TAC KBP 2014
TAC 2009 AESOP Task Guidelines(Also see general TAC 2009 policies and guidelines at http://www.nist.gov/tac/2009/)
The purpose of the Automatically Evaluating Summaries of Peers (AESOP) task is to promote research and development of systems that automatically evaluate the quality of summaries. The focus in 2009 is on developing automatic metrics that accurately measure summary content. Participants will run their automatic metrics on the data from the TAC 2009 Update Summarization task and submit to NIST the results of their evaluations.
The output of automatic metrics will be compared against two manual metrics: the (Modified) Pyramid score, which measures summary content, and Overall Responsiveness, which measures a combination of content and linguistic quality. NIST will calculate Pearson's, Spearman's, and Kendall's correlations between scores produced by each automatic metric and the two manual metrics. Using a one-way ANOVA and the multiple comparison procedure, NIST will also test the discriminative power of the automatic metrics, i.e., the extent to which each automatic metric can detect statistically significant differences between summarizers. The assumption is that a good automatic metric will make the same significant distinctions between summarizers as the manual metrics (and possibly add more), but will not give a contradicting ranking to two summarizers (i.e., infer that Summarizer X is significantly better than Summarizer Y when the manual metric infers that Summarizer Y is significantly better than Summarizer X) or lose too many of the distinctions made by the manual metrics.
AESOP participants will receive all the test data from the TAC 2009 Update Summarization task, plus the human-authored and automatic summaries from that task. The data will be available for download from the TAC 2009 Summarization Track web page on August 24, 2009. The deadline for submission of automatic evaluations is August 30, 2009. Participants may submit up to four runs, all of which will be evaluated by NIST. Runs must be fully automatic. No changes can be made to any component of the AESOP system or any resource used by the system in response to the current year's test data.
Test data for the AESOP task consists of all test data and summaries produced within the TAC 2009 Update Summarization task:
Test data will be distributed by NIST via the TAC 2009 Summarization web page. Teams will need to use their TAC 2009 Team ID and Team Password to download data and submit results through the NIST web site. To activate the TAC 2009 team ID and password for the summarization track, teams must submit the following forms to NIST, even if these forms were already submitted in previous TAC cycles.
When submitting forms, please also include the TAC 2009 team ID, the email address of the main TAC 2009 contact person for the team, and a comment saying that the form is from a TAC 2009 registered participant.
Test Data FormatThe topic statements and documents will be in the same format as the TAC 2008 Update Summarization topic statements and documents (sample given below):
The summaries to be evaluated will include both human-authored summaries and automatic summaries. Each summary will be in a single file, with the following file naming convention:
Given a set of peer summaries (model summaries and other summaries produced as part of the TAC 2009 Update Summarization task), the goal of the AESOP task is to automatically produce a summarizer-level score that will correlate with one or both of the following manual metrics from the TAC 2009 Update Summarization task:
The actual AESOP task is to produce two sets of numeric summary-level scores:
Different summarizers may have different numbers of summaries. NIST will assume that the final summarizer-level score is the mean of the summarizer's summary-level scores. Please contact the track coordinator if this is not the case for your metric and you use a different calculation to arrive at the final summarizer-level scores.
Participants are allowed (but not required) to use the designated model summaries as reference summaries to score a peer summary; however, when evaluating a model summary S in the "All Peers" case, participants are not allowed to include S in the set of reference summaries.
All processing of test data and generation of scores must be automatic. No changes can be made to any component of the system or any resource used by the system in response to the current year's test data.
Each team may submit up to four runs. NIST will evaluate all submitted runs.
A run consists of a single ASCII file containing two sets of summary-level scores produced by the participant's automatic evaluation metric. Each line of the file must be in the following format:
AllPeers D0901-A.M.100.C.2 0.5
AllPeers D0901-A.M.100.C.3 0.5
AllPeers D0901-A.M.100.C.A 0.8
AllPeers D0901-A.M.100.C.B 0.8
AllPeers D0901-A.M.100.C.C 0.8
AllPeers D0944-B.M.100.H.1 0.3
AllPeers D0944-B.M.100.H.2 0.3
AllPeers D0944-B.M.100.H.3 0.3
AllPeers D0944-B.M.100.H.A 0.6
AllPeers D0944-B.M.100.H.B 0.6
AllPeers D0944-B.M.100.H.C 0.6
NoModels D0901-A.M.100.C.1 0.45
NoModels D0901-A.M.100.C.2 0.45
NoModels D0901-A.M.100.C.3 0.45
NoModels D0944-B.M.100.H.1 0.25
NoModels D0944-B.M.100.H.2 0.25
NoModels D0944-B.M.100.H.3 0.25
At the time of submission, participants should indicate whether the run is intended to correlate with the (Modified) Pyramid Score, or Overall Responsiveness, or both.
NIST will assume that for each run, the final score that the run is giving to a summarizer is the mean of the summarizer's summary-level scores.
NIST will post the test data on the TAC Summarization web site on August 24, 2009 and results must be submitted to NIST by 11:59 p.m. (EDT) on August 30, 2009. Results are submitted to NIST using an automatic submission procedure. Details about the submission procedure will be emailed to the firstname.lastname@example.org mailing list before the test data is released. At that time, NIST will release a routine that checks for common errors in submission files including such things as invalid ID, missing summaries, etc. Participants may wish to check their runs with this script before submitting them to NIST because the automatic submission procedure will reject the submission if the script detects any errors.
Each AESOP run will be evaluated for:
Correlation: NIST will calculate the Pearson's, Spearman's, and Kendall's correlations between the summarizer-level scores produced by each submitted metric and the manual metrics (Overall Responsiveness and Pyramid).
Discriminative Power: NIST will conduct a one-way analysis of variance (ANOVA) on the scores produced by each metric (automatic or manual). The output from ANOVA will be submitted to MATLAB's multiple comparison procedure, using Tukey's honestly significant difference criterion. The multiple comparison procedure tests every pair of summarizers (X, Y) for a significant difference in their mean scores and infers whether:
Where sentences need to be identified for automatic evaluation, NIST will use a simple Perl script for sentence segmentation. Jackknifing will be implemented so that human and system scores can be compared.
TAC 2009 Workshop Presentations and Papers
Each team that submits runs for evaluation is requested to write a paper for the TAC 2009 proceedings that reports how the runs were produced (to the extent that intellectual property concerns allow) and any additional experiments or analysis conducted using TAC 2009 data. A draft version of the proceedings papers is distributed as a notebook to TAC 2009 workshop attendees. Participants who would like to give oral presentations of their papers at the workshop should submit a presentation proposal by September 25, 2009, and the TAC Advisory Committee will select the groups who will present at the workshop. Please see guidelines for papers and presentation proposals at http://www.nist.gov/tac/2009/reporting_guidelines.html.
NIST is an agency of the
U.S. Department of Commerce
Last updated: Tuesday, 19-Oct-2010 12:18:40 EDT
Comments to: email@example.com