TAC KBP 2014
TAC 2011 AESOP Task Guidelines(Also see general TAC 2011 policies and guidelines at http://www.nist.gov/tac/2011/)
The purpose of the Automatically Evaluating Summaries of Peers (AESOP) task is to promote research and development of systems that automatically evaluate the quality of summaries in terms of their (1) content and (2) readability (i.e. linguistic quality).
Measuring content: In TAC 2009 and TAC 2010 AESOP task, the focus was on developing automatic metrics that can measure summary content on the system level (i.e. measuring the average quality of summarizers). In addition to that, in 2011 participating metrics will also be evaluated for their ability to accurately measure summary content on the level of individual summaries.
Measuring readability: For the first time in the AESOP task, participating metrics will also be evaluated for their ability to measure summary readability, both on the level of summarizers and individual summaries.Participants can submit metrics that are designed either to measure content or readability, or both; however, all metrics will be evaluated in all categories to provide a full picture of the metric's capabilities. Participants will run their automatic metrics on the data from the TAC 2011 Guided Summarization task and submit to NIST the results of their evaluations.
The output of automatic metrics will be compared against three manual metrics: the (Modified) Pyramid score, which measures summary content; Overall Readability, which measures linguistic quality; and Overall Responsiveness, which measures a combination of content and linguistic quality. NIST will calculate Pearson's, Spearman's, and Kendall's correlations between scores produced by each automatic metric and the three manual metrics, both on the summarizer and summary level. Using a one-way ANOVA and the multiple comparison procedure, NIST will also test the discriminative power of the automatic metrics, i.e., the extent to which each automatic metric can detect statistically significant differences between summarizers. The assumption is that a good automatic metric will make the same significant distinctions between summarizers as the manual metrics (and possibly add more), but will not give a contradicting ranking to two summarizers (i.e., infer that Summarizer X is significantly better than Summarizer Y when the manual metric infers that Summarizer Y is significantly better than Summarizer X) or lose too many of the distinctions made by the manual metrics.
AESOP participants will receive all the test data from the TAC 2011 Guided Summarization task, plus the human-authored and automatic summaries from that task. The data will be available for download from the TAC 2011 Summarization Track web page on August 22, 2011. The deadline for submission of automatic evaluations is August 28, 2011. Participants may submit up to four runs, all of which will be evaluated by NIST. Runs must be fully automatic. No changes can be made to any component of the AESOP system or any resource used by the system in response to the current year's test data.
Test data for the AESOP task consists of all test data and summaries produced within the TAC 2011 Guided Summarization task:
Source documents for summarization will come from the newswire portion of the TAC 2010 KBP Source Data (LDC Catalog Number: LDC2010E12). The collection spans the years 2007-2008 and consists of documents taken from the New York Times, the Associated Press, and the Xinhua News Agency newswires.Test source documents will be distributed by the LDC. The remaining test data will be distributed by NIST via the TAC 2011 Summarization web page. Teams will need to use their TAC 2011 Team ID and Team Password to download data and submit results through the NIST web site. To activate the TAC 2011 team ID and password, teams must submit all required agreement forms, even if these forms were already submitted in previous TAC cycles. See TAC 2011 Summarization Registration Information for how to register, submit required agreement forms, and obtain AESOP test data.
Test Data FormatThe topic statements and documents will be in a similar format as the TAC 2010 Guided Summarization Task.
The topic IDs have the following naming convention:
The summaries to be evaluated will include both human-authored summaries and automatic summaries. Each summary will be in a single file, with the following file naming convention:
Given a set of peer summaries (model summaries and other summaries produced as part of the TAC 2011 Guided Summarization task), the goal of the AESOP task is to automatically produce summary-level and summarizer-level scores that will correlate with one or all of the following manual metrics from the TAC 2011 Guided Summarization task:
The actual AESOP task is to produce two sets of numeric summary-level scores:
Different summarizers may have different numbers of summaries. NIST will assume that the final summarizer-level score is the mean of the summarizer's summary-level scores. Please contact the track coordinator if this is not the case for your metric and you use a different calculation to arrive at the final summarizer-level scores.
Participants are allowed (but not required) to use the designated model summaries as reference summaries to score a peer summary; however, when evaluating a model summary S in the "All Peers" case, participants are not allowed to include S in the set of reference summaries (see the explanation in "AllPeers" above).
All processing of test data and generation of scores must be automatic. No changes can be made to any component of the system or any resource used by the system in response to the current year's test data.
Each team may submit up to four runs. NIST will evaluate all submitted runs.
A run consists of a single ASCII file containing two sets of summary-level scores produced by the participant's automatic evaluation metric. Each line of the file must be in the following format:
AllPeers D1001-A.M.100.C.2 0.5
AllPeers D1001-A.M.100.C.3 0.5
AllPeers D1001-A.M.100.C.A 0.8
AllPeers D1001-A.M.100.C.B 0.8
AllPeers D1001-A.M.100.C.C 0.8
AllPeers D1044-B.M.100.H.1 0.3
AllPeers D1044-B.M.100.H.2 0.3
AllPeers D1044-B.M.100.H.3 0.3
AllPeers D1044-B.M.100.H.A 0.6
AllPeers D1044-B.M.100.H.B 0.6
AllPeers D1044-B.M.100.H.C 0.6
NoModels D1001-A.M.100.C.1 0.45
NoModels D1001-A.M.100.C.2 0.45
NoModels D1001-A.M.100.C.3 0.45
NoModels D1044-B.M.100.H.1 0.25
NoModels D1044-B.M.100.H.2 0.25
NoModels D1044-B.M.100.H.3 0.25
At the time of submission, participants should indicate whether the run is intended to correlate with the (Modified) Pyramid Score, Overall Responsiveness, Overall Readability, or any subset of the three metrics.
NIST will assume that for each run, the final score that the run is giving to a summarizer is the mean of the summarizer's summary-level scores.
NIST will post the test data on the TAC Summarization web site on August 22, 2011 and results must be submitted to NIST by 11:59 p.m. (EDT) on August 28, 2011. Results are submitted to NIST using an automatic submission procedure. Details about the submission procedure will be emailed to the email@example.com mailing list before the test data is released. At that time, NIST will release a routine that checks for common errors in submission files including such things as invalid ID, missing summaries, etc. Participants may wish to check their runs with this script before submitting them to NIST because the automatic submission procedure will reject the submission if the script detects any errors.
Each AESOP run will be evaluated for:
Correlation: NIST will calculate the Pearson's, Spearman's, and Kendall's correlations between the summarizer-level scores produced by each submitted metric and the manual metrics (Overall Responsiveness, Overall Readability, and Pyramid). NIST will also calculate the Pearson's, Spearman's, and Kendall's correlations between the summary-level scores (within each topic) produced by each submitted metric and the manual metrics.
Discriminative Power: NIST will conduct a one-way analysis of variance (ANOVA) on the scores produced by each metric (automatic or manual). The output from ANOVA will be submitted to MATLAB's multiple comparison procedure, using Tukey's honestly significant difference criterion. The multiple comparison procedure tests every pair of summarizers (X, Y) for a significant difference in their mean scores and infers whether:
Where sentences need to be identified for automatic evaluation, NIST will use a simple Perl script for sentence segmentation. Jackknifing will be implemented so that human and system scores can be compared.
TAC 2011 Workshop Presentations and Papers
Each team that submits runs for evaluation is requested to write a paper for the TAC 2011 proceedings that reports how the runs were produced (to the extent that intellectual property concerns allow) and any additional experiments or analysis conducted using TAC 2011 data. A draft version of the proceedings papers is distributed as a notebook to TAC 2011 workshop attendees. Participants who would like to give oral presentations of their papers at the workshop should submit a presentation proposal by September 25, 2011. Please see guidelines for papers and presentation proposals at http://www.nist.gov/tac/2011/reporting_guidelines.html.
NIST is an agency of the
U.S. Department of Commerce
Last updated: Tuesday, 03-May-2011 18:34:13 EDT
Comments to: firstname.lastname@example.org