Karolina Owczarzak and Hoa Trang Dang


Developing automatic summarization systems requires stable and reliable evaluation measures. In order to arrive at an accurate representation of a summarization system's quality, evaluations look at multiple summaries (topics) per system and calculate the average result. For the same reason, if the evaluation methodology uses human-crafted model summaries for comparison, it is usually the case that multiple models are employed. However, providing multiple topics and models is costly; therefore, it is useful to know how the evaluation metrics behave under conditions where different amounts of data are available.


In this poster, we focus on four evaluation metrics used in the Text Analysis Conference (TAC) 2007 and 2008 Summarization task. In the task, over 50 participating systems produced summaries on over 40 different topics. At the same time, human assessors produced four model summaries for each topic, to which the automatic summaries could be compared. Three out of four evaluation metrics we examine here do just that: they compare a candidate summary to the models in terms of overlapping n-grams (i.e. word sequences), or grammatical dependencies (i.e. pairs of phrasal heads and their modifiers). The fourth metric, called the Pyramid, involves human annotators extracting a comprehensive list of “information nuggets” from the four model summaries, and then judging how many of these “nuggets” are contained within the candidate summary that is being evaluated.


For each of these metrics, we calculate its Pearson's correlation with Responsiveness, which is an overall measure of summary quality assigned by a human assessor, and we examine how this correlation changes when we decrease the number of topics or models available to the metric. We also look whether there are any differences in the discriminatory power of the metrics when fewer topics/models are available.


We find that, given enough topics and systems to be ranked, scores using only one or two model summaries for comparison correlate similarly to Responsiveness as the scores based on all four models. Moreover, it appears that manual evaluation metrics such as Pyramid gain less from the addition of more model summaries than the automatic metrics. A Pyramid score based on any two models correlates very highly with the score based on any other two models, and for the most part, it is able to detect significant differences between a pair of summarizers. For the automatic metrics, the largest gain is recorded with adding the second model; afterward, the returns diminish.


Limiting the number of available topics has a more significant impact, though: there is a gradual drop in correlations and discriminative power of the evaluation metrics when fewer than 36 topics are available. Our experiments also suggest that, as the number of topics available for evaluation increases, so does the number of additional topics necessary to make a difference in the system ranking produced by a metric.