Assessing the Effect of Inconsistent Assessors on Summarization Evaluation
Karolina K. Owczarzak, Peter Rankel, Hoa T. Dang, John M. Conroy
We investigate the consistency of human assessors involved in summarization evaluation to understand its effect on system ranking and automatic evaluation techniques. Using Text Analysis Conference data, we measure annotator consistency based on human scoring of summaries for Responsiveness, Readability, and Pyramid scoring. We identify inconsistencies in the data and measure to what extent these inconsistencies affect the ranking of automatic summarization systems. Finally, we examine the stability of automatic scoring metrics (ROUGE and CLASSY) with respect to the inconsistent assessments.
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics
, Rankel, P.
, Dang, H.
and Conroy, J.
Assessing the Effect of Inconsistent Assessors on Summarization Evaluation, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, KR, [online], https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=911315
(Accessed October 17, 2021)