The Effect of Assessor Errors on IR System Evaluation
Ben Carterette, Ian Soboroff
Recent efforts in test collection building have focused on scaling back the number of necessary relevance judgments and then scaling up the number of search topics. Since the largest source of variation in a Cranfield-style experiment comes from the topics, this is a reasonable approach. However, as topic set sizes grow, and researchers look to crowdsourcing and Amazon's Mechanical Turk to collect relevance judgments, we are faced with issues of quality control. This paper examines the robustness of the TREC Million Query track methods when some assessors make significant and systematic errors. We find that while averages are robust, assessor errors can have a large effect on system rankings.
Proceedings of the 33nd Annual International ACM SIGIR Conference on Research and Development Information Retrieval
and Soboroff, I.
The Effect of Assessor Errors on IR System Evaluation, Proceedings of the 33nd Annual International ACM SIGIR Conference on Research and Development Information Retrieval, Geneva, CH
(Accessed December 8, 2023)