Author(s)
Ben Carterette, Ian Soboroff
Abstract
Recent efforts in test collection building have focused on scaling back the number of necessary relevance judgments and then scaling up the number of search topics. Since the largest source of variation in a Cranfield-style experiment comes from the topics, this is a reasonable approach. However, as topic set sizes grow, and researchers look to crowdsourcing and Amazon's Mechanical Turk to collect relevance judgments, we are faced with issues of quality control. This paper examines the robustness of the TREC Million Query track methods when some assessors make significant and systematic errors. We find that while averages are robust, assessor errors can have a large effect on system rankings.
Proceedings Title
Proceedings of the 33nd Annual International ACM SIGIR Conference on Research and Development Information Retrieval
Conference Dates
July 19-23, 2010
Conference Location
Geneva, CH
Keywords
information retrieval, test collections
Citation
Carterette, B.
and Soboroff, I.
(2010),
The Effect of Assessor Errors on IR System Evaluation, Proceedings of the 33nd Annual International ACM SIGIR Conference on Research and Development Information Retrieval, Geneva, CH (Accessed May 20, 2026)
Additional citation formats
Issues
If you have any questions about this publication or are having problems accessing it, please contact [email protected].