Relevance assessment: are judges exchangeable and does it matter?

Ian M. Soboroff; Peter Bailey; Nick Craswell; Alan Smeaton; Emine Yilmaz; Paul Thomas

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

Relevance assessment: are judges exchangeable and does it matter?

Published

July 21, 2008

Author(s)

Ian M. Soboroff, Peter Bailey, Nick Craswell, Alan Smeaton, Emine Yilmaz, Paul Thomas

Abstract

We investigate to what extent people making relevance judgments for a reusable IR test collection are exchangeable. We consider three classes of judge: gold standard judges, who are topic origi- nators and are experts in a particular information seeking task; sil- ver standard judges, who are task experts but did not create topics; and bronze standard judges, who are those who did not define topics and are not experts in the task. Analysis shows low levels of agreement in relevance judgments between these three groups. We report on experiments to determine if this is sufficient to invalidate the use of a test collection for mea- suring system performance when relevance assessments have been created by silver standard or bronze standard judges. We find that both system scores and system rankings are subject to consistent but small differences across the three assessment sets. It appears that test collections are somewhat robust to changes of judge even if these judges vary widely in task and topic expertise. Bronze stan- dard judges may be able to substitute for topic and task experts, with some caution regarding relative system performance, but gold standard judges are preferred.

Proceedings Title

Proceedings of the 31th Annual International ACM SIGIR Conference on Research and Development Information Retrieval

Conference Dates

July 21-25, 2008

Conference Location

Singapore, SN

Conference Title

ACM SIGIR 2008 (Special Interest Group for Information Retrieval)

Pub Type

Conferences

Keywords

enterprise search, information retrieval, relevance assessment, test collections

Data and informatics

Citation

Soboroff, I. , Bailey, P. , Craswell, N. , Smeaton, A. , Yilmaz, E. and Thomas, P. (2008), Relevance assessment: are judges exchangeable and does it matter?, Proceedings of the 31th Annual International ACM SIGIR Conference on Research and Development Information Retrieval, Singapore, SN (Accessed April 23, 2024)

Created July 21, 2008, Updated February 19, 2017

Relevance assessment: are judges exchangeable and does it matter?

Author(s)

Abstract

Keywords

Citation

Additional citation formats