NOTICE: Due to a lapse in annual appropriations, most of this website is not being updated. Learn more.
Form submissions will still be accepted but will not receive responses at this time. Sections of this site for programs using non-appropriated funds (such as NVLAP) or those that are excepted from the shutdown (such as CHIPS and NVD) will continue to be updated.
An official website of the United States government
Here’s how you know
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS
A lock (
) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.
On Building Fair and Reusable Test Collections using Bandit Techniques
Published
Author(s)
Ellen Voorhees
Abstract
While test collections are a vital piece of the research infrastructure for information retrieval, constructing fair, reusable test collections for large data sets is challenging because of the number of human relevance assessments required. Various approaches for minimizing the number of judgments required have been proposed including a suite of methods based on multi-arm bandit optimization techniques. However, most of these approaches look to maximize the total number of relevant documents found, which is not necessarily fair, and they have only been demonstrated in simulation on existing test collections. The TREC 2017 Common Core track provided the opportunity to build a collection de novo using a bandit method. Doing so required addressing two problems not encountered in simulation: giving the human judges time to learn a topic and allocating the overall judgment budget across topics. The resulting modified bandit technique was used to build the 2017 Common Core test collection consisting of approximately 1.8 million news articles, 50 topics, and 30,030 judgments. Unfortunately, the constructed collection is of lower quality than anticipated: a large percentage of the known relevant documents were retrieved by only one team, and for 21 topics, more than a third of the judged documents are relevant. As such the collection is less reusable than desired. Further analysis demonstrates that the greedy approach common to most bandit methods can be unfair even to the runs participating in the collection-building process when the judgment budget is small relative to the (unknown) number of relevant documents.
Proceedings Title
Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM'18)
Voorhees, E.
(2018),
On Building Fair and Reusable Test Collections using Bandit Techniques, Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM'18), Torino, IT, [online], https://doi.org/10.1145/3269206.3271766, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=926509
(Accessed October 14, 2025)