Bias and the Limits of Pooling for Large Collections

C E. Buckley; Darrin L. Dimmick; Ian Soboroff; Ellen M. Voorhees

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

Bias and the Limits of Pooling for Large Collections

Published

July 17, 2007

Author(s)

C E. Buckley, Darrin L. Dimmick, Ian Soboroff, Ellen M. Voorhees

Abstract

Modern retrieval test collections are built through a process called pooling in which only a sample of the entire document set is judged for each topic. The idea behind pooling is to find enough relevant documents such that when unjudged documents are assumed to be nonrelevant the resulting judgment set is sufficiently complete and unbiased. Yet a constant-size pool represents an increasingly small percentage of the document set as document sets grow larger, and at some point the assumption of approximately complete judgments must become invalid. This paper shows that the judgment sets produced by traditional pooling when the pools are too small relative to the total document set size can be biased in that they favor relevant documents that contain topic title words. This phenomenon is wholly dependent on the collection size and does not depend on the number of relevant documents for a given topic. We show that the AQUAINT test collection constructed in the recent TREC2005 workshop exhibits this biased relevance set; it is likely that the test collections based on the much larger GOV2 document set also exhibit the bias. The paper concludes with suggested modifications to traditional pooling and evaluation methodology that may allow very large reusable test collections to be built.

Citation

Information Retrieval

Pub Type

Journals

Download Paper

Local Download

Keywords

evaluation of information retrieval, information retrieval, test collections

Metrology

Citation

Buckley, C. , Dimmick, D. , Soboroff, I. and Voorhees, E. (2007), Bias and the Limits of Pooling for Large Collections, Information Retrieval, [online], https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=51236 (Accessed July 29, 2026)

Additional citation formats

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created July 16, 2007, Updated October 12, 2021

Was this page helpful?