Using the inferred measures framework is a popular choice for constructing test collections when the target document set is too large for pooling to be a viable option. Within the framework, different amounts of assessing effort is placed on different regions of the ranked lists as defined by a sampling strategy. The sampling strategy is critically important to the quality of the resultant collection, but there is little published guidance as to the important factors. This paper addresses this gap by examining the effect on collection quality of different sampling strategies within the inferred measures framework. The quality of a collection is measured by how accurately it distinguishes the set of significantly different system pairs. Top-K pooling is competitive, though not the best strategy because it cannot distinguish topics with large relevant set sizes. Incorporating a deep, very sparsely sampled stratum is a poor choice. Strategies that include a top-10 pool create better collections than those that do not, as well as allow Precision(10) scores to be directly computed.
Proceedings of SIGIR 2014
July 6-11, 2014
Gold Coast, -1
information retrieval evaluation, test collections