"Evaluation as a service" (EaaS) is a new methodology that enables community-wide evaluations and the construction of test collections on documents that cannot be distributed. The basic idea is that evaluation organizers provide a service API through which the evaluation task can be completed. This concept, however, violates some of the premises of traditional pool-based collection building, and, as a result, the quality of the resulting test collection may be compromised. In particular, the service API might restrict the diversity of runs that contribute to the pool: not only may this hamper innovation by researchers, but the lack of diversity might lead to incomplete judgment pools that affect the reusability of the collection. This paper shows that the distinctiveness of the retrieval runs used to construct the first test collection built using EaaS, the TREC 2013 Microblog collection, is not substantially different from that of the TREC-8 ad hoc collection, a high-quality collection built using traditional pooling. An additional test of collection reusability, the `leave out uniques' test, suggests the Microblog 2013 collection's pools are less complete than the TREC-8 collection, though both collections strongly benefit from the presence of a set of distinctive and effective manual runs. Although we cannot yet generalize to all EaaS evaluations, our analyses reveal no obvious flaws in the test collection built using the methodology in the TREC 2013 Microblog track.
Proceedings of SIGIR 2014
July 6-11, 2014
Gold Coast, -1
information retrieval, test collection, TREC