NOTICE: Due to a lapse in annual appropriations, most of this website is not being updated. Learn more.
Form submissions will still be accepted but will not receive responses at this time. Sections of this site for programs using non-appropriated funds (such as NVLAP) or those that are excepted from the shutdown (such as CHIPS and NVD) will continue to be updated.
An official website of the United States government
Here’s how you know
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS
A lock (
) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.
Evaluating the Evaluation: A Case Study Using the TREC 2002 Question Answering Track
Published
Author(s)
Ellen M. Voorhees
Abstract
Evaluating competing technologies on a common problem set is a powerful way to improve the state of the art and hasten technology transfer. Yet poorly designed evaluations can waste research effort or even mislead researchers with faulty conclusions. Thus it is important to examine the quality of a new evaluation task to establish its reliability. This paper provides an example of one such assessment by analyzing the task within the TREC 2002 question answering (QA) track. The analysis demonstrates that comparative results from the new task are stable, and empirically estimates the size of the difference required between scores to confidently conclude that two runs are different.
Proceedings Title
Proceedings of the 2003 Human Language Technology Conference (HLT-NAACL 03)
Conference Dates
May 1, 2003
Conference Location
Edmonton, CA
Conference Title
International Conference on Human Language Technology
Voorhees, E.
(2003),
Evaluating the Evaluation: A Case Study Using the TREC 2002 Question Answering Track, Proceedings of the 2003 Human Language Technology Conference (HLT-NAACL 03), Edmonton, CA, [online], https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=50781
(Accessed October 15, 2025)