Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Evaluating the Evaluation: A Case Study Using the TREC 2002 Question Answering Track

Published

Author(s)

Ellen M. Voorhees

Abstract

Evaluating competing technologies on a common problem set is a powerful way to improve the state of the art and hasten technology transfer. Yet poorly designed evaluations can waste research effort or even mislead researchers with faulty conclusions. Thus it is important to examine the quality of a new evaluation task to establish its reliability. This paper provides an example of one such assessment by analyzing the task within the TREC 2002 question answering (QA) track. The analysis demonstrates that comparative results from the new task are stable, and empirically estimates the size of the difference required between scores to confidently conclude that two runs are different.
Proceedings Title
Proceedings of the 2003 Human Language Technology Conference (HLT-NAACL 03)
Conference Dates
May 1, 2003
Conference Location
Edmonton, CA
Conference Title
International Conference on Human Language Technology

Keywords

evaluation, question answering, TREC

Citation

Voorhees, E. (2003), Evaluating the Evaluation: A Case Study Using the TREC 2002 Question Answering Track, Proceedings of the 2003 Human Language Technology Conference (HLT-NAACL 03), Edmonton, CA, [online], https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=50781 (Accessed March 1, 2024)
Created May 1, 2003, Updated February 17, 2017