Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text

Published

Author(s)

Paul B. Kantor, Ellen M. Voorhees

Abstract

A known-item search is a particular information retrieval task in which the system is asked to find a single target document in a large document set. The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance. Two corrupted versions of a 55,600 document corpus whose true content was known were created by applying OCR techniques to page images. The first version of the corpus used the page images as scanned, resulting in an estimated character error rate of approximately 5%. The second version used page images that had been down-sampled, resulting in an estimated character error rate of approximately 20%. The true text and each of the corrupted versions were then searched using the same set of 49 questions. In general, retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versionsof the query text.
Citation
Information Retrieval
Volume
2 No. 2-3

Keywords

information retrieval, test retrieval conference

Citation

Kantor, P. and Voorhees, E. (2000), The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text, Information Retrieval (Accessed April 12, 2024)
Created December 31, 1999, Updated October 12, 2021