An official website of the United States government
Here’s how you know
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS
A lock (
) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.
The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text
Published
Author(s)
Paul B. Kantor, Ellen M. Voorhees
Abstract
A known-item search is a particular information retrieval task in which the system is asked to find a single target document in a large document set. The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance. Two corrupted versions of a 55,600 document corpus whose true content was known were created by applying OCR techniques to page images. The first version of the corpus used the page images as scanned, resulting in an estimated character error rate of approximately 5%. The second version used page images that had been down-sampled, resulting in an estimated character error rate of approximately 20%. The true text and each of the corrupted versions were then searched using the same set of 49 questions. In general, retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versionsof the query text.
Citation
Information Retrieval
Volume
2 No. 2-3
Pub Type
Journals
Keywords
information retrieval, test retrieval conference
Citation
Kantor, P.
and Voorhees, E.
(2000),
The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text, Information Retrieval
(Accessed February 14, 2025)