The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text
Paul B. Kantor, Ellen M. Voorhees
A known-item search is a particular information retrieval task in which the system is asked to find a single target document in a large document set. The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance. Two corrupted versions of a 55,600 document corpus whose true content was known were created by applying OCR techniques to page images. The first version of the corpus used the page images as scanned, resulting in an estimated character error rate of approximately 5%. The second version used page images that had been down-sampled, resulting in an estimated character error rate of approximately 20%. The true text and each of the corrupted versions were then searched using the same set of 49 questions. In general, retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versionsof the query text.
2 No. 2-3
information retrieval, test retrieval conference
and Voorhees, E.
The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text, Information Retrieval
(Accessed May 27, 2023)