Creating and Validating a Large Database for METTREC
W Klein, Michael D. Garris
The National Institute of Standards and Technology [NIST] is in the process of setting up a new series of conferences named the Metadata Text Retrieval Conferences [METTREC]. It will focus on evaluating document conversion using optical character recognition [OCR], and information retrieval [IR] technologies. Evaluations will be designed to investigate the impact of machine recognition errors upon information retrieval and to determine what interfaces are appropriate to integrate the two technologies. To implement this conference, we require databases that can be used for conference evaluations and has chosen the Federal Register to be the initial document source. It is a large, complete set of document source. It is a large, complete set of documents containing metadata that will allow quantitative evaluation of recognition and retrieval technologies. This paper describes the activities associated with scanning the Federal Register and validating the document images within the database. The process of image validation includes translating filenames, assuring image quality, and verifying correct page sequences. In order to reduce the cost of validation, we minimized human resource expenditure by exploiting OCR and high-speed visual adjudication from images by an operator. This process minimizes the expensive handling of paper to validate document image collections.
CD ROM, document, image database, information retrieval, METTREC, OCR, optical character recognition, quality, scanning
and Garris, M.
Creating and Validating a Large Database for METTREC, - 6090, National Institute of Standards and Technology, Gaithersburg, MD, [online], https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=151343
(Accessed December 7, 2023)