Impact of Image Quality in Machine Print Optical Character Recognition
Michael D. Garris, Stanley Janet, W Klein
The National Institute of Standards and Technology (NIST) is in the process of setting up a new series of conferences named the Metadata Text Retrieval Conferences (METTREC). They will focus on evaluating two critical technologies: document conversion using optical character recognition (OCR) and information retrieval(IR). Large collections of document images labeled with correct recognition and retrieval responses are needed to measure performance. Currently, the production of these materials is extremely expensive. NIST is developing a semi-automated truthing tool that will help reduce the cost of data preparation and enable evaluations to scale up. To accomplish this, current OCR technology is needed to produce an initial text to image alignment. This paper describes a small experiment in which three different vendor products (two Windows NT/95-based and one UNIX-based) are evaluated across three sets of document images containing progressively decreasing print and image quality. The evaluation images contain subjectively selected pages from the 1994 Federal Register. Results demonstrate the impact of degrading print and image quality with reported character recognition error rates ranging from 1% to as high as 74%.
, Janet, S.
and Klein, W.
Impact of Image Quality in Machine Print Optical Character Recognition, NIST Interagency/Internal Report (NISTIR), National Institute of Standards and Technology, Gaithersburg, MD, [online], https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=151348
(Accessed February 25, 2024)