Skip to main content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

NIST Federal Register Document Image Database: Volume 1

This database has been discontinued and no longer being supported but will be available upon request.
 

NIST produced a document image database for evaluating document analysis and recognition technologies and information retrieval systems. This database was formerly part of the NIST Special Databases collection, was known as Special Database 25,  contains page images from the 1994 Federal Register and much more.

A new, fully-automated process developed at NIST in 1992 was used to derive ground truth for document images. The method involves matching optical character recognition (OCR) results from a page with typesetting files for an entire book. Public domain software for deriving ground truth is provided in the form of Perl scripts and C source code, and includes new, more efficient string alignment technology and a word-level scoring package. The documentation includes a complete software reference guide, including online manual pages. With this ground truthing technology, it is now feasible to produce much larger data sets, at much lower cost, than was ever possible with previous labor-intensive, manual data collection projects.

There were roughly 250 issues, comprised of nearly 69,000 pages, published in the Federal Register in 1994. This volume of the database contains the pages of 20 books published in January of that year. The database includes scanned images, SGML-tagged ground truth text, commercial OCR results, and image quality assessment results. These data files are useful in a wide variety of experiments and research. Future volumes may be released, depending on the level of interest.

This volume of the database contains 4711 page images scanned binary at 15.75 pixels per millimeter (400 pixels per inch). The images are stored in the NIST IHead format and are compressed using CCITT Group 4 compression. Documentation for this format and source code for reading and writing IHead images is provided. Of these page images, 4519 of them have corresponding ground truth.

This volume is distributed on two ISO-9660 CD-ROMs utilizing 1.27 GB of storage.

Please click 

to view the PDF version of Users' Guide.

 

The contact for this database is:
Patricia Flanagan
100 Bureau Drive, Stop 8940
Gaithersburg, MD 20899-8940
flanagan@nist.gov  

Keywords: document image database; OCR; optical character; recognition technology



 

Contact

Standard Reference Data, NIST:
100 Bureau Drive, Stop 6410
Gaithersburg, MD 20899-6410
(844) 374-0183 (Toll Free)

If you have any questions regarding this website, or notice any problems or inaccurate information, please contact the webmaster by sending e-mail to: data@nist.gov

Created August 27, 2010, Updated July 23, 2018