ENCLOSURE 3: COMMENTS ABOUT ENCLOSURES 1 AND 2


GOALS AND MATERIALS OF THE TEST

The digital images used in the Conference will contain most, if not all
ASCII characters, and other non-ASCII characters. The goal of the
recognition task will be to convert each image of an upper case or lower
case letter in the images into the corresponding upper case ASCII
character, to convert the image of each digit into the corresponding ASCII
digit, to covert all other characters into ASCII spaces, to replace
multiple spaces by a single space, and to report the result as the
hypothetical classification. 

Refer now to Enclosures 1 and 2, and note that the top three miniforms
contain only hand print, but that the quality of the hand print is
variable both in character formation and in segmentation. 

The bottom two miniforms show answer formats that key punch operators can
handle, even though most OCR systems will probably not be able to. This is
indicated by the hypotheses made up mostly of H's and I's to illustrate
the results of a made-up system that tends to classify anything that's not
an ASCII character as either an H or an I. Presumably, the confidences for
these hypotheses would be very low or zero, so the adverse effect of these
aberrant answers on the OCR error rate (fraction) can be minimized by a
good rejection process. 

Also, notice that the top field in the last form is empty. The key entry
operators were instructed to enter BLANK for the 1980 large sample as
shown in the reference file in the case of an empty field, but were not so
instructed for the 1990 Census. Instead they were to leave the field
empty. In fact, the procedure that we are using to remove the worst
quality images from the sample also removes images that have empty fields
as a side effect. Nevertheless, you will be instructed to enter BLANK into
any blank fields that you encounter as a precaution against a few sneaking
through. This is illustrated in the hypothesis file for the last form in
Enclosure 1. 

Spelling errors by the people filling out the forms and by the key entry
operators introduce problems that complicate scoring. These are
illustrated by the last field in the last record of the reference file:
r04_f02 MANAGERUG. The key entry operators were instructed to type what
was printed without attempting to correct spelling or typographical
errors, and MANAGERUG is what's in the reference file, even though we
might guess that MANAGING is what was meant. However, sometimes the key
entry operators will not notice the misspelling, but will just type the
word that they recognize. This is illustrated by r03_f02 TYPING FILING,
where the actual writing on the image gives Tiping, Filing. This field
also illustrates the fact that all punctuation has been removed from the
reference file data in accordance with the goal stated at the beginning of
this enclosure. 

Some of the incorrect words in the made-up hypothesis file shown in
Enclosure 2 can be corrected by a sufficiently powerful dictionary look-up
algorithm. Therefore, we will be providing word and phrase dictionaries
for use in performing the recognition task. These are available at the ftp
site mentioned above in the directory dicts. There are nine dictionaries
there: 

phrase_0.lng	word_0.lng	phrase_0.sht
phrase_1.lng	word_1.lng	phrase_1.sht
phrase_2.lng	word_2.lng	phrase_2.sht





ENCLOSURE 3, PAGE 2

These were made from a 132,000 sample of the fields f00, f01, and f02
obtained from the 1980 Census. The dictionary phrase_Z.lng contains all of
the phrases (after removal of punctuation and double spaces) occurring in
field f0Z in the sample, and the dictionary word_Z.lng contains all of the
words occurring in that field, while phrase_Z.sht contains all of the
phrases that occur more than once. The coverage of the short dictionaries
is quite good. Each contains only about 8000 phrases, but covers 70%, 70%,
and 60%, respectively, of the fields f00, f01, and f02 in the 132,000
phrases samples for each field. It is expected that the short dictionaries
will provide nearly this level of coverage for the 1990 Census data being
used for the Conference sample, training, and testing data. The long
phrase dictionaries are of the order of 45,000 phrases, and it possible
that they will not cover the 1990 Census data much better than the short
phrase dictionaries do. 

The word dictionaries are about 13,000 words long. About half of the words
are either misspellings or abbreviations. We have looked into the
possibility of mapping these into correct words or roots, but have not
found a fool-proof way of doing this so far. This fact combined with the
fact that the key entry operators do not always key what they were
supposed to introduces some uncertainties into how to best to use the
dictionaries, and the resolution of this problem is left to the
participants. Remember, the goal is to reproduce the letters and digits
contained in the image, and the dictionaries will contain common
misspellings and abbreviations. 

A report describing our preliminary study of the problems associated with
dictionaries both for correcting the results of OCR and for scoring can be
found in /pub/NISTIR/ir_5180.ps at the ftp site listed above. It is in
PostScript(C) format; copies can be obtained from Allen Wilkinson at the
e-mail and FAX addresses listed above if you don't have access to
Postscript printing capability. 

New, improved dictionaries will be provided with the training material.
They will have all of the words and phrases in the sample dictionaries,
but will also include extra words and phrases. We do not expect the short
dictionaries to be much larger, but would not be surprised if the long
dictionaries grew substantially. 

Since some potential participants may not have dictionary-based correction
algorithms available for use with their OCR results, we tentatively plan
to allow each participant to request that we run no more than one set of
his or her test results through a NIST-developed correction suite. We
would then score both sets of results with two different measures of field
level accuracy as described below. Typical results for synthesized data
designed to simulate NIST participation in the Conference are show below
to give an example of what such dictionary correction can do: 
















ENCLOSURE 3, PAGE 3

SCORING EXAMPLES AND DEFINITIONS

FIELD LEVEL ACCURACY MEASURES FOR SIMULATED OCR CLASSIFICATION DATA

			BEFORE DICTIONARY	AFTER DICTIONARY
			BASED CORRECTION	BASED CORRECTION

field distance fraction		33%			21%

field error fraction 		92%			51%

field level     field level
rej. fraction   error fraction
0.00		0.51
0.10		0.46
0.20		0.40
0.30		0.35
0.40		0.31
0.50		0.27
0.60		0.24
0.70		0.20
0.80		0.14
0.90		0.06

Now we describe what these scores, which are what we plan to use for the
Conference, actually mean, and we welcome your comments on this aspect of
the plan, which is still being perfected. You will receive the final plan
with the training materials, and the decision of the Committee in this
regard will become final at that time. Lets start with a few definitions. 

A hypothesis classification (hypothesis for short) is an ASCII phrase that
has been assigned by a system to an unknown digital image of a hand print
phrase. A reference classification (reference for short) is the phrase
that the hypothesis phrase will be scored against. Unfortunately, the
reference phrase will not always be what you or I would consider the
correct phrase. Many images contain misspelled words. The key entry
operators were instructed to key what was printed on the form without
correcting misspellings. However, since humans recognize words rather than
letters when they are reading, the key entry operators sometimes entered
the correct version of a word rather than the misspelled version, never
noticing the misspelling. Also, many images contain abbreviations.
Unfortunately, as mentioned above in connection with dictionaries, we have
not been able to devise an automated way to map abbreviations and
misspellings onto corrected or expanded words or roots.

Under these conditions, the field error fraction, by itself, might not
give a good comparison of the performance of two different systems.
Therefore, we will calculate not only the field fraction but also a
measure of the distance between the hypothesis and the reference field. 

To calculate the field error fraction, we will just compare each
hypothesis field with the corresponding reference field, including spaces.
If they are identical, we will increment a correct-field hypothesis
counter, cf. If not, we will increment a error-field hypothesis counter,
ef. We will sum the cf and ef counters over all accepted (not rejected)
fields and the field error rate will then be calculated as 

field error rate = ef/(cf+ef).

To calculate the distance between a hypothesis and reference field, we
will compute an alignment between each hypothesis and the corresponding
reference phrase that minimizes the Levenstein distance [1-5] between the
two phrases. In calculating the Levenstein distance, we plan to use 3, 1,

ENCLOSURE 3, PAGE 4

and 5, as the penalties for letter substitution, insertion, and deletion
errors, respectively. Finally, we will use the alignment of the hypothesis
and reference phrase to calculate 

error rate = (s+i+d)/(c+s+i+d),

where 

s = # of substitution errors 
i = # of insertion errors 
d = # of deletion errors 
c = # of correct characters

are summed over all accepted fields. 

We will calculate both the field error fraction and the field distance
fraction as a function of the field level rejection fraction. We will not
use any character level rejection fractions. The latter do not seem to
be useful as final system outputs even though they might be very useful in
obtaining the final system output. This is illustrated below for an image
that says 

TIPING FILING

Suppose the hypothesis were 

TIPMG FILMQ

with a field level confidence of 0.72, and with the following confidences
for the individual letters: 

T 0.853    
I 0.573
P 0.993
M 0.678
G 0.921

F 0.950
I 0.976 
L 0.892 
M 0.734 
Q 0.621 

If character level rejection were used, then with a rejection fraction of
0.00, the hypothesis would be 

TIPMG FILMQ 

but with a rejection fraction of 0.40, the hypothesis might be

T P G FIL  

This does not seem to be useful for any application. 

On the other hand, with a field level rejection of 0.72, either the entire
hypothesis along with all of its letters is accepted and included in the
set of hypotheses to be scored for both field error and distance, or else
it is rejected and withheld from the set to be scored, depending upon the
rejection threshold. 

This example can also be used to illustrate a potential problem with the
field error fraction. Suppose the hypotheses from three different systems
for the image that said TIPING FILING were 

ENCLOSURE 3, PAGE 5

TIPMG FILMQ

TYPING FILING

and

TIPING FILING

and suppose that the reference phrase were TYPING FILING because the key
entry operator did not notice the misspelling. Then the system giving the
correct classification TIPING FILING would get the same bad score, ef = ef
+ 1, for the correct phrase as did the system giving the almost unreadable
TIPMG FILMQ, while the system giving the incorrect phrase TYPING FILING
would get a good score of cf = cf +1 for what is actually an incorrect
classification. Since this type of error is rare, it may not be a
significant source of error, but correlation of the field error fraction
with the field distance fraction for the various systems should help to
point out potential problems if there are any, or show that there are
none. 


[1] H. G. Zwakenberg, Inexact Alphanumeric Comparisons, The C Users
Journal, 127 (May 1991).

[2] R. Valdes, Finding String Distance, Dr. Dobb's Journal, 56 (April
1992), and references therein. 

[3] R. A. Wagner and M. J. Fischer, The String-to-String Correction
Problem, J. ACM 21, 168 (1974), and reprinted in S. N. Srihari, Tutorial:
Computer Text Recognition and Error Correction, IEEE Computer Press
(1985).

[4] M. D. Garris and S. A. Janet, NIST Scoring Package User's Guide
Release 1.0, NISTIR 4950 (October 1992).

[5] M. D. Garris, Methods for Evaluating the Performance of Systems
Intended to Recognize Characters from Image Data Scanned from Forms,
NISTIR 5129, (February 1993).


