Computing That Serves

Graduate Student Research Showcase


Thursday, September 8, 2011 - 11:00am


Thomas Packer, Bill Lund and Doug Kennard

PhD students in the Computer Science Department at BYU


Bill Barrett

Title: Performing Information Extraction to Improve OCR Error Detection in
Semi-structured Historical Documents

Thomas Packer

Optical character recognition (OCR) produces transcriptions of document
images. These transcriptions often contain incorrectly recognized characters
which we must avoid or correct downstream. An ability to both identify OCR
errors and extract information from OCR output would allow us to extract and
index only correct information and to post-process specific parts of the OCR
output with targeted resources (e.g. re-OCR using specialized dictionaries).
We present a general approach to OCR error detection that uses a hidden
Markov model trained to simultaneously detect OCR errors and extract
information. We evaluate this approach in two information extraction
settings and on semi-structured text from two machine-printed family history
documents. We show this joint approach to OCR error detection to be an
improvement over two alternative approaches, one based on dictionary
matching and the other using a hidden Markov model trained only to detect
OCR errors. In particular, we report an average of 8% increase in
macro-averaged F-measure between the dictionary approach and our best HMM.
Our contribution is to show how an OCR error detection approach based on a
word model can be improved by joining this task with an information
extraction task, and that an improvement in OCR error detection is achieved
regardless of the information extraction task.


Title: Progressive Alignment and Discriminative Error Correction for Multiple OCR

William Lund, Daniel D. Walker, and Dr. Eric K. Ringger

Abstract—This paper presents a novel method for improving optical character
recognition (OCR). The method employs the progressive alignment of
hypotheses from multiple OCR engines followed by final hypothesis selection
using maximum entropy classification methods. The maximum entropy models are
trained on a synthetic calibration data set. Although progressive alignment
is not guaranteed to be optimal, the results are nonetheless strong. The
synthetic data set used to train or calibrate the selection models is chosen
without regard to the test data set; hence, we refer to it as “out of
domain.” It is synthetic in the sense that document images have been
generated from the original digital text and degraded using realistic error
models. Along with the true transcripts and OCR hypotheses, the calibration
data contains sufficient information to produce good models of how to select
the best OCR hypothesis and thus correct mistaken OCR hypotheses. Maximum
entropy methods leverage that information using carefully chosen feature
functions to choose the best possible correction. Our method shows a 24.6%
relative improvement over the word error rate (WER) of the best performing
of the five OCR engines employed in this work. Relative to the average WER
of all five OCR engines, our method yields a 69.1% relative reduction in the
error rate. Furthermore, 52.2% of the documents achieve a new low WER.


Title: Word Warping: A New Approach to Offline Handwriting Recognition

Doug Kennard and Bill Barrett

Abstract—We present a novel method of offline whole-word handwriting
recognition.  We use automatic image morphing to compute 2-D geometric warps
that align the strokes of each word image with the strokes of word images of
training examples.  Once the strokes of a given word are aligned to a
training example, we use distance maps to compare how similar the two words
are.  Like 1-D Dynamic Programming (DP) methods, our warp-based method is
robust to limited variation in word length and letter spacing.  However, due
to its 2-D nature, our method is also more robust than 1-D DP methods in
handling variations caused by additional inconsistencies in character shape
and stroke placement.  Although we use DP for coarse alignment, the novel
contribution of this paper is not 2-D DP, but morphing to automatically
discover an actual 2-D mesh-based warp, followed by the use of distance maps
to compute similarity between words.  Early results are encouraging.  On two
datasets (1,000 training and 1,000 test words each), we get 88.77% and
89.33%  recognition accuracy for in-vocabulary words.  These are increases
of 7.89% and 17.16% above the results of a 1-D DP approach.


Thomas Packer is a student in Dr. Embley's Data Extraction Research Group. He is currently writing his dissertation proposal.  The project he is exploring is to automatically infer the structure of lists of relational facts in OCRed documents (e.g. printed sales receipts or family history books) and in extracting rich information from them.

William Lund, the Assistant University Librarian for Information Technology, at the Lee Library of Brigham Young University holds a Master of Science in Operations Research from Stanford University and an MLIS degree from Drexel University. As a former research and development manager for Hewlett Packard Company, his background spans both library and information technology. Currently, he is researching methods for optical character recognition error correction of degraded historical documents through the use of natural language processing tools as a part of a Ph.D. program in computer science at BYU.

Doug Kennard is a PhD student working for Dr. Barrett as a T.A. and research assistant in the Graphics, Computer Vision, and Image Processing Lab.  His current research is in the area of handwriting recognition and historical document image processing.  He also works on family history technology projects, including the BYU Historic Journals website at