Publication
Document cleanup using page frame detection
Faisal Shafait; Joost van Beusekom; Daniel Keysers; Thomas Breuel
In: International Journal on Document Analysis and Recognition, Vol. 11, No. 2, Pages 81-96, Springer-Verlag, 11/2008.
Abstract
When a page of a book is scanned or photo-
copied, textual noise (extraneous symbols from the neighbor-
ing page) and/or non-textual noise (black borders, speckles,
...) appear along the border of the document. Existing docu-
ment analysis methods can handle non-textual noise reason-
ably well, whereas textual noise still presents a major issue
for document analysis systems. Textual noise may result in
undesired text in optical character recognition (OCR) out-
put that needs to be removed afterwards. Existing document
cleanup methods try to explicitly detect and remove marginal
noise. This paper presents a new perspective for document
image cleanup by detecting the page frame of the document.
The goal of page frame detection is to find the actual page
contents area, ignoring marginal noise along the page bor-
der. We use a geometric matching algorithm to find the opti-
mal page frame of structured documents (journal articles,
books, magazines) by exploiting their text alignment prop-
erty. We evaluate the algorithm on the UW-III database. The
results show that the error rates are below 4% for each of
the performance measures used. Further tests were run on
a dataset of magazine pages and on a set of camera cap-
tured document images. To demonstrate the benefits of using
page frame detection in practical applications, we choose
OCR and layout-based document image retrieval as sample
applications. Experiments using a commercial OCR system
show that by removing characters outside the computed page
frame, the OCR error rate is reduced from 4.3 to 1.7% on the
UW-III dataset. The use of page frame detection in layout-
based document image retrieval application decreases the
retrieval error rates by 30%.