Publication
Meaningless Text OCR Model for Medieval Scripts
Syed Saqib Bukhari; Adnan Ul-Hasan; Andreas Dengel
2/2016.
Abstract
Availability of large amount of groundtruth data for training an Optical
Character Recognition (OCR) engine is extremely critical. Training data is usually
produced by manually transcribing thousands of document images. In order to
augment the limited training data, synthetic training data is also used, where training
data is produced by rendering text into images in suitable fonts and styles. The most
important part in synthetic training data is the corresponding real world text. If real
world text data is unavailable, which could be a case in historical manuscripts,
generating synthetic training data is not possible. In this paper, this problem has been
addressed for the case of historical manuscripts whose vocabulary and sentence
structure is neither available in text form not it is similar to any existing
(contemporary) scripts. For such a case, we have introduced a novel meaningless
text OCR model, where meaningless words of variable sizes are generated by
permuting characters. Meaningless text lines are subsequently produced by randomly
choosing these meaningless words. Testing of the meaningless textline recognizer
on real textlines show good performance.
The rest of the paper answers the following questions in sequence: which types of
historical documents are we dealing here?, why a textlinebased recognizer is
preferable over characterbased recognizer?, what is the traditional way of training
textlinebased recognizers?, what novel technique we are presenting to overcome the limitations of traditional training procedure?, and what
initial results we have achieved?