Publication

Meaningless Text OCR Model for Medieval Scripts

Syed Saqib Bukhari; Adnan Ul-Hasan; Andreas Dengel

2/2016.

Abstract

Availability of large amount of groundtruth data for training an Optical Character Recognition (OCR) engine is extremely critical. Training data is usually produced by manually transcribing thousands of document images. In order to augment the limited training data, synthetic training data is also used, where training data is produced by rendering text into images in suitable fonts and styles. The most important part in synthetic training data is the corresponding real world text. If real world text data is unavailable, which could be a case in historical manuscripts, generating synthetic training data is not possible. In this paper, this problem has been addressed for the case of historical manuscripts whose vocabulary and sentence structure is neither available in text form not it is similar to any existing (contemporary) scripts. For such a case, we have introduced a novel meaningless text OCR model, where meaningless words of variable sizes are generated by permuting characters. Meaningless text lines are subsequently produced by randomly choosing these meaningless words. Testing of the meaningless textline recognizer on real textlines show good performance. The rest of the paper answers the following questions in sequence: which types of historical documents are we dealing here?, why a textlinebased recognizer is preferable over characterbased recognizer?, what is the traditional way of training textlinebased recognizers?, what novel technique we are presenting to overcome the limitations of traditional training procedure?, and what initial results we have achieved?