Publication

iDocChip: A Configurable Hardware Ar- chitecture for Historical Document Image Processing: Percentile Based Binarization

Vladimir Rybalkin; Syed Saqib Bukhari; Aqib Ghafoor; Muhammad Mohsin Ghaffar; Norbert Wehn; Andreas Dengel

In: The 18th ACM Symposium on Document Engineering. ACM Symposium on Document Engineering (DocEng-2018), August 28-31, Halifax, Nova Scotia, Canada, ACM, 2018.

Abstract

End-to-end Optical Character Recognition (OCR) systems are heav-ily used to convert document images into machine-readable text.Commercial and open-source OCR systems (like Abbyy, OCRopus,Tesseract etc.) have traditionally been optimized for contempo-rary documents like books, letters, memos, and other end-userdocuments. However, these systems are difficult to equally usefor digitizing historical document images which contain degrada-tions like non-uniform shading, bleed-trough, and irregular layout;such degradations usually do not exist in contemporary documentimages. The open-source anyOCR is an end-to-end OCR pipeline, which contains state-of-the-art techniques that are required for digitizing degraded historical archives with high accuracy. However, high accuracy comes at a cost of high computational complexity that results in long runtime that limits digitization of big collection of historical archives and high energy consumption that is the most critical limiting factor for portable devices with constrained energy budget. Therefore, we are targeting energy efficient and high throughput acceleration of the anyOCR pipeline. General-purpose computing platforms fail to meet these requirements that makes custom hardware design mandatory. In this paper, we are presenting a new concept named iDocChip. It is a portable hybrid hardware-software FPGA-based accelerator that is characterized by low footprint meaning small size, high power efficiency that will allow to use it in portable devices and high throughput that will make it possible to process big collection of historical archives in real time without effecting the accuracy.In this paper, we focus on binarization, which is the second most critical step in the anyOCR pipeline after text-line recognizer that we have already presented in our previous publication. The anyOCR system makes use of a Percentile Based Binarization (PBB) method that is suitable for overcoming degradations like non-uniform shading and bleed-through. To the best of our knowledge, we propose the first hardware architecture of the PBB technique. Based on the new architecture, we present a hybrid hardware-software FPGA-based accelerator that outperforms the existing anyOCR software implementation running on i7-4790T in terms of runtime by factor of 21x, while achieving higher energy efficiency of 10 Images/J thanlow power embedded processors with negligible loss of recognitionaccuracy.