Publikation

Structural Information Extraction from Document Images: Addressing Challenges in Layout Analysis, Table Detection, and Classification

Mohammad Minouei

PhD-Thesis, RPTU, 2026.

Zusammenfassung

Paper documents remain a vital part of our daily lives, and the need for automated systems to analyze and extract valuable information from these documents is in- creasingly important. Recent advancements in artificial intelligence have raised user expectations for the extraction of structural information from document images, going beyond the traditional goal of extracting raw text from documents. Typically, document understanding systems comprise multiple components, including layout analysis, table detection, and document classification, each of which presents unique challenges. These challenges include handling complex and varied layouts, address- ing the issue of imbalanced datasets, and developing systems that can adapt and learn over time. Layout analysis is a critical component of document understanding, as it involves organizing and structuring the various elements of a document, such as text, tables, and figures. Accurate table recognition is also essential, as it enables the effective extraction and interpretation of structured data. This research enhances document analysis by increasing accuracy, robustness, and efficiency, which addresses current shortcomings in structural information extraction from documents through novel datasets, model architectures, and learning strategies. The dissertation presents multiple contributions to the field of document under- standing. Initially, we developed a CNN-based method for layout analysis, achieving a 3 percent enhancement over baseline techniques on PubLayNet. Secondly, we introduced a continual learning strategy employing experience-replay techniques, which reduced catastrophic forgetting in table detection by 15 percent. Third, we presented a novel dataset and developed an asymmetric convolution-based neural network, improving table ruling line recognition. To mitigate class imbalance in document classification, we integrated visual and textual features with a customized loss function, resulting in a 13 percent increase in accuracy. The utilization of Large Language Models (LLMs) for document comprehension was also studied. A technique for fine-tuning large language models by structuring input as HTML was created, yielding results on par with state-of-the-art methods while requiring less computational power. And a three-phase prompt engineering strategy for zero-shot information extraction was empirically evaluated, yielding promising outcomes.

Weitere Links

https://kluedo.ub.rptu.de/frontdoor/deliver/index/docId/9510/file/phd-thesis-minouei.pdf

phd-thesis-minouei.pdf (pdf, 16 MB )