Combining Knowledge about Document Types and Structures for Enhanced Content Curation

Karolina Zaczynska, Florian Kintzel, Julian Moreno Schneider, Georg Rehm

In: Adrian Paschke , Georg Rehm , Jamal Al Qundus , Clemens Neudecker , Lydia Pintscher (Hrsg.). Proceedings of QURATOR 2021 -- Conference on Digital Curation Technologies. Conference on Digital Curation Technologies (QURATOR-2021) February 8-12 Berlin/Virtual Germany CEUR Workshop Proceedings 2/2021.


We present the conceptual design of a language technology (LT) system that enables enhanced document curation and processing of different documents types by providing customized NLP workflows that respond and adapt to the extracted characteristics of the input documents. To optimize document and text understanding, the processing steps will not only incorporate textual features but also layout and document type related features like document structure, and the communicative function of specific parts or constituents of a document (e. g., header, subtitle, paragraph, footer). We tackle the lack of standardized representation formats for many of these document features by presenting the first draft of an ontology (QOntology) we plan to incorporate into the overall workflow manager. Since the work is still in progress, we present the theoretical background and conceptual design decisions of the approach which will be the basis of experiments in future work.


