Publikation
LThist 2012: First International Workshop on Language Technology for Historical Text(s)
Thierry Declerck; Brigitte Krenn; Karlheinz Mörth (Hrsg.)
Workshop on Language Technology for Historical Text(s) (LThist-12), located at The 11th Conference on Natural Language Processing, September 21, Vienna, Austria, ÖGAI, 9/2012.
Zusammenfassung
The interaction between Human Language Technology (HLT) and Digital Humanities (DH) at large
has been of interest in various projects and initiatives during the last years, aiming to bring forward
language resources and tools for the Humanities, Social Sciences and Cultural Heritage.
The specific focus of LThist 2012 lies on the development of technology and resources required for
processing historical texts. Workshop contributors and participants discuss ways and strategies for
shaping HLT resources (tools, data and metadata) in ways that are maximally beneficial for
researchers in the Humanities. The necessity for a strong interplay between proponents from
language technology and from the Humanities is also reflected in the invited talks. While Caroline
Sporleder takes a language technology perspective, Sonia Horn addresses the needs and requirements
from a medical historian's point of view. A major aspect of the workshop is the exchange of
experiences with and comparison of tools, approaches, and standards that make historical texts
accessible to automatic processing. Moreover, LThist encourages the interchange of historical data
and processing tools.
In the present workshop, historical texts are understood in two ways: i) texts as documents of older
forms of languages, and ii) texts as documentations of historical content. Accordingly, the
contributions comprise a broad range of topics, genres and diachronic language varieties, including
scientific prose, narratives, folk tales, riddles etc., as well as trade-related documents and marriage
license books with the latter being are valuable resource for demography studies. The presented
papers address various aspects of data preparation and (semi-)automatic processing for a number of
languages including Old Swedish, Late Middle English, Middle English, Early Modern English and
Modern English, diachronic varieties of German, Dutch and Spanish, and Old Occitan. The proposed
approaches and technical solutions center around problem areas such as improving the OCR quality
of historical texts, orthography harmonization and mapping historical to modern word forms, as
prerequisites for automatic mining of historical texts. Also, the possibilities of cross-language
transfer of morphosyntactic and syntactic annotation from resource-rich source languages to underresourced
target languages are examined. Technical infrastructures, specifically tailored for historical
corpora, are discussed, including mark-up languages for historical texts and representation formats
for diachronic lexical databases, processing tools and architectures.
Overall, LThist 2012 well reflects the current discussions regarding automatic processing of
historical texts where OCR errors and the lack of harmonization in orthography are still major
practical issues, but where also machine learning and cross-language transfer are coming more and
more into focus.