Speech and Language Technology

Language, Data and Knowledge Technologies

Language technologies can be used for a plethora of use cases and application scenarios. Within the topic field Language, Data and Knowledge Technologies we apply a holistic approach and actively work on all involved levels of technological components that are necessary to address a certain use case with a prototypical demonstrator or with a deployed system ready for production use.

Many language technologies are based on machine learning methods that are able to learn statistical models on the basis of annotated text and language data with the goal of identifying or classifying certain patterns. We collect and curate large data sets, develop fit-for-purpose and fit-for-domain annotation formats and apply dedicated tools for the creation and evaluation of annotated corpora.

Depending on the individual use case, we apply large language models as well as rule-based methods and machine learning methods. Apart from pre-trained models we also develop our own novel language models. In that regard we concentrate, among others, on the combination of symbolic knowledge representation approaches with large language models so that we can exploit their respective advantages, for example, in terms of explainability.

The technologies, tools and language resources we develop are deployed and made available through scalable platforms that are based on modern microservice-architectures and that use standardised formats and interfaces for communication and data exchange. We are also involved in the development of these language technology platforms: among others, we actively bring together the whole European language technology community, including research and industry, under the joint umbrella of the European Language grid (ELG) cloud platform, which is coordinated by DFKI. Furthermore, we are involved in the development of the German National Research Data Infrastructure (Nationale Forschungsdateninfrastruktur, NFDI) and in Gaia-X.

The common denominator of the various prototypes and demonstrators that are developed in our research projects are that they are all used as curation technologies. These are AI-based technologies that not only simplify and speed up the processing of digital content but that also enable completely novel use cases such as semantic storytelling. So far, curation technologies have been explored in the journalism domain, in medicine/health (especially Covid-19), in the library domain, in museums and in the legal domain.

In terms of our language technology applications, we put a special emphasis on the enrichment of digital content, linking of extracted information with external knowledge graphs, various types of text classification and content credibility assessment, automated summarisation and question-answering, where we want to make use of a combination of large language models and knowledge graphs.

With regard to national and international standardisation activities, we contribute to the development of the German Standardisation Roadmap on Artificial Intelligence (DIN, Deutsches Institut für Normung) and to various working and community groups at W3C (World Wide Web Consortium), the German/Austrian Chapter of which is lead by our team.

Finally, we are involved in the coordination of initiatives that attempt to bring together the whole European language technology community and also to develop technologies for the multilingual and digital European society, because many European languages are in severe danger of digital extinction. We want to fundamentally change the situation and achieve digital language equality in Europe by the year 2030.

Relevant keywords are:

Language and knowledge technology platforms
Technologies for the curation of digital content
Large language models
Knowledge graphs, ontologies, Linked Open Data
Text and document processing
Interoperability of language technologies
Description and standardisation of language technologies and language resources
Open Data and Open Science
Digital language equality and multilingualism

Our current and recent projects develop, among others, platforms for language technologies and language resources (ELG) as well as tools, services and technologies for the curation of digital content with a specific focus on professional use cases in selected domains (QURATOR, PANQURA, Lynx, DKT). We use knowledge-based methods (knowledge graphs, ontologies, Linked Data) as well as large language models that we also train ourselves (SPEAKER, OpenGPT-X). A separate area of work deals with the formal description and standardization of language technologies and language resources, among others, with regard to Open Data and Open Science as well as with the development of infrastructures for research and deployment (ELG, OpenGPT-X, NFDI4DataScience).

Selected projects:

European Language Grid (ELG)
European Language Equality (ELE)
QURATOR – Curation Technologies
PANQURA
SPEAKER – Sprachassistenzplattform made in Germany
NFDI4Data Science and Artificial Intelligence
OpenGPT-X
SoNAR
Lynx
Digitale Kuratierungstechnologien (DKT)

Links

Sprachtechnologie am DFKI