Publication
Approaches for Automated Data Quality Analysis: Syntactic and Semantic Assessment
Agbodzea Pascal Ahiagble; Hannah Stein
In: Jahrestagung der Gesellschaft für Informatik 2022. Jahrestagung der Gesellschaft für Informatik (INFORMATIK-2022), September 26-30, Hamburg, Germany, ISBN 978-3-88579-720-3, Gesellschaft für Informatik, Bonn, 9/2022.
Abstract
Data quality significantly influences data usability and plays an important role in data trading. This paper presents a data quality analysis (DQA) of data tables on two levels. The first, the so-called syntactic level, concerns the structure of the elements within the database and the second, the so-called semantic level, concerns the relationship between the elements in the database and the "real world". Based on a literature review the most relevant data quality criteria and corresponding metrics were derived. Subsequently, based on heuristics, a data-centric approach and an unsupervised machine learning clustering algorithm DBSCAN, a service for automated DQA, is designed and implemented (syntactic DQA). In the next step, an automated semantic DQA service as well. The approach is used to examine data tables for example for missing relevant columns (i.e., semantic completeness). A data quality index represents the services’ output, which is derived from the automated analysis of various data quality criteria. This enables the assessment of data quality, as well as the detection of potentials for improving quality and thus increasing the value of tradeable data.