Skip to main content Skip to main navigation

Publication

Towards Automated Data Cleaning Workflows

Mohammad Mahdavi; Felix Neutatz; Larysa Visengeriyeva; Ziawasch Abedjan
In: Robert Jäschke; Matthias Weidlich (Hrsg.). Proceedings of the Conference on "Lernen, Wissen, Daten, Analysen". GI-Workshop-Tage "Lernen, Wissen, Daten, Analysen" (LWDA-2019), September 30 - October 2, Berlin, Germany, Pages 10-19, CEUR, 9/2019.

Abstract

The success of AI-based technologies depends crucially on trustful and clean data. Research in data cleaning has provided a variety of approaches to address different data quality problems. Most of them require some prior knowledge about the dataset in order to select and configure the approach correctly. We argue that for unknown datasets, it is unrealistic to know the data quality problems upfront and to formulate all necessary quality constraints in one shot. Pragmatically, the user solves data quality problems by implementing an iterative cleaning process. This incremental approach poses the challenge of identifying the right sequence of cleaning routines and their configurations. In this paper, we highlight our work in progress towards building a cleaning workflow orchestrator that learns from cleaning tasks in the past and proposes promising cleaning workflows for a new dataset. To this end, we highlight new approaches for selecting the most promising error detection routines, aggregating their outputs, and explaining the final results.

Projects

More links