Skip to main content Skip to main navigation

Publikation

Learn2Clean Event Data

Yousef Koka
Mastersthesis, German University in Cairo, 8/2024.

Zusammenfassung

Data preprocessing plays a crucial role in the success of machine learning (ML) models, particularly in the context of survival analysis, where the goal is to predict time-to-event outcomes. Learn2Clean, a tool that utilizes Q-Learning to optimize data preprocessing pipelines, has shown promise in improving ML model performance. However, its applicability to survival analysis and its flexibility in handling different scenarios were limited. This thesis presents an extension of Learn2Clean to address these limitations. The extended Learn2Clean framework incorporates three prominent survival analysis models: Cox Proportional Hazards, Random Survival Forest, and DeepHit Neural Network. It adapts the reward structure and action space of the Q-Learning algorithm to effectively optimize preprocessing pipelines for these models. Additionally, the framework enhances categorical data handling through ordinal encoding and introduces a configuration file for greater user customization. Dynamic reward matrices, defined using JSON files, further increase the tool’s adaptability to diverse datasets and objectives. To validate the effectiveness of the extended Learn2Clean tool, experiments were conducted on various datasets. The results demonstrate that the tool successfully identifies preprocessing pipelines that improve the performance of survival analysis models compared to baseline approaches. The flexibility offered by the configuration file and dynamic reward matrices allows users to tailor the tool’s behavior to their specific needs. This research contributes to the field of survival analysis by introducing an extended framework for automated data preprocessing. By adapting the Learn2Clean tool, this work addresses the specific challenges posed by missing data in this domain, aiming to improve the accuracy and robustness of survival models.

Projekte