Publication
Materialization and Reuse Optimizations for Production Data Science Pipelines
Behrouz Derakhshan; Alireza Rezaei Mahdiraji; Zoi Kaoudi; Tilmann Rabl; Volker Markl
In: SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data. ACM SIGMOD International Conference on Management of Data (SIGMOD), June 12-17, Philadelphia, USA, The ACM Special Interest Group on Management of Data, 2022.
Abstract
Many companies and businesses train and deploy machine learning (ML) pipelines to answer prediction queries. In many applications, new training data continuously becomes available. A typical approach to ensure that ML models are up-to-date is to retrain the ML pipelines following a schedule, e.g., every day on the last seven
days of data. Several use cases, such as A/B testing and ensemble learning, require many pipelines to be deployed in parallel. Existing
solutions train and deploy one pipeline at a time, which generates redundant data processing since pipelines usually share similar
operators. Our goal is to eliminate redundant data processing in production data science pipelines using materialization and reuse
optimizations. We first categorize the generated artifacts of the pipeline operators into three groups, i.e., computed statistics, transformed data, and trained models. Then, we optimize the execution of the pipelines by materializing and reusing the generated artifacts. Our solution employs a materialization algorithm that given
a storage budget, materializes the subset of the artifacts, which minimizes the run time of the subsequent executions. Furthermore, we offer a reuse algorithm that generates an optimal execution plan by combining the deployed pipelines into a directed acyclic graph (DAG) and reusing the materialized artifacts when appropriate. Our experiments show that our system can reduce the training time by up to an order of magnitude for different deployment scenarios.