Skip to main content Skip to main navigation

Publication

Creating Customers That Never Existed - Synthesis of E-commerce Data Using CTGAN

Melle Mendikowski; Mattis Hartwig
In: ibai - publishing (Hrsg.). Machine Learning and Data Mining in Pattern Recognition. International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM-2022), 18th International Conference on Machine Learning and Data Mining, July 16-21, New York, New York, USA, Pages 91-105, Vol. 284, ISBN 978-3-942952-93-4, Springer, Heidelberg, 7/2022.

Abstract

Various e-commerce use cases that companies implement in applications rely on personal data of customers. Privacy and data protection play an important role when discussing the usage of personal customer data resulting in a conflicting demand between data collection and data protection. Researchers have found a promising solution to this problem: the generation of synthetic data which is not connected to real people. In this paper, we use the deep learning architecture Conditional Tabular Generative Adversial Network (CTGAN) to synthesize ecommerce data. Especially the categorical relationships between columns of e-commerce data include fixed dependencies, where e.g. an entry in the sub-category column is defining the entry in the category column as well. These specific characteristics result in the necessity to evaluate the suitability of the CTGAN architecture for synthesizing e-commerce data which is the focus of this paper. We present a new similarity measure for synthetic and original datasets that focuses on categorical correlations: the Cramer’s V deviation (CV-deviation). In our experiments, we create synthetic e-commerce data from a publicly available dataset using CTGAN. We use an existing and our newly developed CV-deviation measure in hyperparameter selection and compare the outcomes. By incorporating CV-deviation into the performance metric, we manage to increase the ability of CTGAN to preserve correct categorical relations by 63%. Despite the enhancements the evaluation of the synthetic datasets shows that there is still room for improvement of the overall architecture because it seems difficult for the CTGAN model to efficiently learn all categorical constraints automatically.

More links