Publication
Creating Customers That Never Existed - Synthesis of E-commerce Data Using CTGAN
Melle Mendikowski; Mattis Hartwig
In: ibai - publishing (Hrsg.). Machine Learning and Data Mining in Pattern Recognition. International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM-2022), 18th International Conference on Machine Learning and Data Mining, July 16-21, New York, New York, USA, Pages 91-105, Vol. 284, ISBN 978-3-942952-93-4, Springer, Heidelberg, 7/2022.
Abstract
Various e-commerce use cases that companies implement in
applications rely on personal data of customers. Privacy and data protection play an important role when discussing the usage of personal
customer data resulting in a conflicting demand between data collection
and data protection. Researchers have found a promising solution to
this problem: the generation of synthetic data which is not connected to
real people. In this paper, we use the deep learning architecture Conditional Tabular Generative Adversial Network (CTGAN) to synthesize ecommerce data. Especially the categorical relationships between columns
of e-commerce data include fixed dependencies, where e.g. an entry in
the sub-category column is defining the entry in the category column as
well. These specific characteristics result in the necessity to evaluate the
suitability of the CTGAN architecture for synthesizing e-commerce data
which is the focus of this paper. We present a new similarity measure
for synthetic and original datasets that focuses on categorical correlations: the Cramer’s V deviation (CV-deviation). In our experiments, we
create synthetic e-commerce data from a publicly available dataset using CTGAN. We use an existing and our newly developed CV-deviation
measure in hyperparameter selection and compare the outcomes. By incorporating CV-deviation into the performance metric, we manage to increase the ability of CTGAN to preserve correct categorical relations by
63%. Despite the enhancements the evaluation of the synthetic datasets
shows that there is still room for improvement of the overall architecture
because it seems difficult for the CTGAN model to efficiently learn all
categorical constraints automatically.