TU Berlin Master's graduate Alexander Kumaigorodski and his co-authors from Prof. Dr. Volker Markl's Intelligent Analytics for Mass Data (IAM) research area at the German Research Centre for Artificial Intelligence (DFKI) and the Department of Database Systems and Information Management (DIMA) at TU Berlin present a new approach to speed up loading and processing of tabular CSV data by orders of magnitude.
CSV is a very frequently used format for the exchange of structured data. For example, the City of Berlin publishes its structured datasets in the CSV format in the Berlin Open Data Portal. Such datasets can be imported into databases for data analysis. Accelerating this process allows users to handle the increasing amount of data and to decrease the time required for its data analysis. Each new generation of computer networks and storage media provides higher bandwidths and allows for faster reading times. However, current loading and processing approaches using main processors (CPU) cannot keep up with these hardware technologies and unnecessarily throttle loading times.
The procedure described in this paper uses a new approach where CSV data is read and processed by graphics processors (GPU) instead. The advantage of these graphics processors lies primarily in their strong parallel computing power and fast memory access. Using this approach, new hardware technologies can be fully made use of, e.g., NVLink 2.0 or InfiniBand with Remote Direct Memory Access (RDMA). In conclusion, CSV data can be read directly from main memory or the network and processed with multiple gigabytes per second.
The transparency of the tests performed and the independent confirmation of the results also led to the award of the first-ever BTW 2021 Reproducibility Badge. In the data science community, the reproducibility of research results is becoming increasingly important. It serves to verify results as well as to compare them with existing work and is thus an important aspect of scientific quality assurance. Leading international conferences have therefore already devoted special attention to this topic.
To ensure high reproducibility, the authors provided the reproducibility committee with source code, additional test data, and instructions for running the benchmarks. The execution of the tests was demonstrated in a live session and could then also be successfully replicated by a member of the committee. The Reproducibility Badge recognizes above all the good scientific practice of the authors.