Skip to main content Skip to main navigation

Publication

Emma in Action: Declarative Dataflows for Scalable Data Analysis (Demo Track)

Alexander Alexandrov; Andreas Salzmann; Georgi Krastev; Asterios Katsifodimos; Volker Markl
In: Proceedings of the 2016 International Conference on Management of Data. ACM SIGMOD International Conference on Management of Data (SIGMOD-16), June 26 - July 1, San Francisco, CA, USA, Pages 2073-2076, ISBN 978-1-4503-3531-7, ACM, New York, NY, 2016.

Abstract

Parallel dataflow APIs based on second-order functions were originally seen as a flexible alternative to SQL. Over time, however, their complexity increased due to the number of physical aspects that had to be exposed by the underlying engines in order to facilitate efficient execution. To retain a sufficient level of abstraction and lower the barrier of entry for data scientists, projects like Spark and Flink currently offer domain-specific APIs on top of their parallel collection abstractions. This demonstration highlights the benefits of an alterna- tive design based on deep language embedding. We show- case Emma – a programming language embedded in Scala. Emma promotes parallel collection processing through na- tive constructs like Scala’s for-comprehensions – a declara- tive syntax akin to SQL. In addition, Emma also advocates quoting the entire data analysis algorithm rather than its in- dividual dataflow expressions. This allows for decomposing the quoted code into (sequential) control flow and (paral- lel) dataflow fragments, optimizing the dataflows in context, and transparently offloading them to an engine like Spark or Flink. The proposed design promises increased program- mer productivity due to avoiding an impedance mismatch, thereby reducing the lag times and cost of data analysis.

Projekte