Workshop on Serverless Data Analytics

Serverless computing has become popular with cloud providers because of its ease of use by application developers, its lightweight runtime, ease of maangement, resource elasticity and fine-grained billing. Adapting data analytics applications to serverless platforms pose new challenges. To maximize the benefits of serverless, the data analytics systems must be well designed, implemented, maintained, and optimized to efficiently process the massive amount of intermediate data, achieve the best possible job performance with a limited budget, minimize the cost without violating the QoS objective, etc. Recent work on serverless analytics has demonstrated the benefits of serverless architectures for resource- and cost-efficient data analytics.

This workshop aims to focus on these problems, both from application development side and serverless system infrastructure side. It aims to bring together researchers and practitioners from data analytics and serverless computing communities to address the emerging need for serverless data analytic systems and applications. We welcome new ideas and critical research on Serverless Data Analytics as well as reports on best practices.

Topics

The topics covered in the workshop include, but are not limited to:

Serverless architecture for data analytics
Task scheduling for serverless analytics
Intermediate storage systems for serverless analytics
Fine-grained resource allocation for serverless analytics
High performance runtime for serverless analytics
Query optimization in a serverless environment
Data caching for serverless analytics

Workshop Program

Room: Parksville

8:30 - 10:30	Breakfast (served)
10:30 - 11:30	Keynote: Rethinking Serverless Computing: from the Programming Model to the Platform Design, Gustavo Alonso, ETH Zürich
11:30 - 12:00	BabelMR: A Polyglot Framework for Serverless MapReduce, Thomas Bodner (Hasso Plattner Institute, University of Potsdam); Fabian Mahling (Hasso Plattner Institute, University of Potsdam); Paul Rößler (Hasso Plattner Institute, University of Potsdam); Tilmann Rabl (HPI, University of Potsdam)
12:00 - 13:00	Lunch (served)
13:00 - 13:30	Ephemeral Per-query Engines for Serverless Analytics, Michal Wawrzoniak (ETH Zürich); Rodrigo Bruno (Instituto Superior Técnico / INESC-ID Lisboa); Ana Klimovic (ETH Zürich); Gustavo Alonso (ETH Zürich)
13:30 - 14:00	Hyperspecialized compilation for serverless functions, Leonhard Spiegelberg (Brown University); Tim Kraska (MIT); Malte Schwarzkopf (Brown University)

Workshop Proceedings

Published as part of Joint Proceedings of Workshops at the 49th International Conference on Very Large Data Bases

Program Details

Keynote:

Rethinking Serverless Computing: from the Programming Model to the Platform Design, Gustavo Alonso, Ana Klimovic, Tom Kuchler, Michael Wawrzoniak (Slides)

Abstract: Serverless computing offers a number of advantages over conventional, Virtual Machine (VM) based deployments on the cloud, e.g., greater elasticity, simplicity of use and management, finer granularity billing, and rapid deployment and start up times. Naturally, there is a growing interest in exploring how to run applications in this new environment and data analytics is not an exception. Unfortunately, current serverless platforms are limited along several dimensions, which makes things quite difficult from the perspective of data analytics. In this paper we explore what serverless has to offer today, what is missing, and what can be done to make serverless a better computing platform in general and for data analytics in particular.

Speaker: Gustavo Alonso is a professor in the Department of Computer Science of ETH Zurich where he is a member of the Systems Group. He graduated from the Technical University of Madrid, Spain and did his MSc and PhD at the University of California at Santa Barbara. He was a research scientists at the IBM Almaden Research Center in San Jose, California before joining ETH. His research interests include data management, distributed systems, cloud computing architecture, and hardware acceleration through reconfigurable computing. Gustavo has received 4 Test-of-Time Awards for his research in databases, software runtimes, middleware, and mobile computing. He is an ACM Fellow, an IEEE Fellow, a Distinguished Alumnus of the Department of Computer Science of UC Santa Barbara, and has received the Lifetime Achievement Award from the European Chapter of ACM SIGOPS (EuroSys).

Papers

BabelMR: A Polyglot Framework for Serverless MapReduce, Fabian Mahling (Hasso Plattner Institute, University of Potsdam); Paul Rößler (Hasso Plattner Institute, University of Potsdam); Thomas Bodner (Hasso Plattner Institute, University of Potsdam)); Tilmann Rabl (HPI, University of Potsdam) (Slides)

Abstract: The MapReduce programming model and its open-source implementation Hadoop have democratized large-scale data processing by providing ease-of-use and scalability. Subsequently, systems such as Spark have dramatically improved efficiency. However, for a large number of users and applications, using these frameworks remains challenging, because they typically restrict them to specific programming languages or require cluster management expertise.

In this paper, we present BabelMR, a data processing framework that provides the MapReduce programming model to arbitrary containerized applications to be executed on serverless cloud infrastructure. Users provide application logic in Map and Reduce functions that read and write their inputs and outputs to the ephemeral filesystem of a serverless function container. BabelMR orchestrates the data-parallel programs across stages of concurrent cloud function executions and efficiently integrates with serverless storage systems and columnar storage formats. Our evaluation shows that BabelMR reduces the entry hurdle to analyzing data in a distributed serverless environment in terms of development effort. BabelMR’s I/O and data shuffle building blocks outperform handwritten Python and C# code, and BabelMR is competitive with state-of- the-art serverless MapReduce systems.
Ephemeral Per-query Engines for Serverless Analytics, Michal Wawrzoniak (ETH Zürich); Rodrigo Bruno (Instituto Superior Técnico / INESC-ID Lisboa); Ana Klimovic (ETH Zürich); Gustavo Alonso (ETH Zürich)

Abstract: We challenge the common assumption that queries are submitted to a pre-configured, already running engine and put forward the idea of dynamically instantiating a chosen data processing engine upon query submission by leveraging Function-as-a- Service (FaaS) platforms. We demonstrate the idea by running unmodified data processing engines (we use Apache Drill as an initial example) on real-world serverless FaaS platforms and show that such engines can be instantiated on demand when a query arrives. We aim to eventually support a wide range of queries and workloads. Wide access to such functionality would be a game changer in data processing. First, it would enable pay-per-query models supporting sporadic, interactive data analysis on arbitrary engines. Second, it would significantly increase the flexibility for data processing by enabling the possibility of dynamically choosing the actual engine, its configuration, and the resource allocation on a per-query basis. Logically, this amounts to dynamically attaching a query engine to the query rather than sending the query to a pre-configured and already deployed engine. In this paper we elaborate on this vision, outline the design of the MetaQ prototype that we are building to explore the idea, demonstrate that it is realistic through initial experiments, and discuss its many exciting practical implications.
Hyperspecialized compilation for serverless functions, Leonhard Spiegelberg (Brown University); Tim Kraska (MIT); Malte Schwarzkopf (Brown University)

Abstract: Serverless functions can be spun up in milliseconds and scaled out quickly, forming an ideal platform for quick, interactive parallel queries over large data sets. Modern databases use code generation to produce efficient physical plans, but compiling such plans on each serverless function is costly: every millisecond spent executing on serverless functions multiplies many times in cost. Existing serverless data science frameworks therefore generate and compile code on the client, which precludes specializing this code to patterns that may exist in the input data of individual serverless functions. This paper argues for exploring a trade-off space between one-off code generation on the client, and hyperspecialized compilation that generates bespoke code on each serverless function. Our preliminary experiments show that hyperspecialization outperforms client-based compilation on typical heterogeneous datasets in both cost and performance by 2–4×.

Paper Submission & Evaluation

Paper submission is to be done through CMT at the following site: https://cmt3.research.microsoft.com/SDA2023

All papers will be peer reviewed by the Program Committee. The submitted papers should not have been previously published or concurrently under consideration. Work-in-progress papers that are shorter in length are acceptable and encouraged. The workshop will be in-person and one author of each paper is required to register.

Regular papers are 12 pages (excluding references); shorter, work-in-progress papers are limited to 6 pages (excluding references). PVLDB formatting guidelines and styles apply. Please consult http://vldb.org/pvldb/volumes/17/formatting for templates.

Important Dates

Papers due: ~~22 June 2023, midnight EST~~ 30 June 2023, midnight EST
Author notification: ~~15 July 2023~~ 30 July 2023
Camera-ready: ~~22 July 2023,~~ midnight EST 9 August 2023

Proceedings

There will be a joint proceedings of most VLDB 2023 workshops and SDA papers will be included in it. The proceedings will be published as part of CEUR-WS.

Organizers

M. Tamer Özsu, University of Waterloo (tamer.ozsu@uwaterloo.ca)
Xun Xue, Huawei Technologies Canada (xun.xue@huawei.com)

Program Committee Members

Ashraf Aboulnaga, Qatar Computing Research Institute
Gustavo Alonso, ETH Zürich
Samer Al-Kiswani, University of Waterloo
Khuzaima Daudjee, University of Waterloo
Niv Dayan, University of Toronto
Schahram Dustdar, TU Wien
Robin Grosman, Huawei Technologies Canada
Alexandru Iosup, Vrije Universiteit Amsterdam
Guoliang Li, Tsinghua University
Samuel Madden, MIT
Mohammad Shahrad, University of British Columbia
Jianguo Wang, Purdue University