- Apache Beam is a unified programming model for creating batch and stream data processing pipelines.
- The project just had its first release and is now working towards the second release, 0.2.0-incubating.
- Interviewees are a mix of committers and users of Beam.
- Interviewees describe Beam as an abstraction for stream processing and batch processing, a unified programming model, and an API that allows writing parallel data processing pipelines.
- Interviewees share their favorite features of Beam, including its simplicity, runner agnosticism, and windowing capabilities.
Apache Beam just had its first release. Now that we’re working towards the second release, 0.2.0-incubating, I’m catching up with the committers and users to ask some of the common questions about Beam. Each committer and user is sharing their own opinion and not necessarily that of their company.
Our interviewees are:
Neville Li (NL)
Aparup Banerjee (AB)
Viswadeep Veguru (VV)
Emanuele Cesena (EC)
Sumit Chawla (SC)
Dan Halperin (DH)
Sergio FernÃ¡ndez (SF)
Jean-Baptiste OnofrÃ© (JB)
Lukasz Cwik (LC)
Ganesh Raju (GR)
Siva Viswabrahmana (SV)
Amit Sela (AS)
Kenneth Knowles (KK)
IsmaÃ«l MejÃa (IM)
Aljoscha Krettek (AK)
Frances Perry (FP)
Eric Anderson (EA)
Manu Zhang (MZ)
Tyler Akidau (TA)
Andrew Psaltis (AP)
How do you describe Beam to others?
EC: Apache Beam is a unified programming model to create Batch and Stream data processing pipelines.
SC: An abstraction for stream processing and batch processing
DH: Write your pipeline once. Run it efficiently, for any type of data, anywhere.
JB: Apache Beam is an Apache project. It’s a full featured unified programming model, where your data processing jobs run on any execution runtimes (agnostic).
SF: Apache Beam provides a unified model that makes very flexible to run your (batch or stream) pipeline in different frameworks
SV: It’s a programming model to handle Batch and Streaming data more efficiently with latest data patterns.
AS: Apache Beam let’s you write your data pipelines focusing on your business needs, while providing you a robust programming model (especially for stream processing), and taking care of the rest for you.
KK: A programming model for portable big data computation that unifies bounded & unbounded/infinite data processing while giving users control over the balance between completeness (how much data you think has arrived), latency (how long you wait for results), and cost (for example, how much speculative/redundant computation you do).
IM: Apache Beam is a programming model for data processing pipelines (Batch/Streaming). Beam pipelines are runtime agnostic, they can be executed in different distributed processing back-ends.
AK: Apache Beam is an API that allows to write parallel data processing pipeline that that can be executed on different execution engines.
FP: A programming model for representing data processing pipelines that cleanly separates data shape (bounded, unbounded) from runtime characteristics (batch, streaming, etc). Can be constructed via multiple language-specific SDKs and run on multiple distributed processing backends.
EA: Portable data processing job description framework that supports popular execution environments.
MZ: A unified programming model of batch and streaming that allows users to write data processing applications without knowledge of any specific engines.
TA: Streaming and batch done right ;-), with portability across multiple runtimes and via multiple languages to boot.
AP: A single streaming API and common way of talking and thinking about streaming systems. Truly the only way forward as a community that will enable tool developers to build the missing pieces in the streaming ecosystem that helps bring streaming to the masses.
What’s your favorite feature of Beam?
NL: simplicity, a single abstract data type PCollection that encapsulates both batch and streaming datasets and apply/transform that abstract most pipeline operations in both modes.
EC: The same exact code runs in stream or batch, by simply changing the input (and output). This is great, for example for reprocessing data from a batch source when a stream pipeline has failed.
SC: Abstraction layer for multiple runners, windowing capabilities to solve most of the Streaming use cases. And above all a stack built on proven technology by Google
JB: Runner agnostic: as pipeline designer, I don’t mind on which execution engine my pipelines will actually run (Spark, Flink,).
GR: Agree with JB
LC: Unification of batch and streaming pipelines and support for multiple runners for those of us who are tired of gluing together the same business logic to different execution environments.
SV: Beam allows runtime agnostic implementation on top of them.
AS: Decoupling – your (application) code is decoupled from the nature of the source (bounded/unbounded), from the execution runtime, and (at work) doing that in multiple languages.
KK: The carefully crafted unified model separates what you want to compute from the tuning of its physical execution.
IM: The unified and advanced programming model and the execution engine independence.
AK: The windowing/trigger model and how it interacts with side inputs. Right now, this is very new for most people but I think everyone will soon realize how important it is for real-world analytics jobs.
FP: To avoid duplicating already great answers, Composite PTransforms make the graphs / user experience much more tractable and library extensibility much cleaner.
EA: As a data engineer, I can be productive with any team/environment and not have to learn something new.
MZ: The simplicity of modeling data processing problems into what, where, when and how.
TA: Retractions! 😉 Oh wait, not implemented yet; damn. Everything else in the unified model then. I like how it all fits together nicely and allows for clean and clear pipelines that are easy to maintain as they evolve.
AP: The simplicity of the model and consistent way thinking about solving your problem and not worrying about the underpinnings of an execution engine.
Would you run Beam in production? If yes, are there any caveats like using a specific runner? If no, when do you think it will be ready?
NL: We (Spotify) use Dataflow/Beam in production with Dataflow runner both in batch and streaming mode. We have a better understanding of batch and feel it’s more mature than streaming right now. There are some questions about complex use cases in streaming, e.g. joins, side input, reloading pipelines but hopefully it’ll be clearer soon.
AB: We (Cisco) are planning to use Beam in production with Dataflow runner -mostly in streaming mode to start with.
EC: We (shopkick) will run Beam in production. Despite it is a relatively early stage Apache project, Beam is really mature as it originates from Google Dataflow. We’re planning to use it with Spark runner, which currently has some limitations, e.g. no sessionization, but already supports windowing on event time.
JB: We (Talend) plan to generate Beam production ready code in our studio. Even if the first target runner is Spark Runner, the code will run on any runners. One of the main area where we are working on (and especially me) is the Beam IO (to provide extended and complete connectivity in Beam).
LC: Yes, since Beam is from the Google Cloud Dataflow codebase, it is quite mature.
SV: I am still in inception.
IM: Beam allows us (Talend) to focus on an independent programming model that we can reuse and evolve, e.g. to support new DSLs, new model semantics, or new distributed execution engines.
AK: On the managed Cloud Dataflow I would use it but the runners are not yet 100% there. The Flink runner will likely have support for almost all features by the next release.
FP: The SDK (having come from Cloud Dataflow) is quite mature. Looking forward to more integration tests and benchmarks to help categorize runner maturity. Google Cloud Platform users should stick to the Cloud Dataflow SDK in the short term. Really excited to officially run Beam on GCP in the near future 😉
MZ: We (Intel) plan to involve some Chinese Internet companies in running Beam in production.
TA: Absolutely. As we continue to improve cross-runner integration test, benchmark, and capability matrix coverage, use those to help guide you in finding a suitable runner for your situation.
AP: I would for sure, the capability matrix helps to identify what is possible with the current runners.
If you’re non-Google, why are you working on Beam? If you are from Google, why were you drawn to working on Beam?
NL: non-Google. We (Spotify) are heavy users of Google Cloud Platform and Dataflow/Beam. We discover bugs, production issues and contribute back to the code base. We also built Scio, a Scala DSL for Dataflow/Beam.
AB, VV, SC: non-Google . We (Cisco) have started using Beam in multiple of our projects. We are also building a small domain specific SDK on top of Beam.
SC: (non-Google). Backing from Google with a proven technology stack.
DH: (Google) Before coming to Google, I spent three years at the University of Washington eScience Institute working with scientists – oceanographers, astronomers, social scientists, etc. We found lots of common problems in big-data-driven science, and a huge barrier to entry for researchers to do their work. After spending 3 years trying to simplify computing for science, promoting the use of commercial cloud computing at universities, and building better self-serve, no-knobs service for data processing, I decided to join Google and work on the Dataflow-now-Beam project to build something generally useful for real. I still analyze scientific data sets and problems using Dataflow/Beam; you can see relevant code on my github profile.
SF: (non-Google, Redlink GmbH) I just started to use Beam, but I’m always open to contribute wherever I see there is a gap to cover (e.g., PR #537)
JB: (non-Google) Talend BigData provides a studio where you can design your data processing jobs (both batch and streaming). We support all major distributions (CDH, HDP,) and different execution runtime (Spark, M/R,). However, supporting new runtimes (Flink), and versions updates are really painful (lot of code generation change to do). More other, we planned to provide an ingestion client based on this, as the current solutions available are not super good. As I’m also working on Apache Camel, I saw similar features in Beam and Camel. So Beam is the perfect layer to build our new base.
LC: (Google) Before coming to Google, I spent several years at Amazon working on pipelines to enable fast, flexible, error free, and efficient order processing. Working on Beam at Google allows me to look at those same problems in a different light and several new problems.
SV: (Non-Google – Independent) Started learning Big Data frameworks and found Beam interesting.
AS: (non-Google) I started working on stream-processing a little bit over a year ago, and while doing a survey on open-source stream processing frameworks I noticed that none of them provides a robust model for those “pesky” little details of stream processing such as out-of-order processing or event-time processing. After finding Cloudera Labs’ spark-dataflow project (and getting to know Google’s Dataflow) I started working on a simple integration of spark streaming, and well, this is still on-going ;–). In Addition, I’ve been working with Apache projects for a long time and I’m happy to get my chance to contribute.
KK: My academic background is in programming languages/models and concurrency. I have always been interested in new models for parallel programming. In the last decade, a lot of what I’ve done has been building large parallel processing systems for things like analytics and content production. The need for a scalable, high level, unified batch/streaming model is very clear. I love the chance to have a huge impact on the problem space with Dataflow+Beam. And all in a great open platform!
IM: (non-Google, Talend)**: I work on Beam because I love distributed systems and I think it is the future of Massive Data Processing, a unified and robust model that can be used independently of the execution environment. I also find that working in Beam is an unique chance to work with really talented engineers.
AK: (Non-Google – from Data Artisans) I like working on APIs, so working on a cross-execution-engine API seemed like the logical next step. Plus, I really like how the Beam API is structured, with PCollection.apply(Operation). A lot of people find it strange but I think it’s very neat.
FP: I was involved in the FlumeJava project internally at Google for many years. I love being able to iterate and improve those concepts and be able to share them more broadly.
EA: I love the community. Beam has serious engineering and good people behind it.
MZ: (Non-Google, Intel) I work on Apache Gearpump, an Intel open-sourced lightweight streaming engine on Akka. Back from the time of Dataflow, I had been reading its paper and codes and considering porting the APIs to Gearpump. Then there came Beam and we thought it would be lower risk for users to try out Gearpump through Beam. Also, given Gearpump is a young project and still lacks some critical features, we could fill those gaps when we support capabilities of Beam. Hence, I volunteered for implementing Gearpump runner when Tyler asked on Twitter.
TA: (Google) I want to bring stream processing to the masses, and I believe open source software is the only way to actually make that happen.
AP: (Non-Google) I’m a streaming geek and view Beam as an essential missing link and want to help drive it in anyway I can.
When will we see Beam’s API stabilize?
DH: Even today, much of the Beam API surface is very stable: the core primitives and transforms are by and large stabilized. We are mainly adding advanced features (e.g., DoFn reuse) and cleanups here and there. Users writing pipelines to run on Beam should not worry that things will look completely different any time soon — work is not wasted.
The biggest churn today is in the APIs that face new SDKs and runners. We’re aiming to produce a first major release of Beam before the end of 2016. By that time, we will have a serious revision of these APIs, and have reached a stable first version of the new Runner and Fn APIs. The other big thing we need for the first release is a cleaner story around Filesystems interaction. JB: I think Beam SDK is already very stable. I think “faced user” (pipeline developer) part of Beam can already be considered as stable. Things are moving or Fn and Runner API (runner developer). We plan a 0.2.0-incubating release for end of July/beginning of August. The purpose of this release is mostly to provide new IOs and runners. The first really stable release is plan for the end of this year (2016), where hopefully Beam will go Apache TLP (Top Level Project).
LC: Probably late this year, since there are backwards incompatible changes we still want to do from what we learned from our users of Google Cloud Dataflow that will help a lot.
AS: Like others have already mentioned, the SDK seems very stable, as it came out of a Google product (Dataflow), so the Beam API is definitely stable for an incubating project.
IM: If you are using the Java SDK we could say it is already stable. Other Beam parts are still moving, but the final users should not be impacted by these changes. The next part to stabilize is the support for new languages (Python / Scala). Both should be ready soon.
AK: The user-facing API is already quite stable and only some edge cases will require changes. This will be all most people care about since the internal changes to runner-facing APIs don’t touch them.
FP: Very excited to fix a few of the rough edges in the Dataflow SDK for Java. But at the same time, API design is a journey! Looking forward to shipping a mature Beam SDK soon.
TA: As far as the user-facing API goes, it’s already pretty stable. Keep in mind that Beam is based off of the production-quality Cloud Dataflow SDK that’s been generally available for almost a year, and which itself was based off of APIs used internally at Google for years. There’s been some thrash as part of the Beam migration, but most of that is over now. The majority of API changes I can think of in the future will be additions of new functionality rather than destabilizing changes to existing features. The APIs used by SDK and runner builders are a bit of another story, as those are currently seeing significant evolution for the better, but those affect a much smaller portion of the community.
What’s a feature that you’d really like to see added to Beam?
AB: A DSL on top of Beam. We(cisco) can work in this together with Beam community. We have some thoughts around it.
VV: Api on top of the DAG.
EC: Time-based/historical views over datasets. Streaming data often needs to be joined with a big dataset, e.g. user properties, to create an enriched, de-normalized data stream. The dataset however changes over time, so it is hard to re-process old streaming data and obtain exactly the same enriched stream that was produced the first time. A nice feature would be to abstract time-based views over the dataset, such that streaming data is joined with the proper view, and re-processing of old data gets easy.
LC: Iterative processing to support ML.
JB: My first target is on Beam IO. Right now, our “out of the box” IOs. People needs more useful IOs to use Beam. On the other hand, new runners and DSLs on top of the SDK are also interesting. Thanks to that, the Beam user community will increase for sure.
SC: Lots of out-of-box metrics specifically for streaming scenarios. For inspiration, on how Storm provides streaming specific metrics (rate of processing, latency etc) for Last 10 Minutes, Last 1 hour, Last 3 hours etc..
AS: I would actually like to see a “drag-and-drop” web interface, at least to some extent, since Beam’s composite approach could probably support this.
KK: I’d like to see iterative convergence-based computations for ML applications as well as retractions (more generally, principled extensions to the notion of accumulation mode).
IM: New DSLs, in particular for Complex-Event Processing (CEP) and Streaming SQL. More IOs, Iterative Processing, A unified monitoring facility for Beam Runners, that also provides dynamic data about the pipeline execution.
AK: On the API side I would like to have data-driven triggers and iterative processing. The latter is quite tricky though and might require looking into things like [Naiad/differential dataflow] (https://www.microsoft.com/en-us/research/publication/differential-dataflow/). On the runner side I would like to see an open-source streaming runner that supports dynamic adaptation of compute resources to the workload.w
EA: More libraries, especially ones similar to FlinkCEP and Spark’s MLlib
MZ: Partial pipeline that is reusable and can be connected to another.
TA: State API! Retractions! Per-Key Workflows! SQL!
AP: SQL, metrics and data linage that can be exposed so that common operational UI’s can be built on top of them. Beam today helps eliminate to eliminate the “compete on API” nature of many Distributed Stream Processing Engines (DSPE) having metrics and lineage exposed as clean common API’s as well would allow the development of consistent UI’s and tooling and eliminate one more feature set that DSPE’s compete on. The more of that “gruff” that can be made common, the more we can focus on the innovation of the engines.
How would you recommend a new person get started with Beam?
EC: There are good tutorials to get started with Docker 🙂
SC: Playing with Google DataFlow examples and DataFlow runtime to get used to the concepts.
LC: Clone the Apache Beam repository and look at the how to guides for Google Cloud Dataflow.
AS: Ask around the mailing list, we have a pretty good timezone coverage 😉
IM: I would recommend first getting familiar with the Beam model and terminology, read both Streaming 101 and 102, try to catch one of the Beam presentations, then try to run the examples and beam-samples that JB mentioned, and of course to play first with the direct runner before moving into the distributed runners.
AK: I would suggest to read the initial Google Dataflow paper to get a good grasp of the ideas behind the model. After this, reading code, documentation and examples.
EA: I’m partial to Python and found the nascent Python SDK really easy to grasp.
TA: The talk Frances and I gave at Kafka Summit, Streaming 101 and 102, and the Dataflow walkthroughs are all good places to start. Also, shameless plug, Jesse and I are going to be giving a tutorial on using Apache Beam at Strata NYC (Sep) and Strata Singapore (Dec) if you want a nice hands-on introduction.
How would you recommend a person start contributing to Beam?
JB: The “low hanging fruit” are Jira tagged with newbie allows people to contribute on the code. However any contribution is welcome: documentation, test, ideas, are key for the project (it’s not only code). We are really looking forward more feedbacks, discussions and ideas (on the mailing lists, slack,).
SV: I recently wrote a draft blog
AS: As others mentioned, “low hanging fruit”. This will help you to know the project better and allow you to do more in the future.
KK: Introduce yourself on the dev list! And then look for JIRA tickets tagged “newbie” or “starter” (we try to just tag them with all the likely tags, so try any/all of these).
IM: Contribution is not only implementing advanced stuff. Beam really needs help from people testing and reporting issues about the Runners / IOs / Pipelines, etc. This for me should be the first step, and then you can go chase some JIRAs, or write your own DSL on top of Beam for example, or if you are really advanced you can write your own Runner.
AK: The “low hanging fruit” issues are good for starters to get their feet wet. If you really want to dig deep, though, a good idea would be to implement a new runner for a as-of-yet unsupported execution framework.
FP: Use it! We have many early rough edges and need help identifying and fixing them.
AP: Start to build something with it, pick up JIRA’s, there is always low hanging fruit.
Note: opinions expressed by the interviewees are not necessarily the opinions of their employers.