Three Top Themes From Strata+Hadoop World

Blog Summary: (AI Summaries by Summarizes)
  • Real-time Big Data is becoming increasingly popular and companies are finding that it gives them an advantage and agility they didn't have before.
  • Real-time systems like Kafka allow Data Scientists to get from hypothesis to production quicker and run and score using several models at the same time.
  • Second, third and above generation APIs supporting real-time from the beginning are becoming more common, such as Spark Streaming, Apache Flink, and Apache Kafka.
  • Using intermediary libraries instead of programming directly to an API, such as Apache Crunch or Apache Beam, allows for easier testing and changing of execution engines.
  • Companies are starting to open source or sell their AIs as cloud services, such as Google's TensorFlow and IBM's Watson capabilities.

I spoke at Strata+Hadoop World two weeks on Kafka. There were three main themes from the conference that I came away with: real-time Big Data is the (present) future, we should be using intermediary libraries instead of programming directly to an API, and applied AI is the (present) future.


Big Data Companies are realizing it’s possible to handle Big Data in real-time, also known as streaming. They’re finding that using real-time gives them an advantage and agility that they didn’t have before.

I spoke at Galvanize in San Francisco on how real-time systems are going to change Data Science. The agility that real-time systems like Kafka give us, allows Data Scientists to get from hypothesis to production quicker. Instead of being limited to one model, a real-time system allows us to run and score using several models at the same time. Consuming systems can choose, in real-time, which of the models is performing the best at any given time.

We’re seeing second, third and above generation APIs supporting real-time from the beginning. These are APIs like Spark Streaming, Apache Flink, and Apache Kafka. Kafka is making real-time processing easier with the new Kafka Streams library. We’ll be seeing a proliferation of companies switching batch use cases to real-time with these technologies.

No More Direct APIs

I was eating lunch and talking to the random tablemates. Two of them were talking about how they’re excited to go back to work and rewrite their system to use the new frameworks they’ve just learned about. The engineer in me thinks “awesome!” The business person in me thinks “I hope they run it by their manager first. That’s a massive time sink and probably a waste of time.”

That conversation should have been much different. Had the engineers went through my class, they would have used Apache Crunch or an intermediary API. They would have said “I’m going to update a single line of code to use a different execution engine.” The two statements are vastly different. By only changing one line of code, they could test out MapReduce or Spark.

So far, I’ve been the minority opinion on not programming directly to an API like Spark or MapReduce. Don’t get me wrong, you should know how Spark and MapReduce work. I’m saying you shouldn’t be programming directly to their API. Instead you should be using one of the new intermediary APIs. As companies have just completed their rewrites from Hadoop MapReduce to Spark, they’re starting to understand this need. They shouldn’t have had to rewrite. It should have been a change of execution engines.

I talked to other companies whose code needs to run on both MapReduce and Spark. How do they handle this? Create two codebases or artifacts? No, they’re (now) using intermediary APIs. I’ve talked about Crunch, but I want to talk about the new intermediary APIs from Strata+Hadoop World.

Apache Beam looks very promising. It was originally called Dataflow and comes from Google. You can see some examples of their API here. One big feature is that it supports both batch and real-time from the same API. Other commercial companies are creating APIs allow their customers to run on any framework. Arimo is one such companies with its API. There are some notable downsides to intermediary APIs.

For Spark, you’re going to be missing Spark SQL. IMHO, that’s one of the biggest draws of Spark and Beam/Crunch/etc won’t have it. Personally, I used SQL most often when joining datasets. Both Crunch and Beam have vastly easier and built-in join functions. You might be missing these features less than you thought.

Applied AI

Companies are starting to open source or sell their AIs (artificial intelligence) as cloud services. Google has open sourced TensorFlow. IBM has various Watson capabilities in cloud services.

We’re starting to move from every machine learning (ML) being very specific for a company to some general usages. This will allow companies to use ML without a large investment in developing their own ML. This will allow developers to use AI in different or more imaginative ways. We’ll start to see some really interesting and valuable products using ML for end users.

Related Posts

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on

Big Data Institute horizontal logo

Independent Anniversary

Blog Summary: (AI Summaries by Summarizes)The author founded Big Data Institute eight years ago as an independent, big data consulting company.Independence allows for an unbiased