Kafka 0.10 Changes for Developers

Blog Summary: (AI Summaries by Summarizes)
  • Kafka 0.10 is out with some changes that developers need to know about.
  • The KafkaConsumer now allows you to specify a maximum number of messages to return by using the max.poll.records property to a number in the KafkaConsumer.
  • The major change for this version is the addition of Kafka Streams, which has two divisions: a "DSL" and a low-level API.
  • The DSL and low-level API share many of the same features, but the DSL is higher level and the other exposes the lower level guts.
  • You can find some code in Confluent's GitHub showing how to use the DSL and there are some guides in Confluent's docs.

Kafka 0.10 is out. Here are the changes that developers need to know about. Here is the new URL to the Kafka 0.10 JavaDoc.

KafkaConsumer

The KafkaConsumer had a minor change to that allows you to specify a maximum number of messages to return. You can set this by using the max.poll.records property to a number in the KafkaConsumer. This change helps your consumer from retrieving too many messages at once.

Kafka Streams

The major, and I mean major change for this version is the addition of Kafka Streams. The KIP (Kafka Improvement Process) gives some of the design and motivation behind creating Kafka Streams. Confluent’s docs also give some give information about the design and concepts behind Kafka Streams. Jay Kreps wrote a post introducing Kafka Streams.

There are two divisions to Kafka Streams: a “DSL” and a low-level API. Both share many of the same features. I put quotes around DSL, because I don’t think it really adheres to the definition of a DSL. Think of the them more as two APIs. One is higher level and the other exposes the lower level guts.

You can find some code in Confluent’s GitHub showing how to use the DSL. There are some guides in Confluent’s docs. These show you can do real-time processing with Java 8 lambdas.

You can find a full example of the low level API in my GitHub. This example takes Spotify-esque data and processes it with Kafka Streams. You can watch a YouTube video I created showing the code at work.

If you’re used to the functions that real-time processing systems like Apache Spark, Apache Flink, or Apache Beam expose, you’ll be right at home in the DSL. If you’re not, you’ll need to spend some time understand what methods like map, flatMap, or mapValues mean.

Here is a paraphrased Word Count example that uses the DSL:

// Start building a DAG
KStreamBuilder builder = new KStreamBuilder();

// Use TextLinesTopic as the input Kafka topic
KStream<String, String> textLines = builder.stream(
    stringSerde, stringSerde, "TextLinesTopic");

KStream<String, Long> wordCounts = textLines
    .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
    .map((key, word) -> new KeyValue<>(word, word))
    .countByKey("Counts")
    .toStream();

// Write the `KStream<String, Long>` to the output topic.
wordCounts.to(stringSerde, longSerde, "WordsWithCountsTopic");

KafkaStreams streams = new KafkaStreams(builder, streamsConfiguration);
streams.start();

Here is a paraphrased example of the low level API:

@Override
public void process(String key, String value) {
  // Process a single message
  String mapKey = getKeyName(value);
  String oldValue = stateStore.get(mapKey);

  if (oldValue == null) {
    // Swap k/v around as eventsim key is null
    stateStore.put(mapKey, value);
  } else {
    // TODO: Handle when k/v already there
    // this.kvStore.put(key, oldValue + newValue);
  }

  context.commit();
}

@Override
public void punctuate(long streamTime) {
  // Process a window
  currentGeneration.incrementAndGet();

  KeyValueIterator<String, String> iter = stateStore.all();

  double totalDuration = 0;

  long totalEntries = 0;
  // ...
}

These examples should give you a good idea of the differences between the API approaches. The base class for these is KeyValueStore. You can see my usage of the KeyValueStore, MeteredKeyValueStore, in the low level example.

Related Posts

The Difference Between Learning and Doing

Blog Summary: (AI Summaries by Summarizes)There are several types of learning videos: hype, low effort, novice, and professional.It is important to avoid hype, low-effort, and

The Data Discovery Team

Blog Summary: (AI Summaries by Summarizes)The concept of a “data discovery team” is introduced, which focuses on searching for data in an enterprise data reality.Data

Black and white photo of three corporate people discussing with a view of the city's buildings

Current 2023 Announcements

Blog Summary: (AI Summaries by Summarizes)Confluent’s Current Conference featured several announcements that are important for both technologists and investors.Confluent has two existing moats (replication and

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on