Kafka Topic Design Checklist

Blog Summary: (AI Summaries by Summarizes)
  • Designing data for consumption in a Kafka topic requires more forethought than point-to-point consumption.
  • When designing a Kafka topic, you need to decide on the name, schema, contents, key/ordering, number of partitions, and number of replicas.
  • The topic name should be descriptive and not hardcoded in multiple places in your code.
  • Use a class that exposes topic names as public static final strings.
  • A topic's schema is different than the actual data sent and can be in JSON or Apache Avro format.

Designing data for consumption in a Kafka topic requires more forethought. Instead of the messages being a consumed from point to point, there are many different consumers.

You will need to decide on:

  • Name
  • Schema
  • Contents
  • Key/Ordering
  • Number of Partitions
  • Number of Replicas

Name

The choice of a topic name shouldn’t be difficult. I suggest using a descriptive and long as necessary.

Don’t hardcode the name all over the place in your code. It’s a common early bug to misspell the topic name in several different places. I suggest using class that exposes topic names as public static final Strings.

Schema

You might have noticed that I broke out the actual schema of the topic apart from the contents or payload. A topic’s schema is different than the actual data sent.

Some examples of schema are JSON or Apache Avro. Don’t use XML as your post-ETL schema. Avro is the recommended format for post-ETL schema. Some organizations choose to use JSON. There are some big benefits to using a binary format such as Avro.

Contents

The contents are the actual payload of the message. Sometime this includes the key, but is primarily the value.

When you’re deciding on the contents of the value, remember not to focus on the first use case. Remember that data usage will grow over time and it’s easy to add more consumers. Other teams can, and will, write new consumers and require new data.

If the value’s contents are designed for one use case, you might have to change it for a new use case. My general suggestion is to add the data that makes sense to the value, even if all of the fields aren’t being used.

Key/Ordering

By default, your choice of key affects ordering in Kafka. In a worst case scenario, choosing the wrong key could make a future consumer’s use case impossible. It’s important to choose a key that makes sense given the value’s contents.

Number of Partitions

Partitions are how Kafka breaks down a topic into smaller pieces. Choosing the number of partitions affects the scalability of the topic for consumers.

It’s outside the scope of this post to say how to choose the number of partitions. You can change the number of partitions later on, but I highly suggest spending the time to figure out the right number of partitions.

Number of Replicas

The number of replicas comes down to: how important is your data? If don’t really care about the data, go for 1 replica. If you remotely care about your data, make you have 3 replicas.

Related Posts

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on

Big Data Institute horizontal logo

Independent Anniversary

Blog Summary: (AI Summaries by Summarizes)The author founded Big Data Institute eight years ago as an independent, big data consulting company.Independence allows for an unbiased