Kafka Topic Design Checklist

Jesse Anderson
May 17, 2017
Blog, Business, Data Engineering, Data Engineering is hard
No Comments

Blog Summary: (AI Summaries by Summarizes)

Designing data for consumption in a Kafka topic requires more forethought than point-to-point consumption.
When designing a Kafka topic, you need to decide on the name, schema, contents, key/ordering, number of partitions, and number of replicas.
The topic name should be descriptive and not hardcoded in multiple places in your code.
Use a class that exposes topic names as public static final strings.
A topic's schema is different than the actual data sent and can be in JSON or Apache Avro format.

Designing data for consumption in a Kafka topic requires more forethought. Instead of the messages being a consumed from point to point, there are many different consumers.

You will need to decide on:

Name
Schema
Contents
Key/Ordering
Number of Partitions
Number of Replicas

Name

The choice of a topic name shouldn’t be difficult. I suggest using a descriptive and long as necessary.

Don’t hardcode the name all over the place in your code. It’s a common early bug to misspell the topic name in several different places. I suggest using class that exposes topic names as public static final Strings.

Schema

You might have noticed that I broke out the actual schema of the topic apart from the contents or payload. A topic’s schema is different than the actual data sent.

Some examples of schema are JSON or Apache Avro. Don’t use XML as your post-ETL schema. Avro is the recommended format for post-ETL schema. Some organizations choose to use JSON. There are some big benefits to using a binary format such as Avro.

The contents are the actual payload of the message. Sometime this includes the key, but is primarily the value.

When you’re deciding on the contents of the value, remember not to focus on the first use case. Remember that data usage will grow over time and it’s easy to add more consumers. Other teams can, and will, write new consumers and require new data.

If the value’s contents are designed for one use case, you might have to change it for a new use case. My general suggestion is to add the data that makes sense to the value, even if all of the fields aren’t being used.

Key/Ordering

By default, your choice of key affects ordering in Kafka. In a worst case scenario, choosing the wrong key could make a future consumer’s use case impossible. It’s important to choose a key that makes sense given the value’s contents.

Number of Partitions

Partitions are how Kafka breaks down a topic into smaller pieces. Choosing the number of partitions affects the scalability of the topic for consumers.

It’s outside the scope of this post to say how to choose the number of partitions. You can change the number of partitions later on, but I highly suggest spending the time to figure out the right number of partitions.

Number of Replicas

The number of replicas comes down to: how important is your data? If don’t really care about the data, go for 1 replica. If you remotely care about your data, make you have 3 replicas.

Kafka Topic Design Checklist

Name

Schema

Contents

Key/Ordering

Number of Partitions

Number of Replicas

Related Posts

Gemini Batch API for Java

Unapologetically Technical Episode 20 – Shane Murray

Unapologetically Technical Episode 19 – Jacopo Tagliabue

Unapologetically Technical Episode 18 – Adrian Woodhead

Unapologetically Technical Episode 17 – Semih Salihoglu

Unapologetically Technical Episode 16 – David Jayatillake

Unapologetically Technical Episode 15 – Frances Perry

Unapologetically Technical Episode 14 – Cliff Crosland

Data Teams Survey 2020-2024 Analysis

Join the Newsletter