Kafka Topic Design Checklist

Blog Summary: (AI Summaries by Summarizes)
  • Designing data for consumption in a Kafka topic requires more forethought than point-to-point consumption.
  • When designing a Kafka topic, you need to decide on the name, schema, contents, key/ordering, number of partitions, and number of replicas.
  • The topic name should be descriptive and not hardcoded in multiple places in your code.
  • Use a class that exposes topic names as public static final strings.
  • A topic's schema is different than the actual data sent and can be in JSON or Apache Avro format.

Designing data for consumption in a Kafka topic requires more forethought. Instead of the messages being a consumed from point to point, there are many different consumers.

You will need to decide on:

  • Name
  • Schema
  • Contents
  • Key/Ordering
  • Number of Partitions
  • Number of Replicas

Name

The choice of a topic name shouldn’t be difficult. I suggest using a descriptive and long as necessary.

Don’t hardcode the name all over the place in your code. It’s a common early bug to misspell the topic name in several different places. I suggest using class that exposes topic names as public static final Strings.

Schema

You might have noticed that I broke out the actual schema of the topic apart from the contents or payload. A topic’s schema is different than the actual data sent.

Some examples of schema are JSON or Apache Avro. Don’t use XML as your post-ETL schema. Avro is the recommended format for post-ETL schema. Some organizations choose to use JSON. There are some big benefits to using a binary format such as Avro.

Contents

The contents are the actual payload of the message. Sometime this includes the key, but is primarily the value.

When you’re deciding on the contents of the value, remember not to focus on the first use case. Remember that data usage will grow over time and it’s easy to add more consumers. Other teams can, and will, write new consumers and require new data.

If the value’s contents are designed for one use case, you might have to change it for a new use case. My general suggestion is to add the data that makes sense to the value, even if all of the fields aren’t being used.

Key/Ordering

By default, your choice of key affects ordering in Kafka. In a worst case scenario, choosing the wrong key could make a future consumer’s use case impossible. It’s important to choose a key that makes sense given the value’s contents.

Number of Partitions

Partitions are how Kafka breaks down a topic into smaller pieces. Choosing the number of partitions affects the scalability of the topic for consumers.

It’s outside the scope of this post to say how to choose the number of partitions. You can change the number of partitions later on, but I highly suggest spending the time to figure out the right number of partitions.

Number of Replicas

The number of replicas comes down to: how important is your data? If don’t really care about the data, go for 1 replica. If you remotely care about your data, make you have 3 replicas.

Related Posts

The Difference Between Learning and Doing

Blog Summary: (AI Summaries by Summarizes)Learning options trading involves data and programming but is not as technical as data engineering or software engineering.Different types of

The Data Discovery Team

Blog Summary: (AI Summaries by Summarizes)Data discovery team plays a crucial role in searching for data in the IT landscape.Data discovery team must make data