Is Kafka Only a Big Data Tool?

Jesse Anderson
February 8, 2017
Blog, Business, Data Engineering, Data Engineering is hard
No Comments

Blog Summary: (AI Summaries by Summarizes)

Kafka is not just a Big Data tool and can be used for small data as well.
For most Big Data technologies, not having or having a Big Data problem in the future is the reason not to use them.
Kafka is a distributed publish subscribe system that can provide value to companies without clear Big Data problems.
Pros of using Kafka with small data include data replication, removal of single points of failure, and the ability for consumers to move freely through the commit log.
Cons of using Kafka compared to traditional small data pub/sub include a more complex programmatic API and conceptually more complex partitions and offsets.

I’ve been teaching Kafka at companies without the textbook definition of Big Data problems. They don’t have, and will not have in the future, what you’d define as Big Data problems. As a result, the students ask me if using Kafka is appropriate for their use cases. Put another way, is Kafka only a Big Data tool?

For most Big Data technologies, not having or having a Big Data problem in the future is the reason not to use technologies like Apache Hadoop or Apache Spark. It’s a pretty clear pass/fail because the technical and operational overhead of these projects immediately negates any other benefits. Using Big Data for small data isn’t just massive overkill; it’s going to waste a lot of time and money.

For Kafka, it’s different. I define Kafka as a distributed publish subscribe system. Companies without clear Big Data problems are gaining value from it. They’re able to use the other interesting features of Kafka.

Here are some of the pros I see for using Kafka with small data:

All data can be replicated to more than one computer
Kafka removes single points of failure for the brokers
Kafka removes single points of failure for consumers with consumer groups
Consumers can move freely through the commit log and go back in time
Consumers don’t miss data as a result of downtime because the data is saved

Here are some of the cons I see for using Kafka compared to a traditional small data pub/sub:

Programmatic API is more complex than others
Conceptually more complex (e.g. partitions and offsets) than others
Ordering is no longer global and is only on a partition basis
Consumer groups will need to handle state transitions for failures
Fewer people available with Kafka skills (you will probably need to train)
Operationally, more processes will need to be monitored

With these pros and cons in mind, you can make a choice between Kafka and your small data pub/sub of choice. If the pros are really compelling and outweigh the cons, I suggest you start looking at Kafka. If the cons outweigh, you’re probably better off with your small data pub/sub.

Learn more about how Kafka works here:

Is Kafka Only a Big Data Tool?

Related Posts

Unapologetically Technical Episode 20 – Shane Murray

Unapologetically Technical Episode 19 – Jacopo Tagliabue

Unapologetically Technical Episode 18 – Adrian Woodhead

Unapologetically Technical Episode 17 – Semih Salihoglu

Unapologetically Technical Episode 16 – David Jayatillake

Unapologetically Technical Episode 15 – Frances Perry

Unapologetically Technical Episode 14 – Cliff Crosland

Data Teams Survey 2020-2024 Analysis

Data Teams Survey 2024 Results

Join the Newsletter