The Many Meanings of Event-Driven Architecture: Kafka Edition

Blog Summary: (AI Summaries by Summarizes)
  • Kafka is often used to event changes and notify other systems of information changes or actions performed.
  • The amount of information sent in an event should depend on how expensive a lookup is. If a lookup is expensive, event more information to save downstream lookups.
  • An eventually consistent database architecture involves events being moved serially through different databases. A more common pattern with Kafka is to have all databases pull their updates from Kafka directly.
  • Schema is important in Kafka and can affect the number of projects that need to be updated due to a data format change.
  • For use cases that require eventing really large updates, consider sending the event as the row/key id for a NoSQL database or breaking up the file into parts and publishing them into Kafka.

I spoke at GOTO Chicago last week with Martin Fowler. He gave a keynote on The Many Meanings of Event-Driven Architecture. It wasn’t tied to or specific to any particular technology. In this post, I’m applying some of his points specifically to Kafka and Big Data.

Change Events

Kafka is often used to event changes. These changes are used as notifications to other systems that a some information change or a an action was performed.

Teams ask me how much information should be sent. Should it just have the action and what field changed? Should it give more information about the action?

My general answer is to figure out how expensive a lookup is. If your database lookup is expensive and you can save X number of downstream lookups, you’re coming out ahead. In those cases, I suggest eventing more information. The event would contain the action performed, the data before the action, and the data after the action was performed.

Eventually Consistent Databases

Martin talked about another event architecture I haven’t seen in Big Data architectures. It’s an architecture where events are moved serially through different databases in an eventually consistent manner. A system will make a change to a database and then another system will take all, or a subset of changes, and add them to its local database of changes.

A more common pattern with Kafka is to have all X number of databases pull their updates from Kafka directly. From there, the database can choose all or a subset of changes.

I’m not saying you shouldn’t have multiple databases. In fact, I encourage it in certain cases. I’m saying you should update all databases from Kafka. I’ve seen this sort of architecture lead to siloing and we’re trying to avoid that with Big Data.

The Importance of Schema

When I teach Kafka, I stress the importance of schema. I tell the class that using or not using schema won’t manifest as a failure in the early phases of the project. You’re not going see its need until 6-12 months later as you start to make changes to data and code.

My metric for success with schema is around data changes. If you make a change 6 months after release, how many projects have to change? If every project has to be updated due to a data format change, you’ve failed at schema. If only projects that need the new data change, then you’re doing schema correctly.

Really Large Updates

Some use cases require eventing really large updates (>1MB). Should those be sent through Kafka?

These questions are very use case specific. Sometimes, I’ll suggest that the event be sent as the row/key id for a NoSQL database or the path to the file in question and what changed. This way, the entire row or file doesn’t need to be changed.

Other companies have broken up the file into part and publish the parts into Kafka. Using some metadata, the files will get reassembled during read time.

These decisions about out-of-band changes affect replayability.


Replayability is the ability to take the events and recreate a database’s or state store’s state. This means that every single mutation has to go through Kafka. It if occurs out-of-band or indirectly, those changes can’t be recreated or replayed.

The need for replayability is very use case specific. If you do have one, you’ll need to make sure that all mutations happen as events.

Update: Here is Martin’s keynote

Related Posts

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on

Big Data Institute horizontal logo

Independent Anniversary

Blog Summary: (AI Summaries by Summarizes)The author founded Big Data Institute eight years ago as an independent, big data consulting company.Independence allows for an unbiased