The Many Meanings of Event-Driven Architecture: Kafka Edition

Blog Summary: (AI Summaries by Summarizes)
  • Kafka is often used to event changes and notify other systems of information changes or actions performed.
  • The amount of information sent in an event should depend on how expensive a lookup is. If a lookup is expensive, event more information to save downstream lookups.
  • An eventually consistent database architecture involves events being moved serially through different databases. A more common pattern with Kafka is to have all databases pull their updates from Kafka directly.
  • Schema is important in Kafka and can affect the number of projects that need to be updated due to a data format change.
  • For use cases that require eventing really large updates, consider sending the event as the row/key id for a NoSQL database or breaking up the file into parts and publishing them into Kafka.

I spoke at GOTO Chicago last week with Martin Fowler. He gave a keynote on The Many Meanings of Event-Driven Architecture. It wasn’t tied to or specific to any particular technology. In this post, I’m applying some of his points specifically to Kafka and Big Data.

Change Events

Kafka is often used to event changes. These changes are used as notifications to other systems that a some information change or a an action was performed.

Teams ask me how much information should be sent. Should it just have the action and what field changed? Should it give more information about the action?

My general answer is to figure out how expensive a lookup is. If your database lookup is expensive and you can save X number of downstream lookups, you’re coming out ahead. In those cases, I suggest eventing more information. The event would contain the action performed, the data before the action, and the data after the action was performed.

Eventually Consistent Databases

Martin talked about another event architecture I haven’t seen in Big Data architectures. It’s an architecture where events are moved serially through different databases in an eventually consistent manner. A system will make a change to a database and then another system will take all, or a subset of changes, and add them to its local database of changes.

A more common pattern with Kafka is to have all X number of databases pull their updates from Kafka directly. From there, the database can choose all or a subset of changes.

I’m not saying you shouldn’t have multiple databases. In fact, I encourage it in certain cases. I’m saying you should update all databases from Kafka. I’ve seen this sort of architecture lead to siloing and we’re trying to avoid that with Big Data.

The Importance of Schema

When I teach Kafka, I stress the importance of schema. I tell the class that using or not using schema won’t manifest as a failure in the early phases of the project. You’re not going see its need until 6-12 months later as you start to make changes to data and code.

My metric for success with schema is around data changes. If you make a change 6 months after release, how many projects have to change? If every project has to be updated due to a data format change, you’ve failed at schema. If only projects that need the new data change, then you’re doing schema correctly.

Really Large Updates

Some use cases require eventing really large updates (>1MB). Should those be sent through Kafka?

These questions are very use case specific. Sometimes, I’ll suggest that the event be sent as the row/key id for a NoSQL database or the path to the file in question and what changed. This way, the entire row or file doesn’t need to be changed.

Other companies have broken up the file into part and publish the parts into Kafka. Using some metadata, the files will get reassembled during read time.

These decisions about out-of-band changes affect replayability.

Replayability

Replayability is the ability to take the events and recreate a database’s or state store’s state. This means that every single mutation has to go through Kafka. It if occurs out-of-band or indirectly, those changes can’t be recreated or replayed.

The need for replayability is very use case specific. If you do have one, you’ll need to make sure that all mutations happen as events.

Update: Here is Martin’s keynote

Related Posts

Data Teams Survey 2024 Results

Blog Summary: (AI Summaries by Summarizes)Companies are not fully utilizing LLMs in data engineering, with 24.7% of teams not using them at all.Only 12% of