- Justin Coffey and François Jehl from Criteo introduced the concept of keeping things "stupidly simple" in data teams.
- To keep things simple, they focused on the primary use case and implemented it with the least bells and whistles possible.
- They extensively used HDFS, identified anti-patterns, and standardized how data landed and was queried.
- They deeply understood the technologies they were using and chose a technology for a specific purpose.
- Inexperienced people tend to gravitate towards more complicated architectures when a simple architecture is available.
During my interviews for another case study in Data Teams, I was introduced to a concept I teach but hadn’t heard so brilliantly stated. The case study was with Justin Coffey and François Jehl, who were both at Criteo. Justin introduced this concept of keeping things “stupidly simple”:
To keep things aggressively simple, “we looked at the primary use case and said, how can we implement that with the least bells and whistles possible in a way that’s the most repeatable and most immediately understandable to the next person who’s going to come on?'” That led them to make extensive use of HDFS. They identified anti-patterns and didn’t let users run “crazy jobs.” They focused on standardizing how data landed and was queried so it could be usable by others. They removed one-offs to make sure the primary uses cases were handled effectively.
Part of being stupidly simple included deeply understanding the technologies they were using. They chose a technology and read the papers that the creator wrote about it. “We’re going to take that, and we’re going to take this tool, and we’re going to deploy it to do that specifically, and not do anything else with it,” Justin said.
I’ve said streaming systems at scale are 15x more complex than small data. When you’re choosing a streaming system, a wrong choice for the use case can cause a massive increase in workarounds and complexity. Streaming systems are complicated enough already, and adding more complexity through workarounds just exacerbates the problem. I can’t stress this enough that inexperienced people will gravitate to more complicated architectures when a simple architecture is available. When you add or remove boxes (technologies), each one has to earn its way into the diagram or architecture because there is overhead to adding new technologies operationally, architecturally, and developmentally.
There are blog posts from Slack and Uber on their streaming systems that I want to go through. We’ll look at the workarounds they have to do, why they had to make them, and what you could do to avoid them.
Note: This isn’t a teardown or critique of these company’s projects. There are many reasons these teams chose to go these routes, and I’m more trying to point out how they could have stayed stupidly simple with Pulsar and avoided the workarounds necessary for Kafka.
Slack Kafka Job Queue
Slack’s second iteration using Kafka and Redis task queue
Slack was facing an issue with running their tasks at scale. They started by using a Redis task queue. For their second iteration, they used Kafka to put the jobs into a specific Redis task queue.
For this design, we face the predicament that Kafka is a poor choice for running tasks. I’ve written an entire post going through the reasons why. As a direct result, the team had to continue to use the Redis task queue, and Kafka is functioning more as a buffer between the website and the process executing the actual tasks. It created more workarounds and lost the simplicity of design.
This setup increased complexity at the operational, development, and architectural levels. The ops team must maintain the Kafka cluster and make sure all messages make it correctly from the website to Kafka, then Kafka to Kafka, then Redis to the final destination. This architecture is a circuitous route, to be sure. On the architecture and development side, we no longer have a single path for the data. The developers and architects must know the extra steps to copy data before the tasks can be executed.
The architecture diagram for the Slack use case with Pulsar
The design using Pulsar would be straightforward and stupidly simple. The website would queue a task into Pulsar, and task processes would consume directly from Pulsar. There would be no extra copies or circuitous routes, and we altogether remove the need for Redis and “kafkagate”.
Uber’s Consumer Proxy
Uber makes heavy use of Kafka but has both streaming and messaging use cases. They tried to use Kafka directly for all use cases. However, Kafka lacks support for task queues (messaging) use cases (the reason why). The team created a consumer proxy to work around this limitation of Kafka’s lack of messaging support.
It’s worth reading through Uber’s post because they outline Kafka’s messaging issues and potential data loss and latency issues.
A diagram showing the Consumer Proxy at work
To work around Kafka’s messaging limitations, the team created their Consumer Proxy. This consumer proxy sits between Kafka and the actual processes that need to run the tasks. Instead of using the Kafka API, these processes use GRPCs to receive and execute their tasks.
One significant feature of the Consumer Proxy starts with a design limitation of Kafka. This video describes the limitation in Kafka and why Pulsar doesn’t have this limitation. In Kafka, once a message is produced to a partition, that message can only be accessed through that partition. Uber wanted to have features such as partition parallel processing, out-of-order commits, and dead letter queue (all three of these features are in Pulsar). To do this in Kafka, Uber had to create these features themselves using the Consumer Proxy.
This setup increased complexity at the operational, development, and architectural levels. The ops team has to maintain the Consumer Proxy. Uber’s post doesn’t mention it, but the Consumer Proxy has to have state storage maintained somewhere. The operations team would have to deal with this state storage in addition to the other pieces. Architects and developers must understand the one-off nature, current feature state, and limitations in the Consumer Proxy. For the development side, we no longer use the Kafka API and have to use GRPC. The developers need to understand the ways that tasks are en-queued and de-queued.
Uber’s engineering team needs to maintain the codebase and verify its functionality against new versions of Kafka. This workaround can be a substantial undertaking, but a company such as Uber has these resources. It takes deep expertise in distributed systems to create a Consumer Proxy like this.
The design for Pulsar is relatively simple because all of the feature requirements are supported in Pulsar. In Pulsar, messages aren’t stuck in a partition forever. The team would only use Pulsar APIs and out-of-box features that are already documented:
- For partition parallel processing, you would use round-robin or key shared subscriptions.
- For out-of-order commits, you would use selective or individual acknowledgment of messages.
- For dead letter queues, you would use the built-in dead letter queues.
The Pulsar design and implementation are stupidly simple, and we altogether remove the need for a consumer proxy.
And This Helps Me How?
I’ve been talking about the limitations in Kafka and the benefits of Pulsar for years. You might be sitting there thinking that Pulsar won’t help you because you’re currently using Kafka and dealing with the limitations and workarounds. There are things you can do to move to Pulsar, even if you are using Kafka.
The primary reason companies chose to keep Kafka is because they think moving off Kafka would be too difficult. I’d argue that the work of creating these workarounds over time relative to a move to Pulsar would be a wash. KoP (Kafka on Pulsar), can streamline the transition from Kafka to Pulsar. With KoP, the Pulsar broker natively supports Kafka’s line protocol, or you could recompile with Pulsar’s Kafka API-compliant bindings. It all becomes a question of when to cut the Gordian Knot. How many workarounds before you realize you should have gone to Pulsar in the first place? In my experience, as the (Kafka) framework advances, it becomes more and more challenging to support the one-offs.
If you want to support both streaming and messaging use cases in the future, you should consider Pulsar over Kafka. The Kafka workarounds will continue, and you go deeper into your own code. The deviations get bigger, and moving to newer versions becomes more difficult.