Some of the contenders for Big Data messaging systems are Apache Kafka, Google Cloud Pub/Sub, and Amazon Kinesis (not discussed in this post). While similar in many ways, there are enough subtle differences that a Data Engineer needs to know. These can range from nice to know to we’ll have to switch.
Cloud vs DIY
At its core, Pub/Sub is a service provided by Google Cloud. If you’re already using Google Cloud or looking to move to it, that’s not an issue. If you’re looking for an on-premises solution, Pub/Sub won’t be a fit.
Kafka does have the leg up in this comparison. It can be installed as an on-premises solution or in the cloud. I’ve trained at companies using both of these approaches.
Kafka can store as much data as you want. The actual storage SLA is a business and cost decision rather than a technical one. I’ve had companies store between four and 21 days of messages in their Kafka clusters.
Kafka supports log compaction too. This allows Kafka to remove all previous versions of the same key and only keep the latest version. For some use cases, this will allow you to store more data if you only need the latest version of the key.
Pub/Sub stores messages for seven days. You can’t configure Pub/Sub to store more.
All operational parts of Kafka are your purview. Confluent has an administrator course to learn the various ins and outs of Kafka you’ll need to know. One big part of the operational portion is disaster recovery and replication. Kafka calls this mirroring and uses a program called MirrorMaker to mirror one Kafka cluster’s topic(s) to another Kafka cluster.
Pub/Sub is a cloud service. There isn’t anything you need to do operationally, including replication. Pub/Sub adheres to an SLA for uptime and Google’s own engineers maintain that uptime. On the replication side, all messages are automatically replicated to several regions and zones.
As of Kafka 0.9, there is support for authentication (via Kerberos) and line encryption. At rest encryption is the responsibility of the user. Pub/Sub encrypts line and at rest. It has built-in authentication use Google Cloud’s IAM.
Coding and API
Kafka has its own API for creating producers and consumers. These APIs are written in Java and wrap Kafka’s RPC format. There are other languages that have libraries written by the community and their support/versions will vary. Confluent has created and open sourced a REST proxy for Kafka.
In 0.9 and 0.10 Kafka has started releasing APIs and libraries to make it easier to move data around with Kafka. There is Kafka Connect and Kafka Streams. Kafka Connect focuses on move data into or out of Kafka. Kafka Streams focuses on processing data already in Kafka and publishing it back to another Kafka topic.
Kafka’s consumers are pull. Data is only retrieved during a
Pub/Sub has a REST interface. Google provides libraries that wrap the REST interface with the languages own methods. An RPC-based library is in alpha. Their libraries support 11 different languages.
Pub/Sub consumers choose between a push or a pull mechanism.
For more in-depth processing of Pub/Sub data, Google provides Apache Beam (previously Dataflow Model). With Beam, you are given a PubSubIO class that allows you to read in and write to Pub/Sub. With Beam, you can start using any of the transforms or processing that Beam supports.
Note: Apache Beam supports Kafka too.
Comparing prices between a cloud service and Kafka is difficult. For this I’ll mostly focus on Pub/Sub’s pricing model.
Pub/Sub is priced per million messages and for storage. There are prices breaks as you move up in the number of messages you send. The pricing page gives an example where publishing and consuming 10 million messages would cost $16.
For calculating or comparing costs with Kafka, I recommend creating a price per unit. This will help you understand costs around your systems and help you compare to cloud providers.
Both technologies benefit from an economy of scale. As you send more messages in Pub/Sub, you will be given price breaks. With Kafka, the more messages you send, the more you’ll be able to amortize the costs of the cluster. This creates a decreasing price per unit.
Normally, your biggest cost center isn’t the messaging technology itself. Usually, it’s wrapped up in the publishing and processing of the messages. The code and distributed system to process the data is where most costs are incurred.
The biggest differences for Data Engineers come with the architecture differences.
Kafka gives knobs and levers around delivery guarantees. Most people try to write an at least once. Pub/Sub guarantees an at least once and you can’t change that programmatically.
Both products feature massive scalability. Mis-configuring or partitioning incorrectly can lead to scalability issues in Kafka. Pub/Sub doesn’t expose those knobs and you’re guaranteed performance out-of-the-box.
Ordering guarantees are a big difference. Kafka guarantees ordering in a partition. Pub/Sub does not have ordering guarantees. Instead, each message has an ID and you’ll need to include ordering information in the message payload.
Depending on the use case and the use of ordering, this difference can be a deal breaker. Not every use has a needed for message ordering. A qualified Data Engineer can sort out whether your ordering use case needs Kafka’s or Pub/Sub’s ordering.
Choosing a Big Data messaging system is a tough choice. In Big Data, there are only a few choices. The subtle nuances are important in choosing one or another. Data Engineers will be careful to understand the use case and access pattern to choose the right tool for the job.