Solving the First and Last Mile Problem With Kafka Part 2

Jesse Anderson
September 21, 2016
Blog, Business, Data Engineering
No Comments

Blog Summary: (AI Summaries by Summarizes)

Big Data faces challenges in finding value in large amounts of data, especially in real-time systems.
Kafka provides several windows into real-time streams, including a REST proxy for creating real-time dashboards.
Before creating a dashboard, data must be processed and analyzed upstream, and the dashboard should consume processed data products from Kafka topics.
The code for a real-time dashboard using Kafka can be found on GitHub, including functions for creating a consumer instance, starting a stream interval, and consuming and processing JSON payloads to update charts.

In the first post in the series, I talked about Big Data’s first and last mile problems. I showed how the first mile problems could be solved with Kafka.

In this post, I’m going to talk about the last mile problems.

Big Data Last Mile

With Big Data we’re faced with finding value in large amounts of data. This can come in the form of reports or other data products. Sometimes this comes from an API endpoint serving up Big Data.

No matter the output, we’re starting with Big Data. This means our systems have to sift through large amounts of data to find what we’re looking for. Adding real-time systems into the mix makes it even more difficult. Now, we’re faced with the challenge of finding value in data as it’s coming into the system.

Kafka’s Last Mile

Kafka provides us several windows into real-time streams. Often, we’re creating dashboards around real-time time. One of the ways to create a real-time dashboard is a REST proxy. This proxy isn’t part of Kafka itself, but is Apache licensed and supported by Confluent.

Preparing for the Last Mile

This post isn’t about processing data and preparation, but I want to point it out. In the data pipeline, you’ll need to have processed and analyzed the data already. A browser isn’t the place to do this. Ideally, the dashboard is consuming a topic or two that have the processed data product.

Dashboarding

If you haven’t seen a real-time dashboard, take a look at the video. It demonstrates the code we’re about to see.

Coding a Real-time Dashboard

The code for this dashboard can be found here.

Before starting to consume, a consumer instance must be created:

function createConsumerInstance(groupname, callbackfunction) {
  // Consumer Instance not created, create one
  console.log("Creating new Consumer Instance")
  // Create consumer instance
  var jqxhr = $.ajax({
    type: "POST",
    data: "{\"format\": \"binary\"}",
    url:base_uri + groupname,
    contentType: "application/vnd.kafka.v1+json"
  })
  .done(function(data) {
    console.log("New consumer created " + data["base_uri"])
    consumerinstance = data["base_uri"]
    // Take the consumer instance out of the URI
    consumerinstancesArray = consumerinstance.split("/")
    consumerinstance = consumerinstancesArray[consumerinstancesArray.length - 1]
    console.log("New consumer parsed " + consumerinstance)
    callbackfunction(groupname, consumerinstance)
  })
  .fail(function(data) {
    console.log("Error creating Consumer Instance:\n")
    console.dir(data)
  });
}

The consumer instance creates a base_uri that allows you to consume from that point on.

Javascript lacks threading. You have to do a setInterval that gets called after a certain amount of time. This makes the Javascript act like a pull mechanism by pulling the new events every five seconds.

function startConsumingEventsimStream(groupname, consumerinstance) {
  console.log("Starting Stream Interval")
  setInterval(function() {
    consumeEventsimStream(groupname, consumerinstance);
  }, 5000);
}

From this callback, we’ll get a JSON payload. This method takes the payload and updates the chart with the latest information.

function consumeEventsimStream(groupname, consumerinstance) {
  // Create consumer instance
  var jqxhr = $.ajax({
    headers: {
        Accept: "application/vnd.kafka.binary.v1+json"
    },
    url:base_uri + groupname + "/instances/" + consumerinstance + "/topics/eventsimstream"

  })
  .done(function(data) {
    //console.log("Consumed ");
    //console.dir(data)
    if (data.length != 0) {
      var jsonObjects = []
      for(var i = 0; i < data.length; i++) {
        decodedValue = atob(data[i].value)
        var jsonObject = JSON.parse(decodedValue)
        // Add the item count for D3.js to have a single number per row
        jsonObject['itemcount'] = 1
        jsonObjects.push(jsonObject)
      }
      processStreamJSONObjects(jsonObjects)
    }
  })
  .fail(function(data) {
    console.log("Error consuming raw topic:\n")
    console.dir(data)
  });
}

There are several other charts on the page being updated in a similar fashion.

The reality is that the dashboard is the culmination of lots of upstream work. This is the work of getting the data into Kafka, processing it, and analyzing it. This where the aggregation and reducing the Big Data down to something manageable happens. You’ll need to think about what you want to display in the dashboard so the data is readily available in a topic.

Solving the First and Last Mile Problem With Kafka Part 2

Big Data Last Mile

Kafka’s Last Mile

Preparing for the Last Mile

Dashboarding

Coding a Real-time Dashboard

Related Posts

Data Teams Survey 2020-2024 Analysis

Data Teams Survey 2024 Results

Unapologetically Technical Episode 13 – Jeff Chou

Unpacking the Latest Streaming Announcements: A Comprehensive Analysis

Unapologetically Technical Episode 12 – AJ Hunyady

Unapologetically Technical Episode 11 – Hubert Dulay

Unapologetically Technical Episode 10 – Michael Drogalis

Why Most Data Projects Fail & How to Avoid It at GOTO 2023

Unapologetically Technical Episode 9 – Gunnar Morling

Join the Newsletter