Solving the First and Last Mile Problem With Kafka Part 2

Blog Summary: (AI Summaries by Summarizes)
  • Big Data faces challenges in finding value in large amounts of data, especially in real-time systems.
  • Kafka provides several windows into real-time streams, including a REST proxy for creating real-time dashboards.
  • Before creating a dashboard, data must be processed and analyzed upstream, and the dashboard should consume processed data products from Kafka topics.
  • The code for a real-time dashboard using Kafka can be found on GitHub, including functions for creating a consumer instance, starting a stream interval, and consuming and processing JSON payloads to update charts.

In the first post in the series, I talked about Big Data’s first and last mile problems. I showed how the first mile problems could be solved with Kafka.

In this post, I’m going to talk about the last mile problems.

Big Data Last Mile

With Big Data we’re faced with finding value in large amounts of data. This can come in the form of reports or other data products. Sometimes this comes from an API endpoint serving up Big Data.

No matter the output, we’re starting with Big Data. This means our systems have to sift through large amounts of data to find what we’re looking for. Adding real-time systems into the mix makes it even more difficult. Now, we’re faced with the challenge of finding value in data as it’s coming into the system.

Kafka’s Last Mile

Kafka provides us several windows into real-time streams. Often, we’re creating dashboards around real-time time. One of the ways to create a real-time dashboard is a REST proxy. This proxy isn’t part of Kafka itself, but is Apache licensed and supported by Confluent.

Preparing for the Last Mile

This post isn’t about processing data and preparation, but I want to point it out. In the data pipeline, you’ll need to have processed and analyzed the data already. A browser isn’t the place to do this. Ideally, the dashboard is consuming a topic or two that have the processed data product.

Dashboarding

If you haven’t seen a real-time dashboard, take a look at the video. It demonstrates the code we’re about to see.

Coding a Real-time Dashboard

The code for this dashboard can be found here.

Before starting to consume, a consumer instance must be created:

function createConsumerInstance(groupname, callbackfunction) {
  // Consumer Instance not created, create one
  console.log("Creating new Consumer Instance")
  // Create consumer instance
  var jqxhr = $.ajax({
    type: "POST",
    data: "{\"format\": \"binary\"}",
    url:base_uri + groupname,
    contentType: "application/vnd.kafka.v1+json"
  })
  .done(function(data) {
    console.log("New consumer created " + data["base_uri"])
    consumerinstance = data["base_uri"]
    // Take the consumer instance out of the URI
    consumerinstancesArray = consumerinstance.split("/")
    consumerinstance = consumerinstancesArray[consumerinstancesArray.length - 1]
    console.log("New consumer parsed " + consumerinstance)
    callbackfunction(groupname, consumerinstance)
  })
  .fail(function(data) {
    console.log("Error creating Consumer Instance:\n")
    console.dir(data)
  });
}

The consumer instance creates a base_uri that allows you to consume from that point on.

Javascript lacks threading. You have to do a setInterval that gets called after a certain amount of time. This makes the Javascript act like a pull mechanism by pulling the new events every five seconds.

function startConsumingEventsimStream(groupname, consumerinstance) {
  console.log("Starting Stream Interval")
  setInterval(function() {
    consumeEventsimStream(groupname, consumerinstance);
  }, 5000);
}

From this callback, we’ll get a JSON payload. This method takes the payload and updates the chart with the latest information.

function consumeEventsimStream(groupname, consumerinstance) {
  // Create consumer instance
  var jqxhr = $.ajax({
    headers: {
        Accept: "application/vnd.kafka.binary.v1+json"
    },
    url:base_uri + groupname + "/instances/" + consumerinstance + "/topics/eventsimstream"

  })
  .done(function(data) {
    //console.log("Consumed ");
    //console.dir(data)
    if (data.length != 0) {
      var jsonObjects = []
      for(var i = 0; i < data.length; i++) {
        decodedValue = atob(data[i].value)
        var jsonObject = JSON.parse(decodedValue)
        // Add the item count for D3.js to have a single number per row
        jsonObject['itemcount'] = 1
        jsonObjects.push(jsonObject)
      }
      processStreamJSONObjects(jsonObjects)
    }
  })
  .fail(function(data) {
    console.log("Error consuming raw topic:\n")
    console.dir(data)
  });
}

There are several other charts on the page being updated in a similar fashion.

The reality is that the dashboard is the culmination of lots of upstream work. This is the work of getting the data into Kafka, processing it, and analyzing it. This where the aggregation and reducing the Big Data down to something manageable happens. You’ll need to think about what you want to display in the dashboard so the data is readily available in a topic.

Related Posts

The Difference Between Learning and Doing

Blog Summary: (AI Summaries by Summarizes)Learning options trading involves data and programming but is not as technical as data engineering or software engineering.Different types of

The Data Discovery Team

Blog Summary: (AI Summaries by Summarizes)Data discovery team plays a crucial role in searching for data in the IT landscape.Data discovery team must make data