Solving the First and Last Mile Problem With Kafka Part 1

Blog Summary: (AI Summaries by Summarizes)
  • Apache Kafka is a great technology for moving, storing, and processing data in real-time.
  • Before Kafka, Apache Flume was one of the most commonly used technologies for moving data in real-time, but it lacked a great place to place the data.
  • Kafka eliminates the problems of the first and last mile in Big Data by providing a great way of not just moving data, but storing it and processing it.
  • The first mile code is responsible for getting large amounts of data into the system. An example of the first mile code is the JMXMetricProducer program, which takes in JMX metrics and produces them to Kafka.
  • The JMXThread class in the JMXMetricProducer program pulls the latest JMX metrics and creates a JSON string, which is then passed into the produce method.

See part 2 in the series.

In telecommunications, there is the term “last mile”. It refers to getting the connection to customer. It’s the last mile between the company’s infrastructure and the customer’s location.

We have similar issues in Big Data. We don’t just have a last mile problem; we have a first mile problem too. We have an issue with getting large amounts of data into our system (first mile). Then, we have an issue getting large amounts data out of our system and into the hands of our customers (last mile).

Before Kafka

Before Kafka came along, we didn’t have a great way of moving data in real-time. Apache Flume was one of the most commonly used technologies. It could accept and process the data in a limited fashion, but lacked a great place to place the data. The data was often written in real-time to a file in HDFS.

After Kafka

Now that we have Apache Kafka, we have a great of not just moving data, but storing it and processing it. Having Kafka in the technology stack is really eliminating the problems of the first and last mile.

First Mile Code

I have an extensive Kafka example in my GitHub. One of the programs is an example of the first mile code. This program takes in JMX metrics and produces them to Kafka. The main work is done in the JXMThread.java class.

Here is the code for the run method:

@Override
public void run() {
    logger.debug("Running producer thread");

    while (true) {
        try {
            String json = getJMX();

            produce(json);

            Thread.sleep(sleepTime);
        } catch (Exception e) {
            logger.error("Error in JMXThread", e);
        }

    }
}

The run method pulls the latest JMX metrics and creates a JSON string. This JSON string is then passed into the produce method. Finally, the thread sleeps before polling again.

private void produce(String json) {
    ProducerRecord<String, String> record = new ProducerRecord<String, String>(
            topic, host, json);
    producer.send(record);

    logger.debug("Produced");
}

The produce method takes in the JSON formatted string and creates a ProducerRecord object. This object wraps information about what you want to send to Kafka like the topic, key, and value. Choosing topics, keys, and values require concerted design effort.

Finally, the send method is called to send the message to Kafka.

This completes the first mile of the journey. The data is in Kafka now and we can start processing and gaining value from the data.

See part 2 in the series.

Related Posts

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on

Big Data Institute horizontal logo

Independent Anniversary

Blog Summary: (AI Summaries by Summarizes)The author founded Big Data Institute eight years ago as an independent, big data consulting company.Independence allows for an unbiased