The Four Types of Technologies You Need for Real-time Big Data Systems

Blog Summary: (AI Summaries by Summarizes)
  • Real-time data pipelines bring new challenges and require new concepts and technologies to be learned and understood.
  • Real-time data pipelines can be broken down into four general types: processors, analytics, ingestion and dissemination, and storage.
  • Processors are responsible for processing incoming data and getting it ready for subsequent usage, including enrichment.
  • Analytics create value out of the data and answer business questions in real-time, making it the most important part of the pipeline for businesses.
  • Ingestion and dissemination systems are needed to move data around and save it, and must be able to scale and provide data at a fast speed to many different systems doing processing and analytics.

Creating real-time data pipelines bring new challenges. There are new concepts and technologies that you’ll need to learn and understand. To help you understand the basic technologies you need in a real-time data pipeline, I break it down into 4 general types. These types are:

  • Processors
  • Analytics
  • Ingestion and dissemination
  • Storage

Processors

A processor is the part that processes the incoming data. As data comes into a system, it needs to be changed and transformed. This is often the T in ETL. The processor is responsible for getting the data ready for subsequent usage.

Depending on the complexity of the ETL, the data may need to be enriched. This is where processing gets tricky. How do you get the data to perform the enrichment? This data may come in real-time or from another datastore. Now, that you’re tied to another system’s latency, your enrichment process could be slowed down.

Analytics

Analytics is the part that creates some kind of value out of the data. At this point, you’re taking the data and creating a real-time data product. On the simple side, this could be counting interactions in real-time. On the complex side, this could be a real-time data science or machine learning model.

This is most important part of the pipeline for the business. This is where you take the data and show what’s happening. The analytics are answering a business question in real-time. This is the core driver for businesses wanting to make the move from batch to real-time. The business needs to know what’s happening as it’s happening.

Ingestion and Dissemination

In order to move data around and save it, you will need a system for ingestion and dissemination. This isn’t an easy proposition. We’re dealing with large amounts of data and latency is key. We won’t be able to use the same middleware/sockets/networking that we’ve always used.

When you’re moving at Big Data scale and in real-time, the system needs to be able to scale. It needs to provide the data at a fast speed to many different systems doing processing and analytics.

Being used by many different systems is an important difference with real-time small data systems. Most small data systems are point-to-point. The data goes from one machine to another and there’s never any crossover. With real-time Big Data systems, many, many different systems are all accessing and receiving the data at once. This makes your ingestion and dissemination technologies crucial to the performance of your real-time systems.

Storage

Storage is another issue for real-time systems. We don’t want our data to be ephemeral. At this scale, we have to engineer with the expectation of failure. Our clients and consumers of data will fail and we can’t lose that data.

Storing many small files leads to issues on many Big Data systems. You might have heard of the small file issue with HDFS. Most of the time, the small files come from a real-time source that’s trying to saving into HDFS. For real-time pipelines, we’ll need new storage systems that can store data in real-time efficiently.

Not all processing and analytics should be done in real-time. You will still need to go back and process in batch. This is where filesystems like HDFS come in. You will need to have archived out your data stored in a real-time storage technology and put it into HDFS/S3/GCS/etc.

Multiples

Some technologies may be a mix of 2 or more of these types. This is where things get really cloudy. You need to deeply understand each technology and the pieces that are required to create a real-time data pipeline.

To make it even more difficult, some technologies try to push or market themselves as a mix of types when they aren’t. Some technologies that are primarily a processing or analytic also market themselves as handling the ingestion and dissemination type. This leads to all sorts of issues in production as the technology just doesn’t work well.

Related Posts

The Difference Between Learning and Doing

Blog Summary: (AI Summaries by Summarizes)There are several types of learning videos: hype, low effort, novice, and professional.It is important to avoid hype, low-effort, and

The Data Discovery Team

Blog Summary: (AI Summaries by Summarizes)The concept of a “data discovery team” is introduced, which focuses on searching for data in an enterprise data reality.Data

Black and white photo of three corporate people discussing with a view of the city's buildings

Current 2023 Announcements

Blog Summary: (AI Summaries by Summarizes)Confluent’s Current Conference featured several announcements that are important for both technologists and investors.Confluent has two existing moats (replication and

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on