Creating real-time data pipelines bring new challenges. There are new concepts and technologies that you’ll need to learn and understand. To help you understand the basic technologies you need in a real-time data pipeline, I break it down into 4 general types. These types are:
- Ingestion and dissemination
A processor is the part that processes the incoming data. As data comes into a system, it needs to be changed and transformed. This is often the T in ETL. The processor is responsible for getting the data ready for subsequent usage.
Depending on the complexity of the ETL, the data may need to be enriched. This is where processing gets tricky. How do you get the data to perform the enrichment? This data may come in real-time or from another datastore. Now, that you’re tied to another system’s latency, your enrichment process could be slowed down.
Analytics is the part that creates some kind of value out of the data. At this point, you’re taking the data and creating a real-time data product. On the simple side, this could be counting interactions in real-time. On the complex side, this could be a real-time data science or machine learning model.
This is most important part of the pipeline for the business. This is where you take the data and show what’s happening. The analytics are answering a business question in real-time. This is the core driver for businesses wanting to make the move from batch to real-time. The business needs to know what’s happening as it’s happening.
Ingestion and Dissemination
In order to move data around and save it, you will need a system for ingestion and dissemination. This isn’t an easy proposition. We’re dealing with large amounts of data and latency is key. We won’t be able to use the same middleware/sockets/networking that we’ve always used.
When you’re moving at Big Data scale and in real-time, the system needs to be able to scale. It needs to provide the data at a fast speed to many different systems doing processing and analytics.
Being used by many different systems is an important difference with real-time small data systems. Most small data systems are point-to-point. The data goes from one machine to another and there’s never any crossover. With real-time Big Data systems, many, many different systems are all accessing and receiving the data at once. This makes your ingestion and dissemination technologies crucial to the performance of your real-time systems.
Storage is another issue for real-time systems. We don’t want our data to be ephemeral. At this scale, we have to engineer with the expectation of failure. Our clients and consumers of data will fail and we can’t lose that data.
Storing many small files leads to issues on many Big Data systems. You might have heard of the small file issue with HDFS. Most of the time, the small files come from a real-time source that’s trying to saving into HDFS. For real-time pipelines, we’ll need new storage systems that can store data in real-time efficiently.
Not all processing and analytics should be done in real-time. You will still need to go back and process in batch. This is where filesystems like HDFS come in. You will need to have archived out your data stored in a real-time storage technology and put it into HDFS/S3/GCS/etc.
Some technologies may be a mix of 2 or more of these types. This is where things get really cloudy. You need to deeply understand each technology and the pieces that are required to create a real-time data pipeline.
To make it even more difficult, some technologies try to push or market themselves as a mix of types when they aren’t. Some technologies that are primarily a processing or analytic also market themselves as handling the ingestion and dissemination type. This leads to all sorts of issues in production as the technology just doesn’t work well.