The Three Components of a Big Data Data Pipeline
There’s a common misconception in Big Data that you only need 1 technology to do everything that’s necessary for a data pipeline – and that’s incorrect.
Data Engineering != Spark
The misconception that Apache Spark is all you’ll need for your data pipeline is common. The reality is that you’re going to need components from three different general types of technologies in order to create a data pipeline. These three general types of Big Data technologies are:
Fixing and remedying this misconception is crucial to success with Big Data projects or one’s own learning about Big Data. Spark is just one part of a larger Big Data ecosystem that’s necessary to create data pipelines.
Put another way:
Data Engineering = Compute + Storage + Messaging + Coding + Architecture + Domain Knowledge + Use Cases
Batch and Real-time Systems
There are generally 2 core problems that you have to solve in a batch data pipeline. The first is compute and the second is the storage of data. Most people point to Spark as a way of handling batch compute. It’s a good solution for batch compute, but the more difficult solution is to find the right storage – or more correctly – the different and optimized storage technologies for that use case.
As we get into real-time Big Data systems, we still find ourselves with the need for compute. Some people will point to Spark as a compute component for real-time, but do the requirements change with real-time? We’ll need storage, but now we’ll need a messaging technology.
The need for all of these technologies is what makes Big Data so complex. The issue with a focus on data engineering=Spark is that it glosses over the real complexity of Big Data. You could need as many 10 technologies working together for a moderately complicated data pipeline. For a mature and highly complex data pipeline, you could need as many as 30 different technologies.
Compute is how your data gets processed. Some common examples of Big Data compute frameworks are:
- Hadoop MapReduce
- Apache Spark
- Apache Flink
- Apache Storm
- Apache Heron
These compute frameworks are responsible for running the algorithms and the majority of your code. For Big Data frameworks, they’re responsible for all resource allocation, running the code in a distributed fashion, and persisting the results.
There are all different levels of complexity to the compute side of a data pipeline. From the code standpoint, this is where you’ll spend the majority of your time. This is partly to blame for the misconceptions around compute being the only technology that’s needed.
You may have seen simple or toy examples that only use Spark. Even in production, these very simple pipelines can get away with just compute. However, as we get into more complex pipelines – even pipelines of moderate complexity – we’ll need the other 2 components.
Storage is how your data gets persisted permanently. Some examples of simple storage are:
- S3 or other cloud filesystems
- Local storage (but this doesn’t scale)
For simple storage requirements, people will just dump their files into a directory. As it becomes slightly more difficult, we start to use partitioning. This will put files in directories with specific names. A common partitioning method is to use the date of the data as part of the directory name.
For more optimized storage requirements, we start using NoSQL databases. The need for NoSQL databases is especially prevalent when you have a real-time system. Some examples of NoSQL databases are:
- Apache Cassandra
- Apache HBase
- Apache Druid
Most companies will store data in both a simple storage technology and one or more NoSQL database. Storing data multiple times handles the different use cases or read/write patterns that are necessary. One application may need to read everything and another application may only need specific data.
At small scales you can get away with not having to think about the storage of the data, but once you actually hit scale, then you have to think about how the data stored. As I’ve worked with teams on their Big Data architecture, they’re the weakest in using NoSQL databases. Often, they’ve needed a NoSQL database much sooner, but hadn’t start using it due to a lack of experience or knowledge with the system.
You can’t process 100 billion rows or one petabyte of data every single time. A NoSQL database lays out the data so you don’t have to read 100 billion rows or 1 petabyte of data each time. All reads and writes are efficient, even at scale.
More importantly, NoSQL databases are known to scale. We can’t hit 1 TB and start losing our performance. They also scale cost effectively.
From the architecture perspective, this is where you will spend most of your time. You’ll have to understand your use case and access patterns. This part isn’t as code-intensive. This is so architecture-intensive because you will have to study your use cases and access patterns to see if NoSQL is even necessary or if a simple storage technology will suffice. This is where an architect’s or data engineer’s skill is crucial to the project’s success.
A NoSQL database is used in various ways with your data pipeline. I often explain the need for NoSQL databases as being the
WHERE clause or way to constrain large amounts of data. It can serve as the source of data for compute where the data needs to be quickly constrained. Also, it can serve as the output storage mechanism for a compute job. This allows other non-Big Data technologies to use the results of a compute job. For example, if we were creating totals that rolled up over large amounts of data over different entities, we could place these totals in the NoSQL database with the row key as the entity name. Another technology, like a website, could query these rows and display them on the website. Thus, the non-Big Data technologies are able to use and show Big Data results.
Aside: With the sheer number of new databases out there and the complexity that’s intrinsic to them, I’m beginning to wonder if there’s a new specialty update engineering that is just knowing NoSQL databases or databases that can scale.
Messaging is how knowledge or events get passed in real-time. Some examples of messaging frameworks are:
- Apache Kafka
- Apache Pulsar
- RabbitMQ (doesn’t scale)
You start to use messaging when there is a need for real-time systems. These messaging frameworks are used to ingest and disseminate a large amount of data. This ingestion and dissemination is crucial to real-time systems because it solves the first mile and last mile problems.
From the architecture and coding perspective, you will spend an equal amount of time. You’ll have to understand your use case and access patterns. You’ll have to code those use cases.
Some technologies will be a mix of two or more components. However, there are important nuances that you need to know about. For example, Apache Pulsar is primarily a messaging technology, but it can be a compute and storage component too.
Using Pulsar Functions or a custom consumer/producer, events sent through Pulsar can be processed. From an operational perspective, the custom consumer/producer will be different than most compute components. From an architectural point of view, Pulsar Functions and custom consumer/producers can perform the same (with some advanced caveats) as other compute components.
Pulsar also has its own capability to store events for near-term or even long-term. Pulsar uses Apache BookKeeper as warm storage to store all of its data in a durable way for the near-term. It also features a hot storage or cache that is used to serve data quickly. For long-term storage, it can also directly offload data into S3 via tiered storage (thus being a storage component). With tiered storage, you will have performance and price tradeoffs. Retrieving data from S3 will take slightly longer, but will be cheaper in its storage costs.
Why Do You Need Compute and Storage With Batch Data Pipelines?
Now that you have more of a basis for understanding the components, let’s see why they’re needed together.
With Hadoop, MapReduce and HDFS were together in the same program, thus having compute and storage together. With Spark, it doesn’t have a built-in storage component. You will need to give Spark a place to store data. Spark will need a place to both from and store/save to.
Just storing data isn’t very exciting. We need a way to process our stored data. This is why a batch technology or compute is needed. You need a scalable technology that can process the data, no matter how big it is.
Why Do You Need Compute, Storage, and Messaging With Real-time Pipelines?
With real-time systems we’ll need all 3 components.
You might have seen or read that real-time compute technologies like Spark Streaming can receive network sockets or Twitter streams. The reality is that messaging systems are a significantly better means handling ingestion and dissemination of real-time data. As a result, messaging systems like Pulsar are commonly used with the real-time compute. Messaging systems also solve the issues of back pressure in a significantly better way.
The messaging system makes it easier to move data around and make data available. A common real-time system would look like:
- Event data is produced into Pulsar with a custom Producer
- The data is consumed with a compute component like Pulsar Functions, Spark Streaming, or another real-time compute engine and the results are produced back into Pulsar
- This consume, process, and produce pattern may be repeated several times during the pipeline to create new data products
- The data is consumed as a final data product from Pulsar by other applications such as a real-time dashboard, real-time report, or another custom application
Moving the data from messaging to storage is equally important. This is where Pulsar’s tiered storage really comes into play. You can configure Pulsar to use S3 for long-term storage of data. This data can still be accessed by Pulsar for old messages even though its stored in S3. Other compute technologies can read the files directly from S3 too.
As I mentioned, real-time systems often need NoSQL databases for storage. This where a messaging system like Pulsar really shines. The data and events can be consumed directly from Pulsar and inserted into the NoSQL database. This makes adding new NoSQL databases much easier because the data is already made available.
Crucial for Success
All three components are critical for success with your Big Data learning or Big Data project success. As you can see, data engineering is not just using Spark. If you rewind to a few years ago, there was the same connotation with Hadoop. This sort of thinking leads to failure or under-performing Big Data pipelines and projects. Only by recognizing all of the components you need, can you succeed with Big Data.
Full disclosure: this post was supported by Streamlio. Streamlio provides a solution powered by Apache Pulsar and other open source technologies.