On Cheating with Big Data

Blog Summary: (AI Summaries by Summarizes)
  • To achieve the scales of Big Data, cheats or tradeoffs are necessary.
  • HBase and Cassandra are both column-oriented NoSQL datastores, but their cheats are entirely different.
  • HBase divides large tables into regions, while Cassandra divides them into partitions.
  • HBase's cheat of having one node serve all reads and writes makes atomicity possible at a row level, while Cassandra's cheat of letting several different nodes read and write the data doesn't have built-in atomicity.
  • HDFS cheats by having an immutable filesystem, which leads to workarounds in HBase to have to rewrite out files via a major complication.

To achieve the scales of Big Data, you have to cheat in some way. Sometimes people call these tradeoffs. In Big Data, I prefer to call them cheats. A tradeoff makes it sound like a small thing, but the reality is that Big Data tradeoffs can make a use case possible or impossible.

I don’t want to use cheating in a negative sense. These cheats are necessary to achieve the scales that Big Data provides. Without them, you just can’t scale.

I want to give a few example of cheats so that you can see what I mean.

HBase and Cassandra

HBase and Cassandra are both column oriented NoSQL datastores. Their cheats, however, are entirely different.

HBase divides large tables into regions. These regions are entirely served up by a RegionServer. Their data is replicated via HDFS. HBase cheats by not having the entire table served by a single RegionServer and having data replicated via HDFS.

Cassandra divides large tables into partitions. The partitions are served up by Cassandra Nodes. Data is replicated by being able to read and write from many different Cassandra nodes. Cassandra cheats by letting several different nodes read and write the data.

This subtle difference of having one node serve all reads and writes (HBase) versus several different nodes (Cassandra) sounds inconsequential, but this cheat makes atomicity possible or impossible. In HBase, you get atomicity at a row level because the entire row is served by a single node. In Cassandra, you don’t have built-in atomicity and you’ll have to layer it on with a consistency check.

In a similar way, HBase’s cheat of one node serve all data makes a RegionServer failure an issue for accessing data. The regions the RegionServer was serving will be down until the failed RegionServer is detected and the regions are reassigned. In Cassandra, there isn’t a single place (with some caveats) where data is stored or written. There are several different nodes.

HDFS

HDFS cheats by having an immutable filesystem. This cheat allows any DataNode to serve up the data stored in HDFS.

However, this comes with several downsides. You can’t go back and edit a file in place. This leads to workarounds in HBase to have to rewrite out files via a major complication.

Spark Streaming

Spark Streaming cheats by processing data in micro-batches. These micro-batches range from 500 ms to 10,000 ms. Instead of processing data as it’s received, Spark Streaming batches a smaller amount of time together and processes that.

This cheat comes with the downside that very low latency processing isn’t possible with Spark Streaming. Depending on the amount of time for the micro-batch, you could be 500ms to 10,000ms behind current time. If you want lower latency processing, you’ll need to use other streaming technologies that cheat a different way.

Cheats and Data Engineering

I’ve written about what makes data engineering so complex. A big portion of that complexity is tied up in the technology’s cheats. A qualified Data Engineer needs to know the cheats, know what they mean give the use case, and which technology is the best tool for the job.

If you are starting a Big Data project and haven’t looked over your use case in depth, you need to. These cheats will either make your use case possible or impossible.

Related Posts

Data Teams Survey 2024 Results

Blog Summary: (AI Summaries by Summarizes)Companies are not fully utilizing LLMs in data engineering, with 24.7% of teams not using them at all.Only 12% of