Saying You Have Small Data Isn’t Belittling Your Use Case

Blog Summary: (AI Summaries by Summarizes)
  • Many engineers starting out with Big Data ask which technology to use for processing a dataset of 3 billion rows in 10,000 files that is 100 GB in size.
  • The assumption is that small data technologies can't handle this, but this is a misunderstanding of what Big Data is and isn't.
  • A dataset of 100 GB can easily fit in memory, so it's likely not a Big Data problem.
  • Using a relational database instead of a Big Data technology has benefits such as less conceptual complexity, more prevalence in the marketplace, and faster speeds of queries.
  • When someone tells you that your use case is small data, they're not belittling you, they're saving you time, money, and effort.

There is a common beginner question for engineers starting out with Big Data. An engineer will do a post to a social media site saying “I need to know which Big Data technology to use. I have 3 billion rows in 10,000 files. The whole dataset is 100 GB. Is Big Data Technology X efficient for processing this?”

The short answer is no. The long answer is more than likely no and only a qualified data engineer can tell you for sure.

The issue starts with a misunderstanding of what Big Data is and isn’t. Here’s my definition. The person is assuming that small data technologies can’t do something for them. After all, 3 billion rows sounds like a lot. It isn’t.

If you think about it, you can easily provision a VM with 256 GB of RAM. For a dataset of 100 GB, the entire dataset could fit in memory. There are some nuances like how much this dataset will grow and the complexity of the processing, but this probably isn’t a Big Data problem.

On the threads with answers to these questions, there is another person that responds and says that the use case doesn’t need Big Data. Sometimes, the original poster will get insulted or think that people are belittling their use case. They aren’t.

This is because their use case would be so much better off in a small data technology like a relational database. Using a relational database instead of a Big Data technology has these major benefits:

  • Less conceptual complexity
  • More prevalent in the marketplace
  • More people who know the technology
  • Easier operationally
  • Faster speeds of queries
  • Cheaper operationally, technically, and people-wise
  • Shorter development cycles

When someone is telling you that use your case is small data, they aren’t belittling you or your use case. They’re saving you time, money, and effort.

For toy and personal projects, these sorts of small datasets are fine. If you’re doing this for real for a production use case or a real project, do yourself a favor and stick to the small data technologies.

If you do have Big Data problems, you are specifically held back by a small data technology limitation. You are saying can’t because you are hitting a known technical limitation. The only way solve these problems is with Big Data technologies. For these problems you will need data engineers.

Remember that if you have Big Data use cases, not every use case within an organization requires Big Data. There are still small data use case work nicely in their small data technologies. Using Big Data technologies for every use case will bring the same sorts of issues when dealing with small data use cases.

Related Posts

The Difference Between Learning and Doing

Blog Summary: (AI Summaries by Summarizes)There are several types of learning videos: hype, low effort, novice, and professional.It is important to avoid hype, low-effort, and

The Data Discovery Team

Blog Summary: (AI Summaries by Summarizes)The concept of a “data discovery team” is introduced, which focuses on searching for data in an enterprise data reality.Data

Black and white photo of three corporate people discussing with a view of the city's buildings

Current 2023 Announcements

Blog Summary: (AI Summaries by Summarizes)Confluent’s Current Conference featured several announcements that are important for both technologists and investors.Confluent has two existing moats (replication and

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on