There Are Several Hard Problems with Big Data

Blog Summary: (AI Summaries by Summarizes)
  • Big Data has several different hard problems that cannot be solved by changing just one thing.
  • Big Data is 10-15x more complex than small data.
  • The three main problems for Big Data are operations, development, and management.
  • Management is crucial to the success of the project and problems tend to materialize early on.
  • Operational problems can be reduced in complexity by moving to the cloud or using purpose-built software.

There’s a common misconception that says if I just change one thing in Big Data, everything else will be easier. The answer is that there are several different hard problems in Big Data. Changing one problem doesn’t solve the other problems.

Sometimes, I’ll see tweets or posts about how companies or vendors haven’t made Big Data easy. It makes the assumption that everything about Hadoop can be made simple. Also, it continues the assumption that there’s only one hard problem to solve.

Big Data is complex. In chapter 2 “The Need for Data Engineering” in Data Engineering Teams, I show how Big Data is 10-15x more complex than small data.

The three main problems for Big Data are: operations, development, and management.

Management

Setting up the team team correctly is crucial to the success of the project. I make that point over 73 pages in Data Engineering Teams.

In the scope of making this easier, there isn’t much that can be done. I’ve written the book giving the steps. If you still need help, we provide mentoring services for management and teams.

Problems in management tend to materialize early on. These problems are the culprits behind the early failures of Big Data projects. These projects just never go anywhere because they have the wrong people on the team.

Operations

Operational problems can be the easiest to reduce in complexity. You can move entirely to the cloud and remove the majority of operational overhead. You can use purpose-built software like Cloudera Manager or Apache Ambari. These allow you to have fewer people monitor and maintain a cluster, but don’t remove the need for operations people.

Operations problems tend to manifest after the first few months of the project.

Development

Development projects are the most difficult to reduce in complexity. Many people think that the move from Apache Hadoop to Apache Spark will reduce complexity. It doesn’t.

Others think that the stems from Hadoop or Spark being immature; it comes from them being general purpose systems.

Development problems tend to manifest throughout the project. A data pipeline is constantly being updated and added to. If the development team isn’t ready, these updates will take forever or the team will say they aren’t possible.

I stress the need for qualified Data Engineers. Without proper training and resources, data engineering projects never finish.

What to Do?

Some problems can be lessened and others require smart people. Don’t fall into the misconception that these problems can be magically made easy. In Big Data, an ounce of prevention is worth a ton of cure.

Related Posts

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on

Big Data Institute horizontal logo

Independent Anniversary

Blog Summary: (AI Summaries by Summarizes)The author founded Big Data Institute eight years ago as an independent, big data consulting company.Independence allows for an unbiased