There Are Several Hard Problems with Big Data

Jesse Anderson
April 26, 2017
Blog, Business, Data Engineering, Data Engineering is hard
No Comments

Blog Summary: (AI Summaries by Summarizes)

Big Data has several different hard problems that cannot be solved by changing just one thing.
Big Data is 10-15x more complex than small data.
The three main problems for Big Data are operations, development, and management.
Management is crucial to the success of the project and problems tend to materialize early on.
Operational problems can be reduced in complexity by moving to the cloud or using purpose-built software.

There’s a common misconception that says if I just change one thing in Big Data, everything else will be easier. The answer is that there are several different hard problems in Big Data. Changing one problem doesn’t solve the other problems.

Sometimes, I’ll see tweets or posts about how companies or vendors haven’t made Big Data easy. It makes the assumption that everything about Hadoop can be made simple. Also, it continues the assumption that there’s only one hard problem to solve.

Big Data is complex. In chapter 2 “The Need for Data Engineering” in Data Engineering Teams, I show how Big Data is 10-15x more complex than small data.

The three main problems for Big Data are: operations, development, and management.

Management

Setting up the team team correctly is crucial to the success of the project. I make that point over 73 pages in Data Engineering Teams.

In the scope of making this easier, there isn’t much that can be done. I’ve written the book giving the steps. If you still need help, we provide mentoring services for management and teams.

Problems in management tend to materialize early on. These problems are the culprits behind the early failures of Big Data projects. These projects just never go anywhere because they have the wrong people on the team.

Operations

Operational problems can be the easiest to reduce in complexity. You can move entirely to the cloud and remove the majority of operational overhead. You can use purpose-built software like Cloudera Manager or Apache Ambari. These allow you to have fewer people monitor and maintain a cluster, but don’t remove the need for operations people.

Operations problems tend to manifest after the first few months of the project.

Development

Development projects are the most difficult to reduce in complexity. Many people think that the move from Apache Hadoop to Apache Spark will reduce complexity. It doesn’t.

Others think that the stems from Hadoop or Spark being immature; it comes from them being general purpose systems.

Development problems tend to manifest throughout the project. A data pipeline is constantly being updated and added to. If the development team isn’t ready, these updates will take forever or the team will say they aren’t possible.

I stress the need for qualified Data Engineers. Without proper training and resources, data engineering projects never finish.

What to Do?

Some problems can be lessened and others require smart people. Don’t fall into the misconception that these problems can be magically made easy. In Big Data, an ounce of prevention is worth a ton of cure.

There Are Several Hard Problems with Big Data

Management

Operations

Development

What to Do?

Related Posts

Gemini Batch API for Java

Unapologetically Technical Episode 20 – Shane Murray

Unapologetically Technical Episode 19 – Jacopo Tagliabue

Unapologetically Technical Episode 18 – Adrian Woodhead

Unapologetically Technical Episode 17 – Semih Salihoglu

Unapologetically Technical Episode 16 – David Jayatillake

Unapologetically Technical Episode 15 – Frances Perry

Unapologetically Technical Episode 14 – Cliff Crosland

Data Teams Survey 2020-2024 Analysis

Join the Newsletter