Getting Stuck Crawling with Big Data

Blog Summary: (AI Summaries by Summarizes)
  • Breaking down Big Data projects into smaller pieces is recommended, using the crawl, walk, run process.
  • Some companies get stuck at the crawl phase and don't progress to the walk and run phases.
  • Stopping at crawl looks like cloning the data warehouse in Hadoop without improving or using new technologies in the data pipeline.
  • Hadoop can be used as a data warehouse, but stopping at data warehousing is a waste of its potential.
  • The source of the problem is having the wrong team or members of the team tasked with the Big Data transition.

I always encourage companies to break down their Big Data projects into smaller pieces. I call this process crawl, walk, run.

There is an interesting corollary to this process. Some companies get stuck at the crawl phase and don’t progress on to the walk and run phases. The first time I saw this, I was so intrigued. How could a company stop? Why would they stop when there’s so much more they could do?

Stopped at Crawl

You’re probably wondering what it looks like when your Big Data project stops at crawl.

It looks like you’ve cloned your data warehouse in Hadoop. No more work has gone into improving or using new technologies in the data pipeline.

That goes to a common question I get asked. Is Hadoop a data warehouse? My answer is yes, Hadoop can be used as a data warehouse, but stopping at data warehousing is a terrible waste. Hadoop and its ecosystem can do so much more than a data warehouse can.

What’s the Source

The source of the problem is having the wrong team or members of the team tasked with the Big Data transition. The common misconception is that a Data Engineer is the same thing as a DBA.

The two positions are very different. A DBA has a place on the data engineering team, but having a team of just DBAs leads to being stuck at crawling. Creating a Big Data pipeline requires Java skills.

The crawling phase of moving data out of a RDBMS and placing it in Hadoop is easy, that’s why I call it the crawling phase. There is so much more that can be done with Hadoop. However, they can be done with just SQL skills. You will need qualified data engineers who can create the complex data pipelines.

Related Posts

The Difference Between Learning and Doing

Blog Summary: (AI Summaries by Summarizes)There are several types of learning videos: hype, low effort, novice, and professional.It is important to avoid hype, low-effort, and

The Data Discovery Team

Blog Summary: (AI Summaries by Summarizes)The concept of a “data discovery team” is introduced, which focuses on searching for data in an enterprise data reality.Data

Black and white photo of three corporate people discussing with a view of the city's buildings

Current 2023 Announcements

Blog Summary: (AI Summaries by Summarizes)Confluent’s Current Conference featured several announcements that are important for both technologists and investors.Confluent has two existing moats (replication and

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on