Crawl, Walk, Run with Big Data

Blog Summary: (AI Summaries by Summarizes)
  • Attacking a Big Data project with an all-or-nothing mindset leads to failure
  • Breaking the project into manageable phases called crawl, walk, and run is recommended
  • Crawling phase involves doing the minimum to start using Big Data, such as getting current data into Hadoop and setting up systems to bring in data
  • ETL coding and data normalization are done in this phase to set up for success in analyzing the data
  • Walking phase involves building on the foundation built in the crawling phase and starting to gain value from the data

Crawl, Walk, Run with Big Data

Attacking a Big Data project with an all-or-nothing mindset leads to an absolute failure. I highly suggest breaking the overall project into more manageable phases. These phases are called crawl, walk, and run.

Crawling

In this phase, you’re doing the absolute minimum to start using Big Data. This might be as a simple as getting your current data into Hadoop. You’ll also set up the systems to continually bring in data.

In this phase you’ll start on your ETL coding and data normalization. This will set you up for success as you start analyzing the data. This phase has minimal amounts of analysis. Your focus is on creating the system that will make it as simple as possible to analyze the data.

Walking

In this phase, you’re building on the solid foundation you built while crawling. Everything is ready for you to start gaining value from your data.

In this phase you’re starting to really analyze your data. You’re using the best tool for the analyses because the cluster is already running the right tools.

You’re creating data products. These aren’t just entries in a database table; these are in-depth analysis of business value. At this point, you should be creating direct and quantifiable business value.

You’re starting to look at data science and machine learning as ways to improve your analysis or data products.

Running

In this phase, you’re moving into the advanced parts of Big Data architectures. You’re gaining the maximum amount of value from your data.

You’re also looking at how batch-oriented systems are holding you back. You start looking at real-time systems.

You’re looking at how to optimize your storage and retrieval of data. This might be choosing a better storage format or working with a NoSQL database.

You’re using machine learning and data science to their fullest potential.

If you go directly from crawling to running, you’ll trip and fall. You’ll continually trip and fall and not know why. This is a common issue with new data engineering teams. They don’t set up a good foundation before they more on to the advanced phases.

Related Posts

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on

Big Data Institute horizontal logo

Independent Anniversary

Blog Summary: (AI Summaries by Summarizes)The author founded Big Data Institute eight years ago as an independent, big data consulting company.Independence allows for an unbiased