Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)
  • Google created MapReduce and GFS in 2004 for scalable systems.
  • Apache Hadoop was created by Doug Cutting in 2005 based on Google's papers.
  • Cloudera and Hortonworks commercialized open-source big data technologies in 2008 and 2011.
  • Apache Hive and Apache Pig were introduced to enhance Hadoop with SQL capabilities.
  • Apache HBase and Apache Cassandra were developed for scalable databases in 2007 and 2008.

In the beginning, there was Google. Google looked over the expanse of the growing internet and realized they’d need scalable systems. They created MapReduce and GFS in 2004. They published the papers for them in the same year.

Doug Cutting took those papers and created Apache Hadoop in 2005.

Cloudera was started in 2008, and HortonWorks started in 2011. They were the first companies to commercialize open source big data technologies and pushed the marketing and commercialization of Hadoop.

Hadoop was hard to program, and Apache Hive came along in 2010 to add SQL. Apache Pig in 2008 came too, but it didn’t ever see as much adoption.

With an immutable file system like HDFS, we needed scalable databases to read and write data randomly. Apache HBase came in 2007, and Apache Cassandra came in 2008. Along the way, there were various explosions of databases within a type, such as GPU, graph, JSON, column-oriented, MPP, and key value.

Hadoop didn’t support doing things in real-time, and Apache Storm was open sourced in 2011. It didn’t get wide adoption as it was a bit early for real-time, and the API was difficult to wield.

Apache Spark came in 2009 and gave a unified batch and streaming engine. It gained in usage and eventually displaced Hadoop.

Apache Flink came in 2011 and gave us our first real streaming engine. It handled the stateful problems of real-time elegantly.

We lacked a scalable pub/sub system. Apache Kafka came in 2011 and gave the industry a much better way to move real-time data. Apache Kafka has its architectural limitations, and Apache Pulsar was released in 2016.

The first big conferences were Strata and Hadoop World that started in in 2012. They eventually merged in 2012. It was the place where the brightest big data minds came and spoke. It was shepherded well by Ben Lorica, Doug Cutting, and Alistair Croll.

There was (and still is) an overall problem in the industry because most projects failed to get into production. Some people blamed the technologies. The technologies more or less work well.

Big data projects were given to data scientists and data warehouse teams, where the projects subsequently failed.

Big data projects were given to data scientists and data warehouse teams, where the projects subsequently failed. As clearly evident as that sounds now, my writing about needing data engineering went heavily against the grain of everything that was written at the time.

DJ Patil coined the term Data Scientist in 2008. For the majority of companies, that was the only title working on data problems at scale. Honorable mentions to Paco Nathan, John Thompson, and Tom Davenport who wrote about data science and analytic team management.

Google’s 2015 paper Hidden Technical Debt in Machine Learning Systems highlighted the fact that machine learning isn’t just the creation of models. It is prominently data engineering and all of the technical debt difficulties that come with data.

Is data engineering more difficult than the other industry trends?

I started to write about the management side of big data in 2016 by talking about how data engineering is more difficult than other industry trends. I further expanded on these ideas in 2017 by talking about complexity in big data and writing my first book Data Engineering Teams. I continued to help people understand the need for data engineers in 2018 by discussing the differences between data scientists and data engineers. I followed that post up in 2019 by showing that data scientists are not data engineers. In 2020, I published my third book Data Teams to expand on how data teams and business need to cooperate. To share even more best practices and knowledge, I started the Data Dream Team podcast in 2021.

Maxime Beauchemin was writing about data engineering in 2017 too. He wrote The Rise of the Data Engineer, showing how the industry was changing. He followed it up later that year with The Downfall of the Data Engineer to talk about the growing pains of data engineering.

Zhamak Dehghani first introduced data mesh in 2019 as a sociotechnical approach to data. She wrote Data Mesh in 2022 to provide more information about the subject.

Gene Kim talks about the management of data teams in The Unicorn Project, which was published in 2019.

The programming language du jour has changed over the years. At various times it’s been Java, Scala, and Python. Now people are excited about Rust. Large, untyped codebases are landmines in an industry that deals with data.

This brief history leaves out many technologies and companies. Over time, they are dead, dying a slow death, still trying to find their footing, or moving along nicely. Making poor technology choices can make for a late-game failure.

People who don’t know their history are doomed to repeat it.

People who don’t know their history are doomed to repeat it.

People who don’t know their history are doomed to repeat it. Many data engineers are new and don’t understand the history or the technologies they’re using. There is still a focus on technology and programming languages as the main driver for success or failure. However, people and organizational structure are still the primary drivers for the early success or failure of data projects.

Looking at the technological improvements over the years, we have better tools, but they didn’t make problems easy. None of them took a really hard problem and made it so easy anyone could do it. The gains were 5 to 10 percent improvements in ease, where more time could be spent on business problems because the solution was built-in rather than custom written. I firmly believe that no general-purpose distributed system will make data engineering easy. There isn’t going to be the equivalent of a WordPress event where the bar lowers dramatically.

The gains were 5 to 10 percent improvements in ease, where more time could be spent on business problems because the solution was built-in rather than custom written.

Frequently Asked Questions (AI FAQ by Summarizes)

When were MapReduce and GFS created?

Google created MapReduce and GFS in 2004 for scalable systems.

Who created Apache Hadoop and when?

Apache Hadoop was created by Doug Cutting in 2005 based on Google's papers.

When did Cloudera and Hortonworks commercialize open-source big data technologies?

Cloudera and Hortonworks commercialized open-source big data technologies in 2008 and 2011.

What were Apache Hive and Apache Pig introduced for?

Apache Hive and Apache Pig were introduced to enhance Hadoop with SQL capabilities.

When were Apache HBase and Apache Cassandra developed?

Apache HBase and Apache Cassandra were developed for scalable databases in 2007 and 2008.

What was Apache Storm open-sourced for in 2011?

Apache Storm was open-sourced in 2011 for real-time data processing.

How did Apache Spark impact the big data landscape?

Apache Spark unified batch and streaming processing in 2009, eventually displacing Hadoop.

What was Apache Flink known for when it was introduced in 2011?

Apache Flink was introduced in 2011 as the first real streaming engine.

What revolutionized real-time data movement in 2011?

Apache Kafka revolutionized real-time data movement in 2011.

Why is data engineering crucial in machine learning systems?

Data engineering is crucial in machine learning systems, highlighted by Google's technical debt paper in 2015.

Related Posts

Data Teams Survey 2024 Results

Blog Summary: (AI Summaries by Summarizes)Companies are not fully utilizing LLMs in data engineering, with 24.7% of teams not using them at all.Only 12% of