Why Data Science Teams Don’t Think They Need Data Engineering

Blog Summary: (AI Summaries by Summarizes)
  • Data science teams may believe they don't need data engineering, but this can lead to underperformance and technical debt.
  • Lack of understanding of what data engineers do can lead to a belief that data engineering is unnecessary.
  • Repeatable data science projects require data engineering to ensure maintainability and long-term value.
  • Organizations that start with small or medium data may not realize the need for data engineering until they face scalability issues.
  • If a data science team is stuck on a problem, spending excessive time on it, or lacking technical competence, it may be time to consider the need for data engineering.

Some of the most interesting consultations are when I help data science teams that don’t think they need data engineering. I’ve compiled a list of some of the more common reasons why data science teams believe they don’t need data engineering and why those reasons might not be valid.

Data science teams must have data engineering because the data scientists might just be getting by or severely underperforming. The results from missing the data engineering team are not great and leave much to be desired. Commonly, data scientists will create technical debt that data engineers will have to spend time fixing.

Lack of Understanding

For some data scientists, there is a total lack of understanding of what data engineers do. This lack of knowledge comes from a cursory knowledge of programming and maybe some distributed system. It leads to a “how hard could it be?” question that downplays the complexities that data engineers hide from data scientists.

To help data scientists understand the various between a data engineer and a data scientist, I created some visualizations that clearly show the differences.

Repeatable Data Science

Some data science is repeatable. By that, I mean automation and consistent data products are being created and maintained. Some data science is ad hoc and not repeatable. In these scenarios, every project is started from scratch and, once the project is done, is completely discarded.

For ad hoc projects, there’s no big engineering onus. The projects only live for hours, days, or weeks. There’s no real need for any long-lived planning. I’d argue that organizations lose much of the value of data science when everything is so ephemeral.

When ad hoc organizations transition to long-term projects, they hit the brunt of their engineering mistakes. They’ve been able to escape the data engineering rigors of projects that need to be repeatable and run consistently. They find out the hard way that data engineering isn’t over-engineering; it’s making sure that the data products are maintainable. Creating repeatable data products requires data engineers.

There’s No Scale…Yet

Sometimes organizations start out with small or medium data and don’t have to deal with scale issues (count yourselves lucky). They’ve been able to get by with Excel, single processes, or waiting longer for results. The transition to big data and scale catches them by surprise.

The transition to big data technologies comes with a significant increase in complexity due to the distributed systems. At first, the data scientists think they can handle the growth. It should become quickly apparent that they can’t deal with the complexity increase and need data engineers.

Creating scalable data products requires data engineers.

What It Looks Like and What to Do

If your team is experiencing one of these problems, it will look like the data science team is stuck. They’ll spend a week on something that seems like it should take hours or a day. They’ll spend hours googling or searching on StackOverflow for answers (these sorts of solutions aren’t findable on Google or StackOverflow). The data scientists simply won’t be technically competent enough to realize the issue. These sorts of problems fall right into the wheelhouse of data engineering.

Managers and data scientists will need to take an honest look at the team’s productivity and skills. They more than likely will need data engineers and need to establish a data engineering team. I cover how to start and resource a data engineering team in my Data Teams book.

Related Posts

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on

Big Data Institute horizontal logo

Independent Anniversary

Blog Summary: (AI Summaries by Summarizes)The author founded Big Data Institute eight years ago as an independent, big data consulting company.Independence allows for an unbiased