What Happens When Data Science Teams Add A Data Engineer

Blog Summary: (AI Summaries by Summarizes)
  • Data science teams often don't fully understand the critical nature of data engineering and may think adding a data engineer or two to their team will solve the problem.
  • The ratio of data engineers to data scientists should be 2-5 to ensure success.
  • Hiring the wrong data engineers can lead to underperformance and bias in the hiring process.
  • Choosing the wrong technologies and not understanding the challenges in data engineering can lead to failure in projects.
  • Fixing technical debt while not breaking the entire system requires a qualified data engineer or team of data engineers.

By Jesse Anderson and Mikio Braun

Organizations are gradually getting the message about the critical nature of data engineering. Data science teams are getting that message too. Sometimes, that message gets muddled, and data science teams think they just need to add a data engineer or two to their team. In their mind, this solves the problem, and we can go back to business as usual. We’d like to share our experiences when this happens and why this isn’t the right course of action.

Buy-In

The core issue here is that data science teams don’t fully buy into the notion that data engineering is critical to success. Instead, there is a “there I did it” or “there I fixed it” sort of mentality. So naturally, the actual data science work is on their minds, and they often don’t have enough knowledge to fully understand the challenges in data engineering. In addition, the amount of time needed for data engineering compared to the data science side of things is often perceived as a problem, again without fully understanding why. In Data Teams, Jesse recommends a ratio of 2-5 data engineers per data scientist.

Hiring

Many problems trace their way back to hiring. Put simply, data scientists often hire the wrong engineers, and it just gets worse from there.

Hiring the wrong people can have all sorts of root causes. For example, data scientists may not believe data engineering can help or is all that difficult. Or, they could completely misunderstand data engineering and have worked with the wrong kind of data engineer that only promulgates the flawed archetype of a data engineer. We’ve also the seen data science teams change the title of the most data engineering savvy data scientist to a data engineer. Usually, this puts the most competent data scientist on the job, but in comparison to data engineers, is the least qualified.

The poor hiring becomes a self-fulfilling prophecy. Not knowing how to evaluate a data engineering candidate, the data science team chooses the wrong person, leading to underperformance, and it is hard to learn how to do it right from that. This cycle repeats itself to create a strong bias.

Getting The Project Underway

The project gets underway. There are so many technologies to choose from. Too many things to be done and fixed. How should the data engineer start to make headway when they can’t even understand things?

Projects with the wrong people end up as questions on Reddit. They usually say something like, “I was just hired, and I don’t really know what to do. Here is what they’re asking for. Could you help me choose some technologies?” The responses are well-meaning but miss crucial information because the original post leaves them out. Some suggestions are flat-out wrong. It leaves the unqualified data engineer to try to implement something they couldn’t understand or vet in the first place (see the issues with using beginners in Chapter 10, “Starting a Team” of Data Teams). This failure leaves the business and value creation in the same place or worse than before.

Performing Surgery

Being the first data engineer to start working with data scientists’ code and architecture can be daunting. In addition, the data scientists could have created a mountain of technical debt.

Getting anywhere can be the most delicate surgery of fixing technical debt while not breaking the entire system. From a personnel standpoint, it takes a qualified data engineer even to attempt to fix it. It will be more likely to require a whole team of data engineers to make the fixes and rearchitecting necessary. As a result, you will find yourself worse off with the wrong person than before (see the self-fulfilling prophecy above). 

Outnumbered and Outgunned

When data engineers are outnumbered, they’re often outvoted and outgunned. As a result, the issues, tasks, and challenges significant to data engineers aren’t essential or understood by the data scientists on the team.

A data engineer’s issues are perceived by the data scientists as too expensive, slowing down the data science, or over-engineering. Without a more prominent voice on the team, the data engineers can be easily overlooked or shouted down. For example, data engineers will see the issues and poor design that led to the data scientists’ technical debt in the first place. The data scientists will veto the fixes or changes because they will slow them down or perceive them as unnecessary in the first place.

Some of the worst-case scenarios are that all of the data engineer’s ideas and changes are ignored while the data engineer is assigned to the more menial tasks the data scientists don’t want to do. It creates a poor match on both sides of the equation.

What Do The Problems Look Like?

If you’re a data engineer on one of these teams, you already know what it looks like. Nothing is changing; you’re frustrated and looking for a new position.

For management, this looks like you’ve added a data engineer with the thought that it would fix a problem, and there’s no change. Instead, all of the status quo was maintained. You’ve simply added a person without fixing the deeper organizational issues that got you there. We’ve helped many organizations in this situation, but there isn’t a one-size-fits-all fix. The initial steps start with management and organizational change and not the individual contributors. We’d love to sort out the problems and give you clarity on the next steps. You can contact us here to set a time to talk.

 

 

Related Posts

The Difference Between Learning and Doing

Blog Summary: (AI Summaries by Summarizes)There are several types of learning videos: hype, low effort, novice, and professional.It is important to avoid hype, low-effort, and

The Data Discovery Team

Blog Summary: (AI Summaries by Summarizes)The concept of a “data discovery team” is introduced, which focuses on searching for data in an enterprise data reality.Data

Black and white photo of three corporate people discussing with a view of the city's buildings

Current 2023 Announcements

Blog Summary: (AI Summaries by Summarizes)Confluent’s Current Conference featured several announcements that are important for both technologists and investors.Confluent has two existing moats (replication and

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on