- Data science teams often don't fully understand the critical nature of data engineering and may think adding a data engineer or two to their team will solve the problem.
- The ratio of data engineers to data scientists should be 2-5 to ensure success.
- Hiring the wrong data engineers can lead to underperformance and bias in the hiring process.
- Choosing the wrong technologies and not understanding the challenges in data engineering can lead to failure in projects.
- Fixing technical debt while not breaking the entire system requires a qualified data engineer or team of data engineers.
By Jesse Anderson and Mikio Braun
Organizations are gradually getting the message about the critical nature of data engineering. Data science teams are getting that message too. Sometimes, that message gets muddled, and data science teams think they just need to add a data engineer or two to their team. In their mind, this solves the problem, and we can go back to business as usual. We’d like to share our experiences when this happens and why this isn’t the right course of action.
The core issue here is that data science teams don’t fully buy into the notion that data engineering is critical to success. Instead, there is a “there I did it” or “there I fixed it” sort of mentality. So naturally, the actual data science work is on their minds, and they often don’t have enough knowledge to fully understand the challenges in data engineering. In addition, the amount of time needed for data engineering compared to the data science side of things is often perceived as a problem, again without fully understanding why. In Data Teams, Jesse recommends a ratio of 2-5 data engineers per data scientist.
Many problems trace their way back to hiring. Put simply, data scientists often hire the wrong engineers, and it just gets worse from there.
Hiring the wrong people can have all sorts of root causes. For example, data scientists may not believe data engineering can help or is all that difficult. Or, they could completely misunderstand data engineering and have worked with the wrong kind of data engineer that only promulgates the flawed archetype of a data engineer. We’ve also the seen data science teams change the title of the most data engineering savvy data scientist to a data engineer. Usually, this puts the most competent data scientist on the job, but in comparison to data engineers, is the least qualified.
The poor hiring becomes a self-fulfilling prophecy. Not knowing how to evaluate a data engineering candidate, the data science team chooses the wrong person, leading to underperformance, and it is hard to learn how to do it right from that. This cycle repeats itself to create a strong bias.
Getting The Project Underway
The project gets underway. There are so many technologies to choose from. Too many things to be done and fixed. How should the data engineer start to make headway when they can’t even understand things?
Projects with the wrong people end up as questions on Reddit. They usually say something like, “I was just hired, and I don’t really know what to do. Here is what they’re asking for. Could you help me choose some technologies?” The responses are well-meaning but miss crucial information because the original post leaves them out. Some suggestions are flat-out wrong. It leaves the unqualified data engineer to try to implement something they couldn’t understand or vet in the first place (see the issues with using beginners in Chapter 10, “Starting a Team” of Data Teams). This failure leaves the business and value creation in the same place or worse than before.
Being the first data engineer to start working with data scientists’ code and architecture can be daunting. In addition, the data scientists could have created a mountain of technical debt.
Getting anywhere can be the most delicate surgery of fixing technical debt while not breaking the entire system. From a personnel standpoint, it takes a qualified data engineer even to attempt to fix it. It will be more likely to require a whole team of data engineers to make the fixes and rearchitecting necessary. As a result, you will find yourself worse off with the wrong person than before (see the self-fulfilling prophecy above).
Outnumbered and Outgunned
When data engineers are outnumbered, they’re often outvoted and outgunned. As a result, the issues, tasks, and challenges significant to data engineers aren’t essential or understood by the data scientists on the team.
A data engineer’s issues are perceived by the data scientists as too expensive, slowing down the data science, or over-engineering. Without a more prominent voice on the team, the data engineers can be easily overlooked or shouted down. For example, data engineers will see the issues and poor design that led to the data scientists’ technical debt in the first place. The data scientists will veto the fixes or changes because they will slow them down or perceive them as unnecessary in the first place.
Some of the worst-case scenarios are that all of the data engineer’s ideas and changes are ignored while the data engineer is assigned to the more menial tasks the data scientists don’t want to do. It creates a poor match on both sides of the equation.
What Do The Problems Look Like?
If you’re a data engineer on one of these teams, you already know what it looks like. Nothing is changing; you’re frustrated and looking for a new position.
For management, this looks like you’ve added a data engineer with the thought that it would fix a problem, and there’s no change. Instead, all of the status quo was maintained. You’ve simply added a person without fixing the deeper organizational issues that got you there. We’ve helped many organizations in this situation, but there isn’t a one-size-fits-all fix. The initial steps start with management and organizational change and not the individual contributors. We’d love to sort out the problems and give you clarity on the next steps. You can contact us here to set a time to talk.