Some of the most interesting consultations are when I help data science teams that don’t think they need data engineering. I’ve compiled a list of some of the more common reasons why data science teams believe they don’t need data engineering and why those reasons might not be valid.
Data science teams must have data engineering because the data scientists might just be getting by or severely underperforming. The results from missing the data engineering team are not great and leave much to be desired. Commonly, data scientists will create technical debt that data engineers will have to spend time fixing.
Lack of Understanding
For some data scientists, there is a total lack of understanding of what data engineers do. This lack of knowledge comes from a cursory knowledge of programming and maybe some distributed system. It leads to a “how hard could it be?” question that downplays the complexities that data engineers hide from data scientists.
To help data scientists understand the various between a data engineer and a data scientist, I created some visualizations that clearly show the differences.
Repeatable Data Science
Some data science is repeatable. By that, I mean automation and consistent data products are being created and maintained. Some data science is ad hoc and not repeatable. In these scenarios, every project is started from scratch and, once the project is done, is completely discarded.
For ad hoc projects, there’s no big engineering onus. The projects only live for hours, days, or weeks. There’s no real need for any long-lived planning. I’d argue that organizations lose much of the value of data science when everything is so ephemeral.
When ad hoc organizations transition to long-term projects, they hit the brunt of their engineering mistakes. They’ve been able to escape the data engineering rigors of projects that need to be repeatable and run consistently. They find out the hard way that data engineering isn’t over-engineering; it’s making sure that the data products are maintainable. Creating repeatable data products requires data engineers.
There’s No Scale…Yet
Sometimes organizations start out with small or medium data and don’t have to deal with scale issues (count yourselves lucky). They’ve been able to get by with Excel, single processes, or waiting longer for results. The transition to big data and scale catches them by surprise.
The transition to big data technologies comes with a significant increase in complexity due to the distributed systems. At first, the data scientists think they can handle the growth. It should become quickly apparent that they can’t deal with the complexity increase and need data engineers.
Creating scalable data products requires data engineers.
What It Looks Like and What to Do
If your team is experiencing one of these problems, it will look like the data science team is stuck. They’ll spend a week on something that seems like it should take hours or a day. They’ll spend hours googling or searching on StackOverflow for answers (these sorts of solutions aren’t findable on Google or StackOverflow). The data scientists simply won’t be technically competent enough to realize the issue. These sorts of problems fall right into the wheelhouse of data engineering.
Managers and data scientists will need to take an honest look at the team’s productivity and skills. They more than likely will need data engineers and need to establish a data engineering team. I cover how to start and resource a data engineering team in my Data Teams book.