What Happens When You Hire a Data Scientist Without a Data Engineer

Blog Summary: (AI Summaries by Summarizes)
  • Data Scientists are often hired with the expectation that they will create models, but they may not have the necessary skills to create the data pipeline needed for those models.
  • The definition of a Data Scientist is highly variable, and their programming and distributed system skill level can range from beginner to advanced.
  • Beginner to intermediate programmers may struggle to create a data pipeline due to a lack of programming, distributed systems, and Big Data skills.
  • This can lead to Data Scientists being idle for 2-6 months, which can result in them quitting after about 6 months.
  • It is recommended to have a data pipeline in place before hiring a Data Scientist, which may require creating a data engineering team first.

Sometimes I’ll train at a company that’s creating a data engineering team. The team often includes a Data Scientist.

I’ll always make a note to talk to the Data Scientist about their experience and interactions with the team before I arrived. These Data Scientists are recent hires – within the last 6 months. A clear theme is that their time is under-utilized. They’ve been waiting for 2-6 months for a Data Engineer to create the data pipeline for them.

The trouble is that the definition of Data Scientist is highly variable. For some, it means a person with some programming skills that has math skills. With Data Scientists, the programming and distributed system skill level is incredibly variable. They can range from people with a CS degree to beginner programmers.

These beginner to intermediate programmers will have the most difficulty in creating the data pipeline. They’re lacking the programming, distributed systems, and Big Data skills to create a data pipeline because that’s a complex endeavor; they’re not lacking the math or statistical skills.

These inabilities lead to issues all around. The Data Scientist expected the data pipeline to already be created when they were hired. They’re used to creating the models and not the hardcore data engineering that’s needed. They’re consumers of the pipeline and not the creators of the pipeline. The company and managers are expecting the Data Scientist to create the data pipeline.

When I’ve encountered this issue, the Data Scientist has been idle for 2-6 months. After about 6 months they’ll quit. They haven’t done any of the cool stuff they thought they were signing on for. At small companies, this spells the end of the Big Data foray.

My suggestion is to make sure you have a data pipeline before hiring your first Data Scientist. This will require you to create a data engineering team, before or at the same, as you’re creating a data science team. At a minimum, you need to inventory your datasets and make them available before hiring a Data Scientist.

I talk more about the relationship between a data science and data engineering team in my Data Engineering Teams book. It walks you through the skills the team needs and why they’re so important.

Related Posts

The Difference Between Learning and Doing

Blog Summary: (AI Summaries by Summarizes)There are several types of learning videos: hype, low effort, novice, and professional.It is important to avoid hype, low-effort, and

The Data Discovery Team

Blog Summary: (AI Summaries by Summarizes)The concept of a “data discovery team” is introduced, which focuses on searching for data in an enterprise data reality.Data

Black and white photo of three corporate people discussing with a view of the city's buildings

Current 2023 Announcements

Blog Summary: (AI Summaries by Summarizes)Confluent’s Current Conference featured several announcements that are important for both technologists and investors.Confluent has two existing moats (replication and

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on