“What we know is a drop, what we don’t know is an ocean.”
― Isaac Newton
Data engineering is one of the disciplines where you just know a drop. Some companies are saying it’s easy, and you just need to know a drop. My experience in the field and teaching tells me otherwise. A data engineer needs to learn many different technologies and possess in-depth knowledge of big data.
To help you sort it out, I want to help you imagine the skills as a technology tree. You might have played the Civilization series at some point and maybe even spent way too much time (just one more turn). If you aren’t familiar with it, here is Civilization 6’s technology tree.
You’ll notice that you start with the most basic technologies in the world, such as pottery or animal husbandry. As you begin to research those technologies, you unlock more technologies. Each of these technologies takes a certain amount of turns to research, and the number of turns is based on the science your civilization produces.
If you didn’t know, you can try to skip researching technologies. Instead of gaining all of the foundational knowledge, the player can try to skip ahead. Skipping technologies causes all kinds of problems in-game, just like we’re about to see in our real-life example.
Let’s imagine data engineering as a technology tree. I think it all starts with a specialization in technology and branches out from there. These branches are systems, programming, and architecture. Looking at the diagram below, you can see various relationships.
At the very end of lots of research and time (turns), we become a data engineer. Ideally, all or the vast majority of the data engineer’s tree is green. Leveraging all of the skills we’ve acquired, we can start to create systems. Those systems will produce data projects.
I hold this technology tree out as a way to gauge you or your team’s skills on the road to data engineering. Let’s go through a few examples of this tree.
Imagine that the team or individual comes from a DBA, Data Warehouse, or SQL-focused background. We can look at the diagram to see which skills (technologies) the team is missing.
We can see that the DBAs will have excellent SQL skills. The rest of the technology tree is missing. The software engineering skills are missing. There may be some understanding of the easier architecture skills such as data formats, but the rest of the advanced skills are missing. Using the technology tree, we can see that the skills acquisition will be extensive and time-consuming because the advanced skills are missing.
The companies and individuals that try to skip ahead on the technology tree without filling it in will have all kinds of problems. The lack of software engineering skills forces all code and technologies to be written with SQL. The lack of architecture leads to incorrect or improper uses of technologies.
Let’s imagine that the team or individual comes from a software engineering background. Looking at the diagram, we can see that they have far more of the technology tree covered, but not the entire tree.
The software engineers will have excellent SQL and software engineering skills. The common missing parts of the tree are the multi-threading and coordination concepts that lead to big data. On the architecture side, they will be missing the big data technology ecosystem knowledge and distributed algorithms.
The companies and individuals that try to skip ahead on the technology tree without filling it in will still have problems. The lack of multi-threading skills that foundational to big data causes misunderstandings in solutions. The absence of ecosystem knowledge leads to incorrect or improper uses of technologies. I’ve found that these teams get stuck try to exhaustively go through each potential technology and not truly understanding it.
Another common misconception is around data scientists and data engineers. Often, managers don’t understand the differences between data scientists and data engineers.
Data scientists will have some SQL and software engineering skills. However, these skills are on the beginner to intermediate level. They will be missing the big data technology ecosystem knowledge and likely distributed algorithms on the architecture side.
The companies and individuals that try to use data scientists as data engineers will have problems. The lack of multi-threading skills that foundational to big data causes misunderstandings in solutions. The absence of ecosystem knowledge leads to incorrect or improper uses of technologies. I’ve found that these teams choose technologies by popularity rather than fit for a use case.
Technology Trees and You
When management is looking to create a new or fix an existing data engineering team, they need to make sure the data engineers have the entire technology tree. When a team is under-performing, often they’re missing some or all of the technology tree. I suggest you read Data Teams or Data Engineering Teams to understand how to start or fix the team.
Individuals need to make an honest assessment of yourself and where you are on the technology tree. My Ultimate Guide to Switching Careers to Big Data will help you understand the next steps to take.
The implications of technology trees affect both management and individuals. In either case, their technology tree’s completeness will dictate the success or failure of their projects or goals.