You’re Probably Not a Distributed Systems Engineer

Jesse Anderson
October 18, 2017
Blog, Business, Data Engineering, Data Engineering is hard
No Comments

Blog Summary: (AI Summaries by Summarizes)

There are three main groups of teams that interact with distributed systems: users of end data products, users of existing distributed system frameworks, and creators of distributed systems frameworks.
Users of end data products work with already created data pipelines and data products, while users of existing distributed system frameworks use open source or other distributed systems to create data pipelines and data products.
Creators of distributed system frameworks create new distributed systems or improve existing distributed systems frameworks.
Teams that try to create their own distributed system without the necessary expertise will likely fail due to the complexity of the task and the challenges of debugging and testing a distributed system.
It is important for teams to take an honest look at their abilities before attempting to create their own distributed system to avoid wasting time, money, and resources.

As I’ve worked with software teams, I’ve found some interesting views on distributed systems. Some teams think they’re creators of distributed systems. They usually aren’t.

I think there are three main groups of teams that interact with distributed systems: users of end data products, users of existing distributed system frameworks, and creators of distributed systems frameworks.

These nuances make a big difference in how a team interacts with distributed systems. For example, a team that uses end data products will fail if they try to create their own distributed system. This is one of the more common ways I’ve seen teams fail with Big Data.

Users of End Data Products

Users of end data products are the people who work with already created data pipelines and data products. These teams may be DBAs/SQL-focused or a software engineering team. The difficult parts of the distributed systems creation is done for them. They’re given the data in an already usable form.

Users of Existing Distributed System Frameworks

Users of existing distributed systems frameworks are the people who use open source or other distributed systems to create data pipelines and data products. They’re using existing technologies like Apache Spark, Apache Hadoop, and Apache Kafka.

Creators of Distributed System Frameworks

Creators of distributed system frameworks are the people who create new distributed systems or improve existing distributed systems frameworks. They’re creating everything themselves. These include writing schedulers, resource managers, and harnesses.

Confused Teams

Sometimes teams get confused on their core competencies. An end data product team will think they’re users of distributed system frameworks. A team that uses existing distributed systems frameworks thinks they can create their own distributed system. All of these scenarios will lead to failure.

I’ve written about the increase in complexity when using Big Data. An end product team will experience a 10x increase in complexity when trying to use a Big Data framework. For most teams, this will lead to failure. They’ll need more guidance and mentoring to get through their Big Data journey.

That leads me to somewhat common issue — teams that think they can create their own distributed system. There is all sorts of failure wrapped up in creating your own distributed system. This mostly stems from the fact that you’re probably not a distributed systems engineer. There are very few people with the computer science, system design, and operational understanding to create a distributed system from scratch.

Creating your own distributed system may sound like a good idea initially. We’ll write our own that does exactly what we want. Except:

You will have to spend the time to write it
Debugging and testing a distributed system is tough
There are so many unknown unknowns that only time and usage reveals
The operations team won’t be able to leverage existing knowledge
Any operational issue will be escalated to the development team
The development team will spend their time debugging their distributed system instead of creating new features

Do yourself and your team a favor. Take an honest look at your abilities before going down one of these routes. This will save you all kinds of time, money, and heartache. Using the wrong team for the job is always a bad idea.

You’re Probably Not a Distributed Systems Engineer

Users of End Data Products

Users of Existing Distributed System Frameworks

Creators of Distributed System Frameworks

Confused Teams

Related Posts

Gemini Batch API for Java

Unapologetically Technical Episode 20 – Shane Murray

Unapologetically Technical Episode 19 – Jacopo Tagliabue

Unapologetically Technical Episode 18 – Adrian Woodhead

Unapologetically Technical Episode 17 – Semih Salihoglu

Unapologetically Technical Episode 16 – David Jayatillake

Unapologetically Technical Episode 15 – Frances Perry

Unapologetically Technical Episode 14 – Cliff Crosland

Data Teams Survey 2020-2024 Analysis

Join the Newsletter