The Difficulty of Transitioning to Data Pipelines

Jesse Anderson
June 7, 2017
Blog, Business, Data Engineering, Data Engineering is hard
No Comments

Blog Summary: (AI Summaries by Summarizes)

Companies transitioning to Big Data, especially Kafka, face difficulty in moving from RPC-esque calls to a data pipeline where everything is exposed as raw data.
Data pipelines are a new concept and have a loose coupling, unlike RPCs.
Organizations need to answer questions related to socializing the data pipeline, getting other members to adopt it, monetizing data, and identifying the team responsible for it.
Security concerns include locking down access to the data pipeline, encrypting data, and masking PII from consumers who don't need it.
Technical considerations include ensuring teams have the skills to use the data pipeline, designing it to evolve with use cases, selecting appropriate technologies, notifying other teams of data changes, and deciding when and how to change data in the pipeline.

There’s a common difficulty that companies are having in transitioning to Big Data, especially Kafka. They’re coming from systems where everything is exposed as an RPC-esque call (remote procedure call/REST call/etc). They’re transitioning to a data pipeline where everything is exposed as raw data.

These data pipelines are a brand new concept. With RPC’s, there was a much higher coupling. Teams could change the RPCs as they needed to change the call. With a data pipeline, there is a very loose coupling. Changes to the data pipeline will ripple through the organization in different ways.

Here are questions that teams and organization need to answer when using a data pipeline:

Organizationally

How do we socialize that the data pipeline exists?
How do we get other members of the organization to start adopting the data pipeline?
How do we monetize the data or results from the analysis of the data?
Which team is directly responsible for the data pipeline? (Hint: this is the reason a data engineering team needs to exist)

Security

How do we lock down who has access to the data pipeline?
How do we encrypt the data as it’s being sent around the data pipeline?
How do we mask PII from consumers of the data pipeline that don’t need that information?

Technically

How do we make sure that teams have the skills to use the data pipelines?
How do we design the data pipeline to evolve as use cases increase?
What technologies make sense for our data pipeline given our use case?
How do we notify other teams when the data changes?
How do we decide when and how to change our data in the data pipeline?

The Difficulty of Transitioning to Data Pipelines

Organizationally

Security

Technically

Related Posts

Unapologetically Technical Episode 20 – Shane Murray

Unapologetically Technical Episode 19 – Jacopo Tagliabue

Unapologetically Technical Episode 18 – Adrian Woodhead

Unapologetically Technical Episode 17 – Semih Salihoglu

Unapologetically Technical Episode 16 – David Jayatillake

Unapologetically Technical Episode 15 – Frances Perry

Unapologetically Technical Episode 14 – Cliff Crosland

Data Teams Survey 2020-2024 Analysis

Data Teams Survey 2024 Results

Join the Newsletter