The Difficulty of Transitioning to Data Pipelines

There’s a common difficulty that companies are having in transitioning to Big Data, especially Kafka. They’re coming from systems where everything is exposed as an RPC-esque call (remote procedure call/REST call/etc). They’re transitioning to a data pipeline where everything is exposed as raw data.

These data pipelines are a brand new concept. With RPC’s, there was a much higher coupling. Teams could change the RPCs as they needed to change the call. With a data pipeline, there is a very loose coupling. Changes to the data pipeline will ripple through the organization in different ways.

Here are questions that teams and organization need to answer when using a data pipeline:

Organizationally

  • How do we socialize that the data pipeline exists?
  • How do we get other members of the organization to start adopting the data pipeline?
  • How do we monetize the data or results from the analysis of the data?
  • Which team is directly responsible for the data pipeline? (Hint: this is the reason a data engineering team needs to exist)

Security

  • How do we lock down who has access to the data pipeline?
  • How do we encrypt the data as it’s being sent around the data pipeline?
  • How do we mask PII from consumers of the data pipeline that don’t need that information?

 

Technically

  • How do we make sure that teams have the skills to use the data pipelines?
  • How do we design the data pipeline to evolve as use cases increase?
  • What technologies make sense for our data pipeline given our use case?
  • How do we notify other teams when the data changes?
  • How do we decide when and how to change our data in the data pipeline?