The Difficulty of Transitioning to Data Pipelines

Blog Summary: (AI Summaries by Summarizes)
  • Companies transitioning to Big Data, especially Kafka, face difficulty in moving from RPC-esque calls to a data pipeline where everything is exposed as raw data.
  • Data pipelines are a new concept and have a loose coupling, unlike RPCs.
  • Organizations need to answer questions related to socializing the data pipeline, getting other members to adopt it, monetizing data, and identifying the team responsible for it.
  • Security concerns include locking down access to the data pipeline, encrypting data, and masking PII from consumers who don't need it.
  • Technical considerations include ensuring teams have the skills to use the data pipeline, designing it to evolve with use cases, selecting appropriate technologies, notifying other teams of data changes, and deciding when and how to change data in the pipeline.

There’s a common difficulty that companies are having in transitioning to Big Data, especially Kafka. They’re coming from systems where everything is exposed as an RPC-esque call (remote procedure call/REST call/etc). They’re transitioning to a data pipeline where everything is exposed as raw data.

These data pipelines are a brand new concept. With RPC’s, there was a much higher coupling. Teams could change the RPCs as they needed to change the call. With a data pipeline, there is a very loose coupling. Changes to the data pipeline will ripple through the organization in different ways.

Here are questions that teams and organization need to answer when using a data pipeline:

Organizationally

  • How do we socialize that the data pipeline exists?
  • How do we get other members of the organization to start adopting the data pipeline?
  • How do we monetize the data or results from the analysis of the data?
  • Which team is directly responsible for the data pipeline? (Hint: this is the reason a data engineering team needs to exist)

Security

  • How do we lock down who has access to the data pipeline?
  • How do we encrypt the data as it’s being sent around the data pipeline?
  • How do we mask PII from consumers of the data pipeline that don’t need that information?

 

Technically

  • How do we make sure that teams have the skills to use the data pipelines?
  • How do we design the data pipeline to evolve as use cases increase?
  • What technologies make sense for our data pipeline given our use case?
  • How do we notify other teams when the data changes?
  • How do we decide when and how to change our data in the data pipeline?

Related Posts

zoomed in line graph photo

Data Teams Survey 2023 Follow-Up

Blog Summary: (AI Summaries by Summarizes)Many companies, regardless of size, are using data mesh as a methodology.Smaller companies may not necessarily need a data mesh

Laptop on a table showing a graph of data

Data Teams Survey 2023 Results

Blog Summary: (AI Summaries by Summarizes)A survey was conducted between January 24, 2023, and February 28, 2023, to gather data for the book “Data Teams”

Black and white photo of three corporate people discussing with a view of the city's buildings

Analysis of Confluent Buying Immerok

Blog Summary: (AI Summaries by Summarizes)Confluent has announced the acquisition of Immerok, which represents a significant shift in strategy for Confluent.The future of primarily ksqlDB

Tall modern buildings with the view of the ocean's horizon

Brief History of Data Engineering

Blog Summary: (AI Summaries by Summarizes)Google created MapReduce and GFS in 2004 for scalable systems.Apache Hadoop was created in 2005 by Doug Cutting based on

Big Data Institute horizontal logo

Independent Anniversary

Blog Summary: (AI Summaries by Summarizes)The author founded Big Data Institute eight years ago as an independent, big data consulting company.Independence allows for an unbiased