- Companies transitioning to Big Data, especially Kafka, face difficulty in moving from RPC-esque calls to a data pipeline where everything is exposed as raw data.
- Data pipelines are a new concept and have a loose coupling, unlike RPCs.
- Organizations need to answer questions related to socializing the data pipeline, getting other members to adopt it, monetizing data, and identifying the team responsible for it.
- Security concerns include locking down access to the data pipeline, encrypting data, and masking PII from consumers who don't need it.
- Technical considerations include ensuring teams have the skills to use the data pipeline, designing it to evolve with use cases, selecting appropriate technologies, notifying other teams of data changes, and deciding when and how to change data in the pipeline.
There’s a common difficulty that companies are having in transitioning to Big Data, especially Kafka. They’re coming from systems where everything is exposed as an RPC-esque call (remote procedure call/REST call/etc). They’re transitioning to a data pipeline where everything is exposed as raw data.
These data pipelines are a brand new concept. With RPC’s, there was a much higher coupling. Teams could change the RPCs as they needed to change the call. With a data pipeline, there is a very loose coupling. Changes to the data pipeline will ripple through the organization in different ways.
Here are questions that teams and organization need to answer when using a data pipeline:
- How do we socialize that the data pipeline exists?
- How do we get other members of the organization to start adopting the data pipeline?
- How do we monetize the data or results from the analysis of the data?
- Which team is directly responsible for the data pipeline? (Hint: this is the reason a data engineering team needs to exist)
- How do we lock down who has access to the data pipeline?
- How do we encrypt the data as it’s being sent around the data pipeline?
- How do we mask PII from consumers of the data pipeline that don’t need that information?
- How do we make sure that teams have the skills to use the data pipelines?
- How do we design the data pipeline to evolve as use cases increase?
- What technologies make sense for our data pipeline given our use case?
- How do we notify other teams when the data changes?
- How do we decide when and how to change our data in the data pipeline?