The Data Discovery Team

Jesse Anderson
November 14, 2023
Blog, Data Engineering, Data Engineering is hard
No Comments

Blog Summary: (AI Summaries by Summarizes)

Data discovery team plays a crucial role in searching for data in the IT landscape.
Data discovery team must make data discoverable to the operations team.
Collaboration between data discovery team and operations team is essential for creating data products.
Strong domain knowledge and technical skills are required for effective data discovery team operations.
Data discovery team ensures all data sources are searchable by the data science team.

A Guest Post by Ole Olesen-Bagneux

In this blog post I would like to describe a new data team, that I call ‘the data discovery team’. It’s a team that connects naturally into the constellation of the three data teams

Operations team
Data engineering team
Data Science team

as described in Jesse Anderson’s book Data Teams (2020)

Before I explain what the data discovery team should do, it is necessary to add a bit of context on the concept of data discovery itself.

Data discovery is thought of in different ways in data science and in information science respectfully.

Data discovery is thought of in different ways in data science and in information science respectfully.

First of all, in data science, data discovery means finding patterns in data using database query languages to test hypotheses. This kind of data discovery can be subdivided into several steps, as e.g. suggested by Piethein Strengholt in Data Management at Scale. But basically, data science thinks of data discovery as finding new insights inside large amounts of data.

However, in the less flashy discipline of information science, data discovery means something else. Here, data discovery is about searching for data – not getting insight from data, but simply trying to find data. In an enterprise data reality, searching for data is a bit of a hassle. It’s scattered across an enormous amount of applications in a landscape that is to a large extent opaque.

And that is what the data discovery team that I propose in this blog post should work on: searching for data. Accordingly, if we position the data discovery team in relation to Anderson’s thinking, it would be not be placed together with the data science team on the left side, but on the right side of the overall constellation of data teams, like this:

A diagram showing the overall constellation of data teams

The extra dotted lines simply indicate that the data discovery team works with more teams than the data teams.

The mission of the data discovery team is twofold:

1) The data team must discover the data in the IT landscape

2) The data team must make the data in the IT landscape discoverable to the operations team

Let’s unpack this.

1) The data discovery team must work on discovering the IT landscape. That is done via a careful examination of all metadata repositories describing data sources. Once those repositories have been carefully studied, the identified data sources must be scanned by a data catalog, so that a metadata mirror of these data sources are made discoverable for the operations team.

2) At this point, the operations team can discover data – in the sense that they can search for data. Once that is achieved, the data discovery team can work together with the operations team to create data products. The data discovery team must have very, very strong domain knowledge and the skillset to create a platform of perfect data discovery. They will have some knowledge on the creation of data products, which is one of the key competences of the operations team, so they will collaborate on that.

To really understand how the data discovery team plays a role for the data teams – and to illustrate the above points, we can look at the team triangle that describes what happens to a big data project if one of the teams is missing.

The data discovery team plays a role for the operations team and for the data science team:

A diagram showing how a data discovery team plays a role for the data teams

Thanks to the data discovery team, the entire IT landscape – all data sources – can be searched for, by the data science team. Once a data source has been found, the data products can be created co-jointly by the operations team and the data discovery team, that holds deep domain knowledge but can lack the technical knowledge of how to actually create the data product.

The most perfect overview of data is created, if the data discovery team follows the usage of the data products, by documenting the data pipelines created by the data engineers, and the data science team’s usage of the data products, delivered to them by the operations team.

This creates a metadata mirror, representing all the activity of the big data project conducted by the three data teams:

A diagram representing all the activity of the big data project conducted by the three data teams

Finally, if the data discovery team is missing, then the data scientist’s work is not prioritized correctly. This is because they work not with the best data that exists, but the best data that they happen to come across, by carrying out an unstructured, haphazard process of initial data discovery – the one defined in information science.

A diagram showing how the absence of a data discovery team affects the data teams

That can lead to inefficient big data projects, because the operations team makes data products available to the best of their knowledge, without being supported by a team that is dedicated to create data discoverable, which is the data discovery team.

What’s next?

A data discovery team would be able to serve a company in many ways, in terms of compliance, operations, and analytics.

Specifically for big data, a data discovery team can ensure that the most relevant data is made discoverable by the data science team. Once this data is discovered, the operations team and the data discovery team can co-jointly transform these data sources into data products.

You should consider that in order for the data science team in your company to deliver maximum value, the data science team should work with the best data at all. If you want to ensure that this is possible for them to do, your company needs to create a data discovery team, so that all data sources are perfectly searchable.

Frequently Asked Questions (AI FAQ by Summarizes)

What role does the data discovery team play in the IT landscape?

The data discovery team plays a crucial role in searching for data in the IT landscape.

Why is collaboration between the data discovery team and operations team essential?

Collaboration between the data discovery team and operations team is essential for creating data products.

What skills are required for effective data discovery team operations?

Strong domain knowledge and technical skills are required for effective data discovery team operations.

Why is documentation of data pipelines crucial for efficient data discovery?

Documentation of data pipelines by data engineers is crucial for efficient data discovery.

What can result from the absence of a dedicated data discovery team?

Inefficient big data projects can result from the absence of a dedicated data discovery team.

How can a data discovery team ensure the best data is available for the data science team?

A data discovery team can ensure the best data is available for the data science team.

Why is establishing a data discovery team essential for maximizing the value delivered by the data science team?

Establishing a data discovery team is essential for maximizing the value delivered by the data science team.

The Data Discovery Team

A Guest Post by Ole Olesen-Bagneux

What’s next?

Frequently Asked Questions (AI FAQ by Summarizes)

What role does the data discovery team play in the IT landscape?

Why is collaboration between the data discovery team and operations team essential?

What skills are required for effective data discovery team operations?

Why is documentation of data pipelines crucial for efficient data discovery?

What can result from the absence of a dedicated data discovery team?

How can a data discovery team ensure the best data is available for the data science team?

Why is establishing a data discovery team essential for maximizing the value delivered by the data science team?

Related Posts

Gemini Batch API for Java

Unapologetically Technical Episode 20 – Shane Murray

Unapologetically Technical Episode 19 – Jacopo Tagliabue

Unapologetically Technical Episode 18 – Adrian Woodhead

Unapologetically Technical Episode 17 – Semih Salihoglu

Unapologetically Technical Episode 16 – David Jayatillake

Unapologetically Technical Episode 15 – Frances Perry

Unapologetically Technical Episode 14 – Cliff Crosland

Data Teams Survey 2020-2024 Analysis

Join the Newsletter