- The concept of a "data discovery team" is introduced, which focuses on searching for data in an enterprise data reality.
- Data discovery in data science refers to finding patterns in data using database query languages, while in information science it refers to searching for data.
- The data discovery team's mission is twofold: discovering the IT landscape and making the data discoverable to the operations team.
- The team examines metadata repositories and scans data sources to create a metadata mirror for the operations team.
- The data discovery team collaborates with the operations team to create data products.
A Guest Post by Ole Olesen-Bagneux
In this blog post I would like to describe a new data team, that I call ‘the data discovery team’. It’s a team that connects naturally into the constellation of the three data teams
- Operations team
- Data engineering team
- Data Science team
as described in Jesse Anderson’s book Data Teams (2020)
Before I explain what the data discovery team should do, it is necessary to add a bit of context on the concept of data discovery itself.
Data discovery is thought of in different ways in data science and in information science respectfully.
First of all, in data science, data discovery means finding patterns in data using database query languages to test hypotheses. This kind of data discovery can be subdivided into several steps, as e.g. suggested by Piethein Strengholt in Data Management at Scale. But basically, data science thinks of data discovery as finding new insights inside large amounts of data.
However, in the less flashy discipline of information science, data discovery means something else. Here, data discovery is about searching for data – not getting insight from data, but simply trying to find data. In an enterprise data reality, searching for data is a bit of a hassle. It’s scattered across an enormous amount of applications in a landscape that is to a large extent opaque.
And that is what the data discovery team that I propose in this blog post should work on: searching for data. Accordingly, if we position the data discovery team in relation to Anderson’s thinking, it would be not be placed together with the data science team on the left side, but on the right side of the overall constellation of data teams, like this:
The extra dotted lines simply indicate that the data discovery team works with more teams than the data teams.
The mission of the data discovery team is twofold:
1) The data team must discover the data in the IT landscape
2) The data team must make the data in the IT landscape discoverable to the operations team
Let’s unpack this.
1) The data discovery team must work on discovering the IT landscape. That is done via a careful examination of all metadata repositories describing data sources. Once those repositories have been carefully studied, the identified data sources must be scanned by a data catalog, so that a metadata mirror of these data sources are made discoverable for the operations team.
2) At this point, the operations team can discover data – in the sense that they can search for data. Once that is achieved, the data discovery team can work together with the operations team to create data products. The data discovery team must have very, very strong domain knowledge and the skillset to create a platform of perfect data discovery. They will have some knowledge on the creation of data products, which is one of the key competences of the operations team, so they will collaborate on that.
To really understand how the data discovery team plays a role for the data teams – and to illustrate the above points, we can look at the team triangle that describes what happens to a big data project if one of the teams is missing.
The data discovery team plays a role for the operations team and for the data science team:
Thanks to the data discovery team, the entire IT landscape – all data sources – can be searched for, by the data science team. Once a data source has been found, the data products can be created co-jointly by the operations team and the data discovery team, that holds deep domain knowledge but can lack the technical knowledge of how to actually create the data product.
The most perfect overview of data is created, if the data discovery team follows the usage of the data products, by documenting the data pipelines created by the data engineers, and the data science team’s usage of the data products, delivered to them by the operations team.
This creates a metadata mirror, representing all the activity of the big data project conducted by the three data teams:
Finally, if the data discovery team is missing, then the data scientist’s work is not prioritized correctly. This is because they work not with the best data that exists, but the best data that they happen to come across, by carrying out an unstructured, haphazard process of initial data discovery – the one defined in information science.
That can lead to inefficient big data projects, because the operations team makes data products available to the best of their knowledge, without being supported by a team that is dedicated to create data discoverable, which is the data discovery team.
A data discovery team would be able to serve a company in many ways, in terms of compliance, operations, and analytics.
Specifically for big data, a data discovery team can ensure that the most relevant data is made discoverable by the data science team. Once this data is discovered, the operations team and the data discovery team can co-jointly transform these data sources into data products.
You should consider that in order for the data science team in your company to deliver maximum value, the data science team should work with the best data at all. If you want to ensure that this is possible for them to do, your company needs to create a data discovery team, so that all data sources are perfectly searchable.