What are the questions that a data lineage repository should be able to answer?
By asking ourselves this one question we are starting to think about how data lineage should be structured and designed.
There are other questions to ask of data lineage, but let’s concentrate on this first one. Here are my top ten questions that I believe a lineage repository should be capable of answering without too much difficulty (in no particular order):
- What systems store a given data element, and which is the nominated golden source for the data element (if any)?
- What other data elements have been used to derive a given data element, and what is the derivation logic?
- What are the processes that create, update, and delete a given data element, and who operates and owns these processes?
- What processes use a given data element?
- What is the agreed business definition for a given data element, and where does it fit in the organisation’s business data model?
- What is the quality of a given data element in a particular storage area?
- What are the risks to data quality along the lineage of a given data element?
- What known issues relate to a given data element at any point in its lineage?
- What controls are in place along the data lineage to control the quality of a given data element?
- How complete is the data lineage repository for a particular organisation unit?
With modern data tools, such as graph databases, questions like these can be articulated in a query language and the lineage repository should provide the answers very easily and quickly. This is providing the lineage repository has been suitably structured and populated of course. Some lineage tools (such as my lineage tool Architector) provide some of these capabilities out-the-box without the need for queries, and with ad-hoc queries available as well.
So, to the challenge – can you think of any other questions that a data lineage repository should be able to answer?