How to measure the quality of a data lineage?

Measuring the quality of data is a problem more-or-less cracked now, isn’t it? Policy/DQ Standard –> DQ Thresholds -> DQ Requirements/Rules –> Control Points –> DQ Measurements/Metrics –> Dashboard/Alerts –> RCA –> Remediation. Something like that, if not necessarily in that order.

What about the data used to document data lineage? Assuming that someone is going to actually use our carefully documented data lineage, then presumably those lineage users will have some quality requirements. Can I tell them how complete the data lineage is? Can I tell them how up-to-date it is, or how accurate they can expect it to be?

Coming from a data quality background, my inclination is to define a quality standard for lineage, and then measure it. And of course ask my lineage users, whoever they are, what their tolerances are.

If we agree, then how would we go about making appropriate measurements? Let’s consider we want to know how complete the data lineage is…

First, we might want to be able to aggregate and dissect the completeness measurements by:

  • Data Owner (i.e. who is accountable for the lineage metadata)
  • Data Steward/Custodian (i.e. who is responsible for day-to-day management of the metadata)
  • Quality Control Date (allowing us to compare the quality yesterday with quality today)
  • Business Unit (of the data steward – useful in a large organisation)
  • Region (of the data steward)

This tells me that all lineage metadata should have these associations documented.

Next, what are the appropriate quality rules? Well, we can derive these directly from the underlying data lineage meta-model (you should have one of these). Here is a data lineage meta-model:

Lineage Quality Metamodel

In other words:

  • Every critical data element should be output from at least one business process (i.e. there must be at least one process that creates it, and there might be more that update it)
  • Every critical data element should be an input to at least one business process (otherwise it could not be considered ‘critical’)
  • Every (physicalized) critical data element must be stored in at least one data-store
  • Every data-store must be generated by a business system (which could be core, end-user-app, etc)

This is a simplistic and hypothetical model, so let’s not discuss the actual model here – it is only to illustrate the quality approach. At least now we have some quality rules, and we can start making some measurements.

But there is one other angle we need to consider. For example, just because we have discovered one or two processes consuming a particular data element, does not necessarily mean we have identified all of them. So we need to make sure we have looked in all the right places, and asked all the right people, before we can assert that we have identified all the consuming processes. Once we have done that, and we have demonstrated compliance to the meta-model, then we can say with some confidence that this section of the lineage is xx% complete. (Data quality is a workflow problem as well as a measurement problem.)

This approach can be extended to accommodate accuracy and timeliness tests for lineage metadata, and other sources of quality requirements.

Now, here is the challenge to you – do you know what the quality of your data lineage is? And if no-one is asking this is your organisation, how long will it be until someone does?

Leave a Reply

Case studies:

Due to the nature of work we undertake, and the types of clients we typically help, we do not post details of case studies online. Please contact us for more details.