Data Edits, Data Quality Controls, and a Performance Question

January 28, 2011 by
Filed under: Data Quality 

In a number of places I have used the term “data quality control” in the context of a service that inspects the compliance of a data instance (at different levels of granularity, ranging from a data value to an entire table or file, for example) to a defined data quality rule. The objective of the control is to ensure that the expectations of the process consuming are validated prior to the exchange of data.

Some times when I have used this term, people respond by saying “oh yes, data edits.” But I think there is a difference between a data edit and a control. In most of the cases I have seen, an edit is intended to compare a value to some expectation and then change the value if it doesn’t agree with the intended target’s expectation. This is actually quite different than the control, which generates a notification or other type of event indicating a missed expectation. Another major difference is that the edit introduces inconsistency between the supplier of the data and the consumer of the data, and while the transformation may benefit the consumer in the short run, it may lead to issues at a later point when there is a suddent need for reconciliation between the target and the source.

A different consideration is that data edits are a surreptitious attempt to harmonize data to meet expectations, and does so without the knowledge of the data supplier. The control at least notifies a data steward that there is a potential discrepancy and allows the data steward to invoke the appropriate policy as well as notify both the supplier and the consumer that some transformation needs to be made (or not).

One other question often crops up about controls: how do they affect performance? This is a much different question, and in fact a recent client mentioned to me that they used to have controls in place but it slowed the processing down and so they removed them. This means it boils down to a decision of what is more important: ensuring high quality data or ensuring high throughput. The corresponding questions to ask basically center on the cumulative business impacts. Observing processing time service level agreements may be contractually imposed (with corresponding penalties for missing the SLA), which may even suggest that allowing some errors through and patching the process downstream might be less impactful than incurring SLA non-observance penalty costs.

Comments

2 Comments on Data Edits, Data Quality Controls, and a Performance Question

  1. Jim Harris on Fri, 28th Jan 2011 9:50 AM
  2. Excellent post, David.

    I have seen this challenge most frequently in data warehousing environments, where the ETL processes use data edits to conform the source data values to the target expectations (in this case, the data warehouse tables being loaded) without tracking these edits or notifying a data steward.

    I have implemented error suspension and recycling ETL processes in a data warehousing environment to combine data edits and data quality controls to track all edits, suspend data considered to have severe data quality issues, and notify data stewards to take corrective action where necessary.

    However, in data warehousing environments, performance concerns usually trump data quality concerns, and therefore my data quality controls were usually turned into non-notifying automatic data edits.

    Best Regards,

    Jim

    […] This post was mentioned on Twitter by Jim Harris, robinpstehlik. robinpstehlik said: Interesting read RT @DavidLoshin New #dataquality post on Data Edits, Data Quality Controls, and a Performance Question http://bit.ly/gaeP3Y […]

Tell me what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!