Data Edits, Data Quality Controls, and a Performance Question

January 28, 2011 by · 2 Comments
Filed under: Data Quality 

In a number of places I have used the term “data quality control” in the context of a service that inspects the compliance of a data instance (at different levels of granularity, ranging from a data value to an entire table or file, for example) to a defined data quality rule. The objective of the control is to ensure that the expectations of the process consuming are validated prior to the exchange of data.

Some times when I have used this term, people respond by saying “oh yes, data edits.” But I think there is a difference between a data edit and a control. In most of the cases I have seen, an edit is intended to compare a value to some expectation and then change the value if it doesn’t agree with the intended target’s expectation. This is actually quite different than the control, which generates a notification or other type of event indicating a missed expectation. Another major difference is that the edit introduces inconsistency between the supplier of the data and the consumer of the data, and while the transformation may benefit the consumer in the short run, it may lead to issues at a later point when there is a suddent need for reconciliation between the target and the source.

A different consideration is that data edits are a surreptitious attempt to harmonize data to meet expectations, and does so without the knowledge of the data supplier. The control at least notifies a data steward that there is a potential discrepancy and allows the data steward to invoke the appropriate policy as well as notify both the supplier and the consumer that some transformation needs to be made (or not).

One other question often crops up about controls: how do they affect performance? This is a much different question, and in fact a recent client mentioned to me that they used to have controls in place but it slowed the processing down and so they removed them. This means it boils down to a decision of what is more important: ensuring high quality data or ensuring high throughput. The corresponding questions to ask basically center on the cumulative business impacts. Observing processing time service level agreements may be contractually imposed (with corresponding penalties for missing the SLA), which may even suggest that allowing some errors through and patching the process downstream might be less impactful than incurring SLA non-observance penalty costs.