The Validity of Value Validation

March 29, 2011 by
Filed under: Data Profiling, Data Quality, Recommendations 

I have nothing against data validation as a general practice. In fact, I might claim to be one of the more forceful proponents of validation as a practical methodology, having written a book that has guided the development of automated data validation tools. Yet validation only provides one level of trust when it comes to evaluating the quality of information. 

First, let’s review what validation means. In general, it refers to a process for ensuring that a value is consistent with respect to its context. For a single value, that might mean verifying that it conforms to a set of formatting rules, like social security numbers conforming to the rules for assigning numbers. That might also mean checking that an attribute’s value is selected from a predefined value domain, such as an ISO three-character country code. One might even assert multiple-attribute validation rules, such as ensuring that a postal code is consistent with a street location.

Value validation is a good first cut process for data quality management, since it means that some effort is exerted in determining whether the values are appropriate for the context. But the challenge is that people begin to confuse validity with accuracy or correctness. Just because a value is valid for its domain does not mean that it is the right value.

There is even a cascading effect that sometimes happens that verifies that many dependent values are valid, but none are correct. For example, if we are looking at a location, we can validate that the street name is a valid street name, or that the city is within the state, or that the ZIP code encompasses the street location. But if we got the street wrong in the first place, the city, state, and ZIP codes may also be incorrect.

There are approaches to evaluating accuracy or correctness, many of which depend on a reliable alternate source for comparison.  But if no authoritative source exists, a different approach might have the data quality analyst role up his/her sleeves and examine the data directly to review correctness. This is certainly not a scalable process, but in certain circumstances, eyeballing enough records selected using the right selection method can provide a qualitative assessment of the percentage of correct and incorrect records in the set. More on this in an upcoming post…


One Comment on The Validity of Value Validation

    […] Yesterday I shared some thoughts about the differences between data validity and data correctness, and why validity is a good start but ultimately is not the right measure for quality. Today I am still ruminating about what data correctness or accuracy really means. […]

Tell me what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!