Validity vs. Correctness Continued: Accuracy Percentages

Yesterday I shared some thoughts about the differences between data validity and data correctness, and why validity is a good start but ultimately is not the right measure for quality. Today I am still ruminating about what data correctness or accuracy really means.

For example, I have been thinking for a long time about the existence (or more accurately, nonexistence) of benchmarks for data quality methods and tools, especially when it comes to data accuracy. On the one hand, I often see both vendors and their customers reporting “accuracy percentages” (e.g. “our customer data is 99% accurate”) and I wonder what is meant by accuracy and how those percentages are both calculated and verified.

 

Let’s start with the first question: what is meant by “accurate”? My gut reaction is to say that a managed data value is accurate within the context of a data element as part of a data model when the value assigned to the data element is understood to identically represent the value of the corresponding characteristic of the real world object being modeled. That is sort of a rambly definition, and some examples might shed more light into what I mean:

  • A first name value for an individual is accurate if the value is identical to the first name provided on that individual’s most recent government-registered identity documentation (such as a birth certificate or a passport).
  • An individual’s mobile telephone number is accurate if the party answering a connected call is that named individual.
  • A sewer drain geolocation is known to be accurate if an individual goes to that geolocation and verifies the position of the sewer drain.

 

I could go on with many other examples, but instead I would rather point out that when I just tried to come up with these examples, I had some difficulty in articulating what I really meant and intended to convey, because even as I was trying to come up with a description, I kept thinking of variations in which the verification was not sufficient. For example, at first I thought that first name was correct if that were the name provided as the first name on a birth certificate. Then I thought that people sometimes officially change their name, but that does not retroactively change the birth certificate. Then I thought to use a passport, but not everybody has a passport. So I settled on those as examples of government-registered documentation, and hope that qualification is sufficient (but I am sure it is not).

At the same time, each of the examples I provided are not automatable – to verify correctness of the sewer drain location you have to go there and check reality against the data element value. That is why I am dubious about claims of “98%” accurate, especially for big data sets. If verification requires manual checking, and you have 1,000,000 records, that suggests that you have manually determined that 980,000 of those records have the correct value. I doubt anyone does this.

Perhaps in some cases the definition of accuracy is looser than I have asserted. Perhaps to them, a data value in data set A is accurate if it matches the corresponding data value in data set B. As an example, “a first name is accurate in my customer database if it matches the first name in the master party registry.” In this case, we can say we can measure accuracy, but really we are measuring consistency between data sets, since the correctness of the customer database value is dependent on the correctness of the master party registry’s value, and who verified the correctness of *that* value?

I do have some ideas I plan to share. More information to follow…

Comments

2 Comments on Validity vs. Correctness Continued: Accuracy Percentages

  1. Henrik Liliendahl Sørensen on Wed, 30th Mar 2011 1:17 PM
  2. An ever interesting subject to think about David. I always also like to add some diversity to the musings. Examples:
    • The label “first name” may cause troubles when having data from cultures where the first word in a name is the family name.
    • In countries having a permanent single citizen master data hub the name of a citizen at a certain time is in fact well defined.

    […] As promised in earlier posts, I do have some ideas to help address the question of data correctness. Knowing that validation is only a preliminary filter for assessing the quality of a set of values, what approach can we take instead to get a more effective (or rather, more believable) quantification of accuracy and correctness? Here, I believe we are going to have to push outside the typical technical envelope and look at external factors and measures that can be used to our advantage. […]

Tell me what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!