Filed under: Data Governance, Data Quality, Events, Identity Resolution, Master Data
I have a friend in the neighborhood, and coincidentally he shares a name with another person who also lives in our neighborhood. As a joke, while I refer to my friend by his name (say for argument’s sake it is “Arnie Hollingsworth”), I refer to the other guy as “The Other Arnie Hollingsworth.”
Filed under: Business Impacts, Data Quality, Identity Resolution, Master Data
Yesterday I shared some thoughts about the differences between data validity and data correctness, and why validity is a good start but ultimately is not the right measure for quality. Today I am still ruminating about what data correctness or accuracy really means.
For example, I have been thinking for a long time about the existence (or more accurately, nonexistence) of benchmarks for data quality methods and tools, especially when it comes to data accuracy. On the one hand, I often see both vendors and their customers reporting “accuracy percentages” (e.g. “our customer data is 99% accurate”) and I wonder what is meant by accuracy and how those percentages are both calculated and verified.
Filed under: Data Analysis, Data Governance, Data Profiling, Identity Resolution
Yesterday, Henrik Liliendahl Sørensen posted an interesting entry about data profiling, data values, and corresponding quality and completeness of the hierarchies associated with the data domain values used within a data set for any particular data element’s populated values. I’d like to jam along with that concept with respect to a conversation I had the other day that was essentially about capturing and tracking spend data, although the context was capturing and reporting the aggregate physician payments made by a pharmaceutical (or other covered manufacturer) to specific practitioners.
Comparing character strings for an exact match is straightforward. However, when there are simple errors due to finger flubs or incorrect transcriptions, a human can still intuitively see the similarity. For example, “David Loshin” is sometimes misspelled as “David Loshion,” yet in that case one can see that the two names exhibit similarity.
In order to automatically determine that two strings are similar, you need to implement some method of measuring similarity between the data values. One measure of similarity between two character strings is to measure what is called the edit distance between those strings. The edit distance between two strings is the minimum number of basic edit operations required to transform one string to the other. There are three edit operations:
• Insertion (where an extra character is inserted into the string),
• Deletion (where a character has been removed from the string), and
• Transposition (in which two characters are reversed in their sequence).
Wikipedia actually has some good references about edit distance, and if you are interested in learning more about how those algorithms are implemented, it is a good place to start learning.
As an example of this calculation, the edit distance between the strings “INTERMURAL” and “INTRAMURAL” is 3, since to change the first string to the second, we would transpose the “ER” into “RE,” then delete the “E” followed by an insertion of an “A.” Some people include substitution as a basic edit operation, which is basically a deletion followed by an insertion. Strings that compare with small edit distances are likely to be similar, while value pairs with large edit distances are likely to be less similar or not similar at all.
Attempting to score every string against a query string using this approach is going to be computationally inefficient. That is why edit distance is often invoked as one of a number of similarity measures applied once a collection of candidate matches have been found.
I had a set of discussions recently from representatives of different business functions and found an interesting phenomenon: although folks from almost every area of the business indicated a need for some degree of identity resolution and matching, there were different requirements, expectations, processes, and even tools/techniques in place. In some cases it seems that the matching algorithms each uses refers to different data elements, uses different scoring weights, different thresholds, and different processes for manual review of questionable matches. Altogether the result is inconsistency in matching precision.
And it is reasonable for different business functions to have different levels of precision for matching. You don’t need as strict a set of scoring thresholds for matching individuals for the purpose of marketing as you might for assuring customer privacy. But when different tools and methods are used, there is bound to be duplicative work in implementing and managing the different matching processes and rules.
To address this, it might be worth considering whether the existing approaches serve the organization in the most appropriate way. This involves performing at least these steps:
1) Document the current state of matching/identity resolution
2) Profile the data sets to determine the best data attributes for matching
3) Document each business process’s matching requirements
4) Evaluate the existing solutions and determine that the current situation is acceptable or that there is an opportunity to select one specific approach that can be used as a standard across the organization