Institutionalized Entity Identification Problems

December 13, 2011 by · Leave a Comment
Filed under: Data Governance, Data Quality 

It has been a while since I posted an entry – mostly signs of “busy-ness” in trying to wrap up projects before the end of the year. However, I did have an interesting experience recently with one of our customers, with whom we are working on developing a best practices guide for data governance.

For this customer, I was being provided with badge access to be able to get in and out of the building, and we had an appointment with security to have the badge created (as well as a bunch of other security-type things). As some of you might know, I never use my first name, but go by my middle name, David. However, since my driver’s license has my full name on it, I was told that my badge would have my real first name (“Howard”) on it, and if I needed to contact security for any reason, I would need to give them my real first name (which I *never* use). Read more

Partial Entity Resolution

The other day I had a conversation about product master data, and one of the participants, almost as an aside, mentioned a concept of a “virtual product.” More specifically, he was referring to an operational context in which a maintenance team needed to look for a type of a part to be used to replace a existing worn machine part. The curious aspect of this was that they were not looking for a specific part. Rather, they needed to describe the characteristics of the part and then see which available parts match those characteristics. If none were available, they’d either need to create a new one or search other suppliers for a matching part.

Read more

Identity Resolution, Name Matching, and Similarity: Edit Distance

January 24, 2011 by · 2 Comments
Filed under: Data Quality, Identity Resolution 

Comparing character strings for an exact match is straightforward. However, when there are simple errors due to finger flubs or incorrect transcriptions, a human can still intuitively see the similarity. For example, “David Loshin” is sometimes misspelled as “David Loshion,” yet in that case one can see that the two names exhibit similarity.

In order to automatically determine that two strings are similar, you need to implement some method of measuring similarity between the data values. One measure of similarity between two character strings is to measure what is called the edit distance between those strings. The edit distance between two strings is the minimum number of basic edit operations required to transform one string to the other. There are three edit operations:

• Insertion (where an extra character is inserted into the string),
• Deletion (where a character has been removed from the string), and
• Transposition (in which two characters are reversed in their sequence).

Wikipedia actually has some good references about edit distance, and if you are interested in learning more about how those algorithms are implemented, it is a good place to start learning.

As an example of this calculation, the edit distance between the strings “INTERMURAL” and “INTRAMURAL” is 3, since to change the first string to the second, we would transpose the “ER” into “RE,” then delete the “E” followed by an insertion of an “A.” Some people include substitution as a basic edit operation, which is basically a deletion followed by an insertion. Strings that compare with small edit distances are likely to be similar, while value pairs with large edit distances are likely to be less similar or not similar at all.

Attempting to score every string against a query string using this approach is going to be computationally inefficient. That is why edit distance is often invoked as one of a number of similarity measures applied once a collection of candidate matches have been found.

Variant Approaches to Identity Resolution and Record Matching

January 18, 2011 by · 1 Comment
Filed under: Data Analysis, Identity Resolution 

I had a set of discussions recently from representatives of different business functions and found an interesting phenomenon: although folks from almost every area of the business indicated a need for some degree of identity resolution and matching, there were different requirements, expectations, processes, and even tools/techniques in place. In some cases it seems that the matching algorithms each uses refers to different data elements, uses different scoring weights, different thresholds, and different processes for manual review of questionable matches. Altogether the result is inconsistency in matching precision.

And it is reasonable for different business functions to have different levels of precision for matching. You don’t need as strict a set of scoring thresholds for matching individuals for the purpose of marketing as you might for assuring customer privacy. But when different tools and methods are used, there is bound to be duplicative work in implementing and managing the different matching processes and rules.

To address this, it might be worth considering whether the existing approaches serve the organization in the most appropriate way. This involves performing at least these steps:

1) Document the current state of matching/identity resolution
2) Profile the data sets to determine the best data attributes for matching
3) Document each business process’s matching requirements
4) Evaluate the existing solutions and determine that the current situation is acceptable or that there is an opportunity to select one specific approach that can be used as a standard across the organization