Deprecated: get_settings is deprecated since version 2.1.0! Use get_option() instead. in /home/vemw27zv035v/public_html/wp-includes/functions.php on line 5324
edit distance : The Practitioner's Guide to Data Quality Improvement

Identity Resolution, Name Matching, and Similarity: Edit Distance

January 24, 2011 by · 2 Comments
Filed under: Data Quality, Identity Resolution 

Comparing character strings for an exact match is straightforward. However, when there are simple errors due to finger flubs or incorrect transcriptions, a human can still intuitively see the similarity. For example, “David Loshin” is sometimes misspelled as “David Loshion,” yet in that case one can see that the two names exhibit similarity.

In order to automatically determine that two strings are similar, you need to implement some method of measuring similarity between the data values. One measure of similarity between two character strings is to measure what is called the edit distance between those strings. The edit distance between two strings is the minimum number of basic edit operations required to transform one string to the other. There are three edit operations:

• Insertion (where an extra character is inserted into the string),
• Deletion (where a character has been removed from the string), and
• Transposition (in which two characters are reversed in their sequence).

Wikipedia actually has some good references about edit distance, and if you are interested in learning more about how those algorithms are implemented, it is a good place to start learning.

As an example of this calculation, the edit distance between the strings “INTERMURAL” and “INTRAMURAL” is 3, since to change the first string to the second, we would transpose the “ER” into “RE,” then delete the “E” followed by an insertion of an “A.” Some people include substitution as a basic edit operation, which is basically a deletion followed by an insertion. Strings that compare with small edit distances are likely to be similar, while value pairs with large edit distances are likely to be less similar or not similar at all.

Attempting to score every string against a query string using this approach is going to be computationally inefficient. That is why edit distance is often invoked as one of a number of similarity measures applied once a collection of candidate matches have been found.