Identity Resolution, Name Matching, and Similarity: Edit Distance

January 24, 2011 by
Filed under: Data Quality, Identity Resolution 

Comparing character strings for an exact match is straightforward. However, when there are simple errors due to finger flubs or incorrect transcriptions, a human can still intuitively see the similarity. For example, “David Loshin” is sometimes misspelled as “David Loshion,” yet in that case one can see that the two names exhibit similarity.

In order to automatically determine that two strings are similar, you need to implement some method of measuring similarity between the data values. One measure of similarity between two character strings is to measure what is called the edit distance between those strings. The edit distance between two strings is the minimum number of basic edit operations required to transform one string to the other. There are three edit operations:

• Insertion (where an extra character is inserted into the string),
• Deletion (where a character has been removed from the string), and
• Transposition (in which two characters are reversed in their sequence).

Wikipedia actually has some good references about edit distance, and if you are interested in learning more about how those algorithms are implemented, it is a good place to start learning.

As an example of this calculation, the edit distance between the strings “INTERMURAL” and “INTRAMURAL” is 3, since to change the first string to the second, we would transpose the “ER” into “RE,” then delete the “E” followed by an insertion of an “A.” Some people include substitution as a basic edit operation, which is basically a deletion followed by an insertion. Strings that compare with small edit distances are likely to be similar, while value pairs with large edit distances are likely to be less similar or not similar at all.

Attempting to score every string against a query string using this approach is going to be computationally inefficient. That is why edit distance is often invoked as one of a number of similarity measures applied once a collection of candidate matches have been found.

Comments

2 Comments on Identity Resolution, Name Matching, and Similarity: Edit Distance

    [...] This post was mentioned on Twitter by Pervasive Software, Paige Roberts and Vish Agashe, David Loshin. David Loshin said: Today's #dataquality post is about identity resolution, similarity scoring, and edit distance http://bit.ly/fbWW7m [...]

  1. Galina O. Fedorkova on Thu, 28th Jul 2011 8:24 AM
  2. Hello David!

    “Attempting to score every string against a query string using this approach is going to be computationally inefficient”

    I know how to make it efficient.

    In my PhD thesis I’ve examined the edit distance (Levenshtein distance) in the task of data merging for an insurance company (they had an enormous quantity of finger flubs or incorrect transcriptions from the clinics). There were about 50 000 record in one table against 200 000 records in other (each month). Of course, you can’t build 50 000 x 200 000 = 10 000 000 pairs to compute distance within them, it is inefficient.
    So, i used some information retrieval methods (especially tries) to solve this task, and achieved quite linear time (relative to result set cardinality). I’ve found the information retrieval methods are useful in Identity Resolution problems.
    In next few months a’m going to compare tries with qgramms method (qgramms do not require complex software complex and allow blocks permutation). And then i have more ideas for Identity Resolution problem.

    I can write you detailed information if it is interesting for you. My book will be printed soon, but it is in Russian. I have no English papers at present.

    Galina O. Fedorkova, PhD
    Associate professor of Lipetsk State Technical University

Tell me what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!





*