Wolf Garbe CEO and co-founder of SeekStorm

Very fast Data cleaning of product names, company names & street names

Item: Very fast Data cleaning of product names, company names & street names
Rating: 5
Author: Wolf Garbe

The correction of product names, company names, street names & addresses is a frequent task of data cleaning and deduplication. Often those names are misspelled, either due to OCR errors or mistakes of the human data collectors.

The difference is that those names often consist of multiple words, white space and punctuation. For large data or even Big data applications also speed is very important.

The SymSpell algorithm supports both requirements and is up to 1 million times faster compared to conventional approaches (see benchmark). The C# source code is available as Open Source on GitHub). A simple modification of the original source code will add support of names with multiple words, white space and punctuation:

You can simply use CreateDictionaryEntry("company/street/product name", "") to add multi-word company, street & product names to the dictionary. Spaces within the names are allowed.

Then with Correct("misspelled street",""); you will get the correct street name from the dictionary. With the verbosity parameter you may specify whether you want only the best match or all matches within a certain edit distance (number of character operations difference).

For every similar term (or phrase) found in the dictionary the algorithm gives you the Damerau-Levenshtein edit distance to your input term (look for suggestion.distance in the source code). The edit distance describes how many characters have been added, deleted, altered or transposed between the input term and the dictionary term. This is a measure of similarity between the input term (or phrase) and similar terms (or phrases) found in the dictionary.

Rating:

29 Sep 2015

« Fast approximate string matching with large edit distances in Big Data Elias-Fano: quasi-succinct compression of sorted integers in C# »

SeekStorm

Very fast Data cleaning of product names, company names & street names

Explore →