FastFuzzyStringMatcher is a BK tree implementation for quick in-memory string matching. (Also available for .NET).
- Fast, fuzzy, string matching.
- Search based on percentage and edit distance.
- Associate data with string keywords and return both. For example, search for a file name, and return associated file paths.
Although hash maps can be used for exact string matching, and tries can be used for prefix matching, there are few solutions out there for fast matching based on edit distance or percentage difference. Of course, you can search through every string in a collection, comparing its edit distance to the keyword you're searching for, but this tends to be pretty inefficient.
FastFuzzyStringMatcher builds a BK tree to make searching a lot more efficient.
The project was originally built using Eclipse and Java 8 and should build cleanly, assuming you have the latest JDK installed.
The main class can be found in src/main/java
--> package: com.gitub.pekoto.fastfuzzystringmatcher
--> StringMatcher.java
.
Usage is fairly simple:
- Declare a new instance:
StringMatcher<T> myStringMatcher = new StringMatcher<T>();
- Add your data by calling
myStringMatcher.add(...)
- Search for your data by calling
myStringMatcher.search(...)
EditDistanceCalculator.java
is also public, so it can also be used independently to calculate the edit distance between two CharSequence
objects. (CharSequence
includes String
, StringBuilder
, CharBuffer
, etc.)
src/test/java
contains unit tests which demonstrate and verify the functionality of the StringMatcher
and EditDistanceCalculator
classes.
These are just standard JUnit tests and can be run in Eclipse by right-clicking on the package and selecting Run As --> JUnit Test.
src/example/java
shows how the StringMatcher
can be used to implement a translation memory dictionary with fuzzy matching. EnglishJapaneseDictionarySearcher.java
contains the implementation of the translation memory dictionary. SearchDriver.java
shows how it can be used.
The dictionary loads around 50,000 entries from the JMDict Project. StringMatcher
should be able to handle a lot more than 50,000 translation pairs, but I wanted to keep the download size fairly small.
The fuzzy string matching relies on edit distance.
Edit distance, better known as Levenshtein distance, is the minimum number of edits it takes to turn one string into another, using substitution, insertion, and deletion.
For example, to turn cat into hate:
- cat > hat (substitute c for h)
- hat > hate (insert e)
Edit distance = 2
The algorithm to calculate this uses dynamic programming to build a matrix of edit distances. Wiki has a nice explanation and good examples.
EditDistanceCalculator.java
uses the iterative with two matrix rows approach. This seems to give the best performance based on some quick tests I ran.
StringMatcher
is essentially a BK tree implementation.
In a BK tree, every node is added based on its edit distance from the root.
For example, say we had this collection of words: hat, cat, kate, ball, and bat.
We start by adding hat. It becomes the root:
Next we add cat. This has an edit distance of 1 from hat (substitute h for c), so we add it as a child with a key of 1:
We do the same with kate and ball -- calculate their edit distances respective to the root, and then add them as children with those keys:
Finally we add bat. But notice that the edit distance is 1, and we already have a child with edit distance 1 -- cat. No problem. We just move down to cat, calculate the edit distance between cat and bat, and add the node as a child of cat.
Okay, now we're ready to search!
Imagine we accidentally typed in the word zat, and we want to get a list of potential corrections for our typo. Let's say we want to search all of our nodes with a maximum edit distance of 1.
First, we compare zat with our root, hat. Sure enough, the edit distance is 1, which is within our threshold, so we add hat to our list of results to return.
Next, we examine all of the child nodes within the current edit distance +/- our edit distance threshold of 1.
- 1 (current edit distance) - 1 (our threshold) = 0 (min edit distance)
- 1 (current edit distance) + 1 (our threshold) = 2 (max edit distance)
So we'll examine all of the children that have an edit distance between 0-2. That means we'll examine kate and cat, but ignore ball. First, let's check out kate.
Oh uh, the edit distance between zat and kate is 2, so we ignore this node, and there are no children, so let's back up.
cat has an edit distance of 1, so let's check it out. The edit distance between zat and cat is also 1, which is within our threshold, so hooray! We have another result.
Oh yeah, and cat has a child node. We repeat the step we did at the root but using our current node: work out the maximum and minimum threshold based on the edit distance between zat and cat, and then examine children within that threshold.
This brings us down to bat. We check the edit distance, and again find it's within our threshold.
With that we're done, and we return hat, cat, and bat. We can imagine any of these might be a typo for zat. If you wanted to predict which of the three words was most likely meant by the user, you could also take into account which keys are most commonly mistyped. For example, c is closest to z on a standard QWERTY keyboard, so you could guess that they probably meant cat.
Overall, we still ended up searching 80% of our tree, but 80% can still lead to a significant saving if you have, for example, 500,000 strings in your collection.
Exercise
What would happen if zap had been added to our BK tree?
The BK tree is a simple data structure that can deliver decent performance gains when you need to search a large number of strings. They're quick to implement and having fuzzy searching and custom spell checking can be a super nice feature for your application, especially when you're dealing with translation data or you have lots of custom strings that won't be picked up by a standard spell checker, like fund codes, stock ticker symbols, or fictional character names.
Happy searching :)