SMAC is a tiny autocompletion engine written in Go. It supports UTF-8 alphabets. Emphasis is on speed and simplicity.
Performance on a 355k word dictionary on a modern computer is (see benchmark files):
- initialization time ~0.7 seconds
- average completion time 9000 ns (> 100k completions/sec)
- average memory occupation/dictionary size ratio varies between 13 and 16 (prefixMapDepth = 3 and 4 respectively)
SMAC is case-sensitive. If case-insensitivity is needed, clients should make sure to convert to either upper or lowercase before adding words and calling Complete().
Paging is not supported, since it is mostly responsibility of the client.
SMAC learns new words on the go, via the Learn() function, can UnLearn() them and can also UnLearn() words provided in the bootstrap dictionary, and keeps into account the frequency of acceptance of words (frequently used words) by giving them priority when generating completion lists.
If customizations of the autocompleter (learnt words etc.) are to survive re-instantiations, or they must be transferred somewhere else, SMAC can Save() what it has learnt to file and can subsequently Retrieve() them from file.
SMAC has not been tested on logographic alphabets, for which it would not make much sense, unless an ortographic spelling is provided (hiragana to kanji for example). If there is such a need SMAC could be adapted to it.
You would usually find a dictionary file and bootstrap SMAC from it:
ac, err := NewAutoCompleteLinoF("/home/....", 4, 10, 90)
if err != nil {
os.Exit(-1)
}
The parameters to the constructor make sense for an average use (more detail below).
After getting an instance of SMAC, you can go straight to using it:
ac, _ := autoComplete.Complete("chair")
fmt.Println(ac)
To make SMAC smarter, make sure to Accept() every word that is selected after autocompletion:
err := autoComplete.Accept("chairman")
if err != nil {
// do something
}
Accept() will bounce an error if a word is not in the dictionary.
To make SMAC learn a new word, use Learn():
err := autoComplete.Learn("Pneumonoultramicroscopicsilicovolcanoconiosis")
if err != nil {
// do something
}
Learn() will bounce an error if the word to learn is already in the dictionary
To make SMAC forget a word, use UnLearn():
err := autoComplete.UnLearn("Pneumonoultramicroscopicsilicovolcanoconiosis")
if err != nil {
// do something
}
UnLearn() will bounce an error if the word to forget was not learnt in the first place.
To save to file whatever SMAC has learnt, use Save():
err := autoComplete.Save("/home/....")
if err != nil {
// do something
}
Save() will only save a diff from a bootstrap dictionary. Save() will bounce an error if it cannot write to file. To retrieve a saved SMAC (possibly after using a bootstrap dictionary) use Retrieve():
err := autoComplete.Retrieve("/home/....")
if err != nil {
// do something
}
Retrieve() will only retrieve a diff from a bootstrap dictionary. Retrieve() will bounce an error if it cannot read from file.
You can also bootstrap from an array of strings:
ac, err := NewAutoCompleteLinoS([]string{"mary", "had", "a", "little", "lamb"}, 4, 10, 90)
if err != nil {
os.Exit(-1)
}
or just start from scratch and learn as you go:
ac, err := NewAutoCompleteLinoE(4, 10, 90)
if err != nil {
os.Exit(-1)
}
Meaning of the constructor parameters
The first parameter is prefixMapDepth. It speeds up autocompletion, at the expense of memory usage. Practical values are 1, 2, 3 and 4. Speed and memory usage are (for a list of 355k words):
prefixMapDepth | speed | memory usage |
---|---|---|
1 | 630k ns/completion (1500 completions/sec) | 46 MB |
2 | 160k ns/completion (6000 completions/sec) | 50 MB |
3 | 30k ns/completion (32k completions/sec) | 51 MB |
4 | 9k ns/completion (60k completions/sec) | 60 MB |
The second parameter is the result size. It means how many words you get for an autocompletion. So for example, a result size of 5 would yield these results for "chair":
chair chairborne chaired chairer chairing
while a result size of 10 would yield:
chair chairborne chaired chairer chairing chairladies chairlady chairless chairlift chairmaker
The third parameter is the radius. It indicates how deep SMAC will "fish" for frequently used words (marked with Accept() ). Lets say that i frequently use the word "chairmaker". If my result size is 5, and my radius is also 5, I will never see "chairmaker" when I type "chair". With a radius of 20, SMAC will go beyond the 5th result, find out that "chairmaker" is frequently used and put it in front of the list.
Autocompletion is basically about building a data structure containing all possible prefixes to the words of a dictionary, and accessing them quickly.
Human reaction time is in the hundreds of milliseconds, so an autocompletion should last less than that.
Three challenges present themselves: being able to quickly find the first word corresponding to a given prefix, being able to quickly scan the chosen data structure until all the possible alternatives (completions) are found and not generating an exponentially big number of nodes in the chosen data structure (the set of all prefixes of a given dictionary can grow exponentially).
Speed of access
The fastest data structure for finding a given prefix in a set of words is the trie (radix tree). Every possible prefix has the shortest, direct path in its tree. Access costs O(M) time, being M the length of the desired string - can be neglected. Once pointing to the desired word, you can simply scan its subtree, which has linear cost (proportional to the number of results desired, usually a small number). The penalty here is that words have to be assembled from initials in the trie, for each initial. The overall cost then is O(m²) where m is the result size - basically a small, quasi-fixed cost.
The AVL tree can get to the first word starting with the desired prefix in O(log n), many orders of magnitude under human reaction time. Once the desired word is found, scanning its subtree has also a logarithmic cost, limited to the result size, say m, so the overall cost is m * lg n, a very small number.
Data structure size
A regular trie is simply not practical for autocompletion. The number of nodes explodes exponentially - a dictionary of 355k words generates an 800MB trie. An alternative is a compressed trie, at the cost of increased implementation complexity.
An AVL tree is a better choice since the storage cost is simply an overhead of two pointers per key.
An alternative approach
Simplifying the programming of SMAC has led me to searching for an alternative approach. I thought of just storing the dictionary in a (Go) map, where the words are keys, and the value for each word is simply the next word. This creates a sort of linked list of words:
Once the word map is filled, I then extract all prefixes that actually exist in the dictionary, up to a certain lenght (usually 3) and then create a map of prefixes, where each prefix in the map points to the (has a) first word starting with that prefix in the main word map.
This creates a sort of flattened trie with O(1) access cost.
So for each autocompletion, first the prefix is looked up in the main word map. If it is found, a scan of the linked list starting at that word gives the completion.
If the prefix is not found in the main word map (is now a complete word in itself) then the longest existing prefix in the prefix map is searched. Once found, a scan is made on the main word map, until a word is found that matches the stem (the prefix to be autocompleted). At that point, another scan gives the completion.
Cost of the approach
There are two factors involved, the first scan of the word map to find the first word starting with a given prefix (for prefixes longer than the maximum prefix in the prefix map) and then the cost of scanning the list from that hit. Let's say I built a prefix map up to prefixes of 3 chars. The average number of words having a prefix of length 3 in English is 580 (see stats.go), so the average search is 290 words. Once pointing to the desired word, I scan the word list up to my result size (usually a small number), a neglectable cost. Scanning a linked list of 290 elements with a string comparison at each has a very modest cost, also mitigated by the fact that the longer the stem (prefix) is, the more likely it is that it is a word in itself (sham/shamble for example) so it gets found in O(1) time in the main word list. 44% of 3-letter prefixes are complete words in English, so they are found in O(1) time.
Overall, I got a performance penalty of 2 (with a prefix map of depth 4) over a trie.