Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

Latest commit

 

History

History
37 lines (31 loc) · 3.39 KB

readme.md

File metadata and controls

37 lines (31 loc) · 3.39 KB

#Twitch Spell Check Challenge#

##License:## Academic Free License ("AFL") v. 3.0

##Usage:## To compile the programs, just type make and the makefile will produce the two executables. If you want to produce the debug executables, type make debug. You can also remove the executables and related output files by typing make clean. The programs make heavy use of C++11 features, so it is recommended that the programs are compiled on OS X Mountain Lion with the latest version of the developer tools installed (more information can be found inside the makefile).

To verify the programs, execute the verification script (./verification.sh). This action will produce two additional files called wordsgenerated.txt and verificationresults.txt. The former contains the words generated by the word generator and the latter contains the output of the spell checker.

##About the Storage:## For this challenge, storing the dictionary efficiently is part of how you keep your runtime below O(n) where n is the length of the dictionary. Originally I thought that a hash table would be a good idea as each word could be hashed and then stored in 26 buckets. However, the search time is on average O(1 + n/k), where k is the number of buckets. I decided that maybe something could be faster. After looking over binary search trees and red-black trees, I ran across the trie. Tries store information alphabetically, which is much more helpful than the pseudo-random order of hash tables, there are no collisions, and there is no need for a hash function. In addition, tries are more space efficient since nodes are shared between words with common letters. Since nodes are shared, this allows for longest prefix matching which means you can search using an approximate matching algorithm. With these properties, search time is at worst O(m), where m is the length of the key (or in this case, the word being looked for).

##Runtime Complexity:## To lookup words, I first change the word into all lowercase, which is O(m). Then I search the trie for the word without any other changes, which is O(m). If the word is not found in the trie, I recursively search the trie again with the cost of O(m) for each substring (branching at duplicates and vowels). Since all of the operations are O(m), the runtime complexity looks like it is O(km), where k is a nonzero constant and, in this case, k is the number of searches plus the cost of manipulating characters in the string. However, because O(km) = O(m) if k is nonzero, the runtime of the program is O(m).

##Other Thoughts:## The trie can be simplified further since all that is being stored is characters. A directed acyclic word graph (DAWG) compresses identical branches of a trie which lead to the same suffixes. Another option is to convert the trie into a bitwise trie and do all operations using bits. This would allow for parallelized searching and can be executed quickly on out-of-order CPUs using special instructions in some of the recent instruction sets. It is also possible to change the implementation to solely rely on Damerau-Levenshtein Distance, but the runtime may get slower as you have to iterate through the dictionary calculating distances and then check whether or not it is below the maximum threshold.