-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K-mer banding, bit patterns, and the fast (2-bit) hash #1761
Comments
I'm now skeptical this could work; I'm not sure how well something like XOR
would work to randomize things.
|
Just to be clear: the benefits of banding are manifold and not discussed here. The benefit of using the 2-bit hash is that it's much faster than MurmurHash3, and compatible with the graph stuff. |
An idea that sacrifices speed but might be simple to implement: feed the 2bit hash value to murmurhash and use the output to decide the band a kmer is in. In the sketch you still store stuff with the 2bit hash so you keep the graph functionality. Edit: Maybe you can use a 16bit (or similarly low bit) murmurhash version and improve the speed a bit? I think "generate two 'unrelated' values from two very similar inputs" is the definition of what a hash function does. So it is unlikely we will discover a different way of doing this? What surprised me while doing some benchmarking for #1760 is that FNV hash seems no faster than murmurhash?? Or at the very least my benchmark is too noisy to see a speed up. It seems like you'd be hard pressed to come up with an even simpler hashing scheme than FNV in terms of mathematical operations. |
It was pretty clear early on in testing the banding functionality that it would not play nicely with the 2-bit hashing scheme in khmer. My first approach to banding was to look at bit patterns (
for band i \in 2^N
, do theN
least significant bits in this k-mer matchi
?), but since have settled on more flexible numeric range tests. In either case, the direct two-bit encoding of sequence content is problematic. The fact that two nearly identical sequences can hash to two nearly identical integer values is kind of antithetical to the idea of hashing. Things worked much more nicely when we used banding on murmur3-hashed k-mers.During a chat this morning, @ctb and I entertained the possibility of revisiting the idea of banding 2-bit hashed k-mers using a modified bit pattern approach. Rather than looking at the least significant bits of the hashed k-mer, could we test against an arbitrary bit pattern across all 64 bits?
For this to work, we'd need to satisfy the following criteria.
N
, we would need to be able to deterministically generateN
"random" bit patterns.N
bit patterns.So, is this scheme viable? How would we generate the bit patterns and test k-mers against a given bit pattern?
The text was updated successfully, but these errors were encountered: