Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K-mer banding, bit patterns, and the fast (2-bit) hash #1761

Open
standage opened this issue Aug 22, 2017 · 3 comments
Open

K-mer banding, bit patterns, and the fast (2-bit) hash #1761

standage opened this issue Aug 22, 2017 · 3 comments

Comments

@standage
Copy link
Member

standage commented Aug 22, 2017

It was pretty clear early on in testing the banding functionality that it would not play nicely with the 2-bit hashing scheme in khmer. My first approach to banding was to look at bit patterns (for band i \in 2^N, do the N least significant bits in this k-mer match i?), but since have settled on more flexible numeric range tests. In either case, the direct two-bit encoding of sequence content is problematic. The fact that two nearly identical sequences can hash to two nearly identical integer values is kind of antithetical to the idea of hashing. Things worked much more nicely when we used banding on murmur3-hashed k-mers.

During a chat this morning, @ctb and I entertained the possibility of revisiting the idea of banding 2-bit hashed k-mers using a modified bit pattern approach. Rather than looking at the least significant bits of the hashed k-mer, could we test against an arbitrary bit pattern across all 64 bits?

For this to work, we'd need to satisfy the following criteria.

  • For some user-specified number of bands N, we would need to be able to deterministically generate N "random" bit patterns.
  • We need to be able to test whether a hashed k-mer matches a given bit pattern using one or more bitwise operators.
  • Each hashed k-mer must match one and only one of the N bit patterns.

So, is this scheme viable? How would we generate the bit patterns and test k-mers against a given bit pattern?

@ctb
Copy link
Member

ctb commented Aug 22, 2017 via email

@standage
Copy link
Member Author

Just to be clear: the benefits of banding are manifold and not discussed here. The benefit of using the 2-bit hash is that it's much faster than MurmurHash3, and compatible with the graph stuff.

@betatim
Copy link
Member

betatim commented Aug 23, 2017

An idea that sacrifices speed but might be simple to implement: feed the 2bit hash value to murmurhash and use the output to decide the band a kmer is in. In the sketch you still store stuff with the 2bit hash so you keep the graph functionality. Edit: Maybe you can use a 16bit (or similarly low bit) murmurhash version and improve the speed a bit?

I think "generate two 'unrelated' values from two very similar inputs" is the definition of what a hash function does. So it is unlikely we will discover a different way of doing this?

What surprised me while doing some benchmarking for #1760 is that FNV hash seems no faster than murmurhash?? Or at the very least my benchmark is too noisy to see a speed up. It seems like you'd be hard pressed to come up with an even simpler hashing scheme than FNV in terms of mathematical operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants