Skip to content

rkmh Read Classification by MinHash

Eric T. Dawson edited this page Feb 19, 2018 · 1 revision

Welcome to the rkmh wiki!

For questions and comments, please post an issue. You might also consider contacting me via email, though I prefer discussion to be on the Github and to let open development work as designed.

Getting started

rkmh should be very easy to build as long as your compiler supports C++11 (clang 3.8 or newer; gcc 4.9 or newer).

git clone --recursive https://github.com/edawson/rkmh
cd rkmh
make

This should build the backing mkmh, murmurhash3 and kseq_reader libraries and produce the rkmh executable.

Running
./rkmh or ./rkmh -h

Should give you a list of subcommands and their descriptions. The currently availably subcommands are:

  • hash - Generate the 64-bit hashes of the input sequences. Optionally, rkmh can be told to not hash and just output the kmers.

  • stream - Compare a set of references and reads and return a file which maps from a read name to the reference it most resembles.

  • filter - Given a set of reads, a set of references and a threshold N for the minimum number of matches, return all the reads that share at least N hashes with any reference.

  • call - Call SNVs against a reference. While multiple references are permitted we don't recommend it at the moment.

Soon to be deprecated:

  • classify - do the same as stream but require exact counts when using minimum / maximum occurrence filters (i.e. use a std::map instead of a lock-free hashtable that permits collisions). Collision rates tend to be low, and we plan to give stream the option to do this soon.

Coming soon:
Consider the following functions experimental, unstable, and as yet unsupported (but with support coming in the next few months):

  • count - count the number of times a kmer occurs in a query file, and return a two-colum file mapping each kmer to its number of occurrences. We'd recommend using Jellyfish for this if your genome are more than a few kb, as it's a fantastic kmer counter.
  • search - Given a list of kmers or hashes, find reads in the query set that contain those. Much like filter but designed as a step in a different workflow (stay tuned).
Clone this wiki locally