Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is mmseqs search (without prefiltering) much slower than blastn? #852

Open
rmostowy opened this issue Jun 21, 2024 · 0 comments
Open

Comments

@rmostowy
Copy link

Expected Behavior

I've been using mmseqs2 a lot, and one of its uses is an alternative to blastn. Indeed, the idea of prefiltering hits, especially for large k values, makes a lot of sense as it decreases the run time compared to blastn.

Current Behavior

However, I have found that when compared to blastn without the prefiltering, mmseqs search is much slower than blastn. I would like to understand if this is a consequence of some parameters and there is a way to accelerate it, or whether it's a more fundamental problem.

Steps to Reproduce (for bugs)

Here is the code to reproduce my results.

BLASTN

DATA=file.fasta
mkdir -p blastn

# create DB
makeblastdb -in $DATA -dbtype nucl -out blastn/file.db

# run megablast
blastn -query $DATA -db blastn/file.db -out blastn/file.tsv -evalue 1e-3 -num_threads 8 -task blastn -max_target_seqs 10000 -outfmt '6 qseqid sseqid length mismatch pident nident qlen slen qstart qend sstart send positive ppos gaps'

MMSEQS

DATA=file.fasta
mkdir -p mmseqs
SCORING=blastn-scoring.out

# create DB
mmseqs createdb $DATA mmseqs/DB

# run megablast
mmseqs search mmseqs/DB mmseqs/DB mmseqs/resultDB mmseqs/tmp --search-type 3 --threads 8 --sub-mat "$SCORING" --seed-sub-mat "$SCORING" -s 7.5 --prefilter-mode 2 -e 1e-3
mmseqs convertalis mmseqs/DB mmseqs/DB mmseqs/resultDB mmseqs/result.m8

The resulting runtimes are roughly 30s for blastn and 4m for mmseqs. The files needed to reproduce this are provided as a ZIP.
input.zip

MMseqs Output (for bugs)

https://gist.github.com/rmostowy/f08c6389e9e04a380a03ffc03c3bfa85

Context

I want to know if mmseqs search is a viable alternative to blastn for intermediate values of k (eg k=11) which should give comparable accuracy to blastn at a fraction of speed.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): 15-6f452
  • Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): homebrew
  • For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
  • Server specifications (especially CPU support for AV
    X2/SSE and amount of system memory): Apple M2 Max, 64GB of memory
  • Operating system and version: Mac OS 14.5 (23F79)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant