The search
command performs batch homology searching on input protein sequences against a reference sequence database, and filters hits based on alignment metrics and taxonomic information.
hgtector search -i sample.fa -o <output_dir> <parameters...>
The simplest (but not recommended) usage of the search
command is to have it communicate with the NCBI server to complete searching, self-aligning, sequence and taxonomy information retrieval. This does not require any local database or compute resource. The command is as simple as:
hgtector search -i sample.fa -o .
In this recommended mode, the program requires an external aligner to execute searching. Two common aligners are supported: DIAMOND (Buchfink et al., 2015) and BLASTp (Altschul et al., 1990). We recommend DIAMOND for normal use cases, since it is much faster whereas retaining comparable accuracy versus the classical BLAST.
The program also needs a reference protein sequence database, and a taxonomy database. Please refer to the database section.
hgtector search -i sample.faa.gz -o . -m diamond -p 32 -d <diamond_db> -t <taxdump_dir>
One output file, sample.tsv
is created for each sample to host basic properties and hit table for each protein. The header section contains the following keys:
ID, Length, Product, Score, Hits
The hit table has the following columns, which are a subset of the standard BLAST output metrics:
Code | Description |
---|---|
qaccver |
Query accesion.version |
saccver |
Subject accession.version |
pident |
Percentage of identical matches |
evalue |
Expect value |
bitscore |
Bit score |
qcovhsp |
Query coverage per HSP |
staxids |
Unique subject taxonomy ID(s) |
Multiple search parameters can be specified (see below for details). Here is an example command:
hgtector search ... --maxhits 500 --minsize 50 --evalue 1e-20 --identity 50 --coverage 50 --maxseqs 1000 --extrargs "--sensitive --range-culling"
Note that maxhits
and maxseqs
are different. The later controls how many hits to return in a DIAMOND/BLAST search; the former controls how many hits to preserve after filtering.
Multiple parameters are available to control the inclusion and exclusion of taxonomic groups. For example:
hgtector search ... --tax-include 2,2157 --tax-exclude 1117,766 --tax-unirank genus --tax-block environmental,uncultured,...
This will limit the search result within domains Bacteria (TaxID: 2) and Archaea (TaxID: 2157), but exclude Cyanobacteria (TaxID: 1117) and Rickettsiales (TaxID: 766). It will only retain the top hit within each genus. It will also remove hits with taxon names reading "environmental XXX", "uncultured XXX", etc.
In some cases, sequence homology search has already been executed for other tasks in the research project, and it is favorable to re-use the search result to ensure consistency and to save compute.
hgtector search -i sample.faa -m precomp -s sample.blast6out -o . -t <taxdump_dir> [--taxmap taxon.map.gz]
One still needs the taxonomy database. If the pre-computed search result does not contain subject TaxIDs, an additional taxon map is also needed. See database.
The native HGTector-style hit table can be manually generated using the following commands:
BLAST:
blastp -query sample.fa -db <blast_db> -outfmt "6 qaccver saccver pident evalue bitscore qcovhsp staxids"
DIAMOND:
diamond blastp --query sample.fa --db <diamond_db> --outfmt 6 qseqid sseqid pident evalue bitscore qcovhsp staxids
But this is not the typical command one would execute before meeting HGTector. In many cases, the standard tabular output (-outfmt 6
) is used:
diamond blastp --query sample.fa --db <diamond_db> --outfmt 6
blastp -query sample.fa -db <blast_db> -outfmt 6
HGTector automatically recognizes and parses this format. Note that subject TaxID is not part of the standard tabular output, so a taxon map file is necessary in this case.
HGTector computes the bit score of each protein sequence aligned against itself, and uses it as a baseline for normalizing bits scores of subjects. Several methods are available for doing this self-alignment, and the priority is automatically determined by default. The typical scenario is as follow (in case one wants to override):
hgtector search ... -m diamond --aln-method native
This will use the same method (DIAMOND) for both homology search and self-alignment.
Other options are:
lookup
: The search result already contains a self-hit. This only works if the query protein is also in the database.
fast
: Use a built-in implementation of a simplified BLAST algorithm to calculate this score. The result should be identical to the result of DIAMOND in its default cost scoring setting, but it is slightly different from the result of modern releases of BLAST. If the native method fails (it happens occassionally), the program will automatically use this algorithm instead.
precomp
: Provide precomputed self-alignment results, in a simple format of protein <tab> score
. For example:
hgtector search ... -m precomp -s sample.b6 --aln-method precomp --aln-precomp sample.txt
One can perform search on multiple samples under the same directory:
hgtector search -i <input_dir> -o <output_dir> [-s <precompute_dir>] <parameters...>
This will generate one output file per input file. The filename without extension (e.g., "Bin1" of Bin1.faa
) will be considered as sample ID and carried to the output file (Bin1.tsv
).
Option | Default | Description |
---|---|---|
-i , --input |
- | A file or a directory of files which store query protein sequences in multi-Fasta format. Compressed files (.gz , .bz2 , .xz and .lz ) are supported. Plain lists of protein IDs without sequences are supported if the IDs can be found in the database. |
-o , --output |
- | Directory where search results are to be saved. |
-m , --method |
auto | Search method (auto, diamond, blast, remote, precomp). |
-s , --precomp |
- | File or directory of precomputed search results (when method = precomp). |
Option | Default | Description |
---|---|---|
-d , --db |
- | Reference protein sequence database. |
-t , --taxdump |
- | Directory of taxonomy database files (nodes.dmp and names.dmp ). |
--taxmap |
- | Sequence ID to taxID mapping file (not necessary if protein database already contains taxonomy). |
Option | Default | Description |
---|---|---|
-k , --maxhits |
0 | Maximum number of hits to preserve per query (0 for unlimited) |
--minsize |
30 | Minimum length of query sequence (aa). |
--queries |
(depends) | Number of queries per run (0 for whole sample). |
--maxchars |
(depends) | Maximum number of characters per run (0 for unlimited). |
Option | Default | Description |
---|---|---|
--maxseqs |
500 | Maximum number of sequences to return. |
--evalue |
1e-5 | Maximum E-value cutoff. |
--identity |
0 | Minimum percent identity cutoff. |
--coverage |
0 | Minimum percent query coverage cutoff. |
--extrargs |
(depends) | Extra arguments for choice of search method. This can be command-line arguments for DIAMOND or BLAST, or URL API for remote search. |
Option | Default | Description |
---|---|---|
--tax-include |
- | Include taxa under those TaxIDs (a comma-delimited string, or a file of one TaxID per line). |
--tax-exclude |
- | Exclude taxa under those TaxIDs (a comma-delimited string, or a file of one TaxID per line). |
--tax-unique |
yes | Ignore more than one hit with same TaxID. |
--tax-unirank |
- | Ignore more than one hit under same taxon at this rank (species, genus, family, etc.). Recommended if the database is highly taxonomically imbalanced (e.g., contains thousands of E. coli strains). |
--tax-capital |
yes | Ignore taxon names that are not capitalized. |
--tax-latin |
no | Ignore species names that are not Latinate. |
--tax-block |
unknown, uncultured, unidentified, unclassified, unresolved, environmental, plasmid, vector, synthetic, phage |
Ignore taxon names containing any of these words. |
Option | Default | Description |
---|---|---|
-p , --threads |
0 | Number of threads (0 for all CPU cores). |
--tmpdir |
- | Temporary directory. |
--diamond |
diamond | diamond executable. |
--blastp |
blastp | blastp executable. |
--blastdbcmd |
blastdbcmd | blastdbcmd executable. |
Option | Default | Description |
---|---|---|
--algorithm |
kmerBlastp | Remote search algorithm (blastp, psiBlast, deltaBlast, kmerBlastp (a.k.a. Quick BLASTP), phiBlast, etc., see here). |
--retries |
5 | Maximum number of retries per search. |
--delay |
60 | Seconds between two search requests. |
--timeout |
900 | Seconds before program gives up waiting. |
--entrez |
all [filter] NOT(environmental samples[filter] OR metagenomes[orgn]) txid131567[orgn] | Entrez query text. |
--server |
link | Remote search server URL. |
Option | Default | Description |
---|---|---|
--aln-method |
auto | Self-alignment method (auto, native, fast, lookup, precomp). |
--aln-precomp |
- | File or directory of precomputed sequence Id to score maps (when self-alignment method = precomp). |
--aln-server |
link | Remote align server URL. |
Option | Default | Description |
---|---|---|
--fetch-enable |
auto | Whether to enable remote fetch (yes, no, auto). |
--fetch-queries |
100 | Maximum number of query entries per search. |
--fetch-retries |
3 | Maximum number of retries per search. |
--fetch-delay |
5 | Seconds between two fetch requests. |
--fetch-timeout |
60 | Seconds before program gives up waiting. |
--fetch-server |
link | Remote fetch server URL. |