-
Notifications
You must be signed in to change notification settings - Fork 15
Obtaining a list of bidirectional best blast hits
Bidirectional-best blast hits (BBH) are two proteins in different organisms that, when BLAST is run against all proteins in the other organism, hit each other as the best hit. That a pair of proteins is a bidirectional-best BLAST hit is often used as evidence that the two proteins are orthologs (having a common ancestor without duplication).
ITEP currently includes a single script to compute the bidirectional-best hits for a set of organisms in the database: db_bidirectionalBestHits.py. The script supports different metrics for what is considered the "best" hit:
- Best e-value (default, and what people usually use)
- Highest minbit score
- Highest maxbit score
To get a list of ALL BBH with ITEP gene IDs run:
$ db_bidirectionalBestHits.py > all_bbh
This is the last two lines of the file:
$ cat all_bbh | tail -2
fig|931626.1.peg.1355 fig|386415.1.peg.951 Clostridium novyi NT 59.5229 58.5229
fig|386415.1.peg.1323 fig|386415.1.peg.1323 Clostridium novyi NT 200.0000 200.0000
The columns are: query gene, target gene, target organism, score (log-E value) from query to target and score from target to query. (Note - the exact scores might change from BLAST version to version)
To specify minbit or maxbit scores, just use the -m flag (-m minbit or -m maxbit)
You can also specify subsets of organisms from which to get bidirectional best hits with two different methods. If you have already specified a cluster group and run clustering, you can specify a run ID with the -r flag to restrict calculation of BBH to only organisms in that cluster run. Alternatively, you can generate a list of organisms by any other means and use the -f flag to pass that list of organisms into the function. For example, doing this would generate bidirectional best hits for Clostridium species only.
$ cat organisms | grep Clostridium | db_getBidirectionalBestHits.py -f -
It might be desired to use locus tags instead of gene IDs for all of these bidirectional best hits. If this is the case for you, use the replaceGeneNamesWithAliases.py function to turn them into aliases (you will need a regex that locus tags match in your organisms - e.g. Cbei_\d+ for C. beijerinckii, NT01CX_\d+ for C. novii NT, and Awo_c\d+ for A. woodii). Prepare a table (using the aliases file) containing only these IDs:
$ cat $root/aliases/aliases | grep -P "(Cbei_\d+)|(NT01CX_\d+)|(Awo_c\d+)" > locus_tags
Then use it as input to the replaceGeneNamesWithAliases.py function. The same two lines as shown above for the bidirectional-best hits are repeated below after replacement.
$ cat all_bbh | replaceGeneNamesWithAliases.py -n locus_tags | tail -2
Awo_c0013055 NT01CX_087651 Clostridium novyi NT 59.5229 58.5229
NT01CX_0868323 NT01CX_0868323 Clostridium novyi NT 200.0000 200.0000
Note that this same function can be used with ANY script that has gene IDs in it. However, note that if you do this you will no longer be able to link those IDs to other ITEP analyses.