-
Notifications
You must be signed in to change notification settings - Fork 15
Searching for functions using conserved domains
WARNING: The VM distribution is unable to run RPSBLAST against the full CDD due to memory limitations of a 32-bit machine (64-bit machines in VMs will only work if your host computer has sufficient virtualization capabilities, which many do not). Therefore if you try to run main4.sh it will fail unless you modify it to only search a particular database of interest (e.g. Pfam.pn instead of Cdd.pn) and not the entire CDD.
If you are interested in conserved domains that match a particular description, you can search through the descriptions by using the db_getExternalClustersByDescription.py function. This script takes any number of possible descriptions to match in a case-insensitive manner and returns any of the CDD domains that match that description. For example, if you are interested in biotin synthase you can search for domains related to it using the following (some descriptions have been truncated for readability):
$ db_getExternalClustersByDescription.py "biotin synthase"
30848 COG0502 BioB Biotin synthase and related enzymes [Coenzyme metabolism] 335
32586 COG2516 COG2516 Biotin synthase-related enzyme [General function prediction only] 339
178013 PLN02389 PLN02389 biotin synthase 379
180492 PRK06256 PRK06256 biotin synthase; Validated 336
180835 PRK07094 PRK07094 biotin synthase; Provisional 323
181453 PRK08508 PRK08508 biotin synthase; Provisional 279
185063 PRK15108 PRK15108 biotin synthase; Provisional 345
129447 TIGR00347 bioD dethiobiotin synthase. [description truncated] 166
200012 TIGR00433 bioB biotin synthase. [description truncated] 296
100105 cd01335 Radical_SAM Radical SAM superfamily. ... Examples are biotin synthase (BioB),... 204
198863 cl06149 BATS Biotin and Thiamin Synthesis associated domain. Biotin synthase (BioB), ... 0
148534 pfam06968 BATS Biotin and Thiamin Synthesis associated domain. Biotin synthase (BioB), EC:2.8.1.6, c... 93
205678 pfam13500 AAA_26 AAA domain. ... found in a number of proteins involved in cofactor biosynthesis such as dethiobiotin synthase ... 197
197846 smart00729 Elp3 Elongator protein 3, MiaB family, Radical SAM. This superfamily contains ... biotin synthase ... 216
197944 smart00876 BATS Biotin and Thiamin Synthesis associated domain... 94
You can also specify that you only want results from a specific database, such as PFAM here:
$ db_getExternalClustersByDescription.py "biotin synthesase" -d pfam
148534 pfam06968 BATS Biotin and Thiamin Synthesis associated domain. Biotin synthase (BioB), EC:2.8.1.6, c... 93
205678 pfam13500 AAA_26 AAA domain. ... found in a number of proteins involved in cofactor biosynthesis such as dethiobiotin synthase ... 197
You can search for the conserved domains associated with a protein with the db_getExternalClusterGroups.py function, which takes a list of genes from standard in and returns to you a list of RPSBLAST hits to the CDD.
The function gives you the option to append the cluster's name (e.g. BATS) or description to the results table, to cut off results at an E-value lower than the default value of 1E-5, or to limit the printed results to those in a given conserved database (e.g. COG). See the function's help text for details.
NOTE: If you get the following error, it indicates that you have not run main4.sh (or that it failed):
error:
Traceback (most recent call last):
File "[directory]/src/db_getExternalClusterGroups.py", line 47, in <module>
cur.execute(cmd, (geneid, ) )
sqlite3.OperationalError: no such table: rpsblast_results
You can perform the reverse search (looking for proteins matching a domain, such as pfam00001) using the db_getHitsToExternalClusters.py function. It takes a list of external cluster IDs as input and returns a list of RPSBLAST hits to those external clusters (including names and descriptions).
You can visualize the locations and strengths (E-values) of the hits from a given protein to conserved domain databases using the db_displayExternalClusterHits.py function, which takes a list of gene IDs as input and produces a PNG file displaying the position and name of each sufficiently-strong hit to external domains in relation to the gene (strongest hits are on the bottom).