-
Notifications
You must be signed in to change notification settings - Fork 15
Searching for functions using conserved domains
WARNING: The 32-bit VM distribution is unable to run RPSBLAST against the full CDD due to memory limitations of a 32-bit machine. If you wish to analyze conserved domains with the VM, you should use the 64 bit VM.
We ran the RPSBLAST on another machine with sufficient memory and copied the results over to the 32 bit VM so you can explore the functions.
If you are interested in conserved domains that match a particular description, you can search through the descriptions by using the db_getExternalClustersByDescription.py function. This script takes any number of possible descriptions to match in a case-insensitive manner and returns any of the CDD domains that match that description. For example, if you are interested in biotin synthase you can search for domains related to it using the following (some descriptions have been truncated for readability):
$ db_getExternalClustersByDescription.py "biotin synthase"
30848 COG0502 BioB Biotin synthase and related enzymes [Coenzyme metabolism] 335
32586 COG2516 COG2516 Biotin synthase-related enzyme [General function prediction only] 339
178013 PLN02389 PLN02389 biotin synthase 379
180492 PRK06256 PRK06256 biotin synthase; Validated 336
180835 PRK07094 PRK07094 biotin synthase; Provisional 323
181453 PRK08508 PRK08508 biotin synthase; Provisional 279
185063 PRK15108 PRK15108 biotin synthase; Provisional 345
129447 TIGR00347 bioD dethiobiotin synthase. [description truncated] 166
200012 TIGR00433 bioB biotin synthase. [description truncated] 296
100105 cd01335 Radical_SAM Radical SAM superfamily. ... Examples are biotin synthase (BioB),... 204
198863 cl06149 BATS Biotin and Thiamin Synthesis associated domain. Biotin synthase (BioB), ... 0
148534 pfam06968 BATS Biotin and Thiamin Synthesis associated domain. Biotin synthase (BioB), EC:2.8.1.6, c... 93
205678 pfam13500 AAA_26 AAA domain. ... found in a number of proteins involved in cofactor biosynthesis such as dethiobiotin synthase ... 197
197846 smart00729 Elp3 Elongator protein 3, MiaB family, Radical SAM. This superfamily contains ... biotin synthase ... 216
197944 smart00876 BATS Biotin and Thiamin Synthesis associated domain... 94
You can also specify that you only want results from a specific database, such as PFAM here:
$ db_getExternalClustersByDescription.py "biotin synthesase" -d pfam
148534 pfam06968 BATS Biotin and Thiamin Synthesis associated domain. Biotin synthase (BioB), EC:2.8.1.6, c... 93
205678 pfam13500 AAA_26 AAA domain. ... found in a number of proteins involved in cofactor biosynthesis such as dethiobiotin synthase ... 197
You can search for the conserved domains associated with a protein with the db_getExternalClusterGroups.py function, which takes a list of genes from standard in and returns to you a list of RPSBLAST hits to the CDD. Doing this for our favorite 6-phosphofructokinase gene gives us the following set of conserved domains:
$ echo "fig|290402.1.peg.4768" | db_getExternalClusterGroups.py
fig|290402.1.peg.4768 235111 63.64 319 115 1 1 318 1 319 2e-157 550.0 PRK03202
fig|290402.1.peg.4768 238388 58.68 317 131 0 2 318 1 317 2e-123 437.0 cd00763
fig|290402.1.peg.4768 213713 63.12 301 108 2 3 300 1 301 1e-119 425.0 TIGR02482
fig|290402.1.peg.4768 223283 51.47 340 143 6 1 318 2 341 1e-111 398.0 COG0205
fig|290402.1.peg.4768 109425 58.99 278 111 2 2 276 1 278 2e-110 394.0 pfam00365
The gene's name appears first followed by the CDD ID for the external cluster, percent identity, other metrics, E-value (e.g. 2E-157), bitscore, and the cluster's ID in the source database. We see here that the strongest hit is to PRK03202.
The function gives you the option to append the cluster's name (e.g. BATS) or description to the results table, to cut off results at an E-value lower than the default value of 1E-5, or to limit the printed results to those in a given conserved database (e.g. COG). See the function's help text for details.
You can also get information about a particular cluster after the fact using db_getExternalClustersById.py:
$ echo "PRK03202" | db_getExternalClustersById.py
235111 PRK03202 PRK03202 6-phosphofructokinase; Provisional 320
NOTE: If you get the following error, it indicates that you have not run setup_step4.sh (or that it failed):
error:
Traceback (most recent call last):
File "[directory]/src/db_getExternalClusterGroups.py", line 47, in <module>
cur.execute(cmd, (geneid, ) )
sqlite3.OperationalError: no such table: rpsblast_results
You can perform the reverse search (looking for proteins matching a domain, such as pfam00001) using the db_getHitsToExternalClusters.py function. It takes a list of external cluster IDs as input and returns a list of RPSBLAST hits to those external clusters (including names and descriptions).
$ echo "PRK03202" | db_getHitsToExternalClusters.py
fig|290402.1.peg.581 235111 28.12 352 165 20 35 368 33 314 6e-28 121.0 235111 PRK03202 PRK03202 6-phosphofructokinase; Provisional 320
fig|290402.1.peg.992 235111 43.42 357 165 5 5 361 1 320 4e-123 437.0 235111 PRK03202 PRK03202 6-phosphofructokinase; Provisional 320
fig|290402.1.peg.4768 235111 63.64 319 115 1 1 318 1 319 2e-157 550.0 235111 PRK03202 PRK03202 6-phosphofructokinase; Provisional 320
fig|386415.1.peg.406 235111 62.07 319 120 1 1 318 1 319 3e-153 537.0 235111 PRK03202 PRK03202 6-phosphofructokinase; Provisional 320
fig|931626.1.peg.1249 235111 52.47 324 143 4 1 318 2 320 4e-126 447.0 235111 PRK03202 PRK03202 6-phosphofructokinase; Provisional 320
You can visualize the locations and strengths (E-values) of the hits from a given protein to conserved domain databases using the db_displayExternalClusterHits.py function, which takes a list of gene IDs as input and produces a PNG file displaying the position and name of each sufficiently-strong hit to external domains in relation to the gene (strongest hits are on the bottom).