-
Notifications
You must be signed in to change notification settings - Fork 15
Building your database 4 rpsblast
NOTE: You should perform at least [step 1](Building your database 1 - BLASTP and BLASTN) of building the database before proceeding to this step. This step is only required if you want to analyze the similarity to conserved domains (e.g. PFAM, COG, etc.) in your organisms.
WARNING The VM distribution is unable to run RPSBLAST against the full CDD due to memory limitations of a 32-bit machine (64-bit machines in VMs will only work if your host computer has sufficient virtualization capabilities, which many do not). Therefore if you try to run setup_step4.sh it will fail unless you modify it to only search a particular database of interest (e.g. Pfam.pn instead of Cdd.pn) and not the entire CDD.
WARNING You need to make sure you have a recent enough version of RPSBLAST (comes with BLAST+ 2.28+ or greater) for this to work. NCBI changed the syntax of RPSBLAST relatively recently but did not change the name of the program. If you type "rpsblast -h" it should give you options similar to BLASTP\BLASTN programs (e.g. -query) rather than old-style single-character options.
The setup_step4.sh script runs RPSBLAST for all of your input genomes against every conserved domain in the NCBI CDD database. It does so using Ruffus to parallelize calls. Running it takes about 1 hour for every N genomes (where N is the number of cores assigned) so it is highly recommended to run this in a UNIX screen.
Run setup_step4.sh as follows (it must be run from $root like all the other setup scripts):
$ ./setup_step4.sh [NCORES]
e.g. ./setup_step4.sh 16 if you have 16 cores.
The script does the following:
- If you have not run the script before, downloads a copy of the NCBI CDD database and unpacks it in the expected location.
- Runs RPSBLAST against all of the conserved domains in the CDD. This includes PFAM, COG, TIGRFAM, SMART, and others.
- Imports conserved domain data and the homology results into the ITEP database.
After running this setup, you can use several ITEP scripts to query for the conserved domain hits to proteins, get visualizations of the locations of those hits, and search for domains whose definitions match certain strings (such as "biotin" or "Ack"). See other tutorials for details.
NOTE - If you want to run RPSBLAST without putting the results in the database you can set up a folder with FASTA files in it and an RPSBLAST database, and then run Rpsblast_all_vs_one.py.
The short version: Run the following command and it will load the RPSBlast data into the database:
$ ./setup_step4.sh [any number]
The longer version:
If, when running setup_step4.sh, the database loading fails with a SQLite error for any reason (e.g. full disk space, database is locked because someone else is using it, and so on), the RPSBlast data that was computed will not be loaded into the database. To fix this, run setup_step4.sh again. RPSBlast will not be rerun (as long as you do not delete the rpsblast_res folder) because the results files all already exist from the previous call. Instead, re-running setup_step4.sh will just skip that part and go right to the database loading.