-
Notifications
You must be signed in to change notification settings - Fork 15
Building your database 1 blastp and blastn
The script that automatically runs BLASTP and BLASTN all vs. all for your organisms is main.sh. It is run as
$ ./main.sh [NCORES]
Note that this and the other "main" scripts MUST be run while you are currently in $root or they will not work. This requirement might be fixed in a later version.
For example if you have 16 cores, run
$ ./main.sh 16
The script will do the following, in order:
- The script first does consistency checks on the input files - checking to make sure the file names are in the correct format, that taxIDs and organism names are available for everything, that the format of the tab-delimited files in raw/ are correct, that every genbank file has a raw file, and several other sanity checks that should always pass if you ran convertGenbankToTable.py successfully on your input genbank files. (see "How to import genomes and format them for use with ITEP" for details)
- The script then takes the tab-delimited files and makes FASTA files (nucleotide and amino acid) for each of them, and compiles each one into a BLAST database.
- The scripts use the parallelization capabilities of Ruffus to run BLASTP and BLASTN all vs. all.
- The scripts calculate gene neighborhoods for each gene in every organism and cache them
- The results of all of these calculations are stored in a SQLite database that is created in $root/db/DATABASE.sqlite
Note: One reason that the FASTA files are kept separate rather than concatenating them before building the BLAST database is that it allows us to add new organisms to the ITEP database and not have to re-run BLAST all vs. all against all of the organisms that had already been run.
The E-value cutoffs for BLASTP and BLASTN default to 1E-5 and 1 respectively. To change them use the following syntax:
./main.sh [Ncores] (BLASTP_cutoff) (BLASTN_cutoff)
(Both BLASTP cutoff and BLASTN cutoff are optional - but if you want to specify a BLASTN cutoff you must also specify a BLASTP cutoff). For example the following will cut off both BLASTP and BLASTN at 1E-10.
./main.sh 16 1E-10 1E-10
However, NOTE that IF YOU ALREADY HAVE BLAST RESULTS BETWEEN A SET OF ORGANISMS AND SUBSEQUENTLY CHANGE THE CUTOFF IT WONT CHANGE ANYTHING because the already-existing files won't be touched. You need to go in and delete the files (in blastres/ or blastnres/) before the script will re-run them with the new cutoff. If you want to run a lot of BLASTP and BLASTN runs with a different cutoff other than the default, you might want to go into main.sh and change the values for BLASTP_EVALUE and BLASTN_EVALUE to the ones you want so you don't forget to do it every time.
With only three genomes (as in this tutorial) and 9 cores (only 9 BLASTs need to be run) this step takes less than 10 minutes. The time goes up as N^2 where N is the number of organisms, so we recommend running it in a UNIX screen when you are analyzing many genomes.
Any of the analysis scripts that do not depend on genome sequences, RPSBlast or clustering results can be run after completing this step.
The short version: Run the following command and it will load the BLAST data into the database:
$ ./main.sh [any number]
The longer version:
If, when running main.sh, the database loading fails with a SQLite error for any reason (e.g. full disk space), the data that was computed will not be loaded into the database. To fix this, first make sure that the problem that was reported by SQLite is no longer a problem, then run main.sh again. BLAST will not be rerun because the results files all already exist from the prior main.sh call. Instead, re-running main.sh will just skip that part and go right to the database loading.