-
Notifications
You must be signed in to change notification settings - Fork 15
Building your database 1 blastp and blastn
PREREQUISITES: You must have already downloaded and formatted your data so that ITEP can read it. See "How to import genomes and format them for use with ITEP" for directions on how to do this. The only requirement is a concatenated Genbank file for each organism you wish to analyze.
The script that automatically runs BLASTP and BLASTN all vs. all for your organisms is setup_step1.sh. It is run as
$ ./setup_step1.sh NCORES
Note that this and the other setup scripts MUST be run while you are currently in $root or they will not work.
For example if you want BLAST to use 16 cores, run
$ ./setup_step1.sh 16
The script will do the following, in order:
- The script calls checkForDependencies.sh to ensure that required packages are installed and warn the user if optional (but still useful) packages are missing.
- The script first does consistency checks on the input files - checking to make sure the file names are in the correct format, that taxIDs and organism names are available for everything, that the format of the tab-delimited files in raw/ are correct, that every genbank file has a raw file, and several other sanity checks that should always pass if you ran convertGenbankToTable.py successfully on your input genbank files. (see "How to import genomes and format them for use with ITEP" for details)
- The script then takes the tab-delimited files and makes FASTA files (nucleotide and amino acid) for each of them, and compiles each one into a BLAST database.
- The scripts use the parallelization capabilities of Ruffus to run BLASTP and BLASTN all vs. all.
- The scripts calculate gene neighborhoods for each gene in every organism and cache them
- The results of all of these calculations are stored in a SQLite database that is created in $root/db/DATABASE.sqlite
Note: One reason that the FASTA files are kept separate rather than concatenating them before building the BLAST database is that it allows us to add new organisms to the ITEP database and not have to re-run BLAST all vs. all against all of the organisms that had already been run.
The E-value cutoffs for BLASTP and BLASTN default to 1E-5 and 1 respectively. To change them use the following syntax:
./setup_step1.sh NCORES [BLASTP_cutoff] [BLASTN_cutoff]
Both BLASTP cutoff and BLASTN cutoff are optional - but if you want to specify a BLASTN cutoff you must also specify a BLASTP cutoff. For example the following will cut off both BLASTP and BLASTN at 1E-10.
./setup_step1.sh 16 1E-10 1E-10
WARNING IF YOU ALREADY HAVE BLAST RESULTS BETWEEN A SET OF ORGANISMS AND SUBSEQUENTLY CHANGE THE CUTOFF IT WONT CHANGE ANYTHING because the already-existing files won't be touched. You need to go in and delete the files (in blastres/ or blastnres/) before the script will re-run them with the new cutoff. If you want to run a lot of BLASTP and BLASTN runs with a different cutoff other than the default, you might want to go into setup_step1.sh and change the values for BLASTP_EVALUE and BLASTN_EVALUE to the ones you want so you don't forget to do it every time.
With only three average-sized Bacterial genomes (as in this tutorial) and 9 cores (only 9 BLASTs need to be run) this step takes less than 10 minutes. The time goes up as N^2 where N is the number of organisms, so we recommend running it in a UNIX screen when you are analyzing many genomes. Of course it will also be slower if you are using larger genomes.
Any of the analysis scripts that do not depend on genome sequences, RPSBlast or clustering results can be run after completing this step.
The short version: Run the following command and it will load the BLAST data into the database:
$ ./setup_step1.sh NCORES
The longer version:
If, when running setup_step1.sh, the database loading fails with a SQLite error for any reason (e.g. full disk space), the data that was computed will not be loaded into the database. To fix this, first make sure that the problem that was reported by SQLite is no longer a problem, then run setup_step1.sh again. BLAST will not be rerun because the results files all already exist from the prior call. Instead, re-running setup_step1.sh will just skip that part and go right to the database loading.