-
Notifications
You must be signed in to change notification settings - Fork 15
Installing itep on your machine
ITEP only runs on Linux (several of its dependencies, such as MCL, are Linux-only). There is a virtual machine available (linked to from our homepage at http://price.systemsbiology.net/itep) which includes all of the dependencies - see Using the ITEP virtual machine for details. The VM also includes a copy of ITEP that contains the genomes and a pre-built ITEP database used to write this tutorial (a small subset of the organisms used for the larger analyses described in the manuscript). The VM can be run on any operating system (it has been tested on VirtualBox but can also likely be used with other virtualization software)
ITEP is on github - you will need to install git on your machine first. On Ubuntu this is simply:
$ sudo apt-get install git
(you will need to be an administrator to do this). You will also need to create a Github account and upload your SSH public key to Github (see Github's help for details on how to set this up).
Once you have done this, create a folder for ITEP and navigate to it. Then run
$ git clone git@github.com:mattb112885/clusterDbAnalysis
Type your passphrase if you have one attached to your SSH key.
You will need to set the execute ("x") bit on all of the python and sh scripts in the repo (EXCEPT for SourceMe.sh) if they aren't already set.
Make sure all the .py and .sh files in the src/ directory are executable:
$ chmod u+x src/*.py
$ chmod u+x src/*.sh
Do the same with the src/internal and src/utilities directories and the scripts/ directory:
$ chmod u+x src/internal/*.py
$ chmod u+x src/internal/*.sh
$ chmod u+x src/utilities/*.py
$ chmod u+x src/utilities/*.sh
$ chmod u+x scripts/*.py
$ chmod u+x scripts/*.sh
Note the location of the SourceMe.sh file (I'll call this $SOURCEDIR). Using your favorite editor (I use emacs) open up your .bashrc file (~/.bashrc) and add the following line to the bottom:
source $SOURCEDIR/SourceMe.sh
Save the file, then back in the shell enter the following command:
$ source ~/.bashrc
That's it! Now you can access all of the scripts in ITEP and the libraries are also accessible to Python commands.
If you are running multiple copies of ITEP on your machine you might want to just source every time instead of adding this line to your .bashrc to make sure you are using the right copy (run db_getItepRoot.py if you want to verify which copy you are using at any given moment).
The following is a complete list of direct dependencies for ITEP. Which dependencies you want to install depends on which parts of the code you want to run. You will need Python (2.6 or 2.7) and some form of BASH, both of which typically come with Linux systems. For easy installation you'll also want to install setuptools (using $sudo apt-get install python-setuptools).
ITEP will automatically check for dependencies when you run setup_step1.sh but you can check yourself by running this:
./checkForDependencies.sh
This script will throw errors for required packages and warn you if not required but still useful packages are missing. (Note you will get warnings on the VM because we cannot install everything due to licensing restrictions. All dependencies are free as in beer so you can go to their websites and download them if you need them) Note that at the moment it does not check for the correct versions - this will be fixed soon.
- NCBI BLAST+: Download from the NCBI website at ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ . 2.2.28+ is known to work. Note that the version of BLAST on the current Ubunutu repositories is NOT BLAST+. RPSBLast will only work with version 2.2.28 or later. See warning below.
- MCL: Clustering tool that is the workhorse of the toolkit. Download from the MCL website (http://micans.org/mcl/) and follow their directions to install.
- Sqlite: Download from the SQLite website (due to warning below it is quite possible that the sqlite install that comes with your distro is too old and has bugs that could impact the toolkit's integrity. It is known to work with 3.7.15.2 and presumably later versions). https://www.sqlite.org/ . Follow their directions to install. NOTE - you may need to run ldconfig after upgrading to get the binary to link to the right version of the libraries, due to an sqlite install bug.
- Python: If this doesn't come with your distro run $ sudo apt-get install python2.6 or python 2.7
(Python packages)
- Biopython: Used for writing and reading files and some visualizations. Version 1.61 or later is required for correctly parsing Genbank files from certain sources (see notes below). $ sudo easy_install -f http://biopython.org/DIST/ biopython. Note that this won't remove the old version so you'll have to do that manually if you have an old version on your machine.
- ETE: Used for visualizing and manipulating phylogenetic trees. Recommend using the latest version (2.1 as of the time of this writing). $ sudo easy-install -U ete2
- Numpy\scipy: Used for many miscellaneous computations. $ sudo apt-get install python-numpy python-scipy (note many distributions already have these)
- Ruffus: Used to parallelize BLAST and RPSBLAST computations. $ sudo easy-install -U ruffus
NOTE: Depending on your version of Ubuntu, the Sqlite that comes with it might be too old - it fails on data the size you will get with ITEP with significant numbers of genomes and one of them also had a string-handling bug that makes comparisons incorrect. If this is the case, download and compile the latest version from the SQLite website instead.
NOTE: The latest version of Biopython (1.61 as of the time of this writing) fixes a bug in reading genbank files from some sources like JGI. If you get errors from Biopython related to reading the genbank files, try upgrading your Biopython (and make sure to remove the old version too). Note that the version in the Ubuntu repos as of the time of this writing was 1.60 (which has the aforementioned bug) so you will need to download from their website or use easy_install to get the latest version.
NOTE:The latest versions of NCBI's CDD will ONLY compile with the newest RPSBLAST (so make sure you get the latest version of BLAST+). Unfortuantely, NCBI did not change the name of the RPSBLAST program when changing the syntax and input formats. Type "rpsblast -help" and make sure that the name "rpsblast" refers to the new version of RPSBLAST and not an old one (only the new one will actually show help with this command - the old one requires you to use --help), if you have both installed, before attempting to run setup_step4.sh.
The following packages are used only in small numbers of scripts and often be substituted with other programs that input and output the same file formats (e.g. FASTA / Newick) or that interface with other databases.
-
Alignments and trees
FastTreeMP - Download from http://www.microbesonline.org/fasttree/ and follow directions to compile.
FastTree comparison tools: Download from http://www.microbesonline.org/fasttree/treecmp.html and add the Perl scripts to your PERL5LIB (you will also need Phylip to use them). Note that the default program name used in ITEP scripts is "FastTreeMP"
MAFFT: Download from http://mafft.cbrc.jp/alignment/software/source.html and follow directions to compile.
RAxML: Clone it from Github at https://github.com/stamatak/standard-RAxML and follow the directions to compile it. Alternatively (if you don't need PTHREADS to get parallelization) just download and install it from the ubuntu repo (sudo apt-get install raxml). Note that the default program name is "raxml-PTHREADS".
PAL2NAL: This is useful for converting a protein alignment to a nucleotide alignment (which is needed for things like dN/dS analysis) and is required by one script if you ask it for a codon-correct nucleotide alignment of a set of genes in the database. Download it from http://www.bork.embl.de/pal2nal/
-
Needed to download genomes from RAST
MyRast: Download from http://blog.theseed.org/servers/installation/distribution-of-the-seed-server-packages.html and follow directions to compile.
Note that MyRast requires you to have perl5 on your machine (often this is included in your distribution).
-
Ortholog computation
OrthoMCL: Download it from http://orthomcl.org/common/downloads/software/ (we support v2.0). However see notes below.
NOTE: If you want to run OrthoMCL you need to install it and its dependencies, which include MySQL, Perl 5, and the DBI and DBD::mysql packages, both of which can be installed (as root) by using CPAN if you don't have them:
$ sudo cpan DBI
$ sudo cpan DBD::mysql
OrthoMCL support is still somewhat beta. In particular it will fail due to memory limitations earlier than MCL by itself. Consider yourself warned.
-
Python packages for visualization
matplotlib: Used for some plotting functions. $ sudo apt-get install python-matplotlib
networkx: Used to make GML files for network visualization. $ sudo apt-get install python-networkx
PyQt4: Required to visualize trees with ETE. Download it and its dependencies from http://www.riverbankcomputing.com/software/pyqt/download and follow directions to install (note - this may or may not be installed when you install ETE with easy_install)
ReportLab: Needed for biopython visualizations. https://pypi.python.org/pypi/reportlab
ReportLab fonts: If you get an error about missing fonts from ReportLab, download them from http://www.reportlab.com/ftp/fonts/pfbfer.zip , unzip and place in ~/fonts/
-
Other
xlwt: Python package for saving .xls files (required in some scripts if you want to save Excel files). Any script that has this option can also save the results as flat text.