Skip to content

1. Installation and Setup

Rauf Salamzade edited this page Dec 22, 2020 · 1 revision

Installation of the seQuoia framework is not the easiest and depends on eight conda environments.

Four of these encapsulate multiple programs and their dependencies:

  • main_env : This is the main environment which will be used to run the framework itself as well as certain modules within it.
  • ont_env : This environment is used to install software for the Hybrid Assembly workflow.
  • hut_env : This environment is used to install software from the Huttenhower lab's biobakery.
  • shiny_env : This environment is used to install R and other dependencies for the seeQc Shiny application.

Whereas the remaining four are focused on specific software packages:

  • ARIBA_env : The environment for running ARIBA for finding genes/sequences of interest and performing MLST analysis from raw sequencing reads directly.
  • MultiQC_env : The environment for running a fork of the MultiQC report generation suite called seQc_MultiQC.
  • StrainGE_env : The environment for running the StrainGE suite, specifically StrainGST to identify what the closest known/public strains are to each sample.
  • GAEMR_env : The environment for running the GAEMR suite for assembly QC. This will involve the installation of several dependencies.

The yaml files for creating 7 of the 8 environments can be found in: /path/to/seQuoia/conda_environment_ymls/. A yml file is not currently provided for GAEMR environment which requires the careful installation of dependencies, including NCBI's large nt database.

Additional Configuration of Conda Environments

It is necessary to manually update certain environment variables after creating some of the conda environments.

To do this please check out the section on "Saving Environment Variables" in the guide at:

https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#saving-environment-variables

Main Environment

For the main environment (main_env), please add the scripts folder in the primary directory of the seQuoia package to the primary path.

export PATH=$PATH:/path/to/seQuoia/scripts/

Additionally, since this is the environment which runs executables in the bin/ directory, please also activate this environment and within the seQuoia primary directory run:

python setup.py install
pip install -e .

Setup installation will create bash scripts for loading relevant conda environments and executing programs in /path/to/seQuoia/seQuoia/external_wrappers/. To ensure this works as intended update the files:

  • path_to_conda_installation.txt: one line file, just contains path to mini/ana-conda bin/ directory.
  • conda_environments.txt : A tab separated file with two to three columns: (1) conda installation name, (2) conda installation path, and optionally (3) whether R is a component of the conda installation.
  • prog_to_environment.txt : A tab separated file with three columns: (1) external program as it would be called on the command line, (2) name of conda installation it is found in, and (3) the version of the program installed.

The pip installation should also enable the proper importation of seQuoia functions within various programs and also allow users to incorporate seQuoia OOP constructs in their own programs if they choose to development in the environment.

If using the hybrid assembly workflow is of interest, please edit the pilon installation in this conda environment's bin/ directory to allow for greater memory usage. This process is described here:

https://github.com/broadinstitute/pilon/issues/26

GAEMR Environment

For the GAEMR environment (GAEMR_env), please edit the PATH and PYTHONPATH variables.

export PATH=/path/to/GAEMR/bin/:$PATH
export PYTHONPATH=/path/to/GAEMR/:$PYTHONPATH

It is also necessary to place a path or install in the conda environment an instance of R for running some plotting scripts.

StrainGE Environment

For the StrainGE environment (StrainGE_env), please edit the PATH and PYTHONPATH variables.

export PATH=:/path/to/StrainGE/bin/:$PATH`
export PYTHONPATH=:/path/to/StrainGE/:$PYTHONPATH`

Huttenhowever Biobakery Environment

While the three programs from the biobakery suite can each be easily be downloaded using conda. Additional steps must be performed for two of them.

MetaPhlAn2:

While installation with conda works for executables, the database must be download by first downloading the full software packages (1.2 Gb), which includes the databases. These database files must then be moved to /path/to/hut_env/bin/metaphlan_databases/. Things should work automatically after that.

ShortBRED:

Usearch must be downloaded from https://www.drive5.com/usearch/manual/install.html.

Afterwards, place the program in /path/to/hut_env/bin/, change permissions using chmod to make executable for intended users, and and make symlink of long executable, which likely includes version in naming, to just be usearch.

Setup and Install a Fork of MultiQC Configured for Compatibility with seQuoia

To ensure MultiQC runs smoothly and interprets sampling naming properly, we forked and adapted the versatile and fantastic MultiQC suite for generating detailed QC reports.

First you must setup the conda environment using the provided yaml file. Afterwards, activate the environment, clone the git repository for seQc_MultiQC, and install:

source activate /path/to/seQc_MultiQC_env/
git clone git@github.com:broadinstitute/seQc_MultiQC.git
cd seQc_MultiQC/
python setup.py install

Log details of Conda Environments

Once all your conda environments are set, chances are they won't necessarily look exactly like ours depending on when you set them up and on which platform. Therefore it is important to update the logged information on each conda environment. The most recent catalog of versioning will be copied over to the output directory of each subsequent seQuoia sheppard analysis.

To catalog the current structure of environments, enter the environment_provenance/ subdirectory and run the python script updateVersioningInfo.py.

Update huGE.py Configuration

This will likely be necessary to allow for smooth running on various server/HPC systems. Configuring will require familiarity with the Python language.

A final necessity is to update the hUGE.py class structure in /seQuoia/ to be configured to your server/HPC setup. Running with -c/--cluster set to UGE will probably not work out of the box. We found this framework provided an easier way to run the workflow but understand that it has clear downfalls when setting up seQuoia on a new system. While we will not be actively supporting seQuoia moving forward, next steps would likely have involved creating separate environments for each of the tasks/modules and building workflows in existing management frameworks such as SnakeMake, Nextera, or wdl.

Notes on Hybrid Assembly workflow

Unlike the rest of the six workflows, the Hybrid Assembly workflow is non-linear and is DAG-like in structure to allow for multiple modules to be executed simultaneously. It might be necessary to update the updateProgress(self) function to reflect how to properly parse results from qstat or squeue.

Additionally, two parameters for the Hybrid Assembly workflow have options which can only be turned on when run on Broad servers to local software dependencies not yet made public.