Skip to content

Quickstart

Francisco Zorrilla edited this page Apr 16, 2021 · 26 revisions

Automated installation

Clone this repository to your HPC or local computer:

git clone https://github.com/franciscozorrilla/metaGEM.git # Download metaGEM repo
cd metaGEM # Move into metaGEM directory
rm -r .git # Remove ~250 Mb of unneeded git history files

Press y and Enter when prompted to remove write-protected files, these are not necessary and just eat your precious space.

rm: remove write-protected regular file ‘.git/objects/pack/pack-f4a65f7b63c09419a9b30e64b0e4405c524a5b35.pack’? y
rm: remove write-protected regular file ‘.git/objects/pack/pack-f4a65f7b63c09419a9b30e64b0e4405c524a5b35.idx’? y

Run the env_setup.sh script:

bash env_setup.sh # Run automated setup script

This env_setup.sh script will prompt you to set up 3 conda environments, metagem, metawrap, and prokkaroary, which will be activated as required by Snakemake jobs. Don't worry, you don't need to install everything right away. You can already start processing you raw sequences with just the metagem conda env installed.

Checking your installation

To make sure that the basics have been properly configured, you should run the check task using the metaGEM.sh parser:

bash metaGEM.sh -t check

This will check if conda is installed/available and verify that the environments were properly set up by the env_setup.sh script. Additionally, this check function will prompt you to create results folders if they are not already present. Finally, this task will check if any sequencing files are present in the dataset folder, prompting the user to the either organize already existing files into sample-specific subfolders or to download a small toy dataset.

metaGEM expects data files to be organized into sample specific subdirectories within the dataset folder:

dataset
└── {SAMPLE_ID}
    ├── {SAMPLE_ID}_R1.fastq.gz
    └── {SAMPLE_ID}_R2.fastq.gz

Note that this will be done automatically after downloading the toy dataset files.

Config files

Make sure to inspect and set up the two config files to ensure smooth metaGEM runs:

Snakemake configuration

The config.yaml handles all the tunable parameters, subfolder names, paths, and more. Please refer to the config.yaml wiki page for a more in depth look at this config file.

Cluster configuration

The cluster_config.json handles parameters for submitting jobs to the cluster workload manager. Please refer to the cluster_config.json wiki page for a more in depth look at this config file.

Tools requiring additional configuration

Please note that you will need to set up the following tools/databases to run the complete core metaGEM workflow:

1. CheckM

CheckM is used extensively within the metaWRAP modules to evaluate the output of various intermediate steps. Although the CheckM package is installed in the metawrap environment, the user is required to download the CheckM database and run checkm data setRoot <db_dir> as outlined in the CheckM installation guide.

2. GTDB-Tk

GTDB-Tk is used for taxonomic assignment of MAGs, and requires a database to be downloaded and configured. Please refer to the installation documentation for detailed instructions.

3. CPLEX

Unfortunately CPLEX cannot be automatically installed in the env_setup.sh script, you must install this dependency manually within the metagem conda environment. GEM reconstruction and GEM community simulations require the IBM CPLEX solver, which is free to download with an academic license. Refer to the CarveMe and SMETANA installation instructions for further information or troubleshooting. Note: CPLEX v.12.8 is recommended.