-
Notifications
You must be signed in to change notification settings - Fork 43
Quickstart
❗Please refer to main README installation one-liner or the detailed setup guide for recommended installation❗
Clone this repository to your HPC or local computer:
git clone https://github.com/franciscozorrilla/metaGEM.git # Download metaGEM repo
cd metaGEM # Move into metaGEM directory
rm -r .git # Remove ~250 Mb of unneeded git history files
Press y
and Enter
when prompted to remove write-protected files, these are not necessary and just eat your precious space.
rm: remove write-protected regular file ‘.git/objects/pack/pack-f4a65f7b63c09419a9b30e64b0e4405c524a5b35.pack’? y
rm: remove write-protected regular file ‘.git/objects/pack/pack-f4a65f7b63c09419a9b30e64b0e4405c524a5b35.idx’? y
Run the env_setup.sh
script:
bash env_setup.sh # Run automated setup script
This env_setup.sh
script will prompt you to set up 4 conda environments in the envs/
folder:
-
mamba
- Only used for installing mamba and setting up subsequent environments from recipe files
-
metagem
- Contains most
metaGEM
core workflow tools - Python 3
- Contains most
-
metawrap
- Contains only
metaWRAP
and its dependencies - Python 2
- Contains only
-
prokkaroary
- Contains bonus tools
Don't worry, you don't need to install everything right away. You can already start processing you raw sequences with just the metagem
conda env installed.
If you run into issues with the automated installation please refer to the manual installation page.
To make sure that the basics have been properly configured, run the check
task using the metaGEM.sh
parser:
bash metaGEM.sh -t check
This will check if conda is installed/available and verify that the environments were properly set up by the env_setup.sh
script.
Additionally, this check
function will prompt you to create results folders if they are not already present.
Finally, this task will check if any sequencing files are present in the dataset folder, prompting the user to the either organize already existing files into sample-specific subfolders or to download a small toy dataset.
The conda environments will be set up under the /envs
folder:
envs/
├── mamba/
├── metagem/
├── metawrap/
└── prokkaroary/
metaGEM
expects data files to be organized into sample specific subdirectories within the dataset
folder, note that this will be done automatically after downloading the toy dataset files. Alternatively, users can dump all fastq files in the dataset
folder and run the metaGEM
task organizeData
:
bash metaGEM.sh --task organizeData
This is how the dataset folder should look:
dataset/
└── {SAMPLE ID 1}/
├── {SAMPLE ID 1}_R1.fastq.gz
└── {SAMPLE ID 1}_R2.fastq.gz
└── {SAMPLE ID 2}/
├── {SAMPLE ID 2}_R1.fastq.gz
└── {SAMPLE ID 2}_R2.fastq.gz
└── {SAMPLE ID 3}/
├── {SAMPLE ID 3}_R1.fastq.gz
└── {SAMPLE ID 3}_R2.fastq.gz
.
.
.
Note that the organizeData
task expects that your samples are named according to the following scheme:
{SAMPLE ID}_R{1|2}.fastq.gz, e.g. ERR260137_R1.fastq.gz, ERR260137_R2.fastq.gz, ERR260138_R1.fastq.gz, etc.
Make sure to inspect and set up the two config files to ensure smooth metaGEM
runs:
The config.yaml
handles all the tunable parameters, subfolder names, paths, and more. The root
path is automatically set by the metaGEM.sh
parser to be the current working directory. Most importantly, you should make sure that the scratch
path is properly configured. Most clusters have a location for temporary or high I/O operations such as $TMPDIR
or $SCRATCH
, e.g. see here. Please refer to the config.yaml
wiki page for a more in depth look at this config file.
The cluster_config.json
handles parameters for submitting jobs to the cluster workload manager. Most importantly, you should make sure that the account
is properly defined to be able to submit jobs to your cluster. Please refer to the cluster_config.json
wiki page for a more in depth look at this config file.
Please note that you will need to set up the following tools/databases to run the complete core metaGEM workflow:
CheckM
is used extensively within the metaWRAP
modules to evaluate the output of various intermediate steps. Although the CheckM
package is installed in the metawrap
environment, the user is required to download the CheckM
database and run checkm data setRoot <db_dir>
as outlined in the CheckM
installation guide.
GTDB-Tk
is used for taxonomic assignment of MAGs, and requires a database to be downloaded and configured. Please refer to the installation documentation for detailed instructions.
Unfortunately CPLEX
cannot be automatically installed in the env_setup.sh
script, you must install this dependency manually within the metagem conda environment. GEM reconstruction and GEM community simulations require the IBM CPLEX solver
, which is free to download with an academic license. Refer to the CarveMe
and SMETANA
installation instructions for further information or troubleshooting. Note: CPLEX v.12.8
is recommended.
- Quality filter reads with fastp
- Assembly with megahit
- Draft bin sets with CONCOCT, MaxBin2, and MetaBAT2
- Refine & reassemble bins with metaWRAP
- Taxonomic assignment with GTDB-tk
- Relative abundances with bwa
- Reconstruct & evaluate genome-scale metabolic models with CarveMe and memote
- Species metabolic coupling analysis with SMETANA