-
Notifications
You must be signed in to change notification settings - Fork 43
Snakefile
The Snakefile is the backbone of the metaGEM
workflow. This file is the recipe book that contains all the instructions necessary to take inputs from one bioinformatic tool, process them, and pass them on to the next set of analyses. When experiencing errors, you should always look at what the Snakefile rule is trying to do in order to troubleshoot.
Snakefiles are split into rules, which are structured like this:
rule nameOfRule:
input:
youCanHave = rules.previousRule.output,
multipleInputs = f'{config["path"]["root"]}/{config["folder"]["qfiltered"]}'
output:
same = directory(f'{config["path"]["root"]}/{config["folder"]["concoct"]}/{{IDs}}/cov'),
with = directory(f'{config["path"]["root"]}/{config["folder"]["metabat"]}/{{IDs}}/cov'),
outputs = directory(f'{config["path"]["root"]}/{config["folder"]["maxbin"]}/{{IDs}}/cov')
benchmark:
f'{config["path"]["root"]}/benchmarks/{{IDs}}.ruleName.benchmark.txt'
message:
"""
Explanation of what the rule is trying to do goes here, usually with some warnings or word of caution.
"""
shell:
"""
# some bash code goes here that probably runs some super fancy algos
"""
The Snakefile
rules generally make use of wildcards, which are useful for expanding job submissions for all samples in a dataset. We can easily switch wildcards to expand jobs based on sample IDs, or bin IDs, or GEM IDs, etc. as needed. The wildcard {{IDs}}
is expanded based on the sample IDs using the following code at the top of the Snakefile
:
import os
import glob
def get_ids_from_path_pattern(path_pattern):
ids = sorted([os.path.basename(os.path.splitext(val)[0])
for val in (glob.glob(path_pattern))])
return ids
IDs = get_ids_from_path_pattern('dataset/*')
DATA_READS_1 = f'{config["path"]["root"]}/{config["folder"]["data"]}/{{IDs}}/{{IDs}}_R1.fastq.gz'
DATA_READS_2 = f'{config["path"]["root"]}/{config["folder"]["data"]}/{{IDs}}/{{IDs}}_R2.fastq.gz'
That is, the Snakefile
will search for the folder called dataset
and will take the sub-folder names to be sample IDs.
For practical purposes we usually delete the raw dataset after quality filtering, especially if working with multiple/large datasets, in which case you should replace the dataset
folder with qfiltered
, for example, inside the get_ids_from_path_pattern()
function.
Note that the root
folder is defined in the config.yaml
file, and is automatically set as the working directory when running the metaGEM.sh
parser.
- Quality filter reads with fastp
- Assembly with megahit
- Draft bin sets with CONCOCT, MaxBin2, and MetaBAT2
- Refine & reassemble bins with metaWRAP
- Taxonomic assignment with GTDB-tk
- Relative abundances with bwa
- Reconstruct & evaluate genome-scale metabolic models with CarveMe and memote
- Species metabolic coupling analysis with SMETANA