This repository contains a Makefile to process paired-end illumina data with bowtie2, samtools, picard, and gatk. Much of the analyses are based on @knausb's bam processing workflow, but tweaked for a haploid plant pathogen with an available genome (here we use it for Sclerotinia sclerotiorum). This is designed to run on the HCC SLURM Cluster using the SLURM_Array submission script in the users $PATH. There are no guarantees that this will work anywhere else.
To build your analysis, add your data to directories called reads/ and
mitochondria_genome/, edit variables in make like ROOT_DIR
, and ensure
that you have an environment variable called $EMAIL
so you can be spammed
every time the processes start and finish. In your shell, type:
make
This will generate the genome index, sam files, bam files, g.vcf files
and a final GATK/res.vcf.gz
. You can find a manifest of the files
generated from each sample in [manifest.txt] (Which will be updated randomly
as I work out the bugs and test things).
If you want to add steps/rules to the workflow, you should first be familiar or comfortable with Makefiles. Here are some helpful guides:
One of the things that gets kind of weird about this Makefile as compared to traditional makefiles is the fact that I wrote it in a really wonky sort of way where there are many dependencies for a single rule (This may change in the future).
Many of the rules in the makefile take the form of:
.PHONY: all out
all : out
SAMPLES := $(shell ls -1d samples-dir/*)
OUT_SAMPLES := $(patsubst %.in,%.out,$(SAMPLES))
out : $(OUT_SAMPLES)
runs/ARRAY-JOB-NAME/ARRAY-JOB-NAME.sh : $(SAMPLES)
echo $^ | sed -r 's/([^ ]+?).in */script-to-run.sh \1.in -o \1.out\n/' > $(RUNFILES)/run-script-to-run.txt
SLURM_Array -c $(RUNFILES)/run-script-to-run.txt \
--mail $(EMAIL) \
-r runs/ARRAY-JOB-NAME \
-l $(MODULE) \
--hold \
-w $(ROOT_DIR)
$(OUT_SAMPLES) : $(SAMPLES) runs/ARRAY-JOB-NAME/ARRAY-JOB-NAME.sh
Where the target is a phony target that generates one OUT_SAMPLE
for every
SAMPLE
via runs/ARRAY-JOB-NAME/ARRAY-JOB-NAME.sh
. This shell script as a
target is generated via SLURM_Array
as it submits a
SLURM array job. Is this an elegant solution? No. It's a bit
hamfisted, and I will probably change it in the future.
- mitochondria_genome/: A gzipped fasta file such as one from here: ftp://ftp.broadinstitute.org/pub/annotation/fungi/sclerotinia_sclerotiorum/broad/genomes/sclerotinia_sclerotiorum/
- reads/: Paired-end genomic data, in
\*_[12].fq.gz
format
- runfiles/: shell scripts for pre-processing submission scripts (kept in this directory for posterity)
- bt2-index/: genome index files generated via
make index
- SAMS/: mapped sam files generated via
make map
- BAMS/: filtered bam files
- GVCF/:
*.g.vcf
and*.vcf
files generated via GATK - runs/: std out and std err of runs