This workflow is my attempt at modernizing, and generalizing existing viral infection detection pipelines. Using snakemake to implement computational scaling and parallelisation of analysis tasks.
Clone the repository
cd <installation directory> # Git clone makes a subdirectory with the name of the repository
git clone git@github.com:srinzema/dna_methylation.git
cd dna_methylation
Recreate and activate the environment.
conda env create -f environment.yml
conda activate viralfind
Prepend the installation path to your environment path
export PATH="$(pwd)":$PATH
This pipeline consists of two seperate workflows: init and run, where init initializes files needed for the pipeline in the current working directory. After filling out the resulting files run can be called to start the pipeline.
usage: viralfind [-h] {init,run} ...
positional arguments:
{init,run} Choose init or run.
init Initialize the pipeline in the current directory.
run Runs the pipeline in the current working directory.
options:
-h, --help show this help message and exit
The init workflow creates three files into the current working directory, these files are:
The config.yaml file contains essential parameters for running the pipeline:
- run: Directory where all pipeline-generated files will be stored.
- results: Subdirectory within run to store final results.
- fastq_dir: Path to the directory containing your FASTQ sample files.
- samplesheet: Path to the samples.tsv file containing metadata about your samples.
- genome_dir: Directory housing your genome files.
- first_assembly: Directory name of the first genome assembly (e.g., human genome).
- second_assembly: Directory name of the second genome assembly (e.g., HPV metagenome).
- assembly_file: Path to the assemblies.tsv file used to construct a custom metagenome.
- merge_replicates: Boolean flag indicating whether to merge the count table based on the replicate_group column from samples.tsv.
The samples.tsv file provides metadata for your FASTQ samples, structured as follows:
- samples: The base name of your FASTQ files without the file extension or _R1/_R2 suffixes.
- alias: The name to which your FASTQ samples will be renamed for processing.
- condition: Placeholder for experimental conditions (not implemented in the current pipeline).
- replicate_group: Identifies groups for merging replicates in the final count table.
The assemblies.tsv file is based upon the output of genomepy search. The most important columns are name, provider, and species. These need to be filled out otherwise it doesn't work. If there is no annotation (the annotation column will be False), the assembly cannot be used.
- Metagenome Construction:
- Constructs a custom metagenome from multiple assemblies.
- Quality Control & Trimming:
- fastp: Trims and assesses quality of reads.
- Mapping:
- STAR First Pass: Maps reads to the first assembly.
- Extract Unmapped: Extracts unmapped reads.
- STAR Second Pass: Maps unmapped reads to a custom metagenome.
- Data Processing:
- featureCounts: Counts features from the aligned BAM files.