Viral Find

This workflow is my attempt at modernizing, and generalizing existing viral infection detection pipelines. Using snakemake to implement computational scaling and parallelisation of analysis tasks.

Setup

Clone the repository

cd <installation directory> # Git clone makes a subdirectory with the name of the repository
git clone git@github.com:srinzema/dna_methylation.git
cd dna_methylation

Recreate and activate the environment.

conda env create -f environment.yml
conda activate viralfind

Prepend the installation path to your environment path

export PATH="$(pwd)":$PATH

Basic usage

This pipeline consists of two seperate workflows: init and run, where init initializes files needed for the pipeline in the current working directory. After filling out the resulting files run can be called to start the pipeline.

usage: viralfind [-h] {init,run} ...

positional arguments:
  {init,run}  Choose init or run.
    init      Initialize the pipeline in the current directory.
    run       Runs the pipeline in the current working directory.

options:
  -h, --help  show this help message and exit

1. Init workflow

The init workflow creates three files into the current working directory, these files are:

config.yaml

The config.yaml file contains essential parameters for running the pipeline:

run: Directory where all pipeline-generated files will be stored.
results: Subdirectory within run to store final results.
fastq_dir: Path to the directory containing your FASTQ sample files.
samplesheet: Path to the samples.tsv file containing metadata about your samples.
genome_dir: Directory housing your genome files.
first_assembly: Directory name of the first genome assembly (e.g., human genome).
second_assembly: Directory name of the second genome assembly (e.g., HPV metagenome).
assembly_file: Path to the assemblies.tsv file used to construct a custom metagenome.
merge_replicates: Boolean flag indicating whether to merge the count table based on the replicate_group column from samples.tsv.

samples.tsv

The samples.tsv file provides metadata for your FASTQ samples, structured as follows:

samples: The base name of your FASTQ files without the file extension or _R1/_R2 suffixes.
alias: The name to which your FASTQ samples will be renamed for processing.
condition: Placeholder for experimental conditions (not implemented in the current pipeline).
replicate_group: Identifies groups for merging replicates in the final count table.

Assemblies.tsv

The assemblies.tsv file is based upon the output of genomepy search. The most important columns are name, provider, and species. These need to be filled out otherwise it doesn't work. If there is no annotation (the annotation column will be False), the assembly cannot be used.

Workflow Structure

Metagenome Construction:
- Constructs a custom metagenome from multiple assemblies.
Quality Control & Trimming:
- fastp: Trims and assesses quality of reads.
Mapping:
1. STAR First Pass: Maps reads to the first assembly.
2. Extract Unmapped: Extracts unmapped reads.
3. STAR Second Pass: Maps unmapped reads to a custom metagenome.
Data Processing:
- featureCounts: Counts features from the aligned BAM files.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
assets		assets
scripts		scripts
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
environment.yml		environment.yml
viralfind		viralfind

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Viral Find

Setup

Basic usage

1. Init workflow

config.yaml

samples.tsv

Assemblies.tsv

Workflow Structure

Detailed DAG of the rules

About

Releases

Packages

Languages

EllenvdL/viralfind

Folders and files

Latest commit

History

Repository files navigation

Viral Find

Setup

Basic usage

1. Init workflow

config.yaml

samples.tsv

Assemblies.tsv

Workflow Structure

Detailed DAG of the rules

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages