QC-pipe

A snakemake pipeline for generating QC metrics from sequencing data.

What it does

QC-pipe takes as input fastq files and performs a number of quality control metrics, collated into a MultiQC report. The tools run include:

fastQC -- a QC tool for sequencing data.
FastQ Screen -- a screening tool that aligns reads to a set of sequence databases, to determine sequence composition.
QualiMap -- calculates QC metrics based on alignment data.
Samtools stats -- produces statistics from alignment files.

QC-pipe can align your reads using STAR (recommended for RNA) or bwa (recommended for DNA).

Additionally, QC-pipe can also demultiplex your data from Illumina bcls using bcl2fastq or bcl-convert, or from MGI data using splitbarcode. If you have multiple lanes, QC-pipe can automatically merge these into one pair of fastq files per sample.

Installation

The only prerequisite is snakemake. To install snakemake, you will need to install a Conda-based Python3 distribution. For this, Mambaforge is recommended. Once mamba is installed, snakemake can be installed like so:

mamba create -c conda-forge -c bioconda -n snakemake snakemake

Now activate the snakemake environment (you'll have to do this every time you want to run the pipeline):

conda activate snakemake

Now clone the repository:

git clone https://github.com/WEHIGenomicsRnD/qc-pipe.git
cd qc-pipe

Testing

You can test the pipeline via:

conda activate snakemake
snakemake --use-conda --conda-frontend mamba --cores 1 --directory .test

Configuration

The configuration file is found under config/config.yaml and the config file for FastQ Screen is found under config/fastq_screen.conf. Please carefully go through these settings.

Running

Place your fastq files in format of {sample}_R[1|2].fastq.gz under the the directory specific in your config.yaml file (fastq by default). Now run the pipeline as follows:

conda activate snakemake
snakemake --use-conda --conda-frontend mamba --cores 1

If you need to demultiplex your reads via bcl2fastq, bcl-convert or MGI's splitbarcode, you can run this through the pipeline by setting process_from_bcl: True in the config file and specifying the input BCL/fastq directory (raw_input in the config file). Note that due to the licensing of these demultiplexing tools, you will have to source the software from the Illumina website, or if you are using MGI, directly from MGI support.

You may also start the pipeline with already demultiplexed output. In this case, the pipeline can perform the merge step (in order to merge lanes). This is also handy if your file names are in the standard Illumina BCL output format. To do this, set the merge_from_dir parameter in the config file.

If you want to submit your jobs to the cluster using SLURM, use the following to run the pipeline:

conda activate snakemake
snakemake --use-conda --conda-frontend mamba --profile slurm --jobs 8 --cores 24

Output

The pipeline will generate all results under a results directory. The most relevant directories are:

results/multiqc/multiqc_report.html -- contains collated QC data in the form of an html report.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github/workflows		.github/workflows
.test		.test
config		config
slurm		slurm
workflow		workflow
LICENSE		LICENSE
README.md		README.md
dag.png		dag.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QC-pipe

What it does

Installation

Testing

Configuration

Running

Output

About

Releases 1

Packages

Languages

License

WEHIGenomicsRnD/qc-pipe

Folders and files

Latest commit

History

Repository files navigation

QC-pipe

What it does

Installation

Testing

Configuration

Running

Output

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages