Process Genotypage by Sequencing data with low coverage using a dDocent-based workflow. We performed this workflow as part of the SEACONNECT project on 900 samples among 2 fish species mullus Surmuletus and diplodus Sargus in Mediterranean sea.
You can get the development version of the code by cloning this repository:
git clone https://github.com/Grelot/seaConnect--dDocent.git
- bash (linux OS ubuntu xenial)
- singularity
- snakemake
You will need to have the following programs installed on your computer. Alternatively we provide a singularity container (see Singularity section below).
- OSX or GNU Linux
- bash 4.4.19
- SINGULARITY 2.4.2-dist
- curl 7.47.0
- Python 2.7.12
- CD-HIT 4.6
- samtools 1.3.1
- bedtools 2.26.0
- PEAR 0.9.11
- fastp 0.20.0
- STACKS 1.48
- dDocent 2.7.6
- VCFtools 0.1.14
- Trimmomatic 0.33
- JAVA 8.0
- freebayes 1.3.1
- GNUPLOT 5.0
- MAWK 1.2
- rainbow 2.0.4
- GNU parallel
- seqtk 1.0
- BWA 0.7.17
- (optional) DEMORT 0.2.4
See https://www.sylabs.io/docs/ for instructions to install Singularity.
We provide a Singularity recipe and ready to run container with all required dependencies.
Build a local container as administrator with all required programs and dependencies using Singularity recipe Singularity.seaconnect
sudo singularity build seaconnect.simg Singularity.seaconnect
Pull a ready to run version of Singularity container
singularity pull --name seaconnect.simg shub://Grelot/seaConnect--dDocent:seaconnect
First, SINGLE-END fastq
files must be quality-filtered.
We provide a complete workflow to perform preprocessing of sequencing ngs raw data. This workflow is available as a github repository here : clean-fastq
{species}
: any complete project (in our case we have 2 projects : mullus and diplodus){lane}
: any physical lane on a flow cell that goes into the sequencing machine. We have many{lane}
by{species}
{barcode}
: any DNA sequence attached to a reads which belong to a sample. We have many{barcode}
by{lane}
by{species}
{pop}
: any group of samples{sample}
: any sample
The container (see Singularity section above) must be stored into the main directory of this project seaConnect--dDocent with the name seaconnect.simg
Write the absolute path of the folder containing filtered data after preprocessing into config.yaml fastq
section folder
subsection. Prefix of each fastq file you want to process must be write as a list into subsection file
For instance :
Each fastq file into my filtered folder is a {lane}
:
ls /entrepot/donnees/seaconnect/gbs_mullus/cleaned/
C6JATANXX_2.i.p.q.fastq.gz C6JATANXX_5.i.p.q.fastq.gz
C6JATANXX_3.i.p.q.fastq.gz C6JATANXX_6.i.p.q.fastq.gz
C6JATANXX_4.i.p.q.fastq.gz C8BJGANXX_1.i.p.q.fastq.gz
I write the fastq file to process into config.yaml this way :
fastq:
folder: /entrepot/donnees/seaconnect/gbs_mullus/cleaned
file:
- C6JATANXX_5
- C6JATANXX_6
- C8BJGANXX_1
- C6JATANXX_4
- C6JATANXX_3
- C6JATANXX_2
For each {species}
, we provide a sample information file
.
It must be stored as 01-infos/{species}_sample_information.tsv
.
(for instance we provide mullus_sample_information.tsv)
This first 4 colons must be :
{lane} {barcode} {pop} {sample} ...
01-infos/{species}_sample_information.tsv
must have a header. (The first line of this file is skipped by the program)
We create a barcode file
for each {lane}
which contains a list of the {barcode}
present into the give {lane}
.
bash 00-scripts barcodes.sh {species}
This command will create a barcode
file such as 01-infos/barcodes/{lane}.txt
for each {lane}
Demultiplexing refers to the step in processing where you’d use the {barcode}
information in order to know which sequences came from which {sample}
after they had all be sequenced together. Barcodes refer to the unique sequences {barcode}
that were ligated to each of your invidivual samples genetic material {sample}
before the samples got all mixed together.
Samples are lumped together all in one fastq file with barcodes still attached. We use a barcode file
to split fastq file sequences by {barcode}
We use snakemake workflow managment system for this step.
We use process_radtags from STACKS 1.48 to perform demultiplexing. Parameters can be set into config.yaml process_radtags
section.
snakemake -s 00-scripts/snakeFile.process_radtags -j 8 --use-singularity --configfile 01-infos/config_mullus.yaml --singularity-args "-B /entrepot:/entrepot"
This command process demultiplexing for each {lane}
. Results are stored into 03-samples/{lane}/sample_{barcode}.fq.gz
--singularity-args "-B /entrepot:/entrepot"
argument of the command below.
If you want a blacklist of {sample}
fastq files with low coverage, you can use DEMORT to get number of reads by {sample}
fastq files.
bash 00-scripts/demort.sh {species}
This command evaluates demultiplexed fastq files by computing various metrics. Each {sample}
with a low number of reads is blacklisted. Results are stored into 98-metrics folder.
Each fastq file belonging to a specific {barcode}
is renamed by the corresponding association {pop}
and {sample}
as stipulated into {species}_sample_information.tsv
file. (see for instance mullus_sample_information.tsv)
bash 00-scripts/rename.sh {species} 01-infos/{species}_sample_information.csv 98-utils/{species}_samples_blacklist.txt
This command create a symlink of all fastq files such as 03-samples/{lane}/sample_{barcode}.fq.gz
is linked by 04-ddocent/{species}/{pop}_{sample}.F.fq.gz
.
📎 Optionally, if a samples blacklist 98-utils/{species}_samples_blacklist.txt
is provided, each blacklisted {sample}
is filtered.
dDocent is simple bash wrapper to QC, assemble, map, and call SNPs from almost any kind of RAD sequencing. If you have a reference already, dDocent can be used to call SNPs from almost any type of NGS data set.
- dDocent performs a quality control of input sequences
- dDocent aligns sequences against a
{species}
reference genome sequence. - dDocent performs a quality control of alignments
- Bayesian, haplotype based, population-aware, genotyping from FreeBayes.
FreeBayes is a Bayesian genetic variant detector designed to detect SNPs, INDels (insertions and deletions), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment. FreeBayes is haplotype-based, in the sense that it calls variants based on the literal sequences of reads aligned to a particular target, not their precise alignment, and for any number of individuals from a population and a to determine the most-likely combination of genotypes for the population at each position in the reference.
run dDocent workflow :
- Set parameters into 01-infos/ddocent_config.file
- Go into the
04-ddocent/{species}
folder. - Then add a
04-ddocent/{species}/reference.fasta
genome reference sequence file. - Now you can run the complete dDocent workflow. Type the following commands :
## global variable
SPECIES=mullus
CONTAINER=/entrepot/working/seaconnect/seaConnect--dDocent/seaconnect.simg
DDOCENT_CONFIG=/entrepot/working/seaconnect/seaConnect--dDocent/01-infos/ddocent_config.file
## run dDocent
singularity exec -B "/entrepot:/entrepot" $CONTAINER dDocent $DDOCENT_CONFIG
This command run dDocent workflow. It generates genotypes for each {sample}
for each locus. These genotypes are stored into a VCF
file.
-B "/entrepot:/entrepot"
argument)
Genotypes are stored as VCF
files. We keep only SNPs (Single Nucleotide Polymorphism) variants.
## global variable
SPECIES=mullus
CONTAINER=/entrepot/working/seaconnect/seaConnect--dDocent/seaconnect.simg
## filter VCF remove INDELs
singularity exec "${CONTAINER}" vcftools --vcf 04-ddocent/"${SPECIES}"/Final.recode.vcf --remove-indels --recode --recode-INFO-all --out 05-vcf/"${SPECIES}"
Final result is generated as VCF
files into 05-vcf folder and contains only SNPs.
GBS SNP calling has been now performed. SNPs will be then filtered for next analysis using Emilie Boulanger's workflow.
- All the commands to run the workflow are available as a bash script main.sh.
- You can also run the whole workflow step by step using scripts collection 00-scripts.