CDPHE Enterovirus repository

Disclaimer: Next generation sequencing and bioinformatic and genomic analysis at CDPHE is not CLIA validated at this time. These workflows and their outputs are not to be used for diagnostic purposes and should only be used for public health action and surveillance purposes. CDPHE is not responsible for the incorrect or inappropriate use of these workflows or their results.

In this repository you will find all things related to Enterovirus D68 (EV-D68) bioinformatics and analysis. The general directory stucture is as follows:

CDPHE_EV-D68
| README_EV-D68_overview.md # this document
| dockstore.yml
| workflows # terra workflows you can find in dockstore and push to your terra workspace
      | EV-D68_illumina_pe_assembly.wdl
      | EV-D68_seq_results.wdl
| scripts # custom python scripts for analysess used in wdl workflows
      | ev-d68_calc_percent_coverage.py
      | ev-d68_concat_assembly_software_illumina.py
      | ev_d68_concat_seq_results.py
| workspace_data # reference data
| workflow_diagrams

CDPHE_EV-D68 Workflows

Overview

The following documentation describes the Colorado Department of Public Health and Environments's workflows for the assembly and analysis of whole genome sequencing data for Enterovirus D68 (EV-D68) on GCP's Terra.bio platform. Workflows are written in WDL and can be imported into a Terra.bio workspace through dockstore.

Our EV-D68 whole genome reference-based assembly workflows are highly adaptable and facilitate the assembly and analysis of tiled amplicon based sequencing data of EV-D68. The workflows can accomodate various amplicon primer schemes, as well as diffent sequencing technology platforms including both Illumina and Oxford Nanopore Technology (ONT - COMING SOON!). Below is a high level overview of our workflows followed by detailed descriptions of each workflow which you can access by clicking the dropdown menus.

Briefly, using the platform appropriate assembly workflow on Terra.bio (r.g. EV-D68_illumina_pe_assembly), we perform quality control, trimming, and filtering of raw reads, and perform reference-guided whole genome assembly. Following assembly, intermediate and results files are transfered to a user defined google storage bucket. Next, we produce a results summary file for the set of sequences analyzed using the EV-D68_seq_results workflow. Finally, genotyping is performed using an open source online tool.

Getting set up

Prior to running any of the workflows, you must set up the terra table and link reference files and custom python scripts to your workspace data. Below is a table detailing the workspace data you will need to set up.

Workspace data

The reference files can be found in this repository in the workspace_data directory. Python scripts can be found in the scripts directory.

workspace variable name	workflow	file name	description
`adapters_and_contaminants_fa`	`EV-D68_illumina_pe_assembly`	Adapters_plus_PhiX_174.fasta	adapaters sequences and contaiment sequences removed during fastq cleaning and filtering using SeqyClean. Thanks to Erin Young at Utah Public Health Laboratory for providing this file!
`ev-d68_genome_gff`	`EV-D68_illumina_pe_assembly`, `EV-D68_ont_assembly`	KT285485-1_annotations_20230811.gff	whole genome reference sequence annotation file in gff format (we use NCBI genbank ID KT285485.1 with liftover annotations from NC_038308.1)
`ev-d68_genome_fa`	`EV-D68_illumina_pe_assembly`, `EV-D68_ont_assembly`	KT285485-1.fasta	EV-D68 whole genome reference sequence in fasta format (we use NCBI genbank ID KT285485.1)
`primer_bed`	`EV-D68_illumina_pe_assembly`, `EV-D68_ont_assembly`	{primer_scheme}.primer.bed	primer bed file for your amplicon primer set. We obtained tiled amplicon primers from https://drive.google.com/file/d/1fEqBT7_NvG04Pemb5IdzvBDLW08aPafK/view](https://www.medrxiv.org/content/10.1101/2023.11.20.23297677v2
`ev-d68_calc_per_cov_py`	`EV-D68_illumina_pe_assembly`, `EV-D68_ont_assembly`	ev-d68_calc_percent_coverage.py	see detailed description in the readme file found in `./python_scripts/` repo directory
`ev-d68_concat_assembly_software_illumina_py`	`EV-D68_illumina_pe_assembly`	`ev-d68_concat_assembly_software_illumina.py`	see detailed description in the readme file found in `./python_scripts/` repo directory
`ev-d68_concat_results_py`	`EV-D68_seq_results`	ev_d68_concat_seq_results.py	see detailed description in the readme file found in `./python_scripts` repo directory

Reference-Based Assembly workflows

The following workflows describe the reference based assembly methods for paired-end illumina sequencing data sequencing data. The workflow accepts "sample" as the root entity type. Click the drop down to expand for details. For preprocessing fastq files and generating terra data tables prior to assembly and analysis on Terra.bio, we use the following script and README (https://github.com/CDPHE-bioinformatics/seq_data_preprocessing-private).

EV-D68_illumina_pe_assembly.wdl

Overview

This workflow was developed for the assembly of Illumina paired-end read data. The workflow accepts "sample" as the root entity type. The workflow will:

Use Seqyclean to quality filter and trim raw fastq files

Seqyclean parameters include a minimum read length set to 70 bp and quality trimming set to a minimum Phred quality score of 30.

Run FastQC on both the raw and cleaned reads
Align reads to the reference genome using bwa and then sort the bam by coordinates using Samtools
Use iVar trim to trim primer regions and then sort the trimmed bam by coordinates using Samtools
Use iVar variants to call variants from the trimmed and sorted bam

iVar variants parameters include a minimum quality score set to 20, a minimum variant base frequency set to 0.6 and a minimum read depth set to 10.

Use iVar consensus to call the consensus genome sequence from the trimmed and sorted bam

iVar consensus parameters include a minimum quality score set to 20, a minimum variant base frequency set to 0.6 and a minimum read depth set to 10.

Use Samtools flagstat, stats, depth, and coverage to output statistics from the bam
Rename the fasta header of consensus sequences in the GISAID-acceptable format: CO-CDPHE-{sample_id}
Calculate the percent coverage using the ev-d68_calc_percent_coverage.py script available in the python_scripts directory of this repo.
Use gsutils to transfer the outputs and results to a user-specified google bucket

Inputs

Terra data table.

The terra data table must include the following columns as listed below. Note that optional columns are not neccessary for the assembly workflow but must be present for the EV-D68_seq_results.wdl workflow described below.

column header	description
`entity:sample_id`	column with the list of sample names. (e.g. `entity:evd68-0203_id`)
`fastq_1`	The google bucket path to the R1 fastq file
`fastq_2`	The google bucket path to the R2 fastq file
`out_dir`	User defined google bucket for where the files will be transfered during the transfer task
`terra_data_table_path`	(optional; required for results summary workflow)
`project_name`	(optional; requried for results summary workflow)

Terra Workspace Data.

See Getting set up above.

Setting up the workflow inputs

For setting up the worklfow inputs, use the EV-D68_illumina_pe_assembly-input.json in the workflow_inputs directory.

workflow variable	attribute (input syntax into workflow)
`adapters_and_contaminants`	workspace.adapters_and_contaminants_fa
`concat_assembly_software_illumina_py`	workspace.ev-d68_concat_assembly_software_illumina_py
`evd68_calc_percent_coverage_py`	workspace.evd68_calc_percent_coverage_py
`evd68_genome`	workspace.evd68_genome_fa
`evd68_gff`	workspace.evd68_genome_gff
`fastq_1`	this.fastq_1
`fastq_2`	this.fastq_2
`out_dir`	this.out_dir
`primer_bed`	workspace.primers_bed
`sample_name`	this.{entity_name}_id
`project_name`	this.project_name

Outputs

WDL task name	software/program	variable name	description
seqyclean	seqyclean	`filtered_reads_1`	file
seqyclean	seqyclean	`filtered_reads_2`	file
seqyclean	seqyclean	`seqyclean_summary`	file
fastqc as fastqc_raw	fastqc	`fastqc_raw1_html`	file
fastqc as fastqc_raw	fastqc	`fastqc_raw2_html`	file
fastqc as fastqc_cleaned	fastqc	`fastqc_clean1_html`	file
fastqc as fastqc_cleaned	fastqc	`fastqc_clean2_html`	file
ivar trim	ivar trim and samtools	`trimsort_bam`	file
ivar trim	ivar trim and samtools	`trimsort_bamindex`	file
ivar variants	ivar variants	`variants`	vcf file formatted as a tsv
ivar consensus	ivar consnesus	`consensus`	fasta file of conensus genome, Ns are called in places with less than 10 bp read depth
bam_stats	samtools flagstat, stats, percent_coverage	`flagstat_out`	file
bam_stats	samtools flagstat, stats, percent_coverage	`stats_out`	file
bam_stats	samtools flagstat, stats, percent_coverage	`covhist_out`	file
bam_stats	samtools flagstat, stats, percent_coverage	`cov_out`	file
bam_stats	samtools flagstat, stats, percent_coverage	`depth_out`	file
rename_fasta	N/A	`renamed_consensus`	fasta file; consesnus genome sequence with the fasta header renamed to be CO-CDPHE-{sample_name}
calc_percent_cvg	ev-d68_calc_percent_coverage.py	`percent_cvg_csv`	csv file, see ev-d68_calc_percent_cvg.py script readme for details found in the ./python_scripts directory of this repository
align_reads	bwa and samtools	``assembler_software_file`	file recording the version for bwa and ivar
align_reads	bwa	``bwa_version`	string recording the version for bwa, this information is used later for submitting to public repositories
ivar trim	ivar version	`ivar_version`	String

EV-D68_seq_results.wdl

This workflow should be run following assembly with one of the reference based assembly workflows. The workflow accepts "sample_set" as the root entity type and uses the data table from either assembly workflows. Both assembly workflows (illumina pe and ont) are compatible with this workflow. Breifly the workflow performs the following:

Concaenates sequencing assembly metrics (e.g. percent coverage, assembler version), and sequence metadata (e.g. plate name, sample well location) into a single csv file.
Generates a csv file with sequencing assembly metrics that can be used to parse sequencing data into our LIMS.
Transfers intermediate and summary files to a user defined google bucket.

Inputs

Below is a summary of the workflow input variables along with the syntax used for the attribute column when setting up the workflow to run on Terra.bio. For the attributes, the ""this.sample{terra_datatable_name}s." syntax refers Terra to pull the variable from the terra datatable as used for sample sets. These variables were either in the original terra datatable as inputs for the assembly workflow (see referece based assembly workflow inputs sections above for more details) or added as outputs during the assemlby workflow (see reference based assembly workflow outputs sections for more details). The "workspace." syntax refers Terra to pull the variable from the terra workspace data. Workspace data is describe in the Getting Started drop down menu above.

workflow variable	attribute (input syntax into workflow)
`assembler_version_array`	this.sample{terra_datatable_name}s.assembler_version
`ev-d68_concat_seq_results_py`	workspace.ev-d68_concat_results_py
`cov_out`	this.sample{terra_datatable_name}s.cov_out
`out_dir_array`	this.sample{terra_datatable_name}s.out_dir
`percent_cvg_csv`	this.sample{terra_datatable_name}s.percent_cvg_csv
`project_name_array`	this.sample{terra_datatable_name}s.project_name
`renamed_consensus`	this.sample{terra_datatable_name}s.renamed_consesnus
`sample_name`	this.sample{terra_datatable_name}s.sample{terra_datatable_name}_id
`workbook_path_array`	this.sample{terra_datatable_name}s.workbook_path

Outputs

This workflow generates several output files which are transfered to the user defined user google bucket as defined by this.sample{terra_datatable_name}s.out_dir. The table below details each output. For more detailed regarding the values in each column for the outputs see either the software readmes or the readme for the specific python script as listed in the description.

output variable name	file_name	description	google bucket path
`sequencing_results_csv`	`{seq_run}_sequencing_results.csv`	summary of the sequencing metrics for each sequence generated from the `ev-d68_concat_seq_results.py`	`gs://{user_defined_gcp_bucket}/summary_results/`
`wgs_horizon_report_csv`	`{seq_run}_wgs_horizon_report.csv`	results csv used for parsing results into our LIMS. This file is generated from the `ev-d68_concat_seq_results.py`	`gs://{user_defined_gcp_bucket}/summary_results/`

EV-D68 Genotyping

Finally, we use the publicly available Enterovirus Genotyping Tool (available here: https://www.rivm.nl/mpf/typingtool/enterovirus/) for genotyping the consensus genome assemblies.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data/workspace_data		data/workspace_data
docs/workflow_diagrams		docs/workflow_diagrams
scripts		scripts
tasks		tasks
workflows		workflows
.dockstore.yml		.dockstore.yml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CDPHE Enterovirus repository

CDPHE_EV-D68 Workflows

Table of Contents

Overview

Getting set up

Workspace data

Reference-Based Assembly workflows

EV-D68_illumina_pe_assembly.wdl

Overview

Inputs

Outputs

EV-D68_seq_results.wdl

Inputs

Outputs

EV-D68 Genotyping

About

Releases 1

Packages

Languages

License

CDPHE-bioinformatics/CDPHE-Enterovirus

Folders and files

Latest commit

History

Repository files navigation

CDPHE Enterovirus repository

CDPHE_EV-D68 Workflows

Table of Contents

Overview

Getting set up

Workspace data

Reference-Based Assembly workflows

EV-D68_illumina_pe_assembly.wdl

Overview

Inputs

Outputs

EV-D68_seq_results.wdl

Inputs

Outputs

EV-D68 Genotyping

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages