nf-flu: Output

The output produced by the CFIA-NCFAD/nf-flu pipeline is described here.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

IRMA
BLAST analysis
Coverage Plots
Assembled Consensus Sequences
Mismatch Report
Reference Sequences
Variant Calling
H/N Subtyping
Annotation

IRMA

Output files

irma/<sample>
- amended_consensus/
  - Assembled gene segment consensus sequences: *.fa
- figures/
  - Coverage and variants plot for gene segment: *-coverageDiagram.pdf
  - Heuristics graph for gene segment: *-heuristics.pdf
  - Gene segment variant phasing heatmap using experimental enrichment distances: *-EXPENRD.pdf
  - Gene segment variant phasing heatmap using modified Jaccard distances: *-JACCARD.pdf
  - Gene segment variant phasing heatmap using mutual association distances: *-MUTUALD.pdf
  - Gene segment variant phasing heatmap using normalized joint probability distances: *-NJOINTP.pdf
  - Read filtering, QC and assembly info plots: READ_PERCENTAGES.pdf
- intermediate/
  - Intermediate analysis output files for each step in IRMA assembly.
- logs/
  - Counts and scores for assembly, QC and read mapping: *_log.txt
  - Configuration file for IRMA analysis: FLU-*.sh
  - Table of IRMA execution parameters: run_info.txt
- matrices/
  - Variant phasing matrices to construct heatmaps under figures/: *.sqm
- secondary/
  - Secondary assemblies and unmatched reads.
- tables/
  - Summary of gene segment paired-end merging stats, if applicable: *-pairingStats.txt
  - Summary coverage stats for assembly of gene segment: *-coverage.txt
  - Stats for every position and allele in assembly of gene segment: *-allAlleles.txt
  - Insertion variants called for gene segment: *-insertions.txt
  - Deletion variants called for gene segment: *-deletions.txt
  - SNP variants called for gene segment: *-variants.txt
  - Read counts at various stages of IRMA assembly process: READ_COUNTS.txt
- Sorted BAM file for gene segment assembly: *.bam
- Final assembled plurality consensus (no mixed basecalls) for gene segment: *.fa
- IRMA variant call file for gene segment: *.vcf
Concatenated "amended" IRMA consensus sequences for all gene segments assembled: <sample>.irma.consensus.fasta

IRMA output is described in the official IRMA output documentation.

The primary output from IRMA are the consensus sequences for gene segments, which are used for H/N subtyping and performed blastn against influenza database to pull top match reference sequences for each segment of each sample.

BLAST analysis

Output files

blast/ncbi/blast_db/
- Nucleotide BLAST database of NCBI Influenza DB and reference database (if provided option --ref_db): influenza_db.*
blast/ref_db/blast_db/
- Nucleotide BLAST database of the reference database (if provided option --ref_db) ref_fasta.fixed.*`
blast/blastn/irma
- Nucleotide BLAST tabular output files (-outfmt "6 qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs stitle") of sample IRMA assembled gene segments against the NCBI Influenza DB and the reference database (if provided option --ref_db)
blast/blastn/consensus
- Nucleotide BLAST tabular output files (-outfmt "6 qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs stitle") of sample final consensus assembled gene segments against the NCBI Influenza DB and the reference database (if provided option --ref_db)
blast/blastn/against_ref_db
- Nucleotide BLAST tabular output files (-outfmt "6 qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen qcovs stitle") of sample final consensus assembled gene segments against the reference database only (if provided option --ref_db)

Nucleotide BLAST (blastn) is used to query IRMA assembled gene segment sequences against Influenza sequences from NCBI (and optionally, against user-specified sequences (--ref_db) to predict the H and N subtype of each sample if possible (i.e. if segments 4 (hemagglutinin) and/or 6 (neuraminidase) were assembled) and to determine the closest matching reference sequence for each segment for reference mapped assembly.

Coverage Plots

Output files

coverage_plots/<sample>/
- Coverage plot in linear and log scale: *.pdf

Assembled Consensus Sequences

Output files

consensus/bcftools/<sample>/
- Assembled consensus sequences for each segment: *.bcftools.consensus.fasta
consensus/bcftools/
- Concatenated consensus sequences for all segments assembled: <sample>.consensus.fasta
consensus/irma/
- Assembled consensus sequences for each segment: <sample>.irma.consensus.fasta

Mismatch Report

Output files

mistmacth_report/
- <sample>-blastn-report.xlsx

The report contains 2 sheets:

1_Mismatch_Report: Count number of mismatches in BLASTN report (see sheet 2) against each reference sequences in reference database
2_Blastn_Results: Nucleotide BLASTN report of sample final consensus against reference database

Reference Sequences

Output files

<sample>/
- Top reference sequences for all segments: *.reference.fasta
- List of top reference ID pulled from influenza database: *.topsegments.csv

Segments Mapping

Output files

mapping/<sample>/
- The results of segments mapping using minimap2: *.bam, *.bai, *.depths.tsv, *.flagstat, *.idxstats, *.stats

Variant Calling

Output files

variants/<sample>/
- Filter Frameshift VCF: *.filt_frameshift.vcf
- BCF Tools stats: *.bcf_tools.stats.txt
- Clair3 or Medaka output directory

H/N Subtyping

Output files

H/N subtyping Excel report: iav-subtyping-report.xlsx

A H/N subtyping Excel report is generated from all BLAST analysis results for all samples and final assembled gene segments.

The subtyping report spreadsheet contains four sheets:

1_Subtype Predictions: H/N subtype prediction results for each sample along with top matching Influenza DB segment for the H and N segments
2_Top Segment Matches: Top 3 Influenza DB sequence matches for each segment of each sample along with BLASTN hit values and reference sequence metadata.
3_H Segment Results: Top H subtype prediction, BLASTN results and top matching sequence metadata for each sample.
4_N Segment Results: Top N subtype prediction, BLASTN results and top matching sequence metadata for each sample.

Sheet: 1_Subtype Predictions

This sheet contains the H/N subtype prediction results for each sample along with top matching Influenza DB segment for the H and N segments

Field	Description	Example
Sample	Sample name	ERR3338653
Subtype Prediction	H/N subtype prediction based on BLAST against the Influenza DB. If a type could not be assigned to either H or N segment or both, then the subtype prediction will be missing the H or N value or if both the H and N cannot be assigned then the subtype prediction will be null or an empty cell value	H1N1
H: top match accession	NCBI accession of top matching Influenza sequence for the H segment	CY147779
H: type prediction	H subtype prediction number. Value is a number.	1
H: top match virus name	Top matching sequence virus name	Influenza A virus (A/Mexico/24036/2009(H1N1))
H: NCBI Influenza DB subtype match proportion	Proportion of BLAST matches that support the H subtype prediction. This value is a decimal number where 1.0 indicates 100% of matches support the subtype prediction.	0.9980057896
N: top match accession	NCBI accession of top matching Influenza DB sequence for the N segment	MN371610
N: type prediction	N subtype prediction. Value is a number.	1
N: top match virus name	Top matching sequence virus name	Influenza A virus (A/California/04/2009)
N: NCBI Influenza DB subtype match proportion	Proportion of BLAST matches that support the N subtype prediction. This value is a decimal number where 1.0 indicates 100% of matches support the subtype prediction.	0.9993240503

Sheet: 2_Top Segment Matches

This sheet contains the top 3 Influenza DB sequence matches for each segment of each sample along with BLASTN hit values and reference sequence metadata.

Field	Description	Example
Sample	Sample name	ERR3338653
Sample Genome Segment Number	Influenza genome segment number	4
Reference NCBI Accession	Matching sequence NCBI accession	CY147779
Reference Subtype	Matching sequence subtype	H1N1
BLASTN Percent Identity	BLASTN percent identity	99.941
BLASTN Alignment Length	BLASTN alignment length	1701
BLASTN Mismatches	BLASTN number of mismatches	1
BLASTN Gaps	BLASTN number of gaps	0
BLASTN Sample Start Index	Sample sequence alignment start index	1
BLASTN Sample End Index	Sample sequence alignment end index	1701
BLASTN Reference Start Index	Matching reference sequence start index	1
BLASTN Reference End Index	Matching reference sequence end index	1701
BLASTN E-value	BLASTN alignment e-value	0
BLASTN Bitscore	BLASTN alignment bitscore	3136
Sample Sequence Length	Length of sample sequence segment	1701
Reference Sequence Length	Length of matching reference sequence segment	1701
Sample Sequence Coverage of Reference Sequence	Sample segment sequence coverage of reference sequence	100
Reference Sequence ID	Matching reference sequence identifier	gi
Reference Genome Segment Number	Reference sequence segment number	4
Reference Virus Name	Reference sequence virus name	Influenza A virus (A/Mexico/24036/2009(H1N1))
Reference Host	Reference sequence host organism	Human
Reference Country	Reference sequence country of isolation	Mexico
Reference Collection Date	Reference sequence date of collection	2009/04/27
Reference Patient Age	Reference sequence patient age	152
Reference Patient Gender	Reference sequence patient gender	Female
Reference Group ID	NCBI Influenza DB reference sequence internal genome ID	1018714

Sheets: "3_H Segment Results" and "4_N Segment Results"

These sheets ("3_H Segment Results" and "4_N Segment Results") contain the subtype prediction, BLASTN results and top matching sequence metadata for each sample.

Below are shown the fields for the "3_H Segment Results" sheet. The fields are nearly identical for the "4_N Segment Results" sheet except "N:" instead of "H:" in the field names.

Field	Description	Example
Sample	Sample name	ERR3338653
Subtype Prediction	Overall subtype prediction	H1N1
H: NCBI Influenza DB subtype match proportion	Proportion of BLAST matches that support the N subtype prediction. This value is a decimal number where 1.0 indicates 100% of matches support the subtype prediction.	0.9980057896
H: NCBI Influenza DB subtype match count	Number of reference sequences that have a BLASTN match to the sequence of this sample and have the same H subtype as the top prediction.	31028
H: NCBI Influenza DB total count	Total number of reference sequences that have a BLASTN match to the sequence of this sample.	31090
H: top match BLASTN % identity	BLASTN alignment percent identity	99.941
H: top match BLASTN alignment length	BLASTN alignment length	1701
H: top match BLASTN mismatches	BLASTN alignment mismatches	1
H: top match BLASTN gaps	BLASTN alignment gaps	0
H: top match BLASTN bitscore	BLASTN alignment bitscore	3136
H: sample segment length	Sample sequence length	1701
H: top match sequence length	Reference sequence length	1701
H: top match accession	Matching sequence NCBI accession	CY147779
H: top match virus name	Reference sequence virus name	Influenza A virus (A/Mexico/24036/2009(H1N1))
H: top match host	Reference sequence host organism	Human
H: top match country	Reference sequence country of isolation	Mexico
H: top match collection date	Reference sequence date of collection	2009/04/27
H: type prediction	H segment subtype prediction	1

Annotation

Consensus sequences are annotated using VADR. The output files are available in multiple formats including Feature Table, Genbank, nucleotide and amino acid FASTA and GFF.

Output files

annotation/vadr/<sample>
- Each sample will have its own VADR annotation analysis output directory. Feature table output can be found in the *.vadr.pass.tbl files.
annotation/<sample>/
- VADR Feature Table output is converted to Genbank, GFF and FASTA format for downstream analyses. FASTA files with nucleotide sequences of genetic features (CDS, mature peptide, signal peptide, etc) can be found in the .ffn files and amino acid sequences of genetic features can be found in the .faa files.
annotation/vadr-annotation-failed-sequences.txt: list of sequences that failed VADR annotation
annotation/vadr-annotation-issues.txt: table describing sequences that had issues with VADR annotation

Pipeline information

Output files

pipeline_info/
- Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.csv.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.fixed.csv.
- Documentation for interpretation of results in HTML format: results_description.html.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

output.md

output.md

nf-flu: Output

Pipeline overview

IRMA

BLAST analysis

Coverage Plots

Assembled Consensus Sequences

Mismatch Report

Reference Sequences

Segments Mapping

Variant Calling

H/N Subtyping

Sheet: 1_Subtype Predictions

Sheet: 2_Top Segment Matches

Sheets: "3_H Segment Results" and "4_N Segment Results"

Annotation

Pipeline information

Files

output.md

Latest commit

History

output.md

File metadata and controls

nf-flu: Output

Pipeline overview

IRMA

BLAST analysis

Coverage Plots

Assembled Consensus Sequences

Mismatch Report

Reference Sequences

Segments Mapping

Variant Calling

H/N Subtyping

Sheet: 1_Subtype Predictions

Sheet: 2_Top Segment Matches

Sheets: "3_H Segment Results" and "4_N Segment Results"

Annotation

Pipeline information