Merge pull request #73 from peterk87/add-vadr

Add VADR Influenza sequence annotation
peterk87 · Jul 26, 2024 · ee8844c · ee8844c
2 parents 9860c05 + 4ee2b70
commit ee8844c
Show file tree

Hide file tree

Showing 13 changed files with 501 additions and 7 deletions.
diff --git a/.zenodo.json b/.zenodo.json
@@ -15,5 +15,5 @@
             "name": "Lung, Oliver"
         }
     ],
-    "license": "MIT",
+    "license": "MIT"
 }
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,16 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [[3.4.0](https://github.com/CFIA-NCFAD/nf-flu/releases/tag/3.4.0)] - 2024-07-24
+
+This release adds Influenza virus sequence annotation using VADR.
+
+### Changes
+
+* Add VADR for Influenza consensus sequence annotation
+* Add table2asn for Feature Table conversion to Genbank
+* Add pre- and post-table2asn processing to workaround sequence ID length limits imposed by table2asn when converting from Feature Table format to Genbank
+
 ## [[3.3.10](https://github.com/CFIA-NCFAD/nf-flu/releases/tag/3.3.10)] - 2024-05-31
 
 Fix MultiQC report generation due to module filter paths not working like in v1.12.

diff --git a/README.md b/README.md
@@ -24,7 +24,8 @@ After reference sequence selection, the pipeline performs read mapping to each r
 5. H/N subtype prediction and Excel XLSX report generation based on BLAST results.
 6. Automatically select top match reference sequences for segments
 7. Read mapping, variant calling and consensus sequence generation for each segment against top reference sequence based on BLAST results.
-8. MultiQC report generation.
+8. Annotation of consensus sequences with [VADR][]
+9. [MultiQC][] report generation.
 
 ## Quick Start
 
@@ -178,6 +179,18 @@ Ewels, P.A., Peltzer, A., Fillinger, S., Patel, H., Alneberg, J., Wilm, A., Garc
 
 [seqtk][] is used for rapid manipulation of FASTA/Q files. Available from GitHub at [lh3/seqtk](https://github.com/lh3/seqtk)
 
+### [VADR][]
+
+[VADR][] is used for annotation of Influenza virus sequences.
+
+```text
+Alejandro A Schäffer, Eneida L Hatcher, Linda Yankie, Lara Shonkwiler, J Rodney Brister, Ilene Karsch-Mizrachi, Eric P Nawrocki; VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinformatics 21, 211 (2020). https://doi.org/10.1186/s12859-020-3537-3
+```
+
+### [table2asn][]
+
+[table2asn][] is used for converting the [VADR][] Feature Table format output to Genbank format to help with conversion to other formats such as FASTA and GFF.
+
 ## Credits
 
 The nf-flu pipeline was originally developed by [Peter Kruczkiewicz](https://github.com/peterk87) from [CFIA-NCFAD](https://github.com/CFIA-NCFAD), [Hai Nguyen](https://github.com/nhhaidee) extended the piepline for Nanopore data analysis.
@@ -203,3 +216,5 @@ The nf-flu pipeline was originally developed by [Peter Kruczkiewicz](https://git
 [Mosdepth]: https://github.com/brentp/mosdepth
 [seqtk]: https://github.com/lh3/seqtk
 [MultiQC]: https://multiqc.info/
+[VADR]: https://github.com/ncbi/vadr
+[table2asn]: https://www.ncbi.nlm.nih.gov/genbank/table2asn/
diff --git a/bin/post_table2asn.py b/bin/post_table2asn.py
@@ -0,0 +1,146 @@
+#!/usr/bin/env python
+
+import logging
+import os
+import re
+from pathlib import Path
+
+import typer
+from Bio import SeqIO
+from Bio.SeqIO.InsdcIO import _insdc_location_string
+from Bio.SeqRecord import SeqRecord
+from gfflu.io import get_translation, write_gff
+from rich.logging import RichHandler
+
+REGEX_SUBNAME = re.compile(r"(SEQUENCE-\d+)")
+
+
+def init_logging():
+    from rich.traceback import install
+
+    install(show_locals=True, width=120, word_wrap=True)
+    logging.basicConfig(
+        format="%(message)s",
+        datefmt="[%Y-%m-%d %X]",
+        level=logging.DEBUG,
+        handlers=[RichHandler(rich_tracebacks=True, tracebacks_show_locals=True)],
+    )
+
+
+def get_namesub(namesub_txt: Path) -> dict[str, str]:
+    namesub = {}
+    with open(namesub_txt) as fh:
+        for line in fh:
+            line = line.strip()
+            if line == "":
+                continue
+            sub_name, original_name = line.split("\t", maxsplit=1)
+            namesub[sub_name] = original_name
+    return namesub
+
+
+def write_aa_fasta(recs: list[SeqRecord], out_file: os.PathLike) -> None:
+    """Write out FASTA file with amino acid sequences"""
+    with open(out_file, "w") as out_handle:
+        for rec in recs:
+            for feature in rec.features:
+                if feature.type not in {"CDS", "mat_peptide", "sig_peptide"}:
+                    continue
+                try:
+                    translation = feature.qualifiers["translation"][0]
+                except (KeyError, IndexError):
+                    logging.warning(
+                        f"Could not retrieve translation qualifier for feature {feature} from {rec}. "
+                        "Trying to obtain translation from nucleotide sequence."
+                    )
+                    translation = get_translation(rec.seq, feature.location)
+                try:
+                    gene = feature.qualifiers["gene"][0]
+                except KeyError:
+                    gene = ""
+                try:
+                    product = feature.qualifiers["product"][0]
+                except KeyError:
+                    product = ""
+                location_str = _insdc_location_string(
+                    location=feature.location, rec_length=len(rec.seq)
+                )
+                out_handle.write(
+                    f">{rec.id}|{feature.type}|{gene}|{location_str} {product}\n{translation}\n"
+                )
+
+
+def write_cds_nt_fasta(recs: list[SeqRecord], out_file: os.PathLike) -> None:
+    """Write out FASTA file with CDS nt sequences"""
+    with open(out_file, "w") as out_handle:
+        for rec in recs:
+            for feature in rec.features:
+                if feature.type not in {"CDS", "mat_peptide", "sig_peptide"}:
+                    continue
+                subseq = str(feature.location.extract(rec.seq))
+                try:
+                    gene = feature.qualifiers["gene"][0]
+                except KeyError:
+                    gene = ""
+                try:
+                    product = feature.qualifiers["product"][0]
+                except KeyError:
+                    product = ""
+                location_str = _insdc_location_string(
+                    location=feature.location, rec_length=len(rec.seq)
+                )
+                out_handle.write(
+                    f">{rec.id}|{feature.type}|{gene}|{location_str} {product}\n{subseq}\n"
+                )
+
+
+def output_subbed_genbank(
+    gbk_path: Path,
+    gbk_out: Path,
+    namesub: dict[str, str],
+) -> None:
+    with open(gbk_path) as fh, open(gbk_out, "w") as fout:
+        for line in fh:
+            match = REGEX_SUBNAME.search(line)
+            if match:
+                subname = match.group(1)
+                original_name = namesub[subname]
+                line = line.replace(subname, original_name)
+            fout.write(line)
+
+
+def main(
+    table2asn_genbank: Path,
+    namesub_txt: Path,
+    outdir: Path = typer.Option(Path("./"), "--outdir", "-o"),
+    prefix: str = typer.Option("SAMPLE", "--prefix", "-p"),
+):
+    init_logging()
+    logging.info(f"{table2asn_genbank=}")
+    logging.info(f"{namesub_txt=}")
+    logging.info(f"{outdir=}")
+    logging.info(f"{prefix=}")
+    namesub = get_namesub(namesub_txt)
+    logging.info(f'Read {len(namesub)} from "{namesub_txt}"')
+    gbk_out = outdir / f"{prefix}.gbk"
+    output_subbed_genbank(table2asn_genbank, gbk_out, namesub)
+    logging.info(f'Wrote Genbank with original names to "{gbk_out}"')
+    recs = list(SeqIO.parse(gbk_out, "genbank"))
+    logging.info(
+        f"Read {len(recs)} sequence records from '{gbk_out}'. Writing GFF, amino acid and nucleotide FASTAs for "
+        "sequence features (CDS, mature peptides, signal peptides)..."
+    )
+    gff_out = outdir / f"{prefix}.gff"
+    write_gff(recs, gff_out)
+    logging.info(f'Wrote GFF to "{gff_out}"')
+    faa_out = outdir / f"{prefix}.faa"
+    write_aa_fasta(recs, faa_out)
+    logging.info(f'Wrote amino acid FASTA to "{faa_out}"')
+    ffn_out = outdir / f"{prefix}.ffn"
+    write_cds_nt_fasta(recs, ffn_out)
+    logging.info(f'Wrote nucleotide FASTA of gene features to "{faa_out}"')
+    logging.info("Done!")
+
+
+if __name__ == "__main__":
+    typer.run(main)
diff --git a/bin/sub_seqids_for_table2asn.py b/bin/sub_seqids_for_table2asn.py
@@ -0,0 +1,70 @@
+#!/usr/bin/env python
+
+import logging
+from pathlib import Path
+
+import typer
+from rich.logging import RichHandler
+
+
+def init_logging():
+    from rich.traceback import install
+
+    install(show_locals=True, width=120, word_wrap=True)
+    logging.basicConfig(
+        format="%(message)s",
+        datefmt="[%Y-%m-%d %X]",
+        level=logging.DEBUG,
+        handlers=[RichHandler(rich_tracebacks=True, tracebacks_show_locals=True)],
+    )
+
+
+def main(
+        input_fasta: Path = typer.Option(..., "--input-fasta", "-f"),
+        input_feature_table: Path = typer.Option(..., "--input-feature-table", "-t"),
+        outdir: Path = typer.Option(Path("./"), "--outdir", "-o"),
+        prefix: str = typer.Option("SAMPLE", "--prefix", "-p"),
+):
+    init_logging()
+    logging.info(f"{input_fasta=}")
+    logging.info(f"{input_feature_table=}")
+    logging.info(f"{outdir=}")
+    logging.info(f"{prefix=}")
+    ft_seqids = []
+    ft_out = outdir / f"{prefix}.tbl"
+    with input_feature_table.open() as fh, ft_out.open("w") as fout:
+        seqid_count = 0
+        for line in fh:
+            if line.startswith(">Feature "):
+                seqid = line.replace(">Feature ", "").strip()
+                seqid_count += 1
+                new_seqid = f"SEQUENCE-{seqid_count}"
+                ft_seqids.append((seqid, new_seqid))
+            fout.write(line.replace(seqid, new_seqid))
+    logging.info(f"table2asn compatible feature table written to '{ft_out}'")
+
+    namesub_path = outdir / f"{prefix}.namesub.txt"
+    with namesub_path.open("w") as fout:
+        for seqid, new_seqid in ft_seqids:
+            fout.write(f"{new_seqid}\t{seqid}\n")
+    logging.info(
+        f"Text file with original and table2asn seqids written to '{namesub_path}'",
+    )
+    fasta_out = outdir / f"{prefix}.fa"
+    seqid_to_new_seqid = dict(ft_seqids)
+    logging.info(f"{seqid_to_new_seqid=}")
+    with input_fasta.open() as fh, fasta_out.open("w") as fout:
+        for line in fh:
+            if line.startswith(">"):
+                seqid, *_ = line[1:].strip().split(" ")
+                try:
+                    fout.write(line.replace(seqid, seqid_to_new_seqid[seqid]))
+                except KeyError:
+                    logging.warning(f"{seqid=} not in feature table seqids!")
+            else:
+                fout.write(line)
+    logging.info(f"table2asn compatible fasta written to '{fasta_out}'")
+
+
+if __name__ == "__main__":
+    typer.run(main)
diff --git a/conf/modules_illumina.config b/conf/modules_illumina.config
@@ -26,6 +26,39 @@ process {
     ]
   }
 
+  withName: 'VADR' {
+    ext.args = '--mkey flu -r --atgonly --xnocomp --nomisc --alt_fail extrant5,extrant3'
+    publishDir = [
+      [
+        path: { "${params.outdir}/annotation/vadr" },
+        mode: params.publish_dir_mode,
+        saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
+      ]
+    ]
+  }
+
+  withName: 'VADR_SUMMARIZE_ISSUES' {
+    ext.args = ''
+    publishDir = [
+      [
+        path: { "${params.outdir}/annotation" },
+        mode: params.publish_dir_mode,
+        saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
+      ]
+    ]
+  }
+
+  withName: 'POST_TABLE2ASN' {
+    ext.args = ''
+    publishDir = [
+      [
+        path: { "${params.outdir}/annotation/${sample}" },
+        mode: params.publish_dir_mode,
+        saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
+      ]
+    ]
+  }
+
   withName: 'BLAST_MAKEBLASTDB' {
     ext.args = '-dbtype nucl'
     publishDir = [

diff --git a/conf/modules_nanopore.config b/conf/modules_nanopore.config
@@ -244,6 +244,39 @@ process {
     ]
   }
 
+  withName: 'VADR' {
+    ext.args = '--mkey flu -r --atgonly --xnocomp --nomisc --alt_fail extrant5,extrant3'
+    publishDir = [
+      [
+        path: { "${params.outdir}/annotation/vadr" },
+        mode: params.publish_dir_mode,
+        saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
+      ]
+    ]
+  }
+
+  withName: 'VADR_SUMMARIZE_ISSUES' {
+    ext.args = ''
+    publishDir = [
+      [
+        path: { "${params.outdir}/annotation" },
+        mode: params.publish_dir_mode,
+        saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
+      ]
+    ]
+  }
+
+  withName: 'POST_TABLE2ASN' {
+    ext.args = ''
+    publishDir = [
+      [
+        path: { "${params.outdir}/annotation/${sample}" },
+        mode: params.publish_dir_mode,
+        saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
+      ]
+    ]
+  }
+
   withName: 'VCF_FILTER_FRAMESHIFT' {
     publishDir = [
       [

diff --git a/docs/output.md b/docs/output.md
@@ -14,6 +14,7 @@ The directories listed below will be created in the results directory after the
 - [Reference Sequences](#reference-sequences)
 - [Variant Calling](#variant-calling)
 - [H/N Subtyping](#hn-subtyping)
+- [Annotation](#annotation)
 
 ### IRMA
 
@@ -246,6 +247,22 @@ Below are shown the fields for the "3_H Segment Results" sheet. The fields are n
 | H: top match collection date | Reference sequence date of collection | 2009/04/27 |
 | H: type prediction | H segment subtype prediction | 1 |
 
+### Annotation
+
+Consensus sequences are annotated using [VADR][]. The output files are available in multiple formats including Feature Table, Genbank, nucleotide and amino acid FASTA and GFF.
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `annotation/vadr/<sample>`
+  - Each sample will have its own VADR annotation analysis output directory. Feature table output can be found in the `*.vadr.pass.tbl` files.
+- `annotation/<sample>/`
+  - VADR Feature Table output is converted to Genbank, GFF and FASTA format for downstream analyses. FASTA files with nucleotide sequences of genetic features (CDS, mature peptide, signal peptide, etc) can be found in the `.ffn` files and amino acid sequences of genetic features can be found in the `.faa` files.
+- `annotation/vadr-annotation-failed-sequences.txt`: list of sequences that failed VADR annotation
+- `annotation/vadr-annotation-issues.txt`: table describing sequences that had issues with VADR annotation
+
+</details>
+
 ### Pipeline information
 
 <details markdown="1">
@@ -264,3 +281,4 @@ Below are shown the fields for the "3_H Segment Results" sheet. The fields are n
 [NCBI Influenza DB]: https://www.ncbi.nlm.nih.gov/genomes/FLU/Database/nph-select.cgi?go=database
 [IRMA]: https://wonder.cdc.gov/amd/flu/irma/
 [BLAST]: https://blast.ncbi.nlm.nih.gov/Blast.cgi
+[VADR]: https://github.com/ncbi/vadr