Skip to content

Commit

Permalink
Merge pull request #73 from peterk87/add-vadr
Browse files Browse the repository at this point in the history
Add VADR Influenza sequence annotation
  • Loading branch information
peterk87 authored Jul 26, 2024
2 parents 9860c05 + 4ee2b70 commit ee8844c
Show file tree
Hide file tree
Showing 13 changed files with 501 additions and 7 deletions.
2 changes: 1 addition & 1 deletion .zenodo.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,5 @@
"name": "Lung, Oliver"
}
],
"license": "MIT",
"license": "MIT"
}
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,16 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [[3.4.0](https://github.com/CFIA-NCFAD/nf-flu/releases/tag/3.4.0)] - 2024-07-24

This release adds Influenza virus sequence annotation using VADR.

### Changes

* Add VADR for Influenza consensus sequence annotation
* Add table2asn for Feature Table conversion to Genbank
* Add pre- and post-table2asn processing to workaround sequence ID length limits imposed by table2asn when converting from Feature Table format to Genbank

## [[3.3.10](https://github.com/CFIA-NCFAD/nf-flu/releases/tag/3.3.10)] - 2024-05-31

Fix MultiQC report generation due to module filter paths not working like in v1.12.
Expand Down
17 changes: 16 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ After reference sequence selection, the pipeline performs read mapping to each r
5. H/N subtype prediction and Excel XLSX report generation based on BLAST results.
6. Automatically select top match reference sequences for segments
7. Read mapping, variant calling and consensus sequence generation for each segment against top reference sequence based on BLAST results.
8. MultiQC report generation.
8. Annotation of consensus sequences with [VADR][]
9. [MultiQC][] report generation.

## Quick Start

Expand Down Expand Up @@ -178,6 +179,18 @@ Ewels, P.A., Peltzer, A., Fillinger, S., Patel, H., Alneberg, J., Wilm, A., Garc
[seqtk][] is used for rapid manipulation of FASTA/Q files. Available from GitHub at [lh3/seqtk](https://github.com/lh3/seqtk)
### [VADR][]
[VADR][] is used for annotation of Influenza virus sequences.
```text
Alejandro A Schäffer, Eneida L Hatcher, Linda Yankie, Lara Shonkwiler, J Rodney Brister, Ilene Karsch-Mizrachi, Eric P Nawrocki; VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinformatics 21, 211 (2020). https://doi.org/10.1186/s12859-020-3537-3
```
### [table2asn][]
[table2asn][] is used for converting the [VADR][] Feature Table format output to Genbank format to help with conversion to other formats such as FASTA and GFF.
## Credits
The nf-flu pipeline was originally developed by [Peter Kruczkiewicz](https://github.com/peterk87) from [CFIA-NCFAD](https://github.com/CFIA-NCFAD), [Hai Nguyen](https://github.com/nhhaidee) extended the piepline for Nanopore data analysis.
Expand All @@ -203,3 +216,5 @@ The nf-flu pipeline was originally developed by [Peter Kruczkiewicz](https://git
[Mosdepth]: https://github.com/brentp/mosdepth
[seqtk]: https://github.com/lh3/seqtk
[MultiQC]: https://multiqc.info/
[VADR]: https://github.com/ncbi/vadr
[table2asn]: https://www.ncbi.nlm.nih.gov/genbank/table2asn/
146 changes: 146 additions & 0 deletions bin/post_table2asn.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
#!/usr/bin/env python

import logging
import os
import re
from pathlib import Path

import typer
from Bio import SeqIO
from Bio.SeqIO.InsdcIO import _insdc_location_string
from Bio.SeqRecord import SeqRecord
from gfflu.io import get_translation, write_gff
from rich.logging import RichHandler

REGEX_SUBNAME = re.compile(r"(SEQUENCE-\d+)")


def init_logging():
from rich.traceback import install

install(show_locals=True, width=120, word_wrap=True)
logging.basicConfig(
format="%(message)s",
datefmt="[%Y-%m-%d %X]",
level=logging.DEBUG,
handlers=[RichHandler(rich_tracebacks=True, tracebacks_show_locals=True)],
)


def get_namesub(namesub_txt: Path) -> dict[str, str]:
namesub = {}
with open(namesub_txt) as fh:
for line in fh:
line = line.strip()
if line == "":
continue
sub_name, original_name = line.split("\t", maxsplit=1)
namesub[sub_name] = original_name
return namesub


def write_aa_fasta(recs: list[SeqRecord], out_file: os.PathLike) -> None:
"""Write out FASTA file with amino acid sequences"""
with open(out_file, "w") as out_handle:
for rec in recs:
for feature in rec.features:
if feature.type not in {"CDS", "mat_peptide", "sig_peptide"}:
continue
try:
translation = feature.qualifiers["translation"][0]
except (KeyError, IndexError):
logging.warning(
f"Could not retrieve translation qualifier for feature {feature} from {rec}. "
"Trying to obtain translation from nucleotide sequence."
)
translation = get_translation(rec.seq, feature.location)
try:
gene = feature.qualifiers["gene"][0]
except KeyError:
gene = ""
try:
product = feature.qualifiers["product"][0]
except KeyError:
product = ""
location_str = _insdc_location_string(
location=feature.location, rec_length=len(rec.seq)
)
out_handle.write(
f">{rec.id}|{feature.type}|{gene}|{location_str} {product}\n{translation}\n"
)


def write_cds_nt_fasta(recs: list[SeqRecord], out_file: os.PathLike) -> None:
"""Write out FASTA file with CDS nt sequences"""
with open(out_file, "w") as out_handle:
for rec in recs:
for feature in rec.features:
if feature.type not in {"CDS", "mat_peptide", "sig_peptide"}:
continue
subseq = str(feature.location.extract(rec.seq))
try:
gene = feature.qualifiers["gene"][0]
except KeyError:
gene = ""
try:
product = feature.qualifiers["product"][0]
except KeyError:
product = ""
location_str = _insdc_location_string(
location=feature.location, rec_length=len(rec.seq)
)
out_handle.write(
f">{rec.id}|{feature.type}|{gene}|{location_str} {product}\n{subseq}\n"
)


def output_subbed_genbank(
gbk_path: Path,
gbk_out: Path,
namesub: dict[str, str],
) -> None:
with open(gbk_path) as fh, open(gbk_out, "w") as fout:
for line in fh:
match = REGEX_SUBNAME.search(line)
if match:
subname = match.group(1)
original_name = namesub[subname]
line = line.replace(subname, original_name)
fout.write(line)


def main(
table2asn_genbank: Path,
namesub_txt: Path,
outdir: Path = typer.Option(Path("./"), "--outdir", "-o"),
prefix: str = typer.Option("SAMPLE", "--prefix", "-p"),
):
init_logging()
logging.info(f"{table2asn_genbank=}")
logging.info(f"{namesub_txt=}")
logging.info(f"{outdir=}")
logging.info(f"{prefix=}")
namesub = get_namesub(namesub_txt)
logging.info(f'Read {len(namesub)} from "{namesub_txt}"')
gbk_out = outdir / f"{prefix}.gbk"
output_subbed_genbank(table2asn_genbank, gbk_out, namesub)
logging.info(f'Wrote Genbank with original names to "{gbk_out}"')
recs = list(SeqIO.parse(gbk_out, "genbank"))
logging.info(
f"Read {len(recs)} sequence records from '{gbk_out}'. Writing GFF, amino acid and nucleotide FASTAs for "
"sequence features (CDS, mature peptides, signal peptides)..."
)
gff_out = outdir / f"{prefix}.gff"
write_gff(recs, gff_out)
logging.info(f'Wrote GFF to "{gff_out}"')
faa_out = outdir / f"{prefix}.faa"
write_aa_fasta(recs, faa_out)
logging.info(f'Wrote amino acid FASTA to "{faa_out}"')
ffn_out = outdir / f"{prefix}.ffn"
write_cds_nt_fasta(recs, ffn_out)
logging.info(f'Wrote nucleotide FASTA of gene features to "{faa_out}"')
logging.info("Done!")


if __name__ == "__main__":
typer.run(main)
70 changes: 70 additions & 0 deletions bin/sub_seqids_for_table2asn.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
#!/usr/bin/env python

import logging
from pathlib import Path

import typer
from rich.logging import RichHandler


def init_logging():
from rich.traceback import install

install(show_locals=True, width=120, word_wrap=True)
logging.basicConfig(
format="%(message)s",
datefmt="[%Y-%m-%d %X]",
level=logging.DEBUG,
handlers=[RichHandler(rich_tracebacks=True, tracebacks_show_locals=True)],
)


def main(
input_fasta: Path = typer.Option(..., "--input-fasta", "-f"),
input_feature_table: Path = typer.Option(..., "--input-feature-table", "-t"),
outdir: Path = typer.Option(Path("./"), "--outdir", "-o"),
prefix: str = typer.Option("SAMPLE", "--prefix", "-p"),
):
init_logging()
logging.info(f"{input_fasta=}")
logging.info(f"{input_feature_table=}")
logging.info(f"{outdir=}")
logging.info(f"{prefix=}")
ft_seqids = []
ft_out = outdir / f"{prefix}.tbl"
with input_feature_table.open() as fh, ft_out.open("w") as fout:
seqid_count = 0
for line in fh:
if line.startswith(">Feature "):
seqid = line.replace(">Feature ", "").strip()
seqid_count += 1
new_seqid = f"SEQUENCE-{seqid_count}"
ft_seqids.append((seqid, new_seqid))
fout.write(line.replace(seqid, new_seqid))
logging.info(f"table2asn compatible feature table written to '{ft_out}'")

namesub_path = outdir / f"{prefix}.namesub.txt"
with namesub_path.open("w") as fout:
for seqid, new_seqid in ft_seqids:
fout.write(f"{new_seqid}\t{seqid}\n")
logging.info(
f"Text file with original and table2asn seqids written to '{namesub_path}'",
)
fasta_out = outdir / f"{prefix}.fa"
seqid_to_new_seqid = dict(ft_seqids)
logging.info(f"{seqid_to_new_seqid=}")
with input_fasta.open() as fh, fasta_out.open("w") as fout:
for line in fh:
if line.startswith(">"):
seqid, *_ = line[1:].strip().split(" ")
try:
fout.write(line.replace(seqid, seqid_to_new_seqid[seqid]))
except KeyError:
logging.warning(f"{seqid=} not in feature table seqids!")
else:
fout.write(line)
logging.info(f"table2asn compatible fasta written to '{fasta_out}'")


if __name__ == "__main__":
typer.run(main)
33 changes: 33 additions & 0 deletions conf/modules_illumina.config
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,39 @@ process {
]
}

withName: 'VADR' {
ext.args = '--mkey flu -r --atgonly --xnocomp --nomisc --alt_fail extrant5,extrant3'
publishDir = [
[
path: { "${params.outdir}/annotation/vadr" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
]
}

withName: 'VADR_SUMMARIZE_ISSUES' {
ext.args = ''
publishDir = [
[
path: { "${params.outdir}/annotation" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
]
}

withName: 'POST_TABLE2ASN' {
ext.args = ''
publishDir = [
[
path: { "${params.outdir}/annotation/${sample}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
]
}

withName: 'BLAST_MAKEBLASTDB' {
ext.args = '-dbtype nucl'
publishDir = [
Expand Down
33 changes: 33 additions & 0 deletions conf/modules_nanopore.config
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,39 @@ process {
]
}

withName: 'VADR' {
ext.args = '--mkey flu -r --atgonly --xnocomp --nomisc --alt_fail extrant5,extrant3'
publishDir = [
[
path: { "${params.outdir}/annotation/vadr" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
]
}

withName: 'VADR_SUMMARIZE_ISSUES' {
ext.args = ''
publishDir = [
[
path: { "${params.outdir}/annotation" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
]
}

withName: 'POST_TABLE2ASN' {
ext.args = ''
publishDir = [
[
path: { "${params.outdir}/annotation/${sample}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
]
}

withName: 'VCF_FILTER_FRAMESHIFT' {
publishDir = [
[
Expand Down
18 changes: 18 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ The directories listed below will be created in the results directory after the
- [Reference Sequences](#reference-sequences)
- [Variant Calling](#variant-calling)
- [H/N Subtyping](#hn-subtyping)
- [Annotation](#annotation)

### IRMA

Expand Down Expand Up @@ -246,6 +247,22 @@ Below are shown the fields for the "3_H Segment Results" sheet. The fields are n
| H: top match collection date | Reference sequence date of collection | 2009/04/27 |
| H: type prediction | H segment subtype prediction | 1 |

### Annotation

Consensus sequences are annotated using [VADR][]. The output files are available in multiple formats including Feature Table, Genbank, nucleotide and amino acid FASTA and GFF.

<details markdown="1">
<summary>Output files</summary>

- `annotation/vadr/<sample>`
- Each sample will have its own VADR annotation analysis output directory. Feature table output can be found in the `*.vadr.pass.tbl` files.
- `annotation/<sample>/`
- VADR Feature Table output is converted to Genbank, GFF and FASTA format for downstream analyses. FASTA files with nucleotide sequences of genetic features (CDS, mature peptide, signal peptide, etc) can be found in the `.ffn` files and amino acid sequences of genetic features can be found in the `.faa` files.
- `annotation/vadr-annotation-failed-sequences.txt`: list of sequences that failed VADR annotation
- `annotation/vadr-annotation-issues.txt`: table describing sequences that had issues with VADR annotation

</details>

### Pipeline information

<details markdown="1">
Expand All @@ -264,3 +281,4 @@ Below are shown the fields for the "3_H Segment Results" sheet. The fields are n
[NCBI Influenza DB]: https://www.ncbi.nlm.nih.gov/genomes/FLU/Database/nph-select.cgi?go=database
[IRMA]: https://wonder.cdc.gov/amd/flu/irma/
[BLAST]: https://blast.ncbi.nlm.nih.gov/Blast.cgi
[VADR]: https://github.com/ncbi/vadr
Loading

0 comments on commit ee8844c

Please sign in to comment.