Skip to content

Commit

Permalink
Merge pull request #25 from peterk87/feature/add-new-ref-seqs
Browse files Browse the repository at this point in the history
Update Influenza ref seqs DB to use all Orthomyxoviridae viruses from NCBI FTP site
  • Loading branch information
peterk87 authored Jul 13, 2023
2 parents 080bd84 + 2ed8bf5 commit f155d43
Show file tree
Hide file tree
Showing 18 changed files with 618 additions and 509 deletions.
50 changes: 36 additions & 14 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ on:

env:
NXF_ANSI_LOG: false
# URLs to Influenza ref data should be updated in step with nextflow.config
# default ncbi_influenza_fasta and ncbi_influenza_metadata params
FASTA_ZST_URL: https://api.figshare.com/v2/file/download/41415330
CSV_ZST_URL: https://api.figshare.com/v2/file/download/41415333

concurrency:
group: "${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}"
Expand Down Expand Up @@ -58,23 +62,32 @@ jobs:
make -j2
make install
which seqtk
- name: Cache subsampled influenza.fna.gz
- name: Cache subsampled influenza.fna
uses: actions/cache@v3
id: cache-influenza-fna
with:
path: influenza-10k.fna.gz
path: influenza-10k.fna.zst
key: influenza-fna
- name: Subsample NCBI influenza.fna
if: steps.cache-influenza-fna.outputs.cache-hit != 'true'
run: |
curl --silent -SLk https://ftp.ncbi.nih.gov/genomes/INFLUENZA/influenza.fna.gz > influenza.fna.gz
echo "Subsample 10k seqs from influenza.fna.gz with seqtk"
seqtk sample -s 789 influenza.fna.gz 10000 | gzip -ck > influenza-10k.fna.gz
curl --silent -SLk ${FASTA_ZST_URL} | zstdcat | seqtk sample -s 789 - 10000 | zstd -ck > influenza-10k.fna.zst
- name: Cache influenza.csv
uses: actions/cache@v3
id: cache-influenza-csv
with:
path: influenza.csv.zst
key: influenza-csv
- name: Download influenza.csv
if: steps.cache-influenza-csv.outputs.cache-hit != 'true'
run: |
curl --silent -SLk ${CSV_ZST_URL} > influenza.csv.zst
- name: Run pipeline with test data
run: |
nextflow run ${GITHUB_WORKSPACE} \
-profile test_illumina,docker \
--ncbi_influenza_fasta influenza-10k.fna.gz
--ncbi_influenza_fasta influenza-10k.fna.zst \
--ncbi_influenza_metadata influenza.csv.zst
- name: Upload Artifact
if: success()
uses: actions/upload-artifact@v1.0.0
Expand Down Expand Up @@ -155,37 +168,46 @@ jobs:
echo "ERR6359501-10k,$(realpath reads/ERR6359501-10k.fastq)" | tee -a samplesheet.csv
echo "ERR6359501,$(realpath run1)" | tee -a samplesheet.csv
echo "ERR6359501,$(realpath run2)" | tee -a samplesheet.csv
- name: Cache subsampled influenza.fna.gz
- name: Cache subsampled influenza.fna
uses: actions/cache@v3
id: cache-influenza-fna
with:
path: influenza-10k.fna.gz
path: influenza-10k.fna.zst
key: influenza-fna
- name: Subsample NCBI influenza.fna
if: steps.cache-influenza-fna.outputs.cache-hit != 'true'
run: |
curl --silent -SLk https://ftp.ncbi.nih.gov/genomes/INFLUENZA/influenza.fna.gz > influenza.fna.gz
echo "Subsample 10k seqs from influenza.fna.gz with seqtk"
seqtk sample -s 789 influenza.fna.gz 10000 | gzip -ck > influenza-10k.fna.gz
curl --silent -SLk ${FASTA_ZST_URL} | zstdcat | seqtk sample -s 789 - 10000 | zstd -ck > influenza-10k.fna.zst
- name: Cache influenza.csv
uses: actions/cache@v3
id: cache-influenza-csv
with:
path: influenza.csv.zst
key: influenza-csv
- name: Download influenza.csv
if: steps.cache-influenza-csv.outputs.cache-hit != 'true'
run: |
curl --silent -SLk ${CSV_ZST_URL} > influenza.csv.zst
- name: Run pipeline with test data
run: |
nextflow run ${GITHUB_WORKSPACE} \
-profile test_nanopore,docker \
--platform nanopore \
--input samplesheet.csv \
--ncbi_influenza_fasta influenza-10k.fna.gz
--ncbi_influenza_fasta influenza-10k.fna.zst \
--ncbi_influenza_metadata influenza.csv.zst
- name: Upload pipeline_info/
if: success()
uses: actions/upload-artifact@v1.0.0
with:
name: nanopore-test-results-pipline_info-${{ matrix.nxf_ver }}
path: results/pipeline_info
- name: Upload iav-subtyping-report.xlsx
- name: Upload nf-flu-subtyping-report.xlsx
if: success()
uses: actions/upload-artifact@v1.0.0
with:
name: nanopore-test-results-subtyping-report-${{ matrix.nxf_ver }}
path: results/iav-subtyping-report.xlsx
path: results/nf-flu-subtyping-report.xlsx
- name: Upload multiqc_report.html
if: success()
uses: actions/upload-artifact@v1.0.0
Expand Down
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,14 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [[3.3.0](https://github.com/CFIA-NCFAD/nf-flu/releases/tag/3.3.0)] - 2023-07-11

This release migrates to more recently updated Influenza virus sequences since the last update for the [NCBI Influenza DB FTP data](https://ftp.ncbi.nih.gov/genomes/INFLUENZA/) was in 2020-10-13. By default, all Orthomyxoviridae virus sequences were parsed from the daily updated NCBI Viruses [`AllNucleotide.fa`](https://ftp.ncbi.nlm.nih.gov/genomes/Viruses/AllNucleotide/) and [`AllNuclMetadata.csv.gz`](https://ftp.ncbi.nlm.nih.gov/genomes/Viruses/AllNuclMetadata/AllNuclMetadata.csv.gz) and uploaded to [Figshare](https://figshare.com/articles/dataset/2023-06-14_-_NCBI_Viruses_-_Orthomyxoviridae/23608782) as Zstd compressed files. nf-flu no longer uses the [influenza.fna.gz](https://ftp.ncbi.nih.gov/genomes/INFLUENZA/influenza.fna.gz) and [genomeset.dat.gz](https://ftp.ncbi.nih.gov/genomes/INFLUENZA/genomeset.dat.gz) files for Influenza sequences and metadata, respectively.

### Fixes

* More up-to-date Influenza sequences database used by default (#24)

## [[3.2.1](https://github.com/CFIA-NCFAD/nf-flu/releases/tag/3.2.1)] - 2023-07-07

### Fixes
Expand Down
13 changes: 7 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,14 @@ After reference sequence selection, the pipeline performs read mapping to each r

## Pipeline summary

1. Download latest [NCBI Influenza DB][] sequences and metadata (or use user-specified files)
2. Merge reads of re-sequenced samples ([`cat`](http://www.linfo.org/cat.html)) (if needed)
1. Download latest [NCBI Orthomyxoviridae sequences](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=11308&lvl=3&keep=1&srchmode=1&unlock) and metadata (parsed from [NCBI Viruses FTP data](https://ftp.ncbi.nlm.nih.gov/genomes/Viruses/AllNucleotide/)).
2. Merge reads of re-sequenced samples ([`cat`](http://www.linfo.org/cat.html)) (if needed).
3. Assembly of Influenza gene segments with [IRMA][] using the built-in FLU module
4. Nucleotide [BLAST][] search against [NCBI Influenza DB][]
5. Automatically select top match references for segments
6. H/N subtype prediction and Excel XLSX report generation based on BLAST results
7. Perform Variant calling and genome assembly for all segments.
4. Nucleotide [BLAST][] search against [NCBI Influenza DB][] sequences
5. H/N subtype prediction and Excel XLSX report generation based on BLAST results.
6. Automatically select top match reference sequences for segments
7. Read mapping, variant calling and consensus sequence generation for each segment against top reference sequence based on BLAST results.
8. MultiQC report generation.

## Quick Start

Expand Down
Loading

0 comments on commit f155d43

Please sign in to comment.