Merge pull request #25 from peterk87/feature/add-new-ref-seqs

Update Influenza ref seqs DB to use all Orthomyxoviridae viruses from NCBI FTP site
peterk87 · Jul 13, 2023 · f155d43 · f155d43
2 parents 080bd84 + 2ed8bf5
commit f155d43
Show file tree

Hide file tree

Showing 18 changed files with 618 additions and 509 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -10,6 +10,10 @@ on:
 
 env:
   NXF_ANSI_LOG: false
+  # URLs to Influenza ref data should be updated in step with nextflow.config
+  # default ncbi_influenza_fasta and ncbi_influenza_metadata params
+  FASTA_ZST_URL: https://api.figshare.com/v2/file/download/41415330
+  CSV_ZST_URL: https://api.figshare.com/v2/file/download/41415333
 
 concurrency:
   group: "${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}"
@@ -58,23 +62,32 @@ jobs:
           make -j2
           make install
           which seqtk
-      - name: Cache subsampled influenza.fna.gz
+      - name: Cache subsampled influenza.fna
         uses: actions/cache@v3
         id: cache-influenza-fna
         with:
-          path: influenza-10k.fna.gz
+          path: influenza-10k.fna.zst
           key: influenza-fna
       - name: Subsample NCBI influenza.fna
         if: steps.cache-influenza-fna.outputs.cache-hit != 'true'
         run: |
-          curl --silent -SLk https://ftp.ncbi.nih.gov/genomes/INFLUENZA/influenza.fna.gz > influenza.fna.gz
-          echo "Subsample 10k seqs from influenza.fna.gz with seqtk"
-          seqtk sample -s 789 influenza.fna.gz 10000 | gzip -ck > influenza-10k.fna.gz
+          curl --silent -SLk ${FASTA_ZST_URL} | zstdcat | seqtk sample -s 789 - 10000 | zstd -ck > influenza-10k.fna.zst
+      - name: Cache influenza.csv
+        uses: actions/cache@v3
+        id: cache-influenza-csv
+        with:
+          path: influenza.csv.zst
+          key: influenza-csv
+      - name: Download influenza.csv
+        if: steps.cache-influenza-csv.outputs.cache-hit != 'true'
+        run: |
+          curl --silent -SLk ${CSV_ZST_URL} > influenza.csv.zst
       - name: Run pipeline with test data
         run: |
           nextflow run ${GITHUB_WORKSPACE} \
             -profile test_illumina,docker \
-            --ncbi_influenza_fasta influenza-10k.fna.gz
+            --ncbi_influenza_fasta influenza-10k.fna.zst \
+            --ncbi_influenza_metadata influenza.csv.zst
       - name: Upload Artifact
         if: success()
         uses: actions/upload-artifact@v1.0.0
@@ -155,37 +168,46 @@ jobs:
           echo "ERR6359501-10k,$(realpath reads/ERR6359501-10k.fastq)" | tee -a samplesheet.csv
           echo "ERR6359501,$(realpath run1)" | tee -a samplesheet.csv
           echo "ERR6359501,$(realpath run2)" | tee -a samplesheet.csv
-      - name: Cache subsampled influenza.fna.gz
+      - name: Cache subsampled influenza.fna
         uses: actions/cache@v3
         id: cache-influenza-fna
         with:
-          path: influenza-10k.fna.gz
+          path: influenza-10k.fna.zst
           key: influenza-fna
       - name: Subsample NCBI influenza.fna
         if: steps.cache-influenza-fna.outputs.cache-hit != 'true'
         run: |
-          curl --silent -SLk https://ftp.ncbi.nih.gov/genomes/INFLUENZA/influenza.fna.gz > influenza.fna.gz
-          echo "Subsample 10k seqs from influenza.fna.gz with seqtk"
-          seqtk sample -s 789 influenza.fna.gz 10000 | gzip -ck > influenza-10k.fna.gz
+          curl --silent -SLk ${FASTA_ZST_URL} | zstdcat | seqtk sample -s 789 - 10000 | zstd -ck > influenza-10k.fna.zst
+      - name: Cache influenza.csv
+        uses: actions/cache@v3
+        id: cache-influenza-csv
+        with:
+          path: influenza.csv.zst
+          key: influenza-csv
+      - name: Download influenza.csv
+        if: steps.cache-influenza-csv.outputs.cache-hit != 'true'
+        run: |
+          curl --silent -SLk ${CSV_ZST_URL} > influenza.csv.zst
       - name: Run pipeline with test data
         run: |
           nextflow run ${GITHUB_WORKSPACE} \
             -profile test_nanopore,docker \
             --platform nanopore \
             --input samplesheet.csv \
-            --ncbi_influenza_fasta influenza-10k.fna.gz
+            --ncbi_influenza_fasta influenza-10k.fna.zst \
+            --ncbi_influenza_metadata influenza.csv.zst
       - name: Upload pipeline_info/
         if: success()
         uses: actions/upload-artifact@v1.0.0
         with:
           name: nanopore-test-results-pipline_info-${{ matrix.nxf_ver }}
           path: results/pipeline_info
-      - name: Upload iav-subtyping-report.xlsx
+      - name: Upload nf-flu-subtyping-report.xlsx
         if: success()
         uses: actions/upload-artifact@v1.0.0
         with:
           name: nanopore-test-results-subtyping-report-${{ matrix.nxf_ver }}
-          path: results/iav-subtyping-report.xlsx
+          path: results/nf-flu-subtyping-report.xlsx
       - name: Upload multiqc_report.html
         if: success()
         uses: actions/upload-artifact@v1.0.0

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,14 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [[3.3.0](https://github.com/CFIA-NCFAD/nf-flu/releases/tag/3.3.0)] - 2023-07-11
+
+This release migrates to more recently updated Influenza virus sequences since the last update for the [NCBI Influenza DB FTP data](https://ftp.ncbi.nih.gov/genomes/INFLUENZA/) was in 2020-10-13. By default, all Orthomyxoviridae virus sequences were parsed from the daily updated NCBI Viruses [`AllNucleotide.fa`](https://ftp.ncbi.nlm.nih.gov/genomes/Viruses/AllNucleotide/) and [`AllNuclMetadata.csv.gz`](https://ftp.ncbi.nlm.nih.gov/genomes/Viruses/AllNuclMetadata/AllNuclMetadata.csv.gz) and uploaded to [Figshare](https://figshare.com/articles/dataset/2023-06-14_-_NCBI_Viruses_-_Orthomyxoviridae/23608782) as Zstd compressed files. nf-flu no longer uses the [influenza.fna.gz](https://ftp.ncbi.nih.gov/genomes/INFLUENZA/influenza.fna.gz) and [genomeset.dat.gz](https://ftp.ncbi.nih.gov/genomes/INFLUENZA/genomeset.dat.gz) files for Influenza sequences and metadata, respectively.
+
+### Fixes
+
+* More up-to-date Influenza sequences database used by default (#24)
+
 ## [[3.2.1](https://github.com/CFIA-NCFAD/nf-flu/releases/tag/3.2.1)] - 2023-07-07
 
 ### Fixes

diff --git a/README.md b/README.md
@@ -17,13 +17,14 @@ After reference sequence selection, the pipeline performs read mapping to each r
 
 ## Pipeline summary
 
-1. Download latest [NCBI Influenza DB][] sequences and metadata (or use user-specified files)
-2. Merge reads of re-sequenced samples ([`cat`](http://www.linfo.org/cat.html)) (if needed)
+1. Download latest [NCBI Orthomyxoviridae sequences](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=11308&lvl=3&keep=1&srchmode=1&unlock) and metadata (parsed from [NCBI Viruses FTP data](https://ftp.ncbi.nlm.nih.gov/genomes/Viruses/AllNucleotide/)).
+2. Merge reads of re-sequenced samples ([`cat`](http://www.linfo.org/cat.html)) (if needed).
 3. Assembly of Influenza gene segments with [IRMA][] using the built-in FLU module
-4. Nucleotide [BLAST][] search against [NCBI Influenza DB][]
-5. Automatically select top match references for segments
-6. H/N subtype prediction and Excel XLSX report generation based on BLAST results
-7. Perform Variant calling and genome assembly for all segments.
+4. Nucleotide [BLAST][] search against [NCBI Influenza DB][] sequences
+5. H/N subtype prediction and Excel XLSX report generation based on BLAST results.
+6. Automatically select top match reference sequences for segments
+7. Read mapping, variant calling and consensus sequence generation for each segment against top reference sequence based on BLAST results.
+8. MultiQC report generation.
 
 ## Quick Start