I downloaded new SRA accessions using edirect/eutilities. This was the command I used to filter for salmonella full genome sequences in the SRA archive. The list is saved as SraAccList.txt
in the project directory.
esearch -db sra -query "txid28901[Organism:exp] AND (cluster_public[prop] AND 'biomol dna'[Properties] AND 'library layout paired'[Properties] AND 'platform illumina'[Properties] AND 'strategy wgs'[Properties] OR 'strategy wga'[Properties] OR 'strategy wcs'[Properties] OR 'strategy clone'[Properties] OR 'strategy finishing'[Properties] OR 'strategy validation'[Properties])" | efetch -format runinfo -mode xml | xtract -pattern Row -element Run > SraAccList.txt
In 2023, Lei's efetch command downloaded 543713 SRA sequences and took a few hours to finish processing. On 2024-09-09, Annette re-ran esearch and efetch and downloaded 685477 SRA sequences.
Using the BigQuery browser interface, I uploaded the accession list SraAccList.txt and obtained the metadata using the SQL query below. The resulting table was too large to download at once, so I searched for SRR (NCBI), ERR (European Bioinformatics Institute), and DRR (DNA Data Bank of Japan) accessions seperately and downloaded as .json files in Google Drive.
SELECT * FROM `nih-sra-datastore.sra.metadata`
where consent = "public" and acc like "SRR%" and acc in (SELECT string_field_0 FROM `salmonella-sra-2024.SRA_all.sra_670k`);
Prefix | Accessions | Metadata entries |
---|---|---|
SRR | 606195 | 605491 |
ERR | 78004 | 78004 |
DRR | 1278 | 1270 |
Total | 685477 | 684765 |
Based on the file main.nf. Note that Singularity
has been transitioned to Apptainer
.
Processes in workflow:
-
sra_n_ch
-
sra_ch
-
fetch_SRA
: Uses fastq-dump to download the raw fastq files at 2,000,000 read depth. Fromghcr.io/usda-ars-gbru/salmonella-varcal/sra-tools-bash:latest
. -
megahit
: assembles raw reads into acontigs.fa
file. -
sistr
: Uses the SISTR command line tool to perform in silico typing on the assembled genome. Generates a one line.tab
tab delimited file with SISTR's output -
bwa_indexing
: Burrow-Wheeler Aligner for pairwise alignment between DNA sequences. Indexes the reference genomeGCF_000006945.2.fasta.gz
. Frombiocontainers/bwa:v0.7.17_cv1
. -
gatk_createdict
: Create dictionary for reference FASTA file. Frombroadinstitute/picard:2.27.4
. -
gatk_vcf_indexer
: Frombroadinstitute/picard:2.27.4
. -
map_reads
: bwa mem to map forward and reverse reads to the reference genome, generating a.sam
file. Frombiocontainers/bwa:v0.7.17_cv1
. -
sort_and_index
: Uses samtools to sort and index the sam files into.bam
format.Frombiocontainers/samtools:v1.9-4-deb_cv1
. -
mpileup
: Uses bcftools mpileup to call variants based on the reference genome. Generates tabix indexed and bgzipped.mpileup.vcf.gz
file. -
add_read_groups
: Frombroadinstitute/picard:2.27.4
. -
vg_pack
: Uses vg toolkit's giraffe, filter, and pack functions to generate apack.edge.table.gz
file which describes paths through each node of the graph alignment. -
gvg_call
: Uses inhouse caller gfa_var_genotyper.pl to call variants based on the pack edge table and the reference graph. Generates a gzipped (note: NOT bgzipped)vg.vcf.gz
file in the output directory -
gatk_haplotypecaller
: Frombroadinstitute/gatk:4.2.6.1
.
- cleanup (sra_done_signal and mpileup_done_signal)
- For every SRA accession, once it's passed through both variant calling pipelines, replaces the raw fastqs with decoy files using tricking_nextflow_cache
- For every SRA accession, once it's passed through the mpileup variant calling pipeline, replaces all intermediate files from map_reads, sort_and_index, and mpileup with decoy files using same tools as above
- Files not tracked by nextflow are deleted with
rm
in the bash shell commands of each process
These output files can be found in the output.tar archive on Juno. There should be 1 of each of the files for each salmonella SRA accession.
- mpileup VCF: variants called using mpileup. Accompanied by a .tbi index file in the same folder
- gvg VCF: variants called using gvg
- sistr: output of sistr in silico serotyping
- pack: pack edge tables for the gvg VCFs
Formerly known as Nextflow Tower, Seqera helps you manage and monitor your workflow. To create a token and use it with your Nextflow pipeline, follow their tutorial.
Annette M. Hynes and Lei Ma