written by Lauren Krausfeldt & Poorani Subramanian - bioinformatics@niaid.nih.gov
This is a pipeline for exploring viruses (ssDNA, dsDNA phage, and giant DNA viruses) and viral diversity in metagenomes. It can be run in the cloud application Nephele (under Explore) or on HPC. More details here.
The pipeline accepts metagenomic assembly sequences (.fasta) and binary alignment map (.bam) files of the reads mapped back to the assemblies as input. (These files could be produced from the WGSA2 pipeline in Nephele1 ). The output of this pipeline provides viral genomes found in the metagenome assembly, their taxonomy and level of completeness, viral functional genes and their abundances, and vOTU abundances and their host taxonomy.
The pipeline first searchs for viral genomes using geNomad2, which also provides viral taxonomy and functional classification of each viral genomes. The viral genomes are also functionally classified with DRAM-v3 and (optionally) diamond4 using the nr database. Gene abundances per sample are produced from these outputs using VERSE5. From here, the user has the option to filter the resulting sequences based on completeness using CheckV6. Either the output of geNomad or CheckV is used to cluster viral genomes with BBTools dedupe7 and mmseqs8 to produce vOTUs9. Finally, abundances and host taxonomy of vOTUs are produced.
- Snakefile: pipeline script (reads in configs, commands for each pipeline step/rule)
- cluster_setup.smk: helper script for reading in cluster config file
- project_config.yaml: for snakemake
--configfile
option. config file with details for a specific project - working/input/output directories, path to scripts and other configs, sample names, options for specific rules in the pipeline, etc. - locus.cluster_config.yaml: cluster configuration file for snakemake
--cluster-config
option. specifically for NIAID Locus HPC which uses UGE. (sets parameters forqsub
command for each rule's job, and which environment modules to use) - locus_submit_vp.sh: batch job submit script for running the pipeline on Locus
- scripts: see scripts README
- docs/README_for_DiscoVir_outputs.md: explanation of outputs of the pipeline
The inputs to the pipeline are assembled contigs/scaffolds - one fasta file per sample; and bam files of reads aligned to the assemblies - one bam per sample. They should be located in (or symlinked to) a single directory, and the filenames should start with a unique per-sample name.
- Clone this repo locally:
git clone https://github.com/niaid/virome-pipeline
-
Copy over the project config file project_config.yaml and submit script locus_submit_vp.sh to your project working directory, and edit both with the details for your specific project.
-
for the submit script, the main items to edit are:
- path to the project config file, email address
- the arguments for the
snakemake
command at the bottom of the script (see comments in the script)
-
for the project config, the main items to edit are:
-
paths to input, output, and working directory and email
-
pipeline options detailed in the comments of the config file
-
-
-
Submit the job script:
qsub ./locus_submit_vp.sh
- Success?
- This is tested to run on NIAID's HPC Locus. However, it would be easy to adapt to another HPC that uses environment modules by making your own cluster config file (with the correct module names and job parameters), and your own job submit script (in particular modifying the
$clustercmd
for whatever job scheduler your HPC uses). - In the future, we will work on making it more general (perhaps using conda or a containerized workflow instead of environment modules)
- Also, adding additional steps for specialized analysis and making the pipeline more flexible.
- https://www.protocols.io/view/wgsa2-workflow-a-tutorial-n92ldm98xl5b/v1
- Camargo, A. P., Roux, S., Schulz, F., Babinski, M., Xu, Y., Hu, B., ... & Kyrpides, N. C. (2023). Identification of mobile genetic elements with geNomad. Nature Biotechnology, 1-10. doi: 10.1038/s41587-023-01953-y.
- Shaffer, M., Borton, M. A., McGivern, B. B., Zayed, A. A., La Rosa, S. L., Solden, L. M., ... & Wrighton, K. C. (2020). DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic acids research, 48(16), 8883-8900. doi: 10.1093/nar/gkaa621.
- Buchfink, B., Reuter, K., & Drost, H. G. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature methods, 18(4), 366-368. doi: 10.1038/s41592-021-01101-x.
- Zhu, Q., Fisher, S. A., Shallcross, J., & Kim, J. (2016). VERSE: a versatile and efficient RNA-Seq read counting tool. bioRxiv, 053306. doi: 10.1101/053306.
- Nayfach, S., Camargo, A. P., Schulz, F., Eloe-Fadrosh, E., Roux, S., & Kyrpides, N. C. (2021). CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nature biotechnology, 39(5), 578-585. doi: 10.1038/s41587-020-00774-7.
- https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/dedupe-guide/
- Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11), 1026-1028. doi:10.1038/nbt.3988.
- Roux, S., Adriaenssens, E. M., Dutilh, B. E., Koonin, E. V., Kropinski, A. M., Krupovic, M., ... & Eloe-Fadrosh, E. A. (2019). Minimum information about an uncultivated virus genome (MIUViG). Nature biotechnology, 37(1), 29-37. doi:10.1038/nbt.4306.
- Shumate, A., & Salzberg, S. L. (2021). Liftoff: Accurate mapping of gene annotations. Bioinformatics, 37(12), 1639–1643. doi: 10.1093/bioinformatics/btaa1016.