Skip to content

Latest commit

 

History

History
141 lines (96 loc) · 6.51 KB

README.md

File metadata and controls

141 lines (96 loc) · 6.51 KB

GAMBIT (Genomic Approximation Method for Bacterial Identification and Tracking): A methodology to rapidly leverage whole genome sequencing of bacterial isolates for clinical identification

A Snakemake workflow to (re)produce figures and data in the initial GAMBIT publication:

Lumpe J, Gumbleton L, Gorzalski A, Libuit K, Varghese V, et al. (2023) GAMBIT (Genomic Approximation Method for Bacterial Identification and Tracking): A methodology to rapidly leverage whole genome sequencing of bacterial isolates for clinical identification. PLOS ONE 18(2): e0277575. https://doi.org/10.1371/journal.pone.0277575

Source code for GAMBIT itself is located here.

Please feel free to contact me at jared@jaredlumpe.com with any questions you have.

Instructions

After installing and activating the conda environment (see the Setup section below), simply run:

snakemake [TARGETS...]

from the project's root directory. TARGETS are one or more rule names or output files. By default the main rule is run, which creates figures 1-6. See the Targets section for a list of options.

Directory structure

  • workflow/: Snakemake workflow files and related scripts.
  • config/: Workflow configuration files.
  • resources/: Input data.
    • genomes/: Sets of bacterial genome assemblies used for analysis.
    • gambit-db/: GAMBIT database files.
  • intermediate-data/: Output of intermediate workflow targets.
  • results/: Processed result data.
  • gambit_pub/: Python package containing common code for this repo.
  • env/: The conda environment can be installed here.

Setup

This workflow has been built and tested for Linux only. It may work on Mac (haven't tested) but I believe there are issues preventing it from running on Windows.

Required software

All software dependencies are installed using the conda package manager. If you do not already have it installed, I recommend using the Miniconda installer available here. Make sure the conda command is available in your shell.

Conda environment

Install the conda environment into the env/ subdirectory with:

conda env create -f env.yaml -p env

Before running the workflow you must activate the environment by running conda activate ./env from the project's root directory. This must be done with each new shell session.

Install GAMBIT

The preferred way to install GAMBIT is through the Bioconda channel:

conda install -c bioconda gambit=1.0

Make sure your Conda environment is activated first.

Configuration

Most editable config settings are in config/config.yaml.

Download source data

Large files in resources/ are not present in version control and need to be downloaded separately. You can do this all up front by running the fetch_src_data target, which may make things easier to debug if you run into any connection problems. Otherwise the individual data sets will be downloaded as needed when running the workflow.

Targets

This is a list of all "endpoint" rules and output files which you may want to run. It does not include rules which generate intermediate data.

Aggregate rules

Rule Description
all main and supplemental.
main Generate all primary figures (default).
supplemental Generate all supplemental figures. Note - supplemental figure 1 is VERY slow.
fetch_src_data Download all source data. Not necessary to invoke manually.

Main results

Rule Output Description
fig1 results/figures/figure-1.{png,csv} Generate figure 1.
fig2 results/figures/figure-2{a,b}.png Generate figure 2.
fig3 results/figures/figure-3.png Generate figure 3.
fig4 results/figures/figure-4{a,b}.png Generate figure 4.
fig5 results/figures/figure-5{a,b}.png Generate figure 5.
fig6 results/figures/figure-6.png Generate figure 6.

Supplemental results

Rule Output Description
sfig1 results/figures/supplemental-figure-1.png Generate supplemental figure 1. Note - VERY slow.
sfig2 results/figures/supplemental-figure-2.png Generate supplemental figure 2.
stable3 results/tables/supplemental-table-3.png Generate supplemental table 3.
stable4 results/tables/supplemental-table-4.png Generate supplemental table 4.

Benchmarks

Rule Output Description
benchmark_query results/benchmarks/gambit-query/ Benchmark GAMBIT taxonomic classification from CLI.

Source data

Rule Output Description
fetch_gambit_db resources/gambit-db/ Download GAMBIT reference database files.
resources/genomes/set{1,2}/fasta/ Download FASTA files for data set 1 or 2 from NCBI. Invoke by output directory.
resources/genomes/set{3,4}/fasta/ Download FASTA files for data set 3 or 4. Invoke by output directory.
fetch_genome_set_5 resources/genomes/set5/fasta/ Download FASTA files for data set 5.

Development

You can enable "test mode" by adding --config test=1 to the command line options. This loads an alternate set of parameters which greatly reduces the amount of work to be done.