Skip to content

Latest commit

 

History

History
114 lines (92 loc) · 5.77 KB

snvs.md

File metadata and controls

114 lines (92 loc) · 5.77 KB

Single-nucleotide-polymorphism prediction

This script will map metagenomic reads to bacterial reference genomes and quantify nucleotide variation along the entire genome (i.e. the read depth and observed alleles at each position). After running this script for multiple samples, you can perform pooled-sample, core-genome SNP calling read more. You can either target one or more specific species (--species_id), or provide this script a species abundance file

The pipeline can be broken down into the following steps:

  • build a database of genome sequences for abundant bacterial species (1 representative genome/species)
  • map high-quality reads to the database
  • generate pileups and count alleles at each genomic site

Usage

Usage: run_midas.py snps <outdir> [options]

positional arguments:
  outdir                Path to directory to store results.
                        Directory name should correspond to sample identifier

optional arguments:
  -h, --help            show this help message and exit
  --remove_temp         Remove intermediate files generated by MIDAS (False).
                        Useful to reduce disk space of MIDAS output

Pipeline options (choose one or more; default=all):
  --build_db            Build bowtie2 database of pangenomes
  --align               Align reads to pangenome database
  --pileup              Run samtools mpileup and count 4 alleles across genome

Database options (if using --build_db):
  -d DB                 Path to reference database
                        By default, the MIDAS_DB environmental variable is used
  --species_cov FLOAT   Include species with >X coverage (3.0)
  --species_topn INT    Include top N most abundant species
  --species_id CHAR     Include specified species. Separate ids with a comma

Read alignment options (if using --align):
  -1 M1                 FASTA/FASTQ file containing 1st mate if using paired-end reads.
                        Otherwise FASTA/FASTQ containing unpaired reads.
                        Can be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2)
  -2 M2                 FASTA/FASTQ file containing 2nd mate if using paired-end reads.
                        Can be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2)
  --interleaved         FASTA/FASTQ file in -1 are paired and contain forward AND reverse reads
  -s {very-fast,fast,sensitive,very-sensitive}
                        Bowtie2 alignment speed/sensitivity (very-sensitive)
  -m {local,global}     Global/local read alignment (global)
  -n MAX_READS          # reads to use from input file(s) (use all)
  -t THREADS            Number of threads to use (1)

Pileup options (if using --pileup):
  --mapid FLOAT         Discard reads with alignment identity < MAPID (94.0)
  --mapq INT            Discard reads with mapping quality < MAPQ (20)
  --baseq INT           Discard bases with quality < BASEQ (30)
  --readq INT           Discard reads with mean quality < READQ (20)
  --aln_cov FLOAT       Discard reads with alignment coverage < ALN_COV (0.75)
  --trim INT            Trim N base-pairs from 3'/right end of read (0)
  --discard             Discard discordant read-pairs (False)
  --baq                 Enable BAQ: per-base alignment quality (False)
  --adjust_mq           Adjust MAPQ (False)

Examples

  1. run entire pipeline using defaults:
    run_midas.py snps /path/to/outdir -1 /path/to/reads_1.fq.gz -2 /path/to/reads_2.fq.gz

  2. run entire pipeline for a specific species:
    run_midas.py snps /path/to/outdir --species_id Bacteroides_vulgatus_57955 -1 /path/to/reads_1.fq.gz -2 /path/to/reads_2.fq.gz

  3. just align reads, use faster alignment, only use the first 10M reads, use 4 CPUs:
    run_midas.py snps /path/to/outdir --align -1 /path/to/reads_1.fq.gz -2 /path/to/reads_2.fq.gz -s very-fast -n 10000000 -t 4

  4. just count variants, keep reads with >=95% alignment identity and keep bases with quality-scores >=35:
    run_midas.py snps /path/to/outdir --pileup --mapid 95 --baseq 35

Output files

output: directory of per-species output files; files are tab-delimited, gzip-compressed, with header
species.txt: list of species_ids included in local database
summary.txt: tab-delimited with header; summarizes alignment results per-species
log.txt: log file containing parameters used
temp: directory of intermediate files; run with --remove_temp to remove these files

Output formats

output/<species_id>.snps.gz:

  • ref_id: id of reference scaffold/contig/genome
  • ref_pos: position in reference scaffold (1-indexed)
  • ref_allele: reference nucleotide
  • depth: number of mapped reads
  • count_a: count of A allele
  • count_c: count of C allele
  • count_g: count of G allele
  • count_t: count of T allele

summary.txt

  • species_id: species id
  • genome_length: number of base pairs in representative genome
  • covered_bases: number of reference sites with at least 1 mapped read
  • fraction_covered: proportion of reference sites with at least 1 mapped read
  • mean_coverage: average read-depth across reference sites with at least 1 mapped read
  • aligned_reads: number of aligned reads BEFORE quality filtering
  • mapped_reads: number of aligned reads AFTER quality filtering

Memory usage

  • Memory usage will depend on the number of species you search and the number of reference genomes sequenced per species.
  • In practice, peak memory usage should not exceed 3.5 Gb for most samples

Speed

  • Speed will depend on the number of species you search and the number of sequenced reference genomes per species.
  • For a single species with 1 reference genome, expect ~16,000 reads/second
  • Use -n and -t to increase throughput

Next step

Merge results across samples