Skip to content

Latest commit

 

History

History
76 lines (51 loc) · 3.22 KB

strain_tracking.md

File metadata and controls

76 lines (51 loc) · 3.22 KB

Strain tracking

These scripts will allow you to identify rare SNPs that discriminate individual strains and to track these SNPs between hosts to elucidate transmission patterns.

Before running these scripts, you'll need to have run:
merge_midas.py snps read more.

Step 1: identify rare SNPs that disriminate individual strains of a particular species

  • Scan across the entire genome of a patricular species
  • At each genomic site, compute the presence-absence of the four nucleotides across metagenomic samples from unrelated individuals
  • Identify SNPs (particular nucleotide at a genomic site) that rarely occur in different unrelated samples
  • Because these SNPs are rarely found in different individuals, they serve as good markers of host-specific strains

Command usage:

strain_tracking.py id_markers --indir <PATH> --out <PATH> [options]

Options:

--samples STR
Comma-separated list of samples to use for training
Useful for specifying the subset of samples from unrelated subjects in SNP matrix
By default, all samples are used

--min_freq FLOAT
Minimum allele frequency (proportion of reads) per site for SNP calling (0.10)

--min_reads INT
Minimum number of reads supporting allele per site for SNP calling (3)

--allele_freq INT
Maximum occurences of allele across samples (1)
Setting this to 1 (default) will pick alleles found in exactly 1 sample

--max_sites INT
Maximum number of genomic sites to process (use all)
Useful for quick tests

Examples:

  1. Use a subset of sample in SNP matrix for training strain_tracking.py id_markers --indir merged_snps/species_id --out species.markers --samples sample1,sample2,sample3

  2. Run a quick test
    strain_tracking.py id_markers --indir merged_snps/species_id --out species.markers --max_sites 10000

  3. Use strict criteria for pick marker alleles:
    strain_tracking.py id_markers --indir indir --out outfile --min_freq 0.90 --min_reads 5 --allele_freq 1

Step 2: track rare SNPs between samples and determine transmission

  • Compute the presence of marker SNPs (identified in Step 1) across all metagenomic samples, including from related individuals
  • Quantify the number and fraction of marker SNPs that are shared between all pairs of metagenomic samples
  • Based on a SNP sharing cutoff (e.g. 5%), determine if a strain is shared or not
  • Because these SNPs are rarely found in unrelated individuals (Step 1), their presence in multiple samples is strong evidence of strain sharing/transmission

Command usage:

strain_tracking.py track_markers --indir /path/to/snps/species_id --out species_id.marker_sharing --markers species_id.markers [options]

Options:

--min_freq FLOAT
Minimum allele frequency (proportion of reads) per site for SNP calling (0.10)

--min_reads INT
Minimum number of reads supporting allele per site for SNP calling (3)

--max_sites INT
Maximum number of genomic sites to process (use all)
Useful for quick tests

--max_samples INT
Maximum number of samples to process (use all) Useful for quick tests