Skip to content
/ mutatrix Public
forked from ekg/mutatrix

genome simulation across a population with zeta-distributed allele frequency, snps, insertions, deletions, and multi-nucleotide polymorphisms

License

Notifications You must be signed in to change notification settings

gkno/mutatrix

 
 

Repository files navigation

==== MUTATRIX ====

mutatrix is a population genome simulator which generates simulated genomes.

It reads a reference FASTA file and outputs a VCF description of the variants
on stdout, and writes each simulated, mutated copy of the reference to the
current directory or a user-defined path (--file-prefix).


Example usage:

    % ./mutatrix -S sample -P test/ -p 2 -n 10 reference.fasta

This command writes VCF to stdout and writes mutated references to test/, with
this format:

    # <prefix>/<sample id>:<fasta sequence name>:<copy number>.fa

    % ls test
    sample10:seq_1:0.fa  sample1:seq_1:0.fa  sample2:seq_1:0.fa  ...
    sample10:seq_1:1.fa  sample1:seq_1:1.fa  sample2:seq_1:1.fa  ...

mutatrix is suitable for use in testing pooled variant detectors, as it
distributes alleles throughout the population according to a zeta distribution,
which is roughly consistent with the power-law allele frequency spectrum
observed by large population sequencing projects like the 1000 Genomes Project.


Alternate allele generation:

mutatrix generates alleles using the following model:

At each position in the reference, we draw a pseudorandom number on [0,1).  If
this number, scaled by the number of copies of the genome in the population, is
below --rate (default 0.001), then we generate an alternate minor allele.

We then sample a second number, and if it is below --indel-snp-ratio, we
generate an indel.  Otherwise, we generate a SNP or MNP.  MNPs are generated
using a geometric distribution conditioned on the --mnp-ratio.  A 2bp MNP
occurs at 0.01 the rate of SNPs, a 3bp MNP occurs at 0.01 the rate of 2bp MNPs,
etc.

Indels are generated by obtaining a length from a zeta distribution with alpha
--indel-alpha.  (An alpha of 1.7 is used per observations in [1]).  If the
indel is longer than --indel-max, we continue without generating the indel.
Novel insertions are randomly generated.


Allele frequency spectrum simulation:

Once generated, the alternate allele is distributed across the population of
simulated individuals by sampling an allele frequency from a zeta distribution
(also with alpha 1.7).  The alternate alleles are randomly distributed across
the population.

There is no concept of haplotype block or linkage in mutatrix.  Each allele and
site is effectively independent from other sites.


author: Erik Garrison <erik.garrison@bc.edu>
license: MIT (free)

references:

[1] Problems and Solutions for Estimating Indel Rates and Length Distributions.
Reed A. Cartwright.  http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2734402/

About

genome simulation across a population with zeta-distributed allele frequency, snps, insertions, deletions, and multi-nucleotide polymorphisms

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 75.3%
  • Shell 13.0%
  • C 11.7%