forked from ekg/mutatrix
-
Notifications
You must be signed in to change notification settings - Fork 0
genome simulation across a population with zeta-distributed allele frequency, snps, insertions, deletions, and multi-nucleotide polymorphisms
License
gkno/mutatrix
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
==== MUTATRIX ==== mutatrix is a population genome simulator which generates simulated genomes. It reads a reference FASTA file and outputs a VCF description of the variants on stdout, and writes each simulated, mutated copy of the reference to the current directory or a user-defined path (--file-prefix). Example usage: % ./mutatrix -S sample -P test/ -p 2 -n 10 reference.fasta This command writes VCF to stdout and writes mutated references to test/, with this format: # <prefix>/<sample id>:<fasta sequence name>:<copy number>.fa % ls test sample10:seq_1:0.fa sample1:seq_1:0.fa sample2:seq_1:0.fa ... sample10:seq_1:1.fa sample1:seq_1:1.fa sample2:seq_1:1.fa ... mutatrix is suitable for use in testing pooled variant detectors, as it distributes alleles throughout the population according to a zeta distribution, which is roughly consistent with the power-law allele frequency spectrum observed by large population sequencing projects like the 1000 Genomes Project. Alternate allele generation: mutatrix generates alleles using the following model: At each position in the reference, we draw a pseudorandom number on [0,1). If this number, scaled by the number of copies of the genome in the population, is below --rate (default 0.001), then we generate an alternate minor allele. We then sample a second number, and if it is below --indel-snp-ratio, we generate an indel. Otherwise, we generate a SNP or MNP. MNPs are generated using a geometric distribution conditioned on the --mnp-ratio. A 2bp MNP occurs at 0.01 the rate of SNPs, a 3bp MNP occurs at 0.01 the rate of 2bp MNPs, etc. Indels are generated by obtaining a length from a zeta distribution with alpha --indel-alpha. (An alpha of 1.7 is used per observations in [1]). If the indel is longer than --indel-max, we continue without generating the indel. Novel insertions are randomly generated. Allele frequency spectrum simulation: Once generated, the alternate allele is distributed across the population of simulated individuals by sampling an allele frequency from a zeta distribution (also with alpha 1.7). The alternate alleles are randomly distributed across the population. There is no concept of haplotype block or linkage in mutatrix. Each allele and site is effectively independent from other sites. author: Erik Garrison <erik.garrison@bc.edu> license: MIT (free) references: [1] Problems and Solutions for Estimating Indel Rates and Length Distributions. Reed A. Cartwright. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2734402/
About
genome simulation across a population with zeta-distributed allele frequency, snps, insertions, deletions, and multi-nucleotide polymorphisms
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- C++ 75.3%
- Shell 13.0%
- C 11.7%