forked from ekg/mutatrix
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
66 lines (43 loc) · 2.51 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
==== MUTATRIX ====
mutatrix is a population genome simulator which generates simulated genomes.
It reads a reference FASTA file and outputs a VCF description of the variants
on stdout, and writes each simulated, mutated copy of the reference to the
current directory or a user-defined path (--file-prefix).
Example usage:
% ./mutatrix -S sample -P test/ -p 2 -n 10 reference.fasta
This command writes VCF to stdout and writes mutated references to test/, with
this format:
# <prefix>/<sample id>:<fasta sequence name>:<copy number>.fa
% ls test
sample10:seq_1:0.fa sample1:seq_1:0.fa sample2:seq_1:0.fa ...
sample10:seq_1:1.fa sample1:seq_1:1.fa sample2:seq_1:1.fa ...
mutatrix is suitable for use in testing pooled variant detectors, as it
distributes alleles throughout the population according to a zeta distribution,
which is roughly consistent with the power-law allele frequency spectrum
observed by large population sequencing projects like the 1000 Genomes Project.
Alternate allele generation:
mutatrix generates alleles using the following model:
At each position in the reference, we draw a pseudorandom number on [0,1). If
this number, scaled by the number of copies of the genome in the population, is
below --rate (default 0.001), then we generate an alternate minor allele.
We then sample a second number, and if it is below --indel-snp-ratio, we
generate an indel. Otherwise, we generate a SNP or MNP. MNPs are generated
using a geometric distribution conditioned on the --mnp-ratio. A 2bp MNP
occurs at 0.01 the rate of SNPs, a 3bp MNP occurs at 0.01 the rate of 2bp MNPs,
etc.
Indels are generated by obtaining a length from a zeta distribution with alpha
--indel-alpha. (An alpha of 1.7 is used per observations in [1]). If the
indel is longer than --indel-max, we continue without generating the indel.
Novel insertions are randomly generated.
Allele frequency spectrum simulation:
Once generated, the alternate allele is distributed across the population of
simulated individuals by sampling an allele frequency from a zeta distribution
(also with alpha 1.7). The alternate alleles are randomly distributed across
the population.
There is no concept of haplotype block or linkage in mutatrix. Each allele and
site is effectively independent from other sites.
author: Erik Garrison <erik.garrison@bc.edu>
license: MIT (free)
references:
[1] Problems and Solutions for Estimating Indel Rates and Length Distributions.
Reed A. Cartwright. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2734402/