This program is meant to convert Short Tandem Repeat (STR) length data to number of tandem repetitions. This is specially useful for applying models of microsatellite evolution (Sainudiin et al. 2004) to infer divergence times or demographic models.
The CSV output can be uploaded directly to BEAST v2.6.2 by installing the BEASTvntr package.
Code is written to be functional under Python 2 and 3. However, I suggest using Python 3 to avoid future issues. Other than that, there are no other dependencies.
wget https://raw.githubusercontent.com/santiagosnchez/STRlength2repeat/master/STRlength2repeat.py
python3 STRlength2repeat.py input_file.csv
The program expects a CSV file with STR data in standard length format. For example:
indiv1,124,124,345,347,233,239
indiv2,122,124,345,345,230,239
...
Here, both alleles for a single locus are contiguous. This means that if there are n
STR loci then there will be (n * 2) + 1
columns.
The program includes a function to infer the the size of the tandem, which is based on the difference between minimum allele (A_i - A_min) % m
, where m
is the size of the tandem. In this case, m
can go from 2 to 10.
motif = range(2,11)
mprop = []
for m in motif:
y = [ (x - min(locus)) % m for x in locus if x != 0 ]
mprop.append(round(float(sum([ x == 0 for x in y ]))/len(y),2))
shortest = min([ x for x in locus if x != 0 ])
bprop = max(mprop)
besti = [ i for i in range(len(mprop)) if mprop[i] == bprop ]
bestm = max([ motif[i] for i in besti])
return(bestm,shortest,mprop)
A predefined tandem size should be easy to implement by providing it in CSV format. That will be done in future versions.
The repeat number function is wrapped around get_repeats()
and follows this logic: for i
alleles A
within locus j
we take the difference between the A_ij
and the minimum/shortest allele A_minj
and divide it by the tandem size m
and adding 1:
(A_ij - A_minj) / m_j) + 1
Sainudiin R, Durrett RT, Aquadro CF, Nielsen R. Microsatellite mutation models: insights from a comparison of humans and chimpanzees. Genetics. 2004;168(1):383‐395. doi:10.1534/genetics.103.022665 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1448085/