-
Notifications
You must be signed in to change notification settings - Fork 0
/
ffigueiredo-cjs_trepid_SOM.tex
1306 lines (1003 loc) · 51.2 KB
/
ffigueiredo-cjs_trepid_SOM.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\NeedsTeXFormat{LaTeX2e}[1995/12/01]
\documentclass[10pt]{article}
% Load packages
\usepackage{cite} % Make references as [1-4], not [1,2,3,4]
\usepackage{url} % Formatting web addresses
% \usepackage[utf8]{inputenc} %unicode support
\urlstyle{rm}
\usepackage{indentfirst}
%% \setlength{\doublerulesep}{\arrayrulewidth}
\usepackage{amsmath,amsfonts,amssymb}
\usepackage[pdftex, bookmarks, colorlinks]{hyperref}
%% \def\includegraphic{}
%% \def\includegraphics{}
\usepackage{color,graphicx}
\setlength{\topmargin}{0.0cm}
\setlength{\textheight}{21.5cm}
\setlength{\oddsidemargin}{0cm}
\setlength{\textwidth}{16.5cm}
\setlength{\columnsep}{0.6cm}
\begin{document}
\title{Supplementary Online Material for ``An individual-based model
simulator for transposable elements dynamics and evolution''}
\author{Felipe Figueiredo\footnote{philsf79@gmail.com} \and
Claudio Struchiner }
%% \footnote{Programa de Pós-graduação em Biologia Computacional e
%% Sistemas, Instituto Oswaldo Cruz, BCS/IOC/Fiocruz}
%% \footnote{Programa de Computação Científica, PROCC/Fiocruz}
\maketitle
\tableofcontents
%%%%%%%%%%%%%%%%%%
\section{Overview of the simulator}
%% Here we describe the overall structure of the simulator. More specific
%% details are available in the API documentation, provided with the
%% software.
The \verb|TRepid| simulator is an Object-Oriented system, with
individual hosts and TEs each represented by objects. It's main
interface with the user is provided by a plain-text configuration file
that should be present in directory of the simulation. The simulation
sub-directory tree will be created, and data and log files written by
the simulator as required.
\subsection{The main players}
\label{sec:main_players}
The main class defines simulation objects that contain all the
information and data related to a given simulation. Associated with
each simulation object is a configuration object to access or modify
any parameter of the simulation, including equations parameters, file
names, etc. It collects all configuration parameters and variables
from configuration files, or use defaults if values are not provided
by the user.
%\section{Usage of the simulator}
\subsection{Basic usage: Input}
The principal way of setting up a new simulation is through a
configuration file called \verb|trepid.conf|, inside which the user
defines parameters for all the models.
The default configuration file that comes with the simulator has all
parameters filled with default values, and comments explain what they
mean. This file is divided in sections to organize semantics for the
user. These categories are not interpreted though, since only variable
names are used and users are free to remove the section names and
define the config file in the order they prefer.
After preparing the configuration file, the user can run the
\verb|trepid| program inside the directory where the config file is
located. Upon starting, the simulator looks for the \verb|trepid.conf|
file, and if found, create a directory structure for the population
(even if the population is not saved at the end of the simulation), a
directory for logs and results, and finally a directory for the
Templates for the hosts with TEs.
The Templates dir is where one or more genome templates are located,
and used to create the GM (Genetically Modified, i.e. with TEs)
sub-population. If the user wants to use a GM population, at least one
GM template must exist. If the templates directory is previously
non-existent, a GM population won't be created, and in this case the
simulator aborts.
If the configuration file is not found or is empty, default values
will be used for all parameters. The same happens to any parameter not
defined in the config file, if it is incomplete. The user can check
what parameters were set for each simulation in the log header.
\subsection{The default config file}
\label{sec:default_config}
In this section we describe all of the variables definable in the
config file, showing the default values as illustration. These are the
parameters used for a very simple simulation of a Master Gene model in
a small host population that does not have generation overlap.
\subsection{General}
\label{sec:default_config_general}
{\bf Simulation time and initial sub-populations}
\begin{verbatim}
#####################################
[General]
# Number of generations (default: 10)
it_max = 10
\end{verbatim}
FIXME: include other general variables, save\_*, sample\_gm,
interactive, often\_max.
\subsection{Initial}
\label{sec:default_config_initial}
FIXME: include other initial variable: init\_gender .
\begin{verbatim}
#####################################
[Initial]
# Initial Susceptibles (default: 900)
S_0 = 0
# Initial Infectious (default: 100)
I_0 = 10000
# Initial Removed (default: 0)
R_0 = 0
\end{verbatim}
The parameter \verb$it_max$ determines how many iterations of the model,
i.e., how many generations will be simulated.
The $F_0$ population is determined by three initial sub-populations. In
these parameters, the notations inspired by the SIR epidemic model of
the early prototypes were maintained. \verb$S_0$ symbolizes the
initial wild sub-population, \verb$I_0$ the initial GM sub-population
(i.e. individuals that carry TEs) and \verb$R_0$ the Deceased or past
individuals.
If a population was previously saved to disk, the files will be
automatically read and the $F_0$ population will be constructed using
those individuals. If the initial parameters for either of \verb$S_0$,
\verb$I_0$ and \verb$R_0$ are greater than the pre-existing
individuals, additional ones will be created to match the simulation
criteria.
\subsection{Reproduction}
\label{sec:default_config_reproduction}
{\bf Age structure}
\begin{verbatim}
#####################################
[Reproduction]
# Life expectancy (default: 4)
age_max = 4
# Reproductive age (default: 3)
age_reprod = 3
# Offspring count (default: 100)
offspring_count = 100
\end{verbatim}
The two parameters \verb|age_max| and \verb|age_reprod| determine the
age structure in the population. They are measured in
generations. Setting the \verb|age_reprod| to one unit lower than
\verb|age_max| configures a population without generation
overlap. Setting \verb|age_max| to 2 and \verb|age_reprod| to 1
disables the age structure in the sense that all individuals
last only one generation after being created.
{\centering\noindent\rule{0.5\textwidth}{0.4pt}}
\begin{verbatim}
# Recombination model (default: 1step)
recombination_model = "none"
\end{verbatim}
The above parameters set the total offspring per couple, and turn off
the recombination for the simulation.
{\centering\noindent\rule{0.5\textwidth}{0.4pt}}
\begin{verbatim}
# Reproduction model (default: Logistic)
reproduction_model="Logistic"
# Logistic parameters
# reproductive (default: 0.1)
r = 20
# carrying capacity (default: 5000)
K = 20000
\end{verbatim}
The above options select the ecological model (in this case a logistic
population) and define the model parameters (in this case $r$ and
$K$).
\begin{verbatim}
# ploidy parameters (default = 2)
ploidy = 1
\end{verbatim}
The ploidy can be artificially set to diploid or haploid. Haploid
populations are created in the following manner: upon creation of a
new individual, the first chromosome comes from the father and the
second from the mother, but since only one chromosome will be
considered throughout the simulation. The contents of both chromosomes
are merged, possibly overwriting a specific TE, in such cases where
there is a TE in the same site in both parents' chromosomes. This is
done by design in order to guarantee that there exists only one active
TE in the haploid case, as opposed to what would be expected in the
diploid case in which one active element would be inherited from the
father, and another element from the mother. In the next generation,
two would be inherited from the father and two from the mother, in
geometric progression.
\subsection{Transposition}
\label{sec:default_config_transposition}
\begin{verbatim}
#####################################
[Transposition]
# Species' categories sites (default: k=0, S=30, N=70)
killer_sites = 0
severe_sites = 0
neutral_sites = 200
\end{verbatim}
Sizes of each chromosome section. {\em Killer sites} are those that
kill the individual immediately upon birth, if occupied by an
element. {\em Severe sites} add fitness costs, and {\em Neutral sites}
are ignored.
\begin{verbatim}
# Fitness cumulative impact of transposition (default .05)
severe_impact = .05
\end{verbatim}
This value is incremented for each TE that occupies a {\em severe
site}. The total impact is rounded up to the next integer. This
total impact is the quantity of offspring the individual will not be
able to produce, due to TE activity. In this sense, the total TE
impact will be subtracted from the amount of offspring the individual
will contribute to the population, and it is calculated for both the
father and the mother.
\begin{verbatim}
# Transposition model: constant, exponential or SKR (default: SKR)
transposition_model = "constant"
\end{verbatim}
The {\em constant} model produces the same amount of new elements per
transposition event, i.e., in the gametogenesis. At least one active
element is required for the creation of new copies.
\begin{verbatim}
# Transposition rate for constant and exponential model (default: .5)
transposition_rate=1
\end{verbatim}
The {\em transposition rate} parameter has different meanings,
depending on which transposition model was selected. If the {\em
constant model} is being used, this parameter is the amount of new
copies being created at each gametogenesis. In the case of the {\em
exponential model} or the {\em SKR model} it is the probability that
new copies will be created, i.e., the probability of occurrence of a
transposition event in that gametogenesis.
\begin{verbatim}
# Excision rate for constant and exponential models
excision_rate = 0
\end{verbatim}
The {\em excision rate} is analogous to the {\em transposition rate},
but instead of contributing to the creation of new copies, if
subtracts TEs from chromosomes. Both the possible values and meaning
for each transposition model are analogous to the interpretation in
the {\em transposition model}, as stated above.
\begin{verbatim}
# Deleterious threshold (default: severe + neutral)
# could be set to any arbitrary value
#delete = 10
\end{verbatim}
FIXME: explain.
\begin{verbatim}
#Inactivation model: mastergene, random and progressive (default: mastergene)
inact_model = "mastergene"
\end{verbatim}
The activity {\em status} is represented by a number between 0 and
1. Any positive number implies the element is active, while 0 means
the element is inactive.
There are three {\em inactivation models} available. The {\em Master
Gene} postulates that every new copy is automatically inactive
(thus, has a {\em status} of 0). The {\em random} model samples a
random number between 0 and the {\em status} of the template element,
thus reducing the mean {\em status} over successive generations. The
{\em progressive} model decreases the {\em status} of each new copy by
a fixed amount, also decreasing mean {\em status} over time. Small
{\em status} values are truncated to 0.
\begin{verbatim}
# Invasion threshold for GM sub population (default: 0.9). Set to 1 to disable.
invasion_threshold = 0.9
# Loss threshold for GM sub population (default: 0.01). Set to 0 to disable.
loss_threshold = 0
\end{verbatim}
FIXME: explain.
\subsection{Evolution}
\label{sec:default_config_evolution}
\begin{verbatim}
#####################################
[Evolution]
# Probability that a mutation (substitution) will occur (default: .1)
mutate_prob = 1
\end{verbatim}
The {\em mutate probability} is the probability that a evolutionary
event will occur in the selected transposition event. This implies
that at some amount of point substitutions is bound to occur,
although some might be silent substitutions.
\begin{verbatim}
# Number of point mutations to be introduced in each evolutionary event (default: 2)
mutation_count = 2
\end{verbatim}
FIXME: explain.
\subsection{Output}
\label{sec:output}
Over the course of a simulation some output files are created: a log
file, two CSV files (one for the demography and one with TE data), and
if an individual can be sampled from the population in the last
generation, it will be saved to disk in two files (in the formats
FASTA and NEXUS). All of these files are prefixed with the simulation
date, and a serial number, so files from successive simulations can be
observed in chronological order.
\subsubsection{Log file}
\label{sec:output_log}
FIXME: explain
\subsubsection{CSV tables}
\label{sec:output_csv}
FIXME: explain.
\subsubsection{Sampled individual}
\label{sec:output_sample}
FIXME: explain.
\subsection{Advanced usage: API for Perl programmers}
Although the configuration file can provide a practical input for
several simulation scenarios, some users might have more complex
needs. Besides the configuration file there is an API which can be
directly accessed by software, scripts or an interactive shell in
Perl language which facilitates step-by-step procedures, analysis,
reanalysis and more detailed data acquisition.
All data generated in each component can be accessed from the API, and
accessor methods are provided for each data type for all object
classes. Additionally, the whole generation algorithm can be bypassed
and simulations can be manually run, step by step, and scrutinized in
any way desired. One can change config parameters after a few
generations for any of the models, or even the models themselves,
insert or remove individuals in the population, etc.
The API documentation along with the complete list of functions and
class and object methods are provided inline with the code and can be
accessed by the usual \verb|perldoc| command. PDF and HTML documents
are also provided in the software package.
Each simulation run is identified by a unique ID string, and all
objects related to that simulation are tagged with this ID, so several
simulations can be run by the same instance or script without risk of
collision of variable values, either in batch or in parallel. We
provide an example script to run a batch of simulations with different
parameters.
\section{Algorithms and models}
The simulator is flexible in construction and able to produce a
realistic scenario in which transposition events occur in finite
populations that agree with the Wright-Fisher
conditions \cite{HC98}. Transposable elements (TEs) replicate according
to an arbitrary transposition model and TE sequences' evolve according
to Kimura's infinite-sites model \cite{Taj96}, and an arbitrary model
of molecular evolution \cite{Yan06}.
\subsection{Population}
%% %FIXME: colocar esse paragrafo na discussao
%% The SIR model is one of the most studied epidemiological models and its
%% simplicity is both its strength and weakness. To compensate for the
%% latter, several modifications have been proposed in the last decades
%% to introduce specific effects into its basic framework including Allee
%% effect, periodicity and many others.
As described above, the population has two independent structures:
gender and age. We have drawn inspiration from the SIR epidemiological
model to structure the population in discrete compartments. The age
structure is also discrete, as described below.
The population is divided into {\bf Wild} and {\bf GM} (genetically
modified) compartments of hosts similarly to the SIR compartmental
model classes \emph{Susceptible} and \emph{Infected}. There's also a
{\bf Deceased} compartment, in analogy to the \emph{Recovered}
compartment, where individuals from earlier generations are kept for
future reference or analysis. A given individual cannot change from
the {\bf Wild} to the {\bf GM} compartment, though, and vice
versa. Instead, these compartments' dynamics evolve through
generations of sexual reproduction of the host population.
Time is measured in simulation steps representing discrete
generations. The age structure is defined by two classes, regarding
wither or not a host is mature enough to reproduce. Each new
individual remains non-reproductive for a maturation period, after
which it becomes mature and enter the reproductive pool in the
population.
Age of sexual maturity and maximum lifespan are passed to the
simulator as configuration parameters by the user. Default parameters
for maturation, lifespan and offspring count reflect behavior
expected for strongly $r$-selected species with no generation overlap,
suitable for modeling some insect populations such as mosquitoes or
fruit flies.
\bigskip
\par
\begin{center}
\mbox{
\begin{tabular}{|c|c|}
\hline \multicolumn{2}{|c|}{Default age parameters}\\ \hline\hline
Initial age & {\bf 1} \\ \hline
Reproductive age & {\bf 3} \\ \hline
Maximum age & {\bf 4} \\ \hline
\end{tabular}
}
\end{center}
\bigskip
The growth rate of the population is determined at the beginning of
each generation from an arbitrary ecological model (see below), and
from this rate the net income of new individuals for the next
generation is obtained. The necessary amount of reproductive
encounters to reach the appropriate number of new individuals for the
next generation is then determined by dividing the total income by the
expected number of offspring per couple as indicated in equation
(\ref{eq:couples}). Note that assuming a non-monogamous species, this
is equal to the number of females expected to mate in this generation,
as there should always be enough males to impregnate any available
fertile females.
\begin{equation}
\label{eq:couples}
\mbox{reproductive encounters} =
\frac{\mbox{total income of new hosts}}
{\mbox{expected number of offspring}}
\end{equation}
Couples are then pooled from the available mature population to
satisfy the necessary number of reproductive encounters. We are
modeling host species that are not monogamous, so males are chosen
from the population with replacement while females are chosen without
replacement. This approximates the random mating behavior (i.e., no
spatial structure), while retaining the realistic behavior that a
given female can only copulate at most once per generation.
%% Age of sexual maturity and expected number of offspring per couple are
%% passed as configuration parameters to the simulator by the user in a
%% plain text configuration file, along with the logistic parameters and
%% options for other sections of the simulator. Default parameters for
%% maturation, reproductive age, maximum age were chosen as to avoid
%% overlapping of generations.
The host fitness is defined as the total offspring count that host has
generated. The mean fitness over the population for a given generation
is the arithmetic mean of the fitnesses of all live individuals at the
end of the generation.
The offspring amount for each reproductive encounter could be drawn
from a distribution whose mean is determined for the population. For
simplicity, however, we consider in this version of the simulator a
fixed initial amount of offspring for each couple. From this number it
will be subtracted a quantity proportional to the impact TEs might
have on the host fecundity. We postpone the definition of fitness
impact and how it is calculated to the transposition model section.
Migration between meta-populations can be introduced in the simulator
by the use of several geographic patches. Each of these patches is a
micro-environment of its own, with the same compartment structure. For
simplicity, in this version of the simulator we consider only one
geographic patch, so migration events are ignored. This feature is
planned for a future release.
%% The probability of a particular male copulating with a particular
%% female is therefore proportional to the population density in the
%% patch, and inversely proportional to the distance between the patches
%% they are in.
\subsubsection{Ecological Model}
At each generation, the total number of new offspring must be
determined before mating begins. To this end, an ecological model is
consulted to determine the dynamics of the population in the absence
of external influence (in our case, impact from TEs, which is
described below).
The simulator is modular in the sense that the ecological and
transposition models are implemented and considered independently from
the rest of the system framework, so they function as exchangeable
parts in the mechanism. This way, many other models can be included in
the simulator in order to suit particular demands for each of these
sections.
Indeed, any population model can be used in this system provided it
can be represented in the form:
\begin{equation}
p_{n+1} = F(p_n)
\end{equation}
where $F$ is a function that depends on the current population size
$p_n$. It is not required that F be deterministic, i.e., depends only
on $p_n$. It can optionally depend on a random variable $\xi_n$ to
introduce intrinsic stochasticity in the growth rate:
\begin{equation}
p_{n+1} = F(p_n,\xi_n)
\end{equation}
As a result, it is also not required that the model should be written
explicitly as a differential or difference equation. The only
requisite is that the input is the current population size, and the
output is the expected quantity for the next generation.
We currently implement five population models, two models of
unrestricted growth and and two models for saturated growth: constant
population, constant growth (linear), exponential growth (or decay),
the Logistic and the Hassel \cite{Has75} equations. All of these are
discretized from Ordinary Differential Equations, as described in
section \ref{sec:impl_pop_models}.
%% We implemented two population models of interest in the simulator: the
%% logistic equation
%% \begin{equation}
%% \label{eq:logistic}
%% \dot{p} = r p \left(1-\frac{p}{K}\right)
%% \end{equation}
%% and the Hassel equation \cite{Has75}
%% \begin{equation}
%% \label{eq:hassel}
%% \dot{p} = r p \left(1+\frac{p}{K} \right)^{-\beta}
%% \end{equation}
%% where $r,K,\beta>0$. The parameter $r$ is the reproductive rate, $K$
%% is the support capacity of the population and $\beta$ is the
%% intra-specific competition exponent. For $\beta = 1$, the Hassel model
%% is identical to the logistic model.
\subsubsection{Age structure}
The population has an age structure in which it is divided in discrete
classes. The behavior of such models has been thoroughly studied in
the ecological literature \cite{nisbet82}.
One can configure the parameters to set any arbitrary number of age
classes for a simulation, but there are two age classes of obvious
interest: before and after reaching the reproductive age. The
durations for the maturation stage and the adult stage last are
selected as configuration parameters to the simulator, and can be
chosen to reproduce several scenarios, according to the organism being
modeled. The default values for maturation and longevity are
appropriate to species that don't have generation overlap, such as
most insects.
(FIXME: move to another section) The default value total offspring for
each reproductive encounter is also selected as to represent
$r$-selected species with high fecundity and high mortality, such as
insects.
%Leslie matrix $2 \times 2$
\subsection{Molecular evolution}
\subsubsection{The genome}
The genome in each individual is composed by two chromosomes, and each
chromosome is represented discretely as a list of insertion sites that
act as \emph{loci} for TEs. Each insertion site in each chromosome can
therefore be either empty or occupied by a TE. To reflect the fact
that some insertions can cause disadvantageous, deleterious or even
fatal mutations on the host, we categorize the insertion sites into
three classes: fatal (\emph{killer sites}), meaning they disrupt
essential genes in a way they render the cell useless; severe
(\emph{severe sites}), which may disrupt non-essential genes, or
metabolic pathways, and neutral (\emph{neutral sites}). Under the
hypothesis that within a species genomes from different hosts should
bear more similarities than differences, the number of insertion sites
in each of the above classes should not vary much within a species,
and could be considered species specific if such numbers can be
estimated based on genome size.
The genome size in this simulator is thus defined as a result of
config parameters choices. Each chromosome has $\eta = k + s + n$
\emph{loci}, each of which can contain one or zero TEs. These sites
can be reordered, without loss of generality, in such a way that the
first $k$ sites are \emph{killer sites}, the next $s$ are \emph{severe
sites}, and the remaining $n$ are of \emph{neutral sites}.
%% Since the information contained in each twin strand in the pair is
%% identical, we refer to each helicoidal pair of strands simply as ``DNA
%% strand'', so the two strands mentioned from now on actually means the
%% two strand pairs.
%% Since the host species is diploid, TE copies can reside on any (most
%% likely both) of DNA molecules, and only one copy of each strand is
%% passed to offspring, we must keep track of which strand and insertion
%% site each TE copy is stored. The child receives all copies of TEs
%% currently present in one of the DNA strands of each parent, in
%% addition to any extra copies produced by replication of the parent's
%% elements.
\subsubsection{Recombination and ploidy}
A model of crossing-over recombination is optionally used to promote
additional variability of chromosome content. It is implemented in
three variants called \emph{1-step}, \emph{2-step} and
\emph{all-step}. The resulting gamete retains the original size in
insertion sites, and each site is filled consulting one of the
parent's corresponding site from one of the parental gamete.
In the \emph{1-step} model a site is chosen at random and this site is
the cutting point between the parts coming from each chromosome. The
resulting gamete will be filled up to this site identical to the first
chromosome and then the remaining sites are merged from the second
chromosome.
The \emph{2-step} model is similar to the above, but there are two
cutting points instead of one. This means the first and third sections
of the gamete are merged from the first chromosome and the second is
merged from the other chromosome.
There's also an \emph{all-step} model that abstracts from the two
models above in the sense that every site is a cutting point. This way
every site is taken from a random chromosome preserving only the order
from which it came. This will provide maximum variability considering
recombination events.
If none of the above recombination models is selected, the resulting
gamete will be equal to one of the parental chromosomes, chosen at
random.
Additionally, although we are mainly interested in simulating sexual
populations, a workaround is provided to represent haploid
simulations, for the benefit of simplicity. If this option is chosen,
the second chromosome of the offspring being created is always
ignored, and its contents copied to the first one respecting the
original chromosome site, possibly overwriting an existing TE that
previously existed there. This ``haploid sexual population'' is a
practical way of creating simple simulations and hypotheses.
\subsubsection{The evolutionary model}
Over time mutations occur and are accumulated in the genome of
simulated hosts. Whenever a mutation occurs a substitution drawn from
an evolutionary model.
We implemented a simple model that follow the assumptions of the
Jukes-Cantor substitution model \cite{JC69}. All mutations are
substitutions and equally likely to occur. As a result no special
assumption is made over sites in the sequence which are more likely to
change, nor transition/transversion bias. \emph{ All sites are
considered equally likely to change as well as all nucleotides
changes are considered equally likely to occur.}
As happens with the ecological section, almost any evolutionary model
can be implemented in the simulator. The requirement here is that the
input is the original sequence, and the output is the mutated
sequence. Which changes are allowed or disallowed are completely up to
the model in question. The evolutionary model is any model or function
that, given an input nucleotide sequence, outputs a similar homologous
sequence.
\subsection{Transposition model}
The existence of TEs in a host genome can cause several kinds of
impacts in the host, ranging from fitness reduction, longevity
reduction and even host inviability. There are several ways to model
both how the TEs replicate in the genome, and the impact that such
replication causes in the host that carry them.
The three key elements to be considered in the modeling of these
phenomena are the way the total amount of TE copies vary, the ability
of TEs to be transposed and the impact that each TE might cause in the
host.
\subsubsection{Activity cycle}
The initial phase of the TE invasion, should be regarded as the period
during which the TE is most active. Otherwise, genetic drift may lead
the TE to extinction \cite{rouzic2005b}. After this initial phase, if
the TE succeeds in fixating in the host population, it must decrease
its activity so as not to disrupt too much the host's fertility.
%% \begin{figure}
%% \includegraphics[width=1\textwidth]{figures/cycle.png}
%% \caption{Transposon cycle (\cite{kidwell2001})}
%% \end{figure}
Transposable elements usually undergo some degeneration process, that
inactivates TE copies. We take this phenomenon into account in the
model in the form of an \emph{status score}, that is a fixed number
between zero and one for each TE. When a status of zero is selected,
it renders the TE inactive. This score is also used to determine
whether of not the TE is actively transposing.
Every time a new TE is created from an existing TE, the new copy's
\emph{status score} is chosen to be less than the original score. How
fast this score drops to zero in successive transpositions can be
implemented in several ways, and could be used to reproduce several
interesting scenarios:
\begin{itemize}
\item if the status is constant and zero, then every new copy is
inactive, and the only active copy is the original one. This is
known as the \emph{master gene model};
\item if the status is constant and non-zero, then every new copy is
active, which represents the \emph{transposon model};
\item the status can decrease in a constant rate, which will render
new copies inactive after a few
\item the new status can be chosen randomly, between zero and the
original value
\end{itemize}
\subsubsection{Forms of impact on hosts}
New TEs can cause an impact on the host, depending on where in the
genome it lands when created. This is implemented in the three
categories of insertion sites where TEs might appear. If a TE is
created in a \emph{killer site}, it will be considered a deleterious
transposition. Hosts can be born dead due to deleterious
transposition, and otherwise the impact will be considered a fitness
toll, thus the host will have a lower offspring count when compared.
All transposition models can be used with or without individual
fitness impact on hosts, as an additive fitness impact can be
optionally defined as a config parameter, which is a real number. All
elements that have such an effect on the host are accounted for and
their relative impact is summed and rounded up. This total impact is
the total amount of offspring this particular host will be unable to
produce due to disadvantageous transpositions in its genome.
Lifespan can also be decreased if the impact is too great, albeit not
fatal (as proposed in \cite{rouzic2005}). We include this
characteristic in the model as follows. First we take the ratio
between the total fitness impact as calculated above and the maximum
age defined in the config for the simulation, then this ratio is
rounded down. This integer is the total age classes the host will be
unable to achieve due to deleterious transpositions.
%\subsubsection{Changes in the ammount of copies}
\subsubsection{TE activity dynamics}
Several transposition models can be implemented in the system. We
implemented both neutral models in respect to natural selection
(constant and exponential growth), and a transposition model that
takes into account intrinsic deleterious effects by insertion of new
elements \cite{SKR05}.
Transposition models implemented in our framework have two main
components: one for the the acquisition of new copies and one for the
excision of existing copies.
The acquisition model (usually called by a \verb|transpose()|
function) determines the total amount of new copies that should be
created for the gamete. The excision model (usually called by a
\verb|excise()| function) is the exact opposite. Parameter choices in
the simulation configuration should consider cases where the creation
of new copies never falls behind the deletion of old copies, otherwise
the
%% Here we describe one of the simplest transposition models we
%% implemented in the simulator. It doesn't take into account any
%% regulation due to selection on the host population, so the only
%% regulation imposed on the TE family, besides self-regulation, is the
%% excision of TE copies. The analytic behaviour of this model is
%% thoroughly described in \cite{rouzic2005}.
%% \begin{equation}
%% \Delta c_n = (u_n - e) c_n
%% \end{equation}
%% where $c_n$ is the TE count at time $n$, $u_n$ is a decreasing
%% positive function of $n$ and $e>0$. According to this model, the TE
%% can invade the host population provided $u_n$ is larger than $e$ for
%% small $n$ \cite{rouzic2005}, and can fixate if $u_n$ never decreases
%% below $e$. Furthermore, TE copy number reaches an equilibrium if $u_n$
%% converges to $e$ for large $n$.
%% \subsection{The generation algorithm}
%% The general algorithm that happens at each generation models the basic
%% life cycle of a diploid sexual species.
%% \begin{enumerate}
%% \item Host couples are chosen randomly from available mature hosts at
%% the beginning of each reproductive season. Males are chosen with
%% replacement, females are chosen without replacement.
%% \item Each adult bears new gametes after transposition and
%% recombination.
%% \item Transposition draws a recruitment amount of new TE copies and
%% the deletion amount of excised copies from the transposition
%% model. This changes the content of the gametes in terms of
%% availability of TEs.
%% \item Mutations are sampled from the evolutionary model for newly
%% created TE copies.
%% \item Recombination provides additional shuffling of gamete contents.
%% \item Each couple gives birth to a number of offspring defined by the
%% user as a parameter.
%% \item Each newborn individual is composed of one chromosome from each
%% of its parents, and new individuals are introduced to the population
%% pool.
%% \item The fitness cost from TEs in newborn individuals is calculated
%% and any that exceeds a given threshold is killed before birth and
%% removed from the population.
%% \item The age of every surviving individuals is incremented ate the
%% end of the generation.
%% %% \item Newborn individuals inherit one chromosome from each parent,
%% %% after recombination, transposition and excision of active TEs.
%% \end{enumerate}
% FIXME: diagram in xfig.
\subsection{The population genealogy}
%% Since in the absence of transposition impact the population grows
%% according to the logistic model, we expect $p_n$ not to vary much
%% after a large number of generations, when the parameters are chosen
%% as to minimize phenotypic altering mutations induced by transposition.
We are interested in analyzing sequence data from an individual to
reconstruct the phylogenetic relations between its TEs. Therefore it
becomes necessary to compare the reconstructed history with the real
parental history we simulated.
Each host is identified by a name defined as a serial number, and
carries the names of both parents. A simple recursive breadth-first
algorithm is used to recover the names of parents, grandparents and so
on. As such the genealogy can be fully reconstructed up to any valid
time interval. The population stored in the {\bf Deceased}
compartment can be consulted to create genealogies of different depths
after the execution of the simulation, or inspect each individual for
it's genome, in order to compare the elements present in the sampled
individual, and its ancestors.
\subsection{Sources of stochasticity}
Although most of models described so far involve deterministic models
the overall behavior is stochastic. This happens due to both sampling
effects and the fact that the system is an individual based model, or
agent based model \cite{Bon02}. The following are sources of such
uncertainty:
\begin{enumerate}
\item sampling of individuals in finite population
\label{item:finite}
\item recombination
\label{item:recombination}
\item replication and excision of TEs
\label{item:transposition}
\item mutations (substitutions) within TE sequences
\label{item:mutation}
\end{enumerate}
The item \ref{item:finite} refers to Wright-Fisher models of
populations \cite{HC98}. Since we're simulating small populations
(typically $<10^5$, but there's no restriction to population size)
there is a sampling effect to be considered when an individual is
chosen for reproduction. Should we simulate very large simulations the
importance of sampling effects could be diminished, and we could
approximate the behavior of an infinite population (as in a
Hardy-Weinberg population).
Since we're simulating sexual populations, we incorporate
recombination by crossing over of games. This can be implemented in
several ways, and the item \ref{item:recombination} is related to the
recombination model used to promote additional variability to the
chromosome pool.
FIXME: complete this section
The item \ref{item:transposition} is related to the availability of
new TEs in the chromosomes (as defined by different \emph{loci} and
sequence).
The item \ref{item:mutation} is related to the evolutionary model used
to promote variability across generations.
\section{Availability}
The simulator has been developed and tested in GNU/Linux systems
running Ubuntu Linux. As it is a portable language, it should run on
any system for which its dependencies are available. Those include Mac
OS X and Windows, besides UNIX/Linux in general. Packaging has been
prepared following guidelines typical of Perl modules and
applications, which should ease the installation in any platform Perl
is available with the help of the CPAN
framework\footnote{http://www.cpan.org}, as well as Debian/Ubuntu DEB
packages.
The software is released in the open-source license GPL and is
available from \url{https://launchpad.net/trepid} .
\section{Implementation details}
\label{sec:implementation}
\subsection{Population models}
\label{sec:impl_pop_models}
\subsubsection{Implementation of the logistic model}
We will use the logistic equation as an example to describe the
implementation because of its simplicity; the implementation of the
Hassel model is analogous.
The differential equation is discretized as a single iteration of
Euler's method, with a step size $h=1$. This results in formula
\eqref{eq:logistic_implem}:
\begin{equation}
\label{eq:logistic_implem}
p_{n+1} - p_n = \Delta p = Ceiling \left(
\frac{r \times p \times \left(1-\frac{p}{K}\right)}
{\hbox{offspring count}}
\right)
\end{equation}
where $Ceiling(x)$ ($Floor(x)$) is the smallest (greatest) integer