-
Notifications
You must be signed in to change notification settings - Fork 0
/
ch05.tex
425 lines (388 loc) · 25.7 KB
/
ch05.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
\chapter{Computer simulation methods}
\section{Introduction}
Common diseases such as coronary heart disease have a complicated aetiology
which involves, among other things, genetic heterogeneity, epistasis,
pleiotropy and gene-environmental interaction, which underlies most theoretical
explorations. For example, the RFA proposal for GAW 12 problem II delineated
such a system in which gene and environmental effects operate together on
disease susceptibility, and the major etiological components underlying the
system are simplified into five individual quantitative traits. Nevertheless a
particular feature often overlooked by most such computer programs is LD.
Computer simulation can also be used for practical problems. Perhaps the most
established application is power analysis of a linkage study via computer
programs {\bf SLINK} and {\bf SIMLINK}. Given certain family structures and
genetic markers, simulation-based statistics are obtained by generating
replicates of the observed families structures conditioning on available
phenotype information. Assuming that each replicate occurs with equal
probability, expected lod score is simply the average of lod scores of all
replicates. This quantity at the true recombination value is designated as
ELOD. The maximum of expected lod score (MELOD) over different recombination
values and the expected maximum lod score (EMLOD) can also be obtained. MELOD
may not be ELOD if the estimated recombination value is inconsistent due to
sampling schemes or the wrong assumption of parameter values. EMLOD is an
average of the maximum lod scores obtained in each replicate. EMLOD of a
pedigree conditional on the heterozygosity/homozygosity status of each pedigree
member provides means of identifying pedigree member(s) whose marker status
have a strong impact on the linkage information provided by the pedigree, which
yields a large difference in the EMLOD for his/her pedigree depending on
whether or not (s)he is marker heterozygous. By definition ELOD $\le$ MELOD
$\le$ EMLOD (Terwilliger \& Ott 1994; Ott 1999).
\subsection*{Chapter aims}
This Chapter highlights two scenarios of computer simulation. First, a general
procedure for simulating oligogenic trait, taking into account LD, is
introduced. The disease model incorporates major gene effects as well as
polygenic and environmental effects with genetic loci in linkage equilibrium or
disequilibrium. This has been used to study power for a linkage disequilibrium
problem (unpublished data). Second, statistical power analysis using computer
simulation is applied to two examples of homozygosity mapping (Smith 1953;
Lander \& Botstein 1987).
\section{Methods, implementations and results}
\subsection*{Simulation of oligogenic traits}
A genetic model incorporating major gene, polygenic and environmental effects
is as follows:
\begin{eqnarray}
\mathbf{x}=\mathbf{g}+\mathbf{p}+\mathbf{c}+\mathbf{e}
\end{eqnarray}
where $\mathbf{x}$ is a multivariate trait ($x_i,i=1,...,NTRAIT$), a result of
additive effects of major loci $\mathbf{g}$
($g_j,j=1,...,NMG$), polygenic effects $\mathbf{p}$
($p_k,k=1,...,NPG)$, common environment $\mathbf{c}$
($c_l,l=1,...,NCE$) and unique environment $\mathbf{e}$
($e_m,m=1,...,NUE$). The variables $\mathbf{g}, \mathbf{p}, \mathbf{c},
\mathbf{e}$ are collectively referred to as causal components of the trait.
The $NMG$ major loci are a subset of $NLOCI$ loci for which the order of loci,
their recombination fractions ($\theta_1,\theta_2,...,\theta_{NLOCI-1}$), the
number of alleles and allele frequencies of them are specified. The meaning of
some model constants are summarised in Table~\ref{simcons}.
\begin{table}[h]
\centering
\caption{Some model constants\label{simcons}}
\begin{tabular}{ll}
\\
\hline
Constants & meaning\\
\hline
$MAXFAM$ & maximum \# of families\\
$MAXIND$ & maximum \# of individuals within a family\\
\\
$MAXLOCI$ & maximum \# of loci\\
$MAXALLELES$ & maximum \# of alleles at each locus\\
$NTRAIT$ & \# of traits\\
$NLOCI$ & \# of loci\\
$NMG$ & \# of major genes\\
$NPG$ & \# of polygenes\\
$NCE$ & \# of common environments\\
$NUE$ & \# of unique environments\\
$\beta$ & matrix of regression coefficients for each trait\\
\\ \hline
\end{tabular}
\end{table}
The polygenic effect of an offspring, conditional on those of parents
($p_F,p_M$), is $p_o={{{(p_F+p_M)}/{2}}+{u/\sqrt{2}}}$, $u\sim N(0,1)$ is
specific for each offspring. Suppose the degree of transmission of parental
shared environment to offspring is $k$, then under no assortative mating, the
common environment for offspring is $k (c_F+c_M)+v\sqrt{(1-2k^2)}$, where
$v\sim N(0,1)$ is the same within the whole sibship. The valid range for $k$
is $(0,1/\sqrt{2})$. The unique environment is different among individuals and
is $N(0,1)$. All latent variables ($\mathbf{g}$, $\mathbf{p}$, $\mathbf{c}$,
$\mathbf{e}$) have mean 0 and variance 1.
The regression equation of trait $x$ on latent variables is
\begin{eqnarray}
x_i&=&\beta_{ig_1}g_1+\cdots+\beta_{ig_{NMG}}g_{NMG}
+ \beta_{ip_1}p_1+\cdots+\beta_{ip_{NPG}}p_{NPG} \cr
&+&\beta_{ic_1}c_1+\cdots+\beta_{ic_{NCE}}c_{NCE}
+ \beta_{ie_1}e_1+\cdots+\beta_{ie_{NUE}}e_{NUE}
\end{eqnarray}
The mean and variance of $x_i$ are then 0 and $\sum \beta_{ij}^2, j=1,
..., NMG + NPG + NCE + NUE$. To ease the specifications, the input
$\beta$'s can be standardised by this to yield unit trait variance.
It is possible to generate disequilibrium between multiallelic markers
systematically. Without loss of generality, consider two marker loci, with
alleles $m$ and $n$, allele frequencies $p_1,p_2,...,p_m$, $q_1,q_2,...,q_n$,
we can then arrange the $m\times n$ haplotype frequencies into a contingency
table, with some cells taking values specfied by the user. The $m\times n$
contingency table $\chi^2$ statistic becomes $\chi^2=\sum_{i=1}^m\sum_{j=1}^n
{h_{ij}^2}/{p_iq_j}-1$. The $\chi^2$ statistic can be rewritten as a familiar
quadratic form. let ${\bf w}$ be the $m\times n$ diagonal matrix of direct
product of the table marginals, the $\chi^2$-squared statistic is ${\bf h}{\bf
w}^{-1}{\bf h}'-1$, where {\bf h} contains the actual haplotype frequencies.
Under Hardy-Weinberg equilibrium, the haplotype frequencies is obtained via
numerical optimisation in a SAS program, with the constraints of haplotype
frequency estimates being nonnegative. For $m=n=2$, the haplotype frequencies
can be parameterised by allele frequencies and a single linkage disequilibrium
parameter $D$, to be $p_1q_1+D$, $p_1q_2-D$, $p_2q_1-D$ and $p_2q_2+D$. The
constraints are $\max(-p_1q_1,-p_2q_2)=-\min(p_1q_1,p_2q_2)\leq D\leq
\min(p_1q_2,p_2q_1)$, $p_1q_1+D\leq p_1$ and $p_2q_2+D\leq q_2$, $D\leq
\min(p_2q_1,p_1q_2)$ to ensure nonnegative cell and marginal probabilities. In
summary, -$\min(p_1q_1,p_2q_2)\leq D \leq \min(p_1q_2,p_2q_1)$. With another
biallelic marker in equilibrium with the second marker, with allele frequencies
$r_1$, $r_2$, the haplotype frequencies will be $(p_1q_1+d)r_1$,
$(p_1q_1+d)r_2$, $(p_1q_2-d)r_1$, $(p_1q_2-d)r_2$, $(p_2q_1-d)r_1$,
$(p_2q_1-d)r_2$, $(p_2q_2+d)r_1$ and $(p_2q_2+d)r_2$. We first get parental
haplotypes from population frequencies and then proceed to generate the
genotypes for children. Some loci are assigned as major loci with dominance
parameters. The disequilibrium is simulated if the disequilibrium matrix of
haplotype frequencies is specified.
The implementation is made compatible with pedigree files similar to {\bf
LINKAGE} and {\bf SIMULATE}. Details are givein in Figure~\ref{simfig1}.
\begin{figure}[h]
\begin{tabular}{llc}
\\
&&\framebox{simulate/read pedigree}\\
&&$\downarrow$\\
\framebox{model parameters}&$\longrightarrow$&\framebox{SIM}\\
&&$\downarrow$\\
&&\framebox{pedgree file}\\
&&$\downarrow$\\
&&\framebox{selection}\\
&&$\downarrow$\\
&&\framebox{final pedigree file}
\end{tabular}
\caption{Flowchart of SIM\label{simfig1}}
\end{figure}
Step \framebox{SIM} processes one family at a time, so that certain selection
procedure can be adopted easily. For simulation of genotypes, the usual way of
obtaining allele in a specific locus was described in Weir (1990). Consider a
4-allele locus with allele frequencies $p_1, p_2, p_3, p_4$, when drawing a
(pseudo)random number from [0,1] we could divide the unit interval into four
segments, with boundaries $0, p_1, p_1+p_2, p_1+p_2+p_3, 1$, these segments
therefore have lengths $p_1, p_2, p_3, p_4$. We assign allele number $i$ if
the random number falls within the $i$th interval. Similar logic is applicable
to simulation of recombination events. A comprehensive C library, RANLIB, is
used for generating random variables. Numerical Recipes (Press et al. 1992)
utility nrutil.h/nrutil.c is used to create dynamic arrays and matrices. A
compilation control file Makefile has been created and tested under Sun Solaris
and DEC Alpha with GNU C compiler, PC with Symantec C++ and Cygwin. Prolix
output could be obtained by commenting on \#undef DEBUG statement in
include/sim.h. A program {\bf diseq} is provided to simulate nuclear families
assuming linkage disequilibrium. The sibship size distribution is based on the
popular negative binomial distribution or truncated Poisson distribution, i.e.,
a particular class cannot be observed and eliminated from the sample space
(Casella \& Berger 1990, p123; Sham et al. 1997). Thus
$P(X_T=x)={P(X=x)}/{P(X>0)}, ~x=1,2,\ldots,$ is the truncated distribution
where class 0 cannot be observed. With Poisson distribution we have
$P(X=0)=\mu$, the truncated distribution is
$P(X_T=x)={\mu^xe^{-\mu}}/{x!(1-\mu)}, ~x=1,2,\ldots.$
A version of {\bf CHRSIM}, a chromosome-based simulation program, {\bf
SIMULATE}, a program for data simulation under no linkage, together with the
{\bf LINKAGE} utility routine makeped.c have been included as separate
programs. The method in {\bf CHRSIM} has been extended to allow inclusion of
random markers based on recent estimates of chromosome lengths and Genethon map
(Dib et al. 1996, ftp://ftp.genethon.fr/pub/Gmap/Nature-1995/), which would
make it more flexible and suitable for genome scan research.
\subsection*{Homozygosity mapping}
Familial hemophagocylic lymphohistiocytosis (FHL), also known as familial
erythrophagocytic lymphohistiocytosis and familial histiocytic reticulosis, is
a rare Mendelian recessive disorder of early childhood characterised by
excessive immune activation. The population prevalence of the disorder is
approximately 1/50000, so that the disease allele frequency would be 0.00472.
It is that expected markers with 5 equal frequent alleles and heterozygousity
of 80\% is a good approximation. There are five families available (see
Figures~\ref{k1}-Figure~\ref{k5}). The availability of genotyping is 4, 7-9,
12-17 for family 1, 7-8, 11-20 for family 2 8-13 for family 3 17-26, 29-32,
36-39 for family 4 and 18-19 for family 5.
\fig{4.5truein}{1.ps}{Familial hemophagocylic lymphohistiocytosis family 1}{k1} %4.5
\fig{4.5truein}{2.ps}{Familial hemophagocylic lymphohistiocytosis family 2}{k2} %4.5
\fig{4.5truein}{3.ps}{Familial hemophagocylic lymphohistiocytosis family 3}{k3} %4.5
\fig{5.5truein}{4.ps}{Familial hemophagocylic lymphohistiocytosis family 4}{k4} %6.5
\fig{4.5truein}{5.ps}{Familial hemophagocylic lymphohistiocytosis family 5}{k5} %4.5
%\ffig{3.5truein}{1.ps}{k1}
%\ffig{3.5truein}{2.ps}{k2}
%\ffig{3.5truein}{3.ps}{k3}
%\ffig{3.5truein}{4.ps}{k4}
%\ffig{3.5truein}{5.ps}{k5}
Surprisingly with such an ordinary simulation problem, it took 58 hours on DEC
Alpha to get a replicate using {\bf SLINK}. So only 100 replicates would take
almost a year for data simulation! To get around this the {\bf SIMULATE} is
used to get many replicates, family by family, keeping those replicates whose
phenotypes of each family member matched the observed phenotypes ({\bf
SIMULATE} simulates under no linkage without conditioning on the phenotypic
information of other family members, so genotypes of three loci are simulated
and the middle locus is treated as disease locus and used for phenotype
matching).
The simulated data is experimented with {\bf LINKAGE}, but owing to presence of
many loops, it is quite slow. Because of the loops, the current {\bf VITESSE}
could not be used. {\bf GENEHUNTER} is used and individuals that are less
informative for linkage are removed from the analysis. The lod scores for the
first four pedigrees over 100 replicates are 117.1, 78.6, 91.8, 95.3 for 5cM.
As it is very slow, some results are estimated rather than obtained from real
simulation and computation, for example, pedigree 5 is assumed to have lod
score at least that of pedigree 2. Under 10cM, lod score for pedigree 1 is
92.7, so roughly the lod scores from 5cM to 10cM would be scaled down by a
factor of 0.8. Then ELODs for 5cM and 10cM are then 4.614 and 3.691,
respectively. This corresponds to $\chi^2$ of 17 for 10cM and a noncentrality
parameter of 16. Given that we obtained the lod score distribution under
linkage, a lod score of 3 corresponds to $\chi^2$ of 13.8 and a power of 61\%,
compared to a lod score of 2 with of power 83\% (These can be verified via
Stata as 1-nchi2(1,16,13.8) and 1-nchi(1,16,9.2)).
The simulation guarantees the proceeding of genotyping, and the linkage of
familial homophagocytic lymphohistiocytosis to 9q21.3-22 has been reported in
Ohadi et al. (1999): `` {\em Linkage of the disease gene to a region between
markers D9S1867 and D9S1790 at 9q21.3-22 is identified by homozygosity mapping
in four inbred FHL families of Pakistani descent with a combined maximum
multipoint LOD score of 6.05. This is the first genetic locus to be described
in FHL. However, homozygosity by descent across this interval could not be
demonstrated in an additional affected kindred of Arab origin, whose maximum
multipoint LOD score was 20.12. The combined sample revealed significant
evidence for linkage to 9q markers (LOD score with heterogeneity, 5.00)''}.
The second example is a leukemia family (Figure~\ref{k6}, Dr Layton, personal
communication). A Mendelian recessive model is also assumed, the gene
frequency is about 0.001. Neither {\bf SIMULATE} nor {\bf SLINK} is fast.
With 10cM distance between the flanking marker loci the lod score is
approximately 1 using 20 replicates of {\bf SLINK}. Linkage analysis is not
pursed owing to this result.
\fig{4.5truein}{layton.ps}{The Leukemia family}{k6}
\section{Discussion}
Simulation of oligogenic traits can be seen as a two-step procedure. First, a
microsimulation or demographic simulation is used to get the basic study units,
typically either nuclear or extended families. Second, chromosome-based marker
loci are generated. Characterisation of these markers is desirable in order to
represent features of the human genome and will be served as basis of marker
selection. Several programs have been developed for GAWs (Genetic Analysis
Workshop) 9 and 10 (e.g. {\bf POPGEN}, {\bf MARKCHAR}, {\bf CHRSIM}). Unlike
previous GAWs, GAW 11 started to consider disequilibrium information in the
simulated data. When families are involved, this information is mostly among
parents, which can be estimated using standard EM algorithm when phase
information is unavailable. In fact there are many mechanisms of association
so the actual simulation may appear in different forms. Other aspects may be
featured. For instance, the RFA for GAW12 described a metabolic model with
gene effects operating at different stages and manifest clinical outcomes, and
may be highly suggestive of survival analysis and involvement of other
particular groups. The events are rare and thus needs stringent ascertainment.
It has been shown that chromatid interference does exist and a proper account
of crossover interference in gene mapping methods may increase their
efficiency. It is therefore desirable to have it incorporated in actual data
simulation. Simulation is as useful for validity-checking as for power
analysis of test statistics. On both occasions, multiple replicates of a given
sample are generated assuming different model parameters and appropriate
statistics are obtained. A general discussion, including a population
simulation program, {\bf POPSIM} was given by Hampe et al. (1998).
Homozygosity mapping assumes that a rare recessive disorder running in
consanguineous families is due to homozygosity by descent (HBD). While
powerful, it poses a heavy computational problem, so the analysis is most
likely to be good instance for MCMC. Currently no such a program is available.
A brief comparison of {\bf SLINK} with {\bf SIMLINK} is useful. For a given
problem {\bf SIMLINK} obtains conditional probabilities of trait genotypes for
each pedigree member, a replicate of each of the user-supplied pedigree(s), and
lod/location scores for the replicate and their summary statistics. These
probabilities are calculated by conditioning on the trait genotypes of their
relatives. Simulations are carried out at the specified true recombination
fractions for one marker locus or at the recombination fractions corresponding
to the specified map distance for two flanking marker loci. For linked
markers, {\bf SIMLINK} obtains the ELOD/location score assuming homogeneity or
heterogeneity, and the probability of a maximum lod/location score greater than
specified constants. For unlinked markers, it gives ELOD/expected location
score for several test recombination fractions/map distances, and the
probability of a lod/location score greater than these constants. These give
an estimate of the exclusion region when testing for linkage to an unlinked
marker (pair). Estimating the probability of a maximum lod/location score
greater than 3 for a true recombination fraction of 0.5 gives the probability
$\alpha$ of incorrectly concluding linkage to an unlinked marker (pair). The
overall probability of type I error can be conservatively corrected if many
(pairs of flanking) markers are considered. Assuming that the linkage
calculations for the different (pairs of flanking) markers are independent, the
overall probability of making a type I error becomes $1 - (1 - \alpha)^m$,
where $m$ is the number of (pairs of flanking) markers. Data from {\bf
SIMLINK} can only be analysed under the simulated model (Terwilliger \& Ott
1994). Both {\bf SIMLINK} and {\bf SLINK} can generate data under linkage
heterogeneity. However if families generated with heterogeneity are analysed
assuming homogeneity, the resulting lod scores are too low and do not
correspond to the ELOD under heterogeneity; also, the estimates of $\theta$ are
biased upwards. When generating data under heterogeneity, the easiest solution
for analysis is to focus on ELOD; working with the probability that the maximum
lod score exceeds a given threshold requires more complicated manipulations.
{\bf SLINK} was based on an algorithm of Ott (1989). Suppose there are $N$
people in a pedigree, $\mathbf{x} = (x_1, x_2,\ldots,x_N)$ is the vector of
phenotypes, $\mathbf{g} = (g_1, g_2,\ldots, g_N)$ the vector of multilocus
genotypes including phase information, and $P(g_i|\mathbf{x}) = {P(x_1, \ldots,
x_i, g_i, x_{i+1}, \ldots, x_N)}/{P(\mathbf{x})}$, where the denominator is the
likelihood of the entire set of pedigree data, and the numerator is the
likelihood of the pedigree data given that individual $i$ has genotype $g_i$,
then the conditional probability distribution of the genotypes given the
phenotypes may be calculated by a series of successive risk calculations
$P(\mathbf{g}|\mathbf{x}) = P(g_1|\mathbf{x}) P(g_2|g_1,\mathbf{x})
P(\mathbf{g}|g_1, g_2, \mathbf{x})\ldots$. {\bf SLINK} calculates these
probabilities (or risks) of all the possible multilocus genotypes with phase
$P(g_1|\mathbf{x})$ and, based on these, randomly assigns one of the genotypes
to person 1. Then $P(g_2|g_1, \mathbf{x})$ is calculated, taking into account
the genotype $g_1$ just generated, and randomly assign a genotype to person 2,
and so on. The process of successive risk calculations continues until all
individuals in the pedigree have been assigned a multilocus genotype. This
approach is efficient since once a multilocus genotype has been assigned to one
individual, it is known in all subsequent steps. This algorithm permits
simulation conditioning on any combination of phenotypic data. For example, if
the pedigree were partially typed at a marker of interest, one could simulate
conditional on both the disease phenotypes and the marker data currently
available. {\bf SLINK} allows one to carry out power calculations by
simulating genotypes at one locus given the phenotypes at another locus linked
with the first one. There are also faster and parallel versions of {\bf SLINK}
called {\bf FASTSLINK} and {\bf PSLINK}.
\section{Bibliographic notes}
{\bf GASP} uses a similar method for family data generation but does not
consider disequilibrium. There have been considerable interest in association
study using theory of population genetics, where simulation has been common
practice, for example, Kruglyak (1999), Collins et al. (1999). A simulation
method for data analysis has been elegantly implemented in {\bf MORGAN}, which
was designed for genetic analysis of both qualitative and quantitative traits
and has recently incorporated module {\bf Genedrop} for data simulation. It
features MCMC in a number of computer programs: {\bf PolyEM}, multiple trait
polygenic model with additive effects, {\bf MIXED}, mixed model (major
Mendelian genes plus polygenic additive effects), {\bf Multipnt}, Mendelian
quantitative trait model with Mendelian markers, {\bf PBGibbs}, Mendelian
qualitative trait model, {\bf Autozyg}, multiple Mendelian loci (using meiosis
indicators) {\bf Checkers}, pedigree checking, {\bf Loki}, MCMC linkage
analysis. A useful expression for constraints over three-locus disequilibria
was given by Ayre \& Balding (2001).
While actual mechanism for a particular problem may be quite specific, a
general program has many factors to take into account. One example is
family-size distribution. Cavalli-Sforza \& Bodmer (1971) examined progeny
size as a measure of fertility, one of the major components of biological
fitness. It is in turn measured, crudely, by the mean progeny size after the
reproductive period has finished. If all women had the same constant chance of
bearing children over a given fixed time period, births would be distributed
randomly throughout this period and the distribution of their number would be
Poisson. It can be shown that the negative binomial distribution is equivalent
to a particular type of mixture of Poisson distributions with different means.
More exactly, the means must be distributed as gamma distribution. In their
example, a survey of 307 women, the distribution of observed progeny sizes
followed negative binomial distribution given by the terms of $(q-p)^{-n}$,
where $q=1+p$ and $p$ is positive, $p_r=q^{-n}[(n+r-1)!]/[r!(n-1)!](p/q)^r,
~r=0,1,\ldots,$ so $p_0=q^{-n}, p_1=q^{-n}np/q$, etc. The mean and variance of
this distribution is $np$ and $np(1+p)$. They found $n=5.37$ and $p=0.629$ fit
the data well. Ewens (1982) \& Morton (1982) showed that family-size
distribution should not affect the result of segregation analysis, the
distribution of family sizes can affect power calculations because a data set
containing 100 two-child families contains less information than, say 100
five-child families. Suarez \& Van Eerdewegh (1984) used a geometric sibship
size distribution $f(k)=p(1-p)^{k-2}=0.4551(1-0.4551)^{k-2}, k\ge 2$ so that
two siblings occur at less than 50\% of the times, three siblings at less than
25\% of the time, etc. However, Ewens (1991) gave more justifications why
family size distribution is not usually used in a likelihood function involving
genetic parameters. Two alternative forms of negative binomial are given here.
Suppose that we count the number of Bernoulli trials $x$ required to get a
fixed number of successes $r$ then we get negative binomial distributions
$P(X=x|r,p)=\pmatrix{x-1\cr r-1}p^r(1-p)^{x-r}$. Since \{$X=x$\} can only
occur when there are $r-1$ successes in the first $x-1$ trials and a success on
the $x$th trial which is the product of probability $P(X=x|r,p)=\pmatrix{x-1\cr
r-1} p^{r-1}(1-p)^{x-r}$ and $p$. If we define in terms of the number of
failures before the $r$th success in $X$ trials and $Y=X-r$ and
$P(Y=y)=\pmatrix{r+y-1\cr y}p^r(1-p)^{y}=(-1)^y\pmatrix{-r\cr y}p^r(1-p)^y,
~y=0,1,\ldots$. The mean and variance of this distribution are $\mu=r(1-p)/p$
and $r(1-p)/p^2=\mu+\mu^2/r$. When $r\rightarrow \infty, p\rightarrow 1,
r(1-p)\rightarrow \mu, 0< \mu <\infty$ then the mean and variance agree with
Poisson distribution. If $r=1$, we have geometric distribution
$P(X=x)=p(1-p)^{x-1}$ with the mean $1/p$ and variance $(1-p)/p^2$ (Casella \&
Berger 1990, pp 95-97, 98). Finally, the negative binomial distribution can be
obtained as a gamma mixture of Poisson distributions (Knight 2000, p101). The
density function of gamma distribution is $f(x)
=1/[\Gamma(\alpha)\beta^{\alpha}]x^{\alpha-1}e^{-x/\beta}, ~~ 0<x<\infty,
~~\alpha>0, ~~\beta>0 $, where $\Gamma(\alpha)=\int\limits_0^\infty
x^{\alpha-1}e^{-x}$ is the Gamma function, and $\alpha$, $\beta$ are the shape
and scale parameters of gamma distribution. This distribution has mean
$\alpha\beta$ and variance $\alpha\beta^2$. It is clear that the central
$\chi^2$ distribution is just $\Gamma(p/2,1/2)$, and $p$ is the degree of
freedom. Now let $P(x)$ be a Poisson frequency function with mean $\lambda$
and $f(\lambda)$ be a gamma distribution with mean $\mu$ and variance
$\mu^2/\alpha$, the mixture distribution has frequency function
$g(x)=[\Gamma(x+\alpha)]/[x!\Gamma(\alpha)] [\alpha/(\alpha+\mu)]^\alpha
[\mu/(\alpha+\mu)]^x$.