- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Calabrese, P. P.
- Articles by Aquadro, C. F.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Calabrese, P. P.
- Articles by Aquadro, C. F.
Dynamics of Microsatellite Divergence Under Stepwise Mutation and Proportional Slippage/Point Mutation Models
Peter P. Calabresea, Richard T. Durrettb, and Charles F. Aquadroca Department of Applied Mathematics, Cornell University, Ithaca, New York 14853
b Department of Mathematics, Cornell University, Ithaca, New York 14853
c Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853
Corresponding author: Richard T. Durrett, Department of Mathematics, 523 Malott Hall, Cornell University, Ithaca, NY 14853., rtd1{at}cornell.edu (E-mail)
Communicating editor: S. TAVARÉ
| ABSTRACT |
|---|
Recently Kruglyak, Durrett, Schug, and Aquadro showed that microsatellite equilibrium distributions can result from a balance between polymerase slippage and point mutations. Here, we introduce an elaboration of their model that keeps track of all parts of a perfect repeat and a simplification that ignores point mutations. We develop a detailed mathematical theory for these models that exhibits properties of microsatellite distributions, such as positive skewness of allele lengths, that are consistent with data but are inconsistent with the predictions of the stepwise mutation model. We use our theoretical results to analyze the successes and failures of the genetic distances (
µ)2 and DSW when used to date four divergences: African vs. non-African human populations, humans vs. chimpanzees, Drosophila melanogaster vs. D. simulans, and sheep vs. cattle. The influence of point mutations explains some of the problems with the last two examples, as does the fact that these genetic distances have large stochastic variance. However, we find that these two features are not enough to explain the problems of dating the human-chimpanzee split. One possible explanation of this phenomenon is that long microsatellites have a mutational bias that favors contractions over expansions.
MICROSATELLITES are simple sequence repeats in DNA that typically have a high level of variability due to a high rate of mutations that alter their length. For this reason they have been useful for studying population structure on the time scale of thousands of generations (see ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
µ)2 of ![]()
![]()
![]()
We examine the behavior of two genetic distances (
µ)2 and DSW in four increasingly divergent examples: (i) African vs. non-African human populations, (ii) human vs. chimpanzee, (iii) Drosophila melanogaster vs. D. simulans, and (iv) cattle vs. sheep. If one assumes the stepwise mutation model (SMM) of ![]()
µ)2 grows linearly in time. When used on example (i), the statistic (
µ)2 gives good estimates (see ![]()
µ)2 at dating the human population split but has a slightly better performance for examples (ii) and (iii), yielding estimates that are about one-third and one-eighth of the commonly accepted values.
Finally, in example (iv), the two species are too far diverged for microsatellites to be useful molecular clocks. Results of ![]()
![]()
repeat units and then increases linearly. The PS/PM model can be used to estimate slippage rates from DNA sequence data, but to address the divergence question we need a second model, called the PCR model, that keeps track of the lengths of all perfect repeats that make up an imperfect repeat.
The PCR model is complicated, but it is possible to obtain a simple formula for the variance of a repeat Lt as a function of time t in generations (see Theorem 2). Using a = 2 x 10-8 as an estimate for the point mutation rate per repeat unit and a threshold of four repeat units for slippage events to be possible, this formula shows that the variance of the repeat length begins to depart from linearity when t/(10,000,000) is not small relative to one. This result explains some of the problems with the use of (
µ)2 in the comparison of D. melanogaster and D. simulans, which diverged
25,000,000 generations ago, but makes the failure of (
µ)2 in the human vs. chimpanzee split even more mysterious, since, as our calculations have shown, point mutations will not have had a significant effect in 250,000 generations.
To further investigate the problems in dating the human vs. chimpanzee split, we investigated the behavior of the PS/PM model when there are no mutations. This special case, called the PS/0M model, and denoted A0t, is equivalent to the binary branching process of probability theory, so it is possible to do many exact calculations. Theorem 3 gives expressions for the first four moments of A0t. Expressions for the third moment show that the distribution of A0t has a positive skewness, which contrasts with the symmetric distributions of the SMM, but is consistent with the skewness observed in microsatellite data.
Calculations for the fourth moment show that if ß is the per locus slippage rate, and
is the initial activity of a microsatellite, i.e., the length minus the threshold
for slippage to occur, then the kurtosis becomes large when ßt/
2 is large relative to one. In general fourth moments of the microsatellite lengths are larger under the PS/0M model than under the SMM. Consequently, microsatellite statistics that use these moments, such as those of ![]()
![]()
µ)2 and DSW in dating the human-chimpanzee split. The last observation and the fact that the simulated microsatellite distributions given in Fig 1 and Fig 3 have many more large microsatellites than are typically observed lead us to conclude that there are forces that constrain the growth not yet incorporated into our models. We return to this point in the DISCUSSION.
|
|
|
| GENETIC DISTANCES |
|---|
Our first step is to define the two genetic distances (
µ)2 and DSW and to compute their values for the four examples. We then introduce our two new models, state the theoretical results we have obtained, and use them to study the four examples. To define (
µ)2, let µA and µB be the mean length of alleles at a microsatellite locus in populations A and B, and define genetic distance between the two populations [see (1) of GOLDSTEIN et al. 1995b] as

Given data, the distance is estimated by the corresponding statistic

where
A and
B are the average lengths observed in samples from populations A and B.
To motivate the definition of our second distance, we recall (see, e.g., p. 6723 of ![]()

where in each case we take the square before computing expected value. Replacing the squares in the last formula by absolute values, we can follow ![]()

Given microsatellite lengths X1, ... Xm from population A, and Y1, ... Yn from population B, DSW is estimated by

Suppose that microsatellites follow the SMM of ![]()
![]()
generations ago, then

![]()
generation ago.
THEOREM 1: If 2ß
is large and
Ne then
![]() |
(2) |
When
>> Ne, the terms involving Ne can be dropped and

so in the long run DSW grows like a constant times
1/2.
| FOUR EXAMPLES |
|---|
To test the behavior of the statistics (
µ)2 and DSW we consider four increasingly divergent examples.
Divergence of human populations:
![]()
µ)2 between African and non-African populations was 6.47. Using this in (1) with their mutation rate estimate of 5.6 x 10-4 gives a prediction of 5776 generations for the divergence time. Assuming a human generation time of 27 years, they then arrived at the estimate of 156,000 years, a figure that they argued was in agreement with previous genetic estimates and with archaeological data.
![]()
![]()
![]()
Humans vs. chimpanzees:
![]()
= 3.78, was surprising since the ratio of the divergence times for the two splits is at least 50. The nonlinearity of DSW shown in Theorem 1 helps explain this discrepancy. If we use the slippage rate of ß = 5.6 x 10-4 from the previous example for both humans and chimpanzees and assume an effective population size of Ne = 104 for each population, then using Theorem 1 we arrive at an estimate of
= 88,200 generations for their divergence time. If we use an average lifetime of 20 years for humans and chimpanzees this translates into 1.76 million years, about one-third the accepted estimate of 56 million years (see, e.g., ![]()
![]()
Since ![]()
µ)2. ![]()
![]()
![]()
µ)2 values of 7.56, 86.19, and 40.19, respectively. Even though the second estimate is >11 times the first, we can use all 25 loci in Table 1 together to get (
µ)2
40. Using (1) now with the slippage rate estimate ß = 5.6 x 10-4 gives 35,700 generations, or
700,000 years, which is less than one-seventh the accepted age.
|
Assuming the SMM and that the above parameters remain constant, coalescent simulations show that the (
µ)2 and DSW estimates are significantly smaller than those expected under the SMM. Specifically, for two samples of 20 individuals with 25 unlinked microsatellites in two separate random-mating populations of size 104, which were separated until 275,000 generations ago and with mutations following the SMM with ß= 5.6 x 10-4, we expect a 95% confidence interval for (
µ)2 of 179465, whereas the data were only 40, and a 95% confidence interval for DSW of 7.9714.6, whereas the data were 5.475.
Drosophila species:
The divergence time between D. melanogaster and D. simulans is estimated to have occurred
2.5 million years ago (see ![]()
![]()
µ)2 = 19.393 between these species. Using the mutation estimate of 6.3 x 10-6 from ![]()
![]()
One of the problems with this estimation is that tri- and tetranucleotide repeats have considerably smaller slippage rates than dinucleotide repeats in Drosophila (see ![]()
![]()
µ)2 for these loci is 16.09. Using the estimate ß = 9.3 x 10-6 from ![]()
865,000 generations. Using the previous estimate of 10 generations per year, this translates into 86,500 years, which is about one-thirtieth of the estimate of ![]()
|
Independently, ![]()
µ)2 to estimate the divergence times in the phylogeny of D. melanogaster, D. simulans, D. sechelia, and D. mauritiana. From the possible choices of the mutation rate ß they list, we choose 10-5, which is the closest to that of ![]()
![]()
Our second statistic DSW does much better on the data set of ![]()
= 3,330,000 generations. With 10 generations a year this becomes 330,000 years, which is about one-eighth of the estimate of ![]()
Again coalescent simulations with the above parameters show that these estimates are significantly smaller than those expected under the SMM. Assuming the two populations are separated until 25 million generations ago we expect a 95% confidence interval of 315728 for (
µ)2 while the data are <20 and a 95% confidence interval of 10.518.0 for DSW while the data are 3.64.
Cattle vs. sheep:
These two species diverged
16 million years ago, which, assuming a generation of 2 years, translates into 8 million generations. ![]()
|
Two of these loci studied by ![]()
µ. This suggests that again much of this difference in length is due to mutations involving the flanking sequence.
If we remove these two loci, which have an average (
µ)2 of 653, the remaining 22 loci have an average (
µ)2 of 74.4 per locus. If we use an average generation time of 2 years for cattle and sheep, then using (1) we can estimate that the average slippage rate must be ß = 4.65 x 10-6. We could find no information about slippage rates in cattle or sheep, but this is about one-thirteenth the rate of 6 x 10-5 that ![]()
| TWO MODELS WITH POINT MUTATIONS |
|---|
In all but the first example of the African vs. non-African split in the human population, if we use the SMM with either of our statistics (
µ)2 or DSW, then we underestimate divergence times. In view of this, it is natural to ask if there is some mechanism that interferes with the normal rate of growth of these divergence statistics. One possibility is that point mutations spoiling perfect repeats reduce microsatellite mutation rates over time. To investigate this we introduce a new model called the PS/PM model that is a modest generalization of the one proposed by ![]()
PS/PM model:
There are three types of changes that can occur:
- Proportional slippage: A microsatellite of length
>
becomes length
± 1 at rate b(
-
) each. Microsatellites of length
do not experience slippage events. - Point mutations: For 1
j <
, a microsatellite of length
becomes length j at rate a. - Birth of microsatellites:
+ 1 at rate c.
For later purposes, it is convenient to write the new proportional slippage rule succinctly as b(
-
)+, where

denotes the positive part of
-
; i.e.,
-
if the difference is positive and 0 otherwise.
When
= 1 the PS/PM model reduces to the original model of ![]()
= 1 to a general
comes from several studies. ![]()
![]()
![]()
In formulating the PS/PM model introduced above, our thought experiment consists of picking two nucleotides at random and seeing how many times they are repeated as we scan to the right, so we only need to keep track of the left one-half of a newly imperfect repeat that has been hit by a mutation. This viewpoint, along with appropriate bookkeeping, can be used to fit the model to data and estimate mutation rates (see ![]()

Here we used lower case letters for the perfect repeat segments to make them more clearly visible. In dividing this imperfect repeat into segments it is convenient to include in each piece the final pair of nucleotides that spoil the pattern. Thus the vertical bars mark the ends of the perfect repeat segments, and we record the state as (6, 3, 8). The reason for this convention will become clear as we develop properties of the model.
In words, in our PCR fragment size model, each of the lengths of the perfect repeat units Xti evolves according to the rules of the PS/PM model. In using this model we are concerned only with the life and death of existing microsatellites, so we ignore the birth of new ones.
PCR model:
If the state at time t is (X1t, ... Xnt), then there are two types of changes for any of the lengths Xit with 1
i
n.
- Proportional slippage: Xit
Xit ± 1 at rate b(Xit -
)+. - Point mutation: (X1t, ... , Xit, ... Xnt)
(X1t, ... , Xit - y, y, ... Xnt) at rate a if 1
y
Xit - 1.
Note that because we include the final imperfect repeat unit in each block, the lengths of the two new pieces created by a point mutation add up to the original length. One final minor point is that since our new bookkeeping system includes the final imperfect repeat, the
here should be equal to
+ 1, where
is the parameter of the PS/PM model.
Let Lt =
i Xit be the total length of the microsatellite and let

be its activity; i.e., 2bAt is the rate at which slippage events occur at time t. Since point mutations do not change the total length, and under proportional slippage the microsatellite is equally likely to gain or lose a repeat unit, ELt = L0. That is, the average value of the length stays constant in time. It is somewhat remarkable that there is a simple formula for the variance of Lt despite the complexity of the PCR model.
THEOREM 2: If the initial activity of the microsatellite is A0 then at any time t
0
![]() |
(3) |
This result is derived in Appendix B. Theorem 2 concerns the variance of the process, not the population samples. The relationship between this quantity and (
µ)2 is that if each sample is of size one then 2 var(Lt) = (
µ)2. And when the samples are larger than size one and the time to the most recent common ancestor of each sample is much less than the time to the most recent common ancestor of these ancestors, then 2 var(Lt)
(
µ)2. Note that if we let ß = 2bA0, the initial per locus slippage rate, then the first factor is simply ßt, the answer for the SMM. We call the second term in parentheses the correction factor, since it indicates how much the variance has been reduced from the prediction of the SMM due to the effect of point mutations. Using the series expansion e-x = 1 - x + x2/2 - · · · we see that when a
t is small, the correction factor is
1. In the other direction if a
t = 1 then the correction factor is 1 - e-1 = 0.632 and a significant reduction has occurred. From this computation, we see that point mutations begin to make a difference when the number of generations t
.
To understand the implications of Theorem 2 we return to our four examples. Thinking of dinucleotide repeats, we assume a point mutation rate of a = 2 x 10-8 per repeat unit (see ![]()
![]()
= 5, so in all cases a
= 10-7, and we expect point mutations to have a significant effect after
10 million generations. In the African vs. non-African comparison of human populations, t = 6000 generations, so a
t = 6 x 10-4 and the correction factor is 0.9997. For humans vs. chimpanzees, t = 250,000, so a
t = 2.5 x 10-2 and the correction factor is 0.9876, which is again
1. Coalescent simulations show that the 95% confidence intervals for (
µ)2 and DSW for the PCR model have changed by <10% from those for the SMM for this example, so the data are not consistent with the PCR model either. (For the simulations we assumed that for all microsatellites their most recent common ancestor was a perfect repeat of length 19 and the per repeat slippage rate was b = 1.9 x 10-5. This assumption corresponds to a per locus slippage rate of the most recent common ancestor microsatellite being ß = 5.6 x 10-4 as in the SMM.) For cattle vs. sheep, t = 8,000,000 generations, so a
t = 0.8 and the correction factor is 0.688. For D. melanogaster vs. D. simulans, t = 25,000,000, so a
t = 2.5 and the correction factor is 0.367. Coalescent simulations for this example show the 95% confidence intervals for (
µ)2 and DSW are 79.2342 and 6.3411.8, whereas the observed statistics were <20 and 3.64, respectively. (For the simulations we assumed that for all microsatellites their most recent common ancestor was a perfect repeat of length 15 and the per repeat slippage rate was b = 5.0 x 10-7.) As predicted by Theorem 2, the mean (
µ)2 for the simulations was 184. Fig 1 shows the results of simulating the PCR model to obtain the probability density of the length of a single microsatellite that has evolved for 25 million generations with the parameters of this example. Note that 23% of the microsatellites are longer than 18 repeat units, while only 2 of 186 dinucleotide microsatellites in the original 1-Mb sample of D. melanogaster DNA in ![]()
µ)2, but not enough to account for the 13- and 30-fold underestimation observed.
| A MODEL WITHOUT POINT MUTATIONS |
|---|
Our discussion of Theorem 2 suggests that when a
t is small, as is the case for comparisons between human populations or between humans and chimpanzees, we can ignore the effects of point mutations. If we set the point mutation rate a = 0 in the PS/PM model and add a superscript 0 to remind ourselves that we have done this, then the activity A0t =
i(Xit -
)+ follows a very simple dynamic, which we call the proportional slippage/zero mutations (PS/0M) model.
PS/0M model:
If A0t = k then it changes to k ± 1 at rate bk. The process A0t jumps from k to k + 1 at rate bk, and from k to k - 1 at rate bk, and is thus identical to the binary branching process Zt of probability theory, in which Zt is the number of particles at time t and each particle splits into two or dies at rate b each (see, e.g., ![]()
THEOREM 3: If we use E
to denote the expected value for the process starting from A00 =
, then
![]() |
(4) |
![]() |
(5) |
![]() |
(6) |
![]() |
(7) |
In words, proportional slippage is equally likely to increase or decrease the average activity by one, so the average activity does not change in time. The second equation, which can be derived by setting a = 0 in Theorem 2, says that even though the slippage rate varies in time in the PS/0M model, the variance of at0 is linear in time, just as in the SMM, which has constant slippage rates.
Substantial differences between the PS/0M model and the SMM appear when we look at third and higher moments. The SMM is symmetric so E(Xt -
)3 = 0, but as (6) shows, the proportional slippage model has positive skewness. ![]()
![]()
![]()
Computing the fourth moment reveals another difference between our proportional slippage model and stepwise mutation. In the SMM the difference in microsatellite length, Xt - Yt, between two individuals with a most recent common ancestor t generations ago is the sum of independent random variables. Thus, if t is large,
has approximately a normal distribution and the kurtosis

In contrast, (7) and (5) show that when 2
bt is large the kurtosis in the proportional slippage model is
![]() |
(8) |
where ß = 2
b is the initial per locus slippage rate.
If the kurtosis is large then the distribution of Xt - Yt will have a heavy tail and estimation of quantities such as (
µ)2 will be difficult. To see when the kurtosis
will become large, we note that (8) implies this will occur when ßt/2
2 is large. To see that this answer is reasonable, note that the expected number of slippage events in t generations is ßt and recall that in n steps a random walk typically moves about
steps. Thus the kurtosis becomes large when the "typical amount of change" in the microsatellite, (ßt)1/2, exceeds its initial activity
and hence there is significant probability of microsatellite death.
In the African/non-African split if we assume t = 6000 generations, use an average activity
= 15, which corresponds to an average size of 20 repeat units, and set ß = 5.6 x 10-4 then ßt/2
2 = 0.0075 so the kurtosis is 3.02. For the human-chimpanzee split, t = 250,000, ß = 5.6 x 10-4, and
= 15, so ßt/2
2 = 0.311 and
= 3.93. For D. melanogaster vs. D. simulans, t = 25,000,000, ß = 10-5, and
= 10, so ßt/2
2 = 1.25 and
= 6.75. Finally, for cattle vs. sheep, we take t = 8,000,000 and
= 10, so if we use the estimate ß = 6 x 10-5 from pig microsatellites, ßt/2
2 = 2.4 and
= 10.2. One should note, however, that the values for D. melanogaster vs. D. simulans and cattle vs. sheep are overestimates of the kurtosis since they are based on the proportional slippage model, and our earlier calculations showed that in these cases point mutations had a significant effect on the variance.
To interpret the numerical values of the kurtosis, we observe that if a random variable V has kurtosis
then

and hence the standard deviation of V2/EV2 is
. This shows that if the kurtosis is 3.93 as it is in the human vs. chimpanzee comparison, then, instead of the 3 for the normal distribution, the width of confidence intervals will be
= 1.21 times as large or, equivalently, 1.212 = 1.46 times as much data will be needed to obtain the same accuracy of estimation.
The last conclusion shows that estimates of (
µ)2 under the proportional slippage model are not very much more variable than under the SMM. However, the fluctuations under the SMM in this case are huge. Fig 2 gives a simulation of (
µ)2 under the parameters of the human-chimpanzee split. We used two populations of size Ne = 10,000 individuals, a divergence time of 250,000 generations, and a mutation rate of 5.6 x 10-4 per locus per generation. It is interesting to compare the simulations where 51% of the (
µ)2 values are >120 with the data in Table 1 where the largest (
µ)2 among 25 loci is 112. Indeed, as (1) predicts, the average value of (
µ)2 in the simulation is 2ß
= 280.
Further, coalescent simulations of the PS/0M model show that for the human-chimpanzee and the D. melanogaster-D. simulans splits the observed (
µ)2 and DSW statistics are not within the expected 95% confidence intervals. These observations suggest that there may be some additional mechanism(s) preventing microsatellites from getting too long.
| DEATH OF MICROSATELLITES |
|---|
Our final topic is to compute the probability of microstellite death in the PS/0M model, i.e., the probability a microsatellite will reach 0 activity in t generations. Since, as noted above, the PS/0M model is equivalent to the binary branching process of probability theory, we can compute not only all of the moments of A0t but also the exact distribution of A0t. It follows from results on page 109 of ![]()
THEOREM 4: Letting P
denote the probability law for the PS/0M model starting from A00 =
,
![]() |
(9) |
while for k
1,
![]() |
(10) |
To apply Theorem 4 to our four examples, we begin by recalling that b =
, where ß is the per locus slippage rate and
is the activity, that is, the length minus
= 4. In the African vs. non-African human comparison, t = 6000, ß = 5.6 x 10-4, and
= 15 (i.e., an average length of 19 repeat units), so (9) shows that the probability of having no activity after t = 4 x 103 generations is (0.11/1.11)15 < 10-15. In the human vs. chimpanzee comparison, t = 250,000, ß = 5.6 x 10-4, and
= 15, so the probability of having no activity after t generations is 0.054.
Fig 3 shows the distribution of the lengths in this case as computed from (10). Note the positive skewness in the distribution as predicted by Theorem 3. Note also that our numerical solution has 17% of the microsatellites having >30 repeat units while only 1 of 205 dinucleotide microsatellites in the original 1-Mb sample of human DNA in ![]()
For the D. melanogaster vs. D. simulans and cattle vs. sheep comparisons, the PS/0M model overestimates the number of microsatellites with no activity. But this is to be expected since our earlier results show that point mutations have slowed down microsatellite mutation processes over this amount of time.
| DISCUSSION |
|---|
In summary, microsatellite mutation models that incorporate point mutations and proportional slippage events fit the data better than the SMM. However, these two features are not enough to explain, for example, the observation that the genetic distance statistics (
µ)2 and DSW tend to underestimate divergence times and have more difficulty with more distant comparisons. This and other evidence we presented suggests that long microsatellites are more likely to become shorter rather than longer when a mutation occurs.
One possibility is that there is selection against longer alleles. This effect is clearly noticeable in microbial genomes where selection for small genome size appears to cause microsatellites to be much shorter than they would be by chance alone (see ![]()
![]()
Upper limits on allele sizes are a severe form of selection that has been incorporated in some models (e.g., ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
| ACKNOWLEDGMENTS |
|---|
We thank two anonymous reviewers, Tessa Bauer DuMont, Jennifer Calkins, Semyon Kruglyak, Willie Swanson, and Todd Vision for their many helpful comments. This work was partially supported by National Institutes of Health (NIH) grant GM36431 to C.F.A., NIH grant GM36431-14S1 to C.F.A. and R.T.D., and National Science Foundation grant DMS9877066 to R.T.D.
Manuscript received June 15, 2000; Accepted for publication July 6, 2001.
| APPENDIX A |
|---|
To compute DSW and (
µ)2, let
1,
2, ... be independent and ±1 with probability 1/2 each and let Sn =
1 + · · · +
n. The
i are the results of the various slippage events and Sn is the total change after n events. If each population consists of N diploid individuals then the number of slippage events, U, before the ancestors of X and X' (or of Y and Y') coalesce has a shifted geometric distribution with success probability p =
; that is, we have
![]() |
(A1) |
and hence EU =
- 1 = 4ßN. If the two populations diverged
generations ago then the number of slippage events before the ancestors of X and Y coalesce has the same distribution as U + V where V gives the number that occurs during the first
generations. V has a Poisson with mean 2ß
; that is, P(V = m) =
for m = 0, 1, 2, ...
In the case of (
µ)2 breaking things down according to the values of U and V and using the fact that ES2n = n we have

The computation for DSW starts out the same,
![]() |
(A2) |
but the computation of E|SU| and E|SU+V| is more complicated. To begin, we note that since P(
i = 1) = P(
i = -1) =
, considering two cases Sn-1 = 0 and Sn-1
0 we have
![]() |
(A3) |
Since Sn alternates between even and odd values, it can be 0 only after an even number of steps, and simple path counting gives

where (nm) is the usual binomial coefficient, which gives the number of ways of choosing m things out of a set of n and k! = 1 · 2 · · · k.
Let T be a random time, e.g., U or U + V. Writing 1(T
n) for the function that is 1 if T
n and 0 otherwise, we have |ST| = 
n=1(|Sn| - |Sn-1|) · 1(T
n), so taking expected values and using the independent of T and Sn with (A3) we have
![]() |
(A4) |
Changing variables n = 2k + 1 and using (A1) shows that in the case T = U we have

Differentiating the function f(x) = (1 - x)-1/2 we find its kth derivative is

Recalling the formula for the Taylor series of a function f,

and comparing with the formula for E|SU| we have that
![]() |
(A5) |
the last equality following from 1 - p =
.
In the case T = U + V, P(U + V
2k + 1) is given by
![]() |
(A6) |
Together with (A2), (A4), and (A5) this can be used to compute E|SU+V| numerically, but it does not seem possible to sum the series to get an exact solution. To begin to derive an approximation for E|SU+V|, we note that if n is large, Sn/
, where
has a normal distribution, so E|Sn|
n1/2E|
| = (
)1/2. If we let g(n) = E|Sn| then E|SU+V| = Eg(U + V). Writing W = U + V to simplify formulas and expanding in Taylor series,

Taking expected value of each side,
![]() |
(A7) |
Our next goal is to show that if 2ß
is large and
N we can drop the second term from (A7) to end up with

which with (A5) gives (2). To do this we note that for large x, g(x)
Cx1/2 so g''(x)
(C/4)x-3/2 and the ratio of the two terms is

To see when this will be small we use formulas for the mean and variance of the Poisson and geometric distributions to conclude

since p =
. From this we see that the ratio of interest is
![]() |
(A8) |
If 2ß
is large and
N we can drop the 2ß
+ 4ßN from the numerator and then divide top and bottom by 4ß2 to see that the last expression is
![]() |
(A9) |
when
N. In words the error we make by neglecting the second term in (A7) is at most 5.5%, and as
/N increases, the error will become smaller.
| APPENDIX B |
|---|
Writing Xt = (X1t, ... Xnt) and ei for the vector that has one in the ith place and zero otherwise, it follows from the definition of the PCR model and the Kolmogorov differential equations for the associated Markov chain that
![]() |
(B1) |
where the two parts of the right-hand side correspond to proportional slippage and point mutation events:

If we let g1(Xt) =
iXit be the total length then
2g1(Xt) = 0 since point mutations do not change the length and
1g1(Xt) = 0 by computation so
![]() |
(B2) |
To prepare for the computation of the variance, let h(Xt, j) =
i(Xit - j)+, where j
. Since proportional slippage is a fair game and no slippage occurs for pieces of length j,
1h = 0. To compute the other term, we note that

To evaluate the sum we use the identity
kz=1 2z = k(k + 1) to conclude

To check the second equality note that if Xit
j + 1 it says 0 = 0, while for Xit > j + 1 the positive parts are irrelevant. Combining our computations,

Using this with (B1) and solving the differential equation we have
![]() |
(B3) |
Turning now to g2(Xt) = (
iXit)2, we have
2g2(Xt) = 0 since point mutations do not change the total length. For the other term we note that if Lt =
iXit then (Lt + 1)2 - 2L2t + (Lt - 1)2 = 2 so
1g2(Xt) =























