Genetics, Vol. 159, 839-852, October 2001, Copyright © 2001

Dynamics of Microsatellite Divergence Under Stepwise Mutation and Proportional Slippage/Point Mutation Models

Peter P. Calabresea, Richard T. Durrettb, and Charles F. Aquadroc
a Department of Applied Mathematics, Cornell University, Ithaca, New York 14853
b Department of Mathematics, Cornell University, Ithaca, New York 14853
c Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853

Corresponding author: Richard T. Durrett, Department of Mathematics, 523 Malott Hall, Cornell University, Ithaca, NY 14853., rtd1{at}cornell.edu (E-mail)

Communicating editor: S. TAVARÉ


*  ABSTRACT
*TOP
*ABSTRACT
*GENETIC DISTANCES
*FOUR EXAMPLES
*TWO MODELS WITH POINT...
*A MODEL WITHOUT POINT...
*DEATH OF MICROSATELLITES
*DISCUSSION
*APPENDIX A
*APPENDIX B
*APPENDIX C
*LITERATURE CITED

Recently Kruglyak, Durrett, Schug, and Aquadro showed that microsatellite equilibrium distributions can result from a balance between polymerase slippage and point mutations. Here, we introduce an elaboration of their model that keeps track of all parts of a perfect repeat and a simplification that ignores point mutations. We develop a detailed mathematical theory for these models that exhibits properties of microsatellite distributions, such as positive skewness of allele lengths, that are consistent with data but are inconsistent with the predictions of the stepwise mutation model. We use our theoretical results to analyze the successes and failures of the genetic distances ({delta}µ)2 and DSW when used to date four divergences: African vs. non-African human populations, humans vs. chimpanzees, Drosophila melanogaster vs. D. simulans, and sheep vs. cattle. The influence of point mutations explains some of the problems with the last two examples, as does the fact that these genetic distances have large stochastic variance. However, we find that these two features are not enough to explain the problems of dating the human-chimpanzee split. One possible explanation of this phenomenon is that long microsatellites have a mutational bias that favors contractions over expansions.


MICROSATELLITES are simple sequence repeats in DNA that typically have a high level of variability due to a high rate of mutations that alter their length. For this reason they have been useful for studying population structure on the time scale of thousands of generations (see BOWCOCK et al. 1994 Down; ROY et al. 1994 Down; GOLDSTEIN et al. 1995B Down; UNDERHILL et al. 1996 Down; GOLDSTEIN and POLLOCK 1997 Down; HARR et al. 1998 Down; IRWIN et al. 1998 Down; REICH and GOLDSTEIN 1998 Down; GOLDSTEIN et al. 1999 Down; PRITCHARD et al. 1999 Down; RUIZ-LINARES et al. 1999 Down). To make inferences from observed patterns, one needs a statistic to measure differentiation between populations and a model to give the distribution of that statistic. Here, we consider two genetic distances: ({delta}µ)2 of GOLDSTEIN et al. 1995A Down, GOLDSTEIN et al. 1995B Down and DSW of SHRIVER et al. 1995 Down.

We examine the behavior of two genetic distances ({delta}µ)2 and DSW in four increasingly divergent examples: (i) African vs. non-African human populations, (ii) human vs. chimpanzee, (iii) Drosophila melanogaster vs. D. simulans, and (iv) cattle vs. sheep. If one assumes the stepwise mutation model (SMM) of OHTA and KIMURA 1973 Down, then the expected value of ({delta}µ)2 grows linearly in time. When used on example (i), the statistic ({delta}µ)2 gives good estimates (see GOLDSTEIN et al. 1995B Down), but when applied to examples (ii) and (iii), it gives answers that are roughly one-seventh and one-thirtieth of the commonly accepted values. The nonlinear distance DSW does not do as well as ({delta}µ)2 at dating the human population split but has a slightly better performance for examples (ii) and (iii), yielding estimates that are about one-third and one-eighth of the commonly accepted values.

Finally, in example (iv), the two species are too far diverged for microsatellites to be useful molecular clocks. Results of ELLEGREN et al. 1997 Down show that roughly one-half of the microsatellites they isolated in one species were monomorphic in the other and have presumably lost their ability to mutate. This observation suggests that in the long run point mutations break up perfect repeats and reduce the mutation rates of microsatellite loci. It is natural to ask if this mechanism can explain the underestimates that arise in examples (ii) and (iii). To investigate this possibility, we introduced two new models. The first is a slight generalization of the model of KRUGLYAK et al. 1998 Down, which we call the proportional slippage/point mutation (PS/PM) model. In this model point mutations spoil perfect repeats; the slippage rate is zero for microsatellites with fewer than {kappa} repeat units and then increases linearly. The PS/PM model can be used to estimate slippage rates from DNA sequence data, but to address the divergence question we need a second model, called the PCR model, that keeps track of the lengths of all perfect repeats that make up an imperfect repeat.

The PCR model is complicated, but it is possible to obtain a simple formula for the variance of a repeat Lt as a function of time t in generations (see Theorem 2). Using a = 2 x 10-8 as an estimate for the point mutation rate per repeat unit and a threshold of four repeat units for slippage events to be possible, this formula shows that the variance of the repeat length begins to depart from linearity when t/(10,000,000) is not small relative to one. This result explains some of the problems with the use of ({delta}µ)2 in the comparison of D. melanogaster and D. simulans, which diverged ~25,000,000 generations ago, but makes the failure of ({delta}µ)2 in the human vs. chimpanzee split even more mysterious, since, as our calculations have shown, point mutations will not have had a significant effect in 250,000 generations.

To further investigate the problems in dating the human vs. chimpanzee split, we investigated the behavior of the PS/PM model when there are no mutations. This special case, called the PS/0M model, and denoted A0t, is equivalent to the binary branching process of probability theory, so it is possible to do many exact calculations. Theorem 3 gives expressions for the first four moments of A0t. Expressions for the third moment show that the distribution of A0t has a positive skewness, which contrasts with the symmetric distributions of the SMM, but is consistent with the skewness observed in microsatellite data.

Calculations for the fourth moment show that if ß is the per locus slippage rate, and {alpha} is the initial activity of a microsatellite, i.e., the length minus the threshold {kappa} for slippage to occur, then the kurtosis becomes large when ßt/{alpha}2 is large relative to one. In general fourth moments of the microsatellite lengths are larger under the PS/0M model than under the SMM. Consequently, microsatellite statistics that use these moments, such as those of REICH and GOLDSTEIN 1998 Down and GONSER et al. 2000 Down, will have much different distribution under PS/0M than under SMM. In the case of our four examples the kurtoses are (i) 3.02, (ii) 3.93, (iii) 6.75, and (iv) 10.7, compared to 3 for the normal distribution. In the case of the human-chimpanzee split, (ii), this implies that confidence intervals are 1.21 times as large as they would be under the SMM. However, this again does not explain the magnitude of the failures of ({delta}µ)2 and DSW in dating the human-chimpanzee split. The last observation and the fact that the simulated microsatellite distributions given in Fig 1 and Fig 3 have many more large microsatellites than are typically observed lead us to conclude that there are forces that constrain the growth not yet incorporated into our models. We return to this point in the DISCUSSION.



View larger version (22K):
In this window
In a new window
Download PPT slide
 
Figure 1. PCR model simulation. Probability density of the length in repeat units of a single microsatellite after 25,000,000 generations is shown. Initially the microsatellite was a perfect repeat with length 15, = 5, a = 2 x 10-8, and per repeat slippage rate b = 5.0 x 10-7. These are the parameters we used to study the D. melanogaster vs. D. simulans split.



View larger version (14K):
In this window
In a new window
Download PPT slide
 
Figure 2. SMM simulation. Probability density of ({delta}µ)2 in repeat units for samples of size 20 from two populations with Ne = 10,000 that diverged {tau} = 250,000 generations ago is shown; per locus slippage rate ß = 5.6 x 10-4. These are the parameters we used to study the human-chimpanzee split. In contrast, in Table 1 only 2 of 25 estimates of ({delta}µ)2 are >100.



View larger version (31K):
In this window
In a new window
Download PPT slide
 
Figure 3. Exact PS/0M calculation. Probability density of the length in repeat units of a single microsatellite after 250,000 generations. Initially the microsatellite has length 19, {kappa} = 4, and per repeat slippage rate b = 1.9 x 10-5. These are the parameters we used to study the human-chimpanzee split. Note the positive skewness in the distribution.


*  GENETIC DISTANCES
*TOP
*ABSTRACT
*GENETIC DISTANCES
*FOUR EXAMPLES
*TWO MODELS WITH POINT...
*A MODEL WITHOUT POINT...
*DEATH OF MICROSATELLITES
*DISCUSSION
*APPENDIX A
*APPENDIX B
*APPENDIX C
*LITERATURE CITED

Our first step is to define the two genetic distances ({delta}µ)2 and DSW and to compute their values for the four examples. We then introduce our two new models, state the theoretical results we have obtained, and use them to study the four examples. To define ({delta}µ)2, let µA and µB be the mean length of alleles at a microsatellite locus in populations A and B, and define genetic distance between the two populations [see (1) of GOLDSTEIN et al. 1995b] as

Given data, the distance is estimated by the corresponding statistic

where A and B are the average lengths observed in samples from populations A and B.

To motivate the definition of our second distance, we recall (see, e.g., p. 6723 of GOLDSTEIN et al. 1995B Down) that, if X and X' are the lengths of the microsatellite locus in a sample of size two from population A, and Y and Y' are a similar random sample of size two from population B, then

where in each case we take the square before computing expected value. Replacing the squares in the last formula by absolute values, we can follow SHRIVER et al. 1995 Down and define the genetic distance

Given microsatellite lengths X1, ... Xm from population A, and Y1, ... Yn from population B, DSW is estimated by

Suppose that microsatellites follow the SMM of OHTA and KIMURA 1973 Down in which microsatellites change by ±1 unit at a rate ß independent of their length. In this case GOLDSTEIN et al. 1995A Down have shown that if one assumes the divergence of the two populations occurred {tau} generations ago, then

SHRIVER et al. 1995 Down simulated the behavior of DSW under the SMM and concluded that over short timescales DSW was linear. In Appendix A we prove the following result about the nonlinear behavior of DSW under the SMM when two populations of Ne diploid individuals diverged {tau} generation ago.

THEOREM 1: If 2ß{tau} is large and {tau} >= Ne then

(2)

When {tau} >> Ne, the terms involving Ne can be dropped and

so in the long run DSW grows like a constant times {tau}1/2.


*  FOUR EXAMPLES
*TOP
*ABSTRACT
*GENETIC DISTANCES
*FOUR EXAMPLES
*TWO MODELS WITH POINT...
*A MODEL WITHOUT POINT...
*DEATH OF MICROSATELLITES
*DISCUSSION
*APPENDIX A
*APPENDIX B
*APPENDIX C
*LITERATURE CITED

To test the behavior of the statistics ({delta}µ)2 and DSW we consider four increasingly divergent examples.

Divergence of human populations:
GOLDSTEIN et al. 1995B Down investigated 30 microsatellite loci and estimated that the value of ({delta}µ)2 between African and non-African populations was 6.47. Using this in (1) with their mutation rate estimate of 5.6 x 10-4 gives a prediction of 5776 generations for the divergence time. Assuming a human generation time of 27 years, they then arrived at the estimate of 156,000 years, a figure that they argued was in agreement with previous genetic estimates and with archaeological data.

RUBINSZTEIN et al. 1995 Down studied 24 microsatellite loci in East Anglians and Sub-Saharan Africans and obtained an estimate of 1.45 for DSW. Assuming ß = 5.6 x 10-4 and taking Ne = 5000 as the size of one of the two subpopulations, we can use (2) to give a prediction of 9880 generations for the divergence time. Multiplying by 27 years leads to an estimate of 267,000 years, which is much larger than the estimate of GOLDSTEIN et al. 1995B Down. One possible explanation is that we have chosen the wrong effective population size for our estimate. If instead we use Ne = 750 then an estimate of 5630 generations results, which is similar to the value estimated by GOLDSTEIN et al. 1995B Down.

Humans vs. chimpanzees:
RUBINSZTEIN et al. 1995 Down also studied 24 microsatellite loci in chimpanzees. Combining this with their human data, they obtained an estimate of 5.475 for DSW for the human-chimpanzee comparison. They commented that the ratio of this estimate to the East Anglian vs. African comparison, = 3.78, was surprising since the ratio of the divergence times for the two splits is at least 50. The nonlinearity of DSW shown in Theorem 1 helps explain this discrepancy. If we use the slippage rate of ß = 5.6 x 10-4 from the previous example for both humans and chimpanzees and assume an effective population size of Ne = 104 for each population, then using Theorem 1 we arrive at an estimate of {tau} = 88,200 generations for their divergence time. If we use an average lifetime of 20 years for humans and chimpanzees this translates into 1.76 million years, about one-third the accepted estimate of 5–6 million years (see, e.g., GOODMAN et al. 1998 Down or KUMAR and HEDGES 1998 Down).

Since RUBINSZTEIN et al. 1995 Down report only the genetic distances DSW for their loci, we need to turn to other sources for data we can use to calculate ({delta}µ)2. BOWCOCK et al. 1994 Down, DEKA et al. 1994 Down, and GARZA et al. 1995 Down studied 10, 7, and 8 microsatellite loci, respectively, in these two species. The data are given in Table 1. From this, we can compute ({delta}µ)2 values of 7.56, 86.19, and 40.19, respectively. Even though the second estimate is >11 times the first, we can use all 25 loci in Table 1 together to get ({delta}µ)2 {approx} 40. Using (1) now with the slippage rate estimate ß = 5.6 x 10-4 gives 35,700 generations, or ~700,000 years, which is less than one-seventh the accepted age.


 
View this table:
In this window
In a new window

 
Table 1. Human/chimpanzee data

Assuming the SMM and that the above parameters remain constant, coalescent simulations show that the ({delta}µ)2 and DSW estimates are significantly smaller than those expected under the SMM. Specifically, for two samples of 20 individuals with 25 unlinked microsatellites in two separate random-mating populations of size 104, which were separated until 275,000 generations ago and with mutations following the SMM with ß= 5.6 x 10-4, we expect a 95% confidence interval for ({delta}µ)2 of 179–465, whereas the data were only 40, and a 95% confidence interval for DSW of 7.97–14.6, whereas the data were 5.475.

Drosophila species:
The divergence time between D. melanogaster and D. simulans is estimated to have occurred ~2.5 million years ago (see HEY and KLIMAN 1993 Down). WETTERSTRAND 1997 Down used eight di-, four tri-, and four tetranucleotide repeats and estimated ({delta}µ)2 = 19.393 between these species. Using the mutation estimate of 6.3 x 10-6 from SCHUG et al. 1997 Down, she then used (1) to estimate that the divergence time occurred 1.52 million generations ago. Assuming 10 generations per year, she computed a divergence time of 152,000 years, which is about one-sixteenth of the estimate of HEY and KLIMAN 1993 Down.

One of the problems with this estimation is that tri- and tetranucleotide repeats have considerably smaller slippage rates than dinucleotide repeats in Drosophila (see SCHUG et al. 1998 Down). With this in mind, we applied Wetterstrand's analysis to data on 31 dinucleotide repeat loci from HUTTER et al. 1998 Down given in Table 2. The average value of ({delta}µ)2 for these loci is 16.09. Using the estimate ß = 9.3 x 10-6 from SCHUG et al. 1998 Down in (1) we estimate the divergence time to be ~865,000 generations. Using the previous estimate of 10 generations per year, this translates into 86,500 years, which is about one-thirtieth of the estimate of HEY and KLIMAN 1993 Down.


 
View this table:
In this window
In a new window

 
Table 2. Drosophila data, ({delta}µ) in repeat units

Independently, HARR et al. 1998 Down also used ({delta}µ)2 to estimate the divergence times in the phylogeny of D. melanogaster, D. simulans, D. sechelia, and D. mauritiana. From the possible choices of the mutation rate ß they list, we choose 10-5, which is the closest to that of SCHUG et al. 1998 Down. In this case their estimates differ from those of HEY and KLIMAN 1993 Down by factors of 10–30.

Our second statistic DSW does much better on the data set of HUTTER et al. 1998 Down. The estimate of DSW from their data is 3.64, so assuming an effective population size of N = 106 and using (2) with ß = 9.3 x 10-6, we obtain an estimate of {tau} = 3,330,000 generations. With 10 generations a year this becomes 330,000 years, which is about one-eighth of the estimate of HEY and KLIMAN 1993 Down.

Again coalescent simulations with the above parameters show that these estimates are significantly smaller than those expected under the SMM. Assuming the two populations are separated until 25 million generations ago we expect a 95% confidence interval of 315–728 for ({delta}µ)2 while the data are <20 and a 95% confidence interval of 10.5–18.0 for DSW while the data are 3.64.

Cattle vs. sheep:
These two species diverged ~16 million years ago, which, assuming a generation of 2 years, translates into 8 million generations. ELLEGREN et al. 1997 Down examined 13 loci of bovine origin and 14 of ovine origin. Discarding 3 loci of bovine origin for which there was not reliable information about their length in sheep, the data are given in Table 3.


 
View this table:
In this window
In a new window

 
Table 3. Cattle/sheep data

Two of these loci studied by ELLEGREN et al. 1997 Down show clear signs of mutations other than microsatellite slippage events. At RM103 allele sizes are 115–151 bp in cattle vs. 73 bp in sheep, but the example sequence given for the repeat in cattle is (CA)16. Thus at least part of the average 61.6 bp difference must be due to a major deletion in the sequence flanking the microsatellite in sheep or to an insertion in cattle. At RME11 we have the surprising result that this locus is much longer in sheep than in cattle but that this longer and hence presumably more mutable microsatellite is monomorphic in sheep. Note also that this is the only locus of bovine origin with a large negative {delta}µ. This suggests that again much of this difference in length is due to mutations involving the flanking sequence.

If we remove these two loci, which have an average ({delta}µ)2 of 653, the remaining 22 loci have an average ({delta}µ)2 of 74.4 per locus. If we use an average generation time of 2 years for cattle and sheep, then using (1) we can estimate that the average slippage rate must be ß = 4.65 x 10-6. We could find no information about slippage rates in cattle or sheep, but this is about one-thirteenth the rate of 6 x 10-5 that ELLEGREN 1995 Down observed for microsatellites in pigs.


*  TWO MODELS WITH POINT MUTATIONS
*TOP
*ABSTRACT
*GENETIC DISTANCES
*FOUR EXAMPLES
*TWO MODELS WITH POINT...
*A MODEL WITHOUT POINT...
*DEATH OF MICROSATELLITES
*DISCUSSION
*APPENDIX A
*APPENDIX B
*APPENDIX C
*LITERATURE CITED

In all but the first example of the African vs. non-African split in the human population, if we use the SMM with either of our statistics ({delta}µ)2 or DSW, then we underestimate divergence times. In view of this, it is natural to ask if there is some mechanism that interferes with the normal rate of growth of these divergence statistics. One possibility is that point mutations spoiling perfect repeats reduce microsatellite mutation rates over time. To investigate this we introduce a new model called the PS/PM model that is a modest generalization of the one proposed by KRUGLYAK et al. 1998 Down.

PS/PM model:
There are three types of changes that can occur:

  • Proportional slippage: A microsatellite of length {ell} > {kappa} becomes length {ell} ± 1 at rate b({ell} - {kappa}) each. Microsatellites of length {ell} <= {kappa} do not experience slippage events.

  • Point mutations: For 1 <= j < {ell}, a microsatellite of length {ell} becomes length j at rate a.

  • Birth of microsatellites: {kappa} -> {kappa} + 1 at rate c.

For later purposes, it is convenient to write the new proportional slippage rule succinctly as b({ell} - {kappa})+, where

denotes the positive part of {ell} - {kappa}; i.e., {ell} - {kappa} if the difference is positive and 0 otherwise.

When {kappa} = 1 the PS/PM model reduces to the original model of KRUGLYAK et al. 1998 Down. The motivation for the change from {kappa} = 1 to a general {kappa} comes from several studies. GOLDSTEIN and CLARK 1995 Down studied 17 microsatellite loci in Drosophila, plotted variance of repeat count vs. maximum repeat count, and found (see p. 3884) a straight line that hit zero at seven repeat units. BRINKMAN et al. 1998 Down studied 10,844 parent/child allelic transfers at nine short tandem repeat loci, finding 23 mutations. There were no mutations at loci with fewer than nine repeats and an approximate linear growth of mutations after that point (see Fig 3 on p. 1412). Finally, ROSE and FALUSH 1998 Down studied dinucleotide repeats in the yeast genome and compared their frequency with what would be expected on the basis of random chance. The ratio was close to 1 for one to four repeat units and then the logarithm of the ratio increased linearly (see the middle figure on their p. 614).

In formulating the PS/PM model introduced above, our thought experiment consists of picking two nucleotides at random and seeing how many times they are repeated as we scan to the right, so we only need to keep track of the left one-half of a newly imperfect repeat that has been hit by a mutation. This viewpoint, along with appropriate bookkeeping, can be used to fit the model to data and estimate mutation rates (see KRUGLYAK et al. 1998 Down). However, if we are going to look at microsatellites through the eyes of an experimentalist who only tracks the length of PCR-amplified fragments of DNA, we need to define a new process that keeps track of the lengths of all the perfect repeats in an interrupted repeat as a vector (X1t, X2t, ... Xnt). To explain what we have in mind, consider the concrete sequence

Here we used lower case letters for the perfect repeat segments to make them more clearly visible. In dividing this imperfect repeat into segments it is convenient to include in each piece the final pair of nucleotides that spoil the pattern. Thus the vertical bars mark the ends of the perfect repeat segments, and we record the state as (6, 3, 8). The reason for this convention will become clear as we develop properties of the model.

In words, in our PCR fragment size model, each of the lengths of the perfect repeat units Xti evolves according to the rules of the PS/PM model. In using this model we are concerned only with the life and death of existing microsatellites, so we ignore the birth of new ones.

PCR model:
If the state at time t is (X1t, ... Xnt), then there are two types of changes for any of the lengths Xit with 1 <= i <= n.

  • Proportional slippage: Xit -> Xit ± 1 at rate b(Xit - )+.

  • Point mutation: (X1t, ... , Xit, ... Xnt) -> (X1t, ... , Xit - y, y, ... Xnt) at rate a if 1 <= y <= Xit - 1.

Note that because we include the final imperfect repeat unit in each block, the lengths of the two new pieces created by a point mutation add up to the original length. One final minor point is that since our new bookkeeping system includes the final imperfect repeat, the here should be equal to {kappa} + 1, where {kappa} is the parameter of the PS/PM model.

Let Lt = {Sigma}i Xit be the total length of the microsatellite and let

be its activity; i.e., 2bAt is the rate at which slippage events occur at time t. Since point mutations do not change the total length, and under proportional slippage the microsatellite is equally likely to gain or lose a repeat unit, ELt = L0. That is, the average value of the length stays constant in time. It is somewhat remarkable that there is a simple formula for the variance of Lt despite the complexity of the PCR model.

THEOREM 2: If the initial activity of the microsatellite is A0 then at any time t >= 0

(3)

This result is derived in Appendix B. Theorem 2 concerns the variance of the process, not the population samples. The relationship between this quantity and ({delta}µ)2 is that if each sample is of size one then 2 var(Lt) = ({delta}µ)2. And when the samples are larger than size one and the time to the most recent common ancestor of each sample is much less than the time to the most recent common ancestor of these ancestors, then 2 var(Lt) {approx} ({delta}µ)2. Note that if we let ß = 2bA0, the initial per locus slippage rate, then the first factor is simply ßt, the answer for the SMM. We call the second term in parentheses the correction factor, since it indicates how much the variance has been reduced from the prediction of the SMM due to the effect of point mutations. Using the series expansion e-x = 1 - x + x2/2 - · · · we see that when at is small, the correction factor is {approx}1. In the other direction if at = 1 then the correction factor is 1 - e-1 = 0.632 and a significant reduction has occurred. From this computation, we see that point mutations begin to make a difference when the number of generations t {approx} .

To understand the implications of Theorem 2 we return to our four examples. Thinking of dinucleotide repeats, we assume a point mutation rate of a = 2 x 10-8 per repeat unit (see DRAKE et al. 1998 Down). Based on the work of ROSE and FALUSH 1998 Down we choose = 5, so in all cases a = 10-7, and we expect point mutations to have a significant effect after ~10 million generations. In the African vs. non-African comparison of human populations, t = 6000 generations, so at = 6 x 10-4 and the correction factor is 0.9997. For humans vs. chimpanzees, t = 250,000, so at = 2.5 x 10-2 and the correction factor is 0.9876, which is again ~1. Coalescent simulations show that the 95% confidence intervals for ({delta}µ)2 and DSW for the PCR model have changed by <10% from those for the SMM for this example, so the data are not consistent with the PCR model either. (For the simulations we assumed that for all microsatellites their most recent common ancestor was a perfect repeat of length 19 and the per repeat slippage rate was b = 1.9 x 10-5. This assumption corresponds to a per locus slippage rate of the most recent common ancestor microsatellite being ß = 5.6 x 10-4 as in the SMM.) For cattle vs. sheep, t = 8,000,000 generations, so at = 0.8 and the correction factor is 0.688. For D. melanogaster vs. D. simulans, t = 25,000,000, so at = 2.5 and the correction factor is 0.367. Coalescent simulations for this example show the 95% confidence intervals for ({delta}µ)2 and DSW are 79.2–342 and 6.34–11.8, whereas the observed statistics were <20 and 3.64, respectively. (For the simulations we assumed that for all microsatellites their most recent common ancestor was a perfect repeat of length 15 and the per repeat slippage rate was b = 5.0 x 10-7.) As predicted by Theorem 2, the mean ({delta}µ)2 for the simulations was 184. Fig 1 shows the results of simulating the PCR model to obtain the probability density of the length of a single microsatellite that has evolved for 25 million generations with the parameters of this example. Note that 23% of the microsatellites are longer than 18 repeat units, while only 2 of 186 dinucleotide microsatellites in the original 1-Mb sample of D. melanogaster DNA in KRUGLYAK et al. 1998 Down were this long. The last two examples show that the PCR model leads to significant reductions in predicted values of ({delta}µ)2, but not enough to account for the 13- and 30-fold underestimation observed.


*  A MODEL WITHOUT POINT MUTATIONS
*TOP
*ABSTRACT
*GENETIC DISTANCES
*FOUR EXAMPLES
*TWO MODELS WITH POINT...
*A MODEL WITHOUT POINT...
*DEATH OF MICROSATELLITES
*DISCUSSION
*APPENDIX A
*APPENDIX B
*APPENDIX C
*LITERATURE CITED

Our discussion of Theorem 2 suggests that when at is small, as is the case for comparisons between human populations or between humans and chimpanzees, we can ignore the effects of point mutations. If we set the point mutation rate a = 0 in the PS/PM model and add a superscript 0 to remind ourselves that we have done this, then the activity A0t = {Sigma}i(Xit - )+ follows a very simple dynamic, which we call the proportional slippage/zero mutations (PS/0M) model.

PS/0M model:
If A0t = k then it changes to k ± 1 at rate bk. The process A0t jumps from k to k + 1 at rate bk, and from k to k - 1 at rate bk, and is thus identical to the binary branching process Zt of probability theory, in which Zt is the number of particles at time t and each particle splits into two or dies at rate b each (see, e.g., ATHREYA and NEY 1972 Down). The following is shown in Appendix C:

THEOREM 3: If we use E{alpha} to denote the expected value for the process starting from A00 = {alpha}, then

(4)


(5)


(6)


(7)

In words, proportional slippage is equally likely to increase or decrease the average activity by one, so the average activity does not change in time. The second equation, which can be derived by setting a = 0 in Theorem 2, says that even though the slippage rate varies in time in the PS/0M model, the variance of at0 is linear in time, just as in the SMM, which has constant slippage rates.

Substantial differences between the PS/0M model and the SMM appear when we look at third and higher moments. The SMM is symmetric so E(Xt - {ell})3 = 0, but as (6) shows, the proportional slippage model has positive skewness. FARRALL and WEEKS 1998 Down performed an analysis of 4558 AC dinucleotide repeat loci assayed in the CEPH pedigrees and found positive skewness in the distribution of microsatellite allele lengths. RUBINSZTEIN et al. 1994 Down had earlier observed this skewness and suggested that it was evidence for "a bias in favor of gains" (see p. 1096 of RUBINSZTEIN et al. 1999 Down). However, our results show such a skewed distribution can result from the PS/0M model that has no mutational bias.

Computing the fourth moment reveals another difference between our proportional slippage model and stepwise mutation. In the SMM the difference in microsatellite length, Xt - Yt, between two individuals with a most recent common ancestor t generations ago is the sum of independent random variables. Thus, if t is large, has approximately a normal distribution and the kurtosis

In contrast, (7) and (5) show that when 2{ell}bt is large the kurtosis in the proportional slippage model is

(8)

where ß = 2{alpha}b is the initial per locus slippage rate.

If the kurtosis is large then the distribution of Xt - Yt will have a heavy tail and estimation of quantities such as ({delta}µ)2 will be difficult. To see when the kurtosis will become large, we note that (8) implies this will occur when ßt/2{alpha}2 is large. To see that this answer is reasonable, note that the expected number of slippage events in t generations is ßt and recall that in n steps a random walk typically moves about steps. Thus the kurtosis becomes large when the "typical amount of change" in the microsatellite, (ßt)1/2, exceeds its initial activity {alpha} and hence there is significant probability of microsatellite death.

In the African/non-African split if we assume t = 6000 generations, use an average activity {alpha} = 15, which corresponds to an average size of 20 repeat units, and set ß = 5.6 x 10-4 then ßt/2{alpha}2 = 0.0075 so the kurtosis is 3.02. For the human-chimpanzee split, t = 250,000, ß = 5.6 x 10-4, and {alpha} = 15, so ßt/2{alpha}2 = 0.311 and = 3.93. For D. melanogaster vs. D. simulans, t = 25,000,000, ß = 10-5, and {alpha} = 10, so ßt/2{alpha}2 = 1.25 and = 6.75. Finally, for cattle vs. sheep, we take t = 8,000,000 and {alpha} = 10, so if we use the estimate ß = 6 x 10-5 from pig microsatellites, ßt/2{alpha}2 = 2.4 and = 10.2. One should note, however, that the values for D. melanogaster vs. D. simulans and cattle vs. sheep are overestimates of the kurtosis since they are based on the proportional slippage model, and our earlier calculations showed that in these cases point mutations had a significant effect on the variance.

To interpret the numerical values of the kurtosis, we observe that if a random variable V has kurtosis then

and hence the standard deviation of V2/EV2 is . This shows that if the kurtosis is 3.93 as it is in the human vs. chimpanzee comparison, then, instead of the 3 for the normal distribution, the width of confidence intervals will be = 1.21 times as large or, equivalently, 1.212 = 1.46 times as much data will be needed to obtain the same accuracy of estimation.

The last conclusion shows that estimates of ({delta}µ)2 under the proportional slippage model are not very much more variable than under the SMM. However, the fluctuations under the SMM in this case are huge. Fig 2 gives a simulation of ({delta}µ)2 under the parameters of the human-chimpanzee split. We used two populations of size Ne = 10,000 individuals, a divergence time of 250,000 generations, and a mutation rate of 5.6 x 10-4 per locus per generation. It is interesting to compare the simulations where 51% of the ({delta}µ)2 values are >120 with the data in Table 1 where the largest ({delta}µ)2 among 25 loci is 112. Indeed, as (1) predicts, the average value of ({delta}µ)2 in the simulation is 2ß{tau} = 280.

Further, coalescent simulations of the PS/0M model show that for the human-chimpanzee and the D. melanogaster-D. simulans splits the observed ({delta}µ)2 and DSW statistics are not within the expected 95% confidence intervals. These observations suggest that there may be some additional mechanism(s) preventing microsatellites from getting too long.


*  DEATH OF MICROSATELLITES
*TOP
*ABSTRACT
*GENETIC DISTANCES
*FOUR EXAMPLES
*TWO MODELS WITH POINT...
*A MODEL WITHOUT POINT...
*DEATH OF MICROSATELLITES
*DISCUSSION
*APPENDIX A
*APPENDIX B
*APPENDIX C
*LITERATURE CITED

Our final topic is to compute the probability of microstellite death in the PS/0M model, i.e., the probability a microsatellite will reach 0 activity in t generations. Since, as noted above, the PS/0M model is equivalent to the binary branching process of probability theory, we can compute not only all of the moments of A0t but also the exact distribution of A0t. It follows from results on page 109 of ATHREYA and NEY 1972 Down that

THEOREM 4: Letting P{alpha} denote the probability law for the PS/0M model starting from A00 = {alpha},

(9)

while for k >= 1,

(10)

To apply Theorem 4 to our four examples, we begin by recalling that b = , where ß is the per locus slippage rate and {alpha} is the activity, that is, the length minus {kappa} = 4. In the African vs. non-African human comparison, t = 6000, ß = 5.6 x 10-4, and {alpha} = 15 (i.e., an average length of 19 repeat units), so (9) shows that the probability of having no activity after t = 4 x 103 generations is (0.11/1.11)15 < 10-15. In the human vs. chimpanzee comparison, t = 250,000, ß = 5.6 x 10-4, and {alpha} = 15, so the probability of having no activity after t generations is 0.054.

Fig 3 shows the distribution of the lengths in this case as computed from (10). Note the positive skewness in the distribution as predicted by Theorem 3. Note also that our numerical solution has 17% of the microsatellites having >30 repeat units while only 1 of 205 dinucleotide microsatellites in the original 1-Mb sample of human DNA in KRUGLYAK et al. 1998 Down has this length. This again suggests that there may be some additional mechanism(s) preventing microsatellites from getting too long.

For the D. melanogaster vs. D. simulans and cattle vs. sheep comparisons, the PS/0M model overestimates the number of microsatellites with no activity. But this is to be expected since our earlier results show that point mutations have slowed down microsatellite mutation processes over this amount of time.


*  DISCUSSION
*TOP
*ABSTRACT
*GENETIC DISTANCES
*FOUR EXAMPLES
*TWO MODELS WITH POINT...
*A MODEL WITHOUT POINT...
*DEATH OF MICROSATELLITES
*DISCUSSION
*APPENDIX A
*APPENDIX B
*APPENDIX C
*LITERATURE CITED

In summary, microsatellite mutation models that incorporate point mutations and proportional slippage events fit the data better than the SMM. However, these two features are not enough to explain, for example, the observation that the genetic distance statistics ({delta}µ)2 and DSW tend to underestimate divergence times and have more difficulty with more distant comparisons. This and other evidence we presented suggests that long microsatellites are more likely to become shorter rather than longer when a mutation occurs.

One possibility is that there is selection against longer alleles. This effect is clearly noticeable in microbial genomes where selection for small genome size appears to cause microsatellites to be much shorter than they would be by chance alone (see FIELD and WILLS 1998 Down). The low frequency of di- and tetranucleotide repeats and the enhanced frequency of trinucleotide repeats in coding sequence in yeast (see YOUNG et al. 2000 Down) are another sign of the effects of selection.

Upper limits on allele sizes are a severe form of selection that has been incorporated in some models (e.g., NAUTA and WEISSING 1996 Down; FELDMAN et al. 1997 Down; POLLOCK et al. 1998 Down; STEFANINI and FELDMAN 2000 Down). This simple approach allows one to develop a detailed theory. However, it is not clear what biological mechanism sets an absolute upper bound on the length of all CA repeats. A second approach (see GARZA et al. 1995 Down; ZHIVOTOVSKY et al. 1997 Down; ZHIVOTOVSKY 1999 Down) is that there is a mutational bias such that alleles of large size mutate preferentially to alleles of smaller size. WIERDL et al. 1997 Down observed this where they inserted GT repeats of various sizes into the coding sequence of a yeast gene. In D. melanogaster mutation-accumulation lines HARR and SCHLOTTERER 2000 Down observed that for long microsatellites, although the number of upward and downward mutations was identical, the size of the downward mutations was larger than the size of the upward ones. Two recent studies of mutations observed in human pedigrees also support this notion. ELLEGREN 2000 Down found a weak but statistically significant negative relationship between the magnitude of mutation and standardized allele size. XU et al. 2000 Down examined 236 mutations at 122 tetranucleotide repeat loci and found that the rate of expansion mutations is roughly constant but contraction mutations increase with length. It will be of interest to explore the dependence of the mutational bias on length and to incorporate these effects into our proportional slippage/point mutation model to develop a more accurate model of microsatellite evolution. If mutational bias appears only when microsatellites are fairly long this bias should have only a limited impact on our previous model-based estimates of slippage rates (see KRUGLYAK et al. 1998 Down, KRUGLYAK et al. 2000 Down).


*  ACKNOWLEDGMENTS

We thank two anonymous reviewers, Tessa Bauer DuMont, Jennifer Calkins, Semyon Kruglyak, Willie Swanson, and Todd Vision for their many helpful comments. This work was partially supported by National Institutes of Health (NIH) grant GM36431 to C.F.A., NIH grant GM36431-14S1 to C.F.A. and R.T.D., and National Science Foundation grant DMS9877066 to R.T.D.

Manuscript received June 15, 2000; Accepted for publication July 6, 2001.


*  APPENDIX A
*TOP
*ABSTRACT
*GENETIC DISTANCES
*FOUR EXAMPLES
*TWO MODELS WITH POINT...
*A MODEL WITHOUT POINT...
*DEATH OF MICROSATELLITES
*DISCUSSION
*APPENDIX A
*APPENDIX B
*APPENDIX C
*LITERATURE CITED

To compute DSW and ({delta}µ)2, let {xi}1, {xi}2, ... be independent and ±1 with probability 1/2 each and let Sn = {xi}1 + · · · + {xi}n. The {xi}i are the results of the various slippage events and Sn is the total change after n events. If each population consists of N diploid individuals then the number of slippage events, U, before the ancestors of X and X' (or of Y and Y') coalesce has a shifted geometric distribution with success probability p = ; that is, we have

(A1)

and hence EU = - 1 = 4ßN. If the two populations diverged {tau} generations ago then the number of slippage events before the ancestors of X and Y coalesce has the same distribution as U + V where V gives the number that occurs during the first {tau} generations. V has a Poisson with mean 2ß{tau}; that is, P(V = m) = for m = 0, 1, 2, ...

In the case of ({delta}µ)2 breaking things down according to the values of U and V and using the fact that ES2n = n we have

The computation for DSW starts out the same,

(A2)

but the computation of E|SU| and E|SU+V| is more complicated. To begin, we note that since P({xi}i = 1) = P({xi}i = -1) = , considering two cases Sn-1 = 0 and Sn-1 != 0 we have

(A3)

Since Sn alternates between even and odd values, it can be 0 only after an even number of steps, and simple path counting gives

where (nm) is the usual binomial coefficient, which gives the number of ways of choosing m things out of a set of n and k! = 1 · 2 · · · k.

Let T be a random time, e.g., U or U + V. Writing 1(T>=n) for the function that is 1 if T >= n and 0 otherwise, we have |ST| = {Sigma}{infty}n=1(|Sn| - |Sn-1|) · 1(T>=n), so taking expected values and using the independent of T and Sn with (A3) we have

(A4)

Changing variables n = 2k + 1 and using (A1) shows that in the case T = U we have

Differentiating the function f(x) = (1 - x)-1/2 we find its kth derivative is

Recalling the formula for the Taylor series of a function f,

and comparing with the formula for E|SU| we have that

(A5)

the last equality following from 1 - p = .

In the case T = U + V, P(U + V >= 2k + 1) is given by

(A6)

Together with (A2), (A4), and (A5) this can be used to compute E|SU+V| numerically, but it does not seem possible to sum the series to get an exact solution. To begin to derive an approximation for E|SU+V|, we note that if n is large, Sn/ {approx} {chi}, where {chi} has a normal distribution, so E|Sn| {approx} n1/2E|{chi}| = ()1/2. If we let g(n) = E|Sn| then E|SU+V| = Eg(U + V). Writing W = U + V to simplify formulas and expanding in Taylor series,

Taking expected value of each side,

(A7)

Our next goal is to show that if 2ß{tau} is large and {tau} >= N we can drop the second term from (A7) to end up with

which with (A5) gives (2). To do this we note that for large x, g(x) {approx} Cx1/2 so g''(x) {approx} (C/4)x-3/2 and the ratio of the two terms is

To see when this will be small we use formulas for the mean and variance of the Poisson and geometric distributions to conclude

since p = . From this we see that the ratio of interest is

(A8)

If 2ß{tau} is large and {tau} >= N we can drop the 2ß{tau} + 4ßN from the numerator and then divide top and bottom by 4ß2 to see that the last expression is

(A9)

when {tau} >= N. In words the error we make by neglecting the second term in (A7) is at most 5.5%, and as {tau}/N increases, the error will become smaller.


*  APPENDIX B
*TOP
*ABSTRACT
*GENETIC DISTANCES
*FOUR EXAMPLES
*TWO MODELS WITH POINT...
*A MODEL WITHOUT POINT...
*DEATH OF MICROSATELLITES
*DISCUSSION
*APPENDIX A
*APPENDIX B
*APPENDIX C
*LITERATURE CITED

Writing Xt = (X1t, ... Xnt) and ei for the vector that has one in the ith place and zero otherwise, it follows from the definition of the PCR model and the Kolmogorov differential equations for the associated Markov chain that

(B1)

where the two parts of the right-hand side correspond to proportional slippage and point mutation events:

If we let g1(Xt) = {Sigma}iXit be the total length then 2g1(Xt) = 0 since point mutations do not change the length and 1g1(Xt) = 0 by computation so

(B2)

To prepare for the computation of the variance, let h(Xt, j) = {Sigma}i(Xit - j)+, where j <= . Since proportional slippage is a fair game and no slippage occurs for pieces of length j, 1h = 0. To compute the other term, we note that

To evaluate the sum we use the identity {Sigma}kz=1 2z = k(k + 1) to conclude

To check the second equality note that if Xit <= j + 1 it says 0 = 0, while for Xit > j + 1 the positive parts are irrelevant. Combining our computations,

Using this with (B1) and solving the differential equation we have

(B3)

Turning now to g2(Xt) = ({Sigma}iXit)2, we have 2g2(Xt) = 0 since point mutations do not change the total length. For the other term we note that if Lt = {Sigma}iXit then (Lt + 1)2 - 2L2t + (Lt - 1)2 = 2 so 1g2(Xt) =