Abstract
Distributions of pairwise differences often called “mismatch distributions” have been extensively used to estimate the demographic parameters of past population expansions. However, these estimations relied on the assumption that all mutations occurring in the ancestry of a pair of genes lead to observable differences (the infinitesites model). This mutation model may not be very realistic, especially in the case of the control region of mitochondrial DNA, where this methodology has been mostly applied. In this article, we show how to infer past demographic parameters by explicitly taking into account a finitesites model with heterogeneity of mutation rates. We also propose an alternative way to derive confidence intervals around the estimated parameters, based on a bootstrap approach. By checking the validity of these confidence intervals by simulations, we find that only those associated with the timing of the expansion are approximately correctly estimated, while those around the population sizes are overly large. We also propose a test of the validity of the estimated demographic expansion scenario, whose proper behavior is verified by simulation. We illustrate our method with human mitochondrial DNA, where estimates of expansion times are found to be 10–20% larger when taking into account heterogeneity of mutation rates than under the infinitesites model.
WITH the advent of the coalescent theory (Kingman 1982), people have become increasingly aware of the profound effect of demography on the amount of genetic variability maintained in a population (Hudson 1990; Donnelly and Tavaré 1997). Population expansions or contractions indeed leave recognizable signatures in the pattern of molecular diversity (reviewed in Harpendinget al. 1998). For instance, sudden demographic expansions lead to starshaped phylogenies and unimodal distributions of pairwise differences (Slatkin and Hudson 1991; Rogers and Harpending 1992), to a reduction of the number of segregating sites (Bertorelle and Slatkin 1995; ArisBrosou and Excoffier 1996; Tajima 1996), to a lower amount of linkage disequilibrium between linked loci (Slatkin 1994), or to the occurrence of a large proportion of very low frequency mutations (Fu and Li 1993; Fu 1997). Note that population bottlenecks usually have other recognizable effects, often opposed to those of population expansions (e.g., Tajima 1993; Cornuet and Luikart 1996; Harpendinget al. 1998).
When exposed to the evidence of a past demographic expansion, one might want to estimate the parameters of the expansion, such as the time at which it occurred and its magnitude, but the choice of parameters to be estimated depends on a particular scenario of population growth one might choose, such as exponential growth, logistic growth, or an instantaneous stepwise population size change. These three models are obviously related but have rarely been compared (but see Polanskiet al. 1998). The latter model has been favored in the literature due to its simplicity and because simulation studies have shown that it was a good approximation of rapid logistic growth (Rogers and Harpending 1992). Rogers and Harpending (1992) convincingly showed that under an infinitesites model, the shape of the distribution of the number of observed differences between pairs of DNA sequences (often called the mismatch distribution) conveyed information on the timing and the amplitude of a stepwise expansion. They proposed a nonlinear leastsquares approach (Rogers and Harpending 1992) or a method of moments (Rogers 1995) to find the theoretical mismatch distribution that would best fit the observations. Several authors noted that this approach could be inadequate (Lundstromet al. 1992; Bertorelle and Slatkin 1995; ArisBrosou and Excoffier 1996; Wakeley and Hey 1997) because the infinitesites model was not realistic, especially in the case of the mitochondrial genome where an important heterogeneity of mutation rates had been observed (Lundstromet al. 1992; Wakeley 1993). In fact, the presence of a few fast mutating sites can lead to unimodal mismatch distributions even for moderate expansions (Lundstromet al. 1992; ArisBrosou and Excoffier 1996) and would thus lead to an underestimation of the age of the expansion and to an overestimation of its magnitude. Rogers and his collaborators tried to address these concerns. They concluded that both the mean number of pairwise difference (Rogers 1992) and the confidence intervals around the estimated demographic parameters were relatively unaffected by slight to moderate amounts of rate heterogeneity (Rogerset al. 1996). They thus proposed to ignore any heterogeneity of mutation rates when estimating demographic parameters from the mismatch distribution and therefore to stick to the infinitesites predictions. One cannot be entirely satisfied by these conclusions because simulation studies have shown that the mean of the mismatch distribution was indeed very sensitive to rate heterogeneity (ArisBrosou and Excoffier 1996) and that the relative insensitivity of the confidence intervals to rate heterogeneity was mainly due to their very large size (Rogerset al. 1996).
We thus propose in this article to extend the model of Rogers and Harpending (1992) to explicitly take into account possible heterogeneity of mutation rates when estimating the demographic parameters. We also propose an alternative way to derive confidence intervals around the estimated parameters based on a simple bootstrap approach. We check by simulations the validity of those confidence intervals for a few test cases and show that only those associated with the timing of the expansion are approximately correctly estimated. As one can sometimes observe a poor fit between the data and the mismatch distribution predicted by the model, we also propose a test of the validity of the stepwise demographic expansion scenario. We finally illustrate our method by human data from the mitochondrial DNA control region.
THEORY AND METHODS
The mismatch distribution under the infinitesites model: We assume that t generations ago, a population at equilibrium of size N_{0} entered a demographic expansion phase to instantaneously reach a new size N_{1} and that it remained at that size ever since. Under this demographic scenario described in Figure 1 and assuming that every new mutation occurs at a new site [the infinitesites mutation model of Kimura (1969)], Li (1977) derived an expression for the probability of observing i differences between two genes taken at random from this population, as
The mismatch distribution under a finitesites model: Under the finitesites model,
Onesite case: We first solve the problem for one site (m = 1) assuming Kimura's twoparameters model of mutation (Kimura 1980) with arbitrary relative transition (s) and transversion (v) rates subject to s + v = 1. Conditional on the number of mutations, we consider the mutation process as a random walk between the nucleotides A, C, G, and T. This Markov process has the singlestep transition matrix
We thus obtain
Multisite case, homogeneous mutation rates: Instead of deriving an explicit equation for H^{m}(i, j), when m > 1 we can compute these probabilities numerically using a recurrence equation, as shown below. Let us suppose that we have already derived the probability H^{m}^{–1}(i, j) and that we want to study the case for an additional site and thus derive H^{m}(i, j). Suppose that l mutations have occurred at the mth site and that the (j – l) remaining mutations have occurred at the m – 1 other sites. The probability of observing overall i differences will depend upon whether we observe one or no difference at the mth site. With probability P_{1}, one difference will be observed at the mth site and (i – 1) at the (m – 1) other sites, and with probability P_{2}, all i differences will be observed among the (m – 1) other sites. Therefore,
The mismatch distribution under a tworates finitesites model: Mutation rate heterogeneity arises when the mutation rates are not equal for all nucleotide sites. The simplest form of heterogeneity to be considered is a tworates mutation model, where we make the distinction between fast and slow sites. As most mutations accumulate at a small number of fast sites, convergent or reverse mutations can become quite common. The consequence of this type of heterogeneity on the pattern of diversity has been studied in the case of the control region of human mitochondrial DNA (Wakeley 1993; Yang 1994, 1996; Bertorelle and Slatkin 1995; Yanget al. 1995). Following Yang (1996), who inferred the number of segregating sites in a stationary population under the finitesites model for two classes of mutation rates, we can modify Equation 3 by considering that m_{1} of the m sites are fast and that m_{2} are slow. In this case, we have to take into account all possible ways of partitioning the i differences among the slow and fast regions. If we assume that mutations occur independently, the probabilities of these partitions simply follow a binomial distribution. The expected mismatch probability distribution is thus given by
Multisite case, mrates mutation model: Suppose that we have a sequence of length m and that each nucleotide has a potentially different probability p_{i} (i = 1... m) of being hit by a mutation, subject to the condition
The mismatch distribution for gammadistributed mutation rates: A gamma distribution of mutation rates can be seen as a special case of the mrates mutation model. Such a distribution has been hypothesized for explaining the pattern of diversity in the control region of the mitochondrial genome (e.g., Kocher and Wilson 1991; Hasegawaet al. 1993; Wakeley 1993; Yang 1996). The density of the gamma distribution is given by
In the present article, we used the values of the shape parameters α computed by S. Meyer (Meyeret al. 1999) on a human mtDNA control region sequence database (Handtet al. 1998) as α = 0.26 for HV1 and α = 0.13 for HV2.
Estimation of past demographic parameters using a leastsquares approach: We estimated the demographic parameters θ_{0}, θ_{1}, and τ from the mismatch distribution, using a nonlinear leastsquares approach. We use the Hooke and Jeeves algorithm (Hooke and Jeeves 1963) to find those parameters that minimize the sum of square deviations (SSD) between the observed mismatch distribution
{F_{i–}_{obs}}, i = 1,..., n and its expectation {F_{i–}_{exp}} under a particular model. SSD is conventionally defined as
Bootstrap confidence intervals: We followed a parametric bootstrap approach to generate percentile confidence intervals around the estimated parameters
Testing the validity of the sudden expansion model: We tested the hypothesis that the observed data fitted the sudden expansion model defined by the estimated parameters using the same parametric bootstrap approach as described above. We used here SSD defined in Equation 9 as a test statistic. We obtained its distribution under the hypothesis that the estimated parameters are the true ones by simulating B samples around the estimated parameters. As before, we reestimated each time new parameters θ *_{0}, θ *_{1}, and τ* and computed their associated sums of squares SSD_{sim}. The P value of the test is therefore approximated by
RESULTS
We show in Figure 2 the theoretical mismatch distributions and the demographic parameters estimated using different methods and mutation models for the two hypervariable segments of Mandenka mtDNA control region (Gravenet al. 1995). Except for the curve based on Rogers' method of moments (Rogers 1995), the shapes of the theoretical mismatch distributions are very similar to each other and provide no real indication of the validity of the underlying model. Note also that a much better fit is found for HV1 than for HV2. While the shape of the bestfitted mismatch does not seem to depend much on the mutation model, the values of the estimated parameters do change quite extensively, especially the expansion time τ, which shows larger values for finitesites models than for the infinitesites model. This is of course due to the fact that several mutations can accumulate at a given site in finitesites models and that a longer evolutionary time is necessary to lead to the same number of observed differences. The magnitude of the expansion is also found to be smaller in finitesites models, in agreement with previous simulation results (ArisBrosou and Excoffier 1996).
In Figure 3, we show the expected mismatch distributions fitted for the Turkana sample (Watsonet al. 1996) according to a finitesites tworates model (Figure 3a) and to a finitesites gamma distribution model (Figure 3b). We also report the average mismatch distributions obtained from 5000 simulations performed according to the estimated parameters. Simulated and expected mismatch distributions are found to be in very good agreement, motivating the use of simulations to get empirical confidence intervals around the parameters.
To check if these confidence intervals have good coverage properties (i.e., the true parameters should be included in the confidence interval with a probability 1 – α), we performed a series of simulations for a set of predefined parameters. For a given set of parameters θ_{0}, θ_{1}, and τ, we simulated 1000 data sets from which we estimated the parameters
For illustration purposes, we present in Figure 5 the expected mismatch distributions of a few human samples analyzed for HV1 or HV2, as well as the limits of a 95% confidence interval around the mismatch distributions. Despite an obvious lack of goodnessoffit for some distributions, the adequacy of the sudden expansion model could only be rejected for the Ngoebe HV2 sample (SSD P value = 0.007). For the other samples, random mismatch distributions generated by simulations lead to SSD values larger than the observation in >5% of the cases, making the observed mismatch distributions compatible with the estimated parameters.
DISCUSSION
In this study, we extend the model of Rogers and Harpending (1992) to estimate the parameters of a sudden stepwise demographic expansion by explicitly taking into account a possible heterogeneity of mutation rates. Contrary to previous claims (Rogers 1992; Rogerset al. 1996), we find that the estimated values of the parameters and their confidence intervals are quite sensitive to departure from the infinitesites model. For instance, the estimated values of the expansion time (τ) shown in Figure 2 for the Mandenka population are found, respectively, 9 and 20% larger for HV1 and HV2 when using a model with gammadistributed mutation rates than for the infinitesites model. Even though our methodology appears computationally more intensive, it thus seems justified to take into account the known departures from the infinite sites model to estimate the parameters of the stepwise demographic model. The present approach does not allow us to retrieve all the parameters of a demographic expansion with the same efficiency. As shown in Tables 1 and 2, the expansion time (τ) and the initial population size (θ_{0}) are the only parameters that can be estimated without much bias and with reasonable precision, while the estimation of θ_{1} is clearly biased upward. The confidence intervals obtained from the parametric bootstrap approach are fairly estimated only for the expansion time τ, while those for the population sizes are clearly too large and thus overly conservative. This implies that the magnitude of the expansion cannot be precisely recovered by the present approach. This is understandable because once the expansion is sufficiently large, very few coalescent events (if any) will have occurred between the present time and the beginning of the expansion. As it is the accumulation of those coalescent events that can provide some information on the present population size, there will often be too few of them to get a reliable estimate of the present size, which will also tend to be overestimated. The present parametric bootstrap approach for defining the confidence intervals differs somewhat from that described in previous studies (Rogers 1995; Rogerset al. 1996). The previous approaches consisted of finding a set of values of the demographic parameter compatible with the observed data. The “compatibility” criterion was a statistic of goodnessoffit (mean absolute error) between the observed and the theoretical mismatch distribution. A set of demographic parameters θ_{0}, θ_{1}, and τ was declared compatible if the goodnessoffit statistic fell within a 95% confidence interval obtained by simulation. While this approach seems valid, it requires much heavier computations than ours if one wants to adequately explore the space of possible parameters, as a series of simulations needs to be carried out for each set of parameters. Moreover, the potential impact of the chosen goodnessoffit statistic on the results and the reliability of the confidence intervals has not been addressed. The fact that the effective population sizes are not well recovered from the mismatch distribution would suggest that this previous approach may suffer from the same problems as the simple parametric bootstrap procedure and thus also lead to overly large confidence intervals.
A recent study has shown that timedependent demographic models (including the present stepwise expansion model) were unstable with respect to the estimation of the demographic parameters describing the population sizes (Polanskiet al. 1998), in the sense that large fluctuations in the demographic parameters lead only to small changes in the mismatch distribution (see also Rogers 1997). Conversely, small differences in the shape of the observed mismatch distribution will profoundly affect the values of the estimated parameters. As the estimation of the expansion time τ depends essentially on the mean of the mismatch distribution (Rogers and Harpending 1992), while the other parameters θ_{0} and θ_{1} depend on higher moments of the distribution, those latter two parameters are more likely to be affected by the stochasticity of the genealogical process than τ. This is in keeping with our simulations, which show that the expansion time is usually quite well recovered from the mismatch distribution (Tables 1 and 2).
Even though we have refined the mutation model for mtDNA sequences, one can see that the theoretical mismatch distributions do not always perfectly fit with the observed distributions (Figures 2, 3, and 5). We can see two reasons explaining this discrepancy.
First, the single stepwise expansion model may be inadequate for some populations. Alternative population expansion models such as exponential growth or logistic growth could be more realistic that the stepwise growth used in this study (Polanskiet al. 1998), but as long as the magnitude of the expansion is large and we start from a small population, they should lead to results very similar to those provided here (Rogers and Harpending 1992; Rogers 1997). It seems more likely that demographic scenarios very different from population growth may explain these discrepancies. Population contractions may indeed have occurred in some populations and could explain the rejection of the sudden expansion model. A population crash could have occurred in the Ngoebe population from Panama, as well as in other native Amerindian populations, where the hypothesis of sudden expansion is not supported, as in the Kuna from Panama (P = 0.05, HV1), the Huetar from CostaRica (P = 0.026, HV2), or the Mapuche from Argentina (P = 0.009, HV2). Additional evidence for a recent population contraction comes from the observation of large positive values for Tajima's Dstatistics (Tajima 1989) in those populations (results not shown), which are expected in the case of a recent population bottleneck (Tajima 1993). Note also that other factors like admixture events, population substructure, or inbreeding could all affect the shape of the mismatch distribution but to an extent that has not yet been quantified.
Second, the probabilities derived in Equations 1 and 2 and their derivatives apply to a pair of genes chosen at random from the population, while they are applied here to a random pair chosen from the sample. However, pairs drawn from the sample are not independent due to the shared portions of their gene genealogy. In populations having gone through a recent and large expansion, the internal branches are very short due to the starlike structure of the tree (Slatkin and Hudson 1991; Fu 1997), and a very few mutations will accumulate on those branches. In that case, the correlation between the number of pairwise differences (d_{ij}) will only be due to shared external branches (i.e., d_{12} and d_{13} will be correlated due to the shared lineage leading to sequence 1, but d_{12} and d_{34} should be almost independent), and our derivations should better hold at the sample level. On the other hand, for stationary populations or relatively small or remote expansions, some coalescent events will occur before and after the expansion. The internal branches will be longer and have a large associated variance. Those equations, while still being correct for a single pair of genes, will thus not allow us to get the sample distribution of pairwise differences as they do not take into account the covariance of pairwise differences. Therefore, the present method is not expected to recover the parameters of a demographic expansion efficiently unless the expansion has been very large.
Although mismatch distributions carry some information on the shape of the underlying gene genealogy and coalescent process, other aspects of molecular diversity are not explicitly taken into account by this approach. It has been shown that demographic parameters recovered from the mismatch distribution did not allow the correct prediction of the number of observed polymorphic sites (Bertorelle and Slatkin 1995) or of the distribution of mutation frequencies (Wakeley and Hey 1997) for human mtDNA. This could either be due to departure from the infinitesites mutation model or from the proposed simple demographic model. Even after having introduced a more realistic mutation model, we still find that the observed number of segregating sites is not always in agreement with the distribution obtained from simulations based on the estimated demographic parameters. For instance, considering the mismatch distributions shown in Figure 5, even if we get a perfect fit for the English HV1 sample, the estimated parameters lead on average to far fewer segregating sites than observed, although not significantly so (S_{obs} = 67; S_{mean} = 60.6; SD(S) = 6.2; P = 0.872). Interestingly, the Mandenka sample presents a significant lack of segregating sites for HV2 as compared to the estimated expansion conditions (S_{obs} = 27; S_{mean} = 38.0; SD(S) = 5.0; P = 0.014). This discrepancy could be explained by a very large heterogeneity of mutation rates in HV2 for this population, but it seems difficult to understand how and why the structural and functional constraints that are supposed to shape the heterogeneity of mutation rates (Wakeley 1993, p. 614) could differ between populations. Additional studies on that matter would nevertheless be needed to exclude this possibility.
To get absolute values for the demographic parameters inferred using the present approach, one should get an estimation of the substitution rate at the nucleotide level. The real value of mutation rate in humans has recently been the subject of an intense debate between those advocating the use of a phylogenetic mutation rate (∼3 × 10^{–6} substitutions per site per generation of 20 yr) calibrated by the divergence between humans and chimpanzees (Jazinet al. 1998) and those studying the mutation process directly on pedigrees giving numbers ∼10 times larger (∼2.7 × 10^{–5} substitutions per site per generation; Howellet al. 1996; Parsonset al. 1997; Parsons and Holland 1998). For the present methodology to be fully beneficial, it thus seems highly necessary to get reliable estimates of mutation rates. Otherwise, the importance of taking into account more realistic mutation models would seem rather futile.
Even if the present approach is an improvement over previous methods, it seems that the use of the mismatch distribution as a summary statistic may not exploit the full potential of molecular data and that maximumlikelihood methods that take into account phylogenetic relationships between DNA sequences (e.g., Griffiths and Tavaré 1994; Kuhneret al. 1995; Tavaré et al. 1997; Kuhneret al. 1998; Weiss and von Haeseler 1998) would be needed to get more reliable estimates of demographic parameters. However, considering the fact that these methods are extremely computerintensive when heterogeneity of mutation rates is considered, the present approach may still be useful in most practical purposes.
Acknowledgments
We are grateful to Ziheng Yang, Monty Slatkin, and two anonymous reviewers for their helpful comments on the manuscript and to Simon Tavaré for some useful advice. We thank Alan Rogers for providing us with C source code for generating gammadistributed random variates. We are grateful to André Langaney for his support throughout this work. This study was made possible by Swiss National Fund grant nos. 32047053.96 and 31039847.93. A program for computing the estimated demographic parameters from the mismatch distribution and performing the tests described in this article is available from S.S. upon request. These programs are also integrated into the Arlequin software, available on http://anthropologie.unige.ch/arlequin/.
Footnotes

Communicating editor: G. B. Golding
 Received July 30, 1998.
 Accepted March 19, 1999.
 Copyright © 1999 by the Genetics Society of America