Abstract
Microsatellites have been widely used as tools for population studies. However, inference about population processes relies on the specification of mutation parameters that are largely unknown and likely to differ across loci. Here, we use data on somatic mutations to investigate the mutation process at 14 tetranucleotide repeats and carry out an advanced multilocus analysis of different demographic scenarios on worldwide population samples. We use a method based on less restrictive assumptions about the mutation process, which is more powerful to detect departures from the null hypothesis of constant population size than other methods previously applied to similar data sets. We detect a signal of population expansion in all samples examined, except for one African sample. As part of this analysis, we identify an “anomalous” locus whose extreme pattern of variation cannot be explained by variability in mutation size. Exaggerated mutation rate is proposed as a possible cause for its unusual variation pattern. We evaluate the effect of using it to infer population histories and show that inferences about demographic histories are markedly affected by its inclusion. In fact, exclusion of the anomalous locus reduces interlocus variability of statistics summarizing population variation and strengthens the evidence in favor of demographic growth.
INTEREST in the use of microsatellites as tools for the study of population processes followed soon after their discovery (Litt and Luty 1989; Weber and May 1989; Burke 1991; Pena 1993; Bowcocket al. 1994; Di Rienzoet al. 1994). Population genetics theory can be used to relate aspects of the patterns of variation at microsatellite loci, in samples from a population, to the history of the population and the details of the mutation process generating variability. All attempts to infer population history from microsatellite data rely on assumptions, notably about the nature of the mutation processes at the loci involved (Valdeset al. 1993; Di Rienzoet al. 1994; Zhivotovsky and Feldman 1995; Pritchard and Feldman 1996; Feldmanet al. 1997; Reich and Goldstein 1998). While the results of such analyses can depend critically on the assumptions made, there has been little experimental work aimed at observing directly the products of these mutation processes to provide an empirical foundation for population studies. Moreover, the validity of most methods or their power to detect population expansion relies on strong assumptions about the mutation mechanism: typically that both the mutation rate and the process that changes allele length are the same at each locus and/or that all changes to allele length are equally likely to involve the gain or loss of a single repeat unit (Shriveret al. 1993; Valdeset al. 1993; Kimmelet al. 1998; Reich and Goldstein 1998; Reichet al. 1999).
As a step toward an improved characterization of the mutation process on a locus-by-locus basis, we devised an approach based on the analysis of somatic microsatellite mutations in cancer patients, which allows the estimation of the distribution of mutation sizes for each locus (Di Rienzoet al. 1998). It transpires that one particular feature of this distribution, namely the average squared change in repeat number, or mutation mean square, plays a central role in the behavior of commonly used summaries of population variation such as the variance of the repeat number in a population sample. Our earlier study (Di Rienzoet al. 1998) suggested that locus-by-locus estimates of the mutation mean square from somatic mutations are informative for the germ-line mutation processes, and in addition, that these processes can differ across loci in important respects.
Here, we report on three results and a subsequent population analysis. First, we extend our previous findings to a broader range of human populations by applying the same approach to a second data set on 14 additional microsatellite loci. The results further validate the use of microsatellite instability as a means of characterizing the mutation process of microsatellites. Second, we investigate the variability of mutation rates across loci by using our locus-specific estimates of the distributions of mutation size. The purpose of this analysis is to identify loci that are “anomalous” in that their extreme pattern of population variation cannot be accounted for solely by mutation size variability. We identify one such locus and conclude that its inclusion may compromise the inference about population histories. Third, we extend a result of Slatkin (1995) to show that a linear relationship between the population variance and the mutation mean square would be expected under any demographic scenario for any generalized stepwise mutation model, even if there is a bias in the direction of the mutational change.
The amount of variation expected at a particular locus in a population increases with the variability (measured by the mutation mean square) of the mutation process. Interlocus variability in the mutation mean square is effectively another source of noise in an already very noisy system. Locus-specific estimates of the mutation mean square can be used to correct for this effect before combining information across loci. Di Rienzo et al. (1998) introduced a new statistic, called the normalized population variance (NPV), defined as the ratio of the population variance at a locus to (an estimate of) the mutation mean square at that locus. Inferences based on population data from a single locus are typically very imprecise, even if the genetic mechanisms at the locus are completely understood (e.g., Donnelly and Tavaré 1995). In making inferences about population history, it is thus desirable to use data from many unlinked loci. In the case of multilocus microsatellite analyses, the use of NPV values, rather than just the observed population variance, considerably improves the quality of the inference: tests of particular scenarios about population history will have greater power, and estimates of population parameters will be more precise. The better the estimates of the mutation mean square at each locus, the more marked this effect will be.
As a result, we are in a position to allow for most of the complexities of the mutation process at microsatellite loci and, thus, carry out an advanced multilocus analysis of different demographic scenarios. Because of our less restrictive assumptions about the mutation process, our method is more powerful to detect departures from the null hypothesis of constant population size than other methods that have been applied to similar datasets (Kimmelet al. 1998; Reich and Goldstein 1998). Our analysis shows evidence for historical population expansions in all the populations examined with the exception of Africa in the second data set.
MATERIALS AND METHODS
Subjects: Study subjects included 219 patients with sporadic colorectal cancer diagnosed and treated at the Northwestern Memorial Hospital, Chicago, Illinois, as described in Di Rienzo et al. (1998). Sections of tumor and normal tissue were cut from paraffin-embedded tissue blocks and the DNA was extracted from the tissue sections as described in Wright and Manos (1990).
The Sardinia (Italy) population sample was randomly selected from previously described samples (Di Rienzo and Wilson 1991). The DNAs were extracted from placental tissue of unrelated individuals selected from the general population. The remaining population data analyzed in this article were published in Jorde et al. (1995). We typed the 14 loci in Table 1 in the same Sardinian sample examined by Di Rienzo et al. (1998) to allow the direct comparison across data sets. We then calculated the population variance for these loci in Sardinia and in three major ethnic groups (Africans, Asians, and Europeans; second data set) on the basis of a published report (Jordeet al. 1995). Because each sample in the second data set was made up of several subpopulations, we analyzed the data by also using only the most numerous subpopulation sample for each ethnic group, i.e., Sotho for Africa, Japanese for Asia, and French for Europe, to control for the effect of admixture on the pattern of microsatellite variation. In all the analyses performed in this article, the results for each subpopulation (not shown) were always consistent with those for the overall population sample.
Microsatellite instability screening
Typing protocol: For both population and patient tissue samples, we used previously described typing protocols based on radioactively endlabeling one of the PCR primers (Di Rienzo et al. 1994, 1998). PCR conditions for each microsatellite locus were obtained through the Genome Database. Samples from the CEPH database that have been widely used as size markers were run on each gel to ensure consistent allele identification across gels. Every instance of instability was amplified and electrophoresed twice and classified as a somatic mutation only when a consistent pattern was obtained in at least two assays.
RESULTS
Microsatellite instability and patterns of somatic mutations: Fourteen out of the 30 tetranucleotide repeat loci described in Jorde et al. (1995) were typed in normal and tumor DNA extracted from surgical specimens of patients with sporadic colorectal cancer. A somatic mutation was determined to have occurred when one or more bands appeared in the tumor tissue in addition to those observed in the normal tissue of the same patient. The rate of microsatellite instability per locus, calculated as the proportion of loci with somatic mutations over the total number of patients tested, was 12.10% on average, in line with our previous estimates of the instability rate for tetranucleotide repeats (Table 1; Di Rienzoet al. 1998). To assess the pattern of somatic mutations for each locus, we estimated the distributions of mutation size based on the mutations observed in tumor tissue. Because the typing results do not allow one to determine unequivocally which allele was hit by a somatic mutation, we used the expectation maximization (EM) algorithm to produce maximum-likelihood estimates of the distributions of mutation sizes for each locus (Dempsteret al. 1977; Di Rienzoet al. 1998). These distributions, shown in Figure 1, show a moderate degree of interlocus variability and have on average a relatively narrow range of mutation sizes with most mutations involving one or two repeat units. However, as in our previous analysis of somatic mutations, a minority of loci, such as D10S526, have a broader range of mutations asymmetrically distributed around the mean. To infer accurately population parameters, it is important to take this interlocus variability of mutation size into account. The estimated values of the mutation mean square can be found on the website of the Department of Human Genetics of the University of Chicago (http://www.genes.uchicago.edu).
Maximum-likelihood estimates of the mutation sizes at 14 microsatellite loci.
For 8 out of 14 loci, the mean of the estimated distribution of mutation size was above zero, showing no evidence for a mutational bias toward an increase of repeat size.
Validating the use of somatic mutations for estimating germ-line mutation parameters: In appendix a, we show that a linear relationship is expected between the variance of repeat number in a population sample and the mutation mean square for each microsatellite locus for any demographic scenario. This extends earlier results of Roe (1992), Slatkin (1995), Zhivotovsky and Feldman (1995), Kimmel and Chakraborty (1996), Pritchard and Feldman (1996), Chakraborty et al. (1997), and Di Rienzo et al. (1998). Therefore, regardless of the demographic history of the population, the population variance is expected to be linearly related to the mutation mean square estimated from somatic mutations in cancer patients, if the latter is similar to that underlying the observed population variability. These findings imply that the validation of our approach through the analysis of population variability does not depend on making the correct assumption about a particular demographic model.
Rank correlation between population variance (S2) and mutation mean square (η2)
The fit of the data to the general expectation of a linear relationship between population and mutation parameters can be tested by assessing the significance of the rank correlation between them. As shown in Table 2, all the population samples, with the exception of the African sample, have a significant rank correlation. A graphical representation of the relationship between the population variance and the mutation mean square is also shown in Figure 2.
We regard the significance for all populations other than the African sample as evidence that the somatic mutations in cancer are indeed informative for the mutation process generating population variation. Recall that if it were, we would expect a linear relationship between population variance and mutation mean square. The test based on rank correlation examines the null hypothesis of no association. Whether or not such a test will detect a linear relationship if one is present depends on the power of the test, which in turn will depend on the variability of the data around the linear relationship. This variability is considerable for our data, due to the stochastic nature of the evolutionary process. If informative at all, the cancer data will be equally informative for all populations. In the light of the other results, we are thus inclined to view the lack of significance of the rank correlation for the African sample as a reflection of the low power of the test, rather than on the utility of the cancer data.
Bootstrap resampling was used to assess the sampling error in our estimates of mutation mean square. The bootstrap distributions are shown in Figure 3. We discuss below the consequences of this sampling variability for estimation of demographic parameters and testing of demographic scenarios.
Identifying “anomalous” loci: Equation A1 in appendix a shows that at a given locus the expected population variance depends linearly on features of the mutation mechanism (the mutation mean square, η2, and the mutation rate, μ) and a feature of the population demographic history [the expected coalescence time for a pair of genes at the locus, E(T12)]. As noted above, the use of the NPV facilitates interlocus comparison by “correcting” for an estimate of mutation mean square at the locus. For example, microsatellites with large variability in mutation size, such as D10S526, are expected to show greater population variance, and allowance should be made for this effect when pooling results across loci. However, reliable empirical estimates of locus-specific mutation rates are not available and an analogous correction is not feasible. We now utilize the estimated distributions of mutation size to investigate the interlocus variability of mutation rate.
Write NPV(l, p) for the NPV value at locus l in population p, and writing kp for the number of loci tested in population p, define
Relationship between the mutation mean square (η2) and the population variance (S2). The data from the second data set are shown for the African, Asian, and European samples. The slope of the line was determined as the average NPV for each population sample. For the Sardinian sample, the NPV was averaged across loci for the first and second data sets without the outlier D19S244 (indicated by an arrow).
The most striking feature of Figure 4 is that the bar for locus D19S244 is much higher than that for any other locus in the corresponding samples. The difference is significant: a permutation test (Good 1994) of the null hypothesis of exchangeability of γ(l, p) values across loci within populations has P = 0.02. The variability of the estimator γ(l, p) will depend on the true demographic scenario. It could be substantial, especially in a population of constant size. We note, however, that the use of the permutation test, and hence our conclusion that D19S244 is unusual, is valid regardless of the magnitude of such variability.
We propose three possible explanations for this observation. The first is that this locus has a substantially larger mutation rate than the other loci in the first data set. The second is that it has been affected by natural selection: if natural selection acted at or near the locus in all populations, it may systematically increase (balancing selection) or decrease the expected average coalescence time (background selection or a selective sweep; Smith and Haigh 1974; Hudson and Kaplan 1988; Kaplanet al. 1988; Charlesworthet al. 1995). The former case would tend to manifest itself as large γ(l, p) values for that locus in all populations, and hence in a large bar in Figure 4, as for D19S244. (The latter forms of selection would have the opposite effect, leading to a small bar in Figure 4.) A third potential explanation is that the mutation mean square at that locus was substantially underestimated, hence inflating the NPV values in all populations.
A high rate of de novo mutations has been observed at the D19S244 locus in family studies (Weber and Wong 1993). Further, a search of the Human Transcript Map database did not reveal any gene known to evolve under balancing selection in the neighborhood of this marker, and the number of mutations from which the mutation mean square was estimated was larger than for most other loci (29 instances of microsatellite instability were observed at D19S244). Thus, while we cannot be definitive, we propose that the increased γ(l, p) value for the D19S244 locus reflects a substantially higher mutation rate for that locus. While the second data set (Figure 4) also suggests heterogeneity across loci, no locus appears to be markedly different from the others.
Bootstrap estimates of the sampling error in our estimates of mutation mean square for each locus. For each locus, the following procedure was repeated 50,000 times: (1) A new data set was generated by choosing randomly one of the two normal alleles in each individual with a somatic mutation at the locus and then “mutating” it by adding a (positive or negative) mutation size chosen randomly according to the maximum-likelihood estimate of the mutation size distribution at that locus. (If this resulted in an allele of the same length as the other normal allele, the result was discarded and the procedure repeated for that pair.) The resulting bootstrap data set then had two normal alleles and one mutant allele for each individual. (2) For each bootstrap data set, the EM algorithm was used to calculate the maximum-likelihood estimate of mutation size distribution, and so of mutation mean square. A histogram is plotted of the 50,000 bootstrap values of the estimated mutation mean square for each locus. The procedure was vacuous for D20S161 because in the estimated distribution of mutation sizes mutations were always of size ±1, and so all bootstrap estimates of the mutation mean square were also 1.
Values of γ(l, p) for the first and second data sets.
As shown in Di Rienzo et al. (1998; Equations A7 and A15), the precision of estimators of demographic parameters based on average NPV values is decreasing in the variance, across loci, of the mutation rate. So too is the power of tests of demographic scenarios based on interlocus variability of NPV values. Having identified D19S244 as an “anomalous” locus, it is thus most efficient to ignore this locus in subsequent statistical analyses. To evaluate the effect of this locus on population inferences, we compared the results from the first data set by including and excluding it from all the population samples.
Using the NPV to make inferences about human evolution: Parameters of natural interest include the effective population size, in the case of a constant-sized population, or the timing or rate of changes in population size, under other demographic scenarios. It follows from Equation A1 in appendix a that the average NPV value across loci within a population is an unbiased estimator of the product
One scenario of interest in connection with human evolution would be a population of nontrivial size that grows rapidly to become quite large. Then the expected average coalescence time would be larger than the time since growth, so that it would provide an upper bound on that time (provided, as seems plausible for humans, that the time since growth was less than the effective population size after growth).
Estimates of population parameters based on the average NPV
The present data set also confirms our previous finding that the coalescence time of the overall pooled sample is very similar to the coalescence times of the individual populations. This strongly suggests that populations from different ethnic groups share a substantial portion of their genetic ancestry and is in agreement with previous studies indicating that a small proportion of human genetic diversity occurs between populations (Lewontin 1972; Nei 1987).
Testing demographic scenarios: Aside from estimating parameters, under the assumption that a particular demographic scenario applies, multilocus microsatellite data allow testing of demographic scenarios. It turns out that the expected variability, across loci, in statistics such as the population variance or the NPV changes considerably under different demographic scenarios. For example, under the null hypothesis of constant population size, relatively large values of the variance of NPV across loci would be expected, while under scenarios of recent population growth this variance will be smaller. Di Rienzo et al. (1995, 1998) exploited this fact in developing a method for inference about demographic history.
Here, we apply our method to the second data set. In the light of the earlier analysis suggesting an unusual status for the locus D19S244, we also reapply the method to the first data set, both with and without D19S244. The details of the hypothesis testing procedure are described in appendix b. Table 4 gives the P values for three different test statistics, F1, F2, and g, defined in Equations B1, B2, and B3. The statistic g was effectively introduced by Reich and Goldstein (1998). (For ease of comparison, we actually use the reciprocal of their statistic, but the testing procedure is equivalent.) This statistic uses only the population variance at each locus, without correcting for interlocus variability in the mutation process before combining information across loci. The statistics F1 and F2 each use NPV values at each locus, hence taking advantage of locus-specific information on the mutation mean square. Of the two, F1 is much simpler, but F2 leads to a slightly more powerful test.
Our test rejects the null hypothesis of constant population size for large values of the test statistics, corresponding to the data showing significantly less variability across loci than would be expected under this demographic scenario (see Table 4). Whatever the true demographic scenario, variation in mutation rates across loci will tend to increase the variation in population variances and NPVs across loci, thus decreasing the value of all three test statistics. While this means that it is conservative to calculate P values under the assumption of the same mutation rate for all loci, this can involve a considerable loss of power. We believe it is more helpful to calculate P values separately under a range of assumptions (see Table 4 for details) about the variation in mutation rate and have done so.
Reich et al. (1999) have shown that the procedure based on the statistic g is conservative under a range of other deviations from the simple assumptions usually made about the mutation mechanism. As described in appendix b, we have conducted extensive simulations to establish that tests based on either F1 or F2 remain conservative if the mutation mechanism also varies across loci and (as for our data) the mutation mean square is estimated with error.
While the extent of variation in mutation rate across microsatellite loci has not been well documented, either our “medium” or “high” variability scenarios may be most realistic (Weber and Wong 1993; Seielstadet al. 1999). In this case, excluding D19S244 from the analysis, all populations in the first data set show significant evidence for departure from the assumption of constant population size. The same is true for all populations, except Africa (in fact the Sotho) in the second data set. Under the scenario of high variability in mutation rate, the data are almost significant (P = 0.07).
Effects of the sampling error on the calculation of the NPV at each locus in the pooled population sample. A histogram is plotted of 50,000 values of the NPV; each value corresponds to the sample variance of the allele length observed at a locus divided by a bootstrap value of the estimated mutation mean square for that locus.
Estimated P values for the three F-statistic-based tests of the null hypothesis of constant population size for various levels of variability in μ
All three test statistics lead to valid tests. The differences in P values on the same data are due to differences in their power to detect departures from the null hypothesis. The statistic F2 is slightly more powerful than F1. We should expect the statistics based on NPV (F1 and F2) to be more powerful than one based simply on population variance (g), exactly because the normalization (division by mutation mean square) corrects for one source of variation before combining data across loci. Nonetheless, the difference in power between g and the two F statistics is striking, particularly if one were to use the procedure based on g without making allowance for variation in mutation rate.
DISCUSSION
Somatic and germ-line mutations: An understanding of microsatellite mutation patterns is central to their use for the accurate reconstruction of population processes. We have developed and validated an experimental approach to estimate the distribution of mutation sizes for each individual microsatellite locus. These distributions were estimated from somatic mutations observed in the tumor tissue of sporadic patients with colorectal cancer.
It is not known whether such mutations arise from the same events that produce variation in the normal population. Microsatellite instability in some cancer patients may reflect defects in mismatch repair; but, in other patients, it may be a consequence of the higher number of cell divisions that occurs in the tumor compared to the normal tissue. Nevertheless, in the absence of specific mechanistic or genetic information on the source of these mutations, it is still possible to test whether they reflect the mutation process in the general population by using population theory. Here, we demostrate that under the generalized stepwise model with arbitrary distribution of mutation sizes, the relationship between the variance of repeat number at a given locus in a population sample and the mutation mean square for the same locus is expected to be linear regardless of assumptions about the demographic history of the population (see appendix a). Therefore, if the mutation mean square estimated in cancer patients parallels that of the “real” mutation process, one expects it to be linearly related to the variance of repeat number of different population samples. Three out of the four population samples examined in this article conform to this expectation. This observation extends our previous findings of a linear relationship between the population variance and the mutation mean square for an additional three population samples. Taken together, the results of these two studies indicate that the somatic mutations observed in sporadic colorectal cancer patients are a useful approach to the characterization of the mutation process of microsatellite loci on a locus-by-locus basis.
Even though most of the loci show a preponderance of short mutations, i.e., gain or loss of one or two repeat units, our estimated distributions of mutation sizes (Figure 1) are relatively broad for a small subset of the loci examined. To investigate whether mismatch repair defects result in unusually large mutation sizes, we partitioned the patients into two groups. The first group includes patients with high levels of microsatellite instability (at least 20% of loci tested had somatic mutations). These patients are more likely to have mismatch repair defects, and this was recently confirmed by staining their tumor tissue with antibodies against MSH2 and MLH1 (A. Di Rienzo, K. Halling and S. Thibodeau, unpublished results; Thibodeauet al. 1998). The second group includes patients with low levels of microsatellite instability; the somatic mutations observed in these patients may well be “normal” mutations probably reflecting the large number of cell divisions in tumor tissue. In this analysis we pooled classes of loci to maintain reasonable sample sizes. There were 123, 89, and 163 instances of somatic mutation at di-, tri-, and tetranucleotide repeats in the high instability group, respectively, and 47, 24, and 53 instances in the low instability group in the corresponding groups of loci. The preponderance of short mutations, with some larger mutations, was apparent in both the high and low microsatellite instability groups, and estimates of mutation mean squares were similar in each group (data not shown).
Furthermore, a similar broad range of mutation sizes (e.g., from −12 to +11 repeat units) was observed in the largest survey reported to date of de novo mutations in family studies (1107 events over 952,962 parent-offspring transmissions; Seielstadet al. 1999). In contrast, earlier, smaller studies observe predominantly one-step mutations (e.g., Weber and Wong 1993; Brinkmannet al. 1998). These findings suggest that large samples are necessary to observe mutations of large amplitude and further support our inference of concordant patterns in somatic and germ-line mutations.
In addition to the study of germ-line de novo or somatic mutations, another approach to understanding the mutation processes is to examine the variation at tightly linked microsatellite loci. For example, the analysis of multilocus haplotypes carrying the CCR5-Δ32 allele showed that 9.5% of the alleles at locus D3S4580, located 28 kb from CCR5, differ from the most common one by 4–10 repeat units. Detailed haplotype analysis revealed that this pattern cannot be easily explained by recombination and is more consistent with occasional large mutations (J. Martinson, personal communication; Martinsonet al. 1998).
Overall, our finding in this and the preceding article of a significant rank correlation between population variance and mutation mean square estimated from the cancer data in five of the six population/loci pairs we have examined would seem extraordinarily unlikely if, in fact, the cancer data were uninformative for the germline processes. Further, the results of our analyses of human demography, utilizing the mutation mean squares estimated from the cancer data, are in broad agreement with those of analyses of other genetic systems.
Identifying anomalous loci: Here, we developed a method for identifying loci that are anomalous, either in the sense of having a different mutation rate from others in the study or because their evolution is not governed by the class of (neutral, generalized stepwise) models on which the analysis is based.
We identified one such locus in our studies, D19S244. In the light of independent evidence as to its unusually high mutation rate, we regard this as the most likely explanation for its status as an outlier (Weber and Wong 1993). Whatever the reason for the anomaly, such loci should be excluded from population analyses. An important consequence of the ability to detect such loci is thus the potential for improved inferences as to population parameters and population history. In our study, excluding D19S244 led to clear differences in the population inferences obtained.
More generally, this method has the potential to detect loci at or near which natural selection has acted. Recall that balancing selection, respectively background selection or a selective sweep, acting near a locus will increase, respectively decrease, the observed γ(l, p) value at that locus relative to others in the same population. In our analysis the effect of natural selection is confounded with a higher mutation rate at the locus. One way of distinguishing between the effects of selection and mutation rate changes would be to examine tightly linked microsatellite loci near the anomalous one. Selection should have an effect in the same direction on all such loci. If the original outlier results from an unusually high or low mutation rate, the effect should not extend to linked loci.
Inferences for population parameters and population history: Under general assumptions, the average NPV value across loci within a population is an unbiased estimator of
Several points about this estimation procedure are noteworthy. The first is that recovery of time or population size estimates is dependent on assumptions about the average mutation rate for the loci used. Direct empirical evidence is scanty, yet a change by a factor of two, for example, in this average rate will change estimated times or sizes by a factor of one-half. This problem, effectively one of calibrating mutational events into numbers of generations, afflicts all estimates of such parameters from microsatellite data. Particularly in view of the current speculative nature of such calibrations, the actual value of such estimates should be interpreted with caution. We have presented point estimates with no attempt at assessing the precision of the estimates or, equivalently, of giving confidence intervals. In large part, this is because of the difficulty with the calibration just described. In addition, while the estimator is unbiased for the compound parameter
We performed significance tests of the null hypothesis of constant population size for both data sets. Taking the effective population size as 10,000 individuals, allowing medium variability in mutation rate across loci, and ignoring the locus D19S244 shown above to be anomalous, the null hypothesis would be rejected, in favor of scenarios involving population growth, for all populations in the first data set and for all but the African population in the second data set. If there were more variation in mutation rates across loci (our “high” variability scenario), then the African population in the second data set also becomes highly suggestive of population expansion (P = 0.07).
The mutation process at microsatellite loci is clearly complex. Accordingly, we chose to use an analytical approach that takes into account most of these complexities. There are related recent approaches that use microsatellite data to estimate demographic histories (Kimmelet al. 1998; Reich and Goldstein 1998). These methods typically make the restrictive assumption that all mutations involve a gain or loss of one repeat unit and do not allow for heterogeneities across loci in parameters of the mutation process. In the presence of such heterogeneities, and possible larger step sizes, estimates of parameters describing population history may be biased, and while tests for population expansion may be conservative, this will be at the cost of a loss of power to detect an expansion. Assuming, as seems plausible from the rank correlation results (Table 2), that our estimates of mutation mean square are related to the relevant germ-line processes, our approach to the reconstruction of demographic histories will have substantially more power to detect expansion. In particular, our analysis (see Table 4) has shown that a method that does not use locus-specific information on the mutation mean square is much less powerful than those developed here.
A signal of population expansion has been observed in virtually all major ethnic groups for mtDNA (Harpendinget al. 1993; Rogers 1995). However, based on microsatellite analyses, other authors have detected a significant signal only in nonoverlapping subsets of human populations (Kimmelet al. 1998; Reich and Goldstein 1998). Our results are in broad agreement with those based on mtDNA and those obtained by Kimmel and colleagues on microsatellites: all but one of our signals result in unmistakable rejection of the null hypothesis of constant population size; in a direction consistent with expansion, only one (Luo) of the two African populations results in rejection of this null hypothesis. In this regard, it is interesting to note that the inclusion of a single “anomalous” locus in our data set, D19S244, renders the Sardinian population sample consistent with a constant population size without variability of mutation rates. Thus, failure to identify and allow for heterogeneity in the mutation process even at a single locus may dramatically affect the power to detect population expansion. With regard to microsatellite analyses, our methods provide more powerful tests in the presence of departures from the simple one-step mutation model or interlocus heterogeneity in either mutation rates or step size distributions. Thus, a more likely explanation for the discrepancies across microsatellite analyses is the different power of the statistical tests employed to detect population expansion. The fact that the null hypothesis of constant population size is not rejected by a particular test does not mean that the hypothesis is true. While some scenarios proposed for expansion, say, in Africa but not in other regions (Reich and Goldstein 1998), are not without interest, our results suggest that they are not needed.
Acknowledgments
We thank D. Barch, G. K. Haines, and B. Sisk for help in sample collection; R. R. Hudson and C. Ober for comments on the manuscript; M. Stephens for helpful discussions; and L. Jorde for providing a file containing the original population data. This work was supported in part by grants from the National Science Foundation (SBR-9317266 to A.D. and DMS-9505129 to P.D.) and the American Cancer Society, Illinois Division, and Digestive Disease Research Center (DK-42086) to A.D. and a UK Engineering and Physical Sciences Research Council Advanced Fellowship (B/AF1255) to P.D.
APPENDIX A
Writing S2 for the sample variance of allele length in a sample from a population at a particular locus, we show here that, for any demographic scenario,
Note that for a population with constant effective size N chromosomes, E(T12) = N, and for one that has expanded rapidly from a very small size T generations ago, E(T12) ≈ T. The general result (A1) thus reduces to known results [for example, Di Rienzo et al. (1998, Equation A2, and the equation below A11)] in these cases.
Throughout, we assume the generalized stepwise model for mutation, namely that neither the mutation rate nor the distribution of the change in allele length caused by mutation will depend on the length of the progenitor allele, and we assume selective neutrality. Aside from this, we allow arbitrary distribution of mutation sizes and, as we have noted, an arbitrary demographic scenario.
Recall from Equation A9 of Di Rienzo et al. (1998) that we can write
Now, write W1 for the number of mutations on the lineage to the first sampled chromosome since its common ancestor with the second, and W12 for the total number of mutations along either lineage since their common ancestor. Conditional on T12, the number of generations since this common ancestor, W1 and W12 have binomial distributions with parameters (T12, μ) and (2T12, μ), respectively. In particular, conditional on T12, the means of W1 and W12 are μT12 and 2μT12, respectively, and their respective variances are μT12(1 − μ) and 2μT12(1 − μ). Since μ is small, we approximate these conditional variances by μT12 and 2μT12, respectively.
As in Di Rienzo et al. (1998, equations preceding A10 and A11), we can write
Now,
The result (A1) now follows on substituting (A5) and (A6) into (A3) and (A7) into (A4), before substituting the resulting expressions into (A2). (Recall that η2 = m2 + σ2.)
APPENDIX B
We describe here the details of the significance tests of demographic scenarios used in the article. Write S2 and V, respectively, for the population variance and the normalized population variance, L for the number of loci in the data, and η2 and η4, respectively, for the mutation mean square and the fourth moment of the distribution of mutation size. We use an overbar to denote the average of a quantity across loci in the data set, and Var to denote its variance across loci. Thus, for example,
The three test statistics we consider are
For the values of L and n (the number of chromosomes in the sample) appropriate for each part of our data sets, we evaluated the null distribution of g by simulating 30,000 realizations of evolution with constant population size, θ ≡ 2Nμ = 4 (where N is the number of chromosomes in the population), a simple stepwise mutation model, and no variation in either the mutation rate or the mutation mechanism across loci. For each of the scenarios in which there is variation across loci in the mutation rate, we found the null distribution of g by repeating these simulations, except that a population size of N = 10,000 was assumed, and in the simulations the mutation rate for each locus was chosen randomly according to
Moderate variability:
In each case, the simulated null distribution for g (under the appropriate assumption about variability in μ) was also used as the null distribution for F1 and F2. Significance levels for each of the three statistics were calculated as the percentage of times in the relevant simulation that the simulated value of g was larger than the observed value of the test statistic.
Our principal interest focuses on the use of the statistics F1 and F2. To establish the validity of our procedure we carried out extensive simulations to check that the nominal P values calculated as just described were conservative, in understating the probability of a type I error, under more general assumptions about the processes involved.
First, we simulated from the null distribution of the F statistics under exactly the same assumptions as for g. Next, we weakened the assumption of simple stepwise mutation at each locus, simulating instead using the “two-phase” model introduced in Di Rienzo et al. (1994). Initially we assumed that all loci had the same two-phase distribution of mutation size, varying the parameters (notation as in Di Rienzo et al. 1994) in the range
Next, we introduced possible variation in the mutation mechanism across loci by choosing parameters for the two-phase model independently for each locus, with p being chosen uniformly over various ranges ((0,1), (0.5, 1), (0.8, 1), and (0.9, 1)) for each of which
Finally, we allowed for uncertainty in the estimation of the mutation mean square at each locus by repeating the simulations described in the previous paragraph, but, in addition, using a value of η2 for the locus that is chosen from a normal distribution, with mean given by the true value of η2 (which is specified once the parameters for the mutation model are chosen) and variance (η2 − 1)2/4, independently for each locus. The choice of distribution for the sampling error in our estimates of mutation mean square is motivated by the bootstrap estimates of the sampling variability described in this article.
Each of these sets of simulations was performed under each of the assumptions about variation in mutation rate, with the nominal level of the test set at 0.05. On no occasion was the actual type I error >0.05.
Footnotes
-
Communicating editor: N. Takahata
- Received July 29, 1999.
- Accepted November 29, 1999.
- Copyright © 2000 by the Genetics Society of America