- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Di Rienzo, A.
- Articles by Barch, D. H.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Di Rienzo, A.
- Articles by Barch, D. H.
Heterogeneity of Microsatellite Mutations Within and Between Loci, and Implications for Human Demographic Histories
Anna Di Rienzoa,f, Peter Donnellyb, Chris Toomajiana, Bronwyn Siska, Adrian Hillc, Maria Luiza Petzl-Erlerd, G. Ken Hainese, and David H. Barchfa Department of Anthropology, Northwestern University, Evanston, Illinois 60208,
b Department of Statistics, University of Oxford, Oxford, OX1 3TG, United Kingdom,
c Wellcome Trust Center for Human Genetics and Institute of Molecular Medicine, University of Oxford, Oxford, OX3 9DU United Kingdom,
d Department of Genetics, Federal University of Paraná, 81531-970 Curitiba, Brazil,
e Department of Pathology, Northwestern University Medical School, Chicago, Illinois 60611
f Lurie Cancer Center, Northwestern University Medical School, Chicago, Illinois 60611
Corresponding author: Anna Di Rienzo, Center for Medical Genetics, University of Chicago, 924 E. 57th St., BSLC Rm. 116, Chicago, IL 60637, dirienzo{at}genetics.uchicago.edu (E-mail).
Communicating editor: M. SLATKIN
| ABSTRACT |
|---|
Microsatellites have been widely used to reconstruct human evolution. However, the efficient use of these markers relies on information regarding the process producing the observed variation. Here, we present a novel approach to the locus-by-locus characterization of this process. By analyzing somatic mutations in cancer patients, we estimated the distributions of mutation size for each of 20 loci. The same loci were then typed in three ethnically diverse population samples. The generalized stepwise mutation model was used to test the predicted relationship between population and mutation parameters under two demographic scenarios: constant population size and rapid expansion. The agreement between the observed and expected relationship between population and mutation parameters, even when the latter are estimated in cancer patients, confirms that somatic mutations may be useful for investigating the process underlying population variation. Estimated distributions of mutation size differ substantially amongst loci, and mutations of more than one repeat unit are common. A new statistic, the normalized population variance, is introduced for multilocus estimation of demographic parameters, and for testing demographic scenarios. The observed population variation is not consistent with a constant population size. Time estimates of the putative population expansion are in agreement with those obtained by other methods.
MICROSATELLITES, also called simple tandem repeat polymorphisms (STRPs), are an abundant group of repetitive DNA sequences with repeat units up to six nucleotides long. Their high level of variation in the number of repeat units is thought to reflect their high mutation rate (![]()
Instability of simple sequences is responsible for a number of genetic diseases. There are two main types of instability. The first involves a single locus at a time and is characterized by expansion of the repeat and increased mutation rate at the germline and often somatic level (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The high degree of polymorphism, with alleles that can be typed by automated assays, makes microsatellites promising candidates as tools for the study of population processes. However, certain assumptions about the mode of producing length variation are needed to relate the variation among populations to population genetic processes. The mechanism of slipped strand mispairing leads to a process generating gain or loss of repeat units. The most important implication is that alleles with the same repeat number may not be identical by descent because several mutations may have produced the same allele more than once. The stepwise mutation model, which assumes mutational changes of one repeat unit, is potentially suitable for describing microsatellite variation in populations (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
In this paper, we characterized the process generating variation in human populations at 20 microsatellite loci, on a locus-by-locus basis, to investigate possible inter-locus heterogeneity of mutation patterns. This was done by analyzing the somatic mutations detected in a large sample of colon cancer patients. To determine whether the observed somatic mutations in cancer indeed reflect the "real" mutation process, the same microsatellites were typed in three population samples from different ethnic backgrounds. This approach allowed us to determine that the patterns of mutation in cancer are consistent with the pattern of population variation, implying that somatic mutations in cancer cells are useful for estimating the parameters of the "real" mutation process (![]()
| MATERIALS AND METHODS |
|---|
Patient material:
Study subjects were identified through a search of the Northwestern Memorial Hospital Tumor Registry, Chicago, IL. All patients with adenocarcinoma of the colon who had their primary diagnosis and primary resection of their colon carcinoma at Northwestern Memorial Hospital between January 1, 1984 and December 31, 1988 (219 cases) were identified. The original slides for each patient were reviewed by a pathologist (G.K .H.) and sections of tumor and normal tissue were cut from paraffin-embedded tissue blocks. The presence of tumor or normal tissue was confirmed by an additional hematoxylin and eosin stained slide. Additional sections were then cut from each block and the normal and tumor portions of the sections separated. DNA was then extracted from the tissue sections as described in ![]()
Population samples:
Sardinia (Italy) and K aingang (Brazil) population samples were randomly selected from previously described samples in ![]()
![]()
Typing protocol:
For both population and patient tissue samples, we used a typing protocol based on radioactively endlabeling one of the PCR primers as described in ![]()
![]()
Estimating mutation sizes:
Owing to the high degree of heterozygosity of microsatellites, it was not possible to determine unequivocally which of the two alleles in the normal tissue mutates to produce the additional band(s) observed in the tumor DNA. We describe here four methods for estimating the distribution of mutation sizes at a locus from tumor data.
Measure length in repeat copy number and suppose l mutant alleles are observed at a particular locus. Write Yi,1, and Yi,2 , i = 1, 2,... ,l, for the two possible mutation sizes for each of the l somatic mutations observed, with these values being negative if mutation reduces allele length and positive otherwise. Each of the methods used effectively assigns a probability, pi,1, that for the i th mutant, the mutation was of size Yi,1. Probability pi,2
1 - pi,1 is then defined to be the probability that the mutation was of size Yi,2 . Having chosen values for the pi,j , the underlying distribution of mutation size is estimated to give probability
![]() |
(1) |
The simplest method assumes that the mutation was definitely of the smaller of the two possible (absolute) values. That is, it puts pi,1 = 1 if |Yi,1| < |Yi,2|, pi,1 =
if |Yi,1| = |Yi,2|, and pi,1 = 0 if |Yi,1| > |Yi,2|.
Two natural Bayesian methods make specific prior assumptions about the relationship between the probabilities of certain changes and the size changes involved. One such assumes a priori that the probabilities of certain changes decrease inversely with the (absolute) length of the change. Write fi,j =
. Then put
![]() |
(2) |
If it is assumed a priori that probabilities decrease inversely with the square of the length change, one simply uses (2) with fi,j redefined as
.
Each of the three methods just described tries to encapsulate the belief that of the two possible (absolute) changes, the smaller one is thought more likely than the larger one. The first method always adopts the smaller change, the third has a strong bias toward the smaller (absolute) change, and the second has a less marked bias toward the smaller (absolute) change. A comparison between the results of each of these methods and maximum likelihood estimation is given in Figure 1.
|
The maximum likelihood estimate of the underlying mutation size distribution cannot be found directly. Instead, we used the expectation maximization (EM) algorithm to approximate the maximum likelihood estimate (![]()
(k) in a two stage process. First, we put all the pi,j =
, and define
(k) as in (1). Then repeat sequentially each of the two steps below until respective values of the
's agree very closely in successive iterations. (1) Calculate the pi,j is from (2) with fi,j =
(Yi,j), for the current values of the
's. (2) Using the current values of the pi,j , calculate new values for the
's from (1).
Having estimated the mutation size distribution, we estimate its mean square by
![]() |
(3) |
Some of the literature relates population statistics to the variance of the mutation size distribution. Such analyses typically assume that the mean of the mutation size distribution is 0, in which case the variance and mean square of the distribution are identical. In APPENDIX I, we extend the theoretical results to general mutation size distributions. In this more general setting, the mean square of the distribution, rather than its variance, plays the central role.
Finally, we note that in connection with the constant population size scenario described below, some numerical adjustment is needed for X-linked loci, because of the differing number of chromosomes in the population. Perhaps the easiest correction is simply to redefine the mutation mean square for these loci, multiplying it by a factor of 3/4 for X-linked loci. With this change, the results in APPENDIX I all apply, where N is now the effective number of chromosomes at an autosomal locus. We have made this adjustment in the presentation of our data. (A related adjustment would also be needed for data on Y-linked loci.)
| RESULTS |
|---|
Microsatellite instability screening:
To observe an adequate number of somatic mutation events at each microsatellite locus, we used surgical specimens from patients with sporadic colon cancer as a source of paired normal and tumor DNA. Normal and tumor DNA patterns at 22 autosomal and two X-linked microsatellites were compared in each patient to detect somatic mutations (Table 1). A somatic mutation was considered to have occurred when there were one or more bands in the tumor tissue that were absent in the normal tissue (Figure 2). The fact that the same bands are seen in the normal and tumor tissue in the presence of an extra band can be explained either by the presence of some normal tissue in the tumor tissue sample or by the fact that the somatic mutation occurred after the onset of tumor growth. The average rate of instability per patient was 9.13%, with a slightly higher rate in tetranucleotide repeats compared to the di- and trinucleotide repeats.
|
|
Patterns of somatic mutations:
Heterogeneity of mutation sizes may account for many of the inter-locus differences in population patterns of variability as well as the propensity to produce extreme size alleles, such as those leading to expanded triplets. Therefore, we examined the somatic mutation data to estimate the distribution of mutation sizes for each STRP. Because the mutation patterns described above (e.g., Figure 2) do not allow unequivocal determination of mutation size, we used the EM algorithm, as described in MATERIALS AND METHODS, to produce maximum likelihood estimates of the distribution of mutation sizes in the colon cancer patients. The distributions, shown in Figure 3, vary greatly in shape and range, supporting the idea that microsatellites may differ significantly in mutation patterns.
|
In our estimation we ignored the possibility of "double" (or higher) mutations, in which an observed mutant allele has undergone more than one mutation event, via unobserved intermediate alleles, from the progenitor normal allele. The growing tumor represents an exponentially expanding population of cells, and standard population genetics theory suggests that it is unlikely that mutant alleles that arise during growth will subsequently be lost to the population. Additional, empirical, support for our assumption comes from the fact that most individuals had at most one observable mutation.
Some care should be exercised in interpreting the estimated distributions shown in Figure 3. For the distributions that show a wide range of possible mutation sizes (e.g., D8S164), the sample sizes do not allow precise inferences about the underlying distribution. For example, the apparently "ragged" estimated distributions may arise even if the underlying distribution is much smoother. However, the number of observed mutations for these loci is comparable with those for loci showing a small range of mutation sizes (compare, for example, D8S164 and DXS981 each with 24 and 23 patients with somatic mutations) (see Table 1). Note that in estimating an unknown distribution, the method of maximum likelihood will tend to produce "clumped" estimates.
In addition to the maximum likelihood method, we implemented three natural Bayesian methods for estimating the underlying distribution of mutation sizes. These correspond respectively to the prior assumption that: (1) the shorter of the two possible mutation sizes occurred; (2) the probability of a given mutation size is inversely proportional to that size, and (3) the probability of a given mutation size is inversely proportional to the square of that size. Broadly, these methods gave estimates similar to those obtained by maximum likelihood. Figure 1 illustrates the four estimated distributions for two representative loci. Our subsequent analysis only uses the mean square of the distribution estimated by the maximum likelihood method, i.e., the average squared mutation size, and neither this quantity nor our subsequent conclusions are sensitive to the estimation method.
It was previously proposed that microsatellites evolve directionally with a bias toward increase of repeat number (![]()
Population data:
The pattern of population variation for a genetic locus depends on the features of the mutation process of the locus and the demography of the population examined. To test whether the somatic mutations in cancer tissue follow the same rules as the "real" mutation process underlying the population variation, we typed 16 of the same microsatellite loci in three population samples from different ethnic backgrounds: the Sardinian population from Europe, the Luos from Kenya, and the Kaingang from Brazil. These populations were chosen to maximize, within the human evolutionary timescale, the amount of independent evolution. Hence, the alleles observed in these populations may well share only a small part of their mutational history. Compared to the other populations, the K aingang showed a lower degree of genetic heterogeneity, a higher homozygosity, and a lower number of alleles (Table 2). These results are in agreement with other surveys of genetic variation in Amerindian populations, and suggest a smaller effective population size, probably because of a bottleneck that occurred during the colonization of the Americas (![]()
![]()
![]()
![]()
![]()
![]()
|
Expected relationship between mutation and population parameters:
The generalized stepwise mutation model offers a broad theoretical framework for studying the process generating length variation at microsatellite loci (![]()
![]()
![]()
![]()
We used population genetics theory to relate aspects of the mutation size distribution to patterns of variation expected in population samples (![]()
![]()
![]()
![]()
![]()
One summary of population variation that is convenient for assessing the consequences of various demographic assumptions is the variance of repeat number in a population sample, denoted here by S2. By this we mean, for a particular locus, the sample variance of the collection of repeat copy numbers in the chromosomes sampled from the population. That is, if X1, X2,...,Xn denote the number of repeats in each of the n chromosomes sampled, and
denotes their average,
Population genetics theory will be used to relate this to the mean square of the distribution of the mutation size, which we denote here by
2. As described in MATERIALS AND METHODS, we estimate the value of
2 for each locus from the estimated distributions shown in Figure 3.
Constant population size scenario:
In this case the population in question is panmictic and has maintained a constant effective size of N chromosomes throughout its history. We assumed a generalized stepwise mutation model with arbitrary distributions of mutation sizes; in particular, we allow for asymmetric distributions as well as biases towards increase or decrease of repeat number. In APPENDIX I, we show that under this scenario of demography and mutation, population theory predicts the following relationship
![]() |
(4) |
In Equation 4, the expectation E pertains to averages over realizations of evolutionary and sampling processes. In a particular realization of evolution the actual pattern of variation observed in a sample for a particular microsatellite locus is the result of various chance processes. One of these is sampling: a different sample from the same population would not produce exactly the same pattern. The second and more important source of randomness stems from the chance events intrinsic to the evolutionary history of the population: unlinked loci within the same population and subject to identical mutation processes will give rise, because of chance events, to different patterns of variation. Thus, the actual value of S2 observed in a population sample at a given locus is not expected to be given exactly by Equation 4, but to deviate from it because of the above chance effects.
One particular consequence of Equation 4 pertains to values of S2 at different loci within a population. If we can assume that the mutation rate is approximately the same across these loci, and plot the observed values of S2 against the mutation mean square,
2, for each locus, the points should lie around a straight line. The slope of the line is proportional to the product of population size and mutation rate.
Rapid population growth scenario:
We considered a scenario, based on mtDNA evidence, that the human population underwent rapid demographic growth in the past (![]()
![]()
![]()
Under this combined scenario of demography and mutation process, for a given microsatellite locus we derive in APPENDIX I the following approximate relationship
![]() |
(5) |
2 for that locus, should again give points lying around a straight line. For this scenario, the slope of the line is proportional to the product of the time since the population expansion, and the mutation rate.
Observed relationship between mutation and population parameters:
Plots of the variance of repeat number against the mean square of mutation size estimated from the somatic mutations observed in cancer patients are shown in Figure 4 for each of the three populations in the study. If the constant population size scenario is valid, all loci within a population will share a common value of N (with suitable adjustment, as described in MATERIALS AND METHODS, for X-linked loci), while under the rapid expansion scenario all loci within a population will share the same value of T. Neither of these values need necessarily be shared between populations; thus each population is shown separately.
|
As shown in Figure 4, in each case the points showed clustering around a straight line. This linear relationship observed consistently in each of the three population samples strongly suggests that the estimated mutation parameters based on the somatic mutations in colon cancer are good approximations of the "real" mutation parameters, namely those characterizing the process underlying the population variation (![]()
Normalized population variance:
The above theoretical framework leads to the introduction of a new statistic for each locus, the normalized population variance (NPV), which is defined here as the ratio between the population variance, i.e., the variance of repeat number at a locus in a population sample, and the (estimated) mutation mean square at the same locus. Thus, if we use S 2j and
2j , respectively, to denote population variance and mutation mean square at locus j, the normalized population variance, Vj , for that locus is defined by
Intuitively, a locus with a large mutation mean square, and hence large variability in mutation size, is expected to show greater population variance, and allowance should be made for this effect when combining information across loci. The normalized population variance provides the "correct" measure for comparison across loci. Its use is particularly appropriate in light of the apparent heterogeneity of mutation processes across loci (Figure 3).
The theoretical expectations for the demographic scenarios outlined above can thus be expressed in terms of the normalized population variance:
, in the constant population size scenario, and T
in the rapid expansion scenario, where,
is the average mutation rate across the loci considered. By using estimates of the (average) mutation rate, this offers a convenient way to estimate the parameters N and T. Table 3 shows such estimates for the population samples examined. It should be noted that our approach has the advantage that the estimates of underlying population parameters are obtained by combining independent loci, thus reducing the substantial variability inherent in single-locus estimates. In addition, the estimation allows and corrects for observed interlocus variability of mutation processes because it uses the mean square mutation size estimated individually for each locus rather than making restrictive and possibly unrealistic assumptions about the mutation mechanism, such as that gain or loss of only one repeat is possible. In essence, locus-specific information about the mutation mean square allows us to calibrate, and hence to combine efficiently, population variability at the different loci studied.
|
Discriminating between different demographic scenarios:
While both demographic scenarios described above predict that the expected variance of repeat number in the population sample and the mean square of mutation size will be related linearly, there are major differences in other aspects of the predicted relationship between these two variables. As described in APPENDIX I, one can quantify the amount of variability around the line expected under each scenario, i.e., the extent to which the points are scattered around the line. Equivalently, one can assess the variability in measured values of the normalized population variance across loci. As shown in Table 3, much more scatter is expected under the constant population size scenario than in the case of a rapid expansion; loosely speaking, evolutionary variability (genetic drift) will be more pronounced in the constant population size scenario. Sampling variability is effectively the same in both scenarios.
It should be noted that in our data there are sources of variability additional to those incorporated in the model because we have had to use estimates of the mutation mean square. Thus, if the model were valid we should expect to observe more than the predicted variability. In addition, if mutation rates are not constant across loci, the variability around the line is expected to be even larger (see Equation A6 and Equation A14 in APPENDIX I). The average normalized population variances and their observed variances are presented in Table 3.
We can use population genetics theory to estimate a lower bound on the variance across loci of the normalized population variance for each of the two demographic scenarios. In addition, for each scenario of demography, we considered three models of mutation rate: (1) constant rate across loci, (2) high, and (3) moderate variability of rate across loci. A range of values of T and N were used to calculate the expected variance of the normalized population variance under the constant population size and the rapid expansion scenarios, respectively. The expected values of variance of the normalized population variance are shown in Table 3.
Even under the assumption of constant mutation rate across loci, we observe markedly less variability in normalized population variance than predicted under the constant population size scenario for the Luo, K aingang, and pooled samples, and slightly less than predicted for the Sardinian sample. Allowing for even moderate variability in mutation rates, the observed variability is substantially less than predicted under the constant population size scenario for all samples. While a formal significance test does not seem feasible in this context, we interpret this as evidence against the assumption that the populations have evolved with roughly constant size. For a detailed discussion see APPENDIX II. Conversely, the level of observed variability is consistent with the rapid growth scenario if interlocus variability of mutation rate is incorporated into the model.
Analysis of the pooled population sample:
All three populations showed a pattern consistent with rapid growth and the estimates of T for Sardinia and Luo were reasonably similar. Therefore, we pooled the microsatellite typing data for the three populations and calculated the variance of repeat number in the pooled sample. We then plotted the mutation mean square against the pooled population variance, as shown in Figure 4. As with the single population analysis, we observe a linear relationship and a low variability of the normalized population variance. Again, the average across loci of the normalized population variance can be used to obtain an estimate of T, which in this context represents the coalescence time for the species under the assumption of rapid population growth. Interestingly, the estimate of the time since the rapid population growth obtained on the pooled sample is very similar to those for the Sardinian and Luo samples (see Table 3).
| DISCUSSION |
|---|
It is not easy to obtain experimental evidence to describe the mutation process of microsatellites on a locus-by-locus basis. Here, we present a novel experimental and theoretical approach to this problem that uses the mutator phenotypes of cancer patients with microsatellite instability and the theoretical framework of the generalized stepwise mutation model. By analyzing the somatic mutations occurring in the tumor tissue of such patients, we estimated the distribution of mutation sizes and the mean square mutation size for each locus independently. The variance of repeat number in a population sample is related to the mean square of the size of the mutations that produce the observed population variation. By using our estimated mean square mutation size in tumor tissue, we observed the expected relationship supporting the idea that the mutation sizes in cancer cells are similar to those occurring in the germline. Further analysis of the relationship between mean square mutation size and variance of repeat number in human population samples allowed testing of different demographic scenarios of human evolution.
Somatic mutations in cancer and in the germline of the general population:
Our integrated analysis of microsatellite instability and population variation demanded testing of the idea that mutation patterns in cancer do not differ substantially from those in normal germline cells. In agreement with the predictions of population genetic theory, a significant correlation between the variance of repeat number and the mutation mean square is evident even though the latter was obtained through the analysis of somatic mutations in cancer. The predicted relationship between the population and mutation parameters is observed consistently in three samples from distantly related populations, which represent partially independent realizations of evolution.
Differences between the mutations observed in cancer patients and those underlying population variation might be expected if the mechanisms maintaining genome stability recognized and fixed certain mutation types more efficiently than others, e.g., replication slippage events involving a larger number of repeats were more likely to be recognized. One such mechanism is mismatch repair, but others may exist in sporadic cancer patients (![]()
Information on mutation patterns resulting from defects in genome stability is available only for mismatch repair mutants in yeast. In this organism, the rate and pattern of mutations were compared in wild-type and mismatch repair mutant strains, including msh2 mutants (the human homolog of msh2 is the most common mutant in hereditary colon cancer) (![]()
![]()
The significant proportion of relatively large mutations in our sporadic cancer patients as well as a higher (rather than lower) rate of instability at tetranucleotide compared to di- and tri-nucleotide repeats contrasts with the above scenario of mismatch repair mutants in yeast. This difference could be due to the presence of mutations in genes other than those involved in mismatch repair. In fact, mutations in mismatch repair genes could be identified only in approximately 43% of sporadic cancers with microsatellite instability (![]()
Our finding of a significant correlation between the mutation parameters estimated in cancer and the population parameters in samples from three ethnically diverse human populations suggests that the mutation process in our sporadic cancer patients is closely related to the process that generates population variation at human microsatellites.
Mutation sizes have a wide spectrum:
Previous surveys of microsatellite mutations have consisted of the observation of mutation events in family studies. Over all loci, the majority of mutations (78% in ![]()
![]()
![]()
![]()
![]()
![]()
![]()
Estimating population parameters from multilocus microsatellite data:
Most estimators of population parameters in the literature are of necessity based on single-locus data. It is well understood that because of the extent of evolutionary variability, estimates based on such data have limited precision (![]()
![]()
![]()
![]()
Here, we have shown how to exploit locus-specific information on microsatellite mutation processes to combine data from distinct loci in estimating population parameters. Central to our approach is the introduction of what we have called normalized population variance, i.e., the population variance at a locus divided by the estimated mean square mutation size. Our theoretical model explicitly allows for differences across loci in the distribution of mutation sizes (in particular, we do not assume that changes only involve one repeat unit) and we have shown that the use of the normalized population variance properly allows for this interlocus variability.
Data from different loci can then be combined using the normalized population variances. For the two demographic scenarios that were considered, the average of these normalized population variances across loci leads to a natural estimator for the effective population size, or time since the rapid expansion, respectively. The ability to use data from multiple loci has considerable advantages for estimating population parameters or for the assessment of evolutionary hypotheses. Unlinked loci effectively provide independent sources of information on population parameters, so that as in classical statistics the variance of our estimators decreases as the inverse of the number of loci involved. In addition, the combination of information across loci acts to reduce the effect of special features (such as selection) at particular loci, another problem that may affect single-locus analyses.
Theoretical novelties of this approach include the derivation of the mean and variance of population variance under extreme population growth, for a general stepwise mutation model. (In particular, there are no restrictions on mutation size, and the distribution of mutation size is allowed to have a bias toward either increase or decrease, and to be asymmetric.) We have also derived the mean of the population variance, for this level of generality in the distribution of mutation size, in the constant population size scenario.
We also note that information on the mean square (or variance if the mean is assumed to be 0) of mutation size at different loci could be used to improve the use of recently introduced measures of genetic distance for microsatellite loci [see for example ![]()
Discriminating between different demographic scenarios:
In addition to the estimation problem, we have developed a multilocus method for discriminating among competing demographic scenarios. The central idea is that while both scenarios should result in a linear relationship between certain statistics, they make very different predictions of the amount of variability to be expected across loci.
The finding of relatively low variability in the normalized population variance is not consistent with the scenario of constant population size. It should be noted that our results are biased toward higher variability because of the estimation of the mutation mean square. This bias further supports the rejection of the constant population size scenario. However, the observed variability in the normalized population variance is higher than expected under the rapid expansion scenario, if the mutation rate is assumed to be constant across loci. This may be due either to the estimation of the mutation mean square or to a violation of the assumption of lack of interlocus variability of mutation rate; these explanations are not mutually exclusive.
It is important to ask whether departures from the assumptions of the mutation model, rather than failure of the constant population size scenario, may explain the observed low variability in the normalized population variance. One such possibility is the presence of constraints on microsatellite allele size. It has been proposed that microsatellite variation between species is subject to size constraints; more specifically, that the expected difference in allele sizes between human and chimpanzee is larger than observed (![]()
![]()
Mutation rates which depend on the length of the progenitor allele would also violate the assumptions of the generalized stepwise model. The most obvious possibility is presumably for mutation rates to increase with allele length (![]()
![]()
![]()
Implications for human evolution:
The hypothesis of a rapid population growth during the evolutionary history of human populations has been proposed on the basis of mtDNA data showing a star-shaped genealogy and an approximately Poisson distribution of pairwise sequence differences (![]()
![]()
![]()
![]()
![]()
![]()
![]()
Our findings are difficult to reconcile with a constant population size scenario, but are consistent with a rapid population growth and are in agreement with the above proposal. The multilocus analysis performed here allows us to rule out the possibility that our findings are the results of selection.
Under the assumption of extreme population growth, our analysis allows for multilocus estimation of the time since the expansion event occurred. These estimates, for the three populations separately, and for the pooled population sample are given in Table 3. The estimates of T obtained for the pooled sample can be interpreted as the time since the most recent common ancestor of humans. They are not inconsistent with analogous estimates obtained for mtDNA and Y chromosome loci (![]()
![]()
![]()
| ACKNOWLEDGMENTS |
|---|
This work was supported in part by grants from the National Science Foundation (SBR-9317266 to A.D.R. and DMS-9505129 to P.D.) and the American Cancer Society, Illinois Division, to A.D. and a UK EPRSC Advanced Fellowship (B/AF1255) to P.D. We thank M. NORDBORG, R. THOMAS, and G. WICHMANN for technical help, D. J. BALDING, D. R. COX, and P. MCCULLAGH for helpful discussions, and M. KREITMAN, C. OBER, and A. TURKEWITZ for critical reading of the manuscript.
Manuscript received July 16, 1997; Accepted for publication November 20, 1997.
| APPENDIX 1 |
|---|
Theory for rapid expansion:
Focus attention on a particular locus within a population. It is natural to derive results via genealogical methods. In the case of a large constant size population, the genealogy is well described by the coalescent. The effect on this genealogical tree of changes in population size is well understood, at least under plausible assumptions on the dynamics of the process governing change in the population size (![]()
![]()
![]()
![]()
Write A for the number of repeats in the common ancestor. Then write Mi for the change from the common ancestor along the lineage to the ith chromosome sampled, so that Xi, the observed repeat number in the ith chromosome, is given by
Then, writing
for the sample average:
= n-1(X1 + X2 + ··· + Xn), and analogously for
,
Now, the random variables Mi are independent and identically distributed. Write W for the number of mutation events along the lineage leading from the common ancestor to the i th chromosome. The random variable W has a binomial distribution with parameters T and µ, the mutation probability at the locus in question. Since T is large and µ small, we will assume that in fact W has a Poisson distribution with mean Tµ. If F denotes the distribution governing mutation sizes, we can write
It turns out to be convenient to calculate the cumulants (see e.g., ![]()
(t) for the moment generating function of the distribution F,
(t) - Tµ, so that, writing
r for the r th cumulant of Mi and
r for the r th moment of F, about the origin:
= E(Zrj)
![]() |
(A1) |
Standard results for independent and identically distributed random variables (![]()
![]() |
(A2) |
![]() |
(A3) |
Thus, writing V for the normalized population variance (NPV),
![]() |
(A4) |
If the mutation rate varies across loci, then
![]() |
(A5) |
![]() |
(A6) |
and Var(µ) are the average and variance, respectively, of the mutation rate. With NPV values for L different loci, their average
provides a natural estimate of T
. Further, the sampling variance of the estimator is
![]() |
(A7) |
Note the decrease in this sampling variance as a function of the number of loci, and for loci with the same value of µ, [and hence Var(µ) = 0] the decrease also as a function of the sample size n (assumed here to be the same at all loci, though this is easily relaxed).
Theory for constant population size:
In the context of constant large population size, genealogy is well described by the coalescent. Write Xi, Mi, and S2, as above, for the number of repeats in the ith sampled chromosome, the change in repeat number in that chromosome since the common ancestor of the sample, and the population variance, respectively. Then, as above, S 2 = S 2M, so that
![]() |
(A8) |
Now, consider the lineages leading to the first two chromosomes in the sample. It may be that the common ancestor of these two chromosomes is the common ancestor of the sample, or it may be that their common ancestor occurred more recently than the common ancestor of the entire sample, in which case the two chromosomes will have shared some ancestry subsequent to the sample's common ancestor. Write Yc for the change in repeat number along this shared lineage, with Yc = 0 if there is no shared lineage, and Y1 and Y2 for the respective changes in repeat number since the common ancestor of the two chromosomes. Then
Some algebra then gives
![]() |
(A9) |
Now, in the coalescent, the number of mutations, W1, on the lineage to the first chromosome since its common ancestor with the second has a geometric distribution with mean
/2, where
2Nµ is the usual scaled mutation parameter (recall that N is the number of chromosomes, rather than individuals, in the population). The total number of mutations, W12, along either lineage since their common ancestor is also geometric, with mean
.
With Z 1, Z 2,...denoting independent random variables with mean m, variance
2 and distribution F, we can write Y1 =
W1i=1 Zi . Thus
![]() |
(A10) |
Similarly, Y1 + Y2 =
W12i=1 Zi, so that
![]() |
(A11) |
Finally, on substituting (A10) and (A11) into (A9), and then (A8) we have,
This result has recently been derived independently, by related methods, in ![]()
![]()
![]()
![]()
![]()
Recall that
![]() |
(A12) |
If the mutation rate varies across loci, then
![]() |
(A13) |
![]() |
(A14) |
and Var(µ) are the average and variance respectively of the the mutation rate. With data on NPV values at L different loci, then under this demographic scenario, their average
provides a natural estimator of N
. The variance of this estimator is then
![]() |
(A15) |
Note the decrease in this sampling variance with L, the number of loci. For loci with the same mutation rate µ, this sampling variance reduces to
![]() |
(A16) |
In addition to variation in mutation rate, the mutation mechanism may also vary across loci, as suggested for example by Figure 3 of the paper. In this case, A6, A7, A14, A15, and A16 still apply, with
replaced by the average of this quantity across loci.
| APPENDIX 2 |
|---|
A formal significance test of the null hypothesis of constant population size and generalized stepwise mutation is complicated by several factors. One of these is the nuisance parameters, N, and the variation of mutation rates across loci. Another is the fact that only moments of some observables, rather than their full distribution, are known under the null hypothesis. The following informal analysis may nonetheless be helpful in assessing the strength of the evidence in the data against the null hypothesis.
For definiteness, consider the case of constant N, with N = 10,000 (recalling that N is twice the effective number of individuals in the population) and moderate variability in mutation rate, as described in Table
























