Genetics, Vol. 150, 487-497, September 1998, Copyright © 1998

Simultaneous Estimation of All the Parameters of a Stepwise Mutation Model

Yun-Xin Fua and Ranajit Chakrabortya
a Human Genetics Center, University of Texas, Houston, Texas 77225

Corresponding author: Yun-Xin Fu, Human Genetics Center, University of Texas at Houston, 6901 Bertner Ave., Houston, TX 77030., fu{at}hgc.sph.uth.tmc.edu (E-mail).

Communicating editor: G. B. GOLDING


*  ABSTRACT
*TOP
*ABSTRACT
*STEPWISE MUTATION MODELS
*MINIMUM CHI-SQUARE ESTIMATOR OF...
*APPLICATIONS
*DISCUSSION
*LITERATURE CITED

Minisatellite and microsatellite are short tandemly repetitive sequences dispersed in eukaryotic genomes, many of which are highly polymorphic due to copy number variation of the repeats. Because mutation changes copy numbers of the repeat sequences in a generalized stepwise fashion, stepwise mutation models are widely used for studying the dynamics of these loci. We propose a minimum chi-square (MCS) method for simultaneous estimation of all the parameters in a stepwise mutation model and the ancestral allelic type of a sample. The MCS estimator requires knowing the mean number of alleles of a certain size in a sample, which can be estimated using Monte Carlo samples generated by a coalescent algorithm. The method is applied to samples of seven (CA)n repeat loci from eight human populations and one chimpanzee population. The estimated values of parameters suggest that there is a general tendency for microsatellite alleles to expand in size, because (1) each mutation has a slight tendency to cause size increase and (2) the mean size increase is larger than the mean size decrease for a mutation. Our estimates also suggest that most of these CA-repeat loci evolve according to multistep mutation models rather than single-step mutation models. We also introduced several quantities for measuring the quality of the estimation of ancestral allelic type, and it appears that the majority of the estimated ancestral allelic types are reasonably accurate. Implications of our analysis and potential extensions of the method are discussed.

SINCE the discovery that a large number of loci with tandemly repeated sequences in human and many eukaryote species are highly polymorphic because of copy number variation of the repeats in different individuals (JEFFREYS 1985; LITT and LUTY 1989 Down; WEBER and MAY 1989 Down), allele size data from such loci are rapidly becoming the dominant source of genetic markers for genome mapping, forensic testing, and population studies. Loci with repeat sequences longer than 5 bp are generally referred to as minisatellite or variable number tandem repeat loci, and those with repeat sequences between 2 to 5 bp are referred to as microsatellite or short tandem repeat loci (TAUTZ 1993 Down). Because mutations change the copy number of such loci in a stepwise fashion, rapid accumulation of population samples from minisatellite and microsatellite loci has resurrected the interest of the stepwise mutation model (SMM), which was popular in the 1970s.


To avoid misinterpretation when using the information from these loci, understanding the dynamics of polymorphism at minisatellite and microsatellite loci is important. It is also vital for population and evolutionary study. Important for better understanding of the evolution of such loci is the estimation of relevant population parameters. There are several parameters of a population that can affect the pattern of polymorphism in a sample. The most important is {theta} = 4Nµ, where N is the effective population size and µ is the mutation rate per locus per generation. {theta} is primarily responsible for the amount of polymorphism in a sample. Under the infinite-allele model, that is, each mutation at a locus creates a new allele in the population, {theta} is the only parameter for the distribution of polymorphism in a sample from a steady population. EWENS' (1972) sampling formula provides the basis for estimating {theta} from a single quantity—the number of alleles in a sample. However, under SMM, the pattern of polymorphism in a sample becomes more complex and depends on additional parameters, the number of which depends on the complexity of the mechanism of evolution for such a locus. Unfortunately, a sampling distribution for a SMM parallel to EWENS' (1972) has not been found, resulting in difficulty in making inference under the SMM. Since the values of parameters are necessary in interpreting the evolution of a locus under the SMM, proper estimation of parameters is critical in studying the mechanism of evolution of loci under the SMM. To date, no method is available for simultaneously estimating all the parameters of the SMM, which limits the usefulness of such loci for studying the history of a population, particularly the human population.

We develop in this article a method for simultaneous estimation of all the parameters of the SMM and the ancestral allelic type of the alleles in a sample. The estimator is a combination of the minimum chi-square estimator (MCS) and Monte Carlo simulation, taking advantage of fast coalescent algorithms. We apply the method to the samples of dinucleotide repeats of DEKA et al. 1995 Down and discuss the implications of our analyses.


*  STEPWISE MUTATION MODELS
*TOP
*ABSTRACT
*STEPWISE MUTATION MODELS
*MINIMUM CHI-SQUARE ESTIMATOR OF...
*APPLICATIONS
*DISCUSSION
*LITERATURE CITED

Let {pi}ij be the probability that a mutation causes an allele size change from i to j. For a stable population, which is assumed throughout this article, a SMM is completely specified by the distribution {pi}ij and the mutation parameter {theta} = 4Nµ, where N is the effective population size and µ is the mutation rate per allele per generation. Following the introduction of the SMM by OHTA and KIMURA 1973 Down, most of the subsequent studies in the 1970s were based on single- or two-step SMMs (e.g., MORAN 1975 Down; LI 1976 Down; WEIR et al. 1976 Down; CHAKRABORTY and NEI 1977). In particular, MORAN 1975 Down showed that under a single-step mutation model, allelic frequencies do not reach a steady distribution. Consequently, later studies of SMMs have focused on various moments of allele frequencies, e.g., the variance of allele sizes that have steady-state distributions (WEIR et al. 1976 Down; CHAKRABORTY and NEI 1982 Down). This tradition appears to continue in the recent surge of interest in the SMM (SHRIVER et al. 1993 Down; VALDES et al. 1993 Down; DI RIENZO et al. 1994 Down; KIMMEL et al. 1996 Down).

Although most studies on SMMs assume either a single- or two-step SMM, there are as many SMMs as different distributions for {pi}ij. One problem is that many SMMs can result in patterns of polymorphism that are practically indistinguishable. As a result, the choice of the distribution {pi}ij is not trivial. Distributions that are sufficiently flexible and depend on few parameters, each having clear biological meaning, should be preferred. Although the method developed in this article for estimating parameters of a SMM applies to any distribution for {pi}ij, we shall consider a distribution {pi}ij that is homogeneous for a different value of i, partly because such a model has fewer parameters and partly because only the relative sizes of alleles in the samples analyzed later are known. We shall consider the following distribution:

(1)
where 0 <= {alpha} <= 1 and 0 < P < 1. We note that because {Sigma}j>i{pi}ij = {alpha}, the probability of a mutation increasing allele size is {alpha}, and the probability of a mutation decreasing allele size is 1 - {alpha}. In particular, {alpha} = 1 implies that mutations always increase allele size, {alpha} = 0 implies that mutations always decrease allele size, and {alpha} = 0.5 implies that there is equal chance of increasing and decreasing allele size.

It follows from the distribution (1) that given the direction of size change (increasing or decreasing) the size of a change is given by geometric distribution (1 - P)P|i-j|-1. A small value of P implies that the size of a change is likely small and a large value of P means that the size of a change can be large. An example of {pi}ij is given in Figure 1. Note that the geometric distribution (1-P)P|i-j|-1 does not have maximum size (step) for a size change. Any size of change is possible at least theoretically. Although we can consider a truncated geometric distribution that imposes a maximum size of change, doing so will introduce another parameter. We note that, although any size change is possible under a geometric distribution, the probabilities for most changes of large size are small and therefore negligible in practice. For this reason, it is simpler to consider an effective number of steps rather than to impose an absolute maximum step. We define the effective number of steps of a SMM to be the smallest integer s such that when {pi}ij < {epsilon}, |i - j| > s for some threshold value {epsilon}, we shall use {epsilon} = 10-3. For our model, s increases with P and is the largest integer that is not larger than

Figure 2 plots the effective number of steps against the value of P.



View larger version (17K):
In this window
In a new window
Download PPT slide
 
Figure 1. Numerical examples of geometric distribution (1 - P)Pi-1.



View larger version (11K):
In this window
In a new window
Download PPT slide
 
Figure 2. The effective number of steps of a SMM. Solid and dashed lines correspond to {alpha} = 0.5 and 0.9, respectively.

Given that a locus evolves according to the SMM described above, the values of three parameters {theta}, {alpha}, and P then determine the values of various moments of allele frequencies at equilibrium. Therefore, in general, moments computed from a sample can be used to estimate these parameters. Because moments are computed from allele frequencies of a sample, allele frequencies thus contain more information about the parameters than a set of moments do, and consequently the accuracy of estimation directly based on allele frequencies should be higher than that based on a small set of moments. However, MORAN 1975 Down showed that for any set of initial allele frequencies, the frequency of any allele does not have a steady distribution after a sufficient number of generations. Also, because the allele frequencies of a population many generations ago are generally unknown, MORAN's analysis appears to suggest that it would be futile to make an inference about the parameters of a SMM based directly on allele frequencies of a sample. A close examination of what determines the allele frequencies in a sample will change this perception.

The sequences (chromosomes) of a sample of size n can be traced back to their most recent common ancestor (MRCA), who lived on average 4N(1 - 1/n) generations ago. It is obvious from a coalescent point of view that the distribution of allele frequencies in a sample is entirely determined by the allele A possessed by the MRCA and the three parameters {theta}, {alpha}, and P of SMM. The allele frequencies of a population in the more distant past are irrelevant to the allele frequencies in a sample once the ancestral allele A is known or specified. This observation implies that inference on parameters {theta}, {alpha}, and P based on allele frequencies in a sample can be made once A is specified.

In addition to the importance of the ancestral allele A in determining the distribution of allele frequencies in a sample, the value of A for a sample is itself of great interest in studying the evolution of the locus from which alleles were sampled. It is thus desirable to be able to infer the value of A as well as the values of {theta}, {alpha}, and P from a sample. Such an inference would be difficult if it were based on a set of moments only. In this article, we treat A as a parameter whose value is to be estimated from a sample. For convenience, we collect these four parameters in a vector

(2)
In the next section, we describe how {Gamma} can be estimated from a sample.


*  MINIMUM CHI-SQUARE ESTIMATOR OF {Gamma}
*TOP
*ABSTRACT
*STEPWISE MUTATION MODELS
*MINIMUM CHI-SQUARE ESTIMATOR OF...
*APPLICATIONS
*DISCUSSION
*LITERATURE CITED

Let fi, i = 1, ... be the number of alleles of size i in a sample of n chromosomes. Our aim is to derive an estimate of {Gamma} from {fi}.

There are two widely used approaches for estimating parameters: the maximum likelihood method and the least-squares method. Often both methods result in similar estimates and share many properties that are desirable. The maximum likelihood method computes the probabilities of allele frequencies {fi} given the value of {Gamma} . The computation is difficult and time-consuming, but the likelihood approach has the advantage of being able to test hypotheses. On the other hand, least-squares-based methods compute the means and perhaps the variances and covariances of allele frequencies, which are much easier to compute than probabilities. Therefore, least-squares-based methods are generally easier to use in practice and are particularly appealing when a large number of samples need to be analyzed. The major disadvantage is that they are difficult to extend for hypothesis testing.

Let ei({Gamma} ), i = 1, ... be the expected number of alleles of size i conditional on the value of {Gamma} , i.e., ei({Gamma} ) = E(fi|{Gamma} ). Our strategy is to find the value of {Gamma} that minimizes the quantity

(3)
where wi({Gamma} ) is a function of {Gamma} , and g > 0. When wi = 1 and g = 2, the value of {Gamma} that minimizes L is the well-known least-squares estimator. However, since ei({Gamma} ) is not linear with the parameters in {Gamma} , the least-squares estimator is not necessarily a good choice. For a discrete distribution, a generalized least-squares method, often referred to as the MCS estimator, is usually a better one. The MCS estimator corresponds to g = 2 and wi = e-1i({Gamma} ). Therefore, the MCS estimate of {Gamma} is the value of {Gamma} that minimizes the function

(4)
It is well known that the MCS estimator has similar asymptotic properties to the maximum likelihood estimator when samples are from a multinominal distribution, with each cell's probability being a function of the parameters to be estimated (e.g., STUART and ORD 1991 Down). At the time when alleles were sampled from a population, there was a specific number of alleles of each size, and therefore the sample was indeed from a multinomial distribution. However, the probability of the number of alleles of size i is not a deterministic function of {Gamma} , but a random variable whose distribution depends on {Gamma} . Nevertheless, we expect the MCS estimator to be a reasonably good estimator of {Gamma} . Note that the MCS estimator was used in WEIR et al. 1976 Down for estimating parameters in a two-step SMM from moments of allele frequencies.

Although some optimization procedures with constraints can be adapted to search for the MCS estimate, a simple grid search is sufficient here because there are only four parameters. What complicates this seemingly straightforward procedure is that a formula or an even numerical solution for ei({Gamma} ) is not available at present. Therefore, their values have to be estimated from simulated samples. Estimating ei({Gamma} ) for a large number of combinations of parameter values can be very time consuming even when samples are generated by a fast algorithm from coalescent theory (KINGMAN 1982A Down, KINGMAN 1982B Down; see HUDSON 1991 Down for a recent review). When there are only a few samples to be analyzed, a two-steps grid search approach can be used. The first step is to carry out a fast full-grid search to identify the vicinity of the MCS estimate. To achieve a fast full-grid search, only a modest number of Monte Carlo samples is used to estimate ei({Gamma} ) for each {Gamma} . The second step is to carry out a fine-scale grid search in the small area identified by the first step. In the fine-scale search, a relatively large number of Monte Carlo samples is used to obtain more accurate estimates of ei({Gamma} ), and thus more accurate MCS estimates. When there are many samples to be analyzed, an alternative approach is to create a database of {ei} for all reasonable combinations of parameter values, and estimation for each sample will retrieve the values of {ei} from the database. This latter approach is the one we used in our analyses of 63 samples of dinucleotide repeats.

Let us consider how ei({Gamma} ) can be estimated. Suppose we have M simulated samples of size n given the value of {Gamma} , and suppose the number of alleles of size i in the jth sample is nij. Then we can use the mean êi of nij as an estimate of ei({Gamma} ). That is,

(5)
The estimator is unbiased with variance var(êi) = var(ni)/M, where ni is the number of alleles of size i in a single random sample of n chromosomes. It follows that estimation accuracy increases with the number M of simulated samples. Therefore, the ability to simulate a large number of samples in a reasonable amount of time is critical for the MCS estimator to be practical. Coalescent algorithm is ideal for this purpose, and a sample of n alleles from a locus from a population evolving according to a SMM with value {Gamma} = ({theta}, P, {alpha}, A) can be simulated as follows:

First, a genealogy of n sequences (alleles) is generated using a coalescent algorithm (e.g., HUDSON 1991 Down). For the simulated genealogy, we have not only the topological relationships of these sequences, but also the number of mutations that occurred on each branch of the genealogy. Simulation of such a genealogy requires only the value of {theta}. Second, assign a value to A, which is by definition the allele at the root of the genealogy; then determine the resulting allele of each mutation that occurred on the genealogy. Obviously the exercise requires knowing the allele type before a mutation and can be accomplished by starting from the root and progressing toward the tips of the genealogy. In this process, the type of a new mutant allele is simulated according to the distribution {{pi}i}, which is completely specified by the values of P and {alpha}.

Table 1 shows examples of the values of êi and their standard errors for different numbers of Monte Carlo samples. Note that the majority of estimates are reasonably accurate even when M = 500. These results as well as many other simulation experiments we performed suggest that for the purpose of identifying the vicinity of the MCS estimate, 500–1000 Monte Carlo samples is usually sufficient, and 10,000 Monte Carlo samples is adequate for a fine-scale grid search to obtain the final MCS estimate.


 
View this table:
In this window
In a new window

 
Table 1. Standard errors in estimating ei in a sample of 100 chromosomes

We define ti(n) = ei/n as the proportion of alleles being size i and measure the closeness between {ti(n)} for two different sample sizes by Euclidean distance, i.e.,

where e'i is the expected number of alleles i in a sample of size n'. An interesting observation that we made in our simulations is that values of {ti(n)} become steady rapidly even for a modest sample size. Table 2 shows several examples of Euclidean distances between {ti(n)} of different sample sizes. Note that the difference between them when sample size is larger than 100 is extremely small. This observation, although not too surprising from the viewpoint of coalescent process, is a reward for our MCS method, because it means that one can convert with excellent accuracy the estimate êi for a sample of size n to the corresponding estimate for a sample of size m by i/n, saving considerable computer cpu time when many samples are to be analyzed.


 
View this table:
In this window
In a new window

 
Table 2. Estimated Euclidean distance between {ei/n} of different sample sizes


*  APPLICATIONS
*TOP
*ABSTRACT
*STEPWISE MUTATION MODELS
*MINIMUM CHI-SQUARE ESTIMATOR OF...
*APPLICATIONS
*DISCUSSION
*LITERATURE CITED

Since the discovery of highly polymorphic CA-repeat loci (LITT and LUTY 1989 Down; WEBER and MAY 1989 Down), many samples of such loci from human populations have been reported (e.g., KAMINO et al. 1993 Down; BOWCOCK et al. 1994 Down; DI RIENZO et al. 1994 Down; DEKA et al. 1995 Down). The samples of DEKA et al. 1995 Down are particularly useful for population study because their sample sizes are relatively large and the populations sampled are anthropologically well defined. We thus use their data to illustrate our method and to examine several issues about the evolution of microsatellite loci. Eight CA-repeat loci in nine different populations, the Samoan (SA), Dogrib Indian (DG), Pehuenche Indians (PH), New Guineans (NG), Kachari (KA), German (GR), CEPH (CP), Sokoto (SO), and Chimpanzee (CH), were reported in DEKA et al. 1995 Down, but we shall exclude the locus D13S137 from our analysis because there are many single nucleotide insertions/deletions within the repeat motifs at the locus. For the remaining seven loci, we also exclude all the alleles that are results of single nucleotide insertion/deletion because the mechanisms for changing allele size and for insertion/deletion are likely different. The allele sizes and their frequencies at these loci are given in the appendix of DEKA et al. 1995 Down. Figure 3 shows the allele frequencies at the loci FLT1 and D13S122. Because polymerase chain reaction (PCR) was used for amplification and the distances between the upstream primer sequences and the first CA-repeat were unknown for these loci, the resulting alleles were given in terms of sequence length from the primer sequences, instead of copy numbers. This does not pose difficulty in our analysis because the mutation model (1) is only dependent on the relative number |i - j| of copy numbers, which are available from the samples. However, the estimated ancestral allele A at each locus will have to be given in terms of sequence length from the primer sequence.



View larger version (27K):
In this window
In a new window
Download PPT slide
 
Figure 3. Allele frequencies at loci FLT1 and D13S122.

Our task of estimation appears to be challenging at first glance because there are 63 samples to be analyzed, each requiring a considerable amount of computer cpu time. The observation we made earlier that the proportion of alleles of a given size is rather insensitive to sample size is of great help. Instead of generating a large number of samples for estimating ei for each of the 63 samples, we choose to obtain good estimates of ei/n for a sample of 100 chromosomes only and rescale them to obtain ei for samples of different sizes. We selected {theta} = 0.2(0.2)10 (i.e., 0.2, 0.4, ..., 9.8, 10), P = 0.01(0.01)0.10(0.05)0.9, and {alpha} = 0.5(0.5)1.0, a total of 14,500 combinations of values of the three parameters. For each set of parameters, we generated 20,000 independent samples and obtained estimates of ei/n from (5). Note that we do not need to estimate ei/n for {alpha} < 0.5 separately, because they can be obtained from those for {alpha} > 0.5 because of symmetry. Therefore, we effectively obtained estimates of ei/n for 29,000 sets of parameters. These estimates were stored in a database and can be retrieved easily.

For each of the 63 samples, we performed a grid search over all the parameter sets to obtain a MCS estimate of {Gamma} . In other words, for each of 29,000 parameter sets, we computed the {chi}2 value, updated the minimum {chi}2 value and corresponding parameter values, and obtained the MCS estimate after all the parameter sets had been examined. This fine-scale grid search still requires nontrivial computer cpu time but is quite manageable. The estimation results are given in Table 3.


 
View this table:
In this window
In a new window

 
Table 3. Estimated values (x100) of parameters {theta}, P, {alpha}, and A

Note that 26 of the 63 samples show contraction ({alpha} < 0.5) in allele sizes and 37 show expansion ({alpha} > 0.5) in allele sizes (Table 3). This suggests that there is a slight bias of mutations toward expansion. Table 3 also suggests that most of the loci evolve in a multi-stepwise fashion. We divided samples into two groups, one showing allele size contraction and the other expansion; we then found that P values of the latter group are considerably larger than those of the former group, which means that size change during expansion is likely greater than that during contraction. This result implies that if the probability {alpha} of expanding allele size by a mutation is the same as or even slightly smaller than the probability of contracting allele size, the alleles in a population will still tend to increase in size. Because Table 3 shows that {alpha} in the majority of samples is larger than 1/2, there is a tendency, stronger than that suggested by {alpha} alone, that most loci are expanding in allele size.

RUBINSZTEIN et al. 1995 Down compared allele sizes of 42 microsatellite loci in several primate species and found that alleles in humans are generally longer than in other primates. They argued that microsatellite loci can evolve directionally and at different rates in closely related species. Our estimates of the ancestral allele sizes in human populations and in chimpanzees also show a slight bias toward longer alleles in humans than in chimpanzees, because the ancestral alleles of humans in five (FLT1, D13S118, D13S71, D13S122, and D13S124) of the seven microsatellite loci are longer than those of chimpanzees. However, some of these differences may be due to ascertainment bias, and analyses of more loci are needed to resolve this issue.

WEBER and WONG 1993 Down studied 28 microsatellite loci in human chromosome 19 and a total of 20,000 parent-offspring allele pairs. They found that 78% of the 24 size changes in vivo were either gain or loss of single repeat unit, and gain or loss of more than three repeat units was not observed. When all the mutations in vivo and in vitro are considered, there is a strong tendency toward gains over losses in repeat units. Our analysis in general agrees with their observations. In vivo mutations as observed by WEBER and WONG 1993 Down do not appear to suggest large P values. However, they do not necessarily contradict our estimates: first, because the number of mutations found in each locus examined in their study is simply too small to yield a reliable estimate of P for that locus; and second, because it is not unreasonable to suggest that a similar mutational bias may be occurring both in vitro and in vivo, and when both in vitro and in vivo mutations found in their study are considered, the mutation pattern would agree well with larger P values.

Let {theta}ij be the value of {theta} for the ith population at the jth locus. If these loci are selectively neutral, then it would be reasonable to assume that the mutation rate at each locus is the same for different populations. Therefore, under the neutrality assumption, {theta}ij/{theta}ij' = Nj/Nj', where Nj and Nj' are the effective population sizes of the jth and j'th populations, respectively. That is, the ratio of {theta}s of two different populations is independent of the locus studied. If estimates of {theta} are accurate, then we should expect to see a consistent value of ratio of {theta}s over different loci. The estimated {theta}s in Table 3 show that this ratio for most pairs of populations is not very consistent over loci. This suggests that the variance in {theta} estimate is likely to be large. The estimated {theta}s also vary considerably among different populations for each locus, but this is expected because of different effective population sizes. A large variance in the estimation of {theta} is not unexpected because the estimate of {theta}, [for example, by WATTERSON's (1975) estimator], from segregating sites of DNA sequences that contain more information about {theta} than microsatellite data, is also accompanied by a relatively large variance. Furthermore, KIMMEL and CHAKRABORTY (1996) showed that the variance of a variance estimator of {theta} does not diminish with sample size.

When we draw conclusions based on estimated values of parameters, which are associated with variances, it is important to have some measures of accuracy in the estimates. Often the variance of an estimate of a single parameter is hard enough to compute, and measuring the accuracy of simultaneous estimation of all four parameters in a SMM model is undoubtedly more difficult, but nevertheless of great importance for recommending a method. This is an area in which answers demand substantially more computer resource than estimation, and it deserves much effort in the future. Therefore, we do not intend to provide all the answers here. Instead, we will focus on discussing the accuracy in the estimation of ancestral allele A. Although estimates of the four parameters are interrelated, we observed that values of the three parameters p, {alpha}, and {theta} that result in {chi}2 that is close to the minimum are also close to the set that results in minimum {chi}2, while MCS estimates conditional on different ancestral alleles can differ substantially. This is not difficult to understand. For example, specifying a small ancestral allele size (relative to the sizes of alleles in a sample) will require large {alpha} and P (or {theta}) to explain the observed allele frequencies, while, on the other hand, specifying a large ancestral allele size will result in a small estimated value for {alpha}. Our experience suggests that accuracy in the estimation of an ancestral allele is a good indicator of the overall accuracy of estimation.

To measure the accuracy of the estimate of A, we compare the minimum {chi}2 values conditional on different ancestral allele sizes. One way to facilitate the comparison is a plot of {chi}2 values vs. various ancestral allele sizes. A sharp decrease in the overall minimum value of {chi}2 at allele A should suggest high accuracy of estimation. To allow comparison of the estimates for different samples, we use the ratio of conditional minimum {chi}2 to the overall minimum {chi}2. Figure 3 shows two examples. Part a suggests that the estimate of A for the FLT1 locus from population DG is accurate but the estimate from population CH is uncertain. The allele frequencies in Figure 3 concur with this analysis, because the frequency of estimated A = 168 in the DG sample is extremely high, while the frequency of estimated A = 176 in the CH sample is only intermediate, although it is the highest. Figure 4B also shows a similar pattern for locus D13S122. It is interesting to note that estimated A = 105 for the SA sample is not the most frequent allele in that sample. Table 3 shows that the {theta} estimate from this sample is suspiciously large. Because Figure 3B shows that the estimate of A for this sample is rather uncertain, we expect that quite different sets of P, {alpha}, and {theta} can result in {chi}2 values close to the minimum. Indeed, when ancestral allele A is set to 85 the conditional minimum {chi}2 value is equal to 154.59, which is less than 1% larger than the minimum {chi}2 value, and the corresponding estimates of P, {alpha}, and {theta} are 0.3, 1.0, and 4.2, respectively.



View larger version (14K):
In this window
In a new window
Download PPT slide
 
Figure 4. Examples of the ratio of {chi}2 over the minimum {chi}2.

When the ancestral allele is reasonably certain, it makes sense to examine closely the estimation of other parameters. Take the locus D13S122, for example; allele size 95 appears to be the ancestral allele for the PH and GR samples (see Figure 3 and Table 4). Under the condition A = 95, we examined the minimum {chi}2 values for different values of {alpha}, and results are given in Figure 5. For the PH sample, Figure 5 shows that it is very unlikely that {alpha} < 0.5 while for the GR sample, {alpha} should not be substantially different from 0.5. These conclusions are reinforced by the allele frequencies in the two samples that are shown in Figure 3.



View larger version (12K):
In this window
In a new window
Download PPT slide
 
Figure 5. Minimum {chi}2 for different values of {alpha} for PH and GR samples at locus D13S122, conditional on ancestral allele size being 95.


 
View this table:
In this window
In a new window

 
Table 4. Some measures of quality in the estimates of ancestral allelic type

Another measure R2 of accuracy is the ratio of the {chi}2 value of the second best estimate of A to that of the best. The larger the ratio, the worse the fit for the second best A, and thus the better the estimate of A. Another useful measure Rm is the ratio of the mean {chi}2 values of the two neighboring sizes of A to that of A, which measures the goodness of the A compared to its neighbors. Obviously we always have R2 <= Rm. These two measures are particularly convenient when there are many samples to analyze, as in our situation. The values of R2 and Rm, as well as the minimum {chi}2 value for the 63 samples, are given in Table 4.

Table 4 shows that 43 of the 63 samples result in R2 > 1.20, and 30 result in R2 > 1.5. Although further study is required for proper interpretation of these R2 values, it appears that other than the MCS estimate an increase of 20% or more in the {chi}2 value for an ancestral allele should be a reasonable indication that the estimation is not totally out of line.


*  DISCUSSION
*TOP
*ABSTRACT
*STEPWISE MUTATION MODELS
*MINIMUM CHI-SQUARE ESTIMATOR OF...
*APPLICATIONS
*DISCUSSION
*LITERATURE CITED

A strength of the MCS estimator developed in this article is its ability to simultaneously estimate all the parameters of the SMM, including the ancestral allele, making better use of available information in a sample. To date, mutation mechanisms for minisatellite and microsatellite loci are not yet fully understood, and even less is known about the mode of mutations, i.e., whether it is symmetric or nonsymmetric, single-step or multi-step. Although KIMMEL et al. 1996 Down and KIMMEL and CHAKRABORTY (1996) emphasized that allelic size variance-based estimates of intra- and interpopulation variation at repeat loci are not affected by asymmetry of allele size changes by mutations, their analyses reflect as well the concept that knowledge of the distribution of size change by mutation is critical. Furthermore, the modes of mutations are likely to differ from loci to loci. Therefore, being able to estimate simultaneously all the parameters of a SMM has considerable advantages over methods for estimating a single parameter that assume a mode of mutations, such as a symmetric single-step mutation model, which may be grossly incorrect.

Another strength of our method is its flexibility. Although we only considered a mutation model with three parameters and assumed a constant population size, it is not difficult to see that any combination of mutation model and population genetics model can be analyzed in a similar manner as long as alleles under these models can be simulated. In particular, one can consider mutation models with constraints on allele size or mutation rates dependent on allele size and more complex population genetic models, such as growth populations or subdivided populations. This strength should not be overlooked for two reasons. First, rapid accumulation of population samples from microsatellite and minisatellite loci provides excellent opportunities to examine various mutation models, and, second, many natural populations, particularly human populations, are not panmictic. Proper statistical inferences should be based on more realistic population models, allowing for population growth and subdivision. The expectation of alleles of a given size as estimated by Monte Carlo simulation provides great flexibility of the method, although it has the drawback of requiring more computer cpu time. Note that methods for parameter estimation that rely on Monte Carlo samples to obtain some necessary quantities were used in FU 1994 Down.

The MCS estimator we developed is a generalized least squares estimator and is often used in statistics for discrete distributions. Therefore, we expect our estimator to have many desirable properties. Even though the procedure of estimation takes advantage of a fast coalescent algorithm, it is still a time-consuming method, which makes it hard to investigate the statistical properties of the estimator. Nevertheless, the statistical properties of the estimator will be worth studying in the future.

There are a number of potential extensions to the method we proposed. We chose to obtain parameter estimates from allele frequencies, but the same approach can be applied to a set of summary statistics, including various moments, number of alleles, heterozygosity, etc. It is also possible to incorporate variances and covariances of allele frequencies into an estimator, although doing so will demand even more computer resource. Another potential extension is to use the {chi}2 statistics for testing hypotheses, such as the hypothesis that mutations follow a single-step SMM, but hypothesis testing is likely to be more effective using likelihood-based approaches such as the method by NIELSON (1997).


*  ACKNOWLEDGMENTS

This work was supported in part by National Institutes of Health grant R29 GM-50428 (Y.-X.F.) and GM58545 (R.C.).

Manuscript received March 13, 1998; Accepted for publication May 26, 1998.


*  LITERATURE CITED
*TOP
*ABSTRACT
*STEPWISE MUTATION MODELS
*MINIMUM CHI-SQUARE ESTIMATOR OF...
*APPLICATIONS
*DISCUSSION
*LITERATURE CITED

BOWCOCK, A. M., A. RUIZ-LINARES, J. TOMFOHRDE, E. MINCH, and J. R. KIDD et al., 1994  High resolution of human evolutionary trees with polymorphic microsatellites. Nature 368:455-457[Medline].

CHAKRABORTY, R. and M. NEI, 1982  Genetic differentiation of quantitative characters between populations or species. Genet. Res. 39:303-314.

CHAKRABORTY, R. and M. NEI, 1997  Bottleneck effects on average heterozygosity and genetic distance with the stepwise mutation model. Evolution 31:347-356.

DEKA, R., L. JIN, M. D. SHRIVER, L. M. YU, S. DECROO, and J. HUNDRIESER et al., 1995  Population genetics of dinucleotide (dC-dA)n·(dG-dT)n polymorphism in world populations. Am. J. Hum. Genet. 56:461-474[Medline].

DI RIENZO, A., A. C. PETERSON, J. C. GARZA, A. M. VALDES, and M. STATKIN et al., 1994  Mutational processes of simple-sequence repeat loci in human populations. Proc. Natl. Acad. Sci. USA 91:3166-3170[Abstract/Free Full Text].

EWENS, W. J., 1972  The sampling theory of selectively neutral alleles. Theor. Pop. Biol. 3:87-112[Medline].

FU, Y. X., 1994  Estimating effective population size or mutation rate using the frequencies of mutations of various classes in a sample of DNA sequences. Genetics 138:1375-1386[Abstract].

HUDSON, R. R., 1991 Gene genealogies and the coalescent process, pp. 1–44 in Oxford Surveys in Evolutionary Biology, Vol. 7, edited by D. FUTUYMA and J. ANTONOVICS. Oxford University Press, New York.

JEFFREYS, A. J., V. WILSON, and S. L. THEIN, 1985  Hypervariable "minisatellite" regions in human DNA. Nature 314:67-73[Medline].

KAMINO, K., J. NAKURA, K. KIHARA, L. YE, and K. NAGANO et al., 1993  Population variation in dinucleotide repeat polymorphism at the D8S360 locus. Hum. Mol. Genet. 2:1751[Free Full Text].

KIMMELL, M. and R. CHAKRABORTY, 1996  Measure of variation at DNA repeat loci under a general stepwise mutation model. Theor. Popul. Biol. 50:345-367[Medline].

KIMMEL, M., R. CHAKRABORTY, D. N. STIVERS, and R. DEKA, 1996  Dynamics of repeat polymorphisms under a forward-backward mutation model: within- and between-population variability at microsatellite loci. Genetics 143:549-555[Abstract].

KINGMAN, J. F. C., 1982a  The coalescent. Stochastic Process. Appl. 13:235-248.

KINGMAN, J. F. C., 1982b  On the genealogy of large populations. J. Appl. Probab. 19A:27-43.

LI, W. H., 1976  Distribution of nucleotide differences between two randomly chosen cistrons in a subdivided population: the finite island model. Theor. Popul. Biol. 10:303-308[Medline].

LITT, M. and J. A. LUTY, 1989  A hypervariable microsatellite revealed by in vitro amplification of a dinucleotide repeat within the cardiac muscle actin gene. Am. J. Hum. Genet. 44:397-401[Medline].

MORAN, P. A. P., 1975  Wandering distributions and the electrophoretic profile. Theor. Popul. Biol. 8:318-330[Medline].

NIELSEN, R., 1997  A likelihood approach to populations samples of microsatellite alleles. Genetics 146:711-716[Abstract].

OHTA, T. and M. KIMURA, 1973  A model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a finite population. Genet. Res. 22:201-204[Medline].

RUBINSZTEIN, D. C., W. AMOS, J. LEGGO, S. GOODBURN, and S. JAIN et al., 1995  Microsatellite evolution—evidence for directionality and variation in rate between species. Nature Genet. 10:337-343[Medline].

SHRIVER, M. D., L. JIN, R. CHAKRABORTY, and E. BOERWINKLE, 1993  VNTR allele frequency distribution under the stepwise mutation model: a computer simulation approach. Genetics 134:983-993[Abstract].

STUART, A., and J. K. ORD, 1991 Kendall's Advanced Theory of Statistics, Vol. II, Ed. 6. Oxford University Press, New York.

TAUTZ, D., 1993 Notes on the definition and nomenclature of tandemly repetitive DNA sequence, pp. 21–28 in DNA Fingerprinting: State of the Science, edited by S. D. J. PENA, R. CHAKRABORTY, J. T. EPPEN and A. J. JEFFREYS.

VALDES, A. M., M. SLATKIN, and N. B. FREIMER, 1993  Allele frequencies at microsatellite loci: the stepwise mutation model revisited. Genetics 133:737-749[Abstract].

WATTERSON, G. A., 1975  On the number of segregation sites. Theor. Popul. Biol. 7:256-276[Medline].

WEBER, J. L. and P. E. MAY, 1989  Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. Am. J. Hum. Genet. 44:388-396[Medline].

WEBER, J. L. and C. WONG, 1993  Mutation of human short tandem repeats. Hum. Mol. Genet. 2:1123-1128[Abstract/Free Full Text].

WEIR, B. S., A. H. D. BROWN, and D. R. MARSHALL, 1976  Testing for selective neutrality of electrophoretically detectable protein polymorphisms. Genetics 84:639-659[Abstract/Free Full Text].