- Hongyan Xu and
- Yun-Xin Fu1
- Human Genetics Center, University of Texas, Houston, Texas 77030
- 1
Corresponding author: Human Genetics Center, School of Public Health, University of Texas, 1200 Herman Pressler, Houston, TX 77030. E-mail: yunxin.fu{at}uth.tmc.edu
Abstract
Microsatellites are short tandem repeats that are widely dispersed among eukaryotic genomes. Many of them are highly polymorphic; they have been used widely in genetic studies. Statistical properties of all measures of genetic variation at microsatellites critically depend upon the composite parameter θ = 4Nμ, where N is the effective population size and μ is mutation rate per locus per generation. Since mutation leads to expansion or contraction of a repeat number in a stepwise fashion, the stepwise mutation model has been widely used to study the dynamics of these loci. We developed an estimator of θ, θ^F , on the basis of sample homozygosity under the single-step stepwise mutation model. The estimator is unbiased and is much more efficient than the variance-based estimator under the single-step stepwise mutation model. It also has smaller bias and mean square error (MSE) than the variance-based estimator when the mutation follows the multistep generalized stepwise mutation model. Compared with the maximum-likelihood estimator θ^L by Nielsen (1997), θ^F has less bias and smaller MSE in general. θ^L has a slight advantage when θ is small, but in such a situation the bias in θ^L may be more of a concern.
MICROSATELLITE loci, also known as short tandem repeats, are tandem repeat loci with repeat motifs of two to six nucleotides in length (Tautz 1993). Microsatellites are highly informative as polymorphic markers. Variations at microsatellite loci have been used to study the history and genetic structure of individual populations, such as DNA fingerprinting, paternity and relatedness testing, reconstruction of evolutionary trees, and genetic distance. In addition, they are useful for inferring migration histories, for identifying individuals of unknown origin, and for detecting the hidden population substructure. Microsatellites are also widely used in linkage mapping.
Statistical properties of all measures of genetic variation critically depend upon the composite parameter θ= 4Nμ, where N is the effective population size and μ is the mutation rate per locus per generation. An accurate estimate of θ will greatly facilitate the inference on the basis of variation at microsatellite loci. While the variation at microsatellites is extremely useful, little has been done to estimate θ using microsatellite data. This is partly due to the unknown mutation mechanism at such loci. Microsatellite loci are hypervariable and the mechanisms that produce new variation at such loci are unusual in comparison with those of classical loci. While the exact mechanism of mutations at such loci is still not well characterized at a molecular level (Jeffreyset al. 1994), it is generally believed that the processes and the patterns of mutations at different loci may differ from locus to locus, depending on the motif as well as the size of alleles at each locus. Empirical and theoretical studies indicate that for most microsatellite loci, mutations lead to stepwise changes of the repeat size of alleles although the rate of mutation leading to expansion may not be equal to that of contraction of allele size (Chakrabortyet al. 1997; Dekaet al. 1999). The stepwise mutation model, originally proposed for the study of protein charge changes (Ohta and Kimura 1973), in a more generalized form may be more suitable for the study of most microsatellite loci (Kimmelet al. 1996).
Although a number of estimators of θ (Wehrhahn 1975; Nielsen 1997; Fu and Chakraborty 1998) use microsatellite data, each has its limitations, in part being either too complicated or too simple. There is need for a relatively simple yet robust estimator and the purpose of this article is to develop one such estimator of θ using microsatellite data. Here we assume the neutral Wright-Fisher model without population substructure. The estimation of θ becomes the estimation of effective population size, N, when the mutation rate, μ, is known or the estimation of mutation rate, μ, when the effective population size, N, is known.
METHODS AND RESULTS
Existing estimators: Assuming the single-step stepwise mutation model, in which each mutation produces either one-step contraction or expansion in allele size, for a population without substructure and a neutral locus, the variance in allele size from a sample, V_{s}, has a mean equal to θ/2 (Wehrhahn 1975). Then a convenient unbiased moment estimator is given by
θ^V=2Vs.
(1)
The estimator θ^V simplicity is a large variance. The variance of allele size variance, V_{s}, was given by Zhivotovsky and Feldman (1995) as
Var(Vs)=112θ+13θ2.
(2)
Consequently, the variance of θ^V is given by
Var(θ^V)=13θ+43θ2.
(3)
Several examples of the value of θ^V are shown in Table 1. In general the standard deviation is >θ.
TABLE 1
Large sample variance of estimator
An even better known quantity is heterozygosity, denoted as H and defined as the probability that two randomly chosen sequences are of different allelic type; it is a measure of genetic variation at a microsatellite locus. The complement of heterozygosity, F = 1 – H, is called homozygosity. Since F contains the information of both number of alleles and allele frequency, an estimator based on F may be a possible solution.
Under the single-step stepwise mutation model, for a population without substructure and a neutral locus, the expected homozygosity (Ohta and Kimura 1973) is given by
E(F)=11+2θ.
(4)
Supposing a sample is taken from a population and letting k be the number of alleles in the sample, the homozygosity F can be estimated by
F^=∑i=1kpi2,
(5)
where p_{i} is the allele frequency of the ith allele in the sample. Then a moment estimator of θ can be derived from Equation 4, replacing F with F̂:
θ∼F=12(1F^2−1).
(6)
Since the transformation is not linear, the estimator θ∼F is usually biased, particularly when θ is large. Simple correction based on the infinite allele model was proposed before (Zouros 1979; Chakraborty and Weiss 1991), which is based on an analytical relationship between the expected value of the θ estimator and real value of θ. Unfortunately, such an analytical formula is not yet known for genetic loci evolving under the step-wise mutation model.
Besides the two estimators of θ using microsatellite data, Nielsen (1997) proposed an estimator using the maximum-likelihood approach. In addition to being restricted to the single-step stepwise mutation model, the estimator is rather demanding computationally and can handle only modest sample size. Fu and Chakraborty (1998) proposed an approach to simultaneously estimate all the parameters in a generalized stepwise mutation model, including θ. They use a minimum chi-square method to perform a grid search of all the possible values in the multidimensional parameter space, which makes it a challenge to analyze a large amount of data. To date, many population studies using microsatellites involve larger and larger samples and multiple loci. A relatively simple yet efficient estimator is highly desirable. In many ways, such an estimator can serve a role similar to that of Watterson's or Tajima's estimator of θ for DNA sequence data, despite the fact that several sophisticated estimators of θ for DNA sequence data have been available.
New estimator: The approach we take uses a combination of computer simulation and statistical regression, trying to find the relationship between the expectation of θ∼F and the real value of θ. On the basis of the relationship, we try to develop a new unbiased estimator of θ. Computer simulation is an efficient way to study the properties of the homozygosity-based estimator θ∼F . For each combination of θ value and sample size, n, a large number of samples are simulated according to coalescent theory. For each sample, the homozygosity is estimated through Equation 5. Then the homozygosity-based estimate is obtained through Equation 6. Some of the results are shown in Figure 1, where each point in the figure is the mean of θ∼F over 50,000 simulated samples. Figure 1 shows that θ∼F on average overestimates θ. The magnitude of overestimation is a function of sample size n and θ, and, in many cases, the biases are severe.
To summarize the relationship among θ, n, and the mean of θ∼F , a regression approach can be used. The challenge is to find the simplest equation that is sufficiently accurate for describing the relationship. From Figure 1, it seems that mean of θ∼F is reversely related to sample size and positively proportional to θ. We include the terms 1/n and θ in the regression formula. We started to consider equations that incorporate 1/n and √θ in various ways. Choosing √θ as the basic unit was partly inspired by Equation 4. The most complex equation we consider is a polynomial including all combinations of 1/n, √θ, and (1/n)^{2}.
The regression analysis shows that two regression equations summarize remarkably well (R^{2} = 99.99%) the relationship of θ, n, and mean of θ∼F (see Figure 1). For θ≤ 10,
E(θ∼F)=(1.1313+3.4882n+28.2878n2)θ+0.3998θ.
(7)
For θ> 10,
E(θ∼F)=(1.1675+3.3232n+63.698n2)θ+0.2569θ.
(8)
The regression equations have two nice properties. First, when θ∼F=0 , we have θ= 0. Second, when sample size n → ∞, θ∼F has a limit value, which does not depend on n. Actually, when n > 200, the effect of sample size is very small.
On the basis of the above regression equations, we propose the following new estimator θ^F :
θ^F={θvalue satisfies Equation7withθ∼FreplacingE(θ∼F)ifθ∼F≤15θvalue satisfies Equation8withθ∼FreplacingE(θ∼F)otherwise.}
The threshold value 15 is based on the observation that 90% of the value of θ∼F is <15.0 with θ= 10. However, we found that the choice is not critical, because choosing 10 as the threshold value does not make much difference. This is because when θ is ∼10, Equations 7 and 8 give very similar results.
The performance of θ^F was investigated through simulation. For a given combination of θ and sample size n, 50,000 samples were simulated and for each sample θ∼F was estimated by Equation 6 and then corrected through Equation 7 or Equation 8. Some of the results are summarized in Table 2. Table 2 shows that the estimator θ^ is unbiased (or nearly so). The small bias is likely due to fluctuation in simulation and is insignificant compared to the variance.
Next we compare the performance of our estimator θ^F with that of the estimator based on allele size variance, θ^V . There are two ways to compute the variance of θ^V . The theoretical value of the large sample variance can be computed through Equation 3 and the variance can also be estimated through computer simulation. We computed it in both ways because on the one hand the validity of our simulation program can be checked and on the other hand the results can corroborate each other. The results are summarized in Table 3. Table 3 shows that the theoretical value of the variance of θ^V agrees well with the simulation value, which indicates that our simulation is accurate. More importantly, Table 3 shows that while both estimators are unbiased, our homozygosity-based estimator θ^F is better than the size variance-based estimator θ^V in that the variance of θ^ is smaller than that of θ^V . The relative efficiency of θ^F against θ^V , defined as the ratio of the variance of θ^V and variance of θ^F , is also given in Table 3. The relative efficiency increases as θ increases, which means that θ^F becomes more and more efficient with increasing θ value. Note that since microsatellite loci have a relatively high mutation rate, the θ value can easily be of the range of 10–100, which makes θ^F superior to θ^V for most microsatellite loci.
Comparison with the maximum-likelihood estimator: The performance of the homozygosity-based estimator θ^F is further compared to that of the maximum-likelihood (ML) estimator θ^L proposed by Nielsen (1997). Assuming the single-step stepwise mutation model, 10,000 samples are simulated for a number of combinations of θ and sample size. The two estimators, θ^F and θ^L , are computed for each simulated sample. The mean value and mean square error (MSE) for the corresponding estimates are then computed and the results are summarized in Table 4. Two conclusions are obvious from Table 4. First, the ML estimator θ^L is, in general, upwardly biased. Although the bias decreases with sample size, it is still appreciable even when the sample size is 300. In comparison, the mean value of the homozygosity-based estimator θ^F exhibits little bias, similar to the case of comparing θ^F and θ^V . Second, in general the ML estimator θ^L has a larger MSE than that of θ^F , except in the cases where θ is small and sample size is large. It is somehow surprising that as θ increases, the relative performance of θ^L , measured by MSE, gets worse compared to θ^F . Two possible causes might be that the ML estimator implemented by Nielsen may not be a true ML estimator and it is not efficient. Indeed, in Nielsen's algorithm, a k-allele model was used to approximate the stepwise mutation model (Nielsen 1997) in which the accuracy is not well known. Because of a high mutation rate for microsatellites, the θ value can be quite large
even for a modest population size. For example, many samples from human populations have yielded estimates of θ> 10. This makes θ^F more preferable in general than θ^L .
TABLE 2
Properties of
To address the issue of efficiency, we performed a large-scale simulation to see the extent to which performance of the ML estimator is affected by the number of runs through the Markov chain. In the comparison
with the ML estimator θ^L shown in Table 4, the θ^L was computed using the default Markov chain steps, 100,000 runs. Table 5 shows the results with three different numbers of runs through the Markov chain, 10,000, 100,000 and 1,000,000, where θ is set to 10.0. It is clear from Table 5 that there is a big improvement in the performance of θ^L in terms of MSE when the number of runs through the Markov chain changes from 10,000 to 100,000, but only a small improvement when the replicate number changes from 100,000 to 1,000,000. More importantly, even when 1,000,000 replicates were used for the θ^L , it still has larger bias and MSE than the homozygosity-based estimator θ^F when θ= 10.0. An extreme case was carried out in which the number of runs through the Markov chain for θ^L was set to 10,000,000 when θ= 10.0 and sample size n = 50. In this case, the MSE of θ^L was 69.53, which is still >50.62, the MSE of θ^F .
Robustness of the estimator: So far, the analysis is based on the single-step stepwise mutation model. While this may be true for some microsatellite loci, statistical analysis suggests that not all of them adhere to this simple version of the stepwise mutation model (Shriveret al. 1993; Di Rienzoet al. 1994). Furthermore, direct mutation assays at several loci showed that occasionally mutation may lead to jumps of allele sizes beyond one repeat unit (Weber and Wong 1993). On the basis of these lines of evidence, a generalized version of the stepwise mutation model (Kimmel and Chakraborty 1996; Fu and Chakraborty 1998) was proposed in which each mutation is supposed to change the allele size from X to X + U. The mutation is symmetric and the absolute value of the offset U is sampled from a geometric distribution with parameter λ; that is,
P(∣U∣=x)=(1−λ)x−1λ,0<λ≤1.
(9)
The performance of both estimators under this generalized stepwise mutation model was investigated through computer simulation. A total of 50,000 samples were simulated assuming the generalized model with λ= 0.67. With this λ value,
E(∣U∣)=1∕λ=1.5.
That is, on average each mutation causes a jump of allele sizes of ∼1.5 repeat units. For each simulated sample, the sample procedure as before was taken to obtain the two estimators, θ¯F and θ¯V . The bias and MSE were also taken for each estimator. The corresponding theoretical values for the bias and MSE of θ^V were also computed. The details are in the appendix. The simulation value agrees well with the theoretical value. The results are shown in Table 6.
Table 6 shows that under the generalized stepwise mutation model, both estimators are upwardly biased. That is, both estimators on average overestimate the real θ value. The bias is an increasing function of θ. When the bias of θ^F , is compared to that of θ^V , the former always has a smaller bias than the latter, which means that θ^F is less biased than θ^V especially when θ is high. Comparison between the corresponding MSEs also shows that θ^F has a smaller MSE than θ^V . These two points make θ^F still more preferable than θ^V even when
the actual mutation model is the generalized stepwise mutation model.
APPLICATION
To test the performance of the homozygosity-based estimator θ^F with real data, we use the allele frequency data from the ALFRED database at Yale University (Cheunget al. 2000). There are altogether 115 dinucleotide repeats with data from 10 worldwide populations. The 10 populations are Biaka, Mbuti, Druze, Danes, Han, Japanese, Melanesian-Nasioi, Yakut, Maya-Yucatan, and Surui. More information about the loci and populations can be found at http://alfred.med.yale.edu/alfred/index.asp.
For each population-locus combination, θ^F and θ^V are computed. To compare the consistency of the estimators, one locus is randomly chosen as the base locus and the ratio of the estimate for other loci in the same population is taken over the estimate for the base locus. Since the effective population size is generally supposed to be the same in the same population for all loci from the same sample, we are estimating the ratio of mutation rates using information from different populations. Assuming the mutation rate for a particular locus is constant
across the populations, the estimates of the ratio of mutation rates from different populations are the estimates of the same quantity. Consequently, the dispersion of the results is an indicator of the consistency of the estimator. The coefficient of variance (ratio of standard deviation to mean) is taken as a measure of dispersion. In almost all the cases, the coefficient of variance is smaller with θ^F than with θ^V , which indicates that the homozygosity-based estimator θ^F is more stable and more consistent than the variance-based estimator θ^V . Examples of the results from four loci are tabulated in Table 7, where the base locus (locus 1) is D11S935, locus 2 is D7S640, locus 3 is D6S441, and locus 4 is D5S408, with the corresponding mutation rates denoted as μ_{1}–μ_{4}, respectively.
DISCUSSION
Kimmel and Chakraborty (1996) showed that sample homozygosity at a microsatellite locus depends not only on θ, but also on the pattern of allele size change caused by mutation. Therefore, any attempt to estimate θ on the basis of homozygosity has to be mutation model dependent. Interestingly, the regression formula we found on the basis of the single-step stepwise mutation model is reasonably robust against deviations from the single-step model. This is a useful property since it is very difficult to specify the model with confidence. On the other hand, if one has sufficient confidence in a particular model, a similar approach can be used to derive the regression formula under the model. This can be seen from our simulation study when the mutation
model deviates from the single-step stepwise mutation model to the generalized stepwise mutation model.
Although the maximum-likelihood estimator, θ^L , proposed by Nielsen (1997) is computationally demanding, its performance was compared to that of the homozygosity-based estimator θ^F through a large-scale simulation. The ML estimator θ^L is found to be slightly upwardly biased. This is not too surprising because many maximum-likelihood estimators are known to be biased
for small sample sizes. Indeed we found that the θ^L approaches the true value as sample size (n) increases. However, even when n = 300, there is still an appreciable amount of bias. The MSE of θ^L decreases with the increase of the sample size. However, in the most likely range of θ for microsatellites, θ^L has in general larger MSE than θ^F unless the sample size is extremely large. θ^L has a slight advantage when θ is small. However, in such a situation, the bias of θ^L may be more of a concern. For example, from Table 4 when θ= 2.0 and n = 30, the bias can be nearly 18%. All these factors make θ^F an attractive alternative to θ^L .
We have relied on regression to find a way to remove bias as an estimator of θ from θ∼F . It should be pointed out that jackknife is a widely used approach to reduce bias in estimation (e.g., Manly 1997). The underlying theory is that a jackknife estimator removes the bias of order 1/n; that is, if the original biased estimate θ^ has the form
E(θ^)=θ(1+An),
(10)
where A is a constant, then the jackknife estimator can remove the bias. However, the relationship between E(θ∼F) and θ is rather complex. Although the exact relationship is unknown, Equations 7 and 8 indicate that the relationship is certainly not in the form of Equation 10. So the jackknife estimator is unlikely to be able to remove much of the bias in θ∼F . Indeed, when the jackknife method was applied in our simulated sample, we found that it was able to remove only ∼10% of the bias in many combinations of parameters. Therefore, jackknife is not an appropriate approach to use in this situation.
From Equation 5 of Kimmel and Chakraborty (1996), the estimator based on allele size variance under any arbitrary stepwise mutation model is given by
θ∼V=VE(U02),
(11)
where V = 2E(V_{s}) and U_{0} is the symmetrized allele size change in a single generation and is mutation model dependent. Consequently, the variance-based estimator θ^V is mutation model dependent and is applicable to the particular model itself. In the case of the single-step stepwise mutation model, Equation 11 is reduced to Equation 1 since E(U02)=1 . Therefore, θ^V is a special case of θ∼V and is mutation model dependent and applicable to the single-step stepwise mutation model. Hence it is no surprise that θ^V becomes biased under the generalized stepwise mutation model.
Rubinsztein et al. (1995) argued that the mutational transitions may be asymmetric. During the analysis in this article we did not differentiate the asymmetric model from the symmetric model. This is because from Kimmel and Chakraborty (1996) homozygosity and allele size variance are independent of mutation direction. Indeed, these are confirmed in our simulation (data not shown). Consequently, our homozygosity-based estimator θ^F is applicable for single-step stepwise mutation, symmetric or not. Computer programs to carry out the analysis and to estimate θ^F are available upon request.
Acknowledgments
We thank R. Nielsen for sharing his ML program. This work was supported partly by National Institutes of Health grants R01 GM50428 and R01 GM60777 to Y.-X. Fu.
APPENDIX: CALCULATING THE BIAS AND MSE OF θ^V UNDER THE GENERALIZED STEPWISE MUTATION MODEL
Given that
P(∣U∣=x)=(1−λ)x−1λ,whereλ=0.67,
that is, |U| ∼ geometric(0.67), since the mutation is symmetric, U_{0} = U, we have
E(U02)=E(U2)=Var(U)+(E(U))2=1−λλ2+1λ2=2−λλ2.
(A1)
Since θ^V=2V and from Equation 4 of Kimmel and Chakraborty (1996) we have
E(Vs)=V2=θ2E(U02),
(A2)
where V is defined in Kimmel and Chakraborty (1996),
E(θ^V)=2E(Vs)=2×θ2E(U02)=2−λλ2θ.
(A3)
Substituting λ= 0.67 into Equation A3, we have
E(θ^V)=3θ.
(A4)
Therefore,
Bias(θ^V)=3θ−θ=2θ.
(A5)
To calculate the MSE of θ^V we need to calculate variance of size variance V first. From Equation 16 of Kimmel and Chakraborty (1996),
Var(Vs)=13V2+112VE(U04)E(U02).
(A6)
Since U_{0} = U and U ∼ geometric(0.67), the moment-generating function of U_{0} is
M(t)=λe21−λe2.
(A7)
Taking the fourth derivative of Equation A7 and setting t = 1, λ= 0.67, we have
E(U04)=30.
(A8)
From Equation 5 of Kimmel and Chakraborty (1996), we have
V=θE(U02).
(A9)
Substituting Equations A1, A8, and A9 into Equation A6, we have
Var(Vs)=3θ2+52θ.
(A10)
Since θ^V=2Vs , we have
Var(θ^V)=4Var(Vs)=12θ2+10θ.
(A11)
Therefore,
MSE(θ^V)=[Bias(θ^V)]2+Var(θ^V)=(2θ)2+12θ2+10θ=16θ2+10θ.
(A12)
- Received May 30, 2003.
- Accepted September 24, 2003.
- Copyright © 2004 by the Genetics Society of America