- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Xu, H.
- Articles by Fu, Y.-X.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Xu, H.
- Articles by Fu, Y.-X.
Estimating Effective Population Size or Mutation Rate With Microsatellites
Hongyan Xua and Yun-Xin Fuaa Human Genetics Center, University of Texas, Houston, Texas 77030
Corresponding author: Yun-Xin Fu, School of Public Health, University of Texas, 1200 Herman Pressler, Houston, TX 77030., yunxin.fu{at}uth.tmc.edu (E-mail)
Communicating editor: J. B. WALSH
| ABSTRACT |
|---|
Microsatellites are short tandem repeats that are widely dispersed among eukaryotic genomes. Many of them are highly polymorphic; they have been used widely in genetic studies. Statistical properties of all measures of genetic variation at microsatellites critically depend upon the composite parameter
= 4Nµ, where N is the effective population size and µ is mutation rate per locus per generation. Since mutation leads to expansion or contraction of a repeat number in a stepwise fashion, the stepwise mutation model has been widely used to study the dynamics of these loci. We developed an estimator of
,
F, on the basis of sample homozygosity under the single-step stepwise mutation model. The estimator is unbiased and is much more efficient than the variance-based estimator under the single-step stepwise mutation model. It also has smaller bias and mean square error (MSE) than the variance-based estimator when the mutation follows the multistep generalized stepwise mutation model. Compared with the maximum-likelihood estimator
L by ![]()
F has less bias and smaller MSE in general.
L has a slight advantage when
is small, but in such a situation the bias in
L may be more of a concern.
MICROSATELLITE loci, also known as short tandem repeats, are tandem repeat loci with repeat motifs of two to six nucleotides in length (![]()
Statistical properties of all measures of genetic variation critically depend upon the composite parameter
= 4Nµ, where N is the effective population size and µ is the mutation rate per locus per generation. An accurate estimate of
will greatly facilitate the inference on the basis of variation at microsatellite loci. While the variation at microsatellites is extremely useful, little has been done to estimate
using microsatellite data. This is partly due to the unknown mutation mechanism at such loci. Microsatellite loci are hypervariable and the mechanisms that produce new variation at such loci are unusual in comparison with those of classical loci. While the exact mechanism of mutations at such loci is still not well characterized at a molecular level (![]()
![]()
![]()
![]()
![]()
Although a number of estimators of
(![]()
![]()
![]()
using microsatellite data. Here we assume the neutral Wright-Fisher model without population substructure. The estimation of
becomes the estimation of effective population size, N, when the mutation rate, µ, is known or the estimation of mutation rate, µ, when the effective population size, N, is known.
| METHODS AND RESULTS |
|---|
Existing estimators:
Assuming the single-step stepwise mutation model, in which each mutation produces either one-step contraction or expansion in allele size, for a population without substructure and a neutral locus, the variance in allele size from a sample, Vs, has a mean equal to
/2 (![]()
![]() |
(1) |
The estimator
V is rather simple, but the price of its simplicity is a large variance. The variance of allele size variance, Vs, was given by ![]()
![]() |
(2) |
Consequently, the variance of
V is given by
![]() |
(3) |
Several examples of the value of
V are shown in Table 1. In general the standard deviation is >
.
|
An even better known quantity is heterozygosity, denoted as H and defined as the probability that two randomly chosen sequences are of different allelic type; it is a measure of genetic variation at a microsatellite locus. The complement of heterozygosity, F = 1 - H, is called homozygosity. Since F contains the information of both number of alleles and allele frequency, an estimator based on F may be a possible solution.
Under the single-step stepwise mutation model, for a population without substructure and a neutral locus, the expected homozygosity (![]()
![]() |
(4) |
Supposing a sample is taken from a population and letting k be the number of alleles in the sample, the homozygosity F can be estimated by
![]() |
(5) |
where pi is the allele frequency of the ith allele in the sample. Then a moment estimator of
can be derived from Equation 4, replacing F with
:
![]() |
(6) |
Since the transformation is not linear, the estimator
F is usually biased, particularly when
is large. Simple correction based on the infinite allele model was proposed before (![]()
![]()
estimator and real value of
. Unfortunately, such an analytical formula is not yet known for genetic loci evolving under the stepwise mutation model.
Besides the two estimators of
using microsatellite data, ![]()
![]()
. They use a minimum chi-square method to perform a grid search of all the possible values in the multidimensional parameter space, which makes it a challenge to analyze a large amount of data. To date, many population studies using microsatellites involve larger and larger samples and multiple loci. A relatively simple yet efficient estimator is highly desirable. In many ways, such an estimator can serve a role similar to that of Watterson's or Tajima's estimator of
for DNA sequence data, despite the fact that several sophisticated estimators of
for DNA sequence data have been available.
New estimator:
The approach we take uses a combination of computer simulation and statistical regression, trying to find the relationship between the expectation of
F and the real value of
. On the basis of the relationship, we try to develop a new unbiased estimator of
. Computer simulation is an efficient way to study the properties of the homozygosity-based estimator
F . For each combination of
value and sample size, n, a large number of samples are simulated according to coalescent theory. For each sample, the homozygosity is estimated through Equation 5. Then the homozygosity-based estimate is obtained through Equation 6. Some of the results are shown in Fig 1, where each point in the figure is the mean of
F over 50,000 simulated samples. Fig 1 shows that
F on average overestimates
. The magnitude of overestimation is a function of sample size n and
, and, in many cases, the biases are severe.
|
To summarize the relationship among
, n, and the mean of
F, a regression approach can be used. The challenge is to find the simplest equation that is sufficiently accurate for describing the relationship. From Fig 1, it seems that mean of
F is reversely related to sample size and positively proportional to
. We include the terms 1/n and
in the regression formula. We started to consider equations that incorporate 1/n and
in various ways. Choosing
as the basic unit was partly inspired by Equation 4. The most complex equation we consider is a polynomial including all combinations of 1/n,
, and (1/n)2.
The regression analysis shows that two regression equations summarize remarkably well (R2 = 99.99%) the relationship of
, n, and mean of
F (see Fig 1). For
10,
![]() |
(7) |
For
> 10,
![]() |
(8) |
The regression equations have two nice properties. First, when
F = 0, we have
= 0. Second, when sample size n
,
F has a limit value, which does not depend on n. Actually, when n > 200, the effect of sample size is very small.
On the basis of the above regression equations, we propose the following new estimator
F:

The threshold value 15 is based on the observation that 90% of the value of
F is <15.0 with
= 10. However, we found that the choice is not critical, because choosing 10 as the threshold value does not make much difference. This is because when
is
10, Equation 7 and Equation 8 give very similar results.
The performance of
F was investigated through simulation. For a given combination of
and sample size n, 50,000 samples were simulated and for each sample
F was estimated by Equation 6 and then corrected through Equation 7 or Equation 8. Some of the results are summarized in Table 2. Table 2 shows that the estimator
is unbiased (or nearly so). The small bias is likely due to fluctuation in simulation and is insignificant compared to the variance.
|
Next we compare the performance of our estimator
F with that of the estimator based on allele size variance,
V. There are two ways to compute the variance of
V. The theoretical value of the large sample variance can be computed through Equation 3 and the variance can also be estimated through computer simulation. We computed it in both ways because on the one hand the validity of our simulation program can be checked and on the other hand the results can corroborate each other. The results are summarized in Table 3. Table 3 shows that the theoretical value of the variance of
V agrees well with the simulation value, which indicates that our simulation is accurate. More importantly, Table 3 shows that while both estimators are unbiased, our homozygosity-based estimator
F is better than the size variance-based estimator
V in that the variance of
is smaller than that of
V. The relative efficiency of
F against
V, defined as the ratio of the variance of
V and variance of
F, is also given in Table 3. The relative efficiency increases as
increases, which means that
F becomes more and more efficient with increasing
value. Note that since microsatellite loci have a relatively high mutation rate, the
value can easily be of the range of 10100, which makes
F superior to
V for most microsatellite loci.
|
Comparison with the maximum-likelihood estimator:
The performance of the homozygosity-based estimator
F is further compared to that of the maximum-likelihood (ML) estimator
L proposed by ![]()
and sample size. The two estimators,
F and
L, are computed for each simulated sample. The mean value and mean square error (MSE) for the corresponding estimates are then computed and the results are summarized in Table 4. Two conclusions are obvious from Table 4. First, the ML estimator
L is, in general, upwardly biased. Although the bias decreases with sample size, it is still appreciable even when the sample size is 300. In comparison, the mean value of the homozygosity-based estimator
F exhibits little bias, similar to the case of comparing
F and
V. Second, in general the ML estimator
L has a larger MSE than that of
F, except in the cases where
is small and sample size is large. It is somehow surprising that as
increases, the relative performance of
L, measured by MSE, gets worse compared to
F. Two possible causes might be that the ML estimator implemented by Nielsen may not be a true ML estimator and it is not efficient. Indeed, in Nielsen's algorithm, a k-allele model was used to approximate the stepwise mutation model (![]()
value can be quite large even for a modest population size. For example, many samples from human populations have yielded estimates of
> 10. This makes
F more preferable in general than
L.
|
To address the issue of efficiency, we performed a large-scale simulation to see the extent to which performance of the ML estimator is affected by the number of runs through the Markov chain. In the comparison with the ML estimator
L shown in Table 4, the
L was computed using the default Markov chain steps, 100,000 runs. Table 5 shows the results with three different numbers of runs through the Markov chain, 10,000, 100,000 and 1,000,000, where
is set to 10.0. It is clear from Table 5 that there is a big improvement in the performance of
L in terms of MSE when the number of runs through the Markov chain changes from 10,000 to 100,000, but only a small improvement when the replicate number changes from 100,000 to 1,000,000. More importantly, even when 1,000,000 replicates were used for the
L, it still has larger bias and MSE than the homozygosity-based estimator
F when
= 10.0. An extreme case was carried out in which the number of runs through the Markov chain for
L was set to 10,000,000 when
= 10.0 and sample size n = 50. In this case, the MSE of
L was 69.53, which is still >50.62, the MSE of
F.
|
Robustness of the estimator:
So far, the analysis is based on the single-step stepwise mutation model. While this may be true for some microsatellite loci, statistical analysis suggests that not all of them adhere to this simple version of the stepwise mutation model (![]()
![]()
![]()
![]()
![]()
; that is,
![]() |
(9) |
The performance of both estimators under this generalized stepwise mutation model was investigated through computer simulation. A total of 50,000 samples were simulated assuming the generalized model with
= 0.67. With this
value,

That is, on average each mutation causes a jump of allele sizes of
1.5 repeat units. For each simulated sample, the sample procedure as before was taken to obtain the two estimators,
F and
V. The bias and MSE were also taken for each estimator. The corresponding theoretical values for the bias and MSE of
V were also computed. The details are in the Appendix. The simulation value agrees well with the theoretical value. The results are shown in Table 6.
|
Table 6 shows that under the generalized stepwise mutation model, both estimators are upwardly biased. That is, both estimators on average overestimate the real
value. The bias is an increasing function of
. When the bias of
F is compared to that of
V, the former always has a smaller bias than the latter, which means that
F is less biased than
V especially when
is high. Comparison between the corresponding MSEs also shows that
F has a smaller MSE than
V. These two points make
F still more preferable than
V even when the actual mutation model is the generalized stepwise mutation model.
| APPLICATION |
|---|
To test the performance of the homozygosity-based estimator
F with real data, we use the allele frequency data from the ALFRED database at Yale University (![]()
For each population-locus combination,
F and
V are computed. To compare the consistency of the estimators, one locus is randomly chosen as the base locus and the ratio of the estimate for other loci in the same population is taken over the estimate for the base locus. Since the effective population size is generally supposed to be the same in the same population for all loci from the same sample, we are estimating the ratio of mutation rates using information from different populations. Assuming the mutation rate for a particular locus is constant across the populations, the estimates of the ratio of mutation rates from different populations are the estimates of the same quantity. Consequently, the dispersion of the results is an indicator of the consistency of the estimator. The coefficient of variance (ratio of standard deviation to mean) is taken as a measure of dispersion. In almost all the cases, the coefficient of variance is smaller with
F than with
V, which indicates that the homozygosity-based estimator
F is more stable and more consistent than the variance-based estimator
V. Examples of the results from four loci are tabulated in Table 7, where the base locus (locus 1) is D11S935, locus 2 is D7S640, locus 3 is D6S441, and locus 4 is D5S408, with the corresponding mutation rates denoted as µ1µ4, respectively.
|
| DISCUSSION |
|---|
![]()
, but also on the pattern of allele size change caused by mutation. Therefore, any attempt to estimate
on the basis of homozygosity has to be mutation model dependent. Interestingly, the regression formula we found on the basis of the single-step stepwise mutation model is reasonably robust against deviations from the single-step model. This is a useful property since it is very difficult to specify the model with confidence. On the other hand, if one has sufficient confidence in a particular model, a similar approach can be used to derive the regression formula under the model. This can be seen from our simulation study when the mutation model deviates from the single-step stepwise mutation model to the generalized stepwise mutation model.
Although the maximum-likelihood estimator,
L, proposed by ![]()
F through a large-scale simulation. The ML estimator
L is found to be slightly upwardly biased. This is not too surprising because many maximum-likelihood estimators are known to be biased for small sample sizes. Indeed we found that the
L approaches the true value as sample size (n) increases. However, even when n = 300, there is still an appreciable amount of bias. The MSE of
L decreases with the increase of the sample size. However, in the most likely range of
for microsatellites,
L has in general larger MSE than
F unless the sample size is extremely large.
L has a slight advantage when
is small. However, in such a situation, the bias of
L may be more of a concern. For example, from Table 4 when
= 2.0 and n = 30, the bias can be nearly 18%. All these factors make
F an attractive alternative to
L.
We have relied on regression to find a way to remove bias as an estimator of
from
F. It should be pointed out that jackknife is a widely used approach to reduce bias in estimation (e.g., ![]()
has the form
![]() |
(10) |
where A is a constant, then the jackknife estimator can remove the bias. However, the relationship between E(
F) and
is rather complex. Although the exact relationship is unknown, Equation 7 and Equation 8 indicate that the relationship is certainly not in the form of Equation 10. So the jackknife estimator is unlikely to be able to remove much of the bias in
F. Indeed, when the jackknife method was applied in our simulated sample, we found that it was able to remove only
10% of the bias in many combinations of parameters. Therefore, jackknife is not an appropriate approach to use in this situation.
From Equation 5 of ![]()
![]() |
(11) |
where V = 2E(Vs) and U0 is the symmetrized allele size change in a single generation and is mutation model dependent. Consequently, the variance-based estimator
V is mutation model dependent and is applicable to the particular model itself. In the case of the single-step stepwise mutation model, Equation 11 is reduced to Equation 1 since
. Therefore,
V is a special case of
V and is mutation model dependent and applicable to the single-step stepwise mutation model. Hence it is no surprise that
V becomes biased under the generalized stepwise mutation model.
![]()
![]()
F is applicable for single-step stepwise mutation, symmetric or not. Computer programs to carry out the analysis and to estimate
F are available upon request.
| ACKNOWLEDGMENTS |
|---|
We thank R. Nielsen for sharing his ML program. This work was supported partly by National Institutes of Health grants R01 GM50428 and R01 GM60777 to Y.-X. Fu.
Manuscript received May 30, 2003; Accepted for publication September 24, 2003.
| APPENDIX |
|---|
CALCULATING THE BIAS AND MSE OF
V UNDER THE GENERALIZED STEPWISE MUTATION MODEL
Given that

that is, |U|
geometric(0.67), since the mutation is symmetric, U0 = U, we have
![]() |
(A1) |
Since
V = 2Vs and from Equation 4 of ![]()
![]() |
(A2) |
where V is defined in ![]()
![]() |
(A3) |
Substituting
= 0.67 into Equation A3, we have
![]() |
(A4) |
Therefore,
![]() |
(A5) |
To calculate the MSE of
V we need to calculate variance of size variance V first. From Equation 16 of ![]()
![]() |
(A6) |
Since U0 = U and U
geometric(0.67), the moment-generating function of U0 is
![]() |
(A7) |
Taking the fourth derivative of Equation A7 and setting t = 1,
= 0.67, we have
![]() |
(A8) |
From Equation 5 of ![]()
![]() |
(A9) |
Substituting Equation A8, and Equation A9 into Equation A6, we have
![]() |
(A10) |
Since
V = 2Vs, we have
![]() |
(A11) |
Therefore,
![]() |
(A12) |
| LITERATURE CITED |
|---|
CHAKRABORTY, R. and K. M. WEISS, 1991 Genetic variation of the mitochondrial DNA genome in American Indians is at mutation-drift equilibrium. Am. J. Anthropol. 86:497-506.[CrossRef][Medline]
CHAKRABORTY, R., M. KIMMEL, D. STIVERS, L. DAVISON, and R. DEKA, 1997 Relative mutation rates at di-, tri-, and tetra-nucleotide microsatellite loci. Proc. Natl. Acad. Sci. USA 94:1041-1046.
CHEUNG, K. H., M. V. OSIER, J. R. KIDD, A. J. PAKSTIS, and P. L. MILLER et al., 2000 ALFRED: an allele frequency database for diverse populations and DNA polymorphisms. Nucleic Acids Res. 28:361-363.
DEKA, R., G. SUN, D. SMELSER, Y. ZHONG, and M. KIMMEL et al., 1999 Rate and directionality of mutations and effects of allele size constraints at anonymous, gene-associated and disease-causing trinucleotide loci. Mol. Biol. Evol. 16:1166-1177.[Abstract]
DI RIENZO, A., A. C. PETERSON, J. C. GARZA, A. M. VALDES, and M. SLATKIN et al., 1994 Mutational process of simple-sequence repeat loci in human populations. Proc. Natl. Acad. Sci. USA 91:3166-3170.
FU, Y. X. and R. CHAKRABORTY, 1998 Simultaneous estimation of all the parameters of a stepwise mutation model. Genetics 150:487-497.
JEFFREYS, A. J., K. TAMAKI, A. MACLEOD, D. G. MONCKTON, and D. L. NEIL et al., 1994 Complex gene conversion events in germline mutation at human minisatellites. Nat. Genet. 6:136-145.[CrossRef][Medline]
KIMMEL, M. and R. CHAKRABORTY, 1996 Measures of variation at DNA repeat loci under a general stepwise mutation model. Theor. Popul. Biol. 50:345-367.[CrossRef][Medline]
KIMMEL, M., R. CHAKRABORTY, D. STIVES, and R. DEKA, 1996 Dynamics of repeat polymorphisms under a forward-backward mutation model: within- and between-population variability at microsatellite loci. Genetics 143:549-555.[Abstract]
MANLY, B. F. J., 1997 Randomization, Bootstrap and Monte Carlo Methods in Biology. Chapman & Hall, New York.
NIELSEN, R., 1997 A likelihood approach to population samples of microsatellite alleles. Genetics 146:711-716.[Abstract]
OHTA, T. and M. KIMURA, 1973 A model of mutation appropriate to estimate the number of electrophoretically detectable alleles in a finite population. Genet. Res. 22:201-204.[Medline]
RUBINSZTEIN, D. C., W. AMOS, J. LEGGO, S. GOODBURN, and S. JAIN et al., 1995 Microsatellite evolutionevidence for directionality and variation in rate between species. Nat. Genet. 10:337-343.[CrossRef][Medline]
SHRIVER, M. D., L. JIN, R. CHAKRABORTY, and E. BOERWINKLE, 1993 VNTR allele frequency distributions under the stepwise mutation model: a computer simulation approach. Genetics 134:983-993.[Abstract]
TAUTZ, D., 1993 Notes on the definition and nomenclature of tandemly repetitive DNA sequence, pp. 2128 in DNA Fingerprinting: Current State of the Science, edited by S. D. J. PENA, R. CHAKRABORTY, J. T. EPPLEN and A. J. JEFFREYS. Birkhäuser Publishing, Basel, Switzerland.
WEBER, J. L. and C. WONG, 1993 Mutation of human short tandem repeats. Hum. Mol. Genet. 2:1123-1128.
WEHRHAHN, C. F., 1975 The evolution of selectively similar electrophoretically detectable alleles in finite natural populations. Genetics 80:375-394.
ZHIVOTOVSKY, L. A. and M. W. FELDMAN, 1995 Microsatellite variability and genetic distances. Proc. Natl. Acad. Sci. USA 92:11549-11552.
ZOUROS, E., 1979 Mutation rates, population sizes and amounts of electrophoretic variation of enzyme loci in natural populations. Genetics 92:623-646.
This article has been cited by other articles:
![]() |
H. Xu, R. Chakraborty, and Y.-X. Fu Mutation Rate Variation at Human Dinucleotide Microsatellites Genetics, May 1, 2005; 170(1): 305 - 312. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.-C. Thuillet, T. Bataillon, S. Poirier, S. Santoni, and J. L. David Estimation of Long-Term Effective Population Sizes Through the History of Durum Wheat Using Microsatellite Data Genetics, March 1, 2005; 169(3): 1589 - 1599. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Xu, H.
- Articles by Fu, Y.-X.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Xu, H.
- Articles by Fu, Y.-X.
























