- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Huang, S.-P.
- Articles by Weir, B. S.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Huang, S.-P.
- Articles by Weir, B. S.
Estimating the Total Number of Alleles Using a Sample Coverage Method
Shu-Pang Huanga and B. S. Weiraa Program in Statistical Genetics, Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695-7566
Corresponding author: B. S. Weir, Bioinformatics Research Ctr., Campus Box 7566, North Carolina State University, Raleigh, NC 27695-7566., weir{at}stat.ncsu.edu (E-mail)
Communicating editor: M. A. ASMUSSEN
| ABSTRACT |
|---|
Previously reported methods for estimating the number of different alleles at a single locus in a population have not described a useful general result. Using the number of alleles observed in a sample gives an underestimate for the true number of alleles. The similar problem of estimating the number of species in a population was first investigated in 1943. In this article we use the sample coverage method proposed by Chao and Lee in 1992 to estimate the number of alleles in a population when there are unequal allele frequencies. Simulation studies under the recurrent mutation model show that, for reasonable sample sizes, a significantly better estimate of the true number can be obtained than that using only the observed alleles. Results under the stepwise mutation model and infinite-allele model are presented. Possible applications include improving the characterization of the prior distribution for the allele frequencies, adjusting the estimates of genetic diversity, and estimating the range of microsatellite alleles.
A major topic in population genetics is the characterization of the distribution of allele frequencies for a population. Some theoretical results under different evolutionary forces have been proposed (![]()
![]()
![]()
Mi=1p2i, where pi is the frequency for the ith allele, faces the same problem. The parameter M is usually estimated by the number of alleles observed in a sample. This will, of course, underestimate the true allele number.
The same problem has been recognized for a long time by ecologists who want to use a sample to estimate the number of species (or individuals) in a population. Since ![]()
![]()
![]()
Most of the estimating methods are based on sampling theory. If the underlying population is finite, it is natural to use the hypergeometric model. When the population size is large, however, the multinomial model is a good approximate model to use. Several methods have been proposed under the multinomial model (![]()
![]()
![]()
We describe how to obtain estimators based on the sample coverage method, which is followed by simulation studies using the coalescent process under different mutation models. Examples are given to illustrate applications of this method.
| METHOD |
|---|
Suppose there are M different alleles for a locus in a population. A random sample of n alleles is drawn from the population. Let Xi be the number of the ith type of allele observed in the sample and D the number of different observed allele types. Furthermore, let fj be the number of alleles that have j representatives in the sample. It is easy to see that D =
nj=1fj and n =
Mi=1 Xi =
nj=1jfj. If the ith allele type has frequency pi in the population, then under the equal frequencies assumption (i.e., p1 = p2 = · · · = pM =
), the likelihood for the given sample is

The maximum-likelihood estimator
for M can be derived as
![]() |
(1) |
On the other hand, the probability mass function of the number U of unseen allele types is

(![]()
= Me-n/M. This distribution converges to a Poisson (
) distribution as n increases. But, since at least one type will appear in the sample, the appropriate distribution for U is actually the truncated distribution of PU, i.e.,

Taking the expectation of U, and by suitable transformation, we have


or
![]() |
(2) |
From Equation 1 and Equation 2, the approximate maximum-likelihood estimate
of M satisfies

(![]()
![]()

The assumption of equiprobable frequencies is usually unrealistic and
will therefore underestimate the number of allele types. To solve this problem, several works have proposed different distributions to model the so-called "capture" probability for each class (e.g., species, allele types; ![]()
Instead of estimating the number of classes M directly, if we can estimate the percentage (denoted by C) of classes that are represented in the sample, the quantity D/C, the ratio of the observed classes and their total percentage, can serve as an estimate for the parameter M. A formal definition for the parameter C, namely, sample coverage, is

This is the sum of the probabilities of classes observed in a sample. Now if all allele types have the same frequency in the population, i.e., p1 = p2 = · · · = pM =
, then

If we can estimate the sample coverage C, the estimate of M will follow directly. The quantity C has been well studied. ![]()
![]()
![]()
![]()

Under the equal probability case, we have
![]() |
(3) |
(![]()
![]()
![]() |
(4) |
where
= [
i(pi -
)2/M]1/2/
is the coefficient of variation (CV) and is always
0. By observing that
![]() |
(5) |
we can substitute Equation 5 into Equation 4 and get the estimating function
![]() |
(6) |
![]()

By substituting Equation 3 and fj into the formula, we can get an estimate for
2:
![]() |
(7) |
Replacing the expected quantities by observed values and combining with Equation 7 leads to an estimate of M:
![]() |
(8) |
The bias of
2 is greater when
is large. An adjusted estimator,
2, of
2 is recommended by ![]()
![]() |
(9) |
where

For the variance of the estimators, recall that all of the quantities used in the estimators are functions of the multinomial counts (f0, f1, · · · , fn). Note that under this setting, the sample size n =
nj=1jfj is also a random variable. The asymptotic variance for the estimator can be derived using standard asymptotic approach,
![]() |
(10) |
where

| SIMULATION STUDY |
|---|
Simulation studies were performed under the following three mutation models:
- RMM: Every mutation produces a preexisting type of allele. It is a reversible process. There is no restriction for an allele to mutate to another type as long as the mutation rate toward a particular type is not zero.
- Infinite-allele model (IAM): Each mutation creates a new allele type.
- Stepwise mutation model (SMM): Each mutation is more likely to change to its adjacent type(s).
Note that, even under the RMM where the number of alleles is set up beforehand, the actual number of alleles in a population is unknown because of the evolutionary process. For each simulated population, we may have different numbers of alleles. The quantity of interest is the actual number of alleles in the simulated population rather than the number of all possible alleles.
The simulation consists of two levels. In the first level, we simulate a population of size 10,000 and calculate allele types and frequencies on the basis of the mutation model. For the second level, we randomly choose a sample of size 100 from the simulated population and use our method to estimate the number of alleles in that particular population. We take 400 samples from each of 400 simulated populations.
The simulation algorithm for the first level is based on the coalescent process (![]()
![]()
![]() |
(11) |
where
ki and Dki are the estimate and the observed number of alleles for the kth sample from the ith simulated population and
simi is the actual number of alleles in the ith simulated population. For each population, we plot these three quantities against the number of alleles in that population. Note that, because of the evolutionary process, the CV among allele frequencies is usually large. Since the first two estimators
1 and
2 usually underestimate the true M when CV is large (![]()
3 is used in the simulation studies to measure the performance of the sample coverage method.
Simulation results under RMM:
Since the magnitude of CV is reported to be an important effect on performance, particular configurations of pi's are selected to reflect different CV levels. We chose Zipf's law to specify allele frequencies because "Zipf's laws are probability distributions on the positive integers which decay algebraically. Such laws have been shown empirically to describe ... the distribution of biological genera and species" (![]()
ci-(1+
) as i
. We select
= 0 so that pi = ci-1.
Under this particular model, suppose the rate for all other allele types mutating to a particular type i is
i. When the population reaches equilibrium, the allele frequency for allele i is
i/
., where
. =
Mi=1
i. In our simulation, we chose
i = 0.0001
so that the slightly adjusted Zipf's law is
![]() |
(12) |
The role of b here is to adjust the CV of allele frequencies. The larger the b, the smaller the CV. We chose b to be 1, 10, and 100 to represent high, medium, and low CV values, respectively.
From Fig 1 and Fig 2, we observe that the biases of estimates are all close to 0 when M is small but the estimates exhibit some negative bias when M is large. This also shows that the number of unobserved alleles could be very large. We also note that, when CV is large, the standard error of the estimator is also large.
|
|
Simulation results under IAM:
In the infinite-allele case, the parameter
= 4Nµ is selected to be {10, 5, 1}. The results are shown in Fig 3. From the third column of Fig 3, we see that the difference between the observed and estimated number of alleles increases as
values increase. This is reasonable since, when the mutation rate increases, more distinct alleles are expected in the population under the IAM.
|
Simulation results under SMM:
Under the SMM, we adopted the model proposed by ![]()

i.e., the transition probability from type i to type j depends on the absolute value of the size difference (|i - j|). If only the size change is concerned, the distribution of size change is geometrically distributed. Since

the parameter
is the probability of increasing size when a mutation occurs. Hence, if
> 0.5, mutations tend to increase the size. The parameter P has the effect of determining the magnitude of size change. The size change reaches its maximum when P = 0.5 and becomes smaller when P approaches its boundaries (0 or 1).
Although there is no restriction on the size change under the model, observations show that each mutation is more likely to change an allele to an adjacent allele. Fu and Chakraborty also set up a maximum-mutation step when the transition probability greater than that step is less than some critical value
. The maximum step s is determined by

In our simulation study, we set
= 0.5 and chose two P values = {0.5, 0.8} and two
values = {5, 10}. Results are shown in Fig 4. The
value was set to 0.001, so that the maximum steps for P = 0.5 and 0.8 are eight and four, respectively. Simulation results for different
values are very similar to Fig 4 and are not shown here. On the basis of Fig 4, we observe that, even though we set P = 0.5 and each mutation can change an allele by up to 8 units, the number of alleles observed in the sample is still very limited. Under the same mutation rate and sample size, we can observe many more alleles under IAM than SMM. As a result, the estimates are not very different from the observed numbers.
|
| EXAMPLES AND APPLICATIONS |
|---|
![]()
|
Under the IAM, ![]()
![]() |
(13) |
where
(1 - x)
-1x-1 is the distribution of frequencies of allele frequencies. Using the information listed in Table 1 together with Equation 13 we can get the theoretical number of alleles for the population. Table 2 shows the theoretical allele numbers and our three estimates. The estimated allele numbers are generally close to MCK.
|
Note that, in some cases, MCK is smaller than the observed number. This reflects the problem that the two parameters, effective population size and mutation rate, are generally difficult to estimate and have high uncertainty. One advantage of the sample coverage method is that we do not need to estimate the effective population size and mutation rate before estimating the allele numbers.
Besides the direct applications in improving the estimation of gene diversity and characterization of prior distribution of allele frequencies, under SMM, we can use this information to predict the range of allele sizes as well.
| DISCUSSION |
|---|
A drawback of the sample coverage method is that, when the allele number is large but the sample size is relatively small and all the alleles in the data are different (i.e., D = f1 = n), we have
= 0 and hence
=
. Although in this case we would expect there are many alleles in the population, this estimate is not useful. For this situation, ![]()
low = D +
for the lower bound of M. We use this estimate when simulations would give an infinite estimate. In real data, fortunately, this kind of situation is almost impossible.
We also note that when the CV is large, the estimates still have a large negative bias. One way to conquer this problem is to use the stabilization technique (![]()
=
+ r. Since the alleles with high frequencies are not used in estimation, the CV will be reduced significantly. The reason that those alleles with higher frequencies can be removed from the estimation process is that, since they are so abundant, observing them will not provide any useful information about unseen alleles and hence they are not helpful in estimating the total number. The choice of k is somewhat arbitrary. Simulations show that k = 10 works well (data not shown). We should be careful, however, in applying this technique to real genetic data, because the variation of frequencies among alleles could be very large. For small samples, removing alleles with >10 copies from the data may result in an unrealistically large estimate because only a very small amount of data is used in the estimation. Hence, we did not use this technique in Table 2.
Besides these technical problems, some concerns still need to be addressed. For example, what should we do when some alleles have been reported previously but are not in the present sample? If there are several samples available, should we combine them to give a larger sample or should we use them separately? For the first problem, assuming there are L alleles we know exist but are not in our sample, possible solutions are
' =
+ L, or - replace n by n' = n + L, D' = D + L, and f'1 = f1 + L to obtain
', or - since
has already estimated the unseen alleles from the data, it should already include those L alleles. We need only adjust our estimates when
< D + L and set
' = D + L.
For the second problem, besides the method of combining data, if the samples are from contemporary generations, we can regard them all as a sample from a capture-recapture experiment. We may use any of several methods (e.g., ![]()
![]()
![]()
![]()
| ACKNOWLEDGMENTS |
|---|
The authors thank anonymous reviewers for their help. This work was supported in part by National Institutes of Health grant GM 45344.
Manuscript received September 29, 2000; Accepted for publication August 31, 2001.
| LITERATURE CITED |
|---|
BUNGE, J. and M. FITZPATRICK, 1993 Estimating the number of species: a review. J. Am. Stat. Assoc. 88:364-373.
BUNGE, J., M. FITZPATRICK, and J. HANDLEY, 1995 Comparison of 3 estimators of the number of species. J. Appl. Stat. 22:45-59.
CHAO, A., 1984 Nonparametric estimation of the number of the classes in a population. Scand. J. Stat. 11:265-270.
CHAO, A. and S.-M. LEE, 1992 Estimating the number of classes via sample coverage. J. Am. Stat. Assoc. 87:210-217.
CHAO, A. and P. TSAY, 1998 A sample coverage approach to multiple-system estimation with application to census undercount. J. Am. Stat. Assoc. 93:283-293.
CHAO, A., W. H. HWANG, Y. C. CHEN, and C. Y. KUO, 2000 Estimating the number of shared species in two communities. Stat. Sinica 10:227-246.
CHEN, W.-C., 1980 On the weak form of Zipf's law. J. Appl. Probab. 17:611-622.
CROW, J., and M. KIMURA, 1970 Introduction to Population Genetics Theory. Harper & Row, New York.
DARROCH, J. N. and D. RATCLIFF, 1958 A note on capture and recapture estimation. Biometrics 36:149-153.
ENGEN, S., 1978 Stochastic Abundance Model. Chapman & Hall, London.
ESTOUP, A., L. GARNERY, M. SOLIGNAC, and J. CORNUET, 1995 Microsatellite variation in honey bee (Apis Mellifera L.) populations: hierarchical genetic structure and test of the infinite allele and stepwise mutation models. Genetics 140:679-695[Abstract].
ESTY, W. W., 1982 Confidence intervals for the coverage of low coverage samples. Ann. Stat. 11:905-912.
ESTY, W. W., 1986 The efficiency of Good's nonparametric coverage estimator. Ann. Stat. 14:1257-1260.
FELLER, W., 1950 An Introduction to Probability Theory and Its Applications. John Wiley and Sons, New York.
FISHER, R., A. CORBET, and C. WILLIAMS, 1943 The relation between the number of species and the number of individuals in a random sample of an animal population. J. Anim. Ecol. 12:1257-1260.
FU, Y.-X. and R. CHAKRABORTY, 1998 Simultaneous estimation of all the parameters of a stepwise mutation model. Genetics 150:487-497
GOOD, I. J., 1953 On the population frequencies of species and the estimation of population parameters. Biometrika 40:237-264
GOOD, I. J. and G. TOULMIN, 1956 The number of new species and the increase of population coverage when a sample is increased. Biometrika 43:45-63
GRIFFITHS, R. C., 1979 A transition density expansion for a multi-allele diffusion model. Adv. Appl. Probab. 11:310-325.
HAAS, P. and L. STOKES, 1998 Estimating the number of classes in a finite population. J. Am. Stat. Assoc. 93:1475-1487.
HARRIS, B., 1968 Statistical inference in the classical occupancy problem: unbiased estimation of the number of classes. J. Am. Stat. Assoc. 63:837-847.
HUDSON, R. R., 1993 The how and why of generating gene genealogies, pp. 2336 in Mechanisms of Molecular Evolution, edited by N. TAKAHATA and A. G. CLARK. Sinauer Associates, Sunderland, MA.
KINGMAN, J. F. C., 1982 The coalescent. Stoch. Proc. Appl. 13:235-248.
LEE, S. and A. CHAO, 1994 Estimating population-size via sample coverage for closed capture-recapture models. Biometrics 50(1):88-97.
LEWONTIN, R. and T. PROUT, 1956 Estimation of the number of different classes in a population. Biometrics 12:211-223.
NORRIS, J. and K. POLLOCK, 1996 Nonparametric mle under two closed capture recapture models with heterogeneity. Biometrics 52(2):639-649.
WEIR, B. S., 1996 Genetic Data Analysis II. Sinauer Associates, Sunderland, MA.
This article has been cited by other articles:
![]() |
S. C. Wang and P. Dodson Estimating the diversity of dinosaurs PNAS, September 12, 2006; 103(37): 13601 - 13605. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. R. Miller and L. P. Waits The history of effective population size and genetic diversity in the Yellowstone grizzly (Ursus arctos): Implications for conservation PNAS, April 1, 2003; 100(7): 4334 - 4339. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Guinand, A. Topchy, K. S. Page, M. K. Burnham-Curtis, W. F. Punch, and K. T. Scribner Comparisons of Likelihood and Machine Learning Methods of Individual Classification J. Hered., July 1, 2002; 93(4): 260 - 269. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. A. Kuznetsov, G. D. Knott, and R. F. Bonner General Statistics of Stochastic Process of Gene Expression in Eukaryotic Cells Genetics, July 1, 2002; 161(3): 1321 - 1332. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Huang, S.-P.
- Articles by Weir, B. S.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Huang, S.-P.
- Articles by Weir, B. S.



















