Genetics, Vol. 159, 1365-1373, November 2001, Copyright © 2001

Estimating the Total Number of Alleles Using a Sample Coverage Method

Shu-Pang Huanga and B. S. Weira
a Program in Statistical Genetics, Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695-7566

Corresponding author: B. S. Weir, Bioinformatics Research Ctr., Campus Box 7566, North Carolina State University, Raleigh, NC 27695-7566., weir{at}stat.ncsu.edu (E-mail)

Communicating editor: M. A. ASMUSSEN


*  ABSTRACT
*TOP
*ABSTRACT
*METHOD
*SIMULATION STUDY
*EXAMPLES AND APPLICATIONS
*DISCUSSION
*LITERATURE CITED

Previously reported methods for estimating the number of different alleles at a single locus in a population have not described a useful general result. Using the number of alleles observed in a sample gives an underestimate for the true number of alleles. The similar problem of estimating the number of species in a population was first investigated in 1943. In this article we use the sample coverage method proposed by Chao and Lee in 1992 to estimate the number of alleles in a population when there are unequal allele frequencies. Simulation studies under the recurrent mutation model show that, for reasonable sample sizes, a significantly better estimate of the true number can be obtained than that using only the observed alleles. Results under the stepwise mutation model and infinite-allele model are presented. Possible applications include improving the characterization of the prior distribution for the allele frequencies, adjusting the estimates of genetic diversity, and estimating the range of microsatellite alleles.


A major topic in population genetics is the characterization of the distribution of allele frequencies for a population. Some theoretical results under different evolutionary forces have been proposed (CROW and KIMURA 1970 Down). For example, under the recurrent mutation model (RMM), which assumes that every mutation will lead only to another preexisting allele, the stationary distribution for allele frequencies will be the Dirichlet distribution (GRIFFITHS 1979 Down). This model assumes that the number of alleles M is known. Unfortunately, we do not know this number in general. Many other population genetic parameters are also associated with allele numbers. For instance, the genetic diversity (WEIR 1996 Down), defined as 1 - {Sigma}Mi=1p2i, where pi is the frequency for the ith allele, faces the same problem. The parameter M is usually estimated by the number of alleles observed in a sample. This will, of course, underestimate the true allele number.

The same problem has been recognized for a long time by ecologists who want to use a sample to estimate the number of species (or individuals) in a population. Since FISHER et al. 1943 Down first proposed a statistical model to estimate the number of species in a population, it has been an active research field with applications in many other fields. For example, LEWONTIN and PROUT 1956 Down derived a maximum-likelihood estimator under the assumption of equal frequencies and applied it to estimate the number of genes on a chromosome. Several methods have been proposed to manage the unequal frequencies situation, including both parametric and nonparametric approaches under frequentist or Bayesian philosophies (BUNGE and FITZPATRICK 1993 Down). It is quite straightforward to relate our problem to theirs if we treat allele types as species.

Most of the estimating methods are based on sampling theory. If the underlying population is finite, it is natural to use the hypergeometric model. When the population size is large, however, the multinomial model is a good approximate model to use. Several methods have been proposed under the multinomial model (BUNGE and FITZPATRICK 1993 Down). We use the sample coverage (SC) method proposed by CHAO and LEE 1992 Down. It is a nonparametric method and its performance is better than other methods (BUNGE et al. 1995 Down).

We describe how to obtain estimators based on the sample coverage method, which is followed by simulation studies using the coalescent process under different mutation models. Examples are given to illustrate applications of this method.


*  METHOD
*TOP
*ABSTRACT
*METHOD
*SIMULATION STUDY
*EXAMPLES AND APPLICATIONS
*DISCUSSION
*LITERATURE CITED

Suppose there are M different alleles for a locus in a population. A random sample of n alleles is drawn from the population. Let Xi be the number of the ith type of allele observed in the sample and D the number of different observed allele types. Furthermore, let fj be the number of alleles that have j representatives in the sample. It is easy to see that D = {Sigma}nj=1fj and n = {Sigma}Mi=1 Xi = {Sigma}nj=1jfj. If the ith allele type has frequency pi in the population, then under the equal frequencies assumption (i.e., p1 = p2 = · · · = pM = ), the likelihood for the given sample is

The maximum-likelihood estimator for M can be derived as

(1)

(FELLER 1950 Down).

On the other hand, the probability mass function of the number U of unseen allele types is

(FELLER 1950 Down), where {lambda} = Me-n/M. This distribution converges to a Poisson ({lambda}) distribution as n increases. But, since at least one type will appear in the sample, the appropriate distribution for U is actually the truncated distribution of PU, i.e.,

Taking the expectation of U, and by suitable transformation, we have


or

(2)

From Equation 1 and Equation 2, the approximate maximum-likelihood estimate of M satisfies

(LEWONTIN and PROUT 1956 Down). HARRIS 1968 Down obtained the asymptotic variance for this estimator

The assumption of equiprobable frequencies is usually unrealistic and will therefore underestimate the number of allele types. To solve this problem, several works have proposed different distributions to model the so-called "capture" probability for each class (e.g., species, allele types; ENGEN 1978 Down). Although those parametric approaches can deal with the heterogeneous problem in some way, they are still highly dependent on the suitability of the parametric model.

Instead of estimating the number of classes M directly, if we can estimate the percentage (denoted by C) of classes that are represented in the sample, the quantity D/C, the ratio of the observed classes and their total percentage, can serve as an estimate for the parameter M. A formal definition for the parameter C, namely, sample coverage, is

This is the sum of the probabilities of classes observed in a sample. Now if all allele types have the same frequency in the population, i.e., p1 = p2 = · · · = pM = , then

If we can estimate the sample coverage C, the estimate of M will follow directly. The quantity C has been well studied. GOOD 1953 Down and ESTY 1982 Down, ESTY 1986 Down used the following estimator proposed by Turing (GOOD 1953 Down):

Under the equal probability case, we have

(3)

(DARROCH and RATCLIFF 1958 Down). Compared to the maximum-likelihood estimator (MLE) under the equiprobable population, this estimator is very efficient. Both estimators, however, suffer the same problem of underestimating M when the pi's are not all equal. But in the definition of C, we did not require all pi's to be the same. CHAO and LEE 1992 Down therefore proposed the following approach to obtain an adjusted estimator for M. They used a Taylor series to expand E(D)/E(C) up to the second order with respect to the equal probability point p1 = p2 = · · · = pM = 1/M. This provides

(4)

where {gamma} = [{Sigma}i(pi - )2/M]1/2/ is the coefficient of variation (CV) and is always >=0. By observing that

(5)

we can substitute Equation 5 into Equation 4 and get the estimating function

(6)

GOOD and TOULMIN 1956 Down obtained the equation

By substituting Equation 3 and fj into the formula, we can get an estimate for {gamma}2:

(7)

Replacing the expected quantities by observed values and combining with Equation 7 leads to an estimate of M:

(8)

The bias of 2 is greater when {gamma} is large. An adjusted estimator, 2, of {gamma}2 is recommended by CHAO and LEE 1992 Down,

(9)

where

For the variance of the estimators, recall that all of the quantities used in the estimators are functions of the multinomial counts (f0, f1, · · · , fn). Note that under this setting, the sample size n = {Sigma}nj=1jfj is also a random variable. The asymptotic variance for the estimator can be derived using standard asymptotic approach,

(10)

where


*  SIMULATION STUDY
*TOP
*ABSTRACT
*METHOD
*SIMULATION STUDY
*EXAMPLES AND APPLICATIONS
*DISCUSSION
*LITERATURE CITED

Simulation studies were performed under the following three mutation models:

  1. RMM: Every mutation produces a preexisting type of allele. It is a reversible process. There is no restriction for an allele to mutate to another type as long as the mutation rate toward a particular type is not zero.

  2. Infinite-allele model (IAM): Each mutation creates a new allele type.

  3. Stepwise mutation model (SMM): Each mutation is more likely to change to its adjacent type(s).

Note that, even under the RMM where the number of alleles is set up beforehand, the actual number of alleles in a population is unknown because of the evolutionary process. For each simulated population, we may have different numbers of alleles. The quantity of interest is the actual number of alleles in the simulated population rather than the number of all possible alleles.

The simulation consists of two levels. In the first level, we simulate a population of size 10,000 and calculate allele types and frequencies on the basis of the mutation model. For the second level, we randomly choose a sample of size 100 from the simulated population and use our method to estimate the number of alleles in that particular population. We take 400 samples from each of 400 simulated populations.

The simulation algorithm for the first level is based on the coalescent process (KINGMAN 1982 Down). HUDSON 1993 Down gave a general description of simulation methods. The results for each simulated population are summarized as

(11)

where ki and Dki are the estimate and the observed number of alleles for the kth sample from the ith simulated population and simi is the actual number of alleles in the ith simulated population. For each population, we plot these three quantities against the number of alleles in that population. Note that, because of the evolutionary process, the CV among allele frequencies is usually large. Since the first two estimators 1 and 2 usually underestimate the true M when CV is large (CHAO and LEE 1992 Down), only 3 is used in the simulation studies to measure the performance of the sample coverage method.

Simulation results under RMM:
Since the magnitude of CV is reported to be an important effect on performance, particular configurations of pi's are selected to reflect different CV levels. We chose Zipf's law to specify allele frequencies because "Zipf's laws are probability distributions on the positive integers which decay algebraically. Such laws have been shown empirically to describe ... the distribution of biological genera and species" (CHEN 1980 Down). Its form is pi ~ ci-(1+{alpha}) as i -> {infty}. We select {alpha} = 0 so that pi = ci-1.

Under this particular model, suppose the rate for all other allele types mutating to a particular type i is {nu}i. When the population reaches equilibrium, the allele frequency for allele i is {nu}i/{nu}., where {nu}. = {Sigma}Mi=1{nu}i. In our simulation, we chose {nu}i = 0.0001 so that the slightly adjusted Zipf's law is

(12)

The role of b here is to adjust the CV of allele frequencies. The larger the b, the smaller the CV. We chose b to be 1, 10, and 100 to represent high, medium, and low CV values, respectively.

From Fig 1 and Fig 2, we observe that the biases of estimates are all close to 0 when M is small but the estimates exhibit some negative bias when M is large. This also shows that the number of unobserved alleles could be very large. We also note that, when CV is large, the standard error of the estimator is also large.



View larger version (28K):
In this window
In a new window
Download PPT slide
 
Figure 1. The simulation study for the RMM when M = 20. The x-axes are the actual numbers of alleles obtained from the simulated populations and the y-axes are the quantities described in Equation 11.



View larger version (37K):
In this window
In a new window
Download PPT slide
 
Figure 2. The simulation study for the RMM when M = 100.

Simulation results under IAM:
In the infinite-allele case, the parameter {theta} = 4Nµ is selected to be {10, 5, 1}. The results are shown in Fig 3. From the third column of Fig 3, we see that the difference between the observed and estimated number of alleles increases as {theta} values increase. This is reasonable since, when the mutation rate increases, more distinct alleles are expected in the population under the IAM.



View larger version (39K):
In this window
In a new window
Download PPT slide
 
Figure 3. The simulation study for the IAM.

Simulation results under SMM:
Under the SMM, we adopted the model proposed by FU and CHAKRABORTY 1998 Down,

i.e., the transition probability from type i to type j depends on the absolute value of the size difference (|i - j|). If only the size change is concerned, the distribution of size change is geometrically distributed. Since

the parameter {alpha} is the probability of increasing size when a mutation occurs. Hence, if {alpha} > 0.5, mutations tend to increase the size. The parameter P has the effect of determining the magnitude of size change. The size change reaches its maximum when P = 0.5 and becomes smaller when P approaches its boundaries (0 or 1).

Although there is no restriction on the size change under the model, observations show that each mutation is more likely to change an allele to an adjacent allele. Fu and Chakraborty also set up a maximum-mutation step when the transition probability greater than that step is less than some critical value {epsilon}. The maximum step s is determined by

In our simulation study, we set {alpha} = 0.5 and chose two P values = {0.5, 0.8} and two {theta} values = {5, 10}. Results are shown in Fig 4. The {epsilon} value was set to 0.001, so that the maximum steps for P = 0.5 and 0.8 are eight and four, respectively. Simulation results for different {alpha} values are very similar to Fig 4 and are not shown here. On the basis of Fig 4, we observe that, even though we set P = 0.5 and each mutation can change an allele by up to 8 units, the number of alleles observed in the sample is still very limited. Under the same mutation rate and sample size, we can observe many more alleles under IAM than SMM. As a result, the estimates are not very different from the observed numbers.



View larger version (36K):
In this window
In a new window
Download PPT slide
 
Figure 4. The simulation study for the SMM.


*  EXAMPLES AND APPLICATIONS
*TOP
*ABSTRACT
*METHOD
*SIMULATION STUDY
*EXAMPLES AND APPLICATIONS
*DISCUSSION
*LITERATURE CITED

ESTOUP et al. 1995 Down published a set of microsatellite data from nine populations of honey bee (Apis Mellifera L.) subspecies. The adequacy of two mutation models usually used for microsatellite, IAM and RMM, are tested by comparing the observed and theoretical number of alleles. The authors claimed that the IAM fits the data better and argued that the reason may be because the majority of the microsatellites they used in the study contain repeats of several different length motifs. They also provided estimates of effective population size (2Ne) and mutation rates (µ) for each locus on the basis of some assumptions. We chose the subspecies Tiznit as an example to demonstrate the use of our method. The allele frequencies in Estoup et al.'s data are changed to allele counts to fit our need. The data are listed in Table 1.


 
View this table:
In this window
In a new window

 
Table 1. The data of ESTOUP et al. 1995 Down

Under the IAM, CROW and KIMURA 1970 Down derived the expected number of alleles that can be maintained in a population, MCK, to be

(13)

where {theta}(1 - x){theta}-1x-1 is the distribution of frequencies of allele frequencies. Using the information listed in Table 1 together with Equation 13 we can get the theoretical number of alleles for the population. Table 2 shows the theoretical allele numbers and our three estimates. The estimated allele numbers are generally close to MCK.


 
View this table:
In this window
In a new window

 
Table 2. Results for the data of ESTOUP et al. 1995 Down

Note that, in some cases, MCK is smaller than the observed number. This reflects the problem that the two parameters, effective population size and mutation rate, are generally difficult to estimate and have high uncertainty. One advantage of the sample coverage method is that we do not need to estimate the effective population size and mutation rate before estimating the allele numbers.

Besides the direct applications in improving the estimation of gene diversity and characterization of prior distribution of allele frequencies, under SMM, we can use this information to predict the range of allele sizes as well.


*  DISCUSSION
*TOP
*ABSTRACT
*METHOD
*SIMULATION STUDY
*EXAMPLES AND APPLICATIONS
*DISCUSSION
*LITERATURE CITED

A drawback of the sample coverage method is that, when the allele number is large but the sample size is relatively small and all the alleles in the data are different (i.e., D = f1 = n), we have = 0 and hence = {infty}. Although in this case we would expect there are many alleles in the population, this estimate is not useful. For this situation, CHAO 1984 Down derived an estimate low = D + for the lower bound of M. We use this estimate when simulations would give an infinite estimate. In real data, fortunately, this kind of situation is almost impossible.

We also note that when the CV is large, the estimates still have a large negative bias. One way to conquer this problem is to use the stabilization technique (HAAS and STOKES 1998 Down) that uses only those alleles appearing less than k times in the sample to obtain the estimates and then adding alleles having more than k representatives. For example, suppose we have r alleles appearing more than k times for a total of nr alleles in the sample. We first use the remaining data (n - nr) to estimate the allele number (M - r) and then add the r alleles back. Our estimates, therefore, will be =

+ r. Since the alleles with high frequencies are not used in estimation, the CV will be reduced significantly. The reason that those alleles with higher frequencies can be removed from the estimation process is that, since they are so abundant, observing them will not provide any useful information about unseen alleles and hence they are not helpful in estimating the total number. The choice of k is somewhat arbitrary. Simulations show that k = 10 works well (data not shown). We should be careful, however, in applying this technique to real genetic data, because the variation of frequencies among alleles could be very large. For small samples, removing alleles with >10 copies from the data may result in an unrealistically large estimate because only a very small amount of data is used in the estimation. Hence, we did not use this technique in Table 2.

Besides these technical problems, some concerns still need to be addressed. For example, what should we do when some alleles have been reported previously but are not in the present sample? If there are several samples available, should we combine them to give a larger sample or should we use them separately? For the first problem, assuming there are L alleles we know exist but are not in our sample, possible solutions are

  1. ' = + L, or

  2. replace n by n' = n + L, D' = D + L, and f'1 = f1 + L to obtain ', or

  3. since has already estimated the unseen alleles from the data, it should already include those L alleles. We need only adjust our estimates when < D + L and set ' = D + L.

For the second problem, besides the method of combining data, if the samples are from contemporary generations, we can regard them all as a sample from a capture-recapture experiment. We may use any of several methods (e.g., LEE and CHAO 1994 Down; NORRIS and POLLOCK 1996 Down) for capture-recapture data. Recently CHAO and TSAY 1998 Down proposed a method to estimate M under a multisystem sample scheme that may offer another possible solution. Currently, our study focuses on only one particular population. It may be more interesting to know the allele numbers for a whole species or how many alleles are shared between subspecies. CHAO et al. 2000 Down extended the method to this kind of multipopulation problem in ecology. Further studies are needed to adjust their method to population genetics.


*  ACKNOWLEDGMENTS

The authors thank anonymous reviewers for their help. This work was supported in part by National Institutes of Health grant GM 45344.

Manuscript received September 29, 2000; Accepted for publication August 31, 2001.


*  LITERATURE CITED
*TOP
*ABSTRACT
*METHOD
*SIMULATION STUDY
*EXAMPLES AND APPLICATIONS
*DISCUSSION
*LITERATURE CITED

BUNGE, J. and M. FITZPATRICK, 1993  Estimating the number of species: a review. J. Am. Stat. Assoc. 88:364-373.

BUNGE, J., M. FITZPATRICK, and J. HANDLEY, 1995  Comparison of 3 estimators of the number of species. J. Appl. Stat. 22:45-59.

CHAO, A., 1984  Nonparametric estimation of the number of the classes in a population. Scand. J. Stat. 11:265-270.

CHAO, A. and S.-M. LEE, 1992  Estimating the number of classes via sample coverage. J. Am. Stat. Assoc. 87:210-217.

CHAO, A. and P. TSAY, 1998  A sample coverage approach to multiple-system estimation with application to census undercount. J. Am. Stat. Assoc. 93:283-293.

CHAO, A., W. H. HWANG, Y. C. CHEN, and C. Y. KUO, 2000  Estimating the number of shared species in two communities. Stat. Sinica 10:227-246.

CHEN, W.-C., 1980  On the weak form of Zipf's law. J. Appl. Probab. 17:611-622.

CROW, J., and M. KIMURA, 1970 Introduction to Population Genetics Theory. Harper & Row, New York.

DARROCH, J. N. and D. RATCLIFF, 1958  A note on capture and recapture estimation. Biometrics 36:149-153.

ENGEN, S., 1978 Stochastic Abundance Model. Chapman & Hall, London.

ESTOUP, A., L. GARNERY, M. SOLIGNAC, and J. CORNUET, 1995  Microsatellite variation in honey bee (Apis Mellifera L.) populations: hierarchical genetic structure and test of the infinite allele and stepwise mutation models. Genetics 140:679-695[Abstract].

ESTY, W. W., 1982  Confidence intervals for the coverage of low coverage samples. Ann. Stat. 11:905-912.

ESTY, W. W., 1986  The efficiency of Good's nonparametric coverage estimator. Ann. Stat. 14:1257-1260.

FELLER, W., 1950 An Introduction to Probability Theory and Its Applications. John Wiley and Sons, New York.

FISHER, R., A. CORBET, and C. WILLIAMS, 1943  The relation between the number of species and the number of individuals in a random sample of an animal population. J. Anim. Ecol. 12:1257-1260.

FU, Y.-X. and R. CHAKRABORTY, 1998  Simultaneous estimation of all the parameters of a stepwise mutation model. Genetics 150:487-497[Abstract/Free Full Text].

GOOD, I. J., 1953  On the population frequencies of species and the estimation of population parameters. Biometrika 40:237-264[Abstract/Free Full Text].

GOOD, I. J. and G. TOULMIN, 1956  The number of new species and the increase of population coverage when a sample is increased. Biometrika 43:45-63[Abstract/Free Full Text].

GRIFFITHS, R. C., 1979  A transition density expansion for a multi-allele diffusion model. Adv. Appl. Probab. 11:310-325.

HAAS, P. and L. STOKES, 1998  Estimating the number of classes in a finite population. J. Am. Stat. Assoc. 93:1475-1487.

HARRIS, B., 1968  Statistical inference in the classical occupancy problem: unbiased estimation of the number of classes. J. Am. Stat. Assoc. 63:837-847.

HUDSON, R. R., 1993 The how and why of generating gene genealogies, pp. 23–36 in Mechanisms of Molecular Evolution, edited by N. TAKAHATA and A. G. CLARK. Sinauer Associates, Sunderland, MA.

KINGMAN, J. F. C., 1982  The coalescent. Stoch. Proc. Appl. 13:235-248.

LEE, S. and A. CHAO, 1994  Estimating population-size via sample coverage for closed capture-recapture models. Biometrics 50(1):88-97[Medline].

LEWONTIN, R. and T. PROUT, 1956  Estimation of the number of different classes in a population. Biometrics 12:211-223.

NORRIS, J. and K. POLLOCK, 1996  Nonparametric mle under two closed capture recapture models with heterogeneity. Biometrics 52(2):639-649.

WEIR, B. S., 1996 Genetic Data Analysis II. Sinauer Associates, Sunderland, MA.




This article has been cited by other articles:


Home page
Proc. Natl. Acad. Sci. USAHome page
S. C. Wang and P. Dodson
Estimating the diversity of dinosaurs
PNAS, September 12, 2006; 103(37): 13601 - 13605.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
C. R. Miller and L. P. Waits
The history of effective population size and genetic diversity in the Yellowstone grizzly (Ursus arctos): Implications for conservation
PNAS, April 1, 2003; 100(7): 4334 - 4339.
[Abstract] [Full Text] [PDF]


Home page
J HeredHome page
B. Guinand, A. Topchy, K. S. Page, M. K. Burnham-Curtis, W. F. Punch, and K. T. Scribner
Comparisons of Likelihood and Machine Learning Methods of Individual Classification
J. Hered., July 1, 2002; 93(4): 260 - 269.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
V. A. Kuznetsov, G. D. Knott, and R. F. Bonner
General Statistics of Stochastic Process of Gene Expression in Eukaryotic Cells
Genetics, July 1, 2002; 161(3): 1321 - 1332.
[Abstract] [Full Text] [PDF]