help button home button Genetics Email Content Delivery
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

This Article
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Anderson, E. C.
Right arrow Articles by Scheet, P. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Anderson, E. C.
Right arrow Articles by Scheet, P. A.
Genetics, Vol. 158, 1383-1386, July 2001, Copyright © 2001


Letter to the Editor

Improving the Estimation of Bacterial Allele Frequencies

Eric C. Andersona and Paul A. Scheetb
a Interdisciplinary Program in Quantitative Ecology and Resource Management, University of Washington, Seattle, Washington 98195
b Department of Statistics, University of Washington, Seattle, Washington 98195

Corresponding author: Eric C. Anderson, Department of Statistics, University of Washington, Box 354322, Seattle, WA 98195., eriq{at}stat.washington.edu (E-mail)

RANNALA et al. (2000) develop a useful Poisson process model for estimating allele frequencies in bacterial populations using presence/absence data on infected hosts. However, they also show that the estimators they derive suffer from bias. In this letter we derive different estimators from the same model and show these estimators to be preferable.

RANNALA et al. 2000 Down write the log-likelihood function for allele frequencies p = (p1, ... , pk) and Poisson rate parameter {lambda} as

(1)

where zj is the number of sampled hosts infected by bacteria with allele j, j = 1, ... , k, and where n is the total number of infected hosts sampled. This model arises by considering only infected hosts and by assuming that the distribution of the number of bacterial lineages infecting each host independently follows a Poisson distribution with mean {lambda} and that the allelic type of each lineage is drawn from the k types according to the vector of probabilities p. Although RANNALA et al. 2000 Down note that a rigorous statistical approach to finding estimators for p and {lambda} would require maximizing the likelihood jointly over those parameters, they employ an heuristic maximization method to derive their estimators. Their method does not obtain the maximum-likelihood estimators for p and {lambda}, because it does not properly account for the constraint that the allele frequencies sum to unity.

The true maximum-likelihood estimators (MLEs) from Equation 1 may be found by performing a constrained optimization by the method of LaGrange multipliers. We show this in the next section. In most practical situations, however, data will also be available on the number of uninfected hosts. Using those data improves and also simplifies the estimation of p and {lambda} as described in ESTIMATION FROM THE WHOLE SAMPLE OF HOSTS. The MLEs derived in both of the following sections are shown to be less biased than the estimators given in RANNALA et al. 2000 Down.

CONSTRAINED OPTIMIZATION
Any values of p and {lambda} that maximize (1) while satisfying the constraint {sum}kj=1 pj = 1 will also be maximizers of the equation

(2)

where µ is an arbitrary nonzero constant (µ is called the "LaGrange multiplier" in this context). The MLEs may be found by setting the first partial derivatives of (2) with respect to p and {lambda} equal to 0 and solving for p, {lambda}, and µ. Thus we have

(3)


(4)

By adding and subtracting terms of (pjµ)/{lambda}, (4) may be rewritten as

(5)

In (5), the part in braces is equal to ()(), which is equal to 0. So, we may solve for µ:

By substituting this value for µ into (3), the MLE for pj is found to be

(6)

and the MLE for {lambda} follows from the fact that the pj sum to 1:

(7)

The system of (6) and (7) cannot be solved explicitly, but it is straightforward to find and iteratively to arbitrary precision. For reasons that become clear shortly, we call these estimators the partial sample MLEs.

ESTIMATION FROM THE WHOLE SAMPLE OF HOSTS
The quantity n/(1 - e-{lambda}) in Equation 7 can be seen to estimate the total number of hosts sampled, both infected and uninfected. In fact, it is possible to improve the estimation of p and {lambda} by including information, which should typically be available, on the number of uninfected hosts sampled. Since RANNALA et al. 2000 Down use only the infected hosts, their log-likelihood (1) contains the term -n log(1 - e-{lambda}), which arises from conditioning upon the hosts being infected. If one uses the information in both infected and uninfected hosts, that term vanishes from the log-likelihood. Doing so, and letting M denote the total number of hosts (infected and uninfected) sampled, the log-likelihood becomes

Since the pj and {lambda} occur together only as the products {lambda}pj in this expression, the unconstrained maximization procedure pursued by Rannala et al. would be appropriate, and it follows that the MLEs (which we call the complete sample MLEs) are

(8)


(9)

These expressions are identical to those in RANNALA et al. 2000 Down except that M is used in place of n.

Another, more intuitive, derivation of the complete sample MLEs can be given: Let 1 - qj be the probability that a host carries no bacteria of allelic type j. Then, detecting the presence or absence of allelic type j in a host is a Bernoulli trial with probability of success (i.e., probability of presence of the allele) qj. Since each host sampled is an independent Bernoulli trial, the number of hosts infected with allelic type j from a total sample size of M is binomially distributed, and the MLE of qj is just the familiar MLE for a binomial proportion, j = zj/M. By the assumptions of the infection model, 1 - qj is given by the zero term of the Poisson distribution, e-{lambda}pj. So, by the invariance of MLEs to transformation, the MLE of the product, {lambda}pj, can be found by solving j = 1 - e-. Doing so and then imposing the constraint that {sum}kj=1 pj = 1 gives the estimators in (8) and (9).

If M, the total number of hosts sampled, is available, (8) and (9) are the preferred estimators for two reasons. First, being available in closed form, it is not necessary to iteratively compute the complete sample MLEs as it is for the partial sample MLEs. And second, it can be shown that if M is known, and {lambda} unknown, then M and the set of zj (j = 1, ... , k) are the minimal sufficient statistics for the parameters {lambda} and p. Since M is not a deterministic function of n it follows from the definition of minimal sufficiency that n and the set of zj (j = 1, ... , k) are not sufficient for {lambda} and p, and so the partial sample MLE is not based on all the available information unless the investigator failed to record or does not actually know M. Since estimators based on sufficient statistics typically have smaller variance than those that are not based on sufficient statistics, the complete sample MLE should be used when M is known. Of course, if only n and the zj's are known, then, in that context, the partial sample MLE is based on the sufficient statistic and should be used. Simulations (Fig 2) confirm that the complete sample MLE has lower variance than the partial sample MLE.



View larger version (21K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 1. Figures from simulations like those in RANNALA et al. 2000 Down for the case {lambda} = 2. R indicates the estimator from Rannala et al. P denotes the partial sample MLE (Equation 6 and Equation 7), and C denotes the complete sample MLE (Equation 8 and Equation 9). We show results for just one sample size, n = 100 infected hosts. For the C estimates, the additional uninfected hosts are used in the estimation. (a) Mean values of the estimates for {lambda} from data on a three-allele locus with p2 = p3 = 0.5(1 - p1) based on 50,000 simulated replicates. The R estimator consistently overestimates {lambda}, while both the partial and complete sample estimators estimate {lambda} well (though the partial sample MLE tends to overestimate {lambda} when the allele frequencies are highly uneven). (b) Percentage of standardized bias for the estimate of p1 in the three-allele locus described above. The arching curve is the same found in Figure 1 of RANNALA et al. 2000 Down. The other two curves show that the estimators we propose estimate p with very little bias.



View larger version (24K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 2. Estimated variance of the partial and complete sample MLEs based on 50,000 replicates. The variance of the complete sample MLE is lower, especially for uneven allele frequencies.

When {lambda} is large, so that most of the sampled hosts are infected, the complete sample and partial sample MLEs will differ little from the estimator derived by RANNALA et al. 2000 Down. They will differ more from the estimator in RANNALA et al. 2000 Down when {lambda} is small and many of the hosts sampled are not infected by any bacteria. We have empirically found that the MLEs we derive usually give estimates that are between the estimates from Rannala et al.'s estimator and the "uncorrected estimates" of zi. Our MLEs do not fall between the other two estimators only at alleles for which all the estimators give very similar frequency estimates. Hence, while the uncorrected estimator used in WANG et al. 1999 Down and QIU et al. 1997 Down tends to overestimate low-frequency alleles and underestimate high-frequency alleles, the estimator proposed by RANNALA et al. 2000 Down overcorrects the problem, underestimating low-frequency alleles and overestimating high-frequency alleles. As our simulations show, the estimators we propose don't suffer from this bias.

Programs to compute the partial sample and complete sample MLEs described above may be downloaded from http://www.rannala.org.

SIMULATIONS AND DATA
We repeated a small set of simulations like those carried out in RANNALA et al. 2000 Down. Briefly, hosts were infected at rate {lambda} = 2 by k = 3 different allelic types. Hosts were sampled from this population until n = 100 infected hosts had been sampled. The total number, M, of hosts sampled to achieve 100 infected ones was also recorded for each simulated sample so that we could compute the complete sample MLE. Simulations were done for nine different sets of allele frequencies in which p1 was taken from {0.1, 0.2, ... , 0.9} and p2 = p3 = 0.5(1 - p1). For each set of allele frequencies, 50,000 replicate samples were drawn and the various estimators were computed for each one. Occasionally, when p1 was large, a sample would be drawn with z1 = n. In such a case the partial sample MLE is undefined, and so we discarded these samples (thus, 9 were discarded when p1 = 0.8 and 1483 when p = 0.9). The results are summarized in Fig 1 and Fig 2. They demonstrate that the MLEs we derived here are less biased than the estimators given in RANNALA et al. 2000 Down, and they demonstrate that the variance of the complete sample MLE is smaller than that of the partial sample MLE (albeit only slightly for {lambda} = 2).

We also reanalyzed the tick data given in Table 1 of RANNALA et al. 2000 Down, using the partial sample MLE. Our estimates of the allele frequencies for all of the alleles at ospA and all of the alleles except allele D at ospC were intermediate to the "uncorrected estimate" and the Rannala et al. estimate. The three different estimates for allele D at ospC differed only at the fourth decimal place. Our method estimates that the Borrelia burgdorferi allele frequencies are somewhat more uniform than inferred by RANNALA et al. 2000 Down.

Though the model presented by RANNALA et al. 2000 Down was used only in estimating allele frequencies of Borrelia found in ticks, it may be useful in other host/parasite systems. Further, by defining "hosts" to be samples of DNA pooled from different sources, the model, or some altered version of it, may be useful in other situations involving the estimation of allele frequencies from presence/absence data.

ACKNOWLEDGMENTS

We thank Bruce Rannala for comments on a draft of this letter and the members of the statistical genetics seminar group in the Departments of Statistics and Biostatistics at the University of Washington for helpful discussions, particularly Elizabeth Thompson, Matthew Stephens, Andrew George, and Anne-Louise Leutenegger. We also thank two anonymous referees for helpful and insightful comments. E.C.A. was supported by National Science Foundation grant BIR-9807747 and P.A.S. was supported by National Institutes of Health grant T32 HG00035-06.

Manuscript received January 9, 2001; Accepted for publication April 30, 2001.

LITERATURE CITED

QIU, W.-G., E. M. BOSLER, J. R. CAMPBELL, G. D. UGINE, and I.-N. WANG et al., 1997  A population genetic study of Borrelia burgdorferi sensu stricto from eastern Long Island, New York, suggested frequency-dependent selection, gene flow and host adaptation. Hereditas 127:203-216[Medline].

RANNALA, B., W.-G. QIU, and D. E. DYKHUIZEN, 2000  Methods for estimating gene frequencies and detecting selection in bacterial populations. Genetics 155:499-508[Abstract/Full Text].

WANG, I., D. E. DYKHUIZEN, W. QIU, J. J. DUNN, and E. M. BOSLER et al., 1999  Genetic diversity of ospC in a local population of Borrelia burgdorferi sensu stricto. Genetics 151:15-30[Abstract/Full Text].




This article has been cited by other articles:


Home page
GeneticsHome page
W.-G. Qiu, D. E. Dykhuizen, M. S. Acosta, and B. J. Luft
Geographic Uniformity of the Lyme Disease Spirochete (Borrelia burgdorferi) and Its Shared History With Tick Vector (Ixodes scapularis) in the Northeastern United States
Genetics, March 1, 2002; 160(3): 833 - 849.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Anderson, E. C.
Right arrow Articles by Scheet, P. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Anderson, E. C.
Right arrow Articles by Scheet, P. A.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
Copyright © 2001 by the Genetics Society of America.