| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Letter to the Editor |
Corresponding author: Eric C. Anderson, Department of Statistics, University of Washington, Box 354322, Seattle, WA 98195., eriq{at}stat.washington.edu (E-mail)
RANNALA et al. (2000) develop a useful Poisson process model for estimating allele frequencies in bacterial populations using presence/absence data on infected hosts. However, they also show that the estimators they derive suffer from bias. In this letter we derive different estimators from the same model and show these estimators to be preferable.
![]()
as
![]() |
(1) |
where zj is the number of sampled hosts infected by bacteria with allele j, j = 1, ... , k, and where n is the total number of infected hosts sampled. This model arises by considering only infected hosts and by assuming that the distribution of the number of bacterial lineages infecting each host independently follows a Poisson distribution with mean
and that the allelic type of each lineage is drawn from the k types according to the vector of probabilities p. Although ![]()
would require maximizing the likelihood jointly over those parameters, they employ an heuristic maximization method to derive their estimators. Their method does not obtain the maximum-likelihood estimators for p and
, because it does not properly account for the constraint that the allele frequencies sum to unity.
The true maximum-likelihood estimators (MLEs) from Equation 1 may be found by performing a constrained optimization by the method of LaGrange multipliers. We show this in the next section. In most practical situations, however, data will also be available on the number of uninfected hosts. Using those data improves and also simplifies the estimation of p and
as described in ESTIMATION FROM THE WHOLE SAMPLE OF HOSTS. The MLEs derived in both of the following sections are shown to be less biased than the estimators given in ![]()
CONSTRAINED OPTIMIZATION
Any values of p and
that maximize (1) while satisfying the constraint
kj=1 pj = 1 will also be maximizers of the equation
![]() |
(2) |
where µ is an arbitrary nonzero constant (µ is called the "LaGrange multiplier" in this context). The MLEs may be found by setting the first partial derivatives of (2) with respect to p and
equal to 0 and solving for p,
, and µ. Thus we have
![]() |
(3) |
![]() |
(4) |
By adding and subtracting terms of (pjµ)/
, (4) may be rewritten as
![]() |
(5) |
In (5), the part in braces is equal to (
)(
), which is equal to 0. So, we may solve for µ:

By substituting this value for µ into (3), the MLE for pj is found to be
![]() |
(6) |
and the MLE for
follows from the fact that the pj sum to 1:
![]() |
(7) |
The system of (6) and (7) cannot be solved explicitly, but it is straightforward to find
and
iteratively to arbitrary precision. For reasons that become clear shortly, we call these estimators the partial sample MLEs.
ESTIMATION FROM THE WHOLE SAMPLE OF HOSTS
The quantity n/(1 - e-
) in Equation 7 can be seen to estimate the total number of hosts sampled, both infected and uninfected. In fact, it is possible to improve the estimation of p and
by including information, which should typically be available, on the number of uninfected hosts sampled. Since ![]()
), which arises from conditioning upon the hosts being infected. If one uses the information in both infected and uninfected hosts, that term vanishes from the log-likelihood. Doing so, and letting M denote the total number of hosts (infected and uninfected) sampled, the log-likelihood becomes

Since the pj and
occur together only as the products
pj in this expression, the unconstrained maximization procedure pursued by Rannala et al. would be appropriate, and it follows that the MLEs (which we call the complete sample MLEs) are
![]() |
(8) |
![]() |
(9) |
These expressions are identical to those in ![]()
Another, more intuitive, derivation of the complete sample MLEs can be given: Let 1 - qj be the probability that a host carries no bacteria of allelic type j. Then, detecting the presence or absence of allelic type j in a host is a Bernoulli trial with probability of success (i.e., probability of presence of the allele) qj. Since each host sampled is an independent Bernoulli trial, the number of hosts infected with allelic type j from a total sample size of M is binomially distributed, and the MLE of qj is just the familiar MLE for a binomial proportion,
j = zj/M. By the assumptions of the infection model, 1 - qj is given by the zero term of the Poisson distribution, e-
pj. So, by the invariance of MLEs to transformation, the MLE of the product,
pj, can be found by solving
j = 1 - e-
. Doing so and then imposing the constraint that
kj=1 pj = 1 gives the estimators in (8) and (9).
If M, the total number of hosts sampled, is available, (8) and (9) are the preferred estimators for two reasons. First, being available in closed form, it is not necessary to iteratively compute the complete sample MLEs as it is for the partial sample MLEs. And second, it can be shown that if M is known, and
unknown, then M and the set of zj (j = 1, ... , k) are the minimal sufficient statistics for the parameters
and p. Since M is not a deterministic function of n it follows from the definition of minimal sufficiency that n and the set of zj (j = 1, ... , k) are not sufficient for
and p, and so the partial sample MLE is not based on all the available information unless the investigator failed to record or does not actually know M. Since estimators based on sufficient statistics typically have smaller variance than those that are not based on sufficient statistics, the complete sample MLE should be used when M is known. Of course, if only n and the zj's are known, then, in that context, the partial sample MLE is based on the sufficient statistic and should be used. Simulations (Fig 2) confirm that the complete sample MLE has lower variance than the partial sample MLE.
|
|
When
is large, so that most of the sampled hosts are infected, the complete sample and partial sample MLEs will differ little from the estimator derived by ![]()
![]()
is small and many of the hosts sampled are not infected by any bacteria. We have empirically found that the MLEs we derive usually give estimates that are between the estimates from Rannala et al.'s estimator and the "uncorrected estimates" of
zi. Our MLEs do not fall between the other two estimators only at alleles for which all the estimators give very similar frequency estimates. Hence, while the uncorrected estimator used in ![]()
![]()
![]()
Programs to compute the partial sample and complete sample MLEs described above may be downloaded from http://www.rannala.org.
SIMULATIONS AND DATA
We repeated a small set of simulations like those carried out in ![]()
= 2 by k = 3 different allelic types. Hosts were sampled from this population until n = 100 infected hosts had been sampled. The total number, M, of hosts sampled to achieve 100 infected ones was also recorded for each simulated sample so that we could compute the complete sample MLE. Simulations were done for nine different sets of allele frequencies in which p1 was taken from {0.1, 0.2, ... , 0.9} and p2 = p3 = 0.5(1 - p1). For each set of allele frequencies, 50,000 replicate samples were drawn and the various estimators were computed for each one. Occasionally, when p1 was large, a sample would be drawn with z1 = n. In such a case the partial sample MLE is undefined, and so we discarded these samples (thus, 9 were discarded when p1 = 0.8 and 1483 when p = 0.9). The results are summarized in Fig 1 and Fig 2. They demonstrate that the MLEs we derived here are less biased than the estimators given in ![]()
= 2).
We also reanalyzed the tick data given in Table 1 of ![]()
![]()
Though the model presented by ![]()
ACKNOWLEDGMENTS
We thank Bruce Rannala for comments on a draft of this letter and the members of the statistical genetics seminar group in the Departments of Statistics and Biostatistics at the University of Washington for helpful discussions, particularly Elizabeth Thompson, Matthew Stephens, Andrew George, and Anne-Louise Leutenegger. We also thank two anonymous referees for helpful and insightful comments. E.C.A. was supported by National Science Foundation grant BIR-9807747 and P.A.S. was supported by National Institutes of Health grant T32 HG00035-06.
Manuscript received January 9, 2001; Accepted for publication April 30, 2001.
LITERATURE CITED
QIU, W.-G., E. M. BOSLER, J. R. CAMPBELL, G. D. UGINE, and I.-N. WANG et al., 1997 A population genetic study of Borrelia burgdorferi sensu stricto from eastern Long Island, New York, suggested frequency-dependent selection, gene flow and host adaptation. Hereditas 127:203-216[Medline].
RANNALA, B., W.-G. QIU, and D. E. DYKHUIZEN, 2000 Methods for estimating gene frequencies and detecting selection in bacterial populations. Genetics 155:499-508[Abstract/Full Text].
WANG, I., D. E. DYKHUIZEN, W. QIU, J. J. DUNN, and E. M. BOSLER et al., 1999 Genetic diversity of ospC in a local population of Borrelia burgdorferi sensu stricto. Genetics 151:15-30[Abstract/Full Text].
This article has been cited by other articles:
![]() |
W.-G. Qiu, D. E. Dykhuizen, M. S. Acosta, and B. J. Luft Geographic Uniformity of the Lyme Disease Spirochete (Borrelia burgdorferi) and Its Shared History With Tick Vector (Ixodes scapularis) in the Northeastern United States Genetics, March 1, 2002; 160(3): 833 - 849. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |