RANNALA et al. (2000) develop a useful Poisson process model for estimating allele frequencies in bacterial populations using presence/absence data on infected hosts. However, they also show that the estimators they derive suffer from bias. In this letter we derive different estimators from the same model and show these estimators to be preferable.
Rannala et al. (2000) write the loglikelihood function for allele frequencies p = (p_{1},..., p_{k}) and Poisson rate parameter λ as
The true maximumlikelihood estimators (MLEs) from Equation 1 may be found by performing a constrained optimization by the method of LaGrange multipliers. We show this in the next section. In most practical situations, however, data will also be available on the number of uninfected hosts. Using those data improves and also simplifies the estimation of p and λ as described in estimation from the whole sample of hosts. The MLEs derived in both of the following sections are shown to be less biased than the estimators given in Rannala et al. (2000).
CONSTRAINED OPTIMIZATION
Any values of p and λ that maximize (1) while satisfying the constraint
ESTIMATION FROM THE WHOLE SAMPLE OF HOSTS
The quantity n/(1 – e^{–λ}) in Equation 7 can be seen to estimate the total number of hosts sampled, both infected and uninfected. In fact, it is possible to improve the estimation of p and λ by including information, which should typically be available, on the number of uninfected hosts sampled. Since Rannala et al. (2000) use only the infected hosts, their loglikelihood (1) contains the term –n log(1 – e^{–λ}), which arises from conditioning upon the hosts being infected. If one uses the information in both infected and uninfected hosts, that term vanishes from the loglikelihood. Doing so, and letting M denote the total number of hosts (infected and uninfected) sampled, the loglikelihood becomes
Another, more intuitive, derivation of the complete sample MLEs can be given: Let 1 – q_{j} be the probability that a host carries no bacteria of allelic type j. Then, detecting the presence or absence of allelic type j in a host is a Bernoulli trial with probability of success (i.e., probability of presence of the allele) q_{j}. Since each host sampled is an independent Bernoulli trial, the number of hosts infected with allelic type j from a total sample size of M is binomially distributed, and the MLE of q_{j} is just the familiar MLE for a binomial proportion,
If M, the total number of hosts sampled, is available, (8) and (9) are the preferred estimators for two reasons. First, being available in closed form, it is not necessary to iteratively compute the complete sample MLEs as it is for the partial sample MLEs. And second, it can be shown that if M is known, and λ unknown, then M and the set of z_{j} (j = 1,..., k) are the minimal sufficient statistics for the parameters λ and p. Since M is not a deterministic function of n it follows from the definition of minimal sufficiency that n and the set of z_{j} (j = 1,..., k) are not sufficient for λ and p, and so the partial sample MLE is not based on all the available information unless the investigator failed to record or does not actually know M. Since estimators based on sufficient statistics typically have smaller variance than those that are not based on sufficient statistics, the complete sample MLE should be used when M is known. Of course, if only n and the z_{j}'s are known, then, in that context, the partial sample MLE is based on the sufficient statistic and should be used. Simulations (Figure 2) confirm that the complete sample MLE has lower variance than the partial sample MLE.
When λ is large, so that most of the sampled hosts are infected, the complete sample and partial sample MLEs will differ little from the estimator derived by Rannala et al. (2000). They will differ more from the estimator in Rannala et al. (2000) when λ is small and many of the hosts sampled are not infected by any bacteria. We have empirically found that the MLEs we derive usually give estimates that are between the estimates from Rannala et al.'s estimator and the “uncorrected estimates” of
Programs to compute the partial sample and complete sample MLEs described above may be downloaded from http://www.rannala.org.
SIMULATIONS AND DATA
We repeated a small set of simulations like those carried out in Rannala et al. (2000). Briefly, hosts were infected at rate λ= 2 by k = 3 different allelic types. Hosts were sampled from this population until n = 100 infected hosts had been sampled. The total number, M, of hosts sampled to achieve 100 infected ones was also recorded for each simulated sample so that we could compute the complete sample MLE. Simulations were done for nine different sets of allele frequencies in which p_{1} was taken from {0.1, 0.2,..., 0.9} and p_{2} = p_{3} = 0.5(1 – p_{1}). For each set of allele frequencies, 50,000 replicate samples were drawn and the various estimators were computed for each one. Occasionally, when p_{1} was large, a sample would be drawn with z_{1} = n. In such a case the partial sample MLE is undefined, and so we discarded these samples (thus, 9 were discarded when p_{1} = 0.8 and 1483 when p = 0.9). The results are summarized in Figures 1 and 2. They demonstrate that the MLEs we derived here are less biased than the estimators given in Rannala et al. (2000), and they demonstrate that the variance of the complete sample MLE is smaller than that of the partial sample MLE (albeit only slightly for λ= 2).
We also reanalyzed the tick data given in Table 1 of Rannala et al. (2000), using the partial sample MLE. Our estimates of the allele frequencies for all of the alleles at ospA and all of the alleles except allele D at ospC were intermediate to the “uncorrected estimate” and the Rannala et al. estimate. The three different estimates for allele D at ospC differed only at the fourth decimal place. Our method estimates that the Borrelia burgdorferi allele frequencies are somewhat more uniform than inferred by Rannala et al. (2000).
Though the model presented by Rannala et al. (2000) was used only in estimating allele frequencies of Borrelia found in ticks, it may be useful in other host/parasite systems. Further, by defining “hosts” to be samples of DNA pooled from different sources, the model, or some altered version of it, may be useful in other situations involving the estimation of allele frequencies from presence/absence data.
Acknowledgments
We thank Bruce Rannala for comments on a draft of this letter and the members of the statistical genetics seminar group in the Departments of Statistics and Biostatistics at the University of Washington for helpful discussions, particularly Elizabeth Thompson, Matthew Stephens, Andrew George, and AnneLouise Leutenegger. We also thank two anonymous referees for helpful and insightful comments. E.C.A. was supported by National Science Foundation grant BIR9807747 and P.A.S. was supported by National Institutes of Health grant T32 HG0003506.
Footnotes

Communicating editor: G. B. Golding
 Received January 9, 2001.
 Accepted April 30, 2001.
 Copyright © 2001 by the Genetics Society of America