Abstract
A population’s effective size is an important quantity for conservation and management. The effective size may be estimated from the change of allele frequencies observed in temporally spaced genetic samples taken from the population. Though momentbased estimators exist, recently Williamson and Slatkin demonstrated the advantages of a maximumlikelihood approach that they applied to data on diallelic genetic markers. Their computational methods, however, do not extend to data on multiallelic markers, because in such cases exact evaluation of the likelihood is impossible, requiring an intractable sum over latent variables. We present a Monte Carlo approach to compute the likelihood with data on multiallelic markers. So as to be computationally efficient, our approach relies on an importancesampling distribution constructed by a forwardbackward method. We describe the Monte Carlo formulation and the importancesampling function and then demonstrate their use on both simulated and real datasets.
REDUCTIONS in population size can lead to inbreeding, which increases the probability of population extinction in typically outbreeding species (Frankham 1995). Reductions in population size also lead to a loss of genetic diversity, which may restrict a population’s ability to adapt to changing conditions (Soulé 1986). To predict the risk to a population from these types of genetic factors, biologists are often interested in knowing the effective population size, N_{e}. An effective size is defined by comparison to an ideal population model, the WrightFisher model. The WrightFisher model assumes discrete, nonoverlapping generations of constant size, and it assumes that the gametes that unite to form adults in one generation are randomly sampled with replacement from the previous generation. The variance effective size of a natural population is the size of a WrightFisher population that would experience a comparable increase in variance of gene frequency over time. The inbreeding effective size is defined similarly, but is based on the increase in gene identity by descent over time.
It is possible to estimate the variance effective size from observed changes in allele frequencies in a population over time. Momentbased estimators using Fstatistics have been developed for this purpose (Krimbas and Tsakas 1971; Nei and Tajima 1981; Pollak 1983; Waples 1989; Jorde and Ryman 1995). Recently, Williamson and Slatkin (1999) described a method to estimate N_{e} by the method of maximum likelihood. To find the maximumlikelihood estimate
This likelihood method has been restricted to data on diallelic loci, because, with data on multiallelic loci, evaluating the likelihood for N_{e} exactly is computationally intractable. Here we describe the problem as one of inference from a hidden Markov chain (Baumet al. 1970). We develop an algorithm for importance sampling, which makes it possible to compute the likelihood by Monte Carlo.
FORMULATION OF THE MODEL AND MONTE CARLO
The model: The data are genetic samples collected at different generations. The first sample is collected at generation 0 and the last sample at generation T. Any samples drawn at intervening generations may be evenly or irregularly spaced in time. For notational simplicity, we assume for now that individuals are genotyped at a single locus, though we describe later the extension to multiple, independently segregating loci. The data include K different allelic types, indexed by k = 1,..., K. The allele frequencies observed in samples taken from different generations will differ due to genetic drift and sampling variation.
Let Y_{t} = (Y_{t}_{,1},..., Y_{t}_{,}_{K}) be the counts of the K different allelic types in the sample at generation t, and let S_{t} denote the number of diploid individuals in the sample. We assume that the samples were taken from a WrightFisher population of size N_{e} and denote the unobserved population allele counts at generation t by X_{t} = (X_{t}_{,1},..., X_{t,K}), with
The genetic sample at a time t is assumed to be drawn with replacement from the copies of alleles present in the population at time t. This is equivalent to drawing the sample Y_{t} from a very large gamete pool produced by the population at time t: sampling plan II of Waples (1989). This type of sampling applies to many organisms, especially those species with high fecundity that may be sampled as juveniles, or those that may be sampled (preferably noninvasively) as adults in populations having census sizes considerably larger than their effective sizes (Waples 1989). The sample allele counts Y_{t}, given the latent variable X_{t}, are conditionally independent of all the other variables and follow the multinomial distribution depending on the parameter N_{e}, the sample size S_{t}, and X_{t},
Such a system forms a hidden Markov chain with the dependence structure shown in the directed graph of Figure 1. The allele counts in the population when the first sample is drawn, X_{0}, are nuisance parameters. To avoid estimating X_{0} and the associated consistency problems with the maximumlikelihood estimator (Neyman and Scott 1948), we consider the integrated likelihood, assuming a uniform prior distribution, π(X_{0}), on the population allele counts at time 0. The likelihood for N_{e} is the probability of the data Y = (Y_{0},..., Y_{T}) given the parameter N_{e}. The probability of Y is the sum of the joint probability of Y and the latent variables X = (X_{0},..., X_{T}) over the space of all X:
Monte Carlo evaluation: For likelihood inference, we must evaluate P_{N}_{e}(Y) for a number of different values of N_{e}. Expressing this probability as an expectation with respect to the distribution of X gives
Instead, we pursue a more efficient Monte Carlo approximation by using importance sampling (Hammersley and Handscomb 1964). We express P_{N}_{e}(Y) as an expectation with respect to a different distribution
Sampling from
An approximation for multiple alleles: In our application, with multiple alleles at a locus, since there are so many possible states that each X_{t} may take, the above procedure is computationally infeasible. However, we make use of the Baum et al. (1970) algorithm in spirit, employing two alterations to make it feasible to simulate from
The first approximation is to perform the forwardbackward cycle separately for each allele. To describe this, we introduce some more notation. Denote by X_{(}_{k}_{)} the vector (X_{0,}_{k},..., X_{T}_{,}_{k}) of latent counts of the kth allele from time t = 0 to t = T. Similarly we define Y_{(}_{k}_{)} = (Y_{0,}_{k},..., Y_{T}_{,}_{k}). To do the forwardbackward cycle separately over alleles we first focus on allele 1, simulating
Monte Carlo variance and multiple loci: The quantity
The ability to estimate N_{e} typically requires data from many loci. The extension to data on J independently segregating loci, indexed by j = 1,..., J, is straightforward—each locus is treated separately, and the estimated likelihoods from each locus are multiplied together. Thus, let P˜_{N}_{e},_{j}(Y) be the Monte Carlo likelihood estimate from the data on the jth locus. The Monte Carlo likelihood estimate using all the loci is then
When displaying the Monte Carlo likelihood curve it is preferable to plot the loglikelihood values,
SIMULATED AND REAL DATASETS
We demonstrate the method by computing loglikelihood curves for N_{e} from three different datasets. First, to verify that the method gives correct results we apply it to a simple simulated dataset (dataset 1) for which it is possible to compute the likelihood exactly. We simulated samples of 20 diallelic loci from 100 diploid individuals at generations 0, 6, and 12 from a WrightFisher population of 25 diploid individuals. For each locus, the initial allele frequency in the population at time zero was an independently drawn uniform real number between 0 and 1. The loglikelihood for N_{e} given these data was estimated for values between 10 and 52, in steps of 2, using m = 20,000 realizations of X from
We simulated a second dataset (dataset 2) to see how the method performed with multiallelic markers taken from a WrightFisher population. The dataset included three samples of 100 diploids for 12 fiveallele loci at generations 0, 4, and 8 from a population of 50 diploids. The allele frequencies at each locus in generation 0 for these simulations were independently drawn from a uniform Dirichlet density with five components. For this dataset, the loglikelihood was computed for values of N_{e} between 20 and 100 in increments of 4 using m = 50,000 realizations of X for each locus and each value of N_{e}.
Finally, we computed a loglikelihood curve for N_{e} given data on a population of Drosophila in Begon et al. (1980). These data were analyzed using Fstatistics by Begon et al. (1980) as well as by Pollak (1983). They observed allele frequencies in three samples at each of nine enzyme loci. The first two samples were taken a little more than 1 yr apart, and the third sample was taken some 8 mo later. Though the natural populations do not have discrete generations, they have been modeled previously by Begon et al. and Pollak as populations with discrete generations. Because of the different growth rates of flies during different seasons, seven generations separate the first two samples, while only two generations separate the second two samples (Begonet al. 1980). The sample sizes for all loci were the same, with larger sample sizes taken in the latter sampling periods. The sample sizes were S_{0} = 190, S_{7} = 250, and S_{9} = 335 flies. Pollak (1983) notes that since Begon et al. (1980) sampled adult flies, their sampling scheme is closer to what is known in the literature as sampling scheme I than it is to sampling scheme II. However, as discussed by Waples (1989), the probability models underlying the two different sampling schemes are very similar when the actual size of the population is much larger than the effective size of the population. This is the case with these Drosophila. Begon et al. (1980) report census sizes in the tens of thousands of flies, while the estimated N_{e} is orders of magnitude smaller. Because of this, it is still reasonable to analyze the data using the likelihood method we have developed here.
The data appear as allele frequencies in Table 1 of Begon et al. (1980). Unfortunately the allele frequencies at the Pgm locus are misreported there and fail to sum to one. We thus used only the remaining eight loci. Of these eight, three had three alleles, two had four alleles, two had five alleles, and one had six alleles. We evaluated
RESULTS
For each of the datasets, we were able to use our importancesampling method to compute a loglikelihood curve. Using a program we wrote in C, the runs for datasets 1 and 2 each took ∼10 hr on a 266Mhz laptop computer. The loglikelihood curves from datasets 1 and 2 appear as solid lines in Figure 2. The estimated 90% confidence intervals around each value of log
The loglikelihood curve computed for the data of Begon et al. (1980) is shown in Figure 3. It took ∼54 hr on a 450Mhz desktop computer to produce the results. As before, the 90% confidence intervals around the Monte Carlo estimates appear as dotted lines. With this dataset, even with m = 500,000 realizations of X, the Monte Carlo variance is not negligible. It is, however, small enough that reliable inferences may be made from the loglikelihood curve. The maximumlikelihood estimate of N_{e} is 500. Using the values of N_{e} at which the loglikelihood has decreased 2 units from its maximum gives an estimate of a 95% confidence interval for the true N_{e}. These points are 250 and 975. By contrast, Pollak (1983), using an Fstatistic method, estimated N_{e} to be 251 with a standard error of 115. We discuss the discrepancy between the two estimates in the next section. Our results are not comparable to the N_{e} estimated by Begon et al. (1980) because, at the time, those authors were unable to make a single estimate of N_{e} using the samples at all three time points.
DISCUSSION
As discussed in Williamson and Slatkin (1999), estimating N_{e} by maximum likelihood has advantages over estimating N_{e} using Fstatistics. Until now, it was impractical to compute the likelihood for N_{e} using all the data when loci with more than two alleles were available. While it has been suggested that one may bin lowfrequency alleles together to turn multiallelic loci into apparently diallelic loci and then apply exact likelihood calculation methods to such reduced data, this invariably throws away some information. Furthermore, different binning strategies lead to different results. Allowing full use of the data, the Monte Carlo likelihood procedure described here is a preferable way to analyze temporal data on multiallelic loci. The method is suitable for multiallelic loci such as the microsatellite markers becoming available in a wide variety of species.
Monte Carlo methods use realizations of random variables to estimate an expectation by a sample average. There are a number of ways one can express the likelihood of N_{e} as an expectation, and then estimate it by Monte Carlo, but few of those schemes will be successful, because most will have high Monte Carlo variance. We attempted several different schemes before settling on the importancesampling method presented here. Although these less sophisticated Monte Carlo estimators produced reasonable estimates in very small problems, when applied to data involving loci with many alleles these methods failed to converge reliably, even after many days of computation (E. C. Anderson and E. G. Williamson, unpublished data).
The importancesampling method we use is successful because our importance sampling function,
It should be pointed out that while many Monte Carlo problems involving highdimensional random variables like X make use of Markov chain Monte Carlo (MCMC) methods, our method is not an MCMC method. In MCMC, successive realizations are correlated. In our method each X^{(}^{i}^{)} realized from the distribution
It is interesting that our maximumlikelihood estimate differs so much from the estimate given by Pollak (1983) for the same data. Though perhaps some of this is attributable to the fact that we chose not to use the incorrectly reported data at the Pgm locus, there are differences between the two estimation methods that could also account for some of the discrepancy. The most notable differences occur when combining information from multiple samples in time. Consider the fact that a better estimate of N_{e} may be made from two samples taken many generations apart than from two samples separated by fewer generations. Likewise two large samples will yield a better estimate than two small samples. When there are many samples, the relative information content in different intersample intervals will depend on the relative sample sizes and the number of generations between the samples. By its nature, the maximumlikelihood approach will appropriately weight information from different intervals. In contrast, Pollak’s Fstatistic, F_{K}_{r}, neither includes terms for sample size nor interval length between samples, and, Ñ_{K}_{r}, his estimate of N_{e} based on F_{K}_{r}, includes a term for only the number of generations between the first and the last sample and is invariant to permutations of the sample sizes at different times. Since the data from Begon et al. (1980) span sampling intervals of different lengths and include different sample sizes at different times, differences between our results and those in Pollak (1983) should be expected.
The Monte Carlo variance of our estimate of the likelihood given the data from a natural population of Drosophila was higher than the variance associated with our estimates from simulated data. Although a good estimate was achieved after sufficient computation, it may still be that data generated under a model that differs from the WrightFisher model present difficulty for the Monte Carlo likelihood method. For example, it may be that the effective size of the natural Drosophila population was different during the two different sampling intervals. We note that one could extend the likelihood framework to allow for N_{e} changing over time. For example, if the estimated census size of the population were available and was known to change over time it would be more sensible to estimate directly the ratio, λ, of the effective size of the population to the census size of the population. This ratio would be more useful for the purposes of modeling genetic change in the population than a single estimate of N_{e} over the entire time period between the first and last samples. The forwardbackward approach implemented here could easily be modified to estimate this parameter, λ.
It would also be possible to extend the present approach to compute likelihoods from different stochastic models of allele frequency change. We suspect there would seldom be enough information in the data to estimate accurately models in which allele frequency change is due jointly to genetic drift and some other genetic mechanism such as mutation or selection. However, our methods could be modified to handle different demographic models. For example, consider an alteration of the WrightFisher model such that the current generation is formed by sampling genes without replacement from a gamete pool in which each parental allele is represented by M gametes. In such a case, allele frequency changes occur due to multivariate hypergeometric sampling determined by the parameters N and M. For values of M that are not very small, the normal distribution is still a reasonable approximation to the hypergeometric distribution, and importance sampling could proceed as before. The importancesampling distribution
Another possible extension would be to stochastic models involving populations of organisms with more complex life histories, for example, overlapping generations or agestructured populations. As it becomes easier to gather extensive genetic data on populations, and as understanding of the structure within those populations improves, it will be possible to specify much richer probability models for temporal population genetic data. The forwardbackward method here, and variations of it, should be useful in estimating population parameters from such models using Monte Carlo likelihood.
A software package, MCLEEPS, implementing the algorithm described in this article is available for free download from http://www.stat.washington.edu/thompson/Genepi/Mcleeps.shtml.
APPENDIX
Using θ_{t,k} and ϕ_{t,k} in a continuous setting: We define the random variables
The forward step: Following CavalliSforza and Edwards (1967), if θ_{t}_{1,}_{k} is normally (+) distributed with mean μ_{t}_{1} and variance
Carrying this out until t = T gives values for the mean and variance of θ_{T,k} given ϕ_{0,}_{k},..., ϕ_{T}_{,}_{k}, assuming they follow a normal distribution. In fact, for each t, it gives us the parameters for the normal distribution of θ_{t,k} conditional on ϕ_{r}_{,}_{k} for all r ≤ t. We are thus in a position to realize
The backward step: The backward step is more complicated than the forward step, because after realizing each value of
Working backward, each
Computing the probability
It remains only to compute
Details of
With N^{*} and γ positive integers, δ ∊ {0, 1}, and κ ∊ {0, 1,..., min(2N^{*}  δ, γ  δ}, let
if (δ = 2N^{*}  κ or δ = γ  κ = 0) then θ^{*} ← 0
else if (L < θ < H) then θ^{*} ← θ
else if (θ < L)
and if (δ = 0) then θ^{*} ← θ
else if (δ = 1) then θ_{[L]} ← 2L  θ (this is reflection around θ = L), and then
if (L ≤ θ_{[}_{L}_{]} < H) then θ^{*} ← θ_{[}_{L}_{]}
else we know θ_{[}_{L}_{]} ≥ H, and we consider the sequence θ_{[}_{i}_{]} = i(L  H) + θ_{[}_{L}_{]}, i = 1, 2,..., and we assign θ^{*} ← θ_{[}_{i}_{*]}, where i^{*} is the least i such that L < θ_{[}_{i}_{]} < H. (The sequence θ_{[}_{i}_{]} represents successive translation leftward.)
else if (θ > H)
and if (κ = 0) then θ^{*} ← π/2
else if (κ > 1) then θ_{[}_{H}_{]} ← 2H  θ (this is reflection around θ = H), and then
if (L < θ_{[}_{H}_{]} < H) then θ^{*} ← θ_{[}_{H}_{]}
else we know θ_{[}_{H}_{]} < L and we consider the sequence θ_{[}_{j}_{]} = j(H  L) + θ_{[}_{H}_{]}, j = 1, 2,..., and we assign θ^{*} ← θ_{[}_{j}_{*]}, where j^{*} is the least j such that L ≤ θ_{[}_{j}_{]} < H. (The sequence θ_{[}_{j}_{]} represents successive translation rightward.)
finally we use θ^{*}, making the assignment
where
The probability
If, on the other hand, 2N^{*}  κ > δ and γ  κ > 0, then for X^{(}^{i}^{)} = 0 and X^{(}^{i}^{)} = 2N^{*} we have
This folding and translating might seem to be a very involved process, but it is computationally much faster than realizing θ from a truncated normal distribution and computing the probability of X^{(}^{i}^{)} when θ is from such a distribution.
Multilocus Monte Carlo variance calculation: We derive an expression for the variance of a product of J independent random variables, W_{j}, j = 1,... J:
Acknowledgments
We thank an anonymous referee for helpful comments on the manuscript and for suggesting the extension to the population model with hypergeometric sampling. This study was supported by National Science Foundation grant BIR9807747 to E.T. Additionally, E.A. was supported by National Science Foundation grant BIR9256537, the University of Washington QERM program, and a Burroughs Wellcome Fund PMMB training fellowship. E.W. was supported by National Institutes of Health grant GM40282 to M. Slatkin.
Footnotes

Communicating editor: G. A. Churchill
 Received December 27, 1999.
 Accepted August 7, 2000.
 Copyright © 2000 by the Genetics Society of America