Monte Carlo Evaluation of the Likelihood for Ne From Temporally Spaced Samples
- * Interdisciplinary Program in Quantitative Ecology and Resource Management, University of Washington, Seattle, Washington 98195
- † Department of Integrative Biology, University of California, Berkeley, California 94720-3141
- ‡ Department of Statistics, University of Washington, Seattle, Washington 98195
- Corresponding author: Eric C. Anderson, Department of Statistics, University of Washington, Box 354322, Seattle, WA 98195. E-mail: eriq{at}cqs.washington.edu
Abstract
A population’s effective size is an important quantity for conservation and management. The effective size may be estimated from the change of allele frequencies observed in temporally spaced genetic samples taken from the population. Though moment-based estimators exist, recently Williamson and Slatkin demonstrated the advantages of a maximum-likelihood approach that they applied to data on diallelic genetic markers. Their computational methods, however, do not extend to data on multiallelic markers, because in such cases exact evaluation of the likelihood is impossible, requiring an intractable sum over latent variables. We present a Monte Carlo approach to compute the likelihood with data on multiallelic markers. So as to be computationally efficient, our approach relies on an importance-sampling distribution constructed by a forward-backward method. We describe the Monte Carlo formulation and the importance-sampling function and then demonstrate their use on both simulated and real datasets.
REDUCTIONS in population size can lead to inbreeding, which increases the probability of population extinction in typically outbreeding species (Frankham 1995). Reductions in population size also lead to a loss of genetic diversity, which may restrict a population’s ability to adapt to changing conditions (Soulé 1986). To predict the risk to a population from these types of genetic factors, biologists are often interested in knowing the effective population size, Ne. An effective size is defined by comparison to an ideal population model, the Wright-Fisher model. The Wright-Fisher model assumes discrete, nonoverlapping generations of constant size, and it assumes that the gametes that unite to form adults in one generation are randomly sampled with replacement from the previous generation. The variance effective size of a natural population is the size of a Wright-Fisher population that would experience a comparable increase in variance of gene frequency over time. The inbreeding effective size is defined similarly, but is based on the increase in gene identity by descent over time.
It is possible to estimate the variance effective size from observed changes in allele frequencies in a population over time. Moment-based estimators using F-statistics have been developed for this purpose (Krimbas and Tsakas 1971; Nei and Tajima 1981; Pollak 1983; Waples 1989; Jorde and Ryman 1995). Recently, Williamson and Slatkin (1999) described a method to estimate Ne by the method of maximum likelihood. To find the maximum-likelihood estimate
This likelihood method has been restricted to data on diallelic loci, because, with data on multiallelic loci, evaluating the likelihood for Ne exactly is computationally intractable. Here we describe the problem as one of inference from a hidden Markov chain (Baumet al. 1970). We develop an algorithm for importance sampling, which makes it possible to compute the likelihood by Monte Carlo.
FORMULATION OF THE MODEL AND MONTE CARLO
The model: The data are genetic samples collected at different generations. The first sample is collected at generation 0 and the last sample at generation T. Any samples drawn at intervening generations may be evenly or irregularly spaced in time. For notational simplicity, we assume for now that individuals are genotyped at a single locus, though we describe later the extension to multiple, independently segregating loci. The data include K different allelic types, indexed by k = 1,..., K. The allele frequencies observed in samples taken from different generations will differ due to genetic drift and sampling variation.
Let Yt = (Yt,1,..., Yt,K) be the counts of the K different allelic types in the sample at generation t, and let St denote the number of diploid individuals in the sample. We assume that the samples were taken from a Wright-Fisher population of size Ne and denote the unobserved population allele counts at generation t by Xt = (Xt,1,..., Xt,K), with
The genetic sample at a time t is assumed to be drawn with replacement from the copies of alleles present in the population at time t. This is equivalent to drawing the sample Yt from a very large gamete pool produced by the population at time t: sampling plan II of Waples (1989). This type of sampling applies to many organisms, especially those species with high fecundity that may be sampled as juveniles, or those that may be sampled (preferably noninvasively) as adults in populations having census sizes considerably larger than their effective sizes (Waples 1989). The sample allele counts Yt, given the latent variable Xt, are conditionally independent of all the other variables and follow the multinomial distribution depending on the parameter Ne, the sample size St, and Xt,
Such a system forms a hidden Markov chain with the dependence structure shown in the directed graph of Figure 1. The allele counts in the population when the first sample is drawn, X0, are nuisance parameters. To avoid estimating X0 and the associated consistency problems with the maximum-likelihood estimator (Neyman and Scott 1948), we consider the integrated likelihood, assuming a uniform prior distribution, π(X0), on the population allele counts at time 0. The likelihood for Ne is the probability of the data Y = (Y0,..., YT) given the parameter Ne. The probability of Y is the sum of the joint probability of Y and the latent variables X = (X0,..., XT) over the space of all X:
—A directed graph showing the dependence structure of the components of X and Y. The YT’s are observations of a hidden Markov chain. The graph shown represents a situation where samples were taken at generations 0, 2, t, and T, and no samples were taken at generations 1 and t - 1.
Monte Carlo evaluation: For likelihood inference, we must evaluate PNe(Y) for a number of different values of Ne. Expressing this probability as an expectation with respect to the distribution of X gives
Instead, we pursue a more efficient Monte Carlo approximation by using importance sampling (Hammersley and Handscomb 1964). We express PNe(Y) as an expectation with respect to a different distribution
Sampling from
An approximation for multiple alleles: In our application, with multiple alleles at a locus, since there are so many possible states that each Xt may take, the above procedure is computationally infeasible. However, we make use of the Baum et al. (1970) algorithm in spirit, employing two alterations to make it feasible to simulate from
The first approximation is to perform the forward-backward cycle separately for each allele. To describe this, we introduce some more notation. Denote by X(k) the vector (X0,k,..., XT,k) of latent counts of the kth allele from time t = 0 to t = T. Similarly we define Y(k) = (Y0,k,..., YT,k). To do the forward-backward cycle separately over alleles we first focus on allele 1, simulating
Monte Carlo variance and multiple loci: The quantity
The ability to estimate Ne typically requires data from many loci. The extension to data on J independently segregating loci, indexed by j = 1,..., J, is straightforward—each locus is treated separately, and the estimated likelihoods from each locus are multiplied together. Thus, let P˜Ne,j(Y) be the Monte Carlo likelihood estimate from the data on the jth locus. The Monte Carlo likelihood estimate using all the loci is then
When displaying the Monte Carlo likelihood curve it is preferable to plot the log-likelihood values,
SIMULATED AND REAL DATASETS
We demonstrate the method by computing log-likelihood curves for Ne from three different datasets. First, to verify that the method gives correct results we apply it to a simple simulated dataset (dataset 1) for which it is possible to compute the likelihood exactly. We simulated samples of 20 diallelic loci from 100 diploid individuals at generations 0, 6, and 12 from a Wright-Fisher population of 25 diploid individuals. For each locus, the initial allele frequency in the population at time zero was an independently drawn uniform real number between 0 and 1. The log-likelihood for Ne given these data was estimated for values between 10 and 52, in steps of 2, using m = 20,000 realizations of X from
We simulated a second dataset (dataset 2) to see how the method performed with multiallelic markers taken from a Wright-Fisher population. The dataset included three samples of 100 diploids for 12 five-allele loci at generations 0, 4, and 8 from a population of 50 diploids. The allele frequencies at each locus in generation 0 for these simulations were independently drawn from a uniform Dirichlet density with five components. For this dataset, the log-likelihood was computed for values of Ne between 20 and 100 in increments of 4 using m = 50,000 realizations of X for each locus and each value of Ne.
Finally, we computed a log-likelihood curve for Ne given data on a population of Drosophila in Begon et al. (1980). These data were analyzed using F-statistics by Begon et al. (1980) as well as by Pollak (1983). They observed allele frequencies in three samples at each of nine enzyme loci. The first two samples were taken a little more than 1 yr apart, and the third sample was taken some 8 mo later. Though the natural populations do not have discrete generations, they have been modeled previously by Begon et al. and Pollak as populations with discrete generations. Because of the different growth rates of flies during different seasons, seven generations separate the first two samples, while only two generations separate the second two samples (Begonet al. 1980). The sample sizes for all loci were the same, with larger sample sizes taken in the latter sampling periods. The sample sizes were S0 = 190, S7 = 250, and S9 = 335 flies. Pollak (1983) notes that since Begon et al. (1980) sampled adult flies, their sampling scheme is closer to what is known in the literature as sampling scheme I than it is to sampling scheme II. However, as discussed by Waples (1989), the probability models underlying the two different sampling schemes are very similar when the actual size of the population is much larger than the effective size of the population. This is the case with these Drosophila. Begon et al. (1980) report census sizes in the tens of thousands of flies, while the estimated Ne is orders of magnitude smaller. Because of this, it is still reasonable to analyze the data using the likelihood method we have developed here.
The data appear as allele frequencies in Table 1 of Begon et al. (1980). Unfortunately the allele frequencies at the Pgm locus are misreported there and fail to sum to one. We thus used only the remaining eight loci. Of these eight, three had three alleles, two had four alleles, two had five alleles, and one had six alleles. We evaluated
—Log-likelihood curves estimated by Monte Carlo from datasets 1 and 2. The values of Ne at which the likelihood was computed are indicated by vertical lines above the horizontal axis in each figure. The log-likelihood values are connected by a solid line. Vertical bars intersecting the solid line indicate 90% confidence intervals around log computed using the Monte Carlo variance estimate (13). The endpoints of the confidence intervals are connected by dashed lines. (a) Dataset 1 is simulated data from 20 diallelic loci. (b) Dataset 2 is simulated data from 12 loci with five alleles each.
RESULTS
For each of the datasets, we were able to use our importance-sampling method to compute a log-likelihood curve. Using a program we wrote in C, the runs for datasets 1 and 2 each took ∼10 hr on a 266-Mhz laptop computer. The log-likelihood curves from datasets 1 and 2 appear as solid lines in Figure 2. The estimated 90% confidence intervals around each value of log
The log-likelihood curve computed for the data of Begon et al. (1980) is shown in Figure 3. It took ∼54 hr on a 450-Mhz desktop computer to produce the results. As before, the 90% confidence intervals around the Monte Carlo estimates appear as dotted lines. With this dataset, even with m = 500,000 realizations of X, the Monte Carlo variance is not negligible. It is, however, small enough that reliable inferences may be made from the log-likelihood curve. The maximum-likelihood estimate of Ne is 500. Using the values of Ne at which the log-likelihood has decreased 2 units from its maximum gives an estimate of a 95% confidence interval for the true Ne. These points are 250 and 975. By contrast, Pollak (1983), using an F-statistic method, estimated Ne to be 251 with a standard error of 115. We discuss the discrepancy between the two estimates in the next section. Our results are not comparable to the Ne estimated by Begon et al. (1980) because, at the time, those authors were unable to make a single estimate of Ne using the samples at all three time points.
DISCUSSION
As discussed in Williamson and Slatkin (1999), estimating Ne by maximum likelihood has advantages over estimating Ne using F-statistics. Until now, it was impractical to compute the likelihood for Ne using all the data when loci with more than two alleles were available. While it has been suggested that one may bin low-frequency alleles together to turn multiallelic loci into apparently diallelic loci and then apply exact likelihood calculation methods to such reduced data, this invariably throws away some information. Furthermore, different binning strategies lead to different results. Allowing full use of the data, the Monte Carlo likelihood procedure described here is a preferable way to analyze temporal data on multiallelic loci. The method is suitable for multiallelic loci such as the microsatellite markers becoming available in a wide variety of species.
Monte Carlo methods use realizations of random variables to estimate an expectation by a sample average. There are a number of ways one can express the likelihood of Ne as an expectation, and then estimate it by Monte Carlo, but few of those schemes will be successful, because most will have high Monte Carlo variance. We attempted several different schemes before settling on the importance-sampling method presented here. Although these less sophisticated Monte Carlo estimators produced reasonable estimates in very small problems, when applied to data involving loci with many alleles these methods failed to converge reliably, even after many days of computation (E. C. Anderson and E. G. Williamson, unpublished data).
The importance-sampling method we use is successful because our importance sampling function,
It should be pointed out that while many Monte Carlo problems involving high-dimensional random variables like X make use of Markov chain Monte Carlo (MCMC) methods, our method is not an MCMC method. In MCMC, successive realizations are correlated. In our method each X(i) realized from the distribution
It is interesting that our maximum-likelihood estimate differs so much from the estimate given by Pollak (1983) for the same data. Though perhaps some of this is attributable to the fact that we chose not to use the incorrectly reported data at the Pgm locus, there are differences between the two estimation methods that could also account for some of the discrepancy. The most notable differences occur when combining information from multiple samples in time. Consider the fact that a better estimate of Ne may be made from two samples taken many generations apart than from two samples separated by fewer generations. Likewise two large samples will yield a better estimate than two small samples. When there are many samples, the relative information content in different intersample intervals will depend on the relative sample sizes and the number of generations between the samples. By its nature, the maximum-likelihood approach will appropriately weight information from different intervals. In contrast, Pollak’s F-statistic, FKr, neither includes terms for sample size nor interval length between samples, and, ÑKr, his estimate of Ne based on FKr, includes a term for only the number of generations between the first and the last sample and is invariant to permutations of the sample sizes at different times. Since the data from Begon et al. (1980) span sampling intervals of different lengths and include different sample sizes at different times, differences between our results and those in Pollak (1983) should be expected.
The Monte Carlo variance of our estimate of the likelihood given the data from a natural population of Drosophila was higher than the variance associated with our estimates from simulated data. Although a good estimate was achieved after sufficient computation, it may still be that data generated under a model that differs from the Wright-Fisher model present difficulty for the Monte Carlo likelihood method. For example, it may be that the effective size of the natural Drosophila population was different during the two different sampling intervals. We note that one could extend the likelihood framework to allow for Ne changing over time. For example, if the estimated census size of the population were available and was known to change over time it would be more sensible to estimate directly the ratio, λ, of the effective size of the population to the census size of the population. This ratio would be more useful for the purposes of modeling genetic change in the population than a single estimate of Ne over the entire time period between the first and last samples. The forward-backward approach implemented here could easily be modified to estimate this parameter, λ.
It would also be possible to extend the present approach to compute likelihoods from different stochastic models of allele frequency change. We suspect there would seldom be enough information in the data to estimate accurately models in which allele frequency change is due jointly to genetic drift and some other genetic mechanism such as mutation or selection. However, our methods could be modified to handle different demographic models. For example, consider an alteration of the Wright-Fisher model such that the current generation is formed by sampling genes without replacement from a gamete pool in which each parental allele is represented by M gametes. In such a case, allele frequency changes occur due to multivariate hypergeometric sampling determined by the parameters N and M. For values of M that are not very small, the normal distribution is still a reasonable approximation to the hypergeometric distribution, and importance sampling could proceed as before. The importance-sampling distribution
Another possible extension would be to stochastic models involving populations of organisms with more complex life histories, for example, overlapping generations or age-structured populations. As it becomes easier to gather extensive genetic data on populations, and as understanding of the structure within those populations improves, it will be possible to specify much richer probability models for temporal population genetic data. The forward-backward method here, and variations of it, should be useful in estimating population parameters from such models using Monte Carlo likelihood.
A software package, MCLEEPS, implementing the algorithm described in this article is available for free download from http://www.stat.washington.edu/thompson/Genepi/Mcleeps.shtml.
APPENDIX
Using θt,k and ϕt,k in a continuous setting: We define the random variables
The forward step: Following Cavalli-Sforza and Edwards (1967), if θt-1,k is normally (+) distributed with mean μt-1 and variance
Carrying this out until t = T gives values for the mean and variance of θT,k given ϕ0,k,..., ϕT,k, assuming they follow a normal distribution. In fact, for each t, it gives us the parameters for the normal distribution of θt,k conditional on ϕr,k for all r ≤ t. We are thus in a position to realize
The backward step: The backward step is more complicated than the forward step, because after realizing each value of
Working backward, each
Computing the probability
It remains only to compute
Details of
With N* and γ positive integers, δ ∊ {0, 1}, and κ ∊ {0, 1,..., min(2N* - δ, γ - δ}, let
if (δ = 2N* - κ or δ = γ - κ = 0) then θ* ← 0
else if (L < θ < H) then θ* ← θ
else if (θ < L)
and if (δ = 0) then θ* ← θ
else if (δ = 1) then θ[L] ← 2L - θ (this is reflection around θ = L), and then
if (L ≤ θ[L] < H) then θ* ← θ[L]
else we know θ[L] ≥ H, and we consider the sequence θ[i] = i(L - H) + θ[L], i = 1, 2,..., and we assign θ* ← θ[i*], where i* is the least i such that L < θ[i] < H. (The sequence θ[i] represents successive translation leftward.)
else if (θ > H)
and if (κ = 0) then θ* ← π/2
else if (κ > 1) then θ[H] ← 2H - θ (this is reflection around θ = H), and then
if (L < θ[H] < H) then θ* ← θ[H]
else we know θ[H] < L and we consider the sequence θ[j] = j(H - L) + θ[H], j = 1, 2,..., and we assign θ* ← θ[j*], where j* is the least j such that L ≤ θ[j] < H. (The sequence θ[j] represents successive translation rightward.)
finally we use θ*, making the assignment
where
—Figures representing and
for 2N* = 20. The normal curve is the density for θ. (a) Reflections and translations as described in the appendix. Long-dashed lines represent the curve after reflection through L or H, while the short-dashed lines represent the reflected curve after one or more successive translations. (b) If δ = 1 and κ = 1, then X(i) is constrained to be in {1..., 2N* - 1}. The shaded regions correspond to those values of θ for which X(i) = 13 by
. The total shaded area is equal to
. (c) If δ = 0 then X(i) may take the value 0. The shaded area shows
. (d) If κ = 0 then X may take the value 2N*. The shaded area shows
; 10, δ, γ, 0).
The probability
If, on the other hand, 2N* - κ > δ and γ - κ > 0, then for X(i) = 0 and X(i) = 2N* we have
This folding and translating might seem to be a very involved process, but it is computationally much faster than realizing θ from a truncated normal distribution and computing the probability of X(i) when θ is from such a distribution.
Multilocus Monte Carlo variance calculation: We derive an expression for the variance of a product of J independent random variables, Wj, j = 1,... J:
Acknowledgments
We thank an anonymous referee for helpful comments on the manuscript and for suggesting the extension to the population model with hypergeometric sampling. This study was supported by National Science Foundation grant BIR-9807747 to E.T. Additionally, E.A. was supported by National Science Foundation grant BIR-9256537, the University of Washington QERM program, and a Burroughs Wellcome Fund PMMB training fellowship. E.W. was supported by National Institutes of Health grant GM40282 to M. Slatkin.
Footnotes
-
Communicating editor: G. A. Churchill
- Received December 27, 1999.
- Accepted August 7, 2000.
- Copyright © 2000 by the Genetics Society of America