## Abstract

This article presents an efficient importance-sampling method for computing the likelihood of the effective size of a population under the coalescent model of Berthier *et al.* Previous computational approaches, using Markov chain Monte Carlo, required many minutes to several hours to analyze small data sets. The approach presented here is orders of magnitude faster and can provide an approximation to the likelihood curve, even for large data sets, in a matter of seconds. Additionally, confidence intervals on the estimated likelihood curve provide a useful estimate of the Monte Carlo error. Simulations show the importance sampling to be stable across a wide range of scenarios and show that the *N*_{e} estimator itself performs well. Further simulations show that the 95% confidence intervals around the *N*_{e} estimate are accurate. User-friendly software implementing the algorithm for Mac, Windows, and Unix/Linux is available for download. Applications of this computational framework to other problems are discussed.

THE effective size *N*_{e} of a population is an important parameter determining the rate at which genetic drift and inbreeding occur in the population, as well as the population's capacity to respond to natural selection and to purge itself of deleterious mutations. It is consequently a parameter of great interest. However, it is difficult to estimate *N*_{e} using demographic data alone, especially for organisms with high fecundity and high juvenile mortality. For this reason, a variety of methods have been developed for estimating *N*_{e} from genetic data, including the “temporal methods” in which a population's effective size is estimated using data on the change of allele frequencies observed in two or more temporally spaced genetic samples.

The first temporal methods used moment-based estimators (Krimbas and Tsakas 1971; Nei and Tajima 1981; Pollak 1983; Waples 1989; Jorde and Ryman 1995). These estimators suffer from upward bias when low-frequency alleles are present. Williamson and Slatkin (1999) introduced a likelihood-based estimator of *N*_{e} by modeling the genetic samples as observations of the hidden Markov chain that arises from the Wright-Fisher population model. They showed the likelihood-based estimator to be less biased than the moment-based estimators, but their formulation allowed only for the analysis of loci with two alleles. Anderson *et al.* (2000) extended that work to loci with more than two alleles, using a computationally intensive Monte Carlo likelihood scheme. Using the same hidden Markov model, Wang (2001) developed a faster method for approximating the likelihood and conducted numerous simulations demonstrating the superiority of the likelihood-based method over moment-based estimators.

Berthier *et al.* (2002) introduced a likelihood method for two temporally spaced samples based on a different underlying model—they derive the likelihood using the coalescent (Kingman 1982; Hudson 1990). This provides a computational advantage when a large number of generations separate the samples. Additionally, it is easier to understand how this model applies to a continuously reproducing population rather than the likelihood models based on the discrete-generation Wright-Fisher population. Beaumont (2003) extended Berthier *et al.*'s (2002) model to multiple samples in time and developed several computational improvements. He also provided a general formula for using importance sampling within Markov chain Monte Carlo (MCMC) in difficult problems. Unfortunately, the approaches of both Berthier *et al.* (2002) and Beaumont (2003) are computationally intensive, requiring computation on the order of hours to analyze a small data set of 30 individuals per sample with 10–20 loci (Berthier *et al.* 2002). Further, since the posterior density curves for *N*_{e} are obtained by performing density estimation on values of *N*_{e} generated from a Markov chain, it is difficult to assess their accuracy.

The purpose of this article is to present a more efficient Monte Carlo approximation of the two-sample likelihood of Berthier *et al.* (2002). This new method is an importance sampling approach that is upward of 1000 times faster than the MCMC method. Excellent approximations to the likelihood curve for *N*_{e} are achieved in several seconds. Further, since the method is not based on MCMC, assessing the accuracy of the Monte Carlo estimate is easy and robust. In the following I review the likelihood introduced by Berthier *et al.* (2002). Then I present the new importance sampling method for computing the likelihood. Finally, I conduct simulations to verify that the estimates obtained are comparable to those in Berthier *et al.* (2002), to explore the accuracy of the importance sampling method, to assess the behavior of the estimator in the presence of many alleles, to show that the confidence intervals for estimates of *N*_{e} using the genealogical model are reliable, and to determine how much effect random mutations have on the estimate of *N*_{e}.

## PROBABILITY MODEL

I first consider the probability model for a single locus. The extension to multiple loci is straightforward and is described later. The data are two genetic samples, one of *n*_{0} codominant gene copies (*n*_{0}/2 diploid individuals) assumed sampled without replacement from the population at time 0, and another sample of *n _{T}* codominant gene copies assumed sampled with replacement from the population

*T*generations

*before*time 0, at time

*T*. The sample at time 0 is assumed to be sampled without replacement because we will be modeling the sample using the neutral coalescent, which assumes that the sample consists of

*n*

_{0}

*distinct*gene copies sampled from the population. In contrast, the sample at time

*T*is assumed to be sampled with replacement because that allows us to model it as a multinomial sample, which, as described below, leads to further simplifications. In practice, samples are typically drawn without replacement because distinct individuals are seldom multiply sampled, and, if they are, then the duplicates are identified by allelic identity at multiple loci, and one of the individuals is removed. The model of sampling with replacement, however, is a good approximation of sampling without replacement as long as the actual size (and not necessarily the effective size) of the population from which the sample is taken is much larger than the sample itself.

The number of distinct allelic types observed in the samples at times 0 and *T* is denoted by *K*, and the observed counts of different allelic types in the samples are denoted **a _{0}** = (

*a*

_{0,1}, … ,

*a*

_{0,}

*) and*

_{K}**a**= (

_{T}*a*

_{T}_{,1}, … ,

*a*

_{T}_{,}

*), respectively. We denote by*

_{K}*K*

^{(0)}and

*K*

^{(}

^{T}^{)}the number of distinct allelic types found in the sample at time 0 or in the sample at time

*T*, respectively. What constitutes an allelic type will depend on the genetic marker system being used. For example, if one is using microsatellites, then alleles correspond to different numbers of repeats observable on a gel; with allozymes the alleles correspond to proteins with different electrophoretic mobilities; with single-nucleotide polymorphisms the alleles correspond to different nucleotide bases, etc. The probability model of Berthier

*et al.*(2002) arises by considering the genealogy of the

*n*

_{0}gene copies sampled at time 0 and assuming that the genealogy follows the neutral coalescent process for a population of size

*N*

_{e}between time

*T*and time 0.

At time 0 the *n*_{0} gene copies represent *n*_{0} separate lineages; however, if we were to trace each of those lineages back in time, some lineages may merge (“coalesce”) so that the number of lineages extant in the population at time *T* and ancestral to the *n*_{0} gene copies will be a number smaller than or equal to *n*_{0}. We let *n*_{f} denote the (unknown) number of lineages extant at time *T* that are ancestral to the *n*_{0} sampled genes. If the effective size of the population is small, coalescences will occur rapidly and *n*_{f} will typically be smaller than it would be if *N*_{e} were large. The probability that *n*_{0} lineages at time 0 are the descendants of *n*_{f} lineages at time *T* in a population of effective size *N*_{e} can be computed analytically (Tavaré 1984) as described below.

The *n*_{f} extant lineages at time *T* can be considered *n*_{f} gene copies that existed in the population at time *T* and that represent all of the ancestors at time *T* of the *n*_{0} genes sampled at time 0. We denote the (unknown) numbers of different allelic types carried among those *n*_{f} ancestors by **a _{f}** = (

*a*

_{f,1}, … ,

*a*

_{f,}

*). It is assumed that no mutation occurs between time*

_{K}*T*and time 0. As discussed later, this assumption means the method is suitable for samples that are taken a moderate number of generations apart. It follows from this that only allelic types appearing in the sample at time 0 appear among the

*n*

_{f}ancestors (

*i.e.*,

*a*

_{0,}

*= 0 implies*

_{k}*a*

_{f,}

*= 0 for all*

_{k}*k*= 1, … ,

*K*). It also follows that each allelic type observed in the

*n*

_{0}genes at time 0 must occur at least once among the

*n*

_{f}ancestors (

*i.e.*,

*a*

_{0,}

*> 0 ⇒*

_{k}*a*

_{f,}

*> 0,*

_{k}*k*= 1, … ,

*K*), which implies that

*K*

^{(0)}≤

*n*

_{f}≤

*n*

_{0}. Just as the sample of

*n*gene copies was assumed to be sampled with replacement from the population at time

_{T}*T*, the

*n*

_{f}gene copies are assumed to be a separate, independent sample, with replacement, of

*n*

_{f}gene copies from the population at time

*T*. The unknown frequencies of the

*K*alleles in the population at time

*T*are denoted by

*= (*

**p***p*

_{1}, … ,

*p*). My notation differs here from that of Berthier

_{K}*et al.*(2002) who used

*to denote the allele frequencies. The vector*

**x***is a nuisance parameter that may be integrated out by assuming a prior distribution for it. The prior is taken to be a*

**p***K*− 1-dimensional Dirichlet distribution with parameter

**λ**= (λ

_{1}, … , λ

*). Such a distribution arises as the equilibrium distribution of allele frequencies under a*

_{K}*K*-allele model with reversible mutation (Wright 1937). Often each λ

*is set equal to 1, giving a uniform prior for*

_{k}*, although, especially for large*

**p***K*, another sensible prior would be λ

*= 1/*

_{k}*K*,

*k*= 1, … ,

*K*(Kass and Wasserman 1995).

The above sampling scheme implies a set of conditional probability densities involving the parameters and variables **a _{0}**,

**a**,

_{T}*n*

_{f},

**a**,

_{f}*,*

**p***n*

_{0},

*n*,

_{T}*T*,

*N*

_{e}, and

**λ**. These conditional densities are derived as follows. Both

**a**and

_{T}**a**are independent multinomial samples from a population with allele frequencies

_{f}*. Thus,*

**p***P*(

**a**|

_{T}*n*,

_{T}*) ≡ Mult*

**p***(*

_{K}*n*,

_{T}*) and*

**p***P*(

**a**|

_{f}*n*

_{f},

*) ≡ Mult*

**p***(*

_{K}*n*

_{f},

*), where Mult*

**p***(*

_{K}*n*,

*) denotes the probability mass function of a multinomial random variable of*

**p***n*trials with

*K*categories having cell probabilities

*. Conditional on*

**p****a**, the counts of different alleles

_{f}**a**among the

_{0}*n*

_{0}descendants sampled at time 0 follow a distribution having the form of a Dirichlet-compound multinomial distribution (Johnson

*et al.*1997) defined by a product of binomial coefficients, 1for values of

**a**satisfying

_{0}*a*

_{0,}

*≥*

_{k}*a*

_{f,}

*,*

_{k}*k*= 1, … ,

*K*, and where the binomial coefficient is defined as 1 (to easily deal with values of

*k*for which

*a*

_{0,}

*=*

_{k}*a*

_{f,}

*= 0). This follows from the fact that forward in time, the bifurcations of a neutral coalescent starting with labeled lineages can be interpreted as steps in a Pólya-Eggenberger urn scheme (Hoppe 1984) in which each round of sampling involves taking a ball from the urn and placing it in a separate sample, and then returning two balls of like color to the urn. Under this interpretation, the*

_{k}*n*

_{f}lineages at time

*T*are like

*n*

_{f}balls in an urn, each one colored according to the allelic type it carries, so

**a**counts the numbers of balls of

_{f}*K*different colors in an urn before the onset of Pólya-Eggenberger sampling. Then,

**a**represents the number of balls of each color in the urn after

_{0}*n*

_{0}−

*n*

_{f}rounds of sampling. This is equivalent to collecting a sample in

*n*

_{0}−

*n*

_{f}rounds of sampling in which the number of different colors of balls is given by the vector

**a**−

_{0}**a**, which follows the probability given in (1).

_{f}The probability that *n*_{0} lineages at time 0 have *n*_{f} extant ancestral lineages *T* generations in the past in a neutral coalescent process, given an effective population size of *N*_{e}, can be computed following Tavaré (1984). Letting *t* = *T*/(2*N*_{e}) we have 2where *i*_{[}_{k}_{]} = *i*(*i* − 1) · · · (*i* − *k* + 1) and *i*_{(}_{k}_{)} = *i*(*i* + 1) · · · (*i* + *k* − 1) are notations for the falling and rising factorial functions, respectively.

With the component conditional densities specified as above, the joint probability density of all the variables may be written 3The factorization of the probability model above is depicted in the acyclic directed graph of Figure 1.

The likelihood of *N*_{e} is obtained by integrating the nuisance parameter * p* and the unknown, latent variables

**a**and

_{f}*n*

_{f}out of the joint density: 4The sum over

**a**in (4) has a great many terms in it, especially if

_{f}*n*

_{0},

*n*

_{f}, and

*K*are large, so an approximation to that sum is desirable. The next section will explain how that sum can be efficiently approximated using an importance-sampling algorithm.

Before describing my own importance-sampling algorithm, I briefly describe the computational approach taken by Berthier *et al.* (2002) and Beaumont (2003). They use the Metropolis-Hastings algorithm to define a Markov chain of values of *N*_{e} (and of * p* in Berthier

*et al.*2002), having a limiting distribution proportional (or almost proportional) to 5in the case of in Berthier

*et al.*(2002) and 6in the case of Beaumont (2003), where

*P*(

*N*

_{e}) is a prior distribution assumed for

*N*

_{e}. Samples from the Markov chain are used to make a density estimate of the posterior density for

*N*

_{e}. Note that the first two terms of (5) are what would remain of the integrand in (4) after the sum over

*n*

_{f}and

**a**was performed. Similarly, the first term in (6) is what would remain of the integrand in (4) after integrating out

_{f}*and then summing over*

**p***n*

_{f}and

**a**. Thus, the MCMC method requires approximating the two sums in (4) for every step in the Markov chain to approximate (5) or (6) for use in a Metropolis-Hastings ratio. Berthier

_{f}*et al.*(2002) and Beaumont (2003) do this approximation by Monte Carlo, using

*P*(

**a**,

_{f}*n*

_{f}|

**a**,

_{0}*N*

_{e},

*T*) as an importance sampling distribution. As Beaumont (2003) notes, this importance sampling distribution could be improved by accounting for the dependence (which is apparent in Equation 4 and in the directed graph of Figure 1) of

**a**on

_{f}*. That improvement and others are demonstrated in the next section.*

**p**## EFFICIENT APPROXIMATION OF THE LIKELIHOOD

The computation of (4) presented here is made efficient by: (i) avoiding the use of MCMC altogether; (ii) integrating over * p* in (4) analytically; (iii) recognizing that the difficult steps in calculating (4) involve neither

*N*

_{e}nor

*T*, so that a number of quantities may be computed only once for any

**a**and

_{0}**a**and then used to quickly calculate

_{T}*L*(

*N*

_{e}) for any value of

*N*

_{e}; and (iv) choosing a suitable importance-sampling distribution.

As in Beaumont (2003), analytical integration of the nuisance variable * p* proceeds from Mosimann's (1962) result that if

*has a multinomial distribution with cell probabilities*

**a***, and*

**p***has a Dirichlet distribution, then, marginally,*

**p***will follow the Dirichlet-compound multinomial distribution. Starting from (4), reversing the order of integration and summation yields line (7) in the equation below. Because*

**a***P*(

**a**|

_{T}*n*,

_{T}*) is a multinomial distribution (which follows from the assumption that the sample at time*

**p***T*is drawn with replacement), the product of

*P*(

**a**|

_{T}*n*,

_{T}*) and*

**p***P*(

*|*

**p****λ**) is proportional to

*P*(

*|*

**p****a**,

_{T}*n*,

_{T}**λ**)—the Dirichlet-distributed posterior probability density of

*conditional on*

**p****a**and the prior

_{T}**λ**. Recognizing this yields (8). Then using the fact that

*P*(

*|*

**p****a**,

_{T}*n*,

_{T}**λ**) is a Dirichlet density and the fact that

**a**follows a multinomial distribution (again, this is a consequence of the assumption that the allelic types of the

_{f}*n*

_{f}lineages are a sample with replacement from the population at time

*T*), we apply Mosimann's (1962) result to obtain (9): 789

*P*(

**a**|

_{f}*n*

_{f},

**a**,

_{T}*n*,

_{T}**λ**) is a Dirichlet-compound multinomial distribution with parameters

**a**+

_{T}**λ**, so that 10where λ

_{•}denotes .

*C*(

**a**,

_{T}**λ**) is a constant involving multinomial coefficients and the coefficients of the Dirichlet distribution: 11 By rearranging the sums in (9) to obtain 12it can be seen that the difficult part in evaluating

*L*(

*N*

_{e}) is just the sum over

**a**. It is also clear that once the sum over

_{f}**a**has been computed for every value of

_{f}*n*

_{f}from

*K*

^{(0)}to

*n*

_{0}, then computing the likelihood for any value of

*N*

_{e}is achieved by a small sum over the possible values of

*n*

_{f}. Hence, the primary task here is to develop a good approximation for 13This is undertaken by Monte Carlo, made efficient by importance sampling. The importance-sampling formulation follows from the fact that (13) may be rewritten as 14where

*P**(

**a**) is a probability mass function for

_{f}**a**having the property that for any value of

_{f}**a**for which

_{f}*P**(

**a**) = 0, the product

_{f}*P*(

**a**|

_{0}**a**,

_{f}*n*

_{f},

*n*

_{0})

*P*(

**a**|

_{f}*n*

_{f},

**a**,

_{T}*n*,

_{T}**λ**) is also equal to zero. Equation 14 suggests that

*P*(

**a**|

_{0}*n*

_{f},

**a**,

_{T}*n*,

_{T}**λ**) may be approximated by simulating

*m*values of

**a**, from

_{f}*P**(

**a**), and computing 15The variance of this Monte Carlo estimate of

_{f}*N*

_{e}is reduced to the extent that

*P**(

**a**) can be made proportional to

_{f}*P*(

**a**|

_{0}**a**,

_{f}*n*

_{f},

*n*

_{0})

*P*(

**a**|

_{f}*n*

_{f},

**a**,

_{T}*n*,

_{T}**λ**) (Hammersley and Handscomb 1964).

Our goal is thus to find a *P**(**a _{f}**) that is approximately proportional to

*P*(

**a**|

_{0}**a**,

_{f}*n*

_{f},

*n*

_{0})

*P*(

**a**|

_{f}*n*

_{f},

**a**,

_{T}*n*,

_{T}**λ**). Such an approximation may be obtained by sequentially simulating the components of

**a**. The reasoning for this is as follows:

_{f}*P*(

**a**|

_{0}**a**,

_{f}*n*

_{f},

*n*

_{0})

*P*(

**a**|

_{f}*n*

_{f},

**a**,

_{T}*n*,

_{T}**λ**) is the product of two Dirichlet-compound multinomial probability mass functions, and the marginal distribution of any component of a Dirichlet-compound multinomial random vector follows a beta-binomial distribution with parameters that are easily computed (see Johnson

*et al.*1997, p. 81). We are thus able to compute the marginal distribution of

*a*

_{f,1}, the first component of

**a**, from a distribution exactly proportional to

_{f}*P*(

**a**|

_{0}**a**,

_{f}*n*

_{f},

*n*

_{0})

*P*(

**a**|

_{f}*n*

_{f},

**a**,

_{T}*n*,

_{T}**λ**), as desired. That marginal distribution is proportional to the product of two beta-binomial probability mass functions, and normalizing it can be done quickly because

*a*

_{f,1}may assume no more than

*n*

_{f}values. We may then simulate a value,

*a*

_{f,1}

^{}, from that distribution. After simulating

*a*

_{f,1}

^{}, the alleles corresponding to

*k*= 1 are conceptually “discarded” from the data set (thus reducing

*n*

_{0}to

*n*

_{0}−

*a*

_{0,1},

*n*

_{f}to

*n*

_{f}−

*a*

_{f,1}, and

*n*to

_{T}*n*−

_{T}*a*

_{T}_{,1}) and a similar scheme is pursued to simulate

*a*

_{f,2}. Then the alleles corresponding to

*k*= 2 are discarded and

*a*

_{f,3}is simulated, and so forth.

Mathematically, the probability mass function, *P**(**a _{f}**), is defined to be where

*I*(

*a*

_{0,}

*> 0) is 1 if*

_{k}*a*

_{0,}

*> 0 and 0 otherwise, and*

_{k}*n*

_{0,≥}

*,*

_{k}*n*

_{f,≥}

*,*

_{k}*n*

_{T}_{,≥}

*, and λ*

_{k}_{•,≥}

*are defined to be*

_{k}*n*

_{0}− ∑

_{j<k}

*a*

_{0,j},

*n*

_{f}− ∑

_{j<k}

*a*

_{f,j},

*n*

_{T}− ∑

_{j<k}

*a*

_{T,j}, and λ

_{•}− ∑

_{j<k}λ

_{j}, respectively. The value

*z*is a normalizing constant equal to, for each

_{k}*k*, the sum of the part within square brackets between the values of

*I*(

*a*

_{0,}

*> 0) and min, inclusive.*

_{k}While this *P**(**a _{f}**) is not exactly proportional to

*P*(

**a**|

_{0}**a**,

_{f}*n*

_{f},

*n*

_{0})

*P*(

**a**|

_{f}*n*

_{f},

**a**,

_{T}*n*,

_{T}**λ**), it is close enough that the Monte Carlo estimate, using (14), is quite good, even with

*m*as small as 100. Further, by judicious use of recurrence relations for binomial coefficients and the gamma function, and by storage of frequently used quantities, values of

**a**

_{f}

^{}may be simulated from

*P**(

**a**) rapidly.

_{f}Once the Monte Carlo approximations to *P*(**a _{0}**|

*n*

_{f},

**a**,

_{T}*n*,

_{T}**λ**) are computed for all possible values of

*n*

_{f}, they may be used in (12), to compute the likelihood for any value of

*N*

_{e}. In practice, the likelihood curve is computed by evaluating (12) over a fine grid of values of

*N*

_{e}. The maximum-likelihood estimate

*N̂*

_{e}can then be found by parabolic interpolation of the point on the grid with highest likelihood and its two neighbors.

When data are available on multiple loci that are not in linkage disequilibrium, the overall likelihood is the product over loci of the likelihoods for each locus. The variance of the Monte Carlo estimator can be computed by standard methods, providing a direct estimate of the Monte Carlo error. For multiple loci, the calculation of the Monte Carlo variance follows that in the Appendix of Anderson *et al.* (2000).

The calculations described above are implemented in the computer program *CoNe*, which may be downloaded from santacruz.nmfs.noaa.gov/staff/eric_anderson/. *CoNe* computes a Monte Carlo estimate of the likelihood curve and summarizes the Monte Carlo error with upper and lower 95% confidence intervals on the estimate of the likelihood curve. It also reports the maximum-likelihood estimate (MLE), *N̂*_{e}, and a 95% confidence interval around *N̂*_{e}. The endpoints of the confidence interval around *N̂*_{e} are the values of *N*_{e} for which the natural logarithm of the likelihood is 1.96 units smaller than log *L*(*N̂*_{e}). Given any prior distribution for *N*_{e}, the posterior distribution may be computed from the likelihood, if desired.

## SIMULATIONS AND RESULTS

It is not my goal here to undertake an exhaustive set of simulations comparing the genealogical method to other methods for estimating *N*_{e}. Such a study has been completed recently (Tallmon *et al.* 2004). Instead, I conduct four sets of simulations to (i) confirm that the estimator presented here estimates the same thing as Berthier *et al.*'s (2002) estimator, (ii) assess the reliability of the estimate of the Monte Carlo error, (iii) assess the method's behavior in the presence of many alleles, (iv) demonstrate that the 95% confidence interval on *N̂*_{e} computed by *CoNe* is accurate, and (v) assess the effect of mutations on the estimation of *N*_{e} with *CoNe*.

### Comparison to previous results:

First, I investigated the difference in running times between *CoNe* and the program TM3 presented by Berthier *et al.* (2002). (The program TMVP presented in Beaumont (2003) is supposed to be somewhat faster, but it gave spurious results on the data set used to test running times.) To do this comparison I used the data file supplied as an example data set in the distribution of TM3. It includes simulated data of 10 loci sampled from 50 diploids on two occasions separated by a scaled time of *t* = 0.05. I analyzed it assuming that the number of generations between samples was 10. Accordingly, the correct *N*_{e} is 100. The analysis of the data by *CoNe* using *m* = 250 importance-sampling repetitions required user and system time of 2.0 sec on a 2-GHz Macintosh G5 processor and had a maximum memory usage of 1.6 Mb. The likelihood curve was estimated with negligible Monte Carlo error (Figure 2, thick solid line).

I then analyzed the same data set on the same computer using TM3 with the default settings. After 20, 200, and 2000 sec, I took the output and estimated the log-likelihood curve using the density function in the computer package R (Ikaha and Gentleman 1996). Each of these curves is plotted in Figure 2. It is apparent that after running 1000 times as long as *CoNe*, TM3 has obtained a good estimate for *N̂*_{e}, but it has not estimated the whole likelihood curve very well. The maximum memory usage for each run of TM3 was 179 kb.

Second, I repeated the simulation experiment performed by Berthier *et al.* (2002), in which samples of *L* loci (*L* = 5, 10, or 20) from 30 or 60 diploid individuals, separated by one or five generations, were drawn from simulated populations of effective size 10, 20, or 50, with a variety of different initial allele frequencies (scenarios *A*–*C*, see Table 1 legend). (For brevity I omit the simulation of allele frequency scenario F from Berthier *et al.* 2002.) For each combination of parameters, 2000 simulated data sets were analyzed with *CoNe* using 1000 Monte Carlo samples (*i.e.*, *m* = 1000), and the median maximum-likelihood estimate, the square root of the mean squared error, and a summary of the confidence intervals obtained were recorded. The results of these simulations are shown in italics in Table 1. The results from Berthier *et al.*'s (2002) simulations are shown for comparison in regular type in adjacent columns. The two methods are clearly comparable, although *CoNe* is much faster. The difference in results between the two methods is accounted for by the fact that this study sampled 10 times as many simulated data sets, and the distribution of estimated values of *N*_{e} is heavy tailed to the right.

### Monte Carlo error:

When estimating parameters, uncertainty is often expressed in terms of a confidence interval or a “credible set.” When the likelihood curve itself is estimated by Monte Carlo, there is another source of uncertainty due to the Monte Carlo variance in the estimate of the likelihood, and that uncertainty may also be expressed by a confidence interval. It is this uncertainty due to Monte Carlo variance that is the topic of this section, and the confidence intervals on the likelihood *L*(*N*_{e}), referred to here, should not be confused with confidence intervals on the estimate *N̂*_{e} itself (discussed later). Ninety-five percent confidence intervals for the Monte Carlo estimate of *L*(*N*_{e}) may be computed by adding (for the upper limit) and subtracting (for the lower limit) 1.96 times the Monte Carlo standard error of the estimate.

In the simulations described above, 30,000 data sets were analyzed by *CoNe*. Of those, the data set yielding the highest Monte Carlo variance was simulated using allele frequency scenario A, with 20 loci and true *N*_{e} = 10 and *T* = 1. The likelihood curve for *N*_{e} given this data set and computed using *m* = 1000 Monte Carlo samples is shown as the solid line in Figure 3a. The dashed lines on either side of the likelihood curve are the 95% confidence intervals on the Monte Carlo estimate of *L*(*N*_{e}). The distance between the two dashed lines measures how much uncertainty is in the Monte Carlo estimate of *L*(*N*_{e}). This figure is provided as an example of how efficient this Monte Carlo scheme is. The curve in Figure 3a was achieved in <6 sec on a 2 GHz G5 processor, and, although it represents the worst Monte Carlo estimate in 30,000 data sets, the approximation is still good. By increasing the number of Monte Carlo samples to *m* = 50,000, an excellent approximation is achieved (Figure 3b) in only 4 min 3 sec.

It should be apparent that the lower and upper confidence intervals on the Monte Carlo estimate (Figure 3, dashed lines) provide a convenient summary of the accuracy of the approximation. This is a more useful measure of uncertainty than is available when MCMC and density estimation are used to estimate the likelihood curve.

A series of simulations was undertaken to determine how reliable the Monte Carlo confidence intervals are. This was done by analyzing 70 separate data sets [two samples, separated by 20 generations, of 100 individuals typed at 12 loci with seven alleles having frequencies drawn from a uniform Dirichlet (1, … , 1) distribution] from simulated populations of 1000 individuals. For each data set an estimate of the likelihood curve was computed with *CoNe* using *m* = 100,000 replicates. This estimate was taken to be close to the exact likelihood. Then the same data set was reanalyzed 500 more times, each time with only *m* = 100 replicates, and it was recorded whether the “exact” likelihood curve estimated with *m* = 100,000 fell within the Monte Carlo confidence intervals given by analyzing the data with *m* = 100. Even with as few as *m* = 100 replicates, the average width of the Monte Carlo confidence intervals over all the simulations was 0.07 units of log-likelihood. Sixty percent of the time, the exact likelihood curve fell entirely within the confidence intervals at all points within 4 log-likelihood units of the maximum. When confidence intervals failed to contain the exact likelihood, the average distance between the exact likelihood and the edge of the confidence interval was only 0.013. Thus, while the confidence intervals for the estimate of *L*(*N*_{e}) are not strictly 95% confidence intervals, they do provide a very good measure of the uncertainty in the Monte Carlo estimation. Regardless, in all simulated data sets I have analyzed, the Monte Carlo error can be reduced to negligible levels with never more than a few minutes of computation and typically in a matter of seconds.

### Behavior with many alleles:

Some computational methods for estimating *N*_{e} become unstable or converge slowly when applied to data sets with many alleles (*cf.* Anderson *et al.* 2000). Therefore, I investigated the performance of the importance-sampling method and of the coalescent-based *N*_{e} estimation procedure, with loci having many alleles. *CoNe* was applied to simulated data of 100 individuals genotyped at 10 loci, each with *K* alleles and a uniform initial allele frequency. *K* was set equal to 5, 8, 13, 20, 30, 50, 75, or 100 in each simulation for all 10 of the loci. These simulations were done using a coalescent method. In one set of simulations the genetic drift between samples was set to that of 20 generations in a population of size 100 [*i.e.*, a scaled time of *t* = *T*/(2*N*_{e}) = 0.1] and in another set of simulations it was set to that of 2 generations in a population of size 100 (*t* = 0.01). For each combination of number of alleles and amount of genetic drift, 250 data sets were simulated and analyzed. For all data sets the importance-sampling procedure remained stable. Even with as many as 100 alleles, the likelihood curve was reliably estimated with as few as *m* = 1000 Monte Carlo replicates. For the scaled time of 0.1 (corresponding to 20 generations in a population of size 100), the performance of the coalescent-based estimator of *N*_{e} was not greatly affected by the number of alleles. At all numbers of alleles, the estimator had a slight upward bias, with the mean maximum-likelihood estimate of *N*_{e} being ∼103—only slightly greater than the true value of 100 (Figure 4a). With a scaled time of 0.01 (corresponding to 2 generations in a population of size 100), the importance-sampling procedure was once again stable. However, the estimator itself shows an upward bias, particularly as the number of alleles increases beyond 20. Additionally, with very large numbers of alleles (75 and 100), on average, the 95% confidence interval around the maximum-likelihood estimate of *N*_{e} does not overlap the true value of 100 (Figure 4b).

It is important to note that such pathological data sets would rarely, if ever, be encountered from natural populations having an *N*_{e} low enough that it might be reliably estimated. The appearance of so many alleles in such a population would suggest either that the alleles were under some sort of balancing selection or that the mutation rate at the locus was quite high, violating the assumptions of the *N*_{e} estimation method.

### Accuracy of confidence intervals for *N*_{e}:

*CoNe* reports a 95% confidence interval around the MLE, *N̂*_{e}. The true value of *N*_{e} ought to be contained in that confidence interval in 95% of data sets (simulated under the model) analyzed by *CoNe*. Tallmon *et al.* (2004) report that the credible intervals (these are like confidence intervals, but are computed from a Bayesian perspective) computed by a related program, TMVP (Beaumont 2003), are grossly inaccurate. TMVP is based on the same likelihood model as *CoNe*, but it approximates the likelihood by MCMC instead of using the efficient importance sampling algorithm presented here. In Tallmon *et al.*'s (2004) simulations, data sets of *n* = 20 or 60 diploid individuals were drawn from populations of *N*_{e} = 20, 50, or 100, separated by 1, 3, 5, or 10 generations. The individuals were genotyped at 5 or 15 loci, which were initialized with eight alleles having frequencies drawn from a uniform Dirichlet distribution. Over all the simulation conditions, TMVP's credible intervals failed to contain the true value of *N*_{e} 24.4% of the time. In one instance, with *N*_{e} = 20, *n* = 60, *T* = 3, and with 15 loci, TMVP's credible intervals failed to contain the true *N*_{e} >78% of the time.

Since *CoNe* is based on the same likelihood model as TMVP, I performed simulations like those of Tallmon *et al*. (2004) to investigate whether *CoNe*'s confidence intervals suffer similar degrees of inaccuracy. For each set of simulation parameters, I applied *CoNe* to 1000 simulated data sets, recording how often the true *N*_{e} was not contained within *CoNe*'s 95% confidence intervals for *N*_{e} (Table 2).

*CoNe*'s confidence intervals appear to be more accurate than TMVP's credible intervals as reported by Tallmon *et al.* (2004). Over all simulation conditions, true *N*_{e} was contained in the 95% confidence interval 94.3% of the time, just as expected. The worst performance of *CoNe*'s 95% confidence intervals was on the data simulated with *N*_{e} = 50, *n* = 20, *T* = 3, using 15 loci, when 11.1% of the time the true *N*_{e} was not contained in the confidence interval.

It is not clear why Tallmon *et al.* (2004) found TMVP's credible intervals to perform so poorly when evaluated from a frequentist perspective on simulated data. It is possible that the Markov chain of *N*_{e} values produced by TMVP was not run long enough to achieve a good estimate of the posterior density near the tails of the posterior distribution. At any rate, the confidence intervals around *N*_{e} computed by *CoNe* seem to be reliable, and it is straightforward to assess how well the Monte Carlo estimate of the likelihood has converged.

### The effect of mutations:

Methods for estimating *N*_{e} from temporally spaced samples have traditionally been applied to samples separated by a small number of generations in populations of plants or animals. In such situations, the assumption of no mutation is quite reasonable. However, increasingly the ability to extract and amplify genetic markers from archived samples, as well as the investigation of short-lived organisms like bacteria and viruses, makes it more likely that two samples will be separated by enough time that the assumption of no mutation may be violated. An apparent mutation can be caused by any heritable alteration that changes the observed allelic state of an allelic type. Depending on the type of marker system used these alterations could be point mutations, insertions, deletions, recombinations, or gene conversions, etc. I undertook a short simulation to determine under what conditions (of mutation rate and time between samples) mutation can appreciably affect the inference of *N*_{e} with a program like *CoNe*. I initialized a simulated Wright-Fisher population of *N*_{e} = 1000 diploids at time *T* with allele frequencies drawn from a uniform Dirichlet distribution with eight alleles and simulated the population forward in time until time 0, under both an infinite-alleles model (IAM) of mutation and a symmetric *K*-allele model (KAM) of mutation, with mutation rate *u* per gamete per generation. Samples of size 100 were drawn at times *T* and 0, and *CoNe* was used to estimate *N*_{e} and the scaled time *t* = *T*/(2*N*_{e}). This was done for all combinations of *u* ∈ {0, 10^{−3}, 10^{−4}, 10^{−5}, 10^{−6}} and *T* ∈ {100, 200, 400, 600, 1000} generations. For each *T* and *u* and mutation model (IAM or KAM) 300 data sets were simulated and analyzed. The mean estimated *N*_{e} and the mean estimated scaled time *t* for each condition were recorded.

Mutation biases the estimate of *N*_{e} upward. This makes sense—it tends to counteract the effects of drift (*i.e*., the fixation and loss of alleles), so it makes it appear that the population is larger than it is. In investigating the effect of mutation it is more convenient to express its effect on the estimates of the scaled time *t* = *T*/(2*N*_{e}). Obviously mutation biases the estimate of the scaled time *t* between samples downward. Figure 5 plots the results of the simulations.

On Figure 5's *x*-axis are the mean estimates of *t* = *T*/(2*N*_{e}) for each of the five values of *T* with no mutation (*u* = 0). On Figure 5's *y*-axis are the mean estimates of *t* under either the IAM or the KAM models for the different values of *u*. It is clear from the figure that the effect of mutation becomes more pronounced as *t* increases. This is expected—as more time elapses between the samples, there is more opportunity for mutations to occur. However, it is also clear that the mutation rate must be quite high for it to have a substantial effect, especially at values of *t* < 0.2.

Because mutations will have an effect only if they occur on lineages ancestral to the sample at time 0, and because the probability that such mutations will occur depends on the scaled mutation rate θ = 4*N*_{e}*u* and only weakly on sample size, some rough generalizations may be drawn from the above results. When θ < 4 × 1000 × 10^{−5} = 0.04 the effect of mutation for all *t* < 0.5 is not overwhelming. Further, for the KAM, scaled mutation rates of θ < 4 × 1000 × 10^{−4} = 0.4 are unlikely to bias estimates of *N*_{e} for *t* ≤ 0.1. Conversely, if you are using markers that mutate according to an IAM, then even with θ = 0.4 your estimates of *t* may be substantially biased downward and hence your estimates of *N*_{e} biased upward.

As a practical example, suppose you have sampled microsatellites in a fish population. Assume that the mutation rate at each locus is *u* = 10^{−4} and that the KAM provides a reasonable approximation over these timescales to the mode of mutation of microsatellites. If the estimate of the scaled time between samples is 0.1, then, if your samples are separated by no more than 200 generations, it is unlikely that mutations are biasing your estimate.

## DISCUSSION

I have presented a computational method for fast and precise approximation of the likelihood for *N*_{e} under a coalescent model. This method is an importance sampling algorithm implemented in the computer program *CoNe*. Previous approaches used MCMC and were considerably slower. The performance of my importance-sampling algorithm demonstrates that, although MCMC is broadly applicable and is typically easy to implement, a much faster solution may be available if the problem can be decomposed in such a manner as to avoid MCMC. In addition to being faster, it is often also easier to assess the Monte Carlo error if one is using independently simulated samples (as in *CoNe*) rather than correlated samples from a Markov chain (as in MCMC).

The program *CoNe* is based upon the same underlying model as the programs TM3 (Berthier *et al.* 2002) and TMVP, in the case with only two samples (Beaumont 2003). Thus, the coalescent-based estimator implemented in *CoNe* is expected to perform similarly to TM3 and TMVP. In this article, I used computer simulations to show that estimates made by *CoNe* and TM3 are similar.

Some approximations to the likelihood for *N*_{e} become unstable when the data include many alleles (*e.g.*, Anderson *et al.* 2000). This is not the case with *CoNe*. I subjected *CoNe* to a number of tests involving samples with very large numbers of alleles. The importance-sampling algorithm always performed well in approximating the likelihood. The maximum-likelihood estimator, however, seems to be upwardly biased for *N*_{e} when the amount of genetic drift is small, *i.e*., when *T*/(2*N*_{e}) is on the order of 0.01. This upward bias is exacerbated when more alleles at low frequency are present. Interestingly, this bias is of a different nature than the bias shown by moment-based estimators of *N*_{e} applied to data with low-frequency alleles. Moment-based estimators show more bias when drift is relatively strong and the low-frequency alleles have been lost from the population (Waples 1989). With *CoNe* the situation is exactly the opposite—the bias is very small (≈3%) when *t* = *T*/(2*N*_{e}) = 0.1 but the bias is more severe with *t* = 0.01.

I have shown how to compute the likelihood for *N*_{e} in what is essentially a frequentist analysis. However, two points must be made. First, should one desire a Bayesian posterior distribution for *N*_{e}, it can easily be computed from the likelihood. Second, the likelihood is an integrated likelihood: a prior for the allele frequencies at time *T* must be assumed. I have used a Dirichlet prior with parameter **λ**. This corresponds to the equilibrium frequencies of a *K*-allele model with reversible mutation or to the equilibrium distribution for allele frequencies under drift and recurrent migration from a large population (Wright 1937). In simulations (not shown) I found that changing the value of **λ** from (1, … , 1) to (1/*K*, … , 1/*K*) had little effect on the inference of *N*_{e}. It is accordingly unlikely that the use of other diffuse priors would greatly influence the estimates of *N*_{e}.

The importance-sampling algorithm presented here is quite efficient for the case where only two temporally spaced samples are taken; however, it is worth asking if the importance sampling could be extended to multiple samples in time (Beaumont 2003). Such a task could be challenging. The algorithm presented here works well because it is possible to compute the probability of the observed data given that there were *n*_{f} lineages at time *T* for all *n*_{0} − *K*_{0} + 1 possible values of *n*_{f}. Those probabilities are then used in (13) to compute the likelihood for *N*_{e}. Naively taking the same approach with more than two samples could lead to computational demands that are exponential in the number of samples, but it might be possible to make the problem linear in the number of samples by using an algorithm like that described in Baum (1971). Unfortunately, the conditional probabilities that would have to be calculated and normalized for each sampling episode would require considerably more (on the order of *n* times more, where *n* is the sample size) computation than they do with only two samples, and there is no guarantee that the resulting importance-sampling distribution would be as effective as it is in the two-sample case. Extending this importance sampling approach to more than two samples remains an open problem.

It is important to point out that, although the likelihood for *N*_{e} used here is based on the coalescent, the calculation of the likelihood is easier than in many other coalescent-based inference problems. By making the assumption of no mutation between the samples, it is possible to treat the different allelic types separately, without considering the number of mutational steps between alleles. This simplification makes it unnecessary to consider different topologies of coalescent trees. In effect, the formulation of (1) follows from an implicit sum over all possible topologies—without having to actually perform that sum. Thus, although the importance-sampling algorithm presented here offers dramatic improvements for calculations involving the coalescent without mutation, it is not a solution that applies equally well to other difficult problems such as computing the likelihood for θ = 4*N*_{e}*u* from a single sample of sequences (Griffiths and Tavaré 1994a; Kuhner *et al.* 1995; Stephens and Donnelly 2000) or computing the likelihood of recombination rates from a single sample of sequences (Griffiths and Marjoram 1996) or of migration rates from a single sample of sequences or microsatellites (Beerli and Felsenstein 1999). In those cases, not only is it necessary to explicitly sum over different genealogical trees and their branch lengths, but also it is necessary to sum over the unknown ancestral state of the progenitor of all alleles in the sample and over locations in the tree where mutations might have occurred.

The program *CoNe* is intended to provide estimates of contemporary *N*_{e} of well-circumscribed populations. The *N*_{e} that is estimated is the effective size of the population that prevailed over the time interval between the samples. This contrasts with the methods that estimate θ = 4*N*_{e}*u* from a single sample. The *N*_{e} referred to there is the effective size of the population over the entire coalescent history of the sample, which typically represents far more time than just the interval between two samples taken from the population. Recently, a method that allows the separate estimation of *N*_{e} and *u* (rather than estimation only of the composite parameter θ) from temporally spaced samples of sequences was developed by Drummond *et al.* (2002). Their program, Bayesian Evolutionary Analysis Sampling Trees (BEAST), has been used in a number of instances involving genetic sequences sampled at short time intervals from rapidly mutating and short-lived organisms like viruses or sampled at long time intervals (by obtaining DNA from subfossil material) from longer-lived organisms (reviewed in Drummond *et al.* 2003). BEAST is designed for use with temporally spaced samples of *sequences*, and the temporally spaced element of the samples is useful to the program only if the samples are separated by enough time that mutations are expected to have accumulated in the lineages. This is very different from *CoNe*, which uses codominant allelic count data (which may come from sequences, microsatellites, SNPs, etc.) and performs best when enough time has elapsed between the samples for a substantial amount of drift to have occurred, but not so much time that many mutations have occurred between the two sampling episodes.

In the absence of mutation the importance-sampling algorithm presented here applies directly to the general problem of computing the likelihood of the number of coalescences that have occurred during the time between two sampling episodes. Accordingly, there are a number of related inference problems to which the algorithm could be applied. First, estimating *N*_{eT}, the effective size of the population at time *T*, and a growth rate *r* of the population until the sample at time zero would be straightforward. This is because (4) can be expressed as a likelihood for the *scaled* time *t*, and any pair of *N*_{eT} and *r* implies a single scaled time *t* by the results of Griffiths and Tavaré (1994b) for rates of coalescence in populations of varying size; hence (4) implies a likelihood for pairs . Also, with some modifications, the method could be used to estimate the number of lineages founding small, colonized populations. Such a problem, like the estimation of effective size, is similar to the problem of estimating the degree of inbreeding accumulated in a population between two time points. Laval *et al.* (2003) approach the problem of estimating inbreeding using MCMC and “several levels of approximation … to circumvent the combinatorial problems raised by the exact coalescent approach when the number of alleles increases” (p. 1199). The Monte Carlo approach taken in this article could be applied to the problem of estimating inbreeding using the exact coalescent approach, dealing with the combinatorial problem by efficient importance sampling. It is likely this would lead to a better estimator that could be computed many times more quickly and without the need to “tune” various parameters for MCMC.

Finally, the problem of estimating admixture proportions in populations subject to drift, first addressed by Thompson (1973), could be treated using the importance-sampling scheme presented here. In this scenario, an admixed population is formed a known time in the past with an unknown proportion π of the colonizers coming from population *A* and 1 − π from population *B*. Using present-day genetic samples from populations *A* and *B* and the admixed population, the goal is to estimate π while taking account of the genetic drift that has occurred in all three populations since the time of colonization. Chikhi *et al.* (2001) develop a coalescent-based likelihood for a two-population admixture model and estimate the admixture proportion using MCMC. Runs of the chain required ∼1 week on a 500-Mhz Pentium computer. Being able to compute the likelihood more efficiently using importance sampling would clearly be desirable, and, with some adjustments and approximation, the importance-sampling scheme presented here could deliver substantial improvements in computation time. Further, since the importance-sampling algorithm would yield an unnormalized likelihood, it would be useful for conducting tests of the null hypothesis that the admixed population contains ancestry only from the two putative source populations. This would permit a way of dealing with the possibility that the admixture contains ancestry from another, unsampled population.

The genealogical perspective provides a powerful framework for formulating likelihoods in a number of problems; however, its use in estimating *N*_{e} and admixture proportions has not been rapidly adopted, in part because of the computational burden of currently available methods. The algorithm presented in this article reduces that computational burden and should make genealogical approaches even more practical in the future.

## Acknowledgments

I thank Dave Tallmon, Mark Beaumont, Montgomery Slatkin, and Carlos Garza for helpful discussions on this article; Kevin Dunham for help with testing and releasing the software; and the editor and two anonymous referees for their insightful comments. This article developed out of work initiated while E.C.A. was supported by National Institutes of Health grant GM-40282 to M. Slatkin.

## Footnotes

Communicating editor: J. Wakeley

- Received November 9, 2004.
- Accepted March 17, 2005.

- Genetics Society of America