## Abstract

The estimation of ancestral and current effective population sizes in expanding populations is a fundamental problem in population genetics. Recently it has become possible to scan entire genomes of several individuals within a population. These genomic data sets can be used to estimate basic population parameters such as the effective population size and population growth rate. Full-data-likelihood methods potentially offer a powerful statistical framework for inferring population genetic parameters. However, for large data sets, computationally intensive methods based upon full-likelihood estimates may encounter difficulties. First, the computational method may be prohibitively slow or difficult to implement for large data. Second, estimation bias may markedly affect the accuracy and reliability of parameter estimates, as suggested from past work on coalescent methods. To address these problems, a fast and computationally efficient least-squares method for estimating population parameters from genomic data is presented here. Instead of modeling genomic data using a full likelihood, this new approach uses an analogous function, in which the full data are replaced with a vector of summary statistics. Furthermore, these least-squares estimators may show significantly less estimation bias for growth rate and genetic diversity than a corresponding maximum-likelihood estimator for the same coalescent process. The least-squares statistics also scale up to genome-sized data sets with many nucleotides and loci. These results demonstrate that least-squares statistics will likely prove useful for nonlinear parameter estimation when the underlying population genomic processes have complex evolutionary dynamics involving interactions between mutation, selection, demography, and recombination.

THE estimation of ancestral and current effective population sizes in expanding populations is central to understanding the genetics of natural populations (Crandall *et al.* 1999). It is now possible to scan entire genomes of several individuals within a population (Nielsen *et al.* 2005; Schaffner *et al.* 2005; McVean and Spencer 2006). In this article I present a fast and reliable statistical method for estimating population parameters such as effective population size and growth rate using genomic data. Although the method is applicable to more complex population and selection models, I focus in this article on illustrating the method in a model of exponential population growth.

The problem of determining the parameters of a demographic expansion is a fundamental problem in population genetics theory and coalescents (Avise 2000). Several article have appeared addressing this problem, ranging from methods based upon summary statistics (Rogers and Harpending 1993) to those using the full data in a sample such as maximum-likelihood (ML) estimators (Griffiths and Tavaré 1994; Kuhner *et al.* 1998). However, for large data sets and complex models, computationally intensive methods may exhibit difficulties.

First, the methods may be prohibitively slow or difficult to implement on large data, especially when integrating the Felsenstein equation (Stephens 1999; Hey and Nielsen 2007). This often involves analysis of the “mixing properties” of a complex Markov chain Monte Carlo (MCMC) algorithm—a technically difficult task (Sisson 2007). In previous work, Vasco *et al.* (2001) demonstrated the close relationship of summary-statistics-based phylogenetic estimation methods to earlier coalescent methods (Watterson 1975; Tajima 1983, 1989; Fu 1995) as well as those based on full-likelihood approaches. They argued that instead of utilizing the full likelihood, an analogous function, determined by least-squares (LS) fitting of the data, may be computed in which the full data are replaced by a vector of summary statistics. These are still standard methods that are widely used for coalescent data analysis across the subdisciplines of evolutionary genetics (Avise 2000; Emerson *et al.* 2001; Knowles 2004; Hickerson *et al.* 2006). However, important questions remain: When does the minimum of the LS function coincide with the maximum of the ML approach? How do the performance and reliability of the two methods compare over various samples sizes of nucleotide sequences? How does performance compare for fast *vs.* slow population expansions? As data sets become larger and more complex, these kinds of questions are likely to loom large in the near future.

A second and related problem involves diagnosing systematic patterns of estimation bias for likelihood modeling of nucleotide sequence data. Recent work utilizing MCMC methods has indicated that likelihoods for infinite-sites coalescent models may be quite complicated for changing population size (Stephens 1999; Stephens and Donnelly 2000; Wakeley *et al.* 2001; Hey and Nielsen 2007). Similar observations have been made in likelihood modeling of microsatellite data (Beaumont 1999). These studies have suggested that there may be rugged topographies underlying the likelihood surface, rendering accurate estimation (or identifiability) of population parameters difficult or impossible in some regions. Recently, Sisson (2007) discussed the general problem of Bayesian computation with regard to intractable likelihoods for complex genetical data.

To address both of these problems I take the following approach. First, I present the coalescent model with a focus on its close relationship to summary genetic data in a sample (Vasco *et al.* 2001; Hein *et al.* 2005; Marjoram and Tavaré 2006). Next, I review two standard coalescent point estimation methods and from these discuss more computationally intensive, simulation-based estimation methods (Hjorth 1994; Gourieroux and Monfort 1996; Givens and Hoeting 2005). Simulation-based estimation is a natural extension of the computationally efficient LS coalescent point estimators developed in past work (Fu 1994a,b; Li and Fu 1994; Deng and Fu 1996; Vasco *et al.* 2001). I then compare the computationally intensive statistics with estimates obtained using EVE (Vasco *et al.* 2001), which gives a LS estimate, and FLUCTUATE (Kuhner *et al.* 1998), which gives a maximum-likelihood estimate. Both of these programs give point estimates of the genetic diversity parameter θ = 2*N*μ (where *N* is the effective population size, and μ is the mutation rate) and the population growth rate given a sample of nucleotide sequences. Finally, I demonstrate that it is possible to obtain a nearly unbiased estimate of the population parameters by recursive use of the LS estimator. The estimator is applied to the African human mtDNA sample collected by Vigilant *et al.* (1991).

## GENOMIC DATA, COALESCENTS, AND STOCHASTIC SIMULATION

Population genomic data can be characterized by three properties (see, for example, Christianini and Hahn 2007): first, the observed data are determined by a single stochastic realization of the evolutionary process; second, they consist of a sample space of many dimensions (*i.e.*, genomic scans generate a large number of sequenced sites; also, a large number of individuals may be sequenced—the sample space maybe quite large); and third, stochastic simulation must often be used to infer genomic parameters since analytic methods rarely exist to describe complex stochastic processes. Inferring population parameters from genomic data requires linking together all three of these properties into a single statistical theory. Coalescent theory offers a means to do this by generating random samples from a Wright–Fisher model: one can efficiently undertake the joint simulation of genealogies and mutations (Vasco *et al.* 2001). This allows utilizing coalescent theory as the cornerstone for the statistical analysis of molecular population samples (Fu and Li 1999). In this section I develop a statistical coalescent model with a focus on its close relationship to summary genetic data in a sample of DNA sequences.

Figure 1A shows how to go from the complete data, a sampled set of five DNA sequences, to a summary of the data: each large dot represents a mutation (neutral gene substitution) on one of the sequences shown in Figure 1C (shown as smaller dots). The pattern of polymorphism for the sample can be observed as the number of neutral nucleotide substitutions (Kimura 1983) along a lineage of the tree, *d _{i}*, where each

*i*corresponds to a branch in the coalescent tree. This quantity is useful as summary measure of the complete data (the set of sequences).

Next, consider the coalescent tree of the sample itself, without recombination and migration, so that the sampled sequences are connected by a single genealogy as shown in Figure 1B. In the Wright–Fisher model each generation is discrete and formed randomly by sampling *N* parents with replacement from the current generation. In general, for any coalescent tree, it is possible to label the time intervals, *t _{n}*, for 2, …,

*n*sequences and these are referred to here as the

*n*th coalescent times. These intervals are random variables representing the time duration (the number of generations) required for coalescence from

*n*sequences to

*n*− 1 sequences. Thus, in Figure 1B each of the five sequences can be traced back in time first to

*n −*1 ancestral sequences, next to

*n −*2 sequences, and so on until a single ancestral sequence remains (the most recent common ancestor, MRCA).

We have now covered the first two criteria for developing a statistical theory of genomic data: it has been demonstrated how the complete data (a sampled set of sequences) are connected to a coalescent tree. Before the role of stochastic simulation and statistical inference can be introduced, I first develop some bookkeeping rules that are useful for large coalescent-based phylogenies.

Each branch length of the tree can be represented as a simple sum of the coalescent times between ancestors. It is not difficult to show that, since the jump process of a coalescent tree is fixed (Fu 1994a,b), the branch lengths of any coalescent tree can be computed as(1)where *s _{ik}* represents a set of index variables for each branch that bookkeeps the number of times the coalescent time contributes to the length of the

*i*th branch. Thus, for branch

*i*, one can define

*n*− 1 index variables (

*k*= 2, · · ·,

*n*) such that

*s*= 1 if the branch has a segment of length

_{ik}*t*between the

_{k}*k*th and the (

*k*− 1)th coalescence and

*s*= 0 otherwise. Thus, the branch lengths over the entire topology of a tree for a sample of

_{ik}*n*genes can be quantitatively characterized in terms of the set of 2(

*n*− 1)

^{2}variables and corresponding coalescent times. For example, the coalescent tree shown in Figure 1B has 2(5 − 1)

^{2}= 32 index variables. Detailed examples of how to use this bookkeeping device appear in Fu (1994a,b) as well as in Deng and Fu (1996). Here I use it to define a (2

*n*− 2) × (

*n*− 1) matrix

**J**for the jump process of the coalescent, where each element is determined by the

*s*variables (see Figure 1). If the vector of coalescent times is defined as

_{ik}**T**= ()

^{T}, where the superscript T signifies transposition, then any vector of branch lengths , satisfying Equation 1, can be written in matrix formThis equation is an expression of the fact that the jump process of the coalescent,

**J**, is invariant with respect to model assumptions regarding the distribution of coalescent times or branch lengths of a tree once its topology is fixed.

Consider next the effect of exponential population expansion on the coalescent times and branch lengths of the tree. Thus, *N*(*t*) = *N*_{0}*e ^{gt}*, where

*g*is the growth rate (

*g*> 0),

*t*is the time since the initial generation, and

*N*

_{0}is the initial effective population size. In using a coalescent approach, it is useful to look backward in time at

*N*(−

*t*) and from this vantage it appears that the population exponentially declines in size as the population experiences coalescence toward its most recent common ancestor. In a coalescing population, currently at size

*N*

_{0}, one unit of time corresponds to 2

*N*

_{0}generations. I now describe how to simulate coalescent trees under this model. The

*k*th coalescent time,

*t*, in a genealogy corresponds to a time interval in which exactly

_{k}*k*ancestors exist in the sample. Each

*t*is conditioned upon a time τ

_{k}_{k}when there are more than

*k*ancestors in the sample. The intervals between coalescent events can be efficiently estimated using the distributionwith τ

_{n}= 0 (Griffiths and Tavaré 1994). The dependence of the time

*t*upon τ

_{k}_{k}has an intuitive basis: it occurs because each period

*t*in a genealogy starts only when

_{k}*n*sequences have coalesced to

*k*sequences. For the cases of constant population size and exponential growth, the distribution for the

*t*can be determined exactly (see Hudson 1990; Slatkin and Hudson 1991). The value

_{k}*t*(

_{k}*k*= 2, …,

*n*) is an outcome of the coalescent with exponential growth

*g*,(2)where and

*U*is randomly drawn from the uniform distribution on the open interval (0, 1). This last result determines the final property required to describe genomic data (stated earlier in this section): by generating random numbers stochastic simulation may be used to study the complex stochastic processes by which coalescent trees are formed. The result can be combined with a model of DNA sequence evolution to simulate the evolution of sampled sequences on the tree itself as explained below. Before this is done, however, I show how to develop a statistical coalescent model that may be used to estimate population parameters.

Assume now that the coalescent for a sample of five sequences in an exponentially growing population has produced a coalescent tree with parameters (θ, *g*). Although this genealogy is the product of a stochastic process, the branch lengths, *l _{i}*, and number of mutations on a branch,

*d*, for a single genealogical history can be observed exactly for a simulation or accurately reconstructed for a set of sequences. Throughout this article the

_{i}*d*'s are the components of a data vector , where the convention upon numbering and ordering is shown in Figure 1. I now show how to approximate the average branch lengths and number of mutations on each lineage of such a tree given its topology. For the moment I assume that the tree topology is known exactly. To compute the expected branch lengths, one simulates the coalescent time distribution using Equation 2 many times and averages the results. Each average coalescent time, , is computed using(3)The quantity

_{i}*G*represents the number of genealogies that are simulated to obtain the average

*k*th coalescent time. Equation 3 can be used to study the statistical properties of genealogies under several different kinds of models involving population and selective change. Substitution of Equation 3 into Equation 1 yields(4)Utilizing a UPGMA phylogenetic reconstruction algorithm (see below) gives the expected number of mutations between pairs of coalescing sequences. The key point is that it is possible to fix the tree topology and rapidly compute an approximation of the number of mutations on each branch of the tree.

## STATISTICAL MODEL

In this article I assume the infinite-sites model of mutation and that all mutations are selectively neutral. Each offspring differs from its parent by a Poisson-distributed number of mutations. Hence for individual lineages of length *l _{i}*, the number of mutations on that lineage,

*d*, follows a Poisson distribution with parameter θ

_{i}*l*conditional on

_{i}*l*, where θ = 2

_{i}*N*

_{0}μ. All moments of

*d*(including the mean and variance) are determined by the moments of

_{i}*t*. The expected number of mutations can be found using Equation 4 and is equal to(5)Thus one may compute the expected number of neutral mutational changes

_{i}*d*(see Figure 1A) along an expected branch length

_{i}*l*(recall Figure 1D) for a sample of DNA sequences. So for the entire coalescent tree, the vector of mutations is conditioned on the vector of branch lengths by the Poisson distribution with expectation(6)and variance(7)where is the diagonal matrix for which nonzero entries correspond to

_{i}*l*with respect to a given

_{i}*d*.

_{i}Given a fixed genealogy with **J** and *E*(**T**), Equations 4–7 can be used to define the following nonlinear mutation model,(8)where is the vector of expected branch lengths, and the model error term is , with the *i*th component .

Expectations for the coalescent model can be efficiently computed over the entire genealogy:(9)Variances for coalescent genealogies satisfying the mutation model can likewise be computed, but are not utilized in the present article. For each branch of a coalescent genealogy the data *d _{i}* can be computed using distance, parsimony, or any other method that gives an accurate estimate of the number of mutations on a branch (see below). These data are stored in the vector . All other quantities such as and

*E*(

**JT)**represent theoretical expectations that can be efficiently computed using Monte Carlo integration of sampled genealogies based upon Equations 3 and 4.

## STATISTICS AND COMPUTATION-BASED INFERENCE

In this section I discuss how to compute point estimates of targets *g* and θ using a LS and a ML method, how to use computational statistics and simulation to quantify the bias associated with a targeted estimate, and how to correct the bias of the targeted estimate.

#### Least-squares estimation:

Assume the exponential growth coalescent model with parameters θ, *g*, *N*_{0}, where the effective population size *N*_{0} is fixed at the time of sampling. Using (8) define the scalar function,(10)Then there exists a nonlinear LS estimator for (θ, *g*) upon solution of the optimization problemwith respect to θ and *g*. Application to multiple loci is straightforward: compute a function *V _{i}*(θ,

*g*) using the coalescent history for each

*i*th locus, sum over all

*i*, and then minimize the sum with respect to (θ,

*g*). This can be rapidly computed for large numbers of loci, for example. A computer program written in the C language, called Estimators for Variable Environments (EVE), was developed to perform these computations (Vasco

*et al.*2001). At the point

*g*= 0, the LS minimum point determining θ for the EVE algorithm is expected to be the same as that computed by Fu's UPBLUE recursive estimator for θ when the effective population size is constant (Fu 1994a). Thus, this statistic can be considered an extension of Fu's method to the case of variable population size.

#### Maximum-likelihood estimation:

For the problem of estimation of θ and *g* from sequence data Kuhner *et al.* (1998) proposed the maximum-likelihood computation program FLUCTUATE. This program computes θ using a transformation in which the coalescent structure of a genealogy becomes identical to the constant population expectation of the program COALESCE (Kuhner *et al.* 1995). The conversion between EVE's θ and FLUCTUATE's θ is made by multiplication of the FLUCTUATE θ by number of sites in a sequence. Since EVE assumes that coalescent times are scaled by 2*N*_{0} generations, I rescaled FLUCTUATE growth rate estimates appropriately. In addition, growth rate is scaled in per mutation units of 1/μ. Note that to make this conversion and comparison with the unscaled EVE estimate of growth rate, essentially *a priori* information is required with regard to the constant population size expectation of θ_{0} = 2*N*_{0}μ. When computing genealogies, FLUCTUATE takes a step that is the construction of a single genealogy; a “chain” for a set of genealogies is formed to compute parameter estimates. I used those suggested in Kuhner *et al.* (1998) for a single locus: 10 short chains of 1000 steps followed by 2 long chains of 15,000 steps. For an initial guess of the θ, Watterson's (1975) estimate was used. For an initial guess of *g* I used 10^{−5}.

#### Computation-based estimation:

Statistical problems such as severe bias create difficulties for obtaining accurate estimates of parameters for a point estimate. Simulation-based estimation is a powerful way to diagnose some of these difficulties if one has access to a fast estimator. In this section I describe three computationally intensive estimators based upon jackknifing, bootstrapping, and simulation principles (Efron and Tibshirani 1993; Shao and Tu 1995; Gourieroux and Monfort 1996) that are used investigate statistical properties of the point estimators EVE and FLUCTUATE.

##### Jackknife estimation of parameters:

Quenouille (1956) proposed utilizing the jackknife to estimate the bias of an estimator by deleting one datum each time from the original data set and then recalculating the estimator on the basis of the rest of the data. The jackknife estimator of a parameter *p* (where *p* = *g* or θ in this article), utilizes the subsample estimates of an estimator to quantify the bias-reduction process,(11)where is the estimate computed applying a method like ML and LS to the whole sample and is the *j*th subsample, respectively, with the *i*th sequence removed, this being denoted as the subscript −*i*. For subsampling of the set of sequences one can take the complete set deleting a single sequence at a time. Each is then a statistic computed using 2*n* − 1 observations with datum *d _{r}* removed (which is uniquely determined when is removed from the sample, so that

*r*is equal to the subscript of the external branch corresponding to the removed sequence drawn from ). To compute the estimator stated by Equation 11 one iteratively solves the LS problem(12)

*i*= 1, …,

*n*times. As before, −

*i*represents an iteration index that is determined by the consecutive removal of a sequence from the total set but now it can be used to uniquely determine and

*V*

_{j}_{,−i}. Since the removal of a sequence corresponds to the removal of an external branch, to find it suffices to compute the

*s*matrix with 0's corresponding to coalescent time entries in the full

_{ik}*s*matrix for the removed branch. Having determined and , one then performs the matrix multiplication given by Equation 12 to obtain

_{ik}*V*

_{j}_{,−i}. Minimizing

*V*

_{j}_{,−i}with respect to (θ,

*g*) gives a jackknifed LS estimate for the sample. The subscript

*j*bookkeeps the computation of the jackknifed LS statistic for each . Summing the estimates with respect to

*j*over all

*n*subsamples then gives in Equation 11.

##### Parametric bootstrap-known tree:

In a parametric bootstrap, samples (replicate sets of nucleotide sequences) are simulated from a specified value of *g* and θ. The estimator then is fed the known tree topology (**J**), the known branch lengths , and the known number of mutations on each branch for a given MC replicate. From these data, which correspond to perfect knowledge of the phylogenetic tree of the coalescent, a LS estimate is computed for each replicate. The bootstrap expectation , where is the MC replicate. I computed the MC estimated standard deviation as .

##### Parametric bootstrap-unknown tree:

This case corresponds to that stated above except that for each MC replicate the tree topology is estimated using UPGMA (see, for example, Swofford and Olsen 1990). For each replicate, UPGMA is used to compute an approximation to the branch lengths of the tree, giving an estimate of the observed number of mutations between pairs of coalescing sequences. Statistics for the MC replicates are computed as stated in the last section, with the notational change and for the mean and the MC replicate, respectively.

##### Simulation of sampled sequences:

When implementing simulation studies I assumed a Jukes–Cantor model of DNA nucleotide sequence evolution (Li 1997). This assumes that all four nucleotides are equally frequent and all substitution types are equally likely in a sample. I assumed a sequence length of 1000 sites and a per locus mutation rate μ in simulations when the target parameters were known. When simulating the Vigilant *et al.* (1991) data I assumed a sequence length of 610 sites. Under the infinite-sites model, the number of mutations separating a pair of sequences is simply the number of nucleotide differences between the pair of sequences (Fu 1994a). In this case the number of substitutions between sequences *i* and *j* corresponds to the number of mutations separating two coalescing sequences *d _{ij}* (Tajima 1983). Once the distance matrix for a sample of sequences is computed the UPGMA method can be applied to compute the tree. When applied to an empirical set of nucleotide sequences one computes the genetic distance and applies UPGMA to the sample and then inputs this into EVE (see Vasco

*et al.*2001 for an example using HIV sequences).

## RESULTS

In this section two different ways of assessing statistical truth with respect to LS and ML estimation are investigated. The first involves the case where a target parameter *p* = *p*_{T} is known from simulation. For this case I investigate how accurately the known value of *p* can be recovered using LS or ML statistics. The robustness of the LS statistic is then investigated under recombination. The second involves the case where only an estimate is known for *p*_{T}. For this case all that one can hope for is that is sufficiently close to *p*_{T} since *p*_{T} itself is unknown. Nature gives researchers a single stochastic realization when fitting their stochastic simulation models to genomic data. So an important statistical question is how to robustly characterize the statistical variation associated with that realization.

To investigate the second case I used an African human mtDNA sample drawn from Vigilant *et al.* (1991) to compute an LS point estimate (the mtDNA control region with a sequence length of 610 bp and a sample size of *n* = 97). I then use simulation-based estimation to compute Monte Carlo replicates. This allows using the parametric bootstrap means to quantify bias. The bootstrap replicates also allow estimation of the variance associated with the biased mean. Finally I use the LS statistic to assess bias and correct for it. As noted by Efron and Tibshirani (1993), bias correction is a worthwhile pursuit but a dangerous one. Here I show it is possible to compute a nearly unbiased estimate of . For the purpose of computational approaches to population genetics data analysis this result clearly demonstrates that the LS statistic has a peak at or near an estimated quantity (Cabrera and Fernholz 1999). Elimination of troublesome sources of bias allows computing reliable distributions of statistical errors for the sample that can be utilized in subsequent statistical studies such as the development of power tests and model prediction and selection (Hjorth 1994).

#### Performance of the LS statistic:

Tables 1 and 2 show the performance of the LS method in estimating *g* and θ for the case θ = 40. In these tables the “size” of the table, in terms of increasing *g*, corresponds to the range of consistency for estimation of *g*. Thus, higher sample sizes generally gave better estimates of *g*. Table 2 shows that simultaneous estimation of θ is nearly unbiased over a wide range of *g*—the reason being that θ appears in Equation 8 linearly. It is also shown that a significant reduction in MC standard deviation of the estimators can be achieved by increasing sample size. In all cases, the MC standard deviation of the estimators decreases with increased numbers of sequences. There is a slight bias in estimation near *g* = 0.001 because of the singularity at *g* = 0 in Equation 2. As noted previously, however, one may use the UPBLUE method to estimate θ by Fu (1994a) at *g* = 0 (or at least its assumption of constant population size to compute the expected branch lengths in Equation 10).

Figure 2 graphically shows the effect of sample size on the performance of LS estimation. For higher substitution rate (θ > 40), a significant range of growth rates can be consistently estimated. Better knowledge of the gene tree always allows expansion of the range of consistent estimation. Figure 2A shows that for a range of *g* from 0.001 to 6, consistent estimation of *g* was possible for the unknown tree. Lack of knowledge of the actual gene tree significantly affects the ability to consistently estimate *g* > 5 for θ = 40. Figure 2B shows that the range of consistent estimation for *g* in the case θ = 40 is in the range of *g* from 0.001 to 80 for the known gene tree. Since many rapid population expansions, such as for humans and viruses, may lie in this range it is important to correct the bias for any point estimates obtained that lie in this range. Below, I show how to address this problem using the jackknife bias-correction statistic. I then compare this statistic to those obtained using the more standard parametric bootstrap method.

#### Comparison of LS and ML statistics:

In this section, I compare the LS and ML statistics, utilizing the LS surface as a tool to diagnose performance. A ML surface should exhibit a maximum with respect to the minimum on the LS surface. Therefore, it is possible to investigate how the statistics behave by visually scanning the distribution of MC replicates around the LS minimum in (θ, *g*) parameter space. I show that there are two scales in parameter space that determine the properties of the estimators. The first is determined by a “local” region surrounding a LS minimum point that determines EVE estimates. This is determined by the LS criterion, *i.e.*, the minimization of Equation 10. In this region, one would expect the distribution of MC replicates to cluster around the minimum, with density highest over the target parameters. In fact, a second “global” region over (θ, *g*) parameter space plays a role with respect to the LS surface. This happens because of the appearance of a large plateau in parameter space. In this section I show how this plateau affects performance of the ML *vs.* the LS estimator.

Figure 3C shows the distribution of Monte Carlo replicates for the target *g* = 5 and θ = 60 at a sample size *n* = 10. Figure 3, D–F, shows the distribution of Monte Carlo replicates for the target *g* = 15 and θ = 60 at three different sample sizes *n* = 10, 30, and 100. For both *g* = 5 and *g* = 15, the sample size of *n* = 10 shows a cloud of LS and ML replicates close to the targets (see Figure 3, C and D). Also several outliers appear for both the LS and the ML estimates. However, the LS outliers appear to track the valley in the LS surface, while the ML estimators tend to be dispersed on a steeply rising ridge in the LS surface as shown in Figure 3A. The height of these ridges requires blowing up the region around the targets to see contours of the valley (see Figure 3B). FLUCTUATE outperforms EVE for the case shown in Figure 3C. It has a MC mean closer to the target than EVE, with smaller standard deviation. However, for the cases shown in Figure 3, D–F, the presence of outliers shifts the mean MC value for θ away from its target. This gets progressively worse as the sample size increases—see Figure 3, E (*n* = 30) and F (*n* = 100). In contrast to the severe estimation bias observed for θ, FLUCTUATE estimation of *g* appears to be nearly unbiased. However, this conclusion must be accepted with the proviso that perfect knowledge of θ_{0} = 2*N*_{0}μ is available. Table 3 gives the MC means and standard deviation for the comparison between LS and ML statistics.

For larger sample sizes the distributions of the ML and LS estimates move further apart. Overall, these results imply that the ML estimator FLUCTUATE does not exhibit consistency at high growth rates and sample sizes when estimating θ. However, at small growth rate and small sample size FLUCTUATE outperforms EVE.

Previous MCMC studies for infinite-sites models (Wakeley *et al.* 2001) have shown that the likelihood surface for changing population size has a complex topography: the parameters become nonidentifiable in some regions and plateaus appear. Beaumont (1999) made similar observations with likelihood modeling of microsatellite data. The findings of this section lend support to this earlier work. Under rapid population expansion the majority of the mutations are in the terminal branches of the tree. In a population sample, this appears as a sequence with a common ancestral haplotype and with the remaining haplotypes appearing as singletons. Such a star-shaped coalescent tree gives rise to a difficult data set for an MCMC coalescent likelihood method. For these data the MCMC computation may generate the same star-shaped pattern for many choices of θ and growth rate. Indeed, Stephens (1999) and Stephens and Donnelly (2000) have previously predicted such computational difficulties for certain classes of MCMC methods. They noted that some MCMC algorithms, such as FLUCTUATE, use a fixed or “driving” value of θ. This gives rise to the possibility that such algorithms find modes in an essentially flat surface or ridge where none are present. Figure 3A shows a steeply rising LS contour up to a long flat plateau. The FLUCTUATE MC replicates project onto the contours of such a plateau (see Figure 3F). The large pink dot centered at the cloud of replicates suggests that it may be at the center of a mode “found” by FLUCTUATE where clearly none is present.

#### Effect of recombination on the LS statistic:

Schierup and Hein (2000a) demonstrated that even small amounts of recombination can produce biases in tree reconstruction. These biases can affect phylogenetic inferences using reconstructed trees for nucleotide sequence data. For example, bias from recombination can cause trees to superficially resemble phylogenies for sequences from an exponentially growing population. They also found that recombination created a large overestimation of rate heterogeneity and that standard ML methods [the DNAml and DNAmlk programs of PHYLIP (Felsenstein 1995)] could not infer the presence of a molecular clock from sequence data (Schierup and Hein 2000b). Given that the LS method uses the shape of a single phylogenetic tree to estimate parameters, I tested the LS estimator on several thousand simulated samples of nucleotide sequences exhibiting varying amounts of recombination. I used the coalescent algorithm with recombination described in Hudson (1983), modified so as to include exponential population growth. Since the generation of the coalescence and recombination events are independent, I was able to use the approach described in this article to efficiently simulate the coalescent time distribution under recombination (Fearnhead and Donnelly 2001). The mutation model used was the Jukes–Cantor model (Li 1997) and was applied to simulate nucleotide data sets under recombination as described in the article by Schierup and Hein (2000a).

I examined the performance of estimation under two scenarios. First, if the growth rate is zero, how well will the LS estimator perform given varying rates of recombination? Second, in an exponentially growing population with high growth rate, how well does the LS estimator perform under varying rates of recombination?

The results for the first case are shown in Figure 4. Setting the growth rate to zero in the simulations allows generating a null distribution on the estimators to determine the range of bias created by recombination. For low to moderate rates of recombination (ρ = 0–15) the LS estimator performs reasonably well. One observes growth rates close to zero in each case. The range of estimation bias lies well within the range of that observed to be caused by intrinsic bias in the coalescent time distribution. For ρ > 15, however, the bias due to recombination gradually increases with the amount of recombination. Each point in Figure 4 was computed from the mean over 1000 replicated data sets.

For the second case I simulated samples of nucleotide sequences under moderate growth rate (*g* = 15) and varying rates of recombination ρ = 0–50. The results are shown in Figure 5. As in the previous case, in the region (ρ = 0–15) the LS estimator appears to perform reasonably well. In this region the estimates of *g* and θ exhibit low bias, on the order of that observed in the absence of recombination. For ρ > 15, the bias caused by recombination starts to increase the estimation bias of the observed estimate of *g*, causing it to become more and more inflated. This bias appears to rise more rapidly in the presence of high growth than in the absence of growth (compare Figures 4 and 5). Each point in Figure 5 was computed from the mean over 1000 replicated data sets.

#### Inference on sample with parameters unknown:

LS statistics were computed for the African subset of the sampled human mtDNA sequenced by Vigilant *et al.* (1991). This gave the LS point estimates , which imply a significant recent population expansion (see Figure 6A). The distributions of the MC replicates for all three simulation-based estimators track the LS surface. This makes sense in light of the results shown earlier (see Tables 1 and 2, Figure 2) for the case when the target is known.

Consider next the quantification of estimation bias. Parametric bootstrap estimates of the point estimate were computed with the tree assumed unknown. The results for this are shown in Figure 6B. The mean is shown as the large dot at (θ, *g*) = (197.82 ± 39.35, 39.70 ± 13.13), and the small blue dots correspond to MC replicates. Although extreme bias is present in this estimator the spread of the replicates closely tracks the LS surface determined by the point estimate. Next parametric bootstraps were computed of the point estimate with the tree assumed known. The results are shown in Figure 6C. The MC mean is shown as the large dot at (θ, *g*) = (259.52 ± 66.78, 69.88 ± 30.25). This estimator shows substantially less bias. Finally, the nonparametric jackknife estimator gives a nearly unbiased estimate of the target. The small dots in Figure 6D are the MC jackknife replicates and these cluster tightly around the target. Although the estimated population growth rate for the African sample was very high it was still possible to obtain a nearly unbiased estimate of the point estimate using the jackknife.

The jackknifing statistic allows taking advantage of the intuitive idea that most of the mutations with information with regard to computing accurate estimates of θ and *g* lie on the tips of the coalescent tree. Thus, jackknifing may be regarded as a computationally cheap way to take advantage of this property to obtain nearly unbiased estimates of the parameters. It is essentially a recalibration method that allows reweighting the estimator using the residuals of the nonlinear mutation model. The jackknifing is so efficient it effectively weights each external branch as if it were adding genetic information from many independent loci (see Figure 2 for the cases *n* = 30, *l* = 10). However, in this case the only information in the sample is from a single locus—the human mtDNA.

## DISCUSSION

In this article I show that it is possible to infer sources of bias with regard to coalescent estimation of a sample of sequences in an expanding population. Once the bias has been quantified, it is then possible to rapidly compute a bias-corrected estimator. In this section, I address two issues associated with these problems: LS estimation *vs.* ML estimation and the relationship between the results of this article and ongoing work in MCMC estimation methods.

#### Least-squares *vs.* maximum-likelihood estimation:

In many situations a ML estimator is preferable over a LS estimator because the former is more flexible and can incorporate statistical tests more readily than the latter. However, estimation of demographic and selective parameters using the coalescent is an inherently nonlinear problem. That is, the surface to be minimized or maximized is likely to have multiple optima and a complex topography. Estimating likelihoods when the tree is not known or approximated is extremely computationally intensive in determining this surface. Despite this, using a parametric bootstrap it is possible to compare the ML and LS estimators. One generates Monte Carlo replicate data sets and superimposes the resulting estimates on top of each other. The LS estimates should cluster around the minimum and the ML should cluster around a maximum. The minima and the maxima should be approximately centered over each other. Thus the two clusters should lie roughly on top of each other with centers over the target. This is true for the LS estimator (EVE) but not always for the ML estimator analyzed in this article (FLUCTUATE). As sample size increases the correlation between optima located on the LS surface and optima located on the ML surface becomes lost. As sample size increases one would expect that an efficient estimator should become more accurate in predicting the target. My results show that the LS estimator does have a minimum over the target and becomes more accurate in predicting the target as sample size increases. I also show that the LS point estimator can be efficiently bias corrected if a nonparametric resampling estimation method is used.

#### Measurement error, summary statistics, and MCMC:

Recently Sisson (2007) pointed out some of the technical difficulties associated with developing MCMC algorithms for complex genetic data. He reviews computational problems associated with application of standard Metropolis–Hastings samplers to population genetics problems involving intractable likelihoods. As noted above, Stephens (1999) earlier pointed to similar problems, specifically centered around integration of the Felsenstein equation (Hey and Nielsen 2007). An important part of Sisson's critique lies in evaluating how to construct proposals for MCMC algorithms on the basis of summary statistics that allow efficient “mixing” of the algorithm. He points to the analysis of Tanaka *et al.* (2006), who constructed a stochastic model at the genetic level that allows direct application of the Metropolis–Hastings approach of Marjoram *et al.* (2003) to a summary statistic. Such an approach is related to the methods developed in the present article. However, the method would have been substantially slower and I would never have been quite sure when to stop the algorithm. Instead I implemented a cheaper LS model that allowed obtaining a fast answer. The frequentist framework developed here can be expressed within the framework of Bayesian sensitivity analysis (Oakley and O'Hagan 2004). Whatever framework one chooses, however, the present article offers up a cheap and fast method of obtaining parameter estimates for complex genetic data sets that can be compared to computationally expensive MCMC methods. Future work should lie in bringing together these two approaches into a more complete and useful statistical theory of genomic data.

## Acknowledgments

I thank the associate editor, two anonymous reviewers, Romulus Breban, Gordon Harkins, and Paul Schliekelman for their work on helping me to improve the overall presentation and quality of the manuscript. This work was partly supported by grants from the National Institutes of Health (R01 HD34350-01A1) to Keith A. Crandall and (R29 GM50428) to Y.-X. Fu.

## Footnotes

Communicating editor: R. Nielsen

- Received January 14, 2008.
- Accepted March 31, 2008.

- Copyright © 2008 by the Genetics Society of America