Likelihoods From Summary Statistics: Recent Divergence Between Species

We describe an importance-sampling method for approximating likelihoods of population parameters based on multiple summary statistics. In this first application, we address the demographic history of closely related members of the Drosophila pseudoobscura group. We base the maximum-likelihood estimation of the time since speciation and the effective population sizes of the extant and ancestral populations on the pattern of nucleotide variation at DPS2002, a noncoding region tightly linked to a paracentric inversion that strongly contributes to reproductive isolation. Consideration of summary statistics rather than entire nucleotide sequences permits a compact description of the genealogy of the sample. We use importance sampling first to propose a genealogical and mutational history consistent with the observed array of summary statistics and then to correct the likelihood with the exact probability of the history determined from a system of recursions. Analysis of a subset of the data, for which recursive computation of the exact likelihood was feasible, indicated close agreement between the approximate and exact likelihoods. Our results for the complete data set also compare well with those obtained through Metropolis-Hastings sampling of fully resolved genealogies of entire nucleotide sequences.

E STIMATION of population parameters in classical population genetics has traditionally proceeded through a moments approach, based on closed-form expressions for the expectation and variance of summary statistics (reviewed by Hey and Machado 2003;Beaumont 2004). For the purposes of this discussion, we regard as a summary statistic any measure of genetic variation that can be determined directly from numbers of derived mutations observed in a sample without explicit reference to the genealogy of the sampled genes. Felsenstein (1992) proposed a maximum-likelihood (ML) approach to the estimation of population parameters, treating the intervals separating the nodes of the sample genealogy as missing data. He showed that ML methods that use knowledge of the node intervals scaled to the mutation rate have greater statistical power than classical moments-based methods that use summary statistics. Metropolis-Hastings (MH) sampling to simulate the posterior distribution of the gene genealogy based on entire nucleotide sequences now forms the basis of the dominant ML and Bayesian approaches to both population parameter estimation and phylogeny reconstruction (Kuhner et al. 1995;Huelsenbeck and Ronquist 2001).
Among the appealing properties of estimation procedures based on summary statistics are greater simplicity and potential for customization to specific biological systems. While computation of summary statistics by definition requires no knowledge of the genealogy of the sample, the intimate relationship between the pattern of segregating variation and the sample genealogy lies at the core of coalescence theory and molecular population genetics (Ewens 1972;Watterson 1975). In deriving not only the first two moments but also entire probability distributions of summary statistics, those works laid the basis for parameter estimation through likelihoods as well as for the parametric tests for which they are well known. Our objective here is to develop a likelihood-based method for inferring demographic history from multiple summary statistics.
Classes of segregating mutations: Wakeley and Hey (1997) introduced a set of summary statistics suitable for exploring the demographic history of two groups: numbers of sites polymorphic in both groups, polymorphic in only one group, or showing fixed differences. Our classification expands theirs by using an outgroup to distinguish between ancestral and derived bases (see materials and methods). With respect to the genes sampled from a given group, we describe each segregating derived mutation as absent (a), segregating (s; present in at least one but not all), or fixed ( f ; present in all). A joint classification with respect to two groups comprises only seven types because segregating mutations are neither absent nor fixed in both groups. 1 Distinguishing between the effects of recent divergence and introgression after divergence on the sharing of polymorphisms between closely related species can be difficult, particularly in single-locus analyses (Wang et al. 1997). To address the time since divergence alone, we chose to study variation at DPS2002, a noncoding region for which tight linkage to a paracentric inversion strongly associated with multiple reproductive isolating mechanisms precludes its introgression between the closely related Drosophila species Drosophila pseudoobscura and D. persimilis (Noor and Smith 2000;Noor et al. 2001a,b) (see materials and methods). Table 1 shows the array of mutations observed in DPS2002 haplotypes sampled from D. persimilis and two subspecies of D. pseudoobscura, D. p. pseudoobscura and D. p. bogotana (Machado et al. 2002). Consistent with the absence of introgression at DPS2002, the D. persimilis sample shares no mutations with the D. p. bogotana sample and only one ( f/s) with the D. p. pseudoobscura sample. In contrast, the two subspecies of D. pseudoobscura share 10 mutations, suggesting recent divergence or ongoing gene flow. In the interspecific comparison involving D. p. bogotana (per/bog), the observation of both reciprocal arrangements of fixed differences ( f/a and a/f ) indicates that the most recent common ancestor (MRCA) of genes sampled from each species postdates any between-species coalescence events: the DPS2002 gene genealogy shows reciprocal monophyly. While the reciprocal numbers of DPS2002 polymorphisms restricted to one species (s/a and a/s) appear similar between D. p. bogotana and D. persimilis, D. p. pseudoobscura shows fourfold more polymorphisms than D. persimilis, perhaps reflecting larger effective population size. These observations suggest that the numbers of the various kinds of mutations segregating in the sample contain information about genealogical and demographic history.
Models of speciation: Under the Dobzhansky-Muller scenario for the origin of species (reviewed by Turelli et al. 2001), genetic factors contributing to reproductive isolation arise during an initial phase of allopatry between the incipient species. We identify species divergence with the onset of the allopatric phase and not in particular with the origin of genetic isolating mechanisms. Speciation corresponds to a change in coalescence structure, the most recent point at which ancestral lineages with descendants in different species can have coalesced.
We assume that the paracentric inversion that precludes introgression at DPS2002 between D. pseudoobscura and D. persimilis arose as a neutral mutation and became associated with isolating barriers during the allopatric phase. That this second chromosome region is homosequential in D. pseudoobscura and outgroup D. miranda (Dobzhansky and Tan 1936) indicates that it is the D. persimilis arrangement that is derived. This inversion appears to correspond to a fixed difference, not only in our sample but also between the species (Dobzhansky and Powell 1975). For simplicity, we assume instantaneous fixation of the alternative gene orders in the two descendant species, with the inversion having arisen immediately before the MRCA of the sampled inverted chromosomes.
Likelihoods from summary statistics: Marjoram et al. (2003) introduced a Markov chain Monte Carlo (MCMC) method that directly approximates the posterior distribution of the model, obviating the need to determine the probability of the data. Numerically generating genealogical and mutational histories that match the data may require extensive simulation.
We address methods that base inferences about demographic history and population parameters (M) on a set of summary statistics (D). Implementation of likelihood-based approaches entails the development of a practical means of determining or approximating probability distributions that depend on the unknown genealogical history (G) of the sampled genes and the number and location (U ) of mutations that occurred on it, for V G , the set of all possible placements of mutations on a given G. We adopt the infinite-sites model, under which neutral base substitutions occur independently, with the number in a given time interval following a Poisson distribution. Many likelihood-based methods proceed from the prior distribution of the genealogy, dependent on the population parameters alone, while others, including fully Bayesian approaches (Wilson and Balding 1998), incorporate the posterior distribution of histories given D (reviewed by Stephens and Donnelly 2000). With respect to time since divergence of populations, exact analytical expressions are available for certain simple demographic histories (Watterson 1985;Takahata et al. 1995); however, the derivation of closed-form solutions is intractable for most systems of biological interest. Uyenoyama and Takebayashi (2004) described a method for the recursive determination of exact likelihoods from joint probability-generating functions of correlated summary statistics (compare Griffiths and Tavaré 1996). Takebayashi et al. (2004) used this approach to obtain a maximum-likelihood estimate (MLE) of the rate of recombination between a determinant of mating type in a flowering plant and a closely linked marker locus from the numbers of segregating sites at the two loci. While this method can in principle accommodate multiple summary statistics and general evolutionary models, the computation time makes its application impractical. Likelihoods have been approximated by simulating complete genealogical and mutational histories under a coalescent model and computing summary statistics for each realization (for example, Fu and Li 1997;Weiss and von Haeseler 1998;Pritchard et al. 1999). While full simulation can accommodate a wide range of evolutionary questions, analysis of even small data sets under simple models may impose a considerable computational burden (Wall et al. 2002;Wall 2003), reflecting the exceedingly low probability of any particular realization (Stephens and Donnelly 2000;Marjoram et al. 2003). Tavaré et al. (1997) introduced an acceptancerejection algorithm for the estimation of the age of the MRCA of a sample of genes that entails simulating a gene genealogy and accepting the value for this random variable on the basis of the probability for the tree of the observed total number of segregating mutations. Beaumont et al. (2002) used a local regression method to approximate the likelihood of the actual data from realizations within a certain tolerance (Estoup et al. 2004;Tallmon et al. 2004;Hamilton et al. 2005).
Alternatively, the ''fixed-S '' approach (e.g., Hudson 1993; Depaulis and Veuille 1998) entails generating a genealogy under a standard coalescent model, randomly placing the observed total number of segregating sites (S) on the genealogy, and determining the fraction of outcomes consistent with the remaining observed summary statistics (H, for D ¼ fS, Hg). Markovtsova et al. (2001) showed that the distribution generated by this procedure does not in fact approximate P M (HjS), the conditional distribution of H given S. In particular, the genealogy should be sampled, not from a prior distribution determined by the standard coalescent, but from a conditional distribution, given S and the model parameters M. This discrepancy can become problematic under large departures of the observed value of S from its expectation (Depaulis et al. 2001;Markovtsova et al. 2001;Wall and Hudson 2001).
Approach through importance sampling: We develop an importance-sampling (IS) approximation (see Liu 2001) to the likelihood (1). We use a proposal distribution to sample a genealogical and mutational history consistent with the observed array of seven types of seg-regating sites and then correct the bias by determining the exact probability of the history.
Here, a genealogical path (G) corresponds to an ordered list of the states assumed by the process at the nodes of the full genealogy, without specification of the mutations. Observation of the segregating sites present in a sample (D) provides multiple kinds of information, for D 1 the types and D 2 the numbers of base substitutions observed. We rewrite (1) as for Q M (D 1 , G), the stationary distribution of genealogies compatible with D 1 , and Q M (D 2 , UjD 1 , G), a heuristic distribution of placements of mutations on G compatible with D 2 . We approximate this average (2) by for (U i , G i ) independent and identically distributed (i.i.d.) samples from the proposal density Q M (D 2 , UjD 1 , G)Q M (D 1 , G).
Likelihoods approximated through this procedure may serve as the basis of either Bayesian or maximumlikelihood analyses. Here, we use IS to determine MLEs of the time since divergence between closely related species of Drosophila and the effective population sizes of the extant and ancestral species.

Sequence information:
We studied the pattern of nucleotide variation segregating among DPS2002 sequences obtained by Machado et al. (2002) from the D. pseudoobscura species group, including 19 D. p. pseudoobscura, 13 D. p. bogotana, and 13 D. persimilis sequences (GenBank nos. AF450689-AF450734). This region, $940 bp in length, shows numerous single-nucleotide polymorphisms and variable oligonucleotide repetitive motif tracts, but no detectable open-reading frames. For each site segregating within the ingroup, we assumed that the base present in the single D. miranda sequence represented the ancestral base.
Although Machado et al. (2002) localized DPS2002 within a fixed paracentric inversion that distinguishes D. persimilis from D. pseudoobscura, the Noor group has recently determined that it lies $1.5 Mb outside this inversion, on the telomeric side (M. A. F. Noor, unpublished data). In general, recombination in inversion heterozygotes appears to be suppressed beyond the bounds of inversion breakpoints, perhaps reflecting disruption of chromosome pairing or production of unbalanced gametes upon crossing over (see Navarro et al. 1997 and references therein). In particular, DPS2002 has been shown to be tightly linked to the inverted region in inversion heterozygotes (0/357 recombinants) (Noor and Smith 2000).
In the present study, we assume complete linkage of sites within DPS2002 and of DPS2002 to the inversion. Association of the inversion with multiple reproductive isolating mechanisms, including hybrid male sterility, hybrid inviability, hybrid male courtship dysfunction, and behavioral isolation, prevents introgression of linked regions, including DPS2002 (Noor and Smith 2000;Noor et al. 2001a,b).
Classification of mutations: Our assumption of the absence of intragenic recombination entails that all sites within a locus share a single genealogical history. This common history constrains the observed array of neutral, independent mutations, n ¼ ðn 1 ; n 2 ; . . . ; n 7 Þ; for n i the number of mutations of type i ( Table 2). For example, the observation of a mutation fixed in group 1 and absent from group 2 implies a topology in which the MRCA of the haplotypes sampled from group 1 postdates all between-group coalescence events. This topology excludes the presence in group 2 of mutations segregating in group 1 (s/s and s/f ). With the exception of types 1 and 3, the presence of each mutational type excludes the presence of two other types, implying observation of a maximum of four distinct types of mutations. Figure 1 illustrates the four possible distinguishable topologies (modulo reciprocals) assumed by the sample genealogy: f f/a, a/f g, f f/s, f/ag, f f/s, s/sg, and fs/sg. Summary statistics: We used CLUSTAL-W ( Thompson et al. 1994), with minor manual modifications, to generate a multiple alignment of the 46 DPS2002 sequences. After elimination of sites with gaps in any of the sequences, 892 homologous sites remained. For each of three triads (outgroup D. miranda and a pair of ingroup taxa), we restricted attention to sites at which more than one base segregated within the ingroup taxa under consideration. In accordance with an infinite-sites model of mutation, we designated the base in the outgroup sequence as ancestral and any other base as a mutation. After removing sites at which the ancestral base was absent from both ingroup taxa, we counted the number of mutations in each category shown in Table 1. For sites at which more than two bases segregated, both derived bases contributed to our mutation counts: for example, at a site at which C was designated ancestral, C and G segregated in the sample from one ingroup taxon, and C and T in the other, we counted each mutation (G and T) as segregating in one group and absent in the other. A Bioperl (Stajich et al. 2002) script for counting the various types of mutations is available from its author on request ( J.E.S.: jason.stajich@duke.edu).

METHODS OF ESTIMATION
We present an evolutionary model for the demographic history of a sample of haplotypes from two species. We then construct a recursion in the exact probability of the observed array of the seven kinds of segregating sites under this model and describe its importance-sampling approximation (3).
Evolutionary model: We assume that the time since divergence of species 1 and 2 from ancestral species 0 follows an exponential distribution with parameter l, treating l as the rate of species fusion. During the interval spanned by the gene genealogy, species i (i ¼ 0, 1, 2) maintains a constant effective size of N i genes (for autosomal regions, twice the effective number of individuals).
At any point within the gene genealogy of a sample, we classify each lineage according to its species membership and the distribution of its descendants between the two groups of the initial sample. A type 1 lineage has descendants among the genes sampled from species 1 but not from species 2, and type 2 has those from species 2 but not from species 1. Type 3 lineages have descendants in both groups. On level l of the gene genealogy (the interval in which a total of l ancestral lineages remain), the state or configuration of the process corresponds to (l 01 , l 02 , l 03 , l 11 , l 22 ), for l ij , the number of ancestral lineages of type Speciation corresponds to a change in population structure: transition from two isolated groups to a single panmictic group. Our assumption of the instantaneous fixation of the alternative chromosomal types upon speciation entails that postspeciation coalescence events occur at rate Our speciation scenario stipulates the origin of the inversion immediately before the MRCA of the D. persimilis  (Table 2).
subsample. For cases in which its origin predated the speciation event, we assume that it segregated in the ancestral species at frequency p. During the interval between the speciation event and the MRCA of the inverted chromosomes, coalescence among type 1 lineages occurs at rate l 01 2 =pN 0 and among type 2 lineages at rate with no between-type coalescence events. Upon the origin of the inversion, all pairs of lineages coalesce at the same rate (1/N 0 ), irrespective of type. A genealogical path G corresponds to an ordered list of descriptions of the nodes of the full genealogy, without specification of mutations or within-level transitions, for S l ; the entry (most recent) state on level l, and L total sample size. For a sample comprising L 1 haplotypes from species 1 and L 2 from species 2 (L ¼ L 1 1 L 2 ), the initial state S L corresponds to (0, 0, 0, L 1 , L 2 ) and the MRCA of the entire sample S 1 to (0, 0, 1, 0, 0). Characterization of the stationary distribution of genealogical paths requires only determination of Markov matrices of within-and between-level transition rates (appendix a). We extend the procedure introduced by Wiuf and Donnelly (1999) to condition the gene genealogies proposed in the IS procedure to have a topology compatible with the observed combination of mutational types (appendix b). Restricting sampling to compatible genealogies affords a considerable increase in efficiency for a modest computational cost. Exact recursion in likelihoods: We derive a recursion in the joint probability-generating function (PGF) for the array of summary statistics.
Array of mutations in the sample: For each configuration on level l, we determine a PGF for the array of mutations accumulated in the subtree extending from level l to the MRCA. Let g l ðaÞ denote the vector of these PGFs, for a ¼ ða 1 ; a 2 , a 3 , a 4 , a 5 , a 6 , a 7 ) comprising PGF parameters corresponding to the seven types of segregating mutations (Table 1). Figure 2 illustrates that the total number of mutations accumulated within the subtree beginning on level l corresponds to the sum of numbers accumulated on level l and in the subtree beginning on level l ÿ 1: appendix a presents the derivation of R l ðaÞ; the PGF of mutations accumulated within level l. Recursion (5) has initial condition reflecting that mutations in the MRCA do not segregate in the sample. The PGF of the numbers of the seven types of mutations observed in a sample of size L corresponds to in which the matrix product begins on the left with the largest index value.
Recursive computation of exact likelihoods: We determine likelihoods from a recursion in probabilities rather than in the PGFs themselves. Taking derivatives of (5), we obtain an expression for the probability of observing the array p ¼ ðp 1 ; p 2 , . . . , p 7 ) of mutations in the subtree extending from level l to the MRCA, in which q denotes the array of mutations that arose within level l and p ÿ q the remaining mutations; the summation in q runs over all possible subsets of the total array p; and superscripts of the form (p) indicate the order of derivatives with respect to the parameters representing the mutational types (the p i th derivative with respect to a i , i ¼ 1, 2, . . . , 7). We initialize the recursion in probabilities (6) by considering all possible assignments of mutations to level 2, for q, a subset of the observed mutations n. We then determine g ðpÞ 3 ð0Þ= Q p i ! for all possible mutational arrays p that can occur in the subtree comprising levels 2 and 3. This recursion ends with g ðnÞ L ð0Þ= Q n i !; the probability of the array of mutations observed in the initial sample. Mutations in the subtree extending from the MRCA to level l comprise those occurring within level l and those in the subtree extending to level l ÿ 1. Independence of the mutational process in these disjunct time periods implies that the PGFs of mutation numbers to level l correspond to the product of PGFs of mutations to level l ÿ 1 and within level l (5).
Importance-sampling approximation: To approximate (2), we first sample from an analytical stationary distribution a genealogical path G (4) consistent with the types of mutations observed in the sample (D 1 ) as well as the speciation scenario, place the observed numbers of mutations n along G according to a heuristic distribution, and then correct the bias introduced by the proposal, using the exact probability of the genealogical and mutational history.
Genealogical path: Given entry state S l on level l, the stationary distribution of the level l ÿ 1 entry state S lÿ1 is given by the corresponding row of for Ũ l and Ṽ l matrices of rates of within-and betweenlevel transition probabilities (appendix a), conditional on the observed types of mutations (D 1 ; appendix b).
Beginning with the state of the initial sample (0, 0, 0, L 1 , L 2 ), we construct a genealogical path G i by sampling, for each successive level until termination in the MRCA, a path segment from the row of (7) that corresponds to the entry state for the level.
Proposal distribution: Given a genealogical path G i , we propose placements of the observed mutations according to a multinomial distribution, ðr l;j Þ n j;l n j;l ! ; in which n j,l represents the number of type j mutations on level l ( P n j;l ¼ n j ) and r l,j the probability that a lineage on level l receives a type j mutation. For e l,j the number of lineages on level l that are eligible to receive mutations of type j, r l;j ¼ e l;j w l P L k¼2 e k;j w k ; in which w l represents the relative weight assigned to level l. Weight w l reflects the expected duration of level l, which has an exponential distribution with parameter corresponding to the rate of coalescence within the level (appendix c).
For the path segment corresponding to level l, we obtain the true probability P M (D, U i,l , G i,l ) from the element of R ðqÞ l ð0Þ= Q q j ! (A8) in the row and column associated with the entry (most recent) states on levels l and l ÿ 1, respectively, for q, the array of mutations assigned to level l.
Likelihood function: Griffiths and Tavaré (1994) described an importance-sampling procedure for generating likelihoods of arbitrary models (M) from those obtained under a particular driving model (M 0 ). This approach entails first intensively sampling genealogical paths and placements of mutations under M 0 and then characterizing the entire likelihood function by rescaling the probabilities (see Griffiths and Tavaré 1994;Kuhner et al. 1995;Felsenstein et al. 1999;Stephens and Donnelly 2000).
Because the choice of M 0 affects the reliability of the approximation (Kuhner et al. 1995(Kuhner et al. , 1998Stephens and Donnelly 2000), we first estimate the MLE through a two-tier search (appendix d) and then sample intensively under this driving model. APPLICATION We began our exploration with a comparison between D. p. bogotana and D. persimilis. For this smaller data set (Table 1, per/bog), determination of both the exact likelihoods by recursive computation (6) and their importance-sampling approximations was in fact feasible. Having established a basis for confidence in our IS implementation, we then addressed the estimation of population parameters from the comparison between D. p. pseudoobscura and D. persimilis (Table 1, per/pse).
That the inversion difference appears to be fixed between the species as well as in our sample (Dobzhansky and Powell 1975) suggests the absence or segregation in very low frequency (p) of the derived gene order in the ancestral species. In accordance with this expectation, preliminary results (not shown) indicated higher likelihoods of lower values of p. As the data set contains little information about this aspect, we arbitrarily assigned p as 0.0001 in estimating the remaining parameters.
Comparison of D. p. bogotana and D. persimilis: We used our two-phase search procedure (appendix d) to locate the maximum-likelihood values of the four parameters of the system (l/u, uN 0 , uN 1 , uN 2 ). Our IS approximations compare well to the exact likelihood computed using (6).
Maximum-likelihood estimates: Table 3 provides the MLE array (''Unconstrained'' column), with probabilities estimated using (3), based on a sample of 4 3 10 6 genealogical paths. A likelihood-ratio test detected no significant differences in effective population size among the species (comparison to ''N 0 -N 1 -N 2 '').
Comparison of exact and IS likelihood curves: To assess the accuracy of our IS approximation, we constructed conditional-likelihood curves for the speciation rate (l/u) with the remaining parameters assigned to their MLE values (Table 3) under both the exact recursion (6) and our IS method. Figure 3 presents the exactlikelihood function and 18 IS curves, each based on 500,000 sampled genealogies generated using the MLEs as the driving values. The average of the IS curves (blue line), based on a total of 9 3 10 6 samples, corresponds well to the exact conditional likelihood (red line), although it somewhat underestimates the absolute value of the likelihood.
Computation using (6) of a single point of the exactlikelihood function required $4 hr on a Macintosh PowerPC G5 (2.5-GHz processor, 3.5 GB DDR SDRAM). Construction of the 200 points constituting the entire conditional-likelihood curve in Figure 3 required $800 hr using the exact recursion, compared to $30 min using our IS method based on 500,000 sampled genealogical histories. While computation time under the exact recursion increases roughly with the product of the numbers of mutations observed (elements of n), observation of more mutations has virtually undetectable effect on computation time under IS. Figure 4 compares the exact log-likelihood function for the speciation rate (l/u) computed under (6) to the IS approximation. Our IS curve provides an excellent indication of both the MLE and the breadth of the likelihood function. As noted by Stephens and Donnelly (2000), error of the IS approximation generally increases farther from the driving value. In the case studied here, however, detectable discrepancies arise only in regions quite distant from the MLE, well beyond the $95% confidence range spanned by 2 log-likelihood units.
Comparison of D. p. pseudoobscura and D. persimilis: The interspecific comparison involving D. p. pseudoobscura (Table 1) comprises more sequences (longer genealogies) and many more mutations (87 compared to 42). Both factors, but especially the increase in mutation number, render computation of the exact-likelihood function impractical.
Maximum-likelihood estimates: Table 4 provides maximumlikelihood estimates of the population parameters, determined by our IS method using 4 3 10 6 sampled genealogies. Likelihood-ratio tests suggest that the effective size of D. p. pseudoobscura (uN 2 ) significantly exceeds those of both D. persimilis (uN 1 ) and the ancestor (uN 0 ).
Profile-likelihood curves: Various approaches exist for conveying a sense of the level of confidence in the estimate of a parameter in a multiple-parameter model (e.g., Berger et al. 1999). Within the maximum-likelihood framework adopted here, we have chosen to present our results in terms of the profile likelihood, under which the likelihood of a given value of a parameter corresponds to the maximum achieved over all assignments of the other parameters. appendix e describes our procedure for approximating the full four-dimensional likelihood surface using interpolating splines.  (Table 3) as the driving values, we generated 18 IS curves, each based on 500,000 sampled genealogies. Red indicates the exact likelihood surface of l/u, computed using (6) under the same assignments of the other parameters, and blue shows the average of the IS curves (9 3 10 6 total samples).  Figure 3. The red line represents the log of the exact-likelihood function, scaled to its maximum value, and the blue line shows our IS approximation, based on 9 3 10 6 sampled genealogies. Figure 5 shows the profile-likelihood curve for the relative rate of speciation (l/u), and Figure 6 shows the relative effective size of D. persimilis (uN 1 ). Table 5 provides 90% confidence intervals for our MLEs, determined under the assumption of a x 2 -distribution for the log-likelihood. DISCUSSION We have described a likelihood-based method for the estimation of population parameters. This first application addresses the time since divergence of closely related species in the D. pseudoobscura group. More importantly, it serves as a proof-of-concept demonstration of the speed and reliability of our approach, based on summary statistics rather than on entire nucleotide sequences.
Divergence between D. pseudoobscura and D. persimilis: Our maximum-likelihood analysis indicates a significantly larger effective population size for D. pseudoobscura than for D. persimilis (Table 4) and a speciation event somewhat more ancient than some previous estimates. Table 6 presents our estimates (Table 5) scaled to the rate of mutation per kilobase (ũ ¼ u1000=892; for our 892-bp region). Numbers for effective population size correspond toũN i (i ¼ 0, 1, 2) and those for divergence time toũ=l; the expectation of the exponentially distributed speciation-time variable.
Calibration of mutation rate: Rescaling of our estimates into units of numbers of years or individuals requires determination of the rate of substitution of neutral mutations in noncoding regions. For the same Drosophila species studied here, Hey and Nielsen (2004) (HN) estimated 5.3 3 10 ÿ6 mutations per kilobase per year. This number, an average across 14 regions for which their analysis indicated substantial heterogeneity, reflects divergence in both coding and noncoding regions and both synonymous and nonsynonymous substitutions.
For mammals (humans and rodents), Bustamante et al. (2002) estimated a 70% reduction in the rate of substitution at silent sites in expressed genes compared to their homologous pseudogenes. This accelerated substitution in pseudogenes, particularly marked in genes with high GC content, reflects in part the release from selective constraints on hypermutable CpG dinucleotides (Sved and Bird 1990). In primates, Subramanian and Kumar (2003) found a significant overall excess in neutral    substitution in exons compared to noncoding regions (pseudogenes, introns, and intergenic tracts), which disappeared upon exclusion of CpG sites.
While Drosophila genomes do not show significant CpG deficiency (Gentles and Karlin 2001), selective constraints on codon usage do influence the pattern of substitution at synonymous sites (Akashi 1995). Using the skew in base composition as an index of codon usage bias, Tamura et al. (2004) (TSK) inferred a neutral substitution rate of 1.1 3 10 ÿ5 mutations per kilobase per year for recently diverged (5.1 million years) Drosophila species.
Effective population size: We identify Wright's ''inbreeding effective number'' of individuals (Crow and Denniston 1988) with half the inverse of the rate of coalescence (half the expected number of generations to coalescence between a random pair of genes; see Slatkin 1991). Our estimates of scaled population size (uN i ), a ratio of rates, have units of mutations per coalescence. Division of these estimates by twice the rate of mutation per generation converts the scale to generations per coalescence, which we identify with effective numbers of individuals. Table 6 reports ''absolute'' effective population sizes (millions of individuals), obtained under the assumption of four generations per year (Schaeffer 1995) and HN and TSK mutation rates.
Divergence time: Under the TSK mutation rate, our MLE of the time since speciation between D. pseudoobscura and D. persimilis corresponds to $850 thousand years (KY), identical to the number obtained by Tamura et al. (2004) from their moments-based analysis of nuclear protein-coding genes located throughout the genome. Our estimate shows an insignificant excess over that of Aquadro et al. (1991) (500 KY), based on restriction site differences at the Amy region calibrated by DNA-DNA hybridization data. Hey and Nielsen (2004) based their ML analysis on MCMC reconstruction of the gene genealogy of entire nucleotide sequences from 14 genomic regions. It permits variation among regions in rates of substitution and introgression, but does not accommodate the population substructuring induced by linkage of DPS2002 to the D. persimilis inversion, which prevents its introgression. Their estimates of divergence time scaled to substitution rate (t 3 u S , analogous to ourũ=l) vary over a 13-fold range across the 14 regions, and those of absolute divergence time vary 238-fold. Their overall estimate of absolute divergence time (589 KY) corresponds to the average scaled divergence time divided by the average substitution rate. Hey and Nielsen's (2004) estimate of divergence time based on DPS2002 alone exceeds the average across the 14 regions by .86%. This number corresponds to 1113 KY under the average substitution (HN) rate and to 536 KY under the TSK rate. Expressed on the same scale, our MLEs (l/u ¼ 0.12, uN 0 ¼ 0.81, uN 1 ¼ 2.71, uN 2 ¼ 18.21) show nonsignificant differences from theirs for DPS2002 alone (0.19, 0.9, 1.9, 17.2), in which we associate their number for divergence time with the inverse of the parameter of the exponentially distributed speciation-time variable.
Approaches to estimation: Basing the estimation on summary statistics rather than on entire nucleotide sequences permits considerable simplification of the description of genealogical history. Streamlining of the evidentiary and computational basis affords greater computational and analytical freedom to address more realistic demographic scenarios and biological processes.  Genealogies as nuisance parameters: Our primary interests lie in the characterization of the evolutionary process of speciation and genetic divergence. In this context, the genealogy of the sampled genes represents a nuisance parameter, an unknown aspect that influences the estimation of population parameters but that holds little interest in itself. More precisely, the gene genealogy is not a parameter but another manifestation of the evolutionary process under study (Donnelly and Tavaré 1995;Stephens 2001).
To accommodate some of the diversity of evolutionary processes, a number of methods entail estimation of a great many parameters. For example, full genealogical reconstruction in structured populations requires estimation of the ages of all nodes, mutations, migration events, changes in population size, and population divergences (Beerli and Felsenstein 2001;Nielsen and Wakeley 2001;Wilson et al. 2003;Hey and Nielsen 2004). A given biological system may correspond to a set of parameter assignments within the full model (for example, Yang 1996; Hey and Nielsen 2004). However, a model that accommodates the unique origin of the second chromosome inversion that prevents introgression at the DPS2002 region is not nested within available data analysis packages. Griffiths and Tavaré (1996) showed that the variance of estimates of population parameters obtained from full genealogical reconstruction of entire nucleotide sequences tends to be smaller than that based on summary statistics. Even so, one may well prefer simpler methods based on summary statistics in cases for which they are nearly sufficient for the estimation of the population parameters of interest (Marjoram et al. 2003). The development and implementation of analyses capable of accommodating additional evolutionary processes can demand literally years of effort (Felsenstein et al. 1999), and even entire sequences may contain insufficient information to support the estimation of fully resolved gene genealogies as well as all parameters of a heavily parameterized model (Wiuf 2003). A related observation is that less detailed models can sometimes generate more accurate estimates (Takahashi and Nei 2000;Piontkivska 2004;Kosakovsky Pond and Frost 2005).
Reflecting our interest in population parameters rather than gene genealogies themselves, our method adopts a much-condensed genealogical description. A genealogical path in our analysis corresponds to an ordered list of lineage types associated with the nodes of the gene genealogy (4). It differs, in particular, from the genealogical history of Griffiths and Tavaré (1994), for which the state space includes the mutations.
Reduction of the computational burden invested in estimating genealogy may permit analysis of more realistic biological processes or demographic histories for which full genealogical reconstruction of entire nucleotide sequences may be altogether infeasible. For the application at hand, our model explicitly conditions the genealogical histories on the unique origin of the inversion that prevents introgression in the genomic region studied (appendix b). Incorporation of this biological information into the estimation procedure entailed only modification of Markov matrices of rates of within-and between-level transitions (appendix a). This structural flexibility may permit customization of the analysis to a wide variety of biological systems.
Estimation of divergence times: Species divergence corresponds to a change in coalescence structure: the most recent point at which ancestral lineages with descendants in different species can have coalesced. Using the age of the most recent node with descendants sampled from different populations as a surrogate for divergence time generates negligible error only for ancient divergence events involving small ancestral population sizes and little interpopulation gene flow (see Nichols 2001). A number of recent reviews of moment-and likelihood-based approaches have addressed the estimation of population divergence apart from node age (Arbogast et al. 2002;Rosenberg and Feldman 2002;Takahata and Satta 2002).
Many likelihood-based methods related to ours treat time since speciation as a parameter. Takahata et al. (1995), Yang (2003), andWall (2003) based the estimation of divergence time and ancestral population size on numbers of segregating sites in a present-day sample comprising one sequence from each of two or more species. Nielsen and Wakeley (2001) and Hey and Nielsen (2004) used MH sampling to approximate the posterior distribution of fully resolved gene genealogies, including all node ages and time since speciation.
To incorporate time into the method of Griffiths and Tavaré (1994), for which the genealogical histories record only the relative order of events, Nielsen (1998) determined the probability distribution of the numbers of ancestral lineages remaining in each group at the divergence event. Our method characterizes the time since speciation as an exponentially distributed random variable and estimates the instantaneous rate of speciation (l/u). This construction obviates the need to incorporate time into the backward construction of genealogical histories.
Unique evolutionary events: Slatkin andRannala (1997, 2000) based the estimation of the age of an advantageous or deleterious mutation on the relative magnitudes of neutral variation segregating within the affected and unaffected subsamples. The sample genealogy reflects coalescence of affected lineages only among themselves, with the number of ancestors from which they could have descended determined from a branching-process model. In contrast, Wiuf and Donnelly (1999) addressed the age of a neutral marker restricted to a subset $ of genes sampled from a population. Determination of the likelihood entails conditioning a random gene genealogy to contain a node from which all members of $ and none of the other sampled genes descend and requiring the occurrence of exactly one neutral mutation on the branch immediately ancestral to that node. These approaches differ with respect to more than statistical philosophy. In the latter case, the partitioning of the sample arises only after observation of the mutation of interest, while in the former, the distinction between affected and unaffected genes exists before observation of the sample.
We chose to study the DPS2002 region precisely because tight linkage to a second chromosome inversion precludes its introgression. We have imposed the simplifying assumption that the origin of the inversion occurred immediately before the MRCA of the inverted lineages. Further, for cases in which the MRCA of the inverted lineages predates the speciation event, our model assumes a constant frequency (p) of the inversion in the ancestral species and permits coalescence only within and not between gene orders. A more detailed analysis would incorporate a description of evolutionary change in p (e.g., Slatkin and Rannala 1997;De Iorio and Griffiths 2004).
Importance sampling: The high complexity of virtually all biological systems of interest ensures that any particular realization of the evolutionary process occurs with extremely low probability, making approximation of likelihoods by ''naive'' Monte Carlo simulation impractical (Stephens and Donnelly 2000). Importance sampling offers a means of compensating for discrepancies generated by sampling from convenient but incorrect proposal distributions. As discussed in the introductory section, sprinkling the observed number of mutations over a random genealogy under the ''fixed-S '' procedure very rapidly generates genealogical and mutational histories consistent with the data, but approximates an incorrect distribution (Markovtsova et al. 2001).
Our method (2) proposes genealogical paths (4) by sampling, not from a posterior distribution given the full data (D ¼ fD 1 , D 2 g), but from an analytically determined stationary distribution (7) of paths consistent with the types of segregating mutations observed (Q M (D 1 , G)). It then sprinkles the observed numbers of mutations on the genealogical path (Q M (D 2 , UjD 1 , G)) according to a heuristic weighting scheme (8). We then correct the bias introduced by the proposal distribution using the exact probability of the proposed genealogical and mutational history (appendix a).
Expansion of the evidentiary basis to include additional summary statistics would extract more information from the sampled sequences. Through straightforward redefinition of the state space, our method can incorporate counts of mutations of various kinds, including numbers of mutations classified according to their distribution among groups Wakeley et al. 2001), the number of haplotypes and their frequency spectrum (Ewens 1972), and the frequency spectrum of mutation numbers (Fu 1995). Our approach is less well suited to summary statistics defined as various moments, including average pairwise differences, variances, regressions, and correlations. However, the distribution of mutations among groups can replace pairwise F ST values as the basis for the characterization of gene flow , and the relative numbers of segregating sites at linked loci can replace pairwise linkage disequilibrium as the basis for the estimation of recombination rate . The simplicity of our approach (appendix a) facilitates structural modification both to incorporate more information contained in the sampled sequences and to broaden the scope of evolutionary processes amenable to analysis.
We thank Beatrix Jones for perceiving the relevance of importance sampling to this problem, Michael Lavine for comments and suggestions that improved the analysis, Sudhir Kumar for insights into rate calibration and CpG avoidance, John Willis for reminding us of the unique evolutionary origin of inversions, and two anonymous reviewers and John Wakeley for comments. This research was supported in part by funding from National Science Foundation

LITERATURE CITED
For convenience, we summarize the recursive determination of a PGF of the array n of segregating sites observed in a sample of arbitrary size from the two species (see Uyenoyama and Takebayashi 2004).
Matrices P l and and Q l , respectively, provide per-generation rates of within-level and between-level transitions for states on level l. For state a within level l, let C l;a denote the total rate of transition to any other configuration, irrespective of level, C l;a ¼ P l;a 1 Q l;a ; for P l;a and Q l;a representing row sums of the within-and between-level transition rate matrices. Matrices U l and V l denote within-and between-level transition probabilities, given the occurrence of a transition, for C l , a diagonal matrix in which the diagonal element in row a corresponds to C l;a (A1). Under a geometric distribution for the total number of mutations accumulated in the interval terminated by the first transition from state a and a multinomial distribution, given this total number, for the number of mutations arising on the three types of lineages, we obtain the joint PGF of mutation numbers, f l;a ðaÞ ¼ C l;a C l;a 1 u½l 1 ð1 ÿ c 1 Þ 1 l 2 ð1 ÿ c 2 Þ 1 l 3 ð1 ÿ c 3 Þ ; in which a ¼ (a 1 , a 2 , a 3 , a 4 , a 5 , a 6 , a 7 ) represents the array of seven PGF parameters corresponding to the seven types of observed segregating mutations (Table 1), and the assignment of these PGF parameters to c 1 , c 2 , and c 3 depends on the configuration of state a: c 1 ¼ a 1 if l 1 . 1 or both l 1 ¼ 1 and l 3 . 0 a 2 if l 1 ¼ 1 and l 3 ¼ 0 These expressions indicate that the conditional transition matrices (A2) and the joint distribution of mutation numbers (A3) depend only on the relative rates of transition and mutation: l/u, uN 0 , uN 1 , and uN 2 .
To obtain an expression for R l ðaÞ (5), the joint PGF of mutations occurring within level l of the sample genealogy, we consider the array of mutations that occurred before and after the most recent transition, g l ðaÞ ¼ F l ðaÞ½U l g l ðaÞ 1 V l g lÿ1 ðaÞ; ðA5Þ for F l ðaÞ; a diagonal matrix with the PGFs of mutation numbers (A3) for states within level l arrayed along the diagonal. Rearrangement of (A5) completes the recursion (5), for D l ðaÞ ¼ F l ðaÞC ÿ1 l : Because speciation alone induces within-level transitions and occurs exactly once in a genealogy, under which the matrix inverses in (7) and (A6) reduce to ½I ÿ U l ÿ1 V l ¼ ½I 1 U l V l R l ðaÞ ¼ ½I 1 F l ðaÞU l F l ðaÞV l ¼ ½I 1 D l ðaÞP l D l ðaÞQ l : In determining the recursion in probabilities (6), we observe that the PGF parameters (a 1 , a 2 , . . . , a 7 ) appear only in the D l ðaÞ: Derivatives of these diagonal matrices with respect to the a i take the form dD l ðaÞ da i ¼ D l ðaÞ 2 E l;i ; for E l;i ; a diagonal matrix of the absolute values of the coefficients of a i in the denominators of the elements of D l ðaÞ (A3). All configurations within a given block (l 1 , l 2 , l 3 ) share the same coefficient of a i , implying that the corresponding submatrix of E l;i is proportional to the identity matrix ul x I; for x, the lineage type associated with a i in this block (A4). Using P l E l;i ¼ E l;i P l and (A7), we obtain in which qð¼ P q i . 0Þ represents the total number of mutations arising on level l. The product of the E l;i matrices in (A8) is nonzero only for arrays q that specify a combination of mutations that can occur on level l.

APPENDIX B: CONDITIONED GENEALOGIES
We extend the method of Wiuf and Donnelly (1999) to restrict the genealogical paths proposed by our IS procedure to those consistent with the kinds of mutations observed in the sample (D 1 ). We describe the conditioning of genealogies on each of the four possible topologies (f f/a, a/f g, f f/a, f/sg, fs/s, f/sg, and fs/sg), both with and without the substructuring of the ancestral species induced by the presence of the chromosomal inversion that precludes introgression at DPS2002. In the system under study, the unique origin of this inversion entails the presence of an f/a branch (first two topologies), whether or not the data set includes an f/a mutation.
Let T 0 (x, y, z) represent the probability that a process presently in the prespeciation state (l 01 ¼ x, l 02 ¼ y, l 03 ¼ z, 0, 0) has a genealogy of the required kind, and let T 1 (x, y, 0) be the corresponding probability for a process in the We accept the proposed pointṼ only if its likelihood estimated using itself as the driving model exceeds that estimated using the present V (i) as the driving model. Upon acceptance, the algorithm sets V ði11Þ ¼Ṽ and initiates another cycle. Upon rejection, it either terminates, after a specified number of consecutive rejections, or initiates another cycle from V (i) .
To guard against settling on local modes, we repeat this procedure several times, beginning from perturbations of the seed value.
APPENDIX E: PROFILE-LIKELIHOOD SURFACE Having assigned p as 0.0001, we used IS with the MLEs (Table 4) as driving values to approximate the likelihood surface over the remaining four dimensions at 20,000 grid points (l/u, 10 values beginning at 0.0 with a step size of 0.1; uN 0 and uN 1 , 10 values from 0.01, step size 0.75; uN 2 , 20 values from 0.01, step size 2.0), each based on 300,000 sampled genealogical histories. Invoking the csapi function of the MatLab spline toolbox, we approximated the full four-dimensional likelihood surface using multivariate cubic splines (de Boor 2001). We then discretized the interpolated surface on a finer grid, beginning at zero for each of the four parameters with smaller step sizes (l/u, 0.02; uN 0 and uN 1 , 0.25; uN 2 , 1.0). From these .1.7 3 10 6 estimated and interpolated points, we generated the profilelikelihood curves by determining for each value of a given parameter the values of the remaining three parameters that gave the highest likelihood.