Abstract
This article presents a new method for jointly estimating species divergence times and ancestral population sizes. The method improves on previous ones by explicitly incorporating intragenic recombination, by utilizing orthologous sequence data from closely related species, and by using a maximumlikelihood framework. The latter allows for efficient use of the available information and provides a way of assessing how much confidence we should place in the estimates. I apply the method to recently collected intergenic sequence data from humans and the great apes. The results suggest that the humanchimpanzee ancestral population size was four to seven times larger than the current human effective population size and that the current human effective population size is slightly >10,000. These estimates are similar to previous ones, and they appear relatively insensitive to assumptions about the recombination rates or mutation rates across loci.
THE effective population size (N_{e}) of a species has a direct effect on the amount and the pattern of DNA sequence variation. Researchers have therefore used sequence polymorphism data to estimate N_{e} (e.g., Kreitman 1983; Takahata 1993; Nachman and Crowell 2000). The amount of observed diversity can be used to estimate the population mutation parameter θ= 4N_{e}μ, and the per generation mutation rate μ can be estimated either directly (Haradaet al. 1993; Giannelliet al. 1999) or indirectly (e.g., Kimura 1983; Sattaet al. 1993; Kumar and Hedges 1998; Nachman and Crowell 2000) from divergence data (given assumptions about the divergence date and the average generation time).
Most estimates of N_{e} for humans are ∼10,00015,000 (e.g., Takahata 1993; Hardinget al. 1997). While there are many possible reasons why the effective population size may be quite different from the census population size (Caballero 1994), it remains surprising that the human N_{e} is so low, especially given humans’ large range over the past 12 million years (MY; e.g., Swisheret al. 1994; Gabunia and Vekua 1995). In particular, great ape species historically have had much smaller ranges, but have N_{e} two to three times larger than the human N_{e}. Did some event associated with the founding of the genus Homo (Takahata 1993) or some other particular event in human history lead to a sharp reduction in effective population size? It is difficult to answer this without knowing how N_{e} has varied over evolutionary time. Recently, progress has been made in estimating the effective population size of the population directly ancestral to two extant daughter species.
A few main methods exist for estimating N_{a} (see Takahata and Satta 2002 for a more indepth discussion). (We refer to the ancestor’s population size as the “ancestral N_{e},” or N_{a}.) One method, referred to as the trichotomy method, requires orthologous sequence data from three closely related species (Nei 1987; Wu 1991). This approach uses a single orthologous sequence from each of three species and assumes a simple model of population history where at fixed times different species become isolated with no further admixture (cf. Hey 1994). Random mating is assumed within each population. If the time between the two speciation events is small, the gene tree for a particular region will not always match the species tree (Nei 1987; see Figure 1). The probability that this happens depends in part on N_{a} for the ancestral population of species 1 and 2. In particular, a necessary (but not sufficient) condition for the gene tree and the species tree to be incompatible is that the species 1 and 2 lineages do not coalesce between the two speciation events. If N_{a} is larger, the probability of a common ancestor before time T_{2} is reduced, leading to a greater chance that the gene tree and the species tree do not match. The trichotomy method uses orthologous data from many unlinked loci, infers the gene tree at each locus, calculates the proportion of loci where the inferred gene tree does not match the species tree, and then uses this proportion to estimate N_{a}. Application of the trichotomy method to human and great ape sequence data has led to estimates of N_{a} (for the humanchimpanzee ancestral population) substantially larger than the current N_{e} for humans (Ruvolo 1997; Chen and Li 2001; Takahata and Satta 2002). Chen and Li (2001) estimate, for example, N_{a} = 52,00096,000, or roughly five to nine times larger than the current human N_{e}.
Another method for estimating N_{a} requires divergence data from two or three species (Takahata 1986; Takahataet al. 1995; Yang 1997, 2002). Here, I describe the twospecies method, because that is what is generally used. Given two orthologous sequences, one each from a pair of species, they will coalesce at some time that predates the species divergence time (see Figure 1). For the autosomes, the time spent in the ancestral population before coalescence (x in Figure 1) is exponentially distributed, with mean 2N_{a}g (where g is the average generation time and N_{a} is the diploid ancestral N_{e}). In contrast, the postspeciation branch lengths (y in Figure 1) are fixed. Given data from multiple unlinked loci and assumptions about g and μ, one can use maximum likelihood to jointly estimate the speciation time and N_{a} (Takahataet al. 1995; Yang 1997). The general idea is that large values of N_{a} correspond to greater variability in the coalescence time of two orthologous sequences and thus greater variance in the observed divergences across loci. Using human and chimpanzee divergence data, estimates of the humanchimpanzee N_{a} are ∼510 times the current N_{e} (Takahata and Satta 1997; Takahata 2001).
Finally, two other methods require intraspecific polymorphism data from two species and use either a momentbased (Wakeley and Hey 1997) or a maximumlikelihood (Nielsen and Wakeley 2001) approach to estimate model parameters (including in particular N_{a} and the divergence time). Both of these methods are well suited for species that have diverged relatively recently, but less so for species such as humans and chimpanzees that share very little ancestral polymorphism. In any case, at the present they cannot be used to estimate the humanchimp N_{a} because of a lack of chimpanzee polymorphism data.
Large estimates of the humanchimpanzee N_{a} are concordant with a study of Mhc that used the high levels of diversity there to estimate a longterm (i.e., over the past 1020 million years) average effective population size of ∼10^{5} (Takahata 1991). However, it should be noted that the large estimates of N_{a} are difficult to reconcile with humanchimpanzee divergence times estimated from molecular data. Most recent estimates of the divergence time fall between 4 and 6 million years ago (MYA; e.g., Horaiet al. 1995; Easteal and Herbert 1997; Kumar and Hedges 1998; Kumar and Subramanian 2002). These estimates are for a single human and a single chimpanzee sequence; they reflect both divergence between species and segregation in the ancestral population (x and y in Figure 1). If N_{a} is large, then x must be large; if x is large and x + y is fixed, then y must be small. Suppose, for example, that (x + y) = 5.5 MY, as estimated by Kumar and Hedges (1998), and that N_{a} = 52,00096,000 (cf. Chen and Li 2001). Then, if the average generation time is 20 years, y would be 1.73.4 MY. If the average generation time were 25 years (see discussion), then y = 0.72.9 MY. These estimates postdate many welldocumented australopithecine fossils and are therefore dubious estimates of the time since speciation.
One possible explanation is that N_{a} has been consistently overestimated. Indeed, both the trichotomy method and the twospecies maximumlikelihood method have been criticized (Hudson 1992; Takahataet al. 1995; Sattaet al. 2000; Takahata and Satta 2002), and it is not clear how accurate the estimates are. For one, the trichotomy method assumes one already knows the time between the two speciation events, but this is generally not known a priori. Furthermore, it assumes one can correctly infer the phylogeny for any particular locus. Errors in phylogenetic inference arise when analyzing actual data, and the whole endeavor does not make sense in the presence of intragenic recombination (cf. Nordborg 2001). With three closely related species, the true phylogenies for nearby sites are not always the same. So, when loci are analyzed, they are often an amalgamation of sites with different phylogenies. Trying to infer a single phylogeny from such data is clearly not appropriate (Sattaet al. 2000). Even if the problems of recombination and phylogenetic reconstruction were ignored, the trichotomy method does not make an efficient use of the available information. Data from each locus are summarized into a single binary variable, depending on whether the inferred locus phylogeny agrees or disagrees with the species tree.
The maximumlikelihood methods are more rigorous and efficient, but they too have two main drawbacks. As with the trichotomy method and the method of Nielsen and Wakeley (2001), intragenic recombination is ignored. In addition, the methods of Takahata et al. (1995) are highly sensitive to variation in μ among loci. The problem is that variation in μ leads to greater variance in observed divergences across loci, which inflates the estimate of N_{a}. Although variation in μ can be explicitly modeled in the analyses (Yang 1997; Takahata and Satta 2002), it is difficult to know whether a particular model of rate variation is appropriate, especially given data from only two species (but see Yang 2002).
In this article, I present a new method for estimating N_{a}. The method requires orthologous sequence data from three or more species (two plus one or more outgroups) and jointly estimates N_{a} and species divergence times using a summary maximumlikelihood approach. Unlike the previous maximumlikelihood methods, intragenic recombination is incorporated, and likelihoods are estimated from coalescent simulations. Also, the model can account for variation in mutation rates across loci. Although the method can be used on data from any taxonomic group (as long as at least one outgroup species is available), I concentrate here on analyzing human and great ape sequence data. The maximumlikelihood framework allows for the estimation of confidence intervals; this, along with a more realistic model, allows us to assess with greater rigor whether the humanchimp N_{a} was much larger than the current human N_{e}, as previous studies have claimed. I apply the method to the orthologous data from 53 intergenic regions reported in Chen and Li (2001) and generate both point estimates and approximate confidence intervals for N_{a} and species divergence times. Intergenic sequence data are preferable to data from genes (even synonymous sites or introns) because they are less likely to have been affected by natural selection at closely linked sites.
METHODS
I describe the model in which there are orthologous sequence data from four species. The case in which there are three species (or five or more) follows analogously.
Suppose we have four species with a known phylogeny. We assume a null model of speciation (cf. Hey 1994; Sattaet al. 2000) whereby a panmictic ancestral population splits at a fixed time into two panmictic descendant populations, with no subsequent migration between the descendant populations. The scaled mutation and recombination parameters are θ (= 4N_{h}μ) and ρ (= 4N_{h}r), where μ is the mutation rate per site per generation and r is the recombination rate per site per generation. Label the species H, C, G, and O, with current diploid effective population sizes N_{h}, N_{c}, N_{g}, and N_{o}, respectively. Suppose H and C split at time T_{1}, H and G at time T_{2}, and H and O at time T_{3}, with T_{1} < T_{2} < T_{3} (see Figure 2). T_{1}, T_{2}, and T_{3} are scaled in units of 4N_{h} generations. From time T_{1} until T_{3} both the HC ancestral population and the HCG ancestral population have effective size N_{a}, while the HCGO ancestral population has effective size N_{o}. The results are similar if the latter ancestral population has effective size N_{a} (results not shown). Finally, define n_{s} as the number of contiguous nucleotide sites in the simulation. There are a total of 11 parameters in the model, listed in Table 1. We assume that the generation time and mutation rate do not vary across species. These assumptions are reasonable when the species considered have similar lifehistory traits and are closely related. There are three possible (unrooted) gene trees, with H and C, H and G, or C and G as sibling species.
Now, suppose we have a single orthologous sequence from each species. For each site that is “segregating” (i.e., is not identical across all species), we can infer that one or more mutations happened on certain branches in the unrooted tree. We do this assuming the fewest number of mutations that can explain the data. For example, if the H, C, G, and O sequences have A, G, A, and A, respectively, then we infer that a mutation happened on the branch leading to species C. All biallelic segregating sites fall into seven categories, resulting from mutations on seven different branches of an unrooted tree. These seven branches have H, C, G, O, HC, HG, or CG as descendants and are referred to as the seven types of branches. Note that for any particular gene tree there are only five possible branches, four external ones (with a single species as a descendant) and one internal one. Any site may have one of three possible gene trees, leading to seven possible branches over all possible gene trees (the four external branches that are common to each gene tree and one internal branch from each gene tree). For sites with three segregating nucleotides, we assume that the two species with the same base share the ancestral state and that the two other bases each arose from a single mutation. The Chen and Li (2001) data do not contain any sites where each species has a different nucleotide, so we do not consider this possibility.
The sequence data for the 53 intergenic regions reported in Chen and Li (2001) were kindly provided by the authors and aligned by eye. All indels were excluded. From the remaining sequence, we count the inferred number of mutations that happened on each of the seven branch types. (No distinction is made between transitions and transversions.) For a given region, denote these numbers of inferred mutations as b = (b_{1}, b_{2},... b_{7}). For given values of M = (θ, ρ, N_{h}, N_{c}, N_{g}, N_{o}, N_{a}, T_{1}, T_{2}, T_{3}, n_{s}) we estimate the likelihood of observing the vector b using Monte Carlo simulations.
The population model in Figure 2 is simulated using a modification of the coalescent with recombination (Hudson 1983). For each site in each replicate, we classify all the branches in the genealogy as one of the seven types of branches (i.e., with descendants H, C, G, O, HC, HG, or CG in the unrooted tree). Since mutations happen at rate μ per site per generation, we can tabulate from the total branch lengths the expected number of mutations that lie on each of the types of branches. For the jth replicate, denote these expected values as B_{j} = (B_{1}, B_{2},... B_{7}). The probability of observing b given B_{j} is then
The above equation describes how to estimate the likelihood of M for a single locus. Define M′ as a vector containing the first 10 values of M. Estimation of the likelihood of M′ over multiple loci is straightforward. Given a collection of k loci, define
Of the 11 parameters that make up the model M, only 9 can freely vary. n_{s} is fixed from the actual data, while N_{h} is relevant only indirectly; it turns out that the simulations use only the ratios of the effective population sizes (i.e., N_{a}/N_{h}, N_{c}/N_{h}, etc.), not their actual values. The actual N_{h} comes into play when interpreting the simulation results (e.g., translating from scaled time to actual time). Ideally, one would like to let θ, ρ, N_{c}/N_{h}, N_{g}/N_{h}, N_{o}/N_{h}, N_{a}/N_{h}, T_{1}, T_{2}, and T_{3} vary freely and determine which combination of parameter values maximizes the likelihood of observing the actual data. However, this is computationally prohibitive, so we fix those values for which we have prior information and let the others vary: N_{a}/N_{h}, T_{1}, T_{2}, and T_{3} vary freely (at increments of 1.0, 0.25, 0.5, and 1.0, respectively), and we consider the following four schemes for the other parameters:
Model 1: θ=ρ= 0.001/bp; N_{c} = N_{g} = N_{o} = 3N_{h}.
Model 2: θn_{s} for each locus is proportional to the total inferred number of mutations (across all four species), and the average θ/bp (over all loci) is 0.001; ρ= 0.001/bp; N_{c} = N_{g} = N_{o} = 3N_{h}.
Model 3: Same as model 2, but all CpG sites were excluded, and the average θ/bp (over all loci) is 0.00075.
Model 4: θ= 0.001/bp; ρ= 0.002/bp; N_{c} = N_{g} = N_{o} = 3N_{h}.
Model 5: θ=ρ= 0.001/bp; N_{c} = N_{g} = N_{o} = 6N_{h}.
θ can be easily estimated from human sequence polymorphism data (e.g., Watterson 1975). Putatively neutral sites from recent resequencing studies of human variation suggest that θ= 0.001/bp is a good ballpark figure for the autosomes and that roughly onefourth of all segregating mutations occur at CpG sites (e.g., Nachman and Crowell 2000; Przeworskiet al. 2000; Templetonet al. 2000; Ebersbergeret al. 2001; Frisseet al. 2001). Model 2 tests how sensitive the results are to variation in θ across loci by taking the same average θ as model 1, but assuming that θn_{s} for each locus is proportional to the observed number of inferred mutations. (This is equivalent to estimating θ using Watterson 1975, assuming an average of θ= 0.001/bp.) The genomewide average rate of crossing over in humans is r = 1.3 × 10^{8}/bp (Yuet al. 2001). If N_{h} ≈ 10^{4}, then ρ ≈ 5.2 × 10^{4}/bp. We take slightly larger ρ values to account for the unknown contribution of gene conversion to overall rates of recombination (see, e.g., Frisseet al. 2001; Przeworski and Wall 2001). Finally, levels of nonhuman great ape diversity seem to be substantially higher than human diversity levels (Deinard and Kidd 1999; Kaessmann et al. 1999, 2001), but not enough data have been gathered to accurately estimate N_{c}/N_{h}, N_{g}/N_{h}, or N_{o}/N_{h}. We have chosen values that might plausibly reflect the total species diversity in chimps, gorillas, and orangs. Models 35 were chosen to explore the sensitivity of the results to the presence of hypermutable CpG sites, assumptions about the recombination rate, and assumptions about great ape population sizes, respectively.
In addition to using the parameter combination that maximizes the likelihood as a point estimate, it would be useful to determine how much confidence we should place in the estimated values. To get a sense of how the likelihood varies as a function of T_{1}, for example, I calculate the (approximate) profile likelihood:
To verify the accuracy of the method, I run coalescent simulations with known T_{1}, T_{2}, T_{3}, and N_{a}/N_{h} values; then, I use the new method on the simulated data to estimate parameters and to compare the estimated values with the actual ones. These simulations modeled 50 loci of 500 bp each, with θ =ρ= 0.001/bp, N_{c} = N_{g} = N_{o} = 3N_{h}, T_{1} = 5.0, T_{2} = 8.0, T_{3} = 14.0, and N_{a}/N_{h} = 5.0. The parameter values were chosen to roughly match both the Chen and Li (2001) data and our a priori knowledge about species divergence times and ancestral population sizes. Five replicates were run; I analyzed each one under the assumptions of model 1 (see above). Note that this assumes an idealized situation, where the nuisance parameters are known exactly.
All programs were written in C and are available from the author on request. A total of 5 × 10^{4} replicates were run for each model and parameter combination. To give a sense of the computational efficiency, the total simulations took 5 months to run on a pair of 1.7 GHz Pentium 4 processors.
RESULTS
The maximumlikelihood estimates for T_{1}, T_{2}, T_{3}, and N_{a}/N_{h} are presented in Table 2. The estimates across the five models are broadly similar; all of them estimate an ancestral population size five to six times larger than the current human effective population size, in keeping with previous studies (Takahata and Satta 1997; Chen and Li 2001; Takahata 2001). The estimates of T_{1}, the humanchimpanzee divergence time, are also roughly in line with expectations. If we assume that g = 25 years and N_{h} = 10^{4} (or that g = 20 years and N_{h} = 12,500), then these estimates range from 3.5 to 4.0 MYA. In contrast, the paleontological record suggests that uniquely human ancestors were around at least 44.5 MYA (Whiteet al. 1994; Leakeyet al. 1995) and perhaps much earlier (HaileSelassie 2001; Brunetet al. 2002). This disparity can easily be reconciled if both the average generation time and the current human effective population size are on the larger side of previous estimates (e.g., g = 25 years and N_{h} = 15,000). Given our uncertainty in parameter estimates, these values are quite plausible. If instead we were to assume the generation time and species divergence time were known, then we could use the results to estimate the current human effective population size. If T_{1} = 6 MYA and g = 25 years, then the point estimates of N_{h} range from 15,000 to 17,100. The other species divergence times are also on the recent side; assuming once again that g = 25 years and N_{h} = 10,000, the estimated humangorilla divergence time ranges from 5.0 to 5.5 MYA, while the estimated humanorangutan divergence time ranges from 12 to 13 MYA.
To assess how much confidence we should place in the point estimates, I calculated approximate profilelikelihood curves and estimated ∼95% confidence intervals. The intervals for T_{1} and N_{a}/N_{h} are listed in Table 2. For both T_{1} and N_{a}/N_{h} the intervals are quite narrow, which suggests that the estimates are precise. All four models exclude N_{a}/N_{h} ≤ 3.5 and N_{a}/N_{h} ≥ 7.1 from the approximate confidence intervals. For T_{1}, the lower boundaries range from 2.7 to 3.0 and the upper boundaries range from 4.2 to 4.8. If as before we take g = 25 years and N = 10^{4}, the upper boundaries range from 4.2 to 4.8 MYA; these times are still more recent than the paleontological record would suggest. As mentioned above, a small increase in N_{h} is sufficient to reconcile the time estimates with the paleontological record. Figure 3 shows the profilelikelihood functions of N_{a}/N_{h} and T_{1} for model 2. The curves quickly become quite steep, suggesting that the range of plausible values is not that large. So, even if the approximate confidence intervals were nonconservative, it is likely that conservative ones would not differ much from the intervals listed in Table 2. The corresponding likelihood curves for the other models are qualitatively similar to those in Figure 3.
To verify the accuracy of the method, I applied it on five simulated data sets with known parameter values (see methods). Each one had actual values of T_{1} = 5.0, T_{2} = 8.0, T_{3} = 14.0, and N_{a}/N_{h} = 5. The estimated parameter values, along with the confidence intervals for T_{1} and N_{a}/N_{h}, are given in Table 3. The means of the parameter estimates are 5.0, 8.1, 14.0, and 4.8 for T_{1}, T_{2}, T_{3}, and N_{a}/N_{h}, respectively, which suggests that the method has no or low bias. In addition, the confidence intervals for T_{1} and N_{a}/N_{h} contain the true value all five times. Due to the large computational burden, it was not possible to run enough replicates to accurately estimate the coverage properties of the confidence intervals.
Comparing the different rows in Table 2 can give us some idea of how sensitive the results are to assumptions about the nuisance parameters (i.e., θ, ρ, N_{c}/N_{h}, N_{g}/N_{h}, and N_{o}/N_{h}). Since the results from all of the models are very similar, it appears that the particular assumptions made do not appear to be very important. In particular, unlike the twospecies maximumlikelihood method of Takahata et al. (1995), the results do not seem to be very sensitive to variation in mutation rates across loci. This may be due to the information about locusspecific mutation rates contained in the outgroup species or because the actual data have very little variation in mutation rates across loci.
DISCUSSION
Estimating ancestral population sizes has been an active research area for several years. The work presented here improves on previous efforts by explicitly incorporating intragenic recombination (see also Sattaet al. 2000) and by efficiently utilizing data from outgroup species. The estimates of the humanchimpanzee N_{a} are five to six times larger than the current human effective population size (see Table 2). Although most previous studies came to similar conclusions, it was not clear how much confidence to place in these estimates because of unrealistic assumptions, such as no recombination or no variation in mutation rates (Takahata and Satta 2002). The narrow confidence intervals and simulation results presented here (Tables 2 and 3) provide additional evidence that ancestral population sizes were substantially larger than the current human effective population size.
One recent study that came to a different conclusion (namely, that N_{a} is roughly as small as N_{h}) incorporated variation in mutation rates across loci but not intragenic recombination (Yang 2002). Recombination tends to decrease the variance in estimated branch lengths across loci; because of this, models that assume no recombination tend to underestimate N_{a} (Takahata and Satta 2002). Further work must be done to quantify how model assumptions (both here and in other studies) affect estimates of the ancestral population size.
Although this application focuses on the humanchimpanzee N_{a}, the same method can be used to estimate N_{a} from other taxa, as long as there are orthologous sequence data from three or more species (including at least one outgroup) at multiple unlinked loci. Below, I discuss issues that might affect the general applicability of the method.
Likelihood model: One possible criticism of the model is that the relative locations of the segregating mutations are ignored. However, this is not likely to be very important, since the number and the pattern of segregating mutations are far more informative. Incorporating the segregating site locations may lead to narrower confidence intervals and more accurate estimation of the likelihood function, but excluding them is not expected to bias the results in either direction. Given the results, it does not seem to be worth the substantial computational burden to consider the fulllikelihood model.
Mutational model: The mutational model that was adopted makes no distinction between transitions and transversions and assumes the mutation rate at each site in a locus is the same. However, some sites have higher mutation rates than others (Nachman and Crowell 2000; Templetonet al. 2000), which would increase the number of sites experiencing multiple mutations. Those multiply hit sites with fewer than three segregating nucleotides would then be misclassified by the model. In primates, the transition rate away from CpG sites is thought to be elevated by more than an order of magnitude due to methylatedcytosine mutagenesis (e.g., Joneset al. 1992; Giannelliet al. 1999). To test whether homoplasies from multiply hit CpG sites affected the parameter estimates, I reran model 2 excluding all CpG sites (i.e., model 3). Both the maximumlikelihood estimate and the shape of the profilelikelihood curves are almost identical (Table 2; results not shown), suggesting that the results presented here are relatively insensitive to the effects of multiple mutations at CpG sites. Calculations suggest that other proposed sites with elevated mutation rates, such as mononucleotide runs or DNA polymerase αarrest sites (Krawczak and Cooper 1991; Templetonet al. 2000), are too rare to appreciably increase the expected number of homoplasies (results not shown). For studies of species with larger levels of divergence, the effects of homoplasies may be a more serious concern. Future work will concentrate on implementing a finitesite mutation model in the maximumlikelihood scheme described here.
Molecular clock: The method described here assumes that the rate of mutation per unit time is the same on all branches. This is likely a reasonable assumption for the data considered here. Noncoding regions are less likely to be affected by natural selection than are the coding regions analyzed in other studies. Also, there is no reason to assume substantial differences in mutation rates (per generation) between humans and great apes. The data on generation times are sparse; EyreWalker and Keightley (1999) cite a time of g = 23 years in chimpanzees, while estimates of current human generation times are ∼30 years (Siguroardóttiret al. 2000; Tremblay and Vézina 2000). Average human generation times (over the last several million years) may be substantially smaller. Indeed, the Chen and Li (2001) data show no evidence for more mutations on the chimpanzee branch than on the human branch (Chen and Li 2001; results not shown), suggesting that the longterm average generation times for humans and chimpanzees are quite similar.
For other taxa, the clock assumption may not be appropriate. It would be straightforward to generalize the model to have different rates of evolution on different branches and to estimate these as well as species divergence times and ancestral population sizes. More sequence data and more computational time would be required to accurately estimate the additional parameters, and the method (with variable rates) may not be feasible with more than three species.
Nuisance parameters: Although the goal of this article is to estimate ancestral population sizes and species divergence times, the model presented here also includes other parameters, such as θ, ρ, or N_{c}/N_{h}. The reason ρ is included is not to estimate the recombination rate from divergence data (which would be somewhat challenging). Rather, the values of parameters like ρ affect the likelihoods, so some assumptions must be made about them. In the interest of computational tractability, I have chosen plausible values for θ, ρ, N_{c}/N_{h}, N_{g}/N_{h}, and N_{o}/N_{h}. Comparing model 1 with models 4 and 5 suggests that the choice of particular values for these other parameters may not affect the estimates of the parameters of interest. Further simulations show that this is true for a wider range of values (ρ= 0.00050.003/bp; N_{h} ≤ N_{c}, N_{g}, N_{o} ≤ 6N_{h}), although it should be pointed out that assuming no recombination (as previous methods do) leads to a likelihood of 0, due to the presence of several incompatibilities within loci. So, the estimates of N_{a}/N_{h}, T_{1}, T_{2}, and T_{3} are robust to the assumptions made about the other parameters.
Speciation model: Estimates of N_{a}/N_{h} and T_{1} provide information about the mean and the variance of the distribution of coalescent times of a single human and a single chimpanzee sequence. Under the simple speciation model considered here, greater variances in coalescent times must be the result of larger ancestral population sizes. Some researchers have suggested that there is often gene flow between “incipient species” (e.g., Wu 2001). A model of limited gene flow (prior to strict isolation) will lead to a greater variance in coalescent times. In particular, if there were gene flow after the initial divergence of the human and chimpanzee lines, then the ancestral population size (before the initial divergence) would be overestimated. Further work must be done to develop methods that can distinguish between limited gene flow and large ancestral population sizes using orthologous sequence data.
Acknowledgments
I thank M. Hare, M. Przeworski, N. Takahata, J. Wakeley, and an anonymous reviewer for comments on an earlier version of this manuscript. J.D.W. was supported in part by a National Science Foundation Postdoctoral Fellowship in Bioinformatics.
Footnotes

Communicating editor: N. Takahata
 Received February 12, 2002.
 Accepted October 14, 2002.
 Copyright © 2003 by the Genetics Society of America