Abstract
The effective population sizes of ancestral as well as modern species are important parameters in models of population genetics and human evolution. The commonly used method for estimating ancestral population sizes, based on counting mismatches between the species tree and the inferred gene trees, is highly biased as it ignores uncertainties in gene tree reconstruction. In this article, we develop a Bayes method for simultaneous estimation of the species divergence times and current and ancestral population sizes. The method uses DNA sequence data from multiple loci and extracts information about conflicts among gene tree topologies and coalescent times to estimate ancestral population sizes. The topology of the species tree is assumed known. A Markov chain Monte Carlo algorithm is implemented to integrate over uncertain gene trees and branch lengths (or coalescence times) at each locus as well as species divergence times. The method can handle any species tree and allows different numbers of sequences at different loci. We apply the method to published noncoding DNA sequences from the human and the great apes. There are strong correlations between posterior estimates of speciation times and ancestral population sizes. With the use of an informative prior for the humanchimpanzee divergence date, the population size of the common ancestor of the two species is estimated to be ∼20,000, with a 95% credibility interval (8000, 40,000). Our estimates, however, are affected by model assumptions as well as data quality. We suggest that reliable estimates have yet to await more data and more realistic models.
THE (effective) population size N is a central parameter in models of population genetics, conservation genetics, and human evolution. For example, the amount of genetic variation in a population is determined by θ= 4Nμ, where μ is the mutation rate per site per generation. When an independent estimate of the mutation rate is available, we can use the estimate of θ to infer the population size N. Estimation of θ or N of a modern species is relatively simple. The population size of modern humans has been estimated to be ∼10,000 (Takahataet al. 1995; Ruvolo 1997; Edwards and Beerli 2000; Zhaoet al. 2000; Yuet al. 2001). Estimation of population sizes of extinct ancestors of closely related species is more challenging, but Takahata et al. (1995) have developed a maximumlikelihood method under the infinitesites model for either two or three species. Another commonly used method, for three species, exploits the fact that ancestral polymorphism creates conflicts between the species tree and the gene tree (Nei 1987; Wu 1991) and estimates the ancestral population size by equating the proportion of mismatched gene trees to the theoretical expectation. Application of this method to hominid data sets has led to large estimates of population sizes for the common ancestor of humans and chimpanzees, on the order of 100,000 (Ruvolo 1997; Chen and Li 2001). However, this treemismatch method ignores sampling errors in the reconstructed gene tree, due to a finite number of nucleotide sites at each locus, and produces serious overestimates (Yang 2002).
Yang (2002) implemented a finitesites model using both likelihood and Bayes methodologies. The method is limited to the case of three species, with one individual sequenced at each locus from each species. However, it is advantageous to analyze multiple species and loci simultaneously, which may circumvent the sensitivity of such analysis to possible variation in evolutionary rates among loci (Yang 1997; Chen and Li 2001). The realization that typical data do contain information about ancestral population sizes and that information is better extracted by a combined analysis of sequence data from multiple species and loci provided motivation for the present study. Here we extend the method of Yang (2002) to deal with an arbitrary species tree and different numbers of sequences at different loci. The likelihood calculation using numerical integration does not seem feasible due to the increased dimension of the integral. Thus we adopt the Bayes approach and implement a Markov chain Monte Carlo (MCMC) algorithm. We apply the method to published data of noncoding DNA sequences from the human and the great apes.
Computationintensive MCMC algorithms are increasingly being used in inference in molecular population genetics (see Felsensteinet al. 1999; Stephens and Donnelly 2000 for reviews). Bahlo and Griffiths (2000) developed a likelihood approach to analyzing sequence data from subdivided populations, estimating jointly the population sizes and migration rates. They used an importancesampling strategy that works efficiently under the infinitesites model (Griffiths and Tavaré 1994). Beerli and Felsenstein (1999, 2001) implemented MCMC algorithms for likelihood analysis of subdivided populations under the finitesites model. Those methods assume an equilibrium migration model, which is suitable for geographically structured populations, but not for the species phylogeny considered in this study, as different species diverged at different times and never reached equilibrium. Nielsen and Wakeley (2001) implemented an MCMC algorithm for likelihood and Bayes inference using data from two species, modeling both ancestral polymorphism and gene flow after the species divergence. More recently Wilson et al. (2003) extended the Bayes MCMC algorithm developed for microsatellite data by Wilson and Balding (1998) to account for population subdivision and growth. Wilson et al.’s (2003) populationsplit model allows different subpopulations to diverge at different times, although the total population size is fixed. Their method was not implemented to analyze sequence data. The implementation we present here does not yet account for population demographic processes (such as population growth) or possible gene flow after species divergences, although it is straightforward to add these features. Our method is unique among current methods in being applicable to any species phylogeny and allowing combined analyses of multiple sequences per species as well as sequences from multiple loci.
THEORY
Data and model parameters: The data consist of aligned homologous DNA sequences at multiple neutral loci sampled from presentday species. The model and implementation apply to any species tree. As an example, we focus on the case of the great apes: human (H), chimpanzee (C), gorilla (G), and orangutan (O). The topology of the species tree, (((HC)G)O), is assumed known and fixed in the analysis (Figure 1). The number of sequences sampled may differ among loci. Let D = {D_{i}} be the entire data set, where D_{i} represent the sequence alignment at locus i, with i = 1, 2,..., L for a total of L loci. We expect the method to be applied to closely related species only and assume the molecular clock, that is, rate constancy among lineages. Furthermore, we assume random mating in each population and no gene flow after species divergences. We also assume no recombination within a locus and free recombination between loci.
Parameters in the model include the species divergence times as well as the ancestral and current population sizes. The population size of a current species is considered only if more than one individual is sampled from that species at some loci. Because time and rate are confounded in the data, both divergence times and population sizes are multiplied by the mutation rate. Thus parameters in the model for the example of Figure 1 include the three divergence times τ_{HC}, τ_{HCG}, and τ_{HCGO} and population size parameters θ_{H} for humans; θ_{C} for chimpanzees; and θ_{HC}, θ_{HCG}, and θ_{HCGO} for the three ancestral species. The divergence times (τ’s) are measured by the expected number of mutations per site from the ancestral node in the species tree to the present time (Figure 1). Collectively we let 0398;= {θ_{H}, θ_{C}, θ_{HC}, θ_{HCG}, θ_{HCGO}, τ_{HC}, τ_{HCG}, τ_{HCGO}} denote all parameters in the model to be estimated.
Bayes estimation of parameters: The Bayes hierarchical model has two main components: the prior distribution of the parameters and the likelihood, i.e., the probability of the data given the parameters. We use independent gamma distributions as priors for θ’s. The gamma density is
For parameters τ’s, we used independent gamma priors for the time gaps (interarrival times) on the species tree. For example, for the species tree of Figure 1, (τ_{HCGO} τ_{HCG}), (τ_{HCG} τ_{HC}), and τ_{HC} are assumed to have independent gamma distributions, and the prior f(0398;) is a product of the independent gamma densities. We also implemented an option of specifying gamma priors for the node ages (τ_{HCGO}, τ_{HCG}, and τ_{HC} for the tree of Figure 1), but the prior means are not given by α/β anymore; because of the constraints on node ages (for example, τ_{HCGO} >τ_{HCG} >τ_{HC}), the joint gamma distribution is truncated. The MCMC always updates the node ages (τ’s) and not time gaps.
The gene genealogy G_{i} at each locus i is represented by the tree topology T_{i} and the coalescent times t_{i}. Given parameters 0398;, the probability distribution of G_{i} = {T_{i}, t_{i}} is specified by the coalescent processes under the model. This is described in the next section. Let G = {G_{i}}. We have
We construct a Markov chain, whose states are (0398;, G) and whose stationary distribution is f(0398;, GD). A MetropolisHastings algorithm (Metropoliset al. 1953; Hastings 1970) is used. Given the current state of the chain (0398;, G), a new state (0398;^{*}, G^{*}) is proposed through a proposal density, q(0398;^{*}, G^{*}0398;, G), and is accepted with probability
The proposal density q can be rather flexible as long as it specifies an aperiodic and irreducible Markov chain. The algorithm we implemented cycles through several steps, with each step updating some variables. Step 1 changes the age of an internal node in each gene tree without changing the gene tree topology or speciation times. Step 2 cycles through all loci and, at each locus, changes the gene tree topology by pruning a subtree and then regrafting it back onto a feasible branch. Step 3 updates the θ’s. Step 4 cycles through all speciation times (τ’s) in the species tree and modifies each. This step also uses a “rubberband” algorithm to jointly modify the ages of nodes in each gene tree such that the coalescence times on the gene trees remain compatible with the modified species divergence times. Step 5 is a mixing step, in which all coalescent times in the gene trees and all species divergence times are multiplied by the same constant. The details of the algorithm are given in the appendix.
Distribution of the gene genealogy derived from censored coalescent processes: The prior probability, f(G_{i}0398;), of any gene tree and its coalescent times at a locus are specified by the coalescent processes in the different populations in the species tree. The theory applies to any gene tree, but is best explained with an example, for which we use the gene tree of Figure 1. Five populations, H, C, HC, HCG, and HCGO, are considered. We use HC to represent the population ancestral to H, and C. Yang (2002; see also Takahataet al. 1995) derived the joint prior distribution f(G_{i}0398;) = f(T_{i}, t_{i}0398;) for three species by considering the marginal probability of the tree topology T_{i} and the conditional distribution of the coalescent times given the topology; that is, f(T_{i}, t_{i}0398;) = f(T_{i}0398;) f(t_{i}T_{i}, 0398;). This strategy is not workable for larger species trees because of the increased number of tree topologies and the high dimension of the integral in deriving f(T_{i}0398;). Here we derive the joint distribution f(T_{i}, t_{i}0398;) directly.
Note that two sequences from different species can coalesce only in populations that are ancestral to the two species. For example, sequences H_{1} and G can coalesce in populations HCG or HCGO, but not in populations H or HC. The coalescent processes in different populations are independent. For each population, we trace the genealogy backward in time, until the end of the population at time τ, and record the number of lineages (m) entering the population and the number of lineages leaving it (n). For example, m = 3, n = 2, and τ=τ_{HC}, for population H (Table 1). Such a coalescent process may be termed a censored coalescent process since the process is terminated before it is complete. When n > 1, the genealogical tree in the population consists of n disconnected subtrees or lineages.
Within each population, we measure time in units of 2N generations and further multiply time by the mutation rate. With this scaling, coalescent times are measured by the expected number of mutations per site, and any two lineages in the sample coalesce at the rate θ/2 (Hudson 1990). The waiting time t_{j} until the next coalescent event, which reduces the number of lineages from j to j  1, has the exponential density
Multiplying those probabilities together, we obtain the joint probability distribution of the gene tree topology in the population and its coalescent times t_{m}, t_{m}_{1},..., t_{n}_{+1} as
APPLICATION TO HOMINOID DATA
Data: We apply the new method to the following data, all composed of noncoding regions. Noncoding regions are preferable for this kind of analysis as they are likely to be evolving neutrally, less affected by background selection than, for example, silent sites in coding regions.
Chen and Li (2001) sequenced one individual from each of the four species, human, chimpanzee, gorilla, and orangutan, at 53 independent noncoding loci (contigs), with ∼500 bp at each locus. Chen and Li’s analysis using the treemismatch method estimated the population size for the common ancestor of humans and chimpanzees to be from 52,000 to 150,000. Maximumlikelihood (ML) analysis of the same loci using only the HCG sequences suggested smaller estimates of ∼12,00021,000 (Yang 2002). Here we use the data from all four species.
Yu et al. (2001) sequenced ∼10 kb at the region 1q24 from 61 humans, one chimpanzee, one gorilla, and one orangutan. This region was intended to be noncoding but was discovered to contain four exons (of 115, 155, 138, and 151 nucleotides long), which are removed before analysis.
Makova et al. (2001) sequenced a region of ∼6.6 kb at 16q24.3, located upstream from the melanocortin 1 receptor gene and containing its promoter, from 54 humans, one chimpanzee, one gorilla, and one orangutan. The orangutan sequence was incomplete and unavailable from GenBank. Only the human, chimpanzee, and gorilla sequences are used.
Zhao et al. (2000) sequenced ∼10 kb in the region 22q11.2 from 64 humans, one chimpanzee, and one orangutan. One human sequence (AF291608) appears to be corrupted, so only 63 human sequences are used.
In all data sets, most alignment gaps occur at the ends of the sequence and probably represent undetermined nucleotides. The three large loci (Zhaoet al. 2000; Makovaet al. 2001; Yuet al. 2001) involve many ambiguity nucleotides. These are included in the likelihood calculation (Yang 2000).
Two analyses are performed. The first estimates the ancestral population size and speciation time parameters θ_{HC}, θ_{HCG}, θ_{HCGO}, τ_{HCGO}, τ_{HCG}, and τ_{HC}, initially using the data set of Chen and Li (2001) at 53 loci and then including the data at the three other loci as well, in which case an additional parameter θ_{H} is also estimated. The results are presented in Table 2 under the headings “53 loci” and “56 loci,” respectively. The second analysis uses only the human sequences at the three loci (Zhaoet al. 2000; Makovaet al. 2001; Yuet al. 2001) to estimate θ_{H} and t_{MRCA}, the time to the most recent common ancestor in the sample. The results are presented in Table 3.
Estimation of ancestral population sizes and speciation times: The gamma priors for the ancestral population size and speciation time parameters are specified on the basis of our expectations about those parameters (Table 2). For easy comparison, the same priors are used for parameters θ_{HC}, θ_{HCG}, (τ_{HCG} τ_{HC}), and τ_{HC} as in Yang (2002), although parameters θ_{HCGO} and (τ_{HCGO}  τ_{HCG}) are new. The gamma parameter α is chosen to be >1, so that the distribution peaks at a positive value instead of 0. The prior for each θ has the mean 0.001 and 95% of the density is in the interval (0.00012, 0.00279). We assume a generation time of g = 20 years and a neutral mutation rate of 10^{9} mutations/site/year. Thus the population sizes have a prior mean of 12,500 with the 95% interval (1500, 34,800). The mean speciation times in the prior are 5 million years (MY) before present, 6.6 MY, and 14 MY for the HC, HCG, and HCGO divergences, respectively.
We use 10,000 iterations as the burnin and then take 1,000,000 samples, sampling every two iterations. The results are presented in Tables 2 and 3. The posterior distribution for θ_{HC} from the 53loci data (Chen and Li 2001) indicates a population size of 24,600 with the 95% credibility interval (CI) of (9600, 46,800) for the HC ancestor. These are larger than the Bayes estimates obtained from the HCG sequences only, which were 13,100 with the 95% CI (1700, 32,100; Yang 2002; Table 3). The size for population HCG has posterior mean 42,700 with the 95% CI (27,000, 60,900), which are also larger than those from the HCG sequences only (Yang 2002). The HC divergence time is calculated to be ∼4.8 MY with the 95% CI (4.1, 5.5). Those estimates seem too young, as current opinion appears to favor a date as old as 7 MY (Brunetet al. 2002). The gap between the HC and HCG divergences is estimated to be ∼1.2 MY with the 95% CI (0.47, 2.11), smaller than the estimates from the HCG sequences only (Yang 2002). We also analyzed the HC sequences only from the Chen and Li data and obtained estimates of θ_{HC} and τ_{HC} that are almost identical to estimates from the HCG sequences (results not shown). In sum, inclusion of the orangutan has led to more recent estimates of speciation times and to large estimates of ancestral population sizes. The reasons for the differences are unclear, but they are not due to exclusion of alignment gaps in the analysis of Yang (2002).
We note strong negative correlations in the posterior density between τ’s and θ’s, especially between τ and θ for the population representing the root of the species tree (Table 4). For example, the correlation between τ_{HCGO} and θ_{HCGO} is 0.75. The joint posterior density for θ_{HCGO} and (τ_{HCGO} τ_{HCG}) is shown in Figure 2, after kernel density smoothing (Silverman 1986).
Including the three large loci of Yu et al. (2001), Makova et al. (2001), and Zhao et al. (2000) leads to even younger estimates of speciation times and larger estimates of ancestral population sizes (Table 2; column labeled 56 loci). The HC divergence time, estimated at 4.3 MY with a 95% CI (3.7, 5.0), appears too recent. The strong correlations between parameters θ and τ in the posterior distribution suggest that estimation of θ’s is affected by uncertainties in the τ’s. To alleviate such effects, we used a highly informative prior for τ_{HC} with α= 120, β= 20,000, corresponding to a prior mean of 6 MY for the HC divergence with the 95% prior interval (5.0 MY, 7.1 MY). The 53loci data then give the posterior mean 5.3 MY with the CI (4.7 MY, 5.9 MY) for the HC divergence. The posterior mean and the 95% CI for θ_{HC} are 0.0015 (0.0006, 0.0029), which correspond to an HC population size of 19,000 with the CI (7600, 36,600), 0.0034 (0.0022, 0.0048) for θ_{HCG}, and 0.0019 (0.0002, 0.0044) for θ_{HCGO}. The posterior means and 95% CIs for speciation times are 0.0079 (0.0062, 0.0094) for τ_{HCGO} τ_{HCG} and 0.0010 (0.0004, 0.0017) for τ_{HCG} τ_{HC}. Those estimates appear more reasonable.
Application of this informative prior for τ_{HC} to the 56loci data had a similar effect of reducing θ_{HC}. The posterior mean and 95% CI for θ_{HC} are 0.00201 (0.00088, 0.00352), which correspond to a mean N_{HC} of 25,000 with the CI (11,000, 44,000). Estimates for τ_{HC} are 0.00481 (0.00430, 0.00535). Estimates for θ_{H} are 0.00055 (0.00039, 0.00075), identical to those of Table 2. Estimates of other parameters are 0.00379 (0.00258, 0.00529) for θ_{HCG} and 0.00240 (0.00042, 0.00453) for θ_{HCGO}. The posterior means and 95% CIs for speciation times are 0.00757 (0.00629, 0.00887) for τ_{HCGO} τ_{HCG} and 0.00109 (0.00048, 0.00177) for τ_{HCG} τ_{HC}.
Estimation of human population size θ_{H} and the t_{MRCA}: The human sequences at the three large loci (Zhaoet al. 2000; Makovaet al. 2001; Yuet al. 2001) are analyzed separately and then combined. The only parameter to be estimated is θ_{H}, and the same gamma prior G(2, 2000) is used as in Table 2, which corresponds to a prior mean of N_{H} = 12,500 with the 95% interval (1500, 34,800). The results are shown in Table 3.
For locus 1q24 (Yuet al. 2001), the Bayes analysis suggests a posterior mean of 0.00035 with the 95% CI (0.00017, 0.00062) for θ_{H}. With the generation time g = 20 years and mutation rate μ= 10^{9} substitutions/site/ year, those estimates correspond to an average longterm human population size of only N_{H} = 4400 with the 95% CI (2100, 7700). Yu et al. (2001) suggested a lower mutation rate for the locus at μ= 0.74 × 10^{9} substitutions/site/year. Use of this rate gives the posterior mean 5900 with the CI (2800, 10,500). The t_{MRCA} has the posterior mean 0.31 MY with the CI (0.15, 0.55) if the mutation rate is μ= 10^{9} or 0.42 MY with the CI (0.20, 0.74) if the mutation rate is μ= 0.74 × 10^{9}. Yu et al. (2001) estimated θ_{H} using Watterson’s method based on the number of segregating sites (Watterson 1975), Tajima’s method (Tajima 1983), and Fu and Li’s BLUE method (Fu 1994), either with or without the singletons removed. The estimates varied considerably among methods and are all much larger than the Bayes estimates obtained here. The estimate suggested by the authors was θ= 6.7/8991 = 0.00074, twice as large as the Bayes mean and outside the 95% CI. With μ = 0.74 × 10^{9} used, the population size was estimated to be N_{H} = 12,600 (Yuet al. 2001). Similarly Yu et al.’s analysis estimated t_{MRCA} of the human sample to be ∼1.5 MY, more than three times older than the Bayes estimates.
For locus 16q24.3 (Makovaet al. 2001), the Bayes analysis suggests a posterior mean of N_{H} = 8800 with the 95% CI (5500, 15,000) if g = 20 years and μ= 10^{9}. Makova et al. (2001) estimated a mutation rate of μ= 1.65 × 10^{9} for this locus. Use of this rate gives the posterior mean 5300 with the CI (3300, 9100) for N_{H}. The t_{MRCA} has the posterior mean 0.77 MY with the CI (0.44, 1.2) if the mutation rate is μ= 10^{9} or 0.47 MY with the CI (0.27, 0.73) if the mutation rate is μ= 1.65 × 10^{9}. Makova et al.’s (2001) estimates are θ= 13.7/6545 = 0.00209, which is twice as large as the Bayes mean, and N_{H} = 10,000 for the human population size and t_{MRCA} = ∼1.5 MY.
For locus 22q11.2 (Zhaoet al. 2000), the Bayes analysis suggests a posterior mean of N_{H} = 8100 with the 95% CI (4700, 13,000), if g = 20 years and μ= 10^{9}. The t_{MRCA} has the posterior mean 0.51 MY with the CI (0.30, 0.79). Zhao et al.’s (2000) estimates of θ varied among methods and were all larger than the Bayes estimates. The population size N_{H} was estimated to be ∼10,00015,000, which is comparable with the Bayes estimate. However, t_{MRCA} was estimated to be ∼1.3 MY, with a confidence interval of (0.71, 2.1), much larger than the Bayes estimates.
In sum, the Bayes estimates of θ_{H} and t_{MRCA} are considerably smaller than the estimates of Yu et al. (2001), Makova et al. (2001), and Zhao et al. (2000). As found by those authors, the estimation methods have a great impact. In all three data sets, there is an excess of rare mutants, such as singletons and doubletons (Zhaoet al. 2000; Makovaet al. 2001; Yuet al. 2001), which explains why estimates obtained using Watterson’s method are often a few times larger than other estimates; this is the pattern expected for a recent population expansion. The exact reasons for the large differences are not entirely clear. One possible reason is the many ambiguity nucleotides in the human sequence data, which are properly dealt with in the Bayes and likelihood calculations but are typically removed in heuristic methods.
The three human loci are then combined in a Bayes analysis, with a single θ_{H} estimated (Table 3). The posterior mean θ_{H} = 0.00056 with the 95% CI of (0.00040, 0.00076), corresponding to a population size of N_{H} = 7000 with the 95% CI of (5000, 9500), is an average across the three loci. However, the 95% CI is much narrower than those at individual loci, indicating the increased accuracy in the estimate in the combined analysis. The posterior means and CIs of the t_{MRCA} are similar to estimates from the separate analyses of the three loci (Table 3).
DISCUSSION
Validation of the MCMC algorithm and convergence monitoring: While MCMC provides a powerful framework for fitting sophisticated multiparameter models to heterogeneous data sets from multiple loci, MCMC algorithms are notoriously difficult to validate. MCMC implementations are notably more difficult to debug than maximumlikelihood programs. For example, in most numerical optimization algorithms for maximumlikelihood estimation, the likelihood always increases. However, an MCMC algorithm is stochastic and we cannot expect any summary statistics to increase or decrease monotonically. Second, a likelihood program converges to a fixed point (or points in the presence of multiple local optima). In contrast, convergence of an MCMC algorithm is to a distribution.
We used several strategies to validate the theory and implementation. For small data sets with only 2 or 3 species, quantities such as the probability of a particular gene tree topology and the expectations of coalescent times in the gene tree were calculated by both MCMC simulation and numerical integration using Mathematica. For larger species trees (with, say, 10 species), the MCMC algorithm was run without data [that is, by fixing f(DG) = 1], and the resulting posterior distributions of parameters (0398;) were compared with the prior gamma distributions. In addition, a simulation program was written to generate testing data and to simulate the prior distribution of gene trees and coalescent times to validate the theory and the implementation.
For the purpose of convergence monitoring, we found it useful to run multiple chains and to monitor the values of parameters and the log likelihood over iterations (e.g., Gelman and Rubin 1992). For the data analyzed in this study, our algorithm appeared to be fast to converge, even if poor starting points were used, but could be slow in mixing due to correlation between parameters (Table 4, Figure 2). A relatively short burnin of 2000 or 5000 iterations appeared sufficient to bring all parameters to a reasonable range, with each iteration consisting of the five steps described in the appendix. After the burnin, 10,000 samples, sampling every 2 iterations, produced stable estimates of posterior distributions. Results reported in Tables 2, 3, 4 were obtained from much longer runs.
Sampling strategies and accuracy of parameter estimation: A small simulation study was conducted to evaluate the information content in the data sets. Data are simulated using the species tree (((HC)G)O) with the parameter values θ_{HC} = 0.001, θ_{HCG} = 0.001, θ_{HCGO} = 0.001, τ_{HCGO} = 0.014, τ_{HCG} = 0.0066, and τ_{HC} = 0.005. The prior of Table 2 is used in the Bayes analysis. The fact that the prior means are equal to the true values of parameters suggests that the results are besttime results. Clearly the method will perform well if we have long sequences to reduce sampling errors in the gene tree and branch lengths (coalescent times) at each locus and also many loci to average over stochastic variations in the coalescent process among loci. However, given the total combined sequence length, it is not obvious whether it is better to have a few long sequences or many short ones. Thus we simulated a few cases in which the number of loci, L, and the number of nucleotides in the sequence at each locus, C, vary but the total sequence length from each species is fixed at L × C = 100,000 (Table 5). The average posterior means of parameters (not shown) were found to be close to the true values of parameters in all the cases considered. We examined the size of the 95% credibility interval as an indication of the information content in the data. Surprisingly the results suggest very little difference among the different strategies (Table 5). Comparison of the posterior CIs with the prior intervals suggests that speciation times (τ’s) are, in general, well estimated, with their CIs reduced by three to seven times. Parameters θ are estimated less well, with about twice the reduction in the CI for θ_{HC} and θ_{HCG}, while θ_{HCGO} is the most poorly estimated parameter.
Population size of the humanchimpanzee common ancestor: The humanchimpanzee ancestral population size, N_{HC} or θ_{HC}, has been of much recent interest (Takahata and Satta 2002; Wall 2003). As the size of modern humans has been consistently estimated to be ∼10,000 (e.g., Takahataet al. 1995; Ruvolo 1997; Zhaoet al. 2000; Makovaet al. 2001; Yuet al. 2001; Takahata and Satta 2002), which is surprisingly small given the widespread distributions of humans in the past 12 million years (Wall 2003), reliable estimation of the humanchimpanzee ancestral population size is essential to understand whether there has been a dramatic size reduction during human evolution (Hacia 2001; Kaessmannet al. 2001). Our likelihood and Bayes analyses of the data of Chen and Li (2001) produced estimates that are a few times smaller than estimates obtained from the same data using the treemismatch method (Yang 2002 and this study), demonstrating the importance of the estimation procedure.
However, the reliability of our estimates is affected both by the assumptions made in our model and by the quality of the data. We assumed that the evolutionary rate is the same both among sites within each locus and among different loci. Withinlocus rate variation is not expected to be important because its effect is mainly on correction for multiple hits and because the sequences used are highly similar. This assumption can be relaxed, although at greater computational cost. Rate variation among loci should have a greater effect on estimation of ancestral population sizes (Yang 1997). It is straightforward to incorporate variable evolutionary rates across loci in the MCMC algorithm. However, the effect is less important when multiplespecies data are analyzed simultaneously (Yang 2002). Another assumption we made is the absence of recombination within a locus. Takahata and Satta (2002) and Wall (2003) pointed out that recombination reduces variation among loci and leads to underestimates of N_{HC}. Furthermore, the data of Chen and Li (2001) may be somewhat atypical, since analyses of other genomic regions using the maximumlikelihood method of Takahata et al. (1995) have typically produced much larger estimates of N_{HC} (e.g., Takahataet al. 1995; Takahata and Satta 2002). We note that the information content in the data analyzed here is low, as indicated by the sensitivity of our Bayes estimates to the prior, and that reliable estimates are possible only with accumulation of more genomic data and with implementation of more realistic models. We stress that our likelihoodbased MCMC method makes an efficient use of the information in the data and provides a powerful framework for combining heterogeneous data sets from multiple loci.
Program availability: The program MCMCcoal, for Bayes MCMC analysis under coalescent models, is available at http://abacus.gene.ucl.ac.uk/software/MCMCcoal/. This can also be used to simulate sequence data sets.
APPENDIX: PROPOSAL STEPS IN THE MCMC
The MCMC algorithm implemented in this article involves several proposal steps, each of which updates some parameters in the Markov chain. The main problem is to combat the constraints posed by the speciation times while updating coalescent times in the gene tree and vice versa. The algorithm is tedious. The details follow.
Step 1. Updating the coalescent times at internal nodes in a gene tree: This step cycles through all loci and, for each locus, through all internal nodes in the gene tree to propose changes to the node ages (coalescent times). The age of only one internal node is changed at a time, and the gene tree topology remains intact. First the lower and upper bounds for the new age are determined by examining the current values of speciation dates and the ages of the mother and daughter nodes in the gene tree. A sliding window is then used to propose the new age; that is,
Step 2. Subtree pruning and regrafting in the gene tree: This step cycles through nodes in each gene tree (except the root), removes the subtree represented by the node (which includes the node and all its descendent nodes), and then reattaches it to the remaining gene tree (Figure A1A). The step changes the gene tree topology and the coalescent time for the mother node. The feasible range for the new age of the mother node is determined on the basis of the current values of the speciation times and the age of concerned node in the gene tree. Then a random age is chosen by using a sliding window around the current age,
We also implemented a version of this proposal in which only the tips are pruned and regrafted. This was found to be effective for small gene trees, such as in the data of Chen and Li (2001), but is inefficient for large gene trees.
Step 3. Updating population size parameters: This step updates the population size parameters (θ’s) one by one. A sliding window is used to propose a new value; that is,
Step 4. Updating speciation times in the species tree: This step cycles through the speciation times, that is, ages at internal nodes of the species tree. The age τ at any internal node is bounded upward by the speciation time of its mother node and downward by the ages of its daughter nodes. Let the interval be (τ_{L}, τ_{U}). A sliding window is used to propose a new age,
Our calculation of the new ages for the affected nodes in the gene tree mimics the movements of marks (nodes in the gene tree) on a rubber band when its two ends are fixed (at τ_{L} and τ_{U}) and when the rubber is held at a fixed point (τ) and pulled slightly to one end (Figure A1B). The marks will move relative to the two ends when the rubber expands on one side of the holding point and shrinks on the other. If the node age in the gene tree t >τ, the new age is given by (τ_{U}  t^{*})/(τ_{U}  t) = (τ_{U} τ^{*})/(τ_{U} τ); that is,
An example is shown in Figure A1B, where τ_{HC} of Figure 1 is updated. The range is τ_{L} = 0 and τ_{U} =τ_{HCG}. The ages of nodes a, b, c, and d in the gene tree should be changed as well to maintain compatibility between the gene tree and proposed species tree. Nodes a and b, which are younger than τ_{HC}, are repositioned relative to the lower bound τ_{L} according to Equation A8, while nodes c and d, which are older than τ_{HC}, are repositioned relative to the upper bound τ_{U} according to Equation A7.
Suppose m node ages are changed relative to the upper bound τ_{U} (using Equation A7) and n node ages are changed relative to the lower bound τ_{L} (using Equation A8) across all loci. To derive the proposal ratio, we apply the following transform: y_{0} =τ, y_{j} = (τ_{U}  t_{j})/(τ_{U} τ) for each of the m nodes with t_{j} >τ, and y_{k} = (t_{k} τ_{L})/(τ  τ_{L}) for each of the n nodes with t_{k} <τ. Note that only y_{0} is changed while all other m + n variables y_{j} and y_{k} remain the same in the proposal. The proposal ratio in the transformed variables is 1. The proposal ratio in the original variables is easily derived as ((τ_{U} τ^{*})/(τ_{U} τ))^{m}((τ^{*} τ_{L})/(τ  τ_{L}))^{n}. The acceptance ratio is thus
Step 5. Mixing step: A mixing step is found to be effective in speeding up convergence, especially from a poor starting point. The gene tree topologies remain unchanged, but all parameters in the model (θ’s and τ’s) and node ages (coalescent times) in each gene tree are multiplied by a constant
To overcome the strong correlation between parameters θ and τ (Table 4; see also Yang 2002), the mixing step is modified at an early stage of the MCMC run, say, when 10% of the samples have been taken, making use of the correlation coefficients between parameters calculated during the MCMC up to that point. For each parameter θ, its strongest correlation with parameters τ is found. If that correlation is <0.2, θ is not changed. Otherwise θ is multiplied by c if the correlation is positive or divided by c if the correlation is negative. Thus with the correlation coefficients of Table 4, all the θ parameters are divided by c. The proposal ratio for the modified algorithm is c^{m}^{}^{n}, where m is the total number of parameters multiplied by c and n is the total number of parameters divided by c.
Acknowledgments
This work was funded by a Biotechnology and Biological Sciences Research Council grant (31/G13580) to Z.Y. and grants from the National Institutes of Health (HG01988), the Canadian Institute of Health Research (MOP44064), the Peter Lougheed Foundation, and the Alberta Heritage Foundation for Medical Research to B.R.
Footnotes

Communicating editor: N. Takahata
 Received December 4, 2002.
 Accepted April 18, 2003.
 Copyright © 2003 by the Genetics Society of America