Abstract
Polymorphisms in an ancestral population can cause conflicts between gene trees and the species tree. Such conflicts can be used to estimate ancestral population sizes when data from multiple loci are available. In this article I extend previous work for estimating ancestral population sizes to analyze sequence data from three species under a finitesite nucleotide substitution model. Both maximumlikelihood (ML) and Bayes methods are implemented for joint estimation of the two speciation dates and the two population size parameters. Both methods account for uncertainties in the gene tree due to few informative sites at each locus and make an efficient use of information in the data. The Bayes algorithm using Markov chain Monte Carlo (MCMC) enjoys a computational advantage over ML and also provides a framework for incorporating prior information about the parameters. The methods are applied to a data set of 53 nuclear noncoding contigs from human, chimpanzee, and gorilla published by Chen and Li. Estimates of the effective population size for the common ancestor of humans and chimpanzees by both ML and Bayes methods are ∼12,00021,000, comparable to estimates for modern humans, and do not support the notion of a dramatic size reduction in early human populations. Estimates published previously from the same data are several times larger and appear to be biased due to methodological deficiency. The divergence between humans and chimpanzees is dated at ∼5.2 million years ago and the gorilla divergence 1.11.7 million years earlier. The analysis suggests that typical data sets contain useful information about the ancestral population sizes and that it is advantageous to analyze data of several species simultaneously.
THE amount of sequence polymorphism in a population is mainly determined by the parameter θ= 4Nμ, where N is the (effective) population size and μ is the mutation rate per nucleotide site per generation. In addition to its effect on the amount of neutral variation maintained in a population, the population size is also important in affecting the efficiency of natural selection and is a critical parameter in population genetics models of mutation and selection. It is important to competing theories of origin and evolution of modern humans. When an estimate of the mutation rate is available, as is assumed here, the population size can be obtained from estimates of θ. The population size of a presentday species can be estimated by using observed DNA sequence variation in the population (e.g., Kreitman 1983; Fu 1994; Takahataet al. 1995; Yang 1997a). The population size of modern humans has been estimated to be ∼10,000 (e.g., Takahataet al. 1995; Zhaoet al. 2000; Yuet al. 2001).
The population size of an extinct ancestral species, such as the common ancestor of humans and chimpanzees, is harder to estimate, but a few methods have been developed (see Edwards and Beerli 2000 and Sattaet al. 2000 for extensive reviews). They require data from multiple loci and make use of the information in the conflict of gene trees among loci. For example, Takahata (1986) suggested a method for estimating the population size of the common ancestor of two closely related species when a pair of homologous DNA sequences are available from the two species at each locus. The average divergence at many loci provides information on the species divergence time, while the variation of sequence divergence among loci reflects the ancestral population size. The method of Takahata (1986) uses summary statistics from the data and was later extended to a fulllikelihood analysis and to the case of three species, where the population sizes of the two extinct ancestors as well as the two speciation dates were estimated (Takahataet al. 1995).
In the case of three species, another simpler method has also been used (Nei 1987; Wu 1991; Hudson 1992). This approach uses the proportion of loci at which the gene tree does not match the species tree to estimate the ancestral population size and exploits the fact that ancestral polymorphism creates conflicts between the gene tree and the species tree. I refer to it as the treemismatch method. Its application to human and great ape sequence data has led to estimates of the population size for the ancestor of humans and chimpanzees that are much larger than estimated population sizes for modern humans (Ruvolo 1997; Chen and Li 2001). For example, Chen and Li (2001) sequenced 53 noncoding contigs from human, common chimpanzee, gorilla, and orangutan, and estimated the population size of the common ancestor of humans and chimpanzees to range from 52,000 to 150,000, depending on the generation time used (15 or 20 years) and on whether parsimony or neighbor joining was used to infer the gene trees.
The treemismatch approach has room for improvement. First, the conflicts in the estimated gene trees among loci can be due to both ancestral polymorphism and sampling errors in tree reconstruction. As the sequences are highly similar, with few variable or informative sites at each locus, the reconstructed gene tree, no matter what method was used to infer it, is unreliable. Ignoring the sampling error in the estimated gene tree leads to overestimation of the conflict among loci and overestimation of the ancestral population size (see below). Second, the branch lengths in the gene tree provide, in a probabilistic sense, information about the ancestral population parameters, but are ignored by the method. The method clearly makes poor use of the information in the data. Third, the method assumes that one knows the species divergence times, while they might not be known. Indeed, some authors argue for the importance of accounting for ancestral polymorphism when estimating speciation dates (Takahataet al. 1995).
It seems advantageous to use information in both the conflicting gene genealogies as well as the branch lengths while accounting for their uncertainties. This can be achieved by using the likelihood method under a coalescent model developed by Takahata et al. (1995). However, the infinitesites model assumed in the method is sometimes violated by the data. In this article, I extend the method of Takahata et al. for estimating ancestral population sizes, using data from three species. I use the nucleotide substitution model of Jukes and Cantor (1969) to correct for multiple substitutions at the same site. The model is also implemented in a Bayes framework, with Markov chain Monte Carlo (MCMC) used to achieve efficient calculation. The methods are applied to the data of Chen and Li (2001).
MAXIMUMLIKELIHOOD ESTIMATION
The species phylogeny is represented ((12)3) and is assumed known. Species 1 and 2 diverged τ_{1} generations ago while species 3 diverged τ_{0} generations earlier (Figure 1a). In this article, species 1, 2, and 3 represent human (H), chimpanzee (C), and gorilla (G), respectively. The effective population size of the ancestor of species 1 and 2 is N_{1}, and that of the common ancestor of all three species is N_{0}. The mutation rate μ is measured as the number of nucleotide substitutions per site per generation. As rate and time are confounded in such analysis, the parameters of the model are θ_{0} = 4N_{0}μ, θ_{1} = 4N_{1}μ, γ_{0} =τ_{0}μ, and γ_{1} =τ_{1}μ (Figure 1a). Collectively they are denoted 0398;= {θ_{0}, θ_{1}, γ_{0}, γ_{1}}. The data contain DNA sequences from multiple loci, with one individual sequenced from each of the three species at each locus. It is assumed that there is no recombination within a locus and free recombination between loci. The population is assumed to be random mating, with no gene flow after species divergence.
The likelihood function: Consider the probability distribution of the gene tree and its branch lengths at any locus i. Two cases are possible and are dealt with separately. Case I is represented by Figure 1b, where the gene tree is T_{0} = ((12)3), identical to the species tree, and sequences 1 and 2 coalesce between the two speciation events C and D. The coalescent times t_{0} and t_{1} are defined in Figure 1b. Case I occurs when t_{1} <τ_{0}, with probability
In case I (tree T_{0} in Figure 1b), the coalescent time t_{1} is an exponential random variable truncated at t_{1} < τ_{0}, and t_{0} is an independent random variable with an exponential distribution
In case II (Figure 1, ce), both coalescence events occur in the population ancestral to all three species. The three gene trees T_{1}, T_{2}, and T_{3} have equal probabilities. The coalescent times t_{0} and t_{1}, defined in Figure 1, ce, are independent random variables with exponential distributions
Calculation of the probability, P(D_{i}T_{k}, b_{i}_{0}, b_{i}_{1}), of observing data D_{i} at locus i, given the gene tree T_{i} = T_{k} (k = 1, 2, 3) and its branch lengths b_{i}_{0} and b_{i}_{1}, will be described in the next section. The probability of observing the data at locus i for case II is
Averaging over cases I and II or over the four gene trees T_{0}, T_{1}, T_{2}, T_{3} in Figure 1, we obtain the marginal probability of observing data D_{i} at locus i as
The log likelihood is a sum over all the L loci,
Probability of data at a locus given the gene tree and branch lengths: Given the gene tree T_{i}, which is one of T_{0}, T_{1}, T_{2}, or T_{3} in Figure 1b, and its branch lengths (b_{i}_{0} and b_{i}_{1}), the probability of observing the data, P(D_{i}T_{i}, b_{i}_{0}, b_{i}_{1}), in Equation 8 can be calculated under any model of nucleotide substitution (Lio and Goldman 1998). The model of Jukes and Cantor (1969) is used in this article, which seems sufficient for such highly similar sequences. Under this model, the sequence data can be summarized as counts of five possible site patterns (configurations): xxx, xxy, yxx, xyx, and xyz, where x, y, and z are any three different nucleotides. The data at locus i can thus be represented by the number of sites and Li (2001) is listed in Table 1.
For gene trees T_{0} or T_{1}, the probabilities of observing the five site patterns are
Mutation rate variation among loci: An important factor that may influence the estimation of ancestral population sizes is the variation of mutation rates among loci (Yang 1997a; Chen and Li 2001). For example, estimation of ancestral population size from comparison between two species was found to be sensitive to even slight rate variation (Yang 1997a). If relative rates for the loci are available, it will be straightforward to incorporate them in the likelihood calculation (Yang 1997a). In this article, I use the average distance from the orangutan to the three African apes to calculate the relative rate for the locus (Table 1). This ad hoc approach appears sensible since the orangutan diverged from the African apes very early and ancestral polymorphism in their common ancestor does not seem important. The likelihood calculation proceeds as before except that the branch lengths for the gene tree at each locus are multiplied by the relative rate for that locus.
Application to the data of Chen and Li: Chen and Li (2001) sequenced one individual from each of the four species, human, chimpanzee, gorilla, and orangutan, at 53 independent noncoding loci (contigs). An advantage of the data set is that the sequences are expected to be outside and far away from coding regions and not affected by selection at linked sites or loci. The model of this article assumes the molecular clock and uses only three species. The counts of site patterns are listed in Table 1. At some loci, the total number of sites used in this article is larger than that in Chen and Li (2001; Table 1), because some sites had alignment gaps in the orangutan and were removed by Chen and Li.
The threespecies data are analyzed by the maximumlikelihood (ML) method of this article. The estimates of parameters are given in Table 2. If we assume a generation time of 20 years and a mutation rate of 10^{9} substitutions per site per year (e.g., Nachman and Crowell 2000), the estimates suggest a population size for the ancestor of humans and chimpanzees of ∼12,000. This is several times smaller than the estimates of Chen and Li (2001) from the same data, at a minimum of 52,000. The estimate is also similar to estimates of the population size of modern humans, for example, 12,000 by Yu et al. (2001). The population size for the common ancestor of all three species is estimated to be ∼38,000. The same analysis estimated the humanchimpanzee divergence time at 5.1 million years ago (MYA) and the gorilla divergence at ∼1.1 million years (MY) earlier. Those estimates are largely consistent with those of previous studies (e.g., Hasegawaet al. 1985; Ruvolo 1997; Kumar and Hedges 1998; Yoder and Yang 2000). Figure 2a shows the likelihood surface as a function of θ_{0} and θ_{1} when γ_{0} and γ_{1} are fixed at their maximumlikelihood estimates (MLEs). The ∼95% confidence region is given by the likelihood contour at 3.32 units below the optimum, that is, at 3102.73 (Figure 2a). The sampling errors are quite large. Analysis of the human and chimpanzee sequences at the 53 loci using the ML method of Takahata et al. (1995) under the infinitesites model leads to
To examine the effect of mutation rate variation among loci, I calculated the average sequence distance under the model of Jukes and Cantor (1969) from the human, chimpanzee, and gorilla to the orangutan, that is, d_{HCGO} = (d_{HO} + d_{CO} + d_{GO})/3. This is divided by the average across all loci to give the relative rate for that locus (Table 1). The average distance from the orangutan to the African apes is found to be 0.0296; this is consistent with a mutation rate of 10^{9} substitutions per site per year and an orangutan divergence date of ∼13 MYA (e.g., Hasegawaet al. 1987) since 0.0296/(2 × 13 × 10^{6}) = 1.1387 × 10^{9}. The relative rates for loci calculated this way are used as fixed constants in the likelihood calculation for the data of three species. The MLEs of parameters are shown in Table 2. Using the same generation time and mutation rate as above, we get the estimate of the population size for the ancestor of humans and chimpanzees to be 21,000, larger than the estimate under the assumption of a constant rate for all loci. The population size for the ancestor of all three species is estimated to be 29,000, smaller than under the constantrate model. The differences between the two analyses are somewhat surprising, as one might expect the population sizes to be smaller when rate variation among loci is accounted for. However, it is noted that the distance from orangutan to the African apes has only a weak correlation (0.44) with the average distance within the HCG group, which appears to suggest that the mutation rates are rather homogeneous among loci and that the conflict among loci in sequence divergence is mainly caused by ancestral polymorphism.
If the average distances within the HCG group, (d_{HC} + d_{CG} + d_{GH})/3, are used as relative rates for the loci, parameter estimates become
Comparison with the treemismatch method: When the gene tree is T_{2} or T_{3} (Figure 1), there is a mismatch between the species tree and the gene tree. This occurs with probability P_{SG} = f(T_{2}) + f(T_{3}) = 2(1 Ψ)/3 = ⅔e^{2γ0/θ1} (e.g., Nei 1987, pp. 288289). The treemismatch method estimates θ_{1} by equating this probability to the proportion (pˆ_{)} of loci at which the estimated gene tree differs from the species tree, with γ_{0} being assumed known; that is,
To understand the difference between the treemismatch method and ML, note that three aspects of the data are ignored by the treemismatch method and accounted for by ML: (i) uncertainty in the estimated gene tree due to the finite number of nucleotide sites at the locus, (ii) unresolved loci (ties), and (iii) branch lengths in the gene tree reflected in the sequence divergences. While all three probably contribute to the large differences discussed above, uncertainty in the estimated gene tree seems to be the predominant reason. A more “proper” treemismatch method should equate the observed proportion of mismatches (pˆ_{)} not to P_{SG} but to P_{SE}, the probability of a mismatch between the species tree and the estimated gene tree. This probability is given by
Figure 3 shows the probabilities that the species tree (S), the gene tree (G), and the estimated gene tree (E) differ from each other. The probability of a mismatch between the species tree and the gene tree is P_{SG} = 2(1 Ψ)/3 = 0.0739, much lower than the observed mismatch proportion pˆ = 0.367. The probability of a mismatch between the species tree and the estimated gene tree is higher. With 466 sites (the average across the 53 loci, Table 1), P_{SE} = 0.2028, which is 2.7 times as high as P_{SG} (Figure 3). Now consider the four gene trees T_{0}, T_{1}, T_{2}, and T_{3} (Figure 1), which occur with probabilities f(T_{0}) =Ψ = 0.8892 and f(T_{1}) = f(T_{2}) = f(T_{3}) = (1 Ψ)/3 = 0.0369 (Equation 1). According to the simulation, the probability that the topology of the gene tree is incorrectly reconstructed when the true gene tree is T_{0}, T_{1}, T_{2}, or T_{3} is P_{0} = 0.1453 and P_{1} = P_{2} = P_{3} = 0.1981. Note that for the estimated gene tree, we consider only the topology and disregard its divergence times relative to the speciation times. The average error probability of gene tree reconstruction is thus
THE BAYES APPROACH USING MCMC
A Bayes approach is implemented under the same model, using MCMC. As parameters 0398;= {θ_{0}, θ_{1}, γ_{0}, γ_{1}} are all positive, I use independent gamma distributions as the prior. The gamma density is
Instead of the coalescent times t_{0} and t_{1}, which have different definitions in different gene trees (Figure 1), branch lengths b_{0} and b_{1} in the gene tree are used in the MCMC. The joint prior distributions of the gene tree T_{0} and its branch lengths b_{0} and b_{1} given 0398; can be easily derived from the distributions of the coalescent times t_{0} and t_{1} (Equation 2),
Similarly, the joint prior distribution of gene tree T_{k}, k = 1, 2, 3, and its branch lengths b_{0} and b_{1} given 0398; is
The variables to be updated in the Markov chain include the parameters 0398;= {θ_{0}, θ_{1}, γ_{0}, γ_{1}} and the gene trees and branch lengths at all L loci, G = {T_{i}, b_{i}_{0}, b_{i}_{1}}, i = 1, 2,..., L. The Markov chain is constructed so that its steadystate distribution is the posterior distribution of those variables. Bayes inference is then based on the joint posterior distribution
A MetropolisHastings algorithm (Metropoliset al. 1953; Hasting 1970) is used to update variables in the MCMC. Given the current state of the chain (0398;, G), a new state (0398;^{*}, G^{*}) is proposed through a proposal distribution q(0398;^{*}, G^{*}0398;, G). The new state is then accepted with probability
The proposal density q can be rather flexible as long as it specifies an aperiodic and irreducible Markov chain. The algorithm I implemented cycles through several steps, with each step updating some but not all variables. In step 1, the gene tree and branch lengths at each locus i (T_{i}, b_{i}_{0}, b_{i}_{1}) are updated, while parameters 0398; are fixed. Each locus is updated once in this step. Step 2 updates parameters 0398; while the branch lengths {b_{i}_{0}, b_{i}_{1}} are fixed. This step can cause changes to the gene trees at some loci. Step 3 is a mixing step, in which parameters θ_{0}, θ_{1}, γ_{0}, γ_{1} and branch lengths at all loci are multiplied by a constant while the gene trees remain unchanged. The MCMC algorithm is tedious and the details are given in the appendix.
The Markov chain is started from a random initial state. Sampling starts after a certain number of generations, which are discarded as burnin, and samples are taken every certain number of steps, thus “thinning” the chain. Convergence of the chain is checked by examining the output and also by running multiple chains. The algorithm is also run without sequence data, and the posterior distribution generated is found to be close to the prior.
Application to the data of Chen and Li: The Bayes MCMC algorithm is applied to the data of Chen and Li (2001; see Table 1). I used two sets of priors (Table 3). Parameters α and β in the gamma prior distributions are chosen by considering likely values and ranges of ancestral population sizes and species divergence times and converting them into parameters θ_{0}, θ_{1}, γ_{0}, and γ_{1} using a generation time of 20 years and a mutation rate of 10^{9} substitutions per site per year. The first set is considered more realistic and referred to as the “good priors” in Table 3. Ancestral population sizes N_{0} and N_{1} are centered ∼12,500, close to estimates for modern humans, with the 2.5 and 97.5% percentiles at 1500 and 34,800, respectively. The divergence time for humans and chimpanzees is centered ∼5 MY, while the divergence time for the gorilla is centered ∼1.6 MY. Note that parameters θ_{0}, θ_{1}, γ_{0}, and γ_{1} are all ⪡1 but are definitely >0; thus values of α>1 are used so that the gamma distribution has a mode >0.
The posterior distributions of parameters θ_{0}, θ_{1}, γ_{0}, and γ_{1} are shown in Figure 4 together with their priors. They are also summarized in Table 3 (good priors). The means of the posterior distributions for θ_{0}, θ_{1}, γ_{0}, and γ_{1} are listed, and then the means and the 95% credibility sets for the two population sizes (N_{0} and N_{1}) and for the two speciation times are listed. The posterior means and medians are close. The population size for the ancestor of humans and chimpanzees is estimated to be 12,400, with the 95% credibility interval (CI) to be (1700, 32,100). The HC divergence is dated at 5.3 MY, with the 95% CI to be (4.4, 6.1). The estimates of θ_{1} and γ_{1} are very similar to the MLEs. The posterior mean of θ_{0} is smaller and that of γ_{0} is larger than the MLEs (Tables 2 and 3). The correlation coefficients calculated from the posterior distributions of parameters θ_{0}, θ_{1}, γ_{0}, and γ_{1} are shown in Table 4. There is strong negative correlation between θ_{0} and γ_{0} and between θ_{1} and γ_{1}. Comparison of the prior and posterior distributions (Figure 4) suggests that the data contain much more information about the divergence times, especially the HC divergence time (γ_{1}), than about the population sizes.
To see the effect of prior assumptions on the posterior distributions, I used a second set of priors, which are more spread out and also assume large population sizes (mean 62,500). The posterior distributions are summarized in Table 3 under the heading “Poor priors.” The point estimates of both N_{0} and N_{1} are ∼33,000, smaller than the prior means. The HC divergence is dated at 4.6 MY, and the gorilla divergence is dated 1.9 MY earlier. Those estimates appear reasonable, although the HC divergence date is too recent. Similar strong correlations between the parameters are discovered as in the analysis using the good priors. The negative correlation between γ_{1} and θ_{1} (calculated to be 0.76), combined with the assumed and estimated large population sizes, appears to have led to a HC divergence date that is too recent.
DISCUSSION
ML and Bayes methods of this article estimated the population size for the common ancestor of humans and chimpanzees to be ∼12,000, similar to estimates for modern humans. The estimates are several times smaller than those obtained by Chen and Li (2001) from the same data using the treemismatch method, which range from 52,000 to 150,000. Even the worstcase estimates—e.g., 36,000 by ML under the assumption that all sequence divergence variation among loci is due to mutation rate variation and 33,000 from the Bayes analysis using the poor priors—are smaller than the minimum estimate of Chen and Li. The treemismatch method used by Chen and Li appears to have serious biases due to errors in gene tree reconstruction, and the likelihood and Bayes estimates reported here are probably more reliable. Thus it may be concluded that the sequence data of Chen and Li (2001) do not support much larger ancestral populations than the modern humans or the notion that early human populations experienced dramatic size reductions (Hacia 2001; Kaessmannet al. 2001).
While the ML and Bayes methods are expected to have better statistical properties than the simple treemismatch method, it is worthwhile to examine some of the assumptions made in this article. First, the evolutionary rate is assumed to be constant over lineages. This assumption seems reasonable as the species compared are very closely related; Chen and Li’s (2001) relativerate tests supported the molecular clock. The large differences between the treemismatch method and the likelihood and Bayes methods are clearly not due to the use of the clock assumption in this article; use of clock rooting in the treemismatch method produced even larger estimates of the population size for the ancestor of humans and chimpanzees. Second, the analysis assumes no recombination within a locus. The effect of recombination on estimation of parameters θ_{0}, θ_{1}, γ_{0}, and γ_{1} is not well understood, although Satta et al. (2000) emphasized its possible significance. As the human, chimpanzee, and gorilla sequences are extremely similar, most of the recombination events will not be visible in the sequence data, and the few sites at which more than two nucleotides are observed in the data (see counts n_{i}_{4} for site pattern xyz in Table 1) are probably due to multiple substitutions at the same site. Third, the substitution model of Jukes and Cantor (1969) is simplistic. More complex models, such as those that account for variable substitution rates among sites within the locus, can be easily implemented, but are expected to have little effect. The most serious issue seems to be mutation rate variation among loci. In the case of two species, the ancestral population size is overestimated when mutation rate variation is ignored and accounting for the bias leads to dramatic reduction in the estimated population size (Yang 1997a). In this article, the population size of the ancestor of humans and chimpanzees is not very large under the constantrate model and becomes larger when variable rates for loci are assumed. The effect is much less important and also in the opposite direction compared with the twospecies case. Lack of strong correlation among sequence distances with the orangutan seems to suggest that the rates are relatively homogeneous among those loci. It seems that simultaneous analysis of data from three species allows the parameters to constrain each other, leading to a better use of information in the data. It is quite likely that the estimation can be further improved by sampling multiple individuals from the same species.
The ML and Bayes methods produced similar results for the data analyzed in this article. However, the ML calculation is slower than the MCMC algorithm. The Bayes approach also provides a framework for incorporating prior information about the parameters. For example, a wealth of information is available about the divergence time between humans and chimpanzees. By forcing a very narrow prior distribution for γ_{1}, such information can be incorporated in the Bayes analysis. Using an informative prior will reduce the adverse effect of strong correlation among parameters when other parameters are estimated. Furthermore, the Bayes algorithm seems easier than ML to extend to data that contain more than three species and more than one individual from each species.
Program availability: C programs implementing the MCMC algorithm and calculating the mismatch probabilities (P_{SG}, P_{SE}, and P_{GE}, etc.) are available from the author upon request. The C and Mathematica programs for the likelihood method are available as well, but they make use of the Mathlink library and are awkward to use.
APPENDIX: IMPLEMENTATION OF THE MCMC ALGORITHM
Updating the gene tree and branch lengths at each locus: This step of the algorithm goes through the L loci to update the branch lengths and possibly the gene tree as well. The algorithm visits each locus once. The description in this section concerns one locus i. The subscript i in b_{i}_{0} and b_{i}_{1} may be suppressed. The proposal algorithm for gene trees T_{2} and T_{3} is simpler and is described first, before the algorithm for gene trees T_{0} and T_{1} (Figure 1).
If the current gene tree is either T_{2} or T_{3}, only the branch lengths are updated and the gene tree is not changed. Note that b_{0} and b_{1} have to satisfy the requirements 0 < b_{0} < ∞ and γ_{0} +γ_{1} < b_{1} < ∞. A sliding window is used to propose new branch lengths,
If the current gene tree is either T_{0} or T_{1} (Figure 1, b and c), a change between those two trees is allowed when the branch lengths are updated. Both the current and the new branch lengths should satisfy the constraints γ_{1} < b_{1} < ∞ and γ_{0} +γ_{1} < b_{0} + b_{1} < ∞. To propose new branch lengths under those constraints, I apply the transformation
New states are proposed for y_{0} and y_{1} as
If the gene tree is T_{1}, T_{2}, or T_{3}, the algorithm may (with a small probability of, say, 0.2 or 0.3) attempt to swap the gene trees. The current gene tree is replaced by one of the other two trees chosen at random. The proposal ratio is 1, and the acceptance ratio is
Updating population size and speciation date parameters: This step makes two proposals: the first to change θ_{0} and θ_{1} and the second to change γ_{0} and γ_{1}. Parameters θ_{0} and θ_{1} are positive but are not constrained otherwise. They are updated simultaneously, with all other variables fixed. New values are proposed around the current values by
Next, speciation date parameters γ_{0} and γ_{1} are updated while θ_{0} and θ_{1} as well as branch lengths b_{i}_{0} and b_{i}_{1} for all loci are fixed. At each locus the following constraints have to be satisfied: 0 <γ_{1} < b_{1} <γ_{0} +γ_{1} < b_{0} + b_{1} < ∞ in gene tree T_{0} and 0 < b_{0} < ∞ and γ_{0} + γ_{1} < b_{1} < ∞ in gene tree T_{1}, T_{2}, or T_{3} (Figure 1). To allow the chain to move more freely, the gene tree is allowed to change, if necessary, from T_{0} to T_{1}, T_{2}, or T_{3} or vice versa. Thus only the following constraints have to be satisfied when new values are proposed for γ_{0} and γ_{1}: 0 <γ_{1} < b_{1} and γ_{0} +γ_{1} < b_{0} + b_{1} at every locus; that is, γ_{1} < min{b_{i}_{1}} = c and γ_{1} <γ_{0} +γ_{1} < min{b_{i}_{0} + b_{i}_{1}} = d. The following transformation is used to propose new states,
Mixing step: A mixing step is found to be effective in speeding up convergence when a poor starting point is chosen for the chain. In this step, the gene trees remain unchanged, but parameters θ_{0}, θ_{1}, γ_{0}, and γ_{1} and branch lengths b_{i}_{0} and b_{i}_{1} for all loci are multiplied by a constant
Performance of the algorithm: The performance of the MCMC algorithm is noted to depend on the choice of priors (values of α and β in the gamma distributions). For some priors, the algorithm is noted not to mix well, and in particular, parameters θ_{0} and γ_{0} appear to change slowly. The high correlation among parameters and the constraints seem to cause difficulties for the algorithm. In such cases, the chain has to be run much longer than usual to achieve stable estimates. In proposing new values for γ_{0} and γ_{1}, only small steps are taken (with H_{3} in the range 0.010.05) to achieve an acceptance rate of ∼0.10.3. For other variables, even large steps (with H_{1}, H_{2}, H_{4} in the range 0.10.5) are accepted at high frequencies (>50%). The mixing step seems rather effective so that ∼1000 generations seem enough for the burnin. For some priors, <500,000 generations appear sufficient, which takes a few minutes on a Pentium III PC at 1.2 GHz for the data of Chen and Li (2001). For other priors, the algorithm has to be run much longer.
Acknowledgments
I am very grateful to Drs. W.H. Li and F.C. Chen for providing the data analyzed in this article. I thank M. Hasegawa and B. Larget for discussions and N. Takahata for comments. This study is supported by grant 31/G13580 from the Biotechnology and Biological Sciences Research Council (United Kingdom).
Footnotes

Communicating editor: Y.X. Fu
 Received April 18, 2002.
 Accepted September 6, 2002.
 Copyright © 2002 by the Genetics Society of America