Likelihood and Bayes Estimation of Ancestral Population Sizes in Hominoids Using Data From Multiple Loci
Ziheng Yang

Abstract

Polymorphisms in an ancestral population can cause conflicts between gene trees and the species tree. Such conflicts can be used to estimate ancestral population sizes when data from multiple loci are available. In this article I extend previous work for estimating ancestral population sizes to analyze sequence data from three species under a finite-site nucleotide substitution model. Both maximum-likelihood (ML) and Bayes methods are implemented for joint estimation of the two speciation dates and the two population size parameters. Both methods account for uncertainties in the gene tree due to few informative sites at each locus and make an efficient use of information in the data. The Bayes algorithm using Markov chain Monte Carlo (MCMC) enjoys a computational advantage over ML and also provides a framework for incorporating prior information about the parameters. The methods are applied to a data set of 53 nuclear noncoding contigs from human, chimpanzee, and gorilla published by Chen and Li. Estimates of the effective population size for the common ancestor of humans and chimpanzees by both ML and Bayes methods are ∼12,000-21,000, comparable to estimates for modern humans, and do not support the notion of a dramatic size reduction in early human populations. Estimates published previously from the same data are several times larger and appear to be biased due to methodological deficiency. The divergence between humans and chimpanzees is dated at ∼5.2 million years ago and the gorilla divergence 1.1-1.7 million years earlier. The analysis suggests that typical data sets contain useful information about the ancestral population sizes and that it is advantageous to analyze data of several species simultaneously.

THE amount of sequence polymorphism in a population is mainly determined by the parameter θ= 4Nμ, where N is the (effective) population size and μ is the mutation rate per nucleotide site per generation. In addition to its effect on the amount of neutral variation maintained in a population, the population size is also important in affecting the efficiency of natural selection and is a critical parameter in population genetics models of mutation and selection. It is important to competing theories of origin and evolution of modern humans. When an estimate of the mutation rate is available, as is assumed here, the population size can be obtained from estimates of θ. The population size of a present-day species can be estimated by using observed DNA sequence variation in the population (e.g., Kreitman 1983; Fu 1994; Takahataet al. 1995; Yang 1997a). The population size of modern humans has been estimated to be ∼10,000 (e.g., Takahataet al. 1995; Zhaoet al. 2000; Yuet al. 2001).

The population size of an extinct ancestral species, such as the common ancestor of humans and chimpanzees, is harder to estimate, but a few methods have been developed (see Edwards and Beerli 2000 and Sattaet al. 2000 for extensive reviews). They require data from multiple loci and make use of the information in the conflict of gene trees among loci. For example, Takahata (1986) suggested a method for estimating the population size of the common ancestor of two closely related species when a pair of homologous DNA sequences are available from the two species at each locus. The average divergence at many loci provides information on the species divergence time, while the variation of sequence divergence among loci reflects the ancestral population size. The method of Takahata (1986) uses summary statistics from the data and was later extended to a full-likelihood analysis and to the case of three species, where the population sizes of the two extinct ancestors as well as the two speciation dates were estimated (Takahataet al. 1995).

In the case of three species, another simpler method has also been used (Nei 1987; Wu 1991; Hudson 1992). This approach uses the proportion of loci at which the gene tree does not match the species tree to estimate the ancestral population size and exploits the fact that ancestral polymorphism creates conflicts between the gene tree and the species tree. I refer to it as the tree-mismatch method. Its application to human and great ape sequence data has led to estimates of the population size for the ancestor of humans and chimpanzees that are much larger than estimated population sizes for modern humans (Ruvolo 1997; Chen and Li 2001). For example, Chen and Li (2001) sequenced 53 noncoding contigs from human, common chimpanzee, gorilla, and orangutan, and estimated the population size of the common ancestor of humans and chimpanzees to range from 52,000 to 150,000, depending on the generation time used (15 or 20 years) and on whether parsimony or neighbor joining was used to infer the gene trees.

The tree-mismatch approach has room for improvement. First, the conflicts in the estimated gene trees among loci can be due to both ancestral polymorphism and sampling errors in tree reconstruction. As the sequences are highly similar, with few variable or informative sites at each locus, the reconstructed gene tree, no matter what method was used to infer it, is unreliable. Ignoring the sampling error in the estimated gene tree leads to overestimation of the conflict among loci and overestimation of the ancestral population size (see below). Second, the branch lengths in the gene tree provide, in a probabilistic sense, information about the ancestral population parameters, but are ignored by the method. The method clearly makes poor use of the information in the data. Third, the method assumes that one knows the species divergence times, while they might not be known. Indeed, some authors argue for the importance of accounting for ancestral polymorphism when estimating speciation dates (Takahataet al. 1995).

It seems advantageous to use information in both the conflicting gene genealogies as well as the branch lengths while accounting for their uncertainties. This can be achieved by using the likelihood method under a coalescent model developed by Takahata et al. (1995). However, the infinite-sites model assumed in the method is sometimes violated by the data. In this article, I extend the method of Takahata et al. for estimating ancestral population sizes, using data from three species. I use the nucleotide substitution model of Jukes and Cantor (1969) to correct for multiple substitutions at the same site. The model is also implemented in a Bayes framework, with Markov chain Monte Carlo (MCMC) used to achieve efficient calculation. The methods are applied to the data of Chen and Li (2001).

MAXIMUM-LIKELIHOOD ESTIMATION

The species phylogeny is represented ((12)3) and is assumed known. Species 1 and 2 diverged τ1 generations ago while species 3 diverged τ0 generations earlier (Figure 1a). In this article, species 1, 2, and 3 represent human (H), chimpanzee (C), and gorilla (G), respectively. The effective population size of the ancestor of species 1 and 2 is N1, and that of the common ancestor of all three species is N0. The mutation rate μ is measured as the number of nucleotide substitutions per site per generation. As rate and time are confounded in such analysis, the parameters of the model are θ0 = 4N0μ, θ1 = 4N1μ, γ00μ, and γ11μ (Figure 1a). Collectively they are denoted 0398;= {θ0, θ1, γ0, γ1}. The data contain DNA sequences from multiple loci, with one individual sequenced from each of the three species at each locus. It is assumed that there is no recombination within a locus and free recombination between loci. The population is assumed to be random mating, with no gene flow after species divergence.

The likelihood function: Consider the probability distribution of the gene tree and its branch lengths at any locus i. Two cases are possible and are dealt with separately. Case I is represented by Figure 1b, where the gene tree is T0 = ((12)3), identical to the species tree, and sequences 1 and 2 coalesce between the two speciation events C and D. The coalescent times t0 and t1 are defined in Figure 1b. Case I occurs when t10, with probability ψ=0τ012N1et1(2N1)dt1=1e2γ0θ1 (1) (e.g., Takahataet al. 1995). In case II, both coalescent events occur in the population ancestral to all three species (Figure 1, c-e). The three gene trees T1 = ((12)3), T2 = ((23)1), and T3 = ((31)2) occur with equal probability (1 -Ψ)/3.

In case I (tree T0 in Figure 1b), the coalescent time t1 is an exponential random variable truncated at t1 < τ0, and t0 is an independent random variable with an exponential distribution fI(t1T0)=12N1et1(2N1)ψ,0<t1<τ0,fI(t0T0)=12N0et0(2N0),0<t0<. (2) Let branch lengths b0 and b1 in the gene tree of Figure 1b be defined as follows: b0=(τ0+t0t1)μ=γ0+t0μt1μ,b1=(τ1+t1)μ=γ1+t1μ. (3) Note that b0 is the length of the internal branch AB and b1 is the distance from sequence 1 to ancestor B (Figure 1b). Given that the gene tree at locus i is Ti = T0 and its branch lengths are bi0 and bi1, the probability of observing sequence data Di at locus i, P(Di|T0, bi0, bi1), is the likelihood in traditional molecular phylogenetics (Felsenstein 1981). Calculation of this probability is discussed in the next section. By averaging over the distribution of the random variables t0 and t1 (or b0 and b1), the probability of observing the data at locus i for case I is f(DiT0)=00τ0P(DiT0,γ0+t0μt1μ,γ1+t1μ)×fI(t1T0)fI(t0T0)dt1dt0=1ψ002γ0θ1P(DiT0,γ0+12θ0x012θ1x1,γ1+12θ1x1)×e(x0+x1)dx1dx0, (4) after a change of variables.

Figure 1.

—(a) The species tree ((12)3) for three species: 1 (human), 2 (chimpanzee), and 3 (gorilla). Species 1 and 2 diverged τ1 generations ago and species 3 diverged τ0 generations earlier. The population sizes are N0 in the common ancestor of all three species and N1 in the common ancestor of species 1 and 2. The four parameters in the model are θ0 = 4N0μ, θ1 = 4N1μ, γ00μ, and γ11μ, where μ is the mutation rate per site per generation. There are four possible gene trees, represented by b-e. In case I (b), sequences 1 and 2 coalescence between the two speciation events and the gene tree T0 = ((12)3) is consistent with the species tree. This is referred to as case I in the text. In c-e, both coalescence events occur in the common ancestor of all three species. In this case (case II), the gene tree can be T1 = ((12)3), T2 = ((23)1), or T3 = ((31)2), with equal probability.

In case II (Figure 1, c-e), both coalescence events occur in the population ancestral to all three species. The three gene trees T1, T2, and T3 have equal probabilities. The coalescent times t0 and t1, defined in Figure 1, c-e, are independent random variables with exponential distributions fII(t1Tk)=32N0e3t1(2N0),0<t1<,fII(t0Tk)=12N0et0(2N0),0<t0<, (5) for k = 1, 2, or 3. Similarly, the branch lengths in the gene tree are defined as b0=t0μ,b1=γ0+γ1+t0μ. (6)

Calculation of the probability, P(Di|Tk, bi0, bi1), of observing data Di at locus i, given the gene tree Ti = Tk (k = 1, 2, 3) and its branch lengths bi0 and bi1, will be described in the next section. The probability of observing the data at locus i for case II is f(DiTk)=00P(DiTk,t0μ,γ0+γ1+t1μ)×fII(t1Tk)fII(t0Tk)dt1dt0=00P(DiTk,12θ0x0,γ0+γ1+12θ0x1)×3e(x0+3x1)dx1dx0, (7) for k = 1, 2, 3.

Averaging over cases I and II or over the four gene trees T0, T1, T2, T3 in Figure 1, we obtain the marginal probability of observing data Di at locus i as f(Diθ0,θ1,γ0,γ1)=ψf(DiT0)+1ψ3k=13f(DiTk)=002γ0θ1P(DiT0,γ0+12θ0x012θ1x1,γ1+12θ1x1)×e(x0+x1)dx1dx0+e2γ0θ100k=13P(DiTk,12θ0x0,γ0+γ1+12θ0x1)×e(x0+3x1)dx1dx0. (8)

The log likelihood is a sum over all the L loci, (θ0,θ1,γ0,γ1D)=i=1Llog{f(Diθ0,θ1,γ0,γ1)}, (9) where D = {Di} represents data at all L loci. Parameters θ0, θ1, γ0, and γ1 are estimated by numerical maximization of the log-likelihood function. A C-optimization routine from the PAML package (Yang 1997b) was used. Each likelihood calculation requires evaluation of 2L two-dimensional integrals (Equation 8), which are calculated numerically using Mathematica. The Math-link library was used to establish communication between the C routine and Mathematica. For the data of Chen and Li (2001) with L = 53 loci, the ML iteration takes ∼1 hr on a Pentium III PC at 1.2 GHz.

Probability of data at a locus given the gene tree and branch lengths: Given the gene tree Ti, which is one of T0, T1, T2, or T3 in Figure 1b, and its branch lengths (bi0 and bi1), the probability of observing the data, P(Di|Ti, bi0, bi1), in Equation 8 can be calculated under any model of nucleotide substitution (Lio and Goldman 1998). The model of Jukes and Cantor (1969) is used in this article, which seems sufficient for such highly similar sequences. Under this model, the sequence data can be summarized as counts of five possible site patterns (configurations): xxx, xxy, yxx, xyx, and xyz, where x, y, and z are any three different nucleotides. The data at locus i can thus be represented by the number of sites and Li (2001) is listed in Table 1.

View this table:
TABLE 1

Number of site patterns from the data of Chen and Li (2001)

For gene trees T0 or T1, the probabilities of observing the five site patterns are p0(b0,b1)=prob(xxx)=(1+3e8b13+6e8(b0+b1)3+6e(8b0+12b1)3)16,p1(b0,b1)=prob(xxy)=(3+9e8b136e8(b0+b1)36e(8b0+12b1)3)16,p2(b0,b1)=prob(yxx)=(33e8b13+6e8(b0+b1)36e(8b0+12b1)3)16,p3(b0,b1)=prob(xyx)=(33e8b13+6e8(b0+b1)36e(8b0+12b1)3)16=p2,p4(b0,b1)=prob(xyz)=(66e8b1312e8(b0+b1)3+12e(8b0+12b1)3)16 (10) (Yang 1994). For gene trees T2 and T3 (Figure 1, d and e), the probabilities can be obtained by considering the symmetry of the problem. Thus with functions p0-p4 defined above, the probability of data at locus i, conditional on the gene tree Tk, k = 1 (or 0), 2, 3, and its branch lengths b0 and b1, is given by the multinomial distribution P(DiT1,b0,b1)=C×p0ni0p1ni1p2ni2+ni3p4ni4,P(DiT2,b0,b1)=C×p0ni0p1ni2p2ni3+ni1p4ni4,P(DiT3,b0,b1)=C×p0ni0p1ni3p2ni1+ni2p4ni4, (11) where C=ni!j=04nij! , and ni = ni0 + ni1 + ni2 + ni3 + ni4.

Mutation rate variation among loci: An important factor that may influence the estimation of ancestral population sizes is the variation of mutation rates among loci (Yang 1997a; Chen and Li 2001). For example, estimation of ancestral population size from comparison between two species was found to be sensitive to even slight rate variation (Yang 1997a). If relative rates for the loci are available, it will be straightforward to incorporate them in the likelihood calculation (Yang 1997a). In this article, I use the average distance from the orangutan to the three African apes to calculate the relative rate for the locus (Table 1). This ad hoc approach appears sensible since the orangutan diverged from the African apes very early and ancestral polymorphism in their common ancestor does not seem important. The likelihood calculation proceeds as before except that the branch lengths for the gene tree at each locus are multiplied by the relative rate for that locus.

Application to the data of Chen and Li: Chen and Li (2001) sequenced one individual from each of the four species, human, chimpanzee, gorilla, and orangutan, at 53 independent noncoding loci (contigs). An advantage of the data set is that the sequences are expected to be outside and far away from coding regions and not affected by selection at linked sites or loci. The model of this article assumes the molecular clock and uses only three species. The counts of site patterns are listed in Table 1. At some loci, the total number of sites used in this article is larger than that in Chen and Li (2001; Table 1), because some sites had alignment gaps in the orangutan and were removed by Chen and Li.

View this table:
TABLE 2

Maximum-likelihood estimates of parameters

The three-species data are analyzed by the maximum-likelihood (ML) method of this article. The estimates of parameters are given in Table 2. If we assume a generation time of 20 years and a mutation rate of 10-9 substitutions per site per year (e.g., Nachman and Crowell 2000), the estimates suggest a population size for the ancestor of humans and chimpanzees of ∼12,000. This is several times smaller than the estimates of Chen and Li (2001) from the same data, at a minimum of 52,000. The estimate is also similar to estimates of the population size of modern humans, for example, 12,000 by Yu et al. (2001). The population size for the common ancestor of all three species is estimated to be ∼38,000. The same analysis estimated the human-chimpanzee divergence time at 5.1 million years ago (MYA) and the gorilla divergence at ∼1.1 million years (MY) earlier. Those estimates are largely consistent with those of previous studies (e.g., Hasegawaet al. 1985; Ruvolo 1997; Kumar and Hedges 1998; Yoder and Yang 2000). Figure 2a shows the likelihood surface as a function of θ0 and θ1 when γ0 and γ1 are fixed at their maximum-likelihood estimates (MLEs). The ∼95% confidence region is given by the likelihood contour at 3.32 units below the optimum, that is, at -3102.73 (Figure 2a). The sampling errors are quite large. Analysis of the human and chimpanzee sequences at the 53 loci using the ML method of Takahata et al. (1995) under the infinite-sites model leads to θ^1=0.0017 and γ^=0.0055 (N. Takahata, personal communication). Those estimates are similar to the MLEs of Table 2.

To examine the effect of mutation rate variation among loci, I calculated the average sequence distance under the model of Jukes and Cantor (1969) from the human, chimpanzee, and gorilla to the orangutan, that is, dHCG-O = (dH-O + dC-O + dG-O)/3. This is divided by the average across all loci to give the relative rate for that locus (Table 1). The average distance from the orangutan to the African apes is found to be 0.0296; this is consistent with a mutation rate of 10-9 substitutions per site per year and an orangutan divergence date of ∼13 MYA (e.g., Hasegawaet al. 1987) since 0.0296/(2 × 13 × 106) = 1.1387 × 10-9. The relative rates for loci calculated this way are used as fixed constants in the likelihood calculation for the data of three species. The MLEs of parameters are shown in Table 2. Using the same generation time and mutation rate as above, we get the estimate of the population size for the ancestor of humans and chimpanzees to be 21,000, larger than the estimate under the assumption of a constant rate for all loci. The population size for the ancestor of all three species is estimated to be 29,000, smaller than under the constant-rate model. The differences between the two analyses are somewhat surprising, as one might expect the population sizes to be smaller when rate variation among loci is accounted for. However, it is noted that the distance from orangutan to the African apes has only a weak correlation (0.44) with the average distance within the H-C-G group, which appears to suggest that the mutation rates are rather homogeneous among loci and that the conflict among loci in sequence divergence is mainly caused by ancestral polymorphism.

If the average distances within the H-C-G group, (dHC + dCG + dGH)/3, are used as relative rates for the loci, parameter estimates become θ^0=0.0000014 , θ^1=0.002902 , γ^0=0.003229 , and γ^1=0.004555 , with =3069.73 . Those correspond to a population size of 36,000 for the ancestor of humans and chimpanzees, a population size of only 18 for the ancestor of all three species, 4.5 MY for the H-C divergence, and 7.7 MY for the gorilla divergence. This calculation effectively attributes all variation in sequence divergence among loci to mutation rate variation and causes underestimation of θ0 and γ1 and overestimation of θ1 and γ0 (see also Table 2).

Comparison with the tree-mismatch method: When the gene tree is T2 or T3 (Figure 1), there is a mismatch between the species tree and the gene tree. This occurs with probability PSG = f(T2) + f(T3) = 2(1 -Ψ)/3 = ⅔e-2γ01 (e.g., Nei 1987, pp. 288-289). The tree-mismatch method estimates θ1 by equating this probability to the proportion () of loci at which the estimated gene tree differs from the species tree, with γ0 being assumed known; that is, θ^1=2γ0log{3p^2} . Chen and Li (2001) used the orangutan to root the H-C-G tree and were able to resolve the gene tree at 33 loci, out of which 9 mismatches were found, at a proportion of 27.3%. Several coding loci were included as well, so that 16 mismatches were found at a total of 52 resolved loci, at the proportion 30.8%. The authors assumed an H-C divergence at γ1 = 1.6 MYA and arrived at θ^1=0.00414 , which, if the generation time is 20 years, corresponds to a minimum population size of 1 = 52,000 for the ancestor of humans and chimpanzees. In this article, the molecular clock has been assumed, which can also be used to root the H-C-G tree. Under the clock, T1, T2, or T3 is the ML tree if ni1, ni2, or ni3 is the greatest among the three, respectively (Saitou 1988; Yang 1994). The clock-rooting approach uses more “informative” sites than out-group rooting and resolves the gene tree at 49 of the 53 loci, out of which 18 are mismatches, at the proportion 36.7%. This proportion is even higher than those of Chen and Li and produces even larger estimates of θ1 and N1.

Figure 2.

—Log-likelihood surface (contour) as a function of θ0 and θ1 when γ0 and γ1 are fixed at their MLEs. (a) The same substitution (mutation) rate is assumed for all loci. (b) Fixed relative rates obtained from comparison between the orangutan and the African apes (human, chimpanzee, and gorilla) are used to account for possible evolutionary rate variation among loci. Maximum-likelihood estimates of parameters are listed in Table 2.

To understand the difference between the tree-mismatch method and ML, note that three aspects of the data are ignored by the tree-mismatch method and accounted for by ML: (i) uncertainty in the estimated gene tree due to the finite number of nucleotide sites at the locus, (ii) unresolved loci (ties), and (iii) branch lengths in the gene tree reflected in the sequence divergences. While all three probably contribute to the large differences discussed above, uncertainty in the estimated gene tree seems to be the predominant reason. A more “proper” tree-mismatch method should equate the observed proportion of mismatches () not to PSG but to PSE, the probability of a mismatch between the species tree and the estimated gene tree. This probability is given by PSE=max{ni2,ni3}>ni1f(Diθ0,θ1,γ0,γ1), (12) where f is the probability of data Di = {ni0, ni1, ni2, ni3, ni4} in Equation 8, and the summation is over all data configurations in which the ML tree for the locus is either T2 or T3 (Figure 1). Unlike PSG, PSE is dependent on all four parameters, θ0, θ1, γ0, and γ1, as well as the sequence length ni and appears no easier to calculate than the full likelihood (Equation 9). Instead I use Monte Carlo simulation to calculate those probabilities to assess the impact of errors in gene tree reconstruction on the difference between PSG and PSE. The MLEs of parameters in Table 2 (“constant rate”) are used to generate gene trees, which are used to “evolve” sequences. The sites are counted to obtain the data Di = {ni0, ni1, ni2, ni3, ni4}, which are used to estimate the gene tree by ML.

Figure 3 shows the probabilities that the species tree (S), the gene tree (G), and the estimated gene tree (E) differ from each other. The probability of a mismatch between the species tree and the gene tree is PSG = 2(1 -Ψ)/3 = 0.0739, much lower than the observed mismatch proportion = 0.367. The probability of a mismatch between the species tree and the estimated gene tree is higher. With 466 sites (the average across the 53 loci, Table 1), PSE = 0.2028, which is 2.7 times as high as PSG (Figure 3). Now consider the four gene trees T0, T1, T2, and T3 (Figure 1), which occur with probabilities f(T0) =Ψ = 0.8892 and f(T1) = f(T2) = f(T3) = (1 -Ψ)/3 = 0.0369 (Equation 1). According to the simulation, the probability that the topology of the gene tree is incorrectly reconstructed when the true gene tree is T0, T1, T2, or T3 is P0 = 0.1453 and P1 = P2 = P3 = 0.1981. Note that for the estimated gene tree, we consider only the topology and disregard its divergence times relative to the speciation times. The average error probability of gene tree reconstruction is thus PGE=i=03f(Ti)Pi=0.8892×0.1453+3×0.0369×0.1981=0.1512 (see Figure 3). When gene trees T0 or T1 are incorrectly reconstructed, the estimated gene tree will always be a mismatch with the species tree; such errors will cause an overcount of f(T0)P0 + f(T1)P1 = 0.1366. Conversely, when gene trees T2 or T3 are incorrectly reconstructed, the estimated tree will not be counted as a mismatch one-half of the time, so the undercount is f(T2)P2/2 + f(T3)P3/2 = f(T2)P2 = 0.0074. The difference between those two error rates gives rise to the net error due to gene tree reconstruction: PSE - PSG = f(T0)P0P0 = 0.1290. The above argument suggests that ignoring errors in gene tree reconstruction always causes overestimation of the mismatch between the species tree and the gene tree and leads to overestimation of the ancestral population size N1. It is interesting that the bias in the tree-mismatch method is caused by reconstruction errors for gene tree T0 alone and thus can be reduced if Ψ is reduced, for example, if the two speciation events are very close or if the ancestral population size N1 is large. Obviously factors that reduce the reconstruction error P0, such as longer sequences (Figure 3) and higher mutation rates, will reduce the bias as well.

Figure 3.

—Tree-mismatch probabilities calculated using Monte Carlo simulation plotted as functions of the sequence length. MLEs of parameters in Table 2 (one rate for all loci) are used in the simulation. Note that S, G, and E in the subscripts stand for the species tree, the gene tree, and the estimated gene tree (the ML tree), respectively. Thus PSG is the probability that the species tree and the gene tree differ; this is 0.0739 for the parameter values used. PSE is the probability of a mismatch between the species tree and the estimated gene tree, and PGE is the probability of a mismatch between the gene tree and the estimated gene tree. Ptie is the proportion of replicates in which a tie occurs, that is, the two best trees are equally good. Ties are excluded in calculation of PSE and PGE. Ten million replicates were simulated for each sequence length.

THE BAYES APPROACH USING MCMC

A Bayes approach is implemented under the same model, using MCMC. As parameters 0398;= {θ0, θ1, γ0, γ1} are all positive, I use independent gamma distributions as the prior. The gamma density is g(x;α,β)=βαeβxxα1Γ(α), (13) with mean α/β and variance α/β2. The hyperparameters α and β are chosen to reflect the range and likely values of the parameters.

Instead of the coalescent times t0 and t1, which have different definitions in different gene trees (Figure 1), branch lengths b0 and b1 in the gene tree are used in the MCMC. The joint prior distributions of the gene tree T0 and its branch lengths b0 and b1 given 0398; can be easily derived from the distributions of the coalescent times t0 and t1 (Equation 2), f(T0,b0,b1ϴ)=f(T0ϴ)f(b1T0,ϴ)f(b0T0,b1,ϴ)=ψ×2ψθ1e2(b1γ1)θ1×2θ0e2(b0+b1γ0γ1)θ0=4(θ0θ1)×e2(b1γ1)θ12(b0+b1γ0γ1)θ0, (14) for γ1 < b110 and γ10 - b1 < b0 < ∞.

Similarly, the joint prior distribution of gene tree Tk, k = 1, 2, 3, and its branch lengths b0 and b1 given 0398; is f(Tk,b0,b1ϴ)=f(Tkϴ)f(b0Tk,ϴ)f(b1Tk,ϴ)=1ψ3×2θ0e2b0θ0×6θ0e6(b1γ0γ1)θ0=4θ02e2γ0θ1[2b0+6(b1γ0γ1)]θ0, (15) with 0 < b0 < ∞ and γ10 < b1 < ∞.

The variables to be updated in the Markov chain include the parameters 0398;= {θ0, θ1, γ0, γ1} and the gene trees and branch lengths at all L loci, G = {Ti, bi0, bi1}, i = 1, 2,..., L. The Markov chain is constructed so that its steady-state distribution is the posterior distribution of those variables. Bayes inference is then based on the joint posterior distribution f(ϴ,GD)=f(DG)f(Gϴ)f(ϴ)f(D)=i=1LP(DiTi,bi0,bi1)×i=1Lf(Ti,bi0,bi1ϴ)×f(θ0)f(θ1)f(γ0)f(γ1)f(D) (16) The denominator f(D) is the marginal probability of the data f(D)=ϴGf(DG)f(Gϴ)f(ϴ)dGdϴ, (17) where the integral over G represents summation over the four gene trees (T0, T1, T2, T3 in Figure 1) and integration over branch lengths in each tree. The posterior distribution of any parameter is then given by integrating over the joint posterior distribution. For example, f(ϴD)=Gf(ϴ,GD)dG. (18)

View this table:
TABLE 3

Prior and posterior distributions of parameters in the Bayes analysis

A Metropolis-Hastings algorithm (Metropoliset al. 1953; Hasting 1970) is used to update variables in the MCMC. Given the current state of the chain (0398;, G), a new state (0398;*, G*) is proposed through a proposal distribution q(0398;*, G*|0398;, G). The new state is then accepted with probability R=min{1,f(ϴ,GD)f(ϴ,GD)×q(ϴ,Gϴ,G)q(ϴ,Gϴ,G)}=min{1,f(DG)f(Gϴ)f(ϴ)f(DG)f(Gϴ)f(ϴ)×q(ϴ,Gϴ,G)q(ϴ,Gϴ,G)}. (19) If the new state is accepted, the chain moves to the new state (0398;*, G*). Otherwise the chain stays in the old state (0398;, G). Note that f(D) in Equation 16 cancels in calculation of the acceptance ratio R. Calculation of f(0398;*, G *|D)/f(0398;, G|D) is straightforward due to the conditional independence in the model as described above. So the focus here is the proposal mechanism and the proposal ratio q(0398;, G|0398;*, G*)/q(0398;*, G*|0398;, G).

The proposal density q can be rather flexible as long as it specifies an aperiodic and irreducible Markov chain. The algorithm I implemented cycles through several steps, with each step updating some but not all variables. In step 1, the gene tree and branch lengths at each locus i (Ti, bi0, bi1) are updated, while parameters 0398; are fixed. Each locus is updated once in this step. Step 2 updates parameters 0398; while the branch lengths {bi0, bi1} are fixed. This step can cause changes to the gene trees at some loci. Step 3 is a mixing step, in which parameters θ0, θ1, γ0, γ1 and branch lengths at all loci are multiplied by a constant while the gene trees remain unchanged. The MCMC algorithm is tedious and the details are given in the appendix.

The Markov chain is started from a random initial state. Sampling starts after a certain number of generations, which are discarded as burn-in, and samples are taken every certain number of steps, thus “thinning” the chain. Convergence of the chain is checked by examining the output and also by running multiple chains. The algorithm is also run without sequence data, and the posterior distribution generated is found to be close to the prior.

Application to the data of Chen and Li: The Bayes MCMC algorithm is applied to the data of Chen and Li (2001; see Table 1). I used two sets of priors (Table 3). Parameters α and β in the gamma prior distributions are chosen by considering likely values and ranges of ancestral population sizes and species divergence times and converting them into parameters θ0, θ1, γ0, and γ1 using a generation time of 20 years and a mutation rate of 10-9 substitutions per site per year. The first set is considered more realistic and referred to as the “good priors” in Table 3. Ancestral population sizes N0 and N1 are centered ∼12,500, close to estimates for modern humans, with the 2.5 and 97.5% percentiles at 1500 and 34,800, respectively. The divergence time for humans and chimpanzees is centered ∼5 MY, while the divergence time for the gorilla is centered ∼1.6 MY. Note that parameters θ0, θ1, γ0, and γ1 are all ⪡1 but are definitely >0; thus values of α>1 are used so that the gamma distribution has a mode >0.

Figure 4.

—Prior and posterior distributions for parameters θ0, θ1, γ0, and γ1. Parameter estimates are shown in Table 3 (good priors).

The posterior distributions of parameters θ0, θ1, γ0, and γ1 are shown in Figure 4 together with their priors. They are also summarized in Table 3 (good priors). The means of the posterior distributions for θ0, θ1, γ0, and γ1 are listed, and then the means and the 95% credibility sets for the two population sizes (N0 and N1) and for the two speciation times are listed. The posterior means and medians are close. The population size for the ancestor of humans and chimpanzees is estimated to be 12,400, with the 95% credibility interval (CI) to be (1700, 32,100). The H-C divergence is dated at 5.3 MY, with the 95% CI to be (4.4, 6.1). The estimates of θ1 and γ1 are very similar to the MLEs. The posterior mean of θ0 is smaller and that of γ0 is larger than the MLEs (Tables 2 and 3). The correlation coefficients calculated from the posterior distributions of parameters θ0, θ1, γ0, and γ1 are shown in Table 4. There is strong negative correlation between θ0 and γ0 and between θ1 and γ1. Comparison of the prior and posterior distributions (Figure 4) suggests that the data contain much more information about the divergence times, especially the H-C divergence time (γ1), than about the population sizes.

To see the effect of prior assumptions on the posterior distributions, I used a second set of priors, which are more spread out and also assume large population sizes (mean 62,500). The posterior distributions are summarized in Table 3 under the heading “Poor priors.” The point estimates of both N0 and N1 are ∼33,000, smaller than the prior means. The H-C divergence is dated at 4.6 MY, and the gorilla divergence is dated 1.9 MY earlier. Those estimates appear reasonable, although the H-C divergence date is too recent. Similar strong correlations between the parameters are discovered as in the analysis using the good priors. The negative correlation between γ1 and θ1 (calculated to be -0.76), combined with the assumed and estimated large population sizes, appears to have led to a H-C divergence date that is too recent.

View this table:
TABLE 4

Correlation coefficients in the posterior distribution

DISCUSSION

ML and Bayes methods of this article estimated the population size for the common ancestor of humans and chimpanzees to be ∼12,000, similar to estimates for modern humans. The estimates are several times smaller than those obtained by Chen and Li (2001) from the same data using the tree-mismatch method, which range from 52,000 to 150,000. Even the worst-case estimates—e.g., 36,000 by ML under the assumption that all sequence divergence variation among loci is due to mutation rate variation and 33,000 from the Bayes analysis using the poor priors—are smaller than the minimum estimate of Chen and Li. The tree-mismatch method used by Chen and Li appears to have serious biases due to errors in gene tree reconstruction, and the likelihood and Bayes estimates reported here are probably more reliable. Thus it may be concluded that the sequence data of Chen and Li (2001) do not support much larger ancestral populations than the modern humans or the notion that early human populations experienced dramatic size reductions (Hacia 2001; Kaessmannet al. 2001).

While the ML and Bayes methods are expected to have better statistical properties than the simple tree-mismatch method, it is worthwhile to examine some of the assumptions made in this article. First, the evolutionary rate is assumed to be constant over lineages. This assumption seems reasonable as the species compared are very closely related; Chen and Li’s (2001) relativerate tests supported the molecular clock. The large differences between the tree-mismatch method and the likelihood and Bayes methods are clearly not due to the use of the clock assumption in this article; use of clock rooting in the tree-mismatch method produced even larger estimates of the population size for the ancestor of humans and chimpanzees. Second, the analysis assumes no recombination within a locus. The effect of recombination on estimation of parameters θ0, θ1, γ0, and γ1 is not well understood, although Satta et al. (2000) emphasized its possible significance. As the human, chimpanzee, and gorilla sequences are extremely similar, most of the recombination events will not be visible in the sequence data, and the few sites at which more than two nucleotides are observed in the data (see counts ni4 for site pattern xyz in Table 1) are probably due to multiple substitutions at the same site. Third, the substitution model of Jukes and Cantor (1969) is simplistic. More complex models, such as those that account for variable substitution rates among sites within the locus, can be easily implemented, but are expected to have little effect. The most serious issue seems to be mutation rate variation among loci. In the case of two species, the ancestral population size is overestimated when mutation rate variation is ignored and accounting for the bias leads to dramatic reduction in the estimated population size (Yang 1997a). In this article, the population size of the ancestor of humans and chimpanzees is not very large under the constant-rate model and becomes larger when variable rates for loci are assumed. The effect is much less important and also in the opposite direction compared with the two-species case. Lack of strong correlation among sequence distances with the orangutan seems to suggest that the rates are relatively homogeneous among those loci. It seems that simultaneous analysis of data from three species allows the parameters to constrain each other, leading to a better use of information in the data. It is quite likely that the estimation can be further improved by sampling multiple individuals from the same species.

The ML and Bayes methods produced similar results for the data analyzed in this article. However, the ML calculation is slower than the MCMC algorithm. The Bayes approach also provides a framework for incorporating prior information about the parameters. For example, a wealth of information is available about the divergence time between humans and chimpanzees. By forcing a very narrow prior distribution for γ1, such information can be incorporated in the Bayes analysis. Using an informative prior will reduce the adverse effect of strong correlation among parameters when other parameters are estimated. Furthermore, the Bayes algorithm seems easier than ML to extend to data that contain more than three species and more than one individual from each species.

Program availability: C programs implementing the MCMC algorithm and calculating the mismatch probabilities (PSG, PSE, and PGE, etc.) are available from the author upon request. The C and Mathematica programs for the likelihood method are available as well, but they make use of the Mathlink library and are awkward to use.

APPENDIX: IMPLEMENTATION OF THE MCMC ALGORITHM

Updating the gene tree and branch lengths at each locus: This step of the algorithm goes through the L loci to update the branch lengths and possibly the gene tree as well. The algorithm visits each locus once. The description in this section concerns one locus i. The subscript i in bi0 and bi1 may be suppressed. The proposal algorithm for gene trees T2 and T3 is simpler and is described first, before the algorithm for gene trees T0 and T1 (Figure 1).

If the current gene tree is either T2 or T3, only the branch lengths are updated and the gene tree is not changed. Note that b0 and b1 have to satisfy the requirements 0 < b0 < ∞ and γ01 < b1 < ∞. A sliding window is used to propose new branch lengths, b0U(b0w2,b0+w2),b1U(b1w2,b1+w2), (A1) where U(a, b) is the uniform distribution in the interval (a, b). The window size is set at w = H1, with H1 a small fine-tuning parameter. If the proposed value is outside the range of the variables, the excess is reflected back into the range; that is, if b0<0 , then b0 is reset to -b0 , and if b1<γ0+γ1 , then b1 is reset to 2(γ0+γ1)b1 . The proposal ratio is 1. The acceptance ratio is R=min{1,P(DiTi,bi0,bi1)P(DiTi,bi0,bi1)×f(Ti,bi0,bi1θ0,θ1,γ0,γ1)f(Ti,bi0,bi1θ0,θ1,γ0,γ1)}. (A2)

If the current gene tree is either T0 or T1 (Figure 1, b and c), a change between those two trees is allowed when the branch lengths are updated. Both the current and the new branch lengths should satisfy the constraints γ1 < b1 < ∞ and γ01 < b0 + b1 < ∞. To propose new branch lengths under those constraints, I apply the transformation y0=b0+b1(γ0+γ1),y1=(b1γ1)(b0+b1γ1), (A3) with y0 > 0 and 0 < y1 < 1. Note that y0 is the distance with y0 > 0 and 0 < y1 < 1. Note that y0 is the distance between the root of the gene tree (node A in Figure 1) and the first speciation event (C in Figure 1), and changing y0 will shrink or expand the height of the gene tree, while y1 is the ratio of distance BD to AD, and changing y1 will slide node B between A and D (Figure 1, b and c).

New states are proposed for y0 and y1 as y0=y0eH1(r0.5),y1U(y1H12,y1+H12), (A4) where r is a random variable from U(0, 1). If y1 is out of the range (0, 1), it is reflected back into the range. The new branch lengths b0 and b1 are calculated from the relationships b0=(γ0+y0)(1y1),b1=γ1+y1(γ0+y0). (A5) If b1>γ0+γ1 , the new gene tree Ti is set to T1; otherwise it is set to T0. This change of gene tree does not change the proposal ratio. The proposal ratio for variables y0 and y1 is y0y0 . The proposal ratio for the original variables b0 and b1 can be derived by noting J=b0y0b0y1b1y0b1y1=1y1(γ0+y0)y1γ0+y0=γ0+y0. Thus the acceptance ratio is R=min{1,P(DiTi,bi0,bi1)P(DiTi,bi0,bi1)×f(Ti,bi0,bi1θ0,θ1,γ0,γ1)f(Ti,bi0,bi1θ0,θ1,γ0,γ1)×(γ0+y0)y0(γ0+y0)y0}. (A6)

If the gene tree is T1, T2, or T3, the algorithm may (with a small probability of, say, 0.2 or 0.3) attempt to swap the gene trees. The current gene tree is replaced by one of the other two trees chosen at random. The proposal ratio is 1, and the acceptance ratio is R=min{1,P(DiTi,bi0,bi1)P(DiTi,bi0,bi1)×f(Ti,bi0,bi1θ0,θ1,γ0,γ1)f(Ti,bi0,bi1θ0,θ1,γ0,γ1)}. (A7)

Updating population size and speciation date parameters: This step makes two proposals: the first to change θ0 and θ1 and the second to change γ0 and γ1. Parameters θ0 and θ1 are positive but are not constrained otherwise. They are updated simultaneously, with all other variables fixed. New values are proposed around the current values by θ0=θ0eH2(r00.5),θ1=θ1eH2(r10.5), (A8) where r0 and r1 are uniform random numbers in the interval (0, 1) and H2 is a small fine-tuning parameter. The proposal ratio is θ0θ1(θ0θ1) and the acceptance ratio is R=min{1,i=1Lf(Ti,bi0,bi1θ0,θ1,γ0,γ1)f(Ti,bi0,bi1θ0,θ1,γ0,γ1)×(θ0θ1θ0θ1)α1eβ(θ0θ0+θ1θ1)×θ0θ1θ0θ1}. (A9)

Next, speciation date parameters γ0 and γ1 are updated while θ0 and θ1 as well as branch lengths bi0 and bi1 for all loci are fixed. At each locus the following constraints have to be satisfied: 0 <γ1 < b101 < b0 + b1 < ∞ in gene tree T0 and 0 < b0 < ∞ and γ0 + γ1 < b1 < ∞ in gene tree T1, T2, or T3 (Figure 1). To allow the chain to move more freely, the gene tree is allowed to change, if necessary, from T0 to T1, T2, or T3 or vice versa. Thus only the following constraints have to be satisfied when new values are proposed for γ0 and γ1: 0 <γ1 < b1 and γ01 < b0 + b1 at every locus; that is, γ1 < min{bi1} = c and γ101 < min{bi0 + bi1} = d. The following transformation is used to propose new states, y0=γ0(dγ1),y1=γ1, (A10) with constraints 0 < y0 < 1 and 0 < y1 < c. Note that y0 is the ratio of the distance CD to AD in Figure 1, and changing y0 will slide the speciation date C (Figure 1) between A and D. Note that γ0 = y0(d1), γ1 = y1, and J=γ0y0γ0y1γ1y0γ1y1=dγ1. Sliding windows are used to propose new values, y0U(y0H32,y0+H32),y1U(y1H32,y1+H32), (A11) where H3 is a small fine-tuning parameter. If the new values y0 and y1 are out of the range, they are reflected back into the range. The proposal ratio for the transformed variables y0 and y1 is 1. The proposal ratio for variables γ0 and γ1 is (dγ1)(dγ1) . Next, all loci are scanned to see whether the gene tree needs updating. If the current gene tree is T0 but γ0+γ1<bi1 , the gene tree Ti is set to be one of T1, T2, or T3, chosen at random. The proposal ratio will be multiplied by 3 since there are three trees to move to and only one tree to move back. If the current tree is T1, T2, or T3 but γ0+γ1>bi1 , the gene tree is set to Ti=T0 , and the proposal ratio is divided by 3. Otherwise the gene tree for the locus remains unchanged: Ti=Ti . In sum the acceptance ratio is R=min{1,i=1LP(DiTi,bi0,bi1)f(Ti,bi0,bi1θ0,θ1,γ0,γ1)P(DiTi,bi0,bi1)f(Ti,bi0,bi1θ0,θ1,γ0,γ1)×f(γ0,γ1)f(γ0,γ1)×dγ1dγ1cT}, (A12) where f(γ0,γ1)f(γ0,γ1)=(γ0γ1γ0γ1)α1eβ(γ0γ0+γ1+γ1) from the gamma priors and cT is the proposal ratio due to changes to gene trees at some loci (a product of threes and one-thirds).

Mixing step: A mixing step is found to be effective in speeding up convergence when a poor starting point is chosen for the chain. In this step, the gene trees remain unchanged, but parameters θ0, θ1, γ0, and γ1 and branch lengths bi0 and bi1 for all loci are multiplied by a constant c=eH4(r0.5), (A13) where r is a random number from U(0, 1) and H4 is a small fine-tuning parameter. The proposal ratio is c4+2L. The acceptance ratio is R=min{1,i=1L[P(DiTi,bi0,bi1)×f(Ti,bi0,bi1θ0,θ1,γ0,γ1)P(DiTi,bi0,bi1)×f(Ti,bi0,bi1θ0,θ1,γ0,γ1)]×f(ϴ)f(ϴ)×c4+2L}, where f(ϴ)f(ϴ)=c4(α1)eβ(θ0θ0+θ1θ1+γ0γ0+γ1γ1) , given by the gamma prior distributions.

Performance of the algorithm: The performance of the MCMC algorithm is noted to depend on the choice of priors (values of α and β in the gamma distributions). For some priors, the algorithm is noted not to mix well, and in particular, parameters θ0 and γ0 appear to change slowly. The high correlation among parameters and the constraints seem to cause difficulties for the algorithm. In such cases, the chain has to be run much longer than usual to achieve stable estimates. In proposing new values for γ0 and γ1, only small steps are taken (with H3 in the range 0.01-0.05) to achieve an acceptance rate of ∼0.1-0.3. For other variables, even large steps (with H1, H2, H4 in the range 0.1-0.5) are accepted at high frequencies (>50%). The mixing step seems rather effective so that ∼1000 generations seem enough for the burn-in. For some priors, <500,000 generations appear sufficient, which takes a few minutes on a Pentium III PC at 1.2 GHz for the data of Chen and Li (2001). For other priors, the algorithm has to be run much longer.

Acknowledgments

I am very grateful to Drs. W.-H. Li and F.-C. Chen for providing the data analyzed in this article. I thank M. Hasegawa and B. Larget for discussions and N. Takahata for comments. This study is supported by grant 31/G13580 from the Biotechnology and Biological Sciences Research Council (United Kingdom).

Footnotes

  • Communicating editor: Y.-X. Fu

  • Received April 18, 2002.
  • Accepted September 6, 2002.

LITERATURE CITED

View Abstract