- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Yang, Z.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Yang, Z.
Likelihood and Bayes Estimation of Ancestral Population Sizes in Hominoids Using Data From Multiple Loci
Ziheng Yangaa Galton Laboratory, Department of Biology, University College London, London WC1E 6BT, England
Corresponding author: Ziheng Yang, University College London, Darwin Bldg., Gower St., London WC1E 6BT, England., z.yang{at}ucl.ac.uk (E-mail)
Communicating editor: Y.-X. FU
| ABSTRACT |
|---|
Polymorphisms in an ancestral population can cause conflicts between gene trees and the species tree. Such conflicts can be used to estimate ancestral population sizes when data from multiple loci are available. In this article I extend previous work for estimating ancestral population sizes to analyze sequence data from three species under a finite-site nucleotide substitution model. Both maximum-likelihood (ML) and Bayes methods are implemented for joint estimation of the two speciation dates and the two population size parameters. Both methods account for uncertainties in the gene tree due to few informative sites at each locus and make an efficient use of information in the data. The Bayes algorithm using Markov chain Monte Carlo (MCMC) enjoys a computational advantage over ML and also provides a framework for incorporating prior information about the parameters. The methods are applied to a data set of 53 nuclear noncoding contigs from human, chimpanzee, and gorilla published by Chen and Li. Estimates of the effective population size for the common ancestor of humans and chimpanzees by both ML and Bayes methods are
12,00021,000, comparable to estimates for modern humans, and do not support the notion of a dramatic size reduction in early human populations. Estimates published previously from the same data are several times larger and appear to be biased due to methodological deficiency. The divergence between humans and chimpanzees is dated at
5.2 million years ago and the gorilla divergence 1.11.7 million years earlier. The analysis suggests that typical data sets contain useful information about the ancestral population sizes and that it is advantageous to analyze data of several species simultaneously.
THE amount of sequence polymorphism in a population is mainly determined by the parameter
= 4Nµ, where N is the (effective) population size and µ is the mutation rate per nucleotide site per generation. In addition to its effect on the amount of neutral variation maintained in a population, the population size is also important in affecting the efficiency of natural selection and is a critical parameter in population genetics models of mutation and selection. It is important to competing theories of origin and evolution of modern humans. When an estimate of the mutation rate is available, as is assumed here, the population size can be obtained from estimates of
. The population size of a present-day species can be estimated by using observed DNA sequence variation in the population (e.g., ![]()
![]()
![]()
![]()
10,000 (e.g., ![]()
![]()
![]()
The population size of an extinct ancestral species, such as the common ancestor of humans and chimpanzees, is harder to estimate, but a few methods have been developed (see ![]()
![]()
![]()
![]()
![]()
In the case of three species, another simpler method has also been used (![]()
![]()
![]()
![]()
![]()
![]()
The tree-mismatch approach has room for improvement. First, the conflicts in the estimated gene trees among loci can be due to both ancestral polymorphism and sampling errors in tree reconstruction. As the sequences are highly similar, with few variable or informative sites at each locus, the reconstructed gene tree, no matter what method was used to infer it, is unreliable. Ignoring the sampling error in the estimated gene tree leads to overestimation of the conflict among loci and overestimation of the ancestral population size (see below). Second, the branch lengths in the gene tree provide, in a probabilistic sense, information about the ancestral population parameters, but are ignored by the method. The method clearly makes poor use of the information in the data. Third, the method assumes that one knows the species divergence times, while they might not be known. Indeed, some authors argue for the importance of accounting for ancestral polymorphism when estimating speciation dates (![]()
It seems advantageous to use information in both the conflicting gene genealogies as well as the branch lengths while accounting for their uncertainties. This can be achieved by using the likelihood method under a coalescent model developed by ![]()
![]()
![]()
| MAXIMUM-LIKELIHOOD ESTIMATION |
|---|
The species phylogeny is represented ((12)3) and is assumed known. Species 1 and 2 diverged
1 generations ago while species 3 diverged
0 generations earlier (Fig 1A). In this article, species 1, 2, and 3 represent human (H), chimpanzee (C), and gorilla (G), respectively. The effective population size of the ancestor of species 1 and 2 is N1, and that of the common ancestor of all three species is N0. The mutation rate µ is measured as the number of nucleotide substitutions per site per generation. As rate and time are confounded in such analysis, the parameters of the model are
0 = 4N0µ,
1 = 4N1µ,
0 =
0µ, and
1 =
1µ (Fig 1A). Collectively they are denoted
= {
0,
1,
0,
1}. The data contain DNA sequences from multiple loci, with one individual sequenced from each of the three species at each locus. It is assumed that there is no recombination within a locus and free recombination between loci. The population is assumed to be random mating, with no gene flow after species divergence.
|
The likelihood function:
Consider the probability distribution of the gene tree and its branch lengths at any locus i. Two cases are possible and are dealt with separately. Case I is represented by Fig 1B, where the gene tree is T0 = ((12)3), identical to the species tree, and sequences 1 and 2 coalesce between the two speciation events C and D. The coalescent times t0 and t1 are defined in Fig 1B. Case I occurs when t1 <
0, with probability
![]() |
(1) |
(e.g., ![]()
)/3.
In case I (tree T0 in Fig 1B), the coalescent time t1 is an exponential random variable truncated at t1 <
0, and t0 is an independent random variable with an exponential distribution
![]() |
(2) |
Let branch lengths b0 and b1 in the gene tree of Fig 1B be defined as follows:
![]() |
(3) |
Note that b0 is the length of the internal branch AB and b1 is the distance from sequence 1 to ancestor B (Fig 1B). Given that the gene tree at locus i is Ti = T0 and its branch lengths are bi0 and bi1, the probability of observing sequence data Di at locus i, P(Di|T0, bi0, bi1), is the likelihood in traditional molecular phylogenetics (![]()
![]() |
(4) |
after a change of variables.
In case II (Fig 1, ce), both coalescence events occur in the population ancestral to all three species. The three gene trees T1, T2, and T3 have equal probabilities. The coalescent times t0 and t1, defined in Fig 1C&NDASH;E, are independent random variables with exponential distributions
![]() |
(5) |
for k = 1, 2, or 3. Similarly, the branch lengths in the gene tree are defined as
![]() |
(6) |
Calculation of the probability, P(Di|Tk, bi0, bi1), of observing data Di at locus i, given the gene tree Ti = Tk (k = 1, 2, 3) and its branch lengths bi0 and bi1, will be described in the next section. The probability of observing the data at locus i for case II is
![]() |
(7) |
for k = 1, 2, 3.
Averaging over cases I and II or over the four gene trees T0, T1, T2, T3 in Fig 1, we obtain the marginal probability of observing data Di at locus i as
![]() |
(8) |
The log likelihood is a sum over all the L loci,
![]() |
(9) |
where D = {Di} represents data at all L loci. Parameters
0,
1,
0, and
1 are estimated by numerical maximization of the log-likelihood function. A C-optimization routine from the PAML package (![]()
![]()
1 hr on a Pentium III PC at 1.2 GHz.
Probability of data at a locus given the gene tree and branch lengths:
Given the gene tree Ti, which is one of T0, T1, T2, or T3 in Fig 1B, and its branch lengths (bi0 and bi1), the probability of observing the data, P(Di|Ti, bi0, bi1), in Equation 8 can be calculated under any model of nucleotide substitution (![]()
![]()
![]()
|
For gene trees T0 or T1, the probabilities of observing the five site patterns are
![]() |
(10) |
(![]()
![]() |
(11) |
where C = ni!/
4j=0nij!, and ni = ni0 + ni1 + ni2 + ni3 + ni4.
Mutation rate variation among loci:
An important factor that may influence the estimation of ancestral population sizes is the variation of mutation rates among loci (![]()
![]()
![]()
![]()
Application to the data of Chen and Li:
![]()
![]()
The three-species data are analyzed by the maximum-likelihood (ML) method of this article. The estimates of parameters are given in Table 2. If we assume a generation time of 20 years and a mutation rate of 10-9 substitutions per site per year (e.g., ![]()
12,000. This is several times smaller than the estimates of ![]()
![]()
38,000. The same analysis estimated the human-chimpanzee divergence time at 5.1 million years ago (MYA) and the gorilla divergence at
1.1 million years (MY) earlier. Those estimates are largely consistent with those of previous studies (e.g., ![]()
![]()
![]()
![]()
0 and
1 when
0 and
1 are fixed at their maximum-likelihood estimates (MLEs). The
95% confidence region is given by the likelihood contour at 3.32 units below the optimum, that is, at -3102.73 (Fig 2A). The sampling errors are quite large. Analysis of the human and chimpanzee sequences at the 53 loci using the ML method of ![]()
1 = 0.0017 and
1 = 0.0055 (N. TAKAHATA, personal communication). Those estimates are similar to the MLEs of Table 2.
|
|
To examine the effect of mutation rate variation among loci, I calculated the average sequence distance under the model of ![]()
13 MYA (e.g., ![]()
If the average distances within the H-C-G group, (dHC + dCG + dGH)/3, are used as relative rates for the loci, parameter estimates become
0 = 0.0000014,
1 = 0.002902,
0 = 0.003229, and
1 = 0.004555, with
= -3069.73. Those correspond to a population size of 36,000 for the ancestor of humans and chimpanzees, a population size of only 18 for the ancestor of all three species, 4.5 MY for the H-C divergence, and 7.7 MY for the gorilla divergence. This calculation effectively attributes all variation in sequence divergence among loci to mutation rate variation and causes underestimation of
0 and
1 and overestimation of
1 and
0 (see also Table 2).
Comparison with the tree-mismatch method:
When the gene tree is T2 or T3 (Fig 1), there is a mismatch between the species tree and the gene tree. This occurs with probability PSG = f(T2) + f(T3) = 2(1 -
)/3 = 2/3e-2
0/
1 (e.g., ![]()
1 by equating this probability to the proportion (
) of loci at which the estimated gene tree differs from the species tree, with
0 being assumed known; that is,
1 = -2
0/log{3
/2}. ![]()
1 = 1.6 MYA and arrived at
1 = 0.00414, which, if the generation time is 20 years, corresponds to a minimum population size of
1 = 52,000 for the ancestor of humans and chimpanzees. In this article, the molecular clock has been assumed, which can also be used to root the H-C-G tree. Under the clock, T1, T2, or T3 is the ML tree if ni1, ni2, or ni3 is the greatest among the three, respectively (![]()
![]()
1 and N1.
To understand the difference between the tree-mismatch method and ML, note that three aspects of the data are ignored by the tree-mismatch method and accounted for by ML: (i) uncertainty in the estimated gene tree due to the finite number of nucleotide sites at the locus, (ii) unresolved loci (ties), and (iii) branch lengths in the gene tree reflected in the sequence divergences. While all three probably contribute to the large differences discussed above, uncertainty in the estimated gene tree seems to be the predominant reason. A more "proper" tree-mismatch method should equate the observed proportion of mismatches (
) not to PSG but to PSE, the probability of a mismatch between the species tree and the estimated gene tree. This probability is given by
![]() |
(12) |
where f is the probability of data Di = {ni0, ni1, ni2, ni3, ni4} in Equation 8, and the summation is over all data configurations in which the ML tree for the locus is either T2 or T3 (Fig 1). Unlike PSG, PSE is dependent on all four parameters,
0,
1,
0, and
1, as well as the sequence length ni and appears no easier to calculate than the full likelihood (Equation 9). Instead I use Monte Carlo simulation to calculate those probabilities to assess the impact of errors in gene tree reconstruction on the difference between PSG and PSE. The MLEs of parameters in Table 2 ("constant rate") are used to generate gene trees, which are used to "evolve" sequences. The sites are counted to obtain the data Di = {ni0, ni1, ni2, ni3, ni4}, which are used to estimate the gene tree by ML.
Fig 3 shows the probabilities that the species tree (S), the gene tree (G), and the estimated gene tree (E) differ from each other. The probability of a mismatch between the species tree and the gene tree is PSG = 2(1 -
)/3 = 0.0739, much lower than the observed mismatch proportion
= 0.367. The probability of a mismatch between the species tree and the estimated gene tree is higher. With 466 sites (the average across the 53 loci, Table 1), PSE = 0.2028, which is 2.7 times as high as PSG (Fig 3). Now consider the four gene trees T0, T1, T2, and T3 (Fig 1), which occur with probabilities f(T0) =
= 0.8892 and f(T1) = f(T2) = f(T3) = (1 -
)/3 = 0.0369 (Equation 1). According to the simulation, the probability that the topology of the gene tree is incorrectly reconstructed when the true gene tree is T0, T1, T2, or T3 is P0 = 0.1453 and P1 = P2 = P3 = 0.1981. Note that for the estimated gene tree, we consider only the topology and disregard its divergence times relative to the speciation times. The average error probability of gene tree reconstruction is thus PGE =
3i=0f(Ti)Pi = 0.8892 x 0.1453 + 3 x 0.0369 x 0.1981 = 0.1512 (see Fig 3). When gene trees T0 or T1 are incorrectly reconstructed, the estimated gene tree will always be a mismatch with the species tree; such errors will cause an overcount of f(T0)P0 + f(T1)P1 = 0.1366. Conversely, when gene trees T2 or T3 are incorrectly reconstructed, the estimated tree will not be counted as a mismatch one-half of the time, so the undercount is f(T2)P2/2 + f(T3)P3/2 = f(T2)P2 = 0.0074. The difference between those two error rates gives rise to the net error due to gene tree reconstruction: PSE - PSG = f(T0)P0 =
P0 = 0.1290. The above argument suggests that ignoring errors in gene tree reconstruction always causes overestimation of the mismatch between the species tree and the gene tree and leads to overestimation of the ancestral population size N1. It is interesting that the bias in the tree-mismatch method is caused by reconstruction errors for gene tree T0 alone and thus can be reduced if
is reduced, for example, if the two speciation events are very close or if the ancestral population size N1 is large. Obviously factors that reduce the reconstruction error P0, such as longer sequences (Fig 3) and higher mutation rates, will reduce the bias as well.
|
| THE BAYES APPROACH USING MCMC |
|---|
A Bayes approach is implemented under the same model, using MCMC. As parameters
= {
0,
1,
0,
1} are all positive, I use independent gamma distributions as the prior. The gamma density is
![]() |
(13) |
with mean
/ß and variance
/ß2. The hyperparameters
and ß are chosen to reflect the range and likely values of the parameters.
Instead of the coalescent times t0 and t1, which have different definitions in different gene trees (Fig 1), branch lengths b0 and b1 in the gene tree are used in the MCMC. The joint prior distributions of the gene tree T0 and its branch lengths b0 and b1 given
can be easily derived from the distributions of the coalescent times t0 and t1 (Equation 2),
![]() |
(14) |
for
1 < b1 <
1 +
0 and
1 +
0 - b1 < b0 <
.
Similarly, the joint prior distribution of gene tree Tk, k = 1, 2, 3, and its branch lengths b0 and b1 given
is
![]() |
(15) |
with 0 < b0 <
and
1 +
0 < b1 <
.
The variables to be updated in the Markov chain include the parameters
= {
0,
1,
0,
1} and the gene trees and branch lengths at all L loci, G = {Ti, bi0, bi1}, i = 1, 2, ... , L. The Markov chain is constructed so that its steady-state distribution is the posterior distribution of those variables. Bayes inference is then based on the joint posterior distribution
![]() |
(16) |
The denominator f(D) is the marginal probability of the data
![]() |
(17) |
where the integral over G represents summation over the four gene trees (T0, T1, T2, T3 in Fig 1) and integration over branch lengths in each tree. The posterior distribution of any parameter is then given by integrating over the joint posterior distribution. For example,
![]() |
(18) |
A Metropolis-Hastings algorithm (![]()
![]()
, G), a new state (
*, G*) is proposed through a proposal distribution q(
*, G*|
, G). The new state is then accepted with probability
![]() |
(19) |
If the new state is accepted, the chain moves to the new state (
*, G*). Otherwise the chain stays in the old state (
, G). Note that f(D) in Equation 16 cancels in calculation of the acceptance ratio R. Calculation of f(
*, G*|D)/f(
, G|D) is straightforward due to the conditional independence in the model as described above. So the focus here is the proposal mechanism and the proposal ratio q(
, G|
*, G*)/q(
*, G*|
, G).
The proposal density q can be rather flexible as long as it specifies an aperiodic and irreducible Markov chain. The algorithm I implemented cycles through several steps, with each step updating some but not all variables. In step 1, the gene tree and branch lengths at each locus i (Ti, bi0, bi1) are updated, while parameters
are fixed. Each locus is updated once in this step. Step 2 updates parameters
while the branch lengths {bi0, bi1} are fixed. This step can cause changes to the gene trees at some loci. Step 3 is a mixing step, in which parameters
0,
1,
0,
1 and branch lengths at all loci are multiplied by a constant while the gene trees remain unchanged. The MCMC algorithm is tedious and the details are given in the Appendix.
The Markov chain is started from a random initial state. Sampling starts after a certain number of generations, which are discarded as burn-in, and samples are taken every certain number of steps, thus "thinning" the chain. Convergence of the chain is checked by examining the output and also by running multiple chains. The algorithm is also run without sequence data, and the posterior distribution generated is found to be close to the prior.
Application to the data of Chen and Li:
The Bayes MCMC algorithm is applied to the data of ![]()
and ß in the gamma prior distributions are chosen by considering likely values and ranges of ancestral population sizes and species divergence times and converting them into parameters
0,
1,
0, and
1 using a generation time of 20 years and a mutation rate of 10-9 substitutions per site per year. The first set is considered more realistic and referred to as the "good priors" in Table 3. Ancestral population sizes N0 and N1 are centered
12,500, close to estimates for modern humans, with the 2.5 and 97.5% percentiles at 1500 and 34,800, respectively. The divergence time for humans and chimpanzees is centered
5 MY, while the divergence time for the gorilla is centered
1.6 MY. Note that parameters
0,
1,
0, and
1 are all <<1 but are definitely >0; thus values of
>1 are used so that the gamma distribution has a mode >0.
|
The posterior distributions of parameters
0,
1,
0, and
1 are shown in Fig 4 together with their priors. They are also summarized in Table 3 (good priors). The means of the posterior distributions for
0,
1,
0, and
1 are listed, and then the means and the 95% credibility sets for the two population sizes (N0 and N1) and for the two speciation times are listed. The posterior means and medians are close. The population size for the ancestor of humans and chimpanzees is estimated to be 12,400, with the 95% credibility interval (CI) to be (1700, 32,100). The H-C divergence is dated at 5.3 MY, with the 95% CI to be (4.4, 6.1). The estimates of
1 and
1 are very similar to the MLEs. The posterior mean of
0 is smaller and that of
0 is larger than the MLEs (Table 2 and Table 3). The correlation coefficients calculated from the posterior distributions of parameters
0,
1,
0, and
1 are shown in Table 4. There is strong negative correlation between
0 and
0 and between
1 and
1. Comparison of the prior and posterior distributions (Fig 4) suggests that the data contain much more information about the divergence times, especially the H-C divergence time (
1), than about the population sizes.
|
|
To see the effect of prior assumptions on the posterior distributions, I used a second set of priors, which are more spread out and also assume large population sizes (mean 62,500). The posterior distributions are summarized in Table 3 under the heading "Poor priors." The point estimates of both N0 and N1 are
33,000, smaller than the prior means. The H-C divergence is dated at 4.6 MY, and the gorilla divergence is dated 1.9 MY earlier. Those estimates appear reasonable, although the H-C divergence date is too recent. Similar strong correlations between the parameters are discovered as in the analysis using the good priors. The negative correlation between
1 and
1 (calculated to be -0.76), combined with the assumed and estimated large population sizes, appears to have led to a H-C divergence date that is too recent.
| DISCUSSION |
|---|
ML and Bayes methods of this article estimated the population size for the common ancestor of humans and chimpanzees to be
12,000, similar to estimates for modern humans. The estimates are several times smaller than those obtained by ![]()
![]()
![]()
![]()
While the ML and Bayes methods are expected to have better statistical properties than the simple tree-mismatch method, it is worthwhile to examine some of the assumptions made in this article. First, the evolutionary rate is assumed to be constant over lineages. This assumption seems reasonable as the species compared are very closely related; CHEN and LI's (2001) relative-rate tests supported the molecular clock. The large differences between the tree-mismatch method and the likelihood and Bayes methods are clearly not due to the use of the clock assumption in this article; use of clock rooting in the tree-mismatch method produced even larger estimates of the population size for the ancestor of humans and chimpanzees. Second, the analysis assumes no recombination within a locus. The effect of recombination on estimation of parameters
0,
1,
0, and
1 is not well understood, although ![]()
![]()
![]()
The ML and Bayes methods produced similar results for the data analyzed in this article. However, the ML calculation is slower than the MCMC algorithm. The Bayes approach also provides a framework for incorporating prior information about the parameters. For example, a wealth of information is available about the divergence time between humans and chimpanzees. By forcing a very narrow prior distribution for
1, such information can be incorporated in the Bayes analysis. Using an informative prior will reduce the adverse effect of strong correlation among parameters when other parameters are estimated. Furthermore, the Bayes algorithm seems easier than ML to extend to data that contain more than three species and more than one individual from each species.
Program availability:
C programs implementing the MCMC algorithm and calculating the mismatch probabilities (PSG, PSE, and PGE, etc.) are available from the author upon request. The C and Mathematica programs for the likelihood method are available as well, but they make use of the Mathlink library and are awkward to use.
| ACKNOWLEDGMENTS |
|---|
I am very grateful to Drs. W.-H. Li and F.-C. Chen for providing the data analyzed in this article. I thank M. Hasegawa and B. Larget for discussions and N. Takahata for comments. This study is supported by grant 31/G13580 from the Biotechnology and Biological Sciences Research Council (United Kingdom).
Manuscript received April 18, 2002; Accepted for publication September 6, 2002.
| APPENDIX |
|---|
IMPLEMENTATION OF THE MCMC ALGORITHM
Updating the gene tree and branch lengths at each locus:
This step of the algorithm goes through the L loci to update the branch lengths and possibly the gene tree as well. The algorithm visits each locus once. The description in this section concerns one locus i. The subscript i in bi0 and bi1 may be suppressed. The proposal algorithm for gene trees T2 and T3 is simpler and is described first, before the algorithm for gene trees T0 and T1 (Fig 1).
If the current gene tree is either T2 or T3, only the branch lengths are updated and the gene tree is not changed. Note that b0 and b1 have to satisfy the requirements 0 < b0 <
and
0 +
1 < b1 <
. A sliding window is used to propose new branch lengths,
![]() |
(A1) |
where U(a, b) is the uniform distribution in the interval (a, b). The window size is set at w = H1, with H1 a small fine-tuning parameter. If the proposed value is outside the range of the variables, the excess is reflected back into the range; that is, if b*0 < 0, then b*0 is reset to -b*0, and if b*1 <
0 +
1, then b*1 is reset to 2(
0 +
1) - b*1. The proposal ratio is 1. The acceptance ratio is
![]() |
(A2) |
If the current gene tree is either T0 or T1 (Fig 1B and Fig C), a change between those two trees is allowed when the branch lengths are updated. Both the current and the new branch lengths should satisfy the constraints
1 < b1 <
and
0 +
1 < b0 + b1 <
. To propose new branch lengths under those constraints, I apply the transformation
![]() |
(A3) |
with y0 > 0 and 0 < y1 < 1. Note that y0 is the distance between the root of the gene tree (node A in Fig 1) and the first speciation event (C in Fig 1), and changing y0 will shrink or expand the height of the gene tree, while y1 is the ratio of distance BD to AD, and changing y1 will slide node B between A and D (Fig 1B and Fig C).
New states are proposed for y0 and y1 as
![]() |
(A4) |
where r is a random variable from U(0, 1). If y*1 is out of the range (0, 1), it is reflected back into the range. The new branch lengths b*0 and b*1 are calculated from the relationships
![]() |
(A5) |
If b*1 >
0 +
1, the new gene tree T*i is set to T1; otherwise it is set to T0. This change of gene tree does not change the proposal ratio. The proposal ratio for variables y0 and y1 is y*0/y0. The proposal ratio for the original variables b0 and b1 can be derived by noting

Thus the acceptance ratio is
![]() |
(A6) |
If the gene tree is T1, T2, or T3, the algorithm may (with a small probability of, say, 0.2 or 0.3) attempt to swap the gene trees. The current gene tree is replaced by one of the other two trees chosen at random. The proposal ratio is 1, and the acceptance ratio is
![]() |
(A7) |
Updating population size and speciation date parameters:
This step makes two proposals: the first to change
0 and
1 and the second to change
0 and
1. Parameters
0 and
1 are positive but are not constrained otherwise. They are updated simultaneously, with all other variables fixed. New values are proposed around the current values by
![]() |
(A8) |
where r0 and r1 are uniform random numbers in the interval (0, 1) and H2 is a small fine-tuning parameter. The proposal ratio is
*0
*1/(
0
1) and the acceptance ratio is
![]() |
(A9) |
Next, speciation date parameters
0 and
1 are updated while
0 and
1 as well as branch lengths bi0 and bi1 for all loci are fixed. At each locus the following constraints have to be satisfied: 0 <
1 < b1 <
0 +
1 < b0 + b1 <
in gene tree T0 and 0 < b0 <
and
0 +
1 < b1 <
in gene tree T1, T2, or T3 (Fig 1). To allow the chain to move more freely, the gene tree is allowed to change, if necessary, from T0 to T1, T2, or T3 or vice versa. Thus only the following constraints have to be satisfied when new values are proposed for
0 and
1: 0 <
1 < b1 and
0 +
1 < b0 + b1 at every locus; that is,
1 < min{bi1} = c and
1 <
0 +
1 < min{bi0 + bi1} = d. The following transformation is used to propose new states,
![]() |
(A10) |
with constraints 0 < y0 < 1 and 0 < y1 < c. Note that y0 is the ratio of the distance CD to AD in Fig 1, and changing y0 will slide the speciation date C (Fig 1) between A and D. Note that
0 = y0(d -
1),
1 = y1, and

Sliding windows are used to propose new values,
![]() |
(A11) |
where H3 is a small fine-tuning parameter. If the new values y*0 and y*1 are out of the range, they are reflected back into the range. The proposal ratio for the transformed variables y0 and y1 is 1. The proposal ratio for variables
0 and
1 is (d -
*1)/(d -
1). Next, all loci are scanned to see whether the gene tree needs updating. If the current gene tree is T0 but
*0 +
*1 < bi1, the gene tree T*i is set to be one of T1, T2, or T3, chosen at random. The proposal ratio will be multiplied by 3 since there are three trees to move to and only one tree to move back. If the current tree is T1, T2, or T3 but
*0 +
*1 > bi1, the gene tree is set to T*i = T0, and the proposal ratio is divided by 3. Otherwise the gene tree for the locus remains unchanged: T*i = Ti. In sum the acceptance ratio is
![]() |
(A12) |
where f(
*0,
*1)/f(
0,
1) = (
*0
*1/
0
1)
-1e-ß(
*0-
0+
*1-
1) from the gamma priors and cT is the proposal ratio due to changes to gene trees at some loci (a product of threes and one-thirds).
Mixing step:
A mixing step is found to be effective in speeding up convergence when a poor starting point is chosen for the chain. In this step, the gene trees remain unchanged, but parameters
0,
1,
0, and
1 and branch lengths bi0 and bi1 for all loci are multiplied by a constant
![]() |
(A13) |
where r is a random number from U(0, 1) and H4 is a small fine-tuning parameter. The proposal ratio is c4+2L. The acceptance ratio is

where f(
*)/f(
) = c4(
-1)e-ß(
*0-
0+
*1-
1+
*0-
0+
*1-
1),
given by the gamma prior distributions.
Performance of the algorithm:
The performance of the MCMC algorithm is noted to depend on the choice of priors (values of
and ß in the gamma distributions). For some priors, the algorithm is noted not to mix well, and in particular, parameters
0 and
0 appear to change slowly. The high correlation among parameters and the constraints seem to cause difficulties for the algorithm. In such cases, the chain has to be run much longer than usual to achieve stable estimates. In proposing new values for
0 and
1, only small steps are taken (with H3 in the range 0.010.05) to achieve an acceptance rate of
0.10.3. For other variables, even large steps (with H1, H2, H4 in the range 0.10.5) are accepted at high frequencies (>50%). The mixing step seems rather effective so that
1000 generations seem enough for the burn-in. For some priors, <500,000 generations appear sufficient, which takes a few minutes on a Pentium III PC at 1.2 GHz for the data of ![]()
| LITERATURE CITED |
|---|
CHEN, F.-C. and W.-H. LI, 2001 Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am. J. Hum. Genet. 68:444-456.[Medline]
EDWARDS, S. V. and P. BEERLI, 2000 Perspective: gene divergence, population divergence, and the variance in coalescence time in phylogeographic studies. Evolution 54:1839-1854.[Medline]
FELSENSTEIN, J., 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.[Medline]
FU, Y.-X., 1994 A phylogenetic estimator of effective population size or mutation rate. Genetics 136:685-692.[Abstract]
HACIA, J. G., 2001 Genome of the apes. Trends Genet. 17:637-645.[Medline]
HASEGAWA, M., H. KISHINO, and T. YANO, 1985 Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160-174.[Medline]
HASEGAWA, M., H. KISHINO, and T. YANO, 1987 Man's place in Hominoidea as inferred from molecular clocks of DNA. J. Mol. Evol. 26:132-147.[Medline]
HASTING, W. K., 1970 Monte Carlo sampling methods using Markov chains and their application. Biometrika 57:97-109.
HUDSON, R. R., 1992 Gene trees, species trees and the segregation of ancestral alleles. Genetics 131:509-513.[Medline]



































