- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Rannala, B.
- Articles by Yang, Z.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Rannala, B.
- Articles by Yang, Z.
Bayes Estimation of Species Divergence Times and Ancestral Population Sizes Using DNA Sequences From Multiple Loci
Bruce Rannalaa and Ziheng Yangba Department of Medical Genetics, University of Alberta, Edmonton, Alberta T6G 2H7, Canada
b Galton Laboratory, Department of Biology, University College London, London WC1E 6BT, England
Corresponding author: Ziheng Yang, Darwin Bldg., Gower St., London WC1E 6BT, England., z.yang{at}ucl.ac.uk (E-mail)
Communicating editor: N. TAKAHATA
| ABSTRACT |
|---|
The effective population sizes of ancestral as well as modern species are important parameters in models of population genetics and human evolution. The commonly used method for estimating ancestral population sizes, based on counting mismatches between the species tree and the inferred gene trees, is highly biased as it ignores uncertainties in gene tree reconstruction. In this article, we develop a Bayes method for simultaneous estimation of the species divergence times and current and ancestral population sizes. The method uses DNA sequence data from multiple loci and extracts information about conflicts among gene tree topologies and coalescent times to estimate ancestral population sizes. The topology of the species tree is assumed known. A Markov chain Monte Carlo algorithm is implemented to integrate over uncertain gene trees and branch lengths (or coalescence times) at each locus as well as species divergence times. The method can handle any species tree and allows different numbers of sequences at different loci. We apply the method to published noncoding DNA sequences from the human and the great apes. There are strong correlations between posterior estimates of speciation times and ancestral population sizes. With the use of an informative prior for the human-chimpanzee divergence date, the population size of the common ancestor of the two species is estimated to be
20,000, with a 95% credibility interval (8000, 40,000). Our estimates, however, are affected by model assumptions as well as data quality. We suggest that reliable estimates have yet to await more data and more realistic models.
THE (effective) population size N is a central parameter in models of population genetics, conservation genetics, and human evolution. For example, the amount of genetic variation in a population is determined by
= 4Nµ, where µ is the mutation rate per site per generation. When an independent estimate of the mutation rate is available, we can use the estimate of
to infer the population size N. Estimation of
or N of a modern species is relatively simple. The population size of modern humans has been estimated to be
10,000 (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Computation-intensive MCMC algorithms are increasingly being used in inference in molecular population genetics (see ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
| THEORY |
|---|
Data and model parameters:
The data consist of aligned homologous DNA sequences at multiple neutral loci sampled from present-day species. The model and implementation apply to any species tree. As an example, we focus on the case of the great apes: human (H), chimpanzee (C), gorilla (G), and orangutan (O). The topology of the species tree, (((HC)G)O), is assumed known and fixed in the analysis (Fig 1). The number of sequences sampled may differ among loci. Let D = {Di} be the entire data set, where Di represent the sequence alignment at locus i, with i = 1, 2, ... , L for a total of L loci. We expect the method to be applied to closely related species only and assume the molecular clock, that is, rate constancy among lineages. Furthermore, we assume random mating in each population and no gene flow after species divergences. We also assume no recombination within a locus and free recombination between loci.
|
Parameters in the model include the species divergence times as well as the ancestral and current population sizes. The population size of a current species is considered only if more than one individual is sampled from that species at some loci. Because time and rate are confounded in the data, both divergence times and population sizes are multiplied by the mutation rate. Thus parameters in the model for the example of Fig 1 include the three divergence times
HC,
HCG, and
HCGO and population size parameters
H for humans;
C for chimpanzees; and
HC,
HCG, and
HCGO for the three ancestral species. The divergence times (
's) are measured by the expected number of mutations per site from the ancestral node in the species tree to the present time (Fig 1). Collectively we let
= {
H,
C,
HC,
HCG,
HCGO,
HC,
HCG,
HCGO} denote all parameters in the model to be estimated.
Bayes estimation of parameters:
The Bayes hierarchical model has two main components: the prior distribution of the parameters and the likelihood, i.e., the probability of the data given the parameters. We use independent gamma distributions as priors for
's. The gamma density is
![]() |
(1) |
with mean
/ß and variance
/ß2. The hyperparameters
and ß are chosen by the user to reflect the range and likely values of the parameters.
For parameters
's, we used independent gamma priors for the time gaps (interarrival times) on the species tree. For example, for the species tree of Fig 1, (
HCGO -
HCG), (
HCG -
HC), and
HC are assumed to have independent gamma distributions, and the prior f(
) is a product of the independent gamma densities. We also implemented an option of specifying gamma priors for the node ages (
HCGO,
HCG, and
HC for the tree of Fig 1), but the prior means are not given by
/ß anymore; because of the constraints on node ages (for example,
HCGO >
HCG >
HC), the joint gamma distribution is truncated. The MCMC always updates the node ages (
's) and not time gaps.
The gene genealogy Gi at each locus i is represented by the tree topology Ti and the coalescent times ti. Given parameters
, the probability distribution of Gi = {Ti, ti} is specified by the coalescent processes under the model. This is described in the next section. Let G = {Gi}. We have
![]() |
(2) |
The probability of data Di given the gene tree and coalescent times (and thus branch lengths) at the locus, f(Di|Gi), is the traditional likelihood in molecular phylogenetics and can be calculated using any Markov model of nucleotide substitution (![]()
![]()
![]() |
(3) |
Bayes inference is based on the joint conditional distribution
![]() |
(4) |
For example, the posterior density of
is given by
![]() |
(5) |
where the integration represents summation over all possible gene tree topologies and integration over the coalescent times at each locus.
We construct a Markov chain, whose states are (
, G) and whose stationary distribution is f(
, G|D). A Metropolis-Hastings algorithm (![]()
![]()
, G), a new state (
*, G*) is proposed through a proposal density, q(
*, G*|
, G), and is accepted with probability
![]() |
(6) |
If the new state is accepted, the chain moves to the new state (
*, G*). Otherwise the chain stays in the old state (
, G). The challenge of the MCMC algorithm and the focus of this article is to derive the prior distribution, f(G|
), and to implement an efficient proposal algorithm and calculate the proposal ratio, q(
, G|
*, G*)/q(
*, G*|
, G).
The proposal density q can be rather flexible as long as it specifies an aperiodic and irreducible Markov chain. The algorithm we implemented cycles through several steps, with each step updating some variables. Step 1 changes the age of an internal node in each gene tree without changing the gene tree topology or speciation times. Step 2 cycles through all loci and, at each locus, changes the gene tree topology by pruning a subtree and then regrafting it back onto a feasible branch. Step 3 updates the
's. Step 4 cycles through all speciation times (
's) in the species tree and modifies each. This step also uses a "rubber-band" algorithm to jointly modify the ages of nodes in each gene tree such that the coalescence times on the gene trees remain compatible with the modified species divergence times. Step 5 is a mixing step, in which all coalescent times in the gene trees and all species divergence times are multiplied by the same constant. The details of the algorithm are given in the Appendix.
Distribution of the gene genealogy derived from censored coalescent processes:
The prior probability, f(Gi|
), of any gene tree and its coalescent times at a locus are specified by the coalescent processes in the different populations in the species tree. The theory applies to any gene tree, but is best explained with an example, for which we use the gene tree of Fig 1. Five populations, H, C, HC, HCG, and HCGO, are considered. We use HC to represent the population ancestral to H, and C. ![]()
![]()
) = f(Ti, ti|
) for three species by considering the marginal probability of the tree topology Ti and the conditional distribution of the coalescent times given the topology; that is, f(Ti, ti|
) = f(Ti|
) f(ti|Ti,
). This strategy is not workable for larger species trees because of the increased number of tree topologies and the high dimension of the integral in deriving f(Ti|
). Here we derive the joint distribution f(Ti, ti|
) directly.
Note that two sequences from different species can coalesce only in populations that are ancestral to the two species. For example, sequences H1 and G can coalesce in populations HCG or HCGO, but not in populations H or HC. The coalescent processes in different populations are independent. For each population, we trace the genealogy backward in time, until the end of the population at time
, and record the number of lineages (m) entering the population and the number of lineages leaving it (n). For example, m = 3, n = 2, and
=
HC, for population H (Table 1). Such a coalescent process may be termed a censored coalescent process since the process is terminated before it is complete. When n > 1, the genealogical tree in the population consists of n disconnected subtrees or lineages.
|
Within each population, we measure time in units of 2N generations and further multiply time by the mutation rate. With this scaling, coalescent times are measured by the expected number of mutations per site, and any two lineages in the sample coalesce at the rate
/2 (![]()
![]() |
(7) |
If n > 1, we have to consider the probability that no coalescent event occurred between the last coalescent event and the end of the population at time
, that is, during the time interval
- (tm + tm-1 + ... + tn+1). This probability is exp{-(n(n - 1)/
)[
- (tm + tm-1 + ... + tn+1)]} and is 1 if n = 1. In addition, to derive the probability of a particular gene tree topology in the population, note that if a coalescent event occurs in a sample of j lineages, the probability that a particular pair of lineages coalesce is
j = m, m - 1, ... , n + 1.
Multiplying those probabilities together, we obtain the joint probability distribution of the gene tree topology in the population and its coalescent times tm, tm-1, ... , tn+1 as
![]() |
(8) |
The probability of the gene tree and coalescent times for the locus is the product of such probabilities across all the populations. Thus, for the gene genealogy of Fig 1, we have
![]() |
(9) |
| APPLICATION TO HOMINOID DATA |
|---|
Data:
We apply the new method to the following data, all composed of noncoding regions. Noncoding regions are preferable for this kind of analysis as they are likely to be evolving neutrally, less affected by background selection than, for example, silent sites in coding regions.
-
CHEN and LI 2001 sequenced one individual from each of the four species, human, chimpanzee, gorilla, and orangutan, at 53 independent noncoding loci (contigs), with
500 bp at each locus. Chen and Li's analysis using the tree-mismatch method estimated the population size for the common ancestor of humans and chimpanzees to be from 52,000 to 150,000. Maximum-likelihood (ML) analysis of the same loci using only the H-C-G sequences suggested smaller estimates of
12,00021,000 (YANG 2002 ). Here we use the data from all four species.
-
YU et al. 2001 sequenced
10 kb at the region 1q24 from 61 humans, one chimpanzee, one gorilla, and one orangutan. This region was intended to be noncoding but was discovered to contain four exons (of 115, 155, 138, and 151 nucleotides long), which are removed before analysis. -
MAKOVA et al. 2001 sequenced a region of
6.6 kb at 16q24.3, located upstream from the melanocortin 1 receptor gene and containing its promoter, from 54 humans, one chimpanzee, one gorilla, and one orangutan. The orangutan sequence was incomplete and unavailable from GenBank. Only the human, chimpanzee, and gorilla sequences are used. -
ZHAO et al. 2000 sequenced
10 kb in the region 22q11.2 from 64 humans, one chimpanzee, and one orangutan. One human sequence (AF291608) appears to be corrupted, so only 63 human sequences are used.
In all data sets, most alignment gaps occur at the ends of the sequence and probably represent undetermined nucleotides. The three large loci (![]()
![]()
![]()
![]()
Two analyses are performed. The first estimates the ancestral population size and speciation time parameters
HC,
HCG,
HCGO,
HCGO,
HCG, and
HC, initially using the data set of ![]()
H is also estimated. The results are presented in Table 2 under the headings "53 loci" and "56 loci," respectively. The second analysis uses only the human sequences at the three loci (![]()
![]()
![]()
H and tMRCA, the time to the most recent common ancestor in the sample. The results are presented in Table 3.
|
|
Estimation of ancestral population sizes and speciation times:
The gamma priors for the ancestral population size and speciation time parameters are specified on the basis of our expectations about those parameters (Table 2). For easy comparison, the same priors are used for parameters
HC,
HCG, (
HCG -
HC), and
HC as in ![]()
HCGO and (
HCGO -
HCG) are new. The gamma parameter
is chosen to be >1, so that the distribution peaks at a positive value instead of 0. The prior for each
has the mean 0.001 and 95% of the density is in the interval (0.00012, 0.00279). We assume a generation time of g = 20 years and a neutral mutation rate of 10-9 mutations/site/year. Thus the population sizes have a prior mean of 12,500 with the 95% interval (1500, 34,800). The mean speciation times in the prior are 5 million years (MY) before present, 6.6 MY, and 14 MY for the H-C, HC-G, and HCG-O divergences, respectively.
We use 10,000 iterations as the burn-in and then take 1,000,000 samples, sampling every two iterations. The results are presented in Table 2 and Table 3. The posterior distribution for
HC from the 53-loci data (![]()
![]()
![]()
4.8 MY with the 95% CI (4.1, 5.5). Those estimates seem too young, as current opinion appears to favor a date as old as 7 MY (![]()
1.2 MY with the 95% CI (0.47, 2.11), smaller than the estimates from the H-C-G sequences only (![]()
HC and
HC that are almost identical to estimates from the H-C-G sequences (results not shown). In sum, inclusion of the orangutan has led to more recent estimates of speciation times and to large estimates of ancestral population sizes. The reasons for the differences are unclear, but they are not due to exclusion of alignment gaps in the analysis of ![]()
We note strong negative correlations in the posterior density between
's and
's, especially between
and
for the population representing the root of the species tree (Table 4). For example, the correlation between
HCGO and
HCGO is -0.75. The joint posterior density for
HCGO and (
HCGO -
HCG) is shown in Fig 2, after kernel density smoothing (![]()
|
|
Including the three large loci of ![]()
![]()
![]()
and
in the posterior distribution suggest that estimation of
's is affected by uncertainties in the
's. To alleviate such effects, we used a highly informative prior for
HC with
= 120, ß = 20,000, corresponding to a prior mean of 6 MY for the H-C divergence with the 95% prior interval (5.0 MY, 7.1 MY). The 53-loci data then give the posterior mean 5.3 MY with the CI (4.7 MY, 5.9 MY) for the H-C divergence. The posterior mean and the 95% CI for
HC are 0.0015 (0.0006, 0.0029), which correspond to an HC population size of 19,000 with the CI (7600, 36,600), 0.0034 (0.0022, 0.0048) for
HCG, and 0.0019 (0.0002, 0.0044) for
HCGO. The posterior means and 95% CIs for speciation times are 0.0079 (0.0062, 0.0094) for
HCGO -
HCG and 0.0010 (0.0004, 0.0017) for
HCG -
HC. Those estimates appear more reasonable.
Application of this informative prior for
HC to the 56-loci data had a similar effect of reducing
HC. The posterior mean and 95% CI for
HC are 0.00201 (0.00088, 0.00352), which correspond to a mean NHC of 25,000 with the CI (11,000, 44,000). Estimates for
HC are 0.00481 (0.00430, 0.00535). Estimates for
H are 0.00055 (0.00039, 0.00075), identical to those of Table 2. Estimates of other parameters are 0.00379 (0.00258, 0.00529) for
HCG and 0.00240 (0.00042, 0.00453) for
HCGO. The posterior means and 95% CIs for speciation times are 0.00757 (0.00629, 0.00887) for
HCGO -
HCG and 0.00109 (0.00048, 0.00177) for
HCG -
HC.
Estimation of human population size
H and the tMRCA:
The human sequences at the three large loci (![]()
![]()
![]()
H, and the same gamma prior G(2, 2000) is used as in Table 2, which corresponds to a prior mean of NH = 12,500 with the 95% interval (1500, 34,800). The results are shown in Table 3.
For locus 1q24 (![]()
H. With the generation time g = 20 years and mutation rate µ = 10-9 substitutions/site/year, those estimates correspond to an average long-term human population size of only NH = 4400 with the 95% CI (2100, 7700). ![]()
![]()
H using Watterson's method based on the number of segregating sites (![]()
![]()
![]()
= 6.7/8991 = 0.00074, twice as large as the Bayes mean and outside the 95% CI. With µ = 0.74 x 10-9 used, the population size was estimated to be NH = 12,600 (![]()
1.5 MY, more than three times older than the Bayes estimates.
For locus 16q24.3 (![]()
![]()
![]()
= 13.7/6545 = 0.00209, which is twice as large as the Bayes mean, and NH = 10,000 for the human population size and tMRCA =
1.5 MY.
For locus 22q11.2 (![]()
![]()
varied among methods and were all larger than the Bayes estimates. The population size NH was estimated to be
10,00015,000, which is comparable with the Bayes estimate. However, tMRCA was estimated to be
1.3 MY, with a confidence interval of (0.71, 2.1), much larger than the Bayes estimates.
In sum, the Bayes estimates of
H and tMRCA are considerably smaller than the estimates of ![]()
![]()
![]()
![]()
![]()
![]()
The three human loci are then combined in a Bayes analysis, with a single
H estimated (Table 3). The posterior mean
H = 0.00056 with the 95% CI of (0.00040, 0.00076), corresponding to a population size of NH = 7000 with the 95% CI of (5000, 9500), is an average across the three loci. However, the 95% CI is much narrower than those at individual loci, indicating the increased accuracy in the estimate in the combined analysis. The posterior means and CIs of the tMRCA are similar to estimates from the separate analyses of the three loci (Table 3).
| DISCUSSION |
|---|
Validation of the MCMC algorithm and convergence monitoring:
While MCMC provides a powerful framework for fitting sophisticated multiparameter models to heterogeneous data sets from multiple loci, MCMC algorithms are notoriously difficult to validate. MCMC implementations are notably more difficult to debug than maximum-likelihood programs. For example, in most numerical optimization algorithms for maximum-likelihood estimation, the likelihood always increases. However, an MCMC algorithm is stochastic and we cannot expect any summary statistics to increase or decrease monotonically. Second, a likelihood program converges to a fixed point (or points in the presence of multiple local optima). In contrast, convergence of an MCMC algorithm is to a distribution.
We used several strategies to validate the theory and implementation. For small data sets with only 2 or 3 species, quantities such as the probability of a particular gene tree topology and the expectations of coalescent times in the gene tree were calculated by both MCMC simulation and numerical integration using Mathematica. For larger species trees (with, say, 10 species), the MCMC algorithm was run without data [that is, by fixing f(D|G) = 1], and the resulting posterior distributions of parameters (
) were compared with the prior gamma distributions. In addition, a simulation program was written to generate testing data and to simulate the prior distribution of gene trees and coalescent times to validate the theory and the implementation.
For the purpose of convergence monitoring, we found it useful to run multiple chains and to monitor the values of parameters and the log likelihood over iterations (e.g., ![]()
Sampling strategies and accuracy of parameter estimation:
A small simulation study was conducted to evaluate the information content in the data sets. Data are simulated using the species tree (((HC)G)O) with the parameter values
HC = 0.001,
HCG = 0.001,
HCGO = 0.001,
HCGO = 0.014,
HCG = 0.0066, and
HC = 0.005. The prior of Table 2 is used in the Bayes analysis. The fact that the prior means are equal to the true values of parameters suggests that the results are best-time results. Clearly the method will perform well if we have long sequences to reduce sampling errors in the gene tree and branch lengths (coalescent times) at each locus and also many loci to average over stochastic variations in the coalescent process among loci. However, given the total combined sequence length, it is not obvious whether it is better to have a few long sequences or many short ones. Thus we simulated a few cases in which the number of loci, L, and the number of nucleotides in the sequence at each locus, C, vary but the total sequence length from each species is fixed at L x C = 100,000 (Table 5). The average posterior means of parameters (not shown) were found to be close to the true values of parameters in all the cases considered. We examined the size of the 95% credibility interval as an indication of the information content in the data. Surprisingly the results suggest very little difference among the different strategies (Table 5). Comparison of the posterior CIs with the prior intervals suggests that speciation times (
's) are, in general, well estimated, with their CIs reduced by three to seven times. Parameters
are estimated less well, with about twice the reduction in the CI for
HC and
HCG, while
HCGO is the most poorly estimated parameter.
|
Population size of the human-chimpanzee common ancestor:
The human-chimpanzee ancestral population size, NHC or
HC, has been of much recent interest (![]()
![]()
10,000 (e.g., ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
However, the reliability of our estimates is affected both by the assumptions made in our model and by the quality of the data. We assumed that the evolutionary rate is the same both among sites within each locus and among different loci. Within-locus rate variation is not expected to be important because its effect is mainly on correction for multiple hits and because the sequences used are highly similar. This assumption can be relaxed, although at greater computational cost. Rate variation among loci should have a greater effect on estimation of ancestral population sizes (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Program availability:
The program MCMCcoal, for Bayes MCMC analysis under coalescent models, is available at http://abacus.gene.ucl.ac.uk/software/MCMCcoal/. This can also be used to simulate sequence data sets.
| ACKNOWLEDGMENTS |
|---|
This work was funded by a Biotechnology and Biological Sciences Research Council grant (31/G13580) to Z.Y. and grants from the National Institutes of Health (HG01988), the Canadian Institute of Health Research (MOP44064), the Peter Lougheed Foundation, and the Alberta Heritage Foundation for Medical Research to B.R.
Manuscript received December 4, 2002; Accepted for publication April 18, 2003.
| APPENDIX |
|---|
PROPOSAL STEPS IN THE MCMC
The MCMC algorithm implemented in this article involves several proposal steps, each of which updates some parameters in the Markov chain. The main problem is to combat the constraints posed by the speciation times while updating coalescent times in the gene tree and vice versa. The algorithm is tedious. The details follow.
Step 1. Updating the coalescent times at internal nodes in a gene tree:
This step cycles through all loci and, for each locus, through all internal nodes in the gene tree to propose changes to the node ages (coalescent times). The age of only one internal node is changed at a time, and the gene tree topology remains intact. First the lower and upper bounds for the new age are determined by examining the current values of speciation dates and the ages of the mother and daughter nodes in the gene tree. A sliding window is then used to propose the new age; that is,
![]() |
(A1) |
where U(a, b) is a random variable from the uniform distribution in the interval (a, b), and
1 is the window size, which is adjustable. If the new age is outside the feasible range, the excess is reflected back into the interval. As there are always the same number of routes from tj to t*j as from t*j to tj, the proposal ratio is 1. The acceptance ratio is
![]() |
(A2) |
Step 2. Subtree pruning and regrafting in the gene tree:
This step cycles through nodes in each gene tree (except the root), removes the subtree represented by the node (which includes the node and all its descendent nodes), and then reattaches it to the remaining gene tree (Fig 1A). The step changes the gene tree topology and the coalescent time for the mother node. The feasible range for the new age of the mother node is determined on the basis of the current values of the speciation times and the age of concerned node in the gene tree. Then a random age is chosen by using a sliding window around the current age,
![]() |
(A3) |
where the window size
2 can be fine tuned. If the new age is outside the feasible range, the excess is reflected back. Next the feasible branches in the gene tree at which the mother node can join are counted, and one of them is chosen at random. If m feasible lineages (to which the mother node can be attached) are in the current gene tree and n feasible lineages are in the proposed gene tree, the proposal ratio is n/m. Thus
![]() |
(A4) |
Note that this is the subtree-pruning and regrafting (SPR) algorithm used in phylogenetic tree search (![]()
). A new age t*c is proposed according to Equation A3, which happens to be in population HCG. There are n = 2 feasible branches (ed and eG) that the subtree can attach to at t*c, and one of them is chosen at random. In the current gene tree, the subtree can attach to m = 2 lineages (dH1 and db) at the current age tc.
We also implemented a version of this proposal in which only the tips are pruned and regrafted. This was found to be effective for small gene trees, such as in the data of ![]()
Step 3. Updating population size parameters:
This step updates the population size parameters (
's) one by one. A sliding window is used to propose a new value; that is,
, where
3 is the window size. If
*j < 0, it is reset to -
*j. The proposal ratio is 1, and the acceptance ratio is
![]() |
(A5) |
Step 4. Updating speciation times in the species tree:
This step cycles through the speciation times, that is, ages at internal nodes of the species tree. The age
at any internal node is bounded upward by the speciation time of its mother node and downward by the ages of its daughter nodes. Let the interval be (
L,
U). A sliding window is used to propose a new age,
![]() |
(A6) |
where
4 is the adjustable window size. If the new age is outside the range (
L,
U), the excess is reflected back. To maintain the compatibility of the gene trees with the species tree, we change the ages of the affected nodes in the gene tree at each locus. A node in the gene tree is affected if its age is in the interval (
L,
U) and if it is in the population(s) represented by the concerned node in the species tree or its two daughter nodes.
Our calculation of the new ages for the affected nodes in the gene tree mimics the movements of marks (nodes in the gene tree) on a rubber band when its two ends are fixed (at
L and
U) and when the rubber is held at a fixed point (
) and pulled slightly to one end (Fig 1B). The marks will move relative to the two ends when the rubber expands on one side of the holding point and shrinks on the other. If the node age in the gene tree t >
, the new age is given by (
U - t*)/(
U - t) = (
U -
*)/(
U -
); that is,
![]() |
(A7) |
If the node age t
, the new age is given by (t* -
L)/(t -
L) = (
* -
L)/(
-
L); that is,
![]() |
(A8) |
If the node in the species tree is the root so that
U =
, all node ages are changed relative to the lower bound (Equation A8).
An example is shown in Fig 1B, where
HC of Fig 1 is updated. The range is
L = 0 and
U =
HCG. The ages of nodes a, b, c, and d in the gene tree should be changed as well to maintain compatibility between the gene tree and proposed species tree. Nodes a and b, which are younger than
HC, are repositioned relative to the lower bound
L according to Equation A8, while nodes c and d, which are older than
HC, are repositioned relative to the upper bound
U according to Equation A7.
Suppose m node ages are changed relative to the upper bound
U (using Equation A7) and n node ages are changed relative to the lower bound
L (using Equation A8) across all loci. To derive the proposal ratio, we apply the following transform: y0 =
, yj = (
U - tj)/(
U -
) for each of the m nodes with tj >
, and yk = (tk -
L)/(
-
L) for each of the n nodes with tk <
. Note that only y0 is changed while all other m + n variables yj and yk remain the same in the proposal. The proposal ratio in the transformed variables is 1. The proposal ratio in the original variables is easily derived as ((
U -
*)/(
U -
))m((
* -
L)/(
-
L))n. The acceptance ratio is thus
![]() |
(A9) |
Step 5. Mixing step:
A mixing step is found to be effective in speeding up convergence, especially from a poor starting point. The gene tree topologies remain unchanged, but all parameters in the model (
's and
's) and node ages (coalescent times) in each gene tree are multiplied by a constant
![]() |
(A10) |
where r is a random number from U(0, 1) and
5 > 0 is a small fine-tuning parameter. The proposal ratio is cn, where n is the total number of variables updated. The acceptance ratio is
![]() |
(A11) |
To overcome the strong correlation between parameters
and
(Table 4; see also ![]()
, its strongest correlation with parameters
is found. If that correlation is <0.2,
is not changed. Otherwise
is multiplied by c if the correlation is positive or divided by c if the correlation is negative. Thus with the correlation coefficients of Table 4, all the
parameters are divided by c. The proposal ratio for the modified algorithm is cm-n, where m is the total number of parameters multiplied by c and n is the total number of parameters divided by c.
| LITERATURE CITED |
|---|
BAHLO, M. and R. C. GRIFFITHS, 2000 Inference from gene trees in a subdivided population. Theor. Popul. Biol. 57:79-95.[Medline]
BEERLI, P. and J. FELSENSTEIN, 1999 Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetics 152:763-773.
BEERLI, P. and J. FELSENSTEIN, 2001 Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach. Proc. Natl. Acad. Sci. USA 98:4563-4568.
BRUNET, M., F. GUY, D. PILBEAM, H. T. MACKAYE, and A. LIKIUS et al., 2002 A new hominid from the Upper Miocene of Chad, Central Africa. Nature 418:145-151.
CHEN, F.-C. and W.-H. LI, 2001 Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am. J. Hum. Genet. 68:444-456.[Medline]
EDWARDS, S. V. and P. BEERLI, 2000 Perspective: gene divergence, population divergence, and the variance in coalescence time in phylogeographic studies. Evolution 54:1839-1854.[Medline]
FELSENSTEIN, J., 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.[Medline]
FELSENSTEIN, J., M. KUHNER, J. YAMATO, and P. BEERLI, 1999 Likelihoods on coalescents: a Monte Carlo sampling approach to inferring parameters from population samples of molecular data. IMS Lect. Notes Monogr. Ser. 33:163-185.
FU, Y., 1994 Estimating effective population size or mutation rate using the frequencies of mutations of various classes in a sample of DNA sequences. Genetics 138:1375-1386.[Abstract]
GELMAN, A. and D. B. RUBIN, 1992 Inference from iterative simulation using multiple sequences (with discussion). Stat. Sci. 7:457-511.
GRIFFITHS, R. C. and S. TAVARÉ, 1994 Ancestral inference in population genetics. Stat. Sci. 9:307-319.
HACIA, J. G., 2001 Genome of the apes. Trends Genet. 17:637-645.[Medline]
HASTINGS, W. K., 1970 Monte Carlo sampling methods using Markov chains and their application. Biometrika 57:97-109.
HUDSON, R. R., 1990 Gene genealogies and the coalescent process, pp. 144 in Oxford Surveys in Evolutionary Biology, edited by D. J. FUTUYMA and J. D. ANTONOVICS. Oxford University Press, New York.
JUKES, T. H., and C. R. CANTOR, 1969 Evolution of protein molecules, pp. 21123 in Mammalian Protein Metabolism, edited by H. N. MUNRO. Academic Press, New York.
KAESSMANN, H., V. WIEBE, G. WEISS, and S. PAABO, 2001 Great ape DNA sequences reveal a reduced diversity and an expansion in humans. Nat. Genet. 27:155-156.[Medline]
MAKOVA, K. D., M. RAMSAY, T. JENKINS, and W. H. LI, 2001 Human DNA sequence variation in a 6.6-kb region containing the melanocortin 1 receptor promoter. Genetics 158:1253-1268.
METROPOLIS, N., A. W. ROSENBLUTH, M. N. ROSENBLUTH, A. H. TELLER, and E. TELLER, 1953 Equations of state calculations by fast computing machines. J. Chem. Physiol. 21:1087-1092.
NEI, M., 1987 Molecular Evolutionary Genetics. Columbia University Press, New York.
NIELSEN, R. and J. WAKELEY, 2001 Distinguishing migration from isolation: a Markov chain Monte Carlo approach. Genetics 158:885-896.
RUVOLO, M., 1997 Molecular phylogeny of the hominoids: inferences from multiple independent DNA sequence data sets. Mol. Biol. Evol. 14:248-265.[Abstract]
SILVERMAN, B. W., 1986 Density Estimation for Statistics and Data Analysis. Chapman & Hall, London.
STEPHENS, M. and P. DONNELLY, 2000 Inference in molecular population genetics (with discussions). J. R. Stat. Soc. B 62:605-655.
SWOFFORD, D. L., G. J. OLSEN, P. J. WADDELL and D. M. HILLIS, 1996 Phylogeny inference, pp. 411501 in Molecular Systematics, edited by D. M. HILLIS, C. MORITZ and B. K. MABLE. Sinauer Associates, Sunderland, MA.
TAJIMA, F., 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437-460.
TAKAHATA, N., and Y. SATTA, 2002 Pre-speciation coalescence and the effective size of ancestral populations, pp. 5271 in Developments in Theoretical Population Genetics, edited by M. SLATKIN and M. VEUILLE. Oxford University Press, Oxford.
TAKAHATA, N., Y. SATTA, and J. KLEIN, 1995 Divergence time and population size in the lineage leading to modern humans. Theor. Popul. Biol. 48:198-221.[Medline]
WALL, J. D., 2003 Estimating ancestral population sizes and divergence times. Genetics 163:395-404.
WATTERSON, G. A., 1975 On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7:256-276.[Medline]
WILSON, I. J. and D. J. BALDING, 1998 Genealogical inference from microsatellite data. Genetics 150:499-510.
WILSON, I. J., M. E. WEAL, and D. J. BALDING, 2003 Inference from DNA data: population histories, evolutionary processes and forensic match probabilities. J. R. Stat. Soc. A 166:155-201.
WU, C.-I, 1991 Inferences of species phylogeny in relation to segregation of ancient polymorphisms. Genetics 127:429-435.[Abstract]
YANG, Z., 1997 On the estimation of ancestral population sizes. Genet. Res. 69:111-116.[Medline]
YANG, Z., 2000 Maximum likelihood estimation on large phylogenies and analysis of adaptive evolution in human influenza virus A. J. Mol. Evol. 51:423-432.[Medline]
YANG, Z., 2002 Likelihood and Bayes estimation of ancestral population sizes in hominoids using data from multiple loci. Genetics 162:1811-1823.
YU, N., Z. ZHAO, Y. X. FU, N. SAMBUUGHIN, and M. RAMSAY et al., 2001 Global patterns of human DNA sequence variation in a 10-kb region on chromosome 1. Mol. Biol. Evol. 18:214-222.
ZHAO, Z., L. JIN, Y. X. FU, M. RAMSAY, and T. JENKINS et al., 2000 Worldwide DNA sequence variation in a 10-kilobase noncoding region on human chromosome 22. Proc. Natl. Acad. Sci. USA 97:11354-11358.
This article has been cited by other articles:
![]() |
L. Liu and S. V. Edwards Phylogenetic Analysis in the Anomaly Zone Syst Biol, July 9, 2009; (2009) syp034v1. [Full Text] [PDF] |
||||
![]() |
O. Akerborg, B. Sennblad, L. Arvestad, and J. Lagergren Simultaneous Bayesian gene tree reconstruction and reconciliation analysis PNAS, April 7, 2009; 106(14): 5714 - 5719. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. S. Kubatko, B. C. Carstens, and L. L. Knowles STEM: species tree estimation using maximum likelihood for gene trees under coalescence Bioinformatics, April 1, 2009; 25(7): 971 - 973. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Subramanian Temporal Trails of Natural Selection in Human Mitogenomes Mol. Biol. Evol., April 1, 2009; 26(4): 715 - 717. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Wang and B. Rannala Bayesian inference of fine-scale recombination rates using population genomic data Phil Trans R Soc B, December 27, 2008; 363(1512): 3921 - 3930. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Yang Empirical evaluation of a prior for Bayesian phylogenetic inference Phil Trans R Soc B, December 27, 2008; 363(1512): 4031 - 4039. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Liu BEST: Bayesian estimation of species trees under the coalescent model Bioinformatics, November 1, 2008; 24(21): 2542 - 2543. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Burgess and Z. Yang Estimation of Hominoid Ancestral Population Sizes under Bayesian Coalescent Models Incorporating Mutation Rate Variation and Sequencing Errors Mol. Biol. Evol., September 1, 2008; 25(9): 1979 - 1994. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. J. Maureira-Butler, B. E. Pfeil, A. Muangprom, T. C. Osborn, and J. J. Doyle The Reticulate History of Medicago (Fabaceae) Syst Biol, June 1, 2008; 57(3): 466 - 482. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. R. Riddle, M. N. Dawson, E. A. Hadly, D. J. Hafner, M. J. Hickerson, S. J. Mantooth, and A. D. Yoder The role of molecular genetics in sculpting the future of integrative biogeography Progress in Physical Geography, April 1, 2008; 32(2): 173 - 202. [Abstract] [PDF] |
||||
![]() |
H. A. Ross, S. Murugan, and W. L. Sibon Li Testing the Reliability of Genetic Methods of Species Identification via Simulation Syst Biol, April 1, 2008; 57(2): 216 - 230. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. L. Knowles and B. C. Carstens Delimiting Species without Monophyletic Gene Trees Syst Biol, December 1, 2007; 56(6): 887 - 895. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Zhou, K. Zeng, W. Wu, X. Chen, Z. Yang, S. Shi, and C.-I Wu Population Genetics of Speciation in Nonmodel Organisms: I. Ancestral Polymorphism in Mangroves Mol. Biol. Evol., December 1, 2007; 24(12): 2746 - 2754. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. C. Leman, M. K. Uyenoyama, M. Lavine, and Y. Chen The evolutionary forest algorithm Bioinformatics, August 1, 2007; 23(15): 1962 - 1968. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. C. Carstens and L. L. Knowles Estimating Species Phylogeny from Gene-Tree Probabilities Despite Incomplete Lineage Sorting: An Example from Melanoplus Grasshoppers Syst Biol, June 1, 2007; 56(3): 400 - 411. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Liu and D. K. Pearl Species Trees from Gene Trees: Reconstructing Bayesian Posterior Distributions of a Species Phylogeny Using Estimated Gene Tree Distributions Syst Biol, June 1, 2007; 56(3): 504 - 514. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. V. Edwards, L. Liu, and D. K. Pearl High-resolution species trees without concatenation PNAS, April 3, 2007; 104(14): 5936 - 5941. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. F. Baines and B. Harr Reduced X-Linked Diversity in Derived Populations of House Mice Genetics, April 1, 2007; 175(4): 1911 - 1921. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Hey and R. Nielsen Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics PNAS, February 20, 2007; 104(8): 2785 - 2790. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. S. Kubatko and J. H. Degnan Inconsistency of Phylogenetic Estimates from Concatenated Data under Coalescence Syst Biol, February 1, 2007; 56(1): 17 - 24. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Charlesworth and A. Eyre-Walker The Rate of Adaptive Evolution in Enteric Bacteria Mol. Biol. Evol., July 1, 2006; 23(7): 1348 - 1356. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Slatkin and J. L. Pollack The Concordance of Gene Trees and Species Trees at Two Linked Loci Genetics, March 1, 2006; 172(3): 1979 - 1984. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. P. Maddison and L. L. Knowles Inferring Phylogeny Despite Incomplete Lineage Sorting Syst Biol, February 1, 2006; 55(1): 21 - 30. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Nielsen and M. Matz Statistical Approaches for DNA Barcoding Syst Biol, February 1, 2006; 55(1): 162 - 169. [Full Text] [PDF] |
||||
![]() |
Z. Yang and B. Rannala Bayesian Estimation of Species Divergence Times Under a Molecular Clock Using Multiple Fossil Calibrations with Soft Bounds Mol. Biol. Evol., January 1, 2006; 23(1): 212 - 226. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. E. Walsh, I. L. Jones, and V. L. Friesen A Test of Founder Effect Speciation Using Multiple Loci in the Auklets (Aethia spp.) Genetics, December 1, 2005; 171(4): 1885 - 1894. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Y. Yampolsky, F. A. Kondrashov, and A. S. Kondrashov Distribution of the strength of selection against amino acid replacements in human proteins Hum. Mol. Genet., November 1, 2005; 14(21): 3191 - 3201. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. C. Leman, Y. Chen, J. E. Stajich, M. A. F. Noor, and M. K. Uyenoyama Likelihoods From Summary Statistics: Recent Divergence Between Species Genetics, November 1, 2005; 171(3): 1419 - 1436. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. D. Keightley, G. V. Kryukov, S. Sunyaev, D. L. Halligan, and D. J. Gaffney Evolutionary constraints in conserved nongenic sequences of mammals Genome Res., October 1, 2005; 15(10): 1373 - 1378. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Wang Estimation of effective population sizes from data on genetic markers Phil Trans R Soc B, July 29, 2005; 360(1459): 1395 - 1409. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. V Edwards, W Bryan Jennings, and A. M Shedlock Phylogenetics of modern birds in the era of genomics Proc R Soc B, May 22, 2005; 272(1567): 979 - 992. [Abstract] [Full Text] [PDF] |
||||
![]() |
T.-K. Seo, H. Kishino, and J. L. Thorne Incorporating gene-specific variation when inferring and evaluating optimal evolutionary tree topologies from multilocus sequence data PNAS, March 22, 2005; 102(12): 4436 - 4441. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Hey and R. Nielsen Multilocus Methods for Estimating Population Sizes, Migration Rates and Divergence Time, With Applications to the Divergence of Drosophila pseudoobscura and D. persimilis Genetics, June 1, 2004; 167(2): 747 - 760. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Rannala, B.
- Articles by Yang, Z.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Rannala, B.
- Articles by Yang, Z.































