Estimating Mutation Parameters, Population History and Genealogy Simultaneously From Temporally Spaced Sequence Data
Alexei J. Drummond, Geoff K. Nicholls, Allen G. Rodrigo, Wiremu Solomon


Molecular sequences obtained at different sampling times from populations of rapidly evolving pathogens and from ancient subfossil and fossil sources are increasingly available with modern sequencing technology. Here, we present a Bayesian statistical inference approach to the joint estimation of mutation rate and population size that incorporates the uncertainty in the genealogy of such temporally spaced sequences by using Markov chain Monte Carlo (MCMC) integration. The Kingman coalescent model is used to describe the time structure of the ancestral tree. We recover information about the unknown true ancestral coalescent tree, population size, and the overall mutation rate from temporally spaced data, that is, from nucleotide sequences gathered at different times, from different individuals, in an evolving haploid population. We briefly discuss the methodological implications and show what can be inferred, in various practically relevant states of prior knowledge. We develop extensions for exponentially growing population size and joint estimation of substitution model parameters. We illustrate some of the important features of this approach on a genealogy of HIV-1 envelope (env) partial sequences.

ONE of the most significant developments in population genetics modeling in recent times was the introduction of coalescent or genealogical methods (Kingman 1982a,b). The coalescent is a stochastic process that provides good approximations to the distribution of ancestral histories that arise from classical forward-time models such as the Fisher-Wright (Fisher 1930; Wright 1931) and Moran population models. The explicit use of genealogies to estimate population parameters allows the nonindependence of sampled sequences to be accounted for. (“Genealogy” and “tree” are used interchangeably throughout. In both cases we are referring to a collection of edges, nodes, and node times that together completely specify a rooted history.) Many coalescent-based estimation methods focus on a single genealogy (Fu 1994; Neeet al. 1995; Pybuset al. 2000) that is typically obtained using standard phylogenetic methods. However, there is often considerable uncertainty in the reconstructed genealogy. To allow for this uncertainty it is necessary to compute the average likelihood of the population parameters of interest. The calculation involves integrating over genealogies distributed according to the coalescent (Griffiths and Tavare 1994; Kuhneret al. 1995). We can carry out this integration for some models of interest, using Monte Carlo methods. Importance-sampling algorithms have been developed to estimate the population parameter Θ = 2Neμ (Griffiths and Tavare 1994; Stephens and Donnelly 2000), migration rates (Bahlo and Griffiths 2000), and recombination (Griffiths and Marjoram 1996; Fearnhead and Donnelly 2001). Metropolis-Hastings Markov chain Monte Carlo (MCMC; Metropoliset al. 1953; Hastings 1970) has been used to obtain sample-based estimates of Θ (Kuhneret al. 1995), exponential growth rate (Kuhneret al. 1998), migration rates (Beerli and Felsenstein 1999, 2001), and recombination (Kuhneret al. 2000).

In addition to developments in coalescent-based population genetic inference, sequence data sampled at different times are now available from both rapidly evolving viruses, such as human immunodeficiency virus (HIV; Holmeset al. 1992; Wolinskyet al. 1996; Rodrigoet al. 1999; Shankarappaet al. 1999), and from ancient DNA sources (Hanniet al. 1994; Leonardet al. 2000; Loreilleet al. 2001; Barneset al. 2002; Lambertet al. 2002). This temporally spaced data provides the potential to observe the accumulation of mutations over time and thus estimate mutation rate (Drummond and Rodrigo 2000; Rambaut 2000). In fact, it is even possible to estimate variation in the mutation rate over time (Drummondet al. 2001). This leads naturally to the more general problem of simultaneous estimation of population parameters and mutation parameters from temporally spaced sequence data (Rodrigo and Felsenstein 1999; Rodrigoet al. 1999; Drummond and Rodrigo 2000; Drummondet al. 2001).

In this article we estimate population and mutation parameters, dates of divergence, and tree topology from temporally spaced sequence data, using sample-based Bayesian inference. The important novelties in the inference are the data type (i.e., temporally sampled sequences), the relatively large number of unknown model parameters, and the MCMC sampling procedures, not the Bayesian framework itself. The coalescent gives the expected frequency with which any particular genealogy arises under the Fisher-Wright population model. The coalescent may then be treated either as part of the observation process defining the likelihood of population parameters or as the prior distribution for the unknown true genealogy. In either case we must integrate the likelihood over the state space of the coalescent. Bayesian and purely likelihood-based population genetic inference use the same reasoning and software up to the point where prior distributions are given for the parameters of the coalescent and mutation processes.

Are there then any important difficulties or advantages in a Bayesian approach over a purely likelihood-based approach? The principal advantage is the possibility of quantifying the impact of prior information on parameter estimates and their uncertainties. The new difficulty is to represent different states of prior knowledge of the parameters of the coalescent and mutation processes as probability densities. However, such prior elicitation is often instructive. In the absence of prior information, researchers frequently choose to use noninformative/improper priors for the parameters of interest. Such an approach may be problematic and can result in improper posterior distributions. There exist a number of important cases in the literature in which knowledgeable authors inadvertently analyze a meaningless, improper posterior distribution. Why then do we choose to treat improper priors in this article? We are developing and testing inferential and sampling methods. These methods become more difficult as the amount of information in the prior is reduced. The sampling problem becomes significantly more difficult. We therefore treat the “worst case” prior that might naturally arise. Since this prior is improper, we are obliged to check that the posterior is proper. However, when confronted with a specific analysis, detailed biological knowledge should be encoded in the prior distributions wherever possible.

Although Bayesian reasoning has frequently been applied to phylogenetic inference (Yang and Rannala 1997; Thorneet al. 1998; Mauet al. 1999; Huelsenbecket al. 2000), it has thus far been the exception in population genetic inference (Wilson and Balding 1998).

In this article, we begin with a description of the models we use. We then give the overall structure of the inferential framework, followed by an overview of how MCMC is carried out. We mention extensions of the basic inference that allow for (1) deterministically varying populations and (2) estimation of substitution parameters. Finally, we illustrate our methods with a group of studies of a sample of HIV-1 envelope (env) sequences and a second group of studies of synthetic sequence data.

Kingman coalescent with temporally offset leaves: In this section we define the coalescent density for the constant-sized Fisher-Wright population model. In extensions we give the corresponding density for the case of a population with deterministic exponential growth. It is assumed genealogies are realized by the Kingman coalescent process. Our time units in this article are “calendar units before the present” [e.g., days before present (BP)], where the present is the time of the most recent leaf and set to zero. Let ρ denote the number of calendar units per generation and θ = Neρ. The scale factor θ converts “coalescent time” to calendar time and is one of two key objects of our inference. Note that we do not estimate ρ and Ne separately, only their product.

Consider a rooted binary tree g with n leaf nodes and n − 1 ancestral nodes. For node i, let ti denote the age of that node in calendar units. Node labels are numerically increasing with age so i > j implies titj. Let I denote the set of leaf node labels and let Y denote the set of ancestral node labels. There is one leaf node i ϵ I associated with each individual in the data. These individuals are selected, possibly at different times, from a large background population. An edge ⟨i, j⟩, i > j of g represents an ancestral lineage. Going back in time, an ancestral node i ϵ Y corresponds to a coalescence of two ancestral lineages. The root node, with label i = 2n − 1, represents the most recent common ancestor (MRCA) of all leaves. Let tI be the times of the leaves and tY be the divergence times of the ancestral nodes. Let Eg denote the edge set of g, so that g = (Eg, tY) specifies a realization of the coalescent process. For given n and tI, let Γ denote the class of all coalescent trees (Eg, tY) with n leaf nodes having fixed ages tI. The ages tY are subject to the obvious parent-child age order constraint. The element of measure in Γ is dg = dtn+1dt2n−1 with counting measure over distinct topologies associated with the distinguishable leaves.

The probability density for a tree, fG(g|θ), g ϵ Γ is computed as follows. Let ki denote the number of lineages present in the interval of time between the node i − 1 and the node i. The coalescent process generates g = (Eg, tY) with probability density fG(gθ)=1θn1i=22n1e(ki(ki1)2θ)(titi1). (1) The interpretation is as follows. Fix a time t and suppose k lineages are present at that time. A coalescence event between any of the k(k − 1)/2 pairs of distinguished lineages occurs at instantaneous rate 1/θ. Given that two lineages coalesce at time t, the probability it was some particular pair is 2/k(k − 1). It follows that, in the time interval of length titi−1 preceding the time of a leaf node i ϵ I, “nothing” happens with probability e(−ki(ki−1)/2θ)(titi−1) and that the length of time, tti−1, preceding coalescent node i ϵ Y is a random variable with density (ki (ki − 1)/2θ) · e(−ki(ki−1)/2θ)(titi−1). Taking the product of these factors over all intervals [ti−1, ti], i = 2, 3, …, 2n − 1, we obtain Equation 1 (Rodrigo and Felsenstein 1999).

Mutation: We use the standard finite-sites selection-neutral likelihood framework (Felsenstein 1981) with a general time-reversible (GTR) substitution model (Rodriguezet al. 1990). However, as we are considering genealogies in calendar units (or generations) as opposed to mutations we take some space to develop notation.

Associated with each leaf node i ϵ I there is a nucleotide sequence Di = (Di,1, Di,2, …, Di,s, …, Di,L) of some fixed length L, say. Nucleotide base characters Di,s, i ϵ I, s = 1, 2, …, L take values in the set C = {A, C, G, T}. An additional gap character, ϕ, indicates missing data. Let D = (D1, D2, …, Dn)T denote the n × L matrix of sequences associated with the tree leaves, and let DA denote the (n − 1) × L matrix of unknown sequences associated with the ancestral nodes. The data are D together with tI, that is, the n sequences observed in the leaf individuals and the n ages at which those individual sequences were taken. Let D = C(n−1)L denote the set of all possible ancestral sequences. Consider a site s = 1, 2, …, L in the nucleotide sequence of an individual. The character at site s mutates in forward time according to a Poisson jump process with 4 × 4 rate matrix Q. Here, Qi,j is the instantaneous rate for the transition from character i to character j, and A ← 1, C ← 2, G ← 3, T ← 4. We assume mutations are independent between sites. Let π = (πA, πC, πG, πT) be a 1 × 4 vector of base frequencies, corresponding to the stationary distribution of the mutation process, πQ = (0, 0, 0, 0).

The matrix Q is parameterized in terms of a symmetric “relative rate” matrix R, R=[RACRAGRATRACRCGRCTRAGRCG1RATRCT1] (2) as Qi,j=πiRi,jkπklkπlRk,l,ijQi,j=jiQi,j. (3) The time units of the rate Qi,j have been chosen so that the mean number of mutations per unit time occurring at a site is equal to one. Let μ give the mean number of mutations per calendar unit (e.g., mutations per year) at a site.

The conversion factor μ is the second of the two principal objects of our inference. In addition to μ, the relative rates, R, may be estimated. We have found that wherever it is feasible to estimate the scale parameters μ and θ, our data are informative about the elements of R. We return to inference for relative rates in extensions.

We now write the likelihood for μ. Consider an edge ⟨i, j⟩ ϵ Eg of tree g. The individual associated with node j is a direct descendant of the individual associated with node i. However, the sequences Di and Dj may differ if mutations have occurred in the interval. Let eQ denote the 4 × 4 matrix exponential of Q. In the standard finite-sites selection-neutral likelihood framework Pr{Dj,s = c′|Di,s = c} = [eQμ(titj)]c,c for c ϵ C. The probability for any particular set of sequences D, DA to be realized at the nodes of a given tree is Pr{D,DAg,μ}=i,jEgs=1Dj,sϕL[eQμ(titj)]Di,s,Dj,s (4) (in the above formula, compact notation is obtained by including in the product over edges an edge terminating at the root from an ancestor of infinite age). We may eliminate the unknown ancestral sequences DA from the above expression by simply summing all DA ϵ D, Pr{Dg,μ}=DADPr{D,DAg,μ}. (5) It is feasible to evaluate this sum, using a pruning algorithm (Felsenstein 1981).

Bayesian inference for scale parameters: We now consider Bayesian inference for scale parameters μ and θ. Both of these quantities take a real positive value. The joint posterior density, hMΘG(μ, θ, g|D), for the scale parameters and genealogy, is given in terms of the likelihood and coalescent densities above and two additional densities, fM(μ) and fΘ(θ). These functions quantify prior information about the scale parameters. Let Z be an unknown normalizing constant. The posterior is then hMθG(μ,θ,gD)=1ZPr{Dg,μ}fG(gθ)fM(μ)fϴ(θ). (6) We are interested in the marginal density, h(μ, θ|D). We summarize this density using samples (μ, θ, g) ~ hMΘG. The sampled genealogies can be thought of as uninteresting “missing data.”

Consider now the densities fM(μ) and fΘ(θ). In any particular application these functions will be chosen to summarize available prior knowledge of scale parameters. It is common practice to avoid the problem of prior elicitation and attempt to construct a “noninformative” prior. This notion is poorly defined, since a prior may be noninformative with respect to some hypotheses, but informative with respect to others. Nevertheless, we illustrate sample-based Bayesian inference under a prior that contains little information. We do this for two reasons. First, we wish to give our sampling instruments a thorough workout. From this point of view an improper prior is the best choice. Second, when carrying out Bayesian inference, it is necessary to test the sensitivity of conclusions to changes in the state of prior knowledge. What conclusions would a person in a state close to ignorance reach from these data? The improper prior we consider represents ignorance of a rather natural kind. People using our methods will very likely want to consider this particular state of knowledge, along with others that are more representative of their own.

In our case μ and θ are both scale parameters (for time). The Jeffreys prior, f(z) ∝ 1/z, z > 0, invariant under scale transformations zaz, and the uniform prior on z > 0 are candidates for fM(μ) and fΘ(θ). If fM ∝ 1/μ, fΘ ∝ 1/θ, and fG(g|θ) and Pr{D|g, μ} are as given in Equations 1 and 5 then it may be shown that the posterior density in Equation 6 is not finitely normalizable. We may nevertheless consider ratios of posterior densities. But that means the only feasible Bayesian inference, at least under the uniform, improper prior, is exactly frequentist inference. We cannot treat the parameters of interest as random variables. Suppose fixed upper limits μ ≤ μ* and troottroot may be set, along with a lower limit θ ≥ θ*. For the problems we use to illustrate our methods in examples, conservative limits of this kind determine a state of knowledge that arises quite naturally. Moreover it may be shown that the posterior density is finitely normalizable under uniform priors on the restricted state space, even though the prior on θ remains improper.


The posterior density hMΘG is a complicated function defined on a space of high dimension (between 30 and 40 in the examples that follow). We summarize the information it contains by computing the expectations, over hMΘG, of various statistics of interest. These expectations are estimated using samples distributed according to hMΘG. We use MCMC to gather the samples we need. MCMC and importance sampling are part of a family of Monte Carlo methods that may be used individually or in concert to solve the difficult integration problems that arise in population genetic inference. Earlier work on this subject is cited in the Introduction. Figure 1 shows a cartoon of two proposal mechanisms used. See the appendix for details of the proposal mechanisms and MCMC integration performed.

As always in MCMC, it is not feasible to test for convergence to equilibrium. MCMC users are obliged to test for stationarity as a proxy. We make three basic tests. First, we check that results are independent of the starting state using 10 independent runs with very widely dispersed initializations. Second, we visually inspect output traces. These should contain no obvious trend. Third, we check that the MCMC output contains a large number of segments that are effectively independent of one another, independent, at least, in the distribution determined empirically by the MCMC output. Let ρf(k) give the autocorrelation at lag k for some function f of the MCMC output. Let γf denote the asymptotic standard deviation of some estimate of ρf(k), formed from the MCMC output. Large lag autocorrelations should fall off to zero and remain within Of) of zero, as discussed by Geyer (1992). Note that in the examples section, these standards are not uniformly applied. The first two analyses pass all three checks. The last two analyses pass the first test. Here we are displaying the limitations of our MCMC algorithm. However, we believe the convergence is adequate for the points we make. In the appendix, Convergence and standard errors describes the integrated autocorrelation time (IACT) and effective sample size (ESS) measures used to test the efficiency of our sampler.

Figure 1.

Diagrams of two proposal mechanisms used to modify tree topology during an MCMC analysis. (A) This move is called the “narrow exchange” and is similar to a nearest neighbor interchange. This move picks two subtrees at random under the constraint that they have an aunt-niece relationship; i.e., the parent of one is the grandparent of the other, but neither is parent of the other. Once picked these two subtrees are swapped so long as doing so does not require any modifications in node heights to maintain parent-child order constraints. (B) This move is similar to one proposed by Wilson and Balding (1998) and involves removing a subtree and reattaching it on a new parent branch.

The MCMC algorithm we used was implemented twice, more or less independently, by A. Drummond, in JAVA and by G. K. Nicholls in MatLab. This allowed us to compare results and proved very useful in debugging some of the more complex proposal mechanism combinations. To minimize programming burden, one of our implementations (G. K. Nicholls in MatLab) was partial, allowing only fixed population size and fixed R to be compared. This is discussed more extensively in Implementation issues in the appendix.


Extending the framework of the Introduction and MCMC for evolutionary parameters to include deterministically varying models of population history and estimation of relative rate parameters is straightforward. Let Φ = (0, ∞)5 be the state space for the relative rates of R above the diagonal and excluding RG↔T. Let s = (μ, θ, g, r, R), and let hS(s|D) denote the posterior density for SΩS , where ΩS=ΩMϴG×R×Φ (see the appendix). The posterior probability density has the form hS(sD)=1ZPr{Dg,μ,R}fG(gθ,r)fM(μ)fϴ(θ)fr(r)fR(R). (7) Let T denote the age of the most recent leaf, i.e., T = miniϵIti. In this article T = 0. Let tT be a generic age. In this model Ne = Ne(t). Recall that ρ, the number of calendar units per generation, is an unknown constant. Define a constant θ = Ne(T)ρ and a growth rate parameter r. The density fG(g|θ, r) is the density determined by the coalescent process with a population growing as Ne(t) = (θ/ρ)er(tT) (Slatkin and Hudson 1991). In terms of the notation defined in Kingman coalescent with temporally offset leaves in connection with Equation 1, for genealogies with temporally spaced tips the density is fG(gθ,r)=1θn1i=22n1ertie(ki(ki1)2θr)(ertierti1). (8) If all of the relative rates in R, except RG↔T, are estimated we are fitting a general time-reversible model of substitution. However, it is sometimes useful to consider simpler nested models. One such model is the Hasegawa-Kishino-Yano (HKY) model (Hasegawaet al. 1985). In the HKY model transitions occur at rate κ relative to transversions. Thus RA↔G = RC↔T = κ and RA↔C = RA↔T = RC↔G = RG↔T = 1. Either a Jeffreys prior or a uniform prior can be used for the relative rates. However, as a result of our parameterization, the Jeffreys prior provides more accurate estimates. In the examples that follow, a uniform prior is used for R and κ as this represents the most ignorant state of knowledge and is more than adequate for the purpose of illustrating the methodology. In the same spirit fr(r) is set uniform on r, and this also proves acceptable.


In this section, we illustrate our methods on two HIV-1 env data sets and a series of synthetic data sets of comparable size.

HIV-1 env data: The method was first tested on HIV-1 partial envelope sequences obtained from a single patient over five sampling occasions spanning ~3 years: an initial sample (day 0) followed by additional samples after 214, 671, 699, and 1005 days. Details of this dataset have been published previously (Rodrigoet al. 1999). An important feature of these data is that monotherapy with Zidovudine was initiated on day 409 (Drummondet al. 2001) and continued during the remainder of the study. The total dataset consists of 60 sequences from these five time points. The length of the alignment is 660 nucleotides. Gapped columns were included in the analysis. The evidence for recombination seems to be negligible in this dataset (Rodrigoet al. 1999) and recombination is ignored for the purposes of illustrating our method. Rough estimates of Ne may be obtained by assuming a generation length of ρ = 1 day per generation (Rodrigoet al. 1999). However, we emphasize that we estimate Neρ only in this work. The dataset was split into two subsets for separate analysis. One contained all pretreatment sequences (28 sequences), and the other contained all sequences after treatment commenced (32 sequences; henceforth called posttreatment). The rationale behind this split is that both (1) population size and (2) mutation rate per unit time may be affected by a replication inhibitor such as Ziduvodine. In all of the analyses, base frequencies were fixed to empirically determined values; however, inference of these would have been trivial. Two analyses are undertaken on each dataset. The pretreatment data are strongly informative for all parameters estimated. The results are robust to the choice of priors and MCMC convergence is quick. In contrast, the posttreatment data are only weakly informative for μ, θ, and troot parameters; the results are sensitive to the choice of prior; and MCMC convergence is very slow.

Pretreatment data, constant population size, HKY substitution: In this first analysis of the pretreatment dataset, we fit the HKY substitution model and assume a constant population size. We are estimating μ, θ, g, and κ. We illustrate our methods using uniform prior distributions on μ and θ, an upper limit on mutation rate of μ* = 1, a lower limit on Neρ of θ* = 1, and a very conservative upper limit on troot of t* = 107 days. Ten MCMC runs were made, with starting values for mutation rate distributed on a log scale from 5 × 10−3 down to 10−7 mutations/site/day. This range greatly exceeds the range of values supported by the posterior. To test MCMC convergence on tree topologies, each of the 10 MCMC runs was started on a random tree drawn from a coalescent distribution with population size equal to 1000 (in exploratory work we initialize on a sUPGMA or neighbor-joining topology). The 10 Markov chain simulations were run for 2,000,000 steps and the first 100,000 steps were discarded as burn-in. Each run took ~4 hr on a machine with a 700 MHz Pentium III processor. The mean IACT of the mutation rate parameter was 4190, giving an ESS of ~450 per simulation. Table 1 presents parameter estimates for all 10 runs, illustrating close concordance between runs. Note also that the variability, between runs, of estimated means is in line with standard errors estimated within runs. This is a consistency check on our estimation of the IACT. Figures 2 and 3 show the marginal posterior density of μ and θ for each of the 10 runs. In all 10 runs the consensus tree computed from the MCMC output was the same, despite the fact that the starting trees were drawn randomly (data not shown). Combining the output of all 10 runs, the 95% highest posterior density (HPD) intervals for the mutation rate and troot are, respectively, [4.20, 8.28] × 10−5 mutations per site per day, and [580, 1040] days.

View this table:

Parameter estimates for 10 independent analyses of the pretreatment dataset assuming constant population size and HKY model of mutation

Pretreatment data, exponential growth, general substitution model: In this second analysis of the pretreatment dataset, we fit the general time-reversible substitution model, with exponential growth of population size. We are estimating μ, θ, g, r, RA↔C, RA↔G, RA↔T, RC↔G, and RC↔T. This is the most parameter-rich model we fit. To assess the convergence characteristics of this analysis we ran 10 independent runs of 3,000,000 cycles, each starting with an independent random tree topology (the mean IACT for μ was 7955 giving an ESS of 358 per run). Figure 4 shows the 10 estimates of the marginal posterior density of mutation rate. Table 2 shows parameter estimates for each of the 10 runs. Convergence is still achieved with the extra parameters.

Compare the distribution of summary statistics under the two models described here and in Pretreatment data, constant population size, HKY substitution. Given the nature of infection of HIV-1, it seems likely that an exponential growth rate assumption is more accurate. Estimated 95% HPD intervals for the growth rate r, [1.09 × 10−3, 6.65 × 10−3], exclude small growth rates, corroborating this view. The 95% HPD intervals for the mutation rate and troot are, respectively, [3.61, 8.11] × 10−5 mutations per site per day and [570, 1090] days. Compare these with the model in Pretreatment data, constant population, HKY substitution. The change in model has minimal effect (<10%) on the posterior mean mutation rate.

Figure 2.

The marginal posterior density of mutation rate for 10 independent MCMC runs on the pretreatment HIV-1 env dataset. Each run was started on a random topology. Initial mutation rates ranged from 5e-3 to 1e-7.

Figure 3.

The marginal posterior density of θ for 10 independent MCMC runs on the pretreatment HIV-1 env dataset. Each run was started on a random tree topology. Initial mutation rates ranged from 5e-3 to 1e-7.

Posttreatment: The posttreatment data are analyzed twice under the HKY substitution model with constant population size. The first analysis uses the same priors as the first pretreatment analysis. In contrast to the pretreatment dataset, the mutation rate of the posttreatment dataset is difficult to estimate. This is illustrated in Figures 5 and 6, in which the marginal posterior densities of μ and θ estimated from 10 independent MCMC runs, each 5,000,000 cycles long, are compared. We were unable to compute an IACT for each run, so we are unable to compare within- and between-run variability. However, the between-run concordance visible in Figure 5 justifies the following statement. The posttreatment mutation rate shows one mode at ~2.8 × 10−5 mutations/site/day with a second mode on the lower boundary. The data determine a diffuse, and bimodal, marginal posterior on μ. One of the modes is associated with states (μ, θ, g) with physically unrealistic root times (greater than the age of the patient). These are allowed, if we are not prepared to assert some restriction on troot. This behavior also occurs when we use a Jeffreys prior on the mutation rate (data not shown). It reflects a real property of the data, namely that states of low μ and large troot are not well distinguished from otherwise identical states of larger μ and smaller troot.

Figure 4.

The marginal posterior density of mutation rate for 10 independent MCMC runs on the pretreatment HIV-1 env dataset. An exponential growth rate mode of demography and a general time-reversible (GTR) model of substitution were assumed. Each run was started on a random tree topology. Initial mutation rates ranged from 5e-3 to 1e-7.

View this table:

Parameter estimates for 10 independent analyses of the pretreatment dataset assuming exponential growth and GTR model of mutation

In the second posttreatment analysis, we revise the upper limit on troot downwards, from 107 to t* = 3650, a value more representative of actual prior knowledge for this dataset. The new limit, set 3 years before sero-conversion occurred in the infected patient, is still conservative. Here we explored the prior belief that HIV infection most often originates from a small, homogenous population and then subsequently accumulates variation. This prior effectively assumes that all viruses in an infected individual share a common ancestor at most as old as the time of infection of the host. The estimated 95% HPD interval for the mutation rate was [1.16, 4.27] × 10−5 mutations/site/day, markedly down from the pretreatment mutation rate. Figure 7 depicts the resulting unimodal marginal posterior density for mutation rate, showing that the spurious mode has been eliminated. Again, no IACT was computed. However, between-run variability was much improved over Figures 5 and 6. Information about troot has been converted into information about mutation rates and population size.

Figure 5.

The marginal posterior density of mutation rate for 10 independent MCMC runs on the posttreatment HIV-1 env dataset. The thick line represents the density of all 10 runs combined. Each run was started on a random tree topology. Initial mutation rates ranged from 5e-3 to 1e-7.

Figure 6.

The marginal posterior density of θ for 10 independent MCMC runs on the posttreatment HIV-1 env dataset. Each run was started on a random tree topology. Initial mutation rates ranged from 5e-3 to 1e-7.

Simulated sequence data: To test the ability of our inference procedure to recover accurate estimates of parameters from the above HIV-1 dataset we undertook four simulation studies. In each experiment we generated 100 synthetic datasets. For experiment 1, the posterior estimates of θ, μ, and κ obtained from the pretreatment dataset in Pretreatment data, constant population size, HKY substitution were used to generate 100 coalescent trees and then simulate sequences on each of the resulting trees. The synthetic data were generated under a constant-size population model with the HKY mutation model but analyzed under an exponentially growing population model and a GTR mutation model. In the second experiment, 100 synthetic datasets were generated using the pretreatment parameter estimates in Pretreatment data, exponential growth, general substitution model as the true values. In this case the models for simulation and inference are matched. Synthetic data were generated under an exponentially growing population model and a GTR mutation model. In both experiments 1 and 2 uniform bounded priors were used for all parameters. Experiments 3 and 4 differed from experiments 1 and 2 only in that we used Jeffreys' (1946) prior for scale parameters (mutation rate, population size, and relative rates).

Figure 7.

The marginal posterior density of mutation rate for 10 independent MCMC runs on the posttreatment HIV-1 env dataset where the age of the root had an upper limit of 10 years (3650 days). The thick line represents the density of all 10 runs combined. Each run was started on a random tree topology. Initial mutation rates ranged from 5e-3 to 1e-7.

View this table:

Percentage of times that the true parameter was found in the 95% HPD region of the marginal posterior density

All datasets had the same number of sequences (28), the same sampling times (0 and 214 days), and the same sequence length (660) as the pretreatment dataset. Table 3 shows that the true values are successfully recovered (i.e., fall within the 95% HPD interval) ≥90% of the time in all cases except for the relative rate parameters in experiment 1. In the most complex model we fit, we recover true parameter values. The overparameterization present in experiments 1 and 3 does not seem problematic for estimating mutation rate, θ, or growth rate. These results suggest that inference of biologically realistic growth rates is quite feasible. The relative rates performed most poorly in the parameters of interest. This is caused predominantly because the uniform prior on relative rates introduces metric factors that inflate the densities. In experiment 1, when the true value of a relative rate parameter was not within the 95% HPD interval (which occurred 75 times out of 500), it was almost always overestimated (74 out of 75 times). Furthermore, conditioning on a tranversion (RG↔T = 1), a rare event, may also have an impact. However, experiments 3 and 4 demonstrate that the use of a Jeffreys prior for these and other scale parameters results in >90% recovery in all parameters. We are not aiming to prescribe any particular noninformative prior. Our choice of uniform prior in earlier experiments is deliberately crude. However, it allows us to lay out the methodology with as little emphasis as possible on prior elicitation. The reader should undertake this process for a specific problem.


We have described Bayesian coalescent-based methods to estimate and assess the uncertainty in mutation parameters, population parameters, tree topology, and dates of divergence from aligned temporally spaced sequence data. The sample-based Bayesian framework allows us to bring together information of different kinds to reduce uncertainty in the objects of the inference. Much of the hard work is in designing, implementing, and testing a suitable Monte Carlo algorithm. We found a suite of MCMC updates that do the job.

We have analyzed two contrasting HIV-1 datasets and 400 synthetic datasets to illustrate the main features of our methods. The results of the three HIV-1 env data subsections show that a robust summary of parameter-rich models, including the joint estimation of mutation rate and population size, is possible for some moderate-sized datasets. The pretreatment data restrict the set of plausible parameter values to a comparatively small range. For this dataset, useful results can be obtained from a state of ignorance about physically plausible outcomes. This situation is in contrast to the situation illustrated in the Posttreatment section. For this dataset, prior ignorance implies posterior ambiguity, in the form of a bimodal posterior distribution for the mutation rate. One of these modes is supported by genealogies conflicting with very basic current ideas about HIV population dynamics. We modify the coalescent prior on genealogies to account for this prior knowledge, restricting the most recent common ancestor to physically realistic values. The ambiguity in mutation rate is removed. Similar results could be obtained in a likelihood-based analysis of the posttreatment data, since the prior information amounts to an additional hard constraint on the root time of the coalescent genealogy.

There is some redundancy in the set of MCMC updates we used, in the sense that the limiting distribution of the MCMC is unaltered if we remove the scaling update (move 1) or the Wilson-Balding update (move 2; see appendix for details of these moves). However, these two updates types are needed in practice. There are two timescales in MCMC, time to equilibrium and mixing time in equilibrium. The scaling move sharply reduces mixing time in equilibrium. The Wilson-Balding update is needed to bring the equilibrium time to acceptable values. We have seen MCMC simulations, minus the Wilson-Balding move, in which an apparently stationary Monte Carlo process undergoes a sudden and unheralded mean shift at ~2,000,000 updates. This problem was picked up at the debugging stage, in comparisons between our two MCMC implementations. Subsequent simulation has shown that the genealogies explored in the first 2,000,000 updates of that simulation were just one of the tree clusters supported by the target distribution.

The methods presented here reduce to those of Felsenstein and co-workers (Kuhneret al. 1995) in the case of a uniform prior on Θ = 2Neμ, a fixed R, a fixed μ, and contemporaneous data, if instead of summarizing results using 95% HPD interval estimates, we use the mode and curvature of the posterior density for Θ to recover the maximum-likelihood estimate (MLE) and its associated confidence interval.

A distinction can be made between a dataset, like the pretreatment dataset, for which there is strong statistical information about mutation rates (we refer to populations from which such datasets may be obtained as “measurably evolving”), and a dataset, like the posttreatment data, in which the statistical signal is weak. In both of these datasets the familiar parameter Θ = 2Neμ is in fact well determined by the data (not shown above), so that MCMC convergence in Θ is quick. However, it is only in the pretreatment data that this parameter can be separated easily into its two factors. This is related to the well-known problem of identifiability for population size and mutation rate. We can see that temporally spaced data may or may not contain information that allows us to separate these two factors. In this particular example, lineages of the posttreatment viruses branch from those of the pretreatment viral population. Consequently a more appropriate analysis for this dataset would allow for a change of mutation rate and/or population size over the genealogy of the entire set of sequences. In the case of mutation rate this has already been demonstrated within a likelihood framework (Drummondet al. 2001). In a Bayesian analysis, coalescence of posttreatment lineages with pretreatment lineages will tend to limit the age of the most recent common ancestor of the posttreatment data, so that the pretreatment lineages will play the role of the reduced upper-bound troot in the Posttreatment section.

A software package called molecular evolutionary population inference (MEPI), developed using the phylogenetic analysis library (PAL; Drummond and Strimmer 2001), implementing the described method and further extensions (codon position rate heterogeneity, etc.), is available from


We gratefully acknowledge two anonymous reviewers for helpful comments that much improved the manuscript. In addition, A.D. thanks A. Ferreira. A.D. was supported by a New Zealand Foundation for Research, Science and Technology Bright Futures scholarship. Research by A.G.R. and A.D. was also supported by National Institutes of Health grant GM59174.


Markov chain Monte Carlo for temporally spaced sequence data including proposal mechanism used is described.

Denote by ΩMΘG the space [0, ∞) × [0, ∞) × Γ of all possible (μ, θ, g) values. Let ΩMϴG={(μ,θ,(Eg,tY))ΩMϴG:μμ,θθ,troottroot}.

We now describe a Monte Carlo algorithm realizing a Markov chain Xn, n = 0, 1, 2, … with states x=(μ,θ,g),xΩMϴG , and equilibrium hX = hMΘG.

Suppose Xn = x. A value for Xn+1 is computed using a Metropolis-Hastings algorithm. Define a set of random operations on the state. A given move may alter one or more of μ, θ, and g. Label the different move types m = 1, 2, …, M. The random operation with label m, acting on state x, generates state x′, with probability density qm(x′|x), say. Let (ab) equal a if a < b and otherwise b and (ab) equal a if a > b and otherwise b, let P(x,x)=hX(xD)hX(xD) stand for the ratio of posterior densities, and let Qm(x,x)=qm(xx)qm(xx) give the ratio of the densities for proposals xx and x → and xx′. The algorithm determining Xn+1 given Xn can be described as follows. First, a label m is chosen according to some arbitrary fixed probability distribution on the M move types. A value for the candidate state x′ is drawn according to the density qm(x′|x). Second, we accept the candidate, and set Xn+1 = x′ with probability αm(x,x)=1(P(x,x)Qm(x,x)). (9) otherwise, with probability 1 − αm(x, x′), the candidate is rejected and we set Xn+1 = x.

Proposal mechanisms: In this section we describe the proposal mechanisms (moves) and their acceptance probabilities. In each move, Xn = x, with x = (μ, θ, (Eg, tY)). For each node i let parent(i) ϵ Y denote the label of the node ancestral to i and connected to i by an edge. We get a compact notation if we treat Y and g as if Y contained a notional parent(root) node with tparent(root) = ∞, as we did in Equation 4. Also, we now drop the convention that node labels increase with age.

Let dx = dμ dθ dg in ΩMϴG and HX(dxD)=hX(xD)dx. The moves listed below determine an HX-irreducible aperiodic Metropolis-Hastings kernel. The MCMC is Harris recurrent and ergodic, with HX its unique equilibrium distribution.

Scaling move: Label this move m = 1. Let a real constant β > 1 be given. For β−1 ≤ δ ≤ β, let x → δx denote the transformation (μ,θ,(Eg,tY))(μδ,δθ,(Eg,δtY)). If x′ = δx then x = δ ′x′ with δ′ = 1/δ. The change of variables in the product measure is HX(dxD)dδ=δn3HX(dxD)dδ. Note that this transformation is not simply a change of units. The times ti associated with ancestral nodes i ϵ Y are scaled while leaf node times ti, i ϵ I (which are part of the data) are left unchanged.

The move is as follows. Choose a δ ~ Unif(β−1, β) and set x′ = δx. If xΩMϴG (if, for example, μ/δ > μ*, or the parent-child age order constraint is violated at the unscaled leaves in the scaled tree), then the move fails and we set Xn+1 = x. In a slight abuse of notation we set Q1(x, x′) = 1/δn−3 in the formula for α1(x, x′) in Equation 9 (Green 1995 explains how this scale factor arises in Metropolis-Hastings MCMC). The choice β = 1.2 gave reasonable acceptance rates in our simulations.

Wilson-Balding move: Label this move m = 2. A random subtree is moved to a new branch. This move is based on the branch-swapping move of Wilson and Balding (1998). The SPR move in PAUP* (Swofford 1999) is similar. However, the move below acts on a rooted tree and maintains all node ages except one.

Two nodes, i, j ϵ IY are chosen uniformly at random without replacement. Let jp = parent(j) and ip = parent(i). If tjpti, if ip = j or ip = jp, then the move fails and we set Xn+1 = x. Given i and j, the candidate state x′ = (μ, θ, g′) is generated in the following way. Let ĩ denote the child of ip that is not i, and let ipp = parent(ip), the grandparent of i. Reconnect node ip so that it is a child of jp and a parent of j; that is, set Eg={jp,j,ip,i,ipp,ip}Eg{jp,ip,ip,j,ipp,i}. If node j is not the root, assign to node ip a new time tip chosen uniformly at random in the interval [(titj), tjp]. If node j is the root, choose δ ~ Exp(θ) and set tip=tj+δ . Let tY denote the set of node times with tip replaced by tip . Let x=(μ,θ,(Eg,tY)) . If node j and node ip are not root, the ratio Q2(x, x′) in Equation 9 is Q2(x,x)=(tjp(titj))(tipp(titi)). If node j is the root, Q2(x,x)=θ(exp(δθ)(tipp(titi))), and if ip is the root, Q2(x,x)=(tjp(titj))exp((tipti)θ)θ.

Subtree exchange: Label this move m = 3. Choose a node i ϵ IY. Let ip = parent(i), jp = parent(ip), and let j denote the child of jp that is not ip. If node i is the root or a direct child of the root, or tip < tj, then the move fails and we set Xn+1 = x. Given i and j, the candidate state x′ = (μ, θ, g′) is generated in the following way. Swap nodes i and j, setting Eg={ip,jjp,i}Eg{jp,j,ip,i}. Let x=(μ,θ,(Eg,tY)) . The ratio Q3(x, x′) = 1 in Equation 9.

The subtree exchange above is a local operation. In a second version of this move we chose node j uniformly at random over the whole tree.

Node age move: Label this move m = 4. Choose an internal node, i ϵ Y, uniformly at random. Let ip = parent(i) and let j and k be the two children of i [so i = parent(j) and i = parent(k), jk]. If i is not the root, choose a new time ti uniformly at random in [(tjtk), tip]; otherwise, if i is the root, choose δ ~ Unif(β−1, β) (see move m = 1) and set ti=(tjtk)+δ(tiδ(tjtk)) . Let tY denote the set of ancestral node times, tY, with ti replaced by ti . Let x=(μ,θ,(Eg,tY)) . If i is not the root, then Q4(x, x′) = 1 in Equation 9. If i is the root then Q4(x, x′) = 1/δ.

Random walk moves for θ and μ: Label this move m = 5. The random walk update to θ is as follows. Let a real constant wθ > 0 be given. Choose δ ~ Unif(−wθ, wθ) and set x′ = (μ, θ + δ, g). If xΩMϴG , then the move fails and we set Xn+1 = x. Since the candidate generation process is symmetric, Q5(x, x′) = 1, in the formula for α5(x, x′) in Equation 9. The random walk move for μ, with random walk window parameter wμ, say, is similar to the move just described for θ. The window sizes wθ and wμ must be adjusted to get reasonable sampling efficiency.

Implementation, convergence checking, and debugging: Convergence and standard errors: The efficiency of our Markov sampler, as a tool for estimating the mean of a given function f, is measured by calculating from the output τf = 1 + 2Σ ρf(k), the IACT of f. Dividing the run length by τf, we get the number of “effective independent” samples in the run (the number of independent samples required to get the same precision for estimation of the mean of f). We call this the ESS. Better MCMC algorithms have smaller IACTs and thus larger ESSs, though it may be necessary to measure τ in units of CPU time to make a really useful comparison. One will typically want to run the Markov chain at least a few hundred times the IACT, to test convergence and get reasonably stable marginal histograms. Note first that we do not know the IACT when we set the MCMC running. Exploratory runs are needed. Second, a statement like “We ran the MCMC for 106 updates discarding the first 104” is worthless without some accompanying measurement of an IACT or equivalent. This point is made in Sokal (1989). The summation cutoff in the estimate for the IACT, τf, is determined using a monotone sequence estimator (Geyer 1992). The IACTs we get for our MCMC algorithms suggest that analysis of large datasets (50–100 sequences and 500–1000 nucleotides) is feasible with current desktop computers. Examples may be found in examples (Table 2). The inverse of the IACT of a given statistic is the “mixing rate.” Statistics with small mixing rates are called the “slow modes” of a MCMC algorithm. The mutation rate μ was the slowest mode among those we checked, and we therefore present IACTs for that statistic in examples.

Implementation issues: In this section we discuss debugging and MCMC efficiency of our two implementations. We compare expectations computed in the coalescent with estimates obtained from MCMC output. Standard errors are obtained from estimates of the corresponding IACT. Consider a tree with four leaves, two at time zero and two offset τ time units to greater age. Consider simulation in the coalescent, with no data. The expectation of troot is EG{troot}=(τ+4θ3)(1eτθ)+(τ+3θ2)eτθ. A number of other expectations may be computed.

For problems involving data, expectations are not available. However, an MCMC algorithm with several different move types may be tested for consistency. The equilibrium is the posterior distribution of μ, θ, and g and should not alter as we vary the proportions in which move types are used to generate candidate states. For example, move 2 (Wilson-Balding) is irreducible on its own, while moves 3 and 4 (subtree exchange and nodeage move) form another irreducible group. We fix a small synthetic dataset and compare the output of two MCMC runs: one generated using move 2 alone and the other using moves 3 and 4 alone in tandem.

We now turn to questions of MCMC efficiency. Each update has a number of parameters. These are adjusted, by trial and error for each analysis, so that the MCMC is reasonably efficient. An ad hoc adaptive scheme, based on monitoring acceptance rates, and akin to that described in Larget and Simon (1999), was used. The samples used in output analysis are taken from the final portion of the run, in which these parameters are fixed. The scaling and Wilson-Balding updates are particularly effective.

We have experimented with a range of other moves. However, while it is easy to think up computationally demanding updates with good mixing rates per MCMC update, we have focused on developing a set of primitive moves with good mixing rate per CPU second. In our experience simple moves may have low acceptance rates, but they are easy to implement accurately and are rapidly evaluated. They may give good mixing rates when we measure in CPU seconds. Larget and Simon (1999) have given an effective MCMC scheme for a similar problem. We did not use their scheme, as its natural data structure did not fit well with our other operators. A second update, which may be useful to us in the future, would use the importance-sampling process of Stephens and Donnelly (2000) to determine an independence sampling update.

Because of the explicit nature of MCMC inference, the details of a particular analysis, including the proposal mechanisms, the chain length, the evolutionary model, and the prior distributions, can be quite difficult to keep track of. One of us (A. Drummond) developed an XML data format to describe phylogenetic/population genetic analyses. This enables the user to write down the details of an analysis in a human-readable format that can also be used as the input for the computer program. For the more visually inclined a graphical user interface (GUI) was developed that can generate the XML input files, given a NEXUS or PHYLIP alignment.


  • Communicating editor: J. Hein

  • Received December 15, 2001.
  • Accepted March 12, 2002.


View Abstract