- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Drummond, A. J.
- Articles by Solomon, W.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Drummond, A. J.
- Articles by Solomon, W.
Estimating Mutation Parameters, Population History and Genealogy Simultaneously From Temporally Spaced Sequence Data
Alexei J. Drummonda, Geoff K. Nichollsb, Allen G. Rodrigoa, and Wiremu Solomonca School of Biological Sciences, University of Auckland 1001, Auckland, New Zealand
b Department of Mathematics, University of Auckland 1001, Auckland, New Zealand
c Department of Statistics, University of Auckland 1001, Auckland, New Zealand
Corresponding author: Alexei J. Drummond, University of Oxford, South Parks Rd., Oxford, OX1 3PS, United Kingdom., alexei.drummond{at}zoology.oxford.ac.uk (E-mail)
| ABSTRACT |
|---|
Molecular sequences obtained at different sampling times from populations of rapidly evolving pathogens and from ancient subfossil and fossil sources are increasingly available with modern sequencing technology. Here, we present a Bayesian statistical inference approach to the joint estimation of mutation rate and population size that incorporates the uncertainty in the genealogy of such temporally spaced sequences by using Markov chain Monte Carlo (MCMC) integration. The Kingman coalescent model is used to describe the time structure of the ancestral tree. We recover information about the unknown true ancestral coalescent tree, population size, and the overall mutation rate from temporally spaced data, that is, from nucleotide sequences gathered at different times, from different individuals, in an evolving haploid population. We briefly discuss the methodological implications and show what can be inferred, in various practically relevant states of prior knowledge. We develop extensions for exponentially growing population size and joint estimation of substitution model parameters. We illustrate some of the important features of this approach on a genealogy of HIV-1 envelope (env) partial sequences.
ONE of the most significant developments in population genetics modeling in recent times was the introduction of coalescent or genealogical methods (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
(![]()
![]()
![]()
![]()
![]()
![]()
![]()
(![]()
![]()
![]()
![]()
![]()
In addition to developments in coalescent-based population genetic inference, sequence data sampled at different times are now available from both rapidly evolving viruses, such as human immunodeficiency virus (HIV; ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
In this article we estimate population and mutation parameters, dates of divergence, and tree topology from temporally spaced sequence data, using sample-based Bayesian inference. The important novelties in the inference are the data type (i.e., temporally sampled sequences), the relatively large number of unknown model parameters, and the MCMC sampling procedures, not the Bayesian framework itself. The coalescent gives the expected frequency with which any particular genealogy arises under the Fisher-Wright population model. The coalescent may then be treated either as part of the observation process defining the likelihood of population parameters or as the prior distribution for the unknown true genealogy. In either case we must integrate the likelihood over the state space of the coalescent. Bayesian and purely likelihood-based population genetic inference use the same reasoning and software up to the point where prior distributions are given for the parameters of the coalescent and mutation processes.
Are there then any important difficulties or advantages in a Bayesian approach over a purely likelihood-based approach? The principal advantage is the possibility of quantifying the impact of prior information on parameter estimates and their uncertainties. The new difficulty is to represent different states of prior knowledge of the parameters of the coalescent and mutation processes as probability densities. However, such prior elicitation is often instructive. In the absence of prior information, researchers frequently choose to use noninformative/improper priors for the parameters of interest. Such an approach may be problematic and can result in improper posterior distributions. There exist a number of important cases in the literature in which knowledgeable authors inadvertently analyze a meaningless, improper posterior distribution. Why then do we choose to treat improper priors in this article? We are developing and testing inferential and sampling methods. These methods become more difficult as the amount of information in the prior is reduced. The sampling problem becomes significantly more difficult. We therefore treat the "worst case" prior that might naturally arise. Since this prior is improper, we are obliged to check that the posterior is proper. However, when confronted with a specific analysis, detailed biological knowledge should be encoded in the prior distributions wherever possible.
Although Bayesian reasoning has frequently been applied to phylogenetic inference (![]()
![]()
![]()
![]()
![]()
In this article, we begin with a description of the models we use. We then give the overall structure of the inferential framework, followed by an overview of how MCMC is carried out. We mention extensions of the basic inference that allow for (1) deterministically varying populations and (2) estimation of substitution parameters. Finally, we illustrate our methods with a group of studies of a sample of HIV-1 envelope (env) sequences and a second group of studies of synthetic sequence data.
Kingman coalescent with temporally offset leaves:
In this section we define the coalescent density for the constant-sized Fisher-Wright population model. In EXTENSIONS we give the corresponding density for the case of a population with deterministic exponential growth. It is assumed genealogies are realized by the Kingman coalescent process. Our time units in this article are "calendar units before the present" [e.g., days before present (BP)], where the present is the time of the most recent leaf and set to zero. Let
denote the number of calendar units per generation and
. The scale factor
converts "coalescent time" to calendar time and is one of two key objects of our inference. Note that we do not estimate
and Ne separately, only their product.
Consider a rooted binary tree g with n leaf nodes and n - 1 ancestral nodes. For node i, let ti denote the age of that node in calendar units. Node labels are numerically increasing with age so i > j implies ti
tj. Let I denote the set of leaf node labels and let Y denote the set of ancestral node labels. There is one leaf node i
I associated with each individual in the data. These individuals are selected, possibly at different times, from a large background population. An edge
i, j
, i > j of g represents an ancestral lineage. Going back in time, an ancestral node i
Y corresponds to a coalescence of two ancestral lineages. The root node, with label
, represents the most recent common ancestor (MRCA) of all leaves. Let tI be the times of the leaves and tY be the divergence times of the ancestral nodes. Let Eg denote the edge set of g, so that
specifies a realization of the coalescent process. For given n and tI, let
denote the class of all coalescent trees (Eg, tY) with n leaf nodes having fixed ages tI. The ages tY are subject to the obvious parent-child age order constraint. The element of measure in
is dg = dtn+1 ... dt2n-1 with counting measure over distinct topologies associated with the distinguishable leaves.
The probability density for a tree, fG(g|
), g
is computed as follows. Let ki denote the number of lineages present in the interval of time between the node i - 1 and the node i. The coalescent process generates
with probability density
![]() |
(1) |
The interpretation is as follows. Fix a time t and suppose k lineages are present at that time. A coalescence event between any of the k(k - 1)/2 pairs of distinguished lineages occurs at instantaneous rate 1/
. Given that two lineages coalesce at time t, the probability it was some particular pair is 2/k(k - 1). It follows that, in the time interval of length ti - ti-1 preceding the time of a leaf node i
I, "nothing" happens with probability e(-ki(ki-1)/2
)(ti-ti-1) and that the length of time, t - ti-1, preceding coalescent node i
Y is a random variable with density (ki(ki - 1)/2
) · e(-ki(ki-1)/2
)(ti-ti-1). Taking the product of these factors over all intervals [ti-1, ti], i = 2, 3, ... , 2n - 1, we obtain Equation 1 (![]()
Mutation:
We use the standard finite-sites selection-neutral likelihood framework (![]()
![]()
Associated with each leaf node i
I there is a nucleotide sequence Di = (Di,1, Di,2, ... , Di,s, ... , Di,L) of some fixed length L, say. Nucleotide base characters Di,s, i
I, s = 1, 2, ... , L take values in the set C = {A, C, G, T}. An additional gap character,
, indicates missing data. Let D = (D1, D2, ... , Dn)T denote the n x L matrix of sequences associated with the tree leaves, and let DA denote the (n - 1) x L matrix of unknown sequences associated with the ancestral nodes. The data are D together with tI, that is, the n sequences observed in the leaf individuals and the n ages at which those individual sequences were taken. Let D = C(n-1)L denote the set of all possible ancestral sequences. Consider a site s = 1, 2, ... , L in the nucleotide sequence of an individual. The character at site s mutates in forward time according to a Poisson jump process with 4 x 4 rate matrix Q. Here, Qi,j is the instantaneous rate for the transition from character i to character j, and A
1, C
2, G
3, T
4. We assume mutations are independent between sites. Let
= (
A,
C,
G,
T) be a 1 x 4 vector of base frequencies, corresponding to the stationary distribution of the mutation process,
Q = (0, 0, 0, 0).
The matrix Q is parameterized in terms of a symmetric "relative rate" matrix R,
![]() |
(2) |
as
![]() |
(3) |
The time units of the rate Qi,j have been chosen so that the mean number of mutations per unit time occurring at a site is equal to one. Let µ give the mean number of mutations per calendar unit (e.g., mutations per year) at a site.
The conversion factor µ is the second of the two principal objects of our inference. In addition to µ, the relative rates, R, may be estimated. We have found that wherever it is feasible to estimate the scale parameters µ and
, our data are informative about the elements of R. We return to inference for relative rates in EXTENSIONS.
We now write the likelihood for µ. Consider an edge
i, j
Eg of tree g. The individual associated with node j is a direct descendant of the individual associated with node i. However, the sequences Di and Dj may differ if mutations have occurred in the interval. Let eQ denote the 4 x 4 matrix exponential of Q. In the standard finite-sites selection-neutral likelihood framework
. The probability for any particular set of sequences D, DA to be realized at the nodes of a given tree is
![]() |
(4) |
(in the above formula, compact notation is obtained by including in the product over edges an edge terminating at the root from an ancestor of infinite age). We may eliminate the unknown ancestral sequences DA from the above expression by simply summing all DA
D,
![]() |
(5) |
It is feasible to evaluate this sum, using a pruning algorithm (![]()
Bayesian inference for scale parameters:
We now consider Bayesian inference for scale parameters µ and
. Both of these quantities take a real positive value. The joint posterior density, hM
G(µ,
, g|D), for the scale parameters and genealogy, is given in terms of the likelihood and coalescent densities above and two additional densities, fM(µ) and f
(
). These functions quantify prior information about the scale parameters. Let Z be an unknown normalizing constant. The posterior is then
![]() |
(6) |
We are interested in the marginal density, hM
(µ,
|D). We summarize this density using samples (µ,
, g)
hM
G. The sampled genealogies can be thought of as uninteresting "missing data."
Consider now the densities fM(µ) and f
(
). In any particular application these functions will be chosen to summarize available prior knowledge of scale parameters. It is common practice to avoid the problem of prior elicitation and attempt to construct a "noninformative" prior. This notion is poorly defined, since a prior may be noninformative with respect to some hypotheses, but informative with respect to others. Nevertheless, we illustrate sample-based Bayesian inference under a prior that contains little information. We do this for two reasons. First, we wish to give our sampling instruments a thorough workout. From this point of view an improper prior is the best choice. Second, when carrying out Bayesian inference, it is necessary to test the sensitivity of conclusions to changes in the state of prior knowledge. What conclusions would a person in a state close to ignorance reach from these data? The improper prior we consider represents ignorance of a rather natural kind. People using our methods will very likely want to consider this particular state of knowledge, along with others that are more representative of their own.
In our case µ and
are both scale parameters (for time). The Jeffreys prior, f(z)
1/z, z > 0, invariant under scale transformations z
az, and the uniform prior on z > 0 are candidates for fM(µ) and f
(
). If fM
1/µ, f
1/
, and fG(g|
) and Pr{D|g, µ} are as given in Equation 1 and Equation 5 then it may be shown that the posterior density in Equation 6 is not finitely normalizable. We may nevertheless consider ratios of posterior densities. But that means the only feasible Bayesian inference, at least under the uniform, improper prior, is exactly frequentist inference. We cannot treat the parameters of interest as random variables. Suppose fixed upper limits µ
µ* and troot
t*root may be set, along with a lower limit
*. For the problems we use to illustrate our methods in EXAMPLES, conservative limits of this kind determine a state of knowledge that arises quite naturally. Moreover it may be shown that the posterior density is finitely normalizable under uniform priors on the restricted state space, even though the prior on
remains improper.
| MARKOV CHAIN MONTE CARLO FOR EVOLUTIONARY PARAMETERS |
|---|
The posterior density hM
G is a complicated function defined on a space of high dimension (between 30 and 40 in the examples that follow). We summarize the information it contains by computing the expectations, over hM
G, of various statistics of interest. These expectations are estimated using samples distributed according to hM
G. We use MCMC to gather the samples we need. MCMC and importance sampling are part of a family of Monte Carlo methods that may be used individually or in concert to solve the difficult integration problems that arise in population genetic inference. Earlier work on this subject is cited in the Introduction. Fig 1 shows a cartoon of two proposal mechanisms used. See the Appendix for details of the proposal mechanisms and MCMC integration performed.
|
As always in MCMC, it is not feasible to test for convergence to equilibrium. MCMC users are obliged to test for stationarity as a proxy. We make three basic tests. First, we check that results are independent of the starting state using 10 independent runs with very widely dispersed initializations. Second, we visually inspect output traces. These should contain no obvious trend. Third, we check that the MCMC output contains a large number of segments that are effectively independent of one another, independent, at least, in the distribution determined empirically by the MCMC output. Let
f(k) give the autocorrelation at lag k for some function f of the MCMC output. Let
f denote the asymptotic standard deviation of some estimate of
f(k), formed from the MCMC output. Large lag autocorrelations should fall off to zero and remain within O(
f) of zero, as discussed by ![]()
The MCMC algorithm we used was implemented twice, more or less independently, by A. Drummond, in JAVA and by G. K. Nicholls in MatLab. This allowed us to compare results and proved very useful in debugging some of the more complex proposal mechanism combinations. To minimize programming burden, one of our implementations (G. K. Nicholls in MatLab) was partial, allowing only fixed population size and fixed R to be compared. This is discussed more extensively in Implementation issues in the Appendix
| EXTENSIONS |
|---|
Extending the framework of the Introduction and MCMC FOR EVOLUTIONARY PARAMETERS to include deterministically varying models of population history and estimation of relative rate parameters is straightforward. Let
= (0,
)5 be the state space for the relative rates of R above the diagonal and excluding RG
T. Let s = (µ,
, g, r, R), and let hS(s|D) denote the posterior density for S
*S, where
(see the Appendix). The posterior probability density has the form
![]() |
(7) |
Let T denote the age of the most recent leaf, i.e., T = mini
Iti. In this article T = 0. Let t
T be a generic age. In this model Ne = Ne(t). Recall that
, the number of calendar units per generation, is an unknown constant. Define a constant
= Ne(T)
and a growth rate parameter r. The density fG(g|
, r) is the density determined by the coalescent process with a population growing as
(![]()
![]() |
(8) |
If all of the relative rates in R, except RG
T, are estimated we are fitting a general time-reversible model of substitution. However, it is sometimes useful to consider simpler nested models. One such model is the Hasegawa-Kishino-Yano (HKY) model (![]()
relative to transversions. Thus
and
. Either a Jeffreys prior or a uniform prior can be used for the relative rates. However, as a result of our parameterization, the Jeffreys prior provides more accurate estimates. In the examples that follow, a uniform prior is used for R and
as this represents the most ignorant state of knowledge and is more than adequate for the purpose of illustrating the methodology. In the same spirit fr(r) is set uniform on r, and this also proves acceptable.
| EXAMPLES |
|---|
In this section, we illustrate our methods on two HIV-1 env data sets and a series of synthetic data sets of comparable size.
HIV-1 env data:
The method was first tested on HIV-1 partial envelope sequences obtained from a single patient over five sampling occasions spanning
3 years: an initial sample (day 0) followed by additional samples after 214, 671, 699, and 1005 days. Details of this dataset have been published previously (![]()
![]()
![]()
= 1 day per generation (![]()
only in this work. The dataset was split into two subsets for separate analysis. One contained all pretreatment sequences (28 sequences), and the other contained all sequences after treatment commenced (32 sequences; henceforth called posttreatment). The rationale behind this split is that both (1) population size and (2) mutation rate per unit time may be affected by a replication inhibitor such as Ziduvodine. In all of the analyses, base frequencies were fixed to empirically determined values; however, inference of these would have been trivial. Two analyses are undertaken on each dataset. The pretreatment data are strongly informative for all parameters estimated. The results are robust to the choice of priors and MCMC convergence is quick. In contrast, the posttreatment data are only weakly informative for µ,
, and troot parameters; the results are sensitive to the choice of prior; and MCMC convergence is very slow.
Pretreatment data, constant population size, HKY substitution:
In this first analysis of the pretreatment dataset, we fit the HKY substitution model and assume a constant population size. We are estimating µ,
, g, and
. We illustrate our methods using uniform prior distributions on µ and
, an upper limit on mutation rate of µ* = 1, a lower limit on Ne
of
* = 1, and a very conservative upper limit on troot of t* = 107 days. Ten MCMC runs were made, with starting values for mutation rate distributed on a log scale from 5 x 10-3 down to 10-7 mutations/site/day. This range greatly exceeds the range of values supported by the posterior. To test MCMC convergence on tree topologies, each of the 10 MCMC runs was started on a random tree drawn from a coalescent distribution with population size equal to 1000 (in exploratory work we initialize on a sUPGMA or neighbor-joining topology). The 10 Markov chain simulations were run for 2,000,000 steps and the first 100,000 steps were discarded as burn-in. Each run took
4 hr on a machine with a 700 MHz Pentium III processor. The mean IACT of the mutation rate parameter was 4190, giving an ESS of
450 per simulation. Table 1 presents parameter estimates for all 10 runs, illustrating close concordance between runs. Note also that the variability, between runs, of estimated means is in line with standard errors estimated within runs. This is a consistency check on our estimation of the IACT. Fig 2 and Fig 3 show the marginal posterior density of µ and
for each of the 10 runs. In all 10 runs the consensus tree computed from the MCMC output was the same, despite the fact that the starting trees were drawn randomly (data not shown). Combining the output of all 10 runs, the 95% highest posterior density (HPD) intervals for the mutation rate and troot are, respectively, [4.20, 8.28] x 10-5 mutations per site per day, and [580, 1040] days.
|
|
|
Pretreatment data, exponential growth, general substitution model:
In this second analysis of the pretreatment dataset, we fit the general time-reversible substitution model, with exponential growth of population size. We are estimating µ,
, g, r, RA
C, RA
G, RA
T, RC
G, and RC
T. This is the most parameter-rich model we fit. To assess the convergence characteristics of this analysis we ran 10 independent runs of 3,000,000 cycles, each starting with an independent random tree topology (the mean IACT for µ was 7955 giving an ESS of 358 per run). Fig 4 shows the 10 estimates of the marginal posterior density of mutation rate. Table 2 shows parameter estimates for each of the 10 runs. Convergence is still achieved with the extra parameters.
|
|
Compare the distribution of summary statistics under the two models described here and in Pretreatment data, constant population size, HKY substitution. Given the nature of infection of HIV-1, it seems likely that an exponential growth rate assumption is more accurate. Estimated 95% HPD intervals for the growth rate r, [1.09 x 10-3, 6.65 x 10-3], exclude small growth rates, corroborating this view. The 95% HPD intervals for the mutation rate and troot are, respectively, [3.61, 8.11] x 10-5 mutations per site per day and [570, 1090] days. Compare these with the model in Pretreatment data, constant population, HKY substitution. The change in model has minimal effect (<10%) on the posterior mean mutation rate.
Posttreatment:
The posttreatment data are analyzed twice under the HKY substitution model with constant population size. The first analysis uses the same priors as the first pretreatment analysis. In contrast to the pretreatment dataset, the mutation rate of the posttreatment dataset is difficult to estimate. This is illustrated in Fig 5 and Fig 6, in which the marginal posterior densities of µ and
estimated from 10 independent MCMC runs, each 5,000,000 cycles long, are compared. We were unable to compute an IACT for each run, so we are unable to compare within- and between-run variability. However, the between-run concordance visible in Fig 5 justifies the following statement. The posttreatment mutation rate shows one mode at
2.8 x 10-5 mutations/site/day with a second mode on the lower boundary. The data determine a diffuse, and bimodal, marginal posterior on µ. One of the modes is associated with states (µ,
, g) with physically unrealistic root times (greater than the age of the patient). These are allowed, if we are not prepared to assert some restriction on troot. This behavior also occurs when we use a Jeffreys prior on the mutation rate (data not shown). It reflects a real property of the data, namely that states of low µ and large troot are not well distinguished from otherwise identical states of larger µ and smaller troot.
|
|
In the second posttreatment analysis, we revise the upper limit on troot downwards, from 107 to t* = 3650, a value more representative of actual prior knowledge for this dataset. The new limit, set 3 years before seroconversion occurred in the infected patient, is still conservative. Here we explored the prior belief that HIV infection most often originates from a small, homogenous population and then subsequently accumulates variation. This prior effectively assumes that all viruses in an infected individual share a common ancestor at most as old as the time of infection of the host. The estimated 95% HPD interval for the mutation rate was [1.16, 4.27] x 10-5 mutations/site/day, markedly down from the pretreatment mutation rate. Fig 7 depicts the resulting unimodal marginal posterior density for mutation rate, showing that the spurious mode has been eliminated. Again, no IACT was computed. However, between-run variability was much improved over Fig 5 and Fig 6. Information about troot has been converted into information about mutation rates and population size.
|
Simulated sequence data:
To test the ability of our inference procedure to recover accurate estimates of parameters from the above HIV-1 dataset we undertook four simulation studies. In each experiment we generated 100 synthetic datasets. For experiment 1, the posterior estimates of
, µ, and
obtained from the pretreatment dataset in Pretreatment data, constant population size, HKY substitution were used to generate 100 coalescent trees and then simulate sequences on each of the resulting trees. The synthetic data were generated under a constant-size population model with the HKY mutation model but analyzed under an exponentially growing population model and a GTR mutation model. In the second experiment, 100 synthetic datasets were generated using the pretreatment parameter estimates in Pretreatment data, exponential growth, general substitution model as the true values. In this case the models for simulation and inference are matched. Synthetic data were generated under an exponentially growing population model and a GTR mutation model. In both experiments 1 and 2 uniform bounded priors were used for all parameters. Experiments 3 and 4 differed from experiments 1 and 2 only in that we used JEFFREYS' (1946) prior for scale parameters (mutation rate, population size, and relative rates).
All datasets had the same number of sequences (28), the same sampling times (0 and 214 days), and the same sequence length (660) as the pretreatment dataset. Table 3 shows that the true values are successfully recovered (i.e., fall within the 95% HPD interval)
90% of the time in all cases except for the relative rate parameters in experiment 1. In the most complex model we fit, we recover true parameter values. The overparameterization present in experiments 1 and 3 does not seem problematic for estimating mutation rate,
, or growth rate. These results suggest that inference of biologically realistic growth rates is quite feasible. The relative rates performed most poorly in the parameters of interest. This is caused predominantly because the uniform prior on relative rates introduces metric factors that inflate the densities. In experiment 1, when the true value of a relative rate parameter was not within the 95% HPD interval (which occurred 75 times out of 500), it was almost always overestimated (74 out of 75 times). Furthermore, conditioning on a tranversion (RG
T = 1), a rare event, may also have an impact. However, experiments 3 and 4 demonstrate that the use of a Jeffreys prior for these and other scale parameters results in >90% recovery in all parameters. We are not aiming to prescribe any particular noninformative prior. Our choice of uniform prior in earlier experiments is deliberately crude. However, it allows us to lay out the methodology with as little emphasis as possible on prior elicitation. The reader should undertake this process for a specific problem.
|
| DISCUSSION |
|---|
We have described Bayesian coalescent-based methods to estimate and assess the uncertainty in mutation parameters, population parameters, tree topology, and dates of divergence from aligned temporally spaced sequence data. The sample-based Bayesian framework allows us to bring together information of different kinds to reduce uncertainty in the objects of the inference. Much of the hard work is in designing, implementing, and testing a suitable Monte Carlo algorithm. We found a suite of MCMC updates that do the job.
We have analyzed two contrasting HIV-1 datasets and 400 synthetic datasets to illustrate the main features of our methods. The results of the three HIV-1 env data subsections show that a robust summary of parameter-rich models, including the joint estimation of mutation rate and population size, is possible for some moderate-sized datasets. The pretreatment data restrict the set of plausible parameter values to a comparatively small range. For this dataset, useful results can be obtained from a state of ignorance about physically plausible outcomes. This situation is in contrast to the situation illustrated in the Posttreatment section. For this dataset, prior ignorance implies posterior ambiguity, in the form of a bimodal posterior distribution for the mutation rate. One of these modes is supported by genealogies conflicting with very basic current ideas about HIV population dynamics. We modify the coalescent prior on genealogies to account for this prior knowledge, restricting the most recent common ancestor to physically realistic values. The ambiguity in mutation rate is removed. Similar results could be obtained in a likelihood-based analysis of the posttreatment data, since the prior information amounts to an additional hard constraint on the root time of the coalescent genealogy.
There is some redundancy in the set of MCMC updates we used, in the sense that the limiting distribution of the MCMC is unaltered if we remove the scaling update (move 1) or the Wilson-Balding update (move 2; see Appendix for details of these moves). However, these two updates types are needed in practice. There are two timescales in MCMC, time to equilibrium and mixing time in equilibrium. The scaling move sharply reduces mixing time in equilibrium. The Wilson-Balding update is needed to bring the equilibrium time to acceptable values. We have seen MCMC simulations, minus the Wilson-Balding move, in which an apparently stationary Monte Carlo process undergoes a sudden and unheralded mean shift at
2,000,000 updates. This problem was picked up at the debugging stage, in comparisons between our two MCMC implementations. Subsequent simulation has shown that the genealogies explored in the first 2,000,000 updates of that simulation were just one of the tree clusters supported by the target distribution.
The methods presented here reduce to those of Felsenstein and co-workers (![]()
= 2Neµ, a fixed R, a fixed µ, and contemporaneous data, if instead of summarizing results using 95% HPD interval estimates, we use the mode and curvature of the posterior density for
to recover the maximum-likelihood estimate (MLE) and its associated confidence interval.
A distinction can be made between a dataset, like the pretreatment dataset, for which there is strong statistical information about mutation rates (we refer to populations from which such datasets may be obtained as "measurably evolving"), and a dataset, like the posttreatment data, in which the statistical signal is weak. In both of these datasets the familiar parameter
= 2Neµ is in fact well determined by the data (not shown above), so that MCMC convergence in
is quick. However, it is only in the pretreatment data that this parameter can be separated easily into its two factors. This is related to the well-known problem of identifiability for population size and mutation rate. We can see that temporally spaced data may or may not contain information that allows us to separate these two factors. In this particular example, lineages of the posttreatment viruses branch from those of the pretreatment viral population. Consequently a more appropriate analysis for this dataset would allow for a change of mutation rate and/or population size over the genealogy of the entire set of sequences. In the case of mutation rate this has already been demonstrated within a likelihood framework (![]()
A software package called molecular evolutionary population inference (MEPI), developed using the phylogenetic analysis library (PAL; ![]()
| ACKNOWLEDGMENTS |
|---|
We gratefully acknowledge two anonymous reviewers for helpful comments that much improved the manuscript. In addition, A.D. thanks A. Ferreira. A.D. was supported by a New Zealand Foundation for Research, Science and Technology Bright Futures scholarship. Research by A.G.R. and A.D. was also supported by National Institutes of Health grant GM59174.
Manuscript received December 15, 2001; Accepted for publication March 12, 2002.
| APPENDIX |
|---|
MCMC DETAILS AND MOVE TYPES
Markov chain Monte Carlo for temporally spaced sequence data including proposal mechanism used is described.
Denote by
M
G the space [0,
) x [0,
) x
of all possible (µ,
, g) values. Let

We now describe a Monte Carlo algorithm realizing a Markov chain Xn, n = 0, 1, 2, ... with states x = (µ,
, g), x
*M
G, and equilibrium hX = hM
G.
Suppose Xn = x. A value for Xn+1 is computed using a Metropolis-Hastings algorithm. Define a set of random operations on the state. A given move may alter one or more of µ,
, and g. Label the different move types m = 1, 2, ... , M. The random operation with label m, acting on state x, generates state x', with probability density qm(x'|x), say. Let (a
b) equal a if a < b and otherwise b and (a
b) equal a if a > b and otherwise b, let

stand for the ratio of posterior densities, and let

give the ratio of the densities for proposals x'
x and x
x'. The algorithm determining Xn+1 given Xn can be described as follows. First, a label m is chosen according to some arbitrary fixed probability distribution on the M move types. A value for the candidate state x' is drawn according to the density qm(x'|x). Second, we accept the candidate, and set Xn+1 = x' with probability
![]() |
(9) |
Otherwise, with probability 1 -
m(x, x'), the candidate is rejected and we set Xn+1 = x.
Proposal mechanisms:
In this section we describe the proposal mechanisms (moves) and their acceptance probabilities. In each move, Xn = x, with x = (µ,
, (Eg, tY)). For each node i let parent(i)
Y denote the label of the node ancestral to i and connected to i by an edge. We get a compact notation if we treat Y and g as if Y contained a notional parent(root) node with tparent(root) =
, as we did in Equation 4. Also, we now drop the convention that node labels increase with age.
Let dx = dµ d
dg in
*M
G and

The moves listed below determine an HX-irreducible aperiodic Metropolis-Hastings kernel. The MCMC is Harris recurrent and ergodic, with HX its unique equilibrium distribution.
Scaling move:
Label this move m = 1. Let a real constant ß > 1 be given. For ß-1
ß, let x
x denote the transformation

If
then
with
. The change of variables in the product measure is

Note that this transformation is not simply a change of units. The times ti associated with ancestral nodes i
Y are scaled while leaf node times ti, i
I (which are part of the data) are left unchanged.
The move is as follows. Choose a
Unif(ß-1, ß) and set x' =
x. If x
*M
G (if, for example, µ/
> µ*, or the parent-child age order constraint is violated at the unscaled leaves in the scaled tree), then the move fails and we set Xn+1 = x. In a slight abuse of notation we set
in the formula for
1(x, x') in Equation 9 (![]()
Wilson-Balding move:
Label this move m = 2. A random subtree is moved to a new branch. This move is based on the branch-swapping move of ![]()
![]()
Two nodes, i, j
I
Y are chosen uniformly at random without replacement. Let jp = parent(j) and ip = parent(i). If tjp
ti, if ip = j or ip = jp, then the move fails and we set Xn+1 = x. Given i and j, the candidate state x' = (µ,
, g') is generated in the following way. Let
denote the child of ip that is not i, and let ipp = parent(ip), the grandparent of i. Reconnect node ip so that it is a child of jp and a parent of j; that is, set

If node j is not the root, assign to node ip a new time t'ip chosen uniformly at random in the interval [(ti
tj), tjp]. If node j is the root, choose
Exp(
) and set
. Let t'Y denote the set of node times with tip replaced by t'ip. Let
If node j and node ip are not root, the ratio Q2(x, x') in Equation 9 is

If node j is the root,

and if ip is the root,

Subtree exchange:
Label this move m = 3. Choose a node i
I
Y. Let ip = parent(i), jp = parent(ip), and let j denote the child of jp that is not ip. If node i is the root or a direct child of the root, or tip < tj, then the move fails and we set Xn+1 = x. Given i and j, the candidate state
is generated in the following way. Swap nodes i and j, setting

Let
. The ratio
in Equation 9.
The subtree exchange above is a local operation. In a second version of this move we chose node j uniformly at random over the whole tree.
Node age move:
Label this move m = 4. Choose an internal node, i
Y, uniformly at random. Let ip = parent(i) and let j and k be the two children of i [so i = parent(j) and i = parent(k), j
k]. If i is not the root, choose a new time ti' uniformly at random in [(tj
tk), tip]; otherwise, if i is the root, choose
Unif(ß-1, ß) (see move m = 1) and set
. Let t'Y denote the set of ancestral node times, tY, with ti replaced by ti'. Let
. If i is not the root, then Q4(x, x') = 1 in Equation 9. If i is the root then
.
Random walk moves for
and µ:
Label this move m = 5. The random walk update to
is as follows. Let a real constant w
> 0 be given. Choose
Unif(-w
, w
) and set x' = (µ,
+
, g). If x
*M
G, then the move fails and we set Xn+1 = x. Since the candidate generation process is symmetric,
, in the formula for
5(x, x') in Equation 9. The random walk move for µ, with random walk window parameter wµ, say, is similar to the move just described for
. The window sizes w
and wµ must be adjusted to get reasonable sampling efficiency.
Implementation, convergence checking, and debugging: Convergence and standard errors:
The efficiency of our Markov sampler, as a tool for estimating the mean of a given function f, is measured by calculating from the output
, the IACT of f. Dividing the run length by
f, we get the number of "effective independent" samples in the run (the number of independent samples required to get the same precision for estimation of the mean of f). We call this the ESS. Better MCMC algorithms have smaller IACTs and thus larger ESSs, though it may be necessary to measure
in units of CPU time to make a really useful comparison. One will typically want to run the Markov chain at least a few hundred times the IACT, to test convergence and get reasonably stable marginal histograms. Note first that we do not know the IACT when we set the MCMC running. Exploratory runs are needed. Second, a statement like "We ran the MCMC for 106 updates discarding the first 104" is worthless without some accompanying measurement of an IACT or equivalent. This point is made in ![]()
f, is determined using a monotone sequence estimator (![]()
Implementation issues:
In this section we discuss debugging and MCMC efficiency of our two implementations. We compare expectations computed in the coalescent with estimates obtained from MCMC output. Standard errors are obtained from estimates of the corresponding IACT. Consider a tree with four leaves, two at time zero and two offset
time units to greater age. Consider simulation in the coalescent, with no data. The expectation of troot is

A number of other expectations may be computed.
For problems involving data, expectations are not available. However, an MCMC algorithm with several different move types may be tested for consistency. The equilibrium is the posterior distribution of µ,
, and g and should not alter as we vary the proportions in which move types are used to generate candidate states. For example, move 2 (Wilson-Balding) is irreducible on its own, while moves 3 and 4 (subtree exchange and node-age move) form another irreducible group. We fix a small synthetic dataset and compare the output of two MCMC runs: one generated using move 2 alone and the other using moves 3 and 4 alone in tandem.
We now turn to questions of MCMC efficiency. Each update has a number of parameters. These are adjusted, by trial and error for each analysis, so that the MCMC is reasonably efficient. An ad hoc adaptive scheme, based on monitoring acceptance rates, and akin to that described in ![]()
We have experimented with a range of other moves. However, while it is easy to think up computationally demanding updates with good mixing rates per MCMC update, we have focused on developing a set of primitive moves with good mixing rate per CPU second. In our experience simple moves may have low acceptance rates, but they are easy to implement accurately and are rapidly evaluated. They may give good mixing rates when we measure in CPU seconds. ![]()
![]()
Because of the explicit nature of MCMC inference, the details of a particular analysis, including the proposal mechanisms, the chain length, the evolutionary model, and the prior distributions, can be quite difficult to keep track of. One of us (A. Drummond) developed an XML data format to describe phylogenetic/population genetic analyses. This enables the user to write down the details of an analysis in a human-readable format that can also be used as the input for the computer program. For the more visually inclined a graphical user interface (GUI) was developed that can generate the XML input files, given a NEXUS or PHYLIP alignment.
| LITERATURE CITED |
|---|
BAHLO, M. and R. C. GRIFFITHS, 2000 Inference from gene trees in a subdivided population. Theor. Popul. Biol. 57:79-95.[Medline]
BARNES, I., P. MATHEUS, B. SHAPIRO, D. JENSEN, and A. COOPER, 2002 Dynamics of Pleistocene population extinctions in Beringian brown bears. Science 295:2267-2270.
BEERLI, P. and J. FELSENSTEIN, 1999 Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetics 152:763-773.
BEERLI, P. and J. FELSENSTEIN, 2001 Maximum likelihood estimation of a migration matrix and effective population sizes in subpop















