help button home button Genetics AJP: Lung
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

Genetics, Vol. 168, 2407-2420, December 2004, Copyright © 2004
doi:10.1534/genetics.104.030411

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Ewing, G.
Right arrow Articles by Rodrigo, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ewing, G.
Right arrow Articles by Rodrigo, A.

Using Temporally Spaced Sequences to Simultaneously Estimate Migration Rates, Mutation Rate and Population Sizes in Measurably Evolving Populations

Greg Ewing*,{dagger}, Geoff Nicholls*,{ddagger} and Allen Rodrigo*,{dagger},1

* Allan Wilson Centre for Molecular Ecology and Evolution, University of Auckland, Auckland, New Zealand 1020
{dagger} Bioinformatics Institute, University of Auckland, Auckland, New Zealand 1020
{ddagger} Department of Mathematics, University of Auckland, Auckland, New Zealand 1020

1 Corresponding author: School of Biological Sciences, Computational and Evolutionary Biology Lab, University of Auckland, Private Bag 92019, Auckland, New Zealand 1020.
E-mail: a.rodrigo{at}auckland.ac.nz

Manuscript received April 23, 2004. Accepted for publication September 15, 2004.


    ABSTRACT
 TOP
 ABSTRACT
 ISLAND-MODEL GENEALOGIES
 MUTATION
 BAYESIAN INFERENCE
 MARKOV CHAIN MONTE CARLO...
 CODE IMPLEMENTATION AND...
 SELECTED RESULTS FROM SIMULATED...
 HIV PATIENT DATA
 DISCUSSION
 APPENDIX:
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
We present a Bayesian statistical inference approach for simultaneously estimating mutation rate, population sizes, and migration rates in an island-structured population, using temporal and spatial sequence data. Markov chain Monte Carlo is used to collect samples from the posterior probability distribution. We demonstrate that this chain implementation successfully reaches equilibrium and recovers truth for simulated data. A real HIV DNA sequence data set with two demes, semen and blood, is used as an example to demonstrate the method by fitting asymmetric migration rates and different population sizes. This data set exhibits a bimodal joint posterior distribution, with modes favoring different preferred migration directions. This full data set was subsequently split temporally for further analysis. Qualitative behavior of one subset was similar to the bimodal distribution observed with the full data set. The temporally split data showed significant differences in the posterior distributions and estimates of parameter values over time.


DRUMMOND et al. (2002, 2003a) treat joint estimation of mutation rate, population size, and sample genealogies from time-stamped serially sampled sequence data, using a Bayesian approach and Markov chain Monte Carlo (MCMC). In this article we extend this method to fit an island migration model (NOTOHARA 1990). We focus on the case of two populations. We model asymmetric migration rates and unequal population sizes using serial samples of sequences.

Software tools for estimating migration parameters from DNA sequence data exist. These include Migrate (BEERLI and FELSENSTEIN 1999, 2001), GenTree (BAHLO and GRIFFITHS 2000), and MDIV (NIELSEN and WAKELEY 2001). These methods use MCMC to obtain maximum-likelihood estimates of migration rates and population sizes, but apply only to sequences obtained at a single time. As a consequence these methods estimate the composite parameter {Theta} = 2Nµ (here N is the effective population size and µ the mutation rate).

We focus on measurably evolving populations (MEPs; DRUMMOND et al. 2003b). Sequences of a given locus are sampled from individuals in a population on several sampling occasions. By definition, MEPs show a statistically significant increase in the number of substitutions over the sampling interval (DRUMMOND et al. 2003b). The human immunodeficiency virus (HIV) type I is a MEP. Serial samples taken from a single patient over a period of years show rapid accumulation of mutations in the viral genome and the generation of a large number of genetic variants. HIV forms distinct subpopulations in different body tissues, for example, in the brain and in the blood (WONG et al. 1997; POSS et al. 1998; WANG et al. 2001; ZHANG et al. 2002). Serially sampled sequence data, labeled by subpopulation, are therefore available. With HIV, the presence of different tissue compartments may signal the availability of reservoirs of virus that can hamper the effectiveness of antiviral therapy (NICKLE et al. 2003). It is therefore important to understand the pattern of HIV compartmentalization in the body and the timing of these "colonization" events.

Our method differs from the methods employed by BAHLO and GRIFFITHS (2000) and BEERLI and FELSENSTEIN (2001). Because we work with time-stamped sequence data, and MEPs, we can estimate population, migration, and mutation parameters simultaneously and separately (RODRIGO and FELSENSTEIN 1999; DRUMMOND and RODRIGO 2000). We use a Bayesian framework rather than a maximum-likelihood approach. This allows scientists using this approach to incorporate prior information as they deem appropriate. Such prior information may include knowledge about the means and variance of mutation rates or the most plausible direction of migration. Physically irrelevant parameter ranges, such as populations of size very much smaller than one, must be ruled out explicitly. This imposes an additional discipline on the inference.

Our work may be thought of as a methodologically straightforward but technically demanding extension of DRUMMOND et al. (2002)(2003a) to handle the island migration model. We apply our algorithms to a set of simulation studies. This tests software and identifies quantities poorly resolved by the data. We compare our results with results obtained by other authors. We then treat a real HIV data set, drawn from blood and semen samples, from a single patient taken at four time points over the period of 3 years. The data are rich in features of interest. We use them to illustrate the way the tools we have provided may be used to explore such data sets. We are unwilling, however, to make too many strong inferences about HIV biology on the basis of our analyses, because the complexities of HIV evolution present constant challenges to such analyses. In particular, our analyses ignore selection, recombination, and changes in population size, all of which will have significant impact on the results.

The outline is as follows: in ISLAND-MODEL GENEALOGIES and MUTATION we describe the island model of migration and its likelihood for serially sampled sequences. In BAYESIAN INFERENCE, we determine a posterior distribution for the parameters of interest. The MCMC integration tools we used to sample and summarize that distribution are described in MARKOV CHAIN MONTE CARLO FOR MIGRATION GENEALOGIES and CODE IMPLEMENTATION AND VERIFICATION. In SELECTED RESULTS FROM SIMULATED DATA, we present the results of the simulation studies and finally HIV PATIENT DATA is devoted to real HIV sequences obtained from two tissue compartments in a single patient over a number of time points. MCMC details are given in the APPENDIX.


    ISLAND-MODEL GENEALOGIES
 TOP
 ABSTRACT
 ISLAND-MODEL GENEALOGIES
 MUTATION
 BAYESIAN INFERENCE
 MARKOV CHAIN MONTE CARLO...
 CODE IMPLEMENTATION AND...
 SELECTED RESULTS FROM SIMULATED...
 HIV PATIENT DATA
 DISCUSSION
 APPENDIX:
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
We now describe the probability density for a Fisher-Wright population model (FISHER 1930; WRIGHT 1931) using the Kingman coalescent (KINGMAN 1982a,b) extended to include migration (HUDSON 1990; NOTOHARA 1990) and nonisochronous (i.e., serial or time stamped) leaf tips. For an analysis of the properties of the isochrone model, see HUDSON (1990) and NOTOHARA (1990) and references therein.

The island model of migration is a model of p populations, or demes. For j , = {1, 2 ... , p}, deme j is a panmictic population of Nj haploid individuals. Time increases into the past and is measured in calendar units. Let {lambda}ij denote the per capita migration rate from deme i to j (time increases into the past, so in forward time the individual is moving from j to i).

The migration process we describe below is a process that realizes migration-coalescent genealogies under the island model of migration. A migration-coalescent genealogy g is a rooted and directed binary tree graph with four node types: n leaf nodes (with label set ) of in-degree one and out-degree zero, n – 1 coalescent nodes (label set ) of which n – 2 have in-degree one and out-degree two and one (the root, label R say) has in-degree zero and out-degree two, plus an indeterminate number, m say, of migration nodes (label set ) of in-degree one and out-degree one. Let = {cup} denote the set of all ancestral (i.e., nonleaf) node labels and V = {cup} denote the set of all node labels. Tree edges <r, s> are directed toward the present. Let E denote the set of all edges in the tree graph and VR = V \{R} the set of all node labels excluding the root.

Individuals corresponding to leaf nodes are sampled from the demes. Deme labels are recorded. Because the observation process is conditioned on the scientist's sampling of individuals over demes, the number of individuals sampled from a deme need not reflect the deme size. For r , suppose individual r was sampled from deme ir at calendar time tr. The event represented by node r occurred at calendar time tr. Nodes are labeled from r = 1 to r = m + 2n – 1 in order of increasing age and by least child label in case of ties, so that r > s {Rightarrow} tr ≥ ts and <r, s> implies tr ≥ ts. For any set X V let tX = (tX, r X), with entries ordered by increasing r. Let t = tV. Let |X| denote the number of elements in set X.

Let {rho} equal the mean number of units of calendar time per generation. We do not estimate {rho}; instead we present results for a nominal {rho}. For example, for the HIV data set in HIV PATIENT DATA, tr is measured in days before an arbitrary zero and we set {rho} = 1 day.

The demographic process realizing migration-coalescent tree graphs is defined as follows. An ancestral lineage is associated with each sampled individual and carries a label indicating deme membership. As time increases into the past, each lineage in deme i migrates independently of all other lineages at rate {lambda}ij to deme j. Each pair of lineages in deme i coalesces at instantaneous rate 1/{theta}i, where {theta}i = Ni{rho}. The process terminates when the number of lineages equals one. With each event we associate a node, r , and with each lineage between events an edge <r, s> E. For each s VR, let is give the deme on edge <r, s>. Let = (i1 ... im+2n–2) be the set of all deme edge labels and and , respectively, the sets of deme labels for edges {<r, s> E, s } and {<r, s> E, s } attached to leaf and ancestral nodes. Let {lambda} = ({lambda}1,2 ... {lambda}p–1,p) and {theta} = ({theta}1 ... {theta}p). A visual representation can be seen by skipping ahead to Figure 4, where the leaf deme membership is shown with either a dashed line for one deme or a solid line for the other deme. Migration nodes (events) are where the line changes deme (line type); otherwise it is a traditional coalescent genealogy.



View larger version (23K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 4.— Typical trees realized by MCMC simulation of the full data set. (—) Blood deme; (---) semen. (Top) Trees from the s -> b mode; (bottom) b -> s mode.

 
The free and conditioned parameters of a migration genealogy g are (E, , t) given (, t). Because the data and t are known and fixed throughout the analysis, and leaf labels are determined from the label-time ordering, we write g = (E, , t) and keep in mind that some of the parameters in g are fixed. The parameter set is subject to constraints determined from the leaf demes. When there are just two demes, the deme labels are uniquely determined from the event topology E by propagating the demes from the leaves to the root (switch deme at each migration event). Let {Gamma} denote the set of all admissible migration genealogies g, which can be realized by the migration-coalescent process above, for given and t. {Gamma} is the union over m of sets {Gamma}m containing all migration genealogies with m migration events. Note that the Euclidean dimension of the space {Gamma}m is m + n – 1 (one dimension for each time variable tr, r ) and as a consequence, {Gamma} is a union of spaces of unequal dimension.

We now write the joint density f(g|{theta}, {lambda}) for a migration tree (the corresponding distribution is given in APPENDIX A). Consider the interval of time {delta}r = tr+1tr between consecutive nodes on the tree. There are m + 2n – 2 such intervals on a tree g {Gamma}m, one interval above each node r VR (for isochronous leaves, {delta}r = 0 for r = 1, 2 ... n – 1). For i and r VR, let kir denote the number of lineages in deme i in interval r. For i , let i = \{i} denote the set of demes omitting deme i. For each r VR, the interval (tr, tr+1] contributes a factor

to the density, along with a second factor equal to 1/{theta}i or {lambda}ij as the event type at time tr+1 is coalescent in deme i or (i -> j) migration. An interval terminated by a leaf (when r + 1 ) ends in a nonevent, and the second factor is one. Let mij denote the total number of (i -> j) migrations, i.e., mij = |{r ; ir = j, i = i}| with the child node of migration node r in g. Let ci denote the total number of coalescent events in deme i, with 1 and 2 the child nodes of coalescent node r in g. The overall density is

We note a few distributional details relevant to the MCMC over migration genealogies. Technically, f(g|{theta}, {lambda}) is the density of a distribution f(g|{theta}, {lambda})dg. If g has m migration events then is the element of volume in {Gamma}m. Migration and coalescent events are distinguished by their position on the tree and we take counting measure over topologies and -labels conditioned on leaf properties. The density given above is normalized, {int}{Gamma}f(g|{theta}, {lambda})dg = 1.

The migration coalescent generalizes the Kingman coalescent. Free movement or strong migration (NAGYLAKI 1980; NOTOHARA 1993) is signaled by {lambda}ij >> 1/{theta}j. Model populations with high migration rates can still be structured, as migration imbalance determines a source and sink population structure. If in addition, for each deme j , the local immigrant and emigrant population fluxes balance,

the aggregate population evolves as one panmictic population of size {sum}jNj.


    MUTATION
 TOP
 ABSTRACT
 ISLAND-MODEL GENEALOGIES
 MUTATION
 BAYESIAN INFERENCE
 MARKOV CHAIN MONTE CARLO...
 CODE IMPLEMENTATION AND...
 SELECTED RESULTS FROM SIMULATED...
 HIV PATIENT DATA
 DISCUSSION
 APPENDIX:
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Mutation rate, µ, is inferred with our method by incorporating the traditional mutation model. We use the finite sites mutation model, with neutral selection and general time reversible (GTR) substitution process of FELSENSTEIN (1981) and RODRIGUEZ et al. (1990). The substitution process is a continuous-time Markov process with states {A, C, G, T}, a 4 x 1 vector of equilibrium probabilities {pi}, and a 4 x 4 rate matrix Q normalized to generate one substitution per unit calender time (–{sum}d{pi}dQdd = 1). The substitution and migration processes are independent.

Each leaf node r has associated with it nucleotide sequence data Dr = (Dr,1, Dr,2, Dr,3, ... , Dr,L) of length L with Dr,a {A, C, G, T, {phi}} for a = 1, 2, ... , L. Gaps, indicated by {phi}, are treated as unobserved sites. Let D be the n x L matrix of sequences on the leaves.

The likelihood P is defined and computed in the usual way, using node-to-node transition probabilities (FELSENSTEIN 1981). For these purposes migration nodes may be ignored and we consider the topology in the traditional sense. We calculate the likelihood P in the usual manner using pruning (FELSENSTEIN 1981).


    BAYESIAN INFERENCE
 TOP
 ABSTRACT
 ISLAND-MODEL GENEALOGIES
 MUTATION
 BAYESIAN INFERENCE
 MARKOV CHAIN MONTE CARLO...
 CODE IMPLEMENTATION AND...
 SELECTED RESULTS FROM SIMULATED...
 HIV PATIENT DATA
 DISCUSSION
 APPENDIX:
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
In this section we set out Bayesian inference for µ the mutation rate, the vector {theta} of Ni{rho}-values, and the vector {lambda} of migration rates. The migration genealogy g may be of direct interest also. The joint posterior density of these variables

(1)
is given in terms of the likelihood function P, the migration genealogy prior f, a prior p on µ, {theta} and {lambda}, and z, an unknown and intractable normalization constant. Here h is the density of a distribution h({lambda}, {theta}, µ, g) d{lambda}d{theta}dµdg with d{lambda} = d{lambda}1,2d{lambda}1,3 ... d{lambda}p–1,p and d{theta} = d{theta}1d{theta}2 ...d{theta}p.

Subjective, informative priors are a fundamental part of Bayesian inference. However, we want to set out a new parameter estimation scheme for a new data-model pair (namely serial sequence data in an island migration model). Informative priors can hide certain difficulties users will face when they apply Monte Carlo Bayesian inference. First, MCMC convergence is more easily achieved as the sampled probability density is more concentrated in its space of states. For example, the bimodality that makes the HIV data of HIV PATIENT DATA such a challenge to MCMC, and such a good pedogogical example, can be removed by adding prior information. Second, diffuse priors that are actually improper can lead to improper posteriors and meaningless results. In the last paragraphs of this section and in HIV PATIENT DATA we explain how an improper posterior could but does not arise in our case. Improper priors put mass on physically irrelevant parameter values. Should not a prior worth the name rule out such states? Certainly. However, suppose that is achieved via a simple cutoff in parameter space. We want to know whether results are sensitive to the choice of cutoff. If the posterior is improper without the cutoff then parameter estimates will be strongly influenced by the choice of cutoff. We should take particular care in choosing the cutoff and make a sensitivity analysis. Just this scenario arises in HIV PATIENT DATA. Third, when we form parameter estimates using Bayesian inference we should test each data set with several priors and make many entire MCMC analyses of each data set, not just one. What is the range of inferential outcomes that result from approaching these data with different subjective prejudices? How do conclusions change as we move from an informative to a more diffuse prior? Fourth, priors that seem to be noninformative can turn out to be strongly informative for some hypotheses. For example, an improper prior on genealogies assigns equal probability density to all rooted trees with n leaves. This prior is noninformative for comparison of unique tree topologies. However, the marginal prior density of the root time tR in this prior is tn–2R. This prior is very strongly informative for root time. Diffuse priors bring special problems. We choose examples in which those problems are present so that we can show how to deal with them. We regard our explanations of how to deal with these problems as an integral part of our exposition of the methodology itself.

Note that we are estimating rates for mutation, migration, and coalescence simultaneously from a single data set. DRUMMOND et al. (2002) have shown that mutation rate µ and population size {theta}i = Ni{rho} parameters may be separated when sequenced individuals are sampled serially over a timescale long enough to see mutational change. This is feasible for populations, such as HIV, that are measurably evolving. DRUMMOND et al. (2003a) give conditions for the estimation problem to be well defined in the absence of migration. This issue needs to be considered when priors are improper. We bound {theta}iµ, i , µ, and {lambda}ij above and below. (Note that we bound the traditional parameter N{rho}µ, but this implies a bound on {theta}.) In this setting any density of bounded range determines a proper posterior. Note that panmictic populations lead to migration rates {lambda}ij large compared to 1/{theta}j. Bounds must allow such parameter values or the panmictic condition will be eliminated by the prior. We bound {lambda}ij above so that the number of migrations per generation does not exceed one, {lambda}ij{rho} ≤ 1, relying on a fixed estimate for {rho}. This allows {lambda}ij as large as about Nj/{theta}j (depending on the accuracy of the estimated generation time), while still providing the upper bound on migration rate needed for posterior normalization.

The upper bound (or upper tail) imposed on {theta}i, i , by the prior plays an important role in the inference. Migration genealogies containing no coalescent events in deme i [so ci = 0 in f(g|{theta}, {lambda})] make up a set Si {Gamma} of nonzero posterior probability pi, say. Now, for each g Si there is {epsilon}(g, i) so that f(g|{theta}, {lambda}) > {epsilon}(g, i) for all {theta}i ≥ 0. In other words, for each demographic parameter {theta}i there is a component of the posterior in which the distribution of {theta}i at large values is controlled only through the prior p(µ, {theta}, {lambda}). These physically irrelevant parameter values, corresponding to populations of negligible size, must be ruled out explicitly. It follows that if, for example, the {theta}i priors are untruncated uniform (or otherwise nonintegrable) priors on {theta}i ≥ 0, the posterior cannot be proper. An example of this posterior sensitivity to prior bounds is discussed in detail at the end of HIV PATIENT DATA. A lower bound on {lambda}ij plays a similar role for Jefferys priors (of the form 1/X for parameter X). In that case the problem arises where tree states with no migration events in one direction mij = 0 have nonzero probability, the likelihood is bounded away from zero as {lambda}ij -> 0, and the posterior has the form 1/{lambda}ij at small {lambda}ij. Again, unphysical parameter values must be ruled out explicitly.


    MARKOV CHAIN MONTE CARLO FOR MIGRATION GENEALOGIES
 TOP
 ABSTRACT
 ISLAND-MODEL GENEALOGIES
 MUTATION
 BAYESIAN INFERENCE
 MARKOV CHAIN MONTE CARLO...
 CODE IMPLEMENTATION AND...
 SELECTED RESULTS FROM SIMULATED...
 HIV PATIENT DATA
 DISCUSSION
 APPENDIX:
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
The posterior density h is summarized using samples drawn from h via Metropolis Hastings Markov chain Monte Carlo (MCMC; METROPOLIS et al. 1953; HASTINGS 1970). The constant z does not need to be evaluated. The main challenges we encountered are the classic obstacles of MCMC-Bayesian inference, the bimodality apparent in some parameters, and a posterior distribution that is very close to being improper. We discuss these issues below. In APPENDIX A we describe a Metropolis-Hastings algorithm that determines a Markov chain Xn, n = 0, 1, 2, ... , with unique equilibrium distribution coinciding with the posterior distribution. The arguments {psi} = ({lambda}, {theta}, µ, g) of the posterior density function h make up the state vector. The MCMC acceptance probability we write in Equation A1 is a slightly simplified form of the Metropolis-Hastings-Green acceptance probability of GREEN (1995). The number of migration events in the state {psi} is randomly variable, and as a consequence the tree-component g of the MCMC state must jump between subspaces g {Gamma}m, which are, as we note above, of unequal dimension. The Metropolis-Hastings-Green generalization of the usual Metropolis-Hastings algorithm treats this feature.

The MCMC operators used to transform the state are called "moves." We implemented ~10 distinct move types. At each MCMC step we choose one of these moves according to some random schedule and apply it to the state. See APPENDIX A for details. We omit detailed descriptions of moves specified in DRUMMOND et al. (2002). Where a tree-topology change is proposed, it is necessary to check that the candidate state is legal (identical deme label for all edges in E attached to every coalescent node). Our candidate generation ignores deme labels. Any candidate state that was not a legal migration-coalescent genealogy was rejected and the MCMC was counterincremented. Such rejections are computed rapidly and appear to give good mixing per CPU cycle, even in the case of many demes (four). Addition and deletion of migration nodes were implemented using a pair-birth/death operation (A4) and a pair-split/merge operation (A5). These moves are illustrated in Figure 1. These operators give irreducibility over migration node number and position for two demes. With the migration birth/death operation (A3) these moves allow the MCMC to visit any migration history of a given coalescent tree with three or more demes. A set of now standard coalescent tree operators (DRUMMOND et al. 2002) gives irreducibility over {Gamma}. Mixing over the parameters µ, {theta}, and {lambda} of the mutation and demography models is achieved via scaling moves, that is, by taking random multiples. This is just random-walk MCMC carried out on a log scale. The two advantages of scaling MCMC are, first, that the size of the change is automatically at the scale of the parameter and, second, that the posterior distribution is insensitive to certain scaling transformations (so t/{theta} is invariant under t -> {delta}t, {theta} -> {delta}{theta}). Tricks of this kind are discussed in detail in DRUMMOND et al. (2002).



View larger version (11K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 1.— Two basic migration moves used in our MCMC implementation. (A) Two migration nodes or events are either deleted or added to a single edge. (B) The migration event is "moved" through a coalescent node.

 
Moves that are simple may give adequate mixing per CPU second if they can be evaluated quickly. Such moves may be relatively easy to implement accurately. We found that we were able to treat at least some problems of practical interest with the simple moves listed in APPENDIX A. Operations on migration nodes are fast, as no likelihood change is involved.

In the experiments reported in SELECTED RESULTS FROM SIMULATED DATA and HIV PATIENT DATA, we restrict attention to populations spread over just two demes, so that p = 2 and = {1, 2}. The first real two-deme data set we looked at was rich in features of potential methodological and biological interest, so we have chosen to display our work in this setting. In the two-deme problem, the deme type is of each edge <r, s> is determined uniquely from , the leaf deme values. The MCMC moves in APPENDIX A treat p ≥ 2. There are two simplifications to the MCMC moves for p = 2. First, the migration birth/death operation (A3) is not required (the MCMC parameters are fixed so that it is selected with probability zero at the proposal step). Second, there is a deme-selection step in the pair-birth move that is uniform at random from the set of demes that might admissibly occupy the new position. It will be seen that this set has just one member in the binary case, so the admissible deme is selected with probability one.

Suppose we iterate the MCMC J steps, collecting samples {psi}s every S steps for a total N = J/S samples. A MCMC realization of this kind is called a "run." We estimate = N–1{sum}sf({psi}s). It is important to have reliable estimates of var{} to debug MCMC, that is, to determine whether the difference between and Eh{f({psi})} is significant. We follow GEYER (1992). The uncertainty in our estimate depends on the integrated autocorrelation time {tau}f. Since var{()} = {tau}fvar{(f)}/N, {tau}f can be interpreted as the number of correlated MCMC samples f({psi}s) with the same variance-reducing effect as one independent sample. We estimate {tau}f from the lag a autocorrelation function {gamma}a = cov(f({psi}s), f({psi}s+a))/var({psi}s) using the monotone sequence estimator described in GEYER (1992). We report the effective sample size (ESS) N/{tau}f for a few statistics computed from our MCMC runs to give a quality check on the MCMC. Efficiency comparisons can be decided from estimated integrated autocorrelation times. Let c denote the mean number of CPU seconds per update. The program with the smallest c{tau}f-value is generating iid-equivalent samples f({psi}s) most rapidly.

It is necessary to check that the MCMC has reached equilibrium and that the variance estimates discussed in the preceding paragraph are reliable. We make multiple independent MCMC runs r = 1, 2, ... , R, from starting conditions drawn independently from the {psi}-prior. We evaluate a r for each run and check that the between-run variance of is predicted by its in-run variance. When we report results in SELECTED RESULTS FROM SIMULATED DATA and HIV PATIENT DATA, we superimpose histograms of f({psi}) computed from the R independent runs. We inspect traces, f({psi}s) as a function of s, for any visual evidence of a trend. We perform a number of further checks as described in GEYER (1992).


    CODE IMPLEMENTATION AND VERIFICATION
 TOP
 ABSTRACT
 ISLAND-MODEL GENEALOGIES
 MUTATION
 BAYESIAN INFERENCE
 MARKOV CHAIN MONTE CARLO...
 CODE IMPLEMENTATION AND...
 SELECTED RESULTS FROM SIMULATED...
 HIV PATIENT DATA
 DISCUSSION
 APPENDIX:
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
The program was written in the JAVA programming language. JAVA was chosen primarily because of its portability and object-oriented features. MCMC is computationally very intensive, so some effort went into tuning performance. However, correctness and ease of debugging were prioritized ahead of performance.

A number of tests were used to verify and debug the code. Naturally we checked that we could recover parameter values from synthetic data, for a wide range of parameter values. Our set of MCMC moves includes moves that are not needed for irreducibility. We check that the simulated posterior density does not change as we vary the proportions in which moves are used. We used the MCMC to simulate the prior density f(g|{theta}, {lambda}) for migration-coalescent trees. Independent samples from this density can be obtained by backward simulation of the migration-coalescent process. A number of statistics [for example, tR = max(t) and m = ||] were checked in this way and were found to have excellent agreement.


    SELECTED RESULTS FROM SIMULATED DATA
 TOP
 ABSTRACT
 ISLAND-MODEL GENEALOGIES
 MUTATION
 BAYESIAN INFERENCE
 MARKOV CHAIN MONTE CARLO...
 CODE IMPLEMENTATION AND...
 SELECTED RESULTS FROM SIMULATED...
 HIV PATIENT DATA
 DISCUSSION
 APPENDIX:
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
The results of 150 simulated data sets are summarized in Table 1. The first set of simulation studies (top of the table) is for symmetric migration rates and population sizes, where this restriction was relaxed for the second set of simulation studies. The details of each are now discussed.


View this table:
[in this window]
[in a new window]

 
TABLE 1 Means of the modes, standard deviation of the modes, and coverage estimators for relevant parameters of 25 simulated data sets each

 
Symmetric migration and population sizes:
For simplicity of exposition and to connect with earlier work, we take two identical demes. We suppose that the migration rates either way and population sizes are known a priori to be equal. In the next section we treat a more general estimation problem. We set {lambda}1,2 = {lambda}2,1 = {lambda} and {theta}1 = {theta}2 = {theta}. We make two pairs of studies, corresponding to parameter estimation in the weak ({lambda} = 2, {theta} = 0.05, {lambda} < 1/{theta}) and strong ({lambda} = 200, {theta} = 0.05, {lambda} > 1/{theta}) migration regimens. The MCMC sampling problem becomes harder as the posterior mean {lambda} increases, as the mean and variance of the number of migration events (about 300 in our strong migration example) increases. In each migration regime we consider serial and isochronous leaf data. We consider isochronous leaf data to allow readers to compare our results with previous studies, in particular, BEERLI and FELSENSTEIN (1999). For serial data we estimate {lambda}, {theta}, µ, and g. For isochronous data, {theta} and µ are confounded. In that setting we condition on knowledge of µ and estimate {lambda}, {theta}, and g. We tried Jeffreys priors and uniform priors for {lambda}, {theta}, and µ with conservative upper bounds. Results presented are for Jeffreys priors, but were in any case very similar.

In each of the four studies we generate 25 migration-coalescent trees. Each tree has 50 leaves, with 25 individuals in each deme. On each tree we simulate synthetic sequence data using a GTR model with fixed relative rate matrix (SHANKARAPPA et al. 1999; normalized to unit mean total substitution rate). This rate matrix is appropriate for HIV and is used in the study of real HIV data presented in the next section. All sequences were 1000 bp long and the mutation rate µ was set equal to one. For the serial data the 50 leaves were split into two groups of 25, offset in time by 0.1 time units (since data are synthesized with µ = 1, these time units happen to be substitutions per site). The earlier sample set was made up of 12 sequences from subpopulation 1 and 13 from subpopulation 2. The MCMC runs were 3 million states long. The worst mixing (by far) was observed for serial data with {lambda} = 200, where there are a large number of migration events on the tree, and µ and {theta} are estimated separately. For this group of 25 synthetic data sets, each MCMC run was monitored for convergence, and terminated when appropriate, rather than run as a batch for a predetermined number of updates. Run times varied between 2 and 10 hr with run lengths up to 8 million updates. ESS values depend on the particular realization of synthetic data, varying between 25 and 200 for µ and 10 and 100 for {lambda}.

Results are summarized in Table 1, top. For each study we report the proportion of the 25 trials in which the true parameter values were inside the 95% highest posterior density (HPD) confidence set. We uncover the truth as we had hoped. Figure 2 shows the marginal posterior density for migration of a typical MCMC run. It is a skewed unimodal distribution. For this reason we used a mode estimator on the marginal posterior density. There is a slight bias, of the same kind observed by BEERLI and FELSENSTEIN (1999) in studies of the likelihood for isochronous data. Mode estimation was accomplished by noting that the local density is inversely proportional to the spread of a fixed number of adjacent point samples. We note that the mode is generally a much better point estimator of the true value than mean estimators used in other articles. Contour plots of 95% migration parameters of representative samples from isochronous and serial leaf data are shown in Figure 3. The isochronous data give migration rate estimates that are both less precise and more strongly skewed than migration rate estimates derived from serial data.



View larger version (15K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 2.— A typical marginal density plot for simulated data for {lambda}. In this example {theta} = 0.05 and {lambda} = 2.

 


View larger version (17K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 3.— Approximate 95% HPD confidence contour plots for representative synthetic data runs. True values are {theta} = 0.05 and {lambda} = 2 indicated by crosshairs. (A) For serial samples; (B) for isochronous samples. Note reduced variance at B.

 
Asymmetric migration rates and population sizes:
Two simulated studies relaxed the original constraint of symmetric population size and migration rates. A total of 60 taxa were used, and the true values tested are {lambda}1 = 1, {lambda}2 = 10 and {theta}1 = 0.1, {theta}2 = 0.2 while for the second set the population size parameters were reversed to {theta}1 = 0.2, {theta}2 = 0.1. In all other aspects the simulation setup was the same as that for the symmetric case above except as noted below.

The reason for the increase in taxa was to obtain informative confidence interval estimates; otherwise they would often be very like the prior. Table 1, bottom, gives the results for the two-parameter sets and shows good recovery of the truth. Generally the convergence and variance are less favorable for this case and longer runs were required (~5 million); otherwise the qualitative behavior is similar to that described above.


    HIV PATIENT DATA
 TOP
 ABSTRACT
 ISLAND-MODEL GENEALOGIES
 MUTATION
 BAYESIAN INFERENCE
 MARKOV CHAIN MONTE CARLO...
 CODE IMPLEMENTATION AND...
 SELECTED RESULTS FROM SIMULATED...
 HIV PATIENT DATA
 DISCUSSION
 APPENDIX:
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
In this section we present an analysis of a real data set. We have chosen HIV sequence data from a single patient. Four sets of samples were collected from two viral demes (blood and semen) over a period of 3 years yielding 31 blood (b-deme) and 25 semen (s-deme) sequences of length 638. The distribution of leaf demes across time can be seen from the line type at the leaf tips in Figure 4.

In the following analysis we use a GTR substitution model with the same fixed rate matrix used for the synthetic data. We do not assume, as we did above, that the two populations are behaving in the same way. The migration rates and population sizes are all distinct. We have = {blood, semen}, p = 2, and parameters {lambda} = ({lambda}s,b, {lambda}b,s), {theta} = ({theta}s, {theta}b), µ, and g.

We began by making some exploratory runs on the complete data set, varying priors and start-state and pseudo-random number initialization between runs. The key feature is bimodality in migration rate parameters. The Markov chain state {psi} = ({lambda}, {theta}, µ, g) flips between two different interpretations of the data. Figure 5A shows the behavior of the parameter {lambda}s,b along a 200 million update segment of a 540 million update run. In this run the {lambda}, {theta} and µ priors were flat and bounded at conservative values. The {lambda}s,b-parameter is jumping between two quite different values. All parameters jump in concert with {lambda}s,b. This posterior distribution has two well-defined peaks. This is visible in a contour plot of the joint posterior {lambda}s,b, {lambda}b,s|D distribution, Figure 5B.



View larger version (25K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 5.— Some plots for the full HIV data set. (A) Plot of {lambda}s->b for the full HIV data set showing 400 million of the total 540 million updates. Flat bounded priors were used. (B) The migration contour plot from the same data clearly showing the bimodality.

 
What is the origin of the bimodality? The simplest possibility is that the data tell us that the migration is asymmetric, but leave the favored direction in doubt. Uncertainty of this kind can be removed, if further prior knowledge is available; as noted in the Introduction, with appropriate information, we may weight the asymmetry of rates in favor of one or the other direction of migration. On the other hand, the MCMC might be failing us. Perhaps there is no real bimodality and the MCMC is not in equilibrium. We consider also the possibility that the model is wrong. In particular the population sizes and per capita migration rates may change with time. Also the blood deme may in fact be multiple unobserved demes, which we discuss below in more detail.

Figure 4 gives typical genealogies for the respective modes. At the top is a tree from the s -> b ({lambda}b,s > {lambda}s,b) mode. There are in forward time many migration events, most of which are s -> b and close to the leaves (so {lambda}b,s is large). At the bottom is a tree from the b -> s mode ({lambda}s,b > {lambda}b,s). There are fewer migration events, most of which are b -> s. There are no important changes in the topology of coalescent events between modes.

Trees in the b -> s mode have fewer migration events than those in the s -> b mode. In the s -> b mode, the proximity of many s-deme leaves to the root supports an s-deme for the root. Coalescent branches terminating in b-deme leaves are typically much longer than those terminating in s-deme leaves. This feature is particularly marked in the topmost b-clade associated with the last two time stages in the data set (compare it with the s-clade for those two stages). Because so much of the total branch length is close to b-deme leaves, the s -> b events needed to convert the s-deme at the root are most probably located close to the leaves. This statement has been checked by analyzing the simple but related problem of estimating the relative rates of a two-state mutation process on a fixed tree. Extending the leaf branches raises the likelihood of asymmetric rate estimates. This kind of reconstruction (many migration events close to leaves) was never seen in synthetic data on simulated trees. It is seen in synthetic data generated on trees, like those in Figure 4, simulated from the posterior of this HIV data set. It seems likely that the bimodality is related to the long branches attached to the blood-deme leaves.

Why are the branches attached to the blood deme stretched in this way? We may be seeing a model violation. The blood deme may be a composite of several unobserved demes. The likely consequence of multiple, hidden demes is to lengthen the time to coalescence of any two lineages, i.e., to lengthen the branch lengths. As we note above, the apparent relationship between bimodality and long branches may therefore be a reflection of the fact that we have not sampled from these hidden demes. Other aspects of HIV evolution can also account for long branches; for instance, both population growth and recombination have the characteristic effect of producing long terminal lineages, consistent with the patterns we observe here.

In the remainder of this section we rule out the possibility that the bimodality is a consequence of software artifacts and insufficient mixing of the Monte Carlo chain. In these bimodal runs, the MCMC state is moving between two classes of migration genealogies that differ by the number and position of a large number of nodes (~60). The intermediate states have low probability. It is the bimodality of this data set rather than its size that puts it at the limit of what we can study with this software on current hardware. The states make sense as alternative explanations of the given leaf deme types. This basic consistency, with positive results for the MCMC convergence checks described in MARKOV CHAIN MONTE CARLO FOR MIGRATION GENEALOGIES, convinces us that the bimodality is real and that the MCMC is delivering states representative of the posterior.

We search now for evidence of time dependence in {lambda} and {theta}. We separate the data into two sets. The "new" ("old") data set contains sequences from all individuals sampled at the two "last" ("first") time stages. The MCMC mixes far more rapidly on these data sets. Effective sample sizes in the hundreds are obtained from overnight runs and we were able to make a more thorough study. For each of these data sets we made 20 runs with random starting states and 500 million updates per run, sampling every 10,000 states. Of the 20 runs, 10 used flat priors and 10 used scale invariant Jeffreys priors. Conservative hard upper bounds were imposed in all cases. Figure 6 shows the estimated autocorrelation function computed from the MCMC output for µ in the new data set with flat priors and was typical. These statistics yielded a worst-case parameter ESS of 1300, the smallest effective sample size over all runs.



View larger version (14K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 6.— Autocorrelation function {gamma}s (defined in MARKOV CHAIN MONTE CARLO FOR MIGRATION GENEALOGIES) for µ for the new data set. The x-axis is sample number (multiply by 10,000 to get updates). This plot indicates (i.e., good) typical mixing.

 
Selected marginal posterior plots are shown in Figures 7 and 8. Marginal posterior densities are consistent between runs, which supports other evidence that the MCMC runs have equilibrated. The time to equilibrium was a tiny fraction of the run length. The bimodality present in the full data set is visible in Figure 10, the new data set, which tends to support the view that it represents a real ambiguity in the data, rather than an artifact of time-varying demography. When the migration genealogies associated with the two modal classes of the new data set are examined, we see the same qualitative behavior as for the full data set, which is shown in Figure 9.



View larger version (22K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 7.— Marginal posterior density for µ in old (right-hand peak) and new (left-hand peak) data sets. Note the decrease in mutation rate with time.

 


View larger version (24K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 8.— Marginal {lambda}s->b in the old data set. The higher peak was obtained using Jeffreys priors and the lower one using flat priors.

 


View larger version (18K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 10.— Contour plots of migration rates for temporally split data sets. (A) The new data set (note the bimodality). (B) The old data set.

 


View larger version (10K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 9.— Instances of trees from the new data set MCMC chain showing bimodality similar in character to the full data set. Analysis of the old data set does not show bimodality.

 
Comparison of the two data sets does reveal significant differences in parameter values. Referring to Figure 7, the mutation rate is higher in the old data set. We have used a constant nominal generation time {rho} equal to 1 day. The mutation rate depends on generation time, which in turn is dependent on the type of cell infected: broadly, HIV-infected cells can be classified into three categories—productively infected cells, long-lived cells, or latently infected cells (PERELSON et al. 1996). Each of these categories has different generation times, the shortest being productively infected cells (on the order of 2 days) and the longest being latently infected cells (on the order of several years). It is conceivable that as infection progresses, the relative proportions of these infected cells change in the tissues sampled, thus leading to a change in observed mutation rates. Migration is strongly s -> b in the early part of the infection, but only weakly asymmetric in the later stages, possibly reflecting a higher rate of early colonization events, before the saturation of available target cells in semen (Figure 10).

Figure 11B illustrates the general point that the switch from flat to Jeffreys priors has little consequence for marginal posterior densities. In Figure 11B we see that the Jeffreys prior does pull in the upper tail of the {theta}s-distribution fairly sharply. This sensitivity of the upper tail of the {theta}i-distribution to the choice of prior can be understood as follows. Bounds can be set to conservative values without particular care if the MCMC output is studied carefully. Where MCMC runs actually visit bounds, we have a possible signal that the data are adding little to the information in the prior. The parameter {theta}sµ visits its upper bound (at Ns{rho}µ = 0.5); since µ {approx} 0.2 x 10–5 this bound acts around {theta}s ~= 0.5/0.2 x 10–5 ~= 25,000. This is visible in Figure 11A, where {theta}s exceeds the plotted range. This is just what we expect from the discussion in BAYESIAN INFERENCE concerning migration genealogies with no coalescent events in a given deme [the ci parameter of f(g|{theta}, {lambda}) is zero with posterior probability pi] for the flat prior used for that run. If pi is very small, the MCMC {theta}i-trace will not visit the tail of the posterior density made up of states associated with ci = 0, even though (for the flat prior) that tail does not die to zero as {theta}i -> {infty}. For the full HIV data set this is the case. The posterior mean {theta}-values will diverge as the prior upper bound is sent to infinity but the problem is not visible in the MCMC because the corresponding pi-values are negligible. However, in the new component of the data set, both ps and pb are sufficiently large. The long tail in the {theta}s-distribution for the flat prior in Figure 11B and the spiky excursions to the upper bound at {theta}s{rho}µ = 0.5 are instances of the phenomenon. Care needs to be taken to ensure that the upper tail of the prior {theta}i, i , distributions really represent prior knowledge. No Bayesian inference that is completely noninformative of {theta}i can be made. Where the MCMC does not reveal the true tail behavior, as in the full data runs, the best we can do is assert that we have enough evidence to know what we would see if we waited long enough.



View larger version (25K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 11.— New data set plots for {theta}s. (A) One MCMC trace and (B) multiple marginal plots. The higher peak in B was obtained using Jeffreys priors and the other using flat priors. Despite the appearance of bad mixing, we note the excellent agreement between chains in B.

 

    DISCUSSION
 TOP
 ABSTRACT
 ISLAND-MODEL GENEALOGIES
 MUTATION
 BAYESIAN INFERENCE
 MARKOV CHAIN MONTE CARLO...
 CODE IMPLEMENTATION AND...
 SELECTED RESULTS FROM SIMULATED...
 HIV PATIENT DATA
 DISCUSSION
 APPENDIX:
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
We have shown that we can simultaneously recover mutation rate, population sizes, migration rates, and genealogical information from temporal and spatially sampled sequence data within a Bayesian framework using MCMC. This nontrivial problem involving a space of varying dimension has been shown to converge in practical time frames with complex data sets of moderate size.

Simulation results demonstrated that recovery of the true parameters is consistent and repeatable with rapid convergence for low (