Genetics, Vol. 155, 1429-1437, July 2000, Copyright © 2000

An Integrated Framework for the Inference of Viral Population History From Reconstructed Genealogies

Oliver G. Pybusa, Andrew Rambauta, and Paul H. Harveya
a Department of Zoology, University of Oxford, Oxford OX1 3PS, United Kingdom

Corresponding author: Oliver G. Pybus, Department of Zoology, So. Parks Rd., Oxford OX1 3PS, United Kingdom., oliver.pybus{at}zoo.ox.ac.uk (E-mail)

Communicating editor: S. TAVARÉ


*  ABSTRACT
*TOP
*ABSTRACT
*METHOD
*SIMULATIONS
*EXAMPLE: THE HIV-1 EPIDEMIC
*CONCLUSION
*LITERATURE CITED

We describe a unified set of methods for the inference of demographic history using genealogies reconstructed from gene sequence data. We introduce the skyline plot, a graphical, nonparametric estimate of demographic history. We discuss both maximum-likelihood parameter estimation and demographic hypothesis testing. Simulations are carried out to investigate the statistical properties of maximum-likelihood estimates of demographic parameters. The simulations reveal that (i) the performance of exponential growth model estimates is determined by a simple function of the true parameter values and (ii) under some conditions, estimates from reconstructed trees perform as well as estimates from perfect trees. We apply our methods to HIV-1 sequence data and find strong evidence that subtypes A and B have different demographic histories. We also provide the first (albeit tentative) genetic evidence for a recent decrease in the growth rate of subtype B.


COALESCENT theory provides a framework for understanding the relationship between a population's demographic history and its genealogy. The rapid accumulation of gene sequence data has prompted the development of many coalescent-based methods with the common aim of inferring the history of population size from samples of gene sequences. These methods fall into three categories: (i) methods that compare observed distributions of pairwise genetic differences with expected distributions derived from coalescent theory (DI RIENZO and WILSON 1991 Down; SLATKIN and HUDSON 1991 Down; POLANSKI et al. 1998 Down; GRASSLY et al. 1999 Down); (ii) methods that calculate the likelihood of an observed set of sequences given specified models of sequence evolution and demographic change (GRIFFITHS and TAVARE 1994 Down; KUHNER et al. 1995 Down, KUHNER et al. 1998 Down); and (iii) methods that infer past population history from a reconstructed genealogy (FU 1994 Down; NEE et al. 1995 Down; PYBUS et al. 1999 Down).

Most previous studies have concentrated on estimating the parameters of simple demographic models, such as exponential growth and constant population size. However, estimated demographic parameters are meaningful only if there is a prior reason to believe that the sampled population fits the specified demographic model. How, therefore, can we proceed with parametric estimation if we have sequences sampled from a population with an unknown history? NEE et al. 1995 Down suggested using lineages-through-time (LTT) plots that display the rate of coalescence in a reconstructed genealogy through time. POLANSKI et al. 1998 Down introduced a method for estimating the history of population size from an observed distribution of pairwise differences. Although both of these methods can be used to infer demographic trends, neither provides a measure of confidence or allows demographic hypotheses to be tested. Furthermore, pairwise difference methods are expected to be less efficient than methods that incorporate genealogical information (FELSENSTEIN 1992 Down). Here, we introduce a new nonparametric estimate of population history and describe a maximum-likelihood (ML) framework for demographic parameter estimation and hypothesis testing. We apply these methods to HIV-1 sequence data and compare our results with those obtained using other techniques.

The rationale of our approach is as follows. Given a set of sequences S, the likelihood of demographic hypothesis H is given by L[H|S] = {int}G L[H|G] · L[G|S], where G is a genealogy with specified branch lengths. L[H|G] is provided by coalescent theory and L[G|S] can be calculated using standard likelihood methods (FELSENSTEIN 1981 Down). However, the integration must be performed over the set of all possible genealogies, which is impractically large. Consequently GRIFFITHS and TAVARE 1994 Down and KUHNER et al. 1995 Down, KUHNER et al. 1998 Down have developed general methods for estimating L[H|S] using Monte Carlo integration. Unfortunately these methods are computationally intensive and difficult to implement with complex substitutional and demographic models. We propose a complementary simpler approach in which the likelihood of H is calculated directly from G*, a genealogy with specified branch lengths reconstructed from S. G* must be estimated under the assumption of a molecular clock. We also recommend that G* is reconstructed using a maximum-likelihood approach, so that it is an estimate of the most likely tree given the data (the tree with the greatest value of L[G|S]). In effect, our approach assumes that (i) very few genealogies are likely to have given rise to S, and (ii) the reconstruction method can reliably locate these genealogies. This method may be practical for sequences that contain much phylogenetic information, particularly those sampled from rapidly evolving RNA virus populations, and we report simulation results suggesting that this is so. Monte Carlo integration methods should also run more quickly on such sequences, in comparison to data sets that contain little variation.


*  METHOD
*TOP
*ABSTRACT
*METHOD
*SIMULATIONS
*EXAMPLE: THE HIV-1 EPIDEMIC
*CONCLUSION
*LITERATURE CITED

Background theory:
We are interested in the genealogy of individuals randomly sampled from a large population, the size of which varies deterministically through time. GRIFFITHS and TAVARE 1994 Down and DONNELLY and TAVARE 1995 Down have shown that such genealogies can be modeled using the variable population size coalescent process. An outline of this process is given below. Consider a large haploid population with no recombination, subdivision, or migration, and let Ne(x) be the effective population size at time x. Time (in units of generations) increases into the past; Ne(0) is the effective population size at the present. The relationship between Ne(x) and the census population size N(x) is determined by the population's reproduction model. If the population reproduces according to the Wright-Fisher model (each offspring selects a parent randomly from the previous generation), then Ne(x) = N(x) (KINGMAN 1982 Down).

Consider a set of n gene sequences randomly sampled from the population at the present. The genealogy of the sampled sequences will contain n - 1 ordered internode intervals, labeled I2, I3, ... , In. The sizes of these intervals are denoted g2, g3, ... , gn. The subscripts refer to the number of lineages present in the sampled genealogy during each interval. We apply the coalescent approximation throughout, such that the interval sizes gi are distributed according to the probability density function

(1)

where ti is the time at which interval Ii starts (GRIFFITHS and TAVARE 1994 Down). Equation 1 provides the cornerstone of our framework. However, genealogies reconstructed under the assumption of a molecular clock have internode intervals measured in units of expected substitutions per site, not generations. We therefore rescale the coalescent process into the same units by implementing the change of variable {gamma}i = µgi, where {gamma}i is interval size in substitutions per site and µ is the mutation rate in substitutions per site per generation. After this rescaling, p(gi|ti) becomes p({gamma}i|{tau}i), and many variables become functions of µ (see below).

We call Ne(x) the demographic model, as it represents change in population size through time. Here, we consider two tractable and common demographic models, constant population size, Ne(x) = Ne(0), and exponential growth, Ne(x) = Ne(0)e-rx, where r is the exponential growth rate. Ne(0) and r are demographic parameters that we may wish to estimate. If time is measured in substitutions, then {theta} = Ne(0)µ and {rho} = r/µ. A demographic hypothesis is a demographic model with specified parameter values. All the methods described in this section are implemented in the program GENIE, available from http://evolve.zoo.ox.ac.uk.

Simulating coalescent trees:
If U is a unit uniform random variable, then solving

(2)

for gi will generate a variate sampled randomly from Equation 1 (DONNELLY and TAVARE 1995 Down). Solutions for the constant size and exponential growth models are provided by HUDSON 1990 Down and SLATKIN and HUDSON 1991 Down. Equation 2 enables coalescent trees to be simulated under almost any demographic hypothesis; the only caveat is that the value of the integral must tend to infinity as x increases, that is, Ne(x) cannot increase indefinitely into the past. Fortunately, all demographic histories that fail this restriction are biologically implausible.

The skyline plot:
Here we describe a new nonparametric estimate of demographic history. Since a reconstructed genealogy provides estimates of the random variables gi and ti, what can we infer about Ne(x) from these values? Rearranging Equation 2 we obtain the relationship

(3)

Hi has a meaningful biological interpretation—it is the harmonic mean of effective population size in the range [ti, gi + ti], where [ti, gi + ti] is the time interval delimited by internode interval Ii. If time is measured in substitutions per site, then

(4)

Since -ln(U) represents random error, the term i = {gamma}i (i2) is an estimate of Hiµ that can be calculated from an observed genealogy. Consequently, a plot of i against time defines a piecewise function that is a nonparametric estimate of demographic history. We name these plots skyline plots. i represents all the information about Ne(x) that can be inferred from the observed interval Ii. In other words, the most we can infer from Ii is that it defines a time interval [ti, gi + ti], during which the harmonic mean of Ne(x)µ is estimated to be i. If the substitution rate µ is known, then gi(i2) can be used to estimate Hi directly.

Fig 1 illustrates the ability of skyline plots to reconstruct population history under a variety of demographic scenarios. i is equal to Ne(x)µ only if Ne(x)µ is constant. Hence, for the constant size model, the arithmetic mean of the i is equal to the maximum-likelihood estimate of effective population size (FELSENSTEIN 1992 Down). For other demographic models, i can underestimate Ne(x)µ, because a harmonic mean is always smaller than its corresponding arithmetic mean. As illustrated in Fig 1B, this systematic bias is small when the rate of coalescence is large compared to the rate of population change [that is, when the harmonic and arithmetic means of Ne(x)µ during an interval are similar].






View larger version (58K):
In this window
In a new window
Download PPT slide
 
Figure 1. Skyline plots can reconstruct population history under different demographic scenarios. (a) Constant population size, {theta} = 100. (b) Exponential growth, {theta} = 100, {rho} = 100. (c) A 100-fold instantaneous increase in population size at time 0.5. (d) Logistic growth. The vertical axis shows estimated Neµ. The horizontal axis represents time in units of substitutions; time is zero at the present. In a–d, the top graph shows the expected skyline plot, obtained by calculating the mean of 5000 plots. The bottom graph shows the skyline plot of a single genealogy simulated under the same conditions. Both the inferred (thick lines) and true (thin lines) demographic histories are shown.

Parametric estimation:
We use a maximum-likelihood approach to parameter estimation. Given an observed interval size {gamma}i, the likelihood function is

(5)

where {psi} represents the parameter values of the demographic hypothesis and k is an arbitrary constant (EDWARDS 1972 Down). l[{psi}|G], the log likelihood of the {gamma}2, {gamma}3, ... , {gamma}n from an observed genealogy is simply the sum of the log likelihoods for each interval,

(6)

The maximum-likelihood estimate (MLE) of {psi}, denoted , can be found numerically. In addition, the shape of the likelihood surface near can be used to obtain approximate confidence limits. We obtained ~95% confidence sets using the likelihood-ratio test (LRT). A point in parameter space, {psi}', lies within the confidence set if it satisfies

(7)

where d is the number of demographic parameters. Our simulation results (see next section) suggest that this heuristic approach is reasonably accurate. The likelihood framework above can be used to estimate the parameters of almost any demographic model, provided that the probability model is well formed and the MLE can be reliably located.

Hypothesis testing:
Here we provide two methods for testing demographic hypotheses. First, we describe how to accept or reject specific hypotheses. Given a demographic hypothesis Ne(x), we can use Equation 2 to transform the observed internode intervals {gamma}2, {gamma}3, ... , {gamma}n into u2, u3, ... , un, a series of numbers in the range [0, 1]. If Ne(x) represents the true history of the observed genealogy, then the ui will be distributed according to a unit uniform distribution. The hypothesis Ne(x) can therefore be tested using the one-sample Kolmorgorov-Smirnov (KS) test. The KS test statistic is a measure of the difference between an observed distribution and an expected distribution. Therefore, values of the KS test statistic can also be used to compare the goodness-of-fit of any pair of demographic hypotheses.

The second method enables us to reject entire classes of hypotheses. More specifically, demographic model A can be rejected in favor of model B, provided A is a special case of B. For example, the constant-size model is a special case of the exponential model (corresponding to r = 0). We can reject the constant-size model in favor of the exponential model if the confidence intervals of the MLE of r do not include zero. This test can be similarly applied to any pair of nested demographic models. However, the reliability of this procedure, which is essentially a LRT, depends on the accuracy of the approximate confidence intervals (see above).


*  SIMULATIONS
*TOP
*ABSTRACT
*METHOD
*SIMULATIONS
*EXAMPLE: THE HIV-1 EPIDEMIC
*CONCLUSION
*LITERATURE CITED

The performance of maximum-likelihood estimates:
Extensive simulations were carried out to investigate the bias, variability, and type I error rate of ML estimates calculated from simulated coalescent trees. These simulations describe the performance of ML estimates when genealogies are reconstructed without error, and therefore represent the "best case" scenario, against which the performance of other methods should be compared. The simulations were performed as follows:

  1. A coalescent tree with 30 tips was simulated with specified parameter values, using Equation 2.

  2. A MLE () and ~95% confidence intervals were obtained for each parameter.

  3. Steps i and ii were repeated 200 times.

  4. The bias of the MLE of parameter value p was calculated as b(p) = .

  5. The variability of the MLE of parameter value p was calculated as {nu}(p) = .

  6. The type I error rate of the MLE of parameter value p, denoted e(p), was calculated as the number of simulated trees for which the true parameter value lay outside the 95% confidence intervals of the MLE. e(p) is expected to be binomially distributed with parameters 0.05 and 200.

  7. Steps i–vi were repeated for many different parameter values.

The constant-size model was investigated first and our results agreed with the theoretical values provided by FELSENSTEIN 1992 Down. The MLE of {theta} is unbiased and has variability {nu}({theta}) = 1/(n - 1), where n is the number of tips in the genealogy. The type I error rate e({theta}) was within its expected range (results not shown).

The bias and efficiency of the exponential model parameters, {theta} and {rho}, are shown in Fig 2. In agreement with KUHNER et al. 1998 Down, both parameters are biased upward, the bias of {rho} being more severe than the bias of {theta}. In addition, our results reveal an important and hitherto undetected pattern—the bias and variability of both parameters depend only on their product, {alpha} = {theta}{rho}. {alpha} has previously been shown to be important in determining the behavior of the exponential-growth coalescent process (SLATKIN and HUDSON 1991 Down; PYBUS et al. 1999 Down). As {alpha} increases, the bias and variability of {theta} estimates increase, but the bias and variability of {rho} estimates decrease. Hence, MLEs of {rho} are more accurate when {alpha} << 1 and are less accurate when {alpha} >> 1. The opposite is true for MLEs of {theta}. It appears that {alpha} = 1 marks a transition in the behavior of the exponential coalescent process, which behaves similarly to the constant-size process when {alpha} << 1, but generates increasingly star-like trees as {alpha} increases. The error rates e({theta}) and e({rho}) were within their expected ranges (results not shown).






View larger version (72K):
In this window
In a new window
Download PPT slide
 
Figure 2. The bias and variability of MLEs of exponential model parameters. (a) Bias of {theta} estimates. (b) Variability of {theta} estimates. (c) Bias of {rho} estimates. (d) Variability of {rho} estimates.

The effect of phylogenetic reconstruction:
A second set of simulations was performed to investigate the relative performance of MLEs calculated from correct and reconstructed genealogies. The simulations were performed as follows:

  1. Coalescent trees with 15 tips were simulated with specified parameter values, using Equation 2.

  2. MLEs of {theta} and {rho} were obtained from each coalescent tree (the true tree).

  3. Sequences, 1000 nucleotides in length, were simulated down each coalescent tree according to the Hasegawa-Kishino-Yano (HKY85) substitution model, with equal base frequencies and a transition:transversion ratio (ti:tv) of 2 (HASEGAWA et al. 1985 Down). This step was performed with Seq-Gen (RAMBAUT and GRASSLY 1997 Down).

  4. Reconstructed trees were obtained from the simulated sequences using two methods. For the first method, a ML distance matrix was estimated using the HKY85 model (ti:tv was estimated), from which a UPGMA tree was constructed. The branch lengths of the UPGMA tree were then reestimated using ML (again under the HKY85 model). The second reconstruction method was a heuristic ML topology search, using the stepwise-addition and nearest-neighbor-interchange algorithms. Tree likelihoods were calculated using HKY85 (ti:tv was estimated) with molecular clock enforced. This step was performed with PAUP4.0b3a (SWOFFORD 1999 Down).

  5. MLEs of {theta} and {rho} were calculated from each reconstructed UPGMA and ML tree.

  6. Bias, variability, and error rates were calculated as described in the previous section.

The above procedure was repeated with two sets of demographic parameters ({theta} = 10, {rho} = 100) and ({theta} = 1, {rho} = 1000), which represent two populations with the same demographic history (Ne(0) = 104, r = 0.1) but different substitution rates (µ = 10-3 and µ = 10-4, respectively). A lower substitution rate will result in less diverse sequences, making accurate tree reconstruction more difficult.

Fig 3 shows the distribution of {theta} estimates obtained using the µ = 10-3 substitution rate. This distribution is highly skewed with a long upper tail, making b({theta}) and {nu}({theta}) very difficult to estimate with a small number of replicates (and perhaps causing the stochasticity seen in Fig 2, a and b). Table 1, which contains the simulation results, therefore displays percentiles of this distribution. The distribution of {rho} estimates was much less skewed so b({rho}) and v({rho}) could be estimated accurately.



View larger version (16K):
In this window
In a new window
Download PPT slide
 
Figure 3. Distribution of MLEs of {theta} obtained from true trees (10,000 replicates) and ML-reconstructed trees (500 replicates). The parameters used were {theta} = 10, {rho} = 100.


 
View this table:
In this window
In a new window

 
Table 1. The effect of phylogenetic reconstruction on demographic parameter estimates

For the faster rate (µ = 10-3), MLEs from reconstructed trees performed as well as MLEs from the correct tree (Table 1; Fig 3). Surprisingly, the UPGMA algorithm did as well as ML tree estimation. For the slower rate (µ = 10-4), the error rates of MLEs from reconstructed trees were higher than those from the correct trees. In this scenario, UPGMA fared worse than ML tree estimation. These results suggest that using a reconstructed genealogy to infer demographic history becomes more reasonable as substitution rate increases (provided that the sequences do not become saturated with substitutions).

As far as possible, the above simulations were designed to mimic the evolution of RNA viruses; µ = 10-3 is slightly less than current estimates of HIV-1 substitution rate (LI et al. 1988 Down; LEITNER and ALBERT 1999 Down). However, simplifications have been made for the sake of computational feasibility, so these results should be considered as encouraging but preliminary. A more complete study that incorporates among-site rate heterogeneity, larger trees, different sequence lengths, alternative heuristic search strategies, and more complex substitutional models is necessary.


*  EXAMPLE: THE HIV-1 EPIDEMIC
*TOP
*ABSTRACT
*METHOD
*SIMULATIONS
*EXAMPLE: THE HIV-1 EPIDEMIC
*CONCLUSION
*LITERATURE CITED

Here, we illustrate our methods using four HIV-1 data sets, which contain env and gag gene sequences from HIV-1 subtypes A and B. These two prevalent subtypes have differing geographical distributions and transmission routes. Subtype A is found mostly in sub-Saharan Africa, where ~90% of transmissions occur through heterosexual intercourse. In contrast, subtype B circulates mainly in the developed world and has been predominately transmitted via intravenous drug use and homosexual intercourse (UNAIDS 1998 Down). The data sets used here, labeled envA, envB, gagA, and gagB, were reported in PYBUS et al. 1999 Down. They were carefully compiled to minimize the effects of nonrandom sampling and intersubtype recombination. ML genealogies were estimated for each data set using HKY85 and a codon-position model of rate heterogeneity, under the assumption of a molecular clock. The genealogies are described fully in PYBUS et al. 1999 Down.

Fig 4 shows the skyline plots obtained from the four HIV-1 genealogies. The subtype A plots indicate a constant-rate exponential increase in population size toward the present, whereas the demographic history of subtype B appears to be logistic. However, the subtype B plots are also consistent with the hypothesis of exponential growth, because genealogies from rapidly expanding populations have large internode intervals near the present (SLATKIN and HUDSON 1991 Down), resulting in an underestimation of population size (see Fig 1B for example). Previous LTT plot analyses of HIV-1 have only indicated that both subtypes A and B have increased exponentially (HOLMES et al. 1999 Down).




View larger version (34K):
In this window
In a new window
Download PPT slide
 
Figure 4. The skyline plots of the four HIV-1 genealogies (see Fig 1 for details).

To further investigate the demographic history of HIV-1, we estimated {theta} and {rho} using the ML framework described above (results shown in Fig 5). gag and env genealogies belonging to the same subtype generate different parameter estimates. This is because {theta} and {rho} are both functions of the substitution rate, which is higher in the env gene than in the gag gene (LI et al. 1988 Down; LEITNER and ALBERT 1999 Down). Hence, for each subtype, gag genealogies tend to generate larger {rho} estimates and smaller {theta} estimates than env genealogies.



View larger version (26K):
In this window
In a new window
Download PPT slide
 
Figure 5. MLEs of {theta} (open bars) and {rho} (shaded bars) for the four HIV-1 genealogies.

We calculated MLEs of {alpha} = {theta}{rho} using the results in Fig 5. For both subtypes, gag and env estimates of {alpha} are similar, which is expected because {alpha} is independent of substitution rate (results not shown). These MLEs were compared with {alpha} estimates obtained from the same sequences using the mid-depth method (PYBUS et al. 1999 Down). The ML and mid-depth estimates were very similar and both suggested that {alpha} is greater for subtype B than for subtype A. The mid-depth method confidence intervals were larger, indicating that the method presented here is more powerful.

Returning to Fig 5, both {theta} and {rho} are greater for subtype B than for subtype A. For both subtypes {alpha} >> 1, hence {rho} estimates are expected to be almost unbiased (Fig 2) and it is safe to infer that subtype B has a faster growth rate than subtype A. We suggest that this result is due to the different modes of transmission that characterized the initial spread of the subtypes. In the developed world, subtype B transmission was aided by the presence of interconnected "standing networks" of intravenous drug users and homosexual men, within which transmission rates were very high (ROBERTSON et al. 1986 Down; JACQUEZ et al. 1994 Down). In sub-Saharan Africa, subtype A was spread via heterosexual intercourse, so average waiting times between transmissions were longer (TARANTOLA and SCHWARTLANDER 1997 Down).

Our estimates of current population size, {theta}, are larger for subtype B than for A. Taken at face value this result is surely wrong: sub-Saharan Africa contains ~70% of the world's HIV-1-infected individuals (UNAIDS 1998 Down), more than half of which appear to be infected with subtype A (RAYFIELD et al. 1998 Down; ROBBINS et al. 1999 Down). However, {alpha} values for subtype B are large, so MLEs of {theta} are expected to be biased upward (Fig 2). In contrast, {theta} estimates for subtype A are expected to be almost unbiased. We also suggest a second possible explanation for these results; the demographic history of subtype B may not be exponential. The subtype B skyline plots are consistent with both a logistic and an exponential demographic history. If the logistic model is correct, then estimates of {theta} obtained under the exponential model will be too large. A logistic scenario is partially consistent with epidemiological evidence, as the introduction of behavioral intervention and antiretroviral therapies in western Europe has led to a decrease (although not a cessation) in the number of new infections (UNAIDS 1998 Down).

GRASSLY et al. 1999 Down studied the population dynamics of HIV-1 using a pairwise difference distribution method. They estimated the current effective population size of subtype A to be larger than that of subtype B. However, they assumed equal growth rates for both subtypes, an assumption that our results suggest is incorrect. Clearly further work is needed to reconcile these differences.

We believe that our results are not an artifact of nonrandom sampling, recombination, selection, or variable substitution rates. As discussed previously in HOLMES et al. 1999 Down and GRASSLY et al. 1999 Down, these processes are not expected to be acting differentially at the subtype level. Furthermore, our skyline plots are consistent across different genes, indicating that they are surprisingly robust to different substitution rates and modes of selection (LI et al. 1988 Down).


*  CONCLUSION
*TOP
*ABSTRACT
*METHOD
*SIMULATIONS
*EXAMPLE: THE HIV-1 EPIDEMIC
*CONCLUSION
*LITERATURE CITED

Skyline plots are the most appropriate way of graphically displaying the demographic information contained in reconstructed genealogies. They display estimated population size against time, and are therefore more intuitive than NEE et al. 1995 Down LTT plots, which must be interpreted using transformations. As skyline plots explicitly incorporate genealogy, they are expected to make more efficient use of the data than POLANSKI et al. 1998 Down more complex pairwise difference distribution method.

Omitting the computationally difficult task of integrating over all possible genealogies greatly simplifies ML parameter estimation. This allows us to use the complex substitution models necessary to accurately represent HIV-1 evolution (LEITNER et al. 1997 Down). Furthermore, in future work it should be possible to implement more realistic demographic models; our HIV-1 results suggest that the constant size and exponential models alone are not sufficient. However, it is essential to quantify the effects of tree reconstruction on parameter estimation and to determine the conditions under which our method may be appropriate—the simulations reported here are only preliminary.

Research is also needed to quantify the effects of recombination, selection, subdivision, variable substitution rates, and nonrandom sampling on the accuracy of demographic inference. If these processes, at biologically realistic levels, have a significant effect, then they must be incorporated into the coalescent framework. Significant progress in this area is well underway (see RODRIGO and FELSENSTEIN 1999 Down), although it may be impossible to implement recombination in a framework that considers only a single tree.


*  ACKNOWLEDGMENTS

Thanks to Mike Charleston, Eddie Holmes, Mike Worobey, and Bob Griffiths for their advice. Thanks to Peter Donnelly for comments on the manuscript and helpful discussions of statistical problems. We also thank the reviewing editor and two anonymous referees for their suggestions. This work was funded by the Wellcome Trust (grant 050275) and The Royal Society.

Manuscript received May 21, 1999; Accepted for publication March 30, 2000.


*  LITERATURE CITED
*TOP
*ABSTRACT
*METHOD
*SIMULATIONS
*EXAMPLE: THE HIV-1 EPIDEMIC
*CONCLUSION
*LITERATURE CITED

DI RIENZO, A. and A. C. WILSON, 1991  Branching pattern in the evolutionary tree for human mitochondrial DNA. Proc. Natl. Acad. Sci. USA 88:1597-1601[Abstract/Free Full Text].

DONNELLY, P. and S. TAVARÉ, 1995  Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29:401-421[Medline].

EDWARDS, A. W. F., 1972 Likelihood. Cambridge University Press, Cambridge, United Kingdom.

FELSENSTEIN, J., 1981  Evolutionary trees from gene frequencies and quantitative characters: finding maximum likelihood estimates. Evolution 35:1229-1242.

FELSENSTEIN, J., 1992  Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates. Genet. Res. 59:139-147[Medline].

FU, Y.-X., 1994  A phylogenetic estimator of effective population size or mutation rate. Genetics 136:685-692[Abstract].

GRASSLY, N. C., P. H. HARVEY, and E. C. HOLMES, 1999  Population dynamics of HIV-1 inferred from gene sequences. Genetics 151:427-438[Abstract/Free Full Text].

GRIFFITHS, R. C. and S. TAVARÉ, 1994  Sampling theory for neutral alleles in a varying environment. Philos. Trans. R. Soc. Lond. B 344:403-410[Medline].

HASEGAWA, M., H. KISHINO, and T. YANO, 1985  Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160-174[Medline].

HOLMES, E. C., O. G. PYBUS and P. H. HARVEY, 1999 The molecular population genetics of HIV-1, pp. 177–207 in The Evolution of HIV, edited by K. A. CRANDALL. John Hopkins University Press, Baltimore.

HUDSON, R. R., 1990  Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 7:1-44.

JACQUEZ, J. A., J. S. KOOPMAN, C. P. SIMON, and I. M. LONGINI, 1994  Role of primary infection in epidemics of HIV infection in gay cohorts. J. Acquired Immune Defic. Syndr. 7:1169-1184.

KINGMAN, J. F. C., 1982  On the genealogy of large populations. J. Appl. Prob. 19A:27-43.

KUHNER, M. K., J. YAMATO, and J. FELSENSTEIN, 1995  Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling. Genetics 140:1421-1430[Abstract].

KUHNER, M. K., J. YAMATO, and J. FELSENSTEIN, 1998  Maximum likelihood estimation of population growth rates based on the coalescent. Genetics 149:429-434[Abstract/Free Full Text].

LEITNER, T. and J. ALBERT, 1999  The molecular clock of HIV-1 unveiled through analysis of a known transmission history. Proc. Natl. Acad. Sci. USA 96:10752-10757[Abstract/Free Full Text].

LEITNER, T., S. KUMAR, and J. ALBERT, 1997  Tempo and mode of nucleotide substitutions in gag and env gene fragments in human immunodeficiency virus type I populations with a known transmission history. J. Virol. 71:4761-4770[Abstract].

LI, W. H., M. TANIMURI, and P. M. SHARP, 1988  Rates and dates of divergence between AIDS virus nucleotide sequences. Mol. Biol. Evol. 5:313-330[Abstract].

NEE, S., E. C. HOLMES, A. RAMBAUT, and P. H. HARVEY, 1995  Inferring population history from molecular phylogenies. Philos. Trans. R. Soc. Lond. 349:25-31[Medline].

POLANSKI, A., M. KIMMEL, and R. CHAKRABORTY, 1998  Application of a time-dependent coalescence process for inferring the history of population size changes from DNA sequence data. Proc. Natl. Acad. Sci. USA 95:5456-5461[Abstract/Free Full Text].

PYBUS, O. G., E. C. HOLMES, and P. H. HARVEY, 1999  The mid-depth method and HIV-1: a practical approach for testing hypotheses of viral epidemic history. Mol. Biol. Evol. 16:953-959[Abstract].

RAMBAUT, A. and N. C. GRASSLY, 1997  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235-238[Abstract/Free Full Text].

RAYFIELD, M. A., R. G. DOWNING, J. BAGGS, J. HU, and D. PIENIAZEK et al., 1998  A molecular epidemiologic survey of HIV in Uganda. AIDS 12:521-527[Medline].

ROBBINS, K. E., L. G. KOSTRIKIS, T. M. BROWN, O. ANZALA, and S. SHIN et al., 1999  Genetic analysis of human immunodeficiency virus type 1 strains in Kenya: a comparison using phylogenetic analysis and a combinatorial melting assay. AIDS Res. Hum. Retro. 15:329-335[Medline].

ROBERTSON, J. R., A. B. BUCKNALL, P. D. WELSBY, J. J. ROBERTS, and J. M. INGLIS et al., 1986  Epidemic of AIDS related virus (HTLV-III/LAV) infection among intravenous drug abusers. Br. Med. J. 292:527-529.

RODRIGO, A. G., and J. FELSENSTEIN, 1999 Coalescent approaches to HIV population genetics, pp. 233–272 in The Evolution of HIV, edited by K. A. CRANDALL. John Hopkins University Press, Baltimore.

SLATKIN, M. and R. R. HUDSON, 1991  Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics 129:555-562[Abstract].

SWOFFORD, D. L., 1999 PAUP* version 4.0b3a. Sinauer Associates, Sunderland, MA.

TARANTOLA, D. and B. SCHWARTLANDER, 1997  HIV/AIDS epidemics in sub-Saharan Africa: dynamism, diversity and discrete declines. AIDS 11:S5-S21.

UNAIDS, 1998 Report on the global HIV/AIDS epidemic. www.who.org/emc-hiv.




This article has been cited by other articles:


Home page
Mol Biol EvolHome page
V. N. Minin, E. W. Bloomquist, and M. A. Suchard
Smooth Skyride through a Rough Skyline: Bayesian Coalescent-Based Inference of Population Dynamics
Mol. Biol. Evol., July 1, 2008; 25(7): 1459 - 1471.
[Abstract] [Full Text] [PDF]


Home page
J. Gen. Virol.Home page
N. Jimenez-Hernandez, M. Torres-Puente, M. A. Bracho, I. Garcia-Robles, E. Ortega, J. del Olmo, F. Carnicer, F. Gonzalez-Candelas, and A. Moya
Epidemic dynamics of two coexisting hepatitis C virus subtypes
J. Gen. Virol., January 1, 2007; 88(1): 123 - 133.
[Abstract] [Full Text] [PDF]


Home page
J. Virol.Home page
P. C. Aulicino, E. C. Holmes, C. Rocco, A. Mangano, and L. Sen
Extremely Rapid Spread of Human Immunodeficiency Virus Type 1 BF Recombinants in Argentina
J. Virol., January 1, 2007; 81(1): 427 - 429.
[Abstract] [Full Text] [PDF]


Home page
J. Virol.Home page
C.-C. Hon, T.-Y. Lam, A. Drummond, A. Rambaut, Y.-F. Lee, C.-W. Yip, F. Zeng, P.-Y. Lam, P. T. W. Ng, and F. C. C. Leung
Phylogenetic Analysis Reveals a Correlation between the Expansion of Very Virulent Infectious Bursal Disease Virus and Reassortment of Its Genome Segment B.
J. Virol., September 1, 2006; 80(17): 8503 - 8509.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
S. M. Goodreau
Assessing the Effects of Human Mixing Patterns on Human Immunodeficiency Virus-1 Interhost Phylogenetics Through Social Network Simulation.
Genetics, April 1, 2006; 172(4): 2033 - 2045.
[Abstract] [Full Text] [PDF]


Home page
J. Virol.Home page
J. T. Herbeck, D. C. Nickle, G. H. Learn, G. S. Gottlieb, M. E. Curlin, L. Heath, and J. I. Mullins
Human Immunodeficiency Virus Type 1 env Evolves toward Ancestral States upon Transmission to a New Host
J. Virol., February 15, 2006; 80(4): 1637 - 1644.
[Abstract] [Full Text] [PDF]


Home page
J. Gen. Virol.Home page
T. Nakano, L. Lu, Y. He, Y. Fu, B. H. Robertson, and O. G. Pybus
Population genetic history of hepatitis C virus 1b infection in China
J. Gen. Virol., January 1, 2006; 87(1): 73 - 82.
[Abstract] [Full Text] [PDF]


Home page
J. Virol.Home page
M. Salemi, S. L. Lamers, S. Yu, T. de Oliveira, W. M. Fitch, and M. S. McGrath
Phylodynamic Analysis of Human Immunodeficiency Virus Type 1 in Distinct Brain Compartments Provides a Model for the Neuropathogenesis of AIDS
J. Virol., September 1, 2005; 79(17): 11343 - 11352.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
A. J. Drummond, A. Rambaut, B. Shapiro, and O. G. Pybus
Bayesian Coalescent Inference of Past Population Dynamics from Molecular Sequences
Mol. Biol. Evol., May 1, 2005; 22(5): 1185 - 1192.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
J. F. Wilkins
A Separation-of-Timescales Approach to the Coalescent in a Continuous Population
Genetics, December 1, 2004; 168(4): 2227 - 2244.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
P. Lemey, O. G. Pybus, A. Rambaut, A. J. Drummond, D. L. Robertson, P. Roques, M. Worobey, and A.-M. Vandamme
The Molecular Population Genetics of HIV-1 Group O
Genetics, July 1, 2004; 167(3): 1059 - 1068.
[Abstract] [Full Text] [PDF]


Home page
Appl. Environ. Microbiol.Home page
S. F. Sarkar and D. S. Guttman
Evolution of the Core Genome of Pseudomonas syringae, a Highly Clonal, Endemic Plant Pathogen
Appl. Envir. Microbiol., April 1, 2004; 70(4): 1999 - 2012.
[Abstract] [Full Text] [PDF]


Home page
ScienceHome page
B. T. Grenfell, O. G. Pybus, J. R. Gog, J. L. N. Wood, J. M. Daly, J. A. Mumford, and E. C. Holmes
Unifying the Epidemiological and Evolutionary Dynamics of Pathogens
Science, January 16, 2004; 303(5656): 327 - 332.
[Abstract] [Full Text] [PDF]


Home page
Int J EpidemiolHome page
G. L Armstrong
Commentary: Modelling the epidemiology of hepatitis C and its complications
Int. J. Epidemiol., October 1, 2003; 32(5): 725 - 726.
[Full Text] [PDF]


Home page
J. Virol.Home page
K. E. Robbins, P. Lemey, O. G. Pybus, H. W. Jaffe, A. S. Youngpairoj, T. M. Brown, M. Salemi, A.-M. Vandamme, and M. L. Kalish
U.S. Human Immunodeficiency Virus Type 1 Epidemic: Date of Origin, Population History, and Characterization of Early Strains
J. Virol., June 1, 2003; 77(11): 6359 - 6366.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
P. Lemey, O. G. Pybus, B. Wang, N. K. Saksena, M. Salemi, and A.-M. Vandamme
Tracing the origin and history of the HIV-2 epidemic
PNAS, May 27, 2003; 100(11): 6588 - 6592.
[Abstract] [Full Text] [PDF]


Home page
ScienceHome page
D. A. Joy, X. Feng, J. Mu, T. Furuya, K. Chotivanich, A. U. Krettli, M. Ho, A. Wang, N. J. White, E. Suh, et al.
Early Origin and Recent Expansion of Plasmodium falciparum
Science, April 11, 2003; 300(5617): 318 - 321.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
O. G. Pybus, A. J. Drummond, T. Nakano, B. H. Robertson, and A. Rambaut
The Epidemiology and Iatrogenic Transmission of Hepatitis C Virus in Egypt: A Bayesian Coalescent Approach
Mol. Biol. Evol., March 1, 2003; 20(3): 381 - 387.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
Y. Tanaka, K. Hanada, M. Mizokami, A. E. T. Yeo, J. W.-K. Shih, T. Gojobori, and H. J. Alter
Inaugural Article: A comparison of the molecular clock of hepatitis C virus in the United States and Japan predicts that hepatocellular carcinoma incidence in the United States will increase over the next two decades
PNAS, November 26, 2002; 99(24): 15584 - 15589.
[Abstract] [Full Text] [PDF]