Abstract
Using pseudomaximumlikelihood approaches to phylogenetic inference and coalescent theory, we develop a computationally tractable method of estimating effective population size from serially sampled viral data. We show that the variance of the maximumlikelihood estimator of effective population size depends on the serial sampling design only because internal node times on a coalescent genealogy can be better estimated with some designs than with others. Given the internal node times and the number of sequences sampled, the variance of the maximumlikelihood estimator is independent of the serial sampling design. We then estimate the effective size of the HIV1 population within nine hosts. If we assume that the mutation rate is 2.5 × 10^{−5} substitutions/generation and is the same in all patients, estimated generation lengths vary from 0.73 to 2.43 days/generation and the mean (1.47) is similar to the generation lengths estimated by other researchers. If we assume that generation length is 1.47 days and is the same in all patients, mutation rate estimates vary from 1.52 × 10^{−5} to 5.02 × 10^{−5}. Our results indicate that effective viral population size and evolutionary rate per year are negatively correlated among HIV1 patients.
ONE of the most striking features of human immunodeficiency virus (HIV)1 infections is the high variation among patients of the length of the asymptomatic period. During this period, the number of CD4+ T cells decreases slowly, the immune system gradually weakens, and the viral load is roughly constant. The length of the asymptomatic stage can range from a few years to >10 years. At the end of the asymptomatic period, progression to AIDS starts. This progression is characterized by increasing viral loads and a continued decrease in the number of CD4+ T cells (for a review, see Viscidi 1999).
Mathematical models to account for the variation of the asymptomatic period have been introduced (Nowaket al. 1990; Nowak and May 1992), but its underlying cause is still unknown. Despite the variation in the length of the asymptomatic stage, Shankarappa et al. (1999) noted that the pattern of viral evolution during this period can be divided into three stages: (1) an early phase with a linear increase over time in both the amount of extant sequence diversity and in divergence from the HIV1 sequence that founded the infection; (2) an intermediate phase in which sequence divergence keeps increasing, but diversity stabilizes or decreases; and (3) a late phase in which divergence becomes stable and diversity is stable or decreases.
Because of their high evolutionary rates, RNA viruses such as HIV1 are potentially more informative regarding evolutionary processes than are more slowly evolving model systems. For example, with the constant rate assumption of a molecular clock, the rate of molecular evolution and the internal node times of a phylogenetic tree can be simultaneously estimated with serially sampled viral data (Rambaut 2000). In contrast, rate and time parameters are confounded when all sequences have been isolated at the same date. With slowly evolving organisms, differences of a few years in sequence sampling dates are not sufficient to effectively separate rate and times. Therefore, supplemental information such as fossil data is needed to estimate internal node times for slowly evolving organisms but is fortunately not required for estimating these times from serially sampled viral data. Similarly, with contemporaneously isolated sequences, the molecular clock hypothesis of a constant rate of molecular evolution over time cannot be separated from the more general hypothesis that all evolutionary lineages share a common rate of evolution at a given time but that this common rate changes over time. With serially sampled viral data, these two hypotheses can be distinguished (Seoet al. 2002).
As another illustration of the rich information available in serially sampled viral data sets, important population genetic parameters that are confounded for contemporaneously isolated sequence data can be separately estimated for serially sampled data. For example, effective population size and the rate of mutation per generation are two of the central parameters in population genetic theory. With contemporaneously isolated sequence data, only their product can be estimated. With serially sampled viral data and with a known generation time, effective population size and mutation rate per generation can be separately estimated. Likewise, with serially sampled viral data and with a known mutation rate, effective population size and generation time can be disentangled (Rodrigo and Felsenstein 1999; Rodrigoet al. 1999). Although there is less confounding of parameter estimates from serially sampled data than for contemporaneously isolated data, effective population size, mutation rate per generation, and generation time cannot all three be simultaneously estimated solely from serially sampled sequence data. However, when externally derived values of either generation time or mutation rate are employed, the other two parameters can be separately inferred.
The stages identified by Shankarappa et al. (1999) of genetic diversity and divergence during the HIV1 asymptomatic period are intriguing, but it may also be worthwhile to focus on the differences among patients in HIV1 evolution. Here, a coalescentbased approach is developed and employed to estimate important population genetic parameters particular to HIV1 evolution in each of the nine patients who were included in the Shankarappa et al. (1999) study. We find a negative correlation between the effective viral population size within a patient and the rate of viral sequence evolution per year and we discuss possible sources of this negative correlation.
In this article, a twostage estimation procedure is adopted. Times of internal nodes are estimated from sequence data and then these estimated node times serve as the basis for inferring effective population size. Because the main interest is effective population size and not times of internal nodes, internal node times are nuisance parameters in our analysis and the number of these nuisance parameters increases as the number of sequences increases. This situation can by analyzed with a pseudomaximumlikelihood approach (Gong and Samaniego 1981).
THEORY
Coalescent theory: Subsequent to the pioneering work of Kingman (1982a,b), coalescent theory has attracted widespread interest (for a review see Hudson 1991). According to coalescent theory, the probability density function of the time duration (t_{i}) of the period with exactly i different ancestral lineages is
Initially, we treat the vector of time intervals t as observed and we discuss inference of N_{e} for this situation. Later, we replace this vector t in the equations below by its estimate. When there are n contemporaneously isolated sequences, the likelihood function L_{1} = L_{1} (N_{e}t) is
Coalescent likelihood of serially sampled data: Recently, coalescent theory has been applied to the investigation of serially sampled viral populations (Felsensteinet al. 1999; Rodrigo and Felsenstein 1999; Rodrigoet al. 1999; Drummond and Rodrigo 2000). For serially sampled data, Equation 1 has to be slightly modified. Suppose that n_{1} and n_{2} sequences are sampled at times k_{1} and k_{2}, respectively, where k_{2} represents an earlier time than k_{1}. We assume that the n_{1} sequences sampled at time k_{1} have coalesced into c ancestral lineages when the n_{2} sequences are sampled at time k_{2} (Figure 1). For the period between times k_{2} and k_{1}, t_{i} represents the length of time in which the n_{1} sequences sampled at time k_{1} have exactly i ancestral lineages. Note that
Pseudomaximumlikelihood approach for serial samples: Usually, the time of sampling is measured in chronological units (day, year, etc.) and the times of internal nodes are inferred in the same units. To apply coalescent theory, chronological time should be transformed to time units in terms of generations. We can estimate the evolutionary rate r (number of substitutions per year) with serially sampled data. To make the problem simple, suppose that r is almost constant. To easily convert the mutation rate per generation into the substitution rate per year, we also assume that all mutations are selectively neutral. If the generation length τ (days/generation) is known, 1 year can be regarded as roughly 365/τ generations and the mutation rate μ (number of mutations/generation) can be calculated as rτ/365. If the mutation rate μ is known, 1 year can be regarded as roughly r/μ generations and the generation length (τ) can be calculated 365μ/r. Unfortunately, we cannot determine μ and τ separately with only serially sampled data measured in chronological time units. One of the two must be inferred from other information. Previously estimated values of μ include 4.0 × 10^{−5} (Mansky 1996) and 2.5 × 10^{−5} (Fu 2001). Previously estimated values of τ include 2.6 (Perelsonet al. 1996), 1.8 (see reference to personal communication from A. Perelson in Rodrigoet al. 1999), 1.78 (Fu 2001), and 1.2 (Rodrigoet al. 1999).
Equations 3 and 9 apply when the time intervals are known. In real situations, they are estimated rather than being known with certainty. The uncertainty arises because the sequence data X are insufficient for exact estimation of the internal node times and topology. Therefore, uncertainty of internal node times needs to be considered when calculating the variance of the maximumlikelihood estimate for the effective population size. Suppose that we know definitely one of μ and τ. For n sequences (equivalently, n − 1 coalescent events),
The values of
METHODS
Sequence sampling from asymptomatic period: We analyzed the C2V5 region of HIV1 env sequences from nine patients. These data were described by Shankarappa et al. (1999) who reported on viral sequence data isolated from peripheral blood mononuclear cells (PBMCs) and on a parallel sequence data set isolated from the plasma of each patient. Because there seems to be little difference in patterns of evolution for the PBMC data and the plasma data, we consider only the PBMC data here. The single exception is that we analyzed the plasmaderived sequence data from the patient referred to as p11 by Shankarappa et al. (1999). This exception was made because PBMC data from patient p11 were unavailable.
A striking pattern among the nine HIV1 sequence data sets is that divergence, defined as the evolutionary distance from an early founder sequence, is linearly increasing during the first portion of the infection and then tends to stabilize at a later stage (Shankarappaet al. 1999). This linear phase is consistent with an almost constant rate of molecular evolution. The time span of this linearly increasing portion varies among patients. For each patient, we obtained a rough estimate of the time point at which the linear portion of increasing divergence ended (Table 1). Our analyses are based only on viral sequences isolated from a patient during the estimated linear period of divergence.
The sequence set from each patient was aligned via the default options of ClustalW Ver.1.07 (Thompsonet al. 1994). Next, alignments were manually edited. Alignment columns with gaps were removed from the data prior to subsequent analysis. To avoid the elimination of too many columns, cases where only one or two sequences exhibited a deletion at a site relative to other sequences were treated by removing the sequences with the deletion rather than by removing the alignment columns exhibiting the deletion. This procedure resulted in only a relatively small number of sequences being eliminated from the total number of sequences in the nine data sets (see Table 1).
Phylogenies were then inferred from aligned sequences via maximumlikelihood. Tree reconstruction was performed in three steps. First, a consensus of all sequences with the earliest isolation date from a particular patient was constructed. This consensus sequence was added to the data and was treated as an outgroup. Then, the neighborjoining method (Saitou and Nei 1987) was applied to a distance matrix with entries calculated according to the HasegawaKishinoYano (HKY) model of nucleotide substitution (Hasegawaet al. 1985). With the neighborjoining tree as an initial guess, local rearrangements as implemented in the Molphy software package (Adachi and Hasegawa 1996) were made to search the space of tree topologies with the maximumlikelihood criterion. Last, on the basis of the rooted topology and the HKY model of nucleotide substitution, times of internal nodes and evolutionary rates were simultaneously estimated with the Tipdate Version 1.1 software (Rambaut 2000).
Estimation of evolutionary rate and effective population size: With Equation 12, knowledge of
When sampling node times from a multivariate normal distribution, it is possible to observe samples that are inconsistent with the serial sampling design. For example, in Equation 5, if the set of random numbers shows >n_{1} − 1 coalescent events after time k_{2}, the likelihood function is not defined. In cases where the likelihood function was not defined, the set of sampled node times was discarded.
As an alternative to our strategy of sampling from a multivariate normal distribution to estimate the uncertainty in
RESULTS
Correlation between N_{e} and r: Table 1 shows estimated evolutionary rates and effective population sizes. There appears to be large variation among patients in evolutionary rates. This variation can be explained by any of three scenarios: (1) mutation rates are the same among patients, but generation lengths differ; (2) generation lengths are the same, but mutation rates differ; (3) both mutation rates and generation lengths differ. Because estimates from serially sampled data of the mutation rate and generation length are confounded when the sampling times are measured in chronological units, we considered only the first two of the three potential causes for evolutionary rate variation.
When an identical mutation rate among patients was assumed, we used the value 2.5 × 10^{−5} (mutations per generation) that was estimated by Fu (2001). Contrary to the frequently assumed mutation rate of 4.0 × 10^{−5} (Mansky 1996), this lower value was obtained by excluding insertion, deletion, and frameshift mutations. We chose the value 2.5 × 10^{−5} (mutations per generation) because we too excluded insertions, deletions, and frameshifts.
Estimates of generation time τ have been obtained via a wide variety of approaches and these estimates themselves have widely varied. Estimates of τ range from 1.2 days (Rodrigoet al. 1999) to 2.6 days (Perelsonet al. 1996). It seems plausible that there is much variation in viral generation length among patients. Our generation length estimates, on the basis of the simple relation τ = 365μ/r, ranged among patients from 0.76 to 2.64 days with a mean of 1.47 days. This mean value is close to values of 1.78 days (Fu 2001) and 1.8 days (Rodrigoet al. 1999) that have been previously obtained. These two previous values were obtained via different estimation methods and with different data than analyzed here. Using each estimated generation length, we transformed the chronological units of internal node times to generation time units and then estimated effective population sizes by maximizing the likelihood function (a generalized form of Equation 5).
Figure 2 shows the relation between evolutionary rate and effective population size for two scenarios: constant generation length and constant mutation rate. Assuming a viral generation length of 1.47 days in each patient, we observed a clearly negative correlation between evolutionary rate and effective population size. A negative correlation was also observed when the mutation rate was assumed to be shared among viruses in different patients, but here the correlation was weaker.
Bias and precision of pseudomaximumlikelihood estimates: Via simulation, we investigated the effect on effective population size estimates under various circumstances (Table 2). Our simulation scenarios differed according to the number of sequences (5, 10, 15, 20, or 25) that were sampled at each of four sampling times. The interval between sampling times was 1 year, which corresponds to 248.3 generations under the assumption that the generation length (τ) is 1.47 days. The true value of N_{e} was 4000, because this is close to the mean estimated effective population size (4232.2) of the nine patients when τ was set equal to 1.47 days. Sequence lengths were simulated to be 600 bases, because the lengths of the aligned sequences from the nine patients ranged from 544 to 600 bases. The evolutionary rate was set to 0.00735 substitutions/year because this is the mean of the estimates from the nine patients. This evolutionary rate corresponds to a mutation rate of 2.97 × 10^{−5} mutations/generation.
Node times and the tree topology were simulated according to the coalescent process. In the simulations, one additional sequence was added so that the root of the ingroup genealogy could be inferred. This additional sequence had a sampling time that was set to chronologically precede the root of the ingroup by one generation. The time at which the lineage leading to the ingroup joined with the lineage leading to the outgroup was then randomly determined according to the coalescent process. The JukesCantor model (Jukes and Cantor 1969) was then used to randomly simulate sequences on the tree relating the ingroup sequences and the outgroup sequence.
After generating the sequences, effective population size was inferred from these data in five different situations. In three cases (referred to as cases 1–3), τ = 1.47 was assumed. For case 1, the true tree topology and coalescent times were both treated as known. For case 2, the true tree topology was treated as known but the coalescent times were estimated with maximum likelihood. For case 3, the tree topology was estimated with a combination of neighborjoining and local rearrangements and then the coalescent times were estimated with maximum likelihood. In two situations (referred to as cases 4 and 5), μ = 2.97 × 10^{−5} was assumed. As with case 2, case 4 involved treating the true tree topology as known but estimating coalescent times. As with case 3, case 5 involved reconstructing the tree topology via a combination of neighborjoining and local rearrangements and then estimating the coalescent times on this topology with maximum likelihood.
For each entry in Table 2, the estimated mean and standard error of
As the number of sequences gets larger, the difference grows between the standard errors for the cases where the true tree topology was and was not treated as known. This indicates that topological uncertainty is more important for data sets with large numbers of sequences. But the impact of topological uncertainty on the standard errors of N_{e} seems to be relatively small when compared with the impact of uncertainty of coalescence times.
Results from cases 3 and 5 indicate that the approach introduced here to estimate N_{e} has upward bias. The first step of our pseudomaximumlikelihood approach is to estimate t by maximizing P(Xt). These estimates of t will be asymptotically unbiased and will asymptotically approach a multivariate normal distribution. However, when the sequence length is not long enough, the sampling distribution of the estimates of t may greatly differ from a multivariate normal. When there are many sequences, coalescent time intervals are relatively short near the tips and the estimated internal node times are constrained by the times of tips. Thus, the deviation between the sampling distribution of the estimates of t and a multivariate normal distribution may become particularly important when there are many sequences. As a result, the sampling distribution of estimates of t is truncated and the expected values of
The estimation properties of the pseudomaximumlikelihood technique are also affected by the fact that one step is to reconstruct a bifurcating topology but this reconstruction step is not sufficiently influenced by the times at which sequences are sampled. As an illustration, consider a data set where sequence A and sequence B were sampled at the same time but sequence C was sampled k generations earlier. It is possible, especially if the mutation rate is small, that these three sequences are identical because no substitutions occur following their common ancestor. In this case, our topology reconstruction procedure would consider each of the three rooted topologies shown in Figure 3 to be equally likely and one of these three topologies would be arbitrarily selected. If the topology of Figure 3a were selected, the resulting estimate of N_{e} for these data would be 0. In contrast, selection of the topologies in Figure 3b or 3c would yield an estimate for N_{e} of k. The necessity to arbitrarily select one bifurcating topology due to branch length estimates being zero becomes more frequent when the number of sequences is large and the impact of these arbitrary selections is likely to be more significant when the number of sequence sampling times is large. The effect of phylogenetic reconstruction on estimating population parameters has also been investigated by Fu (1994), who noted a downward bias for inferring θ = 4N_{e}μ when the number of sequences is large and when using UPGMA trees without correcting for multiple hits.
The number of generations represented by each branch on the geneaology controls the estimate of effective population size. Our approach is to estimate these numbers of generations from sequence data. Because sequence data can be used to directly estimate the expected number of sequence changes on each branch of the genealogy, the number of generations represented by each branch can be straightforwardly estimated from the numbers of sequence changes on branches and from a known rate μ of sequence change per generation. When the chronological time per generation τ is assumed known and μ is assumed unknown, the rate of sequence change per chronological time unit must also be estimated to permit estimation of the number of generations represented by each branch on the tree. Because rate of sequence change per chronological time unit is subject to estimation uncertainty, the standard errors for estimating N_{e} are smaller for cases 4 and 5 than for cases 2 and 3.
DISCUSSION
Recently, a Bayesian approach that incorporates the uncertainty in the genealogical structure has been developed (Drummondet al. 2002). This Bayesian approach simultaneously estimates the mutation rate, population size, and tree topology. In contrast, our frequentist approach neglects the uncertainty of the estimated tree topology to reduce the required amount of calculation. The accounting for topological uncertainty is a distinct advantage of the Bayesian approach over that presented here.
Although there are multiple Bayesian and frequentist options for model checking, a potential advantage of the approach here over the full Bayesian strategy is ease of model checking. The first step of our procedure begins with reconstructing a geneaology and the structure of the reconstructed genealogy can then be immediately inspected. Inspection of the genealogy permits both formal and informal checks of whether the assumed model of population history is plausible. For example, the structure of the reconstructed genealogy can give an immediate indication as to whether a model of population size increase over time is justified. Because we believe that models are apt to be the weakest point of evolutionary analysis, whereas the method of estimation based upon the model is an important but secondary concern, ease of model checking should not be dismissed when evaluating a procedure.
In our approach, the estimated times of internal nodes are nuisance parameters rather than being of major interest. Our pseudomaximumlikelihood method does not account for uncertainty of these nuisance parameters. To account for this uncertainty, an empirical Bayes approach could be adopted. In empirical Bayes approaches, the marginallikelihood function is obtained by integrating the conditional distribution over the space of nuisance parameters (O'Hagan 1996).
In our case, the marginallikelihood function of N_{e} given the sequence data X could be obtained by integrating over the times t,
To facilitate data analysis, we combine the constant rate assumption of the molecular clock hypothesis and Kingman's ncoalescent (Kingman 1982a,b). With a constant rate, divergence time estimation is straightforward. With a simple coalescent process, estimation of effective population size from known divergence times is not difficult. However, neither the clock nor the simple coalescent assumptions are technically correct. Both are violated by realistic forms of natural selection. For example, there is ample evidence that positive selection operates in HIV1 genes (Nielsen and Yang 1998). It is unclear whether inaccuracy of the clock and coalescent assumptions would have a large effect on the results of this analysis. Some evolutionary scenarios that invoke selection have been shown to have little effect on the shape of gene genealogies relating contemporaneous data (Golding 1997; Neuhauser and Krone 1997; Przeworskiet al. 1999).
Kingman's ncoalescent also requires a constant population size. Because clinical measurements such as viral load counts do not change much during the asymptomatic period (reviewed in Viscidi 1999), the constant population size assumption seems reasonable. If population size is increasing over time, our estimates of it are expected to be intermediate between the early and later population levels. For a high rate of increase, the estimates are expected to be closer to the early population sizes than the later sizes.
Regarding the molecular clock, we focus here on analysis of sequences that were isolated during the approximately linear phase of sequence divergence that is characteristic of the early asymptomatic period in HIV1 infections. This linear pattern is what would be seen with neutral evolution and with a molecular clock. Because the linear pattern does not extend into later portions of the asymptomatic period, we did not analyze sequences isolated during the later portions. Therefore, instead of using a highly realistic and overly complicated model to analyze the entire data set, we opted here to investigate only the portions of the data that seemed relatively compatible with the simple model of a molecular clock.
Searching for associations between effective population size and measurements that reflect the physiological condition of a patient may also be fruitful. As a potential marker of disease progression, we examined the time points at which CD4+ T cell counts of each patient decreased to 200/μl (see Shankarappaet al. 1999). However, we were unable to detect a relationship between these time points and the estimated effective viral population size within a patient. Nonetheless, informative associations between effective population size and other covariates may be detected in more thorough future analyses. These associations have the potential to illuminate the factors that are related to viral adaptation within a patient.
In previous work (Seoet al. 2002), we noted that it is desirable to disperse serial sampling times as much as possible to accurately estimate evolutionary rates and times. The same strategy applies in the estimation of effective population size, because differences in
In this study, we assumed either that viral mutation rate or viral generation length is constant among patients. In reality, both mutation rate and generation time probably vary to some extent among patients. To investigate this simultaneous variation, more data and especially a more sophisticated model for viral evolution may be warranted.
The negative correlation between evolutionary rate and effective population size (Figure 2) is noteworthy. We cannot formally exclude the possibility that this negative correlation is an artifact of our pseudomaximumlikelihood procedure because the fact that viral generation length and evolutionary rate may both vary among patients makes this exclusion difficult. Because differences in effective population size and evolutionary rate estimates among patients exceed the uncertainty within patients, the possibility that this negative correlation is simply an artifact seems unlikely.
In population genetics, this negative correlation is predicted by both the slightly deleterious mutation model (Ohta 1987) and the nearly neutral mutation model (Tachida 1991). In a strictly neutral model, the rate at which mutants become fixed in the population does not depend on the population size but only on the mutation rate. In contrast, in the slightly deleterious or nearly neutral models, the rate of fixation of advantageous mutants depends on both the population size and selective coefficients. In the slightly deleterious model, it is assumed that most mutations are neutral or slightly deleterious and the mean of the distribution of selective coefficient(s) is <0. A negative correlation between evolutionary rate and effective population size is then predicted. As noted by Overbaugh and Bangham (2001), if the error rate of reverse transcriptase is high, deleterious mutations would be frequent and negative selection is likely to be a dominant force in viral evolution. In the nearly neutral model, the selection coefficient of mutations has a distribution with variance σ^{2} and with a mean that can be either negative or positive. It was shown by simulation that the substitution (note that “substitution” here does not refer solely to variants that have been fixed in the population) rate is negatively correlated with N_{e}σ (Tachida 1991). This might result in a negative correlation between evolutionary rate and effective population size for a fixed σ.
It is not clear whether the above classical views of population genetics explain our finding of a negative correlation. There are potentially many population genetic scenarios that could result in this correlation. Also, the correlation could be attributable to the variation of immune response among patients. If a patient's immune system is strong, viral sequences could experience strong positive selection and this could lead to an increased evolutionary rate and a decreased effective population size. We believe that further characterization of this apparent correlation and its possible sources would facilitate our understanding of the mechanism of viral adaptation within hosts.
Acknowledgments
We thank two anonymous reviewers for their suggestions. T.K.S. and H.K. were supported by the Japan Society for the Promotion of Science (JSPS) United StatesJapan research collaboration program 12554037; T.K.S. by the Japanese government scholarship program for foreign students; M.H. and H.K. by grant BSAR497 from the Ministry of Education, Culture, Sports, Science, and Technology (MECSST); J.L.T. and H.K. by grant 13308013 of MECSST; and J.L.T. by National Science Foundation grants DBI0077503 and INT990934.
APPENDIX
Using the following equation,
Footnotes

Communicating editor: G. B. Golding
 Received August 13, 2001.
 Accepted January 21, 2002.
 Copyright © 2002 by the Genetics Society of America