Estimation of Effective Population Size of HIV-1 Within a Host: A Pseudomaximum-Likelihood Approach
Tae-Kun Seo, Jeffrey L. Thorne, Masami Hasegawa, Hirohisa Kishino


Using pseudomaximum-likelihood approaches to phylogenetic inference and coalescent theory, we develop a computationally tractable method of estimating effective population size from serially sampled viral data. We show that the variance of the maximum-likelihood estimator of effective population size depends on the serial sampling design only because internal node times on a coalescent genealogy can be better estimated with some designs than with others. Given the internal node times and the number of sequences sampled, the variance of the maximum-likelihood estimator is independent of the serial sampling design. We then estimate the effective size of the HIV-1 population within nine hosts. If we assume that the mutation rate is 2.5 × 10−5 substitutions/generation and is the same in all patients, estimated generation lengths vary from 0.73 to 2.43 days/generation and the mean (1.47) is similar to the generation lengths estimated by other researchers. If we assume that generation length is 1.47 days and is the same in all patients, mutation rate estimates vary from 1.52 × 10−5 to 5.02 × 10−5. Our results indicate that effective viral population size and evolutionary rate per year are negatively correlated among HIV-1 patients.

ONE of the most striking features of human immunodeficiency virus (HIV)-1 infections is the high variation among patients of the length of the asymptomatic period. During this period, the number of CD4+ T cells decreases slowly, the immune system gradually weakens, and the viral load is roughly constant. The length of the asymptomatic stage can range from a few years to >10 years. At the end of the asymptomatic period, progression to AIDS starts. This progression is characterized by increasing viral loads and a continued decrease in the number of CD4+ T cells (for a review, see Viscidi 1999).

Mathematical models to account for the variation of the asymptomatic period have been introduced (Nowaket al. 1990; Nowak and May 1992), but its underlying cause is still unknown. Despite the variation in the length of the asymptomatic stage, Shankarappa et al. (1999) noted that the pattern of viral evolution during this period can be divided into three stages: (1) an early phase with a linear increase over time in both the amount of extant sequence diversity and in divergence from the HIV-1 sequence that founded the infection; (2) an intermediate phase in which sequence divergence keeps increasing, but diversity stabilizes or decreases; and (3) a late phase in which divergence becomes stable and diversity is stable or decreases.

Because of their high evolutionary rates, RNA viruses such as HIV-1 are potentially more informative regarding evolutionary processes than are more slowly evolving model systems. For example, with the constant rate assumption of a molecular clock, the rate of molecular evolution and the internal node times of a phylogenetic tree can be simultaneously estimated with serially sampled viral data (Rambaut 2000). In contrast, rate and time parameters are confounded when all sequences have been isolated at the same date. With slowly evolving organisms, differences of a few years in sequence sampling dates are not sufficient to effectively separate rate and times. Therefore, supplemental information such as fossil data is needed to estimate internal node times for slowly evolving organisms but is fortunately not required for estimating these times from serially sampled viral data. Similarly, with contemporaneously isolated sequences, the molecular clock hypothesis of a constant rate of molecular evolution over time cannot be separated from the more general hypothesis that all evolutionary lineages share a common rate of evolution at a given time but that this common rate changes over time. With serially sampled viral data, these two hypotheses can be distinguished (Seoet al. 2002).

As another illustration of the rich information available in serially sampled viral data sets, important population genetic parameters that are confounded for contemporaneously isolated sequence data can be separately estimated for serially sampled data. For example, effective population size and the rate of mutation per generation are two of the central parameters in population genetic theory. With contemporaneously isolated sequence data, only their product can be estimated. With serially sampled viral data and with a known generation time, effective population size and mutation rate per generation can be separately estimated. Likewise, with serially sampled viral data and with a known mutation rate, effective population size and generation time can be disentangled (Rodrigo and Felsenstein 1999; Rodrigoet al. 1999). Although there is less confounding of parameter estimates from serially sampled data than for contemporaneously isolated data, effective population size, mutation rate per generation, and generation time cannot all three be simultaneously estimated solely from serially sampled sequence data. However, when externally derived values of either generation time or mutation rate are employed, the other two parameters can be separately inferred.

The stages identified by Shankarappa et al. (1999) of genetic diversity and divergence during the HIV-1 asymptomatic period are intriguing, but it may also be worthwhile to focus on the differences among patients in HIV-1 evolution. Here, a coalescent-based approach is developed and employed to estimate important population genetic parameters particular to HIV-1 evolution in each of the nine patients who were included in the Shankarappa et al. (1999) study. We find a negative correlation between the effective viral population size within a patient and the rate of viral sequence evolution per year and we discuss possible sources of this negative correlation.

In this article, a two-stage estimation procedure is adopted. Times of internal nodes are estimated from sequence data and then these estimated node times serve as the basis for inferring effective population size. Because the main interest is effective population size and not times of internal nodes, internal node times are nuisance parameters in our analysis and the number of these nuisance parameters increases as the number of sequences increases. This situation can by analyzed with a pseudomaximum-likelihood approach (Gong and Samaniego 1981).


Coalescent theory: Subsequent to the pioneering work of Kingman (1982a,b), coalescent theory has attracted widespread interest (for a review see Hudson 1991). According to coalescent theory, the probability density function of the time duration (ti) of the period with exactly i different ancestral lineages is p(tiNe)=i(i1)2Neexp{i(i1)2Neti}, where we denote Ne as the effective population size to emphasize that it is not necessarily the same as the total population size. Here, the time intervals ti are measured in terms of generations. The vector of these intervals is denoted t. The mean and variance of ti are, respectively, E[ti]=2Nei(i1) and Var[ti]=4Ne2i2(i1)2.

Initially, we treat the vector of time intervals t as observed and we discuss inference of Ne for this situation. Later, we replace this vector t in the equations below by its estimate. When there are n contemporaneously isolated sequences, the likelihood function L1 = L1 (Ne|t) is L1=i=n2i(i1)2Neexp{i(i1)2Neti}. (1) Using the log-likelihood equation lnL1Ne=(n1)Ne+i=n2i(i1)2Ne2ti=0, (2) the maximum-likelihood estimate of the effective population size is N^e=12(n1)i=n2i(i1)ti. (3) This estimate is unbiased and its variance is Var(N^e)=122(n1)2i=2ni2(i1)2Var(ti)=Ne2n1 (4) (e.g., Felsenstein 1992; Rodrigoet al. 1999).

Coalescent likelihood of serially sampled data: Recently, coalescent theory has been applied to the investigation of serially sampled viral populations (Felsensteinet al. 1999; Rodrigo and Felsenstein 1999; Rodrigoet al. 1999; Drummond and Rodrigo 2000). For serially sampled data, Equation 1 has to be slightly modified. Suppose that n1 and n2 sequences are sampled at times k1 and k2, respectively, where k2 represents an earlier time than k1. We assume that the n1 sequences sampled at time k1 have coalesced into c ancestral lineages when the n2 sequences are sampled at time k2 (Figure 1). For the period between times k2 and k1, ti represents the length of time in which the n1 sequences sampled at time k1 have exactly i ancestral lineages. Note that tc (Figure 1) is a special case in which the end of the time interval is determined not by a coalescent event but by the known time k2. For times preceding k2, it is convenient to have si denote the amount of time during which there were exactly i ancestral lineages. The expressions e(ti) and e(si), respectively, specify the points in time at which intervals ti and si end (i.e., the interval endpoints that are most recent). The time point e(tn1) = k1. For a vector t representing the pertinent time intervals (e.g., tn1, … , tc+1, tc + sc+n2, sc+n2−1, … , s2), the likelihood L1 = L1 (Ne|t) can be expressed as a product of probability densities of time intervals L1=[i=n1c+1p(tie(ti),Ne)]×p(tc,sc+n2e(tc),k2,Ne)×[i=c+n212p(sie(si),Ne)]. (5) As was shown by Rodrigo and Felsenstein (1999), each of the above factors in the product of n1 + n2 − 1 terms can be simply expressed: p(tie(ti),Ne)=i(i1)2Neexp(i(i1)2Neti)p(tc,sc+n2e(tc),k2,Ne)=exp(c(c1)2Netc)(c+n2)(c+n21)2Ne×exp((c+n2)(c+n21)2Nesc+n2)p(sie(si),Ne)=i(i1)2Neexp(i(i1)2Nesi). (6) Moreover, each factor provides some information regarding the value of Ne. We define Ne(i)={argmaxNep(tin2e(tn2),Ne)=(in2)(in21)tin22,i{n1+n2,,c+n2+1}argmaxNep(tc,sc+n2e(tc),k2,Ne)=c(c1)tc+(c+n2)(c+n21)sc+n22argmaxNep(sie(si),Ne)i(i1)si2,i{c+n21,,2},} (7) where argmaxθ f(θ) represents the value of θ that maximizes f(θ). Because the lengths of time intervals between coalescent events are random, the Ne(i) are random variables. By making a transformation of random variables from the time intervals in Equation 6 to the Ne(i) , the distributions of the Ne(i) are seen to be independent of the serial sampling design. Specifically, the probability density p(Ne(i)Ne) is an exponential distribution with mean Ne. By expressing Equation 5 in terms of Ne(i) instead of in terms of time intervals, we get L1=1Nen1n21exp(1Nei=n1+n22Ne(i)). (8) The maximum-likelihood estimate of Ne is therefore N^e=1n1+n21i=n1+n22Ne(i). (9) Because the Ne(i) are exponential random variables with mean Ne, Var(N^e)=Ne2n1+n21. (10) It is straightforward to show that the approaches used here to derive N^e and Var(N^e) apply to more general serial sampling designs. Therefore, given the divergence times, Var(N^e) is affected by the total number of sampled sequences but not by other particulars of the serial sampling design.

Figure 1.

A simple example of serial sampling where n1 sequences are sampled at time k1 and n2 sequences are sampled at an earlier time k2. The n1 sequences coalesce into c lineages when the n2 sequences are added at time k2.

Pseudomaximum-likelihood approach for serial samples: Usually, the time of sampling is measured in chronological units (day, year, etc.) and the times of internal nodes are inferred in the same units. To apply coalescent theory, chronological time should be transformed to time units in terms of generations. We can estimate the evolutionary rate r (number of substitutions per year) with serially sampled data. To make the problem simple, suppose that r is almost constant. To easily convert the mutation rate per generation into the substitution rate per year, we also assume that all mutations are selectively neutral. If the generation length τ (days/generation) is known, 1 year can be regarded as roughly 365/τ generations and the mutation rate μ (number of mutations/generation) can be calculated as rτ/365. If the mutation rate μ is known, 1 year can be regarded as roughly r/μ generations and the generation length (τ) can be calculated 365μ/r. Unfortunately, we cannot determine μ and τ separately with only serially sampled data measured in chronological time units. One of the two must be inferred from other information. Previously estimated values of μ include 4.0 × 10−5 (Mansky 1996) and 2.5 × 10−5 (Fu 2001). Previously estimated values of τ include 2.6 (Perelsonet al. 1996), 1.8 (see reference to personal communication from A. Perelson in Rodrigoet al. 1999), 1.78 (Fu 2001), and 1.2 (Rodrigoet al. 1999).

Equations 3 and 9 apply when the time intervals are known. In real situations, they are estimated rather than being known with certainty. The uncertainty arises because the sequence data X are insufficient for exact estimation of the internal node times and topology. Therefore, uncertainty of internal node times needs to be considered when calculating the variance of the maximum-likelihood estimate for the effective population size. Suppose that we know definitely one of μ and τ. For n sequences (equivalently, n − 1 coalescent events), p(X,tNe,μ,τ)=p(tNe,μ,τ)p(XNe,μ,τ,t)=p(tNe,μ,τ)p(Xt,r)=p(tNe)p(Xt,r). (11) The first factor above, p(t|Ne), is itself a product of n − 1 terms that each correspond to a specific coalescent event (see Equations 5 and 6). The second of the factors, p(X|t, r), is the conventional quantity that is calculated for phylogeny reconstruction with Felsenstein's pruning algorithm (Felsenstein 1981). Here, we adopt a pseudomaximum-likelihood approach (Gong and Samaniego 1981). In this approach, t is first estimated and then t is later treated as observed in a likelihood function. Letting L1(Ne|t) = p(t|Ne) and L2(t) = p(X|t, r), we get l(Ne,t)=log(p(X,tNe,μ,τ))=log{L1(tNe)L2(Xt,r)}=l1(Net)+l2(t). Also, we note that l1(Net)=i=n2l1,i(Nti), where l1,i(N|t) corresponds to a particular of the n − 1 coalescent events. With pseudomaximum likelihood, all parameters except those of interest are regarded as nuisance parameters. Replacing the nuisance parameters with consistent estimates simplifies the estimation problem. Instead of estimating t and Ne simultaneously, we infer t by maximizing l2 and then estimate Ne by maximizing l1(Net) . We use N^e(t) to refer to the estimated value of Ne that is obtained by maximizing l1(Ne|t) over Ne for some fixed value of t. Obviously, the estimates t and N^e(t) are not necessarily equal to those obtained by jointly maximizing l(Ne, t) over Ne and t.

The values of t can be inferred with a maximum-likelihood method that incorporates serially sampled data (e.g., the Tipdate software package; Rambaut 2000). Once the t are obtained, N^e(t) can be inferred with (Ne)l1(Net)=0 (Equation 2). This approach is sensible because the maximum-likelihood estimate of t is consistent (Felsenstein 1988). As derived in the appendix, the variance of N^e(t) can be approximated as Vart,X(N^e(t))Ne2n1+122(n1)2Et{VarX(i=n2i(i1)tit)}. (12) The right side of Equation 12 is the sum of two terms where the first term, Ne2(n1) , is the variance of Ne when the divergence times are known (see Equation 10). For a given value of Ne, the divergence times are random and will stochastically vary among coalescent realizations. It is this randomness due to genetic sampling that is captured by Ne2(n1) . We can approximate this genetic sampling term with N^e(t)2(n1) The second of the two terms in Equation 12 arises from statistical sampling. With finite sequence lengths, divergence times cannot be perfectly estimated and this second term represents the uncertainty of N^e(t) that results. The impact of this randomness due to statistical sampling can be handled either via a numerical approximation of the second term of Equation 12 or by simulation (see below).


Sequence sampling from asymptomatic period: We analyzed the C2-V5 region of HIV-1 env sequences from nine patients. These data were described by Shankarappa et al. (1999) who reported on viral sequence data isolated from peripheral blood mononuclear cells (PBMCs) and on a parallel sequence data set isolated from the plasma of each patient. Because there seems to be little difference in patterns of evolution for the PBMC data and the plasma data, we consider only the PBMC data here. The single exception is that we analyzed the plasma-derived sequence data from the patient referred to as p11 by Shankarappa et al. (1999). This exception was made because PBMC data from patient p11 were unavailable.

A striking pattern among the nine HIV-1 sequence data sets is that divergence, defined as the evolutionary distance from an early founder sequence, is linearly increasing during the first portion of the infection and then tends to stabilize at a later stage (Shankarappaet al. 1999). This linear phase is consistent with an almost constant rate of molecular evolution. The time span of this linearly increasing portion varies among patients. For each patient, we obtained a rough estimate of the time point at which the linear portion of increasing divergence ended (Table 1). Our analyses are based only on viral sequences isolated from a patient during the estimated linear period of divergence.

The sequence set from each patient was aligned via the default options of ClustalW Ver.1.07 (Thompsonet al. 1994). Next, alignments were manually edited. Alignment columns with gaps were removed from the data prior to subsequent analysis. To avoid the elimination of too many columns, cases where only one or two sequences exhibited a deletion at a site relative to other sequences were treated by removing the sequences with the deletion rather than by removing the alignment columns exhibiting the deletion. This procedure resulted in only a relatively small number of sequences being eliminated from the total number of sequences in the nine data sets (see Table 1).

Phylogenies were then inferred from aligned sequences via maximum-likelihood. Tree reconstruction was performed in three steps. First, a consensus of all sequences with the earliest isolation date from a particular patient was constructed. This consensus sequence was added to the data and was treated as an outgroup. Then, the neighbor-joining method (Saitou and Nei 1987) was applied to a distance matrix with entries calculated according to the Hasegawa-Kishino-Yano (HKY) model of nucleotide substitution (Hasegawaet al. 1985). With the neighbor-joining tree as an initial guess, local rearrangements as implemented in the Molphy software package (Adachi and Hasegawa 1996) were made to search the space of tree topologies with the maximum-likelihood criterion. Last, on the basis of the rooted topology and the HKY model of nucleotide substitution, times of internal nodes and evolutionary rates were simultaneously estimated with the Tipdate Version 1.1 software (Rambaut 2000).

Estimation of evolutionary rate and effective population size: With Equation 12, knowledge of Var(t) is necessary to determine Var(N^e(t)) . According to asymptotic theory, the distribution of maximum-likelihood estimates can be approximated by a multivariate normal distribution. Here, the asymptotic approximation improves as sequence length increases. We approximated the variance of the estimated times of internal nodes with the inverse of a numerically estimated Fisher information matrix. By sampling from the resulting multivariate normal distributions, we simulated new sets of estimated times and then estimated effective population sizes from these simulated times. To estimate Ne in this fashion, it was necessary to assume a specific value for either τ or μ and then apply a generalization of Equation 5. For each patient and with each value of τ or μ, variance estimates were based on 20,000 samples from the appropriate multivariate normal distribution. This computational procedure yields an approximation of the second of the two terms in Equation 12 that are needed to estimate Var(N^e(t)) .

When sampling node times from a multivariate normal distribution, it is possible to observe samples that are inconsistent with the serial sampling design. For example, in Equation 5, if the set of random numbers shows >n1 − 1 coalescent events after time k2, the likelihood function is not defined. In cases where the likelihood function was not defined, the set of sampled node times was discarded.

As an alternative to our strategy of sampling from a multivariate normal distribution to estimate the uncertainty in N^e(t) due to uncertainty in node times, a bootstrap approach could be used. The idea would be to sample sites with replacement. For each bootstrap sample, effective population size could be estimated and the variance of these estimates among bootstrap samples could be determined. As with our multivariate sampling strategy, this bootstrap approach would only estimate the uncertainty in N^e(t) that is due to uncertainty in node times. This contribution from node time uncertainty would be added to N^e(t)2(n1) to approximate Var(N^e(t)) Advantages of our multivariate normal sampling strategy over the bootstrap alternative are simplicity and computational feasibility.


Correlation between Ne and r: Table 1 shows estimated evolutionary rates and effective population sizes. There appears to be large variation among patients in evolutionary rates. This variation can be explained by any of three scenarios: (1) mutation rates are the same among patients, but generation lengths differ; (2) generation lengths are the same, but mutation rates differ; (3) both mutation rates and generation lengths differ. Because estimates from serially sampled data of the mutation rate and generation length are confounded when the sampling times are measured in chronological units, we considered only the first two of the three potential causes for evolutionary rate variation.

When an identical mutation rate among patients was assumed, we used the value 2.5 × 10−5 (mutations per generation) that was estimated by Fu (2001). Contrary to the frequently assumed mutation rate of 4.0 × 10−5 (Mansky 1996), this lower value was obtained by excluding insertion, deletion, and frameshift mutations. We chose the value 2.5 × 10−5 (mutations per generation) because we too excluded insertions, deletions, and frameshifts.

Estimates of generation time τ have been obtained via a wide variety of approaches and these estimates themselves have widely varied. Estimates of τ range from 1.2 days (Rodrigoet al. 1999) to 2.6 days (Perelsonet al. 1996). It seems plausible that there is much variation in viral generation length among patients. Our generation length estimates, on the basis of the simple relation τ = 365μ/r, ranged among patients from 0.76 to 2.64 days with a mean of 1.47 days. This mean value is close to values of 1.78 days (Fu 2001) and 1.8 days (Rodrigoet al. 1999) that have been previously obtained. These two previous values were obtained via different estimation methods and with different data than analyzed here. Using each estimated generation length, we transformed the chronological units of internal node times to generation time units and then estimated effective population sizes by maximizing the likelihood function (a generalized form of Equation 5).

Figure 2 shows the relation between evolutionary rate and effective population size for two scenarios: constant generation length and constant mutation rate. Assuming a viral generation length of 1.47 days in each patient, we observed a clearly negative correlation between evolutionary rate and effective population size. A negative correlation was also observed when the mutation rate was assumed to be shared among viruses in different patients, but here the correlation was weaker.

View this table:

Estimates of viral evolution parameters from nine different populations

Bias and precision of pseudomaximum-likelihood estimates: Via simulation, we investigated the effect on effective population size estimates under various circumstances (Table 2). Our simulation scenarios differed according to the number of sequences (5, 10, 15, 20, or 25) that were sampled at each of four sampling times. The interval between sampling times was 1 year, which corresponds to 248.3 generations under the assumption that the generation length (τ) is 1.47 days. The true value of Ne was 4000, because this is close to the mean estimated effective population size (4232.2) of the nine patients when τ was set equal to 1.47 days. Sequence lengths were simulated to be 600 bases, because the lengths of the aligned sequences from the nine patients ranged from 544 to 600 bases. The evolutionary rate was set to 0.00735 substitutions/year because this is the mean of the estimates from the nine patients. This evolutionary rate corresponds to a mutation rate of 2.97 × 10−5 mutations/generation.

Node times and the tree topology were simulated according to the coalescent process. In the simulations, one additional sequence was added so that the root of the ingroup genealogy could be inferred. This additional sequence had a sampling time that was set to chronologically precede the root of the ingroup by one generation. The time at which the lineage leading to the ingroup joined with the lineage leading to the outgroup was then randomly determined according to the coalescent process. The Jukes-Cantor model (Jukes and Cantor 1969) was then used to randomly simulate sequences on the tree relating the ingroup sequences and the outgroup sequence.

Figure 2.

A negative correlation between the evolutionary rate per year Formula and the effective population size Formula . (A) Assuming a generation length of 1.47 days. (B) Assuming a mutation rate of 2.5 × 10−5 substitutions/generation.

View this table:

Mean and standard error of Formula

After generating the sequences, effective population size was inferred from these data in five different situations. In three cases (referred to as cases 1–3), τ = 1.47 was assumed. For case 1, the true tree topology and coalescent times were both treated as known. For case 2, the true tree topology was treated as known but the coalescent times were estimated with maximum likelihood. For case 3, the tree topology was estimated with a combination of neighbor-joining and local rearrangements and then the coalescent times were estimated with maximum likelihood. In two situations (referred to as cases 4 and 5), μ = 2.97 × 10−5 was assumed. As with case 2, case 4 involved treating the true tree topology as known but estimating coalescent times. As with case 3, case 5 involved reconstructing the tree topology via a combination of neighbor-joining and local rearrangements and then estimating the coalescent times on this topology with maximum likelihood.

For each entry in Table 2, the estimated mean and standard error of N^e are based upon results from 100 simulated data sets. As expected due to the second term of Equation 12, case 1 exhibits lower standard errors than do the other cases. Another general and expected trend seems to be that the standard error for estimating Ne decreases as the number of sequences increases. This decrease can be explained by the first term in Equation 12.

As the number of sequences gets larger, the difference grows between the standard errors for the cases where the true tree topology was and was not treated as known. This indicates that topological uncertainty is more important for data sets with large numbers of sequences. But the impact of topological uncertainty on the standard errors of Ne seems to be relatively small when compared with the impact of uncertainty of coalescence times.

Results from cases 3 and 5 indicate that the approach introduced here to estimate Ne has upward bias. The first step of our pseudomaximum-likelihood approach is to estimate t by maximizing P(X|t). These estimates of t will be asymptotically unbiased and will asymptotically approach a multivariate normal distribution. However, when the sequence length is not long enough, the sampling distribution of the estimates of t may greatly differ from a multivariate normal. When there are many sequences, coalescent time intervals are relatively short near the tips and the estimated internal node times are constrained by the times of tips. Thus, the deviation between the sampling distribution of the estimates of t and a multivariate normal distribution may become particularly important when there are many sequences. As a result, the sampling distribution of estimates of t is truncated and the expected values of t will be larger than their true values. This could explain the upward bias that we observe for estimating Ne even when the true tree topology is used. The effect of the truncation of coalescent times was also noted by Kuhner et al. (1998).

The estimation properties of the pseudomaximum-likelihood technique are also affected by the fact that one step is to reconstruct a bifurcating topology but this reconstruction step is not sufficiently influenced by the times at which sequences are sampled. As an illustration, consider a data set where sequence A and sequence B were sampled at the same time but sequence C was sampled k generations earlier. It is possible, especially if the mutation rate is small, that these three sequences are identical because no substitutions occur following their common ancestor. In this case, our topology reconstruction procedure would consider each of the three rooted topologies shown in Figure 3 to be equally likely and one of these three topologies would be arbitrarily selected. If the topology of Figure 3a were selected, the resulting estimate of Ne for these data would be 0. In contrast, selection of the topologies in Figure 3b or 3c would yield an estimate for Ne of k. The necessity to arbitrarily select one bifurcating topology due to branch length estimates being zero becomes more frequent when the number of sequences is large and the impact of these arbitrary selections is likely to be more significant when the number of sequence sampling times is large. The effect of phylogenetic reconstruction on estimating population parameters has also been investigated by Fu (1994), who noted a downward bias for inferring θ = 4Neμ when the number of sequences is large and when using UPGMA trees without correcting for multiple hits.

Figure 3.

The three possible rooted topologies when sequence C is sampled k generations earlier than sequence A and sequence B.

The number of generations represented by each branch on the geneaology controls the estimate of effective population size. Our approach is to estimate these numbers of generations from sequence data. Because sequence data can be used to directly estimate the expected number of sequence changes on each branch of the genealogy, the number of generations represented by each branch can be straightforwardly estimated from the numbers of sequence changes on branches and from a known rate μ of sequence change per generation. When the chronological time per generation τ is assumed known and μ is assumed unknown, the rate of sequence change per chronological time unit must also be estimated to permit estimation of the number of generations represented by each branch on the tree. Because rate of sequence change per chronological time unit is subject to estimation uncertainty, the standard errors for estimating Ne are smaller for cases 4 and 5 than for cases 2 and 3.


Recently, a Bayesian approach that incorporates the uncertainty in the genealogical structure has been developed (Drummondet al. 2002). This Bayesian approach simultaneously estimates the mutation rate, population size, and tree topology. In contrast, our frequentist approach neglects the uncertainty of the estimated tree topology to reduce the required amount of calculation. The accounting for topological uncertainty is a distinct advantage of the Bayesian approach over that presented here.

Although there are multiple Bayesian and frequentist options for model checking, a potential advantage of the approach here over the full Bayesian strategy is ease of model checking. The first step of our procedure begins with reconstructing a geneaology and the structure of the reconstructed genealogy can then be immediately inspected. Inspection of the genealogy permits both formal and informal checks of whether the assumed model of population history is plausible. For example, the structure of the reconstructed genealogy can give an immediate indication as to whether a model of population size increase over time is justified. Because we believe that models are apt to be the weakest point of evolutionary analysis, whereas the method of estimation based upon the model is an important but secondary concern, ease of model checking should not be dismissed when evaluating a procedure.

In our approach, the estimated times of internal nodes are nuisance parameters rather than being of major interest. Our pseudomaximum-likelihood method does not account for uncertainty of these nuisance parameters. To account for this uncertainty, an empirical Bayes approach could be adopted. In empirical Bayes approaches, the marginal-likelihood function is obtained by integrating the conditional distribution over the space of nuisance parameters (O'Hagan 1996).

In our case, the marginal-likelihood function of Ne given the sequence data X could be obtained by integrating over the times t, P(XNe)=P(Xt)P(tNe)dt. This integration is not analytically simple but a numerical approximation is available (Congdon 2001). The numerical technique involves sampling times from a multivariate normal distribution that approximates P(X|t) up to a constant of proportionality. The mean of the multivariate normal distribution is the value of t that maximizes P(X|t). The covariance structure of the multivariate normal distribution is approximated by the inverse of the Fisher information matrix. The ith sample from the multivariate normal distribution is denoted t(i) and the total number of these samples is n. Then, P(X|Ne) should be well approximated by P(XNe)1ni=1nP(t(i)Ne). (13) The marginal likelihood can be calculated with Equation 13 for each Ne and the maximum-likelihood estimate of Ne can be found by maximizing P(X|Ne).

To facilitate data analysis, we combine the constant rate assumption of the molecular clock hypothesis and Kingman's n-coalescent (Kingman 1982a,b). With a constant rate, divergence time estimation is straightforward. With a simple coalescent process, estimation of effective population size from known divergence times is not difficult. However, neither the clock nor the simple coalescent assumptions are technically correct. Both are violated by realistic forms of natural selection. For example, there is ample evidence that positive selection operates in HIV-1 genes (Nielsen and Yang 1998). It is unclear whether inaccuracy of the clock and coalescent assumptions would have a large effect on the results of this analysis. Some evolutionary scenarios that invoke selection have been shown to have little effect on the shape of gene genealogies relating contemporaneous data (Golding 1997; Neuhauser and Krone 1997; Przeworskiet al. 1999).

Kingman's n-coalescent also requires a constant population size. Because clinical measurements such as viral load counts do not change much during the asymptomatic period (reviewed in Viscidi 1999), the constant population size assumption seems reasonable. If population size is increasing over time, our estimates of it are expected to be intermediate between the early and later population levels. For a high rate of increase, the estimates are expected to be closer to the early population sizes than the later sizes.

Regarding the molecular clock, we focus here on analysis of sequences that were isolated during the approximately linear phase of sequence divergence that is characteristic of the early asymptomatic period in HIV-1 infections. This linear pattern is what would be seen with neutral evolution and with a molecular clock. Because the linear pattern does not extend into later portions of the asymptomatic period, we did not analyze sequences isolated during the later portions. Therefore, instead of using a highly realistic and overly complicated model to analyze the entire data set, we opted here to investigate only the portions of the data that seemed relatively compatible with the simple model of a molecular clock.

Searching for associations between effective population size and measurements that reflect the physiological condition of a patient may also be fruitful. As a potential marker of disease progression, we examined the time points at which CD4+ T cell counts of each patient decreased to 200/μl (see Shankarappaet al. 1999). However, we were unable to detect a relationship between these time points and the estimated effective viral population size within a patient. Nonetheless, informative associations between effective population size and other covariates may be detected in more thorough future analyses. These associations have the potential to illuminate the factors that are related to viral adaptation within a patient.

In previous work (Seoet al. 2002), we noted that it is desirable to disperse serial sampling times as much as possible to accurately estimate evolutionary rates and times. The same strategy applies in the estimation of effective population size, because differences in (N^e(t)) among serial sampling designs that share a common number of sampled sequences will be attributable to differences in Var(t) (see Equation 12). Designs with smaller Var(t) are expected to yield smaller Var(N^e(t)) .

In this study, we assumed either that viral mutation rate or viral generation length is constant among patients. In reality, both mutation rate and generation time probably vary to some extent among patients. To investigate this simultaneous variation, more data and especially a more sophisticated model for viral evolution may be warranted.

The negative correlation between evolutionary rate and effective population size (Figure 2) is noteworthy. We cannot formally exclude the possibility that this negative correlation is an artifact of our pseudomaximum-likelihood procedure because the fact that viral generation length and evolutionary rate may both vary among patients makes this exclusion difficult. Because differences in effective population size and evolutionary rate estimates among patients exceed the uncertainty within patients, the possibility that this negative correlation is simply an artifact seems unlikely.

In population genetics, this negative correlation is predicted by both the slightly deleterious mutation model (Ohta 1987) and the nearly neutral mutation model (Tachida 1991). In a strictly neutral model, the rate at which mutants become fixed in the population does not depend on the population size but only on the mutation rate. In contrast, in the slightly deleterious or nearly neutral models, the rate of fixation of advantageous mutants depends on both the population size and selective coefficients. In the slightly deleterious model, it is assumed that most mutations are neutral or slightly deleterious and the mean of the distribution of selective coefficient(s) is <0. A negative correlation between evolutionary rate and effective population size is then predicted. As noted by Overbaugh and Bangham (2001), if the error rate of reverse transcriptase is high, deleterious mutations would be frequent and negative selection is likely to be a dominant force in viral evolution. In the nearly neutral model, the selection coefficient of mutations has a distribution with variance σ2 and with a mean that can be either negative or positive. It was shown by simulation that the substitution (note that “substitution” here does not refer solely to variants that have been fixed in the population) rate is negatively correlated with Neσ (Tachida 1991). This might result in a negative correlation between evolutionary rate and effective population size for a fixed σ.

It is not clear whether the above classical views of population genetics explain our finding of a negative correlation. There are potentially many population genetic scenarios that could result in this correlation. Also, the correlation could be attributable to the variation of immune response among patients. If a patient's immune system is strong, viral sequences could experience strong positive selection and this could lead to an increased evolutionary rate and a decreased effective population size. We believe that further characterization of this apparent correlation and its possible sources would facilitate our understanding of the mechanism of viral adaptation within hosts.


We thank two anonymous reviewers for their suggestions. T.-K.S. and H.K. were supported by the Japan Society for the Promotion of Science (JSPS) United States-Japan research collaboration program 12554037; T.-K.S. by the Japanese government scholarship program for foreign students; M.H. and H.K. by grant BSAR-497 from the Ministry of Education, Culture, Sports, Science, and Technology (MECSST); J.L.T. and H.K. by grant 13308013 of MECSST; and J.L.T. by National Science Foundation grants DBI-0077503 and INT-990934.


Using the following equation, Nel(Ne,t)=0, we can estimate N^e(t) . With a Taylor expansion of (Ne)l(N^e(t),t) around Ne, we get 0=Nel(N^e(t),t)Nel(Ne,t)+22Nel(N^e,t)(N^e(t)Ne)Nel(Ne,t)n1Ne2(N^e(t)Ne), (A1) because 2Ne2l(Ne,t)=i=n22Ne2l1,i(Neti)(n1)Eti[2Ne2l1,i(Neti)]=n1Ne2. Equation A1 leads to N^e(t)NeNe2n1Nel(Ne,t) and tTN^e(t)Ne2n12NetTl(Ne,t). A Taylor expansion of N^e(t) around t yields N^e(t)N^e(t)+tTN^e(t)(tt)N^e(t)+Ne2n12NetTl(Ne,t)(tt)N^e(t)+12(n1)i=n2i(i1)(titi). Therefore, Vart,X(N^e(t))=Vart{EX(N^e(t)t)}+Et{VarX(N^e(t)t)}Vart(N^e(t))+(Ne2n1)2×Et{2NetTl(Ne,t)VarX(tt)2Netl(Ne,t)}=Ne2n1+122(n1)2Et{VarX(i=n2i(i1)tit)}. (A2)


  • Communicating editor: G. B. Golding

  • Received August 13, 2001.
  • Accepted January 21, 2002.


View Abstract