The evolution of the human immunodeficiency virus (HIV-1) during chronic infection involves the rapid, continuous turnover of genetic diversity. However, the role of natural selection, relative to random genetic drift, in governing this process is unclear. We tested a stochastic model of genetic drift using partial envelope sequences sampled longitudinally in 28 infected children. In each case the Bayesian posterior (empirical) distribution of coalescent genealogies was estimated using Markov chain Monte Carlo methods. Posterior predictive simulation was then used to generate a null distribution of genealogies assuming neutrality, with the null and empirical distributions compared using four genealogy-based summary statistics sensitive to nonneutral evolution. Because both null and empirical distributions were generated within a coalescent framework, we were able to explicitly account for the confounding influence of demography. From the distribution of corrected P-values across patients, we conclude that empirical genealogies are more asymmetric than expected if evolution is driven by mutation and genetic drift only, with an excess of low-frequency polymorphisms in the population. This indicates that although drift may still play an important role, natural selection has a strong influence on the evolution of HIV-1 envelope. A negative relationship between effective population size and substitution rate indicates that as the efficacy of selection increases, a smaller proportion of mutations approach fixation in the population. This suggests the presence of deleterious mutations. We therefore conclude that intrahost HIV-1 evolution in envelope is dominated by purifying selection against low-frequency deleterious mutations that do not reach fixation.
DURING chronic human immunodeficiency virus (HIV-1) infection the turnover of virions is rapid (Ho et al. 1995; Wei et al. 1995), the mutation rate high (Mansky and Temin 1995), and the population size large (Haase 1999). Consequently a huge amount of diversity accumulates (Wain-Hobson 1993), so that viral genomes within an individual can differ by between 3 and 5% in their envelope regions (Balfe et al. 1990; Wolfs et al. 1990; Lamers et al. 1993). This variation has important consequences, allowing HIV-1 to infect different cell types (Groenink et al. 1992; Chavda et al. 1994), evade the immune system (Burns et al. 1993; Price et al. 1997; Wei et al. 2003), and acquire resistance to antiviral drugs (Fitzgibbon et al. 1993; Condra et al. 1996; de Jong et al. 1996). During chronic intrahost infection, the HIV-1 envelope gene diverges from the founding population at a rate of ∼1%/year (Shankarappa et al. 1999). Revealing the processes that govern this diversification, and the properties of mutations that contribute to increasing levels of diversity, is therefore central to our understanding of HIV-1 evolution.
The trajectory of emergent mutations in the population is potentially influenced by both random genetic drift and natural selection. If selection played no role then evolution would proceed according to a stochastic model of genetic drift. With increasing selection, the accumulation of polymorphisms becomes more predictable. When selection is strong and the effect of drift negligible, evolution can be considered deterministic. In between the extremes of deterministic and purely stochastic evolution, it is possible for drift, selection, and mutation to all have interacting and important influences on population evolution (Rouzine et al. 2001).
The degree of evolutionary stochasticity to which a population is subjected is critically influenced by the number of replicating virions N. If the population were ideal (all changes are neutral, with discrete generations and no population subdivision) then stochasticity could be predicted by N, with large N (Nμ ≫ 1; where μ is the mutation rate) associated with a deterministic outcome. Because real populations are not ideal, stochasticity is instead represented by the effective population size Ne, which is the expected value of N under ideal conditions, given the stochasticity observed (Wright 1931). In HIV-1, Ne is typically estimated to be ∼103–104 (Leigh Brown 1997; Seo et al. 2002; Achaz et al. 2004; Shriner et al. 2004). These small Ne values (Neμ ≪ 1) have been interpreted as evidence for stochastic evolution (Leigh Brown 1997; Shriner et al. 2004). However, in each case Ne was obtained under the assumption of neutrality, despite the potential influence that selection may have on its estimation. Because selection is confounded with Ne, Ne does not provide an accurate predictor of stochastic evolution. Instead, it should be regarded simply as a measure of sequence diversity in the population θ, weighted by 1/2μ (i.e., Ne = θ/2μ in a haploid population). Hence, because diversity can be reduced by selection, a low estimated Ne could also be a sign of deterministic evolution in the form of a mutation–selection balance. Arguments have been forwarded that a neutral model of evolution is indeed valid (Leigh Brown 1997; Shriner et al. 2004), so that a low Ne provides an indication of stochasticity. However, by considering individual time points from a single population separately, these have suffered from a lack of statistical power to reject neutrality and nonindependence.
Here we test directly for a stochastic model of evolution in which the fate of emergent mutations is dictated only by genetic drift. Rejecting this model, we infer it to be an inadequate representation of HIV-1 envelope evolution and conclude an important role for natural selection.
We apply our test to envelope sequences from a cohort of HIV-1-infected children, sampled longitudinally from birth. The external envelope protein is a likely site of selection, being targeted by the patient's antibody response (Moore et al. 1994), and responsible for receptor binding and entry into host cells (Wyatt and Sodroski 1998), and therefore constitutes an ideal region with which to investigate the evolutionary processes acting on HIV-1.
MATERIALS AND METHODS
Currently the best evidence of a role for natural selection in HIV-1 evolution is provided by estimates of the ratio of nonsynonymous to synonymous changes per site (dN/dS), with dN/dS = 1 being the expectation under neutrality. Codon-based estimates of the dN/dS ratio in HIV-1 envelope sequences have consistently shown a preponderance of constrained sites (dN/dS < 1), punctuated by relatively frequent positive selection (dN/dS > 1) (Nielsen and Yang 1998; Ross and Rodrigo 2002; Guindon et al. 2004). These techniques are sensitive to recombination, which leads to an overestimation of the number of positively selected sites (Anisimova et al. 2003; Shriner et al. 2003). Recently a coalescent approximation has been proposed to overcome this problem (Wilson and McVean 2006). However, whether all synonymous changes in HIV-1 are selectively neutral is still uncertain.
Another approach is provided by population genetic arguments based on the frequency distribution of polymorphic sites. The most popular of these tests are Tajima's D (Tajima 1989), and Fu and Li's D (Fu and Li 1993). These compare the contributions of low- and high-frequency polymorphisms to population diversity, representing recent and established substitutions, respectively. The frequency distribution of nonneutral segregating sites will be influenced by selection. Purifying and directional positive selection will lead, on average, to an excess of low-frequency polymorphisms in the contemporary population. However, this pattern of polymorphism can also be produced by demographic change, most notably exponential growth, necessitating the careful application of these statistics.
Because of its sensitivity to selection, we focus on Fu and Li's D as a test statistic, adopting a genealogy-based approach to its estimation. We call this the genealogical D, estimated using a version of coalescent theory (Kingman 1982a,b) that can consider sequences obtained from multiple time points simultaneously (Rodrigo and Felsenstein 1999). This increases the power of our test and removes the problem of nonindependence of samples obtained sequentially from a single individual.
In addition to the genealogical D we include measures of tree symmetry. Selection is likely to leave its mark on the genealogy by biasing the extinction and branching rates of different lineages (Grenfell et al. 2004). This will lead to imbalance, or asymmetry, in the phylogeny (Kirkpatrick and Slatkin 1993), so that the use of measures of tree asymmetry in this study is justified.
Estimates of each test statistic involve sampling a large number of genealogies within a Bayesian framework. Hence, uncertainties in parameters of the evolutionary model and the ancestral genealogy are explicitly acknowledged. For each test of neutrality, a distribution of coalescent genealogies was produced from the empirical data and compared to a simulated null distribution using a goodness-of-fit test.
Testing a model of genetic drift:
Generating the null and posterior genealogical distributions:
For each set of longitudinally sampled patient sequences the posterior distributions of genealogies were generated using the BEAST software package (Drummond and Rambaut 2004). This selects genealogies according to their posterior probability within a Bayesian framework using Markov chain Monte Carlo (MCMC), as described previously (Drummond et al. 2002). The HKY85 model of nucleotide substitution (Hasegawa et al. 1985), with a four-category discrete approximation to a gamma distribution of rate heterogeneity across sites (Yang 1994), was implemented. All substitution model parameters, including κ, the shape parameter α, and substitution rate μ in changes per site per day, were estimated from the data using MCMC along with demographic model parameters (shaded area in Figure 1). Both a constant population size and exponential growth were assumed. Using the serial sampled coalescent, population size is considered as the product of the effective population size and generation length in days Neτ (Rodrigo and Felsenstein 1999). HIV-1 is known to follow a reproducible pattern of increasing diversity during the course of infection (Shankarappa et al. 1999), making the exponential model appropriate. It was compared to the constant alternative according to the Akaike information criteria (AIC) (Akaike 1973), with the model with the lowest AIC score considered to be the best representation of the data.
During MCMC a marginal posterior distribution of genealogies is generated, with each genealogy associated with demographic model parameter values that can be used to predict future iterations of the evolutionary process. This is known as posterior predictive simulation (Rubin 1984), performed here using the COALGEN program (Rambaut and Drummond 2004). It proceeds by simulating new coalescent genealogies using the posterior distribution of demographic parameter values generated from the data, with one simulated tree for each step in the MCMC chain (see Bollback 2002). This generates an appropriate null distribution to test if the assumptions made by the coalescent are accurate, because under this condition the posterior and null distributions are expected to be identical in the long run.
The neutral coalescent proceeds according to a stochastic model that represents an evolutionary process dominated by genetic drift. We are interested in departures from this assumption that are attributable to natural selection. We therefore compared the null and posterior distributions in goodness-of-fit tests using genealogy-based statistics sensitive to nonneutral evolution. One of these statistics, the genealogical D, is also sensitive to demography. In particular, exponential growth is known to lead to more negative values, so that if the null distribution was generated under the incorrect assumption of a constant population size, it would yield a large discrepancy between the posterior and null distribution of values even for a population not influenced by selection. An advantage of posterior predictive simulation is that it can incorporate the assumption of exponential growth so that its potentially confounding influence is explicitly accounted for.
Comparing the null and posterior genealogical distributions:
The observed and predicted genealogical distributions were compared using four summary statistics (Table 1). The genealogical D-statistic is a corrected comparison of the total length of the rooted tree to that expected from the length of the tips under neutral genetic drift (see appendix), and assuming that all sequences were sampled simultaneously (i.e., the neutral expectation for synchronous sequences is 0). When longitudinal samples are considered, the genealogical D is usually <0, even for a neutral population, because the length of internal branches is shortened relative to that predicted from the tips. If slightly deleterious mutations are abundant in the population and selected against, then they will on average be at lower frequency than other mutations (Nielsen and Weinreich 1999) and segregate near to the tips of the genealogy (Williamson and Orive 2002). Simulation studies, in which singleton mutations were randomly allocated to sequences generated under a neutral model of evolution, showed that the presence of low-frequency mutations leads to more negative D-values (data not shown). Positive selection on the other hand is more likely to lead to the fixation of changes between time points (Williamson 2003) and less negative estimates of D. As noted above, the genealogical D is sensitive to exponential growth, which was factored out of the analysis when evidence for exponential growth was detected in the empirical data (i.e., when the exponential demographic model was found to have the lowest AIC score).
In addition, three measures of tree symmetry were used: B1 (Kirkpatrick and Slatkin 1993), Colless tree imbalance (Ic) (Colless 1982), and cherry count (Cn) (McKenzie and Steel 2000). The response of each of these statistics to increasing asymmetry is listed in Table 1, with mathematical definitions given in the appendix.
Testing for significance:
To test for a statistically significant difference between the null and posterior genealogical distributions (Figure 1), we first calculated the proportion of times that the simulated genealogy yielded a summary statistic (T) value greater (for Ic) or less (for D, B1, and Cn) than the genealogy estimated from the empirical data. This provided the posterior predictive P-value P (see appendix). In the case of the genealogical D, this specifically tests the significance of more negative values, which would indicate an excess of low-frequency polymorphisms and be consistent with the action of purifying selection.
By considering the entire distributions of T-values in our estimation of P, we account for the considerable uncertainty inherent in estimates of the true genealogy. However, posterior predictive P-values are known to be conservative (Meng 1994), in that they do not reveal the full discrepancy between the model being tested and the empirical data. Statistically, there are fewer extreme values than would be expected if P followed a uniform reference distribution between 0 and 1. To overcome this problem we simulated multiple sequence data sets under a neutral coalescent model of evolution. By estimating P for each, we obtained the expected reference distribution of P-values. This allowed us to apply the appropriate correction to P-values obtained from the empirical data (see appendix), ensuring that the expected distribution of corrected P-values was uniform. We thus not only increased the sensitivity of our test, but also were able to combine corrected P-values across patients so that a more powerful inference could be made.
Because we have prior knowledge of the effect of selection on each of the statistics, a one-tailed test was applied to each corrected P-value, with P < 0.05 considered significant.
Envelope sequences were obtained from 28 HIV-1-positive children infected at birth. The majority of these sequences have been published previously (Edwards et al. 2006) and detailed descriptions of the cohort (Thomas et al. 1994; Abrams et al. 1995) and sequencing techniques (Strunnikova et al. 1995) are given elsewhere.
The sequences analyzed were derived from a heteroduplex mobility assay (HMA), which screens the sample to isolate distinct clones. They were therefore more divergent than would be expected from a random sample. Because closely related sister taxa were underrepresented, the sampled distribution of coalescent times may have been slightly biased toward the past, in comparison to that expected from a random sample. This will have had an effect on estimates of population demography, contributing to the appearance of an expanding or logistically growing population. However, it is unlikely to have affected our test for selection, since the distribution of node heights is accounted for by the demographic correction implemented.
Sequences were ∼360 bp in length, spanning the highly variable envelope V3 region. Multiple clones were available from serial time points postinfection (supplemental Table 1 at http://www.genetics.org/supplemental/) with the majority during the chronic stages of disease (supplemental Table 2 at http://www.genetics.org/supplemental/). All sequences (excepting those from pa, pb, pc, and pd) were derived from viral RNA. These sequences are available from GenBank under accession nos. AY823998–AY824946.
Application of test statistics to patient data:
We tested for departures from a model of genetic drift through the application of each test statistic to envelope sequences from 28 infected children, sampled longitudinally from birth. In the majority of cases (19/28) evidence for exponential growth of the viral population was detected (Table 2). Notable exceptions were the patients for which only proviral DNA was available. Sequences derived from this source will represent virus that may have been active some time in the past. However, the serial sampled coalescent will treat each sequence as a member of the temporal population from which it was sampled. Therefore, because the population may have remained apparently unchanged, it could appear to follow a constant demographic. This suggests that caution should be exercised in interpreting the results from these patients, because unaccounted for exponential growth may have taken place.
The distributions of uncorrected P-values across patients, assuming both a constant and an exponential demographic, are shown in Figure 2. Also shown are the reference distributions of P-values obtained from sequences simulated under a neutral coalescent model (see appendix). These are typically S-shaped, rather than lying on the diagonal, consistent with the conservative nature of posterior predictive P-values (i.e., there is a lower frequency of extreme values than would be expected from a uniform distribution). After correcting P-values with the simulated reference distributions, and assuming the best-fit (preferred) demographic model, we obtained significant results using both the genealogical D and measures of tree asymmetry (Table 2). The genealogical D gave the strongest evidence for selection, with 15 of the patients tested giving significant corrected P-values (Table 2).
Accounting for recombination:
The posterior distributions of genealogies were generated assuming no recombination. If recombination is not accounted for, the resultant homoplasious changes will lead to a lengthening of the tips of the genealogy relative to the internal branches, in a manner similar to exponential growth (Schierup and Hein 2000). It also has the potential to affect tree asymmetry. Because recombination in HIV-1 is frequent (Jung et al. 2002; Zhuang et al. 2002), it may therefore have influenced our estimation of each test statistic. In particular, our estimate of the genealogical D will be biased toward more negative values if sequences were obtained from a recombining population. To account for the potential influence of recombination on our estimates of D and tree asymmetry, we simulated reference distributions under this assumption and used them to correct our P-values.
For each patient we first estimated the per site rate of recombination using the LDhat software package (McVean et al. 2002). Because of the short sequence length, we assumed a linear model of recombination. The significance of each result was tested using the Lkmax test, which randomly permutes segregating sites along the sequence and tests whether the maximum composite likelihood obtained from the real data differs from this null distribution. The average ratio of the recombination to mutation rate (r/μ) was found to be 0.44 (range 0.0–3.3) (supplemental Table 3 at http://www.genetics.org/supplemental/). Sequences were then generated as before assuming both a constant population size and exponential growth, with an r/μ ratio of 0.5 (approximately equal to the mean across data sets).
If the reference distributions of P-values obtained with and without recombination are different, this indicates that recombination may have confounded interpretation of our significant results. Using the Kolmogorov–Smirnov test we found significant differences in the reference distributions of P-values for all statistics (Figure 2 and Table 3). This indicated that our original correction to empirical P-values may not have been robust to the effect of recombination. In the case of the genealogical D, recombination leads to more negative values, so that the significance of P-values will have been increased (see reference distributions in Figure 2). Interestingly, this bias was much greater when a constant population size was assumed. In the case of tree asymmetry, the effect of recombination was also dependent on the demographic model assumed (Table 3). The reasons for this are unclear, since recombination is likely to have broad and unpredictable effects on the inferred genealogy. Nevertheless, we consider there to be sufficient evidence to necessitate a correction for its effect. We therefore applied a new correction based on reference distributions constructed under the assumption of recombination. This ensured that recombination was taken into account when interpreting the significance of our results.
For the genealogical D, we found a notable reduction in the significance of our results under the preferred demographic model (Table 2). This indicated that significant departures from a simple stochastic model of evolution could have been induced by an empirically relevant rate of recombination. Furthermore, in four of the seven patients for which significant results were still observed (p11, p13, pa, and pb) r/μ ratios >0.5 were estimated (supplemental Table 3 at http://www.genetics.org/supplemental/). For these patients correction for the effects of recombination may have been inadequate.
The effect of recombination on measures of tree asymmetry was more complex. For Ic the number of significant results was also reduced, again indicating that recombination can inflate the significance of departures from the expectation under drift, in this case by increasing levels of asymmetry. The correction applied lowered the significance of observed Ic-values to take this effect of recombination into account. In contrast, for B1 and Cn the number of significant results was increased. This can be explained if recombination lowered the degree of asymmetry recorded by these measures, so that the significance of observed values of B1 and Cn was increased when recombination was taken into account.
Even after correction for recombination a number of significant results remained (Table 2). Because the statistics applied were specifically designed to detect nonneutral evolution, and because their significance cannot be accounted for by demography, we interpreted this as evidence for selection.
Distribution of PT*-values across patients:
To improve the sensitivity of our analysis we next examined the distribution of corrected P-values across patient data sets for evidence of a role for natural selection in HIV-1 evolution. Departures from the expected uniform distribution were assessed using a one-tailed -test (see appendix).
We found our results to be highly significant. Figure 3 shows the distribution of corrected P-values across patients, with corrections applied using a reference distribution with and without recombination. For purposes of illustration, only P-values obtained assuming the preferred demographic model for each patient are shown. The significance of departures from the expected uniform distribution is listed in Table 4. Assuming recombination, and under the preferred demographic model, all statistics yielded a significant result.
The presence of slightly deleterious mutations:
More negative genealogical D-values, compared to the expectation under neutral genetic drift, indicate an excess of low-frequency polymorphisms in the population. Because positive selection is likely to fix changes between time points, leading to a deficiency of low-frequency polymorphisms, this result suggests that these represent transient, slightly deleterious mutations that are removed from the population by purifying selection before they increase in frequency.
To gain further insight into the nature of the selective forces operating we estimated the overall dN/dS ratio across the region using a codon-based method (Goldman and Yang 1994). This provides an appropriate comparison to the results presented here. In line with previous work on HIV-1 envelope (Nielsen and Yang 1998), the overall dN/dS ratio is <1 for all patients (except p3 and p7). These results do not in themselves provide a robust test of selection, but nevertheless indicate that if selection is acting, as we have argued here, its primary role is in lowering the rate of nonsynonymous substitution.
If the majority of mutations are slightly deleterious, then purifying selection will lower their contribution to the substitutions that occur between time points. Therefore the stronger the selection, the lower the rate of fixation (Kimura 1962). To investigate the relationship between the strength of selection and rate of substitution we plotted μ against the effective population size Neτ, both estimated using BEAST. Because estimates of Neτ will be influenced by selection (as discussed above), it was obtained using the third codon site only. The majority (∼70%) of changes at this position are synonymous, so that diversity is more likely to reflect the strength of selection acting on the population.
Under a strictly neutral model of sequence evolution the rate of substitution is dependent only on the underlying mutation rate and is therefore independent of the population size. In contrast to this expectation, we found a strong negative correlation between μ and Neτ (P < 0.0001; Figure 4). This result is consistent with theoretical predictions (Ohta 1987) and suggests the action of purifying selection in removing slightly deleterious mutations from the population before they increase in frequency.
We have tested for a stochastic model of evolution in which mutations arise randomly and in proportion to the length of each branch in the coalescent genealogy. This represents evolution driven by mutation and random genetic drift only. Our analysis shows that the empirical data do not fit this model. Because the genealogical statistics used to test model fit were designed specifically to detect nonneutral evolution, we interpret this as evidence for natural selection acting on the population.
Population genetics theory states that genetic drift will dominate evolution when the product of the effective population size and selection coefficient is much less than one (Nes ≪ 1). Because the HIV-1 genome is not strictly neutral (s ≠ 0), this would be interpreted as evidence for a low Ne. We have shown that this is not the case, so that selection is important in shaping the evolution of HIV-1 envelope sequences. This conclusion does not imply that evolution is deterministic with natural selection the only important component. Genetic drift is likely to play a substantial role, particularly at sites where the selection coefficient is small. In the case of purifying selection, evolution may still be considered stochastic but with an increased bias toward the loss (rather than fixation) of emergent mutations.
Our analyses focused on the frequency distribution of polymorphic sites in the population and tree asymmetry, both estimated from the coalescent genealogy. By simulating a null distribution of genealogies under a stochastic model of genetic drift, we could test directly whether the observed frequency distribution differed significantly from this expectation. The approach adopted incorporates a number of improvements on previous tests that have applied similar goodness-of-fit tests (Leigh Brown 1997; Shriner et al. 2004). Multiple time points were considered simultaneously, improving the power of each test and removing nonindependence from the analysis. Furthermore, the use of posterior predictive simulation to generate the null distribution of genealogies eliminated the confounding influence of demographic change. Further improvements to the sensitivity of our test could be made by estimating demography using synonymous sites only. Assuming that synonymous and nonsynonymous sites evolve independently, this would allow posterior predictive simulation under conditions that more closely reflect the actual demography of the viral population. Although these techniques are currently unavailable within the implemented framework, they could provide a fruitful avenue for future research (see Hahn et al. 2002).
In addition to higher than expected levels of tree asymmetry, we found that the frequency distribution of polymorphic sites (summarized by the genealogical D) was significantly different from the expectation under drift, with a clear excess of low-frequency mutations. This suggests that purifying selection acts to remove slightly deleterious mutations before they increase in frequency, an interpretation consistent with codon-based estimates of the dN/dS ratio, both here and in previous work (Nielsen and Yang 1998; Ross and Rodrigo 2002), that show the rate of nonsynonymous substitution is generally low. To further investigate this hypothesis, we plotted the relationship between the rate of substitution μ and effective population size Neτ. We found a strong negative correlation between μ and Neτ, as expected if the majority of mutations are removed from the population and therefore make a decreasing contribution to the overall rate of change as the efficacy of selection increases (Ohta 1987). Because Neτ will be influenced by selection, we considered diversity only at third codon positions, which is largely synonymous. Nevertheless, synonymous diversity may still be reduced through linkage to sites that are under directional positive selection (Maynard Smith and Haigh 1974). Thus, although it would predict a deficiency in the proportion of low-frequency mutations and therefore be incompatible with our previous observations, an alternative explanation of this relationship is that the strength of positive directional selection differs between patients, being strongest at low levels of associated diversity.
A role for frequency-dependent selection is also plausible, so that the fitness advantage of a particular mutation is lost as it increases in frequency, impeding its eventual fixation. The negative relationship could then be explained if strong frequency-dependent selection imposed by the immune response simultaneously increased diversity and lowered the rate of substitution. This, however, would contradict what is known from empirical studies, which show that antibody escape mutations in envelope reach high frequencies before their selective advantage is lost (Wei et al. 2003). Frequency-dependent selection is therefore unlikely to be strong enough to influence our estimates of the substitution rate. Furthermore, there does not appear to be any correlation between vigor of the antibody response and plasma viral loads (Richman et al. 2003; Schmitz et al. 2003), making it unlikely that the strength of frequency-dependent selection will correlate consistently with diversity across patients. On the basis of these arguments we conclude that frequency-dependent selection is an improbable explanation for the negative relation observed between μ and Neτ.
Our results suggest that HIV-1 carries a mutational load of low frequency, deleterious mutations, which represent a substantial contribution to the total genetic diversity of the viral population. A burden of deleterious mutations is a probable consequence of the high rate at which genetic errors are introduced during replication (Mansky and Temin 1995), necessitating a short, compact genome to maintain evolutionary consistency (Overbaugh and Bangham 2001; Holmes 2003). Even in the envelope gene, the most highly variable region of the genome, a degree of constraint is required for operational viability. This is not surprising, given its essential role in receptor binding and cell entry (Wyatt and Sodroski 1998) and the need to maintain extensive glycosylation on its surface to avoid antibody neutralization (Reitter et al. 1998).
Mathematical definitions of genealogy-based statistics for testing nonneutral evolution:
Given the number of sequences n, the total length of the genealogy T, and the total length of the external branches Te, the genealogical D is defined as
The definitions of υD, uD, an, bn, and cn are reproduced directly from Fu and Li (1993).
Each interior node in a bifurcating genealogy is considered as the root of a subgenealogy. M is the maximum number of interior nodes between each root i and the terminal branches. This is summed over all interior nodes, excluding the root for the entire genealogy:
Colless tree imbalance Ic:
For each of n − 1 internal nodes in a bifurcating genealogy, sequences are partitioned into two groups of sizes ri and si, with ri ≥ si. Ic is based on the differences between ri and si summed over all internal nodes:
Cherry count Cn:
A cherry is defined simply as a pair of sequences adjacent to a single common ancestor node on the genealogy. The number of cherries gives the cherry count Cn.
Testing for neutrality in serially sampled genetic sequences from a single viral population:
Estimation of genealogy and demographic parameters:
Two demographic models, namely constant population size and exponential growth, were fitted to sequences from each patient using the BEAST program (Drummond and Rambaut 2004). This applies a Bayesian coalescent framework using MCMC, with all substitution and demographic parameters estimated from the data and values represented by their marginal posterior probability distributions. All priors were assumed to be uniform on a natural scale. In this way a distribution of genealogies was generated around the posterior expectation, with each genealogy associated with a vector of mutational and demographic model parameters (Drummond et al. 2002). This is referred to in the text as the “empirical” posterior distribution.
The relative fit of each demographic model to the data was assessed using the AIC (Akaike 1973). The AIC of a given model is twice its marginal log likelihood plus the number of parameters specified (AIC = 2 lnLk + 2p). The model with the lowest AIC was selected as the best representation of the data.
Simulation of the null distribution:
The posterior distributions of demographic model parameters, namely the product of effective population size and generation length in days (Neτ) (see Rodrigo and Felsenstein 1999) and growth rate r, were then used to simulate new genealogies in a neutral coalescent process. One new genealogy was produced for each genealogy in the empirical posterior distribution. This is known as posterior predictive simulation (Rubin 1984) and effectively produces a null distribution of genealogies under the coalescent model that can be compared to the empirical distribution in a goodness-of-fit test. This tests for violations of the assumptions made by the coalescent.
Comparison of null and empirical distributions:
To test for departures from the neutral coalescent model we compared the “empirical” posterior distribution and “null” predicted distribution of genealogies in a goodness-of-fit test using descriptive statistics (referred to here as T). The posterior predictive P-value is(A1)where T(GP) is the value of T given by a predicted genealogy and T(GE) is that by an empirical genealogy. PT can be obtained using a consistent estimator as the proportion of times that the predicted genealogy yields a value of T less than the genealogy estimated from the real data,(A2)where the indicator function I(.) takes the value 1 when its argument is true and 0 otherwise. By considering the entire distribution of T-values, P accounts for the considerable uncertainty inherent in estimates of the true genealogy. Note that the direction of the inequality is dependent on the statistic used. Equations A1 and A2 are appropriate for the genealogical D, B1, and Cn. Selection leads to a reduction in the value of these test statistics (Table 1). Significance is therefore obtained when P < 0.05. The value for Ic, however, is increased by selection. In this case therefore the inequality is reversed.
The need to correct PT*-values:
By convention, P-values are expected to follow a uniform distribution between 0 and 1, so that the type I error rate is equal to the chosen significance level. However, posterior predictive P-values such as P are known to be conservative (Meng 1994). Intuitively, this is because the same data are used both to estimate the parameter distributions and to test the goodness-of-fit. Using the uniform distribution as a reference will therefore lead to a conservative test, and it is necessary to obtain a new reference distribution. This can be used to correct our P-values so that their expectation is indeed uniform.
Simulating the necessary reference distributions:
New reference distributions were obtained by generating sequences using a neutral coalescent simulator. The sampling structure, length of artificial sequences, and mutational and demographic parameters were selected as representative of the patient data. Both a constant population size and exponential growth were assumed. One hundred time-structured data sets were obtained with sample times at 0, 300, 600, and 900 days and 10 sequences at each time. Artificial DNA sequences were 400 bp in length and simulated down the coalescent tree under an HKY + continuous Γ model of substitution with κ = 8.0, α = 0.1, and μ = 4.0 × 10−5/site/day. Insertions and deletions were not simulated. For a constant population size, Neτ was set to 1500. For data sets simulated under exponential growth, Neτ was 5000 and the growth rate 2.0 × 10−3.
For each artificial sequence data set produced, a P-value was estimated. The distribution of these values is shown in Figure 2, which reveals that the expected P is indeed not uniform, with a paucity of extreme values.
Obtaining corrected PT*-values:
If N is the number of data points in the simulated reference distribution and n the number of data points in the reference distribution ≤P, then(A3)
Note that each uncorrected P was obtained assuming a particular demographic model and corrected using a reference distribution generated under the same demographic assumption.
This procedure allowed us to apply a one-tailed 5% significance test to the corrected P-value generated for each test with the expectation of a 5% type I error rate.
Testing for neutrality across multiple-sequence data sets:
Because corrected P-values follow a uniform distribution between 0 and 1, they can be combined across patients, even if obtained under different demographic scenarios. Instead of relying on individual tests of neutrality, this allows for a more powerful inference.
Combining values over all 28 patients, letwhere is the inverse cumulative distribution function of the standard normal distribution N(0, 1). Under the null model, C follows a -distribution with 28 d.f., so that a one-tailed -test gives the combined corrected P-value. This effectively tests whether the empirical distribution of corrected P-values differs significantly from the uniform expectation (Figure 3 and Table 4).
The BEAST analysis software was developed by Alexei Drummond and Andrew Rambaut and contributed to by Roald Forsberg, Oliver Pybus, and Korbinian Strimmer. We gratefully acknowledge the contribution of the New York City Maternal Infant HIV Transmission Study, funded by the Centers for Disease Control and Prevention. This work was supported by the Wellcome Trust (C.E., A.D., R.P., and E.H.) and the Biotechnology and Biological Sciences Research Council (D.W.).
- Received October 7, 2005.
- Accepted August 17, 2006.
- Copyright © 2006 by the Genetics Society of America