| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Genetics, Vol. 174, 1441-1453, November 2006, Copyright © 2006
doi:10.1534/genetics.105.052019
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





,2
* Nuffield Department of Clinical Medicine and
Department of Statistics, University of Oxford, Oxford OX1 3SY, United Kingdom,
Center for Infectious Disease Dynamics, Department of Biology, Pennsylvania State University, University Park, Pennsylvania 16802,
Department of Zoology, University of Oxford, Oxford OX1 3PS, United Kingdom, ** Department of Pediatrics, The Johns Hopkins Hospital, Baltimore, Maryland 21287 and 
Department of Pediatrics, Columbia University College of Physicians and Surgeons and Harlem Hospital Center, New York, New York 10032
1 Corresponding author: Department of Maths and Applied Maths, University of Cape Town, Rondebosch, Cape Town 7701, Republic of South Africa.
E-mail: cedwards{at}maths.uct.ac.za
| ABSTRACT |
|---|
|
|
|---|
1%/year (SHANKARAPPA et al. 1999). Revealing the processes that govern this diversification, and the properties of mutations that contribute to increasing levels of diversity, is therefore central to our understanding of HIV-1 evolution. The trajectory of emergent mutations in the population is potentially influenced by both random genetic drift and natural selection. If selection played no role then evolution would proceed according to a stochastic model of genetic drift. With increasing selection, the accumulation of polymorphisms becomes more predictable. When selection is strong and the effect of drift negligible, evolution can be considered deterministic. In between the extremes of deterministic and purely stochastic evolution, it is possible for drift, selection, and mutation to all have interacting and important influences on population evolution (ROUZINE et al. 2001).
The degree of evolutionary stochasticity to which a population is subjected is critically influenced by the number of replicating virions N. If the population were ideal (all changes are neutral, with discrete generations and no population subdivision) then stochasticity could be predicted by N, with large N (Nµ >> 1; where µ is the mutation rate) associated with a deterministic outcome. Because real populations are not ideal, stochasticity is instead represented by the effective population size Ne, which is the expected value of N under ideal conditions, given the stochasticity observed (WRIGHT 1931). In HIV-1, Ne is typically estimated to be
103–104 (LEIGH BROWN 1997; SEO et al. 2002; ACHAZ et al. 2004; SHRINER et al. 2004). These small Ne values (Neµ << 1) have been interpreted as evidence for stochastic evolution (LEIGH BROWN 1997; SHRINER et al. 2004). However, in each case Ne was obtained under the assumption of neutrality, despite the potential influence that selection may have on its estimation. Because selection is confounded with Ne, Ne does not provide an accurate predictor of stochastic evolution. Instead, it should be regarded simply as a measure of sequence diversity in the population
, weighted by 1/2µ (i.e., Ne =
/2µ in a haploid population). Hence, because diversity can be reduced by selection, a low estimated Ne could also be a sign of deterministic evolution in the form of a mutation–selection balance. Arguments have been forwarded that a neutral model of evolution is indeed valid (LEIGH BROWN 1997; SHRINER et al. 2004), so that a low Ne provides an indication of stochasticity. However, by considering individual time points from a single population separately, these have suffered from a lack of statistical power to reject neutrality and nonindependence.
Here we test directly for a stochastic model of evolution in which the fate of emergent mutations is dictated only by genetic drift. Rejecting this model, we infer it to be an inadequate representation of HIV-1 envelope evolution and conclude an important role for natural selection.
We apply our test to envelope sequences from a cohort of HIV-1-infected children, sampled longitudinally from birth. The external envelope protein is a likely site of selection, being targeted by the patient's antibody response (MOORE et al. 1994), and responsible for receptor binding and entry into host cells (WYATT and SODROSKI 1998), and therefore constitutes an ideal region with which to investigate the evolutionary processes acting on HIV-1.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Another approach is provided by population genetic arguments based on the frequency distribution of polymorphic sites. The most popular of these tests are Tajima's D (TAJIMA 1989), and Fu and Li's D (FU and LI 1993). These compare the contributions of low- and high-frequency polymorphisms to population diversity, representing recent and established substitutions, respectively. The frequency distribution of nonneutral segregating sites will be influenced by selection. Purifying and directional positive selection will lead, on average, to an excess of low-frequency polymorphisms in the contemporary population. However, this pattern of polymorphism can also be produced by demographic change, most notably exponential growth, necessitating the careful application of these statistics.
Because of its sensitivity to selection, we focus on Fu and Li's D as a test statistic, adopting a genealogy-based approach to its estimation. We call this the genealogical D, estimated using a version of coalescent theory (KINGMAN 1982a,b) that can consider sequences obtained from multiple time points simultaneously (RODRIGO and FELSENSTEIN 1999). This increases the power of our test and removes the problem of nonindependence of samples obtained sequentially from a single individual.
In addition to the genealogical D we include measures of tree symmetry. Selection is likely to leave its mark on the genealogy by biasing the extinction and branching rates of different lineages (GRENFELL et al. 2004). This will lead to imbalance, or asymmetry, in the phylogeny (KIRKPATRICK and SLATKIN 1993), so that the use of measures of tree asymmetry in this study is justified.
Estimates of each test statistic involve sampling a large number of genealogies within a Bayesian framework. Hence, uncertainties in parameters of the evolutionary model and the ancestral genealogy are explicitly acknowledged. For each test of neutrality, a distribution of coalescent genealogies was produced from the empirical data and compared to a simulated null distribution using a goodness-of-fit test.
Testing a model of genetic drift:
Generating the null and posterior genealogical distributions:
For each set of longitudinally sampled patient sequences the posterior distributions of genealogies were generated using the BEAST software package (DRUMMOND and RAMBAUT 2004). This selects genealogies according to their posterior probability within a Bayesian framework using Markov chain Monte Carlo (MCMC), as described previously (DRUMMOND et al. 2002). The HKY85 model of nucleotide substitution (HASEGAWA et al. 1985), with a four-category discrete approximation to a gamma distribution of rate heterogeneity across sites (YANG 1994), was implemented. All substitution model parameters, including
, the shape parameter
, and substitution rate µ in changes per site per day, were estimated from the data using MCMC along with demographic model parameters (shaded area in Figure 1). Both a constant population size and exponential growth were assumed. Using the serial sampled coalescent, population size is considered as the product of the effective population size and generation length in days Ne
(RODRIGO and FELSENSTEIN 1999). HIV-1 is known to follow a reproducible pattern of increasing diversity during the course of infection (SHANKARAPPA et al. 1999), making the exponential model appropriate. It was compared to the constant alternative according to the Akaike information criteria (AIC) (AKAIKE 1973), with the model with the lowest AIC score considered to be the best representation of the data.
|
The neutral coalescent proceeds according to a stochastic model that represents an evolutionary process dominated by genetic drift. We are interested in departures from this assumption that are attributable to natural selection. We therefore compared the null and posterior distributions in goodness-of-fit tests using genealogy-based statistics sensitive to nonneutral evolution. One of these statistics, the genealogical D, is also sensitive to demography. In particular, exponential growth is known to lead to more negative values, so that if the null distribution was generated under the incorrect assumption of a constant population size, it would yield a large discrepancy between the posterior and null distribution of values even for a population not influenced by selection. An advantage of posterior predictive simulation is that it can incorporate the assumption of exponential growth so that its potentially confounding influence is explicitly accounted for.
Comparing the null and posterior genealogical distributions:
The observed and predicted genealogical distributions were compared using four summary statistics (Table 1). The genealogical D-statistic is a corrected comparison of the total length of the rooted tree to that expected from the length of the tips under neutral genetic drift (see APPENDIX), and assuming that all sequences were sampled simultaneously (i.e., the neutral expectation for synchronous sequences is 0). When longitudinal samples are considered, the genealogical D is usually <0, even for a neutral population, because the length of internal branches is shortened relative to that predicted from the tips. If slightly deleterious mutations are abundant in the population and selected against, then they will on average be at lower frequency than other mutations (NIELSEN and WEINREICH 1999) and segregate near to the tips of the genealogy (WILLIAMSON and ORIVE 2002). Simulation studies, in which singleton mutations were randomly allocated to sequences generated under a neutral model of evolution, showed that the presence of low-frequency mutations leads to more negative D-values (data not shown). Positive selection on the other hand is more likely to lead to the fixation of changes between time points (WILLIAMSON 2003) and less negative estimates of D. As noted above, the genealogical D is sensitive to exponential growth, which was factored out of the analysis when evidence for exponential growth was detected in the empirical data (i.e., when the exponential demographic model was found to have the lowest AIC score).
|
Testing for significance:
To test for a statistically significant difference between the null and posterior genealogical distributions (Figure 1), we first calculated the proportion of times that the simulated genealogy yielded a summary statistic (T) value greater (for Ic) or less (for D, B1, and Cn) than the genealogy estimated from the empirical data. This provided the posterior predictive P-value P
(see APPENDIX). In the case of the genealogical D, this specifically tests the significance of more negative values, which would indicate an excess of low-frequency polymorphisms and be consistent with the action of purifying selection.
By considering the entire distributions of T-values in our estimation of P
, we account for the considerable uncertainty inherent in estimates of the true genealogy. However, posterior predictive P-values are known to be conservative (MENG 1994), in that they do not reveal the full discrepancy between the model being tested and the empirical data. Statistically, there are fewer extreme values than would be expected if P
followed a uniform reference distribution between 0 and 1. To overcome this problem we simulated multiple sequence data sets under a neutral coalescent model of evolution. By estimating P
for each, we obtained the expected reference distribution of P
-values. This allowed us to apply the appropriate correction to P
-values obtained from the empirical data (see APPENDIX), ensuring that the expected distribution of corrected P
-values was uniform. We thus not only increased the sensitivity of our test, but also were able to combine corrected P
-values across patients so that a more powerful inference could be made.
Because we have prior knowledge of the effect of selection on each of the statistics, a one-tailed test was applied to each corrected P
-value, with P
< 0.05 considered significant.
Patient data:
Envelope sequences were obtained from 28 HIV-1-positive children infected at birth. The majority of these sequences have been published previously (EDWARDS et al. 2006) and detailed descriptions of the cohort (THOMAS et al. 1994; ABRAMS et al. 1995) and sequencing techniques (STRUNNIKOVA et al. 1995) are given elsewhere.
The sequences analyzed were derived from a heteroduplex mobility assay (HMA), which screens the sample to isolate distinct clones. They were therefore more divergent than would be expected from a random sample. Because closely related sister taxa were underrepresented, the sampled distribution of coalescent times may have been slightly biased toward the past, in comparison to that expected from a random sample. This will have had an effect on estimates of population demography, contributing to the appearance of an expanding or logistically growing population. However, it is unlikely to have affected our test for selection, since the distribution of node heights is accounted for by the demographic correction implemented.
Sequences were
360 bp in length, spanning the highly variable envelope V3 region. Multiple clones were available from serial time points postinfection (supplemental Table 1 at http://www.genetics.org/supplemental/) with the majority during the chronic stages of disease (supplemental Table 2 at http://www.genetics.org/supplemental/). All sequences (excepting those from pa, pb, pc, and pd) were derived from viral RNA. These sequences are available from GenBank under accession nos. AY823998–AY824946.
| RESULTS |
|---|
|
|
|---|
|
-values across patients, assuming both a constant and an exponential demographic, are shown in Figure 2. Also shown are the reference distributions of P
-values obtained from sequences simulated under a neutral coalescent model (see APPENDIX). These are typically S-shaped, rather than lying on the diagonal, consistent with the conservative nature of posterior predictive P-values (i.e., there is a lower frequency of extreme values than would be expected from a uniform distribution). After correcting P
-values with the simulated reference distributions, and assuming the best-fit (preferred) demographic model, we obtained significant results using both the genealogical D and measures of tree asymmetry (Table 2). The genealogical D gave the strongest evidence for selection, with 15 of the patients tested giving significant corrected P
-values (Table 2).
|
-values. For each patient we first estimated the per site rate of recombination using the LDhat software package (MCVEAN et al. 2002). Because of the short sequence length, we assumed a linear model of recombination. The significance of each result was tested using the Lkmax test, which randomly permutes segregating sites along the sequence and tests whether the maximum composite likelihood obtained from the real data differs from this null distribution. The average ratio of the recombination to mutation rate (r/µ) was found to be 0.44 (range 0.0–3.3) (supplemental Table 3 at http://www.genetics.org/supplemental/). Sequences were then generated as before assuming both a constant population size and exponential growth, with an r/µ ratio of 0.5 (approximately equal to the mean across data sets).
If the reference distributions of P
-values obtained with and without recombination are different, this indicates that recombination may have confounded interpretation of our significant results. Using the Kolmogorov–Smirnov test we found significant differences in the reference distributions of P
-values for all statistics (Figure 2 and Table 3). This indicated that our original correction to empirical P
-values may not have been robust to the effect of recombination. In the case of the genealogical D, recombination leads to more negative values, so that the significance of P
-values will have been increased (see reference distributions in Figure 2). Interestingly, this bias was much greater when a constant population size was assumed. In the case of tree asymmetry, the effect of recombination was also dependent on the demographic model assumed (Table 3). The reasons for this are unclear, since recombination is likely to have broad and unpredictable effects on the inferred genealogy. Nevertheless, we consider there to be sufficient evidence to necessitate a correction for its effect. We therefore applied a new correction based on reference distributions constructed under the assumption of recombination. This ensured that recombination was taken into account when interpreting the significance of our results.
|
The effect of recombination on measures of tree asymmetry was more complex. For Ic the number of significant results was also reduced, again indicating that recombination can inflate the significance of departures from the expectation under drift, in this case by increasing levels of asymmetry. The correction applied lowered the significance of observed Ic-values to take this effect of recombination into account. In contrast, for B1 and Cn the number of significant results was increased. This can be explained if recombination lowered the degree of asymmetry recorded by these measures, so that the significance of observed values of B1 and Cn was increased when recombination was taken into account.
Even after correction for recombination a number of significant results remained (Table 2). Because the statistics applied were specifically designed to detect nonneutral evolution, and because their significance cannot be accounted for by demography, we interpreted this as evidence for selection.
Distribution of PT*-values across patients:
To improve the sensitivity of our analysis we next examined the distribution of corrected P
-values across patient data sets for evidence of a role for natural selection in HIV-1 evolution. Departures from the expected uniform distribution were assessed using a one-tailed
-test (see APPENDIX).
We found our results to be highly significant. Figure 3 shows the distribution of corrected P
-values across patients, with corrections applied using a reference distribution with and without recombination. For purposes of illustration, only P
-values obtained assuming the preferred demographic model for each patient are shown. The significance of departures from the expected uniform distribution is listed in Table 4. Assuming recombination, and under the preferred demographic model, all statistics yielded a significant result.
|
|
To gain further insight into the nature of the selective forces operating we estimated the overall dN/dS ratio across the region using a codon-based method (GOLDMAN and YANG 1994). This provides an appropriate comparison to the results presented here. In line with previous work on HIV-1 envelope (NIELSEN and YANG 1998), the overall dN/dS ratio is <1 for all patients (except p3 and p7). These results do not in themselves provide a robust test of selection, but nevertheless indicate that if selection is acting, as we have argued here, its primary role is in lowering the rate of nonsynonymous substitution.
If the majority of mutations are slightly deleterious, then purifying selection will lower their contribution to the substitutions that occur between time points. Therefore the stronger the selection, the lower the rate of fixation (KIMURA 1962). To investigate the relationship between the strength of selection and rate of substitution we plotted µ against the effective population size Ne
, both estimated using BEAST. Because estimates of Ne
will be influenced by selection (as discussed above), it was obtained using the third codon site only. The majority (
70%) of changes at this position are synonymous, so that diversity is more likely to reflect the strength of selection acting on the population.
Under a strictly neutral model of sequence evolution the rate of substitution is dependent only on the underlying mutation rate and is therefore independent of the population size. In contrast to this expectation, we found a strong negative correlation between µ and Ne
(P < 0.0001; Figure 4). This result is consistent with theoretical predictions (OHTA 1987) and suggests the action of purifying selection in removing slightly deleterious mutations from the population before they increase in frequency.
|
| DISCUSSION |
|---|
|
|
|---|
Population genetics theory states that genetic drift will dominate evolution when the product of the effective population size and selection coefficient is much less than one (Nes << 1). Because the HIV-1 genome is not strictly neutral (s
0), this would be interpreted as evidence for a low Ne. We have shown that this is not the case, so that selection is important in shaping the evolution of HIV-1 envelope sequences. This conclusion does not imply that evolution is deterministic with natural selection the only important component. Genetic drift is likely to play a substantial role, particularly at sites where the selection coefficient is small. In the case of purifying selection, evolution may still be considered stochastic but with an increased bias toward the loss (rather than fixation) of emergent mutations.
Our analyses focused on the frequency distribution of polymorphic sites in the population and tree asymmetry, both estimated from the coalescent genealogy. By simulating a null distribution of genealogies under a stochastic model of genetic drift, we could test directly whether the observed frequency distribution differed significantly from this expectation. The approach adopted incorporates a number of improvements on previous tests that have applied similar goodness-of-fit tests (LEIGH BROWN 1997; SHRINER et al. 2004). Multiple time points were considered simultaneously, improving the power of each test and removing nonindependence from the analysis. Furthermore, the use of posterior predictive simulation to generate the null distribution of genealogies eliminated the confounding influence of demographic change. Further improvements to the sensitivity of our test could be made by estimating demography using synonymous sites only. Assuming that synonymous and nonsynonymous sites evolve independently, this would allow posterior predictive simulation under conditions that more closely reflect the actual demography of the viral population. Although these techniques are currently unavailable within the implemented framework, they could provide a fruitful avenue for future research (see HAHN et al. 2002).
In addition to higher than expected levels of tree asymmetry, we found that the frequency distribution of polymorphic sites (summarized by the genealogical D) was significantly different from the expectation under drift, with a clear excess of low-frequency mutations. This suggests that purifying selection acts to remove slightly deleterious mutations before they increase in frequency, an interpretation consistent with codon-based estimates of the dN/dS ratio, both here and in previous work (NIELSEN and YANG 1998; ROSS and RODRIGO 2002), that show the rate of nonsynonymous substitution is generally low. To further investigate this hypothesis, we plotted the relationship between the rate of substitution µ and effective population size Ne
. We found a strong negative correlation between µ and Ne
, as expected if the majority of mutations are removed from the population and therefore make a decreasing contribution to the overall rate of change as the efficacy of selection increases (OHTA 1987). Because Ne
will be influenced by selection, we considered diversity only at third codon positions, which is largely synonymous. Nevertheless, synonymous diversity may still be reduced through linkage to sites that are under directional positive selection (MAYNARD SMITH and HAIGH 1974). Thus, although it would predict a deficiency in the proportion of low-frequency mutations and therefore be incompatible with our previous observations, an alternative explanation of this relationship is that the strength of positive directional selection differs between patients, being strongest at low levels of associated diversity.
A role for frequency-dependent selection is also plausible, so that the fitness advantage of a particular mutation is lost as it increases in frequency, impeding its eventual fixation. The negative relationship could then be explained if strong frequency-dependent selection imposed by the immune response simultaneously increased diversity and lowered the rate of substitution. This, however, would contradict what is known from empirical studies, which show that antibody escape mutations in envelope reach high frequencies before their selective advantage is lost (WEI et al. 2003). Frequency-dependent selection is therefore unlikely to be strong enough to influence our estimates of the substitution rate. Furthermore, there does not appear to be any correlation between vigor of the antibody response and plasma viral loads (RICHMAN et al. 2003; SCHMITZ et al. 2003), making it unlikely that the strength of frequency-dependent selection will correlate consistently with diversity across patients. On the basis of these arguments we conclude that frequency-dependent selection is an improbable explanation for the negative relation observed between µ and Ne
.
Our results suggest that HIV-1 carries a mutational load of low frequency, deleterious mutations, which represent a substantial contribution to the total genetic diversity of the viral population. A burden of deleterious mutations is a probable consequence of the high rate at which genetic errors are introduced during replication (MANSKY and TEMIN 1995), necessitating a short, compact genome to maintain evolutionary consistency (OVERBAUGH and BANGHAM 2001; HOLMES 2003). Even in the envelope gene, the most highly variable region of the genome, a degree of constraint is required for operational viability. This is not surprising, given its essential role in receptor binding and cell entry (WYATT and SODROSKI 1998) and the need to maintain extensive glycosylation on its surface to avoid antibody neutralization (REITTER et al. 1998).
| APPENDIX |
|---|
|
|
|---|
![]() |
The definitions of
D, uD, an, bn, and cn are reproduced directly from FU and LI (1993).
B1:
Each interior node in a bifurcating genealogy is considered as the root of a subgenealogy. M is the maximum number of interior nodes between each root i and the terminal branches. This is summed over all interior nodes, excluding the root for the entire genealogy:
![]() |
Colless tree imbalance Ic:
For each of n – 1 internal nodes in a bifurcating genealogy, sequences are partitioned into two groups of sizes ri and si, with ri
si. Ic is based on the differences between ri and si summed over all internal nodes:
![]() |
Cherry count Cn:
A cherry is defined simply as a pair of sequences adjacent to a single common ancestor node on the genealogy. The number of cherries gives the cherry count Cn.
Testing for neutrality in serially sampled genetic sequences from a single viral population:
Estimation of genealogy and demographic parameters:
Two demographic models, namely constant population size and exponential growth, were fitted to sequences from each patient using the BEAST program (DRUMMOND and RAMBAUT 2004). This applies a Bayesian coalescent framework using MCMC, with all substitution and demographic parameters estimated from the data and values represented by their marginal posterior probability distributions. All priors were assumed to be uniform on a natural scale. In this way a distribution of genealogies was generated around the posterior expectation, with each genealogy associated with a vector of mutational and demographic model parameters (DRUMMOND et al. 2002). This is referred to in the text as the "empirical" posterior distribution.
Model fit:
The relative fit of each demographic model to the data was assessed using the AIC (AKAIKE 1973). The AIC of a given model is twice its marginal log likelihood plus the number of parameters specified (AIC = 2 lnLk + 2p). The model with the lowest AIC was selected as the best representation of the data.
Simulation of the null distribution:
The posterior distributions of demographic model parameters, namely the product of effective population size and generation length in days (Ne
) (see RODRIGO and FELSENSTEIN 1999) and growth rate r, were then used to simulate new genealogies in a neutral coalescent process. One new genealogy was produced for each genealogy in the empirical posterior distribution. This is known as posterior predictive simulation (RUBIN 1984) and effectively produces a null distribution of genealogies under the coalescent model that can be compared to the empirical distribution in a goodness-of-fit test. This tests for violations of the assumptions made by the coalescent.
Comparison of null and empirical distributions:
To test for departures from the neutral coalescent model we compared the "empirical" posterior distribution and "null" predicted distribution of genealogies in a goodness-of-fit test using descriptive statistics (referred to here as T). The posterior predictive P-value is
![]() | (A1) |
![]() | (A2) |
accounts for the considerable uncertainty inherent in estimates of the true genealogy. Note that the direction of the inequality is dependent on the statistic used. Equations A1 and A2 are appropriate for the genealogical D, B1, and Cn. Selection leads to a reduction in the value of these test statistics (Table 1). Significance is therefore obtained when P
< 0.05. The value for Ic, however, is increased by selection. In this case therefore the inequality is reversed.
The need to correct PT*-values:
By convention, P-values are expected to follow a uniform distribution between 0 and 1, so that the type I error rate is equal to the chosen significance level. However, posterior predictive P-values such as P
are known to be conservative (MENG 1994). Intuitively, this is because the same data are used both to estimate the parameter distributions and to test the goodness-of-fit. Using the uniform distribution as a reference will therefore lead to a conservative test, and it is necessary to obtain a new reference distribution. This can be used to correct our P
-values so that their expectation is indeed uniform.
Simulating the necessary reference distributions:
New reference distributions were obtained by generating sequences using a neutral coalescent simulator. The sampling structure, length of artificial sequences, and mutational and demographic parameters were selected as representative of the patient data. Both a constant population size and exponential growth were assumed. One hundred time-structured data sets were obtained with sample times at 0, 300, 600, and 900 days and 10 sequences at each time. Artificial DNA sequences were 400 bp in length and simulated down the coalescent tree under an HKY + continuous
model of substitution with
= 8.0,
= 0.1, and µ = 4.0 x 10–5/site/day. Insertions and deletions were not simulated. For a constant population size, Ne
was set to 1500. For data sets simulated under exponential growth, Ne
was 5000 and the growth rate 2.0 x 10–3.
For each artificial sequence data set produced, a P
-value was estimated. The distribution of these values is shown in Figure 2, which reveals that the expected P
is indeed not uniform, with a paucity of extreme values.
Obtaining corrected PT*-values:
If N is the number of data points in the simulated reference distribution and n the number of data points in the reference distribution
P
, then
![]() | (A3) |
Note that each uncorrected P
was obtained assuming a particular demographic model and corrected using a reference distribution generated under the same demographic assumption.
This procedure allowed us to apply a one-tailed 5% significance test to the corrected P
-value generated for each test with the expectation of a 5% type I error rate.
Testing for neutrality across multiple-sequence data sets:
Because corrected P
-values follow a uniform distribution between 0 and 1, they can be combined across patients, even if obtained under different demographic scenarios. Instead of relying on individual tests of neutrality, this allows for a more powerful inference.
Combining values over all 28 patients, let
![]() |
is the inverse cumulative distribution function of the standard normal distribution N(0, 1). Under the null model, C follows a
-distribution with 28 d.f., so that a one-tailed
-test gives the combined corrected P
-value. This effectively tests whether the empirical distribution of corrected P
-values differs significantly from the uniform expectation (Figure 3 and Table 4). | ACKNOWLEDGEMENTS |
|---|
|
|
|---|
| FOOTNOTES |
|---|
2 Present address: Department of Computer Science, University of Auckland, Private Bag 92019, Auckland, New Zealand. ![]()
| LITERATURE CITED |
|---|
|
|
|---|
ABRAMS, E. J., P. B. MATHESON, P. A. THOMAS, D. M. THEA, K. KRASINSKI et al., 1995 Neonatal predictors of infection status and early death among 332 infants at risk of HIV-1 infection monitored prospectively from birth. New York City Perinatal HIV Transmission Collaborative Study Group. Pediatrics 96: 451–458.
ACHAZ, J., S. PALMER, M. KEARNEY, F. MALDARELLI, J. W. MELLORS et al., 2004 A robust measure of HIV-1 population turnover within chronically infected individuals. Mol. Biol. Evol. 21: 1902–1912.
AKAIKE, H., 1973 Information theory as an extension of the maximum likelihood principle, pp. 267–281 in Second International Symposium on Information Theory, edited by B. N. PETROV and F. CSAKI. Akademiai Kiado, Budapest.
ANISIMOVA, M., R. NIELSEN and Z. YANG, 2003 Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites. Genetics 164: 1229–1236.
BALFE, P., P. SIMMONDS, C. A. LUDLAM, J. O. BISHOP and A. J. LEIGH BROWN, 1990 Concurrent evolution of human immunodeficiency virus type 1 in patients infected from the same source: rate of sequence change and low frequency of inactivating mutations. J. Virol. 64: 6221–6233.
BOLLBACK, J. P., 2002 Bayesian model choice and adequacy in phylogenetics. Mol. Biol. Evol. 19: 1171–1180.
BURNS, D. P., C. COLLIGNON and R. C. DESROSIERS, 1993 Simian immunodeficiency virus mutants resistant to serum neutralization arise during persistent infection of rhesus monkeys. J. Virol. 67: 4104–4113.
CHAVDA, S. C., P. GRIFFIN, Z. HAN-LIU, B. KEYS, M. A. VEKONY et al., 1994 Molecular determinants of the V3 loop of human immunodeficiency virus type 1 glycoprotein gp120 responsible for controlling cell tropism. J. Gen. Virol. 75(11): 3249–3253.
COLLESS, D. H., 1982 Phylogenetics: the theory and practice of phylogenetic systematics. Syst. Zool. 31: 100–104.[CrossRef]
CONDRA, J. H., D. J. HOLDER, W. A. SCHLEIF, O. M. BLAHY, R. M. DANOVICH et al., 1996 Genetic correlates of in vivo viral resistance to indinavir, a human immunodeficiency virus type 1 protease inhibitor. J. Virol. 70: 8270–8276.[Abstract]
DE JONG, M. D., J. VEENSTRA, N. I. STILIANAKIS, R. SCHUURMAN, J. M. LANGE et al., 1996 Host-parasite dynamics and outgrowth of virus containing a single K70R amino acid change in reverse transcriptase are responsible for the loss of human immunodeficiency virus type 1 RNA load suppression by zidovudine. Proc. Natl. Acad. Sci. USA 93: 5501–5506.
DRUMMOND, A., and A. RAMBAUT, 2004 BEAST. Bayesian Evolutionary Analysis Sampling Trees v1.1. (http://evolve.zoo.ox.ac.uk/software.html).
DRUMMOND, A. J., G. K. NICHOLLS, A. G. RODRIGO and W. SOLOMON, 2002 Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics 161: 1307–1320.
EDWARDS, C. T. T., E. C. HOLMES, D. J. WILSON, R. P. VISCIDI, E. J. ABRAMS et al., 2006 Population genetic estimation of the loss of genetic diversity during horizontal transmission of HIV-1. BMC Evol. Biol. 6: 28.[CrossRef][Medline]
FITZGIBBON, J. E., A. E. FARNHAM, S. J. SPERBER, H. KIM and D. T. DUBIN, 1993 Human immunodeficiency virus type 1 pol gene mutations in an AIDS patient treated with multiple antiretroviral drugs. J. Virol. 67: 7271–7275.
FU, Y. X., and W. H. LI, 1993 Statistical tests of neutrality of mutations. Genetics 133: 693–709.[Abstract]
GOLDMAN, N., and Z. YANG, 1994 A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11: 725–736.[Abstract]
GRENFELL, B. T., O. G. PYBUS, J. R. GOG, J. L. WOOD, J. M. DALY et al., 2004 Unifying the epidemiological and evolutionary dynamics of pathogens. Science 303: 327–332.
GROENINK, M., A. C. ANDEWEG, R. A. FOUCHIER, S. BROERSEN, R. C. VAN DER JAGT et al., 1992 Phenotype-associated env gene variation among eight related human immunodeficiency virus type 1 clones: evidence for in vivo recombination and determinants of cytotropism outside the V3 domain. J. Virol. 66: 6175–6180.
GUINDON, S., A. G. RODRIGO, K. A. DYER and J. P. HUELSENBECK, 2004 Modeling the site-specific variation of selection patterns along lineages. Proc. Natl. Acad. Sci. USA 101: 12957–12962.
HAASE, A. T., 1999 Population biology of HIV-1 infection: viral and CD4+ T cell demographics and dynamics in lymphatic tissues. Annu. Rev. Immunol. 17: 625–656.[CrossRef][Medline]
HAHN, M. W., M. D. RAUSHER and C. W. CUNNINGHAM, 2002 Distinguishing between selection and population expansion in an experimental lineage of bacteriophage T7. Genetics 161: 11–20.
HASEGAWA, M., H. KISHINO and T. YANO, 1985 Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22: 160–174.[CrossRef][Medline]
HO, D. D., A. U. NEUMANN, A. S. PERELSON, W. CHEN, J. M. LEONARD et al., 1995 Rapid turnover of plasma virions and CD4 lymphocytes in HIV-1 infection. Nature 373: 123–126.[CrossRef][Medline]
HOLMES, E. C., 2003 Error thresholds and the constraints to RNA virus evolution. Trends Microbiol. 11: 543–546.[CrossRef][Medline]
JUNG, A., R. MAIER, J. P. VARTANIAN, G. BOCHAROV, V. JUNG et al., 2002 Recombination: multiply infected spleen cells in HIV patients. Nature 418: 144.[CrossRef][Medline]
KIMURA, M., 1962 On the probability of fixation of mutant genes in a population. Genetics 47: 713–719.
KINGMAN, M., 1982a The coalescent. Stoch. Proc. Appl. 13: 235–248.[CrossRef]
KINGMAN, M., 1982b On the genealogy of large populations. J. Appl. Probab. 19A: 27–43.[CrossRef]
KIRKPATRICK, M., and M. SLATKIN, 1993 Searching for evolutionary patterns in the shape of a phylogenetic tree. Evolution 47: 1171–1181.[CrossRef]
LAMERS, S. L., J. W. SLEASMAN, J. X. SHE, K. A. BARRIE, S. M. POMEROY et al., 1993 Independent variation and positive selection in env V1 and V2 domains within maternal-infant strains of human immunodeficiency virus type 1 in vivo. J. Virol. 67: 3951–3960.
LEIGH BROWN, A. J., 1997 Analysis of HIV-1 env gene sequences reveals evidence for a low effective number in the viral population. Proc. Natl. Acad. Sci. USA 94: 1862–1865.
MANSKY, L. M., and H. M. TEMIN, 1995 Lower in vivo mutation rate of human immunodeficiency virus type 1 than that predicted from the fidelity of purified reverse transcriptase. J. Virol. 69: 5087–5094.[Abstract]
MAYNARD SMITH, J., and J. HAIGH, 1974 The hitch-hiking effect of a favourable gene. Genet. Res. 23: 23–35.[Medline]
MCKENZIE, A., and M. STEEL, 2000 Distributions of cherries for two models of trees. Math. Biosci. 164: 81–92.[CrossRef][Medline]
MCVEAN, G., P. AWADALLA and P. FEARNHEAD, 2002 A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics 160: 1231–1241.
MENG, X.-L., 1994 Posterior predictive p-values. Ann. Stat. 22: 1142–1160.[CrossRef]
MOORE, J. P., Y. CAO, D. D. HO and R. A. KOUP, 1994 Development of the anti-gp120 antibody response during seroconversion to human immunodeficiency virus type 1. J. Virol. 68: 5142–5155.
NIELSEN, R., and D. M. WEINREICH, 1999 The age of nonsynonymous and synonymous mutations in animal mtDNA and implications for the mildly deleterious theory. Genetics 153: 497–506.
NIELSEN, R., and Z. YANG, 1998 Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148: 929–936.
OHTA, T., 1987 Very slightly deleterious mutations and the molecular clock. J. Mol. Evol. 26: 1–6.
OVERBAUGH, J., and C. R. BANGHAM, 2001 Selection forces and constraints on retroviral sequence variation. Science 292: 1106–1109.
PRICE, D. A., P. J. GOULDER, P. KLENERMAN, A. K. SEWELL, P. J. EASTERBROOK et al., 1997 Positive selection of HIV-1 cytotoxic T lymphocyte escape variants during primary infection. Proc. Natl. Acad. Sci. USA 94: 1890–1895.
RAMBAUT, A., and A. DRUMMOND, 2004 Coalescent Generator v1.0. (http://evolve.zoo.ox.ac.uk/software.html).
REITTER, J. N., R. E. MEANS and R. C. DESROSIERS, 1998 A role for carbohydrates in immune evasion in AIDS. Nat. Med. 4: 679–684.[CrossRef][Medline]
RICHMAN, D. D., T. WRIN, S. J. LITTLE and C. J. PETROPOULOS, 2003 Rapid evolution of the neutralizing antibody response to HIV type 1 infection. Proc. Natl. Acad. Sci. USA 100: 4144–4149.
RODRIGO, A. G., and J. FELSENSTEIN, 1999 Coalescent approaches to HIV population genetics, pp. 233–272 in The Evolution of HIV, edited by K. A. CRANDALL. John Hopkins University Press, Baltimore.
ROSS, H. A., and A. G. RODRIGO, 2002 Immune-mediated positive selection drives human immunodeficiency virus type 1 molecular variation and predicts disease duration. J. Virol. 76: 11715–11720.
ROUZINE, I. M., A. RODRIGO and J. M. COFFIN, 2001 Transition between stochastic evolution and deterministic evolution in the presence of selection: general theory and application to virology. Microbiol. Mol. Biol. Rev. 65: 151–185.
RUBIN, D., 1984 Bayesianly justifiable and relevant frequency calculations for the applied statistician. Ann. Stat. 12: 1151–1172.[CrossRef]
SCHIERUP, M. H., and J. HEIN, 2000 Consequences of recombination on traditional phylogenetic analysis. Genetics 156: 879–891.
SCHMITZ, J. E., M. J. KURODA, S. SANTRA, M. A. SIMON, M. A. LIFTON et al., 2003 Effect of humoral immune responses on controlling viremia during primary infection of rhesus monkeys with simian immunodeficiency virus. J. Virol. 77: 2165–2173.
SEO, T. K., J. L. THORNE, M. HASEGAWA and H. KISHINO, 2002 Estimation of effective population size of HIV-1 within a host: a pseudomaximum-