Characterizing the molecular basis of adaptation is one of the most important goals in modern evolutionary genetics. Here, we report a full-genome sequence analysis of 21 independent populations of vesicular stomatitis ribovirus evolved on the same cell type but under different demographic regimes. Each demographic regime differed in the effective viral population size. Evolutionary convergences are widespread both at synonymous and nonsynonymous replacements as well as in an intergenic region. We also found evidence for epistasis among sites of the same and different loci. We explain convergences as the consequence of four factors: (1) environmental homogeneity that supposes an identical challenge for each population, (2) structural constraints within the genome, (3) epistatic interactions among sites that create the observed pattern of covariation, and (4) the phenomenon of clonal interference among competing genotypes carrying different beneficial mutations. Using these convergences, we have been able to estimate the fitness contribution of the identified mutations and epistatic groups. Keeping in mind statistical uncertainties, these estimates suggest that along with several beneficial mutations of major effect, many other mutations got fixed as part of a group of epistatic mutations.
ONE of the main tasks in evolutionary biology is to measure the forces operating in populations, not only by statistical inference, but by bringing together evidences from population genetics and functional biology (Lewontin 2000). To show which molecular changes are responsible for differences in fitness, we need an appropriate experimental setup, i.e., an experimental population of sufficient size and appropriate time periods where small selection intensities can be detected. RNA viruses meet both conditions. They reach population sizes as high as 1010 after just a few hours of replication in fully susceptible cell cultures. Their evolution can be followed by sampling evolving populations every day or even more often if desired.
Here, we explore the molecular basis of viral adaptation to cell culture by characterizing the molecular changes that occurred during the experimental evolution of two competing populations of the ribovirus vesicular stomatits virus (VSV; Miralleset al. 1999). In previous work, we confirmed the role played by clonal interference in the evolution of VSV (Miralles et al. 1999, 2000). Clonal interference is a phenomenon characteristic of rapidly evolving asexual populations (Gerrish and Lenski 1998). As a consequence of high mutation rates and large population sizes, beneficial mutations arise in different coexisting lineages. The lack of recombination precludes the possibility that two beneficial mutations combine in the same genotype and, therefore, coexisting beneficial mutations compete with each other to become fixed in the population. Consequently, only the best possible candidate will become fixed and all alternative beneficial mutations of smaller effect will be outcompeted. The larger the population size or the mutation rate, the more beneficial mutations may coexist and, thus, the more intense clonal interference will be. In our previous work, we tested two specific predictions of this model: (1) A positive correlation exists between the magnitude of the fixed beneficial effect and the population size (Miralleset al. 1999) and (2) the rate of adaptation is not linearly proportional to the availability of beneficial mutations but a diminishing-returns effect exists (Miralles et al. 1999, 2000).
Our results are interesting in two ways. First, they extend previous observations, on other viral systems, about the extent of convergent evolution under constant environmental conditions. Convergent evolution indicates the existence of constraints in RNA virus evolution. Second, we show how epistasis among nucleotide sites modulates genomic divergence of experimental viral populations. We made an attempt to assign fitness effects to each observed substitution. As a result, we have been able to identify five statistically significant beneficial changes. Four of these changes arose at the protein involved in recognition of the cellular receptor. The fifth positively selected amino acid belongs to the major component of the RNA polymerase. Despite the evidence for multiple mutations occurring on each lineage, our results are still consistent with one of the major predictions of the clonal interference model: the positive correlation between population size and the magnitude of fixed beneficial effects.
Epistasis is a key factor in evolutionary genetics. The existence and abundance of epistasis are important for many different evolutionary theories, including those seeking an explanation for the origin and maintenance of sexual reproduction (Kondrashov 1985), diploidy (Kondrashov and Crow 1991), dominance (Peck and Waxman 2000), reproductive isolation (Coyne 1992), segmentation of viral genomes (Chao 1988), and even the shifting-balance process (Wright 1931). Therefore, one of the most interesting questions for evolutionary biologists is how common is epistasis. However, the role of epistasis on the evolutionary dynamics of RNA viruses has been poorly explored from an experimental stand-point, despite the fact that their compacted genome seems prone to strong interactions among gene products as well as at the level of the RNA secondary structure. So far, only the extent of synergistic epistasis among deleterious mutations accumulated after consecutive bottlenecks of foot-and-mouth disease virus has been explored (Elena 1999).
MATERIALS AND METHODS
Overview of the evolution experiments and fitness determinations: A detailed description of the experimental methods is provided in Miralles et al. (1999). In brief, we mixed, in equal proportions, two variants of VSV that differed only in the presence of a neutral marker, implying that they should coexist in a stable state until a successful beneficial mutation appeared in one of them (Miralleset al. 1999). These two clones were a surrogate wild type and an I1 mAb-resistant mutant (MARM C). Seven different evolutionary regimes were designed, each one differing from the others in effective population size, Ne, ranging between ∼102 and ∼108 viral particles (see Miralleset al. 1999 for an operative definition of Ne). Each regime was independently replicated five times, for a total of 35 experimental lines. Each mixture was kept under the appropriate batch transfer conditions until one of the two variants became fixed. Our criterion for considering that one allele had fixed was to obtain, during three consecutive days, no evidence for the presence of the opposite genetic marker.
Then, the winner variant was placed in a head-to-head competition with its nonevolved counterpart to estimate the fitness effect (W) associated with the beneficial mutation that became fixed. Competition experiments are described elsewhere (Hollandet al. 1991; Miralles et al. 1999, 2000). In brief, each evolved population was mixed, in three independent test tubes, with a known amount of the ancestral strain bearing the opposite phenotypic neutral marker, and the initial ratio for each replicate mixture, R0, was determined by performing plaque assays with and without I1 mAb in the agarose overlay medium. Each competition mixture was transferred serially during a sufficient number of passages to obtain good estimates of relative fitness as follows. At each transfer, the resulting virus was diluted and used to initiate the next competition passage by infection of a fresh host cell monolayer. The ratio of each genotype was determined by plating with and without I1 mAb in the overlay agarose medium at different transfers. These determinations gave the proportion of each competitor at time t, Rt. Selection rate constants (S, with per-day units) were estimated from the linear regression ln Rt = ln R0 + St. Miralles et al. (1999) reported fitness as selection rate constants, whereas the values shown here (fourth column in Figure 1) are dimensionless fitness. To transform a selection rate constant into a true fitness, it is necessary to take into account the dilution factor employed at each competition day (D = 104) and how many days of competition occurred (t), and then apply the expression W = 1 + S/ln D t (Lenskiet al. 1994).
Twenty-one of the evolved mixed populations were chosen for the whole-genome consensus sequencing analysis reported in this study.
RNA sequences determination: Three end-point-evolved populations from each of the seven Ne were randomly chosen for sequencing, as well as the two ancestral clones. Full-genome (11,161 bp) sequencing was done according to standard techniques (Brachoet al. 1998). For each evolved population, we determined the consensus sequence of the whole population. RNA was extracted by a guanidinium thiocyanate-phenol-chloroform method, and it was reverse-transcribed and amplified by reverse transcription (RT)-PCR with specific primers. PCR products were directly sequenced in an ABI 310 automated sequencer following manufacturer’s indications. Sequences were processed and analyzed with the STADEN software package. We used 14 pairs of primers to amplify the whole genome (available upon request). The resulting fragments were overlapping, facilitating the task of assembling fragments.
The sequencing of both ancestral clones (MARM C and wild type) confirmed that they differed only in the Asp259 → Ala substitution in the G surface protein that confers the neutral MARM phenotype.
Minimizing the risk of contaminations: To minimize the possibility of cross-contamination among evolving populations, we took four precautions:
The five replicates of the evolution experiments were done in four independent, sequential in time, blocks (1, 2, 3, and 4 + 5; second column in Figure 1) by Miralles et al. (1999). Each block contained one replicate per each Ne. By doing this, we protected the experiment against among-replicates cross-contamination, since blocks were separated in time (with the exception of only replicates 4 and 5). Still, common changes are present among replicates.
RNA extractions were done in six independent blocks (third column in Figure 1), and each block was separated in time as well. During RNA extraction, control tubes were added. Each mock tube was subjected to exactly the same steps as the true samples. We never obtained RNA from these tubes.
The RT-PCR and sequencing reactions were done in random blocks, protecting the results from undesirable effects associated with specific blocks.
Regarding the probability of cross-contamination during the RT-PCR procedure, each time we ran a RT-PCR, we also ran control tubes containing all the reaction mixture but without the RNA template. DNA amplification was never obtained within these control tubes.
Ruling out the possibility of cross-contamination: Our conclusion of evolutionary convergence is extremely sensitive to contaminations. Contaminations can create identical changes in otherwise unrelated lineages and thus give the false impression of convergence. Therefore, previous to making any inference from our data regarding evolutionary convergences, it is compulsory to test whether our results can be explained simply by cross-contamination among populations. A possible way to test whether the observed sequences are the result of cross-contamination among populations evolved at the same time or among RNA extractions done simultaneously is to compute a phylogenetic tree from the sequences obtained for each population and compare the observed clades with those expected if contaminations occurred at the different critical stages of our experimental protocol. Figure 2 shows the estimated maximum-likelihood unrooted tree topology. This tree was obtained using DNAML program from PHYLIP v3.6α package (http://evolution.genetics.washington.edu/phylip) and assuming Kimura’s two-parameter nucleotide substitution model. Bootstrap values were obtained from a set of 1000 randomized sequences. The Akaike information criterion (AIC; Akaike 1974) for this tree topology is 423.04. As the tree shows, sequences do not group according to RNA extraction blocks (third column in Figure 1). Under the hypothesis of cross-contamination, during the RNA purification, among lineages treated simultaneously, we should expect a tree with six clades, each containing sequences of populations treated at the same time. The AIC for such a tree (actually, the one showing the smallest value) was 587.43, larger than the value obtained for the inferred tree. This result rules out the possibility of cross-contamination during the RNA purification. [Note that the model with minimum AIC is considered to be the most appropriate one (Akaike 1974).] In general, with the exception of three pairs of identical sequences (sequences in lines 4 and 7, 16 and 20, and 17 and 21 in Figure 1), sequences do not group according to blocks of evolution (second column in Figure 1). According to the hypothesis of cross-contamination among populations evolved at the same time, we should expect a tree showing four different clades, one for each evolution block. The AIC for such a tree topology (again, the one showing the smallest value) was 571.46, larger than that observed for the inferred tree topology. [The Shimodaira and Hasewaga (1999) test generated exactly the same conclusion: None of the trees expected under both cross-contamination hypotheses was significantly better than the inferred tree.]
Most of the bootstrap P values shown in Figure 2 are too low for identifying significant clusters. This finding gives further support to the view of an independent evolutionary history for each population; in other words, populations radiated from a common ancestor. In conclusion, results are better explained by evolutionary convergence on independently evolved lineages than by cross-contamination among lineages at the two more critical steps of our experimental protocol.
Nevertheless, the existence of three pairs of identical sequences that were evolved in the same blocks calls for extra caution when interpreting the results. These three pairs could be the result of either evolutionary convergences, as we believe, or rare cross-contaminations during the evolutionary part of our experiment.
Identification of nucleotide substitutions: Figure 1 shows the fitness values estimated for each of the 21 chosen populations (Miralleset al. 1999) as well as the 25 molecular changes observed. Red blocks indicate nonsynonymous changes, green blocks synonymous changes, and yellow blocks changes in intergenic regions. Twenty-five different positions showed differences among evolved populations and the ancestral genotypes. The substitutions observed in each population were always fixed (at the resolution level of the sequencing chromatograms). Of those, 15 were nonsynonymous changes, 6 were synonymous changes, and only 3 were changes at intergenic regions. Regarding specific genes, the nucleocapsid protein (N) does not show any difference, a fact that could be telling us about the importance of this gene. The phosphoprotein (P), which is a minor component of the RNA polymerase, shows four variable sites among populations; two were synonymous and two were nonsynonymous. The matrix protein (M), which covers the inner side of the virion, also shows four variable sites among populations; three of them are nonsynonymous changes. The coat protein (G) is the major antigenic component of the capsid and responsible for interactions with the cell receptors. A priori, we should expect this gene to be one of the main targets of selection during the adaptation to cell culture conditions. Naked eye inspection of the data supports this expectation, since 10 nucleotide positions were variable among populations, being 7 nonsynonymous changes. Finally, the major component of the RNA polymerase protein (L) shows four polymorphic sites among populations, three of them nonsynonymous.
Heterogeneity among sites and among genes in the number of mutations: Of course, the observed pattern of polymorphic sites could, in principle, be produced by chance. To rule out this possibility, we computed the dispersion coefficient, δ, of the number of times that a site changed across lineages (Sokal and Rohlf 1995). Under the null hypothesis of changes being randomly distributed across the entire 11,161 bp of the genome, the number of times that a site changes follows a Poisson distribution and hence δ= 1. Alternatively, if certain sites changed by chance more often than expected, then δ> 1. Our estimate of δ= 5.16 rejects the null hypothesis of observing this pattern by chance (χ2 = 57,566.97, 11,160 d.f., one-tailed P < 0.0001). (This conclusion holds if instead of the full genome we compute δ for only the 25 variable sites: δ= 2.34, χ2 = 56.16, 24 d.f., one-tailed P = 0.0001.) Hence, we conclude that some sites mutated more often than others along the entire genome.
Now, we were interested in testing whether different genes accumulated different numbers of mutations or, alternatively, whether mutations were equally scattered among genes. A Scheirer-Ray-Hare nonparametric two-way ANOVA (Sokal and Rohlf 1995) showed a significant among-genes heterogeneity in the number of mutations accumulated per nucleotide site (Table 1). A post hoc Tukey’s test (Sokal and Rohlf 1995) indicates that this difference was due to the ∼10-fold lower number of mutations per nucleotide site observed for the L gene, the major RNA polymerase component. This observation suggests that the L gene is subjected to more evolutionary constraints than any other component in the genome (with the exception of the N gene, at which we did not find any change at all). On the other side, on average, M, P, and G genes had the same number of mutations per nucleotide site. This tells us about the naïveté of our above expectation of the G gene fixing more changes than any other component of the genome as a result of being an important target of selection. However, as we see in the next two sections, changes in the G protein were selectively more important than changes in other parts of the genome. There is no effect of the Ne on the number of mutations fixed per gene. The interaction of gene by Ne term was not significant either.
Covariation among sites: Another interesting feature of Figure 1 is the existence of an apparent covariation among sites. For example, the substitution of Ser195 → Pro at the P gene in almost every case goes together with the substitution Gly165 → Asp at the M gene. Similarly, substitutions Pro120 → Gln and C2750 → U at the M gene and A10162 → G at the L gene always occur together. Covariation among sites within and among genes arises as a result of epistatic gene interactions and selection acting on the groups of sites rather than on individual sites. To assess the extent of this covariation among sites, as well as to identify pairs of significantly interacting sites, we computed the matrix of covariances among sites. Figure 3 shows, in a schematic way, the 16 significant cases (after sequential Bonferroni’s correction for multiple tests of the same hypothesis). Significant covariance values ranged between 0.0476 (the smallest bubbles in Figure 3) and 0.1857 (the largest bubble in Figure 3). These 16 significant cases can be grouped into the following six different covariation groups.
Group I: U1820 (Ile142 → Thr) in the P gene covaried with site A3675 (Met200 → Val) in the G gene. These two sites are present in only one population (labeled 106 3 4 in Figure 1).
Group II: U1978 (Ser195 → Pro in the P gene) and G2743 (Gly165 → Asp in the M gene) were present together in several lineages (Figure 1). It is worth noting that all synonymous changes always covaried with nonsynonymous changes (with the only exception of unique change U3524 → C in the G gene, which, indeed, was not neutral; see below).
Group III: For example, synonymous sites G1713 and A4394 covaried with nonsynonymous sites A3802 (Asp242 → Arg in the G protein) and U5202 (Leu157 → Trp in the L protein) in the same population (labeled 106 3 4 in Figure 1.
Group IV: A synonymous change at site C2151 was present in several populations significantly associated with a nonsynonymous change at site G3846 (Asp257 → Asn of the G protein).
Group V: Similarly, nonsynonymous changes at sites C2750 and A10162 significantly covaried with a nonsynonymous change at position C2608 (Pro120 → Gln) in several populations.
Group VI: Finally, a synonymous substitution at site U4295 was associated with a nonsynonymous substitution in site U3848 (Asp257 → Asn of the G protein) in a single population (labeled 102 2 1 in Figure 1).
Could we assign fitness values to specific mutations or to covariation groups? We took advantage of these convergences to estimate specific fitness effects for each site or covariation group (i.e., epistatic sites). The general way to model epistatic fitness effects is to assume that the fitness of a genotype that carries k epistatic mutations that interact together as a group is , where the si are the multiplicative effects of each mutation and e1,2,...,k is the increase in fitness gained by the overall interaction of all mutations in the group (Phillipset al. 2000). Hence, to describe the fitness effect of each epistatic group of mutations we need to estimate as many parameters from the dataset as mutations involved in the group [si (i = 1, 2,..., k)] plus an overall interaction term (e1,2,...,k). A way of reducing the number of parameters to be estimated is to assign a single fitness value to each epistatic group as a whole, regardless of the specific contribution of each mutation. This value will, obviously, be composed of the individual contribution of each mutation plus the added value.
By assuming an infinite-sites model, the expected fitness of a given clone, E(Wi), will be determined by multiplying the fitness effects, 1 + sj, associated with each observed covariation group (six in our dataset, involving 15 nucleotide sites) and with each nonepistatic, purely multiplicative, site (25 - 15 = 10). Hence, . These 16 sj parameters can then be estimated, from the 21 populations, by means of a quadratic sequential programming procedure that minimizes the differences between the observed and expected fitnesses (CNLR procedure in the SPSS package). The advantage of quadratic sequential programming compared with linear regression is that the former allows us to impose biologically meaningful restrictions to the parameters. We used the following two restrictions: (1) When a mutation appears alone, it necessarily has to be beneficial (i.e., sj > 0); (2) if a mutation belonging to an epistatic group is found alone in any population, then it must be beneficial by itself (sj > 0) and its numerical value must be necessarily smaller than the value for the whole epistatic group. Furthermore, the Miralles et al. (1999) study dealt with beneficial mutations and fitness improvements. Deleterious mutations would not have an effect under their experimental setup since they were washed out by strong purifying selection. Hence, we implicitly assumed that all the nonvariable sites detected in our study (11,161 - 25 = 11,136) were deleterious or (quasi-neutral) under our experimental conditions.
Table 2 shows the estimated fitness contribution of each observed change (either point mutations or covariation groups). We report in Table 2 only those sites whose fitness contributions converged to nonzero estimates (not necessarily significant) in the quadratic sequential programming procedure. The correlation between predicted and observed fitness values can be used as a way to assess the goodness of fit of the model. Here, the match between observed and predicted fitnesses was highly significant (r = 0.8966, 19 d.f., P < 0.0001). All nonsignificant cases must be considered either as neutral or, in the worse case, as small-effect mutations that are indistinguishable from zero with the statistical power associated with our experimental sample size.
None of the four changes found at the P protein were beneficial by themselves. The synonymous substitution C2151 → U, along with the replacement Asp257 → Asn at the G protein, had an estimated fitness effect of 4.9%, although nonsignificant. The only substitution at intergenic regions with a fitness effect was C2990 → U at the M-G segment. The small effect associated with this substitution, 0.3%, was not significant either. Whether or not this replacement has a real fitness effect, the fact is that nucleotide substitutions in intergenic regions had been previously reported to be of special significance for the transcription levels of downstream genes in VSV (Smallwood and Moyer 1993). One of the four changes found in the M protein was beneficial, with a fitness effect of ∼2.3%. This was a nonsynonymous change, which implies replacing a positive charge (Lys) by a polar radical (Thr). However, it is worthwhile to note that this change was not significant either. The small effect of this change is probably associated with the fact that it was present in only one clone. Five significant changes were detected at the G protein, one of them linked with a change in the P protein (see above). Surprisingly, one of these beneficial substitutions was the synonymous change at position 3524. Furthermore, this change had the largest observed effect on fitness among all the beneficial mutations (∼6.7%). The change Leu211 → Pro (∼4.8% effect) implies a substantial structural change in the protein, since Pro is associated with the induction of β-turns. The change Arg345 → Lys a priori seems conservative, since it maintains the positive charge at the radical. However, it has a significant effect of ∼5.6%. Finally, the change Thr368 → Ala (∼4.3% effect) replaces a polar radical by a short nonpolar one. As we speculated above, the G protein is important for adaptation to a cellular host, since it is responsible for interacting with the cellular receptor (phosphatidylserine). In the other side, only one nonsynonymous change was identified at the L protein. The change Ile1516 → Val at this important gene had the second largest effect detected (∼5.8%), although it is difficult to find a biochemical explanation of this effect, since it implies a reduction of only one carbon in the length of the radical. Nonetheless, changes in the major component of the RNA polymerase should be of importance in our experimental system, since our daily transfer protocol selects for rapidly replicating viruses and improvement in replication efficiency must arise by changes in the L protein.
Conservatively, we can say that changes in 5 of the 11,161 nucleotide sites of the VSV genome were selectively important in our experimental conditions.
Our experiment dealt with the existence of evolutionary convergences. Evolutionary convergences constitute a very slippery topic, since a result of convergent evolution can always be seen as a cross-contamination by those critical of the existence of convergence. The only serious way to address evolutionary convergences is to (1) design and run experiments in such a way that physical or temporal coexistence of evolving lineages is minimized and (2) test whether the results can be explained by potential contaminations at different experimental steps. With our experiments, we took all possible precautions to minimize the risk of cross-contaminations and, in fact, a detailed phylogenetic analysis of our results supports the view that our results are better explained by evolutionary convergences than by a general contamination at different steps. However, we found three pairs of identical sequences that were, unfortunately, obtained from populations evolved in the same blocks. For these three particular cases, our phylogenetic analysis could not fully rule out the possibility of a cross-contamination. The problem in these cases was that we did not have any basis to decide which member of each pair must be eliminated (the contaminated) and which one retained (the contaminator) in the analysis. Conservatively, we removed all six sequences and redid all the above analysis. By doing so, a few covariances in Figure 3 became not significant (but none that were nonsignificant changed) and the CNLR method provided slightly different estimates for fitness effects (interestingly, the four significant cases in Table 2 retained their significance). Hence, despite absolute numerical values, our major conclusions of molecular convergences and epistasis are solidly supported by our dataset.
One of the most amazing features illustrated in Figure 1 is the large amount of evolutionary convergences observed among independent lineages. Twelve of the variable sites were shared by different lineages. More surprisingly, convergences also occurred within synonymous sites and intergenic regions. Evolutionary convergences during the adaptation of viral lineages under identical artificial environmental conditions have been described previously (Bullet al. 1997; Wichmanet al. 1999; Fareset al. 2001). However, this phenomenon is observed not only in the laboratory. It is also a relatively widespread observation among human immunodeficiency virus (HIV)-1 clones isolated from patients treated with different antiviral drugs; parallel changes are frequent, often following a common order of appearance (Larderet al. 1991; Boucheret al. 1992; Kellamet al. 1994; Condraet al. 1996; Martínez-Picadoet al. 2000). Subsequent substitutions may confer increasing levels of drug resistance or, alternatively, may compensate for deleterious pleiotropic effects of earlier mutations (Mollaet al. 1996; Martínez-Picadoet al. 1999; Nijhuiset al. 1999). Also, molecular convergences have been observed between chimeric simian-human immunodeficiency viruses (strain SHIV-vpu+) isolated from pig-tailed macaques, rhesus monkeys, and humans after either chronic infections or rapid virus passage (Hofmann-Lehmannet al. 2002). Convergent evolution at the molecular level is not controversial as long as it can be reconciled with the neutralist and the selectionist theories. The neutral theory suggests that convergences are simply accidents, whereas within the framework of selectionism, there are two qualifications for convergences. The first explanation considers convergences as being adaptive and the result of organisms facing the same environment (as in the case of our experiments) with a few alternative pathways of adaptation (as expected for compacted genomes). Second, keeping in mind the model of clonal interference, beneficial mutations have to become fixed in an ordered way (Gerrish and Lenski 1998), with the best possible candidate fixing first, and then the second best candidate, and so on. This implies that, given a large enough population size to make clonal interference an important evolutionary factor, we should always expect the same mutations to be fixed.
The above argument is valid for nonsynonymous changes but an alternative explanation must be found for synonymous changes and for changes in the intergenic regions. Genomic RNA is involved in many RNA-RNA and RNA-protein interactions that affect viral replication. This is obvious for noncoding, regulatory regions (Stillman and Whitt 1997, 1998), but there is increasing evidence that capsid-coding regions in picornaviruses may also have an effect on viral replication (McKnight and Lemon 1998; Fareset al. 2001). Therefore, the RNA itself (apart from its protein-coding capacity) may contribute to the viral phenotype, and fitness may also be affected by synonymous replacements. Evidence for selection on synonymous sites has been inferred also in mammals (Eyre-Walker 1999), as a consequence of selection acting upon the base composition of isochores and large sections of junk DNA.
For the sake of illustration, it would be interesting to compare the number of selectively important sites in the VSV genome with those estimated for other genomes. For example, Fay et al. (2001) reported that, in humans, the vast majority (80%) of amino acidic changes are deleterious to some extent and only a minor fraction are neutral. Among these deleterious amino acidic mutations, at least 20% are slightly deleterious. Here, we found that 15 amino acid sites changed, with only 5 being significantly advantageous. At this point, we can only speculate about the selective role of all the amino acid sites shown to be invariable in our study. The total number of amino acids in the five genes of VSV is 3536. Assuming that changes in any of the 3536 - 15 = 3521 invariable amino acids would be deleterious (and thus washed out by purifying selection during our evolution experiment), then the fraction of amino acid replacements that are potentially harmful would be 3521/3536 ≡ 99.58%; the fraction of neutral sites would be 10/3536 ≡ 0.28%, whereas only 5/3536 ≡ 0.14% would be beneficial. Despite the differences between humans and VSV in genome size and organization and in the nature of the nucleic acid used, in both cases the fraction of potentially deleterious amino acid substitutions is overwhelmingly larger than that of neutral or beneficial ones.
At the other extreme, we can make the otherwise unrealistic assumption that all the 3521 invariable amino acids are neutral or quasi-neutral (sensu Ohta 1973; depending on population size); then the fraction of neutral sites in the VSV genome would be as high as 3531/3536 ≡ 99.86%, 0.14% would be advantageous, and none would be deleterious. However, recent computations showed that the deleterious genomic mutation rate for VSV might be as high as ∼1.2 per genome replication, with a majority of mutations of small effect but with a significant fraction being of large effect (Elena and Moya 1999).
Quantifying the proportion of amino acid substitutions in the proteome of RNA viruses that are effectively neutral (or quasi-neutral) is of extreme importance for testing the validity of the quasi-species model of viral evolution. The quasi-species model differs from the classical population genetics models in that neutral mutations do not lead to genetic drift of the population, and natural selection acts on the mutant distribution as a whole rather than on individual variants. The reason for challenging the quasi-species model resides on whether or not neutral sites are abundant in the genome (Jenkinset al. 2001; Holmes and Moya 2002). If neutral mutations are abundant, the sequence space may be too large to allow the formation of a stable quasi-species distribution. If an RNA virus genome contains neutral sites, then the number of sequences close to maximum fitness may exceed the population size of the virus itself (Jenkinset al. 2001). This would prevent the formation of a quasi-species because the viral population could not occupy all the sequence space around the master sequence and so would be subjected to genetic drift, which in turn would prevent the mutational coupling necessary for natural selection to act on the entire mutant distribution.
The presence of multiple beneficial mutations in the same genome allows some important conclusions to be drawn. Fitness effects previously measured (Miralleset al. 1999) cannot be assigned to single changes. Instead, the picture is a little bit more complex, involving multiple beneficial mutations as well as linked neutral changes. Is this jeopardizing our previous conclusion of clonal interference as a key factor during the evolution of RNA viruses? To test whether the first prediction (see above) of the clonal interference model still holds when multiple mutations got fixed on each lineage, we studied the correlation between the best beneficial effect fixed within each clone (from Table 2) and Ne. If the prediction still holds, a positive correlation between these two parameters would be expected. Figure 4 illustrates the existence of a significant positive trend (r = 0.6128, 19 d.f., one-tailed P = 0.0016), confirming the prediction made by the clonal interference model.
A second important conclusion of our study is the existence of extensive epistatic interactions among and within viral genes (Figure 3). However, we have shown that the net effect of epistasis on fitness is likely to be very low in VSV, since only 1 out of 16 detected epistatic interactions had significant fitness effects (4.9%, Table 2). This result adds to the emerging picture that epistasis between mutations is very weak across a broad phylogenetic range (de Visseret al. 1997; Elena and Lenski 1997; Charlesworth 1998; Elena 1999; Peters and Keightley 2000).
The results reported in this article open the possibility for new and exciting research. The most straightforward step to be taken is the recreation, by site-directed mutagenesis on a full-genome cDNA, of each observed mutation alone and measuring the fitness of individual mutations to confirm the validity of the above inferences and the role of epistasis among pairs of mutations. Other interesting research will be to quantify the magnitude of nucleotide diversity within our evolving populations. The results of such an analysis would, in principle, be helpful for understanding the role of processes such as hitchhiking, background selection, or selective sweeps in the evolution of RNA viruses.
We thank Rosario Miralles and Rafael Sanjuán for stimulating and fruitful discussions, Olga Cuesta for technical assistance, and A. M. Powell for critical reading of the manuscript. The comments of two reviewers were invaluable for improving the manuscript. This work was supported by a predoctoral fellowship to J.M.C. and a grant to A.M., both from the Spanish Ministerio de Educación y Cultura.
Communicating editor: H. Ochman
- Received March 27, 2002.
- Accepted June 24, 2002.
- Copyright © 2002 by the Genetics Society of America