Volatility of a codon is defined as the probability that a random point mutation in the codon generates a nonsynonymous change. It has been proposed that higher-than-expected mean codon volatility of a gene indicates that positive selection for nonsynonymous changes has acted on the gene in the recent past. I show that strong frequency-dependent selection (minority advantage) in large populations can increase codon volatility slightly, whereas directional positive selection has no effect on volatility. Factors unrelated to positive selection, such as expression-related or GC-content-related codon usage bias, also affect volatility. These and other considerations suggest that codon volatility has only limited utility for detecting positive selection at the DNA sequence level.
POSITIVE Darwinian selection at the DNA sequence level is usually assessed by comparing the number of synonymous substitutions per synonymous site (dS) and the number of nonsynonymous substitutions per nonsynonymous site (dN) from an alignment of at least two homologous protein-coding DNA sequences (Li 1997; Nei and Kumar 2000). A significantly higher dN over dS is a strong indication of positive selection (Hughes and Nei 1988). Plotkin and co-workers recently introduced a new method for detecting positive selection using just one DNA sequence, and this method is based on the concept of codon volatility (Plotkin et al. 2004). Volatility of a codon is defined as the probability that a random point mutation in this codon is nonsynonymous. Codon volatility necessarily depends on mutational patterns, such as the ratio of transitional to transversional changes. In the simplest mutation model (SMM) where all nucleotides have equal mutation rates and all nucleotides are equally exchangeable, the volatility of a codon is the proportion of point-mutation neighbors that encode different amino acids (Plotkin et al. 2004). For instance, TTG (Leu) has a volatility of 6/8, because 6 of its 8 non-stop-codon neighbors are nonsynonymous. Table 1 lists the volatility of all 61 sense codons under the SMM. One can see that codon volatility varies from 0.5 (CGA for Arg) to 1 (TGG for Trp or ATG for Met). The volatility of a gene is the average volatility for all the codons in the gene. A gene may have an exceptionally high volatility simply because it has a high proportion of amino acids with high volatility (e.g., Trp and Arg). Controlled for amino acid composition, a gene may still have an unusually high volatility due to the frequent use of synonymous codons that are of high volatility. Four of the 20 amino acids (Arg, Gly, Leu, and Ser) contain synonymous codons of different volatilities (Table 1). Plotkin et al. (2004) argued that high volatility (after control for amino acid composition and genomic average patterns of synonymous codon usage) is a result of positive selection for nonsynonymous changes and that volatility can be used as a statistic for detecting positive selection. In particular, it was proposed that compared with a low-volatility codon, the presence of a high-volatility codon in a gene sequence indicates a greater probability that the previous substitution was nonsynonymous (Plotkin et al. 2004). Although this interpretation is valid, it does not follow that high volatility indicates positive selection. Furthermore, in theory, positive selection for nonsynonymous changes may not increase codon volatility for the following reason: The only advantage of having a high-volatility codon is its great potential of generating nonsynonymous mutations in the future. But natural selection cannot foresee the future. As long as two codons code for the same amino acid, the high-volatility codon does not confer higher fitness. Therefore, it is unclear under what conditions codon volatility would indicate positive selection. Here I address this question by computer simulation and analysis of genomic sequence data. My results suggest that positive selection rarely increases codon volatility and that factors other than positive selection can affect codon volatility. Thus, the utility of this method appears limited.
RESULTS AND DISCUSSION
Volatility varies slightly among genes:
Because volatility is between 0.5 and 1 for each codon, it follows that all genes have volatilities between 0.5 and 1. Figure 1A shows the distribution of volatility for all 3624 genes from the complete genome sequence of the K12 strain of Escherichia coli. The mean volatility is 0.761 and the standard deviation is 0.014. One can see that gene volatility has a narrow distribution, with the standard deviation being only 1.8% of the mean. The distribution is also symmetric, as the median (0.760) is very close to mean (0.761). The same pattern is observed in several other prokaryotic and eukaryotic genomes examined (Figure 1), except for the malaria parasite Plasmodium falciparum, which appears to have an asymmetric distribution that may be due to its exceptionally high AT content in the genome.
As mentioned, gene volatility is affected by the amino acid composition of the gene as well as by synonymous codon usage. It is thus possible to compute the expected volatility of the gene using its amino acid composition and the average frequencies of synonymous codons found in the genome (Plotkin et al. 2004). The ratio (λ) of the observed volatility to the expected volatility measures the deviation in volatility due to the bias of synonymous codon usage of the gene from the genomic average. Higher volatility than expected is indicated by λ > 1 and lower volatility than expected is indicated by λ < 1. Figure 2A shows the distribution of λ for all the genes of E. coli K12. Again, the distribution is very narrow, with the standard deviation being only 0.46% of the mean. The same is true for other species examined (Figure 2). These patterns show that the variation in gene volatility is largely due to the among-gene variation in amino acid composition and that gene volatility becomes extremely homogenous when the amino acid composition variation is controlled for. This implies the rarity of selection that would substantially increase gene volatility.
Directional positive selection does not increase codon volatility:
To investigate whether continuous positive selection favoring nonsynonymous substitutions would increase the codon volatility of a gene, I conducted a computer simulation. In the simulation, I randomly generated a sense codon and then introduced mutations randomly according to the SMM described above. I used different types of selection measured by the dN/dS ratio. For example, when dN/dS was 2, a nonsynonymous mutation was twice as likely as a synonymous mutation to be fixed. After a long period of evolution (equivalent to dS = 10), I measured the volatility of the resultant codon and compared it with that of the initial codon. This simulation was repeated 20,000 times. The results showed that neither positive (dN/dS > 1) nor negative (dN/dS < 1) selection affected the evolution of codon volatility (Table 2). The average codon volatility was always 0.74–0.75 and codon frequencies did not change by the selection. For example, in the extreme case of positive selection with dN/dS = 8, the mean starting volatility was 0.74647 and the mean resultant volatility was 0.74678, and their difference was not statistically significant (P > 0.05; Z-test). Among the 20,000 codons simulated, volatility increased in 7771 codons during the evolution, reduced in 7740 codons, and remained unchanged in the rest of the cases. Again, the two numbers, 7771 and 7740, are not significantly different (P > 0.05; binomial test). These simulation results demonstrate that directional positive selection for nonsynonymous changes does not increase codon volatility and thus volatility cannot be used as an indicator for such selection.
Strong frequency-dependent selection may increase codon volatility:
In the above simulation, I assumed that for every codon, there were at most two alleles segregating in the population at any given time. This is identical to the infinite-site model of population genetics (Kimura 1969, 1971) applied at the codon level and it is a reasonable model for most genes in reality. For example, virtually all single-nucleotide polymorphisms observed in human populations have only two different nucleotides. However, this model does not apply to genes under strong overdominant selection or frequency-dependent selection (minority advantage), such as the mammalian major histocompatibility complex genes where more than two different nucleotides may be segregating at a single site (Hughes and Nei 1988, 1989). I suspect that these types of selection might increase codon volatility for two reasons. First, it has been noted that high-volatility codons tend to have neighboring codons of high volatility (Plotkin and Dushoff 2003), as shown in Figure 3. This means that codon volatility is heritable. Second, under strong overdominant selection or frequency-dependent selection, multiple alleles at a given codon site may be segregating. This, in conjunction with the heritability of codon volatility, may allow high-volatility codons to generate more nonsynonymous codons that are positively selected for, which could increase the average volatility of the gene in the population.
To investigate this possibility, I conducted a computer simulation of frequency-dependent selection in a haploid population. Frequency-dependent selection and overdominant selection have similar effects on molecular evolution (Takahata and Nei 1990). I here used the frequency-dependent selection in a haploid population due to its simplicity. I began the simulation by randomly generating a sense codon and assigning it to the entire population of N haploid individuals in generation 0. In generation t, N alleles were chosen randomly from the gene pool of generation t − 1, under both selection and drift. All synonymous codons encoding amino acid i had the same fitness of fi = 1 − spi, where pi was the total frequency of synonymous codons encoding amino acid i and s (0 ≤ s ≤ 1) was the selection coefficient. This fitness formula reflected one form of minority advantage, where the fitness of an allele was linearly determined by the allele frequency. Random mutations were then generated with a rate of u per nucleotide site per generation. Evolution continued for T generations, and the average codon volatility for the population in generation T was computed and compared with that in generation 0. I used sufficiently large T values to ensure that the population reached equilibrium at the end of the simulation. The simulation was repeated 10,000 times. Table 3 lists the simulation results under various conditions. To reduce computer time, I used small population sizes (N) but relatively large selective coefficients (s) and high mutation rates (u) so that Nu was in the range of 0.001–1 and Ns was in the range of 0–99. Nu is usually between 0.01 and 1 for prokaryotes and unicellular eukaryotes and on the order of 0.001 or lower for multicellular eukaryotes (Lynch and Conery 2003). Thus, our simulation was biologically meaningful. As predicted, frequency-dependent selection could increase codon volatility, and this increase became more obvious when Ns and Nu were larger. In higher eukaryotes, codon volatility is less useful as an indicator for positive selection because N and Nu are generally quite small. For instance, in humans, N is about 104 (Takahata et al. 1995) and u is about 2 × 10−8 per generation (Yi et al. 2002), and Nu is thus 2 × 10−4. However, even under the most extreme condition examined (N = 100, s = 0.99, u = 0.01), the mean codon volatility increased by <4%, suggesting that codon volatility responded to only a small extent to the frequency-dependent selection for nonsynonymous changes. Note that the above selection was quite strong, as it could maintain on average 17.7 alleles in the population of 100 haploids and resulted in a nucleotide diversity of 0.662. It should be noted that the above simulation scheme was somewhat different from conventional population genetic analysis, because it was the behavior of a single codon, not that of a DNA sequence, that was examined. The simulation can be regarded as a special case in which at any time of evolution only the variation at one codon of the gene affects fitness. In reality, nonadditive fitness effects of different codons of a gene would make the situation much more complicated. Nevertheless, the simple simulation demonstrates the extent to which frequency-dependent selection could affect codon volatility.
Expression-related codon usage bias potentially affects codon volatility:
Although codon volatility could be enhanced by frequency-dependent selection and thus in principle may be used as an indicator for positive selection, codon volatility may also be affected by factors unrelated to positive selection for nonsynonymous changes. For instance, it is well known that synonymous codon usage bias is related to the level of gene expression in many organisms, with a greater degree of bias in highly expressed genes (Li 1997). It is believed that this phenomenon is due to selection for high translational efficiency through preferential use of synonymous codons whose cognate tRNA has high concentrations. As mentioned, four amino acids have synonymous codons of different volatilities. Table 4 lists the average volatility for each of these amino acids computed by considering the average synonymous codon usage in highly and lowly expressed genes of E. coli, Saccharomyces cerevisiae, and Drosophila melanogaster (Sharp et al. 1988). It is obvious that codon volatility can be affected by gene expression level. The actual amount of effect, however, depends on the frequencies of these four amino acids in a protein, particularly Arg, Leu, and Ser. A recent analysis showed that the synonymous codon usage in genes highly expressed in one stage of the developmental cycle of P. falciparum differs from that in genes highly expressed in another stage (Peixoto et al. 2004). It is unclear, however, whether this difference resulted from codon volatility-related selection (Plotkin et al. 2004) or from translation efficiency-related selection that may arise if tRNA concentrations vary in different developmental stages.
Furthermore, in some organisms, GC content varies substantially in different regions of the genome (e.g., mammals; Bernardi 2004), which may lead to variable expected codon volatility for different genomic regions. Thus, use of the genomic average synonymous codon usage to compute the expected codon volatility may be inappropriate.
In this work, the simplest mutation model was used. In reality, mutational patterns are quite complicated and may vary among genes and genomes. Although codon volatility can be computed given a mutational model (Plotkin et al. 2004), the lack of knowledge of exact mutational models for a given gene or genome adds another layer of uncertainty to the application of codon volatility in detecting positive selection.
In contrast to the claim of Plotkin et al. (2004), I find that codon volatility does not increase by directional positive selection for nonsynonymous changes, and it increases only slightly by strong frequency-dependent selection in large populations. Given that factors unrelated to positive selection also affect codon volatility, the utility of this measure in detecting positive selection at the DNA sequence level seems limited. Nevertheless, the idea of using just one DNA sequence to detect natural selection (Plotkin et al. 2004) is novel and attractive, and it would be interesting to develop other measures that may accomplish this goal.
I thank Hunter Fraser and Xionglei He for discussions. David Webb and two anonymous reviewers provided valuable comments on the manuscript. This work was supported by a startup fund from the University of Michigan and National Institutes of Health grant GM-67030 to J.Z.
Communicating editor: S. W. Schaeffer
- Received August 11, 2004.
- Accepted September 30, 2004.
- Genetics Society of America