Abstract
The relationships between synonymous and nonsynonymous substitution rates and between synonymous rate and codon usage bias are important to our understanding of the roles of mutation and selection in the evolution of Drosophila genes. Previous studies used approximate estimation methods that ignore codon bias. In this study we reexamine those relationships using maximum-likelihood methods to estimate substitution rates, which accommodate the transition/transversion rate bias and codon usage bias. We compiled a sample of homologous DNA sequences at 83 nuclear loci from Drosophila melanogaster and at least one other species of Drosophila. Our analysis was consistent with previous studies in finding that synonymous rates were positively correlated with nonsynonymous rates. Our analysis differed from previous studies, however, in that synonymous rates were unrelated to codon bias. We therefore conducted a simulation study to investigate the differences between approaches. The results suggested that failure to properly account for multiple substitutions at the same site and for biased codon usage by approximate methods can lead to an artifactual correlation between synonymous rate and codon bias. Implications of the results for translational selection are discussed.
SYNONYMOUS substitutions do not affect the amino acid sequence of a protein, yet in some species of Drosophila, Caenorhabditis elegans, Arabidopsis, yeast, and enterobacteria, synonymous substitutions appear to be influenced by natural selection (Grosjean and Fiers 1982; Sharpet al. 1986; Sharp and Li 1987; Shieldset al. 1988; Duret and Mouchiroud 1999). In Drosophila, synonymous codon usage is highly biased toward codons ending with G or C (Shieldset al. 1988; Powell and Moriyama 1997; but see Rodríguez-Trelleset al. 1999), and there is evidence that selection favors substitutions to the preferred synonymous codons and limits substitutions to unpreferred synonymous codons (Akashi 1994, 1995, 1997). The exact causes of selection at synonymous sites, however, are less clear.
Selection for preferred codon usage in Drosophila has been suggested to enhance protein translation. Selective enhancement of translation is supported by the observation that the most frequent synonymous codons tend to match the most abundant tRNAs (Shieldset al. 1988; Moriyama and Powell 1997b). In addition, a perceived negative correlation between silent divergence and codon bias (Shieldset al. 1988; Sharp and Li 1989; Moriyama and Hartl 1993; Powell and Moriyama 1997) and a strong correlation between gene expression and codon bias (Shieldset al. 1988; Powell and Moriyama 1997; Duret and Mouchiroud 1999) are consistent with greater selective pressure on the silent sites of genes with high codon bias. Finally, codon bias has been found to be lower in regions of low recombination (Kliman and Hey 1994; Comeronet al. 1999). This is expected if codon bias is maintained by selection, because in regions of low recombination the efficacy of selection is reduced due to linkage disequilibrium with other selected sites (Hill and Robertson 1966).
Although selective enhancement of translation appears to be the primary source of codon bias in Drosophila, it is less clear which of several mechanisms are operating. Translation could be enhanced by increasing the rate of elongation, reducing the cost of proofreading, increasing the accuracy of translation, or by any combination of those mechanisms (e.g., Akashi and Eyre-Walker 1998). The majority of evidence supports the hypothesis that selection is acting to increase translational accuracy. Preferred codon usage is related to functional constraints, with the most conserved genes and most functionally important amino acid sites exhibiting the most biased codon usage (Akashi 1994). Furthermore, the synonymous substitution rate is perceived to be positively correlated with the nonsynonymous rate (Comeron and Kreitman 1998). These patterns indicate that synonymous substitutions are not independent of selective constraints acting on the amino acid composition of the protein. However, some data are inconsistent with the translational accuracy hypothesis. Codon usage bias is lower in longer genes (Comeronet al. 1999; Duret and Mouchiroud 1999), but selection to increase accuracy should be greater in longer genes because the cost of proofreading is higher. In addition, codon bias increases within the first few hundred codons and then declines along Drosophila genes (Kliman and Eyre-Walker 1998). Hence the presence of additional selection pressures on codon bias, possibly to enhance translational efficiency (Comeronet al. 1999) or to maintain mRNA secondary structure (e.g., Kirbyet al. 1995), cannot be discounted.
Substitution rates in Drosophila genes contain important information about the effectiveness of selection. The relationship of synonymous substitution rate to codon bias, to nonsynonymous rate, and to location within a gene tells us something about the nature of selection (e.g., Kliman and Eyre-Walker 1998). Previous studies of substitution rates employed approximate methods that neglect the effects of codon bias and apply ad hoc corrections for multiple substitutions. Recent studies suggest that approximate methods can lead to seriously biased estimates of synonymous and nonsynonymous substitution rates when codon usage is biased (Ina 1995; Yang and Nielsen 1998, 2000; Bielawskiet al. 2000). An alternative is to use maximum likelihood (ML). The ML method is based on a Markov process model of codon substitution, which describes the changes between the sense codons in the genetic code and accounts for the transition/transversion rate ratio κ, different equilibrium codon frequencies, and the nonsynonymous/synonymous substitution rate ratio ω. Estimates of dS and dN are calculated according to their definitions from the maximum-likelihood estimates of model parameters (such as sequence divergence, κ, and ω). The probability theory corrects for unequal codon usage and for multiple substitutions in a straightforward manner. See Goldman and Yang (1994) and Yang and Nielsen (1998, 2000) for details.
The purpose of this study was to use ML methods to estimate substitution rates in Drosophila nuclear genes to investigate the relationship of synonymous rate to nonsynonymous rate and to codon usage bias. We compiled a sample of 83 loci with homologous DNA sequences from Drosophila melanogaster and at least one other species of Drosophila. Our analyses suggested a very different relationship between synonymous rate and codon usage bias as compared with all previous studies. A simulation study was thus conducted to investigate the source of this difference.
MATERIALS AND METHODS
Sequence data: Sequence data consist of 83 genes from D. melanogaster and at least one of the following species: D. pseudoobscura, D. subobscura, D. simulans, D. yakuba, and D. virilis. A list of these genes and their accession numbers is provided in the appendix. Some analyses were performed on a phylogeny of three species. Such an approach was possible for D. melanogaster, D. pseudoobscura, and D. subobscura (DmDpDsub) for 20 genes and for D. melanogaster, D. simulans, and D. yakuba (DmDsimDy) for 11 genes. Alternatively, larger sets of genes were analyzed in a pairwise fashion, i.e., D. melanogaster vs. D. virilis (DmDv; 39 genes), D. melanogaster vs. D. pseudoobscura (DmDp; 35 genes), and D. melanogaster vs. D. simulans (DmDsim; 24 genes).
Biased patterns of sequence evolution: G + C content at third codon positions (GC3) and codon usage bias, measured by the effective number of codons (ENC; Wright 1990), were computed for each gene, using the program Codon W of J. Penden. The assumption of homogeneous nucleotide frequencies among Drosophila species was tested using a chisquare test of a contingency table of nucleotide counts.
Estimation of the numbers of synonymous (dS) and nonsynonymous (dN) substitutions per site: Lineage-specific estimates of dS and dN were obtained from phylogenetic analyses of three species (DmDpDsub and DmDsimDy) using the ML method (Yang and Nielsen 1998). The assumption of a homogeneous dN/dS ratio among lineages was tested in both datasets by comparing two models. Model 0 assumed the same ratio for all three lineages of Drosophila, whereas model 1 allowed an independent dN/dS ratio for every lineage. Both models account for transition/transversion bias (κ) and unequal codon frequencies, which were determined using the empirical nucleotide frequencies at the three codon positions (F3 × 4 model; Yang 1999). Twice the log-likelihood difference between models 0 and 1 was compared with a χ2 distribution with d.f. = (3 - 1) = 2.
Parameters dS and dN were estimated pairwise by ML for genes in the DmDv, DmDp, and DmDsim datasets (Goldman and Yang 1994). For comparison, we estimated dS and dN using two approximate methods: that of Nei and Gojobori (1986), referred to as NG, and that of Comeron (1995); the primary difference between them is that the method of Comeron (1995) corrects for transition/transversion rate bias while NG does not. The PAML package (Yang 1999) was used to implement NG and ML, while Comeron’s program, K-Estimator 5.2, was used for Comeron’s (1995) method.
Computer simulations: To understand the differences between methods, we simulated pairs of codon sequences using the evolver program of the PAML package (Yang 1999). Model parameters used were transition/transversion rate ratio (κ), dN/dS ratio (ω), codon frequencies (πj), and sequence divergence (t). The κ value was set to one, and ω was the average from genes examined (ω = 0.06). We used the empirically estimated codon frequencies from eight Drosophila genes with different ENC values: 53.4 from sc; 49.1 from ade3; 41.2 from v; 44.7 from Gld; 38 from Gad1; 33.7 from Mlc1; 32.3 from Adh; and 28.3 from Amy-p. Each set of codon frequencies was evaluated at a sequence divergence (t; measured by the expected number of nucleotide substitutions per codon) set to 0.4, 0.8, 1.2, 1.6, 2.0, 2.4, and 2.8. Values of t represent the range observed in the sampled Drosophila genes. In total, 56 pairs of sequences were simulated, each 1 million codons in length. dS and dN were estimated for each pair of sequences using both ML and NG methods. The method of Comeron (1995) was employed for a few parameter combinations only to examine the effect of transition/transversion rate bias.
RESULTS
Nucleotide (codon) frequencies in Drosophila nuclear genes: GC3 varied substantially among the 83 genes, ranging from 28 to 93%. This variation was characteristic of each dataset (DmDpDsub, 44-90%; DmDsimDy, 52-93%; DmDp, 36-90%; DmDv, 45-88%; and DmDsim, 28-91%). Chi-square tests suggested significant heterogeneity in nucleotide composition among lineages for 4 of 20 genes in DmDpDsub (Gapdh2, Gpdh, RpII215, and ry), 8 of 35 genes in DmDp [Gapdh2, Gpdh, l(2)gl, Rh2, RpII215, ry, Tl, and Ubx], and 11 of 39 genes in DmDv [Adh, Amy-p, Cdc37, fu, gbb, Gpdh, Kr, l(2)tid, lama, nos, and Sry-beta].
Estimation of numbers of synonymous (dS) and nonsynonymous substitutions (dN) per site for each gene in the three-species datasets
Consistent with patterns of nucleotide bias, codon usage varied greatly among genes; the ENC values ranged from extreme bias at 26.7 (GstD1) to no bias at 61 (sc). Each dataset exhibited substantial variation in codon bias among genes (DmDpDsub, 27.3-59.8; DmDsimDy, 26.7-61; DmDp, 28.6-56.6; DmDv, 29.2-61; and DmDsim, 26.7-61). Codon bias was not related to heterogeneity in nucleotide composition among lineages. For example, genes with homogeneous nucleotide composition among lineages exhibited both highly biased (e.g., gene Eno, ENC = 28.6) and unbiased (e.g., gene ac, ENC = 56.1) codon usage.
Lineage-specific patterns of synonymous and nonsynonymous substitution: ML estimation of dN and dS was carried out for each gene in the DmDpDsub and DmDsimDy datasets (Table 1). Constancy of nonsynonymous/synonymous rate ratios (dN/dS) was tested, and dN/dS ratios were largely homogeneous over lineages (Table 1). Homogeneity was rejected in only 3 of the 20 genes in DmDpDsub and 3 of the 11 genes in DmDsimDy. This trend differs from that observed in mammals, where over half of genes surveyed exhibited significant variation in selective constraints among lineages (Yang and Nielsen 1998; Bielawskiet al. 2000).
—The relationship between dN and dS (a) and between dS and ENC (b) for Drosophila pseudoobscura. Estimates of dS and dN were obtained from lineage-specific analyses.
Genes exhibiting high synonymous or nonsynonymous rates in one lineage tended to exhibit high rates in other lineages as well. For example, the coefficient of determination (r2) of dS estimates between Dm and Dp is r2 = 0.5098 (P = 0.0004), and the coefficient of determination of dN is r2 = 0.5005 (P = 0.0005). Similar patterns were reported for mammalian genes (Bulmeret al. 1991; Mouchiroudet al. 1995). Substantial variation in dN/dS was found among genes, as they are under very different selective constraints (Table 1).
The relationship between synonymous and nonsynonymous substitution rates was evaluated by linear correlation using rate estimates of Table 1 for five species. Estimates of dS were positively correlated with dN in every lineage (Dp, r2 = 0.4643, P = 0.0009; Dsub, r2 = 0.7312, P ⪡ 0.0001; Dsim, r2 = 0.5722, P = 0.0071; Dy, r2 = 0.3945, P = 0.0385; Dm, r2 = 0.4587, P ⪡ 0.0001). The plot for D. pseudoobscura is presented as an example (Figure 1a). These results are consistent with a previous analysis of substitution rates in Drosophila (Comeron and Kreitman 1998). However, this study represents a wider sample of Drosophila species and the correlations reported here are stronger than previously reported.
Synonymous substitution rates were independent of codon bias, as evaluated by linear regression of dS and ENC, in every lineage (Dp, r2 = 0.0004, P = 0.7911; Dsub, r2 = 0.0138, P = 0.6218; Dsim, r2 = 0.151, P = 0.2376; Dy, r2 = 0.0045, P = 0.8446; Dm, r2 = 0.0079, P = 0.6405). The plot for D. pseudoobscura is presented as an example (Figure 1b). All previous analyses differ from ours in suggesting synonymous substitution rates are negatively correlated with codon bias (Shieldset al. 1988; Sharp and Li 1989; Moriyama and Hartl 1993; Powell and Moriyama 1997). Previous studies, however, also differ from ours in using approximate estimation methods, which ignore codon usage bias. Furthermore, in this study we estimated dS per lineage rather than between pairs of sequences. In the next section, we explore what factors may be responsible for this conflict.
Reconciling differences between ML and approximate methods: Models employed in ML estimation of substitution rates explicitly assumed that nucleotide/codon frequencies were at equilibrium (Goldman and Yang 1994). To see whether violation of this assumption influenced our correlation analysis, we reevaluated the relationship between dS and ENC using only stationary genes. To keep more genes in the dataset and to facilitate direct comparison with approximate methods, we apply the ML method to two-taxa datasets. This pairwise ML analysis led to the same conclusion as the phylogeny-based ML analyses; i.e., synonymous rates were independent of codon bias (Figure 2). However, when either NG or Comeron’s (1995) method was used, dS was negatively correlated with ENC for both the DmDv (NG, r2 = 0.5036, P ⪡ 0.0001; Comeron, r2 = 0.4615, P = 0.0001) and DmDp (NG, r2 = 0.6754, P ⪡ 0.0001; Comeron, r2 = 0.6127, P ⪡ 0.0001) datasets (Figure 2), a result consistent with previous studies (Sharp and Li 1989; Comeron and Aguadé 1996; Powell and Moriyama 1997). However, in the DmDsim dataset, the approximate methods lead to the same conclusion as the ML method; dS was independent of ENC (NG, r2 = 0.0995, P = 0.1332; Comeron, r2 = 0.0583, P = 0.2557; Figure 2). Because results were similar for the two approximate methods, only results for NG are presented in Figure 2. On the basis of these results we conclude that the conflict between approximate methods and ML cannot be attributed to use of a pairwise approach, the sample of genes, or nonstationarity of nucleotide frequencies.
To examine the effect of transition/transversion bias and codon bias on estimation of dS, we changed parameters of the codon model to reanalyze the two-taxa datasets by ML. The effect of ignoring transition/transversion bias was evaluated by setting κ = 1, but allowing for biased codon usage. Results obtained using this codon model did not differ from those obtained using the full codon model; i.e., there was no significant relationship between dS and ENC for DmDv (r2 = 0.140, P = 0.060) and DmDp (r2 = 0.051, P = 0.256). The effect of ignoring codon bias was evaluated by assuming equal codon frequencies, but allowing for transition/transversion bias. This codon model produced results different from those obtained previously and matched the approximate methods, with a significant correlation between dS and ENC in DmDv (r2 = 0.435, P = 0.0005) and DmDp (r2 = 0.6281, P ⪡ 0.0001). These findings suggest that the different treatment of codon bias was responsible for the conflict between ML and approximate methods in the relationship between dS and ENC.
—The relationship between pairwise estimates of dS and mean ENC (a-f). Pairwise estimates of dS were computed using both ML (Goldman and Yang 1994) and NG (Nei and Gojobori 1986). DmDv indicates pairwise comparisons between D. melanogaster and D. virilis; DmDp, pairwise comparisons between D. melanogaster and D. pseudoobscura; and DmDsim, pairwise comparisons between D. melanogaster and D. simulans.
However, in one pairwise comparison (DmDsim), approximate and ML methods were in agreement, yet codon usage in this dataset (mean ENC = 43) was just as biased as in the other two sets of genes (DmDv, mean ENC = 44; DmDp, mean ENC = 41) where approximate and ML methods differed. Hence, there must have been an additional factor. In both the DmDv and DmDp datasets, where approximate methods produced a positive relationship between dS and ENC, uncorrected percent sequence divergences (p) were very large (DmDv, mean p = 21.6 ± 6.0%; DmDp, mean p = 16.3 ± 5.8%). In the DmDsim dataset, where dS was independent of ENC, divergences were very low (mean p = 3.2 ± 1.4%). This pattern suggested a possible “saturation” effect. Consequently, we hypothesized that differences between approximate and ML methods might be related to the combined effects of sequence divergence and codon usage bias.
Simulation studies: Simulations were performed to evaluate the hypothesis that improper treatment of both divergent sequences and codon usage bias by approximate methods could have led to a significant positive correlation between dS and ENC when, in reality, they were independent. Codon sequences were simulated under values of t chosen to reflect the range exhibited among Drosophila genes, and codon frequencies were modeled after observed frequencies of Drosophila genes (see materials and methods). Simulated codon sequences were analyzed using both NG and ML (F3 × 4), and these results were compared with the true values employed in the simulation (Figure 3).
Both sequence divergence and codon bias had an effect on estimates of dS (Figure 3). Only at low sequence divergences (t < 1.6) and relatively unbiased codon usage (ENC = 53) did both NG and ML (F3 × 4) produce unbiased estimation of dS. With one exception (Figure 3a), dS was underestimated by both methods with increasing levels of sequence divergence (Figure 3, b-h), and the NG method involved a much more serious estimation bias than ML. Because equilibrium codon frequencies obtained using the F3 × 4 model (9 frequency parameters) do not perfectly match the empirical Drosophila frequencies, some degree of error was expected for ML. Use of the more parameter-rich model (60 frequency parameters) produced estimates of dS essentially identical to the true values (data not shown). Differences between the analysis of real data using the NG and ML (F3 × 4) methods are consistent with differences observed in our simulation study; i.e., ML estimates of dS were larger than estimates obtained using NG. These simulations further suggested that the ML (F3 × 4) estimates of dS for Drosophila genes are, themselves, likely to be underestimates of the true values. Although the F3 × 4 model is biased in cases of extreme codon bias, this model was acceptable over a wide range of codon biases and consistently outperformed the NG method.
—Estimates of dS in simulated data by NG (○) and ML (F3 × 4; ▵) plotted against sequence divergence t (the expected number of nucleotide substitutions per codon). Data for a pair of sequences, each of 1 million codons, were simulated for different codon bias measured by ENC.
Plots of dS (Figure 3) are reminiscent of the “saturation effect” on plots of uncorrected sequence divergence. For dS, however, the “ceiling” appears to be related to levels of codon usage bias; the effect is relatively minor when codon usage is unbiased and extreme when codon usage is highly biased (Figure 3). For example, in the most extreme case of codon bias (ENC = 28), the estimate of dS obtained using the NG method peaked at 0.5 (t > 0.8), although true values of dS range from 0.95 when t = 0.4 to 6.6 when t = 2.8. The only exception to the general pattern occurred when codon usage was relatively unbiased (Figure 3a). In this case, when 1.2 < t < 2.0, NG overestimated rather than underestimated dS, and when t > 2.0 the method yielded invalid estimates of dS. Interestingly, Muse (1996) suggested that even when there is no codon usage bias, ad hoc corrections for multiple substitutions applied in the approximate methods could introduce substantial bias.
The dramatic effect of codon usage bias on estimation of dS is caused mainly by its effect on counting of synonymous (S) and nonsynonymous (N) sites. When codon usage is biased but ignored (Figure 3, b-h), the number of synonymous sites (S) was overestimated, leading to an underestimation of dS. For example, in the most extreme case (ENC = 28, t = 2.8), most codons end with C or G and most changes at the third position are transversions between C and G, which are more likely to be nonsynonymous than random changes between nucleotides. As a result, the proportion of synonymous sites is as low as S = 8.53%. The NG method assumes unbiased codon usage or equal rates of change between nucleotides and expects more frequent transitional or synonymous changes, giving S = 23.6%, with almost a fourfold difference. When there is little codon bias (ENC = 53), NG reliably estimated S (NG, S = 23.4%; true value, S = 23.3%). Thus the underestimation of dS by NG is more serious for more extreme codon bias (Figure 3). In Figure 4a, the NG estimates of dS corresponding to different sequence divergences (t) from the simulation (Figure 3) were plotted against ENC. Although dS and ENC are independent, NG incorrectly indicated a positive correlation, due to the underestimation of dS at high divergences and extreme codon bias. The ML method (not plotted) indicated no correlation between dS and ENC (r2 = 0.00006, P = 0.95). The pattern seen in the simulated data is consistent with that in the real data. This is most clear when NG estimates of dS from the real data (DmDp, Figure 2d) were superimposed onto the approximate estimates from the simulations (Figure 4b). These data suggest that the significantly positive correlation between dS and ENC indicated by the approximate methods is an artifact of these methods’ failure to properly account for the combined effects of codon bias and sequence divergence.
—(a) Approximate estimates of dS by NG obtained in the simulation (Figure 3) plotted against ENC. Multiple estimates of dS for each ENC correspond to the seven values of t. Note the true dS is independent of ENC, but a positive correlation is suggested by NG. (b) The same data as in a but with NG estimates from the pairwise comparisons between Drosophila melanogaster and D. pseudoobscura superimposed (○).
DISCUSSION
Codon usage in Drosophila varies considerably between genes and does not appear random. Highly conserved genes, and functionally important sites, exhibit highly biased synonymous codon usage (Akashi 1994). Akashi (1994) theorized that these patterns resulted from a balance between selection for translational accuracy, mutation, and drift in finite populations. On the basis of this model, he predicted that synonymous substitution rates should be positively correlated with nonsynonymous substitution rates. However, he did not find a significant correlation. More recently, Comeron and Kreitman (1998) reported that when data from three species of Drosophila are combined, synonymous rates are positively correlated with nonsynonymous substitution rates. Our findings confirm those of Comeron and Kreitman (1998) and further indicate that this correlation is a feature of all five species of Drosophila. In addition, our analysis implies that the correlation between synonymous and nonsynonymous rates is stronger than previously thought, suggesting that synonymous substitution rates in these Drosophila species are not independent of selective constraints acting at the amino acid level. Consistent with patterns of codon usage (Akashi 1994), these data support the hypothesis that codon bias in Drosophila is influenced by selection for translational accuracy.
However, our findings differ from all previous studies (Shieldset al. 1988; Sharp and Li 1989; Moriyama and Gojobori 1992; Moriyama and Hartl 1993; Carulliet al. 1993; Powell and Moriyama 1997) in suggesting that synonymous substitution rates are not correlated with codon bias. Through ML analysis under different models as well as computer simulation, we show that previous reports of a significant correlation between dS and ENC might be an artifact of inadequate correction for the combined effects of saturation and biased codon usage. However, Powell and Moriyama (1997) suggested that artifacts of inadequate correction for codon bias could not be responsible for a correlation between dS and ENC because in their study they employed a method (Moriyama and Powell 1997a) that was claimed to correct for codon bias. However, this method corrects for base composition bias only when correcting for multiple substitutions and assumed unbiased codon usage in the important steps of counting sites and differences (Yang and Nielsen 2000). Because the major source of bias in dS and dN arises from biased estimates of S and N (Yang and Nielsen 2000; see also above), the method of Moriyama and Powell (1997a) does not correct effectively for codon usage bias.
The effect of saturation is still evident in the analysis of Powell and Moriyama (1997). They observed a very strong correlation between dS and codon usage bias in distantly related species pairs, but a much weaker correlation in a more recently diverged pair of species, D. melanogaster and D. simulans. Powell and Moriyama (1997) dismissed the weak correlation as an effect of sampling errors at small sequence divergences and possible shared polymorphisms. Our simulation study suggests an opposite conclusion: that is, the weak correlation is closer to the true relationship, while the strong correlation is an artifact; i.e., because D. melanogaster and D. simulans have a more recent divergence, this species pair provided the least biased estimate of the relationship between dS and codon bias when approximate methods are used.
Our analysis suggests that the discrepancy between approximate and ML methods was caused mainly by codon usage bias and not by the transition/transversion bias (see also Yang and Nielsen 1998, 2000; Bielawskiet al. 2000). The approximate methods of Nei and Gojobori (1986) and Comeron (1995) led to the same conclusion. The ML method both with and without accounting for the transition/transversion bias led to the same conclusion, which is different from that of the approximate methods. To test this notion further, a small simulation study was conducted using the method of Comeron (1995), which corrects for transition/transversion bias. The transition/transversion ratio was fixed at κ = 2, and three levels of codon bias (ENC = 49, 38, and 28.31) were used. Simulated sequences comprised only 30,000 codons as the program of Comeron did not handle very long sequences. Results, shown in Figure 5, indicate that simply accounting for the transition/transversion bias (i.e., the method of Comeron 1995) does not reduce the bias in estimates of synonymous substitution rates. In fact, at moderate codon bias (ENC = 49), NG is closer to the real value of dS than the method of Comeron (1995). This is because the bias introduced by ignoring codon usage is in the opposite direction to that produced by ignoring the transition/transversion ratio, and the two biases partially cancel out in NG. This pattern was discussed by Yang and Nielsen (2000). However, with greater codon bias (i.e., ENC > 49), the effect of ignoring codon bias becomes much larger than ignoring transition bias, hence both NG and Comeron’s (1995) methods yield nearly identical estimates of dS.
Given the evidence for selection for translational accuracy in Drosophila, our finding of lack of correlation between synonymous rate and codon usage bias is puzzling. Models of translational selection predict that selection against substitutions to unpreferred codons at functionally important amino acid sites will result in a positive correlation between dS and ENC (Shieldset al. 1988; Sharp and Li 1989; Moriyama and Hartl 1993; Powell and Moriyama 1997). However, this prediction assumes that mutation pressure does not contribute to codon bias. Kliman and Hey (1994) found a correlation between base composition of third codon positions and introns, and they suggested that at least 10% of variation in codon bias can be explained by mutation pressure. Variable mutation pressures can lead to a negative, rather than positive, correlation between dS and ENC (Bielawskiet al. 2000). If mutation pressure produces a negative relationship between dS and ENC in Drosophila, it might confound the positive relationship between dS and ENC expected under translational accuracy.
—Estimates of dS in simulated data by NG (○), ML (F3 × 4; ▵), and Comeron (1995) (×) plotted against sequence divergence t (the expected number of nucleotide substitutions per codon). (□) Real values of dS. Data for a pair of sequences, each of 30,000 codons, were simulated under κ = 2 and three different levels of codon bias (a-c).
Comeron and Kreitman (1998) found that, among the majority of codons having double nucleotide substitutions in Drosophila, synonymous substitutions represented a shift to an unpreferred rather than a preferred codon, a pattern inconsistent with positive codon selection. Comeron and Kreitman (1998) hypothesized that the explanation for this pattern was a relaxation of selective constraints at these sites. They suggested a covarion-like model whereby the effect of selection on individual codons, for translational accuracy, changes over time. In this model, any relaxation of constraints at an amino acid site also includes relaxation of constraints on synonymous codon usage at the same site. Fluctuations in selective constraints could obscure the relationship between dS and ENC predicted by previous models of translational selection because dS is measured over time and ENC is measured at one point in time. In genes having a large proportion of sites where selective constraints have changed over time, substitution rates will be higher as compared to genes having a smaller fraction of sites subject to changing selective pressures. At any one point in time these genes could have similar values of ENC. Interestingly, we observed considerable variation in dS among genes with similar values of ENC. Moreover, the covarion-like model of Comeron and Kreitman (1998) predicts a positive correlation between synonymous and nonsynonymous substitution rates, which we also observed among Drosophila.
Whatever the causes of codon bias in Drosophila, the results of this study, and others (Kliman and Hey 1994; Comeron and Kreitman 1998; Kliman and EyreWalker 1998; Comeronet al. 1999; Duret and Mouchiroud 1999; McVean and Vieira 1999), indicate that the origins of codon bias in Drosophila are more complex than previously thought. Unbiased estimation of substitution rates will be a crucial aspect of unraveling the origins of codon bias in Drosophila.
Drosophila genes and accession numbers
Acknowledgments
We are grateful for the comments of two anonymous reviewers. This study was supported by a Biotechnology and Biological Sciences Research Council grant (31/G10434) to Z.Y.
Footnotes
-
Communicating editor: J. Hey
- Received July 11, 2000.
- Accepted October 12, 2000.
- Copyright © 2001 by the Genetics Society of America