Genetics, Vol. 159, 1191-1199, November 2001, Copyright © 2001

Codon Usage Bias Covaries With Expression Breadth and the Rate of Synonymous Evolution in Humans, but This Is Not Evidence for Selection

Araxi O. Urrutiaa and Laurence D. Hursta
a Department of Biology and Biochemistry, University of Bath, Bath BA2 7AY, United Kingdom

Corresponding author: Laurence D. Hurst, Department of Biology and Biochemistry, University of Bath, Claverton Down, Bath BA2 7AY, United Kingdom., l.d.hurst{at}bath.ac.uk (E-mail)

Communicating editor: J. HEY


*  ABSTRACT
*TOP
*ABSTRACT
*METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

In numerous species, from bacteria to Drosophila, evidence suggests that selection acts even on synonymous codon usage: codon bias is greater in more abundantly expressed genes, the rate of synonymous evolution is lower in genes with greater codon bias, and there is consistency between genes in the same species in which codons are preferred. In contrast, in mammals, while nonequal use of alternative codons is observed, the bias is attributed to the background variance in nucleotide concentrations, reflected in the similar nucleotide composition of flanking noncoding and exonic third sites. However, a systematic examination of the covariants of codon usage controlling for background nucleotide content has yet to be performed. Here we present a new method to measure codon bias that corrects for background nucleotide content and apply this to 2396 human genes. Nearly all (99%) exhibit a higher amount of codon bias than expected by chance. The patterns associated with selectively driven codon bias are weakly recovered: Broadly expressed genes have a higher level of bias than do tissue-specific genes, the bias is higher for genes with lower rates of synonymous substitutions, and certain codons are repeatedly preferred. However, while these patterns are suggestive, the first two patterns appear to be methodological artifacts. The last pattern reflects in part biases in usage of nucleotide pairs. We conclude that we find no evidence for selection on codon usage in humans.


DOES selection act on mutations within exons that do not alter the amino acid sequence of the coded protein? Originally it was asserted that these synonymous mutations must be neutral (KING and JUKES 1969 Down). However, it is well known that unequal use of alternative codons is a common phenomenon in many unicellular species, as well as in Drosophila and Caenorhabditis, and that this may reflect the activity of selection (MARAIS et al. 2001 Down). In several unicellular species (GOUY and GAUTIER 1982 Down; SHARP et al. 1986 Down; STENICO et al. 1994 Down) and some invertebrates (DURET and MOUCHIROUD 1999 Down) it has been observed that codon usage bias is related to expression patterns. In these species it has been found that the extent of codon usage bias correlates with levels of gene expression, where highly expressed genes tend to have a greater bias (GOUY and GAUTIER 1982 Down; SHARP et al. 1986 Down; STENICO et al. 1994 Down; DURET and MOUCHIROUD 1999 Down). Evidence has been found to relate this to tRNA availabilities (SHARP et al. 1995 Down; MORIYAMA and POWELL 1997 Down; KANAYA et al. 1999 Down). These observations suggest that in these species codon usage bias is explained partly by translation efficiency-related pressures. Additionally, the rate of synonymous evolution covaries with the level of codon bias (see, e.g., POWELL and MORIYAMA 1997 Down), although this might be an artifact of the method (DUNN et al. 2001 Down). There are also consistent preferences toward certain codons within any given genome that may be interpreted as a result of selection as they appear to be in the opposite direction to mutation bias.

In mammalian genomes, codon usage bias is also observed (EYRE-WALKER 1991 Down; IKEMURA and WADA 1991 Down). However, as mammalian genomes show a great variation in nucleotide concentrations across the genome (i.e., isochores; BERNARDI 1995 Down; BERNARDI et al. 1997 Down), codon usage bias has been attributed to this background nucleotide bias. In Fig 1, for example, we plot GC content of third sites in exons for 369 genes on human chromosomes 21 and 22, against the GC content of the 50 kb of DNA flanking the gene in question. As can be seen the two strongly covary. A similar covariance is also seen between GC content at third sites and intronic GC (see DURET and HURST 2001 Down and references therein). As codon usage bias strongly reflects the GC content at the silent sites (Fig 2A), it has been difficult to assess the input of other variables to bias in codon usage in mammalian genes related to protein synthesis such as expression patterns and rates of substitutions.



View larger version (22K):
In this window
In a new window
Download PPT slide
 
Figure 1. Correlation of GC content at fourfold degenerate sites with the GC content at 50-kb flanking region. (GC4 = -0.32 + 1.96 GC50 kb; r2 = 50.4%; P < 0.001).



View larger version (23K):
In this window
In a new window
Download PPT slide
 
Figure 2. Correlation of codon usage bias using the MCB method and G + C content at third sites (GC3s), (A) assuming equiprobability and (B) correcting by nucleotide bias in a sample of 2396 human genes. (MCB = 0.57 + 0.067 GC3s, r2 = 0.5%, P = 0.001; MCBequiprobability = 2.61 - 8.94 GC3s + 9.56 x GC3s2, r2 = 85.3%, P < 0.001).

Nonetheless in one case there has been a claim that a highly expressed set of genes (histones) does show codon usage that deviates from background nucleotide content in flanking regions (DEBRY and MARZLUFF 1994 Down). This suggests that selection could operate on codon usage in humans as well. If this were generally true we might expect that codon usage bias in mammals should influence expression patterns. It is unknown whether, more generally, codon usage bias, after correction for background nucleotide content, covaries with any expression parameters. By contrast, there is now some evidence supporting the notion that codon usage affects expression. When nonmammalian genes are to be expressed in mammalian cells, the replacing of rare codons in the mammalian genome for common ones appears to have dramatic effects on the level of gene expression. This method, known as "mammalianization" or "humanization," has been used for increasing the expression of several genes (e.g., LEVY et al. 1996 Down; ZOLOTUKHIN et al. 1996 Down; WELLS et al. 1999 Down; ZHOU et al. 1999 Down).

In this study we present the results from the analysis of codon usage bias in a sample of over 2000 human genes designed to ask whether codon usage bias in mammals can be explained by background nucleotide content alone or whether such parameters as expression breadth might also be important. To achieve this we developed a tool to measure codon usage bias, correcting nucleotide biases.


*  METHODS
*TOP
*ABSTRACT
*METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Sequences from 2396 genes were included in the sample. Accession numbers were obtained from the DURET and MOURICHOUD (2000) database and sequences retrieved by ACNUC (GOUY et al. 1984 Down). All incomplete sequences (i.e., with internal gaps, nondefined nucleotides, or no start or stop codon) were discarded. Data of expression patterns were also obtained from the DURET and MOUCHIROUD 2000 Down database. Breadth of expression was calculated by counting the number of tissues where the gene is expressed. Columns referring to the same tissue but in different developmental stages were treated as a single tissue.

Randomization tests:
Random sequences were generated for each gene conserving the base content at first, second, and third sites and for gene length. Start and stop codons were removed from the randomizations. During randomizations all sequences that contained an internal stop codon were discarded. The procedure was repeated until a total of 1000 random sequences were obtained.

Effective number of codons tests:
Effective number of codons (ENC) values were obtained for all sequences and values of original sequences were compared with the distribution of random sequences. As the ENC index has a cutoff at 61 and all sequences with greater values are adjusted to 61, the variance of the distribution was estimated on the basis of the median instead of the mean and by using only the lower half of the distribution.

Defining amino acids:
In all tests, nondegenerative amino acids (methionine and tryptophan) were not taken into account. For the majority of the amino acids all their alternative codons have the same bases at the first and second site. The exceptions are serine, arginine, and leucine, each encoded by six alternative codons. Each of these amino acids was treated in all tests as two independent amino acids, one of twofold degeneracy and one of fourfold degeneracy.

Background nucleotide bias model expectations:
To obtain expected proportions for each alternative codon correcting for background nucleotide content, all codons were split into three groups according to the number of different nucleotides (two, three, and four) that could appear at the third site without changing the amino acid encoded. The group of degeneracy two was further divided into two groups, those where the choice is between T and C and those ending in A or G. The expected proportions of each alternative codon for a given amino acid were derived from all the other sites with the same degree of degeneracy or greater (i.e., excluding the amino acid being analyzed). For example, for amino acids with two degrees of degeneracy that could use the nucleotides thiamine and cytosine at the third site, expectations were calculated on the basis of all the other amino acids of two degrees of degeneracy that had a choice of the same nucleotides for the third site and also all the amino acids of four degrees of degeneracy were included by calculating the relative frequencies of thiamine and cytosine. For isoleucine, expectations were calculated by calculating the relative frequencies of adenine, cytosine, and thiamine in fourfold degenerate amino acids. Finally, for all the fourfold degenerate amino acids, only the distributions of nucleotides at the third sites of other fourfold degenerate sites were used for calculating expectations.

To minimize the uncertainty in the expected values, all cases with <30 sites to base the expectations on were eliminated. It should be noted that by using this model as null expectation, we are not taking into account the codon bias caused by dinucleotide biases.

Probability of observed bias:
Proportions of observed and expected codon usage for each amino acid were represented in terms of the minimal number of binomial variables. For amino acids with two alternative codons, codon usage is represented in terms of one variable, the frequency of one codon over the number of times the amino acid is present. For three alternative codons A, B, and C, codon usage can be represented with two variables: (a) the proportion of codon A over the total number of times the amino acid is present and (b) the proportion of codon B from the sum of frequencies of codons B and C. For amino acids with four alternative codons A, B, C, and D, the proportions of codon usage are represented by three variables: (a) the proportion of codons A + B over the frequency of the amino acid, (b) the proportion of codon A over the sum of frequencies of codons A + C, and (c) the proportion of B over the sum of frequencies of codons B + D. Under this method, the distribution of codon usage of a gene, and the expected one, can be represented by 38 binomial variables. All sequences in which not all the variables could be assessed were excluded from analysis (leaving n = 1629). To estimate the probability of the bias observed for each gene under the null hypothesis, the deviation from expectation for each variable was represented in terms of the numbers of standard deviations away from the mean (z). The standard deviation for a binomial variable can be defined as

The squared z values for each of the 38 variables were calculated and then summed to obtain the overall score (x),

Assuming that the binomial variables are normally distributed, the probability of occurrence of the observed bias can then be calculated with a {chi}2 distribution of 38 d.f. that has the following probability density function:

Analysis of overall bias:
All sequences were concatenated into a single large sequence. Observed and expected codon distributions were obtained as previously described for individual genes. The probability of the observed bias or greater from background nucleotide bias model expectations for each amino acid with n alternative codons was estimated using two standard methods of goodness of fit that approximate the {chi}2 distribution with n - 1 d.f.:

A. {chi}2 test:

B. G-test:

Probabilities were estimated for the values obtained by comparing with cumulative {chi}2 distributions.

Dinucleotide analysis:
Dinucleotide proportions were obtained for sites first to second, second to third, and third to first for each gene. Expected proportions for a given dinucleotide (Ed) at sites s2 and s3 given the nucleotide content in the sequence at each site were calculated as

where p(n2) and p(n3) are the proportions of the second and third nucleotides of the dinucleotide at sites s2 and s3. Dinucleotide bias (DnB) was estimated for each gene by

where Od,s2,s3 is the observed proportion of the dinucleotide.


*  RESULTS
*TOP
*ABSTRACT
*METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Background nucleotide content alone does not explain the codon bias observed in mammalian genes:
The extent of codon usage bias in human genes is dominantly dictated by the nucleotide content of the chromosomal region within which the gene finds itself (BERNARDI 1995 Down). Does this alone explain the degree of codon usage bias? We studied codon usage bias in a sample of 2396 human genes. As a first approach to investigate whether the observed codon bias can be explained by nucleotide biases at synonymous sites, 1000 random sequences were generated for each gene, conserving gene length and the base content per site. ENC values (WRIGHT 1990 Down) were obtained as a measure of codon usage bias for original and random sequences. We then compared the ENC for the real sequences to the distribution of random sequences. Over one-half of the genes were more deviated than any of the random sequences generated and 81% were significantly deviated with {alpha} = 0.05 (see METHODS).

However, because the ENC has a cutoff at 61 it has limited use for sequences with low codon usage bias. In addition, randomizing the first and second positions could potentially influence the distribution of ENC values in the random sequences. We therefore performed a second test to estimate the probability of the bias observed, by comparing the proportions of alternative codons of each sequence from the same sample of genes to expected proportions based on the nucleotide content at the third sites of each sequence. If the bias from equiprobability of a gene can be explained by the nucleotide content of that gene, then the null expectation would be that frequencies for each codon should match proportions of bases at the third site of all the amino acids with the same degree of degeneracy in that gene. To estimate the probability of obtaining the observed bias or greater under the null hypothesis for each gene, the observed and expected frequencies of codon usage for each amino acid were represented in terms of the minimal set of binomial variables, which is a method to approximate a multinomial distribution (see METHODS). The probability of obtaining the observed bias or greater under null expectation was estimated by summing the squared z values of distances of observed from expected for each binomial variable and comparing this with a standard {chi}2 distribution (see METHODS). Significant deviations from expected (defined as P < 0.01) were found for 99% of the genes in the sample (data not shown). We conclude that "background" nucleotide content explains some, but by no means all, of the observed codon usage bias. Dinucleotide biases also are known to affect the bias in codon usage (HANAI and WADA 1988 Down; KARLIN and MRAZEK 1996 Down) so it was expected that some proportion of the genes would be more biased than expected by the background nucleotide content.

Prior methods for assessing codon usage bias have limitations:
There are several methods to measure codon usage bias; however, many of them require a known set of preferred codons estimated from highly expressed genes. ENC (WRIGHT 1990 Down) is a popular method that does not assume preferred codons but is not especially suitable for statistical analysis, as it does not allow testing null hypotheses for codon usage distribution other than equiprobability. KARLIN and MRAZEK 1996 Down proposed an alternative method that permits the introduction of values of an expected distribution. However, we found that this method is sensitive to biases on the use of amino acids of different degrees of degeneracy; i.e., the proportion of fourfold degenerate amino acids of a sequence correlates with the index of codon usage bias (r2 = 14.1%; Fig 3A).



View larger version (25K):
In this window
In a new window
Download PPT slide
 
Figure 3. Correlation of codon usage bias and the proportion of fourfold degenerate sites in the sequence, using (A) the Karlin and Mrazek (K&M) method (K&M = 0.136 + 0.588 fourfold degeneracy, r2 = 14.1%; P < 0.001) and (B) the MCB method (MCB = 0.535 + 0.145 fourfold degeneracy, r2 = 0.4%; P = 0.002).

Maximum-likelihood codon bias is a new method for determining codon usage bias correcting for background nucleotide content:
Given the limitations of the available methods, we chose to develop an alternative method that is easy to obtain and not sensitive to amino acid biases. We wanted a method that could measure the degree of nonrandomness in the use of alternative codons that is minimally affected by the presence of rare amino acids. In addition the method should allow testing of a variety of null hypotheses for codon distribution (i.e., not just equiprobability of occurrence); in this article we use this method to correct for background nucleotide content, but it can be used to correct for dinucleotide biases as well.

The use of alternative codons can be thought of as an ensemble of several random variables, one per amino acid, each with two to six possible different outcomes or codons (amino acids encoded by only one codon cannot have codon usage bias), and each outcome with an associated probability of appearance. Each specific distribution of outcomes is a vector and the codon bias for one amino acid is the distance of the observed vector from the expected one. However, to obtain an index of codon usage bias for a complete gene, the biases of individual amino acids have to be added in a sensible way. Different amino acids within a gene vary in two aspects: frequency within a sequence and their degree of degeneracy. If an amino acid is rare, then the observed distribution is more likely to be far from the expected just by chance; therefore the bias of a rare amino acid should be downscaled to have less impact on the overall index of codon bias. The different amino acids also vary in the number of alternative codons by which they are encoded and this should also be taken into account when biases from different amino acids are to be added.

Taking into account the two aspects discussed above, we developed a new method that is easy to calculate and allows us to test different models to explain codon usage bias. The bias of an individual amino acid BA with frequency NA of level of degeneracy T, having the observed Oc and expected Ec proportions for each alternative codon, is obtained by

The bias for a gene Bg can then be obtained by summing over all amino acids,

where A is the number of amino acids contributing to the index.

All genes where more than five amino acids were missing or no index could be estimated were removed from all comparisons (leaving n = 2387). We denominated the method as maximum-likelihood codon bias (MCB), where the contribution to the index of the bias of each amino acid is weighted by an estimation of the likelihood of occurrence of bias on each amino acid, given its frequency and degree of degeneracy. Nevertheless, MCB is not a maximum-likelihood method in a strict sense. We believe this method would be useful for interspecies comparisons by allowing correction for differences in nucleotide composition. Importantly, MCB is minimally affected by biases in amino acid content of different degrees of degeneracy (r2 = 0.4%; Fig 3B) and appears to effectively remove the influence of background GC content (compare Fig 2A and Fig 2B).

It should be noted that with any procedure that estimates the distance from randomness, the size of the sample of events affects the variance that is expected; since the length of genes varies it is expected that this would influence the MCB values that are obtained. Therefore it is important to carefully study the relation of gene length with the variables that are being tested against codon usage values. A script for calculating MCB is available from the authors.

MCB covaries with breadth of expression and rates of synonymous substitution:
Expected distributions for each codon family were derived from the base composition of all third sites with the same or greater level of degeneracy within a given sequence (according to METHODS) and MCB values were obtained for all genes in the data sample. If the residual biases in codon usage, once correcting for nucleotide content, are due to selection then we could expect (a) higher bias in more broadly expressed genes, (b) consistently preferred codons, or (c) an inverse correlation with levels of synonymous substitutions (Ks).

We assessed the effect of breadth of expression on codon usage bias in our sample (see METHODS). Breadth of expression is not a direct measure of expression rate and therefore we may not necessarily be analyzing the key parameter. Nonetheless, the breadth of expression is known to covary with the intensity of purifying selection acting on the nonsynonymous sites (DURET and MOUCHIROUD 2000 Down), so may reasonably be taken as a covariate to the strength of purifying selection. To assess the interaction between breadth of expression and codon usage bias the sample was divided into three groups according to the number of tissues in which they are expressed: (1) genes expressed in up to 5 tissues (n = 1242), (2) genes expressed in more than 5 but not more than 10 tissues (n = 494), and (3) genes expressed in between 11 and 15 tissues (n = 272). The levels of codon bias in the three groups were significantly different from each other (5 vs. 10, P = 0.001; 10 vs. 15, P < 0.001; 5 vs. 15, P < 0.001; Kruskal-Wallis test). In all comparisons genes expressed in fewer tissues tend to have a lower MCB value. The correlation line between MCB and the number of tissues is consistent with this result (P < 0.001, r2 = 3.1%; see Fig 4). This result suggests that genes with broader expression show a higher degree of codon usage bias. These results cannot be explained by compositional biases caused by transcriptional coupled mutational biases (e.g., higher rate of C -> T mutations in "breathing DNA") since the MCB method already takes into account gene-specific background nucleotide concentrations.



View larger version (9K):
In this window
In a new window
Download PPT slide
 
Figure 4. The relationship between MCB and expression patterns. Average values of MCB and standard error bars are shown for the genes divided into five groups according to the number of tissues where they are expressed: 1–3, 4–6, 7–9, 10–12, and 13–15 (n = 884, 484, 293, 203, and 144, respectively).

An inverse correlation between codon usage bias and rates of silent site substitutions has been observed in bacteria (SHARP and LI 1987 Down), Drosophila (POWELL and MORIYAMA 1997 Down), and yeast (L. D. HURST, unpublished data). If codon usage bias is due to selective pressures then it is expected that genes with higher codon usage bias would have lower rates of synonymous substitutions, although the effect may be weak. When rates of synonymous substitutions (compared to mouse and rat orthologs; DURET and MOUCHIROUD 2000 Down), using LI's (1993) method (Li93) and removing tandem substitutions (data as in DURET and MOUCHIROUD 1999 Down), were plotted against MCB values, we observed an inverse correlation between rates of silent site substitution and MCB values (r2 = 1.2%, P < 0.001; see Fig 5). A similar result is obtained (r2 = 1.4%, P < 0.001) when comparing MCB values with rates of substitution at the fourfold degenerate sites, using the Tamura and Nei protocol after removing tandem substitutions. While this result is consistent with selection, it must be treated with caution owing to the fact that estimators of Ks may be biased when nucleotide content is biased (DUNN et al. 2001 Down). Indeed, the correlation is not present (or at most only weakly suggested) if instead we apply the maximum-likelihood method of Goldman and Yang (P = 0.056, r2 = 0.2%).



View larger version (9K):
In this window
In a new window
Download PPT slide
 
Figure 5. The relationship between the rate of silent site evolution (Ks) and codon usage bias. Average value and standard error bars of Ks (human-rodent comparison using the Li method; DURET and MOUCHIROUD 2000 Down) are shown for genes grouped by MCB values: 0.21–0.39, 0.40–0.57, 0.58–0.75, 0.76–0.93, and <0.94 (n = 153, 938, 874, 264, and 67, respectively).

The covariance with expression breadth and synonymous substitution rates is also found when correlating the Karlin and Mrazek method as a measure of codon bias (breadth of expression, P < 0.001, r2 = 0.9%; synonymous substitution rate using Li93, P = 0.002, r2 = 0.4%). Although both the MCB and Karlin and Mrazek methods significantly correlate with synonymous rates of substitutions and breadth of expression, these weak correlations should be interpreted cautiously.

Codon preferences:
The above results are suggestive of a role for selection. If selection is to explain the above effects then we should also expect to see certain codons repeatedly being favored among genes. To investigate if the observed biases were favoring specific codons over others we performed an overall analysis of the whole sample by concatenating all genes into one large sequence. If the biases are due to factors specific for individual genes these should cancel each other out in the whole sample. The proportions for each alternative codon were obtained and compared with expectations from the nucleotide biases. Significant differences from expectations were observed for all of the amino acids that have two or more synonymous codons using the two tests of goodness of fit (P < 0.001, see METHODS).

A more conservative test is to investigate the consistency of the direction of the biases for individual genes. If there is no significant tendency favoring a particular set of codons then it is expected that a codon would be overrepresented one-half of the times that it deviates from expectation. The majority of the codons have significantly less heterogeneity than expected by chance and some were biased in one direction in 90% of the genes (Table 1). The above results are consistent with selective pressures favoring specific codons. Were this the result of selection we can predict that tRNA levels should be more highly skewed for the amino acids showing bias than for those showing little bias as has been shown for other species (cf. SHARP et al. 1995 Down; MORIYAMA and POWELL 1997 Down; KANAYA et al. 1999 Down); however, LANDER et al. 2001 Down did not find support for this prediction in human genes.


 
View this table:
In this window
In a new window

 
Table 1. Heterogeneity of direction of bias for each codon

Expression breadth and synonymous substitution patterns are most probably due to gene length effects:
The above results are suggestive of selection possibly playing a role in codon usage bias in humans. However, as stated earlier, genes of different length are likely to have different MCB values owing to the nature of the method. Indeed, if we randomize our sequences and measure the mean MCB for 1000 simulants for each of our genes, we find that the MCB, on average, is higher for shorter genes. This is to be expected of any statistic that employs a multinomial distribution and applies equally to the method of Karlin and Mrazek.

Importantly, it so happens that in our data set longer genes have a slightly higher rate of synonymous substitutions and are not expressed in as broad a range of tissues. Therefore, plotting mean MCB for the randomized genes against breadth of expression for the real gene, we still find a weak positive correlation of the order of magnitude reported for the real genes (P < 0.001, r2 = 4.0%). Likewise we find in the mean MCB vs. Ks regression a weak negative correlation of about the order reported for the real genes [Li93, P = 0.001, r2 = 0.6%; Tamura and Nei method (TN93), P = 0.002, r2 = 0.5%]. Moreover, when we subtract the average bias of the random sequences from the bias of the real sequences, the correlation with breadth of expression disappears and with rates of substitution weakens considerably (expression, P = 0.348, r2 = 0.01%; Ks Li93, P = 0.014, r2 = 0.03%). Therefore, the most conservative interpretation of our data is that MCB does covary with expression breadth and Ks, but this is likely to be because of a tendency of larger genes having lower expression breadth and higher rates of silent site substitution. The data appear not to support the hypothesis that covariance is due to selection on codon usage per se. It should be noted that for 96% of the sequences the MCB value of the real data was higher than the mean value for the random sequences.

Dinucleotide effects and preferred codons:
We are left trying to understand why there is such a large residual variance in codon usage after background nucleotide content is taken into account. One possibility is that the biases are caused by mutation biases or selection associated with particular dinucleotides. We performed a dinucleotide analysis on the whole sample (see METHODS) and also found that the sequences of the sample show significant biases in the appearance of dinucleotides from the expectations based on nucleotide content variations, consistent with previous observations (KARLIN and MRAZEK 1996 Down). Dinucleotide bias explains part of the codon usage bias that we find in sequences in our sample when correcting for background nucleotide content. Most notably both TA and CG are avoided. It has been suggested that the dearth of CpG is probably related to the mutation of methylated CpG sites to TpG dinucleotides. By contrast, the dearth of TA may be owing to selection related to the susceptibility of UA in mRNA to RNase activity (BEUTLER et al. 1989 Down; but see DURET and GALTIER 2000 Down). A similar bias is also found in noncoding regions (KARLIN and MRAZEK 1996 Down), suggesting some RNA-independent mechanism; however, the bias is significantly more profound in coding regions.

It is not the case, however, that dinucleotide effects can explain the totality of the residual bias. If there are significant biases that cannot be explained by dinucleotide effects, then this should be revealed by comparing the relative frequencies of the codons that encode amino acids with the same degree of degeneracy and that share the second nucleotide in their codons. So, for example, if dinucleotide biases explain codon usage bias, then the relative frequency of A-ending codons among the codons that specify glutamine (CAA, CAG) should be the same as the relative frequency of the A-ending codons among the codons that specify glutamic acid (GAA, GAG).

For each gene, the relative frequencies for each codon were calculated with respect to the other codons that encode the same amino acid. Those amino acids whose codons have the same nucleotide at the second site and that have the same type of degeneracy were grouped. Three such groups can be formed: (1) tyrosine, histidine, asparagines, and aspartic acid; (2) glutamine, lysine, and glutamic acid of twofold degeneracy; and (3) proline, threonine, and alanine of fourfold degeneracy. Within each group of amino acids, subgroups of those codons that have the same nucleotides at the first and the second sites were formed. Within each subgroup the relative frequencies of codons were compared against each other with Mann-Whitney tests. A total of 21 comparisons were made (for amino acids with twofold degree of degeneracy, only one subgroup of codons was formed since the second subgroup is complementary). In this test, the CG content variations do not affect the comparisons because the relative frequencies for each amino acid are calculated with respect to the other codons that encode the same amino acid. Assuming that there are no diamino acid biases or other factors of bias than dinucleotide effects, then we can expect that the relative proportions of codons of different amino acids are nearly identical (i.e., not significantly different) since they are expected to interact with similar proportions of nucleotides at the first position of the next codon. The major difference was found in the comparison of the codons CAA and GAA (encoding for glutamine and glutamic acid, respectively) with mean frequencies of 0.24 and 0.38, respectively. From all 21 possible comparisons within subgroups, only the comparison of the codons TAT and AAT (that encode for tyrosine and asparagine, respectively) was not significant with an {alpha} value of 0.05. All but 4 were significantly different with an {alpha} value of 0.01.

Some of the differences observed might be due to the existence of trinucleotide biases, diamino acid biases, or any more elaborated mutation patterns. These results show, however, that dinucleotide effects cannot alone account for all of the observed distribution of codons in human genes.


*  DISCUSSION
*TOP
*ABSTRACT
*METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

We found that codon usage bias in mammalian genes is not completely explained by background nucleotide content variation. We therefore developed a method to study the influence of other variables on codon usage. Unlike other methods ours appears to be insensitive to influence from rare amino acids. When we apply this method to a sample of human sequences, correcting expected distributions for background nucleotide content, we find that codon usage bias covaries with breadth of expression and is inversely correlated with the rate of synonymous substitution. This could suggest selective pressures related to translation efficiency, as has been conjectured (DEBRY and MARZLUFF 1994 Down). However, the fact that these two correlations disappear when the effect of gene length is included suggests that gene length could be a more relevant variable and that the suggestive results are just artifacts. It is nonetheless interesting to find a weak tendency for short genes to have broader expression and lower synonymous substitution rates. These effects are, however, so weak that it may be improper to suppose that they have any biological meaning.

We also observe that there are codons that are consistently over- or underrepresented. This pattern can be explained in part by dinucleotide biases that also influence codon usage. However, we have also shown that not all the bias can be explained by such a simple mutational bias. While the cause of the remaining bias is uncertain, we fail to provide support for the hypothesis that codon usage is owing to selection.

Can we be sure that selection does not affect codon usage in mammals? While the above results would tend to suggest an absence of selection, as might be assumed to be the dominant position, several caveats must be noted. First, the dearth of TA dinucleotides may be a result of selection, as we discussed. However, DURET and GALTIER 2000 Down argue that this is a methodological artifact. Second, our expression analysis looked at breadth of expression, not rate of expression. Nonetheless, the breadth of expression is known to covary with the intensity of purifying selection acting on the nonsynonymous sites (DURET and MOUCHIROUD 2000 Down), so may reasonably be taken as a covariate to the strength of purifying selection.

Third, we need to understand how to resolve the present findings with the result that there are dramatic increases in the amount of gene expression observed when foreign sequences, to be expressed in mammalian cells, are modified to avoid having rare codons. One possibility is that negative results are not reported and therefore we are left only with the cases in which the increase in expression could be due to the change in some synonymous sites rather than the effect of codon usage per se. On the other hand, this observation could indeed be indicative of selective pressures related to translation efficiency acting on codon usage distributions. However, because we are using a method that measures distance from random use, rather than the degree in which optimal codons are used, we might not have adequate resolution to detect the patterns. Using a method to measure codon bias based on the degree of use of optimal codons, but correcting for the background nucleotide bias, could allow recovery of evidence of weak selective pressures acting on coding sequences in mammals. In the meantime, we may conclude that codon usage bias covaries with expression breadth and the rate of synonymous evolution in humans but that this is not evidence for selection.


*  ACKNOWLEDGMENTS

We thank Laurent Duret, Brian Charlesworth, Jody Hey, and two anonymous referees for comments on an earlier version of the manuscript. This work was funded by a grant from Conacyt to A.O.U. and by the Biotechnology and Biological Sciences Research Council (BBSRC) and the Royal Society to L.D.H.

Manuscript received May 22, 2001; Accepted for publication August 15, 2001.


*  LITERATURE CITED
*TOP
*ABSTRACT
*METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

BERNARDI, G., 1995  The human genome: organization and evolutionary history. Annu. Rev. Genet. 29:445-476[Medline].

BERNARDI, G., D. MOUCHIROUD and C. GAUTIER, 1997 Isochores and synonymous substitutions in mammalian genes, pp. 197–208 in DNA and Protein Sequence Analysis, edited by M. J. BISHOP and C. J. RAWLINGS. IRL Press, Oxford.

BEUTLER, E., T. GELBART, J. H. HAN, J. A. KOZIOL, and B. BEUTLER, 1989  Evolution of the genome and the genetic-code—selection at the dinucleotide level by methylation and polyribonucleotide cleavage. Proc. Natl. Acad. Sci. USA 86:192-196[Abstract/Free Full Text].

DEBRY, R. W. and W. F. MARZLUFF, 1994  Selection on silent sites in the rodent H3 histone gene family. Genetics 138:191-202[Abstract].

DUNN, K. A., J. P. BIELAWSKI, and Z. H. YANG, 2001  Substitution rates in Drosophila nuclear genes: implications for translational selection. Genetics 157:295-305[Abstract/Free Full Text].

DURET, L. and N. GALTIER, 2000  The covariation between TpA deficiency, CpG deficiency, and G + C content of human isochores is due to a mathematical artifact. Mol. Biol. Evol. 17:1620-1625[Abstract/Free Full Text].

DURET, L. and L. D. HURST, 2001  The elevated GC content at exonic third sites is not evidence against neutralist models of isochore evolution. Mol. Biol. Evol. 18:757-762[Abstract/Free Full Text].

DURET, L. and D. MOUCHIROUD, 1999  Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, Arabidopsis. Proc. Natl. Acad. Sci. USA 96:4482-4487[Abstract/Free Full Text].

DURET, L. and D. MOUCHIROUD, 2000  Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Mol. Biol. Evol. 17:68-74[Abstract/Free Full Text].

EYRE-WALKER, A. C., 1991  An analysis of codon usage in mammals—selection or mutation bias. J. Mol. Evol. 33:442-449[Medline].

GOUY, M. and C. GAUTIER, 1982  Codon usage in bacteria—correlation with gene expressivity. Nucleic Acids Res. 10:7055-7074[Abstract/Free Full Text].

GOUY, M., F. MILLERET, C. MUGNIER, M. JACOBZONE, and C. GAUTIER, 1984  Acnuc—a nucleic-acid sequence data-base and analysis system. Nucleic Acids Res. 12:121-127.

HANAI, R. and A. WADA, 1988  The effects of guanine and cytosine variation on dinucleotide frequency and amino-acid composition in the human genome. J. Mol. Evol. 27:321-325[Medline].

IKEMURA, T. and K. WADA, 1991  Evident diversity of codon usage patterns of human genes with respect to chromosome-banding patterns and chromosome-numbers—relation between nucleotide-sequence data and cytogenetic data. Nucleic Acids Res. 19:4333-4339[Abstract/Free Full Text].

KANAYA, S., Y. YAMADA, Y. KUDO, and T. IKEMURA, 1999  Studies of codon usage and tRNA genes of 18 unicellular organisms and quantification of Bacillus subtilis tRNAs: gene expression level and species-specific diversity of codon usage based on multivariate analysis. Gene 238:143-155[Medline].

KARLIN, S. and J. MRAZEK, 1996  What drives codon choices in human genes? J. Mol. Biol. 262:459-472[Medline].

KING, J. L. and T. H. JUKES, 1969  Non-Darwinian evolution. Science 164:788-798[Free Full Text].

LANDER, E. S., L. M. LINTON, B. BIRREN, C. NUSBAUM, and M. C. ZODY et al., 2001  Initial sequencing and analysis of the human genome. Nature 409:860-921[Medline].

LEVY, J. P., R. R. MULDOON, S. ZOLOTUKHIN, and C. J. LINK, 1996  Retroviral transfer and expression of a humanized, red-shifted green fluorescent protein gene into human tumor cells. Nat. Biotechnol. 14:610-614[Medline].

LI, W.-H., 1993  Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J. Mol. Evol. 36:96-99[Medline].

MARAIS, G., D. MOUCHIROUD, and L. DURET, 2001  Does recombination improve selection on codon usage? Lessons from nematode and fly complete genomes. Proc. Natl. Acad. Sci. USA 98:5688-5692[Abstract/Free Full Text].

MORIYAMA, E. N. and J. R. POWELL, 1997  Codon usage bias and tRNA abundance in Drosophila. J. Mol. Evol. 45:514-523[Medline].

POWELL, J. R. and E. N. MORIYAMA, 1997  Evolution of codon usage bias in Drosophila. Proc. Natl. Acad. Sci. USA 94:7784-7790[Abstract/Free Full Text].

SHARP, P. M. and W. H. LI, 1987  The rate of synonymous substitution in enterobacterial genes is inversely related to codon usage bias. Mol. Biol. Evol. 4:222-230[Abstract].

SHARP, P. M., T. M. F. TUOHY, and K. R. MOSURSKI, 1986  Codon usage in yeast—cluster-analysis clearly differentiates highly and lowly expressed genes. Nucleic Acids Res. 14:5125-5143[Abstract/Free Full Text].

SHARP, P. M., M. AVEROF, A. T. LLOYD, G. MATASSI, and J. F. PEDEN, 1995  DNA-sequence evolution—the sounds of silence. Philos. Trans. R. Soc. Lond. Ser. B 349:241-247[Medline].

STENICO, M., A. T. LLOYD, and P. M. SHARP, 1994  Codon usage in Caenorhabditis elegans: delineation of translational selection and mutational biases. Nucleic Acids Res. 22:2437-2446[Abstract/Free Full Text].

WELLS, K. D., J. A. FOSTER, K. MOORE, V. G. PURSEL, and R. J. WALL, 1999  Codon optimization, genetic insulation, and an rtTA reporter improve performance of the tetracycline switch. Transgenic Res. 8:371-381[Medline].

WRIGHT, F., 1990  The effective number of codons used in a gene. Gene 87:23-29[Medline].

ZHOU, J., W. J. LIU, S. W. PENG, X. Y. SUN, and I. FRAZER, 1999  Papillomavirus capsid protein expression level depends on the match between codon usage and tRNA availability. J. Virol. 73:4972-4982[Abstract/Free Full Text].

ZOLOTUKHIN, S., M. POTTER, W. W. HAUSWIRTH, J. GUY, and N. MUZYCZKA, 1996  A "humanized" green fluorescent protein cDNA adapted for high-level expression in mammalian cells. J. Virol. 70:4646-4654[Abstract].




This article has been cited by other articles:


Home page
Mol Biol EvolHome page
P. K. Ingvarsson
Gene Expression and Protein Length Influence Codon Usage and Rates of Sequence Evolution in Populus tremula
Mol. Biol. Evol., March 1, 2007; 24(3): 836 - 844.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
A. D. Cutter, J. D. Wasmuth, and M. L. Blaxter
The Evolution of Biased Codon and Amino Acid Usage in Nematode Genomes
Mol. Biol. Evol., December 1, 2006; 23(12): 2303 - 2315.
[Abstract] [Full Text] [PDF]


Home page
RNAHome page
M. Withers, L. Wernisch, and M. d. Reis
Archaeology and evolution of transfer RNA genes in the Escherichia coli genome
RNA, June 1, 2006; 12(6): 933 - 942.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
J. M. Comeron
Weak selection and recent mutational changes influence polymorphic synonymous mutations in humans
PNAS, May 2, 2006; 103(18): 6940 - 6945.
[Abstract] [Full Text] [PDF]


Home page
J. Virol.Home page
C. C. Burns, J. Shaw, R. Campagnoli, J. Jorba, A. Vincent, J. Quay, and O. Kew
Modulation of Poliovirus Replicative Fitness in HeLa Cells by Deoptimization of Synonymous Codon Usage in the Capsid Region.
J. Virol., April 1, 2006; 80(7): 3259 - 3272.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
M. Semon, J. R. Lobry, and L. Duret
No Evidence for Tissue-Specific Adaptation of Synonymous Codon Usage in Humans
Mol. Biol. Evol., March 1, 2006; 23(3): 523 - 529.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. A. Shabalina, A. Y. Ogurtsov, and N. A. Spiridonov
A periodic pattern of mRNA secondary structure created by the genetic code.
Nucleic Acids Res., January 1, 2006; 34(8): 2428 - 2437.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. d. Reis, R. Savva, and L. Wernisch
Solving the riddle of codon usage preferences: a test for translational selection
Nucleic Acids Res., September 24, 2004; 32(17): 5036 - 5044.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
J. B. Plotkin, H. Robins, and A. J. Levine
Tissue-specific codon usage and the expression of human genes
PNAS, August 24, 2004; 101(34): 12588 - 12591.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
J. M. Comeron
Selective and Mutational Patterns Associated With Gene Expression in Humans: Influences on Synonymous Composition and Intron Presence
Genetics, July 1, 2004; 167(3): 1293 - 1304.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
S. T. Eskesen, F. N. Eskesen, and A. Ruvinsky
Natural Selection Affects Frequencies of AG and GT Dinucleotides at the 5' and 3' Ends of Exons
Genetics, May 1, 2004; 167(1): 543 - 550.
[Abstract] [Full Text] [PDF]


Home page
Microbiol. Mol. Biol. Rev.Home page
K. A. Borkovich, L. A. Alex, O. Yarden, M. Freitag, G. E. Turner, N. D. Read, S. Seiler, D. Bell-Pedersen, J. Paietta, N. Plesofsky, et al.
Lessons from the Genome Sequence of Neurospora crassa: Tracing the Path from Genomic Blueprint to Multicellular Organism
Microbiol. Mol. Biol. Rev., March 1, 2004; 68(1): 1 - 108.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
A. O. Urrutia and L. D. Hurst
The Signature of Selection Mediated by Expression on Human Genes
Genome Res., October 1, 2003; 13(10): 2260 - 2264.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
K. D. Makova and W.-H. Li
Divergence in the Spatial Pattern of Gene Expression Between Human Duplicate Genes
Genome Res., July 1, 2003; 13(7): 1638 - 1645.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
M. S. Rosenberg, S. Subramanian, and S. Kumar
Patterns of Transitional Mutation Biases Within and Among Mammalian Genomes
Mol. Biol. Evol., June 1, 2003; 20(6): 988 - 993.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
I. Hellmann, S. Zollner, W. Enard, I. Ebersberger, B. Nickel, and S. Paabo
Selection on Human Genes as Revealed by Comparisons to Chimpanzee cDNA
Genome Res., May 1, 2003; 13(5): 831 - 837.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. E. Vinogradov
DNA helix: the importance of being GC-rich
Nucleic Acids Res., April 1, 2003; 31(7): 1838 - 1844.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
J. A. Novembre
Accounting for Background Nucleotide Composition When Measuring Codon Usage Bias
Mol. Biol. Evol., August 1, 2002; 19(8): 1390 - 1394.
[Full Text] [PDF]