| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Department of Biological Sciences, University of Iowa, Iowa City, Iowa 52242
1 Address for correspondence: Department of Biological Sciences, University of Iowa, 212 Biology Bldg. (BB), Iowa City, IA 52242.
E-mail: josep-comeron{at}uiowa.edu
| ABSTRACT |
|---|
|
|
|---|
The influence of selection on gene composition, especially on the unequal use of synonymous codons, has been the archetypal example of a trait under weak selection in species with large Ne (SHARP and LI 1986; LI 1987; SHIELDS et al. 1988; KLIMAN and HEY 1993; MORIYAMA and HARTL 1993; HARTL et al. 1994; AKASHI 1995, 1996, 2003; POWELL and MORIYAMA 1997; COMERON et al. 1999; DURET and MOUCHIROUD 1999; LLOPART and AGUADE 2000; BEGUN 2001; MCVEAN and VIEIRA 2001; DURET 2002; HEY and KLIMAN 2002; CARLINI and STEPHAN 2003). Indeed, two different features are observed in several model eukaryotes such as Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and Arabidopsis thaliana. First, differences in synonymous codon usage are associated with differences in expression levels (IKEMURA 1985; SHARP and LI 1987; DURET and MOUCHIROUD 1999; DURET 2000). Second, highly expressed genes show a set of synonymous codons that correspond mainly to abundant tRNAs (IKEMURA 1985; MORIYAMA and POWELL 1997; DURET and MOUCHIROUD 1999; COGHLAN and WOLFE 2000; DURET 2000). The combination of both observations strongly supports the action of selection at the level of translational efficiency (i.e., translational selection), increasing either accuracy or speed of translation (IKEMURA 1985; BULMER 1991; KURLAND 1992).
Nevertheless, the evidence supporting translational selection in humans has beenat bestarguable (EYRE-WALKER 1999; IIDA and AKASHI 2000; URRUTIA and HURST 2001; DURET 2002; GALTIER 2003) and the isochoric structure of the genome is, with certainty, the most influential factor shaping synonymous composition. Other factors, such as multiple-splicing forms, methodological biases introduced by the use of serial analysis of gene expression (SAGE) or expressed sequence tag (EST) studies (MARGULIES et al. 2001), pooling of data from different tissues, and the overlaying effect of selection at certain synonymous sites influencing pre-mRNA structures (SHEN et al. 1999; DUAN et al. 2003), might all have played a part in the inability to detect reliable patterns of translational selection in the human lineage.
On the other hand, it is well known that intron size and presence vary considerably between homologous genes. Several studies have applied population-genetic techniques to provide a primary insight on modes of selection that could explain the proliferation and maintenance of spliceosomal introns as well as their variation in size (STEPHAN et al. 1994; LEICHT et al. 1995; CARVALHO and CLARK 1999; COMERON and KREITMAN 2000; LLOPART et al. 2002; LYNCH 2002; SCHAEFFER 2002; PARSCH 2003). Moreover, recent studies in humans (CASTILLO-DAVIS et al. 2002; URRUTIA and HURST 2003) suggest the action of selection favoring short introns in highly expressed genes, possibly due to the beneficial effects of reducing transcriptional costs (time and energy; CASTILLO-DAVIS et al. 2002). Nonetheless, the evolution of introns cannot be understood independently of the known impact of intronic sequences on downstream mRNA metabolism (SUN and MAQUAT 2000; ZHOU et al. 2000; LE HIR et al. 2001; YU et al. 2002), splicing efficiency (KLINZ and GALLWITZ 1985; STERNER et al. 1996), and overall gene regulation.
Here, we report a comprehensive study of the influence of gene expression on both composition and intron presence and size in human protein-coding genes with no evidence of multiple-splicing forms. The application of several approaches to take into account background effects and the study of expression (i.e., transcription) in different tissues based on microarray data allow detection of the signature of natural selection on both traits and transcription-associated mutational biases.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Microarray data:
Expression data for human tissues were obtained from a high-throughput gene expression study of the normal mammalian transcriptome (SU et al. 2002; http://expression.gnf.org; April 2003) based on hybridization to high-density arrays (GLP91 platform; Affymetrix U95A). A total of 5280 genes (4876 with introns) overlapped with those described above and were used in this study. Presence/absence for each probe/transcript (the Absolute Call) was determined for each sample from the reference series GSE96 using Affymetrix MAS4 algorithm, as reported in NCBI's Gene Expression Omnibus (EDGAR et al. 2002). Levels of expression were investigated using positive AD values only in genes with validated presence. Nineteen tissues were investigated using a total of 45 samples: adrenal gland, brain (fetal and adult), liver (fetal and adult), heart, kidney, lung, ovary pool, pancreas, pituitary gland, placenta, prostate, salivary gland, spinal cord, spleen, testis, thymus, thyroid, trachea, and uterus.
Two different measures of expression were used in this study: breadth of expression (Expressionbreadth; DURET and MOUCHIROUD 2000; URRUTIA and HURST 2001), which is the number of tissues in which transcription is detected (ranging up to 19 tissues), and the level of transcription within each of these 19 tissues (Expressionlevel). Genes are defined as ubiquitously or narrowly expressed if they are expressed in >14 (1497 genes) and <3 (1732 genes) tissues, respectively. We have avoided using measures of expression based on pooled or mean levels of transcription from different tissues because these latter measures are more dependent on breadth of expression (URRUTIA and HURST 2003) and they are highly sensitive to the particular set of tissues chosen for the analysis and to possible differences in overall expression levels. We also avoided using transcription information based on SAGE because of its known GC content bias (MARGULIES et al. 2001) that could not only influence the analysis of expression and composition but also generate spurious clustering of expression across the genome in association with different isochores.
BLAST searches and CpG islands:
Local alignments between human and mouse (Mus musculus) orthologous intron sequences were used to estimate the number of conserved sites in introns by applying BLASTn searches (ALTSCHUL et al. 1997). BLASTn searches are highly sensitive to the set of parameters used, most conspicuously when fairly divergent sequences are compared, but there is no reason to presume a systematic bias when comparing broadly and narrowly expressed genes. We used a word size set to 11 and masked off segments of the human introns that have low compositional complexity and human repeats (http://www.ncbi.nlm.nih.gov/blast). The presence of CpG islands was predicted with the program NewCpGreport (EMBOSS v.2.3.1. package) using default parameters. As expected, both approaches reveal higher percentages of CpG islands and conserved sites in first introns than in other introns, which is a positive control for these methods.
Statistical analyses:
All correlation coefficients reported in this study were obtained using all genes independently, avoiding the approach of subdividing genes into groups to later investigate relationships among groups. Note that this latter approach is equally valuable to detect statistically significant associations but it cannot be used to assess the actual strength of association. Statistical analyses were carried out using Statistica for Windows v.6 (StatSoft, Tulsa, OK).
| RESULTS |
|---|
|
|
|---|
Gene expression and synonymous base composition:
Many studies have shown that GC content at the third positions of codons (GC3) is greater than GC content at introns (GCi) or at the first and second position of codons (GC12) (our data set shows an average GC content of 58.3, 49.4, and 45.7%, for GC3, GC12, and GCi, respectively). This overall difference has been used, at times, as an argument in favor of selection on synonymous base composition (although see DISCUSSION).
Previous studies based on EST data (DURET 2002) revealed a negative relationship between GC3 and Expressionbreadth (see MATERIALS AND METHODS), and this is also observed using microarray data (R = 0.120, P < 1 x 1012). On the other hand, the study of Expressionlevel, which is the measure of expression that is expected to be associated with translational efficiency, shows the opposite tendency, with GC3 increasing with Expressionlevel in all 19 tissues with R ranging between +0.078 (P = 6 x 106) and +0.393 (P < 1 x 1012). The same trend is observed for GC12 and GCi, increasing significantly with Expressionlevel also in all 19 tissues, with R between +0.055 (P = 0.0015) and +0.234 (P < 1 x 1012) for GC12 and between +0.142 and +0.433 (P < 1 x 1012) for GCi. The covariation of GC3 and GCi with Expressionlevel exposes a strong nonselective component not specific to synonymous composition, with two obvious possible causes for this observation: isochores and transcription-associated mutational biases (TAMB).
We investigated the influence of expression on synonymous base composition relative to that in introns for each nucleotide separately. Figure 1A shows the correlation coefficient, R, between Expressionlevel and the difference in base composition between the third position of codons and introns. In testis and, to a lesser degree, in prostate and pituitary gland, the influence of expression on synonymous composition can be explained by an equivalent influence of expression on intron composition. On the other hand, tissues like spleen, heart, placenta, liver, or kidney, show that C and, to a lesser degree, G content at synonymous sites increases with expression beyond mutational tendencies operating on whole transcripts, a first indication of translational selection in these tissues.
|
Set of optimal synonymous codons in highly expressed genes:
To investigate consequences of translational selection, we have looked into the set of synonymous codons that increase in frequency with expression (i.e., optimal codons). A caveat, however, should be mentioned since translational selection depend on aspects of protein translation while we used levels of transcription due to the scarcity of information on protein amounts. Here, we assume that transcript levels are strongly correlated with protein levels. Table 1 shows the difference in the relative synonymous codon usage (RSCU; SHARP and LI 1987; DURET and MOUCHIROUD 1999; DURET 2000) between highly and poorly expressed genes (
RSCU). Highly and poorly expressed genes were defined as the 25% with highest and lowest levels, respectively, of detectable transcription within each tissue, to allow for differences in overall expression levels among tissues. We avoided using the effective number of codons (WRIGHT 1990), which is a measure of overall codon bias that is not influenced by the number of codons under study (WRIGHT 1990; COMERON and AGUADE 1998), because it does not directly correct for differences in background composition. On the other hand, a measure of codon bias that corrects for background composition such as MCB (URRUTIA and HURST 2001) is strongly influenced by the length of the CDS in a nonlinear manner (URRUTIA and HURST 2001) and it exposes heterogeneity of synonymous base composition among amino acids rather than bias in synonymous codon usage that might, or might not, be consistent among amino acids. Moreover, neither of these two methods reveal the codons that increase in frequency with expression.
|
RSCU (Table 1). On the other hand, genes expressed in testis (with strongest TAMB) expose only two synonymous codons increasing significantly with expression (the same two codons showing the strongest effect of expression in tissues with no TAMB). These results, however, could be explained without invoking selection if highly expressed genes cluster in GC-rich isochores (see Introduction) and therefore we have compared highly and poorly expressed genes in GC-rich and GC-poor isochores separately (we defined three isochore categories with equivalent gene numbers based on GCi). A conservative definition of optimal codons then refers to codons showing strong positive
RSCU in both GC-rich and GC-poor isochores in tissues with no evidence of TAMB (e.g., liver and spleen). A total of 17 optimal codons are consistently observed in tissues with no detectable TAMB, 12 C- and 5 G-ending codons (see Table 1). In total, the frequency of optimal codons in a gene (Fop) increases with Expressionlevel in all tissues, with R ranging from +0.100 (testis) to +0.406 (spleen; P < 1 x 1012 in all cases). Another measure that explores the overall adaptation of codon usage taking into account background mutational biases (isochoric and/or TAMB) is the ratio of GC-ending optimal to GC-ending nonoptimal codons (GCoptimal/GCnonoptimal) in a gene (a ratio that should be computed using only amino acids with both optimal and nonoptimal GC-ending codons, i.e., four- and sixfold degenerate amino acids). As expected if the set of optimal codons properly describes consequences of translational selection, all tissues show a significant increase of GCoptimal/GCnonoptimal with Expressionlevel, with R ranging from +0.139 to +0.307 (P < 1 x 1012 in all cases). Note that this latter analysis reveals that the influence of expression on synonymous codon usage is also detectable, although to a much lesser degree, in tissues such as testis where a nonselective component in association with expression (i.e., TAMB) is the main influence on synonymous composition. Figure 2 shows the relationship between GCoptimal/GCnonoptimal and Expressionlevel for genes expressed in liver, the tissue with least evidence of TAMB.
|
Gene expression, CDS length, and intron presence
Gene expression and CDS length:
There is a negative relationship between expression and the length of the CDS. This effect is detected using Expressionbreadth (R = 0.088, P = 1.4 x 1010) and Expressionlevel in any of the 19 tissues investigated (R ranges between 0.118 and 0.204, P < 1 x 1012 in all cases). Equivalent results based on pooled SAGE data have been recently reported (URRUTIA and HURST 2003), and this study broadens the validity of this relationship in humans.
Significantly, the negative correlation between expression and CDS length is observed only among genes with introns (R = 0.111 for Expressionbreadth and R between 0.124 and 0.211 for Expressionlevel, P < 1 x 1012 in all cases). In contrast, there is no detectable correlation among genes without introns using either Expressionbreadth or Expressionlevel (all associations are statistically nonsignificant after sequential Bonferroni correction). Nevertheless, genes without introns are usually shorter (average 984 bp) than genes with introns (average 1505 bp) and one could argue that the relationship between expression and CDS length is detected only among genes with intermediate/long CDS. We have then analyzed genes with introns and short CDS, a subset with an average CDS length of 983 bp, and observed again a negative relationship between expression and CDS length, for both Expressionbreadth and Expressionlevel in all tissues (P < 3 x 1010). This latter result indicates that the observed distinct behavior of genes with and without introns is not attributable to differences in CDS length.
Gene expression and intron density:
Several cellular processes associated with intron presence and size might influence the final amount of mRNA correctly transcribed, spliced, and exported to the cytoplasm. In this regard, various selective models (see DISCUSSION) forecast an association of levels of gene expression with differences in intron presence and size among genes.
Predictably, intron number increases with the length of CDS (R = +0.665, P < 1 x 1012). Therefore, to investigate the influence of expression on intron presence and because of the aforementioned association between expression and CDS length, we have studied measures of intron presence relative to the size of the CDS (i.e., intron density; number of introns per kilobase of CDS). Intron density increases with any measure of expression: Expressionbreadth (R = +0.194, P < 1 x 1012) and Expressionlevel in all tissues (R ranging between +0.114 and +0.204; P < 1 x 1012 in all cases; Figure 3). The same results are obtained when multiple regression analyses that account for variation in CDS length are performed: B = +0.185 (P < 1 x 1012) for Expressionbreadth and B ranges from +0.034 (P = 0.01) to +0.107 (P < 1 x 1012) for Expressionlevel. These results are not caused simply by intron-less genes being narrowly/lowly expressed because the same trend is obtained when only genes with introns are analyzed: R = +0.177 and R > +0.102 (P < 1 x 1012 in all cases), for Expressionbreadth and Expressionlevel, respectively.
|
Possible influence of isochores:
We considered the possible coincidental basis to our previous results because both transcription patterns and gene structures differ among isochores (MOUCHIROUD et al. 1991; DURET et al. 1995; ZOUBAK et al. 1996; LANDER et al. 2001; D'ONOFRIO 2002). We compared patterns of expression and intron density between physically adjacent genes, both with expression data, hence removing background tendencies even under a restrictive definition of isochore (NEKRUTENKO and LI 2000). This analysis is possible only by using Expressionbreadth because too few adjacent genes show expression data in the same tissue. As shown in Figure 4, when two adjacent genes differ in breadth of expression, the gene expressed in a greater number of tissues shows a higher density of introns (2377 gene pairs, R = +0.134, P < 1 x 1012).
|
17%) compared to any other intron where the reduction is close to 30%, hence supporting the perception that first introns contain a larger number of regulatory elements for transcription control (MAJEWSKI and OTT 2002).
|
| DISCUSSION |
|---|
|
|
|---|
This study also illustrates that results based on SAGE or EST data are equivalent to those based on gene chips when investigating intron features but very different, even conflicting, when investigating base composition. The incongruence between methods is likely caused by the known bias in the quantitative aspect (i.e., Expressionlevel) of the SAGE and EST methods relative to GC content (MARGULIES et al. 2001; DURET 2002). On the other hand, qualitative studies (i.e., Expressionbreadth since it is based on presence/absence) are expected to be comparable between SAGE/EST and microarray data, as observed. Another noteworthy difference between this and previous studies is that we have used only genes that have not yet shown evidence of multiple-splicing forms. The use of genes with multiple-splicing forms would introduce a certain degree of ambiguity when composition is investigated because constitutive and facultative exons differ in synonymous GC content (IIDA and AKASHI 2000).
Certainly, the major determinants of various gene features in mammals are the isochore in which they are located and the functional properties of the encoded proteins. In accordance, in this study we show associations that, although with great statistical significance, explain individually only a small percentage of the overall variance in gene composition or intron features (216% of the overall variance).
Gene expression and amino acid composition:
Selection at the level of amino acid composition might favor reducing energetic costs of amino acid biosynthesis (AKASHI and GOJOBORI 2002) or act in association with the abundance of tRNAs for each amino acid, increasing translation accuracy or reducing translation costs of proofreading (time and energy). There is no a priori reason to expect the fitness consequence of an amino acid misincorporation to be proportional to the degree of expression of a protein (BULMER 1991). Conversely, the fitness costs associated with amino acid biosynthesis and proofreading will increase with expression. In the prokaryotes Escherichia coli and Bacillus subtilis, the usage of less energetically costly amino acids increases in abundant proteins (AKASHI and GOJOBORI 2002). In C. elegans (DURET 2000) and yeast (AKASHI 2003), highly transcribed protein-coding genes show a stronger correlation between amino acid composition and tRNA abundance than do poorly transcribed genes, supporting the proposal that selection minimizes translational costs. In humans, the overall amino acid composition of proteins also matches tRNA abundance but there is no support for different amino acid composition in differentially expressed genes. Therefore, the data suggest that the coevolution of amino acid composition and tRNA abundance in the human lineage is driven by selection to minimize amino acid misincorporation during translation and not to reduce translational costs.
Selection and mutational biases on synonymous composition:
The observed difference in GC content between synonymous sites and introns (see RESULTS) can be explained under a nonselective scenario by arguing that transposable elements (TEs), which have a reduced GC content (DURET et al. 1995), represent a frequent component of many introns. The difference between synonymous sites and introns then might just reflect the recent insertion of TEs in introns, especially in genes with long introns (DURET and HURST 2001). In partial agreement with this proposal, there is a strong tendency for long introns to have reduced GC content (R = 0.507, P < 1 x 1012). Nevertheless, our analyses indicate that highly transcribed genes show the strongest compositional difference between synonymous sites and introns while, at the same time, have shorter introns and a reduced number of TEs in introns (CASTILLO-DAVIS et al. 2002). This would argue against the possibility that the positive association between expression and compositional difference between synonymous sites and introns is a consequence of TE presence.
In addition to a strong effect of isochores, we have also detected the influence of transcription-associated mutational biases evidenced by compositional strand bias in introns. Although TAMB is expected to be apparent only in genes expressed in germline cells (HANAWALT 1994; SVEJSTRUP 2002), recent analyses suggest that some level of germline transcription may involve a large fraction of human genes (GREEN et al. 2003; MAJEWSKI 2003). Here, we report that the influence of expression on strand bias varies widely among tissues, with genes expressed in testis showing the greatest influence while genes expressed in tissues such as liver, spleen, heart, placenta, and kidney show no evidence of TAMB. Genes expressed in tissues with significant TAMB will be subject to conflicting mutational and selective pressures on synonymous composition beyond the isochore effects. As a result, tissues showing TAMB also reveal the least obvious influence of selection on synonymous codon usage. Conversely, tissues showing little or no evidence of TAMB are those in which selection on synonymous composition is better observed.
Translational selection in humans:
Overall, the results shown here are evidence that selection on synonymous codons is operating at a detectable level in the human lineage. As predicted by population genetics theory, however, the signature of translational selection is less conspicuous in humans than in species with much larger Ne. Indeed, genomic patterns of selection on synonymous codons are distinguished only after taking into account the strong influence of background composition (isochores) and tissue-specific features such as TAMB.
We propose a set of 17 synonymous optimal codons selectively favored in highly expressed genes. All optimal codons are GC ending and they resemble the set proposed for D. melanogaster more closely than that for C. elegans or A. thaliana (DURET and MOUCHIROUD 1999). The comparison between optimal codons and gene copy numbers of isoaccepting tRNAs (expected to reflect cellular tRNA abundance) shows a good, although not perfect, association, with 14 of the 17 optimal codons being decoded by the most frequent isoaccepting tRNA according to classical rules of codon-anticodon interactions (IKEMURA 1985). In agreement with the proposal of translational selection, two amino acids (glycine and proline) show a corresponding change in codon preference and tRNA abundance when C. elegans and humans are compared, generating in both species precise, although different, matches between optimal codons and the most frequent isoaccepting tRNA. For instance, in the case of glycine, the optimal codon and the anticodon of the most frequent tRNA in C. elegans are GGA and UCC, respectively (DURET 2000); in humans, the optimal codon and the anticodon of the most frequent tRNA are GGC and GCC, respectively. Certainly, the use of optimal codons will increase our capability to explore further consequences of translational selection at both intra- and interspecific levels. Further, the exposure of translational selection in the human lineage is a factor that should be introduced into evolutionary analyses that often assume neutrality of all synonymous mutations.
Gene expression and CDS size:
The negative relationship between expression and protein size reported in S. cerevisiae (AKASHI 2003) and C. elegans (JANSEN and GERSTEIN 2000) has been explained by the selective advantage of reducing energetic costs of amino acid biosynthesis in highly expressed genes (AKASHI and GOJOBORI 2002; AKASHI 2003). On the other hand, the overall excess of deletions over insertions described in many eukaryotes, including mammals (OGATA et al. 1996; OPHIR and GRAUR 1997), and the possibility of transcription-associated deletions could also generate a nonselective association between transcription rates in germinal cells and a reduction in protein size. Thus, in multicellular organisms a negative relationship between expression and protein size does not require a selective explanation unless such a relationship is observed among genes not transcribed in germinal cells.
We have shown a negative association between protein size and Expressionbreadth. Because broadly expressed genes are also more likely to be expressed in germinal cells (DURET and MOUCHIROUD 2000), this observation alone would not rule out a mutational (transcription-associated) cause. However, the same trend is also observed using Expressionlevel, including tissues with no detectable mutational trends expected in germinal cells, hence supporting a selective explanation for the association between expression and CDS size in the human lineage. Interestingly, this trend is specific to genes with introns, suggesting that protein size is not the sole factor playing a role in this relationship. Thus, the results indicate that the association between expression and CDS size should be investigated not only by selective models based on total protein size (e.g., on costs of amino acid biosynthesis) but also in conjunction with models based on the evolutionary/metabolic consequences of exon size and intron presence (see below).
Gene expression and intron presence and size:
Previous reports showed that short introns are favored in highly expressed genes and this study confirms this trend in a wide range of different tissues. Altogether, these results support the hypothesis of a measurable selective advantage for having small transcripts to reduce transcriptional costs (time and energy; CASTILLO-DAVIS et al. 2002). Significantly, we show a counterbalancing trend that is not caused by background tendencies, instigating broadly/highly expressed genes to have higher intron density in the human lineage. One could argue that genes expressed in many tissues are more likely to have more introns because they are more likely to be alternatively spliced, but these multiple-splicing forms have not yet been detected. Nevertheless, the same trend is observed using Expressionlevel in specific tissues, ruling out the possibility of a spurious relationship between intron density/number and, at least, Expressionlevel.
Selective causes favoring intron presence:
A heterogeneous group of selective causes might associate intron presence in protein-coding genes with levels of correct gene products. At this level, the advantage for having higher intron density would be counterbalanced by a minimum exon size required for proper splicing (UPHOLT and SANDELL 1986; DOMINSKI and KOLE 1991) and restrictions on transcription costs that are likely to be species specific, hence explaining differences among species.
A first possibility is that genes with a broader and/or higher expression require an increased number of regulatory signals in different introns. We have applied two indirect approaches to investigate the presence of regulatory regions (see MATERIALS AND METHODS for details). Our survey of CpG islands reveals an equivalent presence in narrowly and ubiquitously expressed genes, with an average of 1.67 and 1.64 islands per gene, respectively, comparing genes with >10 introns. In the second approach, we applied BLASTn searches to identify conserved segments of noncoding DNA as a proxy for functionally important sequences (HARDISON et al. 1997; JAREBORG et al. 1999; WASSERMAN et al. 2000; SHABALINA et al. 2001). The comparison of human and mouse orthologous sequences reveals that the total number of conserved sites in introns does not increase with breadth of expression. On the contrary, narrowly expressed genes have an average of 473 conserved sites in introns compared to 372 in ubiquitously expressed genes (Mann-Whitney U-test, P = 0.020), with percentages of conserved sites of 2.9 and 2.5%, respectively (U-test, P = 0.029). In all, these indirect analyses suggest that differences in intron number are not likely a consequence of an increased number of regulatory signals distributed in different introns.
A second explanation for the observed association between gene expression and intron density might be related to the influence of introns on mRNA metabolism (SUN and MAQUAT 2000; ZHOU et al. 2000; LE HIR et al. 2001; YU et al. 2002) and splicing efficiency (KLINZ and GALLWITZ 1985; STERNER et al. 1996). The so-called exon-exon junction complexes (EJC) are deposited upstream of intron positions after splicing (KATAOKA et al. 2000; LE HIR et al. 2000, 2001) and there is evidence that EJC enhance export efficiency of spliced mRNAs to the cytoplasm (ZHOU et al. 2000; LE HIR et al. 2001). Also, splicing factors might promote transcriptional elongation (FONG and ZHOU 2001). Therefore, selection could be acting at the level of intron density to increase mRNA transport and/or transcriptional elongation, especially in highly expressed genes.
Another selective model for intron presence is associated with the deleterious consequences of linkage between sites under selection, a phenomenon termed the Hill-Robertson effect (HILL and ROBERTSON 1966; FELSENSTEIN 1974; see also LI 1987; KLIMAN and HEY 1993; COMERON et al. 1999; MCVEAN and CHARLESWORTH 2000; TACHIDA 2000; BETANCOURT and PRESGRAVES 2002; COMERON and KREITMAN 2002; HEY and KLIMAN 2002). Specifically, COMERON and KREITMAN (2000)(2002) have proposed that the Hill-Robertson effect might be detectable at an intragenic level in many eukaryotes due to the prevalence of mutations under weak selection in coding regions. Under this model, introns (generally with a reduced frequency of sites under selection compared to exons) will reduce the Hill-Robertson effect at the intragenic level, i.e., intron-containing genes would exhibit increased effectiveness of selection. Then, all else being equal, highly expressed genes would benefit from high intron density to maximize the consequences of selection on amino acid and synonymous composition. A fraction of replacement (amino acid changing) mutations in many species are likely under weak selection and our report of selection on synonymous mutations increases the likelihood of detectable Hill-Robertson effect within genes in the human lineage, particularly in highly expressed genes. Upcoming large-scale population genetics analyses based on polymorphism and divergence data will allow testing of these possibilities.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
| LITERATURE CITED |
|---|
|
|
|---|
AKASHI, H., 1995 Inferring weak selection from patterns of polymorphism and divergence at "silent" sites in Drosophila DNA. Genetics 139: 10671076.[Abstract]
AKASHI, H., 1996 Molecular evolution between Drosophila melanogaster and D. simulans: reduced codon bias, faster rates of amino acid substitution, and larger proteins in D. melanogaster. Genetics 144: 12971307.[Abstract]
AKASHI, H., 2003 Translational selection and yeast proteome evolution. Genetics 164: 12911303.
AKASHI, H., and T. GOJOBORI, 2002 Metabolic efficiency and amino acid composition in the proteomes of Escherichia coli and Bacillus subtilis. Proc. Natl. Acad. Sci. USA 99: 36953700.
ALTSCHUL, S. F., T. L. MADDEN, A. A. SCHAFFER, J. ZHANG, Z. ZHANG et al., 1997 Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: 33893402.
BEGUN, D. J., 2001 The frequency distribution of nucleotide variation in Drosophila simulans. Mol. Biol. Evol. 18: 13431352.
BERNARDI, G., 1995 The human genome: organization and evolutionary history. Annu. Rev. Genet. 29: 445476.[CrossRef][Medline]
BETANCOURT, A. J., and D. C. PRESGRAVES, 2002 Linkage limits the power of natural selection in Drosophila. Proc. Natl. Acad. Sci. USA 99: 1361613620.
BULMER, M., 1991 The selection-mutation-drift theory of synonymous codon usage. Genetics 129: 897907.[Abstract]
CARLINI, D. B., and W. STEPHAN, 2003 In vivo introduction of unpreferred synonymous codons into the Drosophila Adh gene results in reduced levels of ADH protein. Genetics 163: 239243.
CARVALHO, A. B., and A. G. CLARK, 1999 Intron size and natural selection. Nature 401: 344.[CrossRef][Medline]
CASTILLO-DAVIS, C. I., S. L. MEKHEDOV, D. L. HARTL, E. V. KOONIN and F. A. KONDRASHOV, 2002 Selection for short introns in highly expressed genes. Nat. Genet. 31: 415418.[Medline]
COGHLAN, A., and K. H. WOLFE, 2000 Relationship of codon bias to mRNA concentration and protein length in Saccharomyces cerevisiae. Yeast 16: 11311145.[CrossRef][Medline]
COMERON, J. M., and M. AGUADE, 1998 An evaluation of measures of synonymous codon usage bias. J. Mol. Evol. 47: 268274.[CrossRef][Medline]
COMERON, J. M., and M. KREITMAN, 2000 The correlation between intron length and recombination in Drosophila: dynamic equilibrium between mutational and selective forces. Genetics 156: 11751190.
COMERON, J. M., and M. KREITMAN, 2002 Population, evolutionary and genomic consequences of interference selection. Genetics 161: 389410.
COMERON, J. M., M. KREITMAN and M. AGUADE, 1999 Natural selection on synonymous sites is correlated with gene length and recombination in Drosophila. Genetics 151: 239249.
DOMINSKI, Z., and R. KOLE, 1991 Selection of splice sites in pre-mRNAs with short internal exons. Mol. Cell. Biol. 11: 60756083.
D'ONOFRIO, G., 2002 Expression patterns and gene distribution in the human genome. Gene 300: 155160.[CrossRef][Medline]
DUAN, J., M. S. WAINWRIGHT, J. M. COMERON, N. SAITOU, A. R. SANDERS et al., 2003 Synonymous mutations in the human dopamine receptor D2 (DRD2) affect mRNA stability and synthesis of the receptor. Hum. Mol. Genet. 12: 205216.
DURET, L., 2000 tRNA gene number and codon usage in the C. elegans genome are co-adapted for optimal translation of highly expressed genes. Trends Genet. 16: 287289.[CrossRef][Medline]
DURET, L., 2002 Evolution of synonymous codon usage in metazoans. Curr. Opin. Genet. Dev. 12: 640649.[CrossRef][Medline]
DURET, L., and L. D. HURST, 2001 The elevated GC content at exonic third sites is not evidence against neutralist models of isochore evolution. Mol. Biol. Evol. 18: 757762.
DURET, L., and D. MOUCHIROUD, 1999 Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Proc. Natl. Acad. Sci. USA 96: 44824487.
DURET, L., and D. MOUCHIROUD, 2000 Determinants of substitution rates in mammalian genes: expression pattern affects selection intensity but not mutation rate. Mol. Biol. Evol. 17: 6874.
DURET, L., D. MOUCHIROUD and C. GAUTIER, 1995 Statistical analysis of vertebrate sequences reveals that long genes are scarce in GC-rich isochores. J. Mol. Evol. 40: 308317.[CrossRef][Medline]
DURET, L., M. SEMON, G. PIGANEAU, D. MOUCHIROUD and N. GALTIER, 2002 Vanishing GC-rich isochores in mammalian genomes. Genetics 162: 18371847.
EDGAR, R., M. DOMRACHEV and A. E. LASH, 2002 Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30: 207210.
EYRE-WALKER, A., 1999 Evidence of selection on silent site base composition in mammals: potential implications for the evolution of isochores and junk DNA. Genetics 152: 675683.
FELSENSTEIN, J., 1974 The evolutionary advantage of recombination. Genetics 78: 737756.
FONG, Y. W., and Q. ZHOU, 2001 Stimulatory effect of splicing factors on transcriptional elongation. Nature 414: 929933.[CrossRef][Medline]
GALTIER, N., 2003 Gene conversion drives GC content evolution in mammalian histones. Trends Genet. 19: 6568.[CrossRef][Medline]
GREEN, P., B. EWING, W. MILLER, P. J. THOMAS and E. D. GREEN, 2003 Transcription-associated mutational asymmetry in mammalian evolution. Nat. Genet. 33: 514517.[CrossRef][Medline]
HANAWALT, P. C., 1994 Transcription-coupled repair and human disease. Science 266: 19571958.
HARDISON, R. C., J. OELTJEN and W. MILLER, 1997 Long human-mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome. Genome Res. 7: 959966.
HARTL, D. L., E. N. MORIYAMA and S. A. SAWYER, 1994 Selection intensity for codon bias. Genetics 138: 227234.[Abstract]
HEY, J., and R. M. KLIMAN, 2002 Interactions between natural selection, recombination and gene density in the genes of Drosophila. Genetics 160: 595608.
HILL, W. G., and A. ROBERTSON, 1966 The effect of linkage on limits to artificial selection. Genet. Res. 8: 269294.[Medline]
IIDA, K., and H. AKASHI, 2000 A test of translational selection at silent sites in the human genome: base composition comparisons in alternatively spliced genes. Gene 261: 93105.[CrossRef][Medline]
IKEMURA, T., 1985 Codon usage and tRNA content in unicellular and multicellular organisms. Mol. Biol. Evol. 2: 1334.[Abstract]
JANSEN, R., and M. GERSTEIN, 2000 Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins. Nucleic Acids Res. 28: 14811488.
JAREBORG, N., E. BIRNEY and R. DURBIN, 1999 Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. Genome Res. 9: 815824.
KATAOKA, N., J. YONG, V. N. KIM, F. VELAZQUEZ, R. A. PERKINSON et al., 2000 Pre-mRNA splicing imprints mRNA in the nucleus with a novel RNA-binding protein that persists in the cytoplasm. Mol. Cell 6: 673682.[CrossRef][Medline]
KLIMAN, R. M., and J. HEY, 1993 Reduced natural selection associated with low recombination in Drosophila melanogaster. Mol. Biol. Evol. 10: 12391258.[Abstract]
KLINZ, F. J., and D. GALLWITZ, 1985 Size and position of intervening sequences are critical for the splicing efficiency of pre-mRNA in the yeast Saccharomyces cerevisiae. Nucleic Acids Res. 13: 37913804.
KURLAND, C., 1992 Translational accuracy and the fitness of bacteria. Annu. Rev. Genet. 26: 2950.[Medline]
LANDER, E. S., L. M. LINTON, B. BIRREN, C. NUSBAUM, M. C. ZODY et al., 2001 Initial sequencing and analysis of the human genome. Nature 409: 860921.[CrossRef][Medline]
LE HIR, H., M. J. MOORE and L. E. MAQUAT, 2000 Pre-mRNA splicing alters mRNP composition: evidence for stable association of proteins at exon-exon junctions. Genes Dev. 14: 10981108.
LE HIR, H., D. GATFIELD, E. IZAURRALDE and M. J. MOORE, 2001 The exon-exon junction complex provides a binding platform for factors involved in mRNA export and nonsense-mediated mRNA decay. EMBO J. 20: 49874997.[CrossRef][Medline]
LEICHT, B. G., S. V. MUSE, M. HANCZYC and A. G. CLARK, 1995 Constraints on intron evolution in the gene encoding the myosin alkali light chain in Drosophila. Genetics 139: 299308.[Abstract]
LERCHER, M. J., A. O. URRUTIA and L. D. HURST, 2002 Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat. Genet. 31: 180183.[CrossRef][Medline]
LERCHER, M. J., A. O. URRUTIA, A. PAVLICEK and L. D. HURST, 2003 A unification of mosaic structures in the human genome. Hum. Mol. Genet. 12: 24112415.
LI, W. H., 1987 Models of nearly neutral mutations with particular implications for nonrandom usage of synonymous codons. J. Mol. Evol. 24: 337345.[CrossRef][Medline]
LI, W. H., and L. A. SADLER, 1991 Low nucleotide diversity in man. Genetics 129: 513523.[Abstract]
LLOPART, A., and M. AGUADE, 2000 Nucleotide polymorphism at the RpII215 gene in Drosophila subobscura. Weak selection on synonymous mutations. Genetics 155: 12451252.
LLOPART, A., J. M. COMERON, F. G. BRUNET, D. LACHAISE and M. LONG, 2002 Intron presence-absence polymorphism in Drosophila driven by positive Darwinian selection. Proc. Natl. Acad. Sci. USA 99: 81218126.
LYNCH, M., 2002 Intron evolution as a population-genetic process. Proc. Natl. Acad. Sci. USA 99: 61186123.
MAJEWSKI, J., 2003 Dependence of mutational asymmetry on gene-expression levels in the human genome. Am. J. Hum. Genet. 73: 688692.[CrossRef][Medline]
MAJEWSKI, J., and J. OTT, 2002 Distribution and characterization of regulatory elements in the human genome. Genome Res. 12: 18271836.
MARGULIES, E. H., S. L. KARDIA and J. W. INNIS, 2001 Identification and prevention of a GC content bias in SAGE libraries. Nucleic Acids Res. 29: E60.
MCVEAN, G. A., and B. CHARLESWORTH, 2000 The effects of Hill-Robertson interference between weakly selected mutations on patterns of molecular evolution and variation. Genetics 155: 929944.
MCVEAN, G. A., and J. VIEIRA, 2001 Inferring parameters of mutation, selection and demography from patterns of synonymous site evolution in Drosophila. Genetics 157: 245257.
MORIYAMA, E. N., and D. L. HARTL, 1993 Codon usage bias and base composition of nuclear genes in Drosophila. Genetics 134: 847858.[Abstract]
MORIYAMA, E. N., and J. R. POWELL, 1997 Codon usage bias and tRNA abundance in Drosophila. J. Mol. Evol. 45: 514523.[CrossRef][Medline]
MORIYAMA, E. N., and J. R. POWELL, 1998 Gene length and codon usage bias in Drosophila melanogaster, Saccharomyces cerevisiae and Escherichia coli. Nucleic Acids Res. 26: 31883193.
MOUCHIROUD, D., G. D'ONOFRIO, B. AISSANI, G. MACAYA, C. GAUTIER et al., 1991 The distribution of genes in the human genome. Gene 100: 181187.[CrossRef][Medline]
NEKRUTENKO, A., and W. H. LI, 2000 Assessment of compositional heterogeneity within and between eukaryotic genomes. Genome Res. 10: 19861995.
OGATA, H., W. FUJIBUCHI and M. KANEHISA, 1996 The size differences among mammalian introns are due to the accumulation of small deletions. FEBS Lett. 390: 99103.[CrossRef][Medline]
OPHIR, R., and D. GRAUR, 1997 Patterns and rates of indel evolution in processed pseudogenes from humans and murids. Gene 205: 191202.[CrossRef][Medline]
PARSCH, J., 2003 Selective constraints on intron evolution in Drosophila. Genetics 165: 18431851.
PERCUDANI, R., A. PAVESI and S. OTTONELLO, 1997 Transfer RNA gene redundancy and translational selection in Saccharomyces cerevisiae. J. Mol. Biol. 268: 322330.[CrossRef][Medline]
POWELL, J. R., and E. N. MORIYAMA, 1997 Evolution of codon usage bias in Drosophila. Proc. Natl. Acad. Sci. USA 94: 77847790.
RICE, W. R., 1989 Analyzing tables of statistical tests. Evolution 43: 223225.[CrossRef]
SCHAEFFER, S. W., 2002 Molecular population genetics of sequence length diversity in the Adh region of Drosophila pseudoobscura. Genet. Res. 80: 163175.[CrossRef][Medline]
SHABALINA, S. A., A. Y. OGURTSOV, V. A. KONDRASHOV and A. S. KONDRASHOV, 2001 Selectiv