- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Cocquet, J.
- Articles by Veitia, R. A.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Cocquet, J.
- Articles by Veitia, R. A.
Compositional Biases and Polyalanine Runs in Humans
Julie Cocqueta, Elfride De Baereb, Sandrine Cabureta, and Reiner A. Veitiaaa INSERM E0021 and U361, Reproduction et Physiopathologie Obstétricale, Hôpital Cochin, 75014 Paris, France
b Department of Medical Genetics, Ghent University Hospital, B-9000 Ghent, Belgium
Corresponding author: Reiner A. Veitia, Reproduction et Physiopathologie Obstétricale, Hôpital Cochin, Pavillon Baudelocque, 123 Bd. de Port Royal, 75014 Paris, France., veitia{at}cochin.inserm.fr (E-mail)
| ABSTRACT |
|---|
Human proteins containing polyalanine tracts tend to have runs of other amino acids and their open reading frames (ORFs) display a biased codon usage. Their alanine, glycine, proline, and histidine content strongly correlates with the GC content of the third codon base, suggesting that the compositional specificity of these proteins is dictated to a great extent by the evolution of their ORFs.
MANY proteins with runs of single amino acids have been described. In human the amino acids most frequently encountered in repetitive tracts are, in decreasing incidence, glutamine (Gln, Q), leucine (Leu, L), proline (Pro, P), alanine (Ala, A), and glycine (Gly, G) (![]()
![]()
![]()
We assembled a sample of 78 genes containing one or more polyAla repeats (TestSet) by querying GenBank with BLAST, searching for matches of the word An. We retained polyAla proteins containing a run of at least seven Ala residues whose expression was supported by expressed sequence tag data but irrespective of their functional annotation or their implication in pathology. We set the threshold arbitrarily to 7 to avoid blurring potentially interesting correlations with the inclusion of proteins with smaller runs, which have a higher probability of appearing by chance. The TestSet contained 132.7 kb [mean open reading frame (ORF) length, 1.68 ± 1.50 kb]. To compare this sequence set with a nonredundant reference sample of the human genome, we collected 223 ORFs (RefSet) by querying the on-line Mendelian inheritance of man (OMIM) database with the word "gene" and retrieving the coding sequences using the links with GenBank. This type of query allows retrieval of genes irrespective of their potential implication in pathology. The RefSet represented 439 kb (mean ORF length, 1.97 ± 1.41 kb). ORF lengths in both sequence sets (Test and Ref) were statistically similar (P
0.1).
The compositional properties of both DNA sequence sets (Test- and RefSet), and of the proteins they potentially encode, were analyzed using COMPSEQ (http://bioweb.pasteur.fr/seqanal/interfaces/compseq.html). This program counts the frequencies of all words (monomers, dimers, etc.) that occur in a sequence, using a sliding window. This window moves up by the length of the "word" (1, 2, etc.) each time, skipping over the intervening words. In the case of DNA it can count only those words that occur in a specified frame.
The frequencies of the four nucleotides were not the same in the TestSet and in the RefSet. This translates into statistically different GC contents: 0.60 (length-weighted mean) for the TestSet, vs. 0.53 for the RefSet (Mann-Whitney test; P < 0.0001). RefSet GC content was in perfect agreement with the weighted mean recalculated from data in ![]()
We analyzed also possible departures of the dinucleotide frequencies in the sequence sets from what is expected from the random assortment of the nucleotides. We used the well-known measure
*xy = f*xy/f*xf*y, where f*xy denotes the frequency of dinucleotide xy and f*x and f*y are the a priori frequencies of the mononucleotides, x, y.
* analysis was applied to a concatemer of ORFs concatenated with their inverted complementary sequence.
*
1.23 or
*
0.78 translate into extreme over- or underrepresentation of the dinucleotide in question (P < 0.001; ![]()
* values of our data set were statistically similar to those found by ![]()
*CG = 0.50, while in the polyAla TestSet
*CG = 0.74. In the polyAla(-) TestSet
*CG was 0.71. Thus, CG usage in the TestSet was closer to the normal range and the polyAla-encoding regions affected this quantity, but in a minor way. Other compositional biases such as Pro and Gly richness can explain part of this CG contribution. Namely, one-fourth of Pro residues in the polyAla proteins are encoded by a CG-containing codon and some GlyGly dicodons (i.e., GGCGGN) can generate a CG doublet. Other GC-rich codons or their combinations can also make a contribution. In line with the evoked trend, dinucleotide CG was represented twice as much in the TestSet as in the RefSet, having the highest score of all dinucleotides. Dinucleotide AC is at the boundary of underrepresentation in the TestSet (
*AC = 0.79 vs. 0.83 in the RefSet) while GT was specifically underrepresented (
*GT = 0.74 vs. 0.80 in the RefSet).
The codon frequencies of the RefSet were in agreement with those reported in the codon usage database for man and with the length-weighted values recalculated from ZHANG's (1998) data. This and the fact that Karlin's
*xy for the RefSet are similar to those reported in the literature (![]()
25% of Ala and Pro, respectively (and only
11% in the RefSet) whereas TCG encoded
13% of Ser residues (compared to
5% in the RefSet) (online Table 1). We also studied Ala codon usage in the ORF regions specifically encoding runs of four or more Ala residues. From these 1433 codons, 35.9% corresponded to GCC and 32.2% to GCG. High GCC usage was expected (i.e., 38.9% in TestSet and 40.5% in RefSet), whereas the bias favoring the usage of GCG was unexpectedly high, as noted above.
|
Di-, tri- and homotetrapeptides of Pro, Gly, His, Ser, and Gln also showed a tendency (in decreasing order) toward overrepresentation in the TestSet. This suggests that these amino acids tend to be organized in runs in the polyAla-rich proteins (Table 1). This corroborates a previous suggestion that GC-rich genomes (i.e., human and fly) have favored runs of GC-rich codons such as those coding for Ala, Pro, Gly, and His, whereas GC-poor genomes (worm, yeast, and weed; GC < 40%) have favored runs encoded by AT-rich codons (![]()
To explore the relationship between GC content and protein primary sequence we analyzed the behavior of the content in the amino acids Ala, Gly, and Pro (AGP) vs. the GC content of the third base of the codons for each ORF of the TestSet. Our analysis is similar to that described by ![]()
![]()
![]()
![]()
![]()
![]()
|
Departures of the observed homodicodon frequencies from random expectation were observed in the regions of the TestSet coding for the polyAla tracts (1433 codons). All homodicodons showed a tendency to overrepresentation with odds ratios >1.5. The most frequent dicodons, (GCC)2 and (GCG)2, had odds ratios of 1.55 and 1.75, respectively. Standard frequency comparison showed also highly significant differences with random expectations (P < 0.0001). This is in line with the notion that the initial mechanism to alter the length of amino acid runs is polymerase slippage during replication of repeated units. A further array homogenization process cannot be excluded on the basis of our analysis (![]()
Both the least-preferred Pro codon CCG and its homodicodon (CCG)2 are more represented in the TestSet. Interestingly, (CCG)n becomes (GCC)n and would encode polyAla when read in another frame. Similarly, GGC, the most frequent Gly codon in both sequence sets, shows a tendency to form homodicodons in the TestSet. Again, a run of GGC read in another frame is interpreted as (GCG)n (a run of Ala codons, the least preferred in the rest of the genome but strongly represented in the polyAla proteins). This suggests that there might have been interconversion between Glyn
Alan or Pron
Alan, although not many examples could be found in GenBank. If this hypothetical "framesliding" occurred, its result has persisted within mammals and only statistical relicts remain. (See L09550 vs. NM_00523.)
In several transcription factors, alanine-rich regions have been shown to be responsible for repression of target genes (![]()
![]()
![]()
![]()
-helices with multiple equilibrated isoforms, leading to the existence of a threshold length beyond which deleterious effects appear (![]()
![]()
In conclusion, our results show that the ORFs encoding polyAla proteins have higher GC and GC3 contents than reference human ORFs do. The stronger correlation between AGP (AGPH) and GC3 contents in the TestSet suggests that constraints operating on these ORFs leave a stronger imprint on amino acid composition than those operating on other ORFs in the rest of the genome. The fact that most polyAla proteins in our TestSet are transcription factors raises the question of the impact of the evolution of a set of ORFs, influenced by their genomic contexts, on major evolutionary transitions (organismal diversification).
| ACKNOWLEDGMENTS |
|---|
The authors thank Kenta Sumiyama for interesting discussions about a previous version of the manuscript and Alex Fedorov, a reviewer of this article, for his helpful comments. R.A.V. is funded by the Université Denis Diderot/Paris VII and the Institut National de la Santé et de la Recherche Médicale.
Manuscript received May 8, 2003; Accepted for publication July 8, 2003.
| LITERATURE CITED |
|---|
BOUZEKRI, N., P. G. TAYLOR, M. F. HAMMER, and M. A. JOBLING, 1998 Novel mutation processes in the evolution of a haploid minisatellite, MSY1: array homogenization without homogenization. Hum. Mol. Genet. 7:655-659.
COCQUET, J., E. PAILHOUX, F. JAUBERT, N. SERVEL, and X. XIA et al., 2002 Evolution and expression of FOXL2. J. Med. Genet. 39:916-921.
CUMMINGS, C. J. and H. Y. ZOGHBI, 2000 Fourteen and counting: unraveling trinucleotide repeat diseases. Hum. Mol. Genet. 9:909-916.
DE BAERE, E., D. BEYSEN, C. OLEY, B. LORENZ, and J. COCQUET et al., 2003 FOXL2 and BPES: mutational hotspots, phenotypic variability, and revision of the genotype-phenotype correlation. Am. J. Hum. Genet. 72:478-487.[Medline]
D'ONOFRIO, G., K JABBARI, H. MUSTO, and G. BERNARDI, 1999 The correlation of protein hydropathy with the base composition of coding sequences. Gene 238:3-14.[Medline]
EYRE-WALKER, A. and L. D. HURST, 2001 The evolution of isochores. Nat. Rev. Genet. 2:549-555.[Medline]
HAN, K. and J. L. MANLEY, 1993 Functional domains of the Drosophila Engrailed protein. EMBO J. 12:2723-2733.[Medline]
KARLIN, S. and C. BURGE, 1995 Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11:283-290.[Medline]
KARLIN, S., L. BROCCHIERI, A. BERGMAN, J. MRAZEK, and A. J. GENTLES, 2002 Amino acid runs in eukaryotic proteomes and disease associations. Proc. Natl. Acad. Sci. USA 99:333-338.
NAKACHI, Y., T. HAYAKAWA, H. OOTA, K. SUMIYAMA, and L. WANGAND et al., 1997 Nucleotide compositional constraints on genomes generate alanine-, glycine-, and proline-rich structures in transcription factors. Mol. Biol. Evol. 14:1042-1049.[Abstract]
SAKAHIRA, H., P. BREUER, M. K. HAYER-HARTL, and F. U. HARTL, 2002 Molecular chaperones as modulators of polyglutamine protein aggregation and toxicity. Proc. Natl. Acad. Sci. USA 99:16412-16418.
SUMIYAMA, K., K. WASHIO-WATANABE, N. SAITOU, T. HAYAKAWA, and S. UEDA, 1996 Class III POU genes: generation of homopolymeric amino acid repeats under GC pressure in mammals. J. Mol. Evol. 43:170-178.[Medline]
ZHANG, M. Q., 1998 Statistical features of human exons and their flanking regions. Hum. Mol. Genet. 7:919-932.
This article has been cited by other articles:
![]() |
L. Moumne, A. Dipietromaria, F. Batista, A. Kocer, M. Fellous, E. Pailhoux, and R. A. Veitia Differential aggregation and functional impairment induced by polyalanine expansions in FOXL2, a transcription factor involved in cranio-facial and ovarian development Hum. Mol. Genet., April 1, 2008; 17(7): 1010 - 1019. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. G. Faux, G. A. Huttley, K. Mahmood, G. I. Webb, M. Garcia de la Banda, and J. C. Whisstock RCPdb: An evolutionary classification and codon usage database for repeat-containing proteins Genome Res., July 1, 2007; 17(7): 1118 - 1127. [Abstract] [Full Text] [PDF] |
||||
![]() |
K Raile, H Stobbe, R B Trobs, W Kiess, and R Pfaffle A new heterozygous mutation of the FOXL2 gene is associated with a large ovarian cyst and ovarian dysfunction in an adolescent girl with blepharophimosis/ptosis/epicanthus inversus syndrome Eur. J. Endocrinol., September 1, 2005; 153(3): 353 - 358. [Abstract] [Full Text] [PDF] |
||||
![]() |
S Caburet, A Demarez, L Moumne, M Fellous, E De Baere, and R A Veitia A recurrent polyalanine expansion in the transcription factor FOXL2 induces extensive nuclear and cytoplasmic protein aggregation J. Med. Genet., December 1, 2004; 41(12): 932 - 936. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Caburet, D. Vaiman, and R. A. Veitia A Genomic Basis for the Evolution of Vertebrate Transcription Factors Containing Amino Acid Runs Genetics, August 1, 2004; 167(4): 1813 - 1820. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Cocquet, J.
- Articles by Veitia, R. A.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Cocquet, J.
- Articles by Veitia, R. A.





