Genetics, Vol. 166, 1141-1154, March 2004, Copyright © 2004

Neurological Proteins Are Not Enriched For Repetitive Sequences

Melanie A. Huntleya and G. Brian Goldinga
a Department of Biology, McMaster University, Hamilton, Ontario L8S 4K1, Canada

Corresponding author: G. Brian Golding, McMaster University, 1280 Main St. W., Hamilton, Ontario L8S 4K1, Canada., golding{at}mcmaster.ca (E-mail)

Communicating editor: S. W. SCHAEFFER


*  ABSTRACT
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*APPENDIX 1
*APPENDIX 1
*LITERATURE CITED

Proteins associated with disease and development of the nervous system are thought to contain repetitive, simple sequences. However, genome-wide surveys for simple sequences within proteins have revealed that repetitive peptide sequences are the most frequent shared peptide segments among eukaryotic proteins, including those of Saccharomyces cerevisiae, which has few to no specialized developmental and neurological proteins. It is therefore of interest to determine if these specialized proteins have an excess of simple sequences when compared to other sets of compositionally similar proteins. We have determined the relative abundance of simple sequences within neurological proteins and find no excess of repetitive simple sequence within this class. In fact, polyglutamine repeats that are associated with many neurodegenerative diseases are no more abundant within neurological specialized proteins than within nonneurological collections of proteins. We also examined the codon composition of serine homopolymers to determine what forces may play a role in the evolution of extended homopolymers. Codon type homogeneity tends to be favored, suggesting replicative slippage instead of selection as the main force responsible for producing these homopolymers.


THE presence and abundance of simple repetitive sequences within nucleotide sequences are well known. Microsatellites and other tandemly repeated sequences within DNA are well characterized; however, similarly repetitive sequences within proteins are less well acknowledged and understood. Nevertheless, such repeats within eukaryotic proteins are abundant. They vary in composition from a simple reiteration of a single amino acid to long tracts of sequence that are predominated by the presence of one or only a few amino acids.

Genome-wide surveys for simple sequences have shown that these low-complexity sequences are the most commonly shared peptide fragments in eukaryotic proteomes (GOLDING 1999 Down; HUNTLEY and GOLDING 2000 Down). The prominence of these regions in proteins is a eukaryotic phenomenon, as they are not as common or as highly repetitive in prokaryotes (MARCOTTE et al. 1999 Down; HUNTLEY and GOLDING 2000 Down). Not enough is known about the structure and function of these highly repetitive, low-complexity regions despite their abundance in eukaryotic proteins.

Only a few functions have been ascribed to these unusual regions. One of the first described and perhaps best known are the opa and opa-like repeats found in essential developmental proteins in insects (WHARTON et al. 1985 Down). These repeats are stably located repetitive elements that typically encode a stretch of up to ~30 glutamines, with interspersed histidine residues.

In some prokaryotes, reversible mutations within regions of repetitive simple sequence DNA are involved in phase variation (STERN et al. 1986 Down; HOOD et al. 1996 Down; SAUNDERS et al. 2000 Down). This mechanism allows bacterial populations to adapt to changing environments and is important in bacterial virulence (MOXON et al. 1994 Down). Functional studies have shown that acidic, glutamine-rich, and proline-rich regions comprise three types of activation domains (MITCHELL and TJIAN 1989 Down; TRIEZENBERG 1995 Down), while MAR ALBA et al. 1999 Down found that transporter proteins were overrepresented among proteins containing serine repeats.

Other well-known repetitive regions in proteins are thought to be the cause of several human neurodegenerative diseases. These are associated with proteins containing extended regions of tandemly repeated glutamine residues. These proteins and others involved in nervous system disease and development contain multiple long homopeptides within their sequence (KARLIN and BURGE 1996 Down). But not all of the homopeptide tracts are composed of glutamine residues; other residues such as proline, serine, glycine, and glutamic acid form extended homopolymers in these proteins as well.

Huntington's disease was one of the first disorders characterized to be due to homopeptides. This disease is associated with neural cell death, progressive chorea, dementia, and seizures. It is believed to be caused by an increase in the length of a CAG triplet repeat within the huntingtin gene. The age of onset is inversely correlated with the length of CAG repeats (SNELL et al. 1993 Down; DUYAO et al. 1993 Down; KIEBURTZ et al. 1994 Down). In normal individuals, the repeat length is typically between 9 and 30, while affected individuals tend to have 40 to 121 copies. The triplet repeat encodes a polyglutamine tract, which can form cross-links within and between proteins. This increased cross-linking may induce the formation of aggregates within the cell and consequent neuronal death (CARIELLO et al. 1996 Down).

Kennedy's disease, also known as spinal and bulbar muscular atrophy (SBMA), is an X-linked disease that causes late onset lower-motor and primary-sensory neuropathy. Clinical symptoms include muscular atrophy, twitching, tremors, and androgen deficiency. The primary cause of this disease is an expanded CAG triplet repeat within the androgen receptor (AR) gene (LA SPADA et al. 1991 Down). Like Huntington's disease, the triplet repeat encodes a polyglutamine tract that may have increasingly toxic effects on neuronal cells as the repeat expands.

Dentatorubral-pallidoluysian atrophy (DRPLA) is phenotypically similar to Huntington's disease, including late onset dementia, cerebellar ataxia, myoclonic seizures, and choreic and athetoid movements. Again an expanded CAG repeat, encoding polyglutamine, is responsible for the pathology of this disease (LI et al. 1993 Down; KOIDE et al. 1994 Down; NAGAFUCHI et al. 1994 Down). Haw River syndrome is also caused by this same CAG expansion in the DRPLA gene (BURKE et al. 1994 Down).

Other neurological diseases that fall into this category are spinocerebellar ataxia (SCA) 1, 2, 3 (Machado-Joseph disease), 6, 7, and 17. All are caused by expansions of a polyglutamine tract in separate proteins (BANFI et al. 1994 Down; KAWAGUCHI et al. 1994 Down; PULST et al. 1996 Down; DAVID et al. 1997 Down; ZHUCHENKO et al. 1997 Down; NAKAMURA et al. 2001 Down; SILVEIRA et al. 2002 Down).

Studies of synthetic homopolymers, including glutamine repeats, have shown that some can form stable structures (KRULL et al. 1965 Down; PERUTZ et al. 1994 Down; ROHL et al. 1999 Down). Glutamine repeats have been shown to link pairs of ß-strands by hydrogen bonds, forming polar zippers (PERUTZ et al. 1994 Down). This action can result in rigid, irreversible aggregates of proteins within the cell. This has been used as an explanation of how the extended glutamine repeats in some human neurological proteins induce their associated neurodegenerative diseases (PERUTZ and WINDLE 2001 Down). However, prion proteins within the yeast Saccharomyces cerevisiae also form aggregates, but lack homopeptide sequence within the prion-determining domain (LINDQUIST et al. 2001 Down). Instead these domains tend to be enriched with polar amino acids, such as glutamine and asparagine.

Large numbers of short and long homopeptides are more frequent in developmental proteins than in other classes of proteins (KARLIN and BURGE 1996 Down). We therefore expect that this may also be true for highly repetitive, low-complexity regions. In this study we collected all developmental and neurological proteins available from the human and Drosophila melanogaster proteomes and compared each to similar, but mutually exclusive, data sets to determine whether developmental and neurological proteins do indeed contain more highly repetitive, low-complexity regions than other classes of proteins. We confirm previous results for developmental proteins that they are enriched for homopolymers and, in addition, show that they are enriched for low-complexity sequence regions. But this is not the pattern observed in neurological proteins. As a class of proteins, neurological proteins do not have excess of regions highly enriched for glutamine.

In most of the neurodegenerative proteins, polyglutamine results from a triplet repeat expansion of the CAG codon. It is generally believed that these simple sequences arise as a byproduct of replicative slippage at the DNA level, similar to the process occurring in microsatellite expansion. However, not all repeats follow this pattern. Serine reiterations in yeast do not show bias toward long tracts of one of the possible codons (MAR ALBA et al. 1999 Down). This suggests that some repeats may have evolved via selection and not slippage.

In this study extended serine homopolymer tracts are used to show that the length of the tract does not affect the mixture of codon types but that the relative position of the codons within a tract does affect codon composition, indicating that these tracts are likely the result of slippage.


*  MATERIALS AND METHODS
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*APPENDIX 1
*APPENDIX 1
*LITERATURE CITED

Neurological proteins:
Human and Drosophila neurological and kinase proteins were collected from the National Center for Biotechnology Information (NCBI) using the ENTREZ query system. To search for neurological proteins, the key words neural, neuro, nerve, and axon were used. To search for developmental proteins we used the key words development, morphogen, homeotic, differentiation, embryo, larva, and determination. These key words were based on the key words used in the gene ontology database (http://www.godatabase.org/dev/database/). Kinase proteins were collected by searching for the key word kinase. All key words (or modifications of the key word's roots) had to be present on the definition line of the GenPept files. All key word matches were screened to eliminate matches that did not fit into their respective categories, such as homeostasis, which matched to the root of homeotic. These databases are not exclusive, but this method is unbiased, explicit, and easily repeatable. All sequences targeted to the mitochondria were removed.

Many coding sequences within a genome are redundant duplicates, isozymes, or ancient duplications. Additionally, sequence databases can contain redundant sets of sequences. To construct a database of, for example, neurological proteins, such duplicates had to be filtered. First a BLAST search (ALTSCHUL et al. 1997 Down) was done to screen for similar proteins within the genome. All proteins that had a BLAST expect value <0.75 were then pairwise aligned, using ALIGN (MYERS and MILLER 1988 Down). The smaller of any two sequences that had a percentage identity >20% (e.g., the percentage identity between hemoglobin and myoglobin) was thrown away as it was considered to be too recently evolutionarily related. In this way we retained the larger protein of any related pair of sequences. A nonredundant, human neurological database was then constructed, resulting in 433 sequences, equaling a 60% reduction. A nonredundant developmental protein database containing 242 sequences was similarly constructed by discarding 75% of the sequences. Kinase proteins were collected as a control group and after filtering out 72% comprised 982 nonredundant sequences. From the 982 nonredundant kinase sequences, two more kinase databases were constructed to be comparable to the neurological database. The two kinase databases were each constructed by sampling from the 982 nonredundant sequences. These databases may have a small amount of overlap. The first sampling was designed to be comparable to the neurological database by being within 5% of its protein lengths and contained 422 sequences. The second was within 10% of the neurological sequence lengths and had 429 sequences. In this way, we not only had a full collection of nonredundant kinase sequences for which we could compare the neurological data set, but also had collections of kinase sequences that were compositionally similar to the neurological sequences and thus more directly comparable.

Databases from Drosophila protein sequences were constructed, resulting in 77 neurological proteins (a 45% reduction), 139 developmental proteins (a 56% reduction), and 128 kinases (a 65% reduction). The kinase database within 5% of the neurological lengths had 52 proteins, while the one within 10% of the neurological lengths had 64.

In total, we constructed five types of databases each for human and Drosophila proteins: neurological, developmental, kinase, kinase within 5% of neurological lengths, and kinase within 10% of neurological lengths. To analyze these databases we constructed comparison databases that were similar in composition to the original databases, while excluding neurological, developmental, and kinase proteins, respectively. Each database was used as a basis to sample sequences from the NCBI and to construct 100 random comparison databases. For instance, for human neurological proteins, 50 databases were constructed to contain human sequences that were not neurological, but otherwise randomly chosen from the NCBI and within 5% of the lengths of the neurological proteins. Another 50 databases were constructed to be within 10% of the lengths of the neurological proteins. Therefore, each protein within the neurological database had a protein of similar length within each of the comparison databases. In this way, each of the 100 comparison databases is mutually exclusive to the human neurological database, but is similar in protein length composition.

To determine how common highly repetitive, simple sequences were in these databases, BLAST searches were performed, using 100-residue-long homopolymers of each amino acid. The number of BLAST hits with expect values <=0.01 were compared to those found from the 100 comparison databases and the corresponding percentiles were recorded.

This analysis was also performed on the redundant databases, to examine how the analysis was affected by making the databases nonredundant.

To ensure that these results were robust, we also performed the same analysis using BLAST with 50-amino-acid-long homopolymers and using two entirely distinct algorithms, SIMPLE (ALBA et al. 2002 Down) and SEG (WOOTTON and FEDERHEN 1993 Down).

Of these methods, the SIMPLE algorithm has the most rigid window length to search for cryptically simple sequences. During various trials we used total window lengths ranging from 40 to 100 and searched for monomeric-like simple sequences.

For analysis using the SEG algorithm, we chose a window length, L, of 40 and a complexity cutoff value, K2(1), of 2.6. All low-complexity segments were sorted into amino acid categories on the basis of the composition of the segment. If two or more amino acids each had frequencies of 30% or higher, that segment was counted toward each of those categories. This was done to search for highly repetitive, low-complexity regions.

In addition, we analyzed the percentage of low complexity per sequence and the number of low-complexity regions per sequence. We did this using two different sets of SEG parameters: an L of 15 with K2(1) of 1.9 and an L of 40 with K2(1) of 2.6.

This entire analysis was also performed on the proteins from Caenorhabditis elegans to determine how widespread the resulting patterns were.

Homopolymer tracts:
Analysis similar to a previous study (KARLIN and BURGE 1996 Down) was performed on nonredundant protein sets for both humans and Drosophila. Following this previous study, we excluded proteins with extremely biased amino acid content if an amino acid had >20% frequency and searched for proteins with three or more homopeptides of lengths >=5 residues whose combined lengths totaled no less than 20. In sequences with extreme bias in composition long homopeptides are expected to occur more often by chance. Karlin and Burge also screened for proteins containing at least one homopeptide of length >=10 residues and at least one other of length >=5 residues. We used the additional requirement that at least one homopeptide within a protein had a length of 15 residues or more to emphasize more extended homopolymers. The protein descriptions, their accession numbers, lengths, and the homopeptide lengths were recorded. Proteins with any known neurological function were grouped in the "neurological" category. Any of the remaining proteins with known developmental function were grouped under the "developmental" category. All other proteins with some known function were termed "other" and any remaining proteins were put in the "unknown or hypothetical" category. We further selected the serine homopeptides within these proteins and analyzed their codon content.

Serine is unique among the amino acid residues as it has two types of codons (TCN and AGY) that are at least two mutational events apart. Because of the mutational distance between the two codon types, studying the codon composition of serine homopolymers allows for a stronger distinction between the two hypotheses for their mechanism of evolution: replicative slippage or selection at the protein level. The TCN codons (TCA, TCC, TCG, and TCT) are more frequent than the AGY codons (AGT and AGC). If the homopeptide was simply the product of DNA slippage during replication, we would expect little mixture of the two codon types. For example, a polyserine tract that was created via strand slippage should be composed of only TCN codons or only AGY codons, but seldom a mixture of both. If, however, other forces, such as selection, are acting to create these homopeptides, then a mixture of the codon types might be more common.

We determined whether the length of the homopolymer tract influenced the mixture of the two codon types, using a likelihood-ratio test, {chi}2 = –2 ln (L0/L), where L0 is the likelihood of the null model and L is the likelihood of the model being tested.

Given genomic codon usage frequencies (fAGY and fTCN) and N polyserine tracts of length ni = xi + yi, where xi is the number of AGY codons and yi is the number of TCN codons in the ith tract, the likelihood model can be summarized as

(1)

This model assumes a linear relationship between the length of the tract and codon composition. The parameters a and b were adjusted to maximize the likelihood, L. The null model, L0, which is a random choice according to the frequencies, is the likelihood obtained with a = 1 and b = 0.

We used a second model to see if the position of a codon within a homopolymer tract influenced the type of codon found. For instance, if a codon position is flanked by AGY codons, is that position more likely to be occupied by an AGY or a TCN codon? Given N polyserine tracts each with length ni, where Xj denotes the codon at position j within the homopolymer tract, we calculated the likelihood as

(2)

The null model suggests no dependence on neighboring codons. This situation is achieved when and . Otherwise the parameters P1, P2, P3, and P4 can range from 1/e to 1. This results in a logarithmic decay function, bounded between zero and one. The parameters P1 and P2 are a measure of how likely the middle codon position will be occupied by the same codon type as the two surrounding codons, given that the two surrounding codons are of the same type. Thus, smaller values of P1 and P2 translate to increased probabilities of codon type homogeneity. P3 and P4 measure the bias of the middle codon position toward the left or the right codon position when they are not occupied by the same codon type. Therefore, smaller values of P3 and P4 mean an increase in the probability of the Xj codon being of the same type as the Xj–1 codon only, while larger values of P3 and P4 correspond to an increase in the probability of being the same type as the Xj+1 codon.


*  RESULTS
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*APPENDIX 1
*APPENDIX 1
*LITERATURE CITED

Neurological proteins:
Table 1 shows that the human neurological database contained eight proteins with significant similarity to polyalanine. This number of BLAST hits was larger than that found for any of the 100 human nonneurological databases (matched to be within 5 and 10% of the neurological sequence lengths). The Drosophila neurological database also contained eight significant hits, which were in the 100th percentile of the number of significant BLAST hits from each of the 50 Drosophila nonneurological databases (matched to be within 5% of the neurological sequence lengths) and larger than that found for any of the 50 nonneurological databases (matched to be within 10% of the neurological sequence lengths).


 
View this table:
In this window
In a new window

 
Table 1. The number of significant BLAST hits within a neurological database compared to 100 nonneurological databases

Table 2 shows that developmental proteins seem to be enriched with alanine (A), glycine (G), proline (P), and serine (S) in comparison to nondevelopmental proteins equally numerous and matched for sequence length. Also, glutamic acid (E) and glutamine (Q) seem to be more common in developmental proteins; however, this result is not as consistent as for A, G, P, and S. It is interesting to note that lysine (K) shows a rather large discrepancy between human and Drosophila. In human developmental proteins, the number of significant BLAST hits to poly(K) was in the 84th to 94th percentile, but in Drosophila it was only in the 2nd to 10th percentile.


 
View this table:
In this window
In a new window

 
Table 2. The number of significant BLAST hits within a developmental database compared to 100 nondevelopmental databases

Neurological proteins are consistently enriched for alanine (A) and histidine (H) as shown in Table 1. Glutamine, which is associated with many neurodegenerative diseases, is not found to be overrepresented in neurological proteins. There are also large discrepancies between the species for glutamic acid (E) and proline (P).

The kinase proteins in Table 3 show that none of the amino acids are consistently enriched in both species, compared to nonkinase proteins. Kinase databases constructed to be of similar lengths to the neurological proteins (Table 4 and Table 5) show no consistent enrichment and an increase in species-to-species discrepancies.


 
View this table:
In this window
In a new window

 
Table 3. The number of significant BLAST hits within a kinase database compared to 100 nonkinase databases


 
View this table:
In this window
In a new window

 
Table 4. The number of significant BLAST hits within a kinase database (containing sequences within 5% of the length of neurological proteins) compared to 100 nonkinase databases


 
View this table:
In this window
In a new window

 
Table 5. The number of significant BLAST hits within a kinase database (containing sequences within 10% of the length of neurological proteins) compared to 100 nonkinase databases

Neurological proteins have much less enrichment compared to developmental proteins. With the exception of alanine and histidine, neurological proteins are not consistently enriched for repetitive protein sequence.

We performed the BLAST analysis on the redundant data sets to investigate the effect of using nonredundant databases. We found no significant difference except for all of the kinases and for the neurological proteins from D. melanogaster. In these cases, the nonredundant databases were found to have significantly more BLAST hits per sequence than the redundant databases (data not shown).

Using the SIMPLE algorithm we obtained broadly similar results for neurological proteins. However, in many cases the windows detected as significantly simple were not as enriched for a predominant amino acid as those regions detected by BLAST. Another difference is that SIMPLE is not constructed to recognize residues with similar properties and misses such enriched regions as a result. In a parallel analysis using SEG, again the results were consistent with our BLAST analysis, but with more variability found within the Drosophila results (results not shown).

The patterns we obtained using BLAST with 50-amino-acid-long homopolymers were nearly identical to those found using the 100-residue-long homopolymers. However, the Drosophila results, like those from SEG, were more variable.

The parameter space for SEG is very large with numerous parameter sets possible for identifying different types of repetitive low-complexity sequences. Different parameter sets can give rise to dissimilar SEG results. The SEG analysis examining the percentage of low complexity and the number of low-complexity segments per sequence was highly inconsistent between the SEG parameters employed (data not shown).

The proteins of C. elegans yielded similar results to those of humans and Drosphila (data not shown). Again, the neurological proteins had no significant enrichment compared to the nonneurological databases. The developmental proteins had the greatest enrichment, while the kinase proteins had enrichment patterns like those found in humans and Drosophila.

Homopolymer tracts:
Table 6 shows the lengths of the longest homopolymer tracts for each amino acid. This table does not reflect homopolymer frequency or the average lengths of such tracts. Only the individual extreme cases are listed. In humans, many of the longest tracts for neurological proteins are longer than those for the developmental proteins. In Drosophila the opposite is true. Also, for nine amino acids, in both humans and Drosophila, kinase proteins have homopolymers as long as or longer than those of the developmental and neurological proteins.


 
View this table:
In this window
In a new window

 
Table 6. Length of the longest homopolymer tracts

The appendix lists proteins with multiple homopolymers containing at least one homopeptide of length 15 or more. This is composed of 29 human proteins (Table A11) and 74 Drosophila proteins (Table A22). While such proteins are more numerous in Drosophila, they also contain significantly (P < 0.05) more homopeptides per protein than do the human sequences. There are 559 homopeptide tracts for the Drosophila proteins and only 133 for humans. While KARLIN et al. 2002 Down found 192 human proteins with multiple-amino-acid runs, our altered criteria of at least one homopolymer of length >=15 residues and our nonredundant database account for this difference.

In both humans and Drosophila, poly(Q) is the most frequent homopolymer tract. However, poly(Q) accounts for only 24.1% of the human homopolymers, while accounting for 53.1% of the Drosophila homopolymers. Another discrepancy between the two species is found in the abundance of poly(E), which accounts for 18.8% of the human homopolymers, but only 0.5% of the Drosophila homopolymers. As well, poly(G) and poly(P) are more than double in humans (15.0% vs. 7.3% and 10.5% vs. 4.7%, respectively).

These interspecies discrepancies are largely consistent with previous results (KARLIN et al. 2002 Down). However, the lack of polyleucine within the human homopolymers was not found previously. KARLIN et al. 2002 Down found 19% of human proteins with at least one homopolymer of length five or more residues contained polyleucine, and only 14 of 192 proteins with multiple homopolymers contained polyleucine. Because of the longer criteria we used to consider homopolymers, only 2 of these 14 proteins were present in our data.

Of the 11 polyserine tracts in human, 7 had absolutely no mixture of the codon types. Of the 56 Drosophila polyserine tracts, 26 had no mixture. From the analysis of the first model, which was used to determine if the length of a homopolymer tract influenced the underlying codon mixture, the likelihood-ratio test gave {chi}2 values of 22.83 for humans and 57.18 for Drosophila. Using 2 d.f. these values corresponded to probabilities <0.001 of occurring by chance alone. The likelihood model suggests that longer polyserine tracts did not have significantly less mixture of codon types. In fact, the parameter b is small in both cases and did not have a consistent direction between the two species. However, the maximum-likelihood estimate of a took on a fractional value. This indicated that maximum-likelihood codon frequencies within the homopolymers were different from the genomic codon frequencies.

For the second model, which examines the influence of codon position within a homopolymer tract, we found that P1, P2, and P4 were smaller than the null model values. For humans, P3 was slightly greater than the null model value, but for Drosophila P3 was less than the null model value. Likelihood-ratio tests gave {chi}2 values of 81.29 for humans and 65.56 for Drosophila. Using 4 d.f., this translates to probabilities <<0.001. Indeed, being surrounded by one type of codon significantly increases the likelihood of the middle position also being that same codon type. Also, if the two neighboring codons are of different types, the middle position (Xj) will tend to be occupied by a codon type that matches the left-hand (Xj–1) site.


*  DISCUSSION
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*APPENDIX 1
*APPENDIX 1
*LITERATURE CITED

These results confirm previous reports, showing developmental proteins to be enriched for simple sequences composed primarily of alanine, glutamic acid, glycine, proline, glutamine, or serine. However, unexpectedly, neurological proteins are only slightly enriched for alanine or histidine.

Neurological proteins have been thought to be enriched for repeats. These results show that as a class they do not have an excess of glutamine-enriched regions. Yet many neurological disorders are linked with extended polyglutamine tracts and proteins enriched with glutamine residues. There is evidence that many of these disorders result from protein aggregation, triggered by tracts of polyglutamine forming polar zippers (YANAGISAWA et al. 2000 Down; PERUTZ et al. 2002 Down). As a result, polyglutamine may well be the best-characterized amino acid repeat to date.

In contrast, these results confirm the well-known example of simple sequence protein repeats, the opa and opa-like repeats originally found in insects (WHARTON et al. 1985 Down). The opa repeats are typically polyglutamine and are thought to be characteristic of developmentally regulated genes (WHARTON et al. 1985 Down). Polyglutamine was found to have the greatest number of significant hits within the Drosophila developmental database (Table 2). However, for both humans and Drosophila, significant BLAST hits to poly(A), -(G), -(P), and -(S) are more consistently abundant.

When we look at only the proteins containing multiple homopolymer tracts (Table A11 and Table A22), we again find a rather large discrepancy between the two species. Although both species have poly(Q) as the most frequent homopolymer tract, it is far more frequent in the Drosophila proteins, representing over half of the homopolymers, while comprising less than a quarter of the human homopolymers. Poly(E) and poly(P) are much more abundant in humans than in Drosophila.

The amino acids that are found to be overrepresented as repeats within these proteins have diverse properties and thus a variety of implications for the structures of the proteins in which they are embedded.

However, overall, little is known about the types of protein structures extended amino acid repeats can form. A survey of eukaryotic proteins within the structural database revealed that low-complexity protein repeats are underrepresented and rarely structurally characterized (HUNTLEY and GOLDING 2002 Down). One explanation for their absence in the structural databases is that they are disordered and do not form consistent structures. The relationship between intrinsic structural disorder and sequence complexity in proteins has been well studied (ROMERO et al. 1999 Down, ROMERO et al. 2001 Down). Interestingly, all of the amino acids found to be enriched within the simple sequences of developmental and neurological proteins (alanine, glutamic acid, glycine, proline, glutamine, serine, and histidine) are considered disorder promoting (ROMERO et al. 2001 Down). An in-depth survey of 90 regions of protein disorder determined that these proteins were typically involved in molecular recognition and suggested that many may function in signaling pathways (DUNKER et al. 2002 Down). It is argued that due to the conformational flexibility, intrinsic disorder would enable a single binding site to recognize differently shaped partners and have faster rates of association and dissociation (DUNKER et al. 2002 Down).

Although some repeat regions may arise and be maintained by selection, most appear to have arisen via slipped-strand mispairing, like microsatellite expansion. Our analysis of the serine homopolymers from Table A11 and Table A22 shows evidence for slippage, in contrast to the results found in yeast serine homopolymers (MAR ALBA et al. 1999 Down) and Drosophila serine homopolymers (KARLIN and BURGE 1996 Down). However, MAR ALBA et al. 1999 Down examined specific codons, rather than codon types, while KARLIN and BURGE 1996 Down did not provide a statistical analysis of the serine codon type homogeneity. We also know that the repetitive regions have a higher rate of evolution (HUNTLEY and GOLDING 2000 Down). While one might anticipate a rapid expansion of an amino acid repeat within a protein to be detrimental, the question then remains why these extended repeat regions are so abundant and present in such important proteins.

A study on glutamine, alanine, and glycine repeats being inserted into the loop of a protein showed that the stability and folding rates of the proteins were minimally affected (LADURNER and FERSHT 1997 Down). In fact, the largest penalty comes with the addition of the first few residues and not the increased expansion of the repeat. Yet there are numerous deleterious conditions associated with these protein repeats, including the neurodegenerative disorders associated with triplet repeat expansions.

One hypothesis suggests that these repeats allow for protein elongation, followed by functional specialization of the repeat region via mutation (GREEN and WANG 1994 Down). In support of this hypothesis it has been demonstrated that microsatellite expansion can occur quite rapidly and thus protein repeat expansion via slippage may occur rapidly as well. The importance of protein repeats as a mechanism for creating new protein domains may be increased by the findings of mutational bias in trinucleotide repeat evolution (COOPER et al. 1999 Down; RUBINSZTEIN et al. 1999 Down). Originally it was assumed that microsatellites had equal probabilities of gaining and losing repeat units. However, these studies indicate that there is a bias toward adding repeat units.

Another argument in support of this hypothesis is that eukaryotes may compensate for longer generation times, using the extra variability afforded by protein repeats to rapidly create novel protein domains (MARCOTTE et al. 1999 Down). Indeed, these protein repeats are a eukaryotic phenomenon, and the predominant amino acid differs from species to species. This would indicate that the particular characteristics of the amino acid in the repeat are not important; only the presence of a new domain that can be quickly modified and either selected for a new function or deleted is important.

This speculation of the function of protein repeats still does not clearly explain why they are overly abundant in the critically important developmental proteins, but not so in neurological proteins.


*  ACKNOWLEDGMENTS

We thank two anonymous reviewers for their valuable comments on the manuscript. This work was supported by a Natural Sciences and Engineering Research Council of Canada (NSERC) grant to G.B.G. and an NSERC scholarship to M.A.H.

Manuscript received August 19, 2003; Accepted for publication December 11, 2003.


*  APPENDIX 1
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*APPENDIX 1
*APPENDIX 1
*LITERATURE CITED


 
View this table:
In this window
In a new window

 
Table A1. Human proteins with multiple homopeptides, where at least one must be >=15 residues long


 
View this table:
In this window
In a new window

 
Table A2. Drosophila proteins with multiple homopeptides, where at least one must be >=15 residues long


*  APPENDIX 1
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*APPENDIX 1
*APPENDIX 1
*LITERATURE CITED


 
View this table:
In this window
In a new window

 
Table A1. Human proteins with multiple homopeptides, where at least one must be >=15 residues long


 
View this table:
In this window
In a new window

 
Table A2. Drosophila proteins with multiple homopeptides, where at least one must be >=15 residues long


*  LITERATURE CITED
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*APPENDIX 1
*APPENDIX 1
*LITERATURE CITED

ALBA, M. M., R. A. LASKOWSKI, and J. M. HANCOCK, 2002  Detecting cryptically simple protein sequences using the SIMPLE algorithm. Bioinformatics 18:672-678.[Abstract/Free Full Text]

ALTSCHUL, S. F., T. L. MADDEN, A. A. SCHAFFER, J. ZHANG, and Z. ZHANG et al., 1997  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.[Abstract/Free Full Text]

BANFI, S., A. SERVADIO, M. Y. CHUNG, T. J. KWIATKOWSKI, JR., and A. E. MCCALL et al., 1994  Identification and characterization of the gene causing type 1 spinocerebellar ataxia. Nat. Genet. 7:513-520.[CrossRef][Medline]

BURKE, J. R., M. S. WINGFIELD, K. E. LEWIS, A. D. ROSES, and J. E. LEE et al., 1994  The Haw River syndrome: dentatorubropallidoluysian atrophy (DRPLA) in an African-American family. Nat. Genet. 7:521-524.[CrossRef][Medline]

CARIELLO, L., T. DE CRISTOFARO, L. ZANETTI, T. CUOMO, and L. DI MAIO et al., 1996  Transglutaminase activity is related to CAG repeat length in patients with Huntington's disease. Hum. Genet. 98:633-635.[CrossRef][Medline]

COOPER, G., N. J. BURROUGHS, D. A. RAND, D. C. RUBINSZTEIN, and W. AMOS, 1999  Microsatellite and trinucleotide-repeat evolution: evidence for mutational bias and different rates of evolution in different lineages. Proc. Natl. Acad. Sci. USA 96:11916-11921.[Abstract/Free Full Text]

DAVID, G., N. ABBAS, G. STEVANIN, A. DURR, and G. YVERT et al., 1997  Cloning of the SCA7 gene reveals a highly unstable CAG repeat expansion. Nat. Genet. 17:65-70.[CrossRef][Medline]

DUNKER, A. K., C. J. BROWN, J. D. LAWSON, L. M. IAKOUCHEVA, and Z. OBRADOVIC, 2002  Intrinsic disorder and protein function. Biochemistry 41:6573-6582.[CrossRef][Medline]

DUYAO, M., C. AMBROSE, R. MYERS, A. NOVELLETTO, and F. PERSICHETTI et al., 1993  Trinucleotide repeat length instability and age of onset in Huntington's disease. Nat. Genet. 4:387-392.[CrossRef][Medline]

GOLDING, G. B., 1999  Simple sequence is abundant in eukaryotic proteins. Protein Sci. 8:1358-1361.[Medline]

GREEN, H. and N. WANG, 1994  Codon reiteration and the evolution of proteins. Proc. Natl. Acad. Sci. USA 91:4298-4302.[Abstract/Free Full Text]

HOOD, D. W., M. E. DEADMAN, M. P. JENNINGS, M. BISERCIC, and R. D. FLEISCHMANN et al., 1996  DNA repeats identify novel virulence genes in Haemophilus influenzae. Proc. Natl. Acad. Sci. USA 93:11121-11125.[Abstract/Free Full Text]

HUNTLEY, M. and G. B. GOLDING, 2000  Evolution of simple sequence in proteins. J. Mol. Evol. 51:131-140.[Medline]

HUNTLEY, M. A. and G. B. GOLDING, 2002  Simple sequences are rare in the Protein Data Bank. Proteins 48:134-140.[CrossRef][Medline]

KARLIN, S. and C. BURGE, 1996  Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. Proc. Natl. Acad. Sci. USA 93:1560-1565.[Abstract/Free Full Text]

KARLIN, S., L. BROCCHIERI, A. BERGMAN, J. MRAZEK, and A. J. GENTLES, 2002  Amino acid runs in eukaryotic proteomes and disease associations. Proc. Natl. Acad. Sci. USA 99:333-338.[Abstract/Free Full Text]

KAWAGUCHI, Y., T. OKAMOTO, M. TANIWAKI, M. AIZAWA, and M. INOUE et al., 1994  CAG expansions in a novel gene for Machado-Joseph disease at chromosome 14q32.1. Nat. Genet. 8:221-228.[CrossRef][Medline]

KIEBURTZ, K., M. MACDONALD, C. SHIH, A. FEIGIN, and K. STEINBERG et al., 1994  Trinucleotide repeat length and progression of illness in Huntington's disease. J. Med. Genet. 31:872-874.[Abstract/Free Full Text]

KOIDE, R., T. IKEUCHI, O. ONODERA, H. TANAKA, and S. IGARASHI et al., 1994  Unstable expansion of CAG repeat in hereditary dentatorubral-pallidoluysian atrophy (DRPLA). Nat. Genet. 6:9-13.[CrossRef][Medline]

KRULL, L., J. WALL, H. ZOBEL, and R. DIMLER, 1965  Synthetic polypeptides containing sidechain amide groups: water insoluble polymers. Biochemistry 4:626-632.[CrossRef][Medline]

LA SPADA, A. R., E. M. WILSON, D. B. LUBAHN, A. E. HARDING, and K. H. FISCHBECK, 1991  Androgen receptor gene mutations in X-linked spinal and bulbar muscular atrophy. Nature 352:77-79.[CrossRef][Medline]

LADURNER, A. G. and A. R. FERSHT, 1997  Glutamine, alanine or glycine repeats inserted into the loop of a protein have minimal effects on stability and folding rates. J. Mol. Biol. 273:330-337.[CrossRef][Medline]

LI, S. H., M. G. MCINNIS, R. L. MARGOLIS, S. E. ANTONARAKIS, and C. A. ROSS, 1993  Novel triplet repeat containing genes in human brain: cloning, expression, and length polymorphisms. Genomics 16:572-579.[CrossRef][Medline]

LINDQUIST, S., S. KROBITSCH, L. LI, and N. SONDHEIMER, 2001  Investigating protein conformation-based inheritance and disease in yeast. Philos. Trans. R. Soc. Lond. B 356:169-176.[Abstract/Free Full Text]

MAR ALBA, M., M. F. SANTIBANEZ-KOREF, and J. M. HANCOCK, 1999  Amino acid reiterations in yeast are overrepresented in particular classes of proteins and show evidence of a slippage-like mutational process. J. Mol. Evol. 49:789-797.[CrossRef][Medline]

MARCOTTE, E. M., M. PELLEGRINI, T. O. YEATES, and D. EISENBERG, 1999  A census of protein repeats. J. Mol. Biol. 293:151-160.[CrossRef][Medline]

MITCHELL, P. J. and R. TJIAN, 1989  Transcriptional regulation in mammalian cells by sequence-specific DNA binding proteins. Science 245:371-378.[Abstract/Free Full Text]

MOXON, E. R., P. B. RAINEY, M. A. NOWAK, and R. E. LENSKI, 1994  Adaptive evolution of highly mutable loci in pathogenic bacteria. Curr. Biol. 4:24-33.[CrossRef][Medline]

MYERS, E. W. and W. MILLER, 1988  Optimal alignments in linear-space. Comput. Appl. Biosci. 4:11-17.[Abstract/Free Full Text]

NAGAFUCHI, S., H. YANAGISAWA, K. SATO, T. SHIRAYAMA, and E. OHSAKI et al., 1994  Dentatorubral and pallidoluysian atrophy expansion of an unstable CAG trinucleotide on chromosome 12p. Nat. Genet. 6:14-18.[CrossRef][Medline]

NAKAMURA, K., S. Y. JEONG, T. UCHIHARA, M. ANNO, and K. NAGASHIMA et al., 2001  SCA17, a novel autosomal dominant cerebellar ataxia caused by an expanded polyglutamine in TATA-binding protein. Hum. Mol. Genet. 10:1441-1448.[Abstract/Free Full Text]

PERUTZ, M. F. and A. H. WINDLE, 2001  Cause of neural death in neurodegenerative diseases attributable to expansion of glutamine repeats. Nature 412:143-144.[CrossRef][Medline]

PERUTZ, M. F., T. JOHNSON, M. SUZUKI, and J. T. FINCH, 1994  Glutamine repeats as polar zippers: their possible role in inherited neurodegenerative diseases. Proc. Natl. Acad. Sci. USA 91:5355-5358.[Abstract/Free Full Text]

PERUTZ, M. F., B. J. POPE, D. OWEN, E. E. WANKER, and E. SCHERZINGER, 2002  Aggregation of proteins with expanded glutamine and alanine repeats of the glutamine-rich and asparagine-rich domains of Sup35 and of the amyloid beta-peptide of amyloid plaques. Proc. Natl. Acad. Sci. USA 99:5596-5600.[Abstract/Free Full Text]

PULST, S. M., A. NECHIPORUK, T. NECHIPORUK, S. GISPERT, and X. N. CHEN et al., 1996  Moderate expansion of a normally biallelic trinucleotide repeat in spinocerebellar ataxia type 2. Nat. Genet. 14:269-276.[CrossRef][Medline]

ROHL, C. A., W. FIORI, and R. L. BALDWIN, 1999  Alanine is helix-stabilizing in both template-nucleated and standard peptide helices. Proc. Natl. Acad. Sci. USA 96:3682-3687.[Abstract/Free Full Text]

ROMERO, P., Z. OBRADOVIC, and A. K. DUNKER, 1999  Folding minimal sequences: the lower bound for sequence complexity of globular proteins. FEBS Lett. 462:363-367.[CrossRef][Medline]

ROMERO, P., Z. OBRADOVIC, X. LI, E. C. GARNER, and C. J. BROWN et al., 2001  Sequence complexity of disordered protein. Proteins 42:38-48.[CrossRef][Medline]

RUBINSZTEIN, D. C., B. AMOS, and G. COOPER, 1999  Microsatellite and trinucleotide-repeat evolution: evidence for mutational bias and different rates of evolution in different lineages. Philos. Trans. R. Soc. Lond. B 354:1095-1099.[Abstract/Free Full Text]

SAUNDERS, N. J., A. C. JEFFRIES, J. F. PEDEN, D. W. HOOD, and H. TETTELIN et al., 2000  Repeat-associated phase variable genes in the complete genome sequence of Neisseria meningitidis strain MC58. Mol. Microbiol. 37:207-215.[CrossRef][Medline]

SILVEIRA, I., C. MIRANDA, L. GUIMARAES, M. C. MOREIRA, and I. ALONSO et al., 2002  Trinucleotide repeats in 202 families with ataxia: a small expanded (CAG)n allele at the SCA17 locus. Arch. Neurol. 59:623-629.[Abstract/Free Full Text]

SNELL, R. G., J. C. MACMILLAN, J. P. CHEADLE, I. FENTON, and L. P. LAZAROU et al., 1993  Relationship between trinucleotide repeat expansion and phenotypic variation in Huntington's disease. Nat. Genet. 4:393-397.[CrossRef][Medline]

STERN, A., M. BROWN, P. NICKEL, and T. F. MEYER, 1986  Opacity genes in Neisseria gonorrhoeae: control of phase and antigenic variation. Cell 47:61-71.[CrossRef][Medline]

TRIEZENBERG, S. J., 1995  Structure and function of transcriptional activation domains. Curr. Opin. Genet. Dev. 5:190-196.[CrossRef][Medline]

WHARTON, K. A., B. YEDVOBNICK, V. G. FINNERTY, and S. ARTAVANIS-TSAKONAS, 1985  opa: a novel family of transcribed repeats shared by the Notch locus and other developmentally regulated loci in D. melanogaster. Cell 40:55-62.[CrossRef][Medline]

WOOTTON, J. C. and S. FEDERHEN, 1993  Statistics of local complexity in amino acid sequences and sequence databases. Comput. Chem. 17:149-163.

YANAGISAWA, H., M. BUNDO, T. MIYASHITA, Y. OKAMURA-OHO, and K. TADOKORO et al., 2000  Protein binding of a DRPLA family through arginine-glutamic acid dipeptide repeats is enhanced by extended polyglutamine. Hum. Mol. Genet. 9:1433-1442.[Abstract/Free Full Text]

ZHUCHENKO, O., J. BAILEY, P. BONNEN, T. ASHIZAWA, and D. W. STOCKTON et al., 1997  Autosomal dominant cerebellar ataxia (SCA6) associated with small polyglutamine expansions in the alpha 1A-voltage-dependent calcium channel. Nat. Genet. 15:62-69.[CrossRef][Medline]




This article has been cited by other articles:


Home page
Mol Biol EvolHome page
M. A. Huntley and A. G. Clark
Evolutionary Analysis of Amino Acid Repeats across the Genomes of 12 Drosophila Species
Mol. Biol. Evol., December 1, 2007; 24(12): 2598 - 2609.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
M. A. Huntley and G. B. Golding
Selection and Slippage Creating Serine Homopolymers
Mol. Biol. Evol., November 1, 2006; 23(11): 2017 - 2025.
[Abstract] [Full Text] [PDF]