Protein is an essential component for life, and its synthesis is mediated by codons in any organisms on earth. While some codons encode the same amino acid, their usage is often highly biased. There are many factors that can cause the bias, but a potential effect of mononucleotide repeats, which are known to be highly mutable, on codon usage and codon pair preference is largely unknown. In this study we performed a genomic survey on the relationship between mononucleotide repeats and codon pair bias in 53 bacteria, 68 archaea, and 13 eukaryotes. By distinguishing the codon pair bias from the codon usage bias, four general patterns were revealed: strong avoidance of five or six mononucleotide repeats in codon pairs; lower observed/expected (o/e) ratio for codon pairs with C or G repeats (C/G pairs) than that with A or T repeats (A/T pairs); a negative correlation between genomic GC contents and the o/e ratios, particularly for C/G pairs; and avoidance of C/G pairs in highly conserved genes. These results support natural selection against long mononucleotide repeats, which could induce frameshift mutations in coding sequences. The fact that these patterns are found in all kingdoms of life suggests that this is a general phenomenon in living organisms. Thus, long mononucleotide repeats may play an important role in base composition and genetic stability of a gene and gene functions.
AMONG the many components of life, protein is most essential because living organisms use proteins not only for body structuring but also for its functioning. After the discovery of the genetic code for protein biosynthesis (Crick et al. 1961), redundancy in the genetic code attracted great attention. Highly biased use of synonymous codons is one of them, and the codon usage bias is common not only among species but also within species (Grosjean and Fiers 1982; Akashi 2001). Previous studies showed that codon usage bias is linked to several factors, such as efficiency and accuracy of translation (Robinson et al. 1984; Bulmer 1991; Akashi 1994; Plotkin et al. 2004), compositional bias (Muto and Osawa 1987; McLean et al. 1998), and genome size or other nonselective forces (Lawrence and Ochman 1998; dos Reis et al. 2004).
One additional factor is the sequence environment. It is known that nucleotides surrounding a codon can influence the codon usage preference, called context-dependent codon bias (Yarus and Folley 1985). The context-dependent codon bias affects the efficiency and accuracy of translation (Taniguchi and Weissmann 1978; Irwin et al. 1995) and the suppression of both premature stop codons and missense codons (Bossi and Ruth 1980; Murgola et al. 1984). Reflecting the context-dependent codon bias, a strong codon pair bias is detected in both prokaryotic and eukaryotic genomes (Gutman and Hatfield 1989; Buchan et al. 2006; Tats et al. 2008). It has been suggested that codon pair preference is influenced by all three nucleotides of the ribosomal A-site codon and the third nucleotide of the P-site codon (Buchan et al. 2006). Therefore, tRNA geometry within the ribosome was presumed to be the key factor governing genomic codon pair patterns, as it might enhance the fidelity and/or rate of translation.
A mononucleotide repeat is a homogeneous run of the same nucleotides. Potentially deleterious effects of a mononucleotide repeat in coding sequences (CDS) have been pointed out: the mononucleotide repeats in CDS are prone to transcriptional and translational slippage, which leads to functional disruption of the corresponding gene products (Wagner et al. 1990; Gurvich et al. 2003; Baranov et al. 2005); a strong association between mononucleotide repeats and the occurrence of insertion/deletion (indel) during the DNA replication process will elevate the risk of frameshift mutations (Strauss 1999), which might have severe fitness consequences. The list of diseases resulting from changes of unstable repeats continues to grow (Gatchel and Zoghbi 2005). In addition, previous studies suggested that in long mononucleotide runs, errors during the process of DNA synthesis are easier to escape from polymerase proofreading or mismatch repair (MMR) systems (Kroutil et al. 1996; Tran et al. 1997). C/G mononucleotide runs are found to be more unstable than A/T runs in Escherichia coli (Sagher et al. 1999), yeast (Harfe and Jinks-Robertson 2000), and mammalian cells (Boycheva et al. 2003). Indeed, some of the mononucleotide repeats, such as GGGGGn, are found to be among the unpreferred codon pairs in various species (Tats et al. 2008). Therefore, the number of mononucleotide repeats, as well as their base compositions (A/T runs or C/G runs), might affect the occurrence of indels and the genetic stability of CDS.
In this study, we conducted a systematic survey of 134 genomes in bacteria, archaea, and eukaryotes to evaluate the potential influence of the mononucleotide repeats on codon pair preference. We used the observed/expected (o/e) ratio of codon pairs with mononucleotide repeats to distinguish the codon pair bias from the codon usage bias. Our results suggest a strong avoidance of long (five or six) mononucleotide runs in CDS, most likely due to natural selection against the high mutability, which may shed new light on the forces exerted on both codon and codon-pair usage.
MATERIALS AND METHODS
To cover a diverse range of species, 13 eukaryotic, 53 bacterial, and 68 archaeal genomes were selected from online databases (supporting information, Table S1, Table S2, and Table S3). In addition, four sequence alignments (Saccharomyces cerevisiae and S. paradoxus, Caenorhabditis elegans and C. briggsae, Drosophila melanogaster and D. yakuba, and Homo sapiens and Mus musculus) were obtained from the UCSC Genome Informatics website (http://hgdownload.cse.ucsc.edu). Protein-coding regions were determined on the basis of the annotations in these databases. The 13 eukaryotic genomes, including fungi, plants, and animals, were randomly selected to represent a wide range of species (Table S1). The 53 bacterial genome sequences were selected on the basis of a criterion of >4 Mb to give sufficient data (Table S2), whereas this criterion was not applied to the archaeal genomes because of their small genome sizes (2.24 Mb on average; Table S3).
Studied codon pairs:
We first analyzed codon pairs that have mononucleotides spanning the two codons (sense:sense pairs, Table S4). Among 4096 (= 46) possible codon pairs, 928 sense:sense pairs contained two to six mononucleotides in the pair junction, when excluding the pairs containing a stop codon. A/T pairs (codon pairs with A's or T's spanning two codons) or C/G pairs (codon pairs with C's or G's spanning two codons) were analyzed together not only because A and T or C and G are parallel in the nucleotide chain position, but also because the level of bias was similar (Figure S1). Codon pairs with the same number and composition (A/T or C/G) of mononucleotide runs in the pair junction were classified as a group.
In the analysis, codon pairs containing mononucleotide repeats other than those spanning the two codons are excluded because in such codon pairs the mononucleotide run size is not affected by the adjacent codon. For example, the number of longest mononucleotide repeats in codon pair AAATCG is three, which is the same as the single codon AAA. If there are any factors contributing to the reduction in the mononucleotide repeats, the single codon (AAA) would be the actual target, independent of the adjacent codons. We refer this as codon bias, not codon pair bias.
To further confirm the effect of mononucleotide repeats on codon pair bias, we also analyzed synonymous codon pairs, which are defined as a codon pair that has a choice of nucleotide bases that alter the number of mononucleotide repeats (from two to six) without changing the encoded amino acid sequences (Table S5). The possible longest mononucleotide run in the codon pair was two, three, five or six (Table S5). For example, codon pairs encoding dipeptide lysine:lysine had four possible compositions: AAGAAG, AAGAAA, AAAAAG, and AAAAAA, for which the numbers of the longest mononucleotide runs were two, three, five, and six, respectively.
The two types of codon pairs above have a partial overlap especially when we consider long mononucleotide repeats. Particularly, six mononucleotide repeats (6N) are entirely shared by both types of codon pairs (Table S4 and Table S5). For shorter mononucleotide runs (≤5N), however, these types of codon pairs generally include different sets of codon pairs. We also analyzed sense:sense codon pairs excluding synonymous codon pairs.
Normalizing codon pair frequencies:
Codon pair bias could be attributed to codon usage bias. To eliminate this effect, we normalized the expectation of codon pair occurrence by the frequency of used codons (Gutman and Hatfield 1989; Buchan et al. 2006). First we calculated the observed (oij) and the expected number (eij) of a codon pair (codon i and codon j), on the basis of the estimated codon frequencies in the kth open reading frame (ORFk) (Buchan et al. 2006),where ci is the observed count of codon i, Ntot is the total number of codons, and Np = Ntot − 1 represents the total number of codon pairs in the ORFk. The effect of dipeptide bias on codon pairing was removed by normalizing the expected values of each codon pair, to generate enor,(Gutman and Hatfield 1989; Buchan et al. 2006), where odip,mn and edip,mn are the observed and expected codon pair counts, respectively, encoding dipeptide mn. Observed and expected codon pair counts were then summed up at the genomic level. The numbers of codon pairs were calculated using a Perl script.
Because the codon pairs in each type encode the same dipeptide for synonymous pairs, the sum of the observed counts (enor) is equal to the sum of expectation. For sense:sense pairs, on the other hand, the sum of the observed counts is equal to the sum of expectation only when all 4096 possible codon pairs are included (the 0–1 mononucleotides in the pair junction).
Analyzing codon pair bias:
The o/e ratio for each group of codon pairs with the same number (p) of mononucleotide Q in the pair junction was calculated as follows:
The average o/e ratio for a group of genomes was calculated as the geometric mean of the respective ratio of each genome.
To measure the difference between the observed and expected values of a single codon pair, a normalized offset value defined as r was calculated,(Boycheva et al. 2003), where Dexp is the expected random deviation. The r value is considered to be significant when the absolute value is >2.0 (Boycheva et al. 2003).
Analyzing codon pair usage in conserved regions:
A Perl script was written to calculate the number of nucleotide substitutions and indels throughout the following combinations of alignments: S. cerevisiae and S. paradoxus, C. elegans and C. briggsae, D. melanogaster and D. yakuba, and H. sapiens and M. musculus. The average nucleotide divergence (D) was adjusted with the Jukes and Cantor correction (Jukes and Cantor 1969).
First, CDS were extracted according to the annotations of S. cerevisiae, C. elegans, D. melanogaster, and H. sapiens, respectively. Then the CDS of each of the four comparisons were classified into three groups according to D. Each group had an equal length of sequences, and the one with the smallest D was regarded as the highly conserved region, while the one with the largest D was the less conserved region. The observed and expected counts of each codon pair were analyzed in both highly and less conserved regions (see Table S6 for details).
Avoidance of long mononucleotide runs in codon pairs:
In the sense:sense codon pairs, the o/e ratio was apparently less than one in codon pairs with long mononucleotide runs, such as five- or six-mononucleotide repeats for C/G pairs (o/e5C/G or o/e6C/G) and 6N for A/T pairs (o/e6A/T, Figure 1). The geometric mean of the o/e ratio for the 6N pairs was 0.528 in eukaryotes, 0.488 for bacteria, and 0.596 for archaea (Table S7, Table S8, and Table S9). The ratios for codon pairs with shorter runs (<4N mononucleotides) were significantly larger than those for pairs with longer runs (>4N mononucleotides) (P < 0.05, t-test). The consistently lower number of observations than expected values (o/e < 1.0 for the 6N pairs in 133/134 genomes, or 99.2%; Figure S2 and Table S7, Table S8, and Table S9) suggests a universal avoidance of long mononucleotide runs for sense:sense codon pairs in these organisms. The only exception was Geobacter uraniireducens, which showed o/e = 1.084 for the 6N pairs (Table S8).
In addition, the o/e ratios for codon pairs with long A/T mononucleotide runs (o/e5A/T or o/e6A/T) were significantly higher than those for C/G pairs in both prokaryotes and eukaryotes (P < 0.05, paired t-test, except for o/e5A/T vs. o/e5C/G in mammals, discussed below). For example, in eukaryotic genomes, o/e6A/T was 1.8 times higher than o/e6C/G, where the o/e6C/G was only 0.335. This result indicates that the 6C/G is the most unfavorable of codon pairs.
On the other hand, there was a pattern unique to mammals. Unlike the other genomes, higher o/e ratios for C/G pairs than for A/T pairs were generally observed in mammals except for 6Ns (Figure 1C). In addition, the o/e ratios for A/T pairs with long mononucleotide runs were higher in prokaryotes than in eukaryotes (o/e5A/T = 0.952 and o/e6A/T = 0.696 in bacteria; and 0.963 and 0.669 in archaea vs. 0.831 and 0.602 in eukaryotes, respectively, both P < 0.05 by t-test). For long mononucleotide C/G runs, however, no clear difference was detected between prokaryotes and eukaryotes. The only exception was o/e5C/G in eukaryotes, which was significantly larger than o/e5C/G in bacteria (0.651 vs. 0.551, P < 0.05 by t-test, Table S7 and Table S8). But the difference drastically decreased when the mammals were excluded (o/e5C/G = 0.574 for nonmammalian eukaryotes, Table S7).
Synonymous codon pairs showed similar patterns (Figure S3, Table S10, Table S11, and Table S12), and the elimination of synonymous codon pairs from sense:sense codon pairs resulted in virtually the same patterns (Figure S4). The negative relationships between the o/e ratio and the number of mononucleotide runs in both sense:sense and synonymous codon pairs further support the consistent avoidance of long mononucleotide runs in codon pairs in the genome evolution of prokaryotes and eukaryotes.
Effect of prokaryotic GC content on the o/e ratio:
The results above revealed a stronger avoidance of codon pairs with long C/G runs relative to A/T runs. Because genomes with higher GC content would contain more C/G pairs under random expectation, we evaluated the effect of genomic GC content on the tolerance of genomes to the C/G mononucleotide runs. The wide range of GC content in prokaryotic genomes (29.9–72.8% in bacteria and 27.6–68.0% in archaea) provided an opportunity for investigating a correlation between o/e ratio and GC content.
When the 53 bacterial genomes were classified into three groups depending on GC content—group I (GC% < 50%), group II (50% ≤ GC% < 60%), and group III (GC% ≥ 60%)—their o/e ratios for sense:sense pairs were quite different (Figure 2A). The o/e ratios for C/G pairs in group I were significantly higher than those in group III (P < 0.05, t-test). For C/G pairs, the o/e ratios for group II were between those of groups I and III. Notably, o/e6C/G for the high GC content, group III decreased to a very low value (0.223, Figure 2A). This observation suggested that the avoidance of long C/G mononucleotide runs was much stronger in genomes with higher GC content. This propensity was also shown through the negative correlation between GC content of individual genomes and their o/e ratios for C/G codon pairs, e.g., o/e6C/G in Figure 2B (R = −0.550, P < 0.0001), and the other ratios in Figure S9B (P < 0.05). For A/T pairs, on the other hand, the negative relationship between the o/e ratio and GC content was much weaker (Figure S5 and Figure S9A). All the patterns observed in bacteria were also present in three groups of the archaeal genomes (Figure S6 and Figure S8, A–C; group I with GC% < 40%, group II with GC% ranging from 40 to 50%, and group III with GC% ≥ 50%).
The normalized offset value (r) measures the difference between the observed and the expected counts of a certain codon pair (Boycheva et al. 2003). Our calculations showed that the number of C/G codon pairs, in which the observed counts are significantly less than expected (r ≤ −2), is positively correlated with the genomic GC content (Figure 2C; P < 0.0001), reflecting the strong propensity of genomes with higher GC contents to avoid C/G pairs. According to the linear regression, in the bacterial genomes with GC content of 70%, the proportion of significantly underrepresented C/G pairs with mononucleotides in pair junctions was 55.2% (274/496), whereas it was only 28.0% (139/496) in the genomes with low GC content (30%). The number of A/T pairs, which are significantly less than expected, was weakly correlated with genomic GC content in prokaryotes (Figure S5A and Figure S7A).
The o/e ratios in conserved coding sequences:
Given that long mononucleotide runs have a greater potential to produce indels, a lower o/e ratio was expected in more conserved CDS. Since 6Ns had the smallest o/e ratios in analyzed codon pairs with mononucleotides in pair junctions in eukaryotes (Figure 1 and Table S7), o/e6N was analyzed in conserved regions in the four alignments of the eight genomes (see materials and methods).
Indeed, o/e6N was significantly smaller in highly conserved than in less-conserved regions in all alignments of nonmammalian eukaryotes (Figure 3A, P < 0.01 by chi-square test). Moreover, the 6N codons appeared less frequently in highly conserved regions in those three comparisons (Figure 3B). In the mammalian sequences, no such differences were observed. Although o/e6N was slightly smaller in less-conserved regions in the human–mouse comparisons, the difference was not significant.
The biased usage of codons or codon pairs is a common phenomenon in a wide range of species (Gutman and Hatfield 1989; Buchan et al. 2006). A variety of factors, selective or nonselective, might be responsible for such bias. For example, the synonymous codons decoded in the ribosomal A site by the same tRNA exhibit significantly similar ribosomal P-site pairing preference (Buchan et al. 2006). In other words, the codon pair preference is primarily determined by the interplay between nucleotides cP3 (the third nucleotide of the codon positioned at the ribosomal P site) and cA1/cA2 (the first/second nucleotide of the codon positioned at the ribosomal A site) (Buchan et al. 2006). Our results suggest that the avoidance of mononucleotide repeats in pair junctions is an additional explanation. For codon pairs encoding a certain dipeptide, nucleotides cP3 and cA3 are degenerate, and cP3 is more important in determining the mononucleotides run size. The interplay between cP3 and cA1/cA2 largely determines the mononucleotide run size in the degenerate codon pairs. Thus, the deleterious effect of indels and the consequent avoidance of long mononucleotide repeats in CDS can contribute to the close connection between cP3 and cA1/cA2 as well.
In this study, we confirmed a deficit of codon pairs with long mononucleotide runs relative to those with short mononucleotide runs in a variety of species by analyzing two kinds of codon pairs. This result is also consistent with previous studies, such as Tats et al. (2008) in which certain mononucleotide repeats are identified as avoided codon pairs among several other kinds (e.g., nnTAnn). In addition, we revealed three additional patterns: higher o/e ratio for A/T codon pairs than for C/G pairs; negative correlation between GC content of individual genomes and their o/e ratios, particularly for C/G codon pairs; and lower o/e ratio for codon pairs with long mononucleotide runs in conserved coding sequences. These patterns cannot be explained by the simplest tRNA geometry hypothesis. In E. coli, for example, the deficit of long mononucleotide A/T runs in codon pairs cannot be elucidated by tRNA geometry because the synonymous codons, AAG and AAA, and TTC and TTT, are recognized by the same tRNA.
Natural selection on mutability of codons or codon pairs might be an alternative explanation. The high frequency of indel occurrence has been confirmed to be closely associated with simple nucleotide repeats (Strauss 1999). The previous investigation of the mutability of mononucleotide runs in yeast showed that the mutation rate of 6N mononucleotide repeats was ∼10-fold of that of 2N or 3N (Greene and Jinks-Robertson 1997). The high indel mutability of long mononucleotide runs and the severely detrimental effect of coding indels can enforce the choice of the codon or codon pair usage. Consequently, the avoidance of long mononucleotide runs in coding sequences will minimize the change of coding function, particularly in highly conserved regions. Thus, this model can explain a scarcity of long runs in codon pairs and why the conserved genes have less codon pairs with long mononucleotide runs.
Under this scenario, long A/T mononucleotides can be better tolerated in codon pairs than in C/G runs, which exhibit higher mutability. For example, in a frameshift reversion assay in S. cerevisiae (Greene and Jinks-Robertson 1997), the mutation events in a 4C run were as many as in a 6A run, consistent with our observation that o/e4C/G was similar to o/e6A/T (0.841 vs. 0.811 in S. cerevisiae, Table S7). It is known that C/G mononucleotides are more prone to produce indels in E. coli and yeast (Greene and Jinks-Robertson 1997; Sagher et al. 1999; Harfe and Jinks-Robertson 2000), and the frameshift instability of mononucleotide C or G runs may be due to stabilization of a stacked intermediate (Sagher et al. 1999). Both the DNA polymerase fidelity (primarily avoidance of slippage) and the efficiency of the removal of frameshift intermediates by the MMR system are affected by the composition of mononucleotide runs; DNA polymerase slippage occurs more often while the MMR system removes frameshift intermediates less efficiently in C/G than in A/T mononucleotide runs (Gragg et al. 2002). Considering the greater ability to produce indels, long C/G runs are less favored in those regions sensitive to frameshifts and their appearance would be underrepresented. In contrast, A/T mononucleotide runs would exert less influence on the maintenance of sequence stability. In higher eukaryotes, there are no experimental data on the mutability of mononucleotides, but it has been reported that the mutation rate of G17 repeat sequences was much higher than those of A17 and (CA)17 in mismatch repair-proficient embryonic mouse fibroblasts (Boyer et al. 2002).
If the avoidance of slippages from long mononucleotide runs contributes to the biased usage of codon pairs, it is understandable that there is a negative correlation between GC content of individual genomes and their o/e ratios because the genomes with higher GC content are expected to have a higher possibility of forming long mononucleotide sequences. In the three bacterial groups with GC% < 50%, 50% < GC% < 60%, and GC% > 60%, the expected numbers of six C/G pairs were 466, 745, and 1406 (P = 0.057, P < 0.01, and P < 0.05 for comparison of groups 1 and 2, 1 and 3, and 2 and 3, respectively, t-test), whereas the observed counts were roughly the same, 285, 329, and 361, respectively (P = 0.524 − 0.787 for comparison of groups by t-test). The same tendency was observed in archaea as well. Therefore, prokaryotes may have evolved a mechanism to control the mutability of their genomes.
All analyzed genomes have o/e6C/G less than one except G. uraniireducens. This bacterium was isolated from subsurface sediment undergoing uranium bioremediation. This species reduces metals including uranium with acetate and other organic acids serving as the electron donor (Shelobolina et al. 2008). It is generally accepted that uranium induces DNA damage and subsequent high mutation rate through a combination of chemical and radiological effects (Stearns et al. 2005). G. uraniireducens may be able to tolerate more 6C/G codon pairs, due to its higher tolerance of mutational constraints or the advantage of rapid evolution for adaptation to its harsh environment.
A more frequent occurrence of indels in longer C/G mononucleotides may partly explain why some amino acids have more synonymous codons than others. It has been shown that the genetic code is not a random assignment of codons to amino acids and that the code minimizes the effects of point mutation or mistranslation (Freeland and Hurst 1998). Therefore, a good strategy to avoid mutation would be a reduced usage of codons with higher GC content, potentially to minimize the risk of longer C/G runs. To achieve this goal, such codons would have evolved as synonymous codons that are used less often. In fact, this hypothesis can be tested by a GC analysis for all codons. There are 8 amino acids with ≥4 synonymous codons. In these codons, the average GC content is as high as 68.1%, which is significantly >33.3% (P < 0.001, t-test), the GC content of the other 12 amino acids with <3 synonymous codons. Notably, only 6 codons of 23 for these 12 amino acids have two GCs each, while 26/38 such codons are found for the 8 amino acids with >3 synonymous codons. In addition, the start and stop codons have only one or no G. Thus, almost all AT-rich codons are used for the amino acids with limited synonymous codons or stop/start codons. Clearly, these codons have little chance of forming long C/G mononucleotides and therefore a higher opportunity to maintain stable gene function.
The indel-mutability model can shed light on the usage of codons and codon pairs. Our results suggest that the avoidance of long mononucleotides can maintain the conserved gene function by preventing indel occurrence in coding sequences. This may be the best way to minimize mutation by constructing an appropriate gene composition, e.g., the choice of GC content and specific nucleotide combination. In highly conserved genes, on which mutations are supposed to be highly deleterious, the maximal avoidance of long mononucleotide repeats might be essential. Our results on the conserved regions (Figure 3) are very consistent with this scenario.
One exception was the case of mammalian genomes, which showed no consistent pattern compared to the other comparisons (Figure 3). Further study is required to understand the cause of the different patterns in mammals, although less efficient natural selection due to the smaller effective population size in mammals relative to invertebrates and prokaryotes, typically at the magnitude of one or two orders (Lynch and Conery 2003), might explain the phenomenon.
Results from recent studies suggest either up- or downregulation of the mutability level. The gene composition might be the first step in controlling the mutability level. If a severely detrimental effect resulted from the occurrence of any particular indel in CDS, the indel would be removed efficiently (Chen et al. 2009). When a region can tolerate indels, the result could be induction of more mutations (Tian et al. 2008; Zhu et al. 2009), promotion of ectopic recombination (Sun et al. 2008), or reduction of recombination for the surrounding regions to maintain additional mutations (Du et al. 2008). This process would be an efficient way to regulate the mutation level and suggests that the mutability in a gene is self-regulated, at least to some extent. From this point of view, the mechanism for an indel due to slippage might be a consequence of adaptive evolution, which can explain why this mechanism works well for long C/G runs but less well for the same long A/T mononucleotide sequences. The o/e ratio, particularly the o/eC/G ratio, could be used as a measure of the mutation potential for individual or multiple genes in a species. Therefore, these ubiquitous and selectively maintained mononucleotide runs can greatly contribute to the high genetic diversities and to the molecular evolution. Analysis of the distribution of long mononucleotide runs will provide information for the evolution of genes and genomes. With recent works that have revealed the possible causes for codon bias, our study suggests that the role played by mononucleotide runs in such bias can be very important in shaping genetic evolution.
We thank Gary Stormo and two anonymous reviewers for helpful comments on the earlier version of this manuscript. This study was supported by the National Science Foundation of China (30930049) (to D.T.) and the Swiss National Science Foundation (31003A_125213) (to H.A.).
Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.110.121137/DC1.
↵1 These authors contributed equally to this work. D.T. and H.A. are equal senior authors.
Communicating editor: G. Stormo
- Received July 20, 2010.
- Accepted August 27, 2010.
- Copyright © 2010 by the Genetics Society of America