We tested whether functionally important sites in bacterial, yeast, and animal promoters are more conserved than their neighbors. We found that substitutions are predominantly seen in less important sites and that those that occurred tended to have less impact on gene expression than possible alternatives. These results suggest that purifying selection operates on promoter sequences.
THE study of cis-regulatory evolution presents “challenges beyond those typically encountered in analyses of coding sequence evolution” (Wray et al. 2003). We are currently unable to infer regulatory function from primary sequences and, consequently, do not have a clear understanding of a relationship between function and conservation. Whereas it is clear that many cis-elements are under selective constraint (Bergman and Kreitman 2001; Dermitzakis et al. 2003; Andolfatto 2005; Hahn 2007; Loots and Ovcharenko 2010), in some instances sites known to be functional in one species have been lost in closely related species (Ludwig et al. 1998; Dermitzakis and Clark 2002; Moses et al. 2006; Doniger and Fay 2007; Bradley et al. 2010). Genome annotation approaches, such as “phylogenetic footprinting” (Tagle et al. 1988; Blanchette and Tompa 2002; Zhang and Gerstein 2003) and “phylogenetic shadowing” (Boffelli et al. 2003), rely on greater conservation of functional sites compared to surrounding sequences, yet this supposition may not always be true (Emberly et al. 2003; Balhoff and Wray 2005). Indeed, positive selection may drive turnover of binding sites (Rockman et al. 2003; He et al. 2011). Although evidence suggests that fitness costs of mutations in noncoding regions may be relatively low (Kryukov et al. 2005; Chen et al. 2007; Raijman et al. 2008), few studies have explicitly tested the relationship between functions of individual nucleotides and the fitness costs of mutations at these sites (Shultzaberger et al. 2010).
Our knowledge of the forces driving the evolution of cis-elements largely comes from sequence comparisons between and within species, often without specific reference to the function of individual nucleotides within these elements (Wong and Nielsen 2004; Bush and Lahn 2006; Drake et al. 2006; Casillas et al. 2007). Yet regulatory functions and constraints are not uniformly distributed within cis-elements as evidenced by the correlation of conservation and functional importance of promoter motifs (Johnson et al. 2004).
Binding energy of transcription factor binding sites can be experimentally measured and computationally modeled (Djordjevic et al. 2003; Maerkl and Quake 2007; Weindl et al. 2007; Zhao et al. 2009). Modeling and comparative sequence analyses suggest that selection effects on binding sites may be mediated by their binding energy (Mustonen and Lässig 2005; Mustonen et al. 2008). Sites with high predicted binding strength appear to be more conserved, which is consistent with purifying selection (Moses 2009). Within transcription-factor-binding sites, substitutions occur at position-specific rates (Tanay et al. 2004; Kim et al. 2009). Specifically, the degree of conservation of individual nucleotides is proportional to their information content, likely because sites that make direct contact with transcription factors tend to be highly conserved (Mirny and Gelfand 2002; Moses et al. 2003). For this reason, it is tempting to use binding energy as a proxy for the functional consequences, and ultimately fitness effects, of mutations at a given site.
However, the relationship between binding, function, and fitness is not well understood (Mirny and Gelfand 2002). In some instances, there exists a correlation between binding energy and substitution rate (Brown and Callan 2004), but this may not always be the case (Kotelnikova et al. 2005). Furthermore, nonbinding nucleotides may exert some effect on transcription (Mai et al. 2000; Mirny and Gelfand 2002; Abnizova et al. 2007; Wozniak and Hughes 2008) and potentially on fitness.
A comprehensive understanding of the evolution of cis-regulatory sequences will require the synthesis of knowledge concerning binding energy, function, and fitness consequences of individual mutations within these elements. Because such data are not generally available, it would be desirable to ascertain whether a relationship exists between functions of specific nucleotides within cis-elements and their rates of evolution. Such analyses would constitute a critical link between functional studies and comparative sequence analysis.
Materials and Methods
Functional data were derived from published articles reporting studies of promoter mutagenesis (Table S1). A “complete” data set for a given position would contain the information on the consequences of changing the wild-type nucleotide to every one of the three alternatives. Altogether, our data set contained 182 bp examined in such a way. There were also 209 nucleotides with “incomplete” data, i.e., situations in which information was available for only one or two of the three possible substitutions. Although high-throughput mutagenesis data are available for several additional promoters (Patwardhan et al. 2009), we found that these data were inconsistent with the results of single-gene studies, even on the same promoter (data not shown). We therefore did not include them in the present analysis.
For each cis-regulatory element for which mutagenesis data were available, we identified orthologous sequences in a number of closely related species (Figure S1). In each broad taxonomic group (bacteria, yeast, animals), we endeavored to align sequence from a set of species of roughly equivalent phylogenetic distance (measured by the metric of substitutions per base pair). In counting substitutions, we took into account the phylogenetic relationship of the species being compared. For example, substitutions in sister species that could be parsimoniously attributed to the common ancestor of these species were counted once, not twice.
We calculated the “mutation cost index” for all experimentally characterized mutations; it is a measure of the extent to which a mutation alters promoter function. Expression levels of all mutagenized promoters were normalized to the expression levels of the wild-type promoter (in rare instances when the mutant promoter drove higher expression, the inverse of the normalization ratio was recorded instead). For a mutation that reduced gene expression to α (normalized to the level of the wild-type promoter), mutation cost index was defined as 1 − α; therefore, it can range from 0 (no alteration of expression level) to 1 (complete abrogation of promoter function). Similarly, every nucleotide within a promoter can be said to have a “site index,” computed as a sum of the mutation cost indexes of all three possible substitutions. Site index can range from 0 (all mutations are inconsequential to promoter function) to 3 (all mutations at the site abolish expression). In the case of two promoters, a measure of function other than the level of gene expression was used (Table S1).
To test whether the mutation cost index was significantly lower for substitutions than for all possible mutations, as would be expected under purifying selection, we performed sampled randomization tests in which artificial data sets were generated by randomly sampling from the set of all mutations (Sokal and Rohlf 1995). Each artificially generated set matched the substitution data in the number of mutations, but differed in the specific mutations sampled. In the most general version of the test the artificially generated sets were randomly drawn from all experimentally tested mutations.
We performed three variations of this test, each of which constrained certain characteristics of the sampled sets. In the first, artificial sets were constructed to have the same frequency of nucleotides as that of the sites that sustained substitutions. In the second variation of the test, artificial sets were matched in nucleotide frequencies to those of derived nucleotides (i.e., those nucleotides to which substitutions changed the ancestral nucleotides). In the third, the numbers of transition and transversion mutations were matched between the set of substitutions and the artificially generated sets. Sampled randomization tests were performed separately for bacteria, yeast, and animals. Each test was composed of at least 10,000 artificially generated sets. We calculated the fraction of instances in which the mean of an artificial set was lower than or equal to that of the substitutions set. This ratio, which represents the probability that the observed set of substitutions would occur by chance alone, constituted the reported P-value. We chose this method because the distributions of mutation cost indexes were highly non-normal and the substitutions represented a subset of all possible mutations. The sampled randomization test makes no assumptions about the underlying data. It reports the likelihood that the observed data set resembles a randomly chosen data set in regards to certain summary statistics (Sokal and Rohlf 1995). All statistical analyses were performed in the R statistical programming language (http://www.r-project.org).
We explicitly tested whether functionally important nucleotides within promoters evolve under the same regime as their neighbors. A number of studies have been published in which individual nucleotides in a given promoter were replaced while holding all other nucleotides constant (e.g., Myers et al. 1985). Most commonly, mutagenized promoters were fused to reporter genes to compare their levels of expression to wild-type promoters. These tests measured the impact of each nucleotide substitution on the level of expression. For example, at a given site, an A could be a wild-type nucleotide, while mutations to C, G, and T could reduce gene expression to 10, 30, and 60% of the wild-type level, respectively. Combining these functional data with analysis of orthologous promoters could establish a relationship between function and rates of evolution. We assembled a data set of 14 such studies (Table S1), conducted on organisms from three distinct phylogenetic groups: animals (4), yeast (5), and bacteria (5). Together, these articles reported mutagenesis of 332 nucleotides and examined expression levels of 1040 constructs (animals: 136 nucleotides, 350 constructs; yeast: 79 nucleotides, 275 constructs; bacteria: 117 nucleotides, 415 constructs). Of all these experimentally tested nucleotides, 56 were inferred to have sustained substitutions (Figure S1). While limited in size, we believe that this data set is a near-exhaustive collection of published articles reporting experiments of this type.
It may be expected that the effects on gene expression of substitutions that accumulated during evolution would be less severe than the effects of average mutations that could have occurred within these promoters. We tested this hypothesis (Figure 1). We found that the mutations corresponding to substitutions had lower mutation cost indexes than average mutations (bacteria: P = 1 × 10−4; yeast: P = 4 × 10−4; animals: P < 10−5). Therefore, among the substitutions that did occur, there was a substantial bias in favor of changes with lower impact on gene expression. This implies that purifying selection has acted to maintain gene expression levels.
Results in Figure 1 suggest that, in general, mutations with lower effects on promoter function tended to become fixed. Two distinct scenarios could account for this trend. First, the milder fixed substitutions could be distributed relatively evenly across sites. Alternatively, they may preferentially occur at a particular subset of sites. We used the functional data described above to test the hypothesis that substitutions are more common at sites where mutations have less severe effect on gene expression levels (Figure 2). One measure of functional importance of a site is an index defined as a sum of the mutation cost indexes of all three possible mutations that could occur at this nucleotide. Site indexes were significantly lower for positions with substitutions compared to all sites for which experimental mutagenesis data were available (bacteria: P = 3.2 × 10−3; yeast = 6.4 × 10−3; animals: P = 3.4 × 10−3). Therefore, in all three groups, substitutions preferentially occurred at sites that were less disruptive of gene expression.
Mutational biases are not sufficient to account for the trends reported above. First, in the promoter sequences that we analyzed there was no systematic difference in nucleotide composition between sites that sustained substitutions and those that did not (Table S2). Second, correcting for multiple hypothesis testing, there were no significant differences in mutation cost indexes between mutations involving different wild-type nucleotides (Figure S2). Finally, we repeated sampled randomization tests holding constant the number of (i) wild type and (ii) derived nucleotides and (iii) transitions and transversions. All of these modified tests showed significant differences between mutation cost indexes of substitutions compared to all possible mutations (Table S3).
Our results suggest that purifying selection acts on promoter sequences in bacteria, yeast, and animals because we saw fewer than expected substitutions that corresponded to mutations of substantial effect. While these findings are concordant with previous reports of sequence conservation in cis-elements (Andolfatto 2005; Drake et al. 2006; Casillas et al. 2007; Molina and Van Nimwegen 2008), they add an important functional explanation for the observed patterns. An additional reason for the relative abundance of mutations of smaller effect is that they would be more likely to be beneficial and therefore be fixed by directional selection. Positive selection has been shown to act on cis-regulatory elements (Rockman et al. 2005; Haygood et al. 2007), and it may drive transcription-factor-binding-site turnover (Rockman et al. 2003; He et al. 2011). The inference of both positive and negative selection may not be contradictory, as it has been shown that both types of selection operate on gene regulatory elements in a variety of species (Kohn et al. 2004; Macdonald and Long 2005; Haddrill et al. 2008; Torgerson et al. 2009). Also, at least some regulatory regions are evolving under stabilizing selection (Ludwig et al. 2000; Loisel et al. 2006).
Five caveats should be noted. First, it is generally not known how changes in the level of expression translate into measures of fitness. However, our conclusions do not require a particular relationship, but merely a positive correlation between the extent to which a mutation changes expression of a gene and its fitness consequences. Available data suggest that such a correlation is likely (Shultzaberger et al. 2010). Second, the set of mutagenized sites was not random in all studies. In some cases, experimenters chose sites in which to induce mutations in a way presumably biased in favor of nucleotides expected to have more dramatic effects on gene expression. Third, the functional effects of mutations in promoters are highly context specific (Vidal et al. 1995). Therefore, fitness consequences of mutations are contingent on the backgrounds on which they occur and may have changed substantially over time (Bullaughey 2011). Fourth, functions of mutated promoters were tested either in cell lines (animals) or under laboratory conditions (bacteria and yeast). This leaves open a possibility that in vivo or under different environmental conditions, mutations seen in the laboratory as “functionally silent” may have substantial impact on fitness. Furthermore, although a point mutation of a given nucleotide may not have caused an appreciable change in expression level, the site may still be under selection because its deletion could cause a substantial decrease in gene expression (Patwardhan et al. 2009). It appears unlikely, however, that mutations that abrogate or substantially reduce expression are selectively neutral. Finally, all sequences analyzed in this study were derived from proximal promoter elements. The arrangement and composition of functional sites may be different between promoters and other cis-regulatory elements. Therefore, different types of cis-sequences may evolve under different selective regimes. Nonetheless, the results presented here highlight the value of functional data obtained at single-nucleotide resolution, not solely binding energy, for understanding regulatory evolution.
We are grateful to Kevin Bullaughey for numerous suggestions for improvement and help with data analysis. We thank Bin He and Marty Kreitman for critical reading and helpful suggestions and Chan Hee Choi for help during an early stage of this study. This work was made possible by grant support from the National Science Foundation (IOS-0843504) and the National Institutes of Health (NIH) (P50 GM081892) to I.R. and by an NIH training grant (T32 GM007197) to R.K.A.
Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.111.133637/DC1.
- Received August 4, 2011.
- Accepted August 26, 2011.
- Copyright © 2011 by the Genetics Society of America