Assessing the Influence of Adjacent Gene Orientation on the Evolution of Gene Upstream Regions in Arabidopsis thaliana

The orientation of flanking genes may influence the evolution of intergenic regions in which cis-regulatory elements are likely to be located: divergently transcribed genes share their 5′ regions, resulting either in smaller “private” spaces or in overlapping regulatory elements. Thus, upstream sequences of divergently transcribed genes (bi-directional upstream regions, or URs) may be more constrained than those of uni-directional gene pairs. We investigated this effect by analyzing nucleotide variation segregating within and between Arabidopsis species. Compared to uni-directional URs, bi-directional URs indeed display lower population mutation rate, as well as more low-frequency polymorphisms. Furthermore, we find that bi-directional regions undergo selection for the maintenance of intergenic distance. Altogether, however, we observe considerable variation in evolutionary rates, with putative signatures of selection on two uni-directional upstream regions.

I N recent years, the relevance of noncoding DNA for phenotypic evolution has been documented in ever greater detail (Wray 2007). Studies of nucleotide variation within and between species suggest that a substantial number of adaptive fixations can be detected in these regions (Keightley and Gaffney 2003;Andolfatto 2005;Mustonen and Lassig 2007). However, much remains to be learned about the functional and genomic factors that influence the evolution of noncoding DNA.
Several recent studies suggest that the distance and relative orientation of adjacent genes may influence the evolution of gene upstream regions (URs), which host most cis-regulatory elements. In the human genome, short URs located between divergently transcribed genes (i.e., oriented head-to-head on the chromosome and ,1000 bp) are overrepresented and display a mirror structure indicative of bi-directional cis-regulatory activity (Trinklein et al. 2004;Engströ m et al. 2006). The preference for closely located head-to-head gene pairs appears to be specific to the mammalian lineage. However, the genes involved are not enriched in mammal-specific genes, suggesting that gene rearrangement per se may provide new regulatory configurations and opportunities for phenotypic novelties (Koyanagi et al. 2005). In yeast, adaptive evolution in the relative orientation of adjacent genes could not be proven, despite numerous gene losses following recurrent wholegenome duplications (Byrnes et al. 2006).
Cis-regulatory elements are known to impose constraints at least up to $1 kb upstream of transcription starting sites (Keightley and Gaffney 2003;Zeller et al. 2008). As a consequence, when distributed headto-head, closely located adjacent genes may share a reduced personal ''regulatory space'' or have common or intermingled cis-regulatory elements, both of which may constrain opportunities for evolutionary change. Instead, adjacent genes distributed on the same chromosomal strand each may have their own ''regulatory space'' and may be more easily decoupled in evolution. DNA located upstream of gene coding regions (in URs) may thus be under distinct constraints, according to the relative orientation and distance of the second gene flanking the UR (Figure 1). In a first approximation, bi-directional URs (i.e., located between head-to-head genes) may have to accommodate up to twice as many constraints for regulation compared to uni-directional URs. This special type of epistasis may constrain the regulatory evolution of divergently transcribed gene pairs, and adaptive novelties may thus more easily evolve in the context of uni-directional URs. Comparative genomics in the Ostreococcus genus of uni-cellular algae has indeed brought evidence for stronger constraints in intergenic regions separating genes in headto-head orientation (Piganeau et al. 2009). However, the same study also revealed that patterns of constraints in the genus Saccharomyces were clearly different.
Compared to animals, the evolution of cis-regulation in higher plants has received less attention (Wright and Andolfatto 2008). Nevertheless, the above-mentioned hypothesis is particularly interesting to test in the plant species Arabidopsis thaliana. This species has undergone recent genome shrinkage, with a reduction in chromosome number (from 8 to 5) and a 40% reduction in genome size ( Johnston et al. 2005;Schranz and Mitchell-Olds 2006;Oyama et al. 2008). One can therefore hypothesize that constraints on URs may have recently increased in this species because genes on average will be more closely located. We examined nucleotide variation in 29 URs and asked the following questions: (i) Are bi-directional URs more constrained than uni-directional URs? (ii) Is variation in unidirectional URs more likely to be adaptive compared to bi-directional URs? (iii) Does intergenic length influence the evolution of the two types of URs differently?

BI-DIRECTIONAL URs TEND TO DISPLAY LOWER LEVELS OF POLYMORPHISM
We picked 29 light-responsive genes, i.e., nonconstitutively expressed genes with a known or putative function, which were separated by ,1000 bp from their closest upstream neighbor. We consider these loci as functionally independent, given that a large proportion of genes are light-responsive in A. thaliana ( Jiao et al. 2005). All upstream neighbors were confirmed by at least a full-length cDNA sequence. Only one pair of neighbor genes were tandem duplicates (locus 760). Sizes of the intergenic regions ranged from 52 to 992 bp. Sixteen of these intergenic regions were bi-directional URs (i.e., located between two start codons). We sequenced these regions in 12 genotypes of A. thaliana (Bor4, RRS-10, Shakhdara, Bur-0, Lov-, Ler-1, Ts-1, C24, Tsu-1, Cvi-0, NFA-8, Got-7) and retrieved the Col-0 alleles from the database (http://www.arabidopsis.org). Sequences were submitted to public databases under the accession numbers FN674063-FN674399. Diversity was measured by the average number of pairwise differences, p. On average, levels of variation observed in our set of URs (p ¼ 0.0057; Table 1) are comparable to levels of synonymous variation reported in A. thaliana (Nordborg et al. 2005;Schmid et al. 2005). Interestingly, variance in levels of diversity across our set of URs (s ¼ 0.008) was very similar to variance displayed by the 56 intergenic loci examined in Nordborg et al. (2005) (s ¼ 0.007). This indicates that our set of URs provides a representative picture of variance in levels of nucleotide variation in noncoding regions. For bi-directional URs, p ranged from 0.0004 to 0.0097, with an average of 0.0029 (s ¼ 0.002; Table 1). For uni-directional URs, the average was more elevated (0.0091), and the variance was significantly greater (s ¼ 0.011), with values ranging from 0.0003 to 0.0421 (Bartlett's test of homoscedasticity: x 2 ¼ 23.522, P , 0.001). The two distributions were significantly different (Kruskal-Wallis test: x 2 ¼ 4.4308, d.f. ¼ 1, P , 0.036).
We computed maximum-likelihood estimates of population mutation rates (u) for each locus with MANVa (described in Schmid et al. 2005). This rate reflects the rate at which mutations are created at each generation minus those mutations that are immediately removed from the population by negative selection. We found that our data set is best described by three population mutation rates: u 1 ¼ 0.0017 for 12 loci, u 2 ¼ 0.0057 for 15 loci, and u 3 ¼ 0.0349 for 2 loci (likelihood comparisons of model with two rates vs. model with three rates, P ¼ 0.04). Of 13 uni-directional URs, 3, 8, and 2 loci exhibited u 1 , u 2 , and u 3 population mutation rates, respectively. By contrast, of 16 bi-directional URs, none exhibited a u 3 population mutation rate, and 9 and 7 exhibited low (u 1 ) and intermediate (u 2 ) rates, respectively. We performed 1000 bootstraps over positions for each locus to compare the mean value of u over all unidirectional and bi-directional URs. On average, u is significantly lower in bi-directional URs (quantile comparison, P > 0.001). Our data therefore indicate that bi-directional URs tend to display lower population mutation rates, although levels of diversity are clearly disparate within and between UR types.
The analysis of sequence divergence between species showed the same trend, although differences between the two types of URs were only marginally significant.   (Tajima 1989), Hn is the normalized Fay and Wu statistic (Fay and Wu 2000;Zeng et al. 2006), and E is a composite test statistic (Zeng et al. 2006 Orthologous sequences were retrieved from the Arabidopsis lyrata genome assembly, kindly provided by J. Fawcett and Y. van de Peer (University of Ghent). Alignments were optimized by eye, and rearranged regions were excluded from the analysis. In addition, fragments of length .30 bp for which .50% of positions did not align with A. thaliana were removed because it was not clear whether they were small insertions or highly diverged fragments. The occurrence of such interspersed fragments was variable among loci ( We nonetheless computed a multilocus Hudson-Kreitman-Aguadé (HKA) to test for homogeneity in the ratio of intraspecific polymorphism to interspecific divergence across all loci (Hudson et al. 1987). This test was significant, indicating variable ratios of polymorphism vs. divergence across loci (x 2 ¼ 109.3, d.f. ¼ 28, P , 0.001). Partial HKA computed for each locus revealed that two uni-directional loci, 410 and 760, strongly depart from the evolutionary rate shown by most loci (Table 1). These two loci display an excess of polymorphism relative to divergence. It is important to note at this point that positive selection, which is usually associated with high levels of divergence, may be difficult to detect. Indeed, we provide estimates of divergence only for the alignable noncoding regions. The two departing loci displayed significant Fay and Wu's Hn statistics, revealing an excess of high-frequency-derived mutations (Hn ¼ À2.57, P ¼ 0.033 and Hn ¼ À3.08, P , 0.014, respectively; Table 1) for the two loci. The two loci present relatively low levels of divergence, so that alignment issues did not influence the identification of the derived state at polymorphic sites.
Locus 410 (located between gene AT3G62400 and AT3G62410) displayed two very distinct haplotypes, one being very similar to the A. lyrata sequence. Interestingly, the derived haplotype was mutated in the termination codon of AT3G62400, suggesting functional differences between segregating alleles at this locus.
At locus 760 (located between genes AT2G03760 and AT2G03770), both ORFs were intact and both code for sulfotransferases. We screened this region for putative polymorphic binding sites using the plant cis-regulatory element database (Higo et al. 1999). This UR contained many defense-related motifs consistent with the known expression pattern of AT2G03760, a pathogen-responsive transcript. We identified a sulfur responsive element (GAGAC) gained in the high-frequency allele at position 6 in the 59 UTR of AT2G03760. It is tempting to speculate that this gain of function has caused the mutation to rise in frequency, but this hypothesis calls for a rigorous experimental validation. Altogether, our data do not bring definitive evidence that uni-directional URs are more likely to be associated with evolutionary novelties in A. thaliana, but support the hypothesis that cis-regulatory mutations are a potent resource for the exploration of novel and potentially adaptive phenotypes (Clark et al. 2006;de Meaux et al. 2006;Stern and Orgogozo 2008).
To understand how gene orientation influences UR evolution, we compared the ratio of polymorphism to divergence in bi-directional vs. uni-directional URs, using the HKA test (Hudson et al. 1987). This test can be implemented to test for differences in evolutionary dynamics between groups of genes because it does not assume equal mutation rates across loci. The test was performed for all positions in the UTR and intergenic regions. To make sure that the two regions apparently subject to selection were not driving differences between UR types, we subtracted the two loci, 410 and 760, from the data set. We found that ratios of polymorphism to fixed differences are comparable between the two types of noncoding regions (x 2 ¼ 0.2, P . 0.64). A Fisher exact test confirmed this finding (odds ratio 1.26, P ¼ 0.14). By contrast with the HKA test, this test may be excessively liberal because it assumes identical evolutionary history across loci of both types, an assumption that is met only with a large number of loci (Andolfatto 2005).
Given the elevated variance in levels of diversity and divergence, we did not compute mean statistics for each type of URs but compared distributions for Tajima's D, Fay and Wu's Hn (normalized H, developed by Zeng et al. 2006) and Zeng's E with Kolmogorov-Smirnov tests (Tajima 1989;Fay and Wu 2000;Zeng et al. 2006). Tajima's D can detect an excess of low-frequency polymorphism, which is either a signature of polymorphism recovery after a selective sweep or an indication of background selection maintaining slightly deleterious mutations at low frequency. Excess of low-frequency variants can also result from recent demographic events such as bottlenecks. Hn measures the excess of highfrequency polymorphism that can hitchhike with selected mutations or be caused by population admixture (Fay and Wu 2000). Zeng's E combines both approaches to yield a statistic that is more robust to confounding demographic effects (Zeng et al. 2006). For bi-directional URs, the distribution of Zeng's E values is shifted toward lower values (Kruskal-Wallis: . For Tajima's D, a similar shift is observed (x 2 ¼ 6.4604, d.f. ¼ 1, P , 0.012). Both UR types presented similar Fay and Wu's Hn (Kruskal-Wallis: x 2 ¼ 2.5, d.f. ¼ 1, P ¼ 0.11). In addition, we compared population mutation rates across UR types, after subtracting the two departing loci, and observed the same trend as reported above, but with marginal significance of the difference between uni-and bidirectional URs (quantile comparison, P ¼ 0.05). Taken together, these results indicate that the heterogeneity in evolutionary rate alone does not account for the differences that we observed across UR types and confirms that bi-directional URs tend to harbor more low-frequency mutations.

EVOLUTION FOR THE MAINTENANCE OF INTERGENIC DISTANCE IN BI-DIRECTIONAL URs
We eventually examined patterns of insertion/ deletion (indel) polymorphisms within A. thaliana. In our data set, low levels of polymorphism allow good alignments and an accurate identification of indel boundaries. Individual indel events were defined by the boundaries of gap stretches. Thus, overlapping gaps with different boundaries were considered as two different events. To characterize indel variation, we computed p, the average number of pairwise indel differences, as well as the average indel length for each locus (Table 1). All loci harbored indel polymorphism, with the exception of one bi-directional UR (490) and one uni-directional UR (960). On average, indel length was larger for uni-directional upstream regions than for bi-directional URs (5.531 bp vs. 4.359 bp), yet the distributions were not significantly different (Kruskal-Wallis test: x 2 ¼ 0.8239, d.f. ¼ 1, P . 0.36). Levels of indel diversity were also comparable between the two types of URs (0.0017 for bi-directional URs and 0.0027 for uni-directional URs; Kruskal-Wallis test: More interestingly, however, we observed a relationship between intergenic length and indel polymor-phism in bi-directional URs (Spearman's r ¼ 0.581, P , 0.018) but not in uni-directional URs (Spearman's r ¼ À0.406, P . 0.16). Indeed, we observe very low levels of indel polymorphism in bi-directional URs ,500 bp (F 1,25 ¼ 6.755, P , 0.016, Figure 2). The exclusion of the two loci showing signs of adaptive evolution did not modify this trend, although the effect became only marginally significant (F 1,23 ¼ 3.854, P ¼ 0.06). An analogous trend was not observed for the average number of pairwise single nucleotide differences p (r ¼ 0.08, P . 0.75 and r ¼ À0.50, P . 0.08 for bidirectional and uni-directional URs, respectively).
We extended our analysis to a larger set of URs. We used the draft genome of A. lyrata and recorded the distance between 10,390 pairs of orthologous flanking coding regions (3331 and 7059 bi-and uni-directional URs, respectively). We then examined the correlation of UR length between species. The correlation between length in A. thaliana and length in A. lyrata is higher in bi-directional than in uni-directional upstream regions, and the slopes are significantly different (R ¼ 0.755 and 0.698, respectively; significant interaction: t ¼ À4.45, P , 1.e-05). Bi-directional URs are shorter in A. thaliana than in A. lyrata but less so than uni-directional URs. This result supports the existence of stronger constraints on intergenic distance in bi-directional URs. The analysis of indel distributions instead did not yield conclusive results (not shown). However, current algorithms are not performing well for generating interspecific sequence alignments in plant noncoding regions. Comparisons of UR length between species provides an interesting example of how differences between species can be quantified without relying on alignments. CONCLUSION A few studies have investigated the patterns of coexpression or shared gene ontology categories among gene neighbors in A. thaliana, the best annotated plant genome (Williams and Bowles 2004;Krom and Ramakrishna 2008;Wang et al. 2009). Yet, surprisingly Note few studies have investigated the evolutionary dynamics of plant noncoding regions (but see de Meaux et al. 2005 andMiyashita 2001). Here, we made the simplifying assumption that the most constrained regulatory elements are located in the 59 upstream region. It is therefore quite remarkable that we observe consistent differences associated with the orientation of gene neighbors: bi-directional URs display lower polymorphism and a greater number of low-frequency polymorphisms.
Several hypotheses can be invoked to interpret this result. First, polymorphism can be decreased by positive selection on (or around) bi-directional URs or maintained by balancing selection on (or around) uni-directional URs. Any of these hypotheses would indeed explain why both types of URs present similar levels of divergence, although they differ in levels of polymorphism. However, it is not clear why selection patterns should differ between uni-and bi-directional URs in this way. Furthermore, we believe that estimates of interspecific divergence do not reflect adequately the evolutionary dynamics in these regions. Indeed, levels of divergence vary within loci, with conserved fragments flanking highly divergent regions. This has also been reported in other noncoding regions (Miyashita 2001;de Meaux et al. 2006). Therefore, levels of divergence give an indication of divergence only for the most conserved fragments.
Several elements indicate that the pattern that we observe could be explained by a second hypothesis: stronger constraints exerted on bi-directional URs. Indeed, we find lower polymorphism levels and a greater number of low-frequency polymorphisms in these regions. Importantly, we also find that indel polymorphisms are less frequent in short bi-directional URs. This relationship was confirmed at the interspecific level because bidirectional URs tended to be less reduced in size than uni-directional URs. Constraints on indel polymorphisms in intergenic regions have been demonstrated in Drosophila melanogaster (Ometto et al. 2005). In rice, a study of genes displaying cis-regulatory differences in a heterotic F 1 cross were shown to be associated with indel variation and not single nucleotide polymorphism (Zhang et al. 2008). Indel polymorphism may therefore alter cis-regulatory function more strongly than single nucleotide polymorphism. Maintaining a minimal length may be important to avoid collision between regulatory complexes and transcription factors in regions where regulatory elements may be particularly dense. Conversely, if regulatory elements are shared or intermingled, increased intergenic distances may disrupt regulatory interactions. Our study suggests that the distance between binding sites is a predominant functional feature encoded in these regions and shows that patterns of indel polymorphism can reveal features of noncoding DNA that are not apparent from SNP polymorphism data.
This conclusion, however, should not mask that our data also reflect the complexity of noncoding evolution in plant genomes. We actually observe considerable variation in evolutionary dynamics across URs in A. thaliana, with great disparity in rates of evolution within loci as well as within and between species (Table 1). The recent progress of high-throughput sequencing technologies will soon allow a comprehensive analysis of the noncoding genome and help confirm the significant differences observed here. Understanding the evolutionary mechanisms that drive noncoding evolution remains a goal of major importance for upcoming research in plants.