Hypervariable Noncoding Sequences in Saccharomyces cerevisiae
- Justin C. Fay 1 and
- Joseph A. Benavides
- 1Corresponding author: Department of Genetics, Box 8510, 4444 Forest Park Pkwy., St. Louis, MO 63108. E-mail: jfay{at}genetics.wustl.edu
Abstract
Compared to protein-coding sequences, the evolution of noncoding sequences and the selective constraints placed on these sequences is not well characterized. To compare the evolution of coding and noncoding sequences, we have conducted a survey for DNA polymorphism at five randomly chosen loci among a diverse collection of 81 strains of Saccharomyces cerevisiae. Average rates of both polymorphism and divergence are 40% lower at noncoding sites and 90% lower at nonsynonymous sites in comparison to synonymous sites. Although noncoding and coding sequences show substantial variability in ratios of polymorphism to divergence, two of the loci, MLS1 and PDR10, show a higher rate of polymorphism at noncoding compared to synonymous sites. The high rate of polymorphism is not accompanied by a high rate of divergence and is limited to a few small regions. These hypervariable regions include sites with three segregating bases at a single site and adjacent polymorphic sites. We show that this clustering of polymorphic sites is significantly greater than one would expect on the basis of the spacing between polymorphic fourfold degenerate sites. Although hypervariable noncoding sequences could result from selection on regulatory mutations, they could also result from transient mutational hotspots.
PROBABILISTIC models for the molecular evolution of DNA sequences have provided much insight into protein function and evolution (Kimura 1983; Fay and Wu 2003). The power of these models is derived in part from the genetic code, which results in the interspersion of sites with nonsynonymous and synonymous effects on the amino acid sequence of a protein. In contrast to protein-coding sequences, we know relatively little about the function and evolution of cis-regulatory sequences. Although some models have been developed (Moses et al. 2004a,b), a major limitation is the paucity of experimentally identified cis-regulatory sequences.
The examination of polymorphism and divergence in cis-regulatory sequences has shown that while these sequences are constrained, substantial variation exists both within and between species (Jenkins et al. 1995; Ludwig and Kreitman 1995; Ludwig et al. 1998; Tautz and Nigro 1998; Dermitzakis et al. 2003; Moses et al. 2003; Phinchongsakuldit et al. 2004). This variation can be explained under a neutral model since there are degenerate positions within transcription factor binding sites (Moses et al. 2003) and redundant binding sites within an enhancer (Ludwig et al. 1998, 2000). In one study, the DNA sequence variation was found to be inconsistent with a neutral model (Jenkins et al. 1995). However, these studies have been limited to the few regulatory sequences that have been examined in detail, mostly those acting early in Drosophila development.
The genome sequencing of closely related species has provided a wealth of data on the molecular evolution of both coding and noncoding sequences (Cliften et al. 2003; Kellis et al. 2003; Thomas et al. 2003; Richards et al. 2005). One of the main motivations for these projects has been the identification of conserved noncoding sequences, the majority of which likely function in gene regulation. The identification of regulatory sequences by their conservation between species presents a challenge to understanding their evolution since not all regulatory sequences may be tightly conserved. One approach is to study noncoding sequences in their entirety, eliminating any bias in the method used to distinguish functional and nonfunctional sequences.
The examination of polymorphism and divergence in unannotated noncoding sequences has revealed a number of regions showing a higher than expected rate of polymorphism or divergence. The rate of polymorphism but not divergence was found to be greater in the 5′-UTR and intronic sequence of hunchback compared to that in synonymous sites in adjacent hunchback coding sequences (Tautz and Nigro 1998). A small 200-bp region upstream of Attacin C showed a rate of polymorphism 10-fold higher than that found at nearby synonymous sites (Lazzaro and Clark 2001). Divergence and linkage disequilibrium were also much higher in the region. Examination of polymorphism and divergence in 136 5′-UTR sequences in humans revealed a higher ratio of divergence to polymorphism at 5′-UTR compared to that in fourfold degenerate sites found in adjacent coding sequences (Hellmann et al. 2003). The same result was found for noncoding sequences upstream of accessory gland proteins (Kohn et al. 2004). Although there are a number of caveats to comparing variation in noncoding sites to that in synonymous sites, together the data suggest the selective forces acting on coding and noncoding sequences may be quite different.
To compare rates of variation in coding and noncoding sequences, we have surveyed DNA polymorphism at five randomly chosen loci in a diverse collection of 81 strains of Saccharomyces cerevisiae. For each locus we examined 608–845 bp of coding sequence at the 5′ end of the gene and 611–804 bp of noncoding sequence, nearly the entire 5′-intergenic sequence. The five loci include: CCA1, a tRNA nucleotidyltransferase (Aebi et al. 1990); CYT1, which encodes cytochrome c1, a component of the mitrochondrial respiratory chain (Sadler et al. 1984); MLS1, a malate synthase (Hartig et al. 1992); PDR10, an ATP-binding cassette membrane pump involved in pleiotropic drug resistance (Balzi and Goffeau 1995); and ZDS2, known to function in chromatin silencing and cell cycle progression (Bi and Pringle 1996; Roy and Runge 1999). Similar to variation at nonsynonymous sites, all five loci show lower rates of divergence in noncoding compared to synonymous sites. Yet, two genes, MLS1 and PDR10, show higher rates of polymorphism at noncoding compared to synonymous sites.
MATERIALS AND METHODS
Strains:
Strains were obtained from a variety of sources. B1–B6 were obtained from B. Dunn. I14 was collected by J. Fay. CDB and PR were obtained from Red Star Yeast (Oakland, CA). K1–K15 were obtained from N. Goto-Yamamoto and the NODAI culture collection. M1–M34 were provided by R. Mortimer. UC1–UC10 were obtained from the University of California (Davis, CA) Department of Viticulture and Enology culture collection. SB was bought at Whole Foods (Berkeley, CA). Y1–Y12 were provided by C. Kurtzman from the Agriculture Research Service Culture Collection. YJM145–YJM1129 were obtained from J. McCusker. YPS163–YPS1009 were provided by P. Sniegowski.
Polymorphism survey:
Five genes from divergently transcribed intergenic sequences were randomly chosen from the Saccharomyces Genome Database, excluding RNA genes and genes of unknown function. Genes with no clear ortholog in S. paradoxus were not considered. For each gene, the 5′-intergenic sequence and a portion of the coding sequence were amplified by PCR, purified, and both strands were sequenced using BigDye (Perkin Elmer, Boston) termination sequencing. Phred and Phrap were used to call bases and assemble a contiguous sequence for each strain (Ewing and Green 1998). Consed was used to visualize the sequence assemblies and to identify heterozygous sites. Only one of the two haplotypes inferred using PHASE were used in the analyses (Stephens et al. 2001). Sequences were aligned using ClustalW. Population genetic analyses were done using DNASP (Rozas and Rozas 1999). Substitution rates between species were estimated using PAML (Yang 1997).
RESULTS
DNA polymorphism:
DNA polymorphism was surveyed in 81 strains of S. cerevisiae (Table 1), constituting a total of 3561 bp of intergenic sequence and 3671 bp of coding sequence. A total of 191 polymorphic sites were found, constituting 67 unique haplotypes. Four of the polymorphic sites contained 3 segregating bases. Twelve insertions and no deletions were found. The 12 insertions ranged in length from 1 to 3 bp. Of the 12 insertions, 8 were within a string of A or T bases ranging in size from 4 to 11 bp, 1 was within a C4 repeat, 1 consisted of a TA2 repeat, and 1 consisted of a TC5 repeat.
Strains studied and their source
Heterozygous sites were found in 35 of the 81 strains and at 94 of the 191 polymorphic sites. A chi-square test for Hardy-Weinberg equilibrium identified 45 polymorphic sites with a significant deficit of heterozygous strains (P < 0.001). Because most natural isolates of S. cerevisiae are homothallic diploids (Mortimer et al. 1994), haploid spores are capable of switching mating type and selfing. Thus, loss of heterozygosity is not unexpected. However, of the 35 strains with heterozygous sites, 10 strains were heterozygous at between 11 and 26 sites while the remaining 25 strains were heterozygous at 5 or fewer sites. The strains with high levels of heterozygosity can be explained by a recent mating between two distantly related strains or by loss of their capability to sporulate, a common phenotype found in commercial wine strains (Johnston et al. 2000).
The distinction of haploid and diploid strains is important for allele frequency estimates and other population genetic analyses. Although sporulation is a clear indication of diploidy, the absence of sporulation is uninformative since some diploids sporulate at very low frequencies. To avoid this problem we analyzed only one allele from each strain. For those strains containing heterozygous sites, we inferred haplotypes using the program PHASE (Stephens et al. 2001) and randomly chose one of the two inferred haplotypes. Because 16 of the heterozygous sites are unique variable sites that are present in only a single strain, the random sampling resulted in the loss of 7 polymorphic sites. All subsequent analyses are based on the 184 polymorphic sites that remained (Table 2).
Polymorphic sites identified in five genes
Diversity at synonymous sites ranges from 0.33 to 1.32% at the five loci (Table 3), where diversity is measured by the average number of pairwise differences between strains per base pair. The overall average diversity, 0.84%, is higher than that in humans, 0.11–0.15% (Cargill et al. 1999; Halushka et al. 1999), but lower than that in Drosophila melanogaster, 1.41% (Kern and Begun 2005).
Population sample statistics from five genes
The frequency spectrum is slightly skewed toward rare variants compared to that expected from a randomly mating population of constant size under a Wright-Fisher model. Tajima's D (Tajima 1989) ranges from −0.60 to −1.09 among the five genes, none of which are significant (Table 3). One hundred twenty-eight SNPs have a minor allele frequency of <10% compared to the 99 expected under a Wright-Fisher model (Watterson 1975). Four of the 12 insertions, all found within the promoter of PDR10, have a minor allele frequency of >10%.
Population structure:
We examined population structure stratified by the source from which each strain was obtained and by continent from which each strain was isolated (Table 1). Forty-two strains were from Europe, 14 from Asia, 15 from America, 4 from Africa, and 6 are of unknown origin. Forty-four strains were isolated from grapes, wine fermentations, or commercial wine production. Seven strains were from natural samples, including oak tree exudates, a mushroom, a fig, and various fruits. Eleven strains were obtained from clinical samples of immunocompromised patients. Ten strains were obtained from sake fermentations. Seven strains were obtained from fermentations excluding wine and sake. Two strains were from an unknown source.
Significant population differentiation was found both among sample sources and among sample locations (Table 3, P < 0.001 for all genes). However, the sources and locations from which the strains were isolated are correlated with one another. Most European strains were obtained from vineyards, most North American strains were obtained from clinical samples, and most Asian strains were obtained from fermentation of substrates other than grapes.
Different patterns of variation were found among different groups of strains. A significant reduction in diversity is found within strains from wine and sake compared with diversity within other groups and total diversity (Table 4). In addition to a reduction in diversity, the vineyard strains also show a greater proportion of rare variants, as measured by Tajima's D, compared to that in other groups and to the total (Table 4).
Diversity within groups
Linkage disequilibrium and recombination:
There are significant levels of linkage disequilibrium between unlinked genes (Figure 1). Linkage disequilibrium was measured by the average absolute value of D′ for all polymorphic sites with a minor allele frequency of >10% among the 45 strains for which there was no missing data (Table 5). The average absolute value of D′ from all pairwise comparisons between loci ranges from 0.57 to 0.82. All pairwise comparisons within loci range from 0.71 (ZDS2) to 0.98 (PDR10). The expected absolute value of D′ for unlinked sites is 0.26 and was obtained by resampling of polymorphic sites, keeping the allele frequencies constant. This high rate of linkage disequilibrium can be expected given the population structure found in S. cerevisiae and its ability to reproduce asexually.
Linkage disequilibrium measured by |D′|. Only pairs with |D′| > 0.5 and a significant association are shown (Fisher's exact test, P < 0.05). The color indicates |D′| that ranges from 0.5 (yellow) to 1.0 (red).
Average pairwise linkage disequilibrium
There is ample evidence of recombination within each of the five loci. Each locus shows evidence of between two and five recombination events by the four-gamete test (Table 4). The absolute value of D′ within a locus is negatively correlated with distance between polymorphic sites (P = 5 × 10−5), but not for the randomized data (Table 6). This could be due to gene conversion or recombination between individuals from the same subpopulation but not from different subpopulations. The recombination mutation ratio, estimated from the ratio of θw from synonymous sites over 4Nc, ranges from 1.4 to 2.7 and the average is 2.1 (Table 7). The mutation rate has been estimated from CAN1 and SUP3 at 2.25 × 10−10 per base pair per generation (Drake 1991). Given that 82% of spontaneous mutations are single-base substitutions (Kang et al. 1992), the point mutation rate is 1.84 × 10−10. The genomic average rate of recombination is 0.34 cM/kbp or 6.8 × 10−6 recombination events per base pair (Cherry et al. 1997). Similar to a previous study (Jensen et al. 2001), the laboratory estimate of the ratio of recombination events to mutation events is four orders of magnitude greater than that inferred from the polymorphism data. This can be explained by higher rates of asexual compared to sexual reproduction as well as by mating-type switching, which enables a cell to mate with its forebear following meiosis.
Average absolute value of D′
Population sample statistics from 45 strains with data from all five genes
Selection on synonymous sites:
The detection of selection on nonsynonymous or noncoding sites is greatly facilitated if synonymous sites are effectively neutral. In S. cerevisiae, there is ample evidence that synonymous sites are not neutral (Bennetzen and Hall 1982; Bulmer 1987). However, not all genes and not all synonymous sites may be influenced by selection. For the purposes of detecting selection on nonsynonymous or noncoding sites, synonymous sites may be considered effectively neutral if their substitution rate and pattern of preferred and unpreferred synonymous changes are no different from those in neutral sites.
To determine which genes in S. cerevisiae are clearly affected by selection on synonymous sites, we compared the synonymous substitution rate to codon bias (Figure 2). From 1538 genes, there is a clear reduction in the synonymous substitution rate for genes with high codon bias or a small effective number of codons (ENC). However, most genes have a synonymous substitution rate that is not correlated with codon bias. We arbitrarily classified genes as high and low bias, using an ENC cutoff of 45. The 1331 high-bias genes have an average synonymous substitution rate of 0.87 and show no correlation between codon bias and synonymous substitution rate. In contrast, the low-bias genes have an average synonymous substitution rate of 0.60 and show a significant correlation between codon bias and synonymous substitution rate (Pearson's r = 0.74, P < 10−15). With the exception of CYT1, the genes examined in this study have a synonymous substitution rate nearly identical to the average rate and do not have high levels of codon bias.
Synonymous substitution rate among S. cerevisiae, S. paradoxus, and S. mikatae in relation to the average codon bias from the three species, as measured by ENC (Wright 1990). The dashed line shows the arbitrary cutoff used to distinguish high- and low-biased genes. The two solid lines are the least-squares fit of a regression of codon bias and synonymous substitution rate for the genes showing high and low codon bias. The five genes examined in this study are shown in red.
The pattern of substitutions at synonymous sites is indicative of whether synonymous sites are at mutation-selection balance. If they are not, the assumption that synonymous sites are effectively neutral is violated. In Drosophila, the relationship between codon bias and synonymous substitution rate is very weak if present (Dunn et al. 2001; Bierne and Eyre-Walker 2003). There is, however, a clear difference in the pattern of preferred and unpreferred synonymous substitutions between D. melanogaster and D. similans (Akashi 1996; Begun 2001). To determine whether patterns of synonymous substitutions in S. cerevisiae show a similar nonequilibrium status, we compared the number of unpreferred and preferred changes along the lineage leading to S. cerevisiae and within strains of S. cerevisiae (Table 8). Both polymorphic and fixed synonymous changes show an equal number of preferred to unpreferred (P → U) and unpreferred to preferred (U → P) changes.
Preferred and unpreferred synonymous polymorphism and divergence
The data show that the synonymous substitution rate and pattern of synonymous substitution in four of the five genes are consistent with those expected for neutral sites. CYT1 has a reduced rate of synonymous substitution, but, interestingly, has the highest rate of synonymous-site diversity (Table 3). The HKA test (Hudson et al. 1987) reveals a lower ratio of synonymous polymorphism to divergence in CYT1 compared to that in PDR10 (P = 0.048), but not in comparison to that in any of the other three genes. After correction for multiple comparisons this difference is not significant.
Selection on nonsynonymous and noncoding sites:
The ratio of nonsynonymous to synonymous substitutions (dN/dS) measures the selective constraint on a protein. In the absence of positive selection or any changes in selective constraint, the dN/dS ratio should be constant across lineages and should not be greater than one (Fay and Wu 2001). None of the five proteins show significant differences in the levels of constraint among the lineages leading to S. cerevisiae, S. paradoxus, and S. mikatae (likelihood-ratio test using PAML, P > 0.05). The combined dN/dS ratios are 0.09, 0.10, and 0.09 for the branch leading to S. cerevisiae, S. paradoxus, and S. mikatae, respectively.
The ratio of noncoding to synonymous substitutions (dNC/dS) measures the selective constraint on noncoding sequences, assuming the mutation rate is the same across the coding and noncoding sequences. All of the promoters show considerable levels of functional constraint. The combined dNC/dS ratios are 0.53, 0.49, and 0.49 for the branch leading to S. cerevisiae, S. paradoxus, and S. mikatae, respectively. This implies that nearly one-half of intergenic sequences are functionally constrained, only slightly >0.52, the median dNC/dS from 2098 genes (Doniger et al. 2005).
Under the same assumptions used to test for branch-specific dN/dS ratios, the ratio of nonsynonymous- to synonymous-site polymorphism (pN/pS) should equal dN/dS. This comparison is the basis for the McDonald-Kreitman (MK) test (McDonald and Kreitman 1991), which compares polymorphic sites and fixed differences rather than estimates of substitution rates. The average pN/pS ratio, 0.11, is nearly identical to that of divergence, 0.09 (Table 9). Similarly, the average pNC/pS ratio of diversity, 0.62, is similar to that of divergence, 0.54 (Table 9).
Rates of DNA polymorphism compared to divergence
The comparison of N/S ratios from polymorphism and divergence can be misleading if positive selection increases the N/S ratio of divergence and negative selection increases the N/S ratio of polymorphism (Fay et al. 2001). The effect of negative selection on the N/S ratio of polymorphism can be examined by comparing low-frequency to common polymorphism. Both Drosophila (Fay et al. 2002) and humans (Fay et al. 2001) show an elevated N/S ratio of rare compared to common polymorphism, indicative of deleterious mutations segregating at low frequency in the population. In S. cerevisiae, the ratio of the rate of nonsynonymous and synonymous polymorphism that is rare, 0.12, is nearly identical to the rate of common polymorphism, 0.11. The ratio of NC/S from rare polymorphism, 0.60, is also very similar to that of common polymorphism, 0.62. This can be explained if most deleterious mutations are recessive and removed from the population following mating-type switching and selfing.
The overall pattern of polymorphism and divergence indicates selective constraint along the lineage leading to S. cerevisiae is similar to that found among extant populations. However, natural selection may influence polymorphism and divergence at individual genes or regions without affecting overall patterns of polymorphism and divergence. Because of the paucity of nonsynonymous polymorphism and divergence, we compared variation only in noncoding to synonymous sites. The NC/S ratio of diversity is larger than that of divergence for MLS1, PDR10, and ZDS2 and lower than that of divergence for CCA1 and CYT1. In addition, the NC/S ratio of polymorphism is greater than unity for both MLS1 and PDR10.
Two tests can be used to assess the significance of the difference between noncoding and synonymous polymorphism and divergence. The MK test can be applied to the number of polymorphic and fixed noncoding and synonymous changes (McDonald and Kreitman 1991). However, the MK test assumes the mutation rate in the two regions is the same, the coalescence time for the two regions is the same, and the number of fixed differences between species can be reliably determined. The first two assumptions are reasonable when the MK test is applied to nonsynonymous and synonymous changes. However, the latter assumption is not justified when divergence is >5–10% because of multiple hits (Templeton 1996). The HKA test can also be applied to noncoding- and synonymous-site polymorphism and divergence (Hudson et al. 1987). The HKA test explicitly accounts for any differences in mutation rates or coalescence times between the two regions, but, like the MK test, does not account for multiple hits. Despite these concerns, we applied an MK and an HKA test to noncoding and synonymous polymorphism and divergence. For the MK test, the number of fixed differences along the lineage leading to S. cerevisiae was estimated by the maximum-likelihood estimate of the synonymous and noncoding substitution rate multiplied by the number of synonymous and noncoding sites. Neither the MK nor the HKA test was significant for any of the genes (P > 0.05).
If locus-specific differences in the ratio of noncoding polymorphism to divergence have occurred by chance, the ratio of polymorphism to divergence should be relatively constant across a sliding window of each noncoding sequence. Alternatively, if natural selection has increased or decreased the rate of noncoding polymorphism or divergence, the effect of selection may well be localized to a portion of the noncoding region. To examine diversity in the rate of noncoding polymorphism and divergence, we plotted a sliding window of diversity and substitution rate across the noncoding region from each gene as well as across the concatenated fourfold degenerate synonymous sites from all of the genes (Figure 3). The two y-axes in Figure 3 are scaled such that the average rate of synonymous polymorphism is equal to the average rate of synonymous divergence. Noncoding divergence is more variable and on average lower than synonymous-site divergence, as expected. Noncoding polymorphism is also much more variable than synonymous polymorphism, but in four different regions is greater than the range found at synonymous sites, as shown in Figure 3 (light gray area). Two of the noncoding hypervariable regions lie upstream of MLS1 and the other two are upstream of PDR10 and ZDS2. The two hypervariable regions upstream of MLS1 contain 2 of the 4 sites with 3 segregating bases and 5 additional segregating sites that are within 3 bases of one another (Table 10). The hypervariable region upstream of PDR10 has 10 segregating sites that are within 7 bp of another segregating site (Table 11).
Sliding window of polymorphism compared to divergence for noncoding and synonymous sites. Divergence (red) is between S. cerevisiae and S. paradoxus as measured by the Jukes-Cantor model (Jukes and Cantor 1969), and polymorphism (black) is measured by diversity among strains without missing data. The range of synonymous-site divergence (gray) and polymorphism (light gray) is shown as shaded regions. The plot is separated into six regions labeled below the abscissa. The first five regions are from the noncoding sequences 5′ of each gene. The last region (Syn) is the concatenation of fourfold degenerate synonymous sites from the coding sequences of all five genes.
Haplotypes from two hypervariable regions upstrem of MLS1
Haplotypes from the hypervariable region upstream of PDR10
To determine whether there are regions with more noncoding polymorphism than can be explained by a neutral model, we examined the distance between segregating sites. The advantage of the distance between segregating sites is that it has well-defined statistical properties compared to a sliding-window analysis, which is dependent on the window length and step size. Assuming a constant rate of polymorphism, p, and given a polymorphic site, the probability of d sites until the next polymorphism is Thus, the distance between polymorphic sites is geometrically distributed with parameter p, which can be estimated from the number of polymorphic sites per base pair.
If there is an increase in the rate of polymorphism within a portion of a noncoding region, the distance between segregating sites should be less than that expected under a neutral model. The expected distances were calculated using the geometric distribution with a rate parameter, p, estimated from concatenated fourfold degenerate synonymous sites. To test the goodness-of-fit between the observed and expected distance values we binned the distance between segregating sites in three classes, 0–7 bp, 8–20 bp, and >20 bp, to ensure the expected number of sites in each category is greater than five. Synonymous sites provide a good fit to the geometric distribution using a G-test with Williams' correction (Sokal and Rohlf 1995) (Table 12). Of the five noncoding regions, only MLS1 and PDR10 show a significant deviation from a geometric distribution using the rate parameter from synonymous sites (P = 0.019 and P = 0.049, respectively). Although MLS1 and PDR10 are not individually significant after correction for multiple tests, the combined probability of all five genes is significant (P = 0.007, Fisher's test of combined probabilities) and the sum of the data from all five genes is also significant (P = 0.024, G-test). Furthermore, the G-test is conservative because the overall rate of noncoding polymorphism is less than that of synonymous polymorphism and so the expected distance between segregating noncoding sites should be greater than the distance between concatenated fourfold degenerate sites.
Distance between consecutive segregating sites
The significant clustering of polymorphic sites in noncoding regions suggests that there may be positive or balancing selection on functional noncoding sequences. To determine whether polymorphic sites occur in functional sequences, we compared the number of polymorphic sites in positions conserved among S. cerevisiae, S. paradoxus, and S. mikatae to the number of polymorphic sites in positions that are not conserved. Although a little less than half of the noncoding polymorphic sites are found in conserved positions, the same proportion of fourfold degenerate sites (4d) are found in positions conserved across species (Table 13).
Distribution of SNPs in conserved sequences
Previous studies have found reduced levels of variation in experimentally identified functional noncoding sequences (Ludwig and Kreitman 1995; Ludwig et al. 1998; Tautz and Nigro 1998; Dermitzakis et al. 2003; Phinchongsakuldit et al. 2004), but high levels of variation in unannotated noncoding sequences (Tautz and Nigro 1998; Lazzaro and Clark 2001). Of the five intergenic sequences, the promoters of CYT1, MLS1, and COQ5, adjacent to ZDS2, have been identified by deletion constructs. For CYT1, the minimal promoter was delineated to a 209-bp sequence 351 bp upstream of CYT1 (Oechsner et al. 1992). For MLS1, the minimal promoter was delineated to a 190-bp sequence, 474 bp upstream of MLS1 (Caspary et al. 1997). For COQ5, functional promoter elements were found in a 26-bp and a 90-bp sequence starting 400 bp upstream of COQ5 (Hagerman et al. 2002). For CYT1, the rate of polymorphism in the experimentally defined promoter, 6/209, is much less than the total rate found in all synonymous sites, 47/833 (Table 2). In contrast, the experimentally defined promoter of MLS1 has a rate of polymorphism more than twice that of synonymous sites, 14/190, as it encompasses one of the hypervariable regions identified by the sliding-window analyses (Figure 4). While the rate of polymorphism for one of the two COQ5 promoter regions is low, 1/90, the other has a rate of 3/26, four times that of synonymous sites (Figure 4). Thus, while polymorphic sites are not overpresented in conserved positions (Table 13), they tend to be found in experimentally defined promoter sequences (Figure 4).
Multiple sequence alignment of experimentally defined promoters (gray) upstream of MLS1 (A) and ZDS2 (B). Polymorphic sites are shown in green and transcription factor-binding sites designated in the original study are shown in red. Arrows show sites with three segregating bases.
DISCUSSION
With short ∼500-bp intergenic sequences, S. cerevisiae provides an excellent opportunity to understand the functional constraints placed on cis-regulatory sequences and their evolution. To address this issue we have compared DNA sequence variation found within and between Saccharomyces species in noncoding and coding sequences. The main result is the observation of noncoding sequences with higher than expected rates of polymorphism. A secondary finding is extensive linkage disequilibrium, even between unlinked loci. Population subdivision, possibly caused by two separate domestication events (Fay and Benavides 2005), is likely a major contributor to linkage disequilibrium.
On average, rates of DNA polymorphism and divergence in noncoding sites are ∼40% lower than those at synonymous sites (Table 9). Yet four small regions within noncoding sequences show rates of polymorphism greater than those at synonymous sites (Figure 3). The clustering of polymorphic sites upstream of MLS1 and PDR10 is significantly greater than that expected on the basis of fourfold degenerate sites.
For a variety of reasons, commonly used statistical tests of neutrality show no significant results for the MLS1 and PDR10 promoter. Neither an MK test nor an HKA test showed any significant differences between rates of polymorphism and divergence at noncoding and synonymous sites. The lack of significance can be explained since both tests measure differences between the rates of polymorphism and divergence between two classes of sites, averaged over the entire region, whereas the hypervariable sequences are limited to small regions within the intergenic sequences. The runs test is designed to detect heterogeneity in the ratio of polymorphism to divergence since polymorphic sites and fixed differences are expected to be evenly interspersed between one another (McDonald 1996). Neither the MLS1 nor the PDR10 noncoding sequences showed any significant departure from neutrality by any of the statistical tests of heterogeneity implemented in DNA Slider (McDonald 1998). It is likely that the power of these tests is limited when rates of divergence are high, since at high divergence the number of runs should become relatively constant for any distribution of polymorphic sites.
There are few explanations for the clustering of polymorphic noncoding sites upstream of MLS1 and PDR10. Our evidence comes from comparing the distance between noncoding polymorphic sites to synonymous polymorphic sites. Thus, a number of factors that affect polymorphism at synonymous sites should be considered. First, a reduced mutation rate at synonymous sites should cause a decrease in diversity and an increase in the distance between polymorphic synonymous sites. However, divergence at synonymous sites is greater than that found at noncoding sites. Second, selection could result in a reduction in diversity at synonymous but not at noncoding sites, thereby increasing the distance between synonymous polymorphic sites. With the exception of PDR10, there is no evidence for a reduction in synonymous-site diversity across the five genes (Table 3). Even if diversity at synonymous sites within PDR10 were reduced, the statistical test for clustering of noncoding sites relies on the combined fourfold degenerate data from all five genes. Finally, some synonymous sites may be functionally constrained, thereby lowering rates of polymorphism and the distance between polymorphic sites. Only CCA1 shows evidence for significant levels of constraint on synonymous sites (Figure 2). Yet instead of showing a reduced rate of polymorphism, CCA1 has the highest level of synonymous-site diversity (Table 9).
Hypervariable noncoding sequences could also be caused by mutation hotspots in noncoding sequences. Yet hotspots should increase both polymorphism and divergence, and only polymorphism appears inflated. Furthermore, there is little evidence for large-scale mutational heterogeneity across the S. cerevisiae genome (Chin et al. 2005). If mutational hotspots are present but not at a fixed location, polymorphic sites should cluster but divergence should average to a uniform distribution over a long enough period of time. Although transient mutational hotspots provide a rather tenuous explanation for the data, two mechanisms are possible. First, a polymorphic site may increase the mutation rate at nearby bases. Second, recombination hotspots have been found to be transient (Ptak et al. 2005; Winckler et al. 2005). If recombination hotspots are transient in yeast and mutagenic, or repair deficient, they may result in transient mutational hotspots. If this is the case, hypervariable noncoding sequences should be observed across the genome. Little or no clustering would be observed in coding sequences since most mutations would be removed by negative selection.
Selection on noncoding sites can both increase and decrease the distance between polymorphic sites. Both changes in selective constraint and the presence of deleterious mutations can influence the ratio of noncoding- to synonymous-site diversity. Yet, the distance between noncoding polymorphic sites should be at the least equal to that found at synonymous sites. Positive selection, diversifying selection, and balancing selection can all increase the rate of polymorphism at noncoding sites above that of synonymous sites. Although positive selection predicts very short sojourn times for variants under selection, population subdivision would inhibit the rapid spread of a selected allele through the entire species (Slatkin and Wiehe 1998). Alternatively, balancing or diversifying selection could account for the excess of noncoding compared to synonymous polymorphism and predicts elevated rates of polymorphism but not of divergence. Although there is no clear way to confidently distinguish these models of selection, one pertinent observation is that the ratio of NC/S is greater than one within some but not all groups of strains categorized by source of isolation.
In conclusion, there are two plausible models that can explain the hypervariable noncoding sequences. First, hypervariable regions could be caused by transient mutational hotspots. Second, hypervariable regions could be caused by some form of natural selection acting on mutations that affect gene expression. With hypervariable regions found in three of the five genes, only two showing significance, it is difficult to determine whether the hypervariable sites are common, and more likely the result of a mutational explanation, or rare, and more likely the result of natural selection.
Acknowledgments
We thank B. Dunn, N. Goto-Yamamoto, R. Mortimer, C. Kurtzman, J. McCusker, and P. Sniegowski for contributing yeast strains; E. Mardis at the Genome and Sequencing Center for use of an ABI 3730xl sequencer; and two anonymous reviewers for useful comments on the interpretation of the data.
Footnotes
- Received February 20, 2005.
- Accepted May 12, 2005.
- Genetics Society of America