Advances in sequencing technology have enabled whole-genome sequences to be obtained from multiple individuals within species, particularly in model organisms with compact genomes. For example, 36 genome sequences of Saccharomyces cerevisiae are now publicly available, and SNP data are available for even larger collections of strains. One potential use of these resources is mapping the genetic basis of phenotypic variation through genome-wide association (GWA) studies, with the benefit that associated variants can be studied experimentally with greater ease than in outbred populations such as humans. Here, we evaluate the prospects of GWA studies in S. cerevisiae strains through extensive simulations and a GWA study of mitochondrial copy number. We demonstrate that the complex and heterogeneous patterns of population structure present in yeast populations can lead to a high type I error rate in GWA studies of quantitative traits, and that methods typically used to control for population stratification do not provide adequate control of the type I error rate. Moreover, we show that while GWA studies of quantitative traits in S. cerevisiae may be difficult depending on the particular set of strains studied, association studies to map cis-acting quantitative trait loci (QTL) and Mendelian phenotypes are more feasible. We also discuss sampling strategies that could enable GWA studies in yeast and illustrate the utility of this approach in Saccharomyces paradoxus. Thus, our results provide important practical insights into the design and interpretation of GWA studies in yeast, and other model organisms that possess complex patterns of population structure.
- genome-wide association studies
- Saccharomyces cerevisiae
- mitochondrial DNA copy number
- model organism genetics
ASSOCIATION studies have become a dominant paradigm in human genetics, with >2000 studies published to data (Hindorff et al. 2009). The application of genome-wide association (GWA) studies to model organisms is a potentially powerful approach to rapidly identify causal variants that mediate heritable phenotypic variation (Risch and Merikangas 1996; Edwards et al. 2009; Atwell et al. 2010). However, a complication of GWA studies in many model organisms is the potential for type I errors produced by population structure, which lead to increased type I error rates (Lander and Schork 1994). Numerous methods have been developed to address this issue, which take structure into account when testing for associations (Pritchard et al. 2000; Patterson et al. 2006; Price et al. 2006; Kang et al. 2008). However, although these approaches work well in the cases of modest levels of population structure typical of human populations, it is unclear how effective they will be in different model systems.
Saccharomyces cerevisiae is a powerful model organism for genome-wide studies because of its small genome size, which has facilitated extensive sequencing of species and strains. For example, the Saccharomyces Genome Resequencing Project (SGRP) has performed whole-genome sequencing on 36 laboratory and wild strains of the model yeast S. cerevisiae (Liti et al. 2009). Genetic diversity between these strains is high, and the average pairwise distance between polymorphisms is 168 bp. Moreover, linkage disequilibrium (LD) is low; the half-life of LD is on average 3 kb. In theory, these characteristics should make S. cerevisiae strains a powerful resource for fine-scale mapping of associated loci. Indeed, association studies are beginning to be pursued using these and other strains (Mehmood et al. 2011; Muller et al. 2011).
However, the genomic patterns of population structure in S. cerevisiae strains are complex. Among the SGRP strains, the patterns of structure are complex and vary considerably across the genome; more than half the strains are considered mosaics (Liti et al. 2009). Thus, the prospect of GWA studies in yeast remains ambiguous. To address this issue, we performed a comprehensive set of analyses including simulations and an empirical GWA study of mtDNA copy number. We find that GWA studies in S. cerevisiae strains may be challenging for complex traits, although potentially feasible for Mendelian traits and in the identification of cis-acting variation. In addition, we suggest alternative study designs to mitigate the confounding effects of population structure, which will be broadly applicable to model organisms.
Materials and Methods
Sequence and SNP data
We obtained the publicly available genome sequences of the SGRP S. cerevisiae (n = 36) and S. paradoxus (n = 36) strains (from ftp://ftp.sanger.ac.uk/pub/dmc/yeast/latest/; Liti et al. 2009). We also obtained SNP genotypes from a second set of 63 S. cerevisiae strains (Schacherer et al. 2009). We note that because the SNP data for the Schacherer et al. (2009) data set was obtained using tiling microarrays with a resolution of 4 bp, it is not possible to combine the two data sets. Additionally, 13 S. cerevisiae strains are shared between the Liti et al. (2009) data set and the Schacherer et al. (2009) data set.
For each data set, we identified SNPs with a minor allele frequency (MAF) of ≥25%. We focused on common variants as the power to detect an association with more rare variation is limited. We then eliminated nearby SNPs in high LD using the program PLINK with a minimum r2 of 0.8 (Purcell et al. 2007). Using this reduced set of SNPs, we randomly selected a “causal” SNP. We then generated quantitative phenotype data on the allele of each strain, using a fixed effect equal to the standard deviation. For the lowest frequency variants (25%), this corresponds to a percentage of variance explained of 17%, using the formula: p(1 − p)k2/(p(1 − p)k2 + 1 − 1/n) = ∼1/(1 + 1/(p(1 − p)k2), where k is the fixed effect of 1.0 times the standard deviation, p is the frequency of the polymorphism with the fixed effect, and n is the number of individuals (in this case 36) (Yu et al. 2006). We examined several different sizes of fixed effects, and found that smaller effects were difficult to detect at all. Therefore, we chose to use a relatively strong fixed effect to maximize our ability to detect a significant association. We generated 1000 simulated quantitative phenotype sets and tested for genome-wide association between the simulated phenotypes and all tagSNPs using three methods: a t-test, a t-test performed on the residuals of phenotypes after regressing out the first principle component derived from the SNP data (Price et al. 2006), and EMMA (Kang et al. 2008; simulation programs and results are available on our website http://akeylab.gs.washington.edu/downloads.shtml). For EMMA, we used the entire set of tagSNPs to generate the K matrix. For Mendelian simulations, we used the same methods as above, but set the fixed effect equal to 10 times the standard deviation, which corresponds to a percentage of variance explained of 95%. For the cis-based association simulations, for each gene we randomly picked a tagSNP within 1 kb up- or downstream of each gene to be a causal variant. We then simulated a fixed effect using the alleles at this causal site, as for the quantitative trait simulations above. To assess the type I error rate at each gene, we then picked 1000 random SNPs to test for association and tested for association using EMMA, as above.
Yeast strains and culture
Two strains, BY4716 and RM11-1a, were obtained from Leonid Kruglyak. The remaining 34 yeast strains were received from the Saccharomyces Genome Resequencing Project (Liti et al. 2009). These strains were confirmed as haploid through mating tests or made haploid by integrating a KanMX cassette at the HO locus, sporulating the transformants, and isolating haploid spores. The remaining strains (DBVPG6044, YIIc17_E5, NCYC110, and Y9) are presumed to be diploids. Strains were grown in YEPD media to an OD of 0.8–1.0. Genomic DNA was isolated using the smash-n-grab method (Rose et al. 1990).
Measuring mtDNA copy number
PCR primers and probes were designed to two genes (COX3 and ATP6) in the mitochondrial DNA and one gene in nuclear DNA (GID8). Primers for qPCR were designed by sequencing a large region of each gene in all strains and designing primers and TaqMan probes to regions where there were no polymorphisms between strains. We used three reporter dyes for the three probes: VIC for the GID8 probe, FAM for the COX3 probe, and TET for the ATP6 probe. We ran a standard curve for each probe to confirm that the efficiency of the probes was high. We found that the efficiency was ∼1.0 for all three probes across a wide range of concentrations. We then ran two technical replicates for each biological replicate in 20-μl reactions using ABI Master mix including ROX dye. qPCR reactions were run on an ABI3300 using ABI’s standard 3300 protocol. We normalized the copy number in the mtDNA using the nuclear GID8 probe and calculated the relative mtDNA copy number in each strain by normalizing to the strain with the lowest copy number (BY4716). The mean copy number in the diploid strains was 1.55 vs. the mean in all strains of 1.86 overall; however, this difference was not significant (t-test, P = 0.14). We therefore included the diploid strains in our association test.
Association study of mtDNA copy number
To test for significant differences between strains, we fit a linear model to the qPCR data, where normalized copy was modeled as a function of biological replicate, technical replicate, mitochondrial probe, and strain. We tested for the significance of each variable using ANOVA and found that there was a significant (P < 0.05) effect for the biological replicate, probe, and strain effects, explaining 4.44, 58.8, and 28.5% of variance, respectively. We used the average of all technical and biological replicates for each strain to test for association. We then identified tagSNPs using the program HaploBlockFinder using a minimum LD (r2) of 0.8 (Zhang and Lin 2003). We then tested for association between the mean mtDNA copy number and the 27,101 tagSNPs as above, using both a simple t-test and EMMA. We corrected for multiple testing by permuting the phenotype data 1000 times, recording the lowest P-value from each genome-wide test, and used this distribution to determine genome-wide significance.
Neighbor-joining trees were generated using the Neighbor program in PHYLIP (Felsenstein 1989). To assess confidence, we performed 100 bootstraps and built consensus trees using the program Consense in PHYLIP.
Resampling simulations in S. paradoxus
We first randomly sampled 24 random strains out of 36, identified all SNPs with a MAF ≥25%, identified tagSNPs as above, and performed 1000 simulations as above, using a fixed effect of 1.0 times the standard deviation. We next used the 24 strains in the European cluster as identified by Liti et al. (2009), identified SNPs with the same MAF cutoff (n = 8822), and performed 1000 simulations as described above.
GWA studies of quantitative traits in yeast can have high type I error rates
To evaluate the effects of population structure on GWA studies in yeast, we first performed extensive simulations conditional on the SGRP sequence data. In the simulations, we identified all SNPs where the minor allele was present in at least 10 strains (MAF ranged from 0.27 to 0.50), as association studies with rare variants would have very low power in this data set, and selected tagSNPs from these (n = 18,637). Next, we randomly selected a causal SNP and assigned phenotypes to each strain based on the allele that the strain carried at this causal site. We initially focused on simulating quantitative trait loci (QTL) with large effects, which explain ∼17% of phenotypic variation (see Materials and Methods for details). We then performed a GWA study on the 18,637 tagSNPs using three different analyses: a simple t-test, a t-test on the residuals of phenotypes regressed on the first principal component (Price et al. 2006), and a more sophisticated linear mixed-model implemented in EMMA (Kang et al. 2008), which was designed for association mapping in model organisms. EMMA has been shown to dramatically reduce type I error rates due to population structure (Kang et al. 2008; Atwell et al. 2010). This process was repeated 1000 times and we calculated the average type I error rate for both t-tests and the linear mixed model of EMMA.
As expected, t-tests lead to a substantial inflation of type I error rate (Figure 1B). For example, at a nominal type I error level of 0.05, the observed false positive rate was 0.20, which is four times higher than expected. Including the first principle component as a covariate was effective in reducing type I error rate; however, the power was also dramatically reduced, rendering this method ineffective for this data set (see Table 1). Surprisingly, EMMA, while performing considerably better than t-tests, also exhibited an elevated type I error rate, with nearly twice as many false positives relative to that expected at a nominal level of 0.05 (Figure 1).
To explore the cause of the elevated type I error rates, we investigated whether there was variation in the type I error rate at individual SNPs across the 1000 simulations. In other words, we were interested in identifying regions of the yeast genome that were particularly susceptible to generating type I errors. We found that the error rate at individual SNPs at an expected rate of 0.05 varied from 0.038 to 0.168, with a mean of 0.096 (Figure 2). The error rate of 0.168 is much higher than would be expected by chance, assuming type I errors are randomly distributed across the genome (P = 2.80 × 10−12). We also found that SNPs with the highest type I error rate (defined as those in the top 1% of all SNPs tested) are highly correlated (average pairwise r2 = 0.58; also see Figure 2) even though they are distributed across the yeast genome. Strikingly, the genealogy of SNPs with the highest type I error rates differs substantially from the average genome-wide genealogy (Figure 2). Specifically, many of the mosaic strains identified by Liti et al. (2009) cluster with the wine/European strains, for SNPs with the highest type I error rate (Figure 2). Thus, the distinct pattern of structure in these regions compared to the genome-wide average likely makes it difficult to fully account for stratification in association test statistics.
To see how representative our data from the SGRP strains was of other yeast data sets, we analyzed a second data set consisting of 63 S. cerevisiae strains (Schacherer et al. 2009). We performed simulations using the same methods as above (see Materials and Methods). We again found an elevated type I error rate using EMMA, though it was modestly lower than in the original data set of 36 strains (Table 1). Similarly to the SGRP data set, including one principle component as a covariate was effective in reducing type I error rates at the expense of power (Table 1). Finally, we also analyzed the S. paradoxus strains sequenced by the SGRP. These 35 strains differ from the S. cerevisiae strains in that the patterns of structure are more consistent; there are no mosaic strains. We hypothesize that EMMA can more easily correct for this consistent pattern of structure. Indeed we found that the type I error rate was lower than that found in the SGRP strains (Figure 6), though still elevated above the expected rate.
We were also interested in more directly comparing the SGRP and Schacherer et al. (2009) data sets to gain insights into the differences in type I error rates between the two. To compare the two data sets, we resampled 36 strains from the 63 strains from Schacherer et al. (2009). We observed a similar small reduction in type I error rate in this resampled set of 36 strains compared to the 63 strains, suggesting that it is the particular set of strains in the Schacherer et al. (2009) data set that is responsible for the reduced type I error rate, not the larger sample size (Figure 3). Additionally, the Schacherer et al. (2009) strains appear to have the least amount of structure or recognizable subpopulations, which may account for their reduced error rate (Figure 3). For example, the average pairwise divergence at noncoding sites is 2.57 × 10−3 and 4.82 × 10−3, respectively, in the Schacherer et al. (2009) and SGRP strains, respectively. Therefore, the patterns of structure found in these two data sets result in elevated type I error rates, and differences in the amount and patterns of structure may explain the differences in error rates.
Association studies of Mendelian traits and cis-acting QTL are feasible
The results presented above demonstrate that GWA studies of quantitative traits in existing S. cerevisiae strains result in high type I error rates. Here, we evaluate whether other association strategies, such as the analysis of cis-acting QTL or Mendelian traits results in more interpretable results. To this end, we first repeated the simulations as described above for a Mendelian phenotype. In brief, for both the SGRP and Schacherer et al. (2009) data sets, a causal SNP was randomly selected and strains were assigned phenotypes based on the allele that they carried. A GWA study was then performed with EMMA using the 18,637 tagSNPs. As in the previous simulations, the type I error rate was elevated; for the SGRP strains, at an expected rate of 0.05 the observed mean rate was 0.07, and similar patterns were observed for the Schacherer et al. (2009) strains. However, as shown in Figure 4, in both data sets, the causal SNP responsible for a Mendelian trait was on average easily distinguishable from the background distribution of P-values. Thus, GWA studies of Mendelian phenotypes are feasible even in yeast strains with complex patterns of population structure.
Next, we investigated the feasibility of cis-based association mapping, which in contrast to a GWA study only performs an association test at a locus of interest. For example, such study designs are popular in mapping cis-regulatory QTL, where gene or protein expression levels have been measured for a large number of genes (Skelly et al. 2009). Here, tests of association are restricted to polymorphisms located in or adjacent to the gene whose transcript or protein abundance is being analyzed. To assess cis-based association mapping in the SGRP and Schacherer et al. (2009) data sets, we randomly selected a causal SNP nearby to each gene, assigned a quantitative phenotype as described above, and then used EMMA to perform an association test on 1000 randomly selected unlinked polymorphisms. The goal of this analysis is to determine whether population structure also results in a higher type I error rate in cis-based association tests. We found that the type I error rate was again elevated above nominal levels. Specifically, at the expected rate of 0.05, we found that the observed rate for cis-association tests was 0.099 for the SGRP strains and 0.082 for the Schacherer et al. (2009) strains, similar to those for genome-wide tests for each data set. However, even though the type I error rate is similar to that observed for a GWA study, cis-based associations are more interpretable because of the limited number of hypothesis tests performed. For example, a cis-based association study of a single gene expression trait would require ∼1–5 hypothesis tests (depending on local levels of LD and density of polymorphisms) vs. ∼19,000 hypothesis tests in a GWA study. In addition, the relative ease in functionally validating putative regulatory polymorphisms in yeast makes cis-associations an attractive approach. Thus, while some caution is warranted in interpreting cis-based association studies in yeast strains, they are a reasonable approach for identifying putative regulatory QTL.
Association mapping of mtDNA copy number
To complement the simulations, we next carried out a GWA study on mtDNA copy number in 36 S. cerevisiae strains. To measure copy number, we isolated total DNA containing both nuclear and mitochondrial DNA from each strain. To control for changes in mtDNA copy number throughout growth, we isolated DNA from the same growth stage for all strains and obtained two biological replicates for each strain. We used qPCR to quantify the amount of mtDNA per cell in the strains by comparing the amounts of nuclear and mtDNA (see Materials and Methods).
Using this assay, we found that relative copy number per strain varied from 1 in BY4716 to 3.28 in BC187 (Figure 5, Supporting Information, Table S1). These differences in copy number appear to approximately correlate with the genealogy of the strains (Figure 5). Using a linear mixed model that explicitly takes into account biological and technical sources of variation, we found that there was a significant level of variation across all the strains (P = 0.003). The effect of strain in this model accounted for 28.5% of total phenotypic variation. Interestingly, the strain with the lowest copy number, BY4716, which is a haploid derivative of S288C, is a long-used laboratory strain and the source of the S. cerevisiae reference sequence. The low copy number could be the result of relaxed selection for respiratory abilities during its many generations in the lab environment.
We next performed a GWA study for mtDNA copy on 27,101 tagSNPs (see Materials and Methods) using a simple t-test. After correcting for multiple testing, we found 73 SNPs significant at P < 0.05 (Figure 6A, File S1). We next repeated the GWA analysis with EMMA and after correcting for multiple testing, we identified one significant (P < 1.19 × 10−5) SNP, a synonymous variant in HPR1 on chromosome IV (Figure 6A). Strains with the C allele at this SNP had a mean copy number 0.5-fold higher than strains with the T allele (Figure 6B, Table S2).
HPR1 mediates mitotic recombination (Merker and Klein 2002) and recombination is thought to play a role in maintaining mtDNA and in proper segregation of mtDNA into newly forming buds (Lockshon et al. 1995; Zelenaya-Troitskaya et al. 1998; Lecrenier and Foury 2000; Ling and Shibata 2002). Thus, it is biologically plausible that HPR1 is a novel regulator of mtDNA copy number. However, the background distribution of P-values is high (Figure 6A), and our simulation results demonstrate the type I error rate is high. Moreover, the functional significance of the synonymous HPR1 variant is unclear, and it is not in significant LD with adjacent polymorphisms (data not shown). Thus, these results highlight the difficulty in interpreting GWA results of quantitative traits in these strains.
Strategies for enabling GWA studies in yeast
Finally, we investigated sampling strategies that may facilitate GWA studies in yeast. Specifically, one possible approach to ameliorate the confounding effects of population stratification is to intelligently choose strains that have either low levels of structure or a consistent pattern of structure across the genome. To assess the type I error rate in such a genealogy, we used a subset of the SGRP S. paradoxus strains. As seen in Figure 6, the S. paradoxus strains contain a clade of 24 European strains with low levels of structure (Figure 7). We note that while the average pairwise divergence in the S. paradoxus European strains (θπ = 0.0010) is somewhat lower than in the global S. cerevisiae population (θπ = 0.0056), there is still a substantial amount of divergence (Liti et al. 2009). We used these strains to evaluate type I error rates in GWA studies as described above. We found that the type I error rate was only slightly elevated above nominal levels (Figure 7). To determine whether the lower type I error rate was simply a function of limited power, we randomly sampled 24 S. paradoxus strains from the four divergent clades and repeated the GWA simulations. As shown in Figure 7, the type I error rate is significantly higher in the 24 randomly selected S. paradoxus strains, demonstrating that restricting association tests to a more homogenous subpopulation can decrease type I errors.
Our findings have biological and methodological implications for further study. The full genome sequencing and SNP genotype data available in 86 strains is a powerful resource to test hypotheses about the population history, domestication, and evolution of S. cerevisiae. There is also considerable interest in using these strains to map QTL (Liti et al. 2009). However, our results suggest that GWA studies of quantitative traits may be challenging in the strains studied to date. Specifically, the complicated and heterogeneous patterns of structure across the yeast genome in the strains analyzed here result in high type I error rates, a problem that will only be exacerbated in analyses of high-dimensional molecular phenotypes, such as gene expression levels. Several approaches could alleviate this problem, such as sequencing a greater number of strains and choosing a subset that mitigates the problems of structure while maintaining a high level of variation. We have shown that strains with a simpler population structure show much lower levels of false positives. Alternatively, a deeper understanding of the empirical patterns of population structure across the S. cerevisiae genome in globally diverse strains will facilitate the development of new statistical models that can be uniquely tailored to ameliorating type I errors. A strategy similar to the mouse collaborative cross (Threadgill et al. 2002) could also be undertaken.
Our results also provide important practical information on the design and interpretation of GWA studies in other model organisms. For example, EMMA has been used to perform GWA studies in dogs (Bokyo et al. 2010), Arabidopsis (Atwell et al. 2010), and Caenorhabditis elegans (Rockman and Kruglyak 2009). We suggest that GWA studies in model organisms with potentially strong and complicated patterns of structure carefully consider study design and sampling strategy and empirically assess the effects of population structure on type I error rates using the simple simulation framework described in our study. Fortunately, the decreasing costs of sequencing will allow individuals to be intelligently chosen to minimize the confounding effects of population structure and enable interpretable association studies in model organisms.
We thank members of the Akey laboratory for helpful discussions related to this project. This work was supported by National Institutes of Health grant R01GM098360 to J.M.A.
Communicating editor: S. Sen
- Received April 11, 2012.
- Accepted May 15, 2012.
- Copyright © 2012 by the Genetics Society of America