In genomewide mapping of expression quantitative trait loci (eQTL), it is widely believed that thousands of genes are trans-regulated by a small number of genomic regions called “regulatory hotspots,” resulting in “trans-regulatory bands” in an eQTL map. As several recent studies have demonstrated, technical confounding factors such as batch effects can complicate eQTL analysis by causing many spurious associations including spurious regulatory hotspots. Yet little is understood about how these technical confounding factors affect eQTL analyses and how to correct for these factors. Our analysis of data sets with biological replicates suggests that it is this intersample correlation structure inherent in expression data that leads to spurious associations between genetic loci and a large number of transcripts inducing spurious regulatory hotspots. We propose a statistical method that corrects for the spurious associations caused by complex intersample correlation of expression measurements in eQTL mapping. Applying our intersample correlation emended (ICE) eQTL mapping method to mouse, yeast, and human identifies many more cis associations while eliminating most of the spurious trans associations. The concordances of cis and trans associations have consistently increased between different replicates, tissues, and populations, demonstrating the higher accuracy of our method to identify real genetic effects.
GENOMEWIDE analysis of gene expression data in segregating populations has been widely conducted to understand the genetic basis of regulation in many organisms including yeast (Brem and Kruglyak 2005), Arabidopsis (Keurentjes et al. 2007), mouse (Bystrykh et al. 2005; Chesler et al. 2005), and human (Cheung et al. 2005; Stranger et al. 2007). To understand the complex regulatory network, numerous statistical analysis methods have been proposed, including clustering of coregulated genes (Yvert et al. 2003), multipoint linkage analysis (Brem et al. 2005; Storey et al. 2005), prediction of regulatory modules (Ghanzalpour et al. 2006; Lee et al. 2006), and pathway enrichment analysis (Subramanian et al. 2005; Ye and Eskin 2007).
Among these “genetical genomics” approaches, the most widely used statistical analysis is expression quantitative trait loci (eQTL) mapping between genetic variation and gene expression levels (Brem et al. 2002). The goal of these studies is to identify associations between an individual genetic variation and the differential expression of a gene that might help explain the transcriptional regulation of the gene. Many recent studies have identified a large number of cis associations between eQTL and the expression of genes in close proximity. They have also identified many more trans associations between eQTL and the expression of genes in other regions of the genome (Yvert et al. 2003; Chesler et al. 2005; Hubner et al. 2005). An interesting observation consistent across multiple data sets is that hundreds or even thousands of genes are trans-regulated by a small number of genomic regions called “regulatory hotspots” (Chesler et al. 2005; Keurentjes et al. 2007) and these associations appear as “trans-regulatory bands” in eQTL plots regardless of the normalization method used (Bystrykh et al. 2005; Chesler et al. 2005, 2006; Hubner et al. 2005; Peirce et al. 2006; Williams et al. 2006).
Recent genetical genomics studies of yeast have provided much evidence supporting the existence of global regulators that induce trans-regulatory bands (Foss et al. 2007; Perlstein et al. 2007). For mammalian expression data sets, although large numbers of regulatory hotspots have consistently been observed, the locations of these regulatory hotspots are inconsistent between different data sets (Chesler et al. 2005; Hubner et al. 2005; Peirce et al. 2006). Simulation studies suggest that spurious regulatory hotspots may be frequently observed in outbred populations (Perez-Enciso 2004; de Koning and Haley 2005; Wang et al. 2007).
Building on previous studies we examine two first-generation expression data sets of recombinant inbred (RI) mice, where their regulatory hotspots have been shown to poorly replicate in previous studies (Peirce et al. 2006). Due to the high degree of systematic confounding inherent in these data sets, it is particularly challenging to distinguish true genetic effects from the spurious associations. The availability of biological replicates in these data sets allows us to compare the level of true positives between different methods. Two observations suggest that many trans-regulatory bands previously identified in these data sets correspond to “spurious” regulatory hotspots, not to real genetic effects. First, the locations of regulatory hotspots are inconsistent across disjoint sets of biologically replicated samples. Second, stronger trans-regulatory bands frequently appear with randomly permuted SNPs. To understand the cause of this phenomenon, we carefully examined these data sets and identified a surprising pattern of intersample correlation where the pairwise correlations of expression arrays between different strains are often stronger than between replicates of the same strain.
Previous studies have shown that many factors contribute to the spurious correlation between microarray samples, including systematic bias from sources such as technical variation in microarray manufacturing (Churchill 2002; Akey et al. 2007), variations introduced during sample preparation such as the time postmortem a sample is collected, and variations introduced during expression measurements such as the batch of reagents used or laboratory ozone levels (Fare et al. 2003; Branham et al. 2007). Such spurious intersample correlations are usually not completely resolved by randomized design of the experiment (Churchill 2002) or through low-level normalization techniques (Yang et al. 2002; Irizarry et al. 2003).
We suspect that when a SNP, by chance, segregates the strains in a manner consistent with the intersample correlation, the P-values of associations between that SNP and the transcripts are inflated, leading to spurious associations and in extreme cases, a trans-regulatory band. To verify this phenomenon, we constructed a set of simulated data to intuitively show how complex intersample correlation structure inherent in expression data leads to associations between genetic loci and a large number of gene transcripts inducing spurious regulatory hotspots. When we generated random expression data using the same intersample correlation structure found in the recombinant inbred expression data, we observed exactly the same regulatory hotspots.
Two types of computational approaches have previously been proposed to reduce the effects of confounding factors in gene expression experiments. The first type is methods that correct for known confounding factors, including ComBat (Johnson et al. 2007), which directly estimates the location and scale model parameters that represent the batch effect using an empirical Bayes (EB) approach. The second type is methods that correct for unknown confounding factors, including surrogate variable analysis (SVA) (Leek and Storey 2007), which identifies, estimates, and corrects for principal components of expression heterogeneity.
We propose a statistical method that corrects for confounding effects induced by complex intersample correlation of expression measurements in eQTL mapping, using a linear mixed model. Our intersample correlation emended (ICE) eQTL mapping directly incorporates the complex correlation structure into the statistical model as a variance component accounting for random effects. Compared to ComBat, our approach is not limited by prior knowledge of confounding factors and is capable of capturing the complex correlation structure introduced by multiple known and unknown effects. Compared to SVA, which projects confounding effects onto several distinct single-dimensional vectors each treated as a fixed effect in the statistical model, our random-effects model is not limited by the number of confounding variables because it does not explicitly infer and correct for each confounding variable. Instead, our method needs only to estimate the total correlation between samples and corrects for the cumulative effects over all confounding factors, using the intersample correlation structure. Furthermore, the statistical power of SVA decreases as the number of confounding variables increases due to the loss of degrees of freedom while our method always uses only one additional degree of freedom for the intersample variance component. As a result, our method has the advantage that it is able to correct for a mixture of strong and moderate confounding effects as shown in our simulation studies while SVA is able to correct only for a number of strong confounding factors.
To gain some intuition as to why our random-effects model corrects for confounding factors, consider a pair of samples with a differing expression level for a given gene and a marker SNP that segregates the pair of samples. If the remaining gene expression values have similar expression values between the samples, intuitively this pair of samples provides more evidence that the SNP is associated with the gene's expression level than if the remaining gene expression values differ greatly between the samples. In the latter case, the expression difference of the gene between the pair of samples is less informative given the large amount of global differences in expression values between the samples, which may be due to a confounding factor such as a batch effect.
We applied our statistical model to expression data from two mouse tissues (hematopoietic stem cell and whole brain). In both cases, ICE eQTL mapping outperformed ComBat and SVA in eliminating the spurious trans-regulatory bands while increasing the number of identified cis associations. The remaining trans associations are more likely to be real genetic effects because they are concordant between tissues and between replicates. In yeast, where global regulators have been previously identified, a separate permutation analysis showed that most of the regulatory hotspots are likely to correspond to real genetic effects. Even though yeast regulatory hotspots are likely to be genuine, they globally influence the expression levels and may seriously confound the identification of gene-specific cis or trans associations (Storey et al. 2005; Leek and Storey 2007). After applying ICE eQTL mapping to correct for the confounding effects from regulatory hotspots, the number of cis associations almost doubled and the concordances of cis and trans associations between disjoint subsets significantly improved. Finally in human lymphoblastoid cell lines, where other known batch effects have been suggested (Akey et al. 2007), our analysis identified more real cis associations than methods that explicitly correct for the batch effects. Our method is publicly available as an R package at http://genetics.cs.ucla.edu/ice
MATERIALS AND METHODS
Gene expression data and genetic maps:
We obtained the yeast expression data set over 112 segregants across 6216 probes from the Gene Expression Omnibus (GEO) database with accession no. GSE1990 (Brem and Kruglyak 2005). Each of them has two replicates, and the values are represented as the log ratio between the expression and the average expression of the reference (BY) strains. A total of 5534 genes with validated genomic annotations were mapped onto the genome to draw the genomewide eQTL maps. For BXD RI data sets, we obtained the hematopoietic stem cell (HSC) data from the GEO database with accession no. GSE2031 and the whole brain data set by request from the authors. Both data sets use the Affymetrix U74Av2 GeneChip platform and contain 12,422 probes. A total of 8596 probes were mapped onto the NCBI build 34 version of the mouse genome, using refSeq to draw the eQTL maps. The HSC data set contains the expression data over 22 strains with duplicates for each strain, and the whole brain data set contains 64 samples over 28 strains, varying from one to four measurements per strain. The second-generation whole brain data set using M430v2 arrays downloaded from GeneNetwork (http://www.genenetwork.org) contains expression profiles over 45,102 probes across 30 BXD RI strains with up to six replicates per strain. Their expression values were normalized using robust multi-array average (Irizarry et al. 2003).
We obtained the human lymphoblastoid cell line expression data over the HapMap individuals from the GEO database with accession no. GSE5859 (Spielman and Cheung 2007). There are a total of 141 samples, 60 from Centre d'Etude du Polymorphisme Humain from Utah (CEU) and 81 from Japanese in Tokyo (JPT) + Han Chinese in Beijing (CHB). Although the Affymetrix Genome Focus Array contains 8500 annotated genes, we focused on the 4030 that are expressed in lymphoblastoid cell lines defined the same way as in Spielman and Cheung (2007).
The genetic map of 2956 SNPs of yeast segregants was obtained by request from the authors. The BXD RI SNPs were obtained from the Wellcome Trust Center, containing 13,270 SNPs over the genome, of which 7413 SNPs are polymorphic between the two parental strains. Sixty-one and 25 SNPs with minor allele frequency <5% were discarded in the HSC and the whole brain data sets, respectively. A very small portion of heterogeneous alleles in the RI strains were assumed to have additive effects, and the missing SNPs were not resolved. The genetic map for the HapMap samples was obtained using release 22 of the human HapMap (International HapMap Consortium 2003). We examined a total of 3942 genes that are within 500 kb of at least one of the 2 million HapMap SNPs.
Traditional eQTL mapping and genomewide eQTL maps:
Traditional eQTL mapping was performed by taking the average of expression values of each strain and performing a t-test between each marker SNP and each transcript. The eQTL mapping using either of the replicates was performed in the same way except that the samples were divided into two disjoint sets of expression experiments by randomly picking one of the replicates. For the seven strains that have only one measurement in the BXD whole brain data set, they were included in both sets. Missing SNPs or missing expression values were excluded in the test only for the corresponding marker–transcript pair, and the P-value was obtained from the asymptotic t-distribution.
Genomewide eQTL maps were plotted on the basis of the relative genomic positions of each transcripts and marker SNPs. Since the previously suggested transcriptome map (Chesler et al. 2005) may create artificial horizontal bands due to nonuniform genomic densities of the probes, we used an eQTL map based on the relative positions of markers and probes, simply corresponding each marker–transcript association to one pixel. The degree of redness of each pixel is proportional to the log P-values.
Explicit batch-effect correction and surrogate variable analysis:
We explicitly corrected for known batch effects, using the ComBat R package (Johnson et al. 2007). We used the default settings and the batch-corrected expression levels were used to perform traditional eQTL mapping using the t-test.
For surrogate variable analysis, we used the SVA R package downloaded from the author's website, identifying surrogate variables without the genotype data as suggested (Leek and Storey 2007). The P-values are obtained using a linear model after correcting for the surrogate variables.
Genomewide intersample correlation:
An intersample correlation matrix from an expression data set is generated as follows. Let Y be an m × n expression matrix with n arrays for m genes; then the intersample correlation matrix is generated as follows. Let μi, σi be the mean and standard deviation of expression values of the ith genes (Yi1, Yi2, … , Yin). Let Z be an m × n matrix with each element Zij = (Yij − μi)/σi; then the intersample correlation matrix is defined as the covariance matrix of Z. It should be noted that we used the covariance matrix K = Cov(Z) instead of the correlation matrix because the variances are not homogeneous across the strains. Such heterogeneous distribution of variances can be an additional source of systematic confounding but is not emphasized in the main text for the sake of simplicity.
To visually compare the consistency between replicated pairs and unrelated pairs, we used the correlation matrix of Z for each replicated data set because the correlation matrix can be more intuitively understood than the covariance matrix. Each diagonal element represents the pairwise correlation of a replicated pair (Figure 3). In the top diagonal region, the correlations between unrelated pairs in one subset of replicates were computed and visualized. In the bottom diagonal region, the other subset was computed and visualized. The seven strains without replicates in the BXD whole brain data set were omitted in the heat-map visualization.
The eQTL mapping with permuted SNPs was performed by permuting the SNPs across the individuals, thereby preserving the correlation between each pair of SNPs. For generating the simulated expression data preserving the genomewide correlation pattern, we assumed the generalized linear model(1)where g represents SNPs encoded by 0 and 1, and u is a multivariate normal random variable sampled from N(0, K). K = Cov(Z) is the intersample correlation matrix defined in the previous section. α is set so that cis-regulatory effects account for 4% of the phenotypic variation explained by each causal SNP. The number of significant cis eQTL is counted using a conservative false discovery rate (FDR) estimate with π0 = 1, considering only the SNP and simulated gene pair where the cis-regulatory effects are simulated.
For the comparison of various systematic confounding effects, we simulated expression data sets of 500 genes over 50 samples from three different intersample correlation structures described in Figure 4, with 75% of the phenotypic variations explained by the confounding effects, using a multivariate normal distribution (Kang et al. 2008). We generated a random SNP of minor allele frequency of 0.3 for each gene and added a SNP effect explaining 5% of phenotypic variation. We performed eQTL mapping using t-test, SVA, and ICE eQTL mapping for all 500 × 500 SNP–gene pairs and computed the true positive rates at each P-value cutoff.
Variance component test:
We applied the following variance component model to assess the statistical significance of each association in the presence of genomewide correlation,(2)where X, β are the fixed effects of known confounding variables and their coefficients, and u ∼ N() and e ∼ N() are random variables accounting for unknown confounding and random errors. K = Cov(Z) is the intersample correlation matrix and I is an identity matrix. and are coefficients of the two variance components. Under the null hypothesis, is assumed. Under the alternative hypothesis, > 0 is tested. Only the mean is used as the fixed effect in the analysis above. The asymptotic null distribution of the likelihood-ratio test statistic follows a 1:1 mixture of - and -distributions (Stram and Lee 1994). The efficient mixed-model association (EMMA) R package was applied for rapid estimation of variance components and maximum likelihood to perform likelihood tests (Kang et al. 2008).
We used a standard t-test to test for the known batch effect for the BXD HSC data set. When testing the significance of surrogate variables, a standard F test is performed to assess the significance of all surrogate variables using a linear model. In all tests above, the FDR is conservatively estimated with π0 = 1.
ICE eQTL mapping:
ICE QTL mapping models the expression levels as the following linear mixed model,(3)where X, β are the fixed effects of known confounding variables and their coefficients, and u and e are random variables accounting for unknown confounding and random errors as described above. G represents the SNPs or other predictor variables to be tested with the coefficients of α. EMMA is applied to test for the significance of α, using the F test as previously suggested on the basis of REML estimates of variance components (Yu et al. 2006; Zhao et al. 2007; Kang et al. 2008).
We classified the eQTL as cis-acting when the location of the SNP and the location of the probe are within 50 kb for yeast, 5 Mb for BXD mouse RI strains and 500 kb for human. Trans-acting eQTL are stringently classified with the distance >250 kb for yeast and 50 Mb for mouse. The number of expected false positives was computed from 10 null randomized runs of each eQTL mapping setting π0 = 1, as suggested in previous studies (Storey and Tibshirani 2003; Brem and Kruglyak 2005).
Assessing the statistical significance of trans-regulatory bands:
The statistical significance of a trans-regulatory band was quantified as the average of log P-values across all probes. We performed 10,000 random permutations of the SNP set with familywise error correction to evaluate the genomewide corrected P-value of the strength of trans-regulatory bands.
Spurious regulatory hotspots in recombinant inbred mice:
We analyzed the expression data from HSC and whole brain tissue collected from BXD mice where prominent trans-regulatory bands have previously been observed (Bystrykh et al. 2005; Chesler et al. 2005). However, most trans-regulatory bands found in these first-generation mouse expression data sets do not overlap with trans-regulatory bands from independent studies (Manly et al. 2005; Peirce et al. 2006). We selected these data sets to evaluate and correct for the systematic confounding effects for two reasons. First, the presence of biological replicates allows us to quantify the level of systematic confounding effects that are heavily imprinted in the data sets. Second, we demonstrate that even in the presence of many complex systematic confounding effects, our method is able to recover true genetic signals better than competing approaches.
We first examined the reproducibility of trans-regulatory bands between different sets of biological replicates. We defined a metric to quantify the strength of a regulatory band, allowing us to compare regulatory patterns between data sets. We performed standard eQTL mapping using the t-test and defined the average log P-values across all genes as the regulatory enrichment score. The resulting eQTL map shows that this score correlates well with the prominence of a trans-regulatory band (Figure 1, A and C). We created two disjoint subsets of expression experiments by picking one of the replicates per strain and compared the enrichment scores between them. Interestingly, the observed patterns of trans-regulatory bands are inconsistent between the subsets in the BXD brain data set (Figure 2A). The enrichment scores between the replicates are uncorrelated (Spearman's r = −0.0067). On the other hand, the HSC data set shows relatively high correlation of the enrichment scores (Spearman's r = 0.30) due to the batch effect shared between two groups of strains (Figure 2B).
Next, we examined whether regulatory hotspots are likely to be observed at random SNPs by chance. If we observe stronger trans-regulatory bands with randomly permuted SNP sets than the original trans-regulatory bands, then it suggests that the original hotspots may not correspond to real genetic effects, but are rather caused by coregulation of a large number of transcripts (Perez-Enciso 2004; Wang et al. 2007) or other systematic confounding factors. Of 1000 random permutations, we observed hotspots with higher enrichment scores than the strongest hotspot in the original data set 890 and 643 times, corresponding to genomewide adjusted P-values of 0.89 and 0.643, for whole brain and HSC data sets, respectively (Figure 2, C and D). The inconsistencies between biologically replicated samples and the occurrence of strong trans-regulatory bands with permuted SNPs suggest that the observed trans-regulatory bands correspond to spurious regulatory hotspots that do not correspond to real genetic effects.
Intersample correlation as signatures of systematic confounding effects:
The question remains as to how systematic confounding effects cause spurious regulatory hotspots. To gain intuition of this phenomenon, we examined the pairwise correlations between expression arrays or the intersample correlation structure. After normalizing each gene's expression levels across strains, we computed the correlation between each strain pair and each replicate pair. The normalization ensures that the correlation between truly unrelated strain pairs is expected to approach zero, while the replicated pairs are likely to have higher correlation between them. We observed that most of the intersample correlations in recombinant inbred mouse strains do not correspond to real genetic effects. Correlation maps between intra- and interstrain replicates show that the diagonals are not pronounced, providing striking evidence that replicated strain pairs are not correlated (Figure 3, A and B). Furthermore, the results show many unrelated strain pairs with much stronger correlation than expected by chance. In the HSC correlation map, there is also a clear division of two groups where members within a group are highly correlated. Upon further analysis, we discovered that the expression measurements for the two groups of individuals were collected in two batches 3 months apart.
To verify that the intersample correlation structure effectively captures the systematic confounding effects inducing spurious regulatory hotspots, we created a simulated expression data set preserving the intersample correlation structure. In this data set, each SNP corresponds to one simulated transcript with cis-regulatory effects accounting for 4% of the variance explained by the SNP. Standard eQTL mapping with simulated data shows trans-regulatory bands that are almost identical to the original data (supplemental Figure 1). The reason for this is that the SNPs that segregate the strains in a manner consistent with the intersample correlation structure are more likely to be associated with many expression transcripts. This result strongly supports that most of the trans-regulatory bands are explained by the complex intersample correlation structure inherent in expression data.
Furthermore, we evaluated how many transcripts are explained by the intersample correlation structure, using a variance component model (see materials and methods). At a FDR of 0.05, we observed that 94.1 and 47.9% of the transcripts are significantly associated with the correlation structure in the whole brain and HSC data sets, respectively. Since the HSC data set has an obvious batch effect, we also tested how many transcripts are differentially expressed between the two batches, using a t-test. At the FDR threshold of 0.05, only 20.0% of the transcripts are differentially expressed. These results suggest that a significant portion of confounding effects in the HSC data set are not captured by the known batch effect. When applying SVA to test the significance of surrogate variables explaining the expression levels, 88.7 and 40.9% of the transcripts were significantly associated with the five and six identified surrogate variables of the whole brain and HSC data sets, respectively, demonstrating that intersample correlation captures more of the systematic confounding than what is captured by surrogate variables with fewer degrees of freedom.
ICE eQTL mapping:
Motivated by our observation of intersample correlation, we propose a new statistical method for identifying eQTL on the basis of a linear mixed model. Since there are thousands of probes in each sample, we can first estimate the pairwise correlation between samples pretty accurately. Instead of assuming independent random variations of expression levels between samples, our method assumes that a gene in a pair of samples with a globally correlated expression pattern is more likely to have similar expression values than a gene in a pair of globally uncorrelated samples. As a result, the variance component of expression levels at each gene is estimated as a mixture of intersample correlation and independent errors. A marker SNP is considered to be significantly associated with a transcript only if it predicts the expression beyond the level suggested by the intersample correlation (see materials and methods). Since spurious regulatory hotspots appear at marker SNPs consistent with the intersample correlation structure, accounting for this correlation in the null model significantly reduces these hotspots.
To demonstrate the effectiveness of our method, we first applied it to the simulated expression data sets presented in the previous section. Although the simulated data sets contain only cis-acting eQTL, traditional eQTL mapping identified both the cis-acting band and spurious trans-regulatory bands (supplemental Figure 2, a and c). The ICE eQTL map shows no trans-regulatory bands and a much stronger cis-regulatory band (supplemental Figure 2, b and d). At a FDR level of 0.05, ICE eQTL mapping recovered 8.4% of the simulated whole brain cis-acting eQTL, which was a >150-fold increase over the standard t-test and a >3-fold increase over SVA. These results illustrate that our method not only eliminates suspicious trans-regulatory bands but also has higher statistical power to recover real eQTL that might be masked by the correlation structure.
To better understand the relative performance of random-effects models vs. fixed-effects models on this problem, we analyzed our simulated data using the simple t-test, SVA, and ICE eQTL mapping. At a SNP effect explaining 5% of phenotypic variations and a systematic confounding effect of 75%, we see that both the fixed-effect model (SVA) and our random-effects model (ICE eQTL) outperform the simple t-test in discovering true positives when the samples are correlated in two batches (Figure 4B). When the samples are correlated within smaller groups of size two, we see that ICE eQTL outperforms SVA and the simple t-test (Figure 4D). In the mixture of large group and small group effects, which we expect to see in real data sets, we again see that ICE eQTL outperforms SVA and the simple t-test (Figure 4F). Under complex systematic confounding effects, because the fixed-effects model requires a large number of confounding variables to completely correct for the confounding, it loses many degrees of freedom and the estimation of confounding variables becomes less accurate, resulting in the loss of statistical power.
It should be noted that ICE fundamentally differs from traditional mixed-model methods such as MANOVA in that it estimates the variance component directly from the expression data. By leveraging the massive number of probes, ICE can accurately estimate the intersample correlation. Although we have used block-structure variance components as examples of systematic confounding in the above simulations, the estimated variance components typically have a much more complex structure. On the other hand, MANOVA uses predefined variance components that are usually block structured to model random effects specific to groups of samples such as batches, cages, cohorts, or strains, depending on the context of the statistical analysis. Since these variance components are predefined, MANOVA cannot correct for unknown confounding factors.
We next applied ICE eQTL mapping to real whole brain and HSC expression data sets from BXD RI mice. In both cases, ICE eQTL mapping eliminated the trans-regulatory bands while enhancing the cis-regulatory bands (Figure 1, B and D). The number of significant cis-acting eQTL discovered increased dramatically. The enrichment in cis-acting eQTL serves as a good indicator of the statistical power to identify differential expressions due to true genetic effects, even though some of the cis associations might be due to polymorphic SNPs residing in the probe sequences (Walter et al. 2007). For the whole brain data set, ICE eQTL mapping identified nearly three times as many genes with cis-acting eQTL (120) than the t-test (43) and 52% more than SVA (79) at a significance level of 10 false positives per genome (Table 1, Figure 5). ICE eQTL mapping of the HSC data set showed fewer significant eQTL due to the reduced power of having a limited number of strains in the data set. Nevertheless, similar to the whole brain results, our method consistently identified more genes with cis-acting eQTL (23) than the t-test (19) and SVA (14). In this case SVA was outperformed by the t-test because having a large number of surrogate variables significantly reduced the degrees of freedom.
Similar to how the number of cis associations detected is a good measure of increased power to identify true genetic effects, another measure is the concordance of association between biologically replicated samples. We leveraged the replicated samples of the BXD data sets to measure the concordances of cis and trans eQTL between replicates. After ordering the transcripts according to the strength of association for each replicated set, we plotted the concordances of cis and trans associations between the sets, using CAT concordance plots (Irizzary et al. 2005). In HSC and brain, both cis and trans eQTL between replicates are significantly more concordant with ICE eQTL mapping (Figure 6, A and B) than with the t-test and SVA. Finally, we compared the results between whole brain and HSC data sets to see if the trans-acting eQTL are replicable across different tissues. Previous studies have suggested that most trans-regulatory elements are tissue specific because they have not been replicated in different tissues (Chesler et al. 2005). We postulate that most trans-regulatory elements were not replicated across tissues in previous studies because they are spurious associations caused by confounding factors. We ordered the transcripts on the basis of the strength of trans-acting eQTL for each data set and computed the Spearman's rank correlation between the two data sets. The P-values of correlation obtained from the standard t-test show a slightly negative correlation (r = −0.012), with a one-sided P-value of 0.857. However, the ICE eQTL show much higher rank concordance between the tissues with a P-value of 1.1 × 10−7 (r = 0.056). The CAT concordance plot (Irizzary et al. 2005) also shows that ICE eQTL mapping results are significantly more concordant between tissues (Figure 6C). This suggests that a significant fraction of real trans-acting eQTL are not tissue specific, and many of the previously identified trans-acting eQTL did not replicate since they are largely confounded by spurious associations.
Some trans-regulatory bands in high-quality data sets are likely to correspond to real genetic effects:
In the previous sections, we demonstrated the ability of ICE eQTL mapping to obtain reliable and consistent associations in first-generation mouse data sets that have been previously shown to have little reproducibility between independent data sets (Peirce et al. 2006). Second-generation data sets collected using better protocols and newer expression chips such as Affymetrix M430v2 are of higher quality, resulting in much higher correlation between replicated samples than between unrelated pairs. Nevertheless, not only do these studies still suffer from moderate levels of intersample correlation between unrelated pairs (supplemental Figure 4), but also potentially genuine regulatory hotspots globally affect the expression levels and may confound the identification of gene-specific cis or trans associations (Storey et al. 2005; Leek and Storey 2007). In this section, we analyze one of the classic genetical genomics data sets in yeast, where global regulators have been previously reported by several studies (Brem et al. 2002; Foss et al. 2007; Perlstein et al. 2007; Zhu et al. 2008). Under the confounding effects from such genuine regulatory hotspots, we demonstrate that ICE eQTL mapping identifies more cis and trans associations that are consistent between disjoint data sets.
Yeast expression profiles and genotypes were collected from 112 segregants derived by crossing the lab isogeneic BY4716 strain with the wild RM11-1A isolate (Brem et al. 2002; Brem and Kruglyak 2005). There are several differences between the yeast and the BXD RI data sets. First, the technological differences between the cDNA arrays used for yeast and the Affymetrix GeneChips used for mice may lead to very different patterns of systematic bias. Second, having a larger number of strains increases the number of eQTL expected at the same significance level due to increased power. Third, since biological replicates are not available in the yeast data set, it is difficult to determine whether the appearance of regulatory hotspots is caused by a systematic bias or a real genetic effect. Although the dye-swap results provide us with technical replicates, they are not suitable for verifying real hotspots because a dye-swap pair tends to have much stronger correlation than a pair of biological replicates due to smaller environmental or sampling biases between replicates than between unrelated samples (Figure 3C). This correlation may lead to the biased conclusion that most regulatory hotspots are highly reproducible.
After applying traditional eQTL mapping, we observed strong trans-regulatory bands, many of which are consistent with the intersample correlation structure (Figure 1E). However, unlike in the BXD recombinant inbred strains, several of the bands remained significant after performing permutation analysis. Three genomic regions in chromosomes 2 (521–584 kb), 14 (418–502 kb), and 15 (171–193 kb) had genomewide significant P-values of <0.05 with the most significant P = 2 × 10−4. This suggests that these trans-regulatory bands may be the result of real genetic effects rather than confounding effects. Recently, linkage studies of small-molecule drug response traits with the same set of yeast strains have shown that most of the QTL hotspots of these traits fall into the same genomic region where the bands occur (Perlstein et al. 2007). Since the yeast data set does not have biological replicates, we instead randomly divided the 112 segregants into two disjoint sets to perform eQTL mappings separately. If the regulatory hotspots are not real genetic effects, it would be unlikely that the same regulatory hotspots consistently appear between the disjoint sets. However, most of hotspots between the sets coincide, suggesting that they correspond to real genetic effects (supplemental Figure 1d).
We further tried to understand the biological importance of those significant trans-regulatory bands in the yeast data. We listed all 61 genes within 10 kb of the significant regulatory hotspots and queried the set of genes in the Comprehensive Yeast Genome Database (CYGD) (Robinson et al. 2002; Guldener et al. 2005). Interestingly, the three regions with significant hotspots on chromosomes 2, 14, and 15 contain IRA1, RAS2, and IRA2, respectively. It has been known that IRA1 and IRA2 genes negatively regulate the RAS2 protein activation state from multiple studies (Tanaka et al. 1990; Colombo et al. 2004). The probability of 3 genes appearing in a random set of 61 genes is 1.3 × 10−6. There are also other genes that encode small GTP-binding proteins of the RAS superfamily such as ARL1, RHO2, and YPT53 near the significant regulatory hotspots. It is possible that the variations in those regions may change the mRNA levels of a large number of genes by perturbing the RAS GTP-binding signal transduction pathway. However, under this interpretation, it is not certain why a portion of mRNA levels are upregulated while others are downregulated by the same variant in those regulatory hotspots. In addition, a recent study suggests that MKT1 is the causal regulator that may be responsible for the regulatory hotspot in chromosome 14 (Zhu et al. 2008).
Even though many trans-regulatory bands in yeast are likely to be real genetic effects, they globally influence the expression levels and may seriously confound the identification of gene-specific cis or trans associations (Storey et al. 2005; Leek and Storey 2007), resulting in the loss of power to identify real cis and trans associations. Correcting for the intersample correlation induced by genuine regulatory hotspots may eliminate true trans-regulatory bands, but also can reveal many true regulatory signals obscured by the hotspots. We compared the power of different eQTL mapping methods to identify true genetic effects by randomly partitioning the data set as described above. In each partition, the transcripts are ordered by the strength of cis or trans associations, and the concordance between the disjointly partitioned data sets is illustrated using the CAT plot (Figure 6D). The results show that ICE eQTL mapping has higher concordance than the t-test and SVA both for cis-acting and trans-acting eQTL, despite the loss of true regulatory hotspots.
We applied ICE eQTL mapping to the entire yeast data set and observed that the trans-regulatory bands are eliminated while the genes with significant cis associations are nearly doubled (Figures 1F and 5, Table 1). The number of genes with trans-acting eQTL is significantly reduced using ICE eQTL mapping due to eliminated regulatory bands, but many new trans-acting genes that have not been identified by the t-test are discovered. For example, among the 363 significant trans-acting genes identified by ICE eQTL mapping at the significance of 10 false positives per genome, 25% (89) of them are not identified by the t-test at the same threshold. On the contrary, only 7% (35) of the 506 significant cis-associated genes identified by the t-test are not identified by ICE eQTL mapping at the same significance level. ICE eQTL mapping outperforms SVA in discovering cis-acting eQTL across different significance thresholds. For trans-acting eQTL, ICE identifies a larger number of eQTL for the conservative significance threshold of FDR <0.1, while SVA identifies more eQTL for higher thresholds. This may be due to the effects from the moderate regulatory hotspots that have not been captured by surrogate variables. In this data set, SVA appears to overcorrect for the trans-regulatory bands and eliminated even the cis-acting eQTL in the middle of chromosome 12. (supplemental Figure 5c).
Correcting for confounding effects in human lymphoblastoid cell line expression:
Finally, we applied our method to a human genetical genomics study of the HapMap individuals where the goal was to determine whether differentially expressed genes between CEU and JPT + CHB populations are caused by allelic or population differences. It is known that the HapMap expression experiments were conducted on different dates for the CEU and JPT + CHB populations and the problems introduced by this batch effect have recently been addressed (Akey et al. 2007; Spielman and Cheung 2007). While the original article (Spielman and Cheung 2007) claimed that 26% of genes are differentially expressed between European and Asian samples at a genomewide Sidak-corrected P < 0.05, none of them were identified to be significant after controlling for the year in which the sample was processed (Akey et al. 2007). In fact, with respect to this batch effect, 28% of the genes were differentially expressed.
We applied ICE eQTL mapping to identify differentially expressed genes. Our method is able to control for the inflated false positives of differentially expressed genes without the prior knowledge of batch information. The P-value distribution appears to be almost uniform (Figure 7). Spielman et al. (2007) provided POMZP3 as an example of a differentially expressed gene between the two populations to demonstrate that not all of their findings were false positives. The gene was associated with a cis-regulatory SNP, whose allele frequency was significantly different between the two populations. We examined how strongly the POMZP3 gene is differentially expressed, using three different methods. Without correcting for confounding effects, the gene is significant at a P-value of 1.91 × 10−6. However, since numerous other genes are identified to be significant, the strength of the signal is ranked only 943th (23.4%) of 4030 genes. After explicitly correcting for the year of the experiment using ComBat (Johnson et al. 2007), the gene is no longer significant at a P-value of 0.309. However, the signal is ranked relatively high, 434th (10.8%) of 4030 genes. After correcting using SVA, it is ranked only 1992th (49.4%) with a P-value of 0.352. After correcting for the intersample correlation pattern using our method, the gene is ranked 6th (0.15%) at a P-value of 3.1 × 10−4. Using the same approach, we examined the top 5 genes among the 11 genes reported as differentially expressed genes with concordant cis eQTL between populations. Correcting for intersample correlation consistently outperformed the other methods at identifying those genes as differentially expressed with higher ranks (supplemental Table 1).
We next performed ICE eQTL mapping and compared the cis associations with those obtained from t-test-based mapping and batch-corrected mapping. We analyzed a total of 3942 genes within 500 kb of at least one of the 2 million HapMap SNPs. In both CEU and JPT + CHB populations, the number of genes with cis associations increased significantly with our method (Figure 8, A and B). eQTL mapping performed after correcting for the known batch effect using ComBat did not significantly outperform the t-test. Furthermore, the concordance of cis-acting genes between populations significantly increased as well, suggesting that ICE association mapping has higher power to identify real genetic effects (Figure 6E).
Finally, we applied our method to identify differentially expressed genes with evidence of concordant cis-acting SNP between populations. We applied a more stringent threshold than previous studies (Spielman et al. 2007) by requiring the cis-acting SNP to have a genomewide P-value of <2.5 × 10−8 in at least one population and a strong P-value of <10−5 in the other after Bonferroni correction. In addition, we required the minor allele frequency of the SNP to differ by at least 0.1 and the strength of differential expression to be ranked in the top 10% of all genes. Using these stringent criteria, only 2 genes are identified using the t-test, and 3 genes are identified after explicitly correcting for the batch effect. On the other hand, ICE association mapping successfully identified 10 differentially expressed genes, including 4 previously unreported genes (Table 2).
Comparison with previous methods:
A key difference of ICE association mapping from the previous methods using singular value decomposition (Alter et al. 2000; Leek and Storey 2007) is that previous methods project the systematic confounding onto several distinct single-dimensional vectors as fixed effects while ICE association mapping directly incorporates the pairwise correlation as random effects into the statistical model. For previously known confounding variables such as batch effects, both methods can incorporate them as fixed effects in the statistical model. While the singular value decomposition methods infer a number of confounding factors strongly affecting the variations in expression, those with relatively moderate effects may remain uncorrected. Our mixed-model ICE approach does not suffer from this shortcoming since it does not explicitly infer a limited number of confounding variables. Instead, the confounding effects from various unknown sources are assumed to be intrinsically imprinted in the expression profiles, specifically as intersample correlation.
The simulation results under various types of confounding effects we presented above (Figure 4) are largely consistent with those seen with applying random models vs. fixed models on the related problem of correcting for population structure in association studies. Previous studies showed that the random-effects model corrects for the heterogeneous population structure better than a fixed-effects model based on principal component analysis such as EIGENSTRAT for association mapping of model organisms (Yu et al. 2006; Zhao et al. 2007; Kang et al. 2008). Although EIGENSTRAT can robustly correct for population structure in human association mapping where an admixture model assuming a small number of distinct ancestral populations accurately describes the structure of the data (Pritchard et al. 2000; Price et al. 2006), in the model organism association mapping involving multilevel population structure, such methods only partially capture the population structure, resulting in an inflated number of false positives (Aranzana et al. 2005). Similarly, we see that fixed-effects models can effectively correct for intersample correlation where there is a relatively simple confounding structure such as batch effects while the random-effects model performs much better when we have more complex and multileveled confounding structures that we see in simulated and real data sets.
A second intuition of why mixed models outperform singular value decomposition methods such as SVA in this case is that a large number of surrogate variables or eigengenes are required to capture the complex expression heterogeneity, resulting in a significant increase in the degrees of freedom, which affects the statistical power. These effects can be substantial especially for those data sets with a limited number of samples. For example, in the HSC data set containing only 22 strains, SVA was shown to be even less powerful than the t-test in identifying cis-acting eQTL.
Both approaches have the potential risk of overcorrecting true genetic effects, especially for those trans-acting eQTL corresponding to true regulatory hotspots. The concordance plots between replicated and disjointly partitioned data sets consistently show that our ICE association mapping provides higher concordance than the standard t-test at identifying both cis-acting and trans-acting eQTL, while the SVA method consistently shows lower concordance of trans-acting eQTL than the standard t-test (Figure 6). Although some genuine regulatory hotspots may have been eliminated using ICE eQTL mapping particularly in the yeast data set, we were able to identify some of the regulatory hotspots as significant through the analysis of replicates and our SNP permutation approach.
In terms of the computational cost, the running time of ICE association mapping is twice as fast as SVA using EMMA (Kang et al. 2008).
We have proposed a novel statistical method, ICE eQTL mapping, which corrects for the systematic confounding effects inherent in expression data sets. Using the first-generation RI mouse expression data set where the problem of systematic confounding effects has already been documented, we have demonstrated that most trans-regulatory bands in the data set correspond to spurious regulatory hotspots through the analysis of biological replicates and the permutation analysis. Using simulated data that preserve the intersample expression correlation structure, we have shown that the intersample correlation effectively characterizes the systematic biases that are responsible for the spurious associations. Using the same methods, we demonstrated that a number of trans-regulatory bands in yeast correspond to genetic variation in global regulators.
From both differential expression analysis in humans and association analysis in recombinant inbred mice and yeast, we conclude that our method is more robust at correcting for systematic confounding factors than previous methods including an explicit batch correction method, ComBat (Johnson et al. 2007), and an automated method that corrects for unknown confounding factors, surrogate variable analysis (Leek and Storey 2007). Not only did ICE eQTL mapping identify more cis-acting eQTL than both methods, those identified cis-acting and trans-acting eQTL also showed higher concordance between replicated data sets (BXD RI strains), different tissues (BXD RI strains), and disjoint subsets (yeast). These results suggest that our method has higher power to identify associations corresponding to real genetic effects.
Our results also highlight the importance of obtaining independent replicates of expression measurements and the utility of these replicates for analyzing and validating eQTL. We have shown that different strategies for obtaining replicates have profound effects on the correlation structure between replicates. Technical replicates obtained by either performing a dye swap (Figure 3C) or running multiple expression arrays in the same sample (Figure 3D) exhibit much higher correlation between replicate pairs than full biological replicates (Figure 3, A and B). We suspect that confounding factors in the sample preparation are largely responsible for the higher pairwise correlation observed among technical replicates, reducing their utility in analysis and validation. Preparing multiple samples from the same individual can help reduce the effect of these confounding factors. In many eQTL studies, it is possible to independently measure expression from genetically identical individuals, which can further reduce the effects of these confounding factors.
The authors thank A. Jake Lusis at the University of California, Los Angeles, and Andreas Beyer at the BIOTEC, TU-Dresden, for useful discussions. H.M.K. is supported by a Samsung Scholarship. H.M.K., C.Y., and E.E. are supported by National Science Foundation grant nos. 0513612 and 0731455 and National Institutes of Health grant no. 1K25HL080079. Part of this investigation was supported using the computing facility made possible by the Research Facilities Improvement Program grant no. C06 RR017588 awarded to the Whitaker Biomedical Engineering Institute and the Biomedical Technology Resource Centers Program grant no. P41 RR08605 awarded to the National Biomedical Computation Resource, University of California, San Diego, from the National Center for Research Resources, National Institutes of Health. Additional computational resources were provided by the California Institute of Telecommunications and Information Technology (Calit2).
↵1 These authors contributed equally to this work.
Communicating editor: G. Stormo
- Received July 21, 2008.
- Accepted September 9, 2008.
- Copyright © 2008 by the Genetics Society of America