Expression quantitative trait loci (eQTL) studies have established convincing relationships between genetic variants and gene expression. Most of these studies focused on the mean of gene expression level, but not the variance of gene expression level (i.e., gene expression variability). In the present study, we systematically explore genome-wide association between genetic variants and gene expression variability in humans. We adapt the double generalized linear model (dglm) to simultaneously fit the means and the variances of gene expression among the three possible genotypes of a biallelic SNP. The genomic loci showing significant association between the variances of gene expression and the genotypes are termed expression variability QTL (evQTL). Using a data set of gene expression in lymphoblastoid cell lines (LCLs) derived from 210 HapMap individuals, we identify cis-acting evQTL involving 218 distinct genes, among which 8 genes, ADCY1, CTNNA2, DAAM2, FERMT2, IL6, PLOD2, SNX7, and TNFRSF11B, are cross-validated using an extra expression data set of the same LCLs. We also identify ∼300 trans-acting evQTL between >13,000 common SNPs and 500 randomly selected representative genes. We employ two distinct scenarios, emphasizing single-SNP and multiple-SNP effects on expression variability, to explain the formation of evQTL. We argue that detecting evQTL may represent a novel method for effectively screening for genetic interactions, especially when the multiple-SNP influence on expression variability is implied. The implication of our results for revealing genetic mechanisms of gene expression variability is discussed.
QUANTITATIVE genetic analysis has long focused on detecting genetic variants that affect organismal phenotypes. This is often done by contrasting mean differences in phenotypes among genotypes. Despite increasing evidence across several species for genetic control of phenotypic variance (Ansel et al. 2008; Hill and Mulder 2010; Jimenez-Gomez et al. 2011), variance differences in phenotypes have been largely ignored. Recently, however, the fact that variance of phenotypes is genotype dependent has inspired the detection of genetic variants associated with phenotypic variability (Pare et al. 2010; Sudmant et al. 2010; Yang et al. 2012).
When gene expression level is considered as a heritable, quantitative trait, statistical associations between mean gene expression and genotype can be established to identify those genomic loci associated with or linked to gene expression level (i.e., expression QTL, eQTL) (Montgomery and Dermitzakis 2011). The difference in mean gene expression and its genetic control have been extensively examined in humans (Stranger et al. 2005, 2007b; Choy et al. 2008; Montgomery et al. 2010; Pickrell et al. 2010). The difference in variance of gene expression (i.e., gene expression variability) is genetically controlled and likely to be selectable (Raser and O’Shea 2005; Blake et al. 2006; Maheshri and O’Shea 2007; Cheung and Spielman 2009; Zhang et al. 2009). A small number of initial efforts have been made to quantify the difference in gene expression variability (that is, variance of gene expression) (Ho et al. 2008; Li et al. 2010a; Mar et al. 2011; Xu et al. 2011b). Yet, little attention has been paid to the genetic control of gene expression variability in humans.
In the present study, we seek to discover genome-wide genetic variants (i.e., SNPs) associated with differences in the variance of gene expression among individuals. We adapt the double generalized linear model (dglm) (Verbyla and Smyth 1998) to test for the inequality of expression variances and measure the contribution of genetic variants to the expression heteroscedasticity. The model has been recently used to detect genetic loci controlling phenotypic variability in chicken F2 crosses (Ronnegard and Valdar 2011). A likelihood ratio test (LRT) of the dglm method allows us to compare the fit of a “full model” and a “mean model.” The full model takes into account the contribution of genotype to both the mean and the variance of gene expression simultaneously, while the mean model takes into account only the contribution of genotype to the mean, ignoring the contribution to the variance. A significant result of the LRT indicates the nonrandom association between the genotypes and the variances of gene expression. Here we designate the genomic loci statistically associated with gene expression variability expression variability QTL (evQTL). The results of our genome-wide scan for evQTL provide a glimpse into the abundance and distribution of expression variability controlling variants in the human genome. Given that the variance of a quantitative trait is likely to differ under the influence of genetic interactions (Pare et al. 2010; Ronnegard and Valdar 2011), our evQTL detecting method may be used to help detect the interactions between genetic variants controlling gene expression.
Gene expression data from the studies of Stranger et al. (2007a) and Choy et al. (2008) were downloaded from the Gene Expression Omnibus (GEO) website with accession nos. GSE6536 and GSE11582, respectively. The two data sets were designated GSE6536 and GSE11582 thereafter. In the two studies, the expression levels were measured in lymphoblastoid cell lines (LCLs) derived from HapMap individuals using two different platforms: Illumina human whole-genome expression array (WG-6 version 1) for GSE6536 (Stranger et al. 2007a,b) and Affymetrix human genome U133A array for GSE11582 (Choy et al. 2008). The downloaded data had been normalized by using quantile normalization across replicates of a single individual and then median normalized across all HapMap individuals. The downloaded data sets included 16,992 genes (19,440 probes) and 13,012 genes (20,995 probes) in GSE6536 and GSE11582 data sets, respectively, and 11,633 shared genes. The Sequence Alignment/Map (SAM) files of the RNA sequencing (RNA-seq) data for 60 individuals of European origin in Utah (CEU) HapMap individuals from the study of Montgomery et al. (2010) were downloaded at the website http://jungle.unige.ch/rnaseq_CEU60. We used SAMMate (Xu et al. 2011a) to estimate the expression level using the number of reads per kilobase of transcript per million mapped reads (RPKM) (Mortazavi et al. 2008).
The coefficient of variation (CV) was used as a normalized measure of the dispersion of expression distribution as in previous studies (Maheshri and O’Shea 2007; Ansel et al. 2008; Ronnegard and Valdar 2011). The CV of each gene was computed as , where σ and μ are the standard deviation and the mean of gene expression levels, respectively.
Human polymorphism data were obtained from the HapMap project (International Hapmap Consortium 2007) and the pilot phase of The 1000 Genomes (1000G) Project (The 1000 Genomes Project Consortium 2010). The HapMap data release 28 includes genotypes of 4 million SNPs merged from phases 1, 2, and 3 of the project (International Hapmap Consortium 2007). HapMap samples are from four different populations: Yorubans from Ibadan, Nigeria (YRI), individuals of European origin in Utah (CEU), Han Chinese from Beijing (CHB), and Japanese from Tokyo (JPT). The raw data from the pilot study of the 1000G project contains ∼15 million SNPs from a total of 179 samples also from the four HapMap populations. From the HapMap data, we extracted genotypes of individuals whose gene expression data are included in the GSE6536 and GSE11582 data sets. The HapMap CEU and YRI populations consist of 60 trios. We removed the 60 child samples from trios and used the remaining 210 unrelated samples in our analysis.
From the 1000G data, we extracted genotypes of 153 and 149 individuals whose gene expression data were available in GSE6536 and GSE11582, respectively. The 1000G project data give a snapshot of human polymorphism at an unprecedented scale and resolution. The released 1000G low-coverage data captures nearly all (∼95%) of the common polymorphism in a relatively ascertainment-free manner (The 1000 Genomes Project Consortium 2010). One disadvantage of using the 1000G genotype data in our case was that the sample size became smaller. In addition, genotypes of 59 individuals were extracted from the HapMap release and paired with expression data in the RNA-seq data set.
The minor allele frequency (MAF) and Fst statistic for SNPs, and LD estimates between SNP pairs (R2 and D′) were computed using Matlab functions in PGEToolbox (Cai 2008). To eliminate low-frequency polymorphisms, we discarded SNPs with MAFs <10% in the YRI, CEU, and CHB/JPT populations. To control for the effect of population stratification, we excluded SNPs with Fst ≥0.2. We also excluded SNPs that were found to be deviated from the Hardy-Weinberg (HW) equilibrium by using hweStrata, an exact stratified test across populations (Schaid et al. 2006).
For each transcript-SNP pair, the association between gene expression level and the genotype is assumed to be linear. The conventional linear regression model is(1)where yi indicates a gene expression trait of individual i, gi is the genotype at the given SNP (encoded as 0, 1, or 2 for homozygous rare, heterozygous and homozygous common alleles, respectively), and εi is the residual with variance σ2. The significance of association can be assessed using the nominal, parametric P-value of the test of no association, i.e., α = 0, or using a permutation test (Stranger et al. 2005).
To account for the effect of population difference on gene expression, we introduced a covariate x into the model. We defined xi as the ith row in a design vector of the covariate, which is encoded as 0 for African (YRI) individuals and 1 for Eurasian (non-YRI) individuals initially and then was also assessed with encoding YRI, CEU, and CHB/JPT with 0, 1, and 2, respectively. With these we considered the double generalized linear model as follows:(2)where θ is the corresponding vector of coefficients of genotype gi on the residual variance. With this model, the mean and variance of expression yi are controlled by both effects of population substructure, xi, and SNP genotype, gi. Equation 2 is otherwise equivalent to Equation 1 but with the additional covariate xi and with [where ], modeling dispersion in generalized linear models. Equation 2 is termed the dglm (Verbyla and Smyth 1998), which is, more specifically, a classical heteroscedastic regression model (Bickel 1978). We coded the fitting procedure for Equation 2 using the dglm package in R. For regression computation, we tested one SNP-gene pair and obtained two P-values: Pdispersion and Pmean, for the effects of genotypes on the variance and the mean of expression levels, respectively (Ronnegard and Valdar 2011). SNP-gene pairs that did not make the algorithm converge during computation were discarded.
To account for multiple testing performed between massive numbers of probe-SNP pairs (747,524 and 763,745 for GSE6536 and GSE11582 data sets, respectively), we used the threshold of Pdispersion < 1 × 10−8, which is closely equivalent to Bonferroni adjusted P < 0.01, to assess the genome-wide significance. To control for the effect of outlier expression data points, permutation tests were conducted for all significant pairs. Specifically, for each transcript probe-SNP pair, we performed 10,000 permutations of expression phenotype relative to the genotypes of the SNP. An association was considered significant if the P-value from the analysis of the observed Pdispersion was lower than the threshold of the 0.001 tail of the distribution of the Pdispersion-values from the 10,000 permutations, that is, Ppermutation < 0.001. In addition, we required the P-value of the F–K test (Fligner and Killeen 1976) for homogeneity of variances in gene expression among different genotype groups to be <0.01, following Fraser and Schadt (2010). The F–K test is a nonparametric alternative to the F-test investigating the homogeneity of variances. The F–K test does not assume that the data are normally distributed and it is also insensitive to violations of this assumption (Hallin and Paindaveine 2008).
For the trans-acting association analysis, we selected 500 representative genes and a subset of genome-wide SNPs. To select representative genes, we applied the k-means clustering algorithm to all genes included in the GSE6536 data set and obtained 500 distinct clusters of genes. We used the k-means clustering algorithm implemented in Matlab function kmeans and we put the initial settings of the random number generator to their default values with seed 0. From each cluster, we randomly selected one gene to form the 500-gene set. To select the subset of genome-wide SNPs, we first detected the genome-wide linkage disequilibrium (LD) blocks with YRI genotypes using Haploview v4.2 with default parameters (Barrett et al. 2005). LD blocks were defined using the method described by Gabriel et al. (2002). From each genomic region of LD block, as well as each genomic region between two neighboring LD blocks (i.e., interblock region), one common SNP (MAF > 10% in all four HapMap populations) was randomly selected. Finally a total of 13,718 HapMap SNPs on autosomes were selected for the trans-acting association analysis. For each pair of gene and SNP, the same criteria for detecting cis-evQTL, with one modification, that is, Pdispersion < 1 × 10−9 instead of < 1 × 10−8, were used to assess the significance of association.
Creation of eQTL by subsampling evQTL
We simulated a population of 10,000 individuals under HW equilibrium with minor allele frequency of 0.4. Three normal distributions of expression levels with equal means and monotonically increased expression variances, one for each genotype, was generated as follows: aa ∼ N(50,2), Aa ∼ N(50,4), and AA ∼ N(50,6). Individuals were randomly assigned a genotype then a gene expression value randomly from the normal distribution that corresponded to the assigned genotype. From the simulated population a random sampling of 210 individuals was extracted. For each random sample eQTL presence was detected using the linear regression Equation 1. This process was repeated for iterations of 10,000 random samples to determine the average number of eQTL detected per 10,000 samples.
Analysis of functional annotation
We tested for significant enrichment of genes with specific ontology terms, keywords in annotation, and pathways relative to the rest of the genes in the human genome, using DAVID (Huang da et al. 2009). The enrichments of terms, keywords, and pathways among the tested genes were evaluated through the use of Fisher’s exact test. The nominal P-values were corrected for multiple testing using the Benjamini–Hochberg method (Benjamini and Hochberg 1995; Huang da et al. 2009). We determined statistical significance of enrichment using a cutoff of false discovery rate < 5%.
Association with copy number variations and positive selection
To evaluate whether evQTL are more likely to be associated with copy number variation (CNV), we obtained the genome-wide CNV information from the study of Johansson and Feuk (2011). They used a total of 30,047 structural variations from the Database of Genomic Variants (Iafrate et al. 2004), containing experimentally detected variations larger than 1 kb in healthy controls, to define the CNV-containing regions, as well as the CNV-free regions, in the human genome. They estimated that 36% of the human genome belongs to CNV-containing regions and the remaining 64% belongs to the CNV-free regions (Johansson and Feuk 2011). We took the chromosomal coordinates of these regions and checked how many evQTL genes we identified were located in each region.
To address the question of whether positive selection is likely to be associated with evQTL genes, we checked whether identified evQTL genes appear to be in genomic regions under positive selection. We compiled the results from different genome-scan studies that detected positive selection. These studies employed a variety of statistics to identify a total of 4282 genomic regions (784 Mb) showing signatures of positive selection (Supporting Information, Table S5). We took the chromosomal coordinates of these regions and determined evQTL genes that were located in each region.
Linkage disequilibrium between cis-evQTL SNPs
To measure the linkage between cis-evQTL SNPs, we computed two estimates of linkage disequilibrium, R2 and D′, between all possible pairs of cis-evQTL SNPs in each corresponding gene. To obtain the null distributions of R2 and D′ values, we randomly sampled the same number of pairs of SNPs in the cis-regions of each gene and computed the two LD estimates between all pairs of SNPs. We plotted the frequency distributions of R2 and D′ values obtained between evQTL SNP pairs against those between pairs of randomly selected SNPs. If the distributions of R2 and D′ for cis-evQTL SNP pairs shifted toward the higher ends, compared to those for randomly selected SNP pairs, it will indicate that evQTL SNPs are nonrandomly associated with other evQTL SNPs in the same locus and the links are made through haplotypes.
We obtained genome-wide expression data assessed by microarray hybridization in human LCLs from the studies of Stranger et al. (2007a) and Choy et al. (2008). We denote the two data sets by their GEO accessions: GSE6536 and GSE11582. To exclude transcripts that are not expressed or extremely lowly expressed in the LCLs, we removed probes with the intensity signal smaller than the 10th percentile of all intensity values. The final data sets included 15,574 genes (17,496 probes) and 12,076 genes (18,896 probes) in GSE6536 and GSE11582 data sets, respectively, and 10,188 shared genes. The two expression data sets include samples from the same 210 unrelated individuals included in the HapMap consortium studies (International Hapmap Consortium 2005, 2007). We used the two data sets separately to repeat our evQTL analysis and synthesized the results.
Biological variability of gene expression
Using the CV as an estimate of variability, we found that different genes showed substantial differences in their expression variability (Figure S1). For example, FXN and XCL1 have a similar mean level of expression in LCLs; however, the CV of FXN (= 2.5) is only one-fifth that of XCL1 (= 12.8) calculated using the GSE6536 data set across 210 unrelated individuals (Figure 1). The same pattern was observed when using the GSE11582 data set to calculate the CVs for the two genes (= 5.7 and 23.7, respectively) (Figure S2).
To illustrate that the gene expression variability is not explained by measurement error, we compared the microarray data (GSE6536) with the RNA-seq data (Montgomery et al. 2010). For each expressed gene in the two data sets, we calculated the CV of expression levels across individuals. The use of CV allows for dimensionless comparison of different types of normalized expression data sets such as microarray data and RNA-seq data. We found that variability in expression for each gene was similar in microarray and RNA-seq data sets [Spearman’s ρ = 0.703, P < 1 × 10−15 (n = 12,964, assuming independence between genes), Figure S3]. This result suggests that technical error in expression measurement cannot fully account for the expression variability observed in the samples. Therefore, variability in gene expression is a fundamental biological property of gene expression itself, rather than the technology used to measure expression (see also Hansen et al. 2011).
To examine the population specificity of gene expression variability, we performed pairwise correlation analyses between CVs estimated separately for different HapMap populations: YRI, CEU, and CHB/JPT. All three correlations were significantly positive and strong [Spearman’s ρ(cvYRI, cvCEU) = 0.92, ρ(cvCEU, cvCHB/JPT) = 0.91, ρ(cvYRI, cvCHB/JPT) = 0.92, all <1 × 10−15]. We also used the coefficient of dispersion (CD), the ratio between the variance σ2 and the mean μ, as an alternative statistic of the variability, and conducted the same correlation analysis using the two microarray data sets. Again, all three correlations were significantly positive and strong [Spearman’s ρ(cdYRI, cdCEU) = 0.93, ρ(cdCEU, cdCHB/JPT) = 0.92, ρ(cdYRI, cdCHB/JPT) = 0.94, all P < 1 × 10−15]. These results suggest that the population specificity of gene expression variability is weak. Thus, it might be “safe” to transfer the estimate of expression variability from one population to another. This finding shows good agreement with the observations reported in the studies of Li et al. (2010a) and Storey et al. (2007), that is, most expression variation is due to variation among individuals rather than among populations.
Identification of cis-acting evQTL
Next we identified genome-wide genetic variants associated with expression variability by considering nonepistatic, additive genetic effects of genotypes on expression levels in an adapted regression model (Methods, Figure 2). As a second-order statistic, the variance of expression levels (or the CV) requires large sample sizes to be accurately estimated (Lee and Nelder 2006; Visscher and Posthuma 2010). To increase the power of association analysis, we pooled data (including normalized expression values and genotypes) from individuals of YRI, CEU, and CHB/JPT populations. This resulted in 210 data points per gene for both GSE6536 and GSE11532 data sets. The pooling operation is justified because gene expression variability is not population specific. Nevertheless, to control for the nonspecific effects of population stratification on the expression variability, we only included SNPs with MAF >10% in all and each of the three populations in the analysis. We also required that all these included SNPs showed no signs of population differentiation and no signs of deviation from HW equilibrium (Methods). In addition, in the regression model (Methods), we included the population identities as a covariate. We encoded YRI and non-YRI individuals using 0 and 1, respectively, and introduced this variable into the final model as the covariate. We also tried another schema, in which YRI, CEU, and CHB/JPT populations were encoded with 0, 1, and 2, respectively. The two encoding schema produced highly similar results indicated by the significant positive correlation between P-values obtained by using two and three categories of populations as covariates (Spearman’s ρ = 0.98, P < 1 × 10−15, n = 15,399) (Figure S4). Only the results derived from the first encoding schema are reported in this article.
We detected cis-acting evQTL SNPs by considering those SNPs located within a 1-Mb region from the transcription starting site (TSS) of a gene. The method we used was based on the dglm, which is the regression model previously used in detecting major loci controlling the variance of nonexpression phenotypes (Ronnegard and Valdar 2011). We applied the dglm test to identify potentially significant associations between genotypes [encoded with 0, 1, and 2 based on the inclusion of the number of major allele(s)] and variance of gene expression in SNP-gene pairs. A SNP-gene pair with an association reaching the significance threshold of Pdispersion < 1 × 10−8 in the dglm test was considered a potential evQTL. To rigorously assess the significance of each potential evQTL, we employed the permutation test, in which the gene expression data were randomly shuffled and the evQTL relationship between the shuffled expression replicate and genotype was reassessed. For each evQTL, the shuffling and reassessing processes were repeated 10,000 times to obtain the null distribution of Pdispersion. We required an evQTL association to have a P-value of permutation test, Ppermutation < 0.001, that is, the observed Pdispersion was not ≥10 out of 10,000 Pdispersion values obtained from the dglm tests with the shuffled expression data sets (Methods). The permutation test was less sensitive to outliers, which could have a large impact on the Pdispersion in the regression (see Figure S5 for an example). Finally, we also required, PF–K, the P-value of the Fligner–Killeen (F–K) test (Fligner and Killeen 1976), a nonparametric alternative to the F-test, for the homogeneity of variances of expression levels among genotype groups to be smaller than 0.01. The purpose of using the three tests jointly is to reduce false positives and prioritize evQTL of large effect.
Using the two expression data sets, GSE6536 and GSE11582, we detected 166 and 60 genes (corresponding to 179 and 108 probes, respectively) that were significantly associated with at least one SNP that is located in their cis-regions. The full list of these genes and probes are given in Table S1 and Table S2. A total of 218 distinct genes were found when combining the results from both data sets, including 8 common genes (Table 1). That is, evQTL SNPs were identified in the genomic regions of these 8 genes using either of the two expression data sets. This number was found to be statistically significant [P (x ≥ 8) = 8.6 × 10−6, hypergeometric test, Figure S6]. For each of the 8 common genes, the effect of selected evQTL SNP on the gene expression variability as a function of genotypes (0, 1, and 2) is shown in Figure S7.
As mentioned, 166 genes were identified using the GSE6536 data set as they are involved in at least one cis-evQTL (Table S1). Among these genes, 81 have more than one evQTL SNP. We iterated all 379 gene-SNP pairs and found that 6 gene-SNP pairs show significant evQTL relationship using either GSE6536 or GSE11582 data sets. In other words, we replicated the discovery of evQTL in these 6 gene-SNP pairs between the exact same SNPs and corresponding genes using both expression data sets. In all these 6 cases, the direction of the effect of genotypes on the variance of expression is consistent between GSE6536 or GSE11582 data sets (Figure S8).
For the remaining 373 gene-SNP pairs, 32 gene-SNP pairs in GSE6536 do not appear in GSE11582. This is because different array specifications used in two data sets result in the coverage of slightly different gene sets. The rest, 341 gene-SNP pairs, appeared in both data sets, but they were found significant using GSE6536 but not GSE11582. We suspect that this inconsistency was produced for two reasons. First, the GSE11582 data set contains a mixup of a handful of RNA samples, which have been accidentally swapped. This mixup was revealed by a recent application of novel microarray experiment QC methods (see http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE11582). Second, the stringent Pdispersion cutoff for statistical significance blocked many evQTL signals from the GSE11582 data set. We reassessed the distribution of Pdispersion of associations of these 341 gene-SNP pairs using the GSE11582 data. Indeed, the distribution was shifted toward the lower end of the Pdispersion spectrum, albeit these values were not small enough (i.e., < 1 × 10−8) to become genome-wide significant (Figure S9). We also conducted a side-by-side comparison between GSE6536 and GSE11582 results for evQTL involving 228 SNP-gene pairs (this number is smaller than the original 341 because the expression of some genes in the 341 SNP-gene pairs was measured by multiple probes but we only randomly picked one probe in doing this inspection). We inspected the direction of association between genotypes and the variances of gene expression levels and found most evQTL obtained from the two data sets are in the same direction.
Finally, we used the RNA-seq data to validate the effect of evQTL SNPs. One of the advantages of using the RNA-seq data are that the saturation of signal intensity above a certain abundance level is not a problem (Majewski and Pastinen 2011). We focused on the direction of the allele-specific effect of evQTL SNPs on the increase or decrease of the variability. Again, it is expected that if one allele of an evQTL SNP identified using microarray data is associated with higher expression variability of the corresponding gene, then the same allele-specific effects should be recovered using the RNA-seq data. Here is a positive example: using both GSE6536 and GSE11582 data sets, we found that the expression variability of CTNNA2 decreases with the number of the major allele (C) and increases with the number of minor allele (T) in genotypes of SNP rs11126722, that is, VarTT > VarTC > VarCC (Figure S7). In this case, we expected to find the same effects—TT is associated with higher and CC is associated with lower expression variability of CTNNA2—using RNA-seq data. Indeed, using the RNA-seq data for 59 genotyped CEU individuals, we found 85% of tested evQTL SNPs (35 of the 41) act in the same directions as they were identified using the GSE6536 microarray data (Table S3). Note that, due to the small sample size of 59 individuals to measure variability in the RNA-seq data set, only 41 of the evQTL SNPs were selected for this validation. This was because we required that variances of RNA-seq expression between homozygotes of SNPs be significantly different (i.e., P < 0.01, F–K test)—as only then could the direction of SNPs’ effect on expression variability be assessed.
Taken together, these results provide significant evidence that evQTL SNPs are associated with regulating the variability in expression of corresponding genes in a cis-acting allele-specific manner.
Spatial distribution of cis-acting evQTL SNPs
To understand the spatial distribution of cis-evQTL SNPs around the corresponding genes, we used genotype data from the pilot study of The 1000 Genomes (1000G) Project to replace HapMap SNPs. The 1000G SNPs cover the genome more thoroughly than HapMap SNPs. So the advantage of using the 1000G SNPs is that we can literally “impute” genotypes for sites that were not genotyped in the HapMap project. The disadvantage is that the number of genotyped 1000G individuals whose gene expression data are also available is much smaller than that of HapMap individuals (e.g., 153 vs. 210 for GSE6536). Nevertheless, we applied the same criteria for ascertaining evQTL to the 1000G genotype data. When using the GSE6536 data, we found significant cis-evQTL SNPs in all eight previously identified genes (Figure S10). When using the GSE11582 data, we found significant cis-evQTL SNPs in six of eight genes (except IL6 and PLOD2, Figure S11).
Due to the large number of 1000G SNPs, additional novel cis-evQTL SNPs were identified using the 1000G genotype data. At the IL6 locus, besides the two SNPs identified using HapMap data, 33 new evQTL SNPs (including 3 intronic SNPs, one in each of introns 2, 3, and 4) were added (Figure 3). In addition, 6, 12, and 37 new evQTL SNPs in introns of ADCY1, PLOD2, and SNX7, respectively, were identified. Within SNX7, a synonymous SNP, rs2019213, and a nonsynonymous SNP, rs35296149, were also identified. Finally we computed two estimates of linkage disequilibrium, R2 and D′, between all possible pairs of cis-evQTL SNPs in each corresponding gene and found pairs of these SNPs are more tightly linked than those of randomly selected SNPs (Figure S12, Methods).
Relationship between evQTL and eQTL
To understand the relationship between evQTL and conventional eQTL, we examined how some evQTL identified (using the GSE6536 data set) are also eQTL. Using the simple linear regression model (Equation 1 of Methods), we computed P-values for eQTL between the 379 evQTL SNP-gene pairs. We found that 40 of 379 evQTL SNPs have an eQTL with P < 1 × 10−8—that is, 11% of evQTL SNPs are also eQTL SNPs for the same corresponding genes. These include the evQTL between rs2864793 and PLOD2 (Table 1), which has been previously identified as an eQTL (Veyrieras et al. 2008). We categorized these joint evQTL–eQTL into four groups based on the direction of the genotypes’ effects on the mean and the variance of gene expression levels (Figure 4). The distribution of these 40 SNPs in the four groups is: 12, 5, 3, and 20. Among them, 32 (80%, including groups 1 and 4) showed a pattern of positive correlation; that is, the variance of gene expression increases with increasing mean gene expression. In contrast, 8 (20%, including groups 2 and 3) showed a negative correlation. The existence of the anticorrelation rejects the hypothesis that evQTL are merely byproducts of large effect eQTL.
Given the heterogeneous patterns of the evQTL–eQTL relationship, it is tempting to speculate that some eQTL are, in fact, evQTL. When the sample size is relatively small, evQTL can be falsely identified as eQTL. To show this, we simulated a subsampling process on an evQTL and demonstrated that it is likely for the subsampling process to create a sample that is easily falsely discovered as an eQTL. We started with a population of 10,000 individuals and created an ideal evQTL with three genotypes having the same expression mean and monotonically increased expression variances: Varaa = 2, VarAa = 4, and VarAA = 6 (Figure S13 A, Methods). From this population, we randomly sampled 210 individuals and computed the eQTL significance between their genotypes and gene expression using the simple linear regression (Equation 1, Methods). Four independent realizations of subsampling, in which eQTL were detected (P < 0.01), are shown in Figure S13 B–E. The random subsampling was repeated and each time the presence of an eQTL was tested. In every 10,000 subsamplings, we detected 76 eQTL on average when using the 0.01 cutoff, and detected no eQTL when using the 1 × 10−8 cutoff. It seems unlikely that an evQTL can be detected as an eQTL using statistical tests adapted by most eQTL studies. However, this is not conclusive as our simulation was conducted using a “perfectly symmetric” evQTL, in which gene expression exactly fits normal distributions with zero skewness. In real experimental data, if an evQTL is “asymmetric,” such as the one on the right of Figure 2, then the rate of misidentifications might be higher.
Functions and genomic features of cis-acting evQTL genes
The eight evQTL genes, which were identified using both microarray data sets, encode proteins with multifaceted functions (Table 1). For example, IL6 encodes interleukin 6, which plays a role in many biological processes, including signaling for the final differentiation of B cells to produce antibodies (Hirano et al. 1986) and stimulating production of hepatocytes (Andus et al. 1987). PLOD2 encodes a telopeptide lysyl hydroxylase, which is essential for creating stability of intermolecular collagen cross-linkages by creating hydroxylysines, which serve as attachment sites for carbohydrate units (van der Slot et al. 2003). Putting all 218 evQTL genes identified with both expression data sets together, we found these genes tend to encode extracellular and transmembrane proteins and are more likely to be involved in cell adhesion and immunity (Table S4).
Intriguingly, evQTL in five human leukocyte antigen (HLA) genes, HLA-A, HLA-DRB5, HLA-DRB1, HLA-DQA2, and HLA-DPA1, were identified (Table S1 and Table S2). Three of MHC class II genes, HLA-DRB5, HLA-DRB1, and HLA-DQA2, are located in a 200-kb region on chromosome 6p21, where we identified two evQTL SNPs, rs4410767 and rs9271720, associated with two of these three genes, and one evQTL SNP, rs3129766, associated with all three genes (Figure 5). A remarkable characteristic of evQTL in these HLA genes is that expression levels tend to show a bimodal pattern, instead of a continuum pattern, among individuals with the same genotypes. For instance, rs4410767 causes bimodal expression of three genotypes of HLA-DQA2, as well as bimodal expression of the major allele homozygote of HLA-DRB5. Also when major allele homozygotes of these SNPs (e.g., TT of rs4410767) are associated with a higher variability of a gene (e.g., HLA-DRB5), the minor allele homozygotes of these SNPs (e.g., CC of rs4410767) are associated with a lower variability of the other gene (e.g., HLA-DQA2), or vice versa. These results broadly correspond with expectations from previous findings: (1) the human MHC exhibits extreme genetic diversity and is maintained by balancing selection (Parham and Ohta 1996; Bergstrom et al. 1998), (2) haplotype-specific differences in gene expression are common across the MHC (Vandiedonck et al. 2011), and (3) this region contains genes (e.g., HLA-DPA1) that show large copy number differences across populations (Sudmant et al. 2010). The expression of a gene, e.g., HLA-DQA2, that falls onto two dichotomous parts could be due to the effect of an additional unidentified SNP, which increases (or decreases) the mean expression level of the gene by the presence (or absence) of an allele.
We found that 84% of the 218 evQTL genes are located in CNV-containing regions (Methods). This ratio is highly significant as the CNV-containing regions only occupy 36% of the whole genome (P = 5 × 10−17 Fisher’s exact test), suggesting that evQTL are more likely to be located within genomic regions that contain CNVs. In addition, 22% (47 of 218) of evQTL genes are located in the genomic regions previously reported to be under positive selection (Methods, and see Table S5 for the list of references). This ratio is not significantly different from random expectation given that those previously reported regions under positive selection collectively occupy ∼26% of the whole genome (P > 0.1, χ2 test).
Genome-wide distribution of trans-acting evQTL
To identify trans-acting evQTL, we selected 500 representative genes and >13,000 SNPs across all autosomes to conduct an exhaustive pair-by-pair testing of all possible interactions between these genes and SNPs (Methods). We identified a total of 356 gene-SNP pairs that had a trans-acting evQTL relationship (Figure 6). These trans-evQTL involved 35 distinct genes. Most of these genes (80%) had >1 associated SNP. These include ANO3, NRCAM, PDGFD, and ASPA with 114, 43, 39, and 27 trans-evQTL SNPs, respectively (Table S6). There were 30 SNPs involved in trans-evQTL relationships with two different genes. These results suggest the existence of “evQTL hotspots,” defined as consisting of multiple evQTL SNPs linked to a common gene or one evQTL SNP linked to multiple genes. This is further confirmed by the permutation test in which we randomly shuffled the genotypes and repeated the trans-evQTL detection analysis. With the shuffled data set, we only detected 33 trans-evQTL (false positives) that involve 24 genes, among which only 6 genes have >1 trans-evQTL SNP. The randomly shuffled data set produced no apparent evQTL hotspots. Finally we conducted the same analysis with the GSE11582 data set independently and obtained a comparable number (298) of trans-evQTL with the distribution that also indicated the presence of evQTL hotspots (Figure S14).
We carried out a genome-wide analysis to identify genomic loci associated with gene expression variability. Our analysis is conservative in the sense of not only the stringent criteria we adopted for assessing the statistical significance, but also the experimental conditions we applied to maximize the use of information. For example, we pooled the expression data from different HapMap populations and used the full set of expression data in estimating the variance of gene expression. By pooling the data, we increased the sample size. Meanwhile, we resorted to only analyzing SNPs that were both polymorphic and common in each of populations and lost the chance to detect population-specific evQTL. Thus we may underestimate the number of evQTL because of the unreported evQTL relationships present in one population but not the others. Nevertheless, using two expression data sets, we detected a total of 218 distinct loci that contain cis-acting evQTL SNP(s) and cross-validated eight of these loci. We demonstrated that, in addition to cis-acting evQTL, there are many more trans-acting evQTL in the human genome. Our findings provide orthogonal information that is not available in existing eQTL literature and may lead to the identification of important genetic regulatory mechanisms of gene expression, which may have downstream phenotypic effects.
Formation of evQTL
Throughout the article, we have assumed that gene express variability is measurable. However, it is worth noting that the gene expression variability was measured via a “snapshot” of the difference in the means of gene expression between multiple individuals at a single time point (although two expression data sets were used, they were used independently). This quantity is related to but differs from the stochastic noise of gene expression—the random fluctuation of gene expression within replicated measures. To measure the stochastic noise, we would have to measure gene expression in each of the individuals multiple times. The gene expression variability we measured is the summation of the within-individual variability (i.e., stochastic noise of gene expression within the lines) and the between-individual variability, but not either of them alone. Thus, we did not solely measure expression variability as a simple phenotypic trait but as a composite trait. Consequentially, we put forward two distinct scenarios to explain the formation of evQTL.
First, we emphasize the influence of single SNP (i.e., an evQTL SNP) on within-individual variability of gene expression. Imagine that the introduction of an “a” allele of a given SNP lowers the sensitivity of gene regulation in response to environmental changes or decreases the rate of degeneration of mRNA of the gene, while introduction of an “A” allele raises the sensitivity or increases the degeneration rate. Homozygous “aa” individuals are expected to show a lower degree of variability in expression level of corresponding genes because the “a” allele cushions (or canalizes) the impact of a changing environment or maintain intact mRNA for a longer time. When the “a”allele is replaced with “A,” the cushion or stability effects are removed. Without these cushioning effects, the gene expression phenotype becomes more sensitive to environmental influences, which causes a greater variance among individuals. There is a need for finding approaches to identify biologically variable loci in nonreplicable systems. Mechanistically, a mutation (e.g., the allele “a” or “A”) may alter the regulation of gene expression at any stage of the coupled processes of transcription, cotranscription, and post-transcription (Pandit et al. 2008; Majewski and Pastinen 2011; Chalancon et al. 2012). Also a vast array of biophysical parameters governing these processes may be involved, including the efficiency of initiation, the speed of transcription, the frequency of transcriptional bursting, mRNA stability, the presence of antisense RNA-mediated degradation, nonsense-mediated decay, the status of regulatory network, and so on (Raser and O’Shea 2005; Newman et al. 2006; Volfson et al. 2006; Raj and van Oudenaarden 2008; Li et al. 2010b; Dahan et al. 2011; Majewski and Pastinen 2011; Xu et al. 2011b; Chalancon et al. 2012; Schoenberg and Maquat 2012). Certainly, there are also many nongenetic factors introduced in the path from the human donor to the study of gene expression of an LCL in vitro (Choy et al. 2008). These nongenetic factors may include the random choice of which subpopulations of B cells are selected in the process of immortalization, the amount of and individual response to the EBV virus, the history of passage in cell culture and culture conditions, the laboratory protocols and reagents with which assays are performed, and so on. All these make it difficult to study the mechanism of genetic control of expression variability. A well-controlled experimental system is, therefore, highly expected in the future. We believe that new technologies are required to better understand the molecular mechanisms behind evQTL control of gene expression variability. New experimental methods such as, single-molecule mRNA decay measurements used for revealing the mechanism of promoter-regulated mRNA stability (Trcek et al. 2011), and single-cell genomics technologies used for real-time investigation of transcription levels and decay (Chalancon et al. 2012), could prove to be important for future expression variability studies. Further application of these powerful technologies holds promise for determining the actual mechanisms behind expression variance and variability.
Second, we attribute the formation of evQTL to the interaction between two or more genotypes (SNP–SNP interaction or epistasis) to control a single expression phenotype. While the effect of single SNP may explain the formation of evQTL, it seems more plausible to consider the effects of multiple SNP. Under this scenario, an evQTL is created through the effect of multiple SNPs that interact with one another to influence the expression of a gene. This notion is supported by the observation that genetic interactions can create complex phenotypic patterns that could also explain the variability of complex traits (Ronnegard and Valdar 2011). In evQTL what we have mapped may be a complex gene expression pattern that is created through the interaction between expression-controlling variants. It is well known that there are many molecular mechanisms that allow different genotypes to influence gene expression. These include allele-specific expression through the modulation of activity of cis-regulatory elements, the regulation of transcript isoform levels through disruption of the splicing machinery, and the modification of chromatin accessibility and transcription factor binding (Wang et al. 2008; Pickrell et al. 2010; Lalonde et al. 2011; Degner et al. 2012). However, the mechanisms underlying the interaction between these expression-controlling genotypes are poorly understood. Nevertheless, compared to the single-SNP explanation for evQTL, the idea of the interacting SNPs resulting in evQTL seems to be more widely accepted. In fact, several methods for detecting the interaction between SNPs (e.g., Pare et al. 2010; Struchalin et al. 2010, 2012; Daye et al. 2012) are based on testing for heteroscedasticity of traits. A wider application of these methods (including the dglm method in this study) in genetic association studies will supply more knowledge to understand the interaction between genetic variants.
Overall, we relied on statistical association to detect evQTL effects on the variance of gene expression. Different evQTL may be created through very different mechanisms. Some evQTL can correspond to loci controlling canalization/decanalization of the gene expression level (decrease or increase of the “environmental variance” of gene expression); these evQTL thus are true variance QTL. However, evQTL can also represent loci that are involved in the interaction with other, unidentified loci. The effect of the evQTL then depends on the genetic background, the true effect could be dissected in larger populations (or by comparing twins with the same genotype), but if all genetic backgrounds are pooled in a single analysis (as we did), the locus seems to induce more variable phenotypes.
A potential pitfall of using the dglm model to detect evQTL needs to be clarified. The dglm model defined in Equation 2 does not model the contribution of additional causal variants on mean expression levels per se. Because of this, the resulting lack of fit of the mean model (relative to the full model) could be interpreted as the existence of significant heteroscedasticity. Using HLA-DQA2 as an example, the bimodal expression of the gene (Figure 4) could be explained using an additional causal SNP controlling mean expression levels under a dominant or recessive model, without invoking the evQTL effect. The dglm model used in this study does not model the additional causal SNP. When the current model is applied for detecting genetic interactions, it should only be used as the first step in screening to prioritize SNPs. This is a similar approach as the Levene’s test that is used as the first step of the “variance prioritization” method (Pare et al. 2010). Given the flexibility of dglm, the current model (Equation 2) can be extended to incorporate the effect of single additional casual SNP and the additive (not interaction) effect of multiple SNPs on mean expression levels. Therefore, detecting evQTL and using extended dglm models represent a novel method for effectively screening for genetic interactions, especially when the multiple-SNP influence on expression variability is implied.
Roles of CNV and natural selection
CNVs form an abundant class of genetic variation that is presumed to impact gene expression. Schlattl et al. (2011) identified CNVs associated with the expression of 110 genes in CEU and YRI populations. These included a CNV downstream of HLA-DQB1, which is associated with HLA-DQB1 gene expression. In most of these CNV-associated eQTL, CNVs intersecting genes typically affect genes in the expected direction, i.e., involving a positive correlation between gene copy number and expression level. Thus, CNVs may be implicated in the creation of eQTL via dosage mechanism. However, this assumption cannot be used to explain the negative correlations between gene copy number and expression level. Schlattl et al. (2011) found that ∼20% of associations of CNVs displayed the unexpected negative correlations, including several instances that had been previously observed (Stranger et al. 2007a; Henrichsen et al. 2009).
The results presented in this article suggest that most cis-acting evQTL are located in CNV-containing regions of the human genome. These raise the possibility that evQTL account for the mixture of positive and negative correlations between gene copy number and expression level. In addition, although our results are only correlative, more evQTL in CNV-containing regions may deserve a new theory. It has been suggested that the control of gene copy number represents a way to lower the intrinsic noise in gene expression (Raser and O’Shea 2005). Evolutionary analysis showed, in both yeasts and mammals, the expression of duplicate gene trends to decrease after duplication to rebalance gene dosage (Qian et al. 2010). We argue that noise control by increased copy number provides another possible explanation for the co-occurrence of evQTL and CNVs. In future study, detailed analyses of fine-resolution CNV data are needed to clarify the basis of these complex correlations. SNPs that tag nearby CNVs could plausibly be used as markers for identifying CNV-associated evQTL. CNVs contribute to the susceptibility of various complex neurological disorders in humans. The link between CNVs and human mental disorders may be through the altered expression variability of related loci. Indeed, the expression variability of genes in cells from patients suffering from schizophrenia and Parkinson disease has been found to be different from that in cells from normal controls (Mar et al. 2011). Thus understanding the relationship between CNVs and evQTL may shed light on the study of these complex disorders.
It is expected that control of gene expression variability is under evolutionary pressure. Expression variability allows simultaneous achievement of multiple steady-state phenotypes in a population (Raser and O’Shea 2005), which may play an important role in differentiation in multicellular organisms, such as that in the human immune system. For example, in T cells, the gene expression variability contributes to the useful diversification of biological functions within a clonal population and interferes with accurate antigen discrimination. The genetic components that coregulate the expression levels of multiple genes to achieve phenotypic variability in a controlled manner are obviously under selection (Feinerman et al. 2008). Several evQTL were identified at the human MHC region, which is characterized by a complex genomic landscape maintained by balancing selection (Parham and Ohta 1996; Vandiedonck et al. 2011). Our results underscore the role of natural selection, especially balancing selection, in maintaining expression variability in HLA genes.
In unicellular yeast, phenotypic variability allows some individuals in a population to be in an “anticipatory” state for a sudden environmental change (Acar et al. 2008; Zhang et al. 2009). Increased expression variability is therefore beneficial in changing environmental conditions due to lack of regulatory responses to all possible environmental changes, thus higher noise allows for faster adaptation to changing environments. Yet, there is no evidence that this notion can be applied to humans. Our results do not suggest an under- or overrepresentation of cis-acting evQTL in genomic regions under positive selection. Further studies could be directed to develop new methods for detecting positive selection in evQTL and estimate how many evQTL are under positive selection. For eQTL, such a method has already been developed (Fraser et al. 2011), which may prove to be adaptable for evQTL.
Our genome-wide screening for evQTL is an initial step toward better understanding the genetic factors controlling gene expression variability in humans. Based on our results, we hypothesize that there are many unrecognized and broadly distributed genetic variants that play a role in controlling expression variability or are involved in the formation of evQTL through interacting with other variants. The influence of these genetic variants is further modulated by other genomic and cellular features, such as structural variations and transcriptional regulatory networks. These empirical findings underscore the genetic basis underlying phenotypic variability in biological systems and the roles of variability-determining variants and genetic interactions in population evolution.
We wish to thank all anonymous reviewers for their constructive comments. We are grateful to Tomasz Koralewski for help with data processing. We thank Han Liang, Lan Zhu, Loren Skow, Ying Zhang, and Quan Long for valuable suggestions. We acknowledge the Texas A&M Supercomputing Facility (http://sc.tamu.edu/) for providing computing resources. This research was supported in part by a Gray Lady Foundation grant to J.J.C.
Communicating editor: E. Stone
- Received September 13, 2012.
- Accepted October 31, 2012.
- Copyright © 2013 by the Genetics Society of America