Bayesian Detection of Expression Quantitative Trait Loci Hot Spots

Bottolo, Leonardo; Petretto, Enrico; Blankenberg, Stefan; Cambien, François; Cook,  Stuart A; Tiret, Laurence; Richardson, Sylvia

doi:10.1534/genetics.111.131425

Abstract

High-throughput genomics allows genome-wide quantification of gene expression levels in tissues and cell types and, when combined with sequence variation data, permits the identification of genetic control points of expression (expression QTL or eQTL). Clusters of eQTL influenced by single genetic polymorphisms can inform on hotspots of regulation of pathways and networks, although very few hotspots have been robustly detected, replicated, or experimentally verified. Here we present a novel modeling strategy to estimate the propensity of a genetic marker to influence several expression traits at the same time, based on a hierarchical formulation of related regressions. We implement this hierarchical regression model in a Bayesian framework using a stochastic search algorithm, HESS, that efficiently probes sparse subsets of genetic markers in a high-dimensional data matrix to identify hotspots and to pinpoint the individual genetic effects (eQTL). Simulating complex regulatory scenarios, we demonstrate that our method outperforms current state-of-the-art approaches, in particular when the number of transcripts is large. We also illustrate the applicability of HESS to diverse real-case data sets, in mouse and human genetic settings, and show that it provides new insights into regulatory hotspots that were not detected by conventional methods. The results suggest that the combination of our modeling strategy and algorithmic implementation provides significant advantages for the identification of functional eQTL hotspots, revealing key regulators underlying pathways.

THE current focus of biological research has turned to high-throughput genomics, which encompasses large-scale data generation and a variety of integrated approaches that combine two or more -omics of data sets. An important example of integrative genomics analysis is the investigation of the genetic regulation of transcription, also called expression quantitative trait locus (eQTL) or “genetical genomics” studies (Cookson et al. 2009; Majewski and Pastinen 2011). A typical eQTL analysis follows a natural structure of parallel regressions between the large set of q responses (i.e., expression phenotypes), and that of p explanatory variables (i.e., genetic markers, often single nucleotide polymorphism, SNPs), where p is typically much larger than the number of observations n.

From a statistical point of view, the size and the complex multidimensional structure of eQTL data sets pose a significant challenge. Not only does one wish to detect the set of important genetic control points for each response (expression phenotype), including cis- and trans-acting control for the same transcript, but, ideally, one would wish to exploit the dependence between multiple expression phenotypes. This will facilitate the discovery of key regulatory markers, so-called hotspots (Breitling et al. 2008), i.e., genetic loci or single polymorphisms that influence a large number of transcripts. Identification of hotspots can inform on network and pathways, which are likely to be controlled by major regulators or transcription factors (Yvert et al. 2003; Wu et al. 2008). Most importantly, there is mounting evidence that common diseases may be caused (or modulated) by changes at a few regulatory control points of the system (i.e., hotspots), which can cause perturbations with large phenotypic effects (Chen et al. 2008; Schadt 2009).

In this article, we set out to perform hotspot and eQTL detection in an efficient manner, which exploits fully the multidimensional dependencies within the genome-wide gene expression and genetic data sets. We build upon our previous work (Bottolo and Richardson 2010), where we implemented Bayesian sparse multivariate regression for continuous response to search over the possible subsets of predictors in the large 2^p model space. For each expression phenotype, this corresponds to carrying out multipoint mapping within an inference framework, Bayesian variable selection (BVS), where model uncertainty is fully integrated. Here, we propose a novel structure for linking parallel multivariate regressions that borrows information in a hierarchical manner between the phenotypes to highlight the hotspots. To be precise, we propose a new multiplicative decomposition of the joint matrix of selection probabilities ω_kj that link marker j to phenotype k and demonstrate in a simulation study that this hierarchical structure and its Bayesian implementation (hierarchical evolutionary stochastic search or HESS algorithm) possess good characteristics in terms of sensitivity and specificity, outperforming current methods for hotspot and eQTL detection. Finally, we show the applicability of our method in two real-case eQTL studies, including animal models and human data. Our approach is broadly applicable and extendable to other high-dimensional genomic data sets and represents a first step toward a more reliable identification of functional eQTL hotspots and the underlying causal regulators.

Analysis models for eQTL data are linked to two strands of work: (i) methods for multiple mapping of QTL, where a single continuous response, referred to as a “trait,” is linked to DNA variation at multiple genetic loci by using a multivariate regression approach, and (ii) models that exploit the pattern of dependence between the sets of responses associated with a predictor (i.e., genetic marker). There is a vast literature on multi-mapping QTL (see the review by Yi and Shriner 2008); some of the models have been extended to the analysis of a small number of traits simultaneously (Banerjee et al. 2008; Xu et al. 2008). Several styles of approaches have been adopted ranging from adaptive shrinkage (Yi and Xu 2008; Sun et al. 2010) to variable selection within a composite model space framework that sets an upper bound on the number of effects (Yi et al. 2007). Most of the implemented algorithms sample the regression coefficients via Gibbs sampling. However, these have not been used with a substantial set of markers in the “large p small n” paradigm, but mostly in case of candidate genes or in small experimental cross-animal studies. To face the challenges typical of larger eQTL studies, we have chosen to build our multi-mapping model using a recently developed Bayesian sparse regression approach (Bottolo and Richardson 2010). In this approach, subset selection is implemented in an efficient way for vast (potentially multi-modes) model space by integrating out the regression coefficients and by using a purposely designed MCMC variable selection algorithm that enhances the model search with ideas and moves inspired by evolutionary Monte Carlo algorithms.

The first eQTL modeling approach that explicitly set out to borrow information from all the transcripts was proposed by Kendziorski et al. (2006). In the mixture over markers (MOM) method, each response (expression phenotype) y_k, 1 ≤ k ≤ q, is potentially linked to the marker j with probability p_j or not linked to any marker with probability p₀. All responses linked to the marker j are then assumed to follow a common distribution f_j(⋅), borrowing strength from each other, while those of nonmapping transcripts have distribution f₀. Inspired by models that have been successful for finding differential expression, the marginal distribution of the data for each response y_k is thus given by a mixture model: $p_{0} f_{0} (y_{k}) + \sum_{j = 1}^{p} p_{j} f_{j} (y_{k})$ ⁠. A basic assumption of the MOM model is that a response is associated with at most one predictor. For good identifiability of the mixture, MOM requires a sufficient number of transcripts to be associated with the markers. The authors use the EM algorithm to fit the mixture model and estimate the posterior probability of mapping nowhere or to any of the p locations. By combining information across the responses, MOM is more powerful and can achieve a better control of false discovery rates (FDR) by thresholding the posterior probabilities than pure univariate differential expression methods testing each transcript-marker pair. But as originally developed, it is not fully multivariate as it does not account for multiple effects of several markers on each expression trait.

To improve on identification of eQTL effects, Jia and Xu (2007) formulate a unifying q × p hierarchical model in which each transcript y_k, 1 ≤ k ≤ q, is potentially linked to the complete set of p markers X through a full linear model with regression coefficients, β_k = (β_k₁,..., β_kj,..., β_kp)^T. Inspired by Bayesian shrinkage approaches already used in conventional QTL mapping, they propose using a mixture prior on each of the β_kj, also known as “spike and slab,”

β_{k j} \sim (1 - γ_{k j}) N (0, δ) + γ_{k j} N (0, σ_{j}^{2})

,

(1)

with a fixed very small δ for the spike and an independent prior for the variance

σ_{j}^{2}

of the slab in the jth marker. They then link the q responses through a hierarchical model of the Bernoulli indicators γ_kj, establishing what we refer to as a hierarchical regression set-up. They assume that γ_kj ∼ Bernoulli(ω_j), 1 ≤ k, ≤ q, and give ω_j a Dirichlet(1, 1) prior. In this model, to improve detection of transcript-marker associations, strength is borrowed across all the transcripts via the common latent probability ω_j. Jia and Xu (2007) implement their hierarchical model in a fully Bayesian framework using an MCMC algorithm called BAYES, based solely on Gibbs sampling.

The high dimensionality of both gene expression and marker space has been alternatively addressed through the use of data reduction methods. In particular, Chun and Keleş (2009) have proposed implementing sparse partial least-squares regression (M-SPLS eQTL) on preclustered group of transcripts. M-SPLS selects markers associated with each transcript cluster by evaluating the loadings on a set of latent components. As the dimension of each cluster is moderate, SPLS implements a multivariate formulation that takes into account the correlation between the transcripts in the same cluster. Sparsity of the latent direction vectors is achieved by imposing a combination of L₁ and L₂ penalties, similar to the elastic net. The tuning parameters K and η controlling the number of latent components and the convexity of the penalized likelihood are tuned together by cross-validation. The output of this method is the set of the regression coefficients of the markers belonging to the latent vectors that are significantly associated with a subset of transcripts, selected by bootstrap confidence interval.

Jia and Xu’s linked regression set-up and fully Bayesian formulation is a natural starting point for eQTL detection, which shares common features with our approach. Here, we present a novel model structure and state of the art implementation based upon evolutionary Monte Carlo. We report the results of a simulation study comparing our HESS method to BAYES (Jia and Xu 2007), as well as to two alternative approaches: MOM (Kendziorski et al. 2006) and M-SPLS (Chun and Keleş 2009). Finally, we show the application of our method to two diverse genomic experiments in mouse and human genetic contexts.

Theory and Methods

Hierarchical related sparse regression

Let

Y = {y_{k}^{T}, 1 \leq k \leq q}

the n × q matrix of responses, with y_k = (y_k₁,…, y_kn)^T the sequence of n observations of the kth response, and let X be the n × p design matrix with ith row x_i = (x_i₁,…, x_ip)^T. We assume throughout that x_i is quantitative. It encompasses the case of continuous biomarkers, inbred genotypes {0, 1} for recombinant inbred (RI) strains and {0, 1, 2} genotype coding for F₂ animal crosses or human data. A linear model for the kth response can be described by the equation

y_{k} = α_{k} 1_{n} + X β_{k} + ε_{k},

where α is an unknown constant, 1_n is a column vector of ones, β_k = (β_k₁,…, β_kp)^T is the p × 1 vector of regression coefficients, and ε_k is the error term with

ε_{k} \sim N (0, σ_{k}^{2} I_{n})

⁠, where I_n is the diagonal matrix of dimension n. BVS is performed by placing a constant prior density on α_k and a prior on β_k, which depends on a latent binary vector γ_k = (γ_k₁,…, γ_k_j,…, γ_kp)^T, where γ_k_j = 1 if β_kj ≠ 0 and γ_k_j = 0if β_kj = 0, j = 1,…, p. Conditionally on the latent binary vector, the linear model becomes

y_{k} = α_{k} 1_{n} + X_{γ_{k}} β_{γ_{k}} + ε_{k},

where β_γ_k is the nonzero vector of coefficients extracted from β_k, X_γ_k is the design matrix of dimension n × p_γ_k, with columns corresponding to γ_kj = 1, and

p_{γ_{k}} \equiv γ_{k}^{T} 1_{p}

the number of selected covariates for the k response. For every regression k, we assume that, apart from the intercept α_k, X contains no variables that would be included in every possible model and that the columns of the design matrix have all been centered in 0.

Assuming independence of the q regression equations conditionally on the selected predictors modeled in the q × p latent binary matrix

Γ = {γ_{k}^{T}, 1 \leq k \leq q}

⁠, the likelihood becomes

\overset{q}{\prod_{k = 1}} φ_{n} (y_{k}; α_{k} 1_{n} + X_{γ_{k}} β_{γ_{k}}, σ_{k}^{2} I_{n}),

(2)

where φ_n(⋅) is the n -variate normal density function.

The description of the joint likelihood as the product of q regression equations is similar to the one proposed by Jia and Xu (2007). However, one important difference is the assignment in (2) of a regression specific error variance $σ_{k}^{2}$ ⁠, allowing for transcript-related residual heterogeneity and making our formulation more flexible. A more general model, seemingly unrelated regressions (SUR) introduces additional dependence between the responses Y through the noise ε_k, modeling the correlation between the residuals of different responses (Banerjee et al. 2008). However, it becomes computational unfeasible when the size of q is large, which is typical in eQTL experiments.

Prior set-up

For a given k, we follow the same prior set-up for the regression coefficients and error variance as described in Bottolo and Richardson (2010). First, we treat the intercept α_k separately, assigning it a constant prior, p(α_k) ∝ 1. Second, conditionally on γ_k, we assign a g-prior structure on the regression coefficients and an inverse-gamma (Inv Ga) density to the residual variance

p (β_{k} | γ_{k}, τ, σ_{k}^{2}) = N (0, σ_{k}^{2} τ {(X_{γ_{k}}^{T} X_{γ_{k}})}^{- 1})

(3)

p (σ_{k}^{2}) = Inv Ga (a_{σ}, b_{σ})

,

(4)

with a_σ, b_σ > 0, and

E (σ_{k}^{2}) = b_{σ} / (a_{σ} - 1)

⁠. This conjugate prior set-up has many advantages. The most important is that, for a given k, the marginal likelihood p(y_k|X, γ_k, τ) can be written in a closed form that is particularly simple to compute once (3) and (4) are integrated out. Furthermore, it allows for more efficient MCMC exploration with correlated predictors than the nonconjugate case (i.e., when the variance component

σ_{k}^{2}

in (3) is different from the error variance) and it provides more accurate identification of the high-probability models among those visited during the MCMC (George and McCulloch 1997). Finally it leads to a simple and interpretable expression,

E (β_{k j} | Y, τ) = τ / (1 + τ) β_{k j}^{OLS}

with

β_{k j}^{OLS}

the ordinary least-squares solution, of the level of shrinkage.

The hierarchical structure on the regression coefficients is completed by specifying a hyper-prior on the scaling coefficient τ, p(τ). We adopt the Zellner–Siow priors structure for the regression coefficients that can be thought as a scale mixture of g-priors and an inverse-gamma prior on τ, p(τ) = Inv Ga(1/2, n/2) with heavier tails than the normal distribution, $p (β_{k} | γ_{k}, σ_{k}^{2}) = Cauchy (0, n σ_{k}^{2} {(X_{γ_{k}}^{T} X_{γ_{k}})}^{- 1})$ ⁠. In general it has been observed (Bottolo and Richardson 2010) that data adaptivity of the degree of shrinkage conforms better to different variable selection scenarios than assuming standard fixed values (which can be easily implemented by using a point mass prior for τ). Since the level of shrinkage can influence the results of the variable selection procedure, in our model we force all the q regression equations to share the same common τ, linking the regression equations hierarchically through the variance of their nonzero coefficients.

The prior specification is concluded by assigning a Bernoulli prior on the latent binary value γ_kj, p(γ_kj|ω_kj) = Bernoulli (ω_kj). The prior chosen for ω_kj is of paramount importance in BVS since it controls the level of sparsity, i.e., the association with a parsimonious set of important predictors. For a given response this task can be accomplished by specifying a common small-selection probability for all p predictors, ω_kj = ω_k and giving p(ω_k) = Beta(a_k, b_k) (Bottolo and Richardson 2010). Inducing sparsity when all the responses are jointly considered is harder because further constraints are desirable. eQTL surveys (Cookson et al. 2009) suggest that only a fraction of expression traits are under genetic regulation and the number of their control points is usually small. This can be modeled by assigning a different probability for each marker ω_kj = ω_j with an hyper-prior on ω_j. This solution, first proposed by Jia and Xu (2007) with the conjugate prior p(ω_j) = Dirichlet(d₁_j, d₂_j), assumes that this selection probability is the same for all the responses. However, whatever the sensible choice of the hyperparameters d₁_j and d₂_j_, d₁_j, d₂_j = 1 or d₁_j, d₂_j = 0.5, the posterior density greatly depends on the ratio between the number of transcripts associated to the marker j, q_j_, and the total number the transcripts in the eQTL experiment, q, since E(ω_j|Y) = (q_j + d₁_j)/(q + d₁_j + d₂_j), where q_j = # {j: γ_kj = 1}. In such formulation, the results are thus clearly influenced by the number of responses analyzed and sparsity of each kth regression cannot be controlled in the prior specification adopted for ω_kj of γ_kj_.

In this article we propose a novel way of specifying the selection probability ω_kj to synthetize the benefits of both approaches, Bottolo and Richardson (2010) and Jia and Xu (2007). We propose decomposing this probability into its marginal effects

ω_{j k} = ω_{k} \times ρ_{j}

(5)

with ω_k and ρ_j the “row” and “column” effect, respectively, and 0 ≤ ω_k ≤ 1 and ρ_j ≥ 0, but constrained so that 0 ≤ ω_j_k ≤ 1. The idea behind this decomposition is to control the level of sparsity for each row k through a suitable choice of the hyperparameters a_k, b_k of p(ω_k) = Beta(a_k, b_k), while the parameter ρ_j captures the “relative propensity” of predictor j to influence several responses at the same time. Large values of ρ_j indicate that predictor j has a marked influence on ω_jk and thus likely to be a hotspot. The adopted multiplicative formulation has some similarity to the disease mapping paradigm where the relative risk level acts in a multiplicative fashion on an expected number of cases in a binomial or Poisson disease risk model. A gamma density on the jth latent hotspot effect, p(ρ_j) = Ga(c, d), with E(ρ_j) = c/d, complete the hierarchical structure for the decomposition (5).

We conclude this section by describing the choice of the hyperparameters for ω_k and ρ_j. Since by construction ω_k ⊥ ρ_j, E(ω_jk) = E(ω_k)E(ρ_j). If we assume c = d, the hotspot propensity does not change the a priori row marginal expectation, E(ω_jk) = E(ω_k). However, it inflates the a priori row marginal variance Var(ω_jk) > Var(ω_k), with Var(ω_jk) = Var(ω_k)(1 + d⁻¹) + d⁻¹E²(ω_k). For the specification of the hyperparameters a_k and b_k, we use the Beta-binomial approach illustrated in Kohn et al. (2001), after marginalizing over the column effect in (5). The two hyperparameters can be worked out once E(p_γ_{_k}) and Var(p_γ_{_k}), the expected number and the variance of the number of genetic control points for each response, are specified.

Posterior inference

After integrating out the intercepts, the regression coefficients and the error variances, the joint density can be factorized as

p (Y, X, Γ, Ω, τ) = p (Y | X, Γ, τ) p (Γ | Ω) p (Ω) p (τ),

(6)

where

p (Y | X, Γ, τ) = \prod_{k = 1}^{q} p (y_{k} | X, γ_{k}, τ)

⁠,

p (Γ | Ω) = \prod_{k = 1}^{q} \prod_{j = 1}^{p} p (γ_{k j} | ω_{k j})

⁠, and

p (Ω) = \prod_{k = 1}^{q} p (ω_{k}) \prod_{j = 1}^{p} p (ρ_{j})

⁠. Posterior inference is carried out on the q × p latent binary matrix Γ = {γ_kj, 1 ≤ k ≤ q, 1 ≤ j ≤ p}, on the q × p selection probability matrix Ω = {ω_kj, 1 ≤ k ≤ q, 1 ≤ j ≤ p}, and on the scaling coefficient τ, if not fixed.

Sampling Γ is extremely challenging since complex dependence structures in the X create well-known problems of multimodality of the model space even for a single regression equation. Here the computational challenge is higher since we are aiming to explore a huge model space of dimension (2^p)^q. For this reason vanilla MCMC (MC³, Gibbs sampler, simple dimension changing moves) cannot guarantee a reliable exploration of the model space in a limited number of iterations. In this article we use a sampling scheme introduced by Bottolo and Richardson (2010), evolutionary stochastic search (ESS) as a building block for our new algorithm HESS. For each response, HESS relies on running multiple chains with different “temperature” in parallel, which exchange information about the set of covariates that are selected in each chain. Since chains with higher temperatures flatten the posterior density, global moves (between chains) allow the algorithm to jump from one local mode to another. Local moves (within chains) permit the fine exploration of alternative models, resulting in a combined algorithm that ensures that the chains mix efficiently and do not become trapped in local modes. Specific modifications of ESS were introduced to comply with the structure of ω_kj, which are sampled with the decomposition (5) rather than integrating them out as in ESS. This requires some modifications in the local moves (details in Supporting Information, File S1, Section S.2.2.)

For sampling the selection probability matrix Ω, we implemented a Metropolis-within-Gibbs algorithm for each element of the row effect ω = (ω₁, …, ω_q)^T and column effects ρ = (ρ₁, …, ρ_q)^T, rejecting proposed values outside the range [0, 1]. However, since the dimension of ω and ρ is very large, tuning the proposal for each element of the two vectors is prohibitive. To make HESS fully automatic, we use the adaptive MCMC scheme proposed by Roberts and Rosenthal (2009), where the variance of the proposal density is tuned during the MCMC to reach a specified acceptance rate. To satisfy the asymptotic convergence of the adaptive MCMC scheme, mild conditions are imposed (details in File S1, Section S.2.3).

If not fixed, the scaling coefficient τ, which is common for all the q regression equations and all the L chains, is sampled using a Metropolis-within-Gibbs algorithm with random walk proposal and fixed proposal variance (details in File S1, Section S.2.4).

Finally, we describe a complete sweep of our algorithm. We assume that the design matrix is fully known. If missing values are present, these can be imputed in a preprocessing imputation step (for instance using the fill.geno function from the qtl R package for genetic crosses (Broman and Sen 2009) or FastPhase (Scheet and Stephens 2006) for human data). Without loss of generality, we assume that the responses and the design matrix have both been centered. The same notation is used when τ is fixed or given a prior distribution. For simplicity of notation we do not index variables by the chain index, but we emphasize that the description below applies to each chain:

Given Ω and τ we update γ_k, according to the ESS procedure, using global and local moves. During the burn-in, we sample the latent binary vector γ_k for each k to tune the regression specific temperature ladder (details in File S1, Section S.2.5). After the burn-in, at each sweep, we select at random without replacement a fraction φ of the regressions where to update γ_k.
Given Γ and τ, we sample ω and ρ with a random walk Metropolis with adaptive proposals.
Given Γ and Ω, we sample τ with a random walk Metropolis with a fixed proposal. To balance the number of updates of the latent binary values γ_kj with those of the scaling coefficient, at each sweep, the number of times we sample τ is proportional to q × p × L.

The Matlab implementation of the HESS algorithm is available upon request from the authors.

Postprocessing analysis

In this section we present some of the postprocessing operations required to extract useful information from the rich output of our model. We stress that, while here for simplicity we are not using the output of the heated chains, following Gramacy et al. (2010), posterior inference could also be carried out using the information contained in all the chains.

The primary quantity of interest is the posterior propensity of each predictor to be a hotspot. In the spirit of cluster detection rules in disease mapping (Richardson et al. 2004), we use tail posterior probabilities of the propensities ρ_j, i.e., declare the jth predictor to be a hotspot if

\Pr (ρ_{j} > 1 | Y) \geq t

,

(7)

where t is a chosen threshold. We have found by empirical exploration and simulations that choosing a posterior threshold of t = 0.8 gives good performance across different scenarios with varying dimensions (data not shown).

The next quantity of interest is the posterior probability of inclusion for the pair (k, j). Following Petretto et al. (2010), the marginal probability of inclusion is

p (γ_{k j} = 1 | y_{k}) = C_{k}^{- 1} \sum_{s = 1}^{S} 1_{(γ_{k j}^{(s)} = 1)} (γ_{k}) p (γ_{k}^{(s)} | y_{k}),

(8)

where

γ_{k}^{(s)} = (γ_{k 1}^{(s)}, \dots, γ_{k j}^{(s)}, \dots, γ_{k q}^{(s)})

is the latent binary vector sampled at iteration s,

p (γ_{k}^{(s)} | y_{k})

is the model posterior probability obtained through inexpensive numerical integration in the full output (see File S1, Section S.3) and

C_{k} = \sum_{s = 1}^{S} p (γ_{k}^{(s)} | y_{k})

is the constant of normalization. The Bayes factor (BF) for the pair (k, j) is derived from (8) as the ratio between posterior odds and prior odds

{BF}_{k j} = \frac{p (γ_{k j} = 1 | y_{k})}{1 - p (γ_{k j} = 1 | y_{k}))} / \frac{E (p_{γ_{k}}) / p}{1 - (E (p_{γ_{k}}) / p)},

(9)

where E(p_γk) is the a priori expected number of genetic control points for the kth transcript.

Similarly to (8), if of interest, we can further evaluate the joint posterior probability of the set of predictors declared as hotspots as

p (\cap_{j = 1}^{ℋ} (γ_{k j} = 1) | y_{k}) = C_{k}^{- 1} \sum_{s = 1}^{S} 1_{(\cap_{j = 1}^{ℋ} (γ_{k j} = 1))} (γ_{k}) p (γ_{k}^{(s)} | y_{k})

(10)

with C^k as before and ℋ the set of markers identified as hotspots.

Finally, the best model visited is defined as

γ_{k}^{B} = {γ_{k}^{(s)} : \max_{s} p (γ_{k}^{(s)} | y_{k})} .

(11)

Note that the configuration posterior probability p(Γ|Y) (see File S1, section S.3) can be used as an alternative weight in (8) and (10) or to derive the max a posteriori (MAP) configuration visited

Γ^{B} = {Γ^{(s)} : \max_{s} p (Γ^{(s)} | Y)} .

Results

Simulation studies

We carried out a simulation study to compare our algorithm with recently proposed multiple response models: MOM (Kendziorski et al. 2006), BAYES (Jia and Xu 2007), and M-SPLS (Chun and Keleş 2009).

To create more realistic examples, we decided not to simulate the X matrix, but to use real human-phased genotype data spanning 500 kb, region ENm014 (chromosome 7: 126,368,183–126,865,324 bp), from the Yoruba population used in the HapMap project (Altshuler et al. 2005) as the design matrix. After removing redundant variables, the set of SNPs is reduced to P = 498, with n = 120, giving a 120 × 498 X matrix. As noted by Chun and Keleş (2009), high correlations between markers might affect the performance of variable selection procedures that do not explicitly consider such a grouping structure. The benefit of using real human data are to test competing algorithms when the pattern of correlation, i.e., linkage disequilibrium (LD), is complex and blocks of LD are not artificial, but they derive naturally from genetic forces, with a slow decay of the level of correlation between SNPs (see Figure S1).

In the simulated examples, we carefully select the SNPs that represent the hotspots (Figure S1): (i) all hotspots are inside blocks of correlated variables; (ii) the first four SNPs are weakly dependent (r² < 0.1); and (iii) the remaining two SNPs are correlated with each other (r² = 0.46) and also linked to one of the previous simulated hotspots (r² = 0.52 and r² = 0.44, respectively), creating a potential masking effect difficult to detect. Apart from the hotspots, no other SNPs are used to simulate transcript–SNP associations. We simulated four cases:

SIM1: In this example we simulated q = 100 transcripts from the selected six hotspots, with some transcripts predicted by multiple correlated markers (polygenic control): for instance transcripts 17–20 are regulated by three SNPs at the same time (see Figure S2). Altogether we simulated 94 transcript–SNP associations in 50 distinct transcripts. The effects were simulated from a normal density with smaller variance than in Jia and Xu (2007), β_kj ∼ N(0,0.3²)with ε_k ∼ N(0, 0.1²I_n) to mimic the smaller signal-to-noise ratio expected in genetically heterogeneous human data.
SIM2: As in the previous example, we simulated 100 responses, but there are only three hotspots with the same simulated association as before, leading to 64 transcript–SNP associations in 30 distinct transcripts. Moreover we created potential false-positive associations by simulating transcripts 81–90 and 91–100 using a linear transformation of transcript 20 with a mild negative correlation (in the interval [−0.5, −0.4]) and of transcript 80 with a strong positive correlation (in the interval [0.8, 0.9]), respectively. Since we create false-positive associations, the scenario will inform on how different algorithms behave when correlations among some transcripts are not due to SNPs.
SIM3: This simulation set-up is identical to the first scenario for the first 100 responses, but we increase the number of simulated responses to q = 1,000, simulating the further 900 transcripts from the noise.
SIM4: This is the same as the second simulated data set for the first 100 responses, with additional 900 responses simulated from the noise, giving altogether q = 1000 responses.

We discuss here the hyperparameters set-up. Since a priori, in addition to a large effect of a SNP that is located close to the transcript (cis-eQTL), we expect only a few additional control points associated with the variation of gene expression (typically trans-eQTL); in HESS we set E(p_γ_k) = 2 and Var(p_γk) = 2, meaning the prior model size for each transcript response is likely to range from 0 to 6 (Petretto et al. 2010). Following Kohn et al. (2001), we fixed a_σ = 10⁻¹⁰ and b_σ = 10⁻³, giving rise to a noninformative prior on the error variance. We run the HESS algorithm for 6000 sweeps with 1000 as burn-in with three chains and ϕ = 1/4. Computational time is similar for the first two simulated examples, 6 hr, and 10 times greater for the last two simulated scenarios on a Intel Xeon CPU at 3.33 GHz with 24 Gb RAM.

We run BAYES for 15,000 sweeps with 5000 as burn-in, recording sampled values every 5 sweeps. The variance δ of the spike component 1 is set 10⁻⁴, which is 100 times lower than the noise variance. Since the code available from the authors was written in SAS/IML, we recoded their Gibbs sampler in Matlab. We used the default parameters for MOM, while in M-SPLS the two tuning parameters are obtained through cross-validation selected in the interval K = 1,…, 10 and η = 0.01, …, 0.99. Each simulated example was replicated 25 times and we run the four algorithms on each replicate.

Power to detect hotspots:

The identification of the hotspots is of primary interest for all the algorithms we are comparing. In HESS using the tail posterior probability Pr(ρ_j >1|Y), we can rank the predictors according to their propensity to be a hotspot, while in BAYES the posterior mean of the common latent probability ω_j_, E(ω_j|Y) is utilized to prioritize important markers. In MOM the strength for a predictor to be a hotspot is not directly available but, as suggested by the authors, given a marker, it can be obtained by taking a suitable quantile of the transcript–marker associations distribution across responses. We use their R function get.hotspots recording the average of the distribution for each predictor. M-SPLS, after cross-validation, provides a list of latent components that predicts most of the variability of Y. While this group cannot be interpreted directly as the list of hotspots, we use it to test the existing overlap between the simulated hotspots and the latent components. Finally, differently from the analyses presented in Jia and Xu (2007) and Chun and Keleş (2009), in the HESS power calculation, we simply rank the evidence for being a hotspot provided by each algorithm across the 25 replicates. Therefore we are not using any method-specific procedure to call a hotspot, based for instance on FDR considerations, that could influence the comparison results.

Figure 1 shows the ROC curves for the four algorithms considered. HESS (blue lines) outperforms all the other methods with sizeable power on the simulated examples. It is not significantly affected by the dimension of the eQTL experiment (top, q = 100; bottom, q = 1000). This is somehow expected since the hotspot propensity does not depend directly on the number of transcripts analyzed (see File S1, section S.2.3). Spurious correlations among transcripts not due to SNPs (right) have a negligible effect on the HESS power, showing robust properties of our algorithm in detecting hotspots under different scenarios.

Figure 1

Open in new tab Download slide

ROC curves for hotspots detection using HESS (blue line), MOM (red line), BAYES (green line), and M-SPLS (black star) in the four simulated scenarios (Figure S2). From top to bottom, left to right: SIM1, q = 100 and six hotspots; SIM2, q = 100 and three hotspots; SIM3, q = 1000 and six hotspots; SIM4, q = 1000 and three hotspots. For M-SPLS, type I error and power were calculated conditionally on the list of latent vector components. (Top) MOM is indicated by a red dashed line to highlight that it is not designed in the cases when the number of markers is larger than the number of traits.

The other methods show good properties when q = 100, but their power degrades sensibly when q = 1000. This is expected for BAYES (green lines) since E(ω_j|Y) is affected by q, while the performance of MOM (red lines) is more stable. MOM (red dashed lines) shows good power in the simulated examples even when the number of markers is larger than the number of traits, a situation that MOM is not designed for (top). Finally M-SPLS has greater power than BAYES, but it is outperformed by both HESS and MOM in the more sparse scenarios (bottom). Looking more closely at the list of latent vectors identified by M-SPLS (data not shown), we noted that the simulated hotspots at SNP 362 and 466 that are linked to SNP 239 were rarely selected (false negative) in both SIM1 and SIM3. On the contrary, SNP 75 is very often included (false positive) in the list of latent vectors in all the scenarios. This might reflect the high correlation between SNP 362 and 466 with SNP 239, as well as the strong dependency between SNP 75 and 30 where we simulated a hotspot. This suggests that M-SPLS has limited efficiency in the presence of complex correlation patterns in the design matrix. In Figure S3, Figure S4, Figure S5, and Figure S6, interested readers can find the visual representation of the signals detected by each algorithm and averaged across the 25 replicates.

Power to detect transcript-marker associations:

Figure 2 shows the ROC curves of the transcript–marker marginal associations detected by each method. Also in this case, to perform the power calculation, we are not using any method-specific way to declare a significant association, since we simply record the output from each algorithm and rank it across the 25 replicates. In particular we use the marginal probability of inclusion p(γ_kj = 1|y_k) (8) for HESS; the posterior frequency ${\bar{γ}}_{k j} = Σ_{s = 1}^{S} γ_{k j}^{(s)} / S$ for BAYES, where $γ_{k j}^{(s)}$ is the value recorded at iteration s; the transcript–marker association provided the MOM object momObj; and finally the associations selected by bootstrap confidence interval at different type I error levels (α = 10⁻⁴, 10⁻³, 10⁻², 0.05) for M-SPLS.

Figure 2

Open in new tab Download slide

ROC curves for transcript–marker associations using HESS (blue line), MOM (red line), BAYES (green line), and M-SPLS (black star) in the four simulated scenarios (Figure S2). From top to bottom, left to right: SIM1, q = 100 and six hotspots; SIM2, q = 100 and three hotspots; SIM3, q = 1000 and six hotspots; SIM4, q = 1000 and three hotspots. For M-SPLS, power is calculated conditionally on the list of transcript–marker associations selected by bootstrap confidence interval at a fixed type I error (α = 10⁻⁴, 10⁻³, 10⁻², 0.05). In the top, MOM is indicated by a red dashed line to highlight that it is not designed in the cases when the number of markers is larger than the number of traits.

For transcript–marker association detection, we find that HESS has higher power than that of the other methods in all the simulated scenarios. As expected when more responses are included, the power decreases slightly (bottom), while spurious associations due to the correlation between transcripts do not seem to affect the ability of HESS to distinguish between true and false signals (right). MOM is quite stable across scenarios, but it reaches only half of the power of HESS. BAYES and M-SPLS have similar behavior and their performance degrades when q = 1000. BAYES, in particular, has very low power since the shrinkage to the null effect, caused by common latent probabilityω_j_, is particularly strong in SIM3 and SIM4.

The different power of the methods considered can be better understood by looking at Figure S3, Figure S4, Figure S5, and Figure S6, where, for each simulated example, we averaged the evidence of transcript–marker association across replicates. HESS is able to identify the correct simulated pattern, with very few false positives. When false-positive associations are simulated, for instance, in SIM3 and SIM4, HESS assigns on average lower posterior probability of inclusion than for the true positive ones (Figure S4 and Figure S6, top left). While MOM is able to identify the simulated hotspots, it finds it difficult to separate the true transcript–marker associations from the spurious ones (Figure S3, Figure S4, Figure S5, and Figure S6, bottom left). The main limitation of M-SPLS is the correct identification of the latent vectors when highly correlated predictors are considered (Figure S3, Figure S4, Figure S5, and Figure S6, top right). Finally BAYES is able to identify the simulated pattern when q = 100 (Figure S3 and Figure S4, bottom right), but it seems to be too conservative when the number of responses is large, q = 1000 (Figure S5 and Figure S6, bottom right). The higher false-negative rate in BAYES may depend on the poor efficiency of the MCMC sampler (which is based exclusively on the Gibbs sampling that is not able to jump between distant competing models) and on the spike and slab prior that is not integrated out. The latter influences the sampling of γ_kj since the latent binary vector depends on the regression coefficients (see Figure S7 for an illustration).

Real case studies

Here we present two applications of HESS to: (i) mouse gene expression data published in Lan et al. (2006) that is commonly used as a benchmark data set for detection of eQTL (Chun and Keleş 2009) and eQTL hotspots (Kendziorski et al. 2006; Jia and Xu 2007) and (ii) human monocytes expression data set recently analyzed for disease susceptibility by Zeller et al. (2010).

Mouse data set:

The mouse data set has been previously described in detail (Lan et al. 2006), and it consists of 45,265 probe sets the expression of which has been measured in the liver of 60 mice. Mice were collected from the F₂-ob/ob cross (B6 × BTBR) and genotype data were available for 145 microsatellite markers from 19 autosomal chromosomes. To make our analysis comparable with previously reported studies (Jia and Xu 2007; Chun and Keleş 2009), we focused on 1573 probe sets showing sizeable variation in gene expression in the mouse population (sample variance >0.12). Running HESS for 12,000 sweeps with 2000 as burn-in and the same choice of the hyperparameters described in the simulation studies, among the 145 markers 16 were identified with posterior tail probability >0.8, regulating a significant number of probe sets (Table S1). We report the genome location of the identified hotspots in Figure 3 and show transcript–marker associations in Figure 4. Since large hotspot propensity reveals that multiple traits are controlled by the same marker, we focused on biologically meaningful transcript–marker associations by using marginal probability of association >0.95 (corresponding to local FDR 5%, Ghosh et al. 2006). Six markers were found to control more than 5% of all analyzed probe sets as shown in Figure 3. While marker D15Mit63 was previously detected by BAYES and M-SPLS, three other major regulatory points were identified solely by our method: D13Mit91, D18Mit9, and D18Mit202, controlling 14.1, 10.6, and 9.7% of all analyzed probe sets, respectively (Table S2).

Figure 3

Open in new tab Download slide

Proportion of transcripts associated with each marker in the mouse data example (n = 60, P = 145, and q = 1573). Transcript–marker association was declared at 5% local FDR with marginal probability of inclusion >0.95. The 16 red triangles indicate markers (two of them are overlapping and hence are not distinguishable) that have been identified as hotspots with tail posterior probability >0.8.

Figure 4

Open in new tab Download slide

Heat map of the marginal probabilities of inclusion for each transcript–marker pair in the mouse data example (n = 60, P = 145, and q = 1573). The 16 red triangles indicate markers that have been identified as hotspots with tail posterior probability >0.8.

The regulatory hotspot at marker D13Mit91 is located within the Kif13a (kinesin family member 13A) gene, which is involved in intracellular protein transport and microtubule motor activity via direct interaction with the AP-1 adaptor complex (Nakagawa et al. 2000). This hotspot is associated with 222 probe sets, representing 190 distinct well-annotated genes, that are enriched for specific gene ontology (GO) terms, including “protein localization” (P = 4.2 × 10⁻⁶), “protein transport” (P = 5.7 × 10⁻⁶), and “establishment of protein localization” (P = 6.4 × 10⁻⁶). Hence, given its molecular function Kif13a is likely to be a candidate master regulator of the genes implicated with protein transport, and whose expression is associated with marker D13Mit91.

The other two newly identified markers, D18Mit9 and D18Mit202, are located on mouse chromosome 18. D18Mit9 resides within a known QTL (Hdlq30) involved in HDL cholesterol levels (Korstanje et al. 2004) whereas D18Mit202 resides within a known diabetes susceptibility/resistance locus (Idd21, insulin-dependent diabetes susceptibility 21) (Hall et al. 2003).

Human data set:

The human data set included 648 probe sets, representing 516 unique and well-annotated genes (Ensemble GRCh37), that were found to be coexpressed in monocytes, delineating a network driven by the IRF7 transcription factor in 1490 individuals from the Gutenberg Heart Study (GHS) (for details on the network analysis, see Heinig et al. (2010)). This IRF7-driven inflammatory network (IDIN) was also reconstructed in a distinct population cohort: 758 subjects from the Cardiogenics Study showing significant overlap with the network in GHS. The “core” of the network consisted of a small gene set (q = 17), including IRF7 and coregulated target genes, the expression of which was found to be trans-regulated by a locus on human chromosome 13q32 using MANOVA in Cardiogenics (Heinig et al. 2010). However, this trans-regulation was not found in the GHS study, using similar MANOVA analysis.

Here we take a new look and use HESS to analyze the larger IDIN with 648 probe sets in the GHS population (n = 1490 individuals) and the SNP set (P = 209) spanning 1 Mb on chromosome 13q32 (data available upon request from Stefan Blankenberg under the framework of a formalized collaboration via a Memorandum Transfer Agreement). While MOM and BAYES fail to detect any signal at this locus, using HESS we found two SNPs, rs9557207 and rs11616269, which are detected as hotspots for the IDIN expression with tail posterior probability 0.83 and 0.91, respectively (Figure 5). These SNPs are located 45.3 kb (rs9557207) and 25.1 kb (rs11616269) from SNP rs9585056, which previously showed significant trans-effect on the core gene set of the network in Cardiogenics (P = 5.0 × 10⁻³). This region was also associated with EBI2 expression (P = 2.2 × 10⁻⁸), the candidate gene at this locus, and with type I diabetes (T1D) (P = 7.0 × 10⁻¹⁰) (Heinig et al. 2010). For the two identified hotspots, we looked in detail at each transcript–marker association and compute their BF as given in (9). We observe that 26 and 13 transcripts show clear evidence of associations (BF > 10, Kass and Raftery 2007) in the two hotspots identified (Table S3) delineating the extent of regulatory effects. To further calibrate this evidence, we investigated BF for marker–transcript associations in a comparable simulated set-up, that of SIM3. Using the threshold BF > 10 would lead to declaring <5% false positive marker–transcript associations in the identified hotspots (data not shown). Note that most of these transcripts (80%) are found only in the network inferred in GHS and not with the Cardiogenics network, suggesting a complex pattern of regulatory effects around locus rs9585056 which is highlighted in a specific manner in each population. These population-specific regulatory effects could reflect differences in monocytes selection protocols between GHS and Cardiogenics (see Heinig et al. (2010) for details). However, the identification of hotspots at the 13q32 locus by HESS in GHS represents a significant replication of the findings previously reported, which reflects the increased power of HESS over alternative methods.

Figure 5

Open in new tab Download slide

Tail posterior probability for each marker in the human data example (Gutenberg Heart Study, n = 1490, P = 209, and q = 648). Red triangles indicate markers that have been identified as hotspots with tail posterior probability >0.8. The vertical gray line highlights the physical position of annotated SNP rs9557217 and rs9585056 that were previously associated with IDIN network in the Cardiogenics Study cohort and EBI2 expression (Heinig et al. 2010). Thick horizontal bars on the top of the figure display physical position of genes in the 1-Mb region obtained from Ensemble database.

Discussion

We have presented a new hierarchical model and algorithm, HESS, for regression analysis of a large number of responses and predictors and have applied this to hotspot discovery in eQTL experiments. Simulating a variety of complex scenarios, we have demonstrated that our approach outperforms currently used algorithms. In particular, HESS shows increased power to detect hotspots when a large number of transcripts are jointly analyzed. This is due to the propensity measure ρ_j that we use, which quantifies the latent hotspot effect independently of the response dimensionality. One improvement of HESS over vanilla MCMC-based algorithms is in the search procedure that efficiently probes alternative models and assesses their importance, thus providing a reliable model space exploration (Bottolo and Richardson 2010). We have also illustrated the potential of HESS to discover regulatory hotspots in two eQTL studies that encompass diverse genetic contexts (animal model and human data). In contrast to other methods, using HESS, we were able to replicate an established regulatory control of a large inflammatory network in humans (Heinig et al. 2010). Moreover, in the mouse data set, we identified a new candidate (Kif13a) for the regulation of a set of genes implicated in protein transport, which was not detected by other approaches.

Our model is embedded in the linear regression framework with additive effects. One distinct feature of our formulation is the multiplicative decomposition of the selection probabilities and its hierarchical set-up, which allows other structures and/or different types of prior information to be included. For example, specific weights π_kj (suitable normalized) could be added in (5), ω_jk = ω_k × ρ_j × π_kj, to provide additional prior information about the regulation of kth transcript. This may include cis-acting genetic control or auxiliary information on regulatory effects of the jth SNP (i.e., evolutionary conservation, coding, noncoding, genomic location, etc.) (Lee et al. 2009). Likewise, additional structure on the responses (e.g., KEGG pathways membership, predicted targets of transcription factors, protein complexes, etc.) could be included using k-indexed weights, ω_k, to favor detection of hotspots for similar responses.

Another possible extension of our method is the inclusion of interactions in the linear model and their efficient detection. Recent advances in this direction have employed either a stepwise search for interactions between preselected main effects (Wang et al. 2011) or partition models with which to discover modules or clusters of transcript–marker responses (Zhang et al. 2010). Such approaches could be embedded in our variable selection algorithm.

The current Matlab version of HESS represents a first step toward a more efficient implementation in high-level coding languages (currently undergoing), taking advantage of the existing C++ version of ESS algorithm (Bottolo et al. 2011). The approach that we propose here is ideally suited after prioritizing candidate genomic regions or gene networks, as shown in the discussed human case study. The flexibility to incorporate prior biological knowledge makes our method suitable for a wide range of analyses beyond eQTL hotspots detection, including genetic regulation of miRNA targets and metabolic and epigenetic phenotypes.

Acknowledgements

We thank two anonymous referees and the associate editor for their comments that improved the manuscript. L.B. received funding from the Wellcome Trust Value-in-People award. S.R. gratefully acknowledges support from ESRC National Centre for Research Methods (BIAS II node, grant RES-576–25–0015). L.B. and S.R. acknowledge support from the Medical Research Council (grant G1002319). E.P. and S.A.C. acknowledge support from the Medical Research Council and the British Heart Foundation (grant FS/11/25/28740).

Footnotes

Communicating editor: G. A. Churchill

Literature Cited

Altshuler

D

,

Brooks

L D

,

Chakravarti

A

,

Collins

F S

,

Daly

M D

et al. ,

2005

A haplotype map of the human genome

.

Nature

437

:

1299

–

1320

.

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Banerjee

S

,

Yandell

B S

,

Yi

N

,

2008

Bayesian quantitative trait loci mapping for multiple traits

.

Genetics

179

:

2275

–

2289

.

Bottolo

L

,

Richardson

S

,

2010

Evolutionary stochastic search for Bayesian model exploration

.

Bayesian Anal.

5

:

583

–

618

.

Google Scholar

Crossref

WorldCat

Bottolo

L

,

Chadeau-Hyam

M

,

Hastie

D I

,

Langley

S R

,

Petretto

E

et al. ,

2011

ESS++: a C++ objected-oriented algorithm for Bayesian stochastic search model exploration

.

Bioinformatics

27

:

587

–

588

.

Breitling

R

,

Li

Y

,

Tesson

B M

,

Fu

J

,

Wiltshire

T

et al. ,

2008

Genetical genomics: spotlight on QTL hotspots

.

PLoS Genet.

4

:

e1000232

.

Broman

K W

,

Sen

S

,

2009

A Guide to QTL Mapping with R/qtl

.

Springer-Verlag

,

Berlin

.

Chen

Y

,

Zhu

J

,

Lum

P Y

,

Yang

X

,

Pinto

S

et al. ,

2008

Variations in DNA elucidate molecular networks that cause disease

.

Nature

452

:

429

–

435

.

Chun

H

,

Keleş

S

,

2009

Expression quantitative trait loci mapping with multivariate sparse partial least squares regression

.

Genetics

182

:

79

–

90

.

Cookson

W

,

Liang

L

,

Abecasis

G

,

Moffatt

M

,

Lathrop

M

,

2009

Mapping complex disease traits with global gene expression

.

Nat. Rev. Genet.

10

:

184

–

194

.

George

E I

,

McCulloch

R E

,

1997

Approaches for Bayesian variable selection

.

Statist. Sinica

7

:

339

–

373

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Ghosh

D

,

Chen

W

,

Raghunathan

T

,

2006

The false discovery rate: a variable selection perspective

.

J. Statist. Plann. Inference

136

:

2668

–

2684

.

Google Scholar

Crossref

WorldCat

Gramacy

R B

,

Samworth

R J

,

King

R

,

2010

Importance tempering

.

Stat. Comput.

20

:

1

–

7

.

Google Scholar

Crossref

WorldCat

Hall

R J

,

Hollis-Moffatt

J E

,

Merriman

M E

,

Green

R A

,

Baker

D

et al. ,

2003

An autoimmune diabetes locus (Idd21) on mouse chromosome 18

.

Mamm. Genome

14

:

335

–

339

.

Heinig

M

,

Petretto

E

,

Wallace

C

,

Bottolo

L

,

Rotival

M

et al. ,

2010

A trans-acting locus regulates an anti-viral expression network and type 1 diabetes risk

.

Nature

467

:

460

–

464

.

Jia

Z

,

Xu

S

,

2007

Mapping quantitative trait loci for expression abundance

.

Genetics

176

:

611

–

623

.

Kass

R E

,

Raftery

A E

,

2007

Bayes factor

.

J. Am. Stat. Assoc.

90

:

773

–

795

.

Google Scholar

Crossref

WorldCat

Kendziorski

C M

,

Chen

M

,

Yuan

M

,

Lan

H

,

Attie

A D

,

2006

Statistical methods for expression quantitative trait loci (eQTL) mapping

.

Biometrics

62

:

19

–

27

.

Kohn

R

,

Smith

M

,

Chan

D

,

2001

Nonparametric regression using linear combinations of basis functions

.

Stat. Comput.

11

:

313

–

322

.

Google Scholar

Crossref

WorldCat

Korstanje

R

,

Li

R

,

Howard

T

,

Kelmenson

P

,

Marshall

J

et al. ,

2004

Influence of sex and diet on quantitative trait loci for hdl cholesterol levels in an SM/J by NZB/BlNJ intercross population

.

J. Lipid Res.

45

:

881

–

888

.

Lan

H

,

Chen

M

,

Flowers

J B

,

Yandell

B S

,

Stapleton

D S

et al. ,

2006

Combined expression trait correlations and expression quantitative trait locus mapping

.

PLoS Genet.

2

:

e6

.

Lee

S I

,

Dudley

A M

,

Drubin

D

,

Silver

P

,

Krogan

N J

et al. ,

2009

Learning a prior on regulatory potential from eQTL data

.

PLoS Genet.

5

:

e1000358

.

Majewski

J

,

Pastinen

T

,

2011

The study of eQTL variations by RNA-seq: from SNPs to phenotypes

.

Trends Genet.

27

:

72

–

79

.

Nakagawa

T

,

Setou

M

,

Seog

D

,

Ogasawara Dohmae

N

et al. ,

2000

A novel motor, KIF13A, transports mannose-6-phosphate receptor to plasma membrane through direct interaction with AP-1 complex

.

Cell

103

:

569

–

581

.

Petretto

E

,

Bottolo

L

,

Langley

S R

,

Heinig

M

,

McDermott-Roe

M C

et al. ,

2010

New insights into the genetic control of gene expression using a Bayesian multi-tissue approach

.

PLOS Comput. Biol.

6

:

e1000737

.

Richardson

S

,

Thomson

A

,

Best

N

,

Elliott

P

,

2004

Interpreting posterior relative risk estimates in disease mapping studies

.

Environ. Health Perspect.

112

:

1016

–

1025

.

Roberts

G O

,

Rosenthal

J S

,

2009

Examples of adaptive MCMC

.

J. Comput. Graph. Statist.

9

:

349

–

367

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Schadt

E E

,

2009

A molecular networks as sensors and drivers of common human diseases

.

Nature

461

:

218

–

223

.

Scheet

P

,

Stephens

M

,

2006

A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase

.

Am. J. Hum. Genet.

78

:

629

–

644

.

Sun

W

,

Ibrahim

J G

,

Zou

F

,

2010

Genomewide multiple-loci mapping in experimental crosses by iterative adaptive penalized regression

.

Genetics

185

:

349

–

359

.

Wang

P

,

Dawson

J A

,

Yandell

M P Keller, B S

,

Thornberry

N A

et al. ,

2011

A model selection approach for expression quantitative trait loci (eQTL) mapping

.

Genetics

187

:

611

–

621

.

Wu

C

,

Delano

D L

,

Mitro

N

,

Su

S V

,

Janes

J

et al. ,

2008

Gene set enrichment in eqtl data identifies novel annotations and pathway regulators

.

PLoS Genet.

5

:

e1000070

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Xu

C

,

Wang

X

,

Li

Z

,

Xu

S

,

2008

Mapping QTL for multiple traits using Bayesian statistics. Genet. Res

.

Camb.

91

:

23

–

37

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Yi

N

,

Shriner

D

,

2008

Advances in Bayesian multiple QTL mapping in experimental designs

.

Heredity

100

:

240

–

252

.

Yi

N

,

Xu

S

,

2008

Bayesian lasso for quantitative trait loci mapping

.

Genetics

179

:

1045

–

1055

.

Yi

N

,

Shriner

D

,

Banerjee

S

,

Mehta

T

,

Pomp

D

et al. ,

2007

An efficient Bayesian model selection approach for interacting quantitative trait loci models with many effects

.

Genetics

176

:

1865

–

1877

.

Yvert

G

,

Brem

R B

,

Whittle

J

,

Akey

J M

,

Foss

E

et al. ,

2003

Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors

.

Nat. Genet.

35

:

57

–

64

.

Zeller

T

,

Wild

P

,

Szymczak

S

,

Rotival

M

,

Schillert

A

et al. ,

2010

Genetics and beyond: the transcriptome of human monocytes and disease susceptibility

.

PLoS ONE

5

:

e10693

.

Zhang

W

,

Zhu

J

,

Schadt

E E

,

Liu

J S

,

2010

A Bayesian partition method for detecting pleiotropic and epistatic eQTL modules

.

PLOS Comput. Biol.

6

:

e1000642

.

Author notes

Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.111.131425/DC1.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
February 2021	1
March 2021	7
April 2021	12
May 2021	6
June 2021	7
July 2021	4
August 2021	6
September 2021	3
October 2021	4
November 2021	4
December 2021	5
January 2022	5
February 2022	9
March 2022	12
April 2022	18
May 2022	3
June 2022	5
July 2022	8
August 2022	7
September 2022	16
October 2022	1
November 2022	5
December 2022	4
January 2023	2
February 2023	3
March 2023	8
April 2023	11
May 2023	3
June 2023	4
July 2023	6
August 2023	8
September 2023	4
October 2023	6
November 2023	11
December 2023	10
January 2024	10
February 2024	18
March 2024	1
April 2024	13

Article Contents

Bayesian Detection of Expression Quantitative Trait Loci Hot Spots

Abstract

Theory and Methods

Hierarchical related sparse regression

Prior set-up

Posterior inference

Postprocessing analysis

Results

Simulation studies

Power to detect hotspots:

Power to detect transcript-marker associations:

Real case studies

Mouse data set:

Human data set:

Discussion

Acknowledgements

Footnotes

Literature Cited

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Bayesian Detection of Expression Quantitative Trait Loci Hot Spots

Abstract

Theory and Methods

Hierarchical related sparse regression

Prior set-up

Posterior inference

Postprocessing analysis

Results

Simulation studies

Power to detect hotspots:

Power to detect transcript-marker associations:

Real case studies

Mouse data set:

Human data set:

Discussion

Acknowledgements

Footnotes

Literature Cited

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only