## Abstract

Expression quantitative trait loci (eQTL) mapping constitutes a challenging problem due to, among other reasons, the high-dimensional multivariate nature of gene-expression traits. Next to the expression heterogeneity produced by confounding factors and other sources of unwanted variation, indirect effects spread throughout genes as a result of genetic, molecular, and environmental perturbations. From a multivariate perspective one would like to adjust for the effect of all of these factors to end up with a network of direct associations connecting the path from genotype to phenotype. In this article we approach this challenge with mixed graphical Markov models, higher-order conditional independences, and *q*-order correlation graphs. These models show that additive genetic effects propagate through the network as function of gene–gene correlations. Our estimation of the eQTL network underlying a well-studied yeast data set leads to a sparse structure with more direct genetic and regulatory associations that enable a straightforward comparison of the genetic control of gene expression across chromosomes. Interestingly, it also reveals that eQTLs explain most of the expression variability of network hub genes.

- eQTL
- gene network
- exact-likelihood-ratio test
- conditional Gaussian distribution
- mixed graphical Markov model

THE simultaneous assay of gene-expression profiling and genotyping on the same samples with high-throughput technologies provides one of the primary types of integrative genomics data sets, the so-called *genetical genomics* data (Jansen and Nap 2001). First studies producing such data showed that, for many genes, gene expression is a heritable trait (Brem *et al.* 2002). Following this observation, it soon became evident that gene expression may act as an intermediate data tier that can potentially increase our power to map the genetic component of phenotypic variability in complex traits, such as human disease (Schadt *et al.* 2003). The genetic variants associated with each of these thousands of molecular phenotypes are known as expression quantitative trait loci (eQTL) and they can be broadly categorized into *cis*-acting and *trans*-acting associations, depending on their location relative to the gene whose expression levels map to the eQTL. [We use here the terms *cis* and *trans* to refer to what is also known as *local* and *distant* QTLs (Rockman and Kruglyak 2006), respectively.] Because the relative concentration of RNA molecules reflects functional relationships between genes, overlaying the correlation structure of gene expression on the eQTL associations provides an eQTL network, which can help to approach the problem of reverse engineering the genotype–phenotype map with natural variation (Rockman 2008).

A straightforward way to map eQTL networks to the genome is by treating gene-expression profiles as independent continuous traits and applying classical QTL mapping techniques such as single-marker regression (Tesson and Jansen 2009). However, different than higher-level phenotypes such as disease onset, adult height, or yeast growth rates, gene expression is a high-dimensional multivariate trait involving measurements from thousands of genes coordinately acting under complex molecular regulatory programs. This feature makes eQTL mapping a challenging problem for, at least, two reasons. One is that gene-expression profiles can be highly correlated as a product of gene regulation, thereby complicating the distinction between direct and indirect effects when marginally inspecting eQTL associations that involve only one gene at a time. The other is that high-throughput expression profiling can be very sensitive to nonbiological factors of variation such as batch effects (Leek and Storey 2007; Leek *et al.* 2010), introducing heterogeneity and spurious correlations between gene-expression measurements. These artifacts may compromise the statistical power to map truly biological eQTLs (Stegle *et al.* 2010) or show up as interesting genetic switches with broad pleiotropic effects affecting a large number of genes, commonly known as eQTL hotspots (Leek and Storey 2007; Breitling *et al.* 2008).

These problems can be addressed by estimating and including the confounding factors in the model as main (Leek and Storey 2007; Stegle *et al.* 2010) or mixed effects (Kang *et al.* 2008; Listgarten *et al.* 2010) and restricting the eQTL search to *cis*-acting variants located in the regulatory regions of the gene to which they are associated (Montgomery *et al.* 2010).

Yet *trans*-acting eQTL have proven to be crucial to our understanding of complex regulatory mechanisms. A canonical example is locus control regions (Li *et al.* 2002) that enhance the expression of distal genes under tissue-specific conditions such as the one affecting the human *β*-globin locus. Recent contributions have shown that *trans*-acting mechanisms often mediate the genetic basis of disease (Westra *et al.* 2013).

The importance of identifying nonspurious *trans*-acting eQTLs has been widely recognized and a large number of approaches that aim to address the problems described above exist in the literature. They can be broadly categorized into those extending univariate single-marker regression models to adjust for confounding effects (Leek and Storey 2007; Stegle *et al.* 2010; Listgarten *et al.* 2010) and those using multivariate approaches. The latter can be further categorized into Bayesian networks using conditional mutual information with constraint-based algorithms (Zhu *et al.* 2004), empirical Bayes hierarchical mixtures (Kendziorski *et al.* 2006), directed versions of the PC algorithm (Chaibub Neto *et al.* 2008), structural equation models (Liu *et al.* 2008), sparse partial least squares (Chun and Keleş 2009), fused Lasso regression methods (Kim and Xing 2009), random forests (Michaelson *et al.* 2010), mixed Bayesian networks using the Bayesian information criterion (BIC) with homogeneous conditional Gaussian regression models (Chaibub Neto *et al.* 2010), mixed graphical Markov models restricted to tree network topologies (Edwards *et al.* 2010), sparse factor analysis (Parts *et al.* 2011), and conditional independence tests of order one (Bing and Hoeschele 2005; Chen *et al.* 2007; Kang *et al.* 2010; Chaibub Neto *et al.* 2013).

Mixed graphical Markov models and conditional independence constitute a natural extension of classical QTL mapping to multivariate phenotype vectors. This enables a smooth transition from mapping single phenotypes to building eQTL networks providing an easier statistical interpretation of the resulting associations. However, a main shortcoming of currently available methods based on mixed graphical Markov models and conditional independence is that they are restricted to conditioning on one other gene to disentangle direct and indirect relationships.

Using higher-order conditional independences on high-dimensional data, such as the one produced by genetical genomics experiments, is not trivial. The main purpose of this article is twofold: first, to show a way to use higher-order conditional independence and mixed graphical Markov models in eQTL mapping by means of *q*-order correlation graphs, and second, to demonstrate that this approach can help to obtain valuable insight into the genetic regulatory architecture of gene expression in yeast.

## Materials and Methods

### Expression measurements and genotyping

Throughout this article we use a well-studied genetical genomics data set generated from two yeast strains, a wild-type (RM11-1a) and a lab strain (BY4716) that were crossed to generate 112 segregants, which were genotyped and whose gene expression was profiled using two-channel microarray chips, including a dye swap (Brem and Kruglyak 2005). We applied background correction, discarded control probes, and normalized the raw expression data within and between arrays using the limma package (Smyth and Speed 2003; Ritchie *et al.* 2007). To correct for possible dye effects, we averaged the normalized expression values of the dye-swapped arrays. This first set of normalized data consisted of 6216 genes and 2906 genotype markers, *i.e.*, a total of *p* = 9122 features, by *n* = 112 samples.

Using the geno.table() and findDupMarkers() functions from the R/qtl package (Broman *et al.* 2003) we identified 274 markers with >5% missing genotypes and 749 markers with duplicated genotypes in other 260 markers. These 274 + 749 = 1023 markers were discarded from further analysis. Among the remaining 1883 we discarded 26 showing Mendelian segregation distortion at Holm’s familywise error rate (FWER) <0.01. We also removed 75 genes that could not be found in the March 2014 version of the sgdGene annotation table for the sacCer3 version of the yeast genome at the University of California—Santa Cruz (UCSC) Genome Browser (http://www.genome.ucsc.edu). These filtering steps resulted in a final data set of 1857 markers and 6141 genes, *i.e.*, a total of *p* = 7998 features, by *n* = 112 samples. Missing genotypes were treated by complete-case analysis. The use of marginal distributions in the next section enables this approach.

### Software and reproducibility

The algorithms described in this article are implemented in the open source R/Bioconductor package qpgraph available for download at http://www.bioconductor.org. Scripts with the R code reproducing the results of this article are available at http://functionalgenomics.upf.edu/supplements/eQTLmixedGMMs.

### Mixed graphical Markov models of eQTL networks

We assumed that gene expression forms a *p*-multivariate sample following a conditional Gaussian distribution given the joint probability of all genetic variants. Under this assumption a sensible model for an eQTL network is a mixed graphical Markov model, GMM (Lauritzen and Wermuth 1989). A mixed GMM enables the integration of the joint distribution of discrete genotypes with the joint distribution of continuous expression measurements in a single multivariate statistical model satisfying a set of restrictions of conditional independence encoded by means of a graph; see (Lauritzen 1996; Edwards 2000) for a comprehensive description of this type of statistical model.

Here, we review part of the mixed GMM theory required for this article. Mixed GMMs are statistical models representing probability distributions involving discrete random variables (r.v.’s), denoted by *I _{δ}* with

*δ*∈ Δ, and continuous r.v.’s, denoted by

*Y*with

_{γ}*γ*∈ Γ. This class of GMMs are determined by marked graphs

*G*= (

*V*,

*E*) with

*p*marked vertices

*V*= Δ ∪ Γ, and edge set

*E*⊆

*V*×

*V*. Vertices

*δ*∈ Δ are depicted by solid circles,

*γ*∈ Γ by open ones and the entire set of them,

*V*, index the vector of r.v.’s

*X*= (

*I*,

*Y*). In our context, continuous r.v.’s

*Y*correspond to genes and discrete r.v.’s

*I*to markers or eQTLs; see Figure 1A for a graphical representation of one such mixed GMM. We denote the joint sample space of

*X*by (1)where

*i*are discrete values corresponding to genotype alleles from the marker or eQTL

_{δ}*I*and

_{δ}*y*are continuous values of expression from gene

_{γ}*Y*. The set of all possible joint discrete levels

_{γ}*i*is denoted as ℐ. Following Lauritzen and Wermuth (1989), we assume that the joint distribution of the variables

*X*is conditional Gaussian (also known as CG-distribution) with density function: (2)This distribution has the property that continuous variables follow a multivariate normal distribution

_{|Γ|}(

*μ*(

*i*), Σ(

*i*)) conditioned on the discrete variables. The parameters (

*p*(

*i*),

*μ*(

*i*), Σ(

*i*)) are called moment characteristics where

*p*(

*i*) is the probability that

*I*=

*i*, and

*μ*(

*i*) and Σ(

*i*) are the conditional mean and covariance matrix of

*Y*, which depend on joint discrete level

*i*. If the covariance matrix is constant across the levels of ℐ, that is, Σ(

*i*) ≡ Σ, the model is

*homogeneous*. Otherwise, the model is said to be

*heterogeneous*. Throughout this article we assume a homogeneous mixed GMM underlying the eQTL network. This implies that genotypes affect only the mean expression level of genes and not the correlations between them. We can write the logarithm of the density in terms of the canonical parameters (

*g*(

*i*),

*h*(

*i*),

*K*(

*i*)): (3)where

CG distributions satisfy the Markov property if and only if their canonical parameters are expanded into interaction terms such that only interactions between adjacent vertices are present (Lauritzen 1996). Thus, (7)where *λ _{d}*(

*i*) > 0, with

*d*⊆ Δ complete in

*G*, represent the discrete interactions among the variables indexed by

*d*;

*η*(

_{d}*i*)

_{γ}> 0, with

*d*∪ {

*γ*} complete in

*G*, represent the mixed interactions between

*X*and the variables indexed by

_{γ}*d*; and

*ψ*(

_{d}*i*)

*> 0, with*

_{γη}*d*∪ {

*γ*,

*η*} complete in

*G*, represent the quadratic interactions between

*X*,

_{γ}*X*and the variables indexed by

_{η}*d*. If the model is homogeneous, there are no mixed quadratic interactions,

*i.e.*,

*ψ*(

_{d}*i*)

*= 0 for*

_{γη}*d*≠ Ø. Plugging these expansions into Equation 3 we obtain

#### Decomposable mixed GMMs:

An important subclass of mixed GMMs is defined by decomposable marked graphs. A triple (*A*, *B*, *C*) of disjoint subsets of *V* forms a decomposition of an undirected marked graph *G* if *V* = *A* ∪ *B* ∪ *C* and: (1) *C* is a complete subset of *V*; (2) *C* separates *A* from *B*; and (3) *C* ⊆ Δ or *B* ⊆ Γ. An undirected marked graph *G* is said to be *decomposable* if it is complete or if there exists a proper decomposition (*A*, *B*, *C*) such that the subgraphs *G _{A}*

_{∪}

*and*

_{C}*G*

_{B}_{∪}

*are decomposable. In different terms, when*

_{C}*G*is undirected, decomposability of

*G*holds if and only if

*G*does not contain chordless cycles of length larger than three and does not contain any path between two nonadjacent discrete vertices passing through continuous vertices only.

#### Maximum-likelihood estimates of mixed GMMs:

Let = {*x*^{(}^{ν}^{)}} = {(*i*^{(}^{ν}^{)}, *y*^{(}^{ν}^{)})} be a sample of *ν* = 1, …, *n*, independent and identically distributed observations from a CG distribution. For an arbitrary subset *A* ⊆ *V*, we abbreviate to *i _{A}* =

*i*

_{A}_{∩Δ}, ℐ

*= ℐ*

_{A}_{Δ∩}

*and*

_{A}*y*=

_{A}*y*

_{A}_{∩Γ}and the following sampling statistics are defined:

The likelihood function for the homogeneous, saturated model attains its maximum if and only if *n* ≥ |Γ| + |ℐ|, which is almost surely equal to *n*(*i*) > 0 for all *i* ∈ ℐ (Lauritzen 1996, Proposition 6.10). In such a case the maximum-likelihood estimates (MLEs) of the moment characteristics are defined as (15)It follows, then, that saturated mixed GMMs cannot be directly estimated from data with *p* ≫ *n*, using only the formulas described above. For the unsaturated case, decomposable mixed GMMs also admit explicit MLEs. In the homogeneous decomposable case it can be shown (Lauritzen 1996, Propositoin 6.21) that the MLE exists almost surely if and only if *n*(*i _{C}*) ≥ |

*C*∩ Γ| + |ℐ

*| for all cliques*

_{C}*C*of

*G*and

*i*∈ ℐ

_{C}*. In this case, MLEs are defined with the following canonical parameters (Lauritzen 1996, p. 189): (16) (17) (18)where*

_{C}*S*

_{1}=

*∅*. The matrices [

*M*]

^{|Γ|}of Equation 18 are defined as follows. Given a matrix

*M*= {

*m*}

_{γη}_{|}

_{A}_{|×|}

_{A}_{|}of dimension |

*A*| × |

*A*| with

*A*⊆ Γ, [

*M*]

^{|Γ|}is a |Γ| × |Γ| matrix such thatAnalogously, in Equation 17, [

*M*]

^{|Γ|}is a |Γ|-length vector obtained from a |

*A*|-length vector

*M*.

### Simulation of eQTL network models with mixed GMMs

In this subsection, we describe how to simulate eQTL networks with homogeneous mixed GMMs and data from them. We restrict ourselves to the case of a backcross, in which each genotype marker can have two different alleles AA and AB. However, these procedures could be extended to more complex cross models allowing for other than linear additive effects (codominant model) on the mixed associations, such as dominance effects.

#### Simulation of eQTL network structures:

The first step to simulate a GMM consists of simulating its associated graph *G* = (*V*, *E*) that defines the structure of the eQTL network. eQTLs and expression profiles are represented in *G* by discrete vertices Δ and continuous vertices Γ, respectively, such that *V* = Δ ∪ Γ and |*V*| = *p*. In the context of genetical genomics data we make the assumption that discrete genotypes affect gene-expression measurements and not the other way around. Thus, we consider the underlying graph *G* as a marked graph where some edges are directed and represented by arrows and some are undirected. More concretely, *G* will have arrows pointing from discrete vertices to continuous ones, undirected edges between continuous vertices, and no edges between discrete vertices. From this restriction, it follows that there are no semidirected cycles and allows one to interpret these GMMs as *chain graphs*, which are graphs formed by undirected subgraphs connected by directed edges (Lauritzen 1996).

#### Simulation of parameters of a homogeneous mixed GMM:

Here we show how to simulate the parameters of the homogeneous CG distribution represented by *G* with given marginal linear correlations of magnitude *ρ* on the pure continuous (gene–gene) associations and given additive effects of magnitude *a* on the mixed (eQTL) ones.

##### Covariance matrix:

Given the structure of a graph *G*, a random covariance matrix Σ can be simulated as follows. Let *G*_{Γ} ⊆ *G* denote the subgraph of *p*_{Γ} = |Γ| pure continuous vertices and let *ρ* denote the desired mean marginal correlation between each pair of continuous r.v.’s (*X _{γ}*,

*X*) such that (

_{η}*γ*,

*η*) ∈

*G*

_{Γ}. Let

^{+}(

*G*

_{Γ}) ⊂

^{+}denote the set of all

*p*

_{Γ}×

*p*

_{Γ}positive definite matrices in

^{+}such that every matrix

*S*∈

^{+}(

*G*

_{Γ}) satisfies that {

*S*

^{−1}}

*= 0 whenever*

_{ij}*i*≠

*j*and (

*i*,

*j*) ∉

*G*

_{Γ}.

We simulate Σ such that Σ ∈ ^{+}(*G*_{Γ}) in two steps. First, we build an initial positive definite matrix and, from this, we build the *incomplete matrix* Σ_{0} with elements if either *i* = *j* or (*i*, *j*) ∈ *G*_{Γ}, and the remaining elements unspecified. Second, we search for a *positive completion* of Σ_{0}, which consists of filling up Σ_{0} in such a way that the resulting Σ ∈ ^{+}(*G*_{Γ}).

It can be shown (Grone *et al.* 1984) that the incomplete matrix Σ_{0} admits a positive completion and that it is unique. This means that, given Σ_{0}, we can use algorithms for maximum-likelihood estimation or Bayesian conjugate inference (Roverato 2002) as matrix completion algorithms. To this end, we first draw Σ_{0} from a Wishart distribution with and , where for *i* = *j* and for *i* ≠ *j*. It is required that Λ ∈ ^{+} and this happens if and only if −1/(*p*_{Γ} − 1) < *ρ* < 1 (Seber 2007, p. 317). Finally, we apply the iterative regression procedure introduced by Hastie *et al.* ( 2009, p. 634) for maximum-likelihood estimation of Gaussian GMMs with known structure, as matrix completion algorithm to obtain Σ from Σ_{0}.

##### Probability of discrete levels:

Since we are simulating a backcross, each discrete r.v. takes two possible values *i _{δ}* = {1, 2} with equal probability

*p*(

*i*= 1) =

_{δ}*p*(

*i*= 2) = 0.5. For the sole purpose of simulating eQTL networks we made the simplifying assumption that discrete r.v.’s representing eQTLs are marginally independent between them. From these two assumptions it follows that joint levels

_{δ}*i*∈ ℐ are uniformly distributed; that is,

*p*(

*i*) = 1/|ℐ|, ∀

*i*∈ ℐ.

##### Conditional mean vector:

Let a discrete variable *I _{δ}* have an additive effect

*a*on the continuous variable

_{δγ}*Y*. In the case of a backcross this implies that (19)Using Equation 5 we can calculate the conditional mean vector as

_{γ}*μ*(

*i*) = Σ ×

*h*(

*i*). This enables simulating the covariance matrix independently from discrete r.v.’s while interpreting it as a conditional one to generate

*μ*(

*i*). The values of the canonical parameter

*h*(

*i*) = {

*h*(

_{γ}*i*)}

_{γ}_{∈Γ}determine the strength of the mixed interactions between discrete and continuous r.v.’s. Using Equation 7 they can be calculated as follows:

Without loss of generality, values *η*_{∅}* _{γ}* are set to zero. To set values

*η*(1)

_{δ}*and*

_{γ}*η*(2)

_{δ}_{γ}, so that both Equations 5 and 19 are satisfied, we proceed as follows. Assume that an eQTL

*I*has a pleiotropic effect on a set of genes , where

_{δ}*A*= {

_{δ}*γ*∈ Γ: (

*δ*,

*γ*) ∈

*E*}. By combining Equations 5 and 19, for each

*γ*∈

*A*we have that

_{δ}Note that if *j*, *k* ∈ ℐ are two discrete levels such that *j _{δ}* = 1,

*k*= 2 and

_{δ}*j*

_{Δ\{}

_{δ}_{}}=

*k*

_{Δ\{}

_{δ}_{}}, then

*h*(

_{ζ}*j*) =

*h*(

_{ζ}*k*) for all

*ζ*∉

*A*. It follows that for all

_{δ}*ζ*∉

*A*the terms

_{δ}*h*(

_{ζ}*i*′) in both summations cancel out and we obtain,

Let *η _{δγ}* =

*η*(1)

_{δ}*−*

_{γ}*η*(2)

_{δ}_{γ}and the vectors , , , and . We write the matrix form of the previous expression as

### Simulation of eQTL network models of experimental crosses

We have integrated the algorithms presented above with functions from the R/qtl package (Broman *et al.* 2003) to simulate eQTL network models of experimental crosses and data from them in the following way. First, we simulate a genetic map with a given number of chromosomes and markers using the sim.map() function of the R/qtl package.

Second, we simulate a homogeneous mixed GMM in two steps: (a) we define the, possibly random, underlying regulatory model of eQTLs and gene–gene associations and (b) we simulate the parameters (*p*(*i*), *μ*(*i*), Σ) of this homogeneous mixed GMM according to the procedures described above.

Third, we simulate data from the previous eQTL network model with the function sim.cross() from the R/qtl package. This function is overloaded in qpgraph to plug the eQTL associations into the corresponding genetic loci and return a R/qtl cross object. The function sim.cross() defined in the qpgraph package proceeds as follows. First, the genotype data are simulated by the procedures implemented in the R/qtl package. Genotypes are sampled at each marker from a Markov chain with transition probabilities that depend on the distance between markers and a mapping function. eQTLs are placed at the markers and, in particular, if eQTLs are located sufficiently away from each other, we can assume that the corresponding discrete r.v.’s are marginally independent between them. Finally, qpgraph simulates gene-expression values according to the homogeneous mixed GMM by sampling continuous observations from the corresponding parameters of the CG distribution _{|Γ|}(*μ*(*i*), Σ), given the sampled genotype *i* from all joint eQTLs.

### Conditional independence tests parametrized by mixed GMMs

Approaches to learning the structure of a mixed GMM using higher-order correlations require testing for conditional independence between any two r.v.’s *X _{α}* and

*X*, such that

_{β}*β*⊆ Γ, given a set of conditioning ones

*X*, denoted by

_{Q}*X*∐

_{α}*X*|

_{β}*X*. To this end, we use a likelihood-ratio test (LRT) between two models: a saturated model ℳ

_{Q}_{1}, determined by the complete graph

*G*

^{1}= (

*V*,

*E*

^{1}), where

*V*= {

*α*,

*β*,

*Q*} and

*E*

^{1}=

*V*×

*V*, and a constrained model ℳ

_{0}, determined by

*G*

^{0}= (

*V*,

*E*

^{0}) with exactly one missing edge between the two vertices

*α*,

*β*and thus

*E*

^{0}= {

*V*×

*V*}\(

*α*,

*β*) and

*Q*=

*V*\{

*α*,

*β*}.

Since *G*^{1} is complete and (*α*, *β*, *Q*) is a proper decomposition of *G*^{0}, ℳ_{1} and ℳ_{0} are both decomposable, and therefore, they admit explicit MLEs (Equations 16–18). Given that *V* = Δ ∪ Γ, we denote by (*γ*, *ζ*) a pair of continuous r.v.’s (*i.e.*, *γ*, *ζ* ∈ Γ), and by (*δ*, *γ*) a pair of mixed r.v.’s with *δ* ∈ Δ and *γ* ∈ Γ, so that either *Q* = *V*\{*γ*, *ζ*} or *Q* = *V*\{*δ*, *γ*} are the conditioning subsets.

In the context of homogeneous mixed GMMs, the null hypothesis of conditional independence for the pure continuous case, *γ* ∐ *ζ*|*Q*, corresponds to a zero value in the (*γ*, *ζ*) and (*ζ*, *γ*) entries of the canonical parameter *K* (see Equations 7 and 8). The log-likelihood-ratio statistic, which is twice the difference of the log-likelihoods of models ℳ_{0} and ℳ_{1}, is reduced to (see Lauritzen 1996, p. 192):

The null hypothesis of conditional independence in the mixed case, *δ* ∐ *γ*|*Q*, corresponds to an expansion of the canonical parameter *h _{γ}*(

*i*) where the terms corresponding to

*δ*are zero,

*η*(

_{δ}*i*)

*= 0 (see Equations 7 and 8). In this case, the log-likelihood-ratio statistic is (Lauritzen 1996, p. 194) (21)where Γ*

_{γ}^{∗}= Γ\{

*γ*} and Δ

^{∗}= Δ\{

*δ*}. Since models ℳ

_{1}and ℳ

_{0}are decomposable, they are collapsible onto the same set of variables

*X*

_{V}_{\{}

_{γ}_{}}; see (Edwards 2000, pp. 86–87) and (Didelez and Edwards 2004). This property implies that the density functions

*f*of ℳ

_{1}and ℳ

_{0}can be factorized as

*f*=

_{V}*f*

_{V}_{\{}

_{γ}_{}}⋅

*f*

_{γ}_{|}

_{V}_{\{}

_{γ}_{}}such that the marginal and conditional densities,

*f*

_{V}_{\{}

_{γ}_{}}∈ ℳ

_{V}_{\{}

_{γ}_{}}and

*f*

_{γ}_{|}

_{V}_{\{}

_{γ}_{}}∈ ℳ

_{γ}_{|}

_{V}_{\{}

_{γ}_{}}, respectively, can be parametrized separately. Therefore, the likelihood function of ℳ

_{1}and ℳ

_{0}can be computed as the product of the likelihood of the marginal and the conditional models, and , respectively.

It follows that the second term of these factorizations corresponds to the same saturated model induced by the complete subgraph formed by the vertices in *V*\{*γ*}, , and we have that (22)where and denote the estimates of the conditional variance of the r.v. *X _{γ}*, given the rest of the r.v.’s under the null and the alternative conditional models and , respectively. In particular, these conditional models are equivalent to the ANCOVA models (Edwards 2000, p. 91) in which the continuous r.v.

*γ*∈ Γ is the response variable and the rest are explanatory. In this context, we have that and , where RSS is the residual sum of squares of the corresponding ANCOVA model and

*n*is the sample size. The ANCOVA model corresponding to for the case of a backcross is (23)In this model,

*μ*is the phenotype’s mean, the term

*β*represents the effect of the discrete variable

_{δ}Z_{δ}*δ*∈ Δ that we are testing, and we assume that . The continuous r.v.’s in

*Q*, Γ

^{∗}, are modeled as a linear combination of r.v.’s (third summation of the equation). On the other hand, the joint levels of

*δ*and of the discrete r.v.’s in

*Q*, Δ, are encoded through (|ℐ| − 1) terms where some of them represent the main effects of each discrete r.v. (first summation). The rest of the terms encode all the interacting effects between the discrete r.v.’s (second summation). Each variable

*Z*is an indicator variable that, in the case of a backcross, takes values 0 or 1 if the genotype of

_{κ}*I*is AA or AB, respectively.

_{κ}The count of the number of parameters in the model of Equation 23 is as follows: |Δ| parameters come from the term encoding the main effect of *Z _{δ}* and the first summation; the second summation involves 2

^{|Δ|}− 1 − |Δ| terms, whereas the third one involves |Γ| − 1 terms. Thus, the saturated model described in Equation 23 has 2

^{|Δ|}+ |Γ| − 2 free parameters in total, and therefore,

*n*− 2

^{|Δ|}− |Γ| + 2 degrees of freedom, where

*n*is the sample size of the data.

For the pure continuous case, the conditional model corresponding to is the same as the one in Equation 23 except that we remove the term of the third summation that corresponds to the variable *X _{ζ}*. In this case, this model has

*n*− 2

^{|Δ|}− |Γ| + 3 degrees of freedom. Therefore, under the null hypothesis, the statistic

*D*

_{γζ}_{⋅}

*follows asymptotically a distribution with df = 1 degree of freedom.*

_{Q}In general, we can derive the degrees of freedom for the pure continuous case by writing explicitly the conditional expectation of *X _{γ}* given

*X*

_{V}_{\{}

_{γ}_{}}. Under this corresponds to (24)where and

*β*

_{γλ}_{|Γ\{}

_{γ}_{}}is the partial regression coefficient that is found through the canonical parameter

*K*= {

*k*}, ∀

_{γζ}*γ*,

*ζ*∈ Γ, as

*β*

_{γλ}_{|Γ\{}

_{γ}_{}}= −

*k*/

_{γλ}*k*(Lauritzen 1996, p. 130). This model has

_{γγ}*n*− |ℐ| − |Γ| + 1 degrees of freedom since it has |ℐ| parameters that come from the first term in Equation 24 and |Γ| − 1 from the second term. On the other hand, the conditional expectation of

*X*given

_{γ}*X*

_{V}_{\{}

_{γ}_{}}under iswhich leads to

*n*− |ℐ| − |Γ| + 2 degrees of freedom. By computing the difference in the degrees of freedom of both models, we have that

*D*

_{γζ}_{⋅}

*follows asymptotically a distribution with df = 1 degree of freedom.*

_{Q}In the mixed case, the likelihood-ratio statistic *D _{δγ}*

_{⋅}

*of Equation 21 is related to the LOD score used in QTL mappingthrough the following transformation of the LOD score: (25)In fact, since the ratio between ℒ*

_{Q}_{1}and ℒ

_{0}is equivalent to the ratio between and we have that

In this case, the conditional saturated model for a backcross is the same as the one in Equation 23. By contrast, in the conditional constrained model we delete all the terms of Equation 23 that involve the r.v. *X _{δ}*. This constrained model has

*n*− 2

^{|Δ|−1}−|Γ| + 2 free parameters. Thus, for a backcross the likelihood-ratio statistic

*D*

_{δγ}_{⋅}

*, and therefore, the transformation of the LOD score (Equation 25), follows a distribution with df = 2*

_{Q}^{|Δ|−1}degrees of freedom.

Again, we can derive the degrees of freedom of the distribution of the general case by looking at the conditional expectation of models (Equation 24) and written as (27)Here, the first term involves parameters and the second |Γ| − 1, so that the constrained model has free parameters. Hence, we have that *D _{δγ}*

_{⋅}

*, and therefore, the transformed LOD score in Equation 25 follows a distribution with degrees of freedom.*

_{Q}In fact, Lauritzen (1996, pp. 192–194) observed that, for decomposable mixed GMMs, the likelihood ratios Λ_{γζ}_{⋅}* _{Q}* in Equation 20 and Λ

_{δζ}_{⋅}

*in Equation 21 follow exactly a beta-distribution. To enable such an exact test for homogeneous mixed GMMs, we proceed to derive their corresponding parameters.*

_{Q}Due to the decomposability and collapsibility of the saturated ℳ_{1} and constrained ℳ_{0} models, we have seen that the analysis of the joint densities is equivalent to the study of the univariate conditional densities of *X _{γ}* given the rest of the variables. Concretely, for the pure continuous case, the likelihood-ratio statistic Λ

_{γζ}_{⋅}

*is equivalent to the ratio RSS*

_{Q}_{1}/RSS

_{0}, where RSS

_{0}and RSS

_{1}are the residual sum of squares of the constrained and the saturated univariate models, respectively, and both follow a -distribution, where

*k*is the number of free parameters of each model.

Let RSS_{1.0} denote the difference RSS_{0}-RSS_{1}. Following Rao (1973, p. 166), if a r.v. *X* follows a with *k* degrees of freedom it also follows a gamma distribution Γ(*k*/2, 2). Hence,and RSS_{1.0} ∼ Γ(1/2, 2). Moreover, if *X* and *Y* are two independent r.v.’s such that *X* ∼ Γ(*k*_{1}, *θ*) and *Y* ∼ Γ(*k*_{2}, *θ*), then it can be shown (Rao 1973, p. 165) that (28)where *B*(*k*_{1}, *k*_{2}) denotes the beta distribution with shape parameters *k*_{1} and *k*_{2}. Finally, if we let *X* = RSS_{1} and *Y* = RSS_{1.0}, it follows thatBy an argument analogous to the pure continuous case, the likelihood-ratio statistic raised to the power 2/*n* for the null hypothesis of a missing mixed edge follows a beta distribution with these parameters: (29)Hence, the following transformation of the LOD score,follows a beta distribution with parameters given in Equation 29. On the other hand, if we let *p*_{1} = *k*_{1}/2 and *p*_{2} = *k*_{2}/2 in Equation 28, it can be shown (Rao 1973, p. 167) that *F* = (*X*/*k*_{1})/(*Y*/*k*_{2}) ∼ *F*(*k*_{1}, *k*_{2}). This means that we can also perform an exact conditional independence test in terms of the *F*-distribution parametrized with a mixed GMM, by first calculatingfor the pure continuous and mixed cases, respectively, and then using the fact that they follow the *F*-distribution under the null, specified here asThe proportion, denoted by *η*^{2}, of (phenotypic) variance of *X _{γ}* explained by an eQTL

*X*while controlling for the rest of r.v.’s

_{δ}*X*

_{V}_{\{}

_{γ}_{}}, can be estimated as the difference between the estimated conditional variances of

*X*given

_{γ}*X*

_{V}_{\{}

_{γ}_{}}under the saturated and the constrained models, divided by the total variance of

*X*: (30)which, after applying the law of total variance, leads to (31)Note that when

_{γ}*V*\{

*γ*} = {

*δ*}, that is,

*Q*= Ø, the estimated conditional variance under the constrained model, RSS

_{0}/

*n*, is equal to the unconditional variance of

*X*;

_{γ}*i.e.*, RSS

_{0}/

*n*= var(

*X*). In this case, the proportion of the phenotypic variance explained by the eQTL reduces to (Broman and Sen 2009, p. 77)

_{γ}*q*-Order correlation graphs

There is a wide spectrum of strategies based on testing for conditional independence, which can be followed to learn a mixed GMM from data, similarly as with Gaussian GMMs for pure continuous data; see, *e.g.*, Castelo and Roverato (2006) and Kalisch and Bühlmann (2007). In the context of estimating eQTL networks from genetical genomics data, the number of genes and genotype markers *p* exceeds by far the sample size *n*; *i.e.*, *p* ≫ *n*. This fact precludes conditioning directly on the rest of the genes and markers *X _{V}*

_{\{}

_{i}_{,}

_{j}_{}}when testing for an eQTL association (

*i*,

*j*) while adjusting for all possible indirect effects. In other words, we cannot directly test for full-order conditional independences

*X*∐

_{i}*X*|

_{j}*X*

_{V}_{\{}

_{i}_{,}

_{j}_{}}.

We approach this problem using limited-order correlations, a strategy successfully applied to Gaussian GMMs (Castelo and Roverato 2006). It consists of testing for conditional independences of order *q* < (*p* − 2), *i.e.*, *X _{i}* ∐

*X*|

_{j}*X*with |

_{Q}*Q*| =

*q*, expecting that many of the indirect relationships between

*i*and

*j*can be explained by subsets

*Q*of size

*q*. The extent to which this can happen depends on the sparseness of the underlying network structure

*G*and on the number of available observations

*n*. The mathematical object that results from testing

*q*-order correlations is called a

*q*-order correlation graph, or qp-graph (Castelo and Roverato 2006), and it is defined as follows.

Let *P _{V}* be a probability distribution that is Markov over an undirected graph

*G*= (

*V*,

*E*) with |

*V*| =

*p*and an integer 0 ≤

*q*≤ (

*p*− 2). A qp-graph of order

*q*with respect to

*G*is the undirected graph

*G*

^{(}

^{q}^{)}= (

*V*,

*E*

^{(}

^{q}^{)}), where (

*i*,

*j*) ∉

*E*

^{(}

^{q}^{)}if and only if there exists a set

*U*⊆

*V*\{

*i*,

*j*} with |

*U*| ≤

*q*such that

*X*∐

_{i}*X*|

_{j}*X*holds in

_{U}*P*(Castelo and Roverato 2006).

_{V}Assuming that there are no additional independence restrictions in *P _{V}* than those in

*G*, it can be shown (Castelo and Roverato 2006) that

*G*⊆

*G*

^{(}

^{q}^{)}in the sense that every edge that is present in the true underlying network

*G*is also present in the qp-graph

*G*

^{(}

^{q}^{)}. From this fact it follows that a qp-graph

*G*

^{(}

^{q}^{)}approaches

*G*as

*q*grows large, and therefore,

*G*

^{(}

^{q}^{)}can be seen as an approximation to

*G*; see Castelo and Roverato (2006) for further details.

Because separation in undirected marked graphs with mixed discrete and continuous vertices works the same as in undirected pure graphs with either one of these two types of vertices, it follows that the definition of qp-graph also holds for mixed vertices and CG distributions *P _{V}*.

### Estimation of eQTL networks with qp-graphs

Instead of directly approaching the problem of inferring the graph structure *G* of the underlying eQTL network from genetical genomics data with *p* ≫ *n*, we propose calculating a qp-graph estimate . For this purpose, we measure the association between two r.v.’s by means of a quantity called the non-rejection rate (NRR), which is defined as follows.

Let and let be a binary r.v. associated with the pair of vertices (*i*, *j*) that takes values from the following three-step procedure: (1) an element *Q* is sampled from according to a (discrete) uniform distribution; (2) test the null hypothesis of conditional independence *H*_{0}: *X _{i}* ∐

*X*|

_{j}*X*; and (3) if the null hypothesis

_{Q}*H*

_{0}is rejected then takes value 0, otherwise takes value 1.

We have that follows a Bernoulli distribution and the NRR, denoted as , is defined as its expectancyThe NRR measure was originally developed to learn qp-graphs from pure continuous data (Castelo and Roverato 2006). However, note that by using a suitable test for the null hypothesis *H*_{0}: *X _{i}* ∐

*X*|

_{j}*X*, we can also use the NRR in mixed data sets such as those produced by genetical genomics experiments.

_{Q}It can be shown (Castelo and Roverato 2006) that the theoretical NRR is a function of the probability *α* of the type-I error of the test, the mean value *β _{ij}* of the type-II error of the test for all subsets

*Q*, and the proportion of subsets

*Q*of size

*q*that separate

*i*and

*j*in the underlying

*G*: (33)This expression helps in understanding the information conveyed by the NRR in the following way. If a pair of vertices (

*i*,

*j*) is connected in

*G*, then and . This means that for associations present in

*G*, the NRR is 1 minus the statistical power to detect that association. In such a case, depends on the strength of the association between

*X*and

_{i}*X*over all marginal distributions of size (

_{j}*q*+ 2). Moreover, note that from Equation 33 it follows that a value close to zero implies that both,

*β*and , are close to zero. This means that either (

_{ij}*i*,

*j*) is in

*G*or

*q*is too small.

Analogously, if is large, then either or *β _{ij}* are large, and we can conclude that either (

*i*,

*j*) is not present in

*G*or, otherwise, there is no sufficient statistical power to detect that association. As the statistical power to reject the null hypothesis

*H*

_{0}depends on

*n*−

*q*, the latter circumstance may be due to an insufficient sample size

*n*, a value of

*q*that is too large, or both. From these observations it follows that a NRR value close to zero indicates that (

*i*,

*j*) ∈

*G*

^{(}

^{q}^{)}while a value close to one points to the contrary, (

*i*,

*j*) ∉

*G*

^{(}

^{q}^{)}.

The estimation of for a pair of r.v.’s (*X _{i}*,

*X*) can be obtained by testing the conditional independence

_{j}*X*∐

_{i}*X*|

_{j}*X*for every . However, the number of subsets

_{Q}*Q*in can be prohibitively large. An effective approach to addressing this problem (Castelo and Roverato 2006) consists of calculating an estimate on the basis of a limited number subsets , such as 100, sampled uniformly at random.

We may be interested in explicitly adjusting for confounding factors and other covariates = {*C*_{1}, *C*_{2}, …, *C _{k}*}. It is straightforward to incorporate them into a NRR by sampling subsets

*Q*fromNote that covariates in

*can be known or, in the case of unknown confounding factors, estimated with algorithms such as SVA (Leek and Storey 2007) or PEER (Stegle*

*et al.*2010). Finally, the qp-graph estimate of the underlying eQTL network structure

*G*can be obtained by selecting those edges (

*i*,

*j*) that meet a maximum cutoff value

*ε*:

## Results

### Flow of genetic additive effects through gene expression

Using the R/qtl package (Broman and Sen 2009) we simulated a genetic map formed by one single chromosome 100 cM long and 10 equally-spaced markers. We built an eQTL network of *p*_{Γ} = 5 genes forming a chain, where the first of them had one eQTL placed randomly among the 10 markers (Figure 1A). We simulated 10 mixed GMMs with the eQTL network structure shown in Figure 1A, under increasing values of the marginal correlation between the genes (*ρ* = {0.25, 0.5, 0.75}) and of the additive effect from the eQTL on gene 1 (*a* = {0.5, 1, 2.5, 5}). We sampled 1000 data sets of *n* = 100 observations from each of these 10 models. We estimated the additive effect of the eQTL on each of the five genes, averaged over the 10,000 data sets at each combination of additive effect and marginal correlation. Note that only the additive effect on gene 1 is direct (Figure 1A).

Figure 1, B–D, gray lines, shows the estimated average additive effects for the three different marginal correlation values on the gene–gene associations. These plots demonstrate that additive effects propagate as a function of gene–gene correlations *ρ*. More concretely, when *ρ* ≥ 0.5, moderate to large additive effects may easily show up as indirect eQTL associations when inspecting the margin of the data formed by one genetic variant and one gene-expression profile. Black lines were calculated from the same data, but discarding additive effects corresponding to LOD scores <3. This recreates the effect of selection bias (Broman and Sen 2009) and shows that indirect genetic additive effects are amplified under this circumstance.

### Higher-order conditioning adjusts for confounding effects

Confounding effects in gene-expression data affect most of the genes being profiled. Sometimes the sources of confounding are known or can be estimated with methods such as SVA (Leek and Storey 2007) or PEER (Stegle *et al.* 2010) and may be explicitly adjusted by including them as main effects into the model. Often, however, these sources are unknown and it may be difficult to adjust or remove them without affecting the biological signal and underlying correlation structure that we want to estimate. Using simulations, here we show that confounding effects affecting all genes can be implicitly adjusted by using higher-order conditioning.

Using the same genetic map we simulated before, we built an eQTL network with 100 genes without associations between them and where one of the genes has an eQTL placed randomly among the 10 markers with a fixed additive effect of *a* = 2.5. A continuous confounding factor was included under two models with *ρ* = 0.5: a systematic one where the confounding factor affects all genes and a specific one where it affects only the two genes, or the gene and marker, being tested. We considered a fixed sample size of *n* = 100 and conditioning orders *q* = {0, 1, …, 50}, where *q* = 0 corresponds to the marginal association without conditioning.

We tested the presence of a gene–gene association and of an eQTL association between the marker containing the simulated eQTL and one of the genes not associated with that eQTL. Note that none of these associations were present in the simulated eQTL network. For every *q* order with *q* > 0, a subset *Q* of size *q* was sampled uniformly at random among the rest of the genes not being tested and used for conditioning. When considering the explicit adjustment of the confounding factor, this one was added to *Q* except when *q* = 0 since then *Q* = {Ø}. Once *Q* was fixed, 100 data sets were sampled from the corresponding mixed GMM and two conditional independence tests were conducted in each data set for the presence of both, the eQTL and the gene–gene association, given the sampled genes in *Q*.

Figure 2 shows the empirical type-I error rate as function of the conditioning order *q*, where Figure 2A corresponds to the eQTL association and Figure 2B corresponds to the gene–gene association. This figure shows that, as expected, the explicit inclusion of the confounding factor in the conditioning subset *Q* (dotted lines) adjusts the confounding effect immediately with *q* > 0 in both situations, when either all genes are affected or only the tested ones. When the confounding effect is not included in *Q* and affects only the tested genes (solid lines), it yields high type-I error rates that only decrease linearly with *n* − *q*, quantity on which statistical power depends. However, when confounding affects all genes (dashed lines) the type-I error rate has an exponential decay, and for *q* > 20 the confounding effect is effectively adjusted in these data.

### qp-Graph estimates of eQTL networks are enriched for *cis*-acting associations

Expression QTL acting in *cis* have more direct mechanisms of regulation than those acting in *trans* (Rockman and Kruglyak 2006; Cheung and Spielman 2009). This hypothesis is supported by the observation that *cis*-acting eQTLs often explain a larger fraction of expression variance and show larger additive effects than those acting in *trans* (Petretto *et al.* 2006; Rockman and Kruglyak 2006; Cheung and Spielman 2009). On the other hand, spurious eQTL associations tend to inflate the discovery of *trans*-acting eQTLs (Breitling *et al.* 2008). From this perspective, it makes sense to expect an enrichment of *cis*-eQTLs when indirect associations are effectively discarded (Kang *et al.* 2008; Listgarten *et al.* 2010).

We estimated NRR values on every pair (*i*, *j*) of marker and gene from the yeast data set of *n* = 112 segregants for different *q* = {25, 50, 75, 100} orders, restricting conditioning subsets to be formed by genes only. The resulting estimates , were averaged, , to account for the uncertainty in the choice of the conditioning order *q* (Castelo and Roverato 2009).

We ranked marker–gene pairs (*i*, *j*) by average NRR values and made a comparison against the ranking by *P*-value of the (exact) LRT for marginal independence (*i.e.*, where *q* = 0) to directly assess the added value of higher-order conditioning under the same type of statistical test. We considered conservative and liberal cutoff values *ε* = {0.1, 0.5} on the average NRR and obtained two different qp-graph estimates of the underlying eQTL network, denoted by , each of them having and eQTL edges.

We then selected the top-*k* number of marker–gene pairs (*i*, *j*) with lowest *P*-value in the marginal independence test, where , which led to two other estimates of the eQTL network, denoted by . Note that both and have the same number of edges, in this case, pairs (*i*, *j*) of eQTL associations between a marker and a gene, thereby enabling a direct comparison of the fraction of *cis*- and *trans*-acting selected eQTL associations.

In Figure 3 we can see dot plots of the eQTL associations present in qp-graphs (Figure 3, A and C) and those present in using the marginal approach (Figure 3, B and D). Given the same number of eQTL associations, Figure 3 shows that qp-graph estimates of the underlying eQTL network have a higher number of *cis*-acting eQTLs than the marginal test, with an enrichment between 64 and 73% using the liberal cutoff, which increases to 85 and 87% using the conservative cutoff (see Table 1). Note that with the marginal approach more vertical bands of *trans*-acting associations remained present among the strongest selected eQTLs (Figure 3D) than with the qp-graph estimate (Figure 3C). We interpret this observation as evidence of the propagation of additive effects due to strong gene–gene correlations either present in the underlying eQTL network or created by confounding effects, and possibly aggravated by selection bias, as previously shown in Figure 1.

### Performance comparison against another method

We compared the performance of the methodology introduced in this article with a recent approach for causal inference among pairs of phenotypes (Chaibub Neto *et al.* 2013). This approach, called causal model selection tests (CMSTs), is implemented in the R/CRAN package qtlhot and can be used in three different ways (parametric, nonparametric, and joint parametric), with two different penalized log-likelihood functions (Akaike information criterion, AIC, and BIC); see Chaibub Neto *et al.* (2013, p. 1005) for details.

We ran the CMST analysis on the yeast data analyzed in this article following the procedure described in Chaibub Neto *et al.* (2013, pp. 1008–1010), which assessed performance using differential expression relationships obtained from a database of 247 knock-out (KO) experiments in yeast (Hughes *et al.* 2000; Zhu *et al.* 2008). To enable the comparison we used the raw CMST *P*-value of the so-called “*M*_{3} independent model” (see Chaibub Neto *et al.* 2013, Figure 1) to build a ranking of 30,192 potential regulatory relationships, where each pair of genes had a common eQTL. We compared it against a corresponding ranking of NRR values estimated from the same data. Among these predicted associations 1634 formed part of the database of 247 KO experiments and using them as a bronze standard, we compared the two rankings by means of precision-recall curves. These are shown in Figure 4A and show nearly identical performance among all compared methods.

Since a KO gene may produce a cascade of expression changes, many of the 1634 relationships in the bronze standard may actually be indirect, which would explain the similar performance using either lower (CMST) or higher (qpgraph) conditioning. We attempted to build a bronze standard of more direct regulatory relationships by first restricting the initial set of 30,192 possible associations to 3074 that involved at least one transcription factor gene. Among these, we found 94 that were present in the database of KO experiments and in the Yeastract database of curated transcriptional regulatory relationships (Teixeira *et al.* 2014) and considered them as new bronze standard for comparison. The resulting precision-recall curves are shown in Figure 4B and reveal that NRR values estimated with qpgraph have larger discriminative power than CMSTs to identify direct regulatory interactions. Concretely, the area under the curve (AUC) for qpgraph is between 49 and 80% larger than the AUC of the different versions of CMSTs.

### The genetic control of a gene-expression network in a yeast cross

We performed all pairwise exact tests of marginal independence on every marker–gene and gene–gene pair and corrected the resulting *P*-values by FDR to select those associations with FDR <1%. The resulting graph, denoted by , had 92,710 eQTLs and 2,203,119 gene–gene associations. The graph constitutes a first estimate of the underlying eQTL network and it could be also obtained by the classical approach in QTL analysis of permuting phenotypes to test for the global null hypothesis of no QTL anywhere in the genome. Obviously, because the associations have been selected only using the margin of the data formed by one marker and one gene, or two genes, many of them will be indirect or spurious. To remove those associations we used the previously calculated average NRR values and selected a conservative cutoff of *ε* = 0.1, so that a present association requires at least 90% of the conditional independence tests to be rejected. This resulted in an estimate formed by 3498 eQTL and 1799 gene–gene associations. Note that while the estimation of requires some treatment of multiple testing, the NRR avoids this problem by actually exploiting the fact that is performing many tests to inform, by means of a Bernoulli variable, how close or far two genes, or a marker and a gene, are located in the network.

The genetic connected components of the eQTL network involved 450 genes and 3498 eQTLs on 1493 different loci, with a median of 6 eQTLs per gene. A substantial percentage of genes (27%) had >10 eQTLs on their own chromosome. Since eQTLs in were independently mapped from each other, a fraction of those targeting a common gene may be tagging the same causal variant. We removed redundant eQTLs by the following forward selection procedure. For each gene, we ordered its linked eQTLs by increasing NRR values and proceeded over the ranked eQTLs to test the conditional independence of the gene and the eQTL, given the eQTLs occurring before in the ranking using again the exact test described in the *Materials and Methods*. An eQTL association was retained if the test was rejected at *P* < 0.05 and the selection procedure stopped whenever *P* > 0.05 to continue on the next gene.

The genetic connected components were substantially pruned and the vast majority of genes (402/450) were left with just one eQTL, 46 genes with two, and only two had three eQTLs. This final eQTL network comprised 500 eQTLs and 1391 genes, the vast majority of them (941) forming gene–gene associations without any eQTL.

#### Hub genes have *trans*-acting eQTLs with large genetic effects:

For each of the 450 genes having at least one eQTL, we calculated the percentage of variance explained by eQTLs using Equation 31. The distribution of resulting values is shown in Figure 5A. About 70% of the genes had eQTLs explaining 50% or less of their expression variability, and only in 10% of them their eQTLs explained >70%.

Using a method based on linear mixed modeling and exploiting the relatedness matrix built from all pairs of segregants (Lee *et al.* 2011; Bloom *et al.* 2013), we estimated the narrow-sense heritability *h*^{2} for these 450 genes and compared it against the percentage of variance explained by the eQTLs (Figure 5B). Setting the percentage explained to the expected maximum *h*^{2} when the former was larger, the fraction of missing heritability ranges from 0 to 62%. This figure also shows that the connectivity degree to other genes correlates positively with both, *h*^{2} and percentage of variance explained. In fact, as Figure 5C shows, there are 24 genes connected to 9 or more other genes, for which all their eQTLs explain 68% or more of their expression variability. Interestingly, this figure also reveals that half of them have a *trans*-acting eQTL located in a different chromosome. In fact, these particular *trans*-eQTLs map all of them to chromosomes II and III (Table 2). Concretely, the eQTL of gene *SCW11* is located in position 553,857 of chromosome II, 2.6 kb away from the *AMN1* gene. This gene carries a loss-of-function mutation in the BY strain (Yvert *et al.* 2003) affecting the expression of daughter-cell-specific genes. The eQTLs in chromosome III map to the MAT and LEU2 loci, the latter being one of the engineered deletions in the BY strain.

Genes *YCL065W*, *YCR041W*, and *YCR097W-A* in Table 2 have currently unknown function. However, a gene ontology (GO) enrichment analysis on each subset of genes connected to them in the eQTL network showed that they are potentially involved in mating-specific regulatory processes (Holm’s FWER < 0.05).

Therefore, these highly connected genes, shown in Table 2, are involved in regulatory processes related to mating-specific expression, reacting upon nitrogen starvation, participation in the leucine biosynthesis pathway, and daughter-cell separation. These are fundamental pathways for yeast growth and render these results consistent with previous evidence from yeast genetic interaction networks derived from double-mutant screens (Costanzo *et al.* 2010), where highly connected genes were involved in primary cellular functions (Baryshnikova *et al.* 2013).

#### Differential genetic control of gene expression across chromosomes:

We also investigated how genetic variation affects gene expression differently across the yeast chromosomes. For this purpose, we produced hive plots (Krzywinski *et al.* 2012) shown in Figure 6, using the R/CRAN package HiveR (Hanson 2014). A first observation is that eQTLs occurring within the same chromosome (edges between the markers axis and genes axis of the same chromosome) mostly lead to concentric edges, pointing to *cis*-regulatory mechanisms acting at different distances. A remarkable exception is chromosome III where many of those edges cross through each other. This chromosome is also distinctive in that it has a lower density of *cis*-acting eQTLs than the rest of the genome.

Between 81 and 92 kb from the beginning of chromosome III there is a *cis*-acting eQTL on genes *LEU2* and *NFS1*. This eQTL is also *trans*-associated with genes *BAT1*, *OAC1*, *LEU1*, and *BAP2* located in different chromosomes and involved in the leucine biosynthesis pathway. According to the UCSC Genome Browser (http://genome.ucsc.edu) they all have upstream a binding site of *LEU3*, a major regulatory switch in this pathway. The engineered deletion of *LEU2* affects *LEU3* in a feedback loop and this would lead to expression changes in its targets (Chin *et al.* 2008).

Downstream, at about 200 kb of the beginning of chromosome III, we find the *MAT* locus whose genetic composition determines the mating type of yeast. This eQTL is *cis*-associated with the gene *MATALPHA1*, which is expressed in haploids of the alpha-mating type and has been previously reported as a candidate regulator of the rest of genes associated with this locus (Yvert *et al.* 2003; Curtis *et al.* 2013). This locus is *trans*-associated with two other genes in the same chromosome (*HMLALPHA1*, *HMRA1*) and to a set of genes distributed throughout the genome [*STE2*, *STE3*, *STE6*, *AFB1*, *BAR1*, *MF(ALPHA)1*, *MFA2*], which are all involved in the regulation of mating-type-specific transcription.

In chromosome V, around the locus of *URA3* we find a *cis*-acting eQTL that has a *trans*-acting effect on *URA1* (chromosome XI) and *URA4* (chromosome XII), the three of them taking part in the biosynthesis of pyrimidines (Yvert *et al.* 2003; Curtis *et al.* 2013). As can be easily seen from Figure 6, and consistent with previous observations (Yvert *et al.* 2003), few of the eQTLs directly affect transcription factors, such as the *ARR1* gene in chromosome XI, or RNA-binding proteins, such as *NOP8* in chromosome V.

The edges between the genes axes correspond to gene–gene associations from the corresponding chromosome to the rest of the genome and where at least one of the genes has an eQTL. This is a fraction (503) of a total of 1799 gene–gene associations on the entire eQTL network, more directly affected by the genetic control of gene expression. We observed a systematic pattern of association between genes from the same chromosome (see, *e.g.*, chromosome XVI). Replacing the axis displaying genes from all chromosomes by another gene axis again of the same chromosome (data not shown) reveals that a fraction of the gene–gene associations in which one of the genes has a *cis*-acting eQTL occurs between genes close to each other on the chromosome. It may be possible that inherited coexpression segregates due to linkage disequilibrium and/or that tandem gene duplication events render genes close to each other being coexpressed. Using the strategies introduced in this article further, such as conditioning these associations on nearby genes, may help to elucidate what fraction of them are of genetic, molecular, or evolutionary origin.

## Discussion

Gene expression is a high-dimensional multivariate trait whose variability is the result of genetic, molecular, and environmental perturbations and often different kinds of confounding effects. Dissecting the components of this variability and being able to adjust for some of them is a major goal in every study using genetical genomics data. Here we have used a class of statistical models with a graphical interpretation, mixed GMMs, to approach this challenge from a multivariate perspective. Using simulations we have shown that genetic effects can propagate proportionally to the marginal correlation between the genes and that this effect may be amplified under selection bias (Figure 1), underscoring the need to adjust for indirect associations.

Using standard linear theory and basic principles from mixed GMMs, we have derived the parameters, in terms of a mixed GMM, for an exact LRT on data from conditional Gaussian distributions that accommodate both linear and interaction effects between genetic variants and continuous gene-expression profiles. Higher-order conditioning on mixed data unlocks a number of strategies that one may follow to disentangle direct and indirect effects in genetical genomics experiments. We exploited it by using marginal distributions and *q*-order correlation graphs. We showed that this approach allows one to adjust for confounding effects (Figure 2), increases our power to identify *cis*-acting eQTLs (Figure 3) and direct regulatory relationships (Figure 4), and helps in comparing the genetic control of gene expression between chromosomes (Figure 6) and throughout the gene network (Figure 5). In particular, we could see that the degree of connections of each eQTL gene to other genes in the eQTL network correlated positively with both narrow-sense heritability and fraction of variance explained by eQTLs (Figure 5). An important fraction of large genetic effects were in fact due to *trans*-acting eQTLs from those genes with more connections. Using those connections we could predict potential pathways involving three such hub genes of unknown function. Genetical genomics data from large experimental crosses are becoming increasingly available to the community. We believe that mixed GMMs can play a crucial role in harnessing these data to explore linear and interacting eQTL associations of arbitrary order and advance our understanding of the genetic control of gene expression.

## Acknowledgments

We thank M. M. Albà, D. R. Cox, M. Francesconi, S. L. Lauritzen, B. Lehner, and N. Wermuth for helpful discussions on different parts of this article and the two anonymous reviewers for their valuable comments that have improved this article. We thank R. Brem for kindly providing raw expression data files from the yeast data set analyzed in this article. This work has been supported by a grant from the Spanish Ministry of Economy and Competitiveness to R.C. (ref. TIN2011-22826).

## Footnotes

*Communicating editor: E. G. Petretto*

- Received August 9, 2014.
- Accepted September 22, 2014.

- Copyright © 2014 by the Genetics Society of America