## Abstract

We describe a method for integrating gene expression information into genome scans and show that this can substantially increase the statistical power of QTL mapping. The method has three stages. First, standard clustering methods identify small (size 5–20) groups of genes with similar expression patterns. Second, each gene group is tested for a causative genetic locus shared with the clinical trait of interest. This is done using an EM algorithm approach that treats genotype at the putative causative locus as an unobserved variable and combines expression information from all of the genes in the group to infer genotype information at the locus. Finally, expression QTL (eQTL) are mapped for each gene group that shares a causative locus with the clinical trait. Such eQTL are candidates for the causative locus. Simulation results show that this method has far superior power to standard QTL mapping techniques in many circumstances. We applied this method to existing data on mouse obesity. Our method identified 27 putative body weight QTL, whereas standard QTL mapping produced only one. Furthermore, most gene groups with body weight QTL included *cis* genes, so candidate genes could be immediately identified. Eleven body weight QTL produced 16 candidate genes that have been previously associated with body weight or body weight-related traits, thus validating our method. In addition, 15 of the 16 other loci produced 32 candidate genes that have not been associated with body weight. Thus, this method shows great promise for finding new causative loci for complex traits.

IN the past few years, there has been a great deal of interest in the relationship between genotypic variation and transcriptional variation. Numerous studies have found that there is often a strong genotypic component to variation in transcription (*e.g.*, Brem *et al.* 2002; Oleksiak *et al.* 2002; Yan *et al.* 2002; Cheung *et al.* 2003; Schadt *et al.* 2003; Morley *et al.* 2004). These observations have spurred interest in the possibility of leveraging this strong relationship between genotype and transcription to explore the relationship between genotype and a clinical trait of interest, the premise being that genome-wide expression data provide access to the gene network that links genotype to clinical phenotype (for simplicity, we refer to the organismal trait of interest as the “clinical trait,” to distinguish it from an expression-level trait; of course, it need not be solely of clinical interest). Numerous studies have successfully used such approaches to find causative polymorphisms for complex traits (*e.g.*, Lawn *et al.* 1999; Berge *et al.* 2000; Karp *et al.* 2000; Wayne and McIntyre 2002; Kirst *et al.* 2004; Bystrykh *et al.* 2005; Chesler *et al.* 2005; Hubner *et al.* 2005; Li *et al.* 2005; Schadt *et al.* 2005, 2008; Ghazalpour *et al.* 2006; Schadt 2006; Chen *et al.* 2008; Emilsson *et al.* 2008), showing the power of this type of approach.

To date, most work in this area has been based on prior knowledge of a genomic region containing a QTL for the trait of interest. The most common approaches are to look for genes in the region of interest whose expression levels are correlated with the clinical trait of interest or to look for gene networks with many expression QTL (eQTL) in the region of interest. Our focus in this article is the traditional genome-scan setting—that is, gene mapping with no prior knowledge of any QTL position. There are numerous genetical genomics articles (*e.g.*, Ghazalpour *et al.* 2005, 2006; Ayroles *et al*. 2009; Edwards *et al*. 2009) that are essentially genome scans in the sense of identifying QTL locations without using any prior QTL information. However, there is little understanding of how these approaches compare to traditional QTL mapping, what conditions (if any) make them superior, and what methods for using expression information are best and why. The goals of this article are to (1) develop a self-contained method for integrating expression information into genome scans, (2) explore the assumptions and the statistical properties of this method, and (3) show that this approach can substantially increase the power for QTL mapping.

Figure 1 shows a hypothetical schematic of the relationship between a clinical trait, a gene network, and the causative loci for the clinical trait. Genetic variation at the causative loci leads to changes in transcription in the gene network. These in turn lead to variation in the clinical trait. In traditional gene mapping, we have genotype data for markers (some of which are hopefully in linkage or association with a clinical trait locus) and we have clinical trait data. The main obstacle is that for a complex trait, the statistical link is weak between a causative locus and the trait because individual loci account for only a small part of the variation. The statistical link is further weakened by the nonperfect correlation between marker and causative locus.

Now, suppose that we also have information about the gene network, in the form of gene expression information. The network provides a link between the clinical trait and the causative loci. Consider the circled gene in Figure 1. The transcript level of this gene is affected by genotype variation at three of the causative loci and is thus a three-locus quantitative trait. Because it shares three controlling loci with the clinical trait, its genotype-based variation is in higher concordance with the trait than any single locus can be. Schliekelman (2008) modeled this situation for a binary disease and showed that such a gene will have substantially higher association with the disease than will the causative loci. This is the fundamental idea of genetical genomics: that is, that the genes in the network provide an intermediate link between clinical trait and genotype and genome-wide expression data can be used to access their transcript levels.

Schliekelman (2008) showed that there are substantial statistical obstacles to this idea. As we move through the network farther away from the causative loci and closer to the trait, the statistical link to the trait becomes stronger and the link to the loci becomes weaker. We must establish a statistical link in both directions, but there is a trade-off in statistical power between these two goals. The required sample sizes for reasonable power for both directions using currently available methods range from hundreds to thousands, depending on assumptions about the genetics of transcription and disease.

In this article we present a new method for establishing a link between a gene transcript and a clinical trait that greatly increases statistical power relative to the results of Schliekelman (2008). This article addresses quantitative clinical traits, while a future article will address binary ones. This method infers the genotype of the controlling locus for a group of correlated transcripts and tests whether that genotype is associated with trait status. We model the expression level of a gene as a normal mixture model with components corresponding to genotype. That is, the expression level has a normal distribution with mean and variance dependent on genotypes at the controlling loci for that transcript. A number of other authors have previously modeled expression levels this way in the eQTL context (*e.g.*, Jia and Xu 2005; Kendziorski *et al.* 2006; Nettleton and Wang 2006; Chen and Kendziorski 2007; Gelfond *et al.* 2007; Perez-Enciso *et al.* 2007; Sampson and Self 2008; Sillanpaa and Noykova 2008). Hsieh *et al.* (2007) found evidence of multimodality in transcription in 7–10% of genes in two *Drosophila melanogaster* lines. Other genes may have had either little within-population variation or genetic control too complicated to be detected with the available sample sizes.

We define a controlling locus for a transcript as a locus with genotypic variation between individuals that contributes to variation in the transcript between individuals. If there were sufficient separation between the components of the mixture, then we could use the expression level to assign an individual's genotype at the controlling locus with high probability. Consider Figure 2. This shows a scatter plot of a clinical trait value *vs.* transcript value for a simulated data set. The clinical trait and the transcript are jointly controlled by a locus D with alleles *D* and *d*. The expression means are 0, 0.5, and 1 for the *dd*, *Dd*, and *DD* genotypes, respectively. The standard deviation is 0.1 for each component and is small relative to the differences between components. Thus, given this plot, we could infer genotypes at the D locus for most sample individuals with high certainty. In this case, we could test whether the locus D that controls the transcript is a QTL for the clinical trait by assigning individuals to genotype groups according to their expression values and then testing for an association between the assigned genotypes and the clinical trait values. We did this for the simulated data in Figure 2 with a sample size of 100, using the *k*-means algorithm to assign genotype and an ANOVA to test for association. There was a highly significant association between assigned genotype and the clinical trait, with a *P*-value of 2 × 10^{−8}.

This is, in effect, what the hypothesis test discussed in Schliekelman (2008) is doing. Unfortunately, there is usually not this level of separation between components and a single expression level is not a reliable indicator of underlying genotype. In this article we present a method that combines information from multiple transcripts to infer the controlling genotype more reliably. Figure 3a shows a scatter plot for a simulated data set like that in Figure 2. In this example, the expression standard deviation is 0.5. Thus, there is much less separation between components and the individual genotypes cannot be distinguished. Visually, we can discern only a weak relationship between the transcript and the trait. When we apply the *k*-means algorithm to assign genotypes and an ANOVA to test for association with the clinical trait, we get a *P*-value for association of 0.39 with a sample size of 200. However, in Figure 3b we include a second transcript also controlled by the locus D. The *x*-axis and *y*-axis show values of the two transcripts and the size of the point shows the value of the clinical trait for that individual. We can see a clear increasing pattern in the clinical trait value as we move along the diagonal line from low (*x*, *y*) values to high ones. That is, the joint distribution of the transcripts has a much stronger relationship with genotype than does a single transcript and therefore has a stronger relationship with the clinical trait.

In effect, each expression value is an independent observation of the underlying genotype. Because of the substantial variation in expression no single such observation is very reliable. However, by combining observations of multiple transcripts controlled by the same locus, we can get more reliable inference of the underlying genotype. The values *C*1, *C*2, and *C*3 show the cluster centers when the *k*-means clustering algorithm was applied to the data with three clusters specified. We again used the cluster assignments as a proxy for genotype and tested for an association between inferred genotype and the clinical trait. The result was highly significant (*P*-value = 4 × 10^{−5} with sample size of 200).

In this article we formalize this idea using an approach based on the EM algorithm. The goal is a method for conducting genome scans. We assume that the data consist of (1) genotype data for a large number of markers spaced throughout the genome for each sample individual, (2) expression data from a microarray covering thousands of genes for each individual, and (3) the clinical trait value for each individual. We undertake a three-part procedure:

We apply hierarchical clustering to the expression data and apply an algorithm that extracts gene groups of a target size

*w*from the bottom of the dendrogram. This produces groups of size*w*or smaller with high intergroup correlation.We assume that genes in these groups share some common controlling factor. We then test whether this controlling factor affects the clinical trait. This factor could be environmental or genetic, but our method assumes that it is a genetic locus. Our approach uses the EM algorithm and treats genotype at the unobserved locus as the unobserved data and this is used as the basis of a likelihood-ratio test.

If the controlling factor for a gene group is determined to be causative for the clinical trait, then there are two options. One is to map eQTL for the gene group. This approach is pursued in this article. Any such eQTL will then become candidate genes for the trait. Another option is to then use the expression data to infer the genotype at the putative controlling locus and then use this genotype information to construct an optimal sample for mapping the locus, similarly to the informal procedure used by Schadt

*et al*. (2003). This idea will be developed in a future publication (E. H. Breazel and P. Schliekelman, unpublished data)].

There are several advantages to using expression data to infer genotype, rather than using marker data. The first is that by using expression data we are inferring the genotype of the actual locus and not just a marker linked to it. That is, there is no recombination between marker and locus to reduce power. Of course, we then have error resulting from variation in expression. In this sense, our method trades variation in recombination between marker and genotype for variation in expression. We will show evidence that the genotype can be inferred more accurately using expression information, at least in some circumstances. The second advantage is that, to the extent that gene expression is a good indicator of genotype, then we are directly testing association of genotype to the clinical trait. It is well established that association tests are superior to linkage tests for detecting common variants (Risch and Merikangas 1996). However, association tests rapidly lose power when there are multiple alleles at the locus (Slager *et al.* 2000). Our method collapses all alleles into either expression-increasing or expression-decreasing alleles. Thus, it is similar to an association test with two alleles, regardless of how many alleles there really are. The third advantage (discussed in Schliekelman 2008) is that transcripts that share multiple controlling loci with the clinical trait have a higher correlation with the trait than do the loci themselves. In this case, the expression level is indicating multilocus genotype and a multilocus genotype has stronger association with the clinical trait than a single-locus genotype. These points are further addressed in the discussion.

The remainder of this article is organized as follows: First, we describe the method. Then, we show simulation results that explore the statistical power of the method. Finally, we apply the method to a mouse obesity data set. This is followed by a discussion section.

## METHODS

#### Likelihood-ratio test:

We assume that there are *N* individuals with a quantitative clinical trait that is controlled by genotype at *L* causative loci. The expression levels of *s* genes have been measured for each individual. Denote as the expressional level of the gene in the individual, and as the clinical trait value of the individual. Both *X _{ij}* and

*Y*are treated as continuous random variables.

_{i}Our purpose is to test whether a controlling locus for a group of highly correlated genes is also a causative locus for a clinical trait. We assume that the *n* transcripts in the group share at least one controlling locus. If there is some environmental factor that affects the clinical trait and affects expression of the transcript, our method cannot distinguish this from a genetic factor with the same effect.

The hypothesis test of interest is , the controlling locus for the group of *n* transcripts is not a QTL for the clinical trait, and , the controlling locus for the group of *n* transcripts is also a QTL for the clinical trait.

The likelihood function under is

The likelihood function under is

is the vector of expression levels of the *n* genes in the group for the individual. The distribution is the expression-level probability density function (pdf) of the gene group. This is assumed to be a multivariate normal pdf whose mean and variance are dependent on the genotype *g _{i}* at the controlling locus. The clinical trait distribution is also assumed to be normal and depends on the genotype

*g*at the controlling locus (

_{i}*f*(

*Y*|

_{i}*g*)) under the alternate hypothesis, but not under the null (

_{i}*f*(

*Y*)). The test statistic is the ratio of the maximum likelihood under and under .

_{i}We evaluate evidence for association between transcripts and the clinical trait with the likelihood-ratio statistic . Rejecting the null hypothesis suggests that a controlling locus for these transcripts is also a QTL for the clinical trait.

#### EM algorithm:

We use the EM algorithm (Dempster *et al.* 1977) to obtain maximum-likelihood estimates (MLE) of parameters since we do not actually observe the genotype. Matrix *X* and vector *Y* are treated as observed data. Genotype is the unobserved data, where *g _{i}* = 0, 1, 2, is the number of transcript-increasing alleles in individual

*i*at the controlling locus. In the results shown in this article, we assume that the expression-level distribution and the clinical trait distribution are independent after conditioning on the genotype at the controlling locus, that expression levels are independent within and between individuals after conditioning on genotype, and that the clinical trait is independent between individuals. In this case, the conditional expression and clinical trait distributions simplify to a product of independent normal pdfs. However, the first two of these assumptions are likely not true in many cases. That is, there will be correlations beyond shared genotypes between many transcripts and between transcripts and the clinical trait. To capture this correlation, we also implemented our method assuming that the joint expression and trait distribution is a full multivariate normal distribution. Unfortunately, the implementation of the multivariate normal and associated computations increases the computational time to an infeasible level. Thus, we do not use this approach in any of the results shown in this article. Despite this shortcoming, our simulations show that the method still performs well even in the presence of substantial extragenotype correlation. There is likely some loss of power, but the false-positive rate is not increased (see results).

Using standard EM algorithm techniques, we derive estimators for the genotype-specific expression means and variances, genotype frequencies for the controlling locus, and genotype-specific clinical trait means and variances. Without the clinical trait data , these would be the standard EM estimators for the normal mixture model. These estimators are modified to include the clinical trait data. See supporting information, File S1 for details.

#### Grouping genes:

We wrote an algorithm that uses hierarchical clustering to assign transcripts into groups with similar expression patterns. We used the hclust function in R (http://www.r-project.org) with the expression data (clustering across genes) and a Euclidean distance and complete linkage to create a dendrogram relating the genes.

The algorithm starts by clustering the genes as described above. Then, the cutree function in R is used to divide the genes into two groups according to their position in the dendrogram. If the size of a group exceeds the target, then hierarchical clustering is applied again and the group is again divided into two smaller groups using cutree. This process is continued iteratively and groups are successively divided into smaller groups until no groups exceed the target size. Each of the current groups is checked in each iteration. If its size is less than or equal to the target size, then the group is added to a list of final groups. If its size is greater than the target, then it is divided into smaller groups. The code is included in File S1 with further documentation.

After the groups are selected, we apply our method to each one to test whether a controlling factor for that group is also a causative locus for the clinical trait. Note that we chose hierarchical clustering to produce the gene groups simply because it is a fast, simple method that produces results easy to separate into smaller groups. Our simulations show that it works adequately. However, other clustering methods may work better in some situations and our approach does not require any particular clustering method.

#### Determining significance:

The statistical significance of the gene groups can be determined by using the asymptotic approximation for the likelihood ratio. There is a difference of two free parameters between the alternative and null hypotheses: three genotype-specific clinical trait means under the alternative and one genotype-independent clinical trait mean under the null. Therefore, we approximate the distribution of the likelihood ratio as an asymptotic chi-square distribution with 2 d.f. under the null hypothesis (note that the number of components in the mixture does not change between the null and the alternative hypothesis). Our simulations show that this approach works well. However, there is a potential problem associated with our assumption of independence between different transcripts after conditioning on genotype. This assumption will clearly not be met in some cases and there will be additional dependencies between transcripts. In this case the likelihood function is no longer correct and the asymptotic approximation for determining significance may not perform well. In particular, the probability of a set of transcripts having an apparent correlation with the clinical trait by chance may be underestimated. Our simulations (see File S1) show that the false-positive rate is not increased by this effect. However, because of the possibility that the false-positive rate could be affected with correlation structures different from what we tested, we feel that a permutation test is a more reliable means of assessing significance. Due to considerations of computational time, most of the simulation results that we show in this article use the asymptotic approximation to assess significance. However, we recommend using the permutation approach to assess significance with real data sets.

#### Local maxima:

The EM algorithm is guaranteed only to find local maxima. We found that with a single selection of starting parameter values, the algorithm is quite likely to end up at a local maximum instead of the global one. We used two different approaches to avoid this. One is to choose multiple sets of random starting values for parameters, run the algorithm for each one, and take the parameter fits producing the highest likelihood as best fit. We found that 20 sets of starting values were usually adequate. The other approach we used was the split and merge EM (SMEM) algorithm (Ueda *et al.* 2000), which uses split and merge operations between different mixture components to find global maxima. This procedure had similar performance to 20 starting values in finding the global maximum, but was substantially faster computationally. Thus, most of our results use this approach.

We found the local maximum problem to be particularly serious in our application because we are using the EM algorithm to maximize likelihoods in a likelihood-ratio test. False positives can result if the EM algorithm finds the global maximum under the alternative hypothesis model but misses it under the null hypothesis one. Even using 20 starting values or the split–merge EM, the global maximum is still missed enough for this to be a serious concern. Giving the best-fit parameters under the alternative hypothesis model as one of the sets of starting parameter values for the null hypothesis model greatly reduces this problem. We take the additional measure of conducting an exhaustive search of the parameter space for the null hypothesis model for gene groups that produce significant results, to ensure that the global maximum has not been missed. We do this by defining a grid of plausible values for each parameter and then trying all combinations of parameters over this grid as starting values for the EM.

#### Simulation study:

We conducted simulations to evaluate our method for testing for association between expression levels and the clinical trait. For each individual, the genotypes (*g* = 0, 1, 2, clinical trait-increasing alleles) on *L* causative loci are randomly generated. The clinical trait is assumed to follow a normal distribution whose mean is the sum of effects contributed by each of the *L* loci. Trait means for genotypes 0, 1, and 2 on each locus are , and , respectively, where and *h*_{T} is the dominance coefficient. Each locus contributes equally to the clinical trait. Each transcript is assumed to follow a normal distribution with the mean being the sum of contributions from a subset of *c* of the *L* causative loci. Each subset of *c* causative loci is assumed to control the expression of *M* transcripts. Expression means for genotypes 0, 1, and 2 on each locus are , and , respectively, where and *h*_{E} is the dominance coefficient for expression.

All transcripts were simulated as independent after conditioning on the shared controlling genotype except where specifically noted otherwise. Likewise, the clinical trait was independent of the transcripts after conditioning on shared controlling genotype. The variance for both expression and the clinical trait is 0.1. This leads to heritability values for the transcripts of ∼38% for *c* = 1, 12% for *c* = 2, and 6% for *c* = 3. The heritability for clinical trait QTL is ∼2% for *L* = 6 and ∼1% for *L* = 9 for the standard parameter values in this article.

Appropriate parameter values are largely unknown, so we vary these parameters over reasonable ranges. For each set of parameter values, we simulated 1000 replicates to assess the power or false-positive rate in detecting the association between transcripts and the clinical trait. Table 1 gives definitions and default values of parameters used in the simulations.

#### QTL mapping:

We use the JZmapqtl function in QTL Cartographer Version 1.17 (Basten *et al*. 1994) to map eQTL for gene groups. This function conducts composite-interval QTL mapping jointly for a set of traits on the basis of the method of Jiang and Zeng (1995). Their method assumes that traits follow a normal mixture model, similarly to our method. QTL Cartographer does not have functionality for QTL mapping for multiple gene groups. Thus, we wrote a Perl script that loops over each gene group, calls the JZmapqtl function to conduct joint QTL mapping, and collects information on significant QTL. This script is available in File S1. See the mouse data application for discussion on the choice of significance thresholds.

#### Data:

We apply our method to the data of Ghazalpour *et al*. (2006). They conducted an F_{2} cross between inbred mouse strains C3H/HeJ and C57BL/6J. Using 135 female offspring from this cross, they collected genotype information at 1065 SNPs, liver gene expression data, and data on 22 physiological traits, including body weight. From their microarray data, Ghazalpour *et al*. (2006) selected a set of the 3421 most varying and most connected genes to use for their gene network analysis. These data are publicly available and we obtained them from the authors' web site (http://www.genetics.ucla.edu/labs/horvath/ CoexpressionNetwork/MouseWeight/).

## RESULTS

#### Simulation study:

We start with simulation results that show how the statistical power to detect that a transcript group shares a causative locus with the clinical trait depends on various parameter values and assumptions. Then, we show the application of our full genome-scan procedure to a mouse obesity data set.

#### Effect of gene group size on power:

Figure 4 shows a plot of the power to detect that a set of *n* genes shares a controlling locus with the clinical trait. Specifically, it is the power to reject the null hypothesis that the controlling locus for the group of *n* transcripts is not a QTL for the clinical trait, as a function of sample size and other parameters. The group size *n* is shown next to each curve. In the simulations for Figure 4 it was assumed that genes were correctly grouped (*i.e.*, that all genes in the group share the same controlling locus). The *n* genes in the group were simulated such that they share *c* = 3 controlling loci with the clinical trait. Other parameters are as shown in Table 1. In general all parameter values in this article take the value shown in Table 1 unless otherwise specified. We see that the power increases quickly with group size. The sample size required for 80% power is >1000 for *n* = 1, ∼900 for *n* = 2, 550 for *n* = 4, 450 for *n* = 8, and 350 for *n* = 20. As the number of genes in the group increases, the importance of a transcript's individual variation in expression is reduced and joint expression becomes a more reliable indicator of the underlying controlling genotype. Although most of the absolute improvement in power occurs in going from *n* = 1 to ∼*n* = 4, there is still substantial proportional decrease in required sample size going from *n* = 12 to *n* = 20 (and presumably beyond). This indicates that if (1) the group truly consists of transcripts sharing a controlling locus and (2) there is no correlation beyond that due to shared controlling genotype, then larger group size is always beneficial (at least out to some to unknown but large limit).

The simulations in Figure 4 assumed that the gene group always consists entirely of transcripts that are jointly controlled by the same clinical trait QTL. In reality, the gene groups are determined by hierarchical clustering (or another clustering method) with a preset group size. Thus, some transcript groups may contain transcripts that are not controlled by the QTL and this may adversely affect power. Results in File S1 show that these “extra” transcripts do not affect the results unless the number of transcripts not controlled by the clinical trait locus equals or exceeds the number controlled by that locus. Plots showing the impact of varying the dominance coefficients, trait means, and trait variances are also given in the Results section of File S1.

#### Effect of correlation between transcripts:

Our method assumes that the only correlation between transcripts comes from shared genotype. That is, we assume that the transcripts are independent after conditioning on genotype at the controlling loci. The simulation data from the above results follow this assumption. However, it is likely in reality that there will be additional correlation between transcripts. This correlation will decrease power. In the independent case, each transcript is giving an independent observation about the underlying genotype. When there is additional correlation between transcripts, then the amount of independent information is reduced. Figure 5 shows the effect of this correlation. When there was no correlation, expression values were generated with a common mean according to shared controlling genotype and no additional correlation. When, for example, the correlation is 0.2, then the expression values were generated from a multivariate normal distribution with a common mean and a pairwise correlation of 0.2 (that is, all of the off-diagonal terms in the covariance matrix are 0.2 times the product of the two expression standard deviations). The expression standard deviations all had the default value of 0.333.

The correlation due to shared genotype alone is 0.29 for the parameter values used in generating this plot. With additional correlations of 0.2, 0.4, and 0.6, the total correlations are 0.44, 0.58, and 0.72. Figure 5 shows that there is substantial effect of this additional correlation. For example, 80% power is reached at a sample size of ∼400 with no extra correlation. The required sample size for 80% power increases to ∼600, 800, and 1000 when the extra correlations between transcripts are 0.2, 0.4, and 0.6, respectively. As this correlation gets very high, then power is actually reduced relative to the case using a single transcript because of the larger number of parameters that must be fit. We see that power increases when there is a negative correlation between transcripts. When one transcript has a high value relative to its genotypic mean, then other transcripts in the group will tend to have low values and the deviations cancel each other out. It is not currently known how much of the correlation in expression between genes is typically due to shared genotype and how much is due to other types of shared control.

When there is additional correlation between transcripts, then our model is not correct and thus there is reason to be concerned that the asymptotic chi-square approximation may provide incorrect *P*-values. The probability of a set of transcripts having an apparent correlation with the clinical trait by chance may be underestimated because they are more highly correlated with each other than assumed by the model. Thus, we conducted simulations to test the false-positive rate when there is additional correlation. These results are available in File S1 and show that the false-positive rate is not increased by this effect.

#### Comparison to standard QTL association mapping:

Figure 6 shows comparisons of our method to traditional QTL mapping for different combinations of the parameters *c* (the number of these QTL that jointly control expression of the transcript group), *M* (the number of transcripts that are jointly controlled by the *c* QTL), and *L* (the number of clinical trait QTL). The curves labeled “direct QTL mapping” give the power for directly testing for association between genotype at one of the *L* QTL and the clinical trait. This was calculated assuming that an ANOVA is being used to test for differences in the clinical trait between QTL genotypes. This gives an upper limit on power by standard QTL association mapping methods because we assume that the exact causative polymorphism is being tested. These curves were generated using the same parameter values as the other curves in the plot and also the same multiple-testing correction. We can use this to compare the power of standard QTL mapping to the power of our approach for testing whether a transcript group shares a QTL with the clinical trait. We see that when *c* = 1, the power achieved by our approach is slightly less than by direct QTL mapping, while the power of our method is much higher when *c* > 1. For example, a sample size of 1000 is required to achieve 80% power for direct QTL mapping with *L =* 6, while sample sizes of 700, 500, 400, and 350 are required for *c* = 2, 3, 4, and 5 with *M* = 6 for our approach. The power increases strongly with *c*. As *c* increases, then the transcript group and the clinical trait share more controlling loci and thus the expression patterns and the clinical trait become more concordant (this point is discussed extensively in Schliekelman 2008). Standard QTL mapping is not able to benefit from this effect. Of course, our method requires an additional step of mapping the QTL for the transcript group, which is not modeled here. However, many studies have shown that the power to map such eQTL is often quite high (see reference citations in the Introduction). Furthermore, it is highly unlikely that the actual polymorphism would be tested (as opposed to some linked marker) unless the marker density is very high. In this case, the multiple-testing correction for QTL mapping would be much more stringent.

#### False-positive rate:

Simulation results in File S1 show that the false-positive rate at a significance threshold of 0.05 is 5%, as expected. We have seen some cases of false positives caused by local maxima as discussed in methods. Thus, positive results should be carefully checked for this possibility (see methods).

In another example, we analyzed a data set that mixed data for simulated transcripts controlled by a clinical trait QTL with an expression data set published by Chandran *et al*. (2007). The goal was to see whether we could detect the simulated clinical trait-related genes among the real genes. This allowed us to test the performance of our method in a complex data set with a realistic correlation structure in the background genes. The procedure performed well, detecting transcripts associated with each of the clinical trait QTL while producing only a single false-positive group out of 391. These results are in File S1.

#### Application of the full genome-scan procedure to a mouse obesity data set:

Next, we show an application of the full genome-scan procedure to the data set of Ghazalpour *et al*. (2006). See methods for a description of the data. From their microarray data, Ghazalpour *et al*. (2006) selected a set of the 3421 most varying and most connected genes to use for their gene network analysis. We used the same set of 3421 genes. We applied hierarchical clustering and our gene grouping algorithm with a maximum group size of 6. This produced a total of 983 gene groups, with sizes ranging from 1 to 6.

We then applied our EM algorithm procedure to each gene group, using body weight as the clinical trait. We first used the asymptotic chi-square distribution with 2 d.f. with a Bonferroni correction to determine significance for the likelihood-ratio test. A total of 152 gene groups were found to have a controlling factor significantly associated with body weight. These gene groups contained a total of 576 genes. We also applied a permutation approach to assess significance, which gave similar results (File S1).

Next, we conducted QTL mapping using the R programming language (http://www.r-project.org/) with the R/qtl package (Broman *et al.* 2003). We first applied standard QTL mapping procedures to body weight, using permutations to assess significance. The four most significant SNP positions are shown in Table 2. There was one significant QTL at SNP p45500 and two suggestive QTL at SNPs p45558 and p44688.

Next, we mapped QTL for the 152 significant gene groups using the JZmapqtl function in QTL Cartographer. This function, based on the method of Jiang and Zeng (1995), treats a group of traits as a multivariate mixture model with components corresponding to genotypes at joint QTL and maps the QTL using composite-interval mapping. Its underlying model is similar to the one we assume for testing gene groups. The primary difference is that we infer genotype on the basis of expression data and Jiang and Zeng infer genotype using marker genotype data.

Following the suggestion of Jiang and Zeng (1995), statistical significance was assessed using a Bonferroni-corrected χ^{2}-test with 2*m* + 1 d.f., where *m* is the group size. We also used a permutation approach (see File S1) that gave similar results.

A total of 38 gene groups produced significant eQTL (Table 3) at a total of 45 different genomic locations. Some of these 45 SNPs are close to each other on the same chromosome and are thus apparently due to the same QTL. The first column in Table 3 shows groupings of significant SNPs into distinct loci. These groupings were based on visual inspection of the lod plots. Many of the peaks are broad and thus it is not clear in some cases whether significant SNPs are due to the same or different eQTL. The most ambiguous cases are marked in Table 3. The individual lod plots are given in File S1. By our judgment, there were 27 separate QTL.

We see that one significant SNP position, p44688, matches exactly with a suggestive SNP position from the standard QTL mapping of body weight. Furthermore, SNP p45328 is directly adjacent to significant SNP 45558 from the standard QTL mapping. The significant SNP p45500 from the standard QTL mapping is in the support interval of the significant SNP p45826 from our method. Finally, SNP 45276 that had the fourth lowest *P*-value (but not significant) in the standard QTL mapping is in the support interval of significant SNP p45924 from our method. Thus, our method produced significant QTL at all four of the most significant positions from the standard QTL mapping. Three of these four positions were significant by our method but not by the traditional gene mapping method. This gives a strong proof of principle for our approach.

There were 23 other significant QTL regions in addition to these four QTL. All but one of the significant QTL positions are mapped to the same genomic location as at least one gene in the group that produced the QTL. Thus, these genes become obvious candidate genes for the QTL. There were a total of 48 such *cis*-acting genes.

There are numerous instances of *cis*-eQTL for multiple genes mapping to the same location. These may be cases of genes being strongly linked to a body weight QTL, but whose own controlling genotype does not affect body weight. The expression levels of such genes could be significantly associated with body weight even though they do not affect it.

Ghazalpour *et al*. (2006) identified 12 transcription modules using gene coexpression networks constructed from the expression data. They then looked for correlations between these transcription modules and obesity-related clinical traits by calculating the average correlation between the traits and genes in the module. Finally, they mapped eQTL for all genes within the modules and looked for genomic regions that were enriched for eQTL for each module. They found nine such regions. Four of the nine peak markers were within the support interval of significant markers by our method (markers p45006, p46033, p45207, and p45826 from our method). Three others were not in the support interval of a significant marker, but within the same general genomic region as significant markers from our method (a distance of 30 Mb from marker p45328 and distances of 14 and 36 Mb from marker p44633). The remaining two significant regions from their method were on chromosome 10, where our method produced no significant markers. This strong overlap between the methods gives support for both.

We conducted literature review of the 48 candidate genes using the Entrez Gene database and PubMed to find information about their function. This information is summarized in Table 1 in File S1. Sixteen of the 48 genes have previously been associated with obesity or obesity-related traits. These 16 genes were divided among 11 of the putative QTL. That 16 of our candidate genes have been previously associated with obesity gives evidence that our method is producing biologically relevant results. That it also produced 16 more novel QTL shows that it potentially has great power to find causative loci for complex traits. Of course, it is not necessarily the case that a significant eQTL for a transcript associated with the clinical trait is causative for the clinical trait (see discussion).

#### Test of assumptions:

Our method is based on several key assumptions. We use the four transcript groups that produced the eQTL located at or near the SNPs listed in Table 2 as case studies to examine the validity of these assumptions.

The most fundamental assumption is that polymorphisms that affect a clinical trait of interest at least sometimes also produce detectable changes in expression levels in some genes. We see strong evidence that is true for these data. The four polymorphisms showing the strongest relationship with body weight via traditional QTL mapping also produced strong *cis* expression changes. It is interesting to note that transcript group 18 that produced the eQTL near SNP p45500 (the only SNP that was significant by traditional QTL mapping) produced the highest lod score for expression linkage (23.6) and the 20th (of 983) lowest *P*-value for association with body mass. Thus, a strong relationship between genotype and the clinical trait is reflected in a strong relationship between genotype and expression and between expression and clinical trait.

Another fundamental assumption of our method (although not of the general approach of using expression information in genome scans) is that genetic variation at QTL for the clinical trait drives variation in multiple transcripts (at least some of the time) and that those transcripts will therefore have highly correlated expression. The converse of this is that genes with highly correlated expression share controlling loci at least some of the time.

Table 4 shows the genes in the 4 transcript groups that produced eQTL at or near the SNP positions given in Table 2. The groups contain 6, 2, 1, and 4 genes. Overall, the mean group size was 4.1 for the 48 significant groups (recall that the tree-cutting algorithm was set to produce groups of size ≤6). Group 18 has 2 transcripts with significant eQTL near SNP p45500. The remaining 3 groups each have only a single transcript that produced a significant eQTL. However, although no other transcripts in these groups produced significant (over all marker and all transcripts) eQTL, there is evidence that most of these transcripts are influenced by the significant eQTL for the group. The last column in Table 4 shows the percentage of lod scores for the transcript that are below the maximum observed in the region of the significant eQTL for the group (the region is defined as all contiguous SNP positions with lod score >5). We see that five of the six transcripts in group 18 had lod scores in the top 2% in the region of SNP p45500. The remaining transcript had a lod score value in that region in the top 8%. Thus, we see evidence that the body weight QTL is exerting *trans*-acting control on all of the 6 genes in the group. We see similar, although weaker, evidence for the transcripts in groups 349 and 284. Note that these percentages are not *P*-values and this analysis does not constitute a hypothesis test. We merely intend these results to be suggestive.

File S1 contains lod score plots for some of the 48 transcript groups for different chromosomes. Many transcript groups have many shared lod score peaks. We used a simulation approach to conduct a test of the hypothesis that these matches in lod peaks are due to chance. Each chromosome was divided into two equal parts and the standard deviation in the location of the lod peak across the transcript group in each region was determined. This was compared to the standard deviations in 1000 replicates in which lod peaks were placed randomly in each region. The null hypothesis was strongly rejected in many regions for most transcript groups (see File S1 for more detail).

We cannot tell anything from this analysis about the extent to which the transcripts in a group are independently responding to genotypic variation. It could be, for example, that the transcripts in a group are tightly regulated by a single gene. In this case, the many shared lod score peaks could simply be due to variation (much of it random) in that controlling gene's lod score. On the other hand, similar-looking plots could also result if transcripts in a group share many eQTL of weak to moderate strength. We suspect that the truth is a combination of these two possibilities. It is beyond the scope of this article to explore these issues further.

## DISCUSSION

Using both simulations and an application to data, we have demonstrated that integrating expression information into genome scans for complex traits can substantially increase power to detect complex trait loci relative to using traditional QTL mapping methods. There are three reasons for this. The first is that a transcript may share multiple causative loci with the clinical trait. In this case, the transcript is an indicator for a multilocus genotype and thus will have stronger genotypic concordance with the clinical trait than a single locus can. Essentially, such a transcript allows us to test association of a multilocus genotype with the clinical trait without having to test the many possible multilocus combinations.

The second reason is that we are, in effect, directly inferring association between the underlying controlling genotype and the clinical trait instead of using an indirect quantity such as allele sharing. It is well known that direct tests for association give good power to detect causative loci with common alleles such as we have modeled (Risch and Merikangas 1996). Our method has the added advantage that it collapses all alleles into expression-increasing and expression-decreasing. Thus, there are in effect only two allele types. Association tests have been shown to suffer a major loss in power when there are multiple causative alleles (Slager *et al.* 2000). Our method behaves similarly to an association test that always has two alleles, regardless of the true number of alleles. See Schliekelman (2008) for further discussion.

The third reason is that we are inferring the locus directly, not a marker. Thus, there is no recombination obscuring the relationship between genotype and trait. Of course, we have error resulting from variation in expression. In this sense, our method trades variation in recombination between marker and genotype for variation in expression. When there is a just a single transcript, then this variation can be substantial. However, if there are multiple transcripts controlled by the locus, the effect of this variation can be reduced. Even with SNP-based studies, recombination can be substantial. For example, the Affymetrix GeneChip 500K SNP marker array produces an average correlation to the nearest tag SNP of 0.66–0.8, depending on the population (Frazer *et al.* 2007). Results that will appear in a future publication show that we can infer individual genotypes with >90% accuracy with simulated data generated using similar parameter values to those in Table 1 and a gene group size of 6 (E. H. Breazel and P. Schliekelman, unpublished data).

In the case of a cross with a dense marker map, the second and third benefits listed above will not apply. That is, there will be only two allele types and recombination between a QTL and the nearest marker will likely be minimal. There will still be a benefit in power if there are transcripts controlled by multiple causative loci. Furthermore, the use of dense markers requires a higher number of statistical tests and corresponding loss of power. There may be a benefit in using a less dense marker map and an expression-based approach like ours.

It is an open question as to the extent to which QTL identified by our method are QTL in the traditional sense. It is the case that all four significant and near-significant QTL from the traditional QTL mapping approach were significant by our approach. Thus, our method does capture “real” QTL. Furthermore, other significant QTL by our method have been previously identified as body weight QTL (File S1). However, it is not clear whether all of the loci that we identified would, for example, be identified as QTL in a sufficiently powerful traditional QTL study.

It should be emphasized that an eQTL for a transcript that is trait associated is not necessarily itself causative for the clinical trait. There are three ways that a transcript could be associated with a clinical trait. First, it could be causative for the trait. In this case, any eQTL for the transcript are necessarily causative for the clinical trait. Second, it could be reactive to a clinical trait QTL but not causative to the trait. In this case, it could have other eQTL that are not causative to the trait. The third possibility is that it is reactive to the clinical trait. In this case, any QTL for the trait are also eQTL for the transcript. However, it could also have other eQTL that are not causative for the trait.

#### Assumptions of the method:

The most fundamental assumption of our approach is that polymorphisms that contribute to variation in clinical traits also create consistent variation in the expression level of some set of genes (whether in *cis* or in *trans*). This assumption seems supported by our results using the data of Ghazalpour *et al*. (2006) as well other studies (see Introduction). To the extent that such consistent differences in expression do exist, then these differences can be used to infer genotype at the polymorphic locus or loci.

Random variation in expression level will obscure this genotypic information. A second fundamental assumption of our approach is that we can reduce the effect of this variation by combining information from multiple transcripts. There does seem to be little doubt that clinically important polymorphisms often do affect the expression of multiple transcripts. This is seen in our results and in various other studies (see references cited in the Introduction). There are multiple possible approaches to combining the information from these multiple transcripts. Our approach is to assume that there is some degree of independence between transcripts after conditioning on shared controlling genotype. If this is true, then each transcript gives an independent observation of the controlling genotype and we can combine this information to reduce the effect of individual transcript variation, as we have demonstrated here. It is not clear to what extent this assumption of conditional independence is true or how much is gained by combining information from multiple transcripts. Our simulations show (Figure 5) that there is a major benefit in power even when there is substantial correlation between transcripts beyond shared controlling genotype. However, we assumed a simple correlation structure and it is not clear what would happen with a more complicated structure containing feedback and more realistic patterns of connections between genes, etc. A major task for future work will be to begin to sort out the relationship between gene networks and genotypic variation and to develop statistical methods that take the true network structure into account.

In discussing assumptions, we must distinguish between assumptions of the simulations and assumptions of the method. The simulations are based on many assumptions about the relationship between genotype, expression, and the clinical trait, but these are not assumptions of the method. These include the assumptions of an additive trait model, dominance coefficients, allele frequencies, heritability of expression, and number of transcripts per controlling locus. Some plots are shown with key parameters varying, and other plots are available in File S1. The impact of these assumptions is discussed extensively in Schliekelman (2008), and most of those findings apply directly to the results in this study.

#### Comparison to other genetical genomics approaches:

The unifying theme of recent genetical genomics studies is the idea that gene networks will be more strongly linked with clinical traits of interest than will individual genetic loci. Different studies have taken different approaches as to how to combine information from the genes in the network. Our approach is to posit the existence of a controlling locus or loci shared between the genes in a group and the clinical trait. We then use a maximum-likelihood approach to estimate the parameters of this locus. Ghazalpour *et al*. (2006), for example, identify gene groups associated with the clinical trait by using the average correlation coefficient between genes in the group and the trait. They then locate causative genomic regions by identifying eQTL hotspots for the significantly correlated gene groups. In another example, Chen *et al*. (2008) identified gene subnetworks that contained large numbers of genes whose expression levels were identified as causatively associated and in *cis* with a known metabolic QTL region on mouse chromosome 1. The relative merits of these approaches will depend at least partially on what turns out to be true about the relationships between gene networks, causative genotypes, and clinical traits. For example, the average correlation coefficient is probably not a good summary of the relationship between a gene group and the clinical trait. Our approach may better use the available expression information and therefore have better power. On the other hand, our approach is based on specific assumptions about the relationship between gene networks and genotype and may be less robust to violations of these assumptions than the approach of Ghazalpour *et al*. (2006). The approach of Ghazalpour *et al*. (2006) identifies only causative regions that drive the expression of many genes in a trait-associated gene group. Similarly the approach of Chen *et al.* (2008) and many other studies cited in the Introduction identify only networks with many genes mapping to a preidentified region of interest. While this will certainly produce interesting results, it may also miss important causative components. Further data will be required to determine which approach is best. Most likely, a combination of complementary approaches will be most effective.

## Acknowledgments

This study was supported in part by the University of Georgia Research Computing Center, a partnership between the Office of the Vice President for Research and the Office of the Chief Information Officer of the University of Georgia.

## Footnotes

Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.110.123968/DC1.

Communicating editor: H. Zhao

- Received October 8, 2010.
- Accepted December 3, 2010.

- Copyright © 2011 by the Genetics Society of America