## Abstract

Multiple-trait association mapping, in which multiple traits are used simultaneously in the identification of genetic variants affecting those traits, has recently attracted interest. One class of approaches for this problem builds on classical variance component methodology, utilizing a multitrait version of a linear mixed model. These approaches both increase power and provide insights into the genetic architecture of multiple traits. In particular, it is possible to estimate the genetic correlation, which is a measure of the portion of the total correlation between traits that is due to additive genetic effects. Unfortunately, the practical utility of these methods is limited since they are computationally intractable for large sample sizes. In this article, we introduce a reformulation of the multiple-trait association mapping approach by defining the matrix-variate linear mixed model. Our approach reduces the computational time necessary to perform maximum-likelihood inference in a multiple-trait model by utilizing a data transformation. By utilizing a well-studied human cohort, we show that our approach provides more than a 10-fold speedup, making multiple-trait association feasible in a large population cohort on the genome-wide scale. We take advantage of the efficiency of our approach to analyze gene expression data. By decomposing gene coexpression into a genetic and environmental component, we show that our method provides fundamental insights into the nature of coexpressed genes. An implementation of this method is available at http://genetics.cs.ucla.edu/mvLMM.

CLASSICALLY, genome-wide association studies have been carried out using single traits. However, it is well known that genes often affect multiple traits, a phenomenon known as pleiotropy, and more recently it has been shown that performing association mapping with multiple traits simultaneously may increase statistical power (Korol *et al.* 2001; Ferreira and Purcell 2009; Liu *et al.* 2009; Avery *et al.* 2011; Korte *et al.* 2012). Analysis of multiple traits increases power because intuitively, multiple-trait measurements increase sample size relative to a single-trait measurement. However, utilizing the additional data is not straightforward as measurements from the same individual are not independent. This issue is analogous to that of association analysis in cohorts of related individuals, where trait measurements between related individuals are not independent. Variance component methods model this correlation structure by assuming that the covariance due to genetics between related individuals is proportional to their kinship coefficient (Kang *et al.* 2008). This constant of proportionality normalized by the total trait variance is related to narrow-sense heritability of the trait (the variance accounted for by additive genetic effects) (Yang *et al.* 2010).

When the same genetic variants affect multiple traits, trait values for an individual will tend to be correlated. Similarly, shared environmental effects also introduce some level of correlation between traits. A fundamental problem in understanding the relationship between the traits is determining the proportion of the total correlation due to genetics and the proportion due to environment. Classical approaches originating from animal breeding and agricultural research solve this problem by modeling the statistical relationship between traits, using a linear mixed model (LMM) (Falconer 1981; Mrode and Thompson 2005). These approaches decompose the between-trait correlation into both a genetic component and an environmental component and then use the LMM framework to obtain estimates for these quantities. The LMMs used in these classical approaches can be adapted for use in genome-wide association studies (GWAS) by utilizing them to test the association between genetic variants and multiple traits. Multiple-trait variance component methods closely follow the approach utilizing kinship values to model the covariance between different traits among different individuals, such that the genetic covariance between two individuals’ traits is proportional to their kinship coefficient (Henderson and Quaas 1976). In this case, the constant of proportionality is a function of the two trait heritabilities as well as the genetic correlation. These types of models are widely utilized in the plant breeding (Malosetti *et al.* 2008; Kelly *et al.* 2009; Verbyla and Cullis 2012) and animal breeding communities (Ducrocq and Besbes 1993). Similarly, multiple-trait models represent the covariance between traits within an individual as a function of both genetics and shared environment.

To utilize LMMs for association analysis, an iterative procedure must be employed to identify the maximum-likelihood estimates of the parameters of the statistical model used for association. The use of LMMs for single traits has been limited by the computational complexity of traditional maximum-likelihood procedures: , where *n* is the number of individuals in the study and *t* is the number of iterations necessary for the maximum-likelihood algorithm to converge. However, recently developed estimation algorithms have made LMMs computationally efficient and feasible for large population cohorts (Kang *et al.* 2008, 2010; Lippert *et al.* 2011; Zhou and Stephens 2012), reducing the computational complexity of from to , enabling genome-wide association mapping for single traits using LMMs. Unfortunately, the previous approaches (Kang *et al.* 2008, 2010; Lippert *et al.* 2011; Zhou and Stephens 2012) cannot be directly applied to multiple-trait LMMs, meaning that the same computational inefficiencies that limited the widespread use of LMMs for single-trait GWAS now hinder the scale at which researchers can perform multiple-trait GWAS. More specifically, with *p* traits measured over *n* individuals the covariance matrix relating the *p* traits measured over the *n* individuals will be of size and the running time for classical multivariate LMMs is . In other words, even when *p* is small (*e.g.*, ), the running time scales as the cube of the number of individuals in the sample, meaning that the use of multiple-trait LMMs is not feasible for large sample sizes.

A widely utilized approximation to using the covariance matrix is to assume that the genetic and environmental effects are independent, which allows the decomposition of the matrix into the Kronecker product of an matrix and a matrix. This type of approach is widely utilized in the plant breeding literature (Malosetti *et al.* 2008; Kelly *et al.* 2009; Verbyla and Cullis 2012). In our work we reformulate this decomposition, using the matrix-variate normal distribution (Gupta and Nagar 2000). Using this formulation, we show how a simple data transformation leads to a model equivalent to the abovementioned model while allowing maximum-likelihood inference to be performed in computational time essentially linear in the size of the data set, given a one-time cost of and . In a simple case, let us assume that *p* << *n* (*e.g.*, 2 *vs.* ) and that we have only a global mean for each trait; this leads to a total computational complexity of . The iterative part of the algorithm is then essentially linear in the size of the data set. We call our method the matrix-variate linear mixed model (mvLMM). Our approach differs from previous approaches in the plant and animal breeding communities in that our inference approach is more closely related to the EMMA algorithm (Kang *et al.* 2008) while previous inference methods are more closely related to the average information restricted maximum-likelihood (REML) algorithm as implemented in ASReml (Gilmour *et al.* 1995). The reason why algorithms such as EMMA (Kang *et al.* 2008), EMMAX (Kang *et al.* 2010), FaST-LMM (Lippert *et al.* 2011), and GEMMA (Zhou and Stephens 2012) and related methods have become popular in human GWAS is that they take advantage of the specific formulation of the variance components to allow for efficient estimation compared to methods such as ASReml that can be applied to a more general set of models.

We demonstrate the efficacy of our method by analyzing correlated traits in the Northern Finland Birth Cohort (Sabatti *et al.* 2008). Comparing it to a standard approach (Lee *et al.* 2012), we show that our method results in a >10-fold time reduction for a pair of correlated traits, taking the analysis time from ∼35 min to ∼2.5 min for the cubic operations plus another 12 sec for the iterative part of the algorithm. In addition, the cubic operation can be saved so that it does not have to be recalculated when analyzing other traits in the same cohort. Finally, we demonstrate how this method can be used to analyze gene expression data. Using a well-studied yeast data set (Smith and Kruglyak 2008), we show how estimation of the genetic and environmental components of correlation between pairs of genes allows us to understand the relative contribution of genetics and environment to coexpression.

## Methods

### Modeling multiple traits with the matrix-variate linear mixed model

Given a set of *p* traits for *n* individuals, a standard statistical model for the *i*th trait vector, denoted by , is given by the following LMM, the model relating phenotypes to genotypes, which iswhere represents the mean term for the *i*th trait such that **X** is an matrix encoding *q* covariates including the SNP being tested, represents the population structure or genetic background component, and represents the effect due to environment and error. We use to represent the value of the *i*th trait for the *j*th individual. We have assumed that the covariates determining the mean will be shared among traits, but this is not a requirement. The variance of is given by the following, assuming that ,where represents the genetic variance component for trait *i*, **K** represents the kinship matrix calculated using a set of *m* known variants, and represents the environmental and error variance. We note this model asumes i.i.d. environmental errors for a given trait, which is maybe unrealistic for some applications (Bello *et al.* 2012). We use to represent the entry of the kinship matrix corresponding to the relation between the *j*th and *k*th individuals. From these models (Henderson and Quaas 1976; Mrode and Thompson 2005), it follows that the covariance between measurements for individuals *j* and *k* for trait *i* is given by (1)We now consider models with multiple traits. By letting represent the correlation between traits *i* and *m* due to genetic effect and letting represent the correlation due to an individual’s environment, the covariance between the trait measurements *i* and *m* for individual *j* is (2)Assuming that environmental effects are independent between individuals, let the covariance between traits *i* and *m* for individuals *j* and *k* be (3)In fact, these models are standard models utilized in the animal and plant breeding communities.

A straightforward approach to represent this model is to stack all of the traits for each trait into one long vector of length and then represent their covariances in a matrix populated using Equations 1–3. However, this matrix will have elements and fitting this model to estimate the parameters for even a small number of phenotypes is computationally intractable.

### Matrix-variate normal distribution

We note that the covariance matrix above has a significant amount of structure as evident in Equations 1–3. In fact, this matrix can be represented by the sum of two matrices, each of which is a Kronecker product of an *n* × *n* and *p* × *p* matrix. This decomposition is widely utilized in the plant breeding literature (Malosetti *et al.* 2008; Kelly *et al.* 2009; Verbyla and Cullis 2012). In our work, the main contribution is that we provide an efficient method for performing inference in these models efficiently by modeling the full set of trait measurements, using a matrix-variate normal distribution. The matrix-variate normal distribution is a generalization of the multivariate normal distribution to matrices (Gupta and Nagar 2000). The matrix-variate normal distribution is a very natural way to represent these types of factored models. Unlike in a multivariate normal model where the data are concatenated into a single vector of length *np*, in a matrix-variate model, the data are an matrix where each column is a trait. Instead of representing a covariance structure using a single matrix, the matrix variate normal distribution represents the covariance using two matrices: a matrix **A** that represents the covariance between columns of the data and an matrix **B** that represents the covariance between rows of the data. In a matrix-variate normal distribution, the mean is now an matrix. We denote a matrix-variate normal model, using the notation .

Using the matrix-variate normal distribution, our model can be represented aswhere **Y** is the matrix of traits; **Z** follows a matrix-variate normal distribution with mean and covariance matrices **Ψ** and **K**, where **Ψ** is a matrix representing the correlation between traits due to genetics; and **K** is the kinship matrix. **R** follows a matrix-variate normal distribution with mean zero and covariance matrices **Φ** and , where **Φ** is a matrix representing the covariance between traits due to environment and error. The *i*th diagonal component of **Ψ** is given by and the th component by , and similarly . The distribution for **Y** is then summarized as follows, where denotes the matrix-variate normal distribution with mean matrix **M** and columns and row covariance matrices **A** and

### Efficient maximum-likelihood computation

Likelihood evaluation for the matrix-variate distribution given by Equation 4 is accomplished by evaluating the equivalent multivariate normal distribution. By using the operator, which creates a vector from a matrix input by concatenating the columns of the matrix, we are able to represent the distribution given in Equation 4 in the following way, where **⊗** represents the Kronecker product of two matrices:The likelihood computation for this model takes time on the order of . This computational time becomes prohibitive when maximizing the likelihood function while considering a large cohort with multiple traits. Previous work has shown how similar multivariate models with Kronecker product matrices can be utilized efficiently when residual errors are independent (Stegle *et al.* 2011). However, it is not known how these models may be used efficiently when residual errors are correlated, which is the case for our model. To remedy this problem, we introduce a transformation that results in a reduced computational time.

Let the eigendecomposition of . This decomposition is calculated with a computational complexity of . Let **L** be a matrix that diagonalizes both **Ψ** and **Φ**, such that and , a diagonal matrix. This bidiagonalization can be accomplished in (details are in the *Diagonalizing two matrices* section below). We then define the matrix . The transformed data vector is defined as . This transformed vector has the following distribution:The log likelihood of is then given as follows:To calculate the likelihood given **Ψ** and **Φ**, we first obtain the transformation matrix **M**, which is accomplished in . Next, we compute the transformed data vector in . Given , we obtain an estimate of *β*, denoted by , which we show may be accomplished in , and given this we calculate the residual vector in . Finally, the likelihood is computed in . Part of the reason that our approach is efficient is that much of the computations can be reused for many analyses. For example, the matrix **M** that is computed in requires diagonalizing the **K** matrix, which requires time and needs to be performed only once for the complete analysis of the data set. Similarly, the transformed data vector can be computed in , does not depend on which variant is actually being tested, and can be computed only once for each set of traits that is being considered. Thus the likelihood computation for each variant is dominated by , utilizing the quantities that were computed once. In addition, in many scenarios we can assume that the effect sizes are small as in human studies. Under this assumption, we can fit the variance parameters just once, assuming that , and then use this estimate to test each variant. In this case, computing the maximum likelihood reduces to . This transformation is similar to the approaches in the plant breeding literature to speed up computations, using two eigendecompositions (Piepho *et al.* 2012).

This assumption is the same assumption that differentiates EMMAX (Kang *et al.* 2010) from EMMA (Kang *et al.* 2008). While this assumption is appropriate for human studies where most identified genetic variants have very small effects, this assumption may not be appropriate for plant and animal models where there are often several loci with very strong effects. An approach to handle this case while avoiding refitting the variance parameters for each variant is to first identify the variants with strong effects, using the above assumption, and then refit the variance parameters after including these strongly associated variants as fixed effects in the model.

### Restricted Maximum-Likelihood Computation

The REML and the maximum-likelihood (ML) solutions should be similar when the model contains no covariates or only a bias term. However, when this is not the case, parameter estimates obtained in REML analysis may deviate significantly from those of ML. We obtain the REML version of the mvLMM by extending the ML solution (Welham and Thompson 1997). By denoting the log-likelihood obtained by ML as and similarly for REML, we define the following log-likelihood function. For a standard multivariate normal vector **y** with distribution ,, where **T** is , the REML is (Kang *et al.* 2008). Given this standard result, we define the REML log-likelihood for the mvLMM in the following:The computational cost of the operations required to define does not change the order of the computational complexity.

### Estimating genetic correlation

To evaluate the likelihood function in Equation 5, we obtain estimates for the parameters **Ψ** and **Φ** We estimate these parameters under the null model, where SNPs are not included as covariates. This assumption has been used previously and is valid for cases when the effect due to each SNP is small (Kang *et al.* 2010; Lippert *et al.* 2011). First, for each trait *i*, we fit the basic LMM from Equation 1, to identify the optimal values of the variance parameters and . Holding these parameters constant, we perform a two-dimensional global grid search to identify the optimal genetic and environmental correlation parameters. With caching, the likelihood calculation takes time on the order of . This time will be multiplied by a constant when searching over a grid of size *k* for each correlation parameter. That is, if we evaluate the likelihood for each genetic and environmental correlation combination for a grid size of *k*, then we need to evaluate the likelihood times.

To expand this approach to more than two traits, we propose a straightforward pairwise approach to identify the maximum-likelihood parameters. Instead of performing a full grid search over the correlation parameters, we identify the ML estimates of the parameters in each pair of traits. This procedure will be much faster than a full grid search over all pairs of traits and we discuss in Supporting Information, File S1, Figure S1, Figure S2, and Table S1 why this procedure is also more robust.

### Calculating sampling variance for parameter estimates

We calculate the sampling variance of the variance parameters and the correlation parameters, using standard multivariate theory. Generally, the sampling variance of a ML parameter is given by the inverse of Fisher’s information (or average information) matrix evaluated at the ML parameters (Searle *et al.* 1992). Using the search technique we describe, we identify the ML parameters for a given set of traits and then use these parameters to estimate the sampling variance, using Fisher’s information matrix.

### Association analysis

To identify genetic variations that have an effect on our traits of interest, we employ a hypothesis-testing framework. We first estimate the effect that a particular SNP *x* has on each of the traits, using the mvLMM model, and then we jointly test *m* hypotheses, each testing the effect of the SNP on a given trait. Our null hypothesis for this test is that the SNP has no effect on any of our traits and the alternative hypothesis is that it has an effect on one or more of the traits.

To obtain estimates for the SNP effect sizes, we include one SNP in the model at a time and estimate *β* from Equation 5. First, we obtain the maximum-likelihood parameters for **Ψ** and **Φ** under the null model in which the SNP has no effect, as described in the previous section. Then, given these two parameters, we compute an estimate of the coefficient matrix , using the following result.

In the previous section, we defined a transformation and used it to define a transformed data vector . The mean of the transformed data is given by , which can be reduced as follows:Here we have let . By denoting as , we obtain an estimate , using the following result, where represents the reversal of the operation and we have let , the transformed data covariance matrix:Since **P** is a diagonal matrix, can be computed in given the one-time cost of for computing .

The statistic for testing the proposed hypothesis is obtained by defining a transformation matrix **R** so that , where is the coefficient estimate for the effect of SNP *x* on trait *i*. Therefore, given this matrix, we define the *F*-statistic for testing association in Equation 5, which under the null follows an *F*-distribution with *p* numerator degrees of freedom and denominator degrees of freedom, where and represents the sample variance. Details of this test can be found in McCulloch and Neuhaus (1999):

### Diagonalizing two matrices

We are given two positive semidefinite matrices **Φ** and **Ψ** and we wish to identify a matrix **L** that diagonalizes both of these matrices. This is accomplished in the following way. First, we obtain the eigendecomposition of and then define a matrix , so that . Next, we obtain an eigendecomposition and then define a matrix . With this we see that and that . The entire procedure has complexity .

### Genotype and phenotype data

We apply our method to the Northern Finland Birth Cohort data (Sabatti *et al.* 2008), which were used in Kang *et al.* (2010) and Korte *et al.* (2012). This data set consisted of 5326 individuals that had been filtered to reduce the presence of family structure. The data set contains 331,450 autosomal SNPs after application of the exclusion criteria of Hardy–Weinberg equilibrium , genotyping completeness , and minor allele frequency . Missing genotypes are replaced with the minor allele frequency. Missing phenotypes are replaced with the phenotypic mean.

We use a well-studied yeast data set (Smith and Kruglyak 2008) consisting of 109 yeast strains each with 5793 gene expression measurements. Bivariate association mapping is performed on all 2956 available SNPs. Gene expression values were normalized and subjected to quality control by Smith and Kruglyak (2008) and we utilized the same data as they.

## Results

### Association and genetic correlation in the Northern Finland Birth Cohort

#### Association:

We apply our method to the Northern Finland Birth Cohort, a founder cohort consisting of 5043 individuals, each of which has multiple-trait measurements for four different metabolic traits. We analyze a total of six pairs of traits or all combinations of four traits: HDL and LDL cholesterol, C-reactive protein (CRP), and triglycerides (TG). Association between each SNP and each pair of traits is evaluated by assuming that under the null hypothesis the SNP does not affect either trait.

We compare our results to the analysis of Korte *et al.* (2012), which analyzed the same data using a classically based multiple-trait LMM, which they refer to as the multitrait mixed-model (MTMM) method. Our results are highly concordant (0.96–0.99), indicating that our method is consistent with classical approaches. For example, Figure 1 compares the QQ plots of the mvLMM and MTMM for the joint analysis of TG and LDL.

Over 99% of associations identified in marginal analysis are also identified when respective pairs of traits are mapped (significance threshold of 1.5*e*-7). However, the joint mapping uncovers more significant associations; 19 new associations are identified across all trait pairs. For example, in the analysis of TG with CRP, we identify a SNP (rs2000571) with a *P*-value of 8.58*e*-7 and with the MTMM a *P*-value of 1.7*e*-6. This SNP was not significant in the marginal analysis of TG (1.7*e*-5) or CRP (0.03), but belongs to a region on chromosome 11 that has been shown to harbor variants contributing to triglycerides (Braun *et al.* 2012). Many of the identified associated SNPs were more significant in the mvLMM compared to the MTMM, which we suspect is because the mvLMM finds the actual parameters that maximize the likelihood. We also apply our method to analyze all four traits simultaneously and the results are shown in Table 2. For all variants, at least one of the pair of trait *P*-values is more significant than the all-trait *P*-value. Thus it appears that in this scenario, it is best to follow a single-trait analysis with the all-pairs analysis. This raises the more general issue of how one should apply a method such as this and we provide some guidance in the *Discussion*.

#### Genetic correlations:

In multiple-trait models, the total trait correlation is partitioned into a genetic and an environmental component. The genetic component of the correlation (the genetic correlation) represents the part of the total trait covariance that is attributed to genetics normalized by the genetic variances. This quantity provides insight into the genetic architecture of the relationships between traits. We estimate the genetic correlations for each pair of traits analyzed in the Finland Birth Cohort and compare these estimates with those obtained using a standard implementation of a bivariate LMM as implemented in genome-wide complex trait analysis (GCTA) (Lee *et al.* 2012). Table 1 compares estimates of genetic correlation obtained with GCTA and the mvLMM. When we compare our results to those of GCTA, we find that the two methods yield similar results, with genetic correlation estimates falling <1 standard deviation from one another. In addition, the running time for the classical approach was ∼35 min, while the running time for the mvLMM was on average ∼12 sec, given a one-time cost of 2.5 min shared across pairs of traits.

### Bivariate analysis in yeast data

Gene coexpression, defined as the correlation between expression levels of a pair of genes estimated in a set of individuals, is a fundamental quantity that has been utilized for a variety of applications (Stuart *et al.* 2003; Subramanian *et al.* 2005; Ghanzalpour *et al.* 2006; Lee *et al.* 2006). There are two prevalent views about the meaning of significant coexpression. The first is that coexpression stems from similar environmental conditions such as disease status (Heller *et al.* 1997). The second comes from the systems genetics literature where it is thought that coexpressed genes have a similar genetic regulatory program and that specific genetic variants drive modules of coexpressed genes (Ghanzalpour *et al.* 2006; Lee *et al.* 2006). However, correlation estimates from gene expression levels measure the combined effect of both the genetic and the environmental components. Our methodology allows us for the first time to decompose the coexpression into a genetic and environmental component.

We utilize the major gain in efficiency of our approach to perform an analysis that is not feasible with current methods. Using a well-studied yeast data set (Smith and Kruglyak 2008) consisting of 109 yeast strains each with 5793 gene expression measurements, we perform a bivariate analysis, estimating genetic correlations for all 5793 chose 2 gene expression pairs. Within this data set several regions of the genome have been implicated to harbor genetic variation that affects many gene expression levels.

Using a set of hotspot locations derived from Smith and Kruglyak (2008), we define a set of 13,508 hotspot gene pairs by extracting all pairs of genes that lie in each known hotspot. We then compare the phenotypic correlation to the total proportion of covariation accounted for by genetics for each of these pairs. Assuming that hotspot pairs are under the same genetic regulation, we expect that the phenotypic correlation for any given pair should reflect this by having a high value. However, this might not be the case if the environmental correlation between the pair contributes in such a way to lower the overall phenotypic correlation. Therefore, an estimation of the total phenotypic covariation attributed to genetics may better reflect the fact that the two genes are under the same genetic program.

In Figure 2A, we plot the histogram of the absolute value of the total phenotypic correlation for all gene pairs and for hotspot gene pairs. We see that the distribution of phenotypic correlations for hotspot pairs is shifted toward higher correlations with respect to all pairs, giving an indication of coregulation. However, most of the pairs have correlation <0.5. Figure 2B shows the same plot generated using the total proportion of the phenotypic covariation attributed to genetics. In Figure 2B, we observe that the estimated genetic correlation for hotspot pairs is dramatically skewed toward 1. In fact, most of the pairs have a genetic covariance >0.7. This result suggests that the estimated genetic correlations on average give a stronger indication of coregulation compared to the phenotypic correlation.

## Discussion

In this article, we introduced a method for performing multitrait genome-wide association analysis and for the estimation of the genetic correlation. Our method is based on classical theory, but introduces a computational advance that makes it much faster, reducing running time >10-fold when compared with the classic approach. We have shown that our method achieves similar results to those of the classical approach. In addition, we have shown that the ability to quickly estimate genetic correlation may be of great benefit to researchers, leading to fundamental insights into the architecture of complex traits.

The ability to quickly optimize multiple-trait linear mixed models will have a large impact on the ability to dissect complex traits. For example, multiple-expression quantitative trait loci (multi-eQTL) may be discovered by mapping multiple traits to genetic variants across the genome. The ability to perform this type of research is infeasible with current methodologies. In addition, we have shown that the genetic correlation between gene expression measurements may be a better indicator of coregulation. It stands to reason that these genetic correlations may be used in coexpression analysis and lead to the discovery of gene modules that are truly coregulated and not in part due to environmental correlations.

We note that in our model, the genetic background component is assumed to have a covariance structure, defined by the matrix **K**, which is computed using all of the marker genotypes. This model inherently assumes that the effect size of each genetic variant is drawn from a normal distribution with equal variance. This may be inaccurate for several reasons. First, not all of the markers are causal variants and even among the variants that are causal, their effect sizes may vary widely. Second, many of the causal variants themselves may not be genotyped in the study and the markers are merely proxies for these causal variants. This difference between the estimated covariance structure from the markers and the true covariance structure has been shown to lead to inaccurate heritability estimates (de Los Campos *et al.* 2013) and may lead to inaccuracies in estimates of genetic correlations. A more appropriate term from the quantities we estimate maybe “genomic heritabilities” and “genomic correlations.”

Our method presents an approach for jointly performing association analysis for multiple traits. However, the question remains of what is the best way to analyze a data set with multiple traits. Unfortunately, there is no clear answer. If a variant affects only a single trait, then an individual trait-by-trait analysis is the most powerful to identify such a trait because analyzing more than one trait increases the degrees of freedom of the statistical test. On the other hand, if a variant affects multiple traits, then analyzing all traits together will be more powerful. From a practical perspective, we advocate first analyzing each trait independently and then applying this method to groups of traits where there are suspected shared genetic components and increasing the number of traits analyzed until the *P*-values become less significant. Our estimates of genetic correlation can guide identification of potential groups of traits. Any such sequential strategy complicates issues of controlling type I errors. Exactly how to control type I errors in this context is an important avenue of future work.

## Acknowledgments

N.F. and E.E. are supported by National Science Foundation grants 0513612, 0731455, 0729049, 0916676, 1065276, 1302448, and 1320589 and National Institutes of Health (NIH) grants K25-HL080079, U01-DA024417, P01-HL30568, P01-HL28481, R01-GM083198, R01-ES021801, R01-MH101782, and R01-ES022282. N.F. was supported in part by NIH training grant T32MH073526. E.E. is supported in part by the NIH BD2K award, U54EB020403. We acknowledge the support of the National Institute of Neurological Disorders and Stroke Informatics Center for Neurogenetics and Neurogenomics (P30 NS062691).

## Footnotes

Supporting information is available online at http://www.genetics.org/lookup/suppl/doi:10.1534/genetics.114.171447/-/DC1.

*Communicating editor: S. Sen*

- Received September 30, 2014.
- Accepted February 16, 2015.

- Copyright © 2015 by the Genetics Society of America