## Abstract

The recent development of sequencing technology allows identification of association between the whole spectrum of genetic variants and complex diseases. Over the past few years, a number of association tests for rare variants have been developed. Jointly testing for association between genetic variants and multiple correlated phenotypes may increase the power to detect causal genes in family-based studies, but familial correlation needs to be appropriately handled to avoid an inflated type I error rate. Here we propose a novel approach for multivariate family data using kernel machine regression (denoted as MF-KM) that is based on a linear mixed-model framework and can be applied to a large range of studies with different types of traits. In our simulation studies, the usual kernel machine test has inflated type I error rates when applied directly to familial data, while our proposed MF-KM method preserves the expected type I error rates. Moreover, the MF-KM method has increased power compared to methods that either analyze each phenotype separately while considering family structure or use only unrelated founders from the families. Finally, we illustrate our proposed methodology by analyzing whole-genome genotyping data from a lung function study.

GENOME-WIDE association studies (GWASs) have been widely used to identify common single-nucleotide polymorphisms (SNPs) associated with complex human diseases (Hunter *et al.* 2007; Wellcome Trust Case Control Consortium 2007; Yeager *et al.* 2007; Manolio *et al.* 2008; Hindorff *et al.* 2009). In a typical GWAS, large numbers SNPs are genotyped on hundreds or thousands of subjects, and each SNP is subsequently tested, one by one, for association with the phenotype of interest. However, this traditional single-marker association test is not powerful enough to detect rare variants that confer susceptibility to complex diseases (Li and Leal 2008; Schork *et al.* 2009). With the recent development of sequencing technology, identification of rare susceptibility variants for complex diseases has become feasible, provided that novel statistical methods are developed to obtain optimal results.

To increase the power to detect association using rare susceptibility variants, many set-based statistics have been developed that evaluate the joint effect of a group of rare genetic variants in a predefined genomic region on the phenotype of interest (Morgenthaler and Thilly 2007; Li and Leal 2008, 2009; Madsen and Browning 2009; Han and Pan 2010; Morris and Zeggini 2010; Price *et al.* 2010; Lin *et al.* 2011, 2012, 2013, 2014). One commonly used method is the sequence kernel machine–based association test (SKAT) (Wu *et al.* 2010, 2011; Yan *et al.* 2014, 2015). SKAT is a powerful, flexible, and computationally efficient approach for set-based association testing. To increase power, this kernel machine (KM) method assigns weights to each marker and uses the weighted summation to construct test statistics. In addition, SKAT can easily include nongenetic covariates. Both linear and nonlinear kernels may be used to test the genotype-phenotype relationship. Furthermore, SKAT’s test statistic follows a known mixture of chi-square distributions. Thus, *P*-values can be quickly computed analytically without performing resampling.

In genetic studies of complex diseases, multiple correlated phenotypes are often collected. Jointly testing the association between these correlated phenotypes and genetic variants may increase the statistical power to detect causal genes underlying complex diseases. Several multivariate approaches developed for this purpose (Verzilli *et al.* 2005; Zapala and Schork 2006; Liu *et al.* 2009; Zhang *et al.* 2010a; Maity *et al.* 2012) have demonstrated improved statistical power to detect susceptibility variants, especially for pleiotropic variants that influence multiple phenotypes (Zhu and Zhang 2009; Sivakumaran *et al.* 2011).

Family-based designs have been used widely in association studies of complex traits (Falk and Rubinstein 1987; Ott 1989; Terwilliger and Ott 1992; Spielman *et al.* 1993). Although GWASs with unrelated samples often employ general linear models for quantitative phenotypes, this approach can lead to inflated type I error rates in family-based studies if familial correlation is ignored. In family-based studies, a linear mixed model including a random covariate with polygenic effects can account for familial correlations and thus is preferable to a general linear model. In family-based studies, the covariance of random polygenic effects among all subjects is proportional to their kinship coefficients. Linear mixed models with a kinship matrix have been applied commonly in family-based GWASs (Almasy and Blangero 1998; Rabinowitz and Laird 2000; Yu *et al.* 2006; Kang *et al.* 2010; Zhang *et al.* 2010b). More recently, SKAT has been extended to test for quantitative phenotypes in family-based samples by including a kinship matrix (Schifano *et al.* 2012; Chen *et al.* 2013; Oualkacha *et al.* 2013).

Here we develop a new test for gene-based association between rare variants and multiple correlated phenotypes for family-based samples. The recently published MFQLS statistic (Won *et al.* 2015) is for family-based multivariate association analysis with multiple variants, but it is specifically for common variants. Our proposed method, which uses KM regression and is denoted as MF-KM (for multivariate family data using kernel machine regression), is based on a linear mixed-model framework and can be applied to a large range of studies with different types of traits, such as longitudinal studies. In our simulation studies, we show that a usual KM test (Maity *et al.* 2012) (M-KM, considering the correlation among multiple phenotypes) has inflated type I error rates when applied directly to familial data. In contrast, our MF-KM method preserves the expected type I error rates when employed in family-based samples. Moreover, the MF-KM method has increased power compared to methods that either analyze each phenotype separately (F-KM, which considers family structure) or use only unrelated founders (M-KM-ind). Finally, we illustrate our proposed methodology by analyzing whole-genome genotyping data from a lung function study.

## Materials and Methods

### KM regression in a linear mixed-model framework

For KM regression in a linear mixed-model framework on a data set containing *n* people, we assume that the *n* × 1 vector of the quantitative trait **y** follows a linear mixed model:where **X** is an *n* × *p* covariate matrix, **β** is a *p* × 1 parameter vector for fixed effects (an intercept and *p* – 1 covariates), **G** is an *n* × *q* genotype matrix for *q* genetic markers in the region of interest, **γ** is a *q* × 1 vector for the random effects of genetic markers, **u** is an *n* × 1 vector for the random effects of any correlation (*e.g.*, multiple phenotypes or familial structure), and **ε** is an *n* × 1 vector for the random error. The random effect **γ** is assumed to follow a normal distribution with mean zero and variance τ**W**, so the null hypothesis we are interested in testing is *H*_{0}: **γ** **= 0**, which is equivalent to testing *H*_{0}: τ = 0. A variance component score test known as the locally most powerful test can be used to test this *H*_{0} (Wu *et al.* 2011). The error **ε** and the random effects **u** are also assumed to follow normal distributions and are uncorrelated with each other and with **γ**. To be specific, we assume thatwhere **W** is a *q* × *q* diagonal matrix with predefined weights for each variant [such as (Wu *et al.* 2011)], **K** is an *n* × *n* covariance matrix, and is the error variance component.

Under these assumptions, the variance of the quantitative phenotype **y** can be described asUnder the null hypothesis, the estimates areFollowing the same rationale as in previous derivations of the score statistic (Zhang and Lin 2003; Liu *et al.* 2007; Kwee *et al.* 2008), we have the test statistic (1)where is the vector of estimated fixed-effect coefficients of covariates under *H*_{0}, and is the estimated covariance matrix of **y** under *H*_{0}. The statistic *Q* is a quadratic form of and follows a mixture of chi-square distributions, although some of the parameters are estimated (Yuan and Bentler 2010; Schifano *et al.* 2012) under *H*_{0}. Thus, the *P*-values can be calculated using different algorithms, such as the moment-matching method (Satterthwaite 1946; Liu *et al.* 2007), the Davies exact method (Davies 1980), or Kuonen’s saddlepoint method (Kuonen 1999). In this work, we chose to use the Davies method.

### KM for quantitative traits in multivariate family data (MF-KM)

The *Q* statistic derived in Equation (1) can be extended to handle quantitative traits in multivariate family data, denoted as MF-KM. Here the null hypothesis is that the group of genetic variants is not associated with any traits. For simplicity of illustration, we consider a data set containing *m* individuals and two correlated phenotypes. Under the null hypothesis, *H*_{0}: τ = 0, the model with correlation among phenotypes and familial structure iswhere **y** is a vector of quantitative trait [*i.e.*, **y** = (*y*_{11}, *y*_{12}, *y*_{21}, *y*_{22}, …, *y _{m}*

_{1},

*y*

_{m}_{2}), where

*m*is the number of individuals], is the fixed effects of covariates,

**h**is the random effect of correlated phenotypes corresponding to the polygenic contribution, and is the random effect of correlated phenotypes corresponding to the random environmental contribution. In our notation, we do not explicitly distinguish families that can be handled implicitly by the kinship matrix in the variance of

**h**. Since we consider two correlated phenotypes (Bauman

*et al.*2005),where is twice the

*m*×

*m*kinship matrix obtained either from familial relationship or genome-wide data, is the Kronecker product, and , , , , , and represent the polygenic variances of the first and second phenotypes, the polygenic covariance between the two phenotypes, the environmental variances of the first and second phenotypes, and the environmental covariance between the two phenotypes, which can be estimated from the data by using classic optimization methods such as the Nelder-Mead method (Nelder and Mead 1965) or the quasi-Newton method (Broyden 1969; Fletcher 1970; Goldfarb 1970; Shanno 1970), and then the test statistic

*Q*can be constructed.

### Simulation study

#### Simulation of sample genotypes:

We simulated sample genotypes based on a pool of 10,000 haplotypes over a 200-kb chromosome from a calibrated coalescent model (Schaffner *et al.* 2005) with linkage disequilibrium (LD) structure mimicking the European ancestry. We simulated family data using two different family structures (Figure 1). First, we simulated 300 trio families with father, mother, and one child (Figure 1A) by randomly selecting 1200 haplotypes as the parents’ haplotypes. The offspring haplotypes were generated by randomly transmitting one of the two haplotypes of the father and the mother to the child. Similarly, we simulated 100 three-generation families with two grandparents, two independent parents who marry into the families, two dependent parents as the offspring of grandparents, and four children (Figure 1B) by randomly picking two haplotypes for each founder and then randomly picking haplotypes to be transmitted to their descendants. Then we randomly selected 30 rare variants [mean allele frequency (MAF) < 0.05] from the simulated family data over the 200-kb region as one genotype data set. We generated 100 such genotype data sets in the analysis for each of the two family structures.

#### Type I error rate:

To measure type I error rates, for each of the 100 genotype data sets, we simulated 100 sets of a two-dimensional (2D) null phenotype independently of the genotypes. For each trio family, the vector of six quantitative phenotypes for family *i* was generated via the modelwhere is a continuous covariate generated from a normal distribution with mean 50 and standard deviation 5 that repeats twice to mimic two phenotypes for each individual. In other words, one single value is drawn from the distribution for one individual, and this value is assigned to both the individual’s phenotype vectors. is a dichotomous covariate generated from a Bernoulli distribution with probability 0.5, which is also repeated twice for two phenotypes for each individual; follows a multivariate normal distribution with mean vector **0** and covariance matrixwhere , , , and were set to 1, and the covariances and were set to 0.8. The phenotypes for all the subjects were generated in the same way, and the 100 sets of simulated phenotypes for each of the 100 genotype data sets were used to evaluate the type I error rates. In the scenario of families with three generations, the phenotypes were generated in an analogous way, but the kinship matrix was more complicated. Both the fixed-effects and variance parameters were assumed to be unknown when analyzing the simulated data, and they can be estimated from the simulated data.

When analyzing the family data, we compared the performance of the MF-KM method to four other approaches: (1) family KM ignoring the correlation between traits applied to the first and second phenotypes separately (F-KM), (2) multivariate KM without considering familial structure (M-KM), (3) M-KM using independent founders (M-KM-ind), and (4) Fisher’s method (Fisher 1950) combining the *P*-values of F-KM applied to the first and second phenotypes, treating the two phenotypes as independent (Fisher-F-KM).

#### Power evaluation:

To evaluate power, we used the same genotypes as described earlier, but we let the phenotypes be associated with the genotypes. We compared the MF-KM method with F-KM and M-KM-ind. The quantitative phenotypes for one family were generated via the model:where and are the same as described earlier, are the genotypes of causal SNPs, and are effect sizes of the causal SNPs. We considered that 30% and 20% of all variants were disease-susceptibility variants and that ** e_{i}** was determined the same as for the evaluation of type I error rates. Furthermore, were set to in order to assign large weights to rare variants, where = 0.4 was chosen such that when MAF = 0.0001, (Wu

*et al.*2011). Because the KM regression could handle both risk and protective variants, we also considered that one-third of the causal variants were protective, which meant that (

*i.e.*, 20% disease variants and 10% protective variants and 13% disease variants and 7% protective variants). The phenotypes for all families were generated in the same manner, and these 100 sets of phenotypes for each of the 100 genotype data sets in each family scenario were used to evaluate the power.

### Data availability

The MF-KM algorithms have been implemented in R (http://www.r-project.org/) and the source code is available at (http://www.pitt.edu/∼qiy17/Softwares.html). File S1 contains the R program for MF-KM, README of the program and illustrative examples.

## Results

### Simulation of the type I error rate

Table 1 lists the empirical type I error rates of MF-KM, F-KM for the first phenotype, F-KM for the second phenotype, M-KM, M-KM using founders (M-KM-ind), and Fisher’s method combining the *P*-values of F-KM applied to the first and second phenotypes (Fisher-F-KM) at α levels of 0.05, 0.01, 0.005, and 0.001 for trio families and families with three generations. The results indicate that the type I error rate is inflated when the M-KM, which ignores familial structure, is applied to correlated samples, even though the correlation between phenotypes is modeled. The type I error rate is also inflated when Fisher-F-KM is applied, treating the two correlated phenotypes as independent. In contrast, F-KM, MF-KM, and M-KM-ind retain the correct type I error rates. From the quantile-quantile (QQ) plots in Figure 2, we can see similar patterns. The *P*-values are roughly uniformly distributed for MF-KM, F-KM, and M-KM-ind, which indicates that they control type I error rates well, while the type I error rate is inflated for the M-KM applied to correlated samples and for Fisher-F-KM. Comparing the scenarios with two family structures, the inflation of the M-KM is more severe as the number of correlated samples per family increases.

### Statistical power comparison

Because M-KM and Fisher-F-KM have inflated type I error rates for related samples, we only investigated the power of MF-KM, F-KM, and M-KM-ind. As shown in Figure 3, the power of MF-KM is consistently higher than that of F-KM (which uses only one phenotype) and M-KM-ind. This is expected because MF-KM makes full use of the data, while, in contrast, F-KM uses only one phenotype at a time, and M-KM uses unrelated founders to preserve correct type I error rates. In the simulation studies, the covariances and between two phenotypes were set to 0.8. In addition, we varied and so as to study the effect of the correlation between phenotypes on the power. As shown in Supporting Information, Figure S1, the results indicate that MF-KM assuming equal genetic effects on both phenotypes performs best when the correlations between phenotypes are low to moderate and phenotypes have the same direction of genetic effects and have similar effect size, and MF-KM assuming nonequal genetic effects on both phenotypes performs best when the correlations between phenotypes are moderate to high and phenotypes have different effect size and/or have different direction of effects.

### Analysis of genome-wide lung function data

To evaluate the performance of our statistic on a real data set, we applied the method to data from a lung function study (Chen *et al.* 2014, 2015) to carry out gene-based genome-wide association tests of the correlated lung function phenotypes forced expiratory volume in 1 sec. (FEV_{1}) and forced vital capacity ratio (FEV_{1}/FVC). The data contain 578 Costa Rican subjects with and without chronic obstructive pulmonary disease (COPD), including 316 samples from 13 families, with 464 subjects being genotyped. The 72 subjects with unconfirmed COPD were excluded. The detailed recruitment criteria are described elsewhere (Chen *et al.* 2014). A genome-wide panel of 658,502 SNPs was genotyped, including 591,381 common variants (MAF ≥ 0.05) and 67,121 rare variants (MAF < 0.05). We assigned rare variants to a gene if they were located within a 5-kb flank of the gene on either side. In the end, 7064 genes were used in the analysis. We analyzed the association between the correlated two-dimensional FEV_{1} and FEV_{1}/FVC phenotypes and each of the 7064 genes comprised of rare variants using MF-KM adjusting for age, gender, height, and COPD status. The COPD status was included as a covariate to control for potential ascertainment effects because the samples were recruited according to COPD status. In our data set, the Pearson correlation between FEV_{1} and FEV_{1}/FVC is 0.57. In the final analysis, 398 genotyped subjects with full phenotypes were used, but there are still missing genotypes. To handle these, we assigned them to the homozygous reference genotype. In this test, we assumed that the phenotype is caused by rare genetic variants, and thus Wu’s weight (Wu *et al.* 2011) was applied to give rarer variants larger effect sizes. The Manhattan plots of *P*-values for genes from MF-KM and F-KM are shown in Figure 4. In these plots, the location of each gene is determined by the location of its first marker. Using the MF-KM statistic, two genes, *COL6A6* and *RBM16* (marginally), were found to be significantly associated with the joint phenotype of FEV_{1} and FEV_{1}/FVC at an α level of 7 × 10^{−6} (which is the Bonferroni-corrected significance level). Using F-KM, these two genes also were found to be associated with FEV_{1} alone. *COL6A6* includes 7 rare variants of 31 genetic variants, and *RBM16* includes 2 rare variants of 29 genetic variants. The rare alleles in both *COL6A6* and *RBM16* seem to be associated with a higher risk for lung function because more individuals with a higher proportion of rare alleles are in the low FEV_{1} and FEV_{1}/FVC area (the black area in Figure 5) than in the high FEV_{1} and FEV_{1}/FVC area (the red area in Figure 5). Moreover, *COL6A6* (chr. *3*: 130,274,178–130,400,888) is in the COPD-related regions based on the Rat Genome Database (RGD) (Shimoyama *et al.* 2015). There are two known COPD-related regions, *COPD14_H* (chr. *3*: 36,484,119–175,785,038) and *COPD16_H* (chr. *3*: 49,418,084–198,022,430).

## Discussion

Family-based study designs have been used widely in investigating complex diseases, and hundreds of thousands of genetic variants, both common and rare, have been genotyped with advances in high-throughput sequencing technology. Thus, appropriate statistical methods are needed for analyzing data from these studies while accounting for potential pleiotropic effects. Therefore, here we developed the MF-KM statistic using a linear mixed-model framework to analyze multivariate data with quantitative traits in family-based studies.

MF-KM shares the advantages of other set-based methods, such as improved power and reduced multiple testing by jointly testing a set of genetic variants. Our simulation studies show that MF-KM preserves the desired type I error rates. When multiple phenotypes are available, we show that MF-KM achieves higher power than commonly used alternate methods. Based on our simulation results, we believe that MF-KM provides a good option for genetic analysis of multivariate data in family-based studies.

The computational time required to implement the MF-KM method depends on sample size, the number of genetic variants, and the complexity of the model being tested under the null hypothesis. In fact, the computational time of fitting a model under the null hypothesis may not be critical when performing a genome-wide study. Because MF-KM is a score test, the estimates of fixed-effects coefficients and the covariance matrix under the null hypothesis are independent of the genetic variants. Therefore, the linear mixed model under the null hypothesis only needs to be fitted once. The and then can be saved and reused to construct test statistics for all the genes. Therefore, the total computational time is greatly reduced. However, processing genes takes most of the computational time. If the number of markers in a gene is large, inverting the large matrix is still computationally intensive. One way to handle this would be to further group their variants into subgroups such as common or rare nonsynonymous or synonymous coding variants, as in our previous work (Yi *et al.* 2011), and to also use LD blocks if the subgroups are still large. In addition, we may use fast algorithms proposed for linear mixed models, such as EMMA/EMMAX (Kang *et al.* 2010; Zhang *et al.* 2010b), TASSEL (Zhang *et al.* 2010b), and others (Lippert *et al.* 2011; Svishcheva *et al.* 2012; Zhou and Stephens 2012, 2014), which would make our approach faster and more efficient. Although the null model needs to be fitted only once for a genome-wide study, different initial values may need to be tried so as to find the maximum-likelihood estimates (MLEs) because the Nelder-Mead method or other optimization methods find the local maximum. Thus, the computational time also depends on the number of different initial values being tried.

Although the MF-KM method requires certain assumptions, the framework is general and flexible. For example, nongenetic covariates can be easily incorporated; M-KM is a special case of MF-KM where only unrelated samples are involved. Although kinship coefficients can be obtained directly from the pedigree, if genome-wide genotypes are available, it may be more advantageous to use genetic markers to estimate the kinship coefficients among individuals (Balding and Nichols 1995; Lynch and Ritland 1999; Ritland 2005; Yu *et al.* 2006; Kang *et al.* 2010; Liu *et al.* 2011). Using the estimated kinship coefficients allows us to handle any relationship, known or unknown, in the samples. In behavioral and psychological studies, familial correlation is influenced not only by genetics but also by shared environment, which needs to be considered in addition to kinship coefficients (McGue and Bouchard 1998; Turkheimer and Waldron 2000; Hallmayer *et al.* 2011). In the presence of a shared environment, our proposed test statistic could yield inflated type I errors, and this could be controlled by including an extra random intercept within families in the null model, but model complexity increases accordingly. In our lung function study, the shared-environmental influence is not assumed. Although we have only studied the performance of a linear kernel here, it would be straightforward to use a nonlinear kernel within the flexible KM regression framework when a nonlinear association between a disease and genetic variants is assumed.

## Acknowledgments

This work was supported in part by the following grants from the National Institutes of Health: R01-HG007358 and R01-EY024226 (to Q.Y. and W.C.); GM-081488, R03-DE024198, R01-HL092173, and P60-AR064172 (to N.L.); R01-HG006857 (to B.L.); R01-GM073766 (to G.G); National Science Foundation grants EPS-1158862 (to Q.Y. and H.K.T.) and 1462990 (to X.L. and N.L.); and HL-073373 and HL-117191 (to J.C.C.); a grant from the Ministry of Science and Technology of Taiwan (102-2628-B-002-039-MY3); and a grant from National Taiwan University (NTU-CESRP-104R7622-8 to W.-Y.L.).

## Footnotes

*Communicating editor: G. A. Churchill*Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.178590/-/DC1

- Received May 27, 2015.
- Accepted October 4, 2015.

- Copyright © 2015 by the Genetics Society of America