## Abstract

Detecting the association between a set of variants and a given phenotype has attracted a large amount of attention in the scientific community, although it is a difficult task. Recently, several related statistical approaches have been proposed in the literature; powerful statistical tests are still highly desired and yet to be developed in this area. In this paper, we propose a powerful test that combines information from each individual single nucleotide polymorphism (SNP) based on principal component analysis without relying on the eigenvalues associated with the principal components. We compare the proposed approach with some popular tests through a simulation study and real data applications. Our results show that, in general, the new test is more powerful than its competitors considered in this study; the gain in detecting power can be substantial in many situations.

WITH the innovations of biomedical and biochemical technologies, large amounts of genetic sequencing data have been produced, providing researchers with great opportunities to investigate the genetic contributions to some phenotypes such as cancers. Genome-wide association studies (GWASs) have successfully identified thousands of single nucleotide polymorphisms (SNPs) that are associated with some common diseases (Manolio *et al.* 2009; Chen and Ng 2012; Chen 2013; Chen *et al.* 2017b). However, most of those identified SNPs from GWAS are variants with relatively high minor allele frequencies (MAFs). Rare variants (*e.g.*, SNPs with MAF <5%) may play a critical role in disease development (Bodmer and Bonilla 2008). Nevertheless, because of their low MAFs, rare variants are usually removed from data analysis in GWASs. And, even if they were included, current statistical methods designed for GWASs may have very limited power to detect the signal if the sample sizes are not large enough. Instead of testing a single variant a time, researchers have proposed statistical approaches to detecting the possible association between a set of variants and a phenotype. Recently, many statistical methods have been designed specifically for gene-set or pathway rare-variant data analysis (Li and Leal 2008; Madsen and Browning 2009; Han and Pan 2010; Basu and Pan 2011; Lin and Tang 2011; Wu *et al.* 2011, 2015; Yi and Zhi 2011; Lee *et al.* 2012; Sha *et al.* 2012; Pan *et al.* 2014; Wang 2016; Chen *et al.* 2017a; Chen and Wang 2017).

The sequencing kernel association test (SKAT) is among the most popular rare-variant association testing methods. The SKAT is essentially based on the principal component analysis (PCA). More specifically, it calculates a test statistic from each individual principal component of the covariance matrix of the genotype data, and then takes the weighted sum of these statistics as the overall test statistic, where the weights are the associated eigenvalues. The null distribution of the overall test statistic is a linear combination of chi-square distributions, which can be approximated by a chi-square distribution (Davies 1980; Liu *et al.* 2009), from which a *P*-value can be approximated.

The optimal sequencing kernel association test (SKAT-O) is a weighted sum of the SKAT and a burden test, which assumes the directions are the same and the magnitudes are similar among all of the rare variants under study (Lee *et al.* 2012). Therefore, the SKAT-O in general is more robust than the SKAT. However, like the SKAT, the SKAT-O still uses the information from eigenvalues. In addition, both the SKAT and the SKAT-O require assigning weigh to each variant (*e.g.*, a function of MAF).

The use of the eigenvalues as weights in the SKAT can be beneficial if indeed the major principal components have stronger association with the phenotype. However, if this assumption is not met, the SKAT can potentially lose power dramatically. In addition, assigning weights to variants can be challenging. To circumvent these difficulties, in this paper, we propose a new statistical association testing method for rare-variant data analysis. This new test has some nice properties, such as simple form and computational efficiency. To study the performance of the proposed approach, we compare it with some popular methods. Our comparison results show that the new test is more powerful than the SKAT and SKAT-O tests under most of the situations studied. Real data applications are also given to illustrate the use of the new approach.

## Methods

We use to denote phenotypes (either qualitative or quantitative) of the *n* subjects in a study. Assume are the observations of *p* covariates from *n* subjects, and are the *k* genotypes from *n* subjects, where the (*i*,*j*) component of if the number of copies of the minor allele of the jth SNP from the ith subject is zero, one, or two, respectively. Denote the standardized residuals (*i.e.*, the raw residual divided by its estimated SD) of after adjusting for the *p* covariates using a generalized linear model (*e.g.*, a logistic regression model for binary phenotype and ordinary linear regression for conditions phenotype). Then, to detect the association between the set of the *k* SNPs and the phenotype, we can conduct an overall test between *z* and the genotypes.

Let be the *n* eigenvalues of matrix where is the centered (*i.e.*, each component is subtracted by its column mean), is the weight matrix, and the eigenvector associated with (). By default, the SKAT uses where is the density function of a distribution with the two shape parameters and is the MAF of the ith SNP, which can be estimated from the data. Unless otherwise specified, in this paper, we use the default weighting function for both of the SKAT and the SKAT-O tests.

The SKAT statistic is asymptotically equivalent to the following expression (Wu *et al.* 2011): (1).It is easy to see that, under the null hypothesis, none of the *k* SNPs is associated with the phenotype, asymptotically follows a linear combination of chi-square distributions, where are independently and identically distributed (iid) distribution with degree of freedom (*df*) 1.

Alternatively, the test statistic SKAT can be rewritten as: (2),where is the ith nonzero eigenvalue of is the ith column of matrix is a diagonal matrix with if 0 otherwise; and is the eigenvectors matrix of It can be shown that the eigenvector associated with nonzero eigenvalue of can be calculated as the corresponding defined above. In fact, let be the eigenvector associated with eigenvalue of then The above defined can be rewritten as for Then, This shows that is the eigenvector associated with nonzero eigenvalue Use the fact that the two sets of nonzero eigenvalues from conformable matrices *AB* (*e.g.*, ) and *BA* (*e.g.*, ) are the same, the set of nonzero eigenvalues of are the same as {}. Therefore, both are are the sets of eigenvectors associated with nonzero eigenvalues of and the equations in (1) and (2) are equivalent. However, expression (2) is preferred when *k* is smaller than *n*, as the computation is more efficient in this situation. From (2), the asymptotic null distribution of is a linear combination of chi-square distributions,

From either (1) or (2), we can see that the SKAT is actually a weighted chi-square test with weights equal to the associated eigenvalues. Therefore, the SKAT is sensitive to the eigenvalues; its performance largely depends on how strong the major principal components correlate with *z* compared with other principal components. In the cases where the correlations between *z* and the major principal components are not stronger than those between *z* and other principal components, the SKAT may perform poorly. Motived by this observation, we propose a robust test without using eigenvalues. We use *C* to denote the new test statistic that has the following expression. (3).It can be shown that the above new test has the following properties.

### Theorem 1

Under the null hypothesis, asymptotically follows a chi-square distribution with where is the number of nonzero eigenvalues of

#### Proof:

Without loss of generality, we assume It is easy to show that under the null hypothesis, asymptotically, follows a normal distribution with mean 0 and variance 1, and the covariance between and () is 0. Therefore, is asymptotically independently and identically distributed as a

### Theorem 2

is invariant of the weight

#### Proof:

Suppose are the *k* eigenvectors of denote then we have where is the identity matrix. In addition, we have where and then Since each column of matrix is also the eigenvectors of matrix Therefore, from (3),

According to Theorem 2, we can calculate the statistic without assigning weight to each SNP.

In the next section, we will compare the proposed test with the SKAT and the optimal SKAT (SKAT-O) through a simulation study.

## Results

### Simulation study

#### Simulation settings:

In the simulation study, we mainly focus on comparing the proposed test (*C*) with the sequencing kernel association test (SKAT), the optimal sequencing kernel association test (SKAT-O), and the burden test. We use the program, simRareSNP (http://www.biostat.umn.edu/∼weip/), provided by W. Pan to generate case-control rare-variant SNP data. For the genotype data, we use a latent multivariate Gaussian variable with compound symmetry (CS) as their covariance structure. The correlation coefficient () in the CS takes different values, *e.g.*, in the simulation study. We simulate SNPs with MAFs ranging from 0.001 to 0.05.

To investigate how the new method controlling type I error rate, we simulate 50 and 100 null SNPs, 1000 cases, and 1000 controls. Using significance level 10^{−4} and 10^{−5}, we obtain the empirical type I error rate based on 10^{6} replicates. We also consider using 700 cases and 1300 controls.

To estimate the power value, we randomly select a proportion () of 100 variants as causal SNPs, where takes values 0.05, 0.1, 0.2, 0.4, and 0.5. Following the simulation settings as described in the SKAT paper, we assume the effect size of each causal SNP is a function of MAF. Specifically, we assume the magnitude of logarithmic relative risk (RR) of heterozygous to homozygous major genotypes is with various values for The logarithmic RR is very close to the logarithmic odds ratio (OR), which was used with similar magnitudes for simulation study in the SKAT paper, if the disease prevalence is low. Of those causal SNPs, we randomly assign 10, 50, and 90% as protective variants, and the rest are risk variants. The commonly used log-additive genetic model is assumed in the simulation. The genotype frequencies of cases can be determined by those of controls and the relative risks of heterozygous and homozygous minor to homozygous major (Chen *et al.* 2012, 2014a, 2016a; Chen and Ng 2012; Chen 2014). Specifically, if the genotype frequencies of homozygous minor, heterozygous, and homozygous major are p_{0}, p_{1} and p_{2} (q_{0}, q_{1}, and q_{2}), respectively, for controls (cases), and the relative risks of heterozygous and homozygous minor to homozygous major are r_{1} and r_{2}, then we have the following relationships. (4).We then consider continuous phenotypes. We use the same procedure as described above to generate genotype data for 2000 subjects. For phenotype, we randomly select a portion () of SNPs as casual variants with 10, 50, and 90% of them having positive effects. The effect for the jth causal SNP is set as where takes 1 (−1) with probability 0.1 (0.9), 0.5 (0.5), and 0.9 (0.1), and takes different values −0.25, −0.2, −0.15, and −0.1 (*i.e.*, half of the values for the above case-control situations). For the ith subject, the phenotype is where is the genotype (0, 1, or 2) and are independently and identically distributed as the standard normal distribution.

#### Simulation results:

Table 1 reports the relative empirical type I error rates (empirical rate to the preset type I error rate) for all methods included in the comparison. It shows that under various conditions, all methods controlled type I error rate well. Table 3, Table 4, and Table 5 give the empirical power values (the highest power value is highlighted for each comparison) from each test when 1000 cases and 1000 controls were simulated, with the proportion of protective causal variants being 10, 50, and 90%, respectively. We observe the following patterns. First, when the SNPs are independent (*i.e.*, ), all methods have higher power values compared with the situations when the SNPs are not independent (*i.e.*, ). Second, as expected, when (the proportion of causal variants) increases while *d* fixed (*e.g.*, and ), the power increases for each method. Third, for most of the conditions, the proposed test has the largest empirical power values. In addition, the power gain of the new test over the SKAT and the SKAT-O tests are substantial under many scenarios.

For the situations where the phenotypes are simulated as continuous variables, Table 2 reports the type I error rate for each method. It shows that all of the methods can control type I error rate well. Table 6, Table 7, and Table 8 give the empirical power values for each method under various conditions. From the simulation results, we have similar observations as those from the case-control situations. As suggested by one reviewer, we also considered many other situations, including (1) different numbers of cases and controls, (2) various number of SNPs, (3) keep d the same value while let θ vary, (4) effect sizes are independent of MAF, and (5) all SNPs have the same MAF value. The simulation results can be found from File S1 . In general, we observed similar patterns as those from Table 1, Table 3, Table 4, Table 5, Table 6, Table 7, and Table 8.

### Real Data Applications

In this section, first, we use the Genetic Analysis Workshop 17 (GAW17) data to demonstrate the application of the proposed method. The GAW17 uses the information of a subset of genes with sequencing data available in the 1000 Genomes Project. In GAW17, SNPs from gene ELAVL4 influence the simulated quantitative phenotype Q1; and gene VNN1 is associated with the simulated quantitative phenotype Q2. Except for the genetic risk factors, both Q1 and Q2 were also assumed to be associated with some covariates, such as age, gender, and smoking status. For each gene, the phenotype (Q1 or Q2 values) was simulated 200 times in the GAW17 data set; therefore, 200 *P*-values can be obtained by each method. We use a linear regression model to account for the effects of those nongenetic factors first, then applied our proposed test, along with SKAT, SKAT-O, and the burden test, to the standardized residuals obtained from the regression.

Figure 1 and Figure 2 plot the −log10(*P*-values) obtained by those methods from genes ELAVL4 and VNN1, respectively. These plots clearly show that the proposed test produced smaller *P*-values compared to SKAT, SKAT-O, and the burden test, for most cases. This indicates that the proposed test is more powerful than its competitors. For some situations, the improvements of the new method were substantial.

We then applied the new method, along with others, to the ocular hypertension treatment study (OHTS) data (Gordon and Kass 1999). OHTS is a National Eye Institute-sponsored multi-center, randomized clinical trial. Its goal is to investigate the efficacy of medical treatment in delaying or preventing the onset of primary open angle glaucoma (POAG) in individuals with elevated intraocular pressure. This data set includes 249 non-Hispanic Black individuals between 40 and 80 years old were enrolled and genotyped in a subsequent study. Data for this genetic study is available at Database of Genotypes and Phenotypes (dbGaP, Study Accession phs000240.v1.p1). There were 1,051,295 genotyped SNPs. The HGNC gene symbols were obtained using the R/Bioconductor package biomaRt (version 2.26.1). There are 30,562 autosomal genes. Genes that contain more than two SNPs were excluded from further consideration.

In this application, we want to detect the association between each gene and the outcome central corneal thickness (CCT), which is used to assess POAG in this study. After adjusting for covariates age and gender using a linear regression, the standardized residues from the regression analysis are used for the association tests. Table 9 reports the *P*-values obtained by SKAT, SKAT-O, the burden test, and the proposed method for genes with the smallest *P*-value from the four methods < For the two identified genes, the *P*-values from the proposed test are both < while the *P*-values from others are all >0.05. More information about the two genes is included in Table S26 in File S1. However, to confirm the true association, the genes listed need further investigation.

## Discussion and Conclusion

Due to the complex relationships among the set of SNPs, rare-variant association testing is a difficult task. Recently, in this area, many statistical approaches have been proposed in the literature; however, none of them is uniformly most powerful. Robust yet powerful statistical methods are still highly desirable. The popular SKAT method is based on PCA analysis and uses eigenvalues as weights when it combines information obtained from each individual principal component. Indeed, under the assumption that the major principal components tend to have stronger associations with the phenotype, the SKAT have decent detecting power. However, it should be pointed out that the weights (eigenvalues) are completely determined by the genotype data; there is no guarantee that the aforementioned assumption is met in practice. Under some situations, it is possible that the minor principal components will have stronger relationship with the phenotype (Aschard *et al.* 2014). If this is the case, the SKAT will be less powerful. For example, for the gene “HCRT” in the real data application, we found that the four eigenvalues are 899, 0.29, 1.4e−07, and 3.9e−09 with associated z-statistics () −1.76, −4.23, 3.00, and −0.61, respectively. Obviously, using eigenvalues as the weights in the SKAT and SKAT-O tests results in large *P*-values. However, the proposed test has better performance under this situation. To circumvent this difficulty, we proposed a robust approach, which does not assume any relationship between the strength of association and the eigenvalues. Another disadvantage of the SKAT and the SKAT-O is the difficulty to estimate the *P*-value (Wu *et al.* 2016). In contrast, the *P*-value from the proposed test can be easily calculated using a standard chi-square distribution.

Our proposed test can actually be viewed as a *P*-value (statistic) combining method (Chen 2011, 2013, 2017; Chen and Nadarajah 2014; Chen *et al.* 2014b, 2016b). Each summand, in (3) is asymptotically iid under the null hypothesis. Therefore, we can calculate each individual *P*-value and then combine those asymptotic independent *P*-values using some appropriate method. The overall *P*-value calculated from (3) is equivalent to the method we studied before (Chen and Nadarajah 2014). Other *P*-value combing methods, such as Fisher test (Fisher 1932), can also be applied. In addition, if we have any prior information, more powerful *P*-value combining methods can be constructed accordingly. However, much more research is needed to investigate under which situations, which *P*-value combining methods are more powerful.

In summary, the proposed test is simple and robust. Through a comprehensive simulation study, we find that the proposed test is more powerful than the SKAT and the SKAT-O tests under many situations. The new method provides alternative or supplementary approach to rare-variant association testing. Finally, it should be pointed out that like the SKAT and the SKAT-O tests, we can use different kernels (*e.g.*, linear or quadratic) in the proposed approach without any additional difficulty.

## Acknowledgments

The authors would like to thank the editors and two anonymous reviewers for their helpful comments which result in an improved presentation of the paper. T.L. was supported by the National Science Foundation of China 61375051. The authors declare that there is no conflict of interest.

## Footnotes

Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.117.300287/-/DC1.

*Communicating editor: N. Yi*

- Received May 4, 2017.
- Accepted September 10, 2017.

- Copyright © 2017 by the Genetics Society of America