# Boosting Gene Mapping Power and Efficiency with Efficient Exact Variance Component Tests of Single Nucleotide Polymorphism Sets

^{*}Division of Epidemiology and Biostatistics, Mel and Enid Zuckerman College of Public Health, University of Arizona, Tucson, Arizona 85724^{†}Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina 27695^{‡}Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695^{§}Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts^{**}Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Boston, Massachusetts^{††}Harvard Medical School, Boston, Massachusetts^{‡‡}Division of Pulmonary and Critical Care Medicine, Department of Medicine, Brigham and Women’s Hospital, Boston, Massachusetts 02115^{§§}Department of Biostatistics, University of California, Los Angeles, California 90095

- 1Corresponding author: Department of Epidemiology and Biostatistics, Mel and Enid Zuckerman College of Public Health, University of Arizona, 1295 N Martin Ave., Tucson, AZ 85724. E-mail: jzhou{at}email.arizona.edu

## Abstract

Single nucleotide polymorphism (SNP) set tests have been a powerful method in analyzing next-generation sequencing (NGS) data. The popular sequence kernel association test (SKAT) method tests a set of variants as random effects in the linear mixed model setting. Its *P*-value is calculated based on asymptotic theory that requires a large sample size. Therefore, it is known that SKAT is conservative and can lose power at small or moderate sample sizes. Given the current cost of sequencing technology, scales of NGS are still limited. In this report, we derive and implement computationally efficient, exact (nonasymptotic) score (eScore), likelihood ratio (eLRT), and restricted likelihood ratio (eRLRT) tests, ExactVCTest, that can achieve high power even when sample sizes are small. We perform simulation studies under various genetic scenarios. Our ExactVCTest (*i.e.*, eScore, eLRT, eRLRT) exhibits well-controlled type I error. Under the alternative model, eScore *P*-values are universally smaller than those from SKAT. eLRT and eRLRT demonstrate significantly higher power than eScore, SKAT, and SKAT optimal (SKAT-o) across all scenarios and various samples sizes. We applied these tests to an exome sequencing study. Our findings replicate previous results and shed light on rare variant effects within genes. The software package is implemented in the open source, high-performance technical computing language Julia, and is freely available at https://github.com/Tao-Hu/VarianceComponentTest.jl. Analysis of each trait in the exome sequencing data set with 399 individuals and 16,619 genes takes around 1 min on a desktop computer.

- SNP set tests
- linear mixed effect model
- exact tests
- next-generation sequencing studies
- small sample sizes

Single nucleotide polymorphism (SNP) set analysis, also referred to as gene set, pathway, or region-based analysis, has been widely used in the genetic association analysis (Wang *et al.* 2007, 2010). They examine groups of SNPs, each of which might contribute a small and individually undetectable effect to the phenotype. The hypothesis is that, when examined jointly, the combined effect of all the genes would rise to the detectable level. SNP sets are usually predefined according to sliding windows, exons, or canonical pathways. Compared to SNP-level analysis, SNP set analysis has increased power because it reduces multiple testing burden and aggregates weak signals. Besides its success in genome-wide association studies (GWAS) (Wang *et al.* 2009; Psychiatric GWAS Consortium Bipolar Disorder Working Group 2011; Chen and Gyllensten 2015), SNP set analysis plays a paramount role in analyzing rare variants in the next-generation sequencing (NGS) studies.

Burden tests are among the first SNP set analysis tools. Burden tests collapse rare variants in a genetic region into a single burden variable, and then regress the phenotype on the burden variable to test for the cumulative effects of rare variants in the set (Morgenthaler and Thilly 2007; Li and Leal 2008; Madsen and Browning 2009; Price *et al.* 2010). The sequence kernel association test (SKAT) is the first generalized linear mixed model-based method for testing the joint effect of a set of variants on a quantitative/binary trait in an unrelated sample (Wu *et al.* 2011). It tests a SNP set as random effects using a quadratic form and uses a mixture of chi-squared distributions as its asymptotic null distribution. Compared to burden tests, a linear mixed model (LMM)-based method is more powerful when a genetic region has both protective and deleterious variants or many noncausal variants (Lee *et al.* 2012). However, SKAT may still be underpowered at small sample sizes, as it uses an asymptotic score test based on large sample theory. In this article we consider exact variance component tests that are applicable to genetic studies with small to moderate sample sizes.

Testing variance components in the LMM framework is challenging and has received considerable attention in the statistical literature (Chen and Dunson 2003; Kinney and Dunson 2007; Greven *et al.* 2008; Saville and Herring 2009; Drikvandi *et al.* 2013; Qu *et al.* 2013). Although likelihood ratio test (LRT) and restricted likelihood ratio test (RLRT) are known to be more powerful than score tests in finite samples, they impose serious computational challenges to genome-wide studies, as the alternative model has to be fit for each SNP set and the calculation of *P*-values is computationally expensive. Previous efforts in genetics studies include Zeng *et al.* (2014, 2015) and Zeng and Wang (2015).

In summary, our contributions in this work are fourfold. First, we develop the exact score (eScore) test that achieves higher power than SKAT at small sample sizes but maintains computational efficiency. Second, we examine the computational bottleneck of the exact likelihood ratio test (eLRT) and the exact restricted likelihood ratio test (eRLRT) and design new algorithms that are scalable to genomic studies. Third, we investigate the power of three exact variance component tests under various genetic study scenarios and demonstrate that the exact variance component tests have proper type I error rates in small sample sequencing association studies, and that eLRT and eRLRT significantly boost power in rare variant studies. Last, we develop and freely distribute a user-friendly software for genetic testing using the three exact variance component tests.

## Methods

### Notations and models

Suppose is an vector of quantitative phenotypes, is an covariate matrix (*e.g.*, gender, smoking history, principal components, *etc*.), is an genotype matrix of *m* genetic variants, and is a prespecified diagonal weight matrix for genetic variants. We consider a standard LMM(1)where are fixed effects, are random genetic effects, and and are variance component parameters for the SNP set and environmental effects respectively. Therefore, the phenotype vector has covariancewhere is the kernel matrix capturing effects of the SNP set. The resulting log-likelihood function is(2)In the following sections, we present the test statistics for the three exact tests along with their null distributions and then outline the computational strategy to scale them to genome-wide studies. Detailed derivations are delegated to the Supplemental Material, File S1.

### eScore

The classical score test statistic for testing (no SNP set effect) *vs.* takes the formwhere is the Fisher information matrix relevant to variance components and is the score function, both evaluated at the maximum likelihood estimate (MLE) under . File S1, section S.2, shows that(3)where is the least squares estimate of fixed effects and represents the sum of diagonal entries of a square matrix The exact score test (eScore) rejects the null hypothesis when is large.

Let , be the projection matrix onto the column space , and be the strictly positive eigenvalues of . Under the null hypothesis , is distributed aswhere are independent standard normals. The *P*-value of observed equals the tail probabilitywhere are independent chi-square random variables. Therefore eScore *P*-values can be calculated using the same numerical methods SKAT uses to evaluate the tail probability of a mixture of independent chi-squares. Moreover, whenever the ratio of two quadratic forms in (3) is less than the threshold , it represents evidence *against* the alternative hypothesis and the correct *P*-value should be 1. This saves considerable computation as most test regions are not associated with the trait.

In contrast, SKAT employs the test statistic(4)and calculates its *P*-value using the null distribution of . Under the null model, converges to the true as sample size *n* increases. Therefore is distributed as only *asymptotically*. Under the alternative model , however, is a biased estimator that tends to overestimate the true . This bias potentially affects the power of .

### eLRT and eRLRT

In this section we first review the eLRT and eRLRT for testing a single variance component proposed by Crainiceanu and Ruppert (2004), and then discuss the computational challenges for applying them to sequencing studies. Section S.3 in File S1 gives self-contained derivation.

The LRT statistic for testing *vs.* is(5)Under the null model , has exact distribution(6)where are independent standard normals, are the strictly positive eigenvalues of and are the strictly positive eigenvalues of .

The RLRT is based on the restricted/residual log-likelihood(7)where The RLRT statistic is(8)which, under the null model , has exact distribution(9)where are independent standard normals and are the strictly positive eigenvalues of .

Applying eLRT and eRLRT to NGS studies, which routinely test genes or SNP sets, incurs serious computational challenges. First we need to find the MLE or restricted maximum likelihood estimate (REML) for each SNP set, which requires repeatedly inverting *n* × *n* matrices, an expensive operation when *n* is large. Second, computing the *P*-value of eLRT or eRLRT for each SNP set is nontrivial. Crainiceanu and Ruppert (2004) propose the straightforward way of simulating *B* points from the null distribution (6) or (9). That involves solving *B* univariate optimizations, where *B* needs to be at order of to obtain *P*-values at order of with accuracy. This method is hard to scale to genomic scans with a large number of SNP sets.

### Implementation

We attack the first computational challenges by an efficient and stable algorithm for fitting the alternative model that avoids repeatedly inverting matrices. We resolve the second challenge by using an accurate approximation that only requires simulating a small number of points from the null distributions.

### Fast algorithm for fitting variance component model

This section describes an efficient algorithm for fitting the variance component Model 2 or restricted-likelihood Model 7. Let be the eigen decomposition of the SNP set variance matrix. Thenwhere and . Our strategy is to update the mean components and variance components alternately. Updating given is a standard weighted least-squares problem. To update given , where the superscript *t* is iteration number, we denote the residuals by . The objective is thenwhich can be maximized by a minorization-maximization (MM) technique (Hunter and Lange 2000). The simple MM updates are(10)See section S.4 in File S1 for the derivation of the MM updates. This algorithm avoids repeatedly inverting *n* × *n* matrices as only one eigen decomposition is required. Each iteration only involves solving a weighted least squares problem and operations for updating variance components. This algorithm is numerical stable as each update of and always increases the log-likelihood value.

For eRLRT, we need to find the REML for each SNP set. Let be an orthonormal basis of , *e.g.*, obtained from the singular value decomposition of Then is multivariate normal with mean and covarianceLet the eigen decomposition of the covariance matrix beThen the transformed data has independent componentsand the restricted log-likelihood function (7) becomesIt becomes clear that the MM updates (10) remain unchanged for finding REML except replacing by and *n* by .

### Approximating null distributions of eLRT and eRLRT

Calculation of eLRT and eRLRT *P*-values relies on drawing samples from the theoretical null distributions (6) and (9). Typical genome scans test SNP sets. An exome-wide significant *P*-value at a level of requires drawing about samples from the null distribution and each of them requires solving a univariate optimization problem. Hence the *P*-value calculation for eLRT and eRLRT is computationally intensive. We propose an approximation scheme that only requires drawing a small number of samples for each SNP set and thus is highly scalable to genomic scans.

We approximate the exact null distributions (6) and (9) by a mixture distribution of form , where the point mass at 0, scale parameter *a*, and the degree of freedom *b* for the chi-squared distribution need to be determined for each SNP set. We illustrate with eLRT. Denote the expression to be maximized in (6) by . The point mass of the null distribution at 0 is well approximated by the probability of having a local maximum at 0Therefore is calculated by either numerically evaluating the cumulative distribution function of the mixture of chi-square distribution at 0 or by the simple Monte Carlo method. To approximate the continuous part of null distribution, we simulate a small number (300 by default) of by numerically maximizing using the Newton–Raphson method, and then estimate parameters *a* and *b* by matching the first two sample moments to those of . This approximation scheme is well known as the Satterthwaite method in statistics (Satterthwaite 1941), which has been used successfully to approximate the null distributions of many test statistics. The performance of our approximation is included in the section S.5 in File S1, which indicates that our approximation method works well for generating *P*-values and reducing computational burden.

### Data availability

Simulated data sets are generated using computing language Julia and are available upon request. COPDGene exome sequencing study (http://www.copdgene.org/) is part of the National Heart, Lung, and Blood Institute (NHLBI) Grand Opportunity Exome Sequencing Project (GO-ESP) and has been deposited to database of Genotypes and Phenotypes (dbGaP) (study accession: phs000296.v3.p2).

## Results

We first illustrate the subtle differences between SKAT and eScore for the motivation of exact tests. We then conduct a comprehensive simulation study to illustrate the control of type I error rate and the power under various conditions of genetic association. These simulations were designed to evaluate two primary questions: (1) What is the relative performance and what are the advantages of using LRT based tests, especially when the causal variants are rare? (2) Can our method still have advantages even when genetic association are not under model assumptions?

### Simulation studies

Differences between SKAT and eScore are demonstrated using simulations. Genotypes of samples are formed by randomly pairing 400 haplotypes drawn from the haplotype pool distributed with the SKAT software (Wu *et al.* 2011). We used the first 5 kb as the test region, which contains 61 monomorphic loci, 20 rare variants with MAF (minor allele frequency) <0.05 (13 with MAFs ), and 12 common SNPs. 1000 replicates of are generated under the null and alternative model , respectively. Under the alternative hypotheses, causal variants are chosen using criterion MAF For simplicity no covariates are included. Figure 1 displays the discrepancy of *P*-values between eScore and SKAT. Under the null model (left panel), SKAT *P*-values roughly match those from eScore, except 73.1% of the eScore tests have *P*-values equal to 1. This reflects the fact that is a fairly accurate estimate of under the null model. Under the alternative model (right panel), however, is a biased estimate and the SKAT *P*-values are systematically larger than those from eScore, especially in the region of small *P*-values. This can lead to loss of power by SKAT in genome scans where a stringent *P*-value threshold is necessary to correct multiple testing. The difference is more dramatic at smaller sample sizes or stronger effect size .

For both type I error and power simulation studies, we use the haplotype pool that comes with the SKAT software (Wu *et al.* 2011) to generate genotypes of study samples. That is, for each simulation replicate, we pair randomly drawn haplotypes to form the genotypes of a sample of *n* subjects. To assess empirical type I error of eScore, eLRT, and eRLRT, we consider combinations of following factors:

test region: first 5 or 10 kb,

samples size

*n*: 500, 1000, or 2000, andsignificance level α: 0.05, 0.01, or 0.0001.

The average number of variants are 97 and 193 for 5- and 10-kb regions, respectively. We evaluate type I error when both common and rare variants are included in the region as well as when only rare variants (MAF <5%) are included in the region. We generate replicates for each simulation scenario. For each replicate, we first simulate four continuous covariates from independent standard normals, one binary covariate from Bernoulli(0.5), and then generate phenotypes from Model 1 with , , and Results in Table 1 show that the three exact tests control type I error at all α levels.

For power comparisons, we take the first 10 kb of the haplotype pool as the test region. Over simulation replicates, testing regions include around 193 variants and 80–150 observed variants on average (Table 2). Average proportion of rare variants (MAF <0.05) are 80.8, 84.3, and 87.9% for sample sizes of 500, 1000, and 2000, respectively. The number of causal variants for different models are also shown in Table 2. This is among the settings where we have evaluated protected type I error. Covariates are generated in the same manner as in the last section and we set fixed effects at For Models 1 and 4, causal effects follow a normal distribution . Models 2, 3, 5, and 6 mimic the simulation schemes in Wu *et al.* (2010), where the magnitude of causal effects is determined by , so that rarer variants have larger effects. In Wu *et al.*’s (2010) article, *c* was set up as 0.4 and in Lee *et al.*’s (2014) article *c* was set as 0.14, which provides 80% power at level when the sample size is 50,000. In our simulations, we chose and *c* by fixing heritability , where , so that power is in the comparable ranges for most of methods given sample sizes. Environmental variance was fixed at 1. For Models 1 and 4, we chose to be,Similarly, *c* was chosen according to the formulaAs a comparison we list the mean and standard deviation of our simulated *c* over simulation replicated in File S1 (Table S1). It is shown that our *c* is smaller compared to Wu *et al.* (2010) and Lee *et al.* (2014), which indicates smaller heritability explained by testing regions. We consider the following simulation factors to evaluate power and label them as Models 1–6 in Table 3:

sample size

*n*: 500, 1000, or 2000,heritability 5 or 10%,

MAF of causal variants: common and rare (MAF <0.5) or rare only (MAF <0.05),

percentage of causal variants: 10 or 30%,

distribution of causal effects: or ,

direction of causal effects: half positive and half negative or all positive.

Significant level α is . We simulate 1000 replicates for each scenario. Therefore the largest Monte Carlo standard error for power estimate is controlled below

For simplicity, in both simulation and real data analysis, SNP weights are not incorporated and the linear kernel is adopted for both exact tests and SKAT. SKAT optimal (SKAT-o) uses the default setting in Lee *et al.* (2012). Note all exact tests can incorporate variant weights or other kernels just as in SKAT or SKAT-o.

Figure 2 displays the results for Models 1, 2, and 3 (common and rare causal variants) and Figure 3 for Models 4, 5, and 6 (rare causal variants only). Left panels of both figures are the results when 10% of the variants in the region are causal, while the right panels show powers when 30% of variants are causal. It is clear that (1) performance of score tests (SKAT and eScore) are comparable in these scenarios; (2) eLRT and eRLRT significantly boosts power over score tests across all scenarios, especially when causal variants are rare only or sample size is small; and (3) when the causal variants are both common and rare, the SKAT-o method can increase power extensively compared with SKAT method with linear kernel. Its power is comparable to eLRT and eRLRT (Figure 2).

### COPDGene exome sequencing study

We further illustrate our methods using the COPDGene exome sequencing study (http://www.copdgene.org/). It is part of NHLBI GO-ESP project (dbGaP study accession: phs000296.v3.p2). After quality control, 399 individuals remain for the analysis (Qiao *et al.* 2016). We analyze genes along the genome and apply different testing methods to three phenotypes: height, cigarette packages per year (PackYears), and body mass index (BMI). Table S2 in File S1 tabulates their descriptive statistics.

For all three phenotypes, we adjust population substructure using the top three eigenvectors generated by the Eigenstrat software (Price *et al.* 2006), age, and gender. For PackYears, we additionally adjust for current smoking status. Table 4 shows the numbers of gene sets that pass Bonferroni-adjusted exome-wide significance level . Table 5 contains detailed information of gene sets that pass exome-wide significance level for height. It also lists gene sets with *P*-value for trait PackYears and BMI for the purpose of side-by-side comparison of *P*-values, as none of the gene sets pass the exome-wide significant level.

We make following observations: (1) For the complex trait height, eLRT and eRLRT identify 8 and 15 genes that pass the Bonferroni-adjusted, genome-wide significant level and eScore identifies 5. In contrast, SKAT and SKAT-o only identify four and three respectively. (2) For the other two traits, no genes pass Bonferroni-adjusted genome-wide significance level in all tests. (3) eScore *P*-values are universally smaller than SKAT. This agrees with the simulation results in Figure 1 that the asymptotic test by SKAT can lose power at small samples size and strong signals. (4) The optimal kernel in SKAT-o does not show advantage over SKAT with linear kernel and no weight in this analysis (Figure S1).

### Computational efficiency of ExactVC

We compare the computational time of different methods. Table 6 records the run times of each method on a desktop with i7-3770 central processing unit of 3.40 GHz and 16 GB RAM. For each trait, exact tests (eScore and eRLRT) complete the analysis in <1 min, while eLRT uses around a minute. SKAT takes slightly longer than eScore, while SKAT-o takes significantly longer than all of the tests. Note that although LRT and RLRT is considered more computationally intensive compared to the score test, Table 6 shows that the speed of our eLRT test is comparable to eScore and SKAT tests, while eRLRT is even faster.

## Discussion

In this report we study and implement computationally efficient exact variance component tests (eScore, eLRT, and eRLRT) for testing SNP sets in sequencing studies. Simulation study and real data analysis show that (1) all exact tests control type I error, (2) eScore yields smaller *P*-values than SKAT at small sample size and strong signal, and (3) eLRT and eRLRT significantly boost power over eScore, SKAT, and SKAT-o, especially when sample size is small or there are plenty of rare variants. By supplying a fast and easy-to-use software package, we hope to boost the power and efficiency of gene mapping based on current NGS technology. Although the derivation of eLRT and eRLRT require normal assumption of genetic effects within a region, we evaluate the misspecified distribution and how that will affect power. In all scenarios, even without normal assumption, our methods show superior power compared with competing methods. The software package, ExactVCTest, is implemented in the open source, high-performance technical computing language Julia and is freely available at https://github.com/Tao-Hu/VarianceComponentTest.jl.

There are a few directions for future work. One advantage of the asymptotic test by SKAT is that it does not depend on the normality assumption and equally applies to association testing of binary traits (Wu *et al.* 2011; Lee *et al.* 2012), while the exact tests depend on the normality assumption. Fortunately many quantitative traits satisfies the normality assumption after suitable transformations. Development of LRT and RLRT for binary trait remains a challenge. Another statistical challenge is to develop LRT or RLRT for testing SNP set in related samples. An asymptotic score test has been developed by Chen *et al.* (2013). Rigorous testing of multiple variance components still remains a statistical challenge (Crainiceanu 2008; Drikvandi *et al.* 2013).

## Acknowledgments

COPDGene study is supported by National Institutes of Health R01 HL-089856 and R01 HL-089897. The whole exome sequencing was supported by the National Heart, Lung, and Blood Institute Exome Sequencing Project. J.J.Z. is supported by National Institutes of Health grant K01 DK-106116, M.H.C. is supported by National Institutes of Health grant R01 HL-113264. H.Z. is partially supported by National Institutes of Health grants HG-006139, GM-105785, GM-53275, and National Science Foundation grant DMS-1055210. Full list of chronic obstructive pulmonary disease investigators unit core and clinical centers are included in the File S1.

## Footnotes

Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.116.190454/-/DC1.

*Communicating editor: N. Yi*

- Received April 15, 2016.
- Accepted September 7, 2016.

- Copyright © 2016 by the Genetics Society of America