## Abstract

Testing for the existence of variance components in linear mixed models is a fundamental task in many applicative fields. In statistical genetics, the score test has recently become instrumental in the task of testing an association between a set of genetic markers and a phenotype. With few markers, this amounts to set-based variance component tests, which attempt to increase power in association studies by aggregating weak individual effects. When the entire genome is considered, it allows testing for the heritability of a phenotype, defined as the proportion of phenotypic variance explained by genetics. In the popular score-based Sequence Kernel Association Test (SKAT) method, the assumed distribution of the score test statistic is uncalibrated in small samples, with a correction being computationally expensive. This may cause severe inflation or deflation of *P*-values, even when the null hypothesis is true. Here, we characterize the conditions under which this discrepancy holds, and show it may occur also in large real datasets, such as a dataset from the Wellcome Trust Case Control Consortium 2 (*n* = 13,950) study, and, in particular, when the individuals in the sample are unrelated. In these cases, the SKAT approximation tends to be highly overconservative and therefore underpowered. To address this limitation, we suggest an efficient method to calculate exact *P*-values for the score test in the case of a single variance component and a continuous response vector, which can speed up the analysis by orders of magnitude. Our results enable fast and accurate application of the score test in heritability and in set-based association tests. Our method is available in http://github.com/cozygene/RL-SKAT.

THE variance component model is a well-established statistical framework used in many scientific fields. Testing for an association between several explanatory variables and a univariate response produces a variety of useful applications. For example, in metagenomics, an association is tested between a phenotype (*e.g.*, body mass index, blood glucose levels, blood lipid levels, *etc*.) and the relative abundance counts of the measured species (Zhao *et al.* 2015).

In statistical genetics, testing for an association between a set of genetic markers and a phenotype, such as a disease or a trait, is a fundamental task. Since studies to detect genetic signals are often underpowered, even with large datasets becoming available, the common approach to help alleviate this issue is grouping together genetic markers and testing them jointly. Grouping genetic markers is commonly implemented under the framework of variance component models. In addition to association testing, this framework can be used to answer several questions, such as estimation of the underlying heritability of a phenotype (Kang *et al.* 2010); estimating the uncertainty of such estimation (Furlotte *et al.* 2014; Schweiger *et al.* 2016, 2017); phenotype prediction (Hayes *et al.* 2001), and more.

We consider two main scenarios in which such tests are performed: (i) a single phenotype, many sets of markers and (ii) many phenotypes, a single set of markers. Scenario (i) is common in set-testing, where relatively few markers are tested jointly. This is particularly useful in the case of rare variants, which are increasingly available for study using sequencing technologies, and which constitute a large part of human genetic variability. In such studies, a single phenotype is often tested against several sets of markers (for example, all rare variants in a single gene), because single-marker tests are often underpowered. Scenario (ii) occurs when studying heritability, defined as the proportion of phenotypic variance explained by genetics. Here, the tested markers are commonly the entire set of genotyped or sequenced single-nucleotide polymorphism (SNP) variants, or large portions of the genome (defined by, *e.g.*, chromosome or functional annotation), and they are often tested against many (*e.g.*, thousands) of phenotypes. Such phenotypes could be expression profiles of genes (Price *et al.* 2011; Wright *et al.* 2014; Lloyd-Jones *et al.* 2017), methylation levels across of various methylation sites in the DNA (Quon *et al.* 2013; Van Dongen *et al.* 2016) or neuroimaging measurements (Ganjgahi *et al.* 2015; Ge *et al.* 2015).

Within the variance components framework, a common approach for association testing is the score test. It is used, for example, for testing the heritability of morphometric measurements derived from brain structural MRI scans (Ge *et al.* 2015) and on fractional anisotropy measures in subjects from the Genetics of Brain Structure study (Ganjgahi *et al.* 2015).

The main popular alternative to the score test is the generalized likelihood ratio (LR) test, *e.g.*, as implemented by GCTA, a popular software package for heritability estimation (Yang *et al.* 2011). Both the score test and the LR test are based on properties of the likelihood function. The LR test statistic is calculated from the likelihood of the best fitting model across different heritability values, and from the likelihood of the model corresponding to no heritability. Conversely, the score test is based on the derivative of the likelihood function at the point corresponding to zero association, and testing if it is significantly nonzero. Compared with the LR test, the score test is often advantageous as it requires parameter estimation only for the null model, whereas the LR test requires parameter estimation for both the null and the alternative model. Additionally, the score test is the locally most powerful test; see Lippert *et al.* (2014) for a thorough comparison between the two tests, mainly in the context of set testing.

The Sequence Kernel Association Test (SKAT) (Wu *et al.* 2011) has become the standard score-based test in statistical genetics and in metagenomics (Zhao *et al.* 2015), in large part due to its computational tractability. One of its merits is that it does not rely on the asymptotic distribution of the score test statistic, instead specifying a nonasymptotic distribution for the statistic under the null hypothesis of no association. However, it has been observed that this distribution may be inaccurate. In the SKAT-O extension (Lee *et al.* 2012), a resampling-based moment-matching correction is suggested. An adaptive permutation testing procedure is suggested in Hasegawa *et al.* (2016). Chen *et al.* (2016) provide a method for calculating exact *P*-values; however, their method may be significantly slower than that of SKAT, as it requires the eigendecomposition of a full rank square matrix, whose computational complexity is typically cubic in the sample size, for each distinct response variable (*e.g.*, phenotype), or each set of explanatory variables (*e.g.*, SNP set). Finally, in these works, it is reported that this discrepancy occurs mainly in studies having a small sample size, and it is currently unclear to which extent the *P*-values of SKAT are calibrated for large sample sizes.

Here, we undertake a thorough analysis of the null distribution of the score test statistic, and its discrepancy under the SKAT approximation. We suggest a practical way to quantify this discrepancy, and show that such discrepancies may occur even at large sample sizes. We show that a discrepancy is expected when the number of markers is comparable to or larger than the number of individuals, and when the individuals are relatively unrelated. In particular, in addition to such inaccuracies occurring in tests of sets of rare-variants in small samples, we conclude that they may also occur in large scale heritability studies. We further suggest a computational method, Recalibrated Lightweight SKAT (RL-SKAT), that allows exact *P*-value computation while maintaining computation time as in SKAT; in particular, for multiple phenotypes tested against the same marker set, only a single eigendecomposition is required. Finally, we demonstrate and validate our results on two real datasets, a large dataset from the Wellcome Trust Case Control Consortium 2 (International Multiple Sclerosis Genetics Consortium *et al.* 2011) (WTCCC2) study and the Cooperative health research in the Region of Augsburg (KORA) study (Holle *et al.* 2005) dataset.

## Materials and Methods

We begin by reviewing the score test, as defined by the SKAT method (Wu *et al.* 2011) [see also the Supplementary Information in Lippert *et al.* (2014) for an excellent review]. We focus here on continuous phenotypes, and on the case of a single variance component; for other cases, see the *Discussion* below.

### The variance components model

We consider the following standard variance components model [see Searle *et al.* (2009) for a detailed review]. Let *n* be the number of observations, and be a vector of responses. Let be a design matrix of *p* covariates, associated with fixed effects (possibly including an intercept vector as a first column, as well as other covariates) and let be a vector of fixed effects. Finally, let be a kernel matrix, which, in a kernel-based method such as SKAT, can be taken to be any symmetric positive-definite matrix that encodes similarity between individuals. Then, is assumed to follow:(1)The fixed effects and the coefficients and are the parameters of the model.

In the context of statistical genetics, is a vector of phenotype measurements for each individual, and is a matrix of covariates (often including an intercept, sex, age, *etc*.). Let be a standardized (*i.e.*, columns have zero mean and unit variance) genotype matrix containing the *m* SNPs we test. The common choice for is a weighted dot product of the genetic markers (Yang *et al.* 2010); formally, define where is a non-negative diagonal matrix assigning a weight per SNP. A standard choice is the uniform [see Wu *et al.* (2011) for a discussion]. The narrow-sense heritability due to genotyped common SNPs is defined as the proportion of total variance explained by genetic factors (Visscher *et al.* 2008):

### The score test

Under the above model, evaluating whether the tested covariates influence the response, while adjusting for additional covariates, corresponds to testing the null hypothesis SKAT tests this hypothesis with a variance component score test in the corresponding mixed model. Specifically, the score statistic in the single-kernel case is obtained from the derivative of the restricted likelihood, discarding terms that are constant with respect to (Lippert *et al.* 2014):(3)where is the projection matrix to the subspace orthogonal to the covariates For clarity of presentation, we will divide the statistic by Then,

#### Proposition 1:

*Let* *be the eigenvalues of* *and be* *are i.i.d. random variables distributed chi-square with one degree of freedom. Then,*(4)The proof of Proposition 1, as well as all proofs below, are deferred to the Supplemental Material in File S1.

### The exact distribution of the score test statistic

The above derivation is exact whenever is known. However, in practice, is not known and needs to be estimated from the data; most often, from the single response vector we are testing. In practice, is replaced with its restricted maximum likelihood (REML) estimate. The REML estimate is simply the corrected mean of the squared entries of the phenotype, after regressing out the covariates and using :(5)We note that sometimes the ML estimate is used, or just as this only introduces a multiplicative constant, we use the unbiased REML estimate for simplicity of presentation later. The statistic *Q* and are, in fact, dependent random variables. Therefore, the assumed distribution of (described in Proposition 1) does not hold when substituting with its estimate, In Zhang and Lin (2003), Liu *et al.* (2007, 2008), and Wu *et al.* (2011), this substitution is justified by the claim that the (restricted) ML estimator is consistent, and may therefore be substituted by its true value for a large enough sample size, *n*. However, this argument does not take into consideration the dependency between *Q* and Also, as shown below, this distribution might not hold in realistic settings. In Chen *et al.* (2016), this discrepancy is reported for small samples, and an exact distribution is derived for the statistic and for any and which we review here:

#### Proposition 2:

*The distribution of* * may be modeled as a ratio of quadratic forms of normal variables. In particular, if* *then*

### Assessing the discrepancy

While noted in the literature (Zhao *et al.* 2015; Chen *et al.* 2016), the above discrepancy is reported for small samples only. However, as we show now, it may occur also when the number of individuals is large. We give a qualitative measure for when to expect large discrepancies between the asymptotic approximation of a weighted mixture of chi-squares and the exact distribution.

In the Supplemental Material in File S1, it is shown that the distributions of and have the same means, but that *i.e.*, the latter has a smaller variance. We can further quantify the ratio between the variances as an indicator to the discrepancy between the distributions.

#### Proposition 3:

*Denote the eigenvalues of* *by* *and note that there are at most* *nonzero eigenvalues* *Denote the first two sample moments of the eigenvalues by* *and* *Denote the empirical variance of the eigenvalues by* *Then,*(7)The expression is the (sample) coefficient of variation (CV) of the eigenvalues—a unitless, relative measure of their dispersion. Therefore, the ratio becomes larger when the CV is smaller. Also, as noted above, since the approximation wrongly ignores the dependency between the statistic *Q* and we expect the discrepancy to grow larger as the correlation between *Q* and increases. We therefore examine this correlation as an additional measure of this discrepancy.

#### Proposition 4:

*Let* *be the CV of the eigenvalues as above. Then,*(8)This again demonstrates that CV affects discrepancy—the correlation becomes stronger when the CV is smaller. When for example, when we have and Conversely, when we have and This also gives the variance ratio as the function of the correlation as(9)To summarize, the discrepancy is strong when the eigenvalues are more uniformly dispersed, and is weak when they have large variability. The dispersion of the eigenvalues of a kinship matrix has been previously shown to be related to the uncertainty in estimation of heritability: In Visscher and Goddard (2015), it is shown that the asymptotic variance of the heritability REML estimator decreases with the variance of the entries of the kinship matrix, and with the variance of the eigenvalues. In Schweiger *et al.* (2016), this result is shown without assumptions of asymptotics.

#### Examples:

We now employ Propositions 3 and 4 to analyze several interesting examples in a genetic context. For simplicity, in the following, we use so that and

##### Completely unrelated cohort:

Suppose the cohort contains completely unrelated individuals; then, Thus, so and is the constant *n*. Compare this to the case where is known; then, it can be easily seen that Therefore, the mean is the same but the variance vanishes completely.

##### Rank-one kinship matrix:

Consider the case of a simple burden test (Lee *et al.* 2012): if we assume the random effects of all SNPs are identical, the burden test becomes equivalent to the score test with where Alternatively, consider the extreme case, where all the individuals are identical: (while unlikely in human, this could be approximately true in studies of plants, yeast, *etc*.). In both these cases, there is a single nonzero eigenvalue: which gives and that is, with large enough sample size, we expect the correlation to be effectively zero, and the SKAT mixture approximation to hold well.

##### A full rank kinship matrix:

Assume the matrix contains SNPs in linkage equilibrium, where each column was mean-centered and normalized to have unit variance. Choosing the linear kernel we follow Patterson *et al.* (2006) in modeling as a matrix of random standard normal variables, from which it follows that is a Wishart matrix. The limit distribution of the density of the eigenvalues of is specified by the Marčhenko-Pastur distribution (Marcenko and Pastur 1967), with its first two moments known to be 1 and Under this approximation, and When as is often the case, and This shows that, for a large class of kinship matrices, we would expect the SKAT mixture approximation to hold poorly.

##### SNP set:

Now, consider the case of set-testing, where is a normalized matrix of SNPs in linkage equilibrium. Following the modeling above, we have again and when and and thus expecting a good approximation by the mixture. This perhaps shows why the SKAT mixture approximation was considered good in the context of set-tests, when few variants or a large sample is considered. This also shows why, in small samples, the mixture is expected to be a poor approximation.

### Calculating P-values

We now describe how to efficiently calculate *P*-values for the distribution of the statistic calculated from the data; that is, given an observed statistic *r*, what is under the null? We review the result in Chen *et al.* (2016):

#### Proposition 5:

*Let r be the observed value of the statistic. Denote by* *the eigenvalues of* *Then,*(10)*where* *are i.i.d. random variables distributed chi-square with one degree of freedom*.

However, this condition requires us to calculate the eigenvalues of for each new value *r*, which, naively, has a complexity of We consider two scenarios where this is problematic. First, in many heritability studies, we wish to test the heritability of many (*e.g.*, thousands) of phenotypes, all relative to the same kernel or kinship matrix (see above). For each phenotype we calculate its score test statistic For *P*-value calculation, we need to compute the eigendecomposition of for each observed statistic which is a significant computational burden.

A second problematic scenario is of an association study of a single phenotype with many sets of SNPs, *e.g.*, rare variants. Choosing a weighted linear kernel as in SKAT (Wu *et al.* 2011), we have for each set. As changes with each test, in principle, we need to perform a costly eigendecomposition for each matrix However, a significant computational saving is gained due to the fact that the nonzero eigenvalues of are the same as those of which is an matrix (Lippert *et al.* 2014). As the number of tested SNPs *m* is often small, calculating the eigenvalues of this matrix instead is significantly faster, taking only with matrix construction taking only (see Lippert *et al.* 2014). However, with the exact approach, we need to calculate the eigenvalues of instead of Even when is low rank, the matrix may be close to full rank, so another approach is needed.

The following characterizes the eigenvalues of given the eigenvalues of

#### Proposition 6:

*Let r be the observed score test statistic. Denote by* *the eigenvalues of* *Denote the column space of a matrix* *by* *its null space by* *Then,*(11)*where* *is the number of nonzero eigenvalues* , *and* *are i.i.d. random variables distributed chi-square with one degree of freedom,*

Proposition 6 shows that calculating the *P*-value amounts to evaluating the cumulative distribution function of a certain weighted mixture of chi-square distribution at 0. This can be done rapidly using the Davies method (Davies 1980), which is based on the numerical inversion of the characteristic function and runs in complexity, or using other methods (Duchesne and De Micheaux 2010).

It remains to calculate *k* and *q*. Naively, this can be done in for example by calculating the singular value decomposition (SVD) of and to get *k* and to obtain vector bases for and and by calculating the SVD of a matrix whose columns are the two vector bases to obtain *q*. When the same kernel is used with many phenotypes, it is a single preprocessing step. However, when the number of SNPs used to construct the kernel and the number of covariates are small, these quantities can be calculated much faster:

#### Proposition 7:

*Suppose* *and let* *and* *Then, k and q can be calculated in complexity* .

Most commonly, When the number of SNPs *m* and the number of covariates *p* are small, the computational saving is substantial.

### Data availability

This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. The data used in this manuscript were obtained via KORA.PASST (https://epi.helmholtz-muenchen.de/) with the following variables: KORA F4 Illumina HumanMethylation450K BeadChip array, BMIQ normalization KORA F4 Affymetrix 6.0 SNP Array; imputed (HapMap2 reference panel). Access to the data may be obtained by request to KORA.

## Results

### Performance summary

We summarize the results described in *Materials and Methods* above in Table 1 and in Algorithms 1 and 2. We compare our method, RL-SKAT, with the SKAT formulation and the correction of Chen *et al.* (2016) using the naive implementation of Proposition 5, as implemented by the MiRKAT software package (Zhao *et al.* 2015). The two scenarios discussed are those of a heritability study (same with many responses ) and SNP set-testing (many low rank ). In all methods, a preprocessing step of calculating and is required. In a heritability study, calculating the statistic amounts to evaluating two quadratic forms in Compared to RL-SKAT, MiRKAT requires a full eigendecomposition for each For a set-testing study, these quadratic forms can be calculated in due to the low rank of Again, MiRKAT requires a full eigendecomposition, compared to the procedure described in Proposition 7.

We now demonstrate our results on two datasets: a dataset from the Wellcome Trust Case Control Consortium 2 (International Multiple Sclerosis Genetics Consortium *et al.* 2011) (WTCCC2) study and the Cooperative health research in the Region of Augsburg (KORA) study (Holle *et al.* 2005). A full description of data preprocessing is given in the Supplemental Material in File S1.

### A simulation study using WTCCC2 data

We first analyze data with real genotypes from the WTCCC2 Multiple Sclerosis dataset, and simulated phenotypes. We used the same data processing described in Yang *et al.* (2014), resulting in *m* = 360,556 SNPs for *n* = 13,950 individuals. We constructed the kinship matrix by a standard, uniformly weighted linear kernel. We sought to demonstrate the discrepancy between the true null distribution and the chi-square weighted mixture distribution. Following Proposition 4, we calculated the correlation to be 0.886 and variance ratio to be indicating that a large discrepancy is possibly expected. To verify this, we simulated 10,000 random phenotypes, where each phenotype is a vector of i.i.d. standard normal variables. We tested whether the variance component is significantly >0, and calculated their *P*-values under the assumptions of either of the two distributions. In Figure 1, we show the quantile-quantile plots for the two sets of *P*-values. As evidenced, using the SKAT mixture distribution results in a severe deflation of small *P*-values, while using the correct distribution as in Proposition 1 results in an accurate *P*-value distribution. This shows that even for large sample sizes (*n* = 13,950), such a discrepancy is possible.

### Testing for heritable methylation sites in the KORA dataset

The longitudinal KORA study consists of whole-blood methylation levels and genotypes of *n* = 1799 individuals. The phenotype is the proportion of methylated samples at a specific site, averaged across DNA samples of an individual. The study consists of independent population-based subjects from the general population living in the region of Augsburg, southern Germany (Holle *et al.* 2005). Whole-blood samples of the KORA F4 study were used as described elsewhere (Pfeifferm *et al.* 2015). In summary, a total of 431,366 methylation site phenotypes, and 657,103 SNPs, were available for analysis. The correlation as in Proposition 4 is 0.976 and the variance ratio is indicating again that a large discrepancy is expected. We performed a heritability study of multiple phenotypes with the same kinship matrix, by testing the heritability of the *N* = 43,140 methylation sites on chromosome 1. As it is common for a methylation site to be correlated with its surrounding SNPs (Gibbs *et al.* 2010; Zhang *et al.* 2010; Bell *et al.* 2011), we avoided such *cis* effects by using a kinship matrix constructed from the *m* = 604,170 SNPs on all chromosomes other than 1. The kinship matrix is constructed by a standard, uniformly weighted linear kernel. For covariates, we used consisting only of an intercept vector. Again, we calculated *P*-values under the assumption of the two distributions. We note that it has been shown that some methylation site profiles often display significant heritability, while others do not; thus, both significant and insignificant *P*-values are expected (Rahmani *et al.* 2017).

In Figure 2 we show the histograms of the of the *P*-value of all the considered phenotypes. The two histograms are indeed very different; *P*-values calculated using the inaccurate SKAT mixture distribution indicate that the heritability of almost all sites is considered insignificant; for example, using a Bonferroni threshold of only 8/43,140 sites are significant. In light of the results above, it is reasonable to suspect that *P*-values of many heritable phenotypes are deflated, thus causing false negatives. The *P*-values distribution has a peak at ∼0.5, likely an artifact of the inaccurate calculation method. In comparison, *P*-values calculated by RL-SKAT do not exhibit such a peak. They are significantly smaller, and using the same Bonferroni threshold, we now find 319/43,140 significant sites. Indeed, a simulated power study of both approaches under varying degrees of a true underlying heritability validates that the inaccurate approach results in a severe decrease in power (Figure 3), which has been reported in the literature (Uemoto *et al.* 2013). As a point of reference, we compared the power of RL-SKAT with that of the popular LR test approach, and found they have similar power (see the Supplemental Material in File S1). We conclude that in this dataset, using the SKAT distribution for *P*-value calculation is highly problematic.

### Benchmarks

Finally, we benchmarked the methods discussed here on the KORA dataset under the two above scenarios, on a 64G, 2.2 GHz Linux workstation, using our implementation in the Python language. We verified that the relevant part of our implementation is equivalent to MiRKAT and has a very similar running time. For the scenario of heritability testing, we calculated the *P*-values of 1000 phenotypes with the kinship matrix. For the scenario of set testing, we used 1000 sets of 100 SNPs each. The results are summarized in Table 2; as expected, the computational savings are very significant, achieving a speedup of more than two orders of magnitude. We expect the speedup to be even more significant for larger datasets.

## Discussion

In summary, we have shown that the distribution suggested by SKAT to the score test statistic may be very inaccurate. Unlike previous studies, which have noted this discrepancy only in small sample sizes, we have shown that it might occur in large studies as well. We have proposed a computational method to accurately calculate *P*-values without compromising computational time. Finally, we demonstrated our findings in two datasets.

The exact calculation of *P*-values can be applied to other variants of the score test; for example, the SKAT-O (Lee *et al.* 2012) test seeks to find an optimal combination of burden tests and nonburden tests, which amounts to the score test with a certain kernel.

In this work, we focused on the case of a single kernel, and on a continuous phenotype. The extension of this work to multiple kernels (*e.g.*, corresponding to several sets of SNPs) or to binary phenotypes (*e.g.*, case/control studies) is nontrivial, as the null distribution cannot be modeled as a ratio of quadratic forms (see, *e.g.*, Wang 2016; Wu *et al.* 2016). It therefore remains a subject for future work.

We believe that the prominence of likelihood-ratio based tests in heritability studies might stem from the statistical issues discussed above (see, for example, Uemoto *et al.* 2013), where SKAT was found to be significantly less powerful. It is our hope that this paper would facilitate the use of score tests in heritability studies in the future.

## Acknowledgments

R.S. is supported by the Colton Family Foundation. This study was supported in part by a fellowship from the Edmond J. Safra Center for Bioinformatics at Tel Aviv University to R.S. and E.R. E.H was partially supported by NSF grant 1705197. Funding for the project was provided by the Wellcome Trust under award 076113. The KORA study was initiated and financed by the Helmholtz Zentrum München, German Research Center for Environmental Health, which is funded by the German Federal Ministry of Education and Researchand by the State of Bavaria. Furthermore, KORA research was supported within the Munich Center of Health Sciences, Ludwig-Maximilians-Universität, as part of LMUinnovativ.

## Footnotes

*Communicating editor: C. Sabatti*Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.117.300395/-/DC1.

- Received July 7, 2017.
- Accepted September 24, 2017.

- Copyright © 2017 by the Genetics Society of America