# Local Joint Testing Improves Power and Identifies Hidden Heritability in Association Studies

- Brielin C. Brown*,
^{1}, - Alkes L. Price
^{†}, - Nikolaos A. Patsopoulos
^{‡},^{§},**,^{2}and - Noah Zaitlen
^{††},^{2}

^{*}Department of Computer Science, University of California, Berkeley, California 94720^{†}Harvard T. H. Chan School of Public Health, Harvard University, Boston, Massachusetts 02115^{‡}Department of Neurology, Brigham & Women’s Hospital, Boston, Massachusetts 02115^{§}Division of Genetics, Department of Medicine, Brigham & Women’s Hospital, Boston, Massachusetts 02115^{**}Broad Institute, Cambridge, Massachusetts 02142^{††}Department of Medicine, University of California, San Francisco, California 94158

- 1Corresponding author: 378G Stanley Hall, University of California, Berkeley, Berkeley, CA 94709. E-mail: brielin{at}berkeley.edu

## Abstract

There is mounting evidence that complex human phenotypes are highly polygenic, with many loci harboring multiple causal variants, yet most genetic association studies examine each SNP in isolation. While this has led to the discovery of thousands of disease associations, discovered variants account for only a small fraction of disease heritability. Alternative multi-SNP methods have been proposed, but issues such as multiple-testing correction, sensitivity to genotyping error, and optimization for the underlying genetic architectures remain. Here we describe a local joint-testing procedure, complete with multiple-testing correction, that leverages a genetic phenomenon we call linkage masking wherein linkage disequilibrium between SNPs hides their signal under standard association methods. We show that local joint testing on the original Wellcome Trust Case Control Consortium (WTCCC) data set leads to the discovery of 22 associated loci, 5 more than the marginal approach. These loci were later found in follow-up studies containing thousands of additional individuals. We find that these loci significantly increase the heritability explained by genome-wide significant associations in the WTCCC data set. Furthermore, we show that local joint testing in a *cis*-expression QTL (eQTL) study of the gEUVADIS data set increases the number of genes containing significant eQTL by 10.7% over marginal analyses. Our multiple-hypothesis correction and joint-testing framework are available in a python software package called *Jester*, available at github.com/brielin/Jester.

- statistical genetics
- heritability
- epistasis
- polygenicity
- joint testing

GENETIC association studies typically take a marginal approach to analysis, investigating each SNP in isolation of all other SNPs for association with a phenotype of interest. While this method has led to the discovery of thousands of loci associated with hundreds of phenotypes (Welter *et al.* 2014; Eicher *et al.* 2015), it fails to capture the additional signal available when multiple SNPs representing independent genetic signals are examined simultaneously (Yang *et al.* 2012) or when SNPs are imperfectly imputed (Wood *et al.* 2011). Furthermore, the *hidden* heritability, the difference between the heritability due to genome-wide significant associations and heritability due to genotyped variants, remains substantial (Eichler *et al.* 2010). In this work we investigate a local joint-testing approach to analysis of genetic data sets in which pairs of variants from the same locus are examined simultaneously for association with a phenotype. The motivation for our approach comes from the mounting evidence that complex traits are highly polygenic (Visscher *et al.* 2012), that causal variants are not evenly distributed across the genome (Gusev *et al.* 2013), that known associated loci often harbor multiple causal variants (Udler *et al.* 2009; Fellay *et al.* 2010; Trynka *et al.* 2011; Wood *et al.* 2011; Liu *et al.* 2012; Patsopoulos *et al.* 2013; Pickrell 2014), and that the underlying causal variants can be in linkage disequilibrium (LD) with each other (Corradin *et al.* 2014).

In fact, LD between underlying causal variants can result in additive associations that would be nearly impossible to detect using standard marginal methods. Consider the case of two SNPs: one increasing risk for a disease and the other protective. If these SNPs are correlated in the study population, then marginal association methods will fail to detect the signal due to the large number of individuals carrying both variants (and therefore having little or no increased risk for the disease). The same effect can occur when the variants have the same effect direction but are anticorrelated in the study population. In the context of this article, we refer to these SNP pairs as *linkage masked*. Note that in practice linkage masking may present in two distinct and important ways. The first one is multiple correlated genotyped causal variants with opposite effect direction, which may occur due to the Bulmer effect (Bulmer 1971) (see *Discussion*). The second one is correlated genotyped variants with opposite tagging of an untyped causal variant. Lappalainen *et al.* (2011) give evidence that linkage masking between regulatory and coding variation may be common due to balancing selection. While linkage-masked SNPs are difficult to uncover using standard marginal association methods, mixed-model heritability is determined by a simultaneous fit of all SNPs while accounting for LD and therefore includes signal from linkage-masked SNPs, implicating them as a source of hidden heritability that has not been widely considered (Eichler *et al.* 2010).

Pairwise (joint) testing for additive effects without a statistical interaction term may help unmask these associations and, more generally, improve power in the presence of gene interactions, multiple causal variants, or multiple variants differentially tagging an untyped causal SNP. However, applying joint tests in practice has several problems. Because exact multiple-testing correction is usually unknown, several studies have used joint testing for follow-up and fine mapping of known associated loci, often revealing additional associated variants (Wood *et al.* 2011; Yang *et al.* 2012) and demonstrating the merits of joint testing in practice. Studies such as these are able to ignore multiple-hypothesis correction issues due to their focus on known associated regions but do not have the potential to reveal novel loci. Other studies examining genome-wide joint testing of all pairs of SNPs, including those with statistical interaction terms, pay such severe multiple-hypothesis correction penalties that many loci found via standard marginal approaches would not reach genome-wide significance (Prabhu and Pe’er 2012; Arkin *et al.* 2014) and are computationally expensive, although effect methods for reducing the computational burden have been explored (Prabhu and Pe’er 2012; Wang *et al.* 2015). Slavin and Elston (Slavin *et al.* 2011) proposed testing all adjacent pairs and applied their approach in the Wellcome Trust Case Control Consortium (WTCCC) seven disease study. Howey and Cordell (2014) (SnipSnip) proposed using a conditional test on 10 adjacent SNPs to choose a partner SNP for inclusion in the linear model. While these approaches reduce the multiple-hypothesis correction penalty, we show that they do not capture much of the available power gain. Furthermore, prior approaches have not accounted for a known issue with genotyping error and joint tests (Lee *et al.* 2010), which we show affects these methods.

By testing pairs of SNPs for additive effects, rather than individual SNPs, we improve power to detect loci containing multiple independent causal variants, including those containing linkage-masked SNPs. Through local testing, we substantially reduce the multiple-hypothesis correction penalty, while simultaneously enriching for situations in which joint tests are more powerful, *i.e.*, when there are independent additive genetic signals contained in each of the SNPs in the test. Rather than employing the overly conservative Bonferroni procedure to account for multiple testing, we extend the work of Han *et al.* (2009) to provide a method for sampling from the null distribution of joint tests orders of magnitude faster than a permutation test, making application computationally efficient. We applied our method to the WTCCC (WTCC Consortium 2007) cohorts for bipolar disorder, coronary artery disease, Crohn’s disease, hypertension, rheumatoid arthritis, type-1 diabetes, and type-2 diabetes. We define *significant locus* as any 1-Mb region containing a statistically significant association at family-wise error rate (FWER) 5% and compare the number of significant loci discovered from the National Human Genome Research Institute (NHGRI) database of replicated associations to the standard marginal method. We compare our approach to SnipSnip (Howey and Cordell 2014) and marginal analysis of imputed WTCCC genotypes. We also estimate the phenotypic variance explained by associated SNPs discovered in the marginal and joint approaches. Finally, we apply our method to gene expression data from the *Genetic European Variation in Health and Disease (gEUVADIS)* project, comparing the number of genes containing *cis*-expression QTL (eQTL) at various false discovery rate (FDR) thresholds under marginal- and joint-testing approaches.

We find the following:

Local joint testing provides significant power gains when multiple-risk variants are proximal, reaching as high as 41% when they are linkage masked.

Local joint testing in the original WTCCC cohort discovers seven loci not found via the standard marginal approach, all of which were later replicated in more powerful follow-up studies. Marginal analysis of imputed genotypes discovers only two of these loci, as well as one locus not detectable in either genotype-based method.

New SNPs and loci discovered via local joint testing explain a significant amount of the phenotypic variance of these diseases.

Joint testing all pairs of

*cis*-SNPs in gEUVADIS reveals 607 more genes at FDR 5%, an increase of 10.7% over the marginal approach.

## Methods

We begin by describing the null distributions of the tests in our procedure to motivate the sampling approach used to determine the multiple-testing correction. We then show that given the marginal *Z* scores at two SNPs the joint -test statistic can be exactly determined. In most cases, determining the significance level of a pairwise test of correlated variables requires a computationally expensive permutation test. However, we build upon a framework for generating marginal test statistics under the null efficiently, using a conditional sampling approach (Han *et al.* 2009), which has also been used for genome-wide interaction effect power calculations (Wang *et al.* 2015). This implies that we are able to efficiently sample from the null distribution of the joint test, allowing for a dramatic improvement in speed over a permutation test. Finally, we describe the local joint-testing procedure. For simplicity, we assume the phenotype and genotypes have been standardized.

### Asymptotic distribution of the marginal and joint tests

For many widely used statistical tests, the vector of test statistics over many markers asymptotically follows a multivariate normal distribution (MVN) under the null hypothesis of no association (Lin 2005; Seaman and Muller-Myhsok 2005). In particular, let *Y* be the phenotype of interest and *G* the genotype matrix, with the genotype at SNP *i* in a study with *N* individuals; then the Wald test is asymptotically From this, one can derive the correlation structure for two tests under the null (Han *et al.* 2009), so that the vector of marginal test statistics is asymptotically MVN with mean and covariance matrix

Next, we consider the value of the likelihood-ratio test (LRT) statistic for a linear or logistic two-SNP association test. In the *Appendix* we prove that it is possible to compute the value of this test statistic directly from the marginal association statistics, without fitting the joint model. Specifically, we prove the following:

**Observation 1.** *Let * *be the* *Z values of test statistics for the SNPs* * **against a phenotype Y*. *Let the correlation between SNPs* * **and* * **be* * **Then the* *likelihood-ratio test statistic for the model* * **against the null* * **is*(1)*and is asymptotically * *distributed*.

We verified computationally by simulating pairs of SNPs at all correlation levels that this equation is exact (not shown).

### Estimating the significance threshold and local joint testing

We use a conditional sampling method to sample the marginal test statistics under the null from the multivariate normal distribution. Since distal SNPs are likely to be independent, we choose a window size and “slide” along the genome, sampling null test statistics at SNP *i* conditional on the correlation with the previous SNPs (Han *et al.* 2009). Specifically, the MVN factors asand we can sample via the standard conditional MVNwhere is the vector of correlations between SNP *i* and the conditional SNPs, is the correlation matrix for the conditional SNPs, and represents their sampled values.

Each set of sampled marginal null test statistics roughly corresponds to one genome-wide marginal permutation test. Given these marginal null test statistics, we define a joint-testing window size and compute the joint null test statistics for every pair of SNPs within distance via Equation 1. These inferred joint null test statistics similarly correspond to a permutation test of all SNP pairs within distance of each other. With these sampled null test statistics in hand, computing the significance threshold is straightforward. For more details, see *Algorithm 1* (*Appendix*), which takes as an input a desired FWER and number of samples (roughly analogous to the number of permutations), a window size, and a set of joint tests *T* and outputs a multiple-testing corrected significance level corresponding to the desired FWER. To verify that our method produced results equivalent to those of a permutation test, we performed a permutation test of local joint testing with a window size of 100 SNPs on the first 1000 SNPs of chromosome 1 (∼10 Mb) in the wild-type (WT) control group. We find that the distributions of test statistics under the null for *Jester* and permutation approaches are concordant and that the significance thresholds corresponding to an FWER of 5% are nearly identical (Figure 1).

With the multiple-testing correction in hand, the local joint-testing procedure is a straightforward modification to the standard genome-wide association studies (GWAS) procedure. We choose a window size and correlation cutoff and fit the two-SNP LRT for every pair of SNPs exceeding correlation within markers of each other (*Algorithm 2*, *Appendix*). This procedure is implemented in a python package called *Jester* (github.com/brielin/Jester).

### Filtering false positives

Lee *et al.* (2010) describe an issue where genotyping errors that would go unnoticed in standard quality control (QC) procedures can cause inflation in joint and conditional tests of association. When SNPs are highly correlated, miscalled bases in only the cases or controls induce rare haplotypes. As these haplotypes are present only in the cases or controls, this increases the association signal in the joint test (Lee *et al.* 2010).

While performing our analysis, we found many highly correlated pairs of SNPs where neither SNP had substantial marginal signal but together they showed an extremely strong association. We accounted for this in two ways: first, we considered only associations arising from pairs with correlation <0.9, and second, we used imputation against 1000 genomes to reanalyze jointly significant genotyped SNPs. Specifically, for each pair of potentially significant SNPs, we used the -pgs flag of *Impute2* to hold out the genotyped SNPs and replace them with values imputed from the surrounding SNPs. We then recomputed the test statistic, using the imputed values of the SNPs. When the signal was a true association, the joint test statistic remained significant after pgs imputation (Supplemental Material, Table S2). However, when the association appeared to be driven by genotyping error, the joint test statistic became insignificant after pgs imputation (Table S7). In this way, we overcome the false positive error identified by Lee *et al.* (2010).

### Data sets

We analyzed the WTCCC phenotypes bipolar disorder (BD), Crohn’s disease (CD), coronary artery disease (CAD), hypertension (HT), rheumatoid arthritis (RA), type-1 diabetes (T1D), and type-2 diabetes (T2D). We chose this data set because it was one of the first GWAS performed and the phenotypes have been subsequently studied in independent large-scale GWAS. Thus, we emphasize the potential of early discovery of true effect leveraging nonstandard GWAS methods. We used a window size of SNPs for estimating the null distribution and a correlation cutoff of (all pairs in the window). We performed standard QC on the data, removing individuals with missingness SNPs with missingness markers failing a Hardy–Heinberg equilibrium (HWE) test at significance level 0.001, and SNPs with minor allele frequency The numbers of cases, controls, and SNPs left after QC for each data set are presented in Table S8. To impute the WTCCC cohort, genotypes were split into 1-Mb regions and prephased against the 1000 Genomes EUR reference panel, using HAPI-UR, and then imputed using Impute2 against the same reference panel. Nonbiallelic SNPs and SNPs with reference panel frequency < were not imputed. All imputed SNPs with information score <0.5 were excluded from further analysis.

We also analyzed gene expression data for 16,155 genes of the the gEUVADIS European data set. Raw RNA-sequencing reads obtained from the European Nucleotide Archive were aligned to the transcriptome, using UCSC annotations matching hg19 coordinates. RSEM (software tool) was used to estimate the abundances of each annotated isoform and total gene abundance is calculated as the sum of all isoform abundances normalized to 1 million total counts or transcripts per million (TPM). Genotyping data were obtained from the 1000 Genomes Phase III public release. eQTL mapping was performed on a per-gene basis. The *cis* region of each eQTL was defined as all SNPs with minor allele frequency >5% within 200 kb of the transcription start site (TSS), which was chosen because the vast majority of eQTL are known to be contained in this region (Stranger *et al.* 2012). Joint tests were performed between all pairs of SNPs in the *cis* region. In each analysis, 30 genotype principal components were included as covariates. Approximate permutation tests from our sampling procedure with 2500 samples were used to infer permuted *P*-values separately for the marginal and joint approaches, which were then independently analyzed to determine the number of significant genes at FDR 1–25%.

### Data availability

All genotypes and phenotypes used in the analysis are publicly available. Software used to analyze genetic data are available at github.com/brielin/Jester. Table S1 contains details of each locus discovered in the marginal analysis. Table S2 is the same for *Jester* with window size 100. Table S3 is the same for *Jester* with window size 50 and correlation cutoff of 0.04. Table S4 contains SnipSnip results. Table S5 contains results of a marginal analysis of imputed WT genotypes. Table S6 contains imputation analysis as described in the Filtering false positives section for pairs reported in the Slavin *et al.* paper (2011) that passed QC. Table S7 is a sampling of false positives. Table S8 describes the SNP and sample numbers remaining in our data sets after QC. Figure S1 contains the distribution of correlations of significant SNPs discovered using *Jester*. Figure S2 contains the results of running *FaST-LMM set* on the WTCCC cohort. Scripts used in simulations are available upon request.

## Results

### Multiple-testing penalties for WTCCC and power

We computed the multiple-testing correction for local joint testing in the WTCCC cohort, using *Jester* for various window sizes and cutoffs. We define the *effective number of tests* (ENT) as the number of independent tests that would correspond to a corrected significance level of for corresponding FWER *α*. That is, We found that the significance level of genome-wide marginal testing at FWER 5% was We chose to use a window size of 100 and cutoff of 0 for our main analysis, increasing the ENT by a factor of 19.55 over the marginal test, even though we perform 100 times as many tests We chose the window size based on the work of Han *et al.* (2009) and used no correlation cutoff because our power simulations showed an increase in power even in the absence of LD (Figure 2, right). For smaller window sizes and larger cutoffs, the multiple-testing burden decreases substantially. Using a modest cutoff of 0.004, for example, increases the number of tests by a factor of 8.90 over the marginal test

We sought to determine the relative power of local joint testing *vs.* marginal testing in the case of (1) two correlated variants in LD and (2) a single causal variant. We simulated pairs of genotypes for 5000 individuals with allele frequency 50% and correlation We simulated phenotypes in a linear model with standard normally distributed environmental noise where (1) both SNPs had an effect size of or (2) only SNP 1 had an effect size of We used the significance thresholds computed above to incorporate the effect of testing pairs of correlated variants genome-wide. We found a substantial increase in power for modest window sizes and small correlation cutoffs (Figure 2, right) when there were multiple causal variants. Using a window size of 50 and cutoff of 0.004 led to an increase in power of up to 41.0% when there were multiple correlated causal variants, while a window size of 100 and cutoff of 0.0 resulted in an increase in power of 35.4%. In the absence of multiple causal variants, the increase in multiple-testing burden and degree-of-freedom penalty gave a decrease in power of 15% and 20%, respectively, for the two testing contitions, for all correlation levels. Note that while our method shows its most substantial gain when SNPs are linkage masked, we also see up to a 25% increase in power when the causal variants are uncorrelated.

### Significant loci in WTCCC

We analyzed the WTCCC phenotypes, using four different methods: (1) marginal tests at an FWER of (2) joint tests with window size 100 and at the significance level estimated by *Jester* corresponding to an FWER of (3) SnipSnip (Howey and Cordell 2014) with window size of 10 SNPs at significance level and (4) marginal tests of imputed genotypes at significance level Since SnipSnip does not include an analysis of the multiple-testing correction, we chose to use Howey and Cordel’s (2014) suggested significance threshold.

Local joint testing resulted in the discovery of 2.3 times as many associated SNPs over the marginal method, summarized in Table 1. In Figure S1 we provide a plot of the density of the correlation between the pair of SNPs that are genome-wide significant at FWER 5%, using *Jester*. These SNP pairs have a range of correlations, but SNP pairs where neither SNP was discovered in marginal testing have higher correlation (signed with respect to opposite-effect directions) than the average (Figure S1). The marginal test of genotyped SNPs revealed 17 significant loci. *Jester* discovered 7 additional loci while missing 2 of the original due to the increased multiple-testing burden, for a total of 22 significant loci. For each of these 7 newly significant loci, we searched the NHGRI GWAS database (Welter *et al.* 2014) for reported associations and found that each one had been reported in more powerful disease-specific follow-up studies. The significant SNP pairs in 4 of these 7 novel loci were linkage masked, with correlations ranging from 0.26 to 0.74 (signed with respect to opposite-direction SNP effects). Interestingly, SnipSnip discovered fewer SNPs and loci than the marginal method, but we emphasize that the set of loci it uncovered is not a strict subset of those discovered via the marginal method (Table S4). That is, it discovers some new loci while missing some that are marginally significant and thus remains useful as a secondary analysis tool.

We also compared our method to a GWAS of imputed genotypes, the current gold standard method. This allows us to determine both how our method compares in the number of discovered loci and whether the linkage-masked loci that *Jester* uncovered were due to correlated SNPs with opposite tagging of an untyped causal variant. The imputed GWAS discovered 20 loci in total: the 17 marginally discovered loci, 2 of the 7 linkage-masked loci (Table 2), and 1 locus found by SnipSnip but not *Jester* or the marginal GWAS (CD 18p11.21, Table S4 and Table S5). The 2 loci discovered by both *Jester* and imputation (CD 5q31.1 and T1D 10p15.1) were linkage masked, but the discovery of a significant SNP after imputation implies this was due to opposite tagging of an untyped causal. Of the remaining 5 loci, 3 (CD 10q21.3, RA 1p36.32, and RA 10p15.1) were uncorrelated, supporting the presence of multiple causal variants. The final 2 loci (CD 6p21.32 and T2D 9p21.3) were correlated ( and , respectively) but did not contain a significant SNP after imputation. We conclude that these loci are strong candidates for follow-up study to validate the presence of linkage masking. For complete details of all associations discovered in each method, see Table S1, Table S2, Table S3, Table S4, and Table S5.

### Phenotypic variance explained by two new genome-wide significant associations

We sought to quantify the effect of newly significant SNPs on the phenotypic variance explained by genome-wide significant associations. We used *GCTA* to compute the genetic relationship matrix (GRM) for each phenotype, using only SNPs identified as significant using either the classic marginal- or local joint-testing method. We used the *GCTA* -mgrm mode to fit a mixed model containing both marginal and joint GRMs and performed a likelihood-ratio test of the full model against a reduced model containing only the marginal GRM to assess statistical significance. We were unable to fit the model for RA because of numerical issues. For the four remaining disease phenotypes with genome-wide significant associations, three (CD, T1D, and T2D) show a statistically significant increase in the phenotypic variance explained, while one (CAD) shows no increase in the phenotypic variance explained. The increase in phenotypic variance explained varies by phenotype, with CD showing a 74% increase, T1D showing a 12% increase, and T2D showing a 75% increase (Figure 3).

However, the winner’s curse may disproportionally affect joint testing due to the increased number of tests relative to the marginal approach. To quantify the effect of the winner’s curse we estimated the out-of-sample phenotypic variance explained by genome-wide significant associations for T2D, using European individuals from the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort (detailed description of the cohort and study design can be found in dbGaP, study accession no. phs000674.v1.p1). Of the 16 marginally significant SNPs present in WTCCC, 5 were also present in GERA. Of the 250 joint significant SNPs present in WTCCC, 65 were also present in GERA. In the GERA analysis, we estimated the phenotypic variance explained by those sets of SNPs as the coefficient of determination () in a linear regression of the marginal and joint SNP sets against the T2D phenotype. We found that the variance in liability explained in GERA for the marginally significant SNPs was 1.03% (0.07%), and the variance in liability explained in GERA for the joint significant SNPs was 1.52% (0.11%), an increase of 46.6% ( for an LRT of the joint and marginal SNPs against just the maginal SNPs). Therefore we conclude that the winner’s curse does affect in-sample estimates, but the increase in variance explained due to linkage-masked variants remains substantial.

### gEUVADIS eQTL analysis

We compared the number of genes containing an eQTL at an FDR of 1–25%, using standard marginal linear regression of all *cis*-SNPs against joint tests of all *cis*-SNPs (Table 3). At an FDR of 5%, we find that 5641 of 16,155 genes contain an eQTL using the marginal test, and 6248 genes contain an eQTL using the joint test, an increase of 10.7%. As in our analysis of the Wellcome Trust data, the genes discovered using the marginal approach are not a strict subset of those discovered using the joint approach. For each level of FDR, we determined the proportion of genes uncovered in the joint but not marginal approach that appear to be linkage masked. At FDR 5%, the joint-testing approach discovers 908 new genes. In 381 of those 908 genes the significant SNP pairs have a correlation of >0.2 signed with respect to opposite-direction SNP effects (Table 3). In Figure S1 we provide a plot of the density of the correlation between the pair of SNPs in the most significant pair for each gene containing an eQTL at FDR 5%, using *Jester*. These SNP pairs have a range of correlations, but SNP pairs in genes without an eQTL discovered in marginal testing have higher correlation (signed with respect to opposite-effect directions) than the average (Figure S1).

## Discussion

In this work we described a local joint-testing procedure, its multiple-testing correction, the genetic architecture for which it is well powered, a system for reducing susceptibility to genotyping error, and implications for phenotypic variance explained from GWAS SNPs. We have shown that when loci harbor multiple causal variants, the joint test can outperform the marginal test substantially. We observed that our method outperforms the standard marginal association method, providing further evidence that disease loci frequently harbor multiple causal mutations. In our simulations, we find that the largest power gains come from linkage-masked SNPs: when SNPs have opposite-effect direction but are correlated in the study population. Furthermore, 4 of the 7 (57%) newly significant WT loci have the aforementioned property and 381 of the 908 (42%) newly discovered eQTL at FDR 5% show evidence of linkage masking. This finding is weakened only somewhat by the fact that 2 of the 4 linkage-masked loci in WTCCC appear to be type 2 linkage masking, correlated SNPs with opposite tagging of an untyped causal, while the remaining 2 appear to be type 1 linkage masking. Still, we observe increased power due to detection of linkage masking in all aspects of our analysis. The Bulmer effect implies linkage masking should be common; high-fitness haplotypes are able to resist selective pressures and may acquire fitness-decreasing mutations without being eliminated from the population. SNPs of this kind are hypothesized as a source of missing heritability by Haig (2011) and Lappalainen *et al.* (2011) argue that gene expression data support widespread linkage masking due to balancing selection.

We find more evidence for linkage masking in the regulation of gene expression that in the complex disease phenotypes we consider. However, it is not surprising that such effects are more difficult to find in a genotyped cohort. As the effects of such SNPs are already masked, the tagging SNP pair must be in tight LD with the causal SNP pair to prevent severe power loss. On top of this, the tagging SNP pair must be highly correlated to achieve the increased signal necessary to find linkage-masked SNP pairs. Hemani *et al.* (2013) make a similar argument, showing that small reductions in LD can result in dramatic underestimation of epistatic effects. While linkage-masked SNPs are difficult to uncover using standard marginal association methods, their signal is included in mixed-model SNP heritability estimates. These SNPs are therefore a source of hidden heritability: the difference between the phenotypic variance explained by genome-wide significant associations and the phenotypic variance explained by genotyped variation. This is in contrast to the *missing* heritability, the difference between the variance explained by genotyped variation and the total narrow-sense heritability, which is unaffected by linkage masking. Our result narrows the hidden heritability gap by discovering new associations that increase the variance explained due to statistically significant associations.

Our approach is not without drawbacks. When causal variants are sparse, we see a reduction in power due to the degree-of-freedom and multiple-test correction that reaps benefits only in the presence of multiple signals. While our results indicate this is a less common situation, there are two loci discovered by a marginal GWAS but not by *Jester* (Table S1). Additionally, it can be difficult to distinguish heavy linkage masking from genotyping error. In our analysis of the WT disease associations, we used a window size of 100 SNPs, chosen based on results from Han *et al.* (2009). A logical question is whether a larger or smaller window size would lead to more results. We repeated the analysis with a window size of 50 and a correlation cutoff of 0.04 and discovered the same number of loci and slightly more SNPs than presented here (Table S3).

The relationship between joint testing and set testing (Ionita-Laza *et al.* 2013; Listgarten *et al.* 2013) remains an interesting avenue for further investigation. Set testing and joint testing are similar in that they (1) both improve power to detect associations in the presence of multiple causal variants, (2) both are susceptible to false positives due to genotyping error and population structure, and (3) both require an explicit estimation of the multiple-testing correction to maintain the desired FWER. We consider one recent set test, FaST-LMM-set (Factored Spectrally Transformed Linear Mixed Models), which has shown desirable properties with respect to previous set tests (Casale *et al.* 2015). FaST-LMM-set without a background kernel is equivalent to a likelihood-ratio test of the SNPs in the set against a null model, while our test is a likelihood-ratio test of pairs of SNPs. We used FaST-LMM-set to analyze the WTCCC data set, using 100-SNP windows, the same size used in our analysis, but found concerning levels of inflation in the test statistics that we were unable to resolve (Figure S2). In addition, set tests currently require expensive permutation tests to control the FWER, making genome-wide application computationally intensive (Casale *et al.* 2015). We view approximating the multiple testing correction of set testing without resorting to permutations as an interesting open problem and believe that it may be possible to extend the MVN framework to set tests in the same way we have done for joint tests. We hope to explore this connection thoroughly in future work.

Yang *et al.* (2012) propose a mathematically similar approach to ours, determining joint test statistics from marginal summary test statistics (albeit only at genome-wide significant marginal loci), while using an external reference panel to estimate the pairwise LD. This approach in combination with our multiple-hypothesis correction threshold could provide a way to apply our local method without access to genotype data. We caution that our proposed imputation-based genotyping error correction method will not be applicable here and thus high-LD SNPs should be avoided in such an analysis. Furthermore, the variance of the correlation coefficient estimates can be large even when many hundreds of individuals are available in the reference panel (Zaitlen *et al.* 2009), which could lead to false positives. While the reference panel correlations may still be relatively accurate for controls from the same population, case individuals are more likely to harbor many disease-associated mutations and thus will not match the reference panels as well (Zaitlen *et al.* 2012a,b). Even with these caveats, however, the vast gain in power possible with the increased sample size of summary data makes this a tempting proposition, and we have implemented this method in our software package.

## Acknowledgments

We thank Lior Pachter, Ingileif Hallgrímsdóttir, and Sang Hong Lee for valuable discussion; Anne Biton, Joel Mefford, and Brunilda Balliu for helpful comments on the manuscript; and Nick Furlotte for permission to use SNP tools from pylmm in our software. B.C.B. is supported by the National Science Foundation Graduate Research Fellowship Program. N.Z. is supported by National Institutes of Health (NIH) grant K25HL121295. A.L.P. is supported by NIH grant R03 HG006731. N.A.P. is supported by a Career Independence Award by the National Multiple Sclerosis Society. The authors declare no conflicts of interest.

## Appendix

### Algorithms

#### Algorithm 1

Method for sampling from the null distribution to determine significance threshold.

**Input:** significance level *α*, number of samples *n*, window size a set of joint tests *T*.

**Output:** Significance threshold.

Sample marginal test statistics using a conditional normal approximation.

Compute the *P*-values associated to the marginal tests

**for** **do**

Compute using (1)

Compute the *P*-value associated to *S*

**end**

Sort the *P*-values from all tests performed

**return** th smallest *P*-value.

#### Algorithm 2

*Jester*’s GWAS pipeline.

**Input:** A matrix of genotypes *G* and a vector of phenotypes *Y* for *N* individuals.

**Output:** A set of pairs of variants meeting genome-wide significance.

Perform standard QC on *G* and *Y*

Estimate the joint test significance threshold

**for** SNP **do**

Test for association with *Y*

Jointly test with any of the preceding markers correlated with at level or greater

**end**

**return** SNP pairs with *P*-value

### Proof of Observation 1

Consider the linear modelsThe likelihood-ratio test is asymptotically under the null hypothesis of no association. In fact, it is possible to compute the value of this test statistic directly, without fitting the model against a null phenotype.

**Observation 1.** *Let * *be the* *Z-values of test statistics for the SNPs* * **against a phenotype Y*. *Let the correlation between* * **and* * **be* * **Then the* *likelihood-ratio test statistic for the model J against the null N is**and is asymptotically * *distributed*.

*Proof*. The equality follows from setting up the normal equations and solving them. Let *X* be the matrix of regressors Assume that in the linear model the distribution of the error terms is Then the normal equations for the *β*-coefficients areSolving this and simplifying yieldswhereis the determinant of

The log-likelihood for a linear model is Using the above calculation of *β*, we can find the model log-likelihoods and compute the likelihood ratioSimplifying the above yieldsThus, given the marginal test statistics and the sample correlation of the genotype pair, we can compute the joint test statistic under the null without computationally fitting the model (similar results derived in other contexts can be found in Seaman and Muller-Myhsok 2005 and Yang *et al.* 2012). This, when combined with MVN sampling of the marginal test statistics, allows a substantial speedup over a permutation test.

While the above is derived in the context of a continuous disease phenotype, it is straightforward to conclude that the framework is extensible to case–control (binary) phenotypes. While we assumed for simplicity the phenotype was standardized, the result is independent of the scale of *Y* and thus holds for the diseases on the underlying liability scale. Since the least-squares model does not rely on the assumption that the error terms are normally distributed (only that they are spherical), the result extends to logistically distributed residuals (logistic regression). In this case the *β*’s are log-odds ratios.

## Footnotes

*Communicating editor: E. Eskin*Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.116.188292/-/DC1.

- Received February 17, 2016.
- Accepted April 27, 2016.

- Copyright © 2016 by the Genetics Society of America