## Abstract

With growing human genetic and epidemiologic data, there has been increased interest for the study of gene-by-environment (G-E) interaction effects. Still, major questions remain on how to test jointly a large number of interactions between multiple SNPs and multiple exposures. In this study, we first compared the relative performance of four fixed-effect joint analysis approaches using simulated data, considering up to 10 exposures and 300 SNPs: (1) omnibus test, (2) multi-exposure and genetic risk score (GRS) test, (3) multi-SNP and environmental risk score (ERS) test, and (4) GRS-ERS test. Our simulations explored both linear and logistic regression while considering three statistics: the *Wald* test, the *Score* test, and the *likelihood ratio test* (LRT). We further applied the approaches to three large sets of human cohort data (*n* = 37,664), focusing on type 2 diabetes (T2D), obesity, hypertension, and coronary heart disease with smoking, physical activity, diets, and total energy intake. Overall, GRS-based approaches were the most robust, and had the highest power, especially when the G-E interaction effects were correlated with the marginal genetic and environmental effects. We also observed severe miscalibration of joint statistics in logistic models when the number of events per variable was too low when using either the *Wald* test or LRT test. Finally, our real data application detected nominally significant interaction effects for three outcomes (T2D, obesity, and hypertension), mainly from the GRS-ERS approach. In conclusion, this study provides guidelines for testing multiple interaction parameters in modern human cohorts including extensive genetic and environmental data.

- gene and environment interaction
- joint test analysis
*Score*test statistic- genetic risk score
- environment risk score

GENE and environment (G-E) interaction has been studied for a wide range of human traits using both genome-wide scale interaction screening (Hamza *et al.* 2011; Hancock *et al.* 2012; Wei *et al.* 2012; Wu *et al.* 2012; Siegert *et al.* 2013) and targeted analyses focusing on sets of genes or single nucleotide polymorphisms (SNPs) (Mahdi *et al.* 2009; Risch *et al.* 2009; Nickels *et al.* 2013; Dashti *et al.* 2015). In regards to the limited success, a number of statistical methods have been developed to improve the detection of G-E interaction effects (Thomas 2010a; Aschard *et al.* 2012; Gauderman *et al.* 2013). In particular, statistics based on aggregated genetic information have been shown to be a promising path forward (Manning *et al.* 2011; Hutter *et al.* 2012; Ma *et al.* 2013; Courtenay *et al.* 2014; Qi *et al.* 2014; Jiao *et al.* 2015; Aschard *et al.* 2017). In practice, the most common strategy consists in testing for genetic risk score (GRS)-by-exposure interaction using SNPs previously identified in marginal genetic effect screenings (Ripatti *et al.* 2010; Salvatore *et al.* 2014; Pisanu *et al.* 2017), although the approach is applicable to any sets of SNPs [*e.g.*, gene-level sets, pathway- or network-level sets, or polygenic set (Thomas 2010b; Meyers *et al.* 2013)]. Basically, the GRS-based method aggregates genetic information by summing risk alleles (alleles associated with increased value of quantitative traits or greater risk of disease traits). Potential gain in power for such approaches comes from circumventing a penalty for multiple testing [a single 1 degree-of-freedom (df) test rather than one test per SNP]. However, the main limitation is that the power gain relies on an assumption that interaction effects, if present, are highly correlated with the marginal genetic effects (*i.e.*, the risk alleles of SNPs in the GRS have G-E interaction effects in the same direction). Note that this is very similar to the burden test assumption for rare variants analysis (Lee *et al.* 2014). When this concordance assumption does not hold, the standard individual SNP-based interaction approach can outperform the GRS-based interaction approach.

Given that a growing amount of extensive phenotypic and epidemiological data in human genetic cohorts becomes available, joint interaction tests involving multiple SNPs and multiple exposures have been seldom considered, although a linear mixed model approach using a random effect for multiple interactions has been recently described (Moore *et al.* 2019). There are several arguments in favor of applying joint interaction approach for multiple interactions. First, multiple environmental factors might influence a disease through the same intermediate mechanisms. For example, exposure to various carcinogens increases the risk of cancers by increasing the risk of deleterious genetic mutations (Kawaguchi *et al.* 2006; Ferreccio *et al.* 2013). Similarly, shared intermediate phenotypes (*e.g.*, atherosclerosis) for heart attack and stroke are known to be associated with multiple lifestyle factors (*e.g.*, smoking, diet, and alcohol consumption) (Massin *et al.* 2007; Rafieian-Kopaei *et al.* 2014). In such situations, one can hypothesize that nongenetic risk factors may also have shared interaction effects with genetic variants of the disease in question (Figure 1A). Second, because most exposures associated with human diseases display modest effect sizes, risk score approaches integrating all effects of risk factors (McClelland *et al.* 2015; Merchant 2017) can potentially lead to increased power as done for GRS. Moreover, some exposures have strong correlations with one another [*e.g.*, cigarette smoking and alcohol consumption (Fisher and Gordon 1985), or diet and socio-economic status (Darmon and Drewnowski 2008)]. Correlations among exposures, if induced by an unmeasured variable, can be used to improve power to detect interactions though only a part of exposures interact with genetic variants (Aschard *et al.* 2014) (Figure 1B).

As environmental data are increasingly common in large-scale human genetic studies, interaction analyses including multiple SNPs and multiple exposures might be performed systematically on behalf of standard G-E interaction screenings. However, despite a few recent works published (Casale *et al.* 2017), our knowledge of the strengths and limitations of joint analysis approach for multiple G-E interactions is still limited. Here, we addressed a part of this question and explored the relative performance of four joint G-E interaction test approaches for both quantitative and binary trait models: (1) a joint test for multiple single SNP-by-single exposure interaction effects (omnibus test), (2) an interaction test between (weighted or unweighted) GRS and multiple exposures, (3) an interaction test for multiple SNPs and an environmental risk score (ERS), and (4) a GRS-by-ERS interaction test. Specifically, we assessed their robustness and relative power through simulations using three different statistical tests [*Wald* test, *Score* test and *likelihood ratio test* (LRT)] and varying a range of parameters including the total number of SNPs and exposures considered, the presence of correlation between exposures, dependence between SNPs and exposures, and the pattern of G-E interactions in regards of the marginal genetic and environmental effects. We further demonstrated the relevance of the proposed approaches in three large sets of population-based cohort data focusing on four common complex traits [coronary heart disease (CHD), type 2 diabetes (T2D), obesity, and hypertension] and four environmental risk factors (total energy intake, diet quality, physical activity, and smoking status).

## Materials and Methods

### Model overview

Consider the following generalized linear model, including main effects of genetic and exposure risk factors and G-E interaction effects:(1)where are SNPs, are exposures, and the link function is either the identity when the outcome Y is continuous or expit() when Y is a disease probability. In this model, is the main effect of (, where M is the total number of SNPs), is the main effect of (, where K is the total number of exposures), and is the interaction effect between and .

We aim at assessing the relative performances of joint interaction tests, where multiple interaction parameters are tested jointly in a fixed effect model, whether or not some of the predictors are aggregated into summary variables (see next section about GRS and ERS). For mathematical convenience, we present here three joint tests using the *Wald*, *LRT*, and *Score* statistics. The multivariate *Wald* statistics is defined as:(2)where S is the vector of the L estimated interaction effect parameters tested jointly and Q is the estimated variance-covariance matrix of these parameters, *i.e.*:Under the null hypothesis of no interaction effect ( = … = = 0), Γ follows a chi-squared distribution with df equal to , the total number of interaction terms tested jointly .

The *LRT* statistics is defined as:(3)where is the likelihood of the model when the estimated interaction coefficients using Maximum Likelihood Estimators (MLE) or Ordinary Least Square. The follows a chi-squared distribution with df under the null.

The *Score* test statistics is defined as:(4)where is the value of the derivative of the log-likelihood when , and is the Fisher Information. Under the null hypothesis, the follows a chi-square distribution with df. Note that an important difference between the *Score* test and the other tests is the total number of parameters estimated. Indeed, the *Wald* test requires the unrestricted estimates of the parameter (*i.e.*, the model including interactions, so that the total number of parameters equals M + K + M×K), while the *LRT* requires quires both the restricted (without interaction) and unrestricted estimates of the parameter [*i.e.*, (M+K)×2 + M×K], and the *Score* test requires restricted estimates of the parameter (*i.e.*, M + K). More details on these test statistics are described in Supplemental Note (see Supplemental Material).

### Interaction tests considered

In the standard omnibus test, all interaction effects, from Equation 1 are estimated and tested jointly, so the test statistic . For GRS-based interaction tests, we consider both weighted and unweighted forms, where a GRS is built as the (weighted or unweighted) sum of risk alleles of the M SNPs. Explicitly, , and _{,} where is commonly defined as marginal genetic risk estimates from . We first use GRS in a multi-exposure by GRS (multiE-GRS) model:(5)so that the corresponding combined test of the interaction terms is: . As for GRS, we also consider the use of ERS to capture a global effect of multiple exposures. The ERS is built similarly to the using weights from marginal environmental models (*i.e.*, ). The multi-SNP by ERS (multiSNP-ERS) model is then defined as:(6)and the corresponding combined test of the interaction terms is defined as . Finally, we consider the GRS-by-ERS interaction (GRS-ERS) approach:(7)in which test statistics of can be defined as

### Simulation study

Unless otherwise stated, we simulated series of 10,000 replicates each including *N* = 20,000 samples using Equation 1 with *K* = [2–10] correlated exposures and *M* = [10, 100, 300] independent SNPs, while varying the distribution of exposures data (normal/non-normal, correlated strongly/moderately), and allowing for dependence between SNPs and exposures. We also varied the parameters of the model , but always assumed nonzero main effects of the genetic variants and environmental factors that increased risk of diseases (*i.e.*, and ). In each series, we explored the performance of six joint test approaches (*i.e.*, omnibus, multiE-uGRS, multiE-wGRS, multiSNP-ERS, uGRS-ERS, and wGRS-ERS) in null models and alternative models for robustness and power, respectively.

For null simulation series, we calculated genomic inflation factor as the ratio of the median value of observed chi-square statistics over the median value of expected chi-square statistics. Power was estimated under the alternative models for two main scenarios, (1) different percentages of true G-E interaction (*i.e.*, 20, 40, or 60%), and (2) in the presence or absence of correlation between marginal effects and interaction effects. When comparing power, we also performed a standard univariate model testing each G-E pairwise interaction independently (*e.g.*, 100 univariate models for testing 10 SNP and 10 exposures). The significance threshold for the six interaction approaches was 0.05, while the significance threshold for the univariate model was adjusted for multiple testing by using Bonferroni correction (= 0.05/L, where L is the total number of interaction parameters tested). Binary traits were analyzed using logistic regression, while quantitative traits were analyzed using standard linear regression. For hypothesis testing, we considered three test statistics in the joint test: *Wald* test, *Score* test, and *LRT*.

Genetic variants were drawn independently of each other from a binomial distribution with *n* = 2 and using the coded allele as the risk allele. We considered two scenarios including only common variants [risk allele frequency (RAF) of 1–99%] or only rare variants (RAF of 0.1–1% or 99–99.9%). To mimic the genome-wide significant SNPs, the main effects of SNPs were drawn from a left truncated normal distribution, mean of 0, and variance equals to /, where , the trait heritability, equals 0.3 and , the number of causal SNPs, equals 10,000. Then, actual SNPs coefficients were derived by rescaling the main genetic effects based on the expected probability of allele frequencies , where p is RAF. We generated exposure values from a multivariate normal distribution with mean 0 and a covariance matrix set for the presence of relatively strong (mean pairwise *r*-squared equals 0.10, Supplemental Material, Figure S6A) or moderate correlation (mean pairwise *r*-squared equals 0.02, Figure S6B). For non-normal exposures, we randomly selected 50% of the exposures and squared all values, resulting in a chi-squared distribution of the exposure. Exposure effects on the outcome were drawn from absolute values of a normal distribution and assuming the total outcome variance explained by all exposures ranged between 0.02 and 0.05.

When assuming G-E correlations, we randomly selected 50% of the exposures and a random set of associated SNPs and added genetic effects drawn from a normal distribution with mean 0 and variance 0.005 (so that each associated SNP explained on average 0.5% percent of the variance of the exposure) to the selected exposures. For power, we generated G×E interactions using standardized SNPs in order to ensure that the interaction terms do not modify the marginal effects of SNPs (*i.e.*, adding interactions of unstandardized SNPs could change the direction of SNP marginal effects in the model) (Aschard 2016). For linear outcome, G-E interaction effects were generated from uniform distribution in the range [0.001, 0.002] when assuming correlation between interaction effects and marginal effects, and in the range [−0.007, 0.007] when assuming no correlation between those. For binary outcome, were also drawn from uniform using the ranges [0.002, 0.002] and [−0.01, 0.01] in the presence or absence of the correlation, respectively. Finally, for logistic models, we defined the intercept so that the expected baseline prevalence equaled to 30%.

### Real data application

We applied our interaction test approaches to three large sets of human cohort data, the Nurses’ Health Study (NHS I), NHS II, and Health Professional Follow-up Study (HPFS). The total sample size available with genetic data in the three cohorts was 37,664. Though all analyses were conducted in the combined cohort data, each analysis included different numbers of cohort participants depending upon the availability of disease and exposure data considered. All disease and exposure data were drawn from self-reported biannual questionnaires of each cohort.

We focused on four binary traits, T2D, CHD, obesity, and hypertension. For each trait, we excluded all individuals who had the disease prior to each cohort inception (baseline). Then, we defined cases as individuals who had reported to have the disease since the baseline and controls were defined as individuals who had never reported it between the baseline and their last time of follow-up. Exposure variables we considered were established risk factors of the four diseases, such as smoking status (ever smoker *vs.* never smoker), physical activity [measured as Metabolic Equivalent of Task (MET) hours per week], diet quality [Healthy Eating Index (HEI) (range: 1–100) that indicates healthier dietary intake with higher score)], and total energy intake (kcal/week). To avoid potential reverse causation, we used exposure data measured at the earliest time point in the follow-up (*i.e.*, baseline of each cohort). Since all the traits were binary, and based on our simulation results, we used a logistic regression model with *Score* test statistic.

To build G-E interactions, we included sets of SNPs previously identified to be associated with the four traits from large-scale GWAS (International Consortium for Blood Pressure Genome-Wide Association Studies *et al.* 2011; Morris *et al.* 2012; Locke *et al.* 2015; Nikpay *et al.* 2015). We also included the following covariates: age (when exposures were measured), study (NHS, NHSII, and HPFS), genotyping platforms (Affymetrix, IIllumina, Omniexpress, Oncoarray, and Humancore exomchip) (Lindström *et al.* 2017), and principal components computed from the full sets of genotypes (top three principal components of each platform).

### Data availability

The authors affirm that all data necessary for confirming the conclusions presented in the article are represented fully within the article. Supplemental material available at Figshare: https://doi.org/10.25386/genetics.6849047.

## Results

### Validity of the statistical approaches for multiple interactions

We first compared the robustness of three test statistics (*Wald* test, *Score* test, and *LRT*) when testing multiple parameters jointly in linear and logistic regression models (Figure 2). Using null models of no interaction effect (but assuming the presence of marginal genetic and environmental effects), we simulated series of 20,000 individuals with M = 100 independent common SNPs and K = [2, 6, 10] normally distributed and correlated exposures. For binary outcomes, we considered a prevalence of 30%. The *Wald* statistics showed strong robustness in the linear regression but severe deflation in logistic regression as the number of interactions tested jointly increases. The standard *LRT* statistics, derived from (MLE), showed strong inflation with increasing number of interactions in both linear and logistic regression models. Note that inflation for the linear model could be easily fixed by substituting MLE with ordinary least squares estimates (Supplemental Note and Figure S1), but such a fix is not possible for logistic model. The *Score* statistics showed the highest robustness in the both linear and logistic regression, although we noted some non-negligible inflation for logistic regression as the number of interactions increased. As discussed in the Supplemental Note, the better calibration of the *Score* test is likely explained by the smaller number of parameters that have to be estimated (*i.e.*, conversely, the *LRT* and *Wald* test face instability because of the many interaction parameters that the tests have to estimate).

Because the bias was more obvious in logistic regression, we further examined whether the bias was influenced by the modeling of G-E interaction effects or simply due to the number of outcome events per predictor variables (EPV) (Peduzzi *et al.* 1996) (Figure S2). We compared chi-squared statistic distributions of the three test statistics of omnibus test under a complete null model (*i.e.*, no interaction and no main effects) for a fixed EPV (*e.g.*, EPV = 5), while assessing marginal genetic effects only, or interaction effects only ,with the same number of parameters L in testing jointly. More precisely, we compared two scenarios: (1) draw M independent SNPs and tested L SNPs, only a subset of parameters jointly (*e.g.*, *M* = 120, *L* = 100); or (2) draw M independent SNPs and K correlated exposures, and tested L interaction parameters jointly (*e.g.*, *M* = 10, *K* = 10, *L* = 100). As shown in Figure S2 for L = [100, 400], we observed trends similar to those from Figure 1 in marginal effect models (inflation for *LRT* and deflation for *Wald* test). Although we noticed the bias might be slightly larger for G-E interaction models, we did not observe any major qualitative difference between interaction models and SNP only models, suggesting the bias is driven mostly by the small number of EPV (*e.g.*, EPV < 10).

Exploring further the impact of different numbers of EPV on multivariable interaction tests, we found that joint analysis of multiple parameters (in our case multiple interaction terms) tended to be dramatically more sensitive to EPV than standard univariate test (Figure S3). As expected based on existing literature, univariate tests were robust across different test statistics once the test achieved the rule of thumb of EPV = 10 (Vittinghoff and McCulloch 2007), regardless of the total sample size. Conversely, in omnibus tests for a fixed EPV, *Wald* test and *LRT* statistics showed increasing deflation and inflation, respectively, as the sample size increased. For example, whereas an EPV of 10 might be sufficient to have a calibrated *LRT* for a sample size of 1000, increasing the sample size to 5000 required to reach an EPV of 50 to have a valid test. Again, only the *Score* test showed good calibration across the different numbers of EPV and sample size, highlighting this should be the preferred statistics for testing multiple interactions jointly in modern genetic datasets including hundreds of thousands of individuals.

### Robustness comparison in joint analysis approaches

Based on the results above, we examined the six G-E interaction test strategies (*i.e.*, omnibus, multiE-uGRS, multiE-wGRS, multiSNP-ERS, uGRS-ERS, and wGRS-ERS) under the null using *Score* test for linear and logistic regression. Table 1 shows type I error rates calculated as a genomic inflation factor for normally distributed and highly correlated exposures while varying three parameters: the RAF, the number of SNP analyzed jointly, and the presence or absence of G-E dependence. In linear models, all six approaches showed consistent and strong robustness regardless of RAF, the number of SNPs tested, and dependence between SNPs and exposures. Similarly, logistic models for GRS-related approaches (multiE-uGRS, multiE-wGRS, uGRS-ERS, and wGRS-ERS) showed consistent robustness. Conversely, multi-SNP approaches (omnibus and multi SNP-ERS) showed moderate to strong inflated statistics with increasing number of variants and decreasing RAF (Figure S4 and Table 1), highlighting the limitation of the *Score* test in logistic regression when the number of parameters becomes too large in the baseline model (*i.e.*, the model without interaction). Such inflation was also found in type I error rates for omnibus tests with 300 common SNPs (Table S1). When performing the same simulations but using moderately correlated (Table S2) or non-normally distributed exposures with moderate correlation (Table S3), we observed similar findings of consistent robustness in linear models and GRS-based approaches of logistic models, and inflation in multi-SNP approaches of logistic models.

### Power comparison for joint analysis approaches

We aimed first at understanding the potential benefit of testing jointly multiple interaction parameters, as opposed to testing them separately and correcting for multiple testing using a Bonferroni adjustment. To address this question, we considered a set of K predictor (here interaction effects) and derived the theoretical power of the two aforementioned approaches, *i.e.*, the joint test of all K predictors *vs.* the test of each single predictor followed by correction of the *P*-values for the K tests performed, while assuming a subset of the predictor are associated with the outcome. Unsurprisingly, as showed in Figure 3, A–D we first found that, when , there are at best small gains from using multiple interaction tests. The single predictor approach tends to have higher power as the effect of the variants increased and *K* increased (*e.g.*, noncentrality parameter (*ncp*) = 9, , Figure 3C), while the multivariate approach performs slightly better for small effect. Conversely when there are multiple associated predictors (*i.e.*, as increases relative to K), the multivariate approach performs in general better than the univariate approach (Figure 3, E–H). For example, to achieve 80% power at a nominal level of 5% while analyzing K=100 predictors, and assuming very small effect (*ncp* = 1), the univariate test requires up to 20% of the predictors to be associated with the outcome, while the multivariate test would achieve the same power if 10% of the predictors are associated (Figure 3E).

To compare the relative power of the six G-E interaction strategies, we next performed a series of simulations using linear and logistic regression models (Figure 4). We used, as in Table 1, normally distributed and highly correlated exposures. As expected, the power of all the approaches increased with increasing number SNPs tested and increasing proportion of true effects among the interactions. However, we found that the relative gain in power relied on the correlation between interactions and marginal effects and the correlation across exposures. In linear models, score-based approaches (*i.e.*, using GRS and ERS) were the most powerful when simulating G-E interaction effects correlated with marginal genetic and environmental effects. Conversely, when simulating G-E interaction effects uncorrelated to marginal effects, the omnibus test was the most powerful approach. Logistic models showed qualitatively similar power results. Using normally distributed exposures with moderate correlation appeared to have limited impact on the power results (Figures S5 and S6). However, we noticed that power advantages of the GRS-based approaches over multi-SNP approaches in the presence of correlation between interactions and marginal effects tended to decrease with increasing correlation between exposures (*i.e.*, coefficients: 0.02–0.20) (Figure S7). When there was no correlation between interactions and marginal effects, we observed that the power of multi-SNP approaches tended to increase with increasing exposure correlation, particularly univariate test and multiSNP-ERS test in linear models.

### Application to data from large population-based cohorts

Using the *Score* test statistic in logistic regression, we conducted six interaction analyses per each trait, testing jointly for multiple interactions between SNPs known to be associated with four traits (number of SNPs included = 65, 76, 27, and 48 for T2D, obesity, hypertension, and CHD, respectively) and four exposures (HEI as a measure of diet quality, total calorie intake, MET-hour per week as a measure of physical activity, and ever smoking) (Table S4). Details of the SNPs are described in Tables S7–S10. Prior to the interaction tests, we assessed marginal effects of GRS and exposures with the traits. T2D and obesity showed associations with all the four exposures, whereas hypertension was associated with HEI, MET-hour, and smoking ever, and CHD was associated with HEI and ever smoking (Table S5). For each trait, we constructed ERS including only exposures that were marginally associated with the trait. We observed only limited correlation between the four exposures in our data, such as the maximum correlation coefficient was 0.17 between HEI and MET-hour per week (Table S6). Results of joint interaction tests are presented in Table 2. We found that 7 (29%) out of the 24 interaction tests performed showed nominal significance. The most significant interaction was found in omnibus test for obesity with all four exposures (*P* = 0.003). Their interactions were also detected with GRS-ERS approaches (*P* = 0.037 and 0.028 for uGRS-ERS and wGRS-ERS, respectively). Nominally significant interactions were also observed between hypertension and three exposures (HEI, MET-hour, and smoking) through multiSNP-ERS (*P* = 0.006), uGRS-ERS (*P* = 0.046), and wGRS-ERS (*P* = 0.021). Lastly, omnibus test detected a nominally significant interaction effect of T2D and the four exposures (*P* = 0.015). Note that we kept the omnibus in this analysis despite type I error rate is not fully controlled. However, based on simulations presented in Table S1, we expect any bias to be minimal (*e.g.*, type I error rate at 5% equals 0.059 for the same number predictors in simulated data).

## Discussion

Recent advances in big data era enable us to jointly assess multiple G-E interaction effects; however, the statistical challenges and benefits of such approaches remain unknown. To address this gap, we defined a set of parsimonious approaches for multiple interactions, and explored their statistical properties in simulations and real data application. Our results indicate that joint test approaches, in particular aggregating multiple marginal effects such as GRS or ERS, offer both robustness and power gain, potentially allowing for the identification of G-E interactions missed by a standard univariate test. However, we also found several important issues that should be considered for the joint analysis approaches.

First, contrary to univariate test, the choice of the statistics was crucial in the joint test for multiple interactions (*i.e.*, omnibus) especially for binary traits. In logistic models, *Wald* test and *LRT* statistics were deflated and inflated, respectively, with increasing number of interaction parameters, whereas *Score* statistic was consistently more robust in the same scenarios. Consistent with findings from previous studies, the type I error rate was larger as the EPV decreases (Peduzzi *et al.* 1996; Vittinghoff and McCulloch 2007). Furthermore, the two previous studies using *Wald* test reported that, for the low EPV (*e.g.*, EPV < 10), sample variance estimates were not robust (Peduzzi *et al.* 1996) and bias of regression coefficients increased in both positive and negative directions (Vittinghoff and McCulloch 2007). Because the biased coefficients often lead to extreme values of MLE, we observed highly inflated *LRT* statistics when EPV is < 10 (Supplemental Note). Although the previous studies did not evaluate *Score* test, we found that the *Score* test statistics were substantially robust even with EPV <10, suggesting it should be preferred when testing multiple parameters jointly in logistic regression. Permutations might be an alternative to control type I error rates of the three test statistics. Permutation would be computationally demanding and might break down part of the structure in the data, making interpretation more difficult. However, it might be applicable in some case, and preliminary analyses we conducted showed encouraging results (Tables S11 and S12).

In contrast of the logistic regression, all approaches showed strong robustness in linear models. In general, linear models do not face issues similar to small EPV. For example, prior studies assessing the number of subjects per variable (SPV, similar in principle to the EPV, but applied to linear regression) showed that two SPV were enough to have adequate estimates on regression coefficients, SE, and confidence intervals in linear regression (Austin and Steyerberg 2015). In our simulations, this was true for *Wald* and *Score* test statistics with the lowest SPV of 6.7 (= 20,000/3000) but not for *LRT*. However, slight inflation in *LRT* statistics can be easily corrected by using estimates from ordinary least squares instead of using maximum likelihood estimates (Supplemental Note).

Another important limitation of multi-SNP approaches was minor/risk allele frequency of SNPs when testing for binary traits. Our simulations showed severe type I error rates increase with increasing the number of variants in logistic models and the bias was worse with rare variants (RAF < 1%) included in the analysis. Although we did not assess type I error rates with non-normally distributed continuous outcomes, previous works highlighted that special caution is required for rare variants analysis, especially when analyzing non-normally distributed traits (*e.g.*, traits from gamma or log normal distributions) (Schwantes-An *et al.* 2016). We only considered independent variants in all our analyses. Using correlated variants for GRS-based approaches would require estimating weights from *e.g.*, multiple regressions or penalized models to avoid redundancy of genetic information. For SNP-based approaches, correlation between SNPs might increase instability of all approaches and is therefore not recommended (Table S13). While the *Score* test performed again better, analyzing large number of correlated variants might induce substantial inflation of the type I error rate.

Lastly, the presence or absence of correlation between interaction and marginal effects played a substantial role to gain power in jointly testing multiple interactions. As seen in Figure 4, the relative power gain of the six approaches was highly sensitive to the correlation. When the interaction effects are correlated to marginal effects, GRS-based approaches outperformed the others in most of scenarios. However, when the correlation is absent, SNP-based approaches (*i.e.*, omnibus or multiSNP-ERS) had better performance than the GRS-based approaches. Similar trends of power gain have been discussed in previous studies on rare-variant association tests. For example, burden tests, which are very similar to our aggregating methods (*e.g.*, GRS or ERS), examine genetic associations by aggregating effects of a set of rare variants into a genetic score (Wu *et al.* 2011). Because the test requires a strong assumption of the same direction and magnitude of effects, the test is obviously less powerful when the assumption is violated (Lee *et al.* 2014). As an alternative, researchers have proposed variance-component tests that are powerful with different directions of marginal effects. Such an approach has been recently proposed for GxE (Moore *et al.* 2019) and might be compared against our fixed effect approach in the future.

We found that joint analysis approaches, especially joint tests with aggregated effects (*e.g.*, GRS or ERS), could detect more multiple G-E interaction effects with strong robustness, and, further, much power with correlation between marginal effects and G-E interaction effects. Because of computational convenience, these score-based approaches would be easily applicable to other interaction tests, such as gene-by-gene (G-G) interactions or multivariate interaction tests for multiple traits in future research. Although our simulations used aggregating approaches for both marginal effects and interaction effects as Equation 7, the benefits of the score-based approaches might be achieved even if one tests aggregating effects of interactions (*e.g.*, GRS-ERS) in models with multiple marginal effects (*e.g.*, , ). Because our joint approaches test the interaction effect only (*e.g.*, ), increase in the number of total parameters would not have influence on detecting interaction effects as long as using the *Score* test statistic.

Our joint analysis approaches also have some limitations. First, because joint test approaches examine whether any of multiple interactions have a signal or not, it does not provide evidence for any specific interaction effects of SNPs and the exposures that are the main drivers of the interaction signals. Instead, joint test approaches offer an opportunity to gain insights into global interaction patterns. For example, significant GRS-based interaction would indicates an overall decrease or increase of the genetic effect with exposure, while the Omnibus and SNP-ERS tests would indicate more diffuse G × E interactions with limited structure. Second, in practice, it might not be easy to generate ERS where environmental factors are coded as being categorical or have different units of measurements. Also, it is still an open question what, and how many, environmental factors should be tested jointly for G-E interactions. Our recommendation is to include risk factors that have strong biological evidence on shared mechanisms, because, otherwise, interpretation can be challenging. Third, when applying the standard univariate approach, we used a nominal significance threshold of 5% after correction for multiple testing without accounting for correlations between SNPs and exposures or between exposures. This is the most stringent approach, and more advanced strategies might be considered in future (Sun and Lin 2017).

In summary, our study shows that approaches allowing for the joint analysis of multiple G-E interaction effects outperform standard pairwise interaction test in many scenarios. Particularly, GRS-based approaches in conjunction with a *Score* test showed both strong robustness and some of the largest gain in power, although alternatives approaches might be considered depending on the investigator hypotheses about correlation between GxE effects and marginal effects. Overall, this study provides the community guidelines for testing multiple interaction parameters in modern human cohorts including extensive genetic and environmental data.

## Acknowledgments

This work was supported by National Institutes of Health National Human Genome Research Institute grant R21HG007687 to H.A. and National Institutes of Health grants P30 DK46200 and DK112940.

## Footnotes

Supplemental material available at Figshare: https://doi.org/10.25386/genetics.6849047.

*Communicating editor: E. Eskin*

- Received July 21, 2018.
- Accepted December 10, 2018.

- Copyright © 2019 by the Genetics Society of America