## Abstract

Case–control designs are commonly employed in genetic association studies. In addition to the case–control status, data on secondary traits are often collected. Directly regressing secondary traits on genetic variants from a case–control sample often leads to biased estimation. Several statistical methods have been proposed to address this issue. The inverse probability weighting (IPW) approach and the semiparametric maximum-likelihood (SPML) approach are the most commonly used. A new weighted estimating equation (WEE) approach is proposed to provide unbiased estimation of genetic associations with secondary traits, by combining observed and counterfactual outcomes. Compared to the existing approaches, WEE is more robust against biased sampling and disease model misspecification. We conducted simulations to evaluate the performance of the WEE under various models and sampling schemes. The WEE demonstrated robustness in all scenarios investigated, had appropriate type I error, and was as powerful or more powerful than the IPW and SPML approaches. We applied the WEE to an asthma case–control study to estimate the associations between the thymic stromal lymphopoietin gene and two secondary traits: overweight status and serum IgE level. The WEE identified two SNPs associated with overweight in logistic regression, three SNPs associated with serum IgE levels in linear regression, and an additional four SNPs that were missed in linear regression to be associated with the 75th quantile of IgE in quantile regression. The WEE approach provides a general and robust secondary analysis framework, which complements the existing approaches and should serve as a valuable tool for identifying new associations with secondary traits.

GENOME-WIDE association studies (GWAS) have been widely used to detect the association between common genetic variants and complex traits (Visscher *et al.* 2012) and are commonly conducted using case–control designs. In addition to the primary binary trait used to define the case–control status, data on secondary traits are often collected. For example, in a chronic obstructive pulmonary disease study (Regan *et al.* 2010), researchers collected information on additional respiratory diseases, such as asthma, emphysema, and bronchitis. It is of interest to take full advantage of the existing data and analyze genetic associations with these additional traits. Such secondary analyses have great potential to discover additional variants associated with the secondary traits.

Several classical methods for the direct analysis of secondary traits are available, including direct regression in (1) a combined sample of cases and controls, (2) cases only, (3) controls only, and (4) combined cases and controls, adjusting for the primary disease. These methods, although attractive due to the simplicity of model building, can be biased when the secondary phenotype is associated with the primary disease. This is because the case–control sample is no longer a representative sample of the general population. Numerous studies have demonstrated that none of these methods result in an unbiased estimation of the genetic association with the secondary trait such as in Jiang *et al.* (2006), Richardson *et al.* (2007), and Monsees *et al.* (2009).

More sophisticated methods aiming to avoid the aforementioned bias can be broadly classified into two groups. The first group is based on the inverse probability weighting (IPW) idea that is widely used in survey methodology. Jiang *et al.* (2006), Richardson *et al.* (2007), and Monsees *et al.* (2009) introduced the IPW approach to genetic studies when case and control sampling probabilities are available. The main idea of IPW is to reweight individual observations by the inverse of their selection probabilities and conduct weighted regressions. This approach requires the information of the case–control sampling scheme to construct the weights and is often biased or inefficient when there are unobserved confounders and overweighted subsamples.

The second group explicitly accounts for the case–control sampling scheme by maximizing the retrospective likelihood function conditional on the primary disease or joint modeling of the primary and secondary traits (Jiang *et al.* 2006; Lin and Zeng 2009; He *et al.* 2011; Wang and Shete 2011, 2012; Li and Gail 2012; Ghosh *et al.* 2013; Wei *et al.* 2013). The semiparametric maximum likelihood (SPML) proposed by Lin and Zeng (2009) is the most widely recognized approach in this group as it largely improves the efficiency over IPW. This approach makes the linear logit assumption for the probability of primary disease with respect to genotypic score and secondary trait. It is very sensitive to its assumption and can be largely biased when it is violated. Li and Gail (2012) proposed an adaptive weighted approach aiming at improving the robustness of SPML by weighting the estimates from SPML and its extension (including the interactive effects of genetic markers and secondary trait in predicting case–control status in its assumption). However, the efficiency was lost and simulations showed that its biases can be as large as 46% (Wang and Shete 2012). He *et al.* (2011), Wang and Shete (2011, 2012), Ghosh *et al.* (2013), and Wei *et al.* (2013) approached this issue from different perspectives, but they were all based on a similar model specification of primary disease, and their estimation efficiency over SPML was marginal. Therefore, for the comparison presented in this article, we focus our work mainly on the IPW and SPML approaches.

Despite the tremendous efforts on secondary analysis, there are a number of remaining issues. First, the performance of existing methods heavily depends on either the sampling scheme or the correct specification of the primary disease model. Since many case–control studies of complex diseases do not have clear selection probabilities and distribution of primary disease, the estimations from these existing methods can be biased. Second, for the novel approaches that are likelihood based, additional challenges arise for regressions with no parametric-likelihood functions. For example, the FTO genotype is associated not only with the mean but also with the variance of body mass index (BMI) (Frayling *et al.* 2007). Instead of considering only the mean of phenotypes in GWAS, it is valuable to systematically examine how the markers influence the location, scale, and shape of the entire trait distribution through quantile regression (Koenker and Bassett 1978). Such quantile regressions can identify additional markers missed by mean regressions that are associated with certain quantiles of the trait. Existing secondary analysis approaches are not developed to facilitate these types of regressions, and the expansion from likelihood-based secondary approaches is especially difficult.

In this article, we propose a set of weighted estimating equation (WEE) approaches, providing unbiased estimation for the secondary analysis in genetic case–control studies. We introduce a new concept of counterfactual estimating functions under alternative disease status and combine the observed and counterfactual estimating functions into a set of weighted estimating equations. The counterfactual outcomes idea has been widely used in causal inference to denote the potential outcomes of the subjects if they were in the alternative treatment or exposure group, and conclusions are made if the actual and counterfactual outcomes from the same subjects differ. Here, we borrow this idea to define the potential secondary trait of the subjects if they were in the alternative case–control status and estimate the marker–secondary trait association, using the case–control sample. In comparison with the existing approaches, it provides a very generalizable and robust regression framework for analyzing secondary phenotypes with limited information on the sampling scheme and the underlying primary disease models. It is flexible to handle various types of secondary phenotypes regardless of their relationship with the primary disease. After first considering the WEE for quantile regression in Wei *et al.* (2015), we expand this idea to a more general framework that accommodates a wide range of regressions. This work is outlined in this article, which is structured as follows. First, we introduce the construction of the WEE approach. Second, we present the simulations that investigate the finite sample performance under different scenarios. Finally, we consider a real data example of an asthma case–control study with one binary secondary trait (overweight status) and one continuous secondary trait [serum immunoglobulin E (IgE) levels] to illustrate our approach.

## Proposed Method

### Notations and settings

Let *X* denote the coded genotype for a variant of interest, *Y* denote a secondary phenotype, **Z** denote the vector of covariates to adjust for, and *D* = {0, 1} denote the primary disease status. The aim of the secondary trait analysis is to estimate the genetic effect of *X* onto *Y* in the general population. The relation between *Y* and *X* and *Z* can be modeled as (1)where is a link function, and is the coefficient of primary interest. Depending on the choice of *g*, model (1) covers a wide range of regressions. For example, if for continuous *Y*, then it is linear regression; if for binary *Y*, then it is logistic regression; if is the quantile function at the *τ*th quantile for *Y*, then it is quantile regression.

In a case–control design, the data at a single variant consist of cases and controls We denote by the total sample size. When the secondary phenotype *Y* is associated with the primary disease *D*, directly regressing *Y* against *X* using case–control samples leads to biased estimation of Therefore, we propose a weighted estimating equation-based approach for secondary trait analyses in case–control studies. It utilizes the entire case–control sample and yields consistent estimation of in the general population.

### Weighted estimating equations for the secondary phenotypes in genetic case–control studies

Constructing estimating equations is a common estimation method. The key is to find an estimating function such that for randomly selected subjects from the general population, the following equations hold at the true (2)In generalized linear models (GLM), the estimating function can be constructed as the first derivative of the log-likelihood function, which is known as Fisher’s score function. In other regressions that minimize certain loss functions, is the first derivative of the corresponding loss function. For example, the estimating function with respect to is in linear regression, in logistic regression, and at any quantile in quantile regression. As we do not have a representative sample of the population, solving Equation 2 directly in a case–control sample is biased. However, conditioning on the disease status *D*, we can expand the above equation as follows: (3)This expansion provides the basis of constructing the proposed weighted estimating equations. Suppose that for each in the sample, we are able to observe its counterfactual secondary outcomes under the alternative disease status. Specifically, would be the phenotype of the *i*th case if it is in fact a control, and would be the phenotype of the *i*th control if it is actually a case. If we are able to observe both and ’s, we can then construct the unbiased estimating equations following the expanded estimating Equation 3. The sample estimation equations can be written as (4)where the weight is the probability of being the observed disease status given and is the probability of being the counterfactual disease status. The optimization of estimating Equation 4 can be viewed as a weighted regression, where weights are for the actual outcomes and for the counterfactual outcomes For this reason, we name our proposed approach the WEE.

One can show that for each summand of Equation 4, its conditional expectation given is zero at the true and thus constitutes an unbiased estimating equation. Following classical theories for *M* and *Z* estimations (theorems 5.7 and 5.9 in Van der Vaart 2000), solving Equation 4, leads to the consistent estimation of **β** as is a consistent estimation function as long as is a random sample conditioning on In IPW, the coefficients can be unbiasedly estimated only when given is a random sample. Therefore, the proposed approach is less sensitive to the sampling scheme than the IPW approach. Although the estimating equations involve we are not assuming the disease probability relates only to In reality, the disease risk can relate to *Y* or other auxiliary variables **W** as well, and in Equation 4 can be viewed as the marginal probability given *i.e.*, where is the joint distribution of

Of course in practice we are unable to observe the counterfactual secondary outcomes. To get around this difficulty, we propose two approaches. The first one is to estimate the expectation of counterfactual estimating functions. When is linear in we recommend to replace by its conditional expectation over In general for nonlinear estimation functions, we propose to generate pseudo observations from two sets of stratified models. In the next two subsections, we elaborate on the two approaches under the assumption that the probability is known, and an algorithm to estimate is followed.

### Estimation approach A: estimating the expected counterfactual estimating function

One can easily show that the following estimating equations, which substitute by its conditional expectation, are unbiased as well: (5)When the estimating function is linear in this approach is particularly appealing since one can simply replace by its conditional mean. In this case, the estimation Equation 5 is equivalent to (6)The conditional means can be easily estimated from stratified linear regression. Specifically, one can regress against and separately among cases and controls and estimate by the predicted value under the alternative disease status model. This way, the estimate can be obtained using one-step optimization, by solving the following equation, (7)where is the predicted outcome given and under the alternative disease status.

Estimating Equation 5 remains unbiased when is nonlinear. However, estimating in such a case could be computationally undesirable, especially when the model is high-dimensional. We hence propose an alternative approach that is flexible and computationally efficient.

### Estimation approach B: generating pseudo counterfactual observations

Under model (1), the linear association between *Y* and holds among both cases and controls. The regression coefficients, however, can vary between them. Therefore, we propose to fit model (1) separately for cases and controls and use the resulting stratified models to generate pseudo counterfactual observations. Here, we consider two scenarios to illustrate this idea. One is the GLM and the other is quantile function.

### Simulating counterfactual outcomes in the GLM

Supposeis the stratified estimated models for cases and controls, where for cases and for controls. For each observation *i*, we generate its counterfactual outcome from the estimated model of the alternative disease status Specifically, if *g* is a logit link for a binary secondary trait, then is a random draw from a Bernoulli distribution with success probability If *g* is a log link for a counted secondary trait, then is a random draw from a Poisson distribution with

### Simulating counterfactual outcomes in quantile regression

Since quantile regression does not have full parametric likelihood functions, one needs to consider the main model (1) across the entire distribution of *Y* to simulate counterfactual phenotypes. This joint modeling approach for quantile regression has been described in Wei *et al.* (2006), and the quantile-based counterfactual outcome generations have been described in detail in Wei *et al.* (2015). Here, we summarize how we have been able to generate counterfactual outcomes based on a grid of evenly spaced quantiles. We let denote a set of *k* evenly spaced quantile levels and denote the quantile coefficient functions given disease status *D* = *d* such that (8)for any Then is simulated following the model-estimated conditional distribution of given and as follows:

We estimate the quantile coefficients for in Equation 8 within cases and controls, respectively.

To approximate the coefficient process we define to be a piecewise linear function on [0, 1] that concatenates the estimates for and is subject to the constraint of

We randomly draw the quantile level from Uniform distribution and simulate the pseudo outcome by

### Stabilizing the coefficients

With we construct the sampling estimating equations as (9)As simulating the pseudo counterfactual outcome might introduce variation for small samples, we propose to repeat the procedure above *T* times to obtain stable estimates. The final estimate is the average of the *T* estimates; *i.e.*, (10)where is the estimated coefficient from the *t*th set of pseudo outcomes. In our numerical studies, we used *T* from 1 to 100 and found that the variance of the estimates stabilizes fairly quickly after

### Estimation of

For the two estimation algorithms described above, we assumed that the conditional disease probability was known. In practice, it needs to be estimated. To estimate we could use the models from primary analysis or simply assume a logistic model (11)We can achieve a consistent estimation of the slope parameters and by conducting logistic regression in the case–control sample. The intercept needs to be calibrated to match the overall disease prevalence in the general population. Let denote the known disease prevalence; then we can estimate by solving the following equation, (12)where is the joint distribution of *X* and **Z**, and and are the estimated and from logistic regression. The joint distribution can be estimated using population databases. If difficult to obtain, we propose a sample version to approximate as follows: (13)When the disease prevalence or the disease model is misspecified, the resulting estimated coefficients could be slightly biased. The estimation of the coefficients when the is misspecified is considered in simulations.

### Bootstrap procedure for the confidence intervals and hypothesis tests

In previous sections, we have outlined two estimation algorithms to estimate the parameters in model (1). Although the estimates can be viewed as some form of weighted regressions, the direct output of the Wald test statistics does not take into account the uncertainty from the estimated and simulated Therefore, we propose to use a bootstrap method to obtain the variance–covariance matrices of our proposed estimates. With the bootstrap standard errors, we are able to construct bootstrap confidence intervals and apply Wald test statistics to test the null hypothesis *i.e.*, whether the genetic variant(s) is (are) associated with the secondary phenotype *Y* in the general population. The bootstrap procedure is as follows:

Bootstrap cases and controls separately to assemble a bootstrap case–control sample. Namely we randomly select cases from the case sample and controls from the control sample with replacement.

For each bootstrap sample, we reapply the proposed algorithm to obtain bootstrap estimates. For approach A, it includes re-estimating and For approach B, it includes regenerating pseudo outcomes and re-estimating

We repeat steps 1 and 2

*B*times and then calculate the bootstrap standard error. We use the bootstrap standard error to construct confidence intervals and bootstrap chi-square test statistics for inference.

We evaluated the type I error and power of this bootstrap procedure in simulations and the bootstrap-based inferences are applied to the real data examples.

## Simulation Results

### Finite sample performance

We use simulations to evaluate the performance of the proposed WEE approach and compare its findings with those of several commonly used methods. We consider the scenario of a preselected SNP with binary and continuous secondary phenotypes in the framework of a case–control study. As quantile regression has been extensively explored in Wei *et al.* (2015), here we focus on its performance in parametric regressions (logistic and linear regressions).

#### Model settings:

As before, we denote that is the primary disease status, is the genotype information at the SNP (under an additive model) with minor allele frequency (MAF) = 0.3, and *Z* is a covariate of interest following a standard normal distribution. The correlation coefficient between *X* and *Z* is 0.3. We consider both binary and continuous secondary phenotypes *Y*. For binary *Y*, we assume a logistic model as follows:For continuous *Y*, we assume the linear modelwhere the error term *ε* follows Under the binary secondary phenotype model, the prevalence of *Y* is ∼30%. In both models, is the true coefficient of *X* in predicting *Y* and is the coefficient of primary interest.

To model the disease probability we consider three possible settings. Setting 1 (logistic setting) assumes the probability of disease follows a logistic model with main effects of *X* and *Y*. Similar settings were considered in Lin and Zeng (2009) and Wang and Shete (2011). Setting 2 (interaction setting) extends the logistic setting by including the interaction. Similar settings were considered in Li and Gail (2012) and Wang and Shete (2012). Finally, setting 3 (piecewise setting) assumes that follows a piecewise linear model instead of logistic regression. The detailed mathematical forms of the disease models are given below.

Setting 1 (logistic setting):

Setting 2 (interaction setting):

Setting 3 (piecewise setting):

where and are the 0.25th and 0.75th quantiles of the respectively.

In all the disease models above, we assume that the prevalence of the primary disease () is The intercepts in these models are selected to match the overall disease prevalence in the population. For each of the model settings above, we generate a large number of observations (), which we treat as a general population.

#### Sampling schemes:

We consider three sampling schemes to select the case–control samples. First, we consider a simple sampling scheme in which cases and controls are randomly drawn into the sample, creating the selection probability depends only on the disease status. This is the simplest available sampling design, but the collection of data is often difficult, especially when the sample size is large. Second, we consider the convenient sampling schemes to collect data that researchers often encounter in case–control studies. For example, in a stroke GWAS (Cornelis *et al.* 2010), cases were selected from imaging-confirmed cases from seven clinical centers, while controls were selected using existing healthy subjects in an acute myocardial infarction study with similar recruitment criteria. Data under this sampling scheme are easy to collect, but the confounding factors associated with selection probabilities may not be fully captured. We then evaluate the performance of the WEE and existing methods in the presence of unmeasured confounders. Finally, we consider the stratified sampling scheme, which is often used in large-scale case–control studies to under- or oversample subjects with certain characteristics. This sampling scheme can largely improve the estimation efficiency for the small subgroups. For example, the National Maternal and Infant Health Survey oversampled the infants born with low birthweight (≤2500 g) and very low birthweight (≤1500 g) to study the long-term and short-term health outcomes of these infants. This is a common scenario when IPW fails because certain subjects carrying large weights may dominate the estimates. Specifally, we select 2000 cases and 2000 controls into the sample in one of the following ways:

Setting 1 (random sample): Subjects are randomly selected into the sample.

Setting 2 (convenience sample): Assume

*V*is an uncaptured variable associated with such thatand only the subjects with are selected into the sample.

Setting 3 (stratified sample): and Subjects with are nine times more likely to be selected into the sample than subjects with

#### Comparison methods:

We estimated the coefficients under the random sample of different models above, using the following methods: (1) regression using cases only, (2) regression using controls only, (3) regression using a combined case–control sample, (4) regression using a disease status stratified case–control sample, (5) IPW, (6) SPML, and (7) the proposed WEE approach. In the IPW approach, we model cases and controls separately and weight them inversely to their probability of selection for cases and controls The estimating equation of IPW is as follows: For the SPML approach, we use the external SPREG software provided by the authors (http://dlin.web.unc.edu/software/spreg-2/). In the WEE approach, we use estimating approach A of taking the conditional expectation of the counterfactual score function in the linear regression, which solves through a simple one-step optimization; we use estimating approach B of generating pseudo observations in logistic regression. We vary the *T* values from 1 to 100 (*T* is the number of pseudosamples generated) and compare their estimates. We assume we have no prior knowledge of and the weight is estimated through the sample version equation (Equation 13). We then further compare the WEE with the IPW and SPML approaches in both the convenient and stratified samples. As outlined in the Introduction, IPW is the most simple and robust method and SPML is the most efficient and popular method in the current available secondary analysis literature.

#### Results and discussions:

Table 1 summarizes the relative bias, standard error, and mean square error of the estimated coefficients based on 500 Monte Carlo replicates from random samples using all methods. According to Table 1, the classical methods, including regressions applied to the cases only, controls only, combined case–control sample directly, and combined case–control sample adjusting for primary disease status, are all biased. Hence, without appropriate adjustment, classical methods provide biased estimation for the association in genetic case–control data.

The proposed WEE approach performs well in correcting such bias in all the settings we considered. In logistic regression, we generated pseudo counterfactual observations. The estimated coefficients are unbiased even with The standard errors do decrease slightly as *T* increases, but they quickly stabilize after Therefore, we conclude that a relatively small number of imputations is enough to reach the optimal efficiency of this approach, and therefore it is computationally efficient. In addition, we assumed the working model as in estimation, which is not affected by the associations. It is different from the one used to generate the data; the generating model is based on three settings, including logistic, interaction, and piecewise settings. Even under the misspecified the proposed estimating equation-based approach performs fairly well, indicating that it is quite robust to the model misspecification. Additional scenarios to test the robustness boundary of are described in later sections.

The IPW also produces fairly accurate estimates in all models and demonstrates comparable efficiency in comparison with the WEE in Table 1. The calculation of the IPW method, however, requires information on the sampling scheme. Under random samples, where the selection of cases and controls depends solely on the disease status, it is similar to using the disease prevalence to calculate in the WEE. Therefore, it is not surprising to observe comparable performance in this scenario. Table 2 further compares IPW with the WEE under complex sampling schemes and shows that IPW is less robust and efficient than the WEE in these scenarios. Under convenient sampling schemes with unadjusted confounding factors, every method contains some biases. The biases are controlled within 7% in the WEE, but they can reach up to 25% in IPW. This is because the WEE is less affected by the confounding factors as long as *Y* given is a representative sample of the population. Under the stratified sampling scheme, the IPW estimates suffer from inflated variance, because subjects with large sampling weights dominate the estimates. However, subjects with similar values contribute similar weights in the WEE, which therefore is robust to the stratification. We further compare their power under this scenario in the next section.

The SPML approach provides unbiased and relatively efficient estimations when the linear logistic model assumption is satisfied but introduces biases when this assumption is violated. Specifically, the SPML estimate is the most efficient of all methods we considered for random samples under the logistic setting. Under interaction and piecewise settings, where the linear logistic model assumption is violated, the SPML estimates are very sensitive to the model misspecification and contain considerable bias. As SPML is more sensitive to the underlying disease model than the WEE is, its performance under convenient samples may be worse than that of the WEE even for the linear logistic disease model. Its performance under stratified samples is comparable to that under random samples, as, similar to the WEE, SPML does not use sampling weights that might dominate the estimates.

### Type I error estimates and power comparisons

In this section, we further investigate the type I error and power of the WEE approach with the bootstrap procedure and compare it with IPW and SPML for the primary hypothesis For type I error comparisons, we consider the association between a binary *Y* and a single preselected SNP with no covariates under random samples. The MAF of the SNP ranges from 10% to 50%, and the primary disease models follow logistic, interaction, or piecewise settings. We simulate 100,000 Monte Carlo samples with 2000 cases and 2000 controls. Table 3 summarizes the type I errors of IPW, SPML, and WEE methods at the *α* levels of 0.05, 0.01, and 0.001. According to Table 3, both the WEE method and IPW produce the correct type I errors in all the settings, while the SPML method results in largely inflated type I error in the interaction and the piecewise settings due to its deviation from the linear logistic model assumption.

For the scenarios in which the novel methods hold valid type I errors, we compare their power under two values for the narrow heritability and 0.02, and the association coefficient ranges between 0.14 and 0.33 as a result. The power at significance level for 1000 Monte Carlo samples with 2000 cases and 2000 controls using the random and stratified sampling schemes is presented in Table 4. Under random samples, the WEE and IPW demonstrate similar power regardless of the underlying disease models. The SPML is also more powerful for detecting the associations when the underlying disease model is linear logistic, but its type I error blows up when the underlying models do not satisfy its linear logistic assumption. Under the stratified sampling scheme, where subjects with large sampling weights may dominate the estimates in IPW, the WEE proves far more powerful than IPW.

### The performance under biased estimated

The proposed estimates made two assumptions in the estimation of that the primary disease prevalence is known as and the estimated follows a linear logistic model as follows: (14)The proposed estimates were obtained in previous simulations under model misspecification (the true generating models were logistic, interaction, and piecewise settings), and the performances indicate that the WEE is fairly robust with linear logistic assumption. In this section, we further investigate the robustness of our proposed method when the is incorrectly estimated. The disease prevalence is often estimated from cohort studies or literature and can also be biased. Instead of using the true disease prevalence-generated we assume that we obtain the misspecified value ranging from 5% to 20% for estimation under the logistic setting. Table 5 shows the relative bias, standard error, and mean square error of the estimated coefficients from 500 Monte Carlo replicates with various values used for estimation. We find that while the estimation bias does increase slowly when the deviation from the true prevalence increases, even doubling the disease prevalence incurs biases of <7%. We conclude that the resulting estimates are fairly robust against the misspecified prevalence.

## Real Data Analysis

In this section, we apply the proposed WEE approach to an asthma case–control GWAS from the New York University Bellevue Asthma Registry (NYUBAR) (Liu *et al.* 2011). The study consisted of 387 asthmatics and 212 healthy controls, genotyped 10 tag SNPs at the thymic stromal lymphopoietin (TSLP) gene, and identified the association between the TSLP gene and asthma in the primary analysis. To do so, the study controlled for a number of demographic and clinical variables such as age, gender, race, smoking status, BMI, forced expiratory volume in 1 second (FEV_{1}), and IgE level. As an ongoing non-National Institutes of Health funded study, the asthma cohort data are not currently available to the public.

To illustrate the proposed approach, we consider two secondary phenotypes: one is a binary secondary phenotype of overweight status, and the other is a continuous secondary phenotype of serum IgE levels. Overweight is defined as body mass index >25. A meta-analysis by Flaherman and Rutherford (2006) combining 402 studies from 1966 to October 2004 found that the effect of high body weight during middle childhood showed a 50% increase in relative risk (relative risk = 1.5, 95% C.I. = 1.2–1.8) of having subsequent asthma. The NYUBAR data set is consistent with their findings with an odds ratio (OR) of having asthma for overweight observations of 1.66 (*P*-value = 0.006; 95% C.I. = 1.15–2.38) compared to the normal counterparts. As a result, the commonly used classical secondary analysis methods may provide biased estimation. We sought to apply more appropriate approaches for estimation.

IgE is a class of antibody that mediates the immune responses in the pathogenesis of allergic asthma (Burrows *et al.* 1989). An allergen-specific IgE level >0.35 kilo-international units (kIU)/liter is considered positive. Elevated IgE is associated with many allergic diseases, such as allergic rhinitis, peanut allergy, latex sensitivity, atopic dermatitis, and chronic urticaria (Morjaria and Polosa 2009). The secondary analysis of IgE leads to better understanding of the mechanism by which TSLP influences risk to asthma and other allergic diseases. As serum IgE level is approximately normally distributed among cases and controls after log transformation, we first apply linear regression for estimation. In addition, as elevated IgE level instead of mean IgE level plays an essential role in allergic diseases, we also consider the quantile regression approach to further investigate the genetic association with upper quantiles of the log serum IgE levels. Given that the log serum IgE level is strongly associated with asthma in the data set (OR = 1.4; *P*-value = 6.7*E*-9; 95% C.I. = 1.26–1.58), the commonly used methods may also provide biased estimates of the association between the TSLP gene and log serum IgE level, and we analyze them with novel approaches.

### Logistic regression for overweight status

We denote as the primary case–control status of asthma, as the minor allele count for each of the 10 TSLP SNPs, as the binary secondary trait of whether a person is overweight, and *Z* as a continuous variable of the propensity score (Guo and Fraser 2010) derived from a set of covariates including age, gender, smoking status, FEV_{1}, and the first principal component score from 213 ancestry informative markers (Pritchard *et al.* 2000). Then the logistic model we consider is as follows:Three approaches are used to estimate the coefficient IPW, SPML, and WEE. For this regression and for the mean and quantile regressions in later sections, we calculate the overall asthma prevalence as 10.1% based on six birth cohort studies, and this information is used to approximate selection probability in IPW and estimate in SPML and in the WEE. The population distribution of is unknown and the weight is estimated through the sample version equation (Equation 13). The standard errors and *P*-values were calculated using bootstrap; *i.e.*, we bootstrap cases and controls separately and reapply the entire estimating procedure to the bootstrap case–control sample.

The resulting estimated coefficients of the 10 tag SNPs in the TSLP gene and overweight are summarized in Table 6. Before adjusting for multiple comparisons, both IPW and the WEE are able to identify two SNPs (rs2289276 and rs11466741) at the a *α*-level 0.05, while SPML fails to detect any SNPs. The point estimates of the WEE and IPW are comparable, but the standard errors of the WEE are smaller, resulting in smaller *P*-values. Specifically, the *P*-values of SNP rs11466741 are 0.008 and 0.047 in the WEE and IPW, respectively.

SPML generates very different estimates from IPW and the WEE at many SNPs. As we mentioned in the Introduction, IPW is widely known as a robust approach and SPML is efficient but potentially biased with misspecified Therefore, we further tested the interactions in to understand the underlying models, and some of the interactive effects are significant (SNPs rs2289278 and rs10035870). Therefore, we believe that the SPML approach contains substantial biases due to the violation of the model assumptions, and the WEE as well as IPW provides relatively unbiased estimations.

In summary, this work demonstrated that the WEE approach combines the advantages of IPW and SPML estimators in that it is robust and fairly efficient in estimating the marker–secondary trait associations.

### Mean regression for serum IgE level

In this example, we let *Y* denote a continuous secondary trait of the log serum IgE level. As in the case before, *D* is the case–control status of asthma, *X* is the minor allele count for each of the 10 TSLP SNPs, and *Z* is the propensity score of the covariates. We then consider the IPW, SPML, and WEE approaches under the following linear regression:The resulting estimated coefficients are summarized in Table 7. The WEE identifies three SNPs (rs11466741, rs11466743, and rs10035870) at an *α*-level of 0.05 before adjusting for multiple comparison, IPW detects one significant SNP rs10035870, and SPML fails to detect any SNP. As in the logistic regression for the overweight status, the point estimates of the WEE and IPW are comparable, and the WEE is able to detect more SNPs because it is more efficient than IPW. SPML not only provides biased estimates due to the violation of its assumption on but also fails to improve the efficiency in comparison with the WEE. The WEE provides the most robust and efficient estimation of the associations between the TSLP gene and serum IgE level.

### Quantile regression for serum IgE level

Elevated serum IgE levels instead of the mean serum IgE levels contribute to the allergic effects. Therefore, we also consider quantile regression for the secondary analysis to identify the TSLP variants that are associated with the upper quantiles of the serum IgE level. In addition, we use quantile regression to deepen our knowledge from mean regression on how SNPs affect the location, scale, and distribution of serum IgE levels. The quantile model we consider iswhere *X* is the minor allele count for each of the 10 TSLP SNPs, *Z* is a continuous variable of propensity score developed from covariates, and *Y* is the log serum IgE level. To evaluate the effects of the TSLP gene variants on different levels of IgE, we estimate the model at quantile levels of 0.15, 0.25, 0.5, 0.75, and 0.85, respectively. Two approaches are used to estimate the coefficient IPW and the WEE, as the SPML approach is likelihood based and cannot be applied to nonparametric regressions.

The resulting estimated quantile coefficients are summarized in Table 8. The estimated quantile coefficients from the two approaches are comparable. However, the bootstrap standard errors of the IPW estimates are much bigger than the ones from the WEE. Consequently, the WEE estimates are more powerful in detecting the quantile associations.

For the three SNPs that are significant in mean regression in Table 7, their results from quantile regressions also indicate significant associations, and these associations remain significant even after a conservative Bonferroni correction for estimating different quantile levels and the number of SNPs. Moreover, quantile regression is able to detect the SNPs that are associated only with the upper quantile but not the mean of serum IgE level. Specifically, the WEE shows that SNPs rs2289276, rs2289278, rs2289277, and rs11466750 have significant associations with the 75th quantile of log serum IgE level; however, the mean regression did not indicate significant associations, illustrating the potential for the new approach to discover new associations.

Quantile analysis is able to present a comprehensive picture on the effects of the SNPs on the entire distribution of serum IgE level. To obtain a more complete picture, we estimate the quantile coefficients on a fine grid of quantile levels. In Figure 1, A and B, we plotted the estimated conditional distribution functions with different genotypes at SNPs rs10035870 and rs11466743, respectively. Specifically, the solid curve in Figure 1A is the estimated quantile function for the individuals with genotype AA at rs10035870, and the dashed line is for the individuals with genotype AG/GG. In Figure 1B, the solid curve is the estimated quantile function for genotype GG at rs11466743, and the dashed line is for genotype AG/AA. Both SNPs were found to have a significant impact on the distribution of serum IgE level. SNP rs10035870 has a strong positive effect on the entire distribution of serum IgE level, and thus subjects with the major allele at rs10035870 tend to have a higher serum IgE level in general. In contrast, SNP rs11466743 has a strong impact only on the median and upper quantiles, but has little effect on the lower quantiles of serum IgE level. As indicated in Figure 1B, the subjects with genotype AG/AA in rs11466743 are less likely to have a very high serum IgE level compared to those with genotype GG; however, they have an equal chance to have a low serum IgE level.

In summary, the quantile regression approach demonstrates two attractive features in marker–secondary trait analysis. First, it is able to identify additional SNPs that are associated only with certain quantiles of *Y*. Second, it describes a complete picture of the entire *Y* distribution where the associations exist. Given these findings it is clear that existing approaches are either inefficient for this type of regression (IPW) or unable to perform (SPML). The proposed WEE approach can supplement the secondary analysis by facilitating this type of regression.

## Discussion

In this article, we propose a general framework of weighted estimating equations that provides unbiased estimation of the genetic associations with secondary phenotypes in case–control designs. It enjoys a number of attractive properties in the following aspects.

First, its framework is flexible to accommodate various types of single secondary phenotypes and regressions. We illustrated the WEE, using logistic regression for binary outcomes and linear regression and quantile regression for continuous outcomes. Moreover, with appropriately selected estimating functions, the WEE can be applied to many other types of outcomes, such as ordinal, nominal, count, and time-to-event traits. An example of the application of the WEE for survival analysis is provided in the *Appendix*.

Second, the WEE can easily accommodate multiple SNPs and covariates at the same time. Although the approach presented for GWAS data focuses primarily on a single SNP and a few common covariates including population substructure, it can be applied to a much wider range of scenarios for the secondary analysis. For example, rare variants can be aggregated in a region, and the WEE can be applied to the aggregated data for the detection of rare variant effects in sequencing studies. It can also be incorporated into high-dimensional data analysis such as conducting variable selection from a large number of SNPs and covariates, using cross-validation or penalty functions like Lasso.

Third, the WEE does not make any assumptions on the association, which is more general than most likelihood-based approaches. As shown in the simulation studies, the WEE yields unbiased association estimation regardless of the association.

Fourth, the WEE is not sensitive to sampling schemes. Although the IPW approach is a simple and flexible method that works for any model, it requires knowing additional information on sampling probabilities. The resulting estimates may not be robust or efficient under its sampling schemes, such as under the existence of confounding factors and overweighted observations.

Finally, the WEE is computationally simple and straightforward. The essence of the new estimating equation is weighted regression. Hence the computation does not require special software or packages. We published the R functions of the WEE approach using linear regression, logistic regression, and quantile regression on github (https://github.com/songxiaoyu/secondary-analysis-in-case-control-studies) together with selected codes of comparison methods and simulations. Users can utilize these functions directly for their analyses in R or revise the codes based on their respective needs.

Overall, the WEE provides a very general secondary analysis framework. Through intensive numerical investigations, we found that it is particularly useful and outperforms the existing approaches when do not follow a linear logistic model, the sampling scheme is unknown or complex, or likelihood functions are not available.

Based on our investigations on the secondary analysis in the case–control studies, we provide some suggestions for the selection of methods in a real GWAS study. The first step is investigating the *Y* and *D* relationship. If classical methods are valid, and researchers do not need to consider novel approaches to adjust for biases. For studies where *Y* are *D* are correlated regardless of the source of correlation, it is worthwhile to consider the sampling scheme. The IPW approach provides robust and similar efficient estimates to those of the WEE approach under a random sample. However, if the sampling scheme is unclear or complex, the WEE could serve as an alternative approach. One may need to be cautious with likelihood-based approaches such as SPML, as the estimates may be biased and the efficiency may not be achieved with a misspecified underlying disease model.

#### Future extensions:

The WEE approach can be extended in a few directions. First, it can be adapted for studies with more complex designs. For example, we can apply the same idea to matched case–control designs and combine their estimating functions of the matched subsets for estimation. The WEE can also be extended to studies where the primary disease includes multiple categories or is continuously sampled but oversampled at the “low” and/or “high” extremes. The conditional primary disease prevalence can be estimated using proportional odds regression or multinomial logistic regression for categorical primary disease and certain parametric likelihoods for continuous primary disease.

Another extension is to consider meta-/mega-analysis, combining WEE estimates from multiple case–control studies. As most of the case–control studies are powered for the primary analysis, the power for detecting the important SNPs of secondary traits using a single data set can be limited. If multiple case–control studies exist that measure the same secondary trait, combining them could greatly enhance the power. Another way to improve the power is to model multiple correlated secondary traits jointly. If a SNP is associated with multiple traits, analyzing them jointly could improve the detecting power. The WEE approach can also be extended to multivariate regressions by replacing the estimating function *S* by the generalized estimating functions as in Liang *et al.* (1992).

In summary, the construction of the WEE is straightforward and computationally efficient by evaluating the expected counterfactual estimating functions or generating pseudo observations. It provides a robust and fairly efficient unbiased estimation for the marker–secondary phenotype association for multiple types of secondary traits. It also has appropriate type I error rate and relatively large power and has proved robust against disease prevalence misspecification. Finally, the WEE can be extended to multiple study designs and can be applied to multiple regressions.

## Acknowledgments

This research is supported by the National Science Foundation (DMS-120923) and by the National Institutes of Health (1R03HG007443-01). I.I.-L. is supported by National Institute of Mental Health grant MH095797 and by National Science Foundation grant DMS-1100279.

## Appendix

### The WEE in Survival Analysis

The key for using the new estimating equations is to find the estimating function of each respective outcome and then to plug it into the basis of the estimating equations:For example, we consider a Cox proportion hazards model with the fixed covariates for time-to-event data. Let and be the event time, the censoring time, the genetic variant, and the covariate vector of individual *i*, respectively, for The observed time of individual *i* is defined by and Then, the hazard function of subject *i* is The corresponding partial-likelihood function isand the partial score function iswhere and

With the partial score function, we then plug it into the basis to construct the new estimating equations and solve it for the unbiased estimation of the association between *X* and time-to-event secondary trait *Y* in a case–control study. The new set of estimating equations can be solved by generating pseudo observations for each subject in cases and controls or by estimating expectation of the counterfactual estimating functions, assuming a known weight The weight can be solved by modeling the primary disease through a logistic regression.

## Footnotes

*Communicating editor: I. Hoeschele*

- Received July 24, 2015.
- Accepted February 13, 2016.

- Copyright © 2016 by the Genetics Society of America