## Abstract

This article focuses on conducting global testing for association between a binary trait and a set of rare variants (RVs), although its application can be much broader to other types of traits, common variants (CVs), and gene set or pathway analysis. We show that many of the existing tests have deteriorating performance in the presence of many nonassociated RVs: their power can dramatically drop as the proportion of nonassociated RVs in the group to be tested increases. We propose a class of so-called sum of powered score (SPU) tests, each of which is based on the score vector from a general regression model and hence can deal with different types of traits and adjust for covariates, *e.g.*, principal components accounting for population stratification. The SPU tests generalize the sum test, a representative burden test based on pooling or collapsing genotypes of RVs, and a sum of squared score (SSU) test that is closely related to several other powerful variance component tests; a previous study (Basu and Pan 2011) has demonstrated good performance of one, but not both, of the Sum and SSU tests in many situations. The SPU tests are versatile in the sense that one of them is often powerful, although its identity varies with the unknown true association parameters. We propose an adaptive SPU (aSPU) test to approximate the most powerful SPU test for a given scenario, consequently maintaining high power and being highly adaptive across various scenarios. We conducted extensive simulations to show superior performance of the aSPU test over several state-of-the-art association tests in the presence of many nonassociated RVs. Finally we applied the SPU and aSPU tests to the GAW17 mini-exome sequence data to compare its practical performance with some existing tests, demonstrating their potential usefulness.

- adaptive SPU (aSPU) test
- genome-wide association study (GWAS)
- score statistic
- sequencing data
- sum of powered score (SPU) test
- sum of squared score (SSU) test
- sum test

THE recent advances in sequencing technologies have made it feasible to conduct global testing for association between complex traits and rare variants (RVs) (Bansal *et al.* 2010). The most popular approach in genome-wide association studies (GWASs) is to test on each single nucleotide variant (SNV) one by one and then select the SNVs meeting a stringent significance level after adjusting for multiple testing. However, such a strategy may be low powered due to the weak signal contained within each individual RV for its extremely low minor allele frequency (MAF). Hence, developing new association tests tailored to RVs has been an active research area in the past few years. Due to low MAFs of RVs, to achieve practically meaningful power, the majority of existing approaches focus on testing on a group of RVs, rather than on each individual RV (Capanu *et al.* 2011); the main idea is to boost power through aggregating information across multiple RVs in an analysis unit, such as a gene (*e.g.*, Morgenthaler and Thilly 2007; Li and Leal 2008; Madsen and Browning 2009; Liu and Leal 2010; Han and Pan 2010; Hoffmann *et al.* 2010; Li *et al.* 2010; Price *et al.* 2010; Zhang *et al.* 2010; Zhu *et al.* 2010; Luo *et al.* 2011; Neale *et al.* 2011; Ionita-Laza *et al.* 2011; Feng *et al.* 2011; Pan and Shen 2011; Basu and Pan 2011; Gordon *et al.* 2011; Wu *et al.* 2011; Fan *et al.* 2013). As theoretically shown (Cox and Hinkley 1974) and demonstrated in our simulations, there is no uniformly most-power test for this purpose, which means that, depending on the unknown truth, including specific association effect directions and sizes, a given and fixed test may or may not be powerful. Hence, there have been intensive efforts in developing adaptive tests for RVs (*e.g.*, Pan and Shen 2011; Lin and Tang 2011; Zhang *et al.* 2011; Lee *et al.* 2012; Chen *et al.* 2012; Derkach *et al.* 2013; Sun *et al.* 2013). However, due to their limited extents of adaptivity (*e.g.*, with a predetermined and fixed set of the weights on RVs), these adaptive tests are still not flexible (or adaptive) enough with loss of power in some situations. A main motivation in this article is to develop a broader family of association tests such that at least one of them is powerful for a given situation. We develop such a family of tests, called the sum of powered score (SPU) tests, which generalize the sum (of score) test (Sum) and the sum of squared score (SSU) test (Pan 2009). The Sum test is a representative of the burden tests based on genotype pooling or collapsing (Morgenthaler and Thilly 2007; Li and Leal 2008; Madsen and Browning 2009), whereas the SSU test is closely related to kernel machine regression [and its implementation for RVs, SKAT (Sequence Kernel Association Test)] (Wu *et al.* 2010, 2011), C-alpha test (Neale *et al.* 2011), and an empirical Bayes test for high-dimensional data (Goeman *et al.* 2006); see Pan 2011 and Basu and Pan (2011). In many simulation setups, one, but not both, of the Sum test and SSU test has been shown to be powerful (Basu and Pan 2011). For example, with different association directions of causal RVs, the Sum test suffers from a loss of power, while the SSU test performs much better. However, we emphasize that, in analysis of multiple RVs, there exist nonassociated RVs. For example, in cancer research it has been observed that the vast majority of RVs do not appear to confer risk (Capanu *et al.* 2011). Hence, it is important to assess the performance of a test in the presence of nonassociated RVs in the group of the RVs to be tested. In fact, as to be shown, the performance of the Sum test deteriorates rapidly as the number of nonassociated RVs increases, whereas the SSU test is more robust but nevertheless may gradually become less competitive. It seems that the performance of various tests has not been fully investigated for the case with many nonassociated RVs, including some new adaptive tests, such as a kernel-based adaptive clustering (KBAC) test (Liu and Leal 2010), a *P*-value weighted sum test (PWST) (Zhang *et al.* 2011), an estimated regression coefficient (EREC) test (Lin and Tang 2011), an adaptive SSU (aSSU) test (Pan and Shen 2011), and an optimized SKAT (SKAT-O) test (Lee *et al.* 2012). As to be shown, it turns out that these tests suffer more from substantial power loss and are no longer competitive in the presence of a high proportion of nonassociated RVs. In contrast, regardless of the number of nonassociated RVs, at least one of our proposed SPU tests may remain relatively more powerful. Since the identity of the most powerful SPU test also changes with the unknown true association pattern with causal and noncausal RVs, we propose a simple yet highly adaptive SPU (aSPU) test to maintain high power across a wide range of scenarios.Our proposed aSPU test is often much more powerful than many existing adaptive tests in the presence of many nonassociated RVs.

We conducted extensive simulation studies to compare our newly proposed tests with several state-of-the-art tests, such as the PWST, EREC, aSSU, and SKAT-O tests, which all appeared after the publication of and thus were not compared in Basu and Pan (2011). As an active research topic, quite a few association tests for RVs have been proposed in the past 2 or 3 years. However, unfortunately, most of them have not been fully compared to each other, especially in the presence of many nonassociated RVs, which is expected to be a norm instead of an exception in analysis of RVs. Hence, as a second aim of this article, we assess the performance of these existing tests along with our newly proposed tests, offering some insights into their potential in practical use. Our study can be regarded as an update and follow-up to Basu and Pan (2011).

## Methods

### Data and notation

Our proposed methods are based on general regression models and thus can be applied to binary, quantitative, and survival responses or traits in the framework of generalized linear models and Cox proportional hazards model while adjusting for covariates, such as environmental variables and principal components accounting for population stratification. To be concrete, we consider only the case–control study design with a binary response/trait and no covariates; more general cases can be similarly approached, as shown in our example. Suppose that for subject *i* = 1, …, *n*, *Y _{i}* = 0 or 1 is a binary response or trait,

*e.g.*, an indicator of disease, and

*X*= (

_{i}*X*

_{i}_{1}, …,

*X*)′ is a group of predictors of interest, such as

_{ik}*k*RVs from a candidate gene or region. We use additive coding for each RV; that is,

*X*is the count of the minor allele at RV

_{ij}*j*for subject

*i*. Consider a logistic regression model: (1)We want to test the null hypothesis

*H*

_{0}:

*β*= (

*β*

_{1}, …,

*β*)′ = 0; that is, there is no association between any RVs and the trait under

_{k}*H*

_{0}.

Many of the existing tests and our new tests are based on the score vector *U* = (*U*_{1}, …, *U _{k}*)′ for

*β*in the logistic regression model (1) and

*V*= Cov(

*U*|

*H*

_{0}) (Pan 2009; Basu and Pan 2011; Lin and Tang 2011; Wu

*et al.*2011; Lee

*et al.*2012),where and are the sample means of

*Y*’s and

_{i}*X*’s, respectively. An advantage of using a score-based test is that the closed form of the score vector is available and only a null model (

_{i}*i.e.*, the model under

*H*

_{0}) needs to be fitted, and thus is computationally much faster, sometimes even only feasible, as compared to the corresponding Wald or likelihood-ratio test, for which a more general and complicated model has to be fitted and may not even converge (

*e.g.*, when

*k*>

*n*). Furthermore, it is noted that the score vector

*U*in the joint model (1) is the same as

*U*

_{M}= (

*U*

_{M,1}, …,

*U*

_{M,}

*)′, with*

_{k}*U*

_{M,}

*being the score statistic for*

_{j}*β*

_{M,}

*in the marginal model (2)where subscript M denotes parameters from a marginal model. In contrast, for example, in general, the maximum-likelihood estimates (MLEs) and differ, and our experience suggests that the Wald test based on the marginal models is more powerful than that based on the joint model. The most popular single variant-based analysis corresponds to a minimum*

_{j}*P*-value (UminP) test combining univariate score tests for the marginal models,where

*V*= Var(

_{jj}*U*|

_{j}*H*

_{0}) is the

*j*th diagonal element of

*V*. To adjust for multiple testing, one could apply the conservative Bonferroni adjustment or better, as implemented here, based on the asymptotic null distribution of

*U*∼

*N*(0,

*V*) under

*H*

_{0}, use numerical integrations (or simulations) to obtain an asymptotically exact

*P*-value for the UminP test (Conneely and Boehnke 2007).

### A brief review of some existing tests

Basu and Pan (2011) compared the performance of many existing association tests for RVs. Their major conclusion is that if there is (nearly) a common association strength for causal RVs with no or few nonassociated RVs, then the burden tests, such as the Sum test (Pan 2009), were most powerful; otherwise, the SSU test (Pan 2009) and its close relatives (Pan 2011), kernel machine regression (KMR or SKAT) (Wu *et al.* 2010, 2011) and C-alpha test (Neale *et al.* 2011) performed best. The Sum test is based on a working assumption that in the joint logistic regression model (1), we have a common association parameter between the *k* RVs and the trait, say *β*_{1} = … = *β _{k}* =

*β*

_{c}. Then we need only to test a null hypothesis with a single parameter

*H*

_{0}:

*β*

_{c}= 0, corresponding to fitting a simple logistic regression model: (3)On the other hand, the SSU test is a variance-component test: assuming that

*β*

_{1}, …,

*β*in model (1) are independent random effects with mean 0 and variance

_{k}*τ*

^{2}, it can be derived as a score test on a null hypothesis with a single parameter

*H*

_{0}:

*τ*

^{2}= 0. Specifically, both the Sum test and SSU test are based on the score vector

*U*,from which it is clear that the Sum test, as other burden tests, such as the CMC test (Li and Leal 2008) and the weighted Sum test (Madsen and Browning 2009), will lose its power if the causal RVs have different association directions, leading to different signs of

*U*’s and thus a small test statistic

_{j}*T*

_{Sum}, failing to reject

*H*

_{0}. In contrast, since the components of

*U*is squared in the SSU test (and KMR and C-alpha test), the SSU test and its close relatives do not lose power with different association directions due to the sum over , instead of over

*U*as in the Sum test. A more general score-based statistic can be written aswhere

_{j}*ζ*= (

*ζ*

_{1}, …,

*ζ*)′ is a vector of weights for the

_{k}*k*RVs (Lin and Tang 2011). For example, if

*ζ*= −1 or 1 depending on whether and its

_{j}*P*-value < 0.1, then

*T*

_{G}is the adaptive Sum (aSum) test of Han and Pan (2010). Two new tests that were not reviewed in Basu and Pan (2011) are also special cases of the above general test

*T*

_{G}. First, if

*ζ*= 2(

_{j}*p*− 0.5), where

_{j}*p*is the

_{j}*P*-value for a one-sided Wald test for

*H*

_{j}_{,0}:

*β*

_{M}_{,}

*= 0*

_{j}*vs.*

*H*

_{j}_{,1}:

*β*

_{M}_{,}

*< 0 with a test statistic , then*

_{j}*T*

_{G}is the PWST of Zhang

*et al.*(2011). Second, if , then the

*T*

_{G}test is the EREC test of Lin and Tang (2011); it is noted that, due to the instability of estimating

*β*

_{M,}

*for a RV, Lin and Tang (2011) proposed shrinking toward a constant*

_{j}*d*or −

*d*, with

*d*= 1 for binary traits. Each of the above three adaptive tests accommodates different association directions by using the signs of ’s, thus overcoming a main shortcoming of the Sum test, retaining high power in the presence of different association directions. Nevertheless, with RVs, for the same reason that motivates pooling or collapsing RVs in most association tests proposed so far, only limited information is contained in each RV, implying that all the above weighting schemes may not work well in some situations, as elaborated later.

In addition to differing association directions for causal RVs, a more common issue is the existence of many nonassociated RVs among the group of RVs to be tested. In particular, with many nonassociated RVs, as shown by Basu and Pan (2011), the burden tests, including the Sum test, lose their power quickly, while the SSU test and its close relatives perform much better. On the other hand, intuitively, if we can exclude nonassociated RVs in constructing a test statistic, it may help improve the power. Along this line, Pan and Shen (2011) proposed a class of adaptive Neyman-type tests (Neyman 1937), including an adaptive Sum (aSum+) and an adaptive SSU (aSSU) test, which are based on RV selection, instead of weighting. Specifically, first, one orders the components of the score vector *U* in a descending order based on the magnitudes of *U _{j}* and , respectively, for the aSum+ and aSSU tests. Second, suppose that the

*P*-values for the Sum and SSU tests based on the first

*j*components of the ordered

*U*are

*P*

_{Sum,}

*and*

_{j}*P*

_{SSU,}

*respectively; then the test statistics for the two adaptive tests areand the final*

_{j}*P*-values are obtained by permutations or simulations. In short, the aSum+ and aSSU tests work by selecting the first few components of a reordered score vector

*U*that are most informative (with smallest

*P*-values) while possibly ignoring other components of

*U*for nonassociated or weakly associated RVs. In particular, the aSum+ test accounts for possibly different association directions by using only those positively associated RVs; however, it may suffer from power loss due to its ignoring those negatively associated RVs. To improve over the aSum+ test, Pan

*et al.*(2011) proposed an adaptive Sum test based on two directional searches, denoted as aSum2d, to use both positively and negatively associated SNVs, and found its improved power in detecting gene–gene interactions for CVs. Specifically, we first reorder the components of

*U*in a descending order of

*U*as for the aSum+ test and suppose that the

_{j}*P*-value of the Sum test applied to the

*last j*components of the reordered

*U*is

*P*

_{Sum,−}

*; then the aSum2d test statistic isWe can then use permutations or simulations to obtain*

_{j}*P*-values for

*T*

_{aSum2d}(and

*T*

_{aSum–}if needed). If desired, one can also just use

*T*

_{aSum–}to test for only negatively associated RVs.

Another adaptive test, called KBAC, was proposed by Liu and Leal (2010). A unique feature of the KBAC test is to detect not only the main effects of, but also possible interactions among, RVs. For the latter purpose, rather than weighting on each individual RV, it uses a kernel-based weight on each unique pattern (or combination) of the genotypes across the *k* RVs. It upweights a genotype pattern that appears more frequently in cases (*i.e.*, with a higher risk of disease) and then contrasts the frequencies of genotype patterns between the case and control groups by taking a weighted sum of their frequency differences. As pointed out by Basu and Pan (2011), there are two potential limitations. First, since its test statistic is a weighted sum of the frequency differences between the case and control groups, the presence of opposite association directions may contribute to both positive and negative frequency differences, leading to a small test statistic and thus loss of power. Second, as the number of nonassociated RVs increases, there will be a larger number of unique genotype patterns and thus a smaller number of subjects with each genotype pattern, leading to loss of power. These two points are confirmed later.

### A new class of tests and a data-adaptive test

Our basic observation is that, depending on the unknown pattern of association effects of the group of RVs to be tested, different tests may be more powerful; in spite of the generality of the *T*_{G} test, its performance *critically* depends on the choice of the weights, and any fixed choice may or may not be most suitable. Hence our primary goal is to construct a class of versatile tests such that for a given scenario, at least one of the tests is powerful. Then we combine these tests to obtain a data-adaptive test that will maintain high power across a wide range of scenarios. For this purpose, we choose weight *ζ _{j}* as informative and as simple as possible. Since most existing association tests use the score vector

*U*, suggesting that most information is already contained in

*U*, we would simply use

*U*to construct weights. In particular, since we have

*U*∼

*N*(0,

*V*) under

*H*

_{0}, we know that a large |

*U*| offers strong evidence to reject

_{j}*H*

_{0,}

*:*

_{j}*β*= 0. Specifically, we choose for an integer

_{j}*γ*≥ 1, leading to a SPU test: (4)With various values of

*γ*≥ 1, we obtain a class of the SPU tests. The SPU tests cover the Sum and SSU tests as two special cases with a corresponding

*γ*= 1 and

*γ*= 2, respectively. Importantly, as

*γ*increases, the SPU(

*γ*) test puts more weights on the larger components of

*U*while gradually ignoring the remaining components. An extreme case is that, as an even number

*γ*→ ∞, we haveSince the SPU tests are based on resampling methods to calculate their

*P*-values, they are invariant to any monotone transformation of their test statistics, such as (⋅)

^{1/}

*. That is, we can equivalently define , which uses only the largest component of |*

^{γ}*U*|. More generally, as we increase the value of

*γ*, we put higher and higher weights on the larger components of

*U*, effectively realizing RV selection. On the other hand, an even integer of

*γ*automatically eliminates the effects of different signs of

*U*’s, avoiding power loss of the Sum test in the presence of different association directions. However, an odd integer of

_{j}*γ*might be more suitable, as in the SPU(1) or Sum test, when the associations are all in the same direction.

We know that under *H*_{0}, the score vector *U* has an asymptotic Normal distribution *N*(0, *V*). Hence, in theory, we can derive the asymptotic distribution of *T*_{SPU(}_{γ}_{)}, which, however, may not be easy to calculate. As an alternative, we recourse to permutations (Churchill and Doerge 1994). Specifically, we permute the original set of traits *Y* to obtain a new set of traits *Y*^{(}^{b}^{)}, based on which we calculate the score vector *U*^{(}^{b}^{)} and the null statistic ; after *b* = 1, …, *B* permutations, we calculate the *P*-value as . We used *B* = 200 in our simulations for a nominal significance level at 5%.

In the presence of covariates, we propose generalizing the above permutation scheme. Specifically, first we regress *Y* on the covariates to fit a null model under *H*_{0} to obtain and residual ; second, we permute the set of the residuals *r* = {*r _{i}*|

*i*= 1, …,

*n*} to obtain a permuted set

*r*

^{(}

^{b}^{)}; third, we calculate the new score vector based on the permuted residuals as and the corresponding null statistic and after repeating the above steps for

*b*= 1, …,

*B*, we calculate the

*P*-value as .

Alternatively, we also propose using the parametric bootstrap: we first fit a null model under *H*_{0} to obtain and then simulate a new set of traits for *b* = 1, …, *B*; we calculate the test statistic based on each set of simulated *Y*^{(}^{b}^{)} and calculate the *P*-value as in the permutation method.

Since the power of a SPU(*γ*) test depends on the choice of *γ* while the optimal choice of *γ* depends on the unknown true association pattern of the RVs to be tested, it would be desirable to data-adaptively choose the value of *γ*. For this purpose, we propose an aSPU test that simply combines the *P*-values of multiple SPU tests (with various values of *γ*), although other combining methods are also possible (Pan *et al.* 2010; Cheung *et al.* 2012). Suppose that we have some candidate values of *γ* in Γ, *e.g.*, Γ = {1, 2, 3, …, 8, 15, 16, 31, 32, ∞} as used in our later simulations, and suppose that the *P*-value of the SPU(*γ*) test is *P*_{SPU(}_{γ}_{)}; then our combining procedure is to take the minimum *P*-value:Of course, *T*_{aSPU} is no longer a genuine *P*-value; we use the permutation or parametric bootstrap to estimate its *P*-value. It may appear that a double permutation or bootstrap procedure is needed, but indeed not necessary. For example, if we use the permutation method, first, we permute the original set of traits to obtain *Y*^{(}^{b}^{)} and the corresponding score vector *U*^{(}^{b}^{)} for *b* = 1, 2, …, *B*. We then calculate the corresponding SPU test statistics and their *P*-values . Thus, we have , and the final *P*-value of the aSPU test is .

We note the practicality of permutation- or other resampling-based methods for *P*-value calculations. First, due to extremely low MAFs of some RVs, it is always dubious whether asymptotic results are applicable. Second, it is computationally feasible to use permutation-based tests for genome-wide searches. In practice, we can first use a smaller *B*, say *B* = 1000, to scan a genome and then gradually and repeatedly increase *B* for a few groups of RVs that pass an initial significance criterion (*e.g.*, *P*-value < 5/*B*) in the previous step; in this way, contrary to otherwise claimed, it is indeed feasible to apply a permutation-based test to genome scans and obtain highly significant results (if any). We have applied permutation- or simulation-based aSPU test to genome-wide scans to yield *P*-values < 10^{−6}.

Finally we comment on the choice of Γ. The following considerations guide the choice of the integers *γ* ≥ 1 in Γ. First, to cover the burden and variance-component tests, which have been shown empirically to perform well under some situations (Basu and Pan 2011), we would include *γ* = 1 and *γ* = 2 in Γ. Second, depending on whether the phenotype-RV association directions vary, we may need to use either even or odd integers *γ*’s to yield high power; if unsure, then it is suggested to use both odd and even integers *γ*’s. Third, depending on how sparse the association signals are, one may use smaller or larger *γ*’s. For example, the more the RVs to be tested and the fewer associated RVs to be expected, then larger *γ*’s would be desirable. We have found that often Γ = {1, 2, 3, …, 8, ∞} suffices. For demonstration, we have included *γ* = ∞, which is not necessary, but may be beneficial when testing on CVs. In the following, we also show the results of the SPU(*γ*) tests for *γ* ∈ Γ; we have two purposes. The first is to show varying operating characteristics of the various SPU tests. For example, we show higher power of SPU(3) or SPU(4) than that of SPU(1) and SPU(2), demonstrating the power gain of using some SPU(*γ*) test with *γ* > 2. Second, we show that often SPU(8) gives results almost the same as those of SPU(∞), suggesting no need to use other larger *γ*’s. In practice, we suggest using the aSPU test that combines the strengths (and possibly weaknesses) of various SPU tests; the aSPU test can be regarded as a rigorous (and almost exact) means for multiple testing adjustment with the use of several SPU tests, while the results of the SPU tests may shed light on the underlying genetic architecture. For example, if a large *γ* gives the most significant *P*-value, it may indicate a high degree of signal sparsity; if some odd *γ*’s yield more significant results than even *γ*’s, then most or all of the large associations are in the same direction. More elaborately, as shown below, an analysis of a SPU test can also imply the relative contribution of each SNV to the aggregated association (if any).

### SNV selection

A limitation of most global tests is their inability for variant selection: even if the global null hypothesis is rejected, they may not give any information on which RVs are (or are not) likely to be associated with disease. We note that the aSPU test can be used to rank the importance of the RVs. First, we estimate the optimal value of *γ* < ∞, chosen by the aSPU test. Second, we assess the relative contribution of each RV *r* to the aSPU test as . Third, we rank the RVs based on their *C _{r}* values, and we can select the top

*k*

_{1}RVs such that the sum of their relative contributions with

*α*

_{1}= 0.8, say; the choice of

*α*

_{1}determines the tradeoff between increasing true positives and increasing false positives. Generally, we can use

*C*to prioritize and generate hypotheses on the selection of causal RVs.

_{r}### Further comments and extensions

Below we briefly comment on the advantages of the aSPU test over several other adaptive tests. First, since the power of any univariate test for a single RV may be low (which is exactly the reason why we combine information across multiple RVs, *e.g.*, through pooling or collapsing), the *P*-value of such a test may not be informative; the aSum test and PWST based on such *P*-values may not perform well. Second, we note that the adaptive Neyman-type tests, such as the aSSU test, are based on the idea of variable or RV selection, while the SPU tests are more based on weighted averaging of variables or RVs. As discussed extensively in the model selection literature (Yuan and Yang 2005; Shen and Huang 2006) and in a genetic application (Newton *et al.* 2007), if signals are strong enough, then model selection is expected to perform better; otherwise, model averaging is preferred. In our current context, again due to extremely low MAFs of RVs, no matter how strong its association strength, information contained within each individual RV is only quite limited. Thus we expect that the model averaging-based SPU tests to outperform the model selection-based aSSU or other adaptive Neyman-type tests. Third, we note that the EREC test is related to the SPU tests. As shown in Pan (2009), we havewhere Diag(*V*) is a diagonal matrix with its *j*th diagonal element as *V _{jj}*. Hence, if is much larger than

*d*, then , which is roughly proportional to

*U*(if the MAFs of the RVs are in a close range), suggesting that the EREC test will be similar to the SPU(2),

*i.e.*, SSU test. On the other hand, if is small relative to

*d*, then , implying that the EREC test will behave similarly to the SPU(1) (or Sum) test. Generally, we expect that the EREC test behaves between the SPU(1) and SPU(2) tests.

Furthermore, several approaches (Lee *et al.* 2012; Derkach *et al.* 2013; Sun *et al.* 2013), including SKAT-O, have been proposed to combine a burden test like SPU(1) and a variance-component test like SPU(2). In contrast, our proposed aSPU test is based on combining a broader set of tests including but beyond SPU(1) and SPU(2), hence is more flexible and adaptive. In the presence of many nonassociated RVs, the weight *ζ* = 1 or *ζ* = *U* may not suffice: we may need weights *U ^{γ}*

^{− 1}with

*γ*> 2. In other words, with many nonassociated RVs, the power of the EREC or SKAT-O [or similar tests combining SPU(1) and SPU(2)] can be much lower than a SPU(

*γ*) test with a large

*γ*and lower than the aSPU test. In addition, since the aSum, PWST, and EREC tests use the marginal estimates , which have to be obtained iteratively during each permutation, whereas the score vector

*U*is much easier to obtain, the SPU tests are much faster.

In the presence of external or prior biological information, as for other tests, it may be helpful to incorporate some *external* weights (differing from *ζ* discussed earlier) into the SPU tests. Given some external weights *w _{j}*, we can have a weighted SPU (wSPU) test aswhile all other aspects, including the construction of an adaptive wSPU test, remain the same as before. For example, if it is believed that causal RVs tend to have lower MAFs, as advocated by Madsen and Browning (2009), one can use a

*w*inversely proportional to the MAF of the

_{j}*j*th RV. In this way, by suitably weighting both CVs and RVs, it is possible to use the adaptive wSPU test for a joint analysis of CVs and RVs (Ionita-Laza

*et al.*2013). Alternatively,

*w*can be a predicted likelihood of the

_{j}*j*th RV’s being functional or deleterious based on some computational algorithms (Wei

*et al.*2011). As other tests, the performance of the wSPU tests depends on how informative or correct the external weights are, while the choice of the external weights may not always be clear; hence we skip further discussions on the use of such external weights.

We have proposed using permutations (or the parametric bootstrap) to calculate the *P*-values for the SPU and aSPU tests. If the asymptotic normality of the score vector is expected to approximately hold, *e.g.*, in analysis of CVs, we may replace the permutation or bootstrap with simulation-based methods, which will be much faster (Zou *et al.* 2004; Seaman and Muller-Myhsok 2005).

## Results

### Simulation setups

We conducted extensive simulation studies to evaluate and compare the performance of various tests. We simulated genotypes as in Wang and Elston (2007). First, a latent vector *Z* = (*Z*_{1}, …, *Z _{k}*)′ was generated from a multivariate Normal distribution

*N*(0,

*R*), where

*R*had a first-order autoregressive (AR1) covariance structure with its (

*i*,

*j*)th element

*R*= Corr(

_{ij}*Z*,

_{i}*Z*) =

_{j}*ρ*

^{|}

^{i}^{−}

^{j}^{|}; we used

*ρ*= 0 and

*ρ*= 0.9 to generate independent and correlated RVs, respectively. Second, the latent vector

*Z*was dichotomized to yield a haplotype with some specified MAFs, each of which was randomly selected from a uniform distribution between 0.001 and 0.01 during each simulation. Third, the above two steps were repeated to generate two independent haplotypes, which were then combined to obtain genotype

*X*= (

_{i}*X*

_{i}_{1}, …,

*X*)′ for subject

_{ik}*i*. Fourth, for a nonnull case we randomly chose

*k*

_{1}causal RVs with their corresponding

*β*≠ 0 while all other

_{j}*β*= 0; for a null case, all

_{j}*β*= 0. Fifth, the disease status

_{j}*Y*of subject

_{i}*i*was generated from the logistic regression model (1). We used

*β*

_{0}= −log(0.05/0.95) for a 5% background disease probability; that is, Pr(

*Y*= 1|

_{i}*X*= 0) = 0.05. Finally, as in a case–control study, we sampled

_{i}*n*/2 cases and

*n*/2 controls in each data set.

We considered a few setups with combinations of various values of *ρ* = 0 or 0.9, *k*_{1} = 8 or 1, and *n* = 1000. We varied the number of nonassociated RVs *k* − *k*_{1} between 0 and 128, and a range of possible values of *β _{j}* ≠ 0 to cover from a common association effect to varying association strengths or directions and from a single to multiple causal RVs.

Throughout the simulations, the test significance level was fixed at *α* = 0.05. The results were based on 1000 independent replicates for each setup. We compared the performance of the SPU tests with several state-of-the-art adaptive tests not reviewed in Basu and Pan (2011), including one based on a Bayesian hierarchical GLM (BhGLM) (Yi *et al.* 2011). As a benchmark, we also included the UminP test that tests on each individual RV separately and then combine them by taking their minimum *P*-value.

### Simulation results

To save space, we focus on a few cases with correlated RVs (*i.e.*, neighboring RVs were in linkage disequilibrium). It is not only more general to consider correlated RVs (or covariates), but also in agreement with real sequence data as generated from the 1000 Genomes Project (Zhang *et al.* 2013). The extensive simulation results with independent RVs and other association parameters were similar to those presented below and thus are relegated to supporting information, File S1.

First, all the tests maintained well-controlled type I error rates (Table 1). Second, we consider the case with nonzero *β _{j}*’s randomly drawn from a uniform distribution

*U*(1, 2), representing the association pattern with varying association strengths but all in the same direction (Table 2). Among the SPU tests, the SPU(1), SPU(2), SPU(4), or SPU(6) were the respective winners with no, a medium, and a large number of nonassociated RVs. This is in agreement with our analysis earlier that an increasing proportion of nonassociated RVs requires a larger value of

*γ*in the SPU(

*γ*) test to weed off the effects of nonassociated RVs. In particular, the quickly deteriorating performance of the SPU(1) (

*i.e.*, Sum) test was striking. Compared to some more powerful SPU tests, the UminP test was low powered because the UminP test used information from only the most significantly associated RV while ignoring other associated RVs. It is noted that a SPU(

*γ*) test with 8 ≤

*γ*< ∞ was only slightly more powerful than SPU(∞), suggesting no need to use

*γ*> 8.

Among the adaptive tests, the aSPU test was the overall winner; its performance in the presence of many nonassociated RVs was most impressive: for example, with 126 nonassociated RVs, the power of the aSPU test was 0.811, much higher than 0.749 of SKAT (with a linear kernel used throughout), 0.658 of aSSU, 0.567 of EREC, 0.532 of aSum+, 0.380 of PWST, 0.331 of KBAC, and 0.248 of BhGLM. It is noted that the power of the aSPU test was always close to that of the most powerful SPU test, whose identity however changed with the setup. Although *all* other adaptive tests performed well with no or few nonassociated RVs, they failed to do so otherwise. In particular, the aSSU test was much less powerful than the SSU test with many nonassociated RVs, presumably due to the difficulty in RV selection with relatively weak signals with each individual RV.

Third, for the case with both varying association directions and effect sizes of the causal RVs (Table 3), among the SPU tests, as the number of nonassociated RVs increased, SPU(2), SPU(4), SPU(6), and SPU(8) became most powerful, respectively, and as expected, the SPU(1) (*i.e.*, Sum) test was the least powerful. For example, with 128 null RVs, the power of SPU(1) and that of SPU(2) were only 0.070 and 0.261, respectively, much lower than 0.370 of SPU(7) and SPU(8); accordingly, the power of aSPU was 0.329, much higher than 0.235 of SKAT and 0.195 of SKAT-O. It was also confirmed that there was no need to use a SPU(*γ*) test with *γ* > 8 to gain power. Among the adaptive tests, the aSPU test was the winner, although the PWST and SKAT were most powerful with no or only few nonassociated RVs, but quickly lost their edge as the number of nonassociated RVs increased. The aSSU test performed second best after the aSPU test, presumably due to easier RV selection with larger effect sizes of some causal RVs. The BhGLM, aSum, and KBAC tests did not perform well in this case. Similar results with higher significance levels *α* and with covariates were obtained as shown in File S1.

### Data example

We applied the methods to the mini-exome sequence data provided by the Genetic Analysis Workshop (GAW) 17 (Almasy *et al.* 2011). The exome sequence data contain 24,487 SNVs in 3205 genes from 697 unrelated subjects. Our analyses focused on RVs with MAFs no larger than 1%; after removing those more frequent SNVs, we had 2476 genes containing at least 1 RV, with a total of 18,131 RVs. We conducted gene-based analyses.

The phenotypes were generated by GAW17 organizers based on some disease liability models with covariates. In particular, biological knowledge on pathways, especially the vascular endothelial growth factor (VEGF) pathway, and on predicted deleterious coding variants was utilized to design a realistic simulation model. Fixing the sequence data for the 697 subjects, 200 independent sets of a binary phenotype were generated. In addition to some causal SNVs, three risk factors, age, gender, and smoking status, were possibly associated with the binary phenotype. An advantage of using the GAW17 data is the opportunity to assess statistical power of any given method due to the known causal SNVs and the availability of replicated phenotypes. Hence, in addition to conducting a genome-wide scan, we applied the methods to each causal gene with all 200 sets of the binary phenotype to estimate the power.

#### A genome-wide scan:

To demonstrate the practical use of the proposed methods, we first conducted a genome-wide scan on all the 2476 genes with the first set of the binary phenotype. A logistic regression model was fitted to each gene with the three covariates,for *i* = 1, … 697, where *X _{i}*

_{1}, …,

*X*are the

_{ik}*k*RVs in the gene to be tested and

*Y*is the binary phenotype. In the presence of the covariates, we used the parametric bootstrap to calculate the

_{i}*P*-values for the SPU, aSPU, SKAT, and SKAT-O tests.

Throughout the data analysis, we used the following “step-up” procedure to determine the number of bootstraps (or permutations), *B*. We started with *B* = 10^{3} amd then gradually increased *B*: if an estimated *P*-value was <5/*B*, we increased *B* to 10 times of its current value to reestimate the *P*-value, and the process was repeated until no estimated *P*-value was <5/*B*. For the genome scan on the GWA17 data, we tried *B* = 10^{3}, 10^{4}, and up to 10^{5} to obtain *P*-values. On a multicore computer with 100 cores, it took about 0.21 hr to test the 2476 genes based on *B* = 10^{3}, 0.05 hr to test the 50 genes (with their *P*-values < 0.005 in the previous round) with *B* = 10^{4}, and 0.12 hr for testing the 5 genes based on *B* = 10^{5}. If a single-core computer was used, a conservative estimate of the time to be taken would be the above time multiplied by 100, which would take <2 days. If we doubled the sample size, it took about three times the original time. Note that our code was completely in R and not yet optimized; implementing the core part in C or another compiled language is planned and expected for at least a 10-fold speedup. Hence, to be more accurate, we can replace the used threshold 5/*B* with a larger value such as 50/*B*.

Figure 1 shows the Manhattan plots for the tests. Since we were testing on 2476 genes, at the usual family-wide significance level of 0.05 and with a Bonferroni adjustment, we would use a gene-wise significance level of 0.05/2476 = 2.02 × 10^{5}, which would suggest using a bootstrap replication number up to 10^{6} to 10^{7}.

None of the tests detected any significant genes. Nonetheless, we highlighted the top five most significant genes based on each test to show their differing operating characteristics. In Figure 1, it is clear that the three representative SPU tests gave overlapping but different sets of the top genes. It is interesting to note that the aSPU test combined the results of the SPU tests.

As a comparison, Figure 1 also shows the results for the parametric bootstrap- and asymptotics-based SKAT and SKAT-O tests. First, we note the differences among the top genes between the resampling- and asymptotics-based tests, although their overall trends were similar, implying that one has to be cautious in using asymptotics-based tests for analysis of RVs. Second, we note the difference between the results of SKAT and SKAT-O; the latter had some similarity to that of the SPU(1) test, as expected. Most importantly, although similar to some extent, the top ranked genes by the aSPU, SKAT, and SKAT-O tests were still different, suggesting the potential usefulness of the aSPU test as a complement to SKAT and SKAT-O. We also note that most of the genes contained no more than 30 RVs; otherwise, the difference between the aSPU and SKAT or SKAT-O could be larger, as suggested in our simulations.

#### Power comparison:

With the 200 replicated sets of the binary phenotype, the GAW17 data offer a unique opportunity to estimate the power of any test when applied to real sequence data. Due to the small sample size and relatively small effect sizes of the causal RVs, there was low power to detect any causal RVs in the GAW17 data. Accordingly, we used a less stringent gene-wise significance level of 0.05 (*i.e.*, without multiple testing adjustment) and estimated the power of a test as the sample proportion of its rejecting the null hypothesis among the 200 sets of the replicated phenotypes. We considered all 35 causal genes (with at least one causal RV) and tested on each separately. The main results are shown in Table 4.

We excluded the genes for which the maximum power of all the tests was <10% and partitioned the remaining causal genes into three groups based on whether the aSPU test was more powerful than both SKAT and SKAT-O, or between them or less than them. It is clear that for some genes, the aSPU test was more powerful than SKAT and SKAT-O. For gene PIK3C2B, due to the connections between SPU(1) and burden tests and between SPU(2) and SKAT, given that the SPU(1) was more powerful than SPU(2), it is not surprising to see that SKAT-O (which combines SKAT and a burden test) was more powerful than SKAT; furthermore, perhaps due to the relatively high signal sparsity, the SPU(3) turned out to be the most powerful among the SPU tests, leading to the aSPU test being more powerful than both SKAT and SKAT-O. A similar but different situation was found with gene VNN1. That the SPU(1) test was less powerful than SPU(2) might explain why SKAT was more powerful than SKAT-O; however, interestingly, since the SPU(5) test was most powerful among the SPU tests while SPU(4)–SPU(8) were also relatively high powered, we found the aSPU test more powerful than both SKAT and SKAT-O. On the other hand, when SPU(1) or SPU(2) was (nearly) more powerful than other SPU tests, the aSPU test combining all the SPU tests (with only one or two high powered but more low-powered ones) lost edge to either SKAT or SKAT-O, but not both, as for gene RRAS. In some situations, as for gene KDR, since both SPU(1) and SPU(2) were far more superior than other SPU tests, the aSPU test ended up being less powerful than both SKAT and SKAT-O. In summary, we found that, as expected, there was no uniform winner among the aSPU, SKAT, and SKAT-O tests; under some situations, the aSPU test could be more powerful than both SKAT and SKAT-O.

Since all the causal genes contained only a relatively small number of noncausal RVs, and all causal RVs were deleterious (Almasy *et al.* 2011), most often the SPU(3) test was most powerful among the SPU tests. In contrast, perhaps as expected, the SPU(∞) was almost always least powerful. Importantly, we note that the power of the SPU(3) test could be much larger than SPU(1) and SPU(2), representatives of the burden tests and variance-component tests. For example, for gene PIK3C2B with 23 causal RVs among a total of 60 RVs, the power of SPU(3) was 0.650, much larger than 0.565 and 0.445 for SPU(1) and SPU(2), respectively. It is also confirmed that, even with the same association direction but also with nonassociated RVs, the SPU(1) test might lose power, as for gene VNN1 and BCHE. These points reinforce what was observed in simulation studies: we may need to use SPU tests beyond SPU(1) and SPU(2) to yield high power, supporting the use of the aSPU test in some applications.

We briefly mention two limitations of the GAW17 data. First, the number of RVs in any gene was often relatively small in this mini-exome sequence data set; hence the clear advantage of SPU(*γ*) tests with *γ* > 3 did not show up, which in turn limited the potential advantage of the aSPU test. However, with the increasing availability of whole-genome sequence data, we expect a much larger number of RVs in or near a gene, for which we may see a more dramatic advantage of the aSPU test. Second, since all the causal SNVs were deleterious, it favored SKAT-O, which might not perform as well otherwise (as shown in simulations; see Table 3).

We also performed a simulation study to confirm that our methods could control type I error rates satisfactorily with real sequence data at a higher significance level. We randomly selected three genes, RRAS, HIF3A, and PIK3C2B, with 5, 15, and 60 RVs, respectively. To mimic the GAW17 phenotype data, we randomly generated a binary phenotype *Y _{i}* with Pr(

*Y*= 1) = 0.3 under

_{i}*H*

_{0}for each of the 697 subjects. We then tested

*H*

_{0}for possible association between the phenotype and each gene with 10

^{5}simulation replicates so that we could obtain more accurate type I error estimates for a higher significance level

*α*. As shown in Table 5, our methods could satisfactorily control the type I error rate. We note that, for gene RRAS with only 5 RVs, many tests could be conservative with type I errors lower than the nominal level

*α*, due to the highly discrete null distributions of the test statistics. Furthermore, due to its higher-level discreteness and extreme-value-type test statistic, the tail distribution of the SPU(∞) statistic had larger variability, which in turn could perturb that of the aSPU test. As shown in Table 4, since the SPU(∞) test was almost always lowest powered, we suggest excluding it when testing on a few RVs with a binary phenotype. Here, for the purpose of demonstration, we included the SPU(∞) test.

#### SNV selection:

The aSPU test can be used for SNV selection. For comparison, we also included two methods: one was the UminP test as used in GWAS, and the other was Lasso-penalized logistic regression (Tibshirani 1996; Zhou *et al.* 2010). For UminP, we ranked the RVs based on their corresponding *P*-values. We used R package glmnet to fit Lasso-penalized logistic regression. There is a tuning parameter *λ* ≥ 0 in Lasso; as one reduces the value of *λ*, there will be more nonzero coefficient estimates , thus selecting more RVs to be included. In this way, we counted the number of nonzero and the corresponding number of true positives (*i.e.*, the corresponding causal RVs). One problem with Lasso-penalized regression with RVs was that it was difficult to control the number of the nonzero coefficient estimates; as shown in Figure 2 for gene PIK3C2B, we could not obtain 40 or so nonzero coefficients, even with some labor-intensive fine tuning of *λ*.

The methods were compared on the basis of the number of true positives among a given number of their top-ranked RVs for a causal gene. We found that aSPU and Lasso performed similarly, but much better than UminP. Among the 15 causal genes, if we looked at the top six ranked RVs, the frequencies of selecting (0, 1, 2, 3, 4) true positives were the following: (10, 5, 0, 0, 0) by UminP, (3, 6, 4, 1, 1) by Lasso, and (4, 3, 6, 1, 1) by aSPU. Similar results were obtained for other numbers of top-ranked RVs. The bad performance of the UminP test could be due to the unstable variance estimate for a RV, which was too close to 0 (with a too small MAF) and thus dramatically inflated the corresponding test statistic. Figure 2 shows a few more examples in detail. It is confirmed that aSPU and Lasso performed similarly. Given the simplicity of the aSPU test and that Lasso may not yield some given numbers of nonzero coefficient estimates, the use of aSPU for ranking and selecting RVs seems to be promising.

## Discussion

We have proposed and studied a class of SPU tests, which are versatile in the sense that in many scenarios at least one of the SPU tests has high power, although the identity of a more powerful SPU test may change with the varying scenario. The SPU tests are based on the score vector of a regression model, rendering it both computationally efficient and general to cover a wide range of applications with binary, quantitative, ordinal, and survival traits and possible covariates. In particular, compared to several other adaptive tests based on estimated (marginal) regression coefficients (*e.g.*, Feng *et al.* 2011; Zhang *et al.* 2011; Lin and Tang 2011), the SPU tests, as any score-based test, requires only fitting a simplified model under the null hypothesis. In addition to its computational simplicity, a score-based test is more stable with RVs: for example, with a binary trait, the MLE of a regression coefficient for a RV does not exist if the minor allele appears in only cases or controls (but not both), leading to no convergence with an iterative algorithm to obtain the MLE. On the other hand, as shown earlier, the MLE of a (marginal) regression coefficient is approximately proportional to its score component, implying that, as expected, the score vector is as informative as a vector of the estimated regression coefficients while being computationally much simpler. Our major contribution is that, by recognizing the limitation of the existing adaptive tests with a fixed choice of weights, we allow many possible choices of weights indexed by a single parameter *γ* ≥ 1. There is a simple interpretation, and thus a guidance, on parameter *γ*: as the value of *γ* increases, we upweight the larger components of the absolute value of the score vector, |*U*|; that is, with a decreasing proportion of the causal RVs in the group to be tested, we expect a SPU(*γ*) test with a larger value of *γ* to be more powerful because its upweighting of the larger components of |*U*| essentially reduces or even eliminates noisy perturbations from many nonassociated RVs to the test statistic, thus maintaining high power in the presence of many nonassociated RVs. In addition, in the presence of both positive and negative association directions for causal RVs, an even number of *γ* will eliminate the canceling effect of positive and negative components of *U*. In particular, as compared with some new adaptive tests, such as KBAC, PWST, aSSU, EREC, and SKAT-O, our proposed aSPU test was more adaptive and performed much better in simulations when there were a large number of nonassociated RVs.

For its highly adaptive and versatile performance, the aSPU test has a wide spectrum of applications with other types of traits and/or other genetic variants. For example, our preliminary results showed promising performance of the aSPU test for polygenic testing on association between a binary trait and thousands of CVs (International Schizophrenia Consortium 2009). In principle, the aSPU test can be also applied to gene set or pathway analysis (Liu *et al.* 2008; Wang *et al.* 2010).

The relatively good performance of the SSU test and its close relatives, KMR or SKAT (Wu *et al.* 2010, 2011) and C-alpha test (Neale *et al.* 2011), was attributed to its treating the regression coefficients *β* as random effects and then testing on the variance component of the random effects (Basu and Pan 2011). Here, based on the formulation of the SPU tests, more general than the SSU test, we feel that it can be viewed from another angle: the good performance of the SSU test, or more generally, any SPU(*γ*) test, is due to the weighting of each score component *U _{j}* by itself or its power . Since

*U*contains association information about RV

_{j}*j*, such weights are both simple and informative: specifically, since

*U*has a null distribution

*N*(0,

*V*), a larger component of |

*U*| corresponds to stronger evidence of association between the

_{j}*j*th RV and the trait and thus assigning a higher weight

*U*or will help boost power by reducing the noises introduced to the test statistic with nonassociated RVs. However, depending on the unknown truth, such as the proportion of associated RVs and their specific association effects, a SPU(

_{j}*γ*) test with a suitable

*γ*will be more powerful than others. For example, in the presence of many nonassociated RVs, we would expect a larger

*γ*to be more effective: a nonassociated RV is expected to have a smaller |

*U*|, and thus almost a zero weight with , and may successfully eliminate the negative effects of many nonassociated RVs on testing. In particular, when the group of RVs to be tested contains a large proportion of nonassociated RVs, we found that a SPU(

_{j}*γ*) test with

*γ*> 2 could be much more powerful than the Sum and SSU tests, explaining when and why the aSPU test could outperform SKAT-O and other tests combining a burden test like SPU(1) and a variance component test like SPU(2) (Lee

*et al.*2012; Derkach

*et al.*2013; Sun

*et al.*2013). We also found that a SPU(

*γ*) test with a large

*γ*> 8 gave results similar to that of the SPU(∞) test in our simulations, suggesting that searching

*γ*over 1 to 8 perhaps suffices in many applications.

Since the SPU tests use a mechanism of weighting to minimize the effects of nonassociated RVs, they are analogous to model averaging in the literature of model selection, in contrast to the adaptive Neyman-type tests, such as the aSSU test (Pan and Shen 2011), which are more in the line of model or RV selection. In the current context of analysis of RVs, due to limited information contained within each individual RV, model averaging is expected to outperform model selection, as supported by our empirical results when comparing between the aSPU and aSSU tests. Since in general neither model selection nor model averaging can dominate the other (Yuan and Yang 2005; Shen and Huang 2006), there may be a merit in combining the two approaches in other applications. In addition, a few modifications or extensions are possible. First, in our current implementation of the aSPU test, we simply take the minimum *P*-value (minP) to combine multiple SPU tests; other combining methods may be equally applied and may be preferred in certain situations (Pan *et al.* 2010). Second, it appears straightforward to extend the SPU tests to the case with variable thresholds (Price *et al.* 2010), which is related to adaptive Neyman-type tests (Pan and Shen 2011). Third, we have not evaluated the performance of the SPU tests in the presence of interactions among RVs; in particular, it would be interesting to compare their performance with the KBAC test that was partly designed to detect interactions (Liu and Leal 2010). Finally, although we have pointed out some possible extensions of the aSPU test for (1) analysis of CVs, (2) joint analysis of both CVs and RVs, and (3) pathway analysis of CVs and/or RVs, these topics warrant further investigation in future research. R code is posted on our website at http://www.biostat.umn.edu/∼weip/prog.html.

## Acknowledgments

We thank the reviewers for many helpful and constructive comments. This research was supported by National Institutes of Health grants R01GM081535, R01CA169122, R01HL065462, R01HL105397, and R01HL116720 and by the Minnesota Supercomputing Institute.

## Footnotes

*Communicating editor: N. Yi*

- Received April 7, 2014.
- Accepted May 12, 2014.

- Copyright © 2014 by the Genetics Society of America