## Abstract

In recent years, genome-wide association studies (GWAS) have uncovered a large number of susceptibility variants. Nevertheless, GWAS findings provide only tentative evidence of association, and replication studies are required to establish their validity. Due to this uncertainty, researchers often focus on top-ranking SNPs, instead of considering strict significance thresholds to guide replication efforts. The number of SNPs for replication is often determined *ad hoc*. We show how the rank-based approach can be used for sample size allocation in GWAS as well as for deciding on a number of SNPs for replication. The basis of this approach is the “ranking probability”: chances that at least *j* true associations will rank among top *u* SNPs, when SNPs are sorted by *P*-value. By employing simple but accurate approximations for ranking probabilities, we accommodate linkage disequilibrium (LD) and evaluate consequences of ignoring LD. Further, we relate ranking probabilities to the proportion of false discoveries among top *u* SNPs. A study-specific proportion can be estimated from *P*-values, and its expected value can be predicted for study design applications.

AFTER a large-scale association study is completed, researchers are faced with the question of how many strongest signals to follow up on. *P*-values of association tests are commonly used to sort SNPs, and the most significant SNPs are considered candidates for replication. The exact number of follow-up SNPs is often determined in an informal way and may well exceed the number of SNPs deemed significantly associated. Statistical methodology can be used instead to determine that number. The rank-based approach that we develop here allows us to establish a “model-based” number of follow-up SNPs, *before* conducting any association tests. In addition, this approach provides a usually lower, “data-based” number, guided by the ordered *P*-values that were actually observed in a study. In our approach, SNPs with prominently small *P*-values are followed up on, as well as any additional SNPs from the model-based calculation.

The model-based number of follow-up SNPs, *u*, would be determined on the basis of an association model, to ensure a certain probability that a desired number of true associations are included. Alternatively, one can choose the number of follow-up SNPs before collecting subjects for a genome-wide association study (GWAS) and find the sample size required for a desired number of true associations to be included among the chosen number of SNPs. The rank-based approach can be used in place of the usual power calculations and serves a related purpose (Satagopan *et al.* 2002, 2004; Zaykin and Zhivotovsky 2005; Gail *et al.* 2008). In power calculations, the exact α (significance) threshold is predetermined, rather than the exact number of smallest *P*-values.

The rank-based approach has a number of attractive features:

SNP-specific weighting can be easily incorporated for prioritizing results of a study by specifying different prior probabilities of association for different SNPs. With this approach, SNPs in a GWAS can be weighted differentially or organized into “tiers”, depending on whether they belong to a candidate pathway.

The approach is general in that it utilizes information contained in

*P*-values that may originate from a variety of association tests. Parametric*P*-values of SNP or multilocus tests can be used as well as permutational*P*-values.There is only a weak dependence of ranking probabilities on linkage disequilibrium (LD). The probability that a true association will rank among top

*u*SNPs quickly becomes essentially the same for correlated and independent SNPs, as*u*increases. This peculiar feature has been noted previously in simulation studies (Zaykin and Zhivotovsky 2005) and here we investigate this phenomenon more fully.It is often more appealing to researchers to think in terms of a number of SNPs rather than in terms of a significance threshold.

The rank-based approach has a clear connection to the false discovery rate (FDR) (Benjamini and Hochberg 1995) and to Bayesian measures of association (Wakefield 2007).

For simple models, ranking probabilities and related measures are trivially evaluated. For example, assuming 500,000 chi-square tests and noncentrality 15 for a true association, the probability that the true association will rank among the first 100 nonassociated SNPs can be evaluated by a single R command (R Development Core Team 2007) as “1-pchisq (qchisq (1-(100/(5e5+1)), df=1), df=1, ncp=15)”.

## Methods

### Ranking probabilities and FDR

Ranking probability _{j}_{,}* _{u}* is the probability that at least

*j*of

*m*true association

*P*-values will have rank smaller than

*u*in a sorted list of

*P*-values, {

*P*

_{(1)}, … ,

*P*

_{(}

_{k}_{)}}, where

*k*is the number of tests. A precise approximation to

_{j}_{,}

*under independence of*

_{u}*P*-values is given by(1)where

*F*

_{Y}_{(}

_{j}_{)}(⋅) is the cumulative distribution function (CDF) of the

*j*th ordered true association

*P*-value that we denote by

*Y*

_{(}

_{j}_{)}(

*Appendix A*). Equation 1 for

*j*=

*u*= 1 is the probability that one of the true associations will have the smallest

*P*-value in a study. This is also equal to power to detect at least one true association at the significance level of 1/(2

*L*).

Ranking probability is closely related to the FDR. Benjamini and Hochberg (1995) suggested a method to control the expected proportion of false discoveries among discoveries (*i.e.*, among rejections of a hypothesis) times the power,where *T* and *F* are the numbers of true and false rejections. In association mapping scenarios, Pr(*T* + *F* > 0) can decrease, as the number of tests increases, for example, if the proportion of true associations (*m*/*k*) decreases with the number of tests. Thus, the expectation part, namely, the proportion of false discoveries among discoveries, also known as conditional or positive FDR (pFDR) (Storey 2002), may not always be controlled with this method (Zaykin *et al.* 2000). The pFDR is related to the false positive report probability (FPRP) (Wacholder *et al.* 2004) and to the *q*-value (Storey 2002). If the prior probability of false associations is Pr(H_{0}), and *P*-values for true effects have a common CDF *F _{Y}*(⋅), then for a fixed

*P*-value rejection threshold α(2)

*F*(α) is the CDF for a true association

_{Y}*P*-value. It is equal to power at the α-level,

*i.e.*, the probability that a true association

*P*-value is ≤α. FPRP and the

*q*-value are essentially the same concepts, but the

*q*-value approach involves estimation of Pr(H

_{0}) and

*F*(⋅) from data. Both approaches plug in the observed

_{Y}*P*-value (

*p*) in place of α. Thus, FPRP(

*p*) can be interpreted as an average proportion of false positives in future experiments, provided that the α-level is readjusted to α* =

*p*.

While FPRP and *q*-value are related to the average proportion of false positives for a fixed α-level, it is of interest to estimate the proportion of false positives at and below a particular ordered *P*-value. For example, if we take *u* smallest *P*-values from a genome scan, what would be the expected proportion of false positives among *P*_{(1)}, … , *P*_{(}_{u}_{)}? We term this quantity the rank-FDR, rFDR(*u*), and express it in terms of ranking probabilities as(3)In terminology of Gail *et al.* (2008), 1 – rFDR is the “proportion positive” (PP). They also introduced the “detection probability”, defined as (PP × *u*)/*m*.

Equation 3 can be used for calculating the *model-based* number of follow-up SNPs, by finding the value *u* that controls the chosen value of rFDR. Alternatively, one can control the probability that at least *j* true associations are found among top *u* results. This is given by a single value, _{j}_{,}* _{u}*, and is analogous to the probability of making one or more type-I errors in the usual family-wise error rate control.

The rFDR can also be expressed as an expectation using ordered *P*-values. Assuming a continuous test statistic, the *P*-value density for true associations is *f _{Y}*(

*p*) =

*dF*(

_{Y}*p*)/

*dp*. Then the posterior probability that the

*observed j*th ordered

*P*-value [denoted by lowercase

*p*

_{(}

_{j}_{)}] is a false association (

*i.e.*, posterior probability of the null hypothesis, H

_{0}) is(4)Note similarities and differences of Equations 2 and 4. Expectation of Equation 4, that is, the average value across many association studies, corresponds to the true proportion of

*P*

_{(}

_{j}_{)}that are false associations. The rFDR can also be expressed in terms of Equation 4:(5)A similar result cannot be obtained by averaging FPRP(

*p*) or

*q*-values; that is, pFDR at a fixed threshold, pFDR(α), cannot be expressed as an expectation of pFDR computed using plugged in

*P*-values, pFDR(

*p*). The main distinction between the rFDR and the pFDR approaches is that the rFDR is defined for a rank, whereas pFDR is defined for a threshold.

Strictly, the right-hand side of Equation 5 is defined for a scenario where the number of true associations is not fixed, but is rather assumed to have a binomial distribution with the rate 1 − Pr(H_{0}) and the expectation *m*. When *m* is assumed to be fixed in the computation of _{j}_{,}* _{u}*, there is a slight discrepancy between Equations 5 and 3 for values of

*m*that are close to one. The distinction between random and fixed

*m*affects the definition of . For example, when all

*Y*’s have the same distribution, then under the fixed

*m*assumption, but with the binomial

*m*, it is , where

*k*is the total number of

*P*-values and

*w*are binomial probabilities of observing

_{m}*m*true effects.

### Estimation of rFDR

The model-based rFDR is the expected proportion of false discoveries, and it can be computed by Equation 3 using sample sizes, assumed number of true effects, and effect magnitudes. Its main application is planning of studies. For example, one can estimate sample size *N* required for the proportion of true effects to be reasonably high among the best 100 *P*-values. Alternatively, if *N* is given, one can estimate the number of best results to follow up on. On the other hand, the sum of probabilities under the expectation operator in Equation 5 involves random ordered *P*-values that could have been observed in any particular experiment. Therefore, a *study-specific* rFDR among the *u* most significant *P*-values can be estimated as(6)This requires a specification of the *P*-value density for true associations, *f _{Y}*(⋅), which is now given. Asymptotic normality of a test statistic can often be assumed, and a parametric distribution such as a normal or a chi square is often justified. Conditional on γ, the CDF of the

*P*-value for true associations can be modeled as(7)where

*G*

_{0}(⋅) and

*G*(⋅) denote the CDF of the test statistic under the null and the alternative hypothesis and γ is a parameter that governs power, such as the noncentrality parameter of a chi-square statistic. The noncentrality is computed by substituting expected frequencies into the chi-square statistic. For example, a usual chi-square test that compares allele frequencies between two groups, such as cases and controls, with sample sizes

_{γ}*N*

_{1}and

*N*

_{2}has the noncentralitywhere

*q*

_{1}and

*q*

_{2}are case and control allele frequencies and . In a logistic regression model, the noncentrality is approximately(8)where β is the assumed log odds ratio,

*q*is the population allele frequency, and

*N*is a weighted harmonic mean of

*N*

_{1}and

*N*

_{2}with weights

*q*

_{1}(1 −

*q*

_{1}) and

*q*

_{2}(1 −

*q*

_{2}). For a common value of γ, the density

*f*(⋅) in Equation 4 can be found as follows. First, we define

_{Y}*P*-value density for a fixed γ,where

*g*(⋅) is the density that corresponds to the distribution

*G*(⋅). Assuming some distribution of effect sizes among true associations,

*h*(γ), the density of

*P*-values for true associations is(9)Park

*et al.*(2010) proposed a method to estimate both the number of true effects (

*m*) and their distribution in large-scale association studies. They also reported tabulated frequencies of effect sizes for several diseases. In their approach, effect size is expressed as ES = 2β

^{2}

*q*(1 −

*q*); thus, given a sample size

*N*, an empirical distribution of noncentrality can be obtained from the distribution of ES by Equation 8;

*i.e.*, γ = ES ×

*N*/2. A probability distribution, such as gamma, can be fitted to the empirical distribution of effect sizes. This gives

*h*(γ), and further, when

*k*is the total number of

*P*-values in a study, Pr(H

_{0})can be taken as 1 −

*m*/

*k*.

When effects sizes are defined as in Park *et al.* (2010), ranking probabilities are related to as(10)Alternatively, it is possible to specify a prior distribution for β (Wakefield 2007), in which case the ES and the noncentrality distribution will depend on allele frequency at a particular SNP through the term *q*(1 − *q*). In this case, Equation 10 will no longer hold, because rankings by *P*-values and by posterior probabilities will not always be the same. See Wakefield (2008, 2009) for an illuminating discussion of rankings by *P*-values and by Bayesian measures of association.

### rFDR and ranking probabilities under LD

To account for LD among *L* nonassociated SNPs, we utilize the concept of “effective number of tests”, denoted by *L*_{e.} Due to LD, *L*_{e} is generally smaller than *L*, and accordingly we need to consider an “effective rank”, *i*_{e} as well. The effective rank parameter was considered previously by Dudbridge in the context of false discovery rates estimation (F. Dudbridge, unpublished results). Dudbridge and Gusnanto (2008) showed that the effective number of tests in a GWAS may exist and is specific to a given set of SNPs and a population. The effective number of tests in GWAS has been primarily considered in multiple testing applications, as a means to mitigate conservativeness of adjusting significance level by the number of SNPs (Dudbridge and Gusnanto 2008; Moskvina and Schmidt 2008; Pe’er *et al.* 2008; Han *et al.* 2009; Gao *et al.* 2010). Theoretically, *L*_{e} is related to the “extremal index”, θ, which arises as a parameter in the asymptotic distribution of the maximum order statistic (Leadbetter *et al.* 1983). *Appendix* *B* details calculations of *L*_{e.} Ranking probabilities that incorporate *L*_{e} can be estimated with the approximation(11)(12)where *Q*_{1/2}(*u*_{e} − *j* +1, *L*_{e} − *u*_{e} + *j*) is the median of a Beta(*u*_{e} − *j* +1, *L*_{e} − *u*_{e} + *j*) distribution. When LD is present, but *L*_{e} cannot be estimated or guessed, we recommend use of the following approximation:(13)A guess value of *L*_{e} can often be taken, because as we will show, the effect of LD on ranking probabilities is generally small and affects only probabilities at small values of *u*. Moreover, all three equations (Equations 1, 12, and 13) give similar results as *u* becomes larger, even when *L*_{e}/*L* is small. The rFDR estimates are similarly robust in the presence of LD due to their relation to ranking probabilities (Equation 10).

### Simulations to evaluate accuracy of the rFDR estimation

Estimation of rFDR by Equation 6 involves specification of the effect size distribution for the assumed number of true associations. These parameters were modeled after the empirical distribution for Crohn’s disease, reported in Park *et al.* (2010). Park and colleagues’ method allows one to obtain a table of binned effect sizes (ES) with their respective frequencies. The tabulated distribution for the noncentrality for 142 susceptibility loci was obtained from the distribution of ES as γ = ES × *N*/2, assuming equal sample sizes for cases and controls (*N* = 2000). The probability distribution *h*(γ) in Equation 9 can be fitted to such frequency data. We assumed a gamma distribution for *h*(γ) and fixed the shape parameter to be one, to allow a high proportion of very small effects. We also assumed that the distribution was truncated at the maximum observed noncentrality + 10. The scale parameter was fitted by minimizing the squared difference between quantiles of the empirical distribution and *h*(γ). Alternatively, maximum-likelihood or method of moments methods can be used. We assumed a 2-million SNP GWAS with the extent of LD modeled to provide *L*_{e}/*L* = 0.4, as has been observed in HapMap data (Han *et al.* 2009). *Appendix* *C* gives details for simulation of correlated *P*-values. In every simulation, we recorded which ones of the 15 smallest *P*-values were true associations. From this information, the actual proportion of true associations across simulations was computed for each of *P*_{(1)}, … , *P*_{(15)}. These values were compared to the values Pr(H_{0}|*p*_{(1)}), … , Pr(H_{0}|*p*_{(15)}), estimated by Equation 4 and averaged across simulations. Distribution of the actual total number of true associations among the 15 smallest *P*-values was compared to the estimates obtained from *P*-values by Equation 6. We performed >100,000 simulations to construct Table 1.

### Simulations to evaluate accuracy of ranking approximations

We evaluated the accuracy of our approach using real GWAS data from a schizophrenia scan (Suarez *et al.* 2006). We also used simulated GWAS data with deliberately high, block-structured LD to assess the robustness of our approach in the presence of extreme LD and under conditions when the extremal index distribution (Equation B1) does not describe the asymptotic distribution of the minimum *P*-value. We considered the probability of capturing at least one true association and computed four types of ranking probabilities for each scenario:

Empirical (“gold standard”) ranking probability, denoted by “true” in Tables 2−4: To calculate empirical ranking probabilities, we used a simulation approach as follows. For each of 100,000 simulations, we shuffled the affection status in the GWAS data set and computed the trend test and its

*P*-value for each nonassociated SNP.*P*-values for true associations were obtained from data generated according to genetic models described below. Then we ranked*P*-values for both associated and nonassociated SNPs and recorded ranks of the true associations. On the basis of all simulations, we estimated the empirical probabilities that at least one true association will rank below the*i*th false positive, Pr(min(*Y*) ≤*X*_{(}_{i}_{)}). The same probabilities can also be expressed in terms of ranking below the*i*th*P*-value rather than below the*i*th false positive (Equation A1).Median-based approximate ranking probability that accounts for the LD among nonassociated SNPs denoted by “

*L*_{e}-based” in the tables:(14)

To obtain a sample of minimum

*P*-values to estimate the effective number of tests for the calculation of the*L*_{e}-based approximation, we used the simulation approach just described. At each simulation, we recorded the minimum*P*-value for nonassociated SNPs and used Equation B4 to estimate*L*_{e.}

3. Median-based approximate ranking probability assuming independence of nonassociated SNPs, denoted by “median-based” in the tables:

(15)

4. Mean-based approximate ranking probability assuming independence of nonassociated SNPs, denoted by “mean-based” in the tables:

(16)

We conducted four simulation experiments with the first two based on the schizophrenia scan, where we retained 701,080 SNPs with a minor allele frequency of at least 0.025.

In the first experiment, we added five independent true associations with the same effect size, assuming a multiplicative model of disease risk for each SNP. In addition, we assumed multiplicative risk for joint effect of the five SNPs. We considered the Armitage trend test and retrospective sampling with 1301 cases and 1300 controls. We computed the noncentrality parameter (γ in Equation 7) by the formula given in Ahn *et al.* (2006). We assumed that each true association is in Hardy–Weinberg equilibrium (HWE) in the population, with the risk allele frequency of 0.2. We used the base allele risk *a* = 0.05 and considered the log of relative risk ln *R* = 0.206. In this model, susceptibilities for each of the five true associations are Pr(case|*g _{i}*)

*a*

^{2}

*R*, where

^{i}*g*is the genotype with

_{i}*i*copies of the risk allele. Probabilities of genotypes for the cases and for the controls were obtained by Bayes’ rule. These probabilities were used to obtain the actual case and control genotypes via multinomial sampling. This setup was used to construct Table 2. Using Equation B4, we estimated the extremal index for this experiment to be

In the second experiment, we modeled 11 SNPs in a region harboring an “unknown” causal mutation (that is, this causal SNP was removed from association analysis). The 12-SNP haplotype frequencies, including the mutation, were modeled on the basis of the haplotype frequencies of the μ-opioid receptor (Shabalina *et al.* 2009). These haplotype frequencies were used to create samples of haplotypes. Haplotypes were randomly paired to create diploid individuals. The fifth SNP with the minor allele frequency of 0.24 was chosen to be the causal variant. This causal SNP was contributing additively to a quantitative trait value, assuming different trait means for the two alleles, *m*_{1} = 10, *m*_{2} = 20, and a normally distributed random contribution with the standard deviation σ = 76.72 [this value was chosen to give a similar empirical probability, Pr(min(*Y*) ≤ *X*_{(1)}), to that obtained in the first experiment]. For example, the trait value for an individual *i* that was heterozygous at the mutation was modeled as *m*_{1} + *m*_{2} + *N*(0, σ^{2}). The 12-SNP haplotype frequencies determine pairwise LD of the causal variant with the 11 markers. These 11 correlation coefficient values of LD (Weir 1996) were 0.15, 0.17, 0.17, 0.24, 0.63, 0.78, 0.78, 0.79, 0.94, 0.95, and 0.96. The causal SNP was removed from the data. Regression *F*-test *P*-values for association of the number of copies of the minor allele in an individual (0, 1, 2) with the trait were computed for all SNPs. The function *F*_{min(}_{Y}_{)}(⋅) needed to compute ranking approximations was estimated by the empirical distribution function (using the “ecdf” function in R) from a sample of the minimum of *P*-values for the 11 SNPs. Note that to estimate the function, one needs to obtain only a single large sample of the minimum *P*-values. This setup was used to construct Table 3.

The third experiment was aimed to demonstrate that our approach remains valid even under unrealistically high, block-structured LD. For this experiment, we modeled correlated *P*-values directly by the method described in *Appendix* *C*. We considered a single true association, with a *P*-value derived from a chi-square statistic with the noncentrality parameter set to 18.42 to yield 10% power at the genome-wide significance level 0.05/*L*. We assumed 2 million nonassociated SNPs (*L* = 2 × 10^{6}) and performed 100,000 simulation replicates.

## Results

Table 1 gives proportions of true positives among each of the 15 smallest *P*-values, assuming the effect size distribution and the number of disease loci estimated for Crohn’s disease (Park *et al.* 2010). There is some bias in the estimated values for low ranks, due to LD (the ratio of the effective number of tests to the actual number, *L*_{e}/*L*, was 0.4 in these experiments). This bias is entirely due to LD: The last two columns of Table 1 constructed under linkage equilibrium (LE) show that the estimates are unbiased in this case. The bias can be reduced by using the following *ad hoc* correction to the prior. For the *i*th *P*-value, we modified the original prior Pr(H_{0}) = 1− *m*/(*L*+ *m*) as(17)where *f*_{e} = *i*_{e}/(*L*_{e} + 1), *f*_{0} = *i*/(*L* + 1), and *i*_{e} is the effective rank (Equation B2). The rationale behind the correction is based on the relation Pr(min(*Y*) < *X*_{(1)}) = 1 − *E*{Pr(H_{0} | *p*_{(1)})}. The minimum false association *X*_{(1)} on the left-hand side behaves approximately as the minimum of only *L*_{e} *P*-values. Thus, for *P*_{(1)}, we may adjust the prior probability used to evaluate the right-hand side by the amount that is approximately equal to *L*/*L*_{e}. Further, we want the adjustment factor to quickly approach 1 as the rank increases. Figure 1 shows the distribution of the actual total number of true associations among the 15 smallest *P*-values, the distribution of estimated values obtained from *P*-values by Equation 6, and the distribution for their difference. There is a good correspondence in the mean values for the true and the estimated distributions. The difference distribution is centered around zero, at 0.21 (−0.02 if the “corrected” prior were used instead). There are about four true positives among the 15 smallest *P*-values, on average. We note that in experiments with no true associations, the method still predicts about 2 associations on average, because the prior wrongly assumes 142 disease associations, but the probability of predicting ≥4 associations is <10%. The model-based calculation assuming independence, by using the approximation in Equation 3 shows that among the first 10 smallest *P*-values, 31% are expected to be true positives. The true value, observed in simulations, is only slightly higher (32%). Among the first 100 *P*-values, only 7% are expected to be true positives; this is ∼5% of the total number of true associations, 142. Using the ranking probability Equation 13, we can estimate that for the probability of capturing at least 5 associations to be 80%, one would need to follow up on the 80 smallest *P*-values (_{5,80} ≈ 0.8). To capture at least 10 associations, one would need to follow up on 550 strongest signals (_{10,550} ≈ 0.8).

Rank-based calculations can be related to a more common calculation of statistical power. With the assumed parameters for the true associations, the α-level for detecting at least one true association with 90% power would have to be set at 1.1/*L*_{e.} This is considerably higher than what would be given by multiple testing thresholds that take into account just the number of tests, such as α = 0.05/*L*_{e}, and therefore false positives are also expected with this approach. A comparable calculation with the rank-based approach is to find the minimum *u*, such that _{1,}* _{u}* ≥ 0.9, which gives

*u*= 3.

The main purpose of Tables 2−4 is to illustrate accuracy of our approximations for ranking probabilities. Table 2 presents results for the 700K schizophrenia scan for five independent true associations. By comparing the empirical probability with approximate probabilities based on our approach, we studied the ranking probability that at least one of these true associations will rank among the top-*i* false positives (Equation A1 gives a conversion formula for ranking in terms of the top-*u P*-values, rather than top-*i* false positives). Empirical probabilities are estimates of true probabilities, since they were obtained by sorting the actual *P*-values, computed via the Armitage trend test that was applied at each simulation run to all 700K SNPs in LD. Empirical ranking probabilities are very similar to the approximate values in the second column, computed by the approximation that incorporates LD. Thus, our approach that incorporates the effective number of tests accounts well for LD in these data. It is apparent that ignoring the LD has very little effect on ranking probabilities for ranks as low as 10 (Table 2, columns 3 and 4). Usage of either median or mean of the beta distribution (Equations 15 and 16) yields similar ranking probabilities. Mean-based approximation yields values that are somewhat closer to empirical values. Similar conclusions were reached for probabilities of capturing all, rather than at least one, true associations (data not shown). Results in Table 3 were obtained using the schizophrenia GWAS data as well. Here, we considered 11 correlated SNPs in a region harboring an untyped causal variant and a continuous phenotype. We reached similar conclusions with this setup.

In addition to the results presented for the schizophrenia data set, we also evaluated our approach using two GWAS data sets for breast cancer and prostate cancer (Hunter *et al.* 2007; Yeager *et al.* 2007). Results for ranking probabilities obtained for these analyses were very similar to those obtained for the schizophrenia data (data not shown). Interestingly, the ratio of the effective number of tests estimate over the number of SNPs, was very similar for the breast cancer and the prostate cancer data sets , which both consisted of Caucasian individuals and shared most of the SNPs. This finding reaffirms conclusions by Dudbridge and Gusnanto (2008) who found consistency of the *L*_{e} estimates obtained for the two control data sets of the Wellcome Trust Case Control Consortium (2007) (WTCCC) panel. The two corresponding extremal index values for the WTCCC data are similarly close: 0.58 and 0.59.

These observations suggest that once an effective number of tests estimate is obtained for a specific panel of SNPs, it can be reused for GWAS utilizing similar SNPs and samples from populations of similar ancestry. Additionally, two different designs that we used resulted in similar values of *L*_{e.} For the schizophrenia data, we obtained *L*_{e} = 0.590 for the retrospective design with a case–control trend test. A similar estimate, *L*_{e} = 0.610 was obtained for a prospective design with a regression test and a continuous trait.

In Table 4, we focused on the probability that a true association will rank among the top-*i* highly correlated false positives. In these simulations, the effective number of tests was only 6.4% of the actual number of tests, and the behavior of the minimum *P*-value was incompatible with the extremal index distribution given by Equation B1 as judged by the Kolmogorov–Smirnov test. Again, we found that the *L*_{e}-based approximation gave ranking probabilities that were very close to the empirical values across the studied range of *i* values. Approximations that ignore LD underestimate ranking probabilities for low ranks. Nevertheless, as *i* increases, these estimates quickly catch up with the empirical ranking probability. These results are reassuring in light of the fact that ranked *P*-values from this setup do not follow the extremal index distribution. Similar results were observed for relatively small numbers of tests (not shown).

## Discussion

We propose ranking-based strategies for design of large-scale association studies and for following up on best ranking associations. With our approach, one can determine an optimal number of SNPs to carry forward into a replication study, on the basis of the sample size of the study and the assumed model for effect sizes. This model-based step (step 1) is based on calculating the proposed rFDR by Equation 3. It is reminiscent of a power calculation in that the results of association tests are not used in this step. The number of follow-up SNPs determined this way represents an expectation across multiple studies (Equation 10). In step 2, the smallest *P*-values actually observed in a given study are used to estimate the study-specific rFDR (Equation 6). One would use this equation to find the largest value of *u* for which is at the desired level. Then *u* SNPs with the smallest *P*-values are followed up on, plus any additional ones, as determined in step 1. The reason for following up on a potentially larger, model-based number, given by step 1, is a sampling variability of study-specific estimates, illustrated in Figure 1. The averages of true and estimated values (means of the left and the middle histograms of Figure 1) correspond very well, but there is variation in both true and estimated values, as well as in their difference (given by the histogram on the right). There is a chance that none of the best results would have a high true association probability, either because none of them are true associations or because the probability estimates happened to be incorrect. But one would not discard results of a large-scale study for the reason that too few or none of the *P*-values came out “nonsignificant”. SNP densities that are currently in use are sufficiently high to ensure coverage of truly associated regions. Even if none of the association tests in a GWAS reach statistical significance, SNPs scoring at the top still constitute putative associations.

At first glance, it may appear that unlike *P*-value adjustment methods, the methods that we advocate have a disadvantage in that they depend on prior parameters, namely on the effect size distribution and the number of trait loci. However, in our view, proper applications of *P*-value adjustments must rely on the same assumptions. Multiplicity adjustments correct the raw α-level by the number of tests, but the α-level itself must be chosen on the basis of power calculations that are much in the same way affected by the assumed number of effects, their distribution and the sample size of the study. Two studies with different sample sizes or different numbers of trait loci will have a different expected proportion of true associations with *P*-values that are below the same α-level. The relation of our approach to power can be seen more formally from our approximations: Expectation for estimated posterior probability for the minimum *P*-value to be a true positive is equal to power at the significance level 1/(2*L*). Thus, in applications of *P*-value adjustment methods, the same prior parameters must be taken into account in decisions regarding an appropriate significance threshold for a study. Moreover, with our approach, it is straightforward to prioritize association results by simply specifying relatively higher prior probabilities of association to SNPs that belong to candidate pathways.

Many existing methods used for sample and marker allocations to single- and multi-staged GWAS assume independence of SNPs. An analytic approach described here explicitly accommodates LD. Although simulation techniques can also be used to model LD, typically such a technique would have to utilize data from a particular GWAS. This limits applications of the simulation approach for design of future studies because the data required for simulations would have yet to be collected. A virtue of our approach is that it provides insights about the effect of LD on the ranks of true positives. In particular, it reveals consequences of the independence assumption, which have been poorly understood. Our results show that the approximation based on the effective number of tests in a GWAS (*L*_{e}) successfully accounts for the effect of LD between *L* SNPs. We found this method to work well even when the minimum *P*-value does not asymptotically follow the distribution predicted by the extremal index theory. This observation is explained by the fact that the validity of our approach depends only on similarity of medians of the theoretical and the actual distributions, rather than on similarity of entire distributions. In fact, *L*_{e} does not need to exist when interpreted in the conventional sense as a parameter governing behavior of the smallest *P*-value. The θ = *L*_{e}/*L* values appear to be very similar for similar panels of SNPs and samples from populations of similar ancestry. Our estimates of θ for two different samples genotyped on the same platform are within 1% of each other. Similarly, there is only 1% difference between θ-estimates for the two control samples from the WTCCC data (Dudbridge and Gusnanto 2008). Further, we showed theoretically and confirmed via applications to GWAS data that as the rank value increases, the effect of LD on the ranking probability quickly becomes negligible. This reaffirms and generalizes our earlier observation from simulation studies that ranks of true associations appear to be little affected by LD (Zaykin and Zhivotovsky 2005). This surprising property holds even when the extent of LD is unrealistically high. In a GWAS with *L* SNPs, LD for moderate values of ranks can be ignored as long as 1/*L* << *L*_{e}/*L*. In practical applications, ranking probability may be set to some high value , such as 0.90, and one would then compute the value of rank *i* such that the ranking probability is ≥. Unless the resulting value of *i* is small, ranking probabilities computed using the independence assumption yield values that are very close to exact values. Even for low values of ranks, bias due to usage of independence assumption is small.

The rank-based approach can be used for determination of sample size while planning a study. Just as with a power calculation, one would assume an effect size and the number of tests. Instead of determining a sample size *N*, needed for a true association *P*-value to be as small as α/*L* with 90% probability (as in a power calculation), one could determine *N* such that at least *j* true association *P*-values will end up among a specified number (*u*) of best results with 90% probability. This gives the ranking probability, _{j}_{,}* _{u}*; alternatively, rFDR can be controlled by the sum of

_{j}_{,}

*(Equation 3).*

_{u}In summary, our ranking approach provides a simple and intuitively appealing framework for planning and analysis of large-scale association studies. This approach is broadly applicable due to its robustness in the presence of extreme and heterogeneous LD. Although our approximations are mainly developed assuming a large number of SNPs, they work well when the number of SNPs is small, as found in candidate gene studies. These features make our approach well suited for studies with different marker densities and LD patterns, including studies utilizing next-generation sequencing data. For practical use, we provide software that allows one to estimate ranking probabilities and the number of SNPs that would contain true associations with a specified probability, as well as to plan discovery and replication stages in GWAS.

## URLs

Software implementing the methods described here is available at the National Institute of Environmental Health Sciences website (http://www.niehs.nih.gov/research/atniehs/labs/bb/staff/zaykin/index.cfm) or by request to the authors.

## Acknowledgments

We benefited from discussion with Shyamal Peddada and David Umbach and from useful comments by Gary Churchill and two anonymous reviewers. This research was supported by the Intramural Research Program of the National Institutes of Health, National Institute of Environmental Health Sciences.

## Appendix A: Approximations for Ranking Probabilities

We consider a set of true association *P*-values, {*Y*_{1}, . . . , *Y _{m}*}, and a set of false association

*P*-values, {

*X*

_{1}, . . . ,

*X*}, that are ordered together. Ranking probability as defined in

_{L}*Methods*is equivalent to the probability that the

*j*th true association

*P*-value will rank below the

*P*-value with the rank

*u*≤

*L*;

*i.e.*, Pr(

*Y*

_{(}

_{j}_{)}≤

*P*

_{(}

_{u}_{)}). We can write the ranking probability in a mathematically convenient way that separates true and false associations:

A random *i*th-ordered *P*-value among nonassociated SNPs is denoted by *X*_{(}_{i}_{)}. Its cumulative and probability distribution functions (CDF and PDF) are denoted by and , respectively. Similarly, *F _{Y}*(⋅)and

*f*(⋅) are the CDF and the PDF for a true association.

_{Y}When *Y*’s are independent and have the same distribution, the CDF of *Y* is given by Equation 7 and the CDF of the *j*th-ordered *Y* has a standard form, where Binom denotes the binomial CDF evaluated at *j* − 1 successes in *m* trials with the success probability *F _{Y}*(

*P*). Otherwise, the distribution can be estimated by the empirical CDF from a sample of

*P*-values obtained under a suitable association model. For

*O*simulation samples, an

*O*×

*m*matrix of

*P*-values (

**P**) can be obtained. After the rows of

**P**are sorted, is given by the empirical CDF,

*e.g.*, by the “ecdf” function of R, applied to the

*j*th column of the matrix. If

*Y*

_{1}, … ,

*Y*are independent, but different effect sizes are assumed, the corresponding

_{m}*P*-values can be sampled directly, provided the distribution of the test statistic is standard, such as chi square: ,

*i*= 1, . . . ,

*m*, where

*U*is a random number from the Uniform(0, 1) distribution. This formula can be obtained by inverting Equation 7.

We start with an independence assumption, namely that nonassociated SNPs are themselves independent. Considering a single true association for now, the probability that the *P*-value for a truly associated SNP will rank among the *i* smallest null *P*-values is given by(A2)where is the CDF of Beta(*i*, *L* − *i* + 1), if *L* nonassociated *P*-values are independent and continuous. When *L* = 1, this probability equals one minus the expectation of *Y*, μ_{Y}_{,} which was termed the “expected *P*-value” and advocated as a sensible alternative to power calculation by Sackrowitz and Samuel-Cahn (1999). Although it is possible to evaluate Equation A2 by numerical integration or by simulation, a useful and precise approximation is obtained as follows. First, by means of changing the order of integration, we can rewrite Equation A2 as(A3)This operation replaces the random variable *Y* on the right-hand side of Equation A2 with the uniformly distributed random variable *X*. By using the first-degree Taylor series approximation about *E*(*X*), we obtain(A4)(A5)(A6)where *Q*_{1/2} is the median of Beta(*i*, *L* − *i* + 1).

## Appendix B: Approximations for Ranking Probabilities Under LD

For a variety of stationary stochastic processes with distance-decaying dependencies, the distribution of the maximum of a large number of observations {*T _{i}*} is Pr(max (

*T*) ≤

*t*) = Pr(T ≤

*t*)

^{θ}

*. The independence is a special case with θ = 1. Considering the minimum of*

^{L}*P*-values, {

*Xi*}, we can let

*T*= 1 −

_{i}*X*, and max(

_{i}*T*) = 1 − min(

*X*). Then the asymptotic distribution of min(

*X*) under no association is

which is the CDF of Beta(1, θ*L*). Thus, the effective number of tests is *L*_{e} = θ*L*. The effective rank is obtained by scaling the original rank *i* between 1 and *L*_{e}:(B2)The ranking probability is then obtained as(B3)There is a concern that assumptions of the extremal index theory, such as stationarity, may not be satisfied at high SNP densities. Indeed, Dudbridge and Koeleman found that at HapMap densities, fit to a beta distribution may become inadequate (Dudbridge and Koeleman 2004). However, our approach does not require that the minimum *P*-value should follow a beta distribution: Only its median needs to be approximately equal to that of a Beta(1, θ*L*) distribution. Our requirement is much weaker than the distributional assumption, and we will verify that our approach works well even in those cases where the effective number of tests does not exist in the sense defined by the asymptotic distribution in Equation B1. Dudbridge and Gusnanto (2008) gave the method of moments estimator for *L*_{e} as , where is the sample average of minimum *P*-values under the null hypothesis. It is also straightforward to derive the maximum-likelihood estimator (MLE) as(B4)where {*p _{i}*} is a sample of minimum

*P*-values. In practice, one would repeat the following procedure

*k*times to obtain that sample: Permute the affection status of individuals, perform an association test for all SNPs, and record the minimum

*P*-value. Usage of the effective number of tests approach has an advantage in that once an estimate of is obtained, it can be reused for computing ranking probabilities for a different data set that utilizes a similar set of SNPs and a sample from a population of similar ancestry. It is also possible to fit a two-parameter beta distribution and obtain MLEs for the two parameters, as suggested by Dudbridge and Gusnanto (2008). These estimates can be used to model the distribution in Equation A4. Ranking probabilities evaluated by this approach according to our simulation results were very close to those based on

*L*

_{e}(results not shown). On the basis of the form of Equation A5, we expect that the independence assumption may work well even though in practice

*P*-values are in fact dependent, due to LD. The reason for this is the similarity of medians

*Q*

_{1/2}(

*i*,

*L*−

*i*+ 1) and

*Q*

_{1/2}(

*i*

_{e},

*L*

_{e}−

*i*

_{e}+ 1) for a moderate

*i*and a large

*L*(

*cf*. Equation A5). As

*i*increases,

*i*

_{e}approaches θ

*i*, and the median of an intermediate order statistic approaches its mean (Watts

*et al.*1982). Thus, assuming that

*L*is large and 1/

*L*<< θ,(B5)We also expect that a simpler, mean-based approximation is adequate for intermediate values of

*i*:(B6)(B7)One might be interested in the number of SNPs required to be genotyped in a follow-up study to ensure for the ranking probability to reach some high value, such as × 100% = 99%. In other words, one might be interested in determining the index

*i*in Equation A5 for the probability to be at least = 0.99. A simple way to do that is to iterate

*i*from one up to a value that gives the required probability. The mean-based approximation also allows one to determine

*i*by solving Equation B6. It follows that , and from Equation B2

*i*= 1 + (

*i*

_{e}− 1)(L − 1)/(

*L*

_{e}− 1).

## Appendix C: Simulation of Correlated *P*-Values

For each simulation replicate, we sampled *P*-values for the true associations by the inverse of Equation 7, as , where *U* is a random number from the Uniform(0, 1) distribution. We assumed the 1-d.f. chi-square statistic. *P*-values for nonassociated SNPs were sampled from the zero-mean Ornstein–Uhlenbeck diffusion process (OUP). The OUP diffusion was employed previously to address issues of multiple testing in genome scans (Lander and Botstein 1989; Lander and Kruglyak 1995; Zaykin and Zhivotovsky 2005). Let us denote a normal score for the *i*th SNP by *S _{i}*

_{,}

*i*= 1, 2, … , 2 × 10

^{6}. We simulated

*S*for

_{i}*i*> 1 by where

*ε*is a sample from the standard normal distribution,

_{i}*N*(0, 1). This formula follows from considering the conditional distribution (

*S*|

_{i}*S*

_{i}_{−1}), which is normal with the mean and the variance (Gillespie 1996). After {

*S*} were simulated, we converted them to

_{i}*P*-values using the unconditional limiting normal CDF of

*S*

_{i}_{,}

*N*(0, σ

^{2}/2λ). The value

*S*

_{1}was simulated as a draw from this distribution. For simulations with extreme, block-patterned correlation we used the following parameters: We used λ = 0.5 and σ

^{2}= 100. Each nonassociated SNP had a chance of 0.7 to be in a LD block. To model block-structured correlation, the location increment parameter, Δ

*, assumed one of two values, 0.033, with probability 0.3, or 0.001, with probability 0.7. This model resulted in extremely correlated*

_{i}*P*-values, with

*L*

_{e}/

*L*= 0.064. On average, autocorrelation decayed from 0.99 for neighboring SNPs to 0.5 for SNPs 187.5 kb apart. Between the blocks, the autocorrelation decayed to 0.1 in 205.5 kb. Within the blocks, there was very little decay of correlation: to 0.93 in a 200-kb block. To model studies with the HapMap ratio of

*L*

_{e}/

*L*= 0.4, we changed the two Δ

*values to 0.1 and 0.07.*

_{i}## Footnotes

*Communicating editor: G. A. Churchill*

- Received May 9, 2011.
- Accepted June 13, 2011.

- Copyright © 2011 by the Genetics Society of America