## Abstract

Genome-wide association studies (GWASs) have been effectively identifying the genomic regions associated with a disease trait. In a typical GWAS, an informative subset of the single-nucleotide polymorphisms (SNPs), called tag SNPs, is genotyped in case/control individuals. Once the tag SNP statistics are computed, the genomic regions that are in linkage disequilibrium (LD) with the most significantly associated tag SNPs are believed to contain the causal polymorphisms. However, such LD regions are often large and contain many additional polymorphisms. Following up all the SNPs included in these regions is costly and infeasible for biological validation. In this article we address how to characterize these regions cost effectively with the goal of providing investigators a clear direction for biological validation. We introduce a follow-up study approach for identifying all untyped associated SNPs by selecting additional SNPs, called follow-up SNPs, from the associated regions and genotyping them in the original case/control individuals. We introduce a novel SNP selection method with the goal of maximizing the number of associated SNPs among the chosen follow-up SNPs. We show how the observed statistics of the original tag SNPs and human genetic variation reference data such as the HapMap Project can be utilized to identify the follow-up SNPs. We use simulated and real association studies based on the HapMap data and the Wellcome Trust Case Control Consortium to demonstrate that our method shows superior performance to the correlation- and distance-based traditional follow-up SNP selection approaches. Our method is publicly available at http://genetics.cs.ucla.edu/followupSNPs.

IN recent years, genome-wide association studies (GWASs) have become the standard approach to discover the genetic basis of human disease traits (Wellcome Trust Case Control Consortium 2007; Hindorff*et al.* 2009). In GWASs, information on genetic polymorphisms is collected across the genome from case/control individuals for identifying genomic regions associated with a disease trait. Single-nucleotide polymorphisms (SNPs) are typically used due to their low genotyping cost and abundance in the genome. Resources such as the HapMap Project (Altshuler*et al.* 2005) and the 1000 Genomes Project (1000 Genomes Project Consortium 2010) provide a catalog of the SNPs in the genome. To reduce the cost of GWASs and the redundancy in the information collected, an informative subset of the SNPs, termed tag SNPs, is genotyped in GWASs. Tag SNPs are selected by utilizing the correlation structure between the SNPs, referred to as linkage disequilibrium (LD). Tag SNP selection under different criteria has been very well investigated (Cousin*et al.* 2003, 2006; Carlson*et al.* 2004; Stram 2004, 2005; Lin and Altman 2004; de Bakker*et al.* 2005; Halperin*et al.* 2005; Pardi*et al.* 2005; Qin*et al.* 2006; Saccone*et al.* 2006; Santana*et al.* 2010).

However, genomic regions that are in LD with the most significantly associated tag SNPs are often relatively large and may contain many additional polymorphisms. At this stage of the study it may not be clear to the investigator which specific genes or polymorphisms lead to increase in disease risk. Additionally, biological validation on all such candidates may be costly and time consuming. Clearly, a further characterization of such regions is required by identifying all the associated polymorphisms within these regions. A complete set of all associated polymorphisms can be seen as a catalog of all possible functional variants and the actual values of the association statistics at these polymorphisms provide information about which of these polymorphisms may be causal.

How to cost effectively follow up and further investigate the regions represented by the significant tag SNPs presents a challenge. Given all the SNPs within these regions, or candidate SNPs, one way to identify the associated SNPs is to collect genotype information on every candidate SNP. However, this approach is highly inefficient as only a small percentage of the candidate SNPs are likely to be associated. Ideally, we want to know which of the candidate SNPs are the associated SNPs before genotyping them. We introduce a follow-up study approach in which a subset of the candidate SNPs, or follow-up SNPs, which are likely to be associated, is selected and genotyped in the original case/control individuals. We propose a follow-up SNP selection method with the goal of maximizing the number of statistically associated SNPs among the follow-up SNPs. We assume that the candidate SNPs are cataloged in a reference human genetic variation data set, such as the HapMap Project. The intuition behind our method is that a candidate SNP that is strongly correlated to a significantly associated tag SNP is likely to be associated as well (*i.e.*, such as perfect correlation). We formalize this intuition to compute the probability of each candidate SNP being associated and select the follow-up SNPs accordingly.

Our approach may also be used in conjunction with the fine-mapping efforts that obtain the complete sequence information for regions of interest. In a typical GWAS, thousands of case/control individuals are employed to achieve a reasonable statistical power in the study and sequencing thousands of individuals is still a difficult and costly task compared to genotyping SNPs. Therefore, a small number of individuals can be sequenced in these regions to catalog the candidate SNPs and the follow-up SNPs selected using our method can be genotyped in all of the case/control individuals.

To our knowledge, there is no existing method addressing the follow-up SNP selection. Below we formalize two intuitive approaches that we refer to as the distance- and the correlation-based traditional follow-up SNP selection approaches. The traditional follow-up SNP selection approaches choose candidate SNPs that are within a certain distance or correlated above a minimum correlation cutoff value to the most significant tag SNPs. The distance-based traditional approach assumes that the neighboring SNPs are strongly correlated; however, due to the complexity of the LD landscape, SNPs close to each other may not necessarily be correlated, and assuming so may fail to identify the associated SNPs. Similarly, the correlation-based traditional approach may fail as a consequence of using the same minimum correlation cutoff value for every candidate and tag SNP. Predicting whether a candidate SNP is associated depends on the particular values of the observed tag SNP statistic, pairwise correlation, and the effect size of the candidate SNP. Our method outperforms the traditional approaches both in simulated and in real GWAS data. In our simulations we use the SNPs available from the HapMap Project and the widely used Affymetrix 500K SNP array as the candidate and tag SNPs, respectively. We generate various simulated candidate regions to compare the performance of the follow-up SNP selection approaches. For performance evaluation under real GWASs, we use data available from the Wellcome Trust Case Control Consortium (2007). In each of the seven disease GWASs we use half of the observed SNPs as the tag SNPs and the remaining as the candidate SNPs.

## MATERIALS AND METHODS

#### Problem formulation:

Given a biallelic SNP *m _{i}* with true population minor allele frequency

*p*, we denote the true case and control minor allele frequencies with and and the observed frequencies with and . For the simplicity of our equations we work with balanced case and control panels of size

_{i}*N*/2, which yields

*N*chromosomes in each panel.

The following association statistic *S _{i}* is evaluated at SNP

*m*for large enough

_{i}*N*,(1)

*S*is normally distributed with mean (the noncentrality parameter), and unit variance, where(2)A SNP

_{i}*m*is associated with the disease trait if its noncentrality parameter (NCP) is not zero, λ

_{i}*0. The NCP of a SNP is unknown and the association of a SNP is inferred statistically. We refer to SNP*

_{i}*m*as

_{i}*statistically*associated or significant, if under the null distribution (λ

*= 0) of the association statistic*

_{i}*S*, the observed statistic is in the rejection region defined by the significance level α,

_{i}*i.e.*, where Φ

^{−1}is the quantile function of the standard normal distribution. Note that even though a SNP can be associated, it may not be detected as statistically associated.

Additionally, given that the SNP *m _{i}* is associated (λ

*0),*

_{i}*m*can be either the causal SNP or correlated to the causal SNP. Assuming there is a single causal SNP

_{i}*m*and

_{c}*m*is not the causal SNP, then the following relation holds between the NCPs of

_{i}*m*and

_{c}*m*,(3)(Pritchard and Przeworski 2001), where

_{i}*r*is the correlation coefficient between the two SNPs.

_{ic}Consider *T* tag SNPs were genotyped in a GWAS and the association statistic of each tag SNP is computed, , …, Given *K* candidate SNPs, the follow-up SNP selection problem is to choose *k* follow-up SNPs from the *K* candidate SNPs to genotype in the original GWAS case/control panels, using the observed statistics of *T* tag SNPs and the pairwise correlation between the SNPs.

Assume that in a follow-up study the chosen follow-up SNPs are genotyped in the case/control individuals of the original GWAS. In this scheme an ideal follow-up SNP selection, referred to as *oracle*, selects the candidate SNPs that are in fact significant in the original GWAS. We evaluate the performance of a follow-up SNP selection method with the precision criteria, which is the proportion of the follow-up SNPs that are significant, or true positives (TPs). Finally, at a given significance level α, if TP(α) denotes the number of the true positives among *k* follow-up SNPs, then the precision can be expressed as follows:

#### Traditional follow-up SNP selection approaches:

To the best of our knowledge although there is no existing method addressing the follow-up SNP selection, we formalize two intuitive approaches that we refer to as the distance- and the correlation-based traditional follow-up SNP selection approaches in which each candidate SNP is paired with a tag SNP.

Under the correlation- and distance-based traditional follow-up SNP selection approaches each candidate SNP is paired with a tag SNP. The tag SNP that pairs a candidate SNP is selected as the tag SNP with the highest pairwise correlation or within a certain distance of the candidate SNP. For each candidate/tag SNP pair, denotes the observed association statistic of the tag SNP, *r _{it}* denotes the pairwise correlation, and

*d*denotes the distance between the candidate and the tag SNPs.

_{it}Under the correlation-based traditional approach, follow-up SNPs are selected as follows. First, the candidate/tag SNP pairs are sorted according to the significance of their tag SNP’s observed association statistic, from the most significant to the least significant. If two pairs have the same significance of tag SNP statistic, then the pair with stronger pairwise correlation precedes the other one. To select the follow-up SNPs a minimum pairwise correlation cutoff value, *r*_{min} > 0, is given such that the top *k* pairs carrying stronger pairwise correlation than *r*_{min} are selected.

The distance-based traditional approach selects the follow-up SNPs with respect to the distance between the candidate and tag SNPs. A distance window, *d*_{max}, is given such that starting from the most significant tag SNP, candidate SNPs that are within the distance window are paired with the tag SNP. In addition, pairs with the same significance of tag SNP statistic are sorted with respect to how close their distance is to their tag SNP. Top *k* pairs are selected to determine the follow-up SNPs.

#### A statistical framework to analyze follow-up SNP selection:

We introduce a simple statistical framework for analyzing the follow-up SNP selection problem. Although this model is an oversimplification, it captures the essence of follow-up SNP selection and is easy to analyze. We pair each candidate SNP with the tag SNP with the highest correlation. We assume that the candidate/tag SNP pairs are independent and focus on the joint distribution of the association statistics in a pair. We can estimate the covariance between two SNPs as shown by Han*et al.* (2009) and calculate the joint distribution of their association statistics as follows:(4)

Note that and The value of the candidate SNP NCP, λ* _{i}*, depends on whether the SNP is causal. We introduce a new parameter,

*c*, as the probability of the candidate SNP being causal and assume it attains a certain NCP of . Under these assumptions the joint distribution can be expressed as follows:(5)

_{i}Although the joint distribution depends on two unknown parameters, the NCP of the causal SNP, , and the probability of the candidate SNP being the causal SNP, *c _{i}*, it is shown in results that variation in these parameters has small effect on the choices of follow-up SNPs.

#### Follow-up SNP selection under the statistical framework:

Using the joint distribution given in Equation 5, the conditional distribution of the candidate SNP statistic given the observed tag SNP statistic can be expressed as follows:(6)

Let φ(*x*; μ, σ^{2}) denote the density of a univariate normal distribution with mean μ and variance σ^{2}. The density function of the conditional distribution of the candidate SNP statistic, *f*, can expressed as the following mixture density:(7)

Using the above equation we can express the probability of a candidate SNP being statistically associated given the observed value of its tag SNP statistic as which is referred to as . The selection of follow-up SNPs is achieved by computing the for each candidate SNP and ranking them descending with respect to their . We refer to the follow-up SNP selection method where each candidate SNP is paired with the tag SNP that has the highest pairwise correlation as **r**ank-based **f**ollow-up **S**NP **s**election (RFSS).

Consider the example given in Figure 1, the selection of two follow-up SNPs from four candidate/tag SNP pairs, where the true framework parameters are and *c _{i}* = 10

^{−6}. We assume there are 1 million candidate SNPs (only 4 shown in the example) and the significance level is 10

^{−8}, which takes into account the multiple-testing correction. The NCP value of corresponds to 50% power at the causal SNP. Under the correlation-based traditional approach, for the given tag SNP statistics and pairwise correlations, two follow-up SNPs can be selected using three different values for the minimum pairwise correlation cutoff value,

*r*

_{min}= {0.50, 0.90, 0.92}. For each candidate SNP the value can be calculated as: π

_{1}= 0.084, π

_{2}= 0.0004, π

_{3}= 0.0007, and π

_{4}= 0.008. The two optimal follow-up SNPs are

*m*

_{1}and

*m*

_{4}, which have the highest values. This example shows that the correlation-based traditional approach may fail to identify the optimal follow-up SNP selection under all of the possible correlation cutoff values.

#### Extending the statistical framework to incorporate multiple-tag SNPs and the neighboring candidate SNPs:

We now extend the RFSS approach by grouping each candidate SNP with multiple-tag SNPs with the highest correlations and relaxing the assumption of the candidate SNPs being independent. In this scheme, we incorporate two additional sources of information. First, in addition to the best tag SNP, the observed statistics of the top highly correlated tag SNPs are utilized. Second, even though a candidate SNP may not be causal, it may still be associated (*e.g.*, ) as a result of being correlated to a candidate SNP that is causal. We refer to the multivariate extension of the RFSS approach as mRFSS and present its performance improvements in results.

Given a candidate SNP *m _{i}*, we consider its

*L*most strongly correlated tag SNPs. Let

**R**

_{i}**denote the**

_{L}*L*× 1 vector of the correlation coefficients between

*m*and the

_{i}*L*tag SNPs. Similarly, let

**S**and respectively, be the

_{L}*L*× 1 vectors of the association statistics and NCPs of the tag SNPs and

**Σ**be the

_{L}*L*×

*L*matrix of their pairwise correlation coefficients. The joint distribution of the association statistics of the candidate SNP

*m*and the

_{i}*L*tag SNPs follows a multivariate normal distribution, which can be expressed as follows:(8)

In Equation 8 the NCPs of the candidate SNP *m _{i}* and the tag SNPs are unknown. Suppose the causal SNP

*m*is known, where it correlates to

_{c}*m*by

_{i}*r*and to the tag SNPs by the

_{ic}*L*× 1 vector

**R**

_{c}**. Using the indirect association rule (3), we can express the NCPs of**

_{L}*m*and the tag SNPs as follows: and .

_{i}Although the causal SNP is unknown, we can consider each candidate SNP *m _{k}* as the causal SNP with probability

*c*and use the indirect association rule to resolve the unknown NCPs. For each candidate SNP

_{k}*m*, where

_{k}*k*∈ {1, …,

*K*}, let

*r*and

_{ik}**R**

_{k}**denote the correlation coefficient of**

_{L}*m*to

_{k}*m*and the

_{i}*L*most strongly correlated tag SNPs to

*m*. Then the joint distribution of (

_{i}*S*,

_{i}**S**) can be expressed as a two-level hierarchical model, which uses the indirect association to compute the NCPs:(9)Consequently, the density function,

_{L}*f*, of the conditional distribution of the candidate SNP statistic

*S*given the vector of observed tag SNP statistics can be written as follows:(10)Note that in Equation 10 there is one mixture component for each candidate SNP

_{i}*m*being the causal SNP (with weight

_{k}*c*) and a mixture component that corresponds to having no causal SNPs with a weight .

_{k}Although model (9) considers *K* candidate SNPs, in practice it can be simplified by using only the *M* neighboring candidate SNPs with the highest pairwise correlations to *m _{i}*, since the remaining

*K*–

*M*candidate SNPs do not contribute to the sum in Equation 10. In our experiments, we used 10 most strongly correlated neighboring candidate SNPs,

*M*= 10. Additionally, the choice of the most strongly correlated tag SNPs affects the computational cost and the numerical stability of the method. If any 2 tag SNPs are strongly correlated with each other, the matrix

**Σ**may become nearly singular, and therefore in practice we discarded any tag SNP with correlation >0.9 to any of the already included tag SNPs. Finally, in our experiments we chose the number of tag SNPs to be at most 10,

_{L}*L*≤ 10.

## RESULTS

#### Performance of the correlation-based traditional approach:

In this section we analyze the expected performance (EP) of the correlation-based traditional approach on a single follow-up SNP: that is, how often a candidate SNP is observed as statistically associated given that it is selected as a follow-up SNP. The selection of a candidate SNP as a follow-up SNP depends on the ordering of all the candidate/tag SNP pairs and the given minimum correlation cutoff value, *r*_{min}. To simplify the interdependence, we introduce a new parameter called the minimum statistic cutoff value, , where . As a rule, a candidate SNP is selected as a follow-up SNP if the observed statistic of its tag SNP, , is >. We assume there is a one-to-one mapping between *r*_{min} and , such that for every *r*_{min}, an value exists.

The follow-up SNP selection rule defined for the correlation-based traditional approach using is comparable to RFSS, which uses a conditional probability threshold, π*, as a follow-up SNP selection rule. It can be shown that the probability of a candidate SNP being significant given the observed value of its tag SNP statistic, is a monotonic function of . That is, for a given candidate/tag SNP pair, for every π* it is possible to determine a unique value such that , where the candidate SNP is selected if . Therefore, we can compare the two selection rules based on and , where the correlation-based traditional approach uses the same for every candidate/tag SNP pair, and RFSS determines an for each candidate/tag SNP pair on the basis of , *c _{i}*,

*r*

_{it}_{,}and α. Below we show how the expected performance changes with respect to .

The EP under the correlation-based traditional approach can be computed directly from the corresponding joint distribution of the association statistics given in Equation 5 as This probability can be expressed as In Figure 2, the densely cross-hatched region at the top represents where the follow-up SNP is significant. Likewise, in the cross-hatched region at the bottom, the candidate SNP is selected as a follow-up SNP; however, it is not significant. Therefore EP can be expressed as the ratio of the probability in the densely cross-hatched region to all cross-hatched regions as given in Equation 11.

For given *c _{i}*, and α, we can write the EP as a function of

*r*and ,where(11)In Figure 3 each of ,

_{it}*c*, and

_{i}*r*is varied while keeping the other two parameters fixed and the effect of variation in EP is shown. When the variation in EP is compared between the change in (Figure 3a) or

_{it}*c*(Figure 3b) to

_{i}*r*(Figure 3c), we observe that the largest variation in EP is due to the pairwise correlation. A second observation is that varying or

_{it}*c*approximately corresponds to a shift of EP in the horizontal axis. This suggests that if the same and

_{i}*c*parameters are used to calculate the EPs of any two candidate/tag SNP pairs, the order of the pairs with respect to their EPs will always be the same. That is, as long as the values of and

_{i}*c*are close to the true values, the selection of the follow-up SNPs is robust to uncertainties in these parameters. This property is utilized in RFSS and mRFSS, where we show the concordance of the selections under different combinations of the framework parameters.

_{i}#### Performance comparison under simulated data:

We evaluate and compare the performance of the traditional approaches and our proposed methods, RFSS and mRFSS, using simulated association studies generated using the ENCODE regions from the HapMap Project. We use the ENCODE SNPs as the candidate SNPs and the Affymetrix 500K array as the tag SNPs. There are 10 ENCODE regions, each 500,000 bp long, which are genotyped separately under the four HapMap populations. The correlation structure and the minor allele frequency of the SNPs in these regions vary depending on the specific population. A summary of the number of candidate and tag SNPs in each region and population is given in Table 1.

We simulate the follow-up study of a GWAS as follows: assuming there is a single causal SNP, the region where the significant tag SNPs are located is simulated by an ENCODE region. Using each ENCODE region in each population, we simulate 2000 association study panels. In half of these panels, we implant a causal SNP, which is randomly selected among the candidate SNPs, and in the rest of the panels there are no causal SNPs. To generate the association statistics of the SNP, we simulate the case/control genotypes of the SNPs. In each panel, we generate the genotypes by sampling haplotypes depending on whether there is a causal SNP in the panel. If there is no causal SNP, we randomly sample the case/control individuals’ haplotype from the specific haplotype pool of the ENCODE region from the corresponding population. If there is a causal SNP, we divide the haplotype pool into two separate pools on the basis of the causal SNP’s allele. The true case and control minor allele frequencies of the causal SNP are calculated as follows. We assume the HapMap frequency of the causal SNP under the corresponding population is the true control frequency and determine the true case minor allele frequency such that the NCP of each causal SNP yields 50% statistical power. In other words, when the causal SNP is genotyped, half of the time it is detected as significant. Once the true case and control minor allele frequencies are determined at the causal SNP, we sample case and control individuals using these probabilities from the corresponding haplotype pool. We use a genome-wide significance level α of 10^{−8} and under 50% statistical power the NCP of the causal SNP is .

In different HapMap populations and ENCODE regions the correlation structure among the SNPs varies greatly. We compare the performance of the methods for each of the ENCODE regions in each population. In Figure 4 an example performance comparison is given in the ENm010.7p15.2 ENCODE region using the CEU HapMap population. The precision of each method is plotted along the size, *k*, of the follow-up SNPs (Figure 4a). We give the ideal follow-up SNP selection as the “oracle” as a reference to compare each method to. The vertical line in Figure 4a indicates the total number of statistically associated candidate SNPs in the simulation data.

The performance of the correlation-based traditional approach depends on the minimum correlation cutoff value, *r*_{min}, used for selecting the follow-up SNPs. If *r*_{min} is high, such as *r* = 0.9, the correlation-based traditional approach performs well for a small number of follow-up SNPs and performance degrades rapidly as more follow-up SNPs are collected. However, significant candidate SNPs may not be strongly correlated to the most significant tag SNPs, and by using a high *r*_{min} value such candidate SNPs cannot be selected. On the other hand, when *r*_{min} is low (Figures 4a and 6), the traditional approach selects the significant candidate SNPs that are correlated weakly to the most significant tag SNPs. However, a price is paid by selecting many follow-up SNPs, most of which are not significant. Hence when the number of follow-up SNPs is small, the precision is significantly lower. The distance-based traditional approach performs worse than the correlation-based counterpart, which can be observed in Figure 4a as shown for the distance windows of 1,000 and 10,000 bp. Whether a candidate SNP is statistically associated depends on the values of the observed tag SNP statistic and the pairwise correlation, and due to the complexity of the LD landscape SNPs located near each other may not be strongly correlated.

RFSS makes assumptions about and *c _{i}*, which affects the estimates of the decision rule. We evaluate how incorrect assumptions on these parameters affect the performance by varying the value of these parameters in our simulations. Figure 4b shows that even with incorrect assumptions about these parameters, performance of our method is nearly identical to the performance using correct parameters.

In Table 2 we show the summary of performance comparisons under all generated association studies. In each population in every ENCODE region, the precision of each method is reported two times, first when the number of follow-up SNPs is equal to the total number of the significant candidate SNPs, *N*_{s}, and second when number of follow-up SNPs is twice this value. For each population, the two sets of precisions obtained over all ENCODE regions are averaged, respectively. The proposed RFSS and mRFSS methods perform significantly better than the traditional approaches for the relevant sizes of follow-up SNPs.

#### Performance comparison using incorrect HapMap population correlations:

We further experiment with the effect of using incorrect correlations on the performance of each method. Among the four HapMap populations, the correlations among the SNPs vary the most between the CEU and YRI populations. We use the correlation values from the YRI population to select follow-up SNPs from the simulation data generated in the ENm010.7p15.2 ENCODE region using the CEU population.

In Figure 5 the performance comparison of the traditional and the proposed methods is given. We observe that each method performs worse than its performance when the correct correlations are used. However, the proposed methods performed better than the traditional approaches.

#### Performance comparison in discovering the causal SNPs:

Next, we compare the performance of the traditional and the proposed approaches in discovering the causal SNPs. Note that neither the traditional nor the proposed methods are designed for this goal. Nevertheless, we compare their performance on what percentage of the significant causal SNPs that are present in the data are selected as follow-up SNPs. Here we assume only the follow-up SNPs that are observed as significant are further analyzed, which may lead to the discovery of whether a SNP is causal.

In Figure 6 we show the candidate SNPs in the order they are selected as follow-up SNPs in each method. We plot the hidden statistic of each follow-up SNP as a green circle and indicate the causal SNPs with a black circle. The red horizontal lines mark the significance threshold for the association statistic under the significance level α = 10^{−8}. We use two extreme minimum correlation cutoff values of *r*_{min} = 0.1 and *r*_{min} = 0.9 for the correlation-based traditional approach. The blue circles indicate the statistic of the tag SNP that corresponds to each follow-up SNP. The plot for the distance-based traditional approach is omitted as it performs similarly to *r*_{min} = 0.1. Additionally, the likelihood of each candidate SNP being significant is shown with a blue line in the plots of the proposed methods. The traditional approaches select the follow-up SNPs despite the significance level or the assumed statistical power at the causal SNP. However, the proposed approaches rank the candidate SNPs on the basis of these parameters, where the likelihood of each candidate SNP being significant changes accordingly. We observe that when multitag SNPs are used, this ranking is more successful. Both of our proposed methods identify a significantly higher number of causal SNPs, where the mRFSS method prioritizes causal candidate SNPs earlier in the selection.

In Table 3 the summary of the average performance comparisons on each HapMap population is shown. In each population and ENCODE region, the performance of the methods is recorded twice, where the number of the follow-up SNPs, *k*, equals the number of significant candidate SNPs in the corresponding simulation, *k* = *N*_{s} and two times this value *k* = 2*N*_{s}. In each population, we then average the performance of each method over the ENCODE regions. The proposed approaches lead to the discovery of significantly more causal SNPs than the traditional approaches.

#### Performance comparison under real GWAS data:

We compare the performance of the correlation-based traditional approach to the proposed methods using real GWAS studies. We use the data available from the Wellcome Trust Case Control Consortium (WTCCC) GWASs on seven human diseases, which provides genotype data on 2000 case individuals per disease and 3000 shared controls. Each individual is believed to be of European origin (CEU) and genotyped with the Affymetrix 500K array. Bipolar disorder (BD) and hypertension (HT) are excluded from the analysis as no statistically significant associations are observed. The performance comparisons are evaluated on the remaining five diseases, coronary artery disease (CAD), Crohn’s disease (CD), rheumatoid arthritis (RA), type 1 diabetes (T1D), and type 2 diabetes (T2D). We use approximately half of the genotyped Affymetrix SNPs as candidate SNPs and the rest as the tag SNPs. Using each SNP’s unique reference identifier number (rsID), the SNPs with odd rsIDs are used as the candidate SNPs.

The performance of the correlation-based traditional approach is recorded for minimum correlation cutoff values of 0.9, 0.5, and 0.1. For the proposed methods we assumed the statistical power at the causal SNP to be 50% and used the value of 10^{−6} for the probability of a candidate SNP being causal. In Figure 7, a sample performance comparison is given between the correlation-based traditional approach and the proposed methods under RA.

In Table 4 the summary of performance comparisons in all five diseases is given. In each disease the precision of each method is reported twice, first when the number of follow-up SNPs is equal to the total number of statistically associated candidate SNPs and second when number of follow-up SNPs is two times the total number of statistically associated candidate SNPs. The proposed RFSS and mRFSS methods consistently outperform the correlation-based traditional approach.

#### Concordance of the RFSS selections under uncertainty in the framework parameters:

We determine the rank of the follow-up SNPs, which represents the order they will be picked, using different values for the framework parameters, and *c _{i}*, in the WTCCC data. The concordance between two such rankings indicates the invariance between using different values. In the two rankings top

*k*candidate SNPs are examined and the proportion of the candidate SNPs that have the same rsIDs is recorded. The concordance of a ranking to itself is always one for all

*k*; hence two rankings are highly similar if their concordance is close to one.

Figure 8 shows the concordance plots of RFSS in RA, varying between 5.73 and 6.57 and *c _{i}* between 10

^{−5}and 10

^{−8}. The concordance of the mRFSS performs similar to that of the RFSS and is hence not shown. The vertical line indicates the total number of statistically significant candidate SNPs under the significance level α = 10

^{−8}. We observe that the follow-up SNP selection under RFSS is highly concordant between different values used for the framework parameters, verifying empirically that even though our proposed RFSS method depends on unknown parameters, follow-up SNP selection is highly concordant within the range of likely values.

## DISCUSSION

Currently, genome-wide association studies initiate the journey of disease gene discovery. The results of a GWAS are effective in guiding investigators to the genomic regions that may contain the causal variants. However, such regions may contain many polymorphisms and biologically validating all of them is not an efficient use of the resources. We introduced a follow-up study approach that may be useful in better characterizing the associated regions and may provide investigators a clear direction for biological validations.

Our method makes certain assumptions for predicting whether a candidate SNP is significantly associated, which depends on the observed tag SNP statistics, pairwise correlation between the tag and candidate SNPs, and the effect size of the causal SNP. We model the effect size of a candidate SNP assuming that the effect size of the causal SNP is known and the probability of each candidate SNP being causal is given. These parameters affect the probability of a candidate SNP being associated and thus whether the candidate SNP should be selected for the follow-up study. We empirically show that although the true values for these parameters are not known, surprisingly, within the range of likely values, predicting the association of a candidate SNP is mainly influenced by the known parameters, and errors in the assumptions on the unknown parameters do not change the predictions of our method. Additionally, our method does not require any genotype data either on the candidate or on the tag SNPs, which makes it easy to apply to the currently available GWAS results.

Our approach requires knowledge of the correlation between the test statistics at each marker and the relation between noncentrality parameters at causal variants and correlated markers. For the standard association statistic presented in this article, the correlation between statistics is simply the correlation (*r*) between the markers. For other association statistics that have the same correlation structure, we can directly apply our method. For statistics that have a different correlation structure, we can still apply our method if we replace Equations 7 and 10 with the correlation structure specific to the association study.

Several groups have performed sequencing in regions implicated in association studies in a small number of individuals to discover many new polymorphisms in those regions. These studies then follow up a subset of those polymorphisms by genotyping them in the entire case and control population. Usually many polymorphism are discovered, each with different correlation structure with respect to the tag SNPs. Our approach can be directly applied to select which subset of these discovered polymorphisms to collect by using the correlations estimated from the sequenced data. Finally, our method is publicly available at http://genetics.cs.ucla.edu/followupSNPs.

## Acknowledgments

E.K. and E.E. are supported by National Science Foundation grants 0513612, 0731455, 0729049, and 0916676 and National Institutes of Health grants K25-HL080079 and U01-DA024417. This research was supported in part by the University of California, Los Angeles subcontract of contract N01-ES-45530 from the National Toxicology Program and National Institute of Environmental Health Sciences to Perlegen Sciences. J.A.L. is supported by the Saiotek and Research Groups 2007–2012 (IT-242-07) programs (Basque Government), the TIN2010-14931 Ministry of Science and Innovation (MICINN) project, and the COMBIOMED network in computational biomedicine (Carlos III Health Institute).

## Footnotes

Communicating editor: F. Zou

- Received December 27, 2010.
- Accepted March 22, 2011.

- Copyright © 2011 by the Genetics Society of America