Ranks of Genuine Associations in Whole-Genome Scans
Dmitri V. Zaykin, Lev A. Zhivotovsky

Abstract

With the recent advances in high-throughput genotyping techniques, it is now possible to perform whole-genome association studies to fine map causal polymorphisms underlying important traits that influence susceptibility to human diseases and efficacy of drugs. Once a genome scan is completed the results can be sorted by the association statistic value. What is the probability that true positives will be encountered among the first most associated markers? When a particular polymorphism is found associated with the trait, there is a chance that it represents either a “true” or a “false” association (TA vs. FA). Setting appropriate significance thresholds has been considered to provide assurance of sufficient odds that the associations found to be significant are genuine. However, the problem with genome scans involving thousands of markers is that the statistic values of FAs can reach quite extreme magnitudes. In such situations, the distributions corresponding to TAs and the most extreme FAs become comparable and significance thresholds tend to penalize TAs and FAs in a similar fashion. When sorting between true and false associations, the “typical” place (i.e., rank) of TAs among the most significant outcomes becomes important, ordered by the association statistic value. The distribution of ranks that we study here allows calculation of several useful quantities. In particular, it gives the number of most significant markers needed for a follow-up study to guarantee that a true association is included with certain probability. This can be calculated conditionally on having applied a multiple-testing correction. Effects of multilocus (e.g., haplotype association) tests and impact of linkage disequilibrium on the distribution of ranks associated with TAs are evaluated and can be taken into account.

CONTINUOUS efforts to characterize genetic contributions to human diseases have led to identification of many susceptibility genes. Mapping of genes with pronounced effects, such as those involved in certain types of breast cancer, Alzheimer's disease, and cystic fibrosis are examples of remarkable success. However, mapping complex diseases is generally complicated and many studies fail to replicate. Risch and Merikangas (1996) gave recommendations for sample sizes required to detect genetic effects of a specified size with certain power, 1 − β. If an association genome scan contains L markers, the corrected 5% type I error rate is α = 0.05/L, and the required sample size can be calculated for a normally distributed test statistic as N = (Zα − σZ1−β)2/(2μ2), where μ and σ are the parameter values under the hypothesis of association, HA, and Zα and Z1−β refer to α and 1 − β quantiles of the normal distribution (Risch and Merikangas 1996). When high power at the genome-wide α-level can be assured, that will provide good odds that the most significant results represent actual associations.

An important question that has not been adequately addressed so far is the ranks of true associations. Distributions of ranks are particularly important in designing the follow-up studies and deciding how many markers are necessary to consider in a follow-up study to capture all true associations with a high probability (Satagopan et al. 2002). Consider the following study as an example: Ozaki et al. (2002) established association of an SNP in the lymphotoxin-α gene with susceptibility to myocardial infarction. In a genome scan of 65,671 SNPs the actual association was originally found to have a P-value of 0.0022, being less significant than >200 spurious associations that later failed to replicate. These included six false effects with P-values <10−5 and one false effect with P < 10−6. Is it a typical outcome? Or would we rather expect genuine associations to be the front runners? Would true associations (TAs) usually rank first among results that passed a “multiple-testing correction”? The genome scan of Ozaki et al. (2002) included 94 cases and 658 population controls. Their larger follow-up study determined the relative risk of ∼1.6 associated with one of the genotypes at the putative associated SNP. An approach that we describe here allows us to estimate that there was only a 14% chance that the association with such an effect would rank among the first 200 most significant results. To cover such an effect with only 50% probability in a genome scan with 65,671 SNPs, one would need to examine >3400 of the most significant results—a surprisingly high number. Even higher numbers are needed to cover with certainty the TA that is >50%. Here we describe an approach that allows such calculations.

A simplified motivating description of the problem studied in this article is as follows: Suppose there is a single TA with a specific effect size, which could be measured by the penetrance value of a susceptibility allele. Given a fixed sample size, there is a certain probability (P) that the association test statistic calculated at this marker will exceed a given value, Z. However, as the total number of markers (L) increases, the largest test statistic among L false associations will eventually have a higher chance than P to exceed Z. In other words, the distribution associated with the the smallest “false” P-value will eventually become more skewed toward zero as L increases. In such situations, multiple testing corrections cannot distinguish among true and false associations and justly penalize the false ones. As a result, the position of TAs among results sorted by a measure of association or significance is not substantially affected (in a subset of experiments that passed the correction). Thus, outcomes of an association genome scan can be thought of as realizations of a stochastic experiment in which statistic values representing TAs and false associations (FAs) compete in the magnitude of the value they are able to attain. Whichever statistic values ranked first may be associated with the highest odds of representing the TAs, and corresponding markers may become candidates for an independent follow-up study. Considering recent recommendations calling for relatively large (100,000–200,000) numbers of markers (Goldstein et al. 2003), this becomes a problem of extensive multiple testing.

A traditional way to address the multiple-testing issue is to establish significance thresholds that preserve the probability of making a certain proportion of claims under the situation of “complete null” (i.e., when all markers represent FAs). Lander and Kruglyak (1995) came up with explicit recommendations for significance thresholds that are appropriate for linkage studies where the correlation between adjacent markers is high and extends throughout long genetic distances. Some issues with the type I error control approach are discussed below and its effect on the distribution of ranks is investigated. Multiple-testing corrections to assure type I error control require a statistic value (or equivalently a P-value) to pass a certain threshold to be declared “significant.” The effect of this is merely a “truncation,” where an entire experiment (i.e., genome scan) could arguably be regarded as a failure if none of the markers exhibit the threshold significance. We study the effect of such a truncation on the relative order of ranks of true and false associations by restricting attention to the results among the experiments that pass the significance threshold. In a single given experiment, corrections such as Bonferroni do not change the order of results and the original ranks. However, the distributional characteristics of ranks will change for a random experiment given that some of the tests in that experiment have passed the correction. In other words, the distribution of the ranks is generally different for a subset of genome scans with some significant results, compared to all genome scans with and without significance.

A brief summary of our findings is as follows:

  1. Investigation of relative TA and FA ranks allows us to treat the multiplicity problem in a useful way that is complementary to more traditional type I error and false discovery rate approaches. We find that in large-scale association genome scans, TAs are not likely to result in the largest corresponding association statistic values. This is in part due to small effect sizes associated with complex traits and corroborates concerns about the linkage disequilibrium (LD) mapping outlined in Terwilliger and Weiss (1998).

  2. Multiple-testing corrections do not have a drastic impact on the distribution of ranks, unless the power is high. Thus, procedures that preserve the type I error rate or the unconditional false discovery rate (FDR) are insufficient to guarantee that the most significant results represent TAs. Our approach helps to clarify sources of the apparently low chances of replication for originally found associations (Terwilliger and Weiss 1998; Ioannidis et al. 2001; Vieland 2001; Lohmueller et al. 2003; Lucentini 2004).

  3. The effect of even strong LD on TA ranks is too small to be of substantial importance in genome-wide association studies. This finding is in contrast with the statistical behavior in linkage analysis where the correlation extends throughout large chromosomal regions, greatly reducing the effective number of tests.

  4. Dependency among the association test statistics due to block structure of LD has a similar small impact.

  5. Multilocus approaches such as haplotype-trait association methods confer an additional advantage in terms of the ranks of TAs. This finding is independent of previous arguments proposed in favor of haplotype analysis.

RANKS OF TRUE AND FALSE POSITIVES

It is possible to give explicit recommendations for genome-wide significance thresholds that provide appropriate type I error control. However, type I error control considers only the situation of the complete absence of actual associations. On the other hand, the question of practical importance is, What are the chances that a particular test is a true association? Morton (1998) pointed out that it is desirable to assure high reliability (ρ)—the proportion of true discoveries among all, true and false discoveries, an essentially Bayesian concept. Reliability relates to the frequentist version of FDRs of Benjamini and Hochberg (1995) as Math, where Math and Math are the numbers of true and false associations that were declared significant. FDR is the average proportion of false positives across multiple studies, including those with no discoveries, and the reliability approximates the expectation, Math, among the studies where one or more rejections have been made. The value Math decreases with the overall number of tests conducted within a study and increases with power to detect TAs. The power is typically highly variable. Genome-scan marker sets consist of common SNPs spread out throughout the genome. Actual causal polymorphisms are unlikely to be captured by any of the actual markers in the set. Instead, the hope is that the sufficient marker density may provide high LD with the causal loci. In this case some of the markers surrounding the functional polymorphisms can be viewed as proxies for TAs. However, the population-specific nature of LD and its high variability as well as the uncertainty in allele frequencies make it difficult to specify the magnitude of the association with a marker. Although high power at the genome-wide α-level will provide good odds that the most significant results represent actual associations, the actual power to detect associations in the vicinity of causal variants is likely to be low and indeterminate in practice. Expected conditional FDR, or the proportion of false discoveries among all discoveries, E(1 − ρ), can be related to the posterior probability of the null hypothesis (Storey 2002). This approach is most straightforward in situations where the proportion of TAs is relatively high so that the distribution of a mixture of TAs and FAs can be characterized empirically and distinguished from the distribution of FAs. More generally, it involves specifying the prior TA probability (expected proportion of TAs among all studied markers) as well as the power characteristics of markers representing the TAs. We study the problem in terms of the stochastic ordering and ranks of statistic values associated with true and false positives. Instead of calculating the probability of the null hypothesis (H0) given a P-value, we compute the number of most significant results needed to contain a TA with a given confidence.

In the case of independence between genetic markers, the rank probabilities can be derived analytically. The appendix describes the necessary theory and the Monte Carlo approach used to approximate the analytic results as well as to model ranks under more complex situations of dependency due to LD.

RESULTS

We considered two basic calculations. The first is obtaining the probability that TAs are found among some fixed number (R) of most significant results in a scan (Table 1). The second calculation is the number of most significant results required to contain TAs with some fixed probability (Tables 2-6). Both calculations are performed with or without a multiple-testing correction. Imposing a multiple-testing correction results in discarding experiments that did not pass the significance threshold. Then the ranks are calculated for the subset of all scans that have passed the correction. Although the ranks do not change for a single given genome scan, they are expected to improve among the scans that passed the threshold. This is modeling the expectation of a researcher that genome scans with multiplicity-corrected, statistically significant results are more likely to represent TAs, especially among the tests that passed the correction. The effect of LD on the ranks is studied for the case of strong LD that is homogeneous across the genome, as well as for the case of LD that has a block structure. For the case of multiple TAs, the chances of finding some or all of the effects are evaluated. Finally, the effect of multilocus tests (e.g., haplotype association tests) is considered and compared to the results obtained for tests based on individual SNPs.

View this table:
TABLE 1

Probabilities that all m = 3 true effects are found among R smallest P-values (with 100,000 false effects)

View this table:
TABLE 2

The number of most significant SNPs that will contain a single true association with 95% probability (1 TA + 200,000 FA markers scan)

View this table:
TABLE 3

The number of most significant SNPs that will contain a single true association with 50% probability (1 TA + 200,000 FA markers scan)

View this table:
TABLE 4

Block LD dependency with one true effect (m = 1)

View this table:
TABLE 5

Block LD dependency with three true effects (m = 3)

View this table:
TABLE 6

Two-SNP genotypic (8 d.f.) tests

Table 1 shows unconditional and conditional probabilities of finding all m = 3 TAs among R most significant results in a genome scan with s = 100,000 independent FAs. We used four different values of power to detect individual true effects (0.75, 0.85, 0.95, and 0.99), assuming the normal distribution in (A1), and compared probabilities under no multiple-testing correction (δ = 1) with probabilities obtained conditional on at least one P-value satisfying the Šidak correction, δ = 1 − (1 − 0.05)1/100003. Table 1 indicates that with a multiple-testing correction, the chances that all three true effects are found among R smallest P-values are improved by only a small amount. With 95% power, we need to consider ∼500 most significant results to bring this probability up to 44%. Multiple-testing correction only slightly increases it to 54%. This improvement would be even less pronounced for adjustments that are less stringent than Šidak or Bonferroni. Power has the most definite influence on the probability of finding true effects. When the power to find any one of the three true effects in a single 5%-level test is 75%, there is practically no chance of finding all three effects when up to 100 markers are considered. The probability conditional on the multiple-testing adjustment is still small even with R = 1000 for this value of power. The effect of the power associated with TAs is of great importance. There is considerable improvement in probability to find true effects when power reaches values of 99% and above. For example, the increase in power from 95 to 99% lowers R from 500 to 50 markers needed for 50% probability of discovery. This effect is also apparent in Tables 2 and 3. These tables use s = 200,000 of FAs and a single TA (m = 1), assuming the χ(1)2-distribution for the test statistic in (A1). The probability that this effect is among first R most significant markers is set to 0.95 for Table 2 and to 0.50 for Table 3. The entries in the tables show the value of R required to satisfy these probabilities. Dependency among P-values due to LD is added, assuming 200,000 equally spaced markers with correlation decay characteristics as shown in Figure 1. The correlation shown in Figure 1 corresponds to a quite extensive LD. Generally, high correlation values between alleles at two loci (i.e., LD) translate into much weaker correlation between the corresponding P-values computed for two single-locus tests of association (Nielsen et al. 2004). Tables 2 and 3 show that unless the power is very high, a substantial proportion of the original 200,000-marker set needs to be examined to ensure that it contains the TA. The effect of power is most substantial at values close to 99%. Only at these high-power values is there a definite benefit of a multiple-testing correction, in that the rank of the true association lowers to manageable numbers (e.g., 1671 markers without the correction vs. 70 markers for 99% power and no LD entry in Table 2). Since LD usually follows a block structure, additional simulations were carried out with the mixture of two diffusions as described in the appendix. The decay of correlation in blocks and between the blocks followed the pattern shown in Figure 2, with 25% of SNPs allocated within the blocks. Proportions of genome regions within blocks have been reviewed by Wall and Pritchard (2003) for various populations and subsets of genome. The usage of P-value correlation most closely corresponds to the LD measured by the composite correlation rAB (Weir 1996), which translates into a stronger requirement than that of D′—the gametic or the composite LD normalized by the bounds on covariance between alleles (Lewontin 1964; Hamilton and Cole 2004; Zaykin 2004). The coefficient rAB has well-defined statistical and population-genetic properties. It serves well in contexts of association mapping (Meng et al. 2003), because high values of rAB imply that an allele at one locus can be regarded as a proxy for an allele at another locus. This requires dependency as well as the relative closeness of allele frequencies at both loci.

Figure 1.—

Diffusion-generated P-value correlation decay. The distance between two neighboring markers (1 unit on the x-axis) is 15 kb.

Figure 2.—

P-value correlation decay within (left) and between (right) LD blocks. The distance between two neighboring markers (1 unit on the x-axis) is 15 kb.

The block LD parameters were chosen to result in the correlation decay similar to the one shown in Figure 1, when averaged across large distances. Results shown in Table 4 indicate that the block structure of correlation did not substantially affect the outcomes compared to the results obtained for the monotone decay of correlation with distance. The effect of dependency among markers has a very small effect on the ranking of TAs, because of the local nature of the correlation decay. As far as ranks of true positives are concerned, the “effective number of tests” is roughly equal to the total number of markers assuming no LD. It is our experience that in actual genome-wide association studies the average correlations between association test P-values are smaller in magnitude, with the decay that resembles the shape used in the present study.

When the number of TAs is increased, the probability to find all TAs rapidly decreases. For example, with m = 1 and 99% power, the numbers of most significant markers that contain the TA with 95% probability are 1671 (with no correction) and 70 (with Šidak correction, Table 1, independence case). When m = 2, the numbers increase to 3969 and 2155, respectively. With m = 3, they become 6112 and 4573. These calculations assume the same 99% power for all TAs, although it is likely that as m increases, the power associated with individual TAs decreases.

The probability of finding one or more TAs increases with the number of TAs but only under the unlikely assumption that the power associated with individual effects does not decrease. With m = 1 and 78.5% power, the numbers of most significant markers that contain the TA with 95% probability are 54,233 (with no correction) and 47,620 (with Šidak correction). With m = 3, the numbers become 3165 and 2196, although it could be naively assumed that these should correspond to the case with m = 1 and 1 − (1 − 0.785)3 = 99% power. More results for the case of m = 3 are given in Table 5. The results confirm that lower power associated with each effect is sufficient to obtain the same probabilities as those with a single false effect. For example, 80% power associated with each of 3 TAs results in numbers similar to that for a single TA with 95% power (for 50% probability of containing TAs). It is expected, however, that given multiple effects the power associated with each of them will generally be quite low. An assessment of the anticipated decrease in power can be obtained under a simple model when multiple effects independently contribute to a binary phenotype. Suppose that there are m susceptibility polymorphisms contributing to the trait. If m = 2 with two corresponding frequencies among the cases, p1, p2, then the frequency of either one of the susceptibility polymorphisms is pc = p1 + p2(1 − p1) = 1 − (1 − p1)(1 − p2). For simplicity, assume the same common frequency of all polymorphisms among the cases, p. Then pc = 1 − (1 − p)m, and conversely p = 1 − (1 − pc)1/m. Similar calculation would hold for the frequency among controls, qc. For a given relative risk pc/qc it is possible to calculate the expected power associated with any one of these m polymorphisms. Resulting power lines using the relative risk power formula (Equation 1, discussed below) are shown in Figure 3 for pc = 0.3 and qc = 0.2. The discrepancy is high—nearly 100% power of the compound test for all three effects has the power of ∼85% for the individual effects. In other words it is difficult to ensure that high power for all three effects (e.g., Table 5, last row) is satisfied in practice.

Figure 3.—

Expected power associated with individual effects (bottom line) given overall power for three effects (m = 3) (top line).

Genome-scan analysis may involve multilocus tests, e.g., statistical tests relating haplotype frequencies with phenotype values. The problem is which polymorphisms to include in a particular test. Previously, we investigated power of the “sliding-window” approach, where several neighboring polymorphisms are included into the current window and the overall test is performed (Zaykin et al. 2002a). The total number of tests for the whole-genome scan remains essentially the same although there is additional but very short-distance correlation between the tests sharing the same polymorphisms. Such multilocus approaches are advantageous when there are substantial haplotype effects. Nevertheless, there is the problem of balance between the increase in degrees of freedom associated with the haplotype tests and the increase in the effect size—no power increase is guaranteed with the haplotype analysis even if the effects are specifically haplotype driven. Another possible advantage of the haplotype analysis is that haplotypes can be in higher LD with unobserved mutations than the individual SNPs comprising the haplotypes. However, in such situations haplotype tests rarely result in substantial gain in power. Higher power compared with tests for individual SNPs can be observed primarily when there are multiple susceptibility SNPs and the LD is low (Morris and Kaplan 2002). In our framework, the major influence of the multilocus (e.g., haplotype) tests on the distribution of ranks is via the form of the distribution of P-values corresponding to multilocus TAs. We considered the situation of block-LD-correlated test statistics with 8 d.f. that would, for example, correspond to the overall genotypic test involving two diallelic SNPs, with nine distinguishable dilocus genotypes. A haplotypic test with three SNPs would have a similar number (7) of degrees of freedom. Results are shown in Table 6. Although the results are similar to those of the 1-d.f. tests in Table 4, all 8-d.f. ranks are smaller than those for the 1-d.f. tests (excluding the situations when the power is high enough that the TA ranks first). The difference between the 1- and 8-d.f. tests is more pronounced under the multiple-testing correction (second and fourth columns of the tables). The difference between entries of two tables is statistically significant (Wilcoxon signed-rank test P-value is 0.0005 for comparing just 95% entries). The relation between the degrees of freedom (assuming a chi-square distribution of the association measure) and the distribution of ranks is rather complicated. For a particular quantile of the P-value distribution the exact relation regarding the optimal degrees of freedom can be based on the following computation. Let (γ1, γ2, …, γk) be the noncentrality parameters corresponding to (1, 2, …, k) d.f. tests to have the same power (1 − β) at a particular α-level. Then i = 1, …, k significance levels αi* at the quantile q are Math, where Math is the χ2-cumulative distribution function (CDF) with di degrees of freedom and the noncentrality parameter γi. A sample graph for 80% power tests and the 70% quantile is shown in Figure 4. Note that at 80% quantile the graph would be the straight line at 0.05. In this figure, the 4-d.f. test appears “optimal,” as it corresponds to the minimum significance level; however, more important is the relation at smaller quantiles, somewhere in the neighborhood of the smallest expected P-values corresponding to false effects (i.e., around the Bonferroni level). At such levels (with 200,000 tests) the optimal d.f. = 9 (Figure 5) and even much larger degrees of freedom tests have still smaller significance levels than a single degrees of freedom test. In general, the exact relation between the degrees of freedom and the significance level depends on a particular quantile and the power at a given α-level. Still, our results indicate that tests with moderate degrees of freedom, e.g., haplotype tests including two to four SNPs, may score better in terms of ranks. This provides an additional justification for multilocus or haplotype-based analysis used in whole-genome scans.

Figure 4.—

Plot of degrees of freedom vs. significance level (α) for 80% power tests at the 70% quantile.

Figure 5.—

Plot of degrees of freedom vs. significance level (α) for 80% power tests at the 5% quantile.

Returning to the single-SNP analysis of Ozaki et al. (2002), we emphasize the low (14%) estimated probability that their associated SNP with the estimated relative risk of 1.6 would rank among the first 200 most significant results, as has been observed in their study. For that SNP to appear with 50% probability among the most significant results, the number of SNPs is estimated to be 3460, given their sample size of 94 cases and 658 controls. For the case and the control frequencies (p, q), the power of a test for H0: relative risk (RR) = 1 can be obtained by considering the asymptotic normal distribution of ln(p/q). For example, Equation 27 in Zaykin et al. (2004) can be inverted to obtainMath(1)This equation is for the equal sample size of the cases and the controls, but an “effective common sample size” can be calculated that would provide the same power as two given sample sizes of cases and controls. The power curve closely follows that of the more common test for the difference between two frequencies (McGinnis et al. 2002, Equation 3). For two sample sizes, n1 and n2, we define the effective common sample size as the sample size N that would provide approximately the same power using such a test. We find the value of N asMath(2)When p = q, the common sample size N in (2) is the harmonic mean of n1 and n2. With the relative risk of 1.6, the power reaches 99% when n1 = n2 = 990. With such a sample size, there is 50% probability that the associated SNP would appear among the first 2 most significant results, and 90% probability that it would appear among the first 200. Reports of large-scale association studies are still scarce. One such recent study (Kammerer et al. 2004) identified a replicated association on chromosome 19 in an intercellular adhesion molecule gene (ICAM) that influences nonfamilial breast cancer risk. Kammerer et al. used a pooled DNA genome-wide scan with 25,495 SNPs and a sample of 254 cases and 268 controls. Disregarding some increase in the test statistic variance due to pooling, we estimated that there was 84% probability to encounter the true positive with the relative risk of 1.3 reported in their study among first 550 results. We estimated the mean rank of the true positive to be 506. The actual rank of that SNP in the genome scan of Kammerer et al. was 550, which is in close agreement with our prediction (M. Nelson, personal communication).

DISCUSSION

True associations are difficult to find even when they are present among the results, as they tend to rank quite high among all results ordered by the magnitude of a measure of association or by P-values. Common solution of stringent multiple-testing control fails to influence the relative order of true and false associations, unless the power to detect true associations is very high. When viewed as a problem of stochastic ordering, the nature of this becomes clear. Multiple-testing procedures penalize collections of P-values (whole-genome scans) in a similar fashion, with no regard to which genome scans produced true associations ranking at the beginning of the results. Introducing a high level of dependency only slightly reduces the effective number of markers, because the extent of the average correlation is short on the whole-genome scale. It is essential that sufficient sample sizes are used to ensure very high power to detect true associations. This problem is specific to situations where the number of tests is very large and approximately equal to the number of false effects (Math), such as in association genome scans. At the extreme of only two markers with one of them being a true association, the probability of correctly identifying it using a P-value quickly increases with power. On the other hand, whole-genome scans with >100,000 markers demand extremely high-power values (at least 99%) to reduce the number of markers required for replication in a follow-up study to a manageable value. The probability of locating a true association increases more rapidly for the high values of power. The increase from 95 to 99% is of much greater consequence than the increase from 85 to 89%, although the relative changes, e.g., (1 − 0.95)/(1 − 0.99) vs. (1 − 0.85)/(1 − 0.89), are more comparable.

Our examples assume much lower power values than that used in calculations of Risch and Merikangas (1996), who considered sample sizes needed to achieve the power of 80% at the α-level of 0.05/L. The power values we use correspond to α = 0.05, reflecting the influence of the effect magnitude, variability, and the sample size at a given marker. So, for example, 99.999% power at α = 0.05 corresponds to the power of 80% at α = 0.05/200,000 for a normally distributed test statistic. Such high-power values will demand very large sample sizes, and high power at the genome-wide α-level will guarantee that TAs are likely to appear as first runners among the results. Practically, sample sizes continue to be modest, especially with regard to the generally small sizes of genetic effects associated with complex traits. Moreover, genome-scan SNPs are expected to be proxies for actual causative polymorphisms only through highly variable and population-specific LD that further decreases the effect size of such indirect associations. When multiple moderate effects contribute to the trait variation it is tempting to combine results from markers with large statistic values. However, our results suggest that the TAs tend to be interspersed with hundreds to thousands of false effects and the results of such analysis should be viewed with caution.

One could look at the issue of the number of tests (L) vs. power (that depends on Math) when trying to come up with a way to design an optimal configuration of L and N. The problem is that high marker density is required due to the relatively short extent of LD in human populations; thus, high values of L (≥100,000) are needed (Goldstein et al. 2003). Therefore, if only a certain number of markers can be followed up in a replication study, the calculations for the required N should be carried out. If a case-control test is conducted, a quantity of interest can be the RR of a genetic variant, A. If the frequencies of A among the cases and controls are p and q, respectively, the power of a test for H0: RR = 1 is approximately as given in (1). The population frequency of A is wp + (1 − w)q, where w is the prevalence of cases in the population. The same power calculation corresponds to a test that the genetic susceptibility associated with A is the same as the population prevalence (Zaykin et al. 2004).

A potential issue with the TA definition is that the significance around false effects might be relatively more erratic, and there might be some information in the markers surrounding the peak corresponding to a TA. However, this problem remains controversial. In the context of linkage analysis Terwilliger et al. (1997) claimed that the peaks around true positives are expected to be wider. However, Visscher and Haley's (2001) reexamination suggested that there is no additional information in the length of the peak. Siegmund (2001) concluded that, in general, methods that take the length of the peak into account (e.g., various smoothing techniques) can be somewhat helpful but are not expected to achieve substantial gains and that statistically this problem remains difficult. We have previously proposed an approach to combine neighboring P-values in a sliding window by the “truncation product method” (Zaykin et al. 2002b). The result of an application of this method is a new set of P-values combined (smoothed) over regions covered by the sliding window. Theoretically, a combined P can exceed the values of individual P-values in the combined set.

We found that multiple-testing corrections have only mild effect on the ranks of the TAs. That is, among genome scans that yield one or more tests satisfying the significance threshold, the ranks of TAs (their expected distributional characteristics) are only moderately improved. This is in line with the observation that the proportion of false discoveries among multiple-testing-corrected results can be substantially higher than the significance level (Zaykin et al. 2000). A high amount of LD between the markers or the presence of the LD block structure have not been found to greatly influence the ranks of TAs. The extent of high correlation has to be large enough to span whole segments of chromosomes for the effect to be considerable. In the context of linkage analysis such an extended correlation has a profound effect as has been demonstrated previously (Lander and Botstein 1989).

Our results suggest that statistical tests with moderately large degrees of freedom (e.g., haplotype or other multilocus tests) may score better in terms of the ranks compared to single SNP tests with 1 or 2 d.f. When either of the tests has the same power at a conventional significance level, the moderate degrees-of-freedom tests tend to rank closer to zero in terms of the P-values. This implies a lower conditional FDR for these tests and gives an additional justification for using haplotypic tests in whole-genome scans. The exact recommendations are difficult to come up with, as the results depend on parameters other than the degrees of freedom, as described above. In genome scans with >100,000 markers, tests with ≥6 d.f. are expected to have better or equal distribution of TA ranks than tests involving single SNPs.

In summary, we suggest that actual genetic associations are unlikely to appear among the front runners, unless high power at the genome-wide level can be assured. We emphasize that although the results may be “type I error protected,” the most significant observations are still likely to represent false positives. Similar observations are found in actual association scans (Ozaki et al. 2002; Kammerer et al. 2004). Considering the substantial effort put into every association genome scan as well as the typically high prior understanding of the genetic contribution to the variation in the trait of interest, it is important to consider the number of markers that should be subject to careful consideration in an independent follow-up study. We provide an approach to make such calculations. A program for computing true association ranking probabilities is available from D.V.Z. upon request.

APPENDIX: THEORY ON STOCHASTIC ORDERING OF TRUE AND FALSE POSITIVES

To model the distribution of ranks, we start with the joint distribution of P-values corresponding to true and false associations, working in terms of order statistics that come from distinct densities corresponding to TA and FA P-values. In this regard, P-values are used as a ranking measure reflecting the degree of association. Equivalently, the problem can be described in terms of an arbitrary association measure used to order results. Integrating the joint distribution allows calculating quantities such as the probability that the largest true association P-value is smaller than the ith smallest false association P-value, as well as the expected ranks for the true associations. These quantities can be calculated either unconditionally or conditionally upon a multiple-testing correction and require the assumption of independence. A diffusion process is used to extend the method, allowing for dependence due to LD between genetic markers. Consider a continuously distributed test statistic for association, Ti, at the marker i, with the corresponding P-value, pi = 1 − F0(Ti), where F0(·) is the CDF of Ti, assuming the null hypothesis, H0. This is the correct CDF for markers representing FAs, but the actual CDF of the test statistic for TAs is denoted by FT(·|γ). In the case of a χ2-distribution, γ is the noncentrality parameter. If Ti are normally distributed, γ refers to the shift in the mean of the distribution caused by the association with the trait. Markers are ordered in terms of P-values, so that Math, where L is the overall number of markers in the study. For the FAs the CDF of P-values is Math, where P and p denote the random variable and its value, respectively. For the TAs we have, assuming a one-tailed test,Math(A1)For illustrative purposes, we now assume a normally distributed Ti, although calculations are similar for other continuous distributions. The results are also presented assuming a χ2-distribution that is more relevant in exploratory analysis so that the direction of the effect is not considered. With the normal Ti, Equation A1 becomes Math, where Φ(·) is the normal CDF, and Math is the power parameter that depends on the sample size, N, effect size, μ, and the variance, σ2. The normal test statistic P-value density is Math, where ϕ(·) is the normal density function. Note that Math, and πT(p|γ = 0) = 1—i.e., p ∼ uniform(0, 1) under H0. Let Xi be the random variable corresponding to the ith smallest FA P-value. Denote the most significant TA among m true associations, with its random P-value, by YY1. Consider events: AY < Xi; B ≡ min(X1, Y) ≤ δ; BC ≡ min(X1, Y) > δ, where Pr(B) + Pr(BC) = 1, and δ is a multiple-testing-adjusted α-level; e.g., Šidak's adjusted δ is 1 − (1 − α)1/L. For large L this gives ≈α/L, the Bonferroni correction. If true and false effects are independent, the joint density function of Xi and Y is (dropping conditioning on γ and other parameters from the notation for simplicity)Math(A2)where the number of FAs is s = Lm, B(·) is the beta function, the first ratio is the beta(i, si) distribution of the ith smallest false P-value, and the ratio of normal densities is the density for the TA P-value distribution, assuming independence and common power characteristics for each of m associations. The problem is to find Pr(A|B) = Pr(YXi|(Y or X1) ≤ δ). First consider m = 1. Pr(A|B) is the probability that the true association ranks below the ith most significant FA P-value. It is found asMath(A3)whereMath(A4)The first term in (A4) is the joint distribution of X1 and Xi, independent of Y with density given by the second term. Alternatively, we wish to find a value of i such that Pr(A|B) is equal to a certain value, such as 95%. This is found by computing (A3) over the range of i. The resulting value of i is equivalent to the number of most significant associations that will contain the actual TA with 95% probability. Equation A3 measures the probability that the true positive will rank right below the ith most extreme false positive. This probability is conditional on at least one P-value satisfying the multiple-testing threshold, δ. An unconditional version is obtained by setting δ = 1. Satagopan et al. (2002) considered the unconditional ranking in the problem of optimizing the total number of typed markers for two-stage association-mapping study designs. The “average” ranking of Y, or its expected rank, is given byMath(A5)To modify Equation A3 allowing for the calculation of probability that all TAs will rank below i, we need the distribution of the largest TA P-value, among (m > 1) TAs,Mathas well as the joint distribution of the maximum and minimum P-value for TAs,MathThe probability that all true positives will rank before the first i false positives, conditional on a multiple-testing correction, isMath(A6)The expected rank of Ym is obtained in the same manner as in (A5).

These equations can be evaluated by numerical integration or more conveniently via Monte Carlo simulations. Simulations are set up as follows. FAs are sampled from the uniform(0, 1) distribution. The TA P-value distribution function, Math, can be inverted to sample each of m true association P-values. For example, assuming a normally distributed test statistic, {pi} are obtained as Math, where ui is a uniform(0, 1) random deviate. With a chi-square test statistic, P-values are generated as Math, where Ψd(·) is the χ2 CDF with d degrees of freedom and the noncentrality parameter γ. Once all L = s + m of P-values are generated, they are ordered and their ranks are recorded. This consists of a single simulation experiment. Probabilities such as in (A3) are approximated by the proportion of times across multiple experiments where Y had ranked below the ith false positive. Parameter γ governs the power associated with individual TAs. For example, γ = 3.29 with the normally distributed test statistic corresponds to the 95% probability of detecting true effects with a 5%-level test and can be obtained as Φ−1(1 − 0.05) + Φ−1(0.95) ≈ 3.29. Such simulations are easy to set up and they become more convenient than the numerical integration when L is large. Although we considered a fixed value of m for the calculations to be more transparent, in principle this value can be allowed to have a distribution allowing for uncertainty associated with m.

In this study, we applied the Monte Carlo approach to obtain numerical results, taking the number of simulations to be at least 50,000. The simulation approach is especially useful to study the effect of correlations between P-values. The positive dependence, such as dependence due to LD, is expected to reduce the effect of the total number of false effects on the distribution of ranks of TAs. Assuming a multivariate normal distribution for the test statistic and the exponential correlation decay with distance, the joint distribution of the test statistic under H0 over neighboring markers can be described by the Ornstein-Uhlenbeck diffusion process, Math, where Math is the white-noise term and a and σ2 are the process drift and the variance parameters. After statistic values (normal scores) are converted to P-values, sampling from this diffusion process generates correlated P-values with the stationary uniform (0, 1) distribution. The correlation r(Zi, Zj) generated by diffusion closely translates to the correlation between P-values: Math (Kruskal 1958). The largest ratio of r(Zi, Zj) to the correlation between two P-values is π/3 ≈ 1.047, as r(Zi, Zj) approaches zero. We have also implemented a more realistic model of the correlation structure, allowing for extended blocks of very high LD interspersed with regions of low LD, by mixing two (high- and low-correlation) Ornstein-Uhlenbeck diffusions. In reality, P-value correlations may not be adequately described by the diffusion. Nevertheless, this model allows us to generate substantial correlations with specified decay characteristics and therefore allows us to investigate the effect of correlation on the distribution of TA ranks.

Acknowledgments

Clive Bowman, Meg Ehm, Ralph McGinnis, Mike Mosteller, Matt Nelson, Shyamal Peddada, Liling Warren, and two anonymous reviewers provided useful comments that improved the article.

Footnotes

  • Communicating editor: C. Haley

  • Received April 9, 2005.
  • Accepted July 11, 2005.

References

View Abstract