A method is given for design of experiments to detect associations (linkage disequilibrium) in a random population between a marker and a quantitative trait locus (QTL), or gene, with a given strength of evidence, as defined by the Bayes factor. Using a version of the Bayes factor that can be linked to the value of an F-statistic with an existing deterministic power calculation makes it possible to rapidly evaluate a comprehensive range of scenarios, demonstrating the feasibility, or otherwise, of detecting genes of small effect. The Bayes factor is advocated for use in determining optimal strategies for selecting candidate genes for further testing or applications. The prospects for fine-scale mapping of QTL are reevaluated in this framework. We show that large sample sizes are needed to detect small-effect genes with a respectable-sized Bayes factor, and to have good power to detect a QTL allele at low frequency it is necessary to have a marker with similar allele frequency near the gene.
THE advent of dense maps of single-nucleotide polymorphisms (SNPs) covering the genome with 300,000 or more markers offers new opportunities to find and identify genes, by testing for population-level associations between the SNP and disease or other trait of interest. Associations occur because of linkage disequilibrium between the marker and trait. “Linkage disequilibrium (LD) mapping” aims to detect and locate genes relative to a map of existing genetic markers. Location information is obtained because the distance between the gene and a marker on a chromosome is one factor influencing the closeness of association between the gene and marker. In a population, recombinations affecting the association between a gene and marker may occur over many generations. This potentially gives a much finer resolution than pedigrees used for quantitative trait loci (QTL) mapping. Finer resolution comes at a cost, however. More genotyping is needed per individual and as we shall show, larger sample sizes are needed to take advantage of a finer level of resolution. In this article we develop experimental design techniques necessary to find the sample size needed to reliably detect a given level of linkage disequilibrium between a biallelic QTL and marker.
This article is structured as follows. First we review prospects for LD mapping with SNP markers, possible strategies, and results to date, including problems with reliability of detected QTL, illustrating the need for a sound measure of statistical evidence and experimental design. Then we discuss measures of statistical evidence or criteria for “detecting” associations. We argue that there are problems with commonly used P-values as a measure of evidence and advocate the Bayes factor (see, e.g., Spiegelhalter and Smith 1982) as a replacement measure. Further information including relationships between Bayes factors, posterior probabilities, type I error rates, P-values, power, and false discovery rates as evidence for gene effects is given in appendix B. Then, we review and correct the deterministic power calculation from Luo (1998) (with technical details in appendix A) and the classical approach to power of experiments and then introduce a generic Bayes factor (Spiegelhalter and Smith 1982) for comparing linear models, in the absence of prior information. This is linked to the classical power calculation, to give designs with a given probability to detect an effect with a given Bayes factor.
Results are presented for sample sizes ranging from 600 to 4800 and Bayes factors ranging from 1/20 to 20 and for a range of QTL and marker allele frequencies ranging from 0.01 to 0.5 for sample sizes ranging from 600 to 38,400 with a Bayes factor of 20. Applications to genome scans and testing large sets of candidate genes are discussed, with results for posterior odds ranging from 1/20 to 20 (Bayes factors ranging from 250 to 1,000,000). The discussion section gives wider applications and implications of the method for strategies for gene discovery.
Kruglyak (1999), reviewing prospects for whole-genome linkage disequilibrium mapping, suggests that the useful range of linkage disequilibrium in the human population is around 3 kb, corresponding to a map with 500,000 SNP markers. The high density of SNP markers, combined with the short range of linkage disequilibrium in a population, means that it is possible, in principle, to locate SNPs near the genes and hence to sequence regions containing genes. Results in the literature on the range of usable linkage disequilibrium are variable. Dunning et al. (2000) report a similarly low range of disequilibrium in humans, while Taillon-Miller et al. (2000)(reviewed in Boehnke 2000) report strong LD at distances as high as 1 Mb. At the other extreme, LD extending over a wide range has been reported in domesticated species by, e.g., Dunner et al. (1997) and Farnir et al. (2000) for cattle and by McRae et al. (2002) for sheep, who reported LD even between unlinked markers.
The relatively short range of linkage disequilibrium in some populations makes it possible, in principle, to study complex diseases with non-Mendelian inheritance (i.e., disease is not caused by a single gene, but a number of genes contribute to disease susceptibility). If other populations with LD extending over a somewhat greater range can be found this reduces the amount of marker genotyping to more affordable levels, at a cost of lower potential resolution. See Risch (2000) for a discussion of the issues and optimal “postgenome” strategies.
Similar considerations apply to quantitative traits. Colleagues in the Forest Research Cell Wall Biotechnology Centre are interested in economic traits for forest trees, such as growth rate or wood density. Results from QTL mapping studies (Wilcox et al. 1997; Kumar et al. 2000; Ball 2001) suggest that these traits may be influenced by a number of smaller-effect genes.
Roses (2000) discusses a strategy for determining an individual's response to medicine(s), based on selecting a subset of SNP markers associated with the responses to medicine(s) of interest and then using the values of these SNPs for an individual to predict the individual's response.
There are two general strategies for detecting associations: testing for random associations in a population or using family-based tests such as the transmission disequilibrium test (Allison 1997; Boehnke and Langefeld 1998; Spielman and Ewens 1998; Long and Langley 1999). See Nielsen and Zaykin (2001) for a review.
The random population association test has the disadvantage of being confounded with allele frequency differences in subpopulations, if there is any population substructure, while the transmission disequilibrium test requires availability of many small families.
Note that where there may be population substructure we recommend the method of Pritchard et al. (2000a)(b) for estimating and allowing for population substructure. This requires additional markers and may reduce power due to the additional number of parameters to be estimated.
With either of these strategies there are currently two common approaches: testing for single-marker associations or testing haplotypes on the basis of a combination of marker values associated with the disease or quantitative trait. In this article, we consider the single-marker approach, with data from association studies, but note that the method could be applied to haplotype data where two classes of haplotypes are considered.
The large number of possible SNPs and the even larger number of possible haplotype combinations make it essential to have soundly based statistical evidence for putative associations. This may not be the case in general in published literature, i.e., Altshuler et al. (2000), discussed in Gura (2001)(p. 594):
… a team lead by Altshuler tried a slightly different approach. The investigators went back to published studies linking other SNPs to diabetes … and retested 16 reported SNPs in a new group … 13 SNPs were common enough in the population to study … only one SNP held up in the target populations … the SNP association had been reported in 1998, but four out of five subsequent analyses with only a few hundred patients each could not confirm the linkage.
These results are summed up by Altshuler (Hampton 2000):
The lack of replication of the others points to the need for larger samples, controls for population differences, and stronger statistical evidence prior to claiming an association [emphasis added].
Terwilliger and Weiss (1998)(Figure 4) show the distribution of ∼260 reported P-values from association studies in two journals and note that there is no evidence of departure from the uniform distribution (i.e., no evidence of any real effect):
… investigators are too frequently gambling on and publishing results in situations where the evidence is not at all compelling.
These results point to lack of reliability of published results. Clearly, the statistical criteria used have not led to reliability of “detected” SNPs in these examples. Experience, and the results below, suggest this problem is likely to be widespread. The objective of this article is to design experiments for testing associations with the required quantifiably stronger statistical evidence.
Terwilliger and Weiss (1998) suggest as a cause of the problem that the simple model underlying the association tests does not reflect the genetic complexity, citing allelic and nonallelic heterogeneity, genetic-by-environment interactions, intragene interactions (epistasis), etc. We suggest that, while this may be true, resulting in complex statistical interactions, there is no hope of detecting the interactions if one cannot first detect and isolate the main effects, which would imply a model similar to that underlying association tests. We suggest the problem is not with the model but with the basic statistical experimental design (insufficient power) and statistical criteria for detection (P-values).
Evidence for a real effect—P-value and posterior probability for H1:
We argue that the P-value, the commonly used measure of statistical significance, is not a sound measure of evidence for a real effect and that the solution of the problems is to adopt a sound framework for statistical inference, based on probability theory, first given in Bayes (1763).
The P-value is the probability of observing a test statistic more extreme than the observed test statistic under repeated sampling assuming the null hypothesis of no effect holds. The problem is that there is no valid interpretation of P-values as evidence for a real effect independent of sample size and test setup. One reason is that the P-value is a probability under H0; the corresponding probability under H1, which is not controlled by the P-value alone, may also be comparably small (appendix B, Equations B10 and B11), in which case the evidence for H1 will be weak. For a given sample size and test setup, smaller and smaller P-values correspond to stronger and stronger evidence. However, we cannot say how small a P-value is required for a given strength of evidence or even quantify strength of evidence, except by reference to Bayesian methods for specific sample sizes and test setups. The larger the sample size, the smaller the P-values need to be to correspond to a given strength of evidence. We shall see that, for the large sample sizes needed for detecting linkage disequilibrium, reasonable evidence corresponds to very small P-values (Table 3), much smaller than those generally used, explaining the unreliability of published results. For further discussion of P-values and their relationship with Bayes factors and posterior probabilities see appendix B.
Overestimation of effects—selection bias:
A related problem is overestimation of effects caused by selection bias. If effects of selected markers are real but not reliably detected, i.e., the power of detection is not high, then reusing the same data to estimate the size of an effect as was used to select a marker results in an upward bias in the size of effects that can be large, in percentage terms. The solution is either to use an independent population to obtain unbiased estimates of effects (which is inefficient) or, in the Bayesian framework, to use model averaging. See Ball (2001) for a Bayesian model-averaging approach for obtaining unbiased estimates in the QTL mapping context. The model-averaging approach applies ideally to sets of multiple markers, but can also be applied when considering only a single marker, in which case the models averaged over correspond to the null hypothesis H0 with no effect and the alternative hypothesis H1 with a real effect.
The interpretation of P-values as representing good evidence for real effects is one, although possibly not the only problem with reliability of published data. Other problems may include spurious associations arising from neglecting population structure or inadequate experimental design. Or it may be that the genetic effects are not “stable” across different environments or subpopulations sampled by various trials. Before claiming that there is instability, however, we need to be sure there really was good evidence for an effect in the first place and then obtain good evidence for an interaction.
This article quantifies the level of replication required for association studies, for reliably detecting linkage equilibrium between a DNA marker (or gene) and a trait, with particular interest in detecting small-effect genes. To do this we apply the Bayesian approach to assessing evidence, using the Bayes factor. Sample sizes needed to detect effects of various sizes with a given strength of evidence as defined by the Bayes factor, with a given probability (power), for a given level of linkage disequilibrium between a biallelic marker and a biallelic quantitative trait locus, are determined.
We make use of an existing deterministic power calculation (Luo 1998) for the power of detecting linkage disequilibrium between a biallelic QTL and a marker. The type I error rate (or P-value) corresponding to the desired Bayes factor is determined for each set of parameter values and plugged into the deterministic power calculation.
We assume a biallelic marker with alleles M, m and a biallelic QTL with alleles A, a. Following Luo (1998), let q, p be the probability of A, M, respectively. In the biallelic case, linkage disequilibrium is specified by a single coefficient, D, such that the joint probabilities of alleles are given by 12(cf. Weir 1996). It follows that the conditional probabilities are given by 34The genotypic frequencies of pairwise combinations of QTL and marker genotypes are determined from these quantities (Table 1).
The statistical model for observed phenotypes is 5where i, j, k index marker genotype, QTL genotype, and observations within marker and QTL genotype, respectively. Since the QTL genotypes are unobserved, the data are analyzed with a one-way analysis-of-variance model, with marker genotypes as groups, i.e., 6
The ANOVA table for this analysis is shown in Table 2.
Classical power calculation:
The classical statistical experimental design seeks to determine designs with a given “power” 𝒫 at a given significance level α; i.e., the probability of obtaining a result with P-value less than α is 𝒫, assuming the effect to be detected is at least a certain size. The effect is said to be detected if this significance level is obtained.
The null hypothesis is “rejected” at level α if the observed value F-statistic is greater than the 1 − α point of its distribution under the null hypothesis, 7where F1−α:ν1,ν2 is the 1 − α point of the F-distribution on ν1, ν2 d.f.
The probability of rejecting the null hypothesis or power is the probability that the F-statistic exceeds the critical value. To find this probability we need to know the actual distribution of the F-statistic under the alternative hypothesis, which in this case means there is a biallelic QTL in linkage disequilibrium with a marker with parameters D, d, h, p, and q defined above.
Under the alternative hypothesis F is distributed as a noncentral F-distribution with noncentrality parameter δ given by 8where EMSb and EMSw are the expected values of the mean squares between and within marker classes (cf. Table 2; Johnson and Kotz 1970, p. 189; Luo 1998, Equation 6.2). The power is given by 9where Fν1,ν2 is a random variable with the F-distribution with ν1, ν2 d.f., and noncentrality parameter δ.
To determine δ and complete the calculation, it remains to determine the expected mean squares EMSb and EMSw for the problem. Details of the calculation including derivation of modified values for ω23, EMSb, and EMSw are given in appendix A.
Experimental design with power to obtain a given Bayes factor:
Bayes' theorem follows from the basic law of conditional probability: 10or 11
If we have observed B the probability of A being true is Pr(A|B). Bayesian statistics applies this with B as the data (y) and A as an unknown parameter (θ). Replacing A by θ and B by y and representing the probability functions by f, π, and g gives 12
It follows (by integrating both sides over θ) that 1314
This has the following interpretation: the prior distribution π(θ) represents our knowledge of the unknown θ before observing the data y, and the posterior distribution g(θ|y) represents our knowledge of the unknown θ after observing the data. In other words π has been updated to g. f(y) is the probability of the data.
Now suppose we have two hypotheses (or models), H0 (e.g., the hypothesis of no linkage disequilibrium) and H1 (e.g., the hypothesis of a nonzero QTL effect in linkage disequilibrium with a marker). Each hypothesis represents a model with unknown parameters and probability density functions. Let θi be the parameters under Hi (i = 0, 1) and let πi(θi), fi(y|θi), gi(θi|y), and fi(y) be the probability functions. Now fi(y) is the probability of the data under hypothesis Hi, so we write it as a conditional probability Pr(y|Hi). Additionally let π(Hi) denote the prior probabilities for each model to be the “true” model.
The Bayes factor, B, measures the strength of evidence in the data in support of one hypothesis (or model), H1 (e.g., the hypothesis of a nonzero QTL effect in linkage disequilibrium with a marker), over another, H0 (e.g., the hypothesis of no linkage disequilibrium), and is defined as the ratio of the probability of the data under H1 to the probability of the data under H0, i.e., 15Higher Bayes factors (>1) are stronger evidence for H1, while lower Bayes factors (<1) are evidence for H0. A Bayes factor of 1 means the data are equally likely under H0 and H1, so there is no evidence either way. Bayes' theorem in this case can be written as 16where π(Hi) and Pr(Hi|y) are the prior and posterior probabilities of Hi, i.e., 17The Bayes factor has a natural interpretation in terms of betting odds. Prior odds are the odds we would be prepared to bet on prior to seeing the data; posterior odds are the odds we would be prepared to bet on after seeing the data. Equation 17 has the interpretation that the Bayes factor is the factor by which we multiply our prior odds after seeing the data. For example, if we had prior odds of 1:10 (i.e., odds against a QTL in a given region) and a Bayes factor of 100 our posterior odds would be 10:1 (100 × 1/10 = 10). This may not sound impressive to readers accustomed to P-values <0.01 or even 0.001 but we shall see that it is not easy to obtain evidence this strong. For example, in a t-test with a noninformative or nearly noninformative prior distribution on the effect and sample size >100, a P-value of 0.05 can correspond to a Bayes factor <1, i.e., little or no evidence against H0 or even evidence for H0. (cf. Berger and Berry 1988 and Table 3).
Note that the reader may find some similarity between the likelihood ratio and the Bayes factor: in fact, the Bayes factor is the same as the likelihood ratio—if there are no unknown parameters. However, use of the Bayes factor is not the same as the use of the likelihood-ratio test (cf. the discussion).
Prior odds are part of prior knowledge, but do not affect the Bayes factor. The Bayes factor is, however, affected by the prior distribution of parameters [πi(θi) above], particularly the parameters being tested. In general the more prior information we have on the parameter values under H1, the higher the Bayes factor we will obtain.
In a particular situation, where there is prior information one should choose somewhat conservative prior distributions for these parameters, so the evidence for an effect is not exaggerated. A vague prior that puts most prior mass on very large numbers should also not be used naively—in the limit as prior information tends to zero, the Bayes factor also tends to zero. To develop a generic method, or where there is no prior information, one way to proceed is to start with priors with little or no prior information and update these priors using a small “training sample” (ya; i.e., replace the priors by the posteriors after observing ya) and use the rest of the data to estimate the Bayes factor with the updated priors. It can be shown that this is equivalent to defining 18where B0(y) denotes the Bayes factor with data y and the original priors and B(y) is the Bayes factor with data y and the updated priors.
Note that now [setting y = ya in (18)] the Bayes factor is calibrated to be 1 for the small training sample. This is reasonable because the training sample is small and so contains little or no evidence either way, which is consistent with the interpretation of B(y) = 1. This is the motivation for the approach of Spiegelhalter and Smith (1982) who obtain, for a one-way analysis-of-variance model, 19where m is the number of groups, ni is the number in each group, n is the total sample size, and F is the classical F-value.
Note that our definition of B is the reciprocal of that used by Spiegelhalter and Smith, so high Bayes factors correspond to positive evidence for H1.
This form of the Bayes factor is particularly convenient, because it links directly to the F-statistic used in existing power calculations. To detect an association with power 𝒫 to achieve a given Bayes factor B we solve for F in (19) and select a design using the existing deterministic power calculation.
Note that in general the Bayes factor is sensitive to the prior for the variable(s) being tested (here marker effects). Use of (19) does not require specification of a prior, which is especially useful in a general experimental design context where there are no specific data and hence no prior. Intuitively, it seems reasonable that the above calibration is equivalent to using a nearly noninformative prior with information comparable to that in the small training sample. This is consistent with our experience—when we have used priors with information equivalent to one data point, the resulting Bayes factors were similar to (19).
Equation 19 applies to testing for linkage disequilibrium (cf. the ANOVA table, Table 2), where m = 3 is the number of marker classes; n1, n2, and n3 are the number of observations in each marker class; and F is the F-value. To obtain a deterministic formula we set n1, n2, and n3 to their expected values on the basis of the frequencies of marker classes MM, Mm, and mm in Table 1 and the total sample size: 20Substituting in (19) gives 21
This should be a good approximation for the large sample sizes we need: variations in observed proportions in each marker class will be small and hence will have little effect on the relationship between the F-value and Bayes factor in (19).
For any valid choice of values of QTL/marker allele frequencies, effects and linkage disequilibrium parameters D, d, h, p, and q, sample size n, and Bayes factor B, we can solve for F = F1−α:2,n−2 in (21) and then look up the value of α from the F-distribution. Then given α and n we determine the power, 𝒫, from the classical power calculations. To determine the sample size n for given B and 𝒫 we use interpolation.
An R package (ldDesign) containing code used for the calculations is available on the Comprehensive R archive network (CRAN), from http://cran.r-project.org/src/contrib/Descriptions/ldDesign.html.
Table 3 shows the type I error rates (or P-values) corresponding to various Bayes factors. For a desired Bayes factor (e.g., B = 10, which implies the data are 10 times more likely under the hypothesis of a real effect than under the hypothesis of no effect), look up the type I error rate in the table for the sample size desired. For example, if the sample size is 1728, we need a type I error rate of α < 2.35 × 10−4 to get B = 10. If the sample sizes is 300, we need a type I error rate of α < 1.42 × 10−3 to get B = 10. For these sample sizes, α = 0.05 and even α = 0.01 are clearly not a good option, corresponding mostly to Bayes factors <1. For example, with n = 864, B = 1/10, we have P ≈ 0.05—showing that a P-value of 0.05 can correspond to evidence against an effect.
Table 4 is a comparison with results for 12 sample populations from Luo (1998). 𝒫0.05 is the power to detect an effect with comparisonwise significance level α = 0.05, as shown in Luo's (1998) Table 3. Also shown are the equivalent Bayes factors, B, and the sample size nB20 required to obtain a Bayes factor of 20 with power 0.9. Note that the Bayes factors for the original sample sizes are all <1, i.e., not corresponding to positive evidence for a real effect. The sample sizes nB20 are the sample sizes required to have good power to detect an effect with fairly strong evidence and are substantially (up to 13 times) larger.
Note that our calculations for the power, 𝒫0.05, to detect an effect with significance level α = 0.05 agree with those from Luo's (1998) Table 3 and with our stochastic simulations, except for populations 11 and 12, which agree with our simulations but do not agree with the results from Luo (1998). Luo obtained powers of 0.65, 0.60, which are similar to the power obtained when φ = 0. It may be that he used φ = 0 for the power calculations for these populations.
Figure 1 shows graphs of power vs. linkage disequilibrium. Graphs correspond to sample sizes n = 600, 1200, 2400, 4800 and QTL heritabilities (or proportions of phenotypic variance explained) of . Within each graph, the solid lines are graphs of power vs. linkage disequilibrium for Bayes factors of 1/20, 1/10, 1/5, 1, 5, 10, and 20. The lines, in order of increasing power, correspond to the Bayes factors in decreasing order.
Graphs of power vs. disequilibrium for various values of marker and QTL allele frequencies are shown in Figure 2. To facilitate comparisons across the range of allele frequencies disequilibrium has been represented as D′ = D/Dmax, which varies from 0 to 1, with 1 representing the maximum possible linkage disequilibrium for the given values of marker and allele frequencies.
Note that the low allele frequencies per se do not have a major effect on the power obtained when disequilibrium reaches its maximum, although this may happen for a lower D when q is lower (cf. top row, middle graph with n = 4800, p = 0.2). However, very low power occurs when there is a poor match between marker and QTL frequencies (e.g., p = 0.01, q = 0.5 or p = 0.5, q = 0.01).
The Bayes factor gives a well-defined measure of strength of evidence, independent of the experimental design or sample size used. The optimal value to use depends on the costs of further experimentation and possible benefits. A Bayes factor of 20 or more represents good evidence for an effect; however, one needs to factor in the prior odds. An effect that is a priori unlikely needs a high Bayes factor to obtain respectable posterior odds. A more formal cost-benefit analysis is possible, in the Bayesian decision theory framework; see, e.g., DeGroot (1970) and Lindley (1985).
Use with genome scans:
In a genome scan the prior odds for a gene to be in the vicinity of a particular marker, defined as the region closer to that marker than to any other, are proportional to the number of genes expected and inversely proportional to the number of markers (cf. Ball 2001). In a genome scan with many markers the prior probability would be small; therefore a high Bayes factor would be required. Kruglyak (1999) suggests that D = 0.1 may be obtainable for the human population, at distances between QTL and marker of up to 3 kb, corresponding to a map with 6 kb spacing and 500,000 markers.
For example, if it is desired to localize an effect to the nearest of 500,000 SNP markers and ∼10 genes were expected, the prior probability per marker would be ∼1/50,000. To obtain respectable posterior odds of say 20:1 in (16), we would require a Bayes factor of 1,000,000.
Table 5 shows sample sizes required for localizing QTL in a genome scan with 500,000 SNP markers, assuming there are 10 QTL. A sample size of ≈40,000 is needed for a power of 0.9 to obtain posterior odds of 20:1 for a QTL explaining 1% of the phenotypic variance with D = 0.1 to be within ±3 kb of a given marker. Note that, in relative terms, the sample sizes to obtain the higher Bayes factors do not increase much at the larger sample sizes.
Use with candidate genes:
If a large number of “candidate” genes were being tested, at an initial screening stage of experimentation, one may be content with posterior odds of <1. In this case one would want to be sure the effects of interest had a high probability (power) of being accepted, while the genes not affecting the trait have a low probability. For example, suppose 50,000 candidate genes were tested with a sample size of n = 4800, and we are looking for QTL explaining 1% of the phenotypic variance . From Figure 1 with n = 4800, B = 1 we have a power of 0.5 provided D ≥ 0.13 and a power of 0.001 provided D = 0. Thus with this design we would, on average, end up with 50 false positives and half of the genes with D ≥ 0.13. On the other hand requiring B = 20 would eliminate 80% of the genes with D ≥ 0.13, which is hardly satisfactory. For QTL with , the situation is more favorable: with n = 4800 and D ≥ 0.1 there is a 95% power to detect genes with D ≥ 0.1 with a Bayes factor of 20, with a rate of <1/1000 false positives.
Table 6 shows sample sizes required for selecting genes from a set of 50,000 candidate genes, assuming there are 10 true genes. A sample size of ≈36,000 is needed for a power of 0.9 to obtain posterior odds of 20:1 for a gene explaining 1% of the genetic variance, with D = 0.1 to be within ±3 kb of a given marker. As with Table 5, the sample sizes needed to obtain the higher Bayes factors are not much higher in relative terms.
Detection of markers in linkage disequilibrium with a trait is a complex statistical problem. Classical single-marker test methods have not led to reliable SNPs in published data, and the need for more rigorous statistical evidence for detection of SNPs has been noted. A major problem is with the interpretation of P-values as evidence. The results of this article confirm the wide range in P-values required to obtain a given Bayes factor, as sample size varies.
The results of Table 4 show that using a threshold of P = 0.05 does not correspond to positive evidence for a real effect, and up to 13 times larger sample sizes are needed to obtain good evidence (a Bayes factor of 20) with power 0.9.
The results of Figure 1 show that good power is achieved with a sample size of n = 2400, for a 5% QTL with D = 0.15, to detect linkage disequilibrium with a Bayes factor of 20. However, at n = 4800 and D = 0.1 power is low even for a Bayes factor of 1.
The results of Figure 2 show that power may be very low if low QTL allele frequency is not matched by a low marker frequency. In this case the linkage disequilibrium is much lower than the maximum possible for the given QTL allele frequency. This sensitivity of power to allele frequency is one reason why plots of the LD test statistics in the neighborhood of a gene generally vary wildly, in a nonsmooth fashion. To detect a gene, particularly one with a low-frequency allele, it is important not only to have a marker close to the gene but also to have a marker with similar allele frequency. For a randomly chosen marker in the vicinity of a gene it seems that this would occur with low probability for a low-frequency QTL allele. This further increases the number of markers needed for adequate genome coverage and/or the population size needed to detect such genes.
The results of Tables 5 and 6 show that it is feasible, albeit with large sample sizes of the order of 7500 and 40,000, to detect and locate QTL that explain 5%, 1%, or more of the genetic variance, respectively, with a linkage disequilibrium of D = 0.1, by testing sets of up to 50,000 candidate genes or 500,000 SNP markers. For genome scans, it is possible to borrow strength from neighboring markers, justifying higher prior probabilities when estimating the marginal probability for a set of markers in a larger region (cf. Ball 2001). This cannot be pushed too far, however: because of the short range of usable linkage disequilibrium indicated in Kruglyak (1999), only a small number of the 500,000 markers would be close enough to a given gene.
An alternative, in view of the large sample sizes indicated for detecting small QTL with the genome scan or brute force candidate gene approaches, is a hybrid approach whereby candidate genes or genomic regions are preselected, e.g., from differential gene expression or QTL mapping studies. Another alternative is to find populations with a somewhat larger range of usable linkage disequilibrium, giving less precise information but with a lower genotyping cost.
Various alternative experimental designs are possible, such as the transmission disequilibrium test, which avoid potential spurious associations due to population substructure (Allison 1997; Boehnke and Langefeld 1998; Spielman and Ewens 1998; Long and Langley 1999). We do not consider these directly but note that a transmission test can be viewed as a test of association between transmission of an allele and a trait, and the power calculations can be rederived. We would expect that a transmission test of equivalent power (in the classical sense) to the designs considered here would have a similar power to achieve a given Bayes factor; i.e., existing comparisons between the power of these tests and those of association studies would apply. However, this is yet to be confirmed.
An alternative to considering single markers separately is to study haplotypes. For example, if one is interested in a particular haplotype or class of haplotypes from a subset of SNPs being associated with a higher level of disease, compared with all other haplotypes from the subset, this can be considered a biallelic marker for which the methods of this article apply. If more than two haplotype classes are considered the equivalent power calculations would need to be derived. Use of P-values is problematic in this situation because P-values are affected by considerations, such as which other haplotypes or haplotype classes are tested. The Bayesian approach is well suited to haplotypes—the Bayes factor or posterior probability does not depend on such considerations. Of course if there were a very large number of combinations or partitions of haplotypes the prior probability for a randomly chosen haplotype would be low, and a high Bayes factor would be required.
The Bayes factor depends on the prior for the variable(s) being tested (here marker effects). We have used a generic or “default” Bayes factor that avoids this prior dependence, giving a generic result that can be applied in the absence of further prior information. In specific situations it may be possible to do better when prior information on the size of QTL effects is available. For example, the variance due to a QTL can be no more than the genetic variance of the trait, for which estimates are often available. However, in our experience, this amount of prior information is not enough to make a major difference.
The approach used here, of determining the power to achieve a given Bayes factor, is a hybrid of Bayesian and non-Bayesian (frequentist) approaches. We are studying the sampling distribution (frequentist) of the Bayes factor (a Bayesian measure of evidence). This is appropriate because we are interested in possible outcomes of future experiments, if there is an association with certain parameter values. However, once an experiment has been carried out, in the Bayesian viewpoint, there is only one observed data set, and the Bayes factor is a property of that data set; we would not normally consider the sampling variability of Bayes factors that might have been obtained.
The problem with P-values remains whether one uses any of the many forms of hypothesis tests: the t-test, F-test, χ2-test, or likelihood-ratio test. One may wonder why the likelihood-ratio test does not solve the problem, when the Bayes factor is essentially a likelihood ratio. The difference is in how the unknown parameters are treated. In the non-Bayesian likelihood-ratio test one maximizes over the unknown parameters, where θ̂1 and θ̂0 are chosen to maximize their respective likelihood functions f1(y|θ1), f0(y|θ0). This strongly favors the alternative hypothesis, which usually has one or more additional parameters to maximize over. To obtain a hypothesis test, the likelihood ratio, LR, is compared to its sampling distribution under the null hypothesis. High values of the likelihood ratio are interpreted as meaning the null hypothesis is rejected. As with any other hypothesis test, there is no guarantee, however, that the alternative hypothesis is any more likely.
The problem with P-values remains whether one uses comparisonwise or experimentwise P-values or P-values from a permutation test. A permutation test is just a nonparametric way of obtaining a P-value. Of course the experimentwise P-value usually corresponds to a much lower comparisonwise P-value so represents stronger evidence. The interpretation of the P-value and what threshold to use remains a problem.
Obtaining a Bayes factor of 20 may require much larger sample sizes than obtaining a P-value of 0.05, say. One may ask: Why use Bayes factors if this requires larger sample sizes and hence greater cost, if one has P = 0.05? This is not a reflection on the relative efficiency of the methodologies, because the Bayes factor of 20 represents stronger evidence than the P-value of 0.05, which may represent very weak evidence—“you get (only) what you pay for.” If a much larger sample size is required for power to obtain the desired Bayes factor this implies that the original experiment was underpowered, and the P-value corresponds to a small Bayes factor, i.e., very weak evidence, while creating a misleading impression of evidence for an effect.
DERIVATION OF MODIFIED VALUES FOR EMSb AND EMSw
First we calculate μ, β2, and then ω23. Values for β1, β3, and other ωij values are obtained similarly.
Taking expectations in Equation 5, A1Substituting QTL genotype probabilities and values and simplifying gives A2
Taking expectations conditional on marker class Mm (i = 2), A3The genotype conditional probabilities are given by A4A5A6Noting that E(y|Mm, AA) = E(y|AA), etc., and substituting QTL genotype conditional probabilities and values and solving for β2, we obtain A7which agrees with Equation 3.2 of Luo (1998).
Finally, taking expectations conditional on marker genotype class Mm (i = 2), QTL genotype aa (j = 3) gives
Solving for ω23 gives A8This differs from the expression of Luo (1998)(p. 207) by a factor −1.
Luo (1998)(Equations 3.1–3.3 and Appendix 1), following Searle (1987) and Knott (1994), gives expressions for βi, ωij and expressions for EMSb, EMSw in terms of βi, ωij and fi, fij, where fi is the relative proportion of the ith marker genotype and fij is the proportion of the ith marker genotype/jth QTL genotype combination.
The formula given for EMSw is A9where Gm = 3 is the number of marker genotypes and Gq = 3 is the number of QTL genotypes.
Use of the formulas for EMSb and EMSw from Luo gave results that did not agree with stochastic simulations. Our calculations of βi and ωij agreed with those of Luo except for the equation for ω23, above, which was out by a factor of −1. Using the corrected value for ω23 still did not give results that agreed with simulations. The values of EMSw did agree with simulations, suggesting that there is a further error in EMSb. Rather than rederive the expression for EMSb we take advantage of the fact that once one of EMSb or EMSw is known, the other can be obtained by subtraction using the relationships between EMSw, EMSb, and total sums of squares, viz. (cf. Table 2), Taking expectations, where Vp is the total, or phenotypic, variance. Solving for EMSb gives A10
Finally, we need an expression for Vp. Substituting from Equation 5 and taking variances gives A11
BAYES FACTORS, POSTERIOR PROBABILITIES, TYPE I ERROR RATES, P-VALUES, POWER, AND FALSE DISCOVERY RATES AS MEASURES OF EVIDENCE FOR GENE EFFECTS
To fix notation, suppose H0 represents the null hypothesis that θ = 0 (a so-called “precise” hypothesis), and H1 represents the alternative hypothesis that θ ≠ 0, where θ represents one or more parameters for a genetic effect to be tested, which are zero under the assumption of no effect. There may be other parameters common to both H0 and H1; these are suppressed.
Type I error rates and P-values:
The type I error rate (α) for a test is given by B1where T is a “test statistic,” and Tc is the α-critical level of the test statistic (which we also write as Tα), such that H0 is rejected if T ≥ Tc.
Type I error rates are valid if α is chosen preexperimentally. However, α, being chosen preexperimentally, does not make full use of the information in the data, which is not efficient. Fisher proposed reporting the P-value, which is the value of α when we set Tc = Tobs, the observed value of the test statistic: B2The P-value is the smallest, i.e., most optimistic, value of α under which H0 would have been rejected. However, we cannot set α = p after observing the data—the P-value is not a valid type I error rate. Therefore it is problematic to make a decision on the basis of one or more P-values. Even if we had chosen α preexperimentally, giving an average type I error rate under repeated sampling, we would not be able to quantify the evidence for a particular sample that may be worse than average.
Bayes factors as evidence and relationships with P-values:
Bayes factors have a natural interpretation as evidence from Equation 16. Higher Bayes factors correspond to stronger evidence for H1.
Intuitively, more extreme values of the test statistics mean that the data are more likely than those for smaller values to have come from H1; i.e., smaller P-values correspond to stronger evidence. Experience confirms this: for a given problem (H0, H1, and sample size) with an intelligently defined test statistic, a plot of P-value vs. the Bayes factor shows a monotone relationship (cf. Table 3). This means, for example, that P-values and Bayes factors would give the same ranking of genes in order of strength of evidence but does not imply Bayes factors and P-values are equivalent; how to rank genes in order of expected benefit (considering size of gene effects as well as strength of evidence) or how many putative genes to select for further investigation becomes clear only when Bayes factors or posterior probabilities are calculated.
The relationship between Bayes factors and P-values is quantified implicitly in Equation 19 from Spiegelhalter and Smith (1982) for the one-way analysis-of-variance model, where the F-value corresponds to the P-value through the F-distribution. This relationship varies in general, depending on sample size (which is apparent from Equation 19) and on problem setup.
More generally Sellke et al. (2001) give “calibrations”; e.g., their “simple calibration” is given as B3leading to a fairly general lower bound, B4for a valid postexperimental type I error rate, under a technical assumption, that the distribution of −log(p) under H1 has a decreasing failure rate, which is shown to hold for a normal testing example with iid data. Unlike the problem-specific formula from Spiegelhalter and Smith, these bounds are not exact, and so we do not use them for experimental design.
Making a decision on the basis of a P-value amounts to conditioning on (i.e., simplifying by replacing the data by) the event B5occurring, where 𝒮 is the sample space, i.e., conditioning on data we did not observe, which is an error (cf. Berger and Berry 1988). Strictly speaking we did not observe Ep but B6Similarly, making a decision on the basis of statistical significance for a predetermined value α amounts to conditioning on B7
We can attempt to calculate a posterior probability for H0 using only the information T ≥ Tα, i.e., that the event Eα occurred. We have B8by Bayes' rule, where Pr(H0) is the prior probability that H0 is true, and B9Combining (B8) and (B9), setting π0 = Pr(H0) = 1 − Pr(H1), α = Pr(Eα|H0), and solving gives B10Note the dependence on Pr(Eα|H1). To obtain small values for Pr(H0) and hence good evidence for a real effect we need B11which may not be the case, since the left-hand side of the inequality is not controlled solely by α. The left-hand side of (B11) is related to the power of the test. Specifically, let H1(d) represent the hypothesis θ = d and π(·) be the prior for θ in H1. Then B12is the power function, giving the power to detect an effect of size d, at significance level α. Pr(Eα|H1) can be represented as an integral of the power function, B13giving the posterior probability of H0, B14A corresponding Bayes factor is given by B15The left-hand side of (B14) is equivalent to the positive false discovery rate (Storey 2003; Equation B20 below). Thus (B14) and (B15) show the relationship between the prior and posterior probabilities, a Bayes factor for rejected null hypotheses, type I error rates (α), power curves, and positive false discovery rates. For the same reason that reporting α is not fully efficient, using Bα (or the false discovery rate) instead of the Bayes factor or posterior probability based on the observed data (or T = Tobs) is also not fully efficient as a summary of the observed data.
Note that the integral in (B13), i.e., the Bayesian approach involving π(d), is needed because d (the parameter being tested) is unknown. Since the power curve (B12) is monotonic, a non-Bayesian approximate upper bound for Pr(H0|Eα) could be obtained if a lower bound, dl, for d was known. Suppose we had such a dl and dl > 0 (say); then B16Of course we have no such bound a priori. A natural approach would be to let dl be the lower limit of a confidence interval (e.g., 95 or 99%) for d̂, so the bound holds to within a reasonably small error. The bound would not be useful unless the power 𝒫α(dl) was reasonably good, so we need dl to be at least 2σ(d̂) if α ≤ 0.95 or 3σ(d̂) if α ≤ 0.99, etc., where σ(d̂) is the standard error of the estimate of d̂. As an example if α = 0.05, π0 = 0.5, and dl = 2σ(d̂), so that 𝒫α(dl) ≈ 0.5, then B17Thus we obtain, approximately, that Pr(H0|Eα) ≤ 0.09 if B18Note that this requires an estimated effect of twice the usual 2σ(d̂) recommendation corresponding to P = 0.05, to get a posterior probability of just <0.1. By comparison, if the assumptions for Sellke et al. (2001), Equations B3 and B4, hold, the P-value corresponding to a valid postexperimental α = 0.05 or Pr(H0|Eα) = 0.05 is 0.0034, which corresponds to a 2.9σ(d̂)-sized effect.
Problems with the interpretation of P-values as evidence have been pointed out in a number of articles (e.g., Edwards et al. 1963; Berger and Sellke 1987; Berger and Berry 1988). Edwards et al. showed that the evidence corresponding to a given P-value was not as strong as most people think. The later work by Berger and co-authors shows that the problems with P-values, especially the common choice of P = 0.05, are not specific to the choice of prior or to unusual examples but are quite general. For example, Berger and Berry consider the problem of testing for a biased coin. The example has P = 0.05, yet the minimum value of Pr(H0), assuming π0 = 0.5 over a wide class of priors for the probability of heads, is 0.21. Results of this article show similar problems for common analysis-of-variance models (Table 3). For a long time these warnings have largely gone unheeded by the scientific community.
False discovery rates and q-values:
Another alternative to p-values is the “false discovery rate” (FDR) (Benjamini and Hochberg 1995), which has become popular for handling multiple comparisons in genomics. Motivated by the low power obtained when using experimentwise error rates and many spurious significant results obtained when using comparisonwise error rates due to the large number of multiple comparisons being made in genome scans, with the false discovery rate one attempts to control the proportion of false positives among the rejected hypotheses. Storey (2003) considers a modification of the FDR, namely the positive false discovery rate (pFDR), B19where V is the number of false positives (H0 is true, but rejected) and R is the number of rejected null hypotheses (e.g., T > Tα). The pFDR avoids a technical problem with V/R when R = 0. This is shown to have a Bayesian interpretation, B20Note that the right-hand side of (B20) is the same as (B14).
The Bayes error (posterior probability or making a wrong decision for a given Tα) is shown to be a combination of pFDR and the analogous positive false negative discovery rate (pFNR). Storey also defines the “q-value”: analogous to the p-value being the lowest α under which a hypothesis would be rejected, the q-value is the lowest pFDR consistent with the data. This optimistic choice (analogous to the definition of the p-value) means that the q-value has some if not all the limitations of the p-value—there is no valid interpretation of the q-value as an error rate (for false discoveries).
From a Bayesian perspective the use of false discovery rates is a step in the right direction—since using the frequentist FDR is approximately similar to using the Bayesian posterior probability. For example, the expected proportion of false positives among selected genes is simply the average posterior probability for the genes. These concepts are, however, superfluous within the Bayesian paradigm and we would not advocate using them unless for convenience, e.g., if they had already been calculated or the Bayesian model was difficult to fit. In the Bayesian paradigm there is no problem with multiple comparisons and no need to control the FDR—e.g., we compute posterior probabilities for each gene, giving the information needed to maximize expected utility directly. The FDR gives the properties of a procedure randomly, choosing a gene from a set of genes, corresponding to rejected null hypotheses, and ignoring the individual gene statistics, while the posterior probability gives the actual probability for each specific gene, which is clearly more informative.
Additionally, to estimate the FDR in the non-Bayesian framework, the prior probability for the proportion of true hypotheses needs to be estimated. This is where the procedure potentially loses efficiency compared to a full Bayesian model.
Rejecting or accepting H0 is not the only option:
Finally, we note that it is not always necessary to “reject” or “accept” H0, i.e., to choose only one of H0, H1 and act as if the chosen model is the true model. As in Ball (2001), inference on genetic architecture and avoidance of selection bias in estimates of gene effects can be achieved by considering multiple models according to their posterior probabilities.
The author thanks Craig Echt and Phillip Wilcox, leaders of the Forest Research Genomics group (now Cell Wall Biotechnology Centre), for supporting this work and Rowland Burdon for useful comments. This work was funded by the New Zealand Foundation for Research Science and Technology and an NSOF grant from the New Zealand Forest Research Institute.
Communicating editor: J. B. Walsh
- Received November 19, 2003.
- Accepted February 9, 2005.
- Genetics Society of America