Controlling the Proportion of False Positives in Multiple Dependent Tests
 R. L. Fernando*,†,1,
 D. Nettleton†‡,
 B. R. Southey§,
 J. C. M. Dekkers*,†,
 M. F. Rothschild*,† and
 M. Soller**
 ^{*} Department of Animal Science, Iowa State University, Ames, Iowa 50011
 ^{†} Lawrence H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, Iowa 50011
 ^{‡} Department of Statistics, Iowa State University, Ames, Iowa 50011
 ^{§} Department of Animal Sciences, University of Illinois, Urbana, Illinois 61801
 ^{**} Department of Genetics, Hebrew University, Jerusalem 91904, Israel
 1 Corresponding author: Department of Animal Science, Iowa State University, 225 Kildee Hall, Ames, IA 500113150. Email: rohan{at}iastate.edu
Abstract
Genome scan mapping experiments involve multiple tests of significance. Thus, controlling the error rate in such experiments is important. Simple extension of classical concepts results in attempts to control the genomewise error rate (GWER), i.e., the probability of even a single false positive among all tests. This results in very stringent comparisonwise error rates (CWER) and, consequently, low experimental power. We here present an approach based on controlling the proportion of false positives (PFP) among all positive test results. The CWER needed to attain a desired PFP level does not depend on the correlation among the tests or on the number of tests as in other approaches. To estimate the PFP it is necessary to estimate the proportion of true null hypotheses. Here we show how this can be estimated directly from experimental results. The PFP approach is similar to the false discovery rate (FDR) and positive false discovery rate (pFDR) approaches. For a fixed CWER, we have estimated PFP, FDR, pFDR, and GWER through simulation under a variety of models to illustrate practical and philosophical similarities and differences among the methods.
IN recent years a relatively new class of “multipletest” genetic experiments has come into prominence, in which there is a strong prior assumption that a certain proportion of the tested alternative hypotheses are true. Consider, for example, a genomewide scan for linkage between a marker and a quantitative trait locus (QTL). In this situation, when heritability analysis shows that QTL are segregating in the population, the large number and close spacing of the markers employed ensures that an appreciable proportion of markers are in linkage to segregating QTL. The challenge is to identify these markers among all of the tested markers. Similarly, prior markerQTL linkage mapping in a particular population may have identified a set of markers in linkage to segregating QTL. For purposes of markerassisted selection, it is important to identify individuals heterozygous at these QTL. On HardyWeinberg assumptions, over a wide range of QTL allele frequencies onethird to onehalf of the QTL will be heterozygous in any given individual. Thus, the experiment to identify the markers in linkage to heterozygous QTL in a particular individual starts with the strong prior assumption that a comparable proportion of the markers tested are indeed in such a state. Again, the challenge is to identify the individualbymarker combinations for which this is true, among all tested individualbymarker combinations. In many microarray experiments, treatments that cause physiological changes are administered to experimental units. One main goal of such experiments is to identify which of thousands of genes change expression as a result of treatment. Treatments are often designed to alter the expression of particular genes, so it is reasonable to assume that some measurable changes in gene expression occur.
Clearly in these examples, identification of a marker in linkage to a QTL, identification of an individualbymarker combination that represents a heterozygous QTL, or identification of differentially expressed genes, there is the possibility of falsepositive error. Controlling this error is important scientifically to avoid cluttering the literature with false results and, practically, to avoid expenditure of effort on false leads to genetic improvement or gene cloning.
One of the most widely used approaches to control errors in multiple tests is based on controlling the familywise type I error rate (FWER). The FWER is the probability of rejecting one or more true null hypotheses in a family of tests. In genome scans for QTL, it has been proposed that the family of tests should be defined as the set of all possible tests across the entire genome, thus controlling the genomewise type I error (GWER; Lander and Kruglyak 1995). The drawback of this approach is the drastic loss of power.
An alternative to attempting to avoid all falsepositive results is to manage the accumulation of false positives relative to the total number of positive results that appear in the literature. Indeed, this is the approach that was traditionally taken in human genetics, where it was early realized that for a monogenic trait, if a comparisonwise type I error rate (CWER) of 0.05 is used as the threshold for declaring linkage, a large proportion of declared linkages would be false. Instead, in human linkage analysis error control has been based on controlling the posterior type I error rate (PER), which is the probability of nonlinkage between two loci given that linkage was declared between these two loci (Morton 1955). By definition, this has the above property of controlling the accumulation of false positives relative to the total number of positive results. Although originally defined for the singletest situation, the PER has also been discussed in a multipletest situation (Risch 1991), where evenly spaced markers spanning the entire genome were sequentially tested for linkage to a singletrait locus. Assuming that the tests were independent, Risch (1991) computed the posterior type I error rate given that linkage was declared after k_{s} tests. When a constant threshold was used for declaring linkage, the posterior type I error rate decreased as k_{s} increased (Risch 1991).
In QTL scans, testing does not stop when one of the markers is declared to be linked to a QTL; all markers are tested for linkage to QTL. Further, with the increased availability of closely spaced markers, tests cannot be considered to be independent. Thus, to extend the philosophy underlying the posterior type I error rate to QTL scans, Southey and Fernando (1998) defined the proportion of false positives (PFP) as a generalization of the PER to the genome scan situation. As is shown in subsequent sections of this article, the PFP effectively controls the accumulation of false positives relative to the total number of positive results. In addition, the PFP level for a set of tests does not depend on the number of tests or the correlation structure among the tests. This makes the PFP of particular usefulness in QTL mapping applications that often involve a large number of tests with a complex correlation structure.
Another approach that has been used to control the accumulation of false positives in QTL scans is based on controlling the false discovery rate (FDR; Benjamini and Hochberg 1995; Weller 2000; Mosiget al. 2001). Mosig et al. (2001) argued intuitively that the FDR as defined by Benjamini and Hochberg (1995) is not appropriate when the experiment has a large number of tests for which the null hypothesis is false; they proposed using an adjusted FDR, which takes this factor into account. Although not considered previously in the QTL mapping context, Storey (2002) defined the positive false discovery rate (pFDR) to be more suitable than FDR as a measure of false discoveries. Differences and similarities of these various methods with respect to PFP are discussed in a subsequent section of this article.
Our development of the PFP is general. However, we use simulations within the QTL mapping application to show how PFP compares to FWER, FDR, and pFDR and to illustrate how the estimated PFP levels compare to true PFP levels.
CONNECTION TO POSTERIOR TYPE I ERROR RATE
The philosophy behind the PFP approach is closely connected to the philosophy of the posterior type I error rate approach developed by Morton (1955) for the case of detecting linkage between a singlemarker locus and a monogenic trait locus. In this setting, the PER is the conditional probability that the true status between a randomly selected marker locus and the monogeneic trait locus is one of nonlinkage, given a statistical test result interpreted as declaring linkage (Morton 1955). In technical notation, let the true status of linkage between the two loci be represented by a random variable L that can take one of two values, L = 1 if the two loci are linked and L = 0 if the two loci are not linked; and let the declared status of linkage between the two loci on the basis of some statistical test be represented by a random variable D that can also take one of two values, D = 1 if the two loci are declared linked and D = 0 if the two loci are declared not linked. Then the PER is Pr(L = 0D = 1). Following Morton (1955), this probability can be written as
For a monogenic trait in humans, the prior probability that a random marker is within detectable linkage of the trait locus is ∼0.02 (Elston and Lange 1975; Ott 1991), so that for a random marker, Pr(L = 1) = 0.02. Using a CWER of 0.05 to represent significance would give a PER of 0.73; i.e., of every 100 declared linkages, ∼73 would be false. The traditional LOD score of 3 required to declare linkage corresponds to a CWER between 0.0001 and 0.001 (Elston 1997). Taking 0.001 as the critical CWER to declare linkage, and supposing that average power of the test is 0.90, the PER is
For the case of a genome scan involving a set of k markers, Southey and Fernando (1998) defined the PFP as
For the general case involving a family of k hypothesis tests, we define
PFP CONTROLS THE PROPORTION OF FALSE POSITIVES ACROSS MANY EXPERIMENTS
In this section we present two useful properties of PFP. Proofs of these properties are presented in the appendix.
Property 1: If the PFP level is equal to γ for each of n sets of tests corresponding to n independent experiments, then the PFP level for the collection of all tests associated with the n experiments is also equal to γ.
Property 2: If the PFP level is equal to γ for each of n sets of tests corresponding to n independent experiments, the observed proportion of false positives out of the total number of rejections across all n experiments converges to γ with probability 1 as the number of experiments increases, provided that the number of tests per experiment does not grow without bound.
Contrast property 1 with the situation encountered in FWER control. If the FWER is controlled at level γ for each of n independent families of tests, the FWER for the family consisting of the union of the n families of tests is 1 – (1 –γ)^{n}. This quantity may be several times larger than γ for even moderate n. As the number of independent sets of tests increases, it becomes prohibitively difficult to control the probability of one or more falsepositive errors.
Rather than attempting to avoid all false positive results, it makes sense to manage the accumulation of false positives relative to the total number of positive results that appear in the literature. The PFP approach provides precisely this type of error management as illustrated by property 2. It is property 2 that suggests “proportion of false positives” as an appropriate name of the error measure E(V)/E(R). We show in a subsequent section of this article that control of other error measures (FWER, FDR, and pFDR) does not necessarily lead to the control of the proportion of falsepositive results among all positive results.
PFP DOES NOT DEPEND ON EITHER THE NUMBER OF TESTS OR THE CORRELATION STRUCTURE AMONG THE TESTS
Consider a collection of k tests. Let W_{j} be 1 or 0 depending on whether or not the jth null hypothesis is falsely rejected. Let S_{j} be 1 or 0 depending on whether or not the jth null hypothesis is rejected. Suppose the jth test is conducted at CWER α_{j}, and let π_{j} denote the probability that the jth null hypothesis is rejected. Let K_{0} and K_{1} form a partition of the indices 1,..., k such that j ∈ K_{0} if the jth null hypothesis is true and j ∈ K_{0} if the jth null hypothesis is false. Then for all j ∈ K_{0}, we have E(W_{j}) = E(S_{j}) = α_{j}. For all j ∈, we K_{1} the have E(W_{j}) = 0 and E(S_{j}) = π_{j}. Now let p_{0} denote proportion of true null hypotheses among all hypotheses tested. Let
From expression (8) we can see that PFP depends only on the average CWER α, the proportion p_{0} of true null hypotheses out of all hypotheses tested, and the average power π. Note that, as claimed in the Introduction, the PFP does not depend on either the number of tests or the correlation structure among the tests. These properties are particularly desirable for application of the PFP approach to QTL mapping, where there is a nontrivial correlation structure among a large number of tests.
INTERPRETATION OF PFP FOR A SINGLE EXPERIMENT: THE RELATION OF PFP AND PER
We have shown that PFP = PER for an experiment consisting of a single test of linkage between a random marker and a monogenetic disease locus. In this section we demonstrate a more general result: the level of PER for a test randomly chosen from a family of k tests is equal to the level of PFP for the family of k tests. Let J denote a random index that is equally likely to take each value in {1,..., k}. Then, using the notation of the previous section,
ESTIMATING PFP FOR A GIVEN EXPERIMENT
For simplicity of notation, we assume henceforth that a single CWER α is used for each of k tests. Consideration of the case where the jth test is conducted at its own CWER α_{j} is a straightforward generalization. For any given CWER α, (8) indicates that the PFP can be estimated as
Because
An estimator of π is given by
COMPARISON OF PFP, FWER, FDR, AND pFDR
Benjamini and Hochberg (1995) defined FDR as
We have previously shown that control of PFP across multiple experiments will lead to control of the proportion of falsepositive results among all positive results in the long run. We now show by a hypothetical example that the other error measures (FDR, pFDR, and FWER) do not necessarily share this property.
Suppose that for each experiment in a series of independent and identical experiments V/R is 50/100 with probability 0.1, 0/10 with probability 0.5, and 0/0 with probability 0.4. Then
This example shows that control of FDR, pFDR, or FWER will not guarantee control of the accumulation of falsepositive results as a proportion of all positive results over multiple experiments. Obviously the example has been artificially constructed to emphasize the differences among the error measures. This example involves independent experiments, which means that the tests in one experiment are independent of tests in another. The tests within any of the experiments, however, are not necessarily independent of each other. Indeed, these tests must be dependent to obtain the behavior described in the example. Note that when a large number of rejections occur, the ratio V/R is high (50/100). On the other hand, when a small number of rejections occur, the ratio V/R is quite low (0/10). Such a situation can arise in the QTL mapping setting. Suppose that a QTL for a trait of interest lies on a chromosome for which few markers are available. Suppose that some other chromosomes have a high density of markers. A high density of markers on a chromosome without the QTL translates into a high positive correlation among tests for which the null hypothesis is true. Because dense markers are positively correlated, a falsepositive result at any one of these markers is likely to be accompanied by many other falsepositive results at neighboring markers. With few markers on the chromosome containing the QTL, there can never be a large number of true positive results. Thus a large number of rejections will occur only when there are a large number of false positives. It is in such situations that we will see substantial differences between PFP and the other error measures. Such a scenario is created in model 5 of our simulation study described later in this article.
Although the example of this section and model 5 of our simulation show that the error measures can differ substantially, there are many similarities among FDR, pFDR, and PFP. Storey (2002) has shown that when the tests are identically and independently distributed pFDR = PER; i.e., the level pFDR for a set of k tests is equal to the level of PER for a randomly chosen test. Storey (2003) has shown that pFDR = PFP when the tests are independent (Corollary 1 in Storey 2003) and that pFDR and FDR will be approximately equivalent to PER (and thus PFP) as the number of tests in a family grows large as long as the test statistics corresponding to the family of tests satisfy a “weak dependence” condition (Theorem 4 in Storey 2003). We have shown that the equality between PFP and PER holds in general regardless of the dependence structure among the test statistics or the number of tests conducted. A probability interpretation of pFDR that holds even when tests are not independent or identically distributed is given below.
Let A denote the event, “a positive result, randomly selected from all positive results, is a false positive.” We have
It is easiest to understand the somewhat subtle difference between this interpretation of pFDR and the interpretation of PFP as PER by considering the example presented in this section. In the example pFDR is determined as follows. Of the experiments with at least one positive result, about fivesixths of the experiments will have 0 as the probability that a randomly selected positive result will be a false positive while the other onesixth will have probability 0.5 that a randomly selected positive result is a false positive. Thus pFDR is (5/6) · 0 + (1/6)(0.5) = 1/12, which is exactly the probability that a randomly selected positive result will be a false positive, given that the experiment resulted in at least one positive result. Note that this calculation in no way accounts for the fact that there are many more positive results in the less likely experimental outcome [Pr(V/ R = 50/100) = 0.1] than in the more likely outcome [Pr(V/R = 0/10) = 0.5]. On the other hand, PFP = PER is the probability that a randomly selected result is a false positive, given that it is positive. By conditioning on the event that the randomly selected result is positive rather than on the event that the experiment contains at least one positive, PFP accounts for differences in the number of positive results across experimental outcomes because randomly selected events are more likely to be positive in experiments with many positive results. In contrast to pFDR, experimental outcomes V/R are weighted by both their probability of occurrence and the number of rejections R. For our hypothetical example, we can write PFP as a weighted average of the V/R ratios as
A SIMULATION STUDY
A QTL scan with 500 backcross offspring from inbred lines was simulated. The simulation was used to compare PFP with FWER, FDR, and pFDR and to illustrate how the estimated PFP levels compare to true PFP levels. The simulation was repeated for five simple genetic models.
QTL model 1: This model had 10 chromosomes with one QTL at the center of the chromosome; the 10 QTL were of equal effect, so that each accounted for 10% of the genetic variance. The remaining 20 chromosomes had no QTL. The simulated trait was completely additive with a heritability of 0.25 in the F_{2} generation. The residuals were normally distributed. Each chromosome was 100 cM long and had 21 equally spaced markers.
QTL model 2: This model was obtained from model 1 by moving the QTL from the center to the left by 25 cM for each of the 10 chromosomes with a QTL.
QTL model 3: This model was obtained from model 1 by increasing the number of chromosomes with a single QTL at the center from 10 to 20 and by decreasing the number of chromosomes with no QTL from 20 to 10. As this model contains 20 QTL of the same effect, each accounted for 5% of the additive genetic variance.
QTL model 4: This model was obtained from model 1 by decreasing the number of chromosomes with a single QTL at the center from 10 to 5 and by increasing the number of chromosomes with no QTL from 20 to 25. As this model contained five QTL of the same size, each accounted for 20% of the additive genetic variance.
QTL model 5: This model with only two chromosomes was constructed to illustrate that PFP can give quite different results from pFDR and FDR. The first chromosome was 100 cM long with one QTL at the center and 11 equally spaced markers. The second chromosome also was 100 cM long with no QTL and 101 equally spaced markers. The heritability for the trait was 0.025.
The scan for QTL was based on testing each marker for linkage to QTL by a ttest for comparing the means for the trait between the two marker genotype classes (Solleret al. 1976). The null hypothesis of no linkage to a QTL was rejected if the P value for the test was lower than the critical CWER. For each experiment, the numbers of positive (R) and falsepositive (V) test results were counted given the critical CWER values of 0.01, 0.001, and 0.0001. For each model, 50,000 replications of the experiment were used to obtain empirical values for PFP, pFDR, FDR, and FWER, which in this context is called the GWER (Lander and Kruglyak 1995). The empirical PFP was obtained as V̄/R̄, V̄ and R̄ being the mean values of V and R over the 50,000 replications of the experiment; empirical pFDR was obtained as the mean value of the ratio V/R over all experiments with R > 0; empirical FDR was obtained as empirical pFDR times the proportion of experiments with R > 0; and empirical GWER was obtained as the proportion of experiments with V > 0. The results for these empirical values are given in Table 1.
Table 1 shows that PFP, pFDR, and FDR were practically identical to each other for model 1 through model 4, while GWER was very different from these. For these four models, using a Pvalue threshold of 0.001 was sufficient to control PFP, pFDR, or FDR to well below 0.05, while with this threshold GWER is well above 0.05. The results for model 5 show that PFP can be quite different from pFDR and FDR and that pFDR can be different from FDR.
DISCUSSION
In linkage analysis, significance testing has not been based on controlling the type I error rate, but on controlling the PER, which is the conditional probability of a falsepositive result given a positive test result (Morton 1955; Ott 1991). For QTL scans, which involve multiple tests of linkage, Southey and Fernando (1998) proposed PFP as a natural extension of PER, which was defined for a single test. In this article we provided the mathematical justification for this proposal.
Briefly, the justification is as follows. If the level of PER for a test is γ, then as the number of independent tests increases, the proportion of false positives in the accumulated positive results converges to γ. We have shown here that if the level of PFP in a multiple test experiment is γ, then as the number of such independent experiments increases, the proportion of false positives in the accumulated positive results also converges to γ. Alternatively, we have shown here that when the number k of tests is 1, controlling PER is equivalent to controlling PFP. Further, when k > 1, we showed that controlling the PER for a test that is randomly chosen from the set of k tests is equivalent to controlling the PFP defined over all k tests. These results hold for any dependence structure among the k tests in an experiment.
When tests are identically and independently distributed, pFDR = PFP, and thus, in this situation, controlling PER for a randomly chosen test is equivalent to controlling pFDR (Storey 2002). A probability interpretation of pFDR that holds even when tests are not independent nor identically distributed given here is: if an experiment with level γ for pFDR has one or more positive test results, γ is the conditional probability that a randomly sampled result from these positive results is a false positive.
Thus in multipletest experiments, controlling PFP will result in controlling the proportion of falsepositive results in the accumulated positive test results over many experiments, while controlling pFDR will result in controlling the expected proportion of false positives in the positive test results in each experiment. When tests are independently and identically distributed, pFDR = PFP, and, thus, false positives will be controlled to the same level in each experiment and in the accumulated test results over many experiments. The simulation results for models 1–4 show that even when tests are highly dependent, pFDR and PFP can give very similar results. For tests that are identically distributed but dependent, Storey (2003) has given conditions under which pFDR will converge to PER = PFP as the number of tests increases. As demonstrated by the results for model 5, however, it is clear that in some situations controlling PFP is not equivalent to controlling pFDR or FDR.
Like pFDR, PFP = 1 if all the null hypotheses tested are true. Thus neither pFDR nor PFP can be controlled in the same sense that FDR can be controlled (Benjamini and Hochberg 1995). Nonetheless, we believe the more direct interpretations of pFDR and PFP make these error measures worth considering. We have illustrated the connection between the PER and PFP and have shown that, unlike FDR and pFDR, PFP is free of the correlation structure among the tests. Storey and Tibshirani (2001) propose a method for approximating FDR and pFDR under general dependence structures. Using their method requires the ability to draw samples from an approximation to the joint distribution of the test statistics when all null hypotheses are true. This is not a trivial computing exercise and may be very difficult to accomplish in some situations. In contrast, the approach to estimate PFP that we have presented here requires only the P values corresponding to the tests of interest. Such P values can be obtained without simulation in situations where the approximate marginal distributions of the test statistics are known.
In this article we used the method proposed by Mosig et al. (2001) to estimate p_{0} and π to demonstrate the estimation of PFP (Table 2). In most of the cases our estimates of PFP were conservative. In only two cases with very low heritability and one QTL was the mean of the estimated PFP levels lower than the empirical value. Even in this case, when estimated PFP was ∼0.05, the empirical PFP was only slightly higher. Research in methods for estimating p_{0} is ongoing, so we believe the estimates of PFP illustrated here can be improved by using improved estimates of p_{0}. It is also worth noting that in models 1–4, where the number of QTL ranged from 5 through 20, using a critical CWER of 0.001 was sufficient to control PFP <0.05.
Acknowledgments
The authors thank two anonymous reviewers for useful comments. R. Fernando acknowledges support from the National Research Initiative Competitive Grants Program of the U.S. Department of Agriculture, Award 20023520511546; D. Nettleton acknowledges support from the National Research Initiative Competitive Grants Program of the U.S. Department of Agriculture, Award 19983520510390; M. Soller acknowledges support from the BovMAS project of the European Union FP5 program and the Cotswold Swine Breeding Company.
APPENDIX
Proof of Property 1. Let V_{i} denote the number of rejections of a true null hypothesis in the ith experiment, and let R_{i} denote the number of rejections of a null hypothesis in the ith experiment. If we have E(V_{i})/E(R_{i}) = γ for all i = 1,..., n, then the PFP across the set of n independent experiments is given by
Proof of Property 2. We begin the proof of property 2 by noting that
By Corollary 1 to Theorem 6 in Rohatgi (1976),
Footnotes

Communicating editor: J. B. Walsh
 Received February 8, 2003.
 Accepted October 1, 2003.
 Copyright © 2004 by the Genetics Society of America