Abstract
Molecular epidemiological association studies use valuable biosamples and incur costs. Statistical methods for early genotyping termination may conserve biosamples and costs. Group sequential methods (GSM) allow early termination of studies on the basis of interim comparisons. Simulation studies evaluated the application of GSM using data from a casecontrol study of GST genotypes and prostate cancer. Group sequential boundaries (GSB) were defined in the EAST2000 software and were evaluated for study termination when early evidence suggested that the null hypothesis of no association between genotype and disease was unlikely to be rejected. Early termination of GSTM1 genotyping, which demonstrated no association with prostate cancer, occurred in >90% of the simulated studies. On average, 36.4% of biosamples were saved from unnecessary genotyping. In contrast, for GSTT1, which demonstrated a positive association, inappropriate termination occurred in only 6.6%. GSM may provide significant cost and sample savings in molecular epidemiology studies.
ALTHOUGH group sequential methods (GSM) are routinely used to monitor randomized clinical trials, they have not yet been widely applied to molecular epidemiology (ME) studies. In clinical trials, GSM allow early closure of one or more treatment arms on the basis of interim analysis (Whitehead 1999). By enabling early closure, GSM protect patients from unnecessary exposure to an unfavorable treatment. The statistical “cost” for early closure, the loss of precision of the effect size estimate, is acceptable because patients are protected from unnecessary exposure to unfavorable therapies. There is an extensive literature on group sequential methods and their application to clinical trials; an excellent summary is provided in Jennison and Turnbull (2000).
Early closure for “futility,” in which the study is unlikely to lead to rejection of the null hypothesis, is becoming more commonly used in clinical trials. Although ME studies lack this ethical imperative for early closure, such studies would benefit from early closure for futility for several reasons. First, such studies often use biologic samples that are difficult to obtain or limited in quantity. Second, genotype assessment incurs both material and labor costs. Thus, early closure for failure to reject the null hypothesis may save samples, reagents, labor, and opportunity costs. Finally, clearly defined interim analysis procedures would provide investigators with a formal tool for evaluating their data on an ongoing basis.
Previous investigators have described the importance of early closure for null effects (Gould 1983; Jennison and Turnbull 2000). O’Neill and Anello first described the use of the Wald sequential probability ratio test (SPRT), an open procedure, in a matched casecontrol study (O’Neill and Anello 1978). Pasternak and Shore (1980) demonstrated that in a cohort study the group sequential design had generally higher efficiency than that of a fixed sample plan. Kaaks et al. (1994) demonstrated the application of a sequential ttest to the use of biologic samples. Van der Tweel and van Noord (2000) described both a SPRT and a triangular test for sample sequential analysis of genotype data. Recently, Satagopan et al. (2002) described the use of a twostage design for maximizing power when the total cost is the primary study constraint.
Current molecular epidemiology studies, however, have practical characteristics that preclude these approaches. First, the finite number of available samples and limits on funding time lines prevent the use of an “open” GSM whose sample size is potentially unlimited. Second, almost all molecular epidemiology studies acquire genotype data on a group of samples simultaneously. Thus, the most appropriate GSM must evaluate sequential groups of genotype data rather than sequential individual genotypes. Finally, current studies often evaluate a small number of genotypes (<10), thus making the sample itself the primary limiting variable.
We evaluated the group sequential boundaries methods because of their widespread use and the availability of GSM commercial software. In GSM, the number of interim “looks” is frequently equally spaced and predefined at the design stage. These criteria may be relaxed during study conduct. In a casecontrol study, the test statistic is the χ^{2} value corresponding to the odds ratio of disease between cases and controls. In the case of early stopping for futility, if the χ^{2} test statistic is less than a predefined value, called a boundary value, then it is unlikely that genotyping additional samples will give a statistically significant result. Therefore genotyping stops once the χ^{2} test statistic crosses this boundary. Stopping boundaries may be defined by commercial software packages such as EAST2000 (Cambridge, MA; http://www.cytel.com) or PEST (Reading, UK; http://www.rdg.ac.uk/mps/mps_home/software/pest4/pest4.htm) or by writing local software (Schoenfeld 2001).
Figure 1 demonstrates the evolution of a test statistic in a hypothetical study with eight looks. The study would terminate early to accept the null hypothesis if the path of the test statistic crossed the boundary at any point, as occurs at look number 6. For some choices of parameter values, early closure is not possible. For example, the boundary shown in Figure 1 does not allow closure at the first look, where it is undefined. Therefore, irrespective of the results obtained at the first look, a second round of genotyping would be required.
Simulation studies were used to evaluate the application of GSM. Two previously published data sets of GST genotype and prostate cancer risk were used for the simulations (Rebbecket al. 1999). These data sets were chosen for several reasons. First, the GSTM1 data set reported a null association and represented the case where early stopping for futility with a GSM could provide significant sample and cost savings. Second, the GSTT1 data set reported a positive association and was used to evaluate the frequency of inappropriate genotyping termination. Since publication, additional cases and controls have been genotyped; the sample set used in the simulations contained a total of 675 GSTM1 and 725 GSTT1 genotypes. The observed odds ratio (OR) for GSTM1 was OR = 0.99, 95% confidence interval (C.I.) 0.721.38; for GSTT1, the OR = 1.61, 95% C.I. 1.122.32. In addition to representing both a null and a positive association, both data sets have samples sizes and odds ratios typically seen in presentday ME studies. Finally, the raw data were readily available for the simulation studies.
O’BrienFleming (OBF) stopping boundaries for both rejection and failure of rejection of the null hypothesis at each interval of genotype data acquisition were defined using EAST2000 (O’Brien and Fleming 1979). We chose the OBF boundary because it is most frequently used to monitor clinical trials and is more conservative than the alternative Pocock boundary for both rejection and acceptance of the null hypothesis. A more conservative boundary also limits the decrease in power associated with using GSM. We chose EAST2000 for its ease of use and commercial availability since many groups conducting molecular epidemiology studies did not have resources to generate inhouse software. Although EAST2000 requires that boundaries for rejection of the null hypothesis be part of the overall calculation of GSM boundaries, only boundaries for failure of rejection of null hypothesis were used in the simulations.
For all simulations, the overall twosided type I error was set at α= 0.05. Since the sample pool was fixed (N = 675 for GSTM1 and 725 for GSTT1), the power was defined by the sample size, null genotype frequencies in controls, and OR. We chose not to specify a type II error rate to examine the performance of the GSM method over a range of genotype frequencies and ORs. Genotype frequencies in controls were set at 10%, at 50%, and at the genotype frequency observed in the data set used. The observed genotype frequencies were 38% for GSTM1 and 28% for GSTT1. ORs of 1.6, 1.8, and 2.0 were examined. The OR of 1.6 was chosen to correspond to that observed for GSTT1. An OR of 2.0 corresponds to that often used as the target “clinically significant association” for many epidemiological studies. The OR of 1.8 was chosen to be intermediate between these two. For these simulations, the interval of genotype data acquisition was termed a “look.” Each look contained a multiple of 90 genotypes to simulate genotype acquisition from a 96well PCRbased genotyping method (e.g., 90 genotypes and 6 control samples per PCR run).
In addition to the simulation parameters defined by the baseline frequencies and OR, three different look strategies were examined. The first strategy had two looks, with the interim look occurring after ∼50% of the samples had been genotyped. The second strategy used the maximum number of possible looks, given the sample size and the restriction that each look (except the last) must include a multiple of 90 samples. The third strategy chosen was intermediate between these. Thus simulations for GSTM1 examined two, three, or seven looks; two, four, or eight looks were examined for GSTT1.
A total of 1000 replications were performed for each of the 27 combinations of baseline gene frequency, OR, and number of looks. Simulations were done separately for the GSTM1 and GSTT1 data sets. For each replication, prostate cancer cases and controls were randomly sampled from the true data sets without replacement and in proportion to their relative frequencies. The observed OR and χ^{2} test statistic were calculated for each look. The χ^{2} test statistic was then compared to the boundary value calculated by EAST2000 for study termination. If the test statistic was less than the boundary for early stopping, i.e., if the test statistic “crossed the boundary,” then the run terminated. If the test statistic did not cross the boundary, then an additional look was selected and the test statistic recalculated, accounting for the information gained in the prior look. This procedure was repeated until the test statistic crossed a boundary or all genotypes were sampled. All simulation studies and analyses were performed using STATA v7.0 (College Park, TX).
In the above, we dealt with the potential for early closure by using the boundary values themselves (on the χ^{2} test statistic scale). This method allows application of these methods to test statistics that are not built into standard group sequential software packages. However, it should be noted that an alternative means of conducting monitoring of a molecular epidemiology trial would be to use directly the methods developed for a comparison of two binomials. These methods are available in, for example, EAST2000.
Results for GSTM1 simulations are shown in Table 1. Overall, 91.5% of the simulations terminated early with a range of 4.5100%. The median genotyped sample size was 459. Thus, use of GSM decreased the median sample size by 32%. Results for the GSTT1 simulations are shown in Table 2. On average, only 6.6% of the GSTT1 simulations terminated early. The median sample size was 714 with the sample size of 725 representing the entire data set. This low frequency of termination is appropriate as an association between GSTT1 genotype and outcome was present in the data set.
Our simulations indicate that GSM may provide significant improvements for casecontrol molecular epidemiology studies. Our approach of evaluating genotype data in multiples of 90 more closely reflects laboratory data acquisition and is thus directly applicable to large molecular epidemiology studies. For GSTT1 simulations with 80% power, assuming a genotype cost of $3.00/genotype, the use of GSM would save ∼$650 from a total cost of $2025, in addition to savings in technician time and reagents. This sample size savings had a relatively small cost to the overall power of the study. The average difference in study power between a fixed sample design and a GSM design for GSTM1 simulations with 80% power was 3.3% (average fixed sample size power was 86.2%; average GSM design power was 82.9%). For these simulations, the average difference in study power between a onelook and a maximumlook strategy was also small—3.3%.
A number of observations may be made regarding the effects of varying model parameters on the probability of early stopping. First, the frequency of early stopping decreased as the study power increased. Although power is affected by the baseline frequency, OR, and sample size, the frequency of early stopping was “monotonic” in power. Thus, in all cases lowerpower studies had higher rates of termination and terminated at earlier looks than did models with higher power. This corresponds to the intuition that studies with low power should be more likely to close early because the a priori chance of finding a significant association is very small, even if an association exists. However, appropriately powered models closed appropriately early in the GSTM1 simulations and had low rates of inappropriate closure in the GSTT1 simulations.
The baseline genotype frequency in controls (p1) directly affects the statistical power. GSTM1 models with a baseline frequency p1 = 0.38 or p1 = 0.50 and GSTT1 models with p1 = 0.28 had the highest power for a given OR and number of looks. These higherpower models closed later and had larger average sample numbers. Likewise, simulations with larger OR closed later and had larger average sample numbers than simulations with lower OR for the same p1 and number of looks.
Finally, increasing the number of looks decreased the study power and in general decreased the average sample number. Interestingly, for GSTM1 models with a typical power of ∼80%, an intermediate number of looks had higher average sample numbers than models with either the minimum or the maximum number of looks. Models with two looks obtained enough genotype information at the first look to close early with a high rate with attendant sample size savings. This is consistent with the results of similar analyses in clinical trials (Pocock 1982). Models with seven looks gained the majority of genotype information in the middle looks, also allowing for substantial sample size savings. However, models with three looks had low rates of closure at look 1, thus requiring a second look.
Since our simulations indicate that an intermediate look strategy may give a higher average sample number for studies with ∼80% power, investigators may wish to choose either a minimum or a maximumlook strategy. Since the power cost of additional looks is relatively small, the optimal number of looks will be determined largely by the opportunity cost of multiple data analyses as well as by the need to conserve samples and costs. If samples are limited or expensive to assay, investigators may wish to perform multiple looks to minimize the average sample number. However, if sample conservation or cost minimization are not overriding concerns, then investigators may wish to perform only one interim analysis.
Acknowledgments
Supported by the Doris Duke Charitable Foundation (R.A.) and the Leonard and Madilyn Abramson Endowed Chair, National Institutes of Health grant R01CA85074 (T.R.R.)
Footnotes

Communicating editor: ZB. Zeng
 Received September 27, 2002.
 Accepted December 9, 2002.
 Copyright © 2003 by the Genetics Society of America