Individual genome scans for quantitative trait loci (QTL) mapping often suffer from low statistical power and imprecise estimates of QTL location and effect. This lack of precision yields large confidence intervals for QTL location, which are problematic for subsequent fine mapping and positional cloning. In prioritizing areas for follow-up after an initial genome scan and in evaluating the credibility of apparent linkage signals, investigators typically examine the results of other genome scans of the same phenotype and informally update their beliefs about which linkage signals in their scan most merit confidence and follow-up via a subjective–intuitive integration approach. A method that acknowledges the wisdom of this general paradigm but formally borrows information from other scans to increase confidence in objectivity would be a benefit. We developed an empirical Bayes analytic method to integrate information from multiple genome scans. The linkage statistic obtained from a single genome scan study is updated by incorporating statistics from other genome scans as prior information. This technique does not require that all studies have an identical marker map or a common estimated QTL effect. The updated linkage statistic can then be used for the estimation of QTL location and effect. We evaluate the performance of our method by using extensive simulations based on actual marker spacing and allele frequencies from available data. Results indicate that the empirical Bayes method can account for between-study heterogeneity, estimate the QTL location and effect more precisely, and provide narrower confidence intervals than results from any single individual study. We also compared the empirical Bayes method with a method originally developed for meta-analysis (a closely related but distinct purpose). In the face of marked heterogeneity among studies, the empirical Bayes method outperforms the comparator.
MOST genome scans for linkage in mapping quantitative trait loci (QTL) are analyzed without formal consideration of information provided by other genome scans of the same QTL. Investigators often evaluate scans other than their own when deciding which regions merit further investigation, but they have limited options for formally integrating the data. Individual genome scans have low power to detect QTL and provide imprecise estimates of their location and effect, especially when the effect is small. As a consequence, follow-up for fine mapping and positional cloning is problematic. When multiple studies of the same QTL have been conducted, an analysis method that can formally integrate data from multiple genome scan studies is emerging as a useful and powerful tool in the field of linkage analysis.
Although closely related, the method we offer should not be conflated with meta-analysis. Meta-analysis, which can be viewed as a set of statistical procedures designed to summarize statistics across independent studies that address similar scientific questions, is one way to use data from multiple genome scan studies. Only recently has meta-analysis been applied to studies evaluating linkage between human diseases and genetic markers (Hedges and Olkin 1985; Li and Rao 1996; Rice 1997; Allison and Heo 1998; Gu et al. 1998; Guerra et al. 1999; Wise et al. 1999; Etzel and Guerra 2002; Guerra 2002). Heterogeneity among multiple linkage studies poses many challenges in such analysis. Different studies can use different genetic markers and marker maps, different statistical methods to test for linkage, and different sampling schemes. Furthermore, the QTL effect can vary across studies because of disparate environmental effects and population substructures. The combination of raw data from each study with a well-designed preanalysis procedure could be a preferred approach to overcome such difficulties. However, in many situations this is not feasible and an analysis method that can account for the heterogeneity among studies is more desirable. Moreover, the key difference between meta-analysis and the approach we offer is the space of inference. In meta-analysis the space of inference is generally the superpopulation of all populations from which individual studies sampled their cases. In contrast, we adopt the inference perspective of the individual investigator who asks what the evidence is for linkage at a specific point in the population from which a sample is drawn.
Several meta-analysis methods have been proposed for detecting linkage between genetic markers and QTL. Generally, these methods can be classified into two categories. The first category involves combining test results without the estimation of parameters. A simple yet typical example is to combine P-values across studies (Fisher 1925). Suppose that we have P-values from independent studies, denoted by pi ; Fisher (1925) showed that a linear combination of the natural log of these values, , has a χ2-distribution with d.f. The overall test of linkage can be performed with this statistic under the null hypothesis of no linkage in the region. Many researchers have applied this technique across multiple linkage studies (Allison and Heo 1998; Wise et al. 1999; Guerra 2002). For example, Allison and Heo (1998) used Fisher's method to show strong evidence of linkage in OB regions on the basis of five published studies on these regions. They showed that Fisher's method is applicable in situations of different marker maps, different statistical procedures, and different sampling schemes across different studies. However, it is very difficult to estimate the parameters of interest, such as the location and effect of a QTL, because of the method's nonparametric nature. In addition, Fisher's method is sensitive to the existence of a single study showing evidence for linkage and typically does not yield a balanced estimate of the average support for linkage among studies. The approach taken in GSMA bins results from multiple studies and then applies a Fisher-style meta-analysis (Wise et al. 1999). This approach allows the investigator the flexibility of combining data from multiple sources using different tests and different marker panels but loses precision in estimating specific locations for a genetic locus because of the binning procedure.
Methods in the second category involve the estimation of parameters across studies, such as the location and the effect of a QTL (Li and Rao 1996; Gu et al. 1998; Etzel and Guerra 2002). Li and Rao (1996) used a random-effect model to derive an overall regression-coefficient estimate with standard error across studies from the Haseman–Elston regression (Haseman and Elston 1972). These estimates are then used to construct an overall test statistic for linkage between marker loci and QTL. Gu et al. (1998) proposed to use the proportion of shared identical-by-descent (IBD) as the common linkage effect across different linkage studies and developed a random-effect model to extract and estimate such effects. This enables the researchers to combine studies using different types of sib pairs. Etzel and Guerra (2002) developed a weighted least-squares estimator on the basis of regression coefficients from the Haseman–Elston test. In addition, the methods developed by Gu et al. and Li and Rao require the same marker maps across studies but the method of Etzel and Guerra relaxes this requirement.
In this article we propose an empirical Bayes (EB) method to combine the results from multiple linkage studies using sib pairs. This approach has the same spirit as the approach proposed by Bonney et al. (1992) but is different from meta-analysis methods. For the meta-analysis of multiple genome scan studies, each individual study is equivalent to the other studies used and investigators are interested in the overall results. For the EB approach of analyzing multiple genome scan studies, a genome scan study is identified as the primary study and the rest of the studies are considered to be background studies. The primary study of interest to an investigator would be the study in which the investigator was personally involved; presumably, the investigator would be able to obtain further genotypes for fine mapping from the individuals in the primary study but not necessarily the background studies. In the EB method, the estimates of regression coefficients and their corresponding standard errors from the Haseman–Elston regression method in the primary study and the background studies are obtained and used to estimate the prior distribution of these estimates. The estimates of regression coefficients of the primary study and their corresponding standard errors are then refined by incorporating their prior distribution with an EB approach. The updated estimates are then applied to estimate the location and effect of the QTL for the primary study. The proposed method does not require identical marker maps or identical QTL effects. We evaluate the performance of our method and compare it with the method developed by Etzel and Guerra (2002), using extensive simulations. We report the combined mean value, the standard error, and the confidence interval for QTL location and effect.
MATERIALS AND METHODS
Haseman–Elston regression analysis for a single-genome-scan study with m markers using sib pairs:
The Haseman–Elston regression can be used to detect the linkage between genetic markers and QTL-based sib pairs. Suppose that the trait values, the squared trait difference, and the estimated proportion of alleles shared IBD at a marker locus for the th sib pair are denoted as , , and , respectively; the Haseman–Elston method is based on a simple regression of on :The regression coefficient β has the expectation , where θ is the recombination fraction between the marker locus and the QTL and is the phenotypic variance explained by the effect of this QTL. Because the original Haseman–Elston regression tends to have lower power than the variance component method, other regression methods were subsequently developed (Amos 1994; Drigalenko 1998; Elston et al. 2000; Xu et al. 2000; Feingold 2002; Sham et al. 2002). For example, additional evidence of linkage could be obtained by regressing the mean corrected squared sums of trait values (Drigalenko 1998), the mean corrected cross product of trait values (Drigalenko 1998; Elston et al. 2000), or a weighted combination of and (Xu et al. 2000) on , where is the mean value of and over all sib pairs.
EB model (Bayesian hierarchical normal model):
In Bayesian analysis the choice of reasonable prior distribution for parameters is sometimes not obvious. However, if data from several independent studies are available, the prior information can be extracted from the data. Such approaches are called EB methods (Carlin and Louis 2000b) and can be viewed as an approximation to the complete hierarchical Bayesian analysis. An EB method is a hybrid approach between classical frequentist methods and fully Bayesian methods. Both parametric and nonparametric approaches exist (Carlin and Louis 2000a), but even the parametric varieties do not depend on strong distributional assumptions (Efron and Morris 1973).
The EB approach as proposed by Efron and Morris (1973, 1975) can be described by a widely used two-level hierarchical normal model. Suppose β is the parameter of interest in the population and populations are available to estimate in each population, where can be different among populations. At the first level, the maximum-likelihood estimators () for can be obtained and ensure that asymptotically follows a normal distribution, . At the second level, is specified by a normal model with an -dimensional predictor , a common regression coefficient μ, and an unknown variance ; i.e., . Using the Bayesian rule, it is easy to compute the marginal distribution of (given μ and ) and conditional distribution of (given , and ),(1)and(2)where is an unknown shrinkage factor. Generally, is unknown and is replaced by , the estimates of associated variance of . and μ can be estimated by the maximum-likelihood methods or by more advanced techniques developed by Tang and Morris (2002). Note that the estimates of and μ are not affected by the order of the data. Then we can use and as the final estimator for and associated variance, respectively.
Application of the EB method to each marker based on k studies with m markers and an identical marker map:
EB methods have been used in many contexts, including genetic research (Bonney et al. 1992; Li and Rao 1996; Witte 1997; Lockwood et al. 2001). We tailored the general EB procedure for linkage analysis. Assume that data for the detection of linkage to the same QTL are available from genome scans using sib pairs. Within each of the studies, a set of markers with the identical map are used within the same chromosomal segment. For each marker locus from each study, the regression coefficient,, is the parameter of interest and describes the effect of the putative QTL on the phenotype. From the Haseman–Elston regression analysis we can obtain the estimator for and its sampling variance . For available studies, all studies are first used to estimate parameters and , and then the EB estimators and for and associated variance can be easily obtained using Equations 1 and 2 for each of studies. At this stage no distinction exists between the primary study and the background studies because there is an EB estimate of β and for each marker locus in each study. One of the studies is then designated as the primary study according to the investigator's interest and the inference is made on the basis of the primary study.
Maximum-likelihood estimates of QTL location and effect based on m markers from a single-genome-scan study using sib pairs:
Suppose there are markers in a single-genome-scan study using sib pairs and the location of the th marker, , is already known; our new method for estimating the location and effect of QTL is completely based on and according to the normal approximation. Haseman and Elston (1972) demonstrated that the expectation of = , where is the recombination fraction between the marker and the QTL, which can be obtained through a map function and the location of the QTL and the marker locus (Malley et al. 2002), and is the variance explained by the QTL. The correlation coefficient of and is , where is the recombination fraction between markers and and, again, can be obtained through the map function and the marker locations. If we denote the location of QTL as and use Haldane's mapping function, then and . Thus, we consider the coefficients as observations from a multivariate normal distribution according to the large sample theory,where and . The variance of in the above formula is replaced by the estimated variance from the Haseman–Elston regression. The maximum-likelihood estimate (MLE) method can be used to get the point estimates of QTL location and effect and their confidence intervals. The log-likelihood function for , the location of QTL, and , the variance of QTL, is(3)where and , using Haldane's mapping function. Obviously, only two unknown parameters, and , need to be estimated.
Interval mapping estimates of QTL location and effect based on m markers from a single-genome-scan study using sib pairs:
Here we briefly recall the interval mapping (IM) method developed by Fulker et al. (1995) for estimating QTL location and effect. For the IM method, the proportion of alleles shared IBD at each analysis point along the chromosome is estimated by using the proportion of IBD sharing from all genotype marker loci at the same chromosome. Then and its associated variance are obtained from the Haseman–Elston regression at the analysis point. The analysis point, , that gives the minimum test statistic is taken as the location of QTL. The point estimate of QTL effect, , is given by .
MLE–EB estimates of QTL location and effect based on m markers and k genome-scan studies using sib pairs:
Suppose that data for detecting linkage to the same QTL are available from genome scans using sib pairs and consider the first study as the primary study. Within each of the studies, a set of markers is used within the same chromosomal region and is denoted as . This method requires that all studies have the same marker map. In the first step the coefficients and their associated variances, and (), are estimated at each marker locus from each of the studies using the Haseman–Elston regression analysis. In the second step the EB estimates, and () at the th marker , are obtained on the basis of and (), using the Gaussian regression independent multilevel model (GRIMM) (Tang and Morris 2002). GRIMM is independently applied to each of markers to obtain the corresponding EB estimates for each study. In the third step, the location and effect of QTL for the primary study are estimated by using (MLE) and (MLE–EB) with the MLE method.
IM–EB estimates of QTL location and effect based on m markers and k genome-scan studies using sib pairs:
Suppose that data for detecting linkage to the same QTL are available from genome scans using sib pairs and consider the first study as the primary study. Within each of the studies, a set of markers is used within the same chromosomal region and is denoted as . Note that this method does not require that all studies have the same marker map because it is tied to the analysis point on the chromosome, which does not have to be at the location of a marker. In step 1 the coefficients and their associated variances, and () at each analysis point on the chromosome, are estimated using the Haseman–Elston regression from each of the studies (Fulker et al. 1995). In step 2, and () are obtained from each of the studies by using GRIMM (Tang and Morris 2002). As before, GRIMM is independently applied to each analysis point along the chromosome for each study. In step 3 the test statistic for the primary study is calculated on the basis of and at the analysis point . In step 4 the analysis point having a minimum -value over the entire chromosome region is considered the IM estimate of the QTL location and, consequently, the IM estimate of is given by . In step 5 the same procedure can be applied to to obtain and , the IM–EB estimates of QTL location and effect, respectively.
Note that the above procedures along with the procedures used in the MLE–EB method distinguish our EB method from meta-analysis methods. For meta-analysis methods a set of genome-scan studies is analyzed in an aggregate manner and the overall results are obtained. For example, Etzel and Guerra (2002) proposed a weighted least-squares estimator method to estimate β and on the basis of and () and made inference solely on the basis of such estimates. For their method any individual study is no more important than the other studies. In contrast, the EB method tries to obtain the prior information using all data. This prior information is then incorporated into each study to help estimation and inference. For each study we obtain the estimates of β and and draw conclusions on the basis of them. In this situation the individual study that is used to draw conclusions is considered the primary study whereas the other studies are considered the background studies.
To investigate the performance of EB methods using multiple linkage studies, we simulated data sets as follows. We assumed that there was only one QTL with no background polygenic variation and no shared sib environment effect or, equivalently, that such effects were subsumed into the residual variance. There were two alleles at the QTL with the high-risk allele having a frequency of 0.05. We chose the markers on chromosome 11 that were used for a recent genome scan of Alzheimer's disease (Blacker et al. 2003) because it provides known parameters for simulation. Fifteen microsatellite markers span a distance of 146 cM. The location of the QTL was set either at 35 or at 65 cM from the p terminus of the chromosome. The marker locations along with the QTL location are shown in Figure 1. Phenotype data for each individual were generated on the basis of the genetic model , where μ is the overall trait mean across the population, is the additive effect of the high-risk allele, and is the normally distributed random error. Furthermore, we assumed that and and set μ = 70 for all studies.
For each iteration of the simulation, 5, 10, or 15 studies were simulated corresponding to a single study of interest with 4, 9, or 14 background studies, respectively. Each study consisted of 1000 independent sib pairs. For the primary study we fixed the additive variance of QTL, , to 0.30; the total variance, , to 1.00; and the location of QTL to 65 cM from the p terminus of the chromosome and used the actual marker map in Figure 1. For background studies the location of the QTL, the effect of the QTL, and the marker map could be same as or different from that of the primary study. The detailed settings are presented in Table 1. We used abbreviations to represent a specific simulation setting. For example, L1E1M1 denotes that all studies have the same QTL location, the same QTL effect, and the same marker map. It is worth emphasizing the difference between E2 and E3. For E3 the additive variance of the QTL is the same but the environmental variance varies, and thus the proportion of variance explained by the QTL varies across background studies (Etzel and Guerra 2002). For E2 the proportion of variance explained by the QTL also varies across background studies, but this variability is due to the variability of the additive of variance of the QTL across background studies. Even for studies with one QTL, this situation can occur because the allele frequency of the underlying QTL may be different because of ascertainment bias or population substructures across background studies. Similar differences also exist between E4 and E5. Such simulation experiments also reflect some realistic situations in practical studies. For E1 all background studies are considered to have the same significance as the primary study. For E3 half of the background studies have weaker signals and half have stronger signals than the primary study.
Once the genotype and phenotype data were generated, we used the methods listed in Table 2 to obtain the estimates for the location and effect of QTL. The Haseman–Elston regression coefficients at each marker or analysis point were determined by regression of a weighted combination of and on π (Xu et al. 2000), the proportion of alleles shared IBD (Drigalenko 1998). For the MLE method, the method outlined by Haseman and Elston (1972) was used to calculate π at a single marker locus. For the IM method the calculation of π at each analysis point was through the interpolation scheme of Fulker et al. (1995). The analysis points in some methods were placed 1.0 cM apart and the length of window was set as 20 cM. This ensures that at least one marker locus for each study is within the window of each analysis point. We repeated this procedure 1000 times to get mean values and standard errors of the estimates. To obtain their confidence intervals, we adopted a bootstrap procedure as follows:
Denote the point estimates of a parameter of interest for each of 1000 iterations as .
For the primary and the background studies in each iteration, sample the same number of sib pairs with replacement from the original data set and then obtain the estimates for the same parameter using the same method.
Repeat the above procedure 400 times for each iteration, thereby obtaining 400 additional point estimates in each iteration, denoted as .
Assign the α-confidence interval, the simplest way being to sort to in ascending order and assign the α-confidence interval as in study . The confidence interval obtained by this method is referred to as the bootstrap confidence in the rest of this article. However, these confidence intervals may not be suitable for comparisons of different methods because of different coverage probability. Therefore, we proposed the following procedure to force the α-confidence interval to have exact α-coverage probability:
For iteration , compute the mean and its standard deviation, and of , respectively. We chose λ to enable the proportion α of intervals to contain the true value of parameters. Correspondingly, this kind of confidence interval is referred as the quantile confidence interval.
For each method investigated in our simulation, we assessed performance according to the summary statistics listed below. For the point estimates of each parameter, we calculated their mean value, their standard error (SE), and the square root of the mean squared difference between the estimates and the true value (MSE). For the bootstrap and quantile confidence intervals determined in steps 4 and 5, we recorded the coverage probability, the mean, and the standard error of their lengths, respectively.
Number of background studies:
The box plots for the point estimates and the length of the 95% quantile confidence interval of QTL location and effect with 5, 10, and 15 studies are shown in Figures 2 and 3. Because the results from MLE and IM and the results from MLE–EB and IM–EB are similar, only the results based on IM, IM–EB, and the weighted least-squares estimator (WLSE) method are shown in Figures 2 and 3. The results based on all methods can be found in supplemental Figures 1 and 2 (http://www.genetics.org/supplemental/). The results in Figure 2 were obtained with L1E1M1 (all studies have the identical QTL location, QTL effect, and marker map), which represents the minimum heterogeneity or the best case across studies. The results in Figure 3 were obtained with L2E4M2 (studies have different QTL locations, QTL effects, and marker maps), which represents the maximum heterogeneity or one of the worst situations across studies. As expected, the EB methods using multiple studies estimate the location and the additive variance of QTL more precisely with a smaller MSE than does an individual study. Such an improvement becomes more notable with more independent studies included in the analysis under most situations that we simulated. Here we compared the performance of two methods—IM and its EB version (IM–EB)—in detail. For L1E1M1 the estimates of QTL location and effect using IM–EB are closer to true values and have smaller MSEs and tighter 95% confidence intervals than those obtained using IM. With 14 background studies, the MSE of estimates for QTL location and effect and the average length of 95% confidence interval of estimates for QTL location and effect using IM–EB can be reduced up to 64, 43, 41, and 24%, respectively. For L2E4M2 the estimates of QTL location and effect by EB-based methods are biased. The EB estimates for QTL location are shifted to the left of the chromosome because half of the background studies have a QTL positioned at 35 cM from the p terminus of the chromosome, which is 30 cM proximal to the location of the QTL in the main study. The QTL effect is underestimated because the additive variance of the QTL in background studies is uniformly distributed in [0.05, 0.15], which will drag the estimates for additive variance of QTL from 0.30 in the primary study toward 0, especially when more background studies are involved. However, the IM–EB estimates still supply a smaller MSE and a tighter confidence interval than do the IM estimates. Such improvements are not as significant as those obtained with L1E1M1, and the increase in the number of background studies apparently does not improve the results significantly. The MSE of estimates for QTL location is reduced 0.5, 4, and 9% with 4, 9, and 14 background studies, respectively.
Location of QTL:
Figure 4 summarizes the point estimates of QTL location and effect using IM and IM–EB based on L1E1M1 and L2E1M1 with 4, 9, and 14 background studies. The results based on all four methods (MLE, MLE–EB, IM, and IM–EB) are shown in supplemental Figure 3 (http://www.genetics.org/supplemental/). The presence of other QTL in background studies significantly affects the precision of estimates for QTL location using the EB-based methods but only slightly changes the estimates for QTL effect using these methods. First, the estimates for QTL location from the primary study (MLE and IM) are close to the true QTL location whereas the estimates for QTL from the EB-based methods (MLE–EB and IM–EB) are shifted to the left of the chromosome, especially when more background studies are included in the analysis (because half of the background studies have one QTL positioned 35 cM from the p terminus of the chromosome). Second, the MSE of estimates for QTL location is reduced 54% using MLE–EB on the basis of L1E1M1 with 9 background studies whereas this is reduced only 31% on the basis of L2E1M1 with 9 background studies. Third, as we noted in the previous section, the EB-based methods yield more precise estimates and narrower confidence intervals for QTL location with more background studies on the basis of L1E1M1 whereas such improvements are not significant on the basis of L2E1M1. Apparently, the estimates for QTL location obtained from MLE–EB with 10 studies are better than those obtained with 5 studies but are similar to those obtained with 15 studies. Similar patterns can be found for their confidence intervals. Fourth, the presence of other QTL in background studies does not severely affect the estimation of QTL effect: the EB-based methods using multiple studies always provide more precise estimates for QTL effect and supply narrower confidence intervals than do methods using a single study. In addition, similar estimates of QTL effect were obtained for L1E1M1 and L2E1M1.
Effect of QTL:
Figure 5 presents the box plots of point estimates and length of 95% quantile confidence intervals for QTL location and effect with 5, 10, and 15 studies using MLE and MLE–EB methods with varied QTL effect, one QTL (L1), and the fixed marker map (M1). Because the results based on E1, E2, and E3 are similar, only the results from E1, E4, and E5 are shown; the results based on E1–E5 are shown in supplemental Figure 4 (http://www.genetics.org/supplemental/). The heterogeneity of QTL effect among background studies affects the precision of estimates for both QTL location and QTL effect. The point estimates for QTL location are unbiased. The EB-based methods significantly improve the estimates of QTL location and narrow its confidence intervals when L1E1M1, L1E2M1, and L1E3M1 are used (please see supplemental Figure 4). The MSE of MLE–EB estimates for QTL location are reduced 55, 48, and 66% with 4, 9, and 14 background studies using L1E1M1, respectively. Such improvements become subtle when L1E4M1 and L1E5M1 are used: the MSE is reduced only up to 20 and 4%, respectively, using MLE–EB with 9 background studies. The QTL effect is underestimated when L1E4M1 is used in the simulation because the additive variance of QTL in background studies is uniformly distributed in [0.05, 0.15], which biases the estimates of additive variance of the QTL from 0.30 in the primary study toward 0. This phenomenon is not observed for L1E5M1 because the additive variance of QTL is fixed at 0.30 although the proportion of variance explained by the QTL is uniformly distributed in [0.05, 0.15]. Similarly to the estimates for QTL location, the EB-based methods significantly improve estimates for QTL effect and narrow its confidence intervals when L1E1M1, L1E2M1, and L1E3M1 are used in the simulation whereas such improvements are minor when L1E4M1 and L1E5M1 are used.
Figure 6 summarizes the point estimates for QTL location and effect for IM and IM–EB methods based on L1E1M1 and L1E1M2. The heterogeneity among marker maps in background studies does not drastically change the performance of EB-based methods, especially for estimates for QTL effect. In a few cases the random-marker map can slightly enhance the performance of EB-based methods. Similar patterns can be observed for WLSE and for the estimates of QTL effect.
Comparison of EB-based methods with the WLSE method:
Table 3 shows the mean and MSE of estimates for QTL location and effect using MLE–EB, IM–EB, and WLSE with 10 studies and different simulation strategies. With a few exceptions, when one method gives a smaller MSE of estimates than another method, it also provides tighter confidence intervals. Therefore we used the MSE of estimates as a criterion for comparison. In Table 3 the numbers in italics indicate the smallest MSE among three methods. No method is better than the others under all situations. In general, when L1 (all studies have a QTL with the same location) and E1, E2, or E3 were used in the simulation, WLSE has the smallest MSE of estimates for QTL location. When L2 or E4 and E5 are used in the simulation, MLE–EB or IM–EB has the smallest MSE of estimates for QTL location. For the estimation of QTL effect, a similar pattern can be found.
Novel methods are needed to improve the precision of estimates with the multiple-genome studies available for mapping the same QTL. To address this issue, we propose several EB methods to estimate the location and effect of a QTL from multiple-genome scans using sib pairs. Our method does not mandate that all studies have an identical marker map, the same kinds of markers, the same QTL location, or the same QTL effect. Furthermore, we assessed the performance of the method using extensive simulations and showed that this method is useful in detecting linkage between marker loci and QTL because it provides more precise estimates of QTL location and effect and supplies narrower confidence intervals than do methods using an individual study.
We used the Fulker–Cardon approximation for IM, which has been used extensively for the study of linkage in large pedigrees because the exact computation of IBD sharing is not feasible in this context. For smaller families such as those studied here, the Fulker–Cardon approximation loses a small proportion of informativeness (Amos et al. 1997) compared with exact calculation using the Lander–Green algorithm (Kruglyak and Lander 1995). For our comparison, computational speed is critical and the slight decrease in information caused by using the Fulker–Cardon approximation should not affect our conclusions.
The heterogeneity among studies is a significant obstacle to designing effective meta-analysis methods. We assessed the effect of heterogeneity through extensive simulations based not only on different marker maps but also on varied QTL location and effect. Our EB method was quite robust under all simulated circumstances. That is, it yielded more precise estimates and narrower confidence intervals. However, the influence of these factors varied. In general, varied marker maps across studies had less effect on estimates than did the varied locations and effects of a QTL. Different marker maps across background studies only slightly increased the MSE of estimates and the length of their 95% confidence intervals compared with results based on an identical marker map in all studies using the same EB-based method. In a few cases varied maker maps could be helpful, providing more precise estimates and narrower confidence intervals. This property is very useful for combining all available studies. Distinct marker maps can be used in different studies, and even when all studies use an identical set of markers, the actual genetic distance between markers can vary because of other causes, such as different subpopulations.
We found different degrees of heterogeneity for QTL effect among studies and this influence ranged from small to large. Our simulation strategies addressed some of the more complicated situations. We not only assumed a common additive variance accounted for by the QTL and but also assumed that the additive variance of the QTL itself could vary across different studies, which could be a result of different ascertainment schemes (e.g., extreme sampling with different cutoff thresholds can result in different estimations). We found that if the QTL effect varied but the average QTL effect among background studies was the same as for the primary study, the MSE of estimates and the average length of their confidence intervals obtained using an EB-based method were only slightly larger that those based on the same QTL effect across all studies using the same EB-based method. These estimates are significantly less than those based on an individual study. If the average QTL effect among background studies was far different from the QTL effect in the primary study, the MSE of estimates and the average length of confidence intervals obtained using EB-based methods were only slightly less than those based on an individual study.
The presence of other QTL in background studies drastically affected the estimates. The MSE of estimates and the average length of their 95% confidence intervals based on EB-based methods were only slightly less than those based on an individual study. The bias in location and effect size introduced by the presence of multiple loci influencing the trait levels may be resolved by identifying effects from separate loci if enough data are available. Thus, pooling of data across studies may allow the investigator to obtain enough information to reduce or remove bias from multiple loci if the effects from the separate loci can be resolved, perhaps by noting a substantial decrease in log odds ratio between the two separate peaks corresponding to the two separate loci.
We also assessed the effect of the number of background studies for EB-based methods. When only a small or moderate amount of heterogeneity existed in background studies (e.g., simulations with L1 and E1, E2, or E3), more background studies always yielded more precise estimates and narrower confidence intervals. When there was a larger amount of heterogeneity among background studies (e.g., simulations with L2, E4, or E5), the increased number of background studies could have no benefit. A possible reason may be that the information gained from multiple studies was overwhelmed by the noise introduced by them. More efficient methods clearly need to be developed under such situations. On the other hand, study heterogeneity can be explored before such an analysis (Guerra 2002) and if a large amount of heterogeneity is detected among studies, it may not be useful to perform the analysis.
We compared the performance of several EB-based methods and WLSE (Etzel and Guerra 2002) for meta-analysis. No method was superior to any other under all simulation strategies. It is important when doing such comparisons to understand that the current technique is not a meta-analysis. A meta-analysis is a technique to combine results from several studies of the same relationship to obtain a single summary estimate of that relationship. In such an analysis, the results of the studies are combined with equal regard, weighted by their relative precision. In the EB analysis, one study is of primary interest whereas the rest are regarded as background studies. Such differences between the meta-analysis methods (e.g., WLSE) and the EB-based methods (e.g., IM–EB) can in some degree explain that in general, WLSE performed better in simulations with L1 and E1, E2, or E3 whereas IM–EB was generally more stable and better in simulations with L2, E4, or E5. In addition, the purpose of the EB analysis is to use the information gleaned from background studies to improve our decision about where in the genome to focus a fine-mapping study. A fine-mapping study will require further genotyping, which can be performed only on the DNA samples available to the investigator. Thus, the samples available to the investigator must be regarded as fundamentally different from the samples used in the background studies. In this situation, an EB-based method would be preferred. If the samples are currently unavailable to the investigator and he wants to initiate a fine-mapping study based on other publicly available studies, the meta-analysis would be a better choice.
In this article we showed the results based on the quantile confidence intervals. The quantile confidence intervals ensure that the 95% confidence intervals have exact 95% coverage and enable the appropriate comparisons among different methods. We also computed coverage probability and the average lengths of the bootstrap confidence intervals on the basis of different simulation settings for IM and IM–EB (see supplemental Table 1 at http://www.genetics.org/supplemental/). Generally, the 95% confidence intervals were slightly overestimated (i.e., they had coverage probabilities slightly larger than 95%). However, IM–EB always had tighter confidence intervals than IM.
The purpose of this study was to develop a method not only to detect linkage but also to estimate QTL location and effect. Thus, we did not evaluate the performance of the EB method under the null hypothesis of there being no linkage between the QTL and the marker loci. Such evaluations were explored elsewhere (Beasley et al. 2005). We also used a relatively large number of samples (1000 sib pairs) and assumed a relatively large QTL effect (0.30 heritability) in our simulation because the EB method is feasible only when the sample size is large and the QTL effect is large. The estimation by GRIMM, which requires normality and normal approximation, is also valid only with large samples. Additionally, we calculated the power using the genetic power calculator developed by Purcell et al. (2003) and assumed that the recombination fraction between the QTL and the genetic marker varies from 0.10 to 0.075 to 0.05, resulting in 62.5, 72.3, and 81.2% power, respectively, with 1000 sib pairs and 30% heritability. The type I error rate in the calculation was set at 0.05 without any adjustment for multiple testing. This range of power value should be required in any real study using sib pairs to detect linkage between QTL and marker loci. At the same time, we explored the validity of the EB-based methods with smaller sample sizes. We repeated the simulations using 500 sib pairs per study. As expected, all simulations resulted in less precise estimates, larger MSEs, and larger confidence intervals for all methods investigated here (MLE, IM, MLE–EB, IM–EB, and WLSE). However, the general patterns for the methods using an individual study (MLE and IM), the EB-based methods (MLE–EB and IM–EB), and the meta-analysis method (WLSE) using multiple studies still hold (see supplemental Table 2 and supplemental Figures 5–9 at http://www.genetics.org/supplemental/).
Although we restricted our analysis to the Haseman–Elston regression slope using sib pairs, this method can be readily extended to other types of data as long as a common parameter that describes the effect of the putative QTL and its variance can be estimated for each study (e.g., Kruglyak and Lander 1995). We assumed that the assumptions of linear genetic effect and normality hold for traits. Changing some assumptions should not alter our main conclusions because all methods are based solely on regression slopes and their associated variances, and the estimates of regression slopes and associated variances obtained from the Haseman–Elston regression analysis have been shown to be robust even when these assumptions do not hold (Allison et al. 2000; Fernandez et al. 2002). We plan to explore the performance of EB-based methods when the assumptions of linear genetic effect and normality are violated.
There are several possible ways to improve the EB method presented here. For the EB method, we modeled that the coefficients of the Haseman–Elston regression at a analysis point from studies, , follow a normal distribution: . At the same time, the expectation of is . We simply assumed that all studies address the same QTL. Therefore the predictor was simply set as 1 in the estimation. However, can be different even for the same QTL, resulting in different across studies, which may be why the EB methods perform poorly in some simulations (e.g., L2 or E4). For this situation, an obvious improvement is to set the predictor as an estimation of , which can be obtained by the initial analysis of each study using an IM method. For the EB analysis, μ and were directly estimated and used, which necessarily ignores some posterior uncertainty. This disadvantage can be eliminated by taking a full Bayesian approach. We plan to explore the performance of this full Bayesian model.
Finally, we conclude that the EB method can account for between-study heterogeneity, estimate the QTL location and effect more precisely, and provide narrower confidence intervals than can an individual study.
We thank two anonymous reviewers for their valuable comments. We are grateful to Carl N. Morris of the Department of Statistics at Harvard University for providing us the source R code of the GRIMM algorithm and permitting us to use it. The R source code of the GRIMM algorithm is available on request from Carl N. Morris. This article is supported by grant R01ES09912 from the National Institutes of Health.
Communicating editor: R. W. Doerge
- Received February 20, 2006.
- Accepted May 18, 2006.
- Copyright © 2006 by the Genetics Society of America