Abstract
Estimates of the locations and effects of quantitative trait loci (QTL) can be obtained by regressing phenotype on marker genotype. Under certain basic conditions, the signs of regression coefficients flanking QTL must be the same. There is no guarantee, however, that the signs of the regression coefficient estimates will be the same. We use sign inconsistency to describe the situation in which there is disagreement between the signs of the estimated regression coefficients flanking QTL. The presence of sign inconsistency can undermine the effectiveness of QTL mapping strategies that presume intervals whose markers have regression coefficient estimates of differing sign to be devoid of QTL. This article investigates the likelihood of sign inconsistency under various conditions. We derive an analytic expression for the approximate probability of sign inconsistency in the singleQTL case. We also examine sign inconsistency probabilities when multiple QTL are present through simulation. We have discovered that the probability of sign inconsistency can be unacceptably high, even when the conditions for QTL detection are otherwise quite favorable.
STATISTICAL tools for quantitative trait loci (QTL) identification have seen substantial development over the last several years. Early work by Lander and Botstein (1989) based on the maximumlikelihood approach has been extended by several authors. Haley and Knott (1992) provided an interesting alternative to maximumlikelihood approaches. They noted that the locations and effects of QTL could be estimated by conducting a regression of the quantitative trait against functions of recombination fractions at each of several loci spanning the genome. Haley and Knott's approach has proved to be quite popular among practitioners because of its good accuracy and simplicity relative to maximumlikelihood methods.
Whittaker et al. (1996) cleverly show how a regression of phenotype on marker genotype can be used to estimate the locations and effects of additive QTL in backcross and F_{2} populations. In contrast to likelihood approaches and the Haley and Knott regression approach, the method proposed by Whittaker et al. (1996) does not require test statistics to be computed at a fine grid of loci covering the genome. They show that when all QTL on any given chromosome are separated by at least two markers (i.e., when all QTL are isolated), the location and effect of any QTL is a function of the distance between the two markers immediately flanking the QTL and the regression coefficients corresponding to these markers. Whittaker et al. (1996) further note that the signs of the regression coefficients flanking a QTL must have the same sign as the effect of the QTL that they flank. They correctly conclude that “a pair of adjacent markers with regression coefficients of opposite sign arises when the data are incompatible with the presence of a single QTL between two markers.” However, the estimated regression coefficients that are obtained in practice can, of course, differ in sign even when the regression coefficients that they estimate agree in sign. Thus it is possible that the estimated regression coefficients of markers flanking a QTL will have opposite signs.
Piepho and Gauch (2001) propose a method called marker pair selection (MPS) for mapping QTL that is based on the work of Whittaker et al. (1996). They compare several model selection criteria for choosing pairs of adjacent markers to use in a regression of phenotype on marker genotype. They say that a pair of adjacent markers is sign consistent if the signs of the estimated coefficients agree. Sign consistency plays an important role in their model selection process. In particular, only signconsistent adjacent marker pairs are allowed to enter or remain among the set of marker pairs to be used in a regression of phenotype on marker genotype. Thus only QTL flanked by signconsistent markers can be detected.
We use sign inconsistency to describe a situation where a QTL resides in an interval flanked by markers with estimated regression coefficients of opposite sign. In response to an anonymous referee's comment, Piepho and Gauch (2001) acknowledge the possibility of sign inconsistency. They mention a variation of their proposed marker selection process that might be able to detect signinconsistent QTL. They advise against the use of the modified procedure partly because they believe the problem of sign inconsistency to be rare. Neither of us refereed the article of Piepho and Gauch (2001), but we have experienced problems with sign inconsistency in our own research related to the Whittaker et al. (1996) method of QTL mapping. The main purpose of this article is to assess the prevalence of sign inconsistency when using the regression method to map QTL.
REVIEW OF QTL MODEL AND PREVIOUS RESULTS
In this section we briefly review the QTL model and relevant equations connecting a regression of phenotype on marker genotype to QTL locations and effects in backcross populations. We summarize only the main points that are developed in detail by Whittaker et al. (1996) and Piepho and Gauch (2001).
Suppose that we have n independent observations of a phenotype paired with p marker genotypes from a backcross experiment. Let Y_{i} denote the ith phenotype and X_{ij} the genotype of the ith observation at marker j. Following Piepho and Gauch (2001), we code the homozygous genotype as 0 and the heterozygous genotype as 1. We assume that
Whittaker et al. (1996) show that
Throughout the remainder of the article, we drop the h subscript corresponding to QTL number whenever we assume that only a single QTL is present.
RESULTS
The following theorem, proved in the appendix, provides an explicit expression for the approximate probability of sign inconsistency in the case of a single QTL as a function of recombination fraction between adjacent markers, the location of the QTL in the marker interval, the error variance, QTL effect size, and sample size. The approximation is best when the number of backcross observations is large and should be more than adequate for the sample sizes typically used in QTL mapping.
Theorem 1. Suppose a single QTL with effect α is located between two markers separated by recombination fraction θ. Let
Through simulations we have examined the adequacy of the approximation suggested by Theorem 1. The probabilities predicted by the asymptotic result differed from simulation estimates (based on 10,000 trials) by an average of <2%. The approximation was best when the probability of sign consistency was high and tended to underestimate the probability of sign inconsistency when the probability of sign inconsistency was lower. Overall the approximation was quite accurate for n = 300 and seemed to improve for larger n as would be expected.
Figure 1 provides a plot of the estimated probability of sign inconsistency against the recombination fraction between a QTL and its left marker for marker spacings of 5 and 10 cM, sample sizes of n = 300 and 1000, and varying QTL effect sizes. The points on the left in Figure 1a were computed through simulation to assess the adequacy of the approximation suggested by Theorem 1. Each point corresponds to the proportion of 10,000 randomly generated data sets in which sign inconsistency occurred. The solid lines were obtained using the result from Theorem 1. The SPLUS function pmvnorm was used to compute P(b_{1}b_{2} < 0), where (b_{1}, b_{2})′ is distributed as described in the statement of Theorem 1. An error variance σ^{2} = 100 was selected so that the level of difficulty for QTL detection would vary broadly from high to extremely low as the QTL effect ranged from 3 to 15. (See discussion.)
The left plot in Figure 1a shows that, although the asymptotic approximation does tend to underestimate the simulated probability of sign inconsistency, the approximation is already quite accurate for n as small as 300. Figure 1, a and b, shows that the asymptotic probability of sign inconsistency increases as QTL effect size and marker spacing decrease. The problem diminishes, but is not eliminated, as the sample size increases. Figure 1 also clearly demonstrates that the probability of sign inconsistency is minimized when a QTL is located at the center of its marker interval. The probability of inconsistency rises in a symmetric manner as the location of the QTL moves toward either flanking marker. Note that in all cases the probability of inconsistency rises to at least 50% as the QTL reaches a flanking marker. This is a general phenomenon described formally in Theorem 2 below and proved in the appendix. It seems ironic that having a QTL at or near a marker is the best situation from the scientific point of view but the worst situation in terms of sign consistency.
Theorem 2. In addition to the assumptions in Theorem 1, if r_{L} is close to 0 or θ, then the probability of sign inconsistency is close to
A simulation study for multiple QTL: We conducted a simulation study to see how often the signs of the regression coefficients flanking an isolated QTL would differ when multiple QTL are present and when the locations of QTL are uniformly distributed within marker intervals. We examined the estimation of regression coefficients for equally spaced markers on a single chromosome of length 1 M. The number of markers on the chromosome was 21 or 11, corresponding, respectively, to spacings of 5 or 10 cM. The number of QTL on the single chromosome was one, two, or three. All QTL on the chromosome were given the same effect size, although this size was varied across simulation settings to gauge the effect of changes in heritability on the frequency of sign change. Integer values 1–15 were used as the common effect of all QTL on the chromosome. The sample size used to estimate the regression coefficients was n = 300. This sample size was large enough to ensure at least one recombination between every pair of markers, even when markers were separated by only 5 cM. The error variance was fixed at 100 for all simulation settings.
One thousand sets of regression coefficients were estimated for each combination of marker spacing, number of QTL, and QTL effect. We are interested in recording only sign inconsistencies for regression coefficients flanking isolated QTL because the regression method of Whittaker et al. (1996) is not directly applicable for locating and estimating nonisolated QTL. Thus, we randomly generated the positions for multiple QTL subject to a constraint forcing all QTL to be isolated. In effect, the position of the first QTL follows a uniform distribution over the length of the chromosome. The location of the second QTL is uniform over the portion of the chromosome that does not include the interval containing the first QTL or the interval(s) immediately adjacent to the first QTL's interval. Similarly, the third QTL was placed uniformly over the portion of the chromosome not ruled out by either of the first two QTL.
The ordinary leastsquares estimates of the regression coefficients were estimated in two ways. First, the trait values were regressed against only those markers flanking a QTL. In the second method of estimation, the trait values were regressed against all markers. Among all pairs of markers flanking a QTL, the number of pairs with regression coefficients of differing sign was noted for both methods and each iteration of the simulation. The proportion of 1000 iterations with one or more sign inconsistencies among markers flanking a QTL was computed for each estimation method.
This first estimation method involving a regression on only markers flanking a QTL is most relevant in the context of a forward stepwise regression procedure like the one proposed by Piepho and Gauch (2001). Consider, for example, using their Algorithm 1 for the analysis of a single chromosome. The algorithm would begin by considering all possible regressions using a single pair of adjacent markers. Only signconsistent pairs would be eligible for consideration at later stages of the model selection procedure. The algorithm would continue to examine all possible regressions using two sets of adjacent marker pairs. Only sets where both adjacent marker pairs are sign consistent would be eligible for later consideration. The procedure continues by analyzing all possible sets of three, four, etc., until no signconsistent model can be found. Thus the correct model cannot be selected unless there is complete sign consistency in the regression of trait values against exactly those markers flanking a QTL.
Simulation results: In this subsection, we present detailed results on the probability of sign inconsistency when trait values are regressed against exactly those markers flanking a QTL. We do not report our detailed results for the probability of sign inconsistency when regressing against all markers. Generally speaking, our simulations suggest that the probability of sign inconsistency is similar for both methods of estimation. The probability of inconsistency when regressing against all markers appears somewhat lower than the probabilities we report when the QTL effect size is very low. However, the inconsistency probability quickly becomes higher than the reported probabilities as the QTL effect size climbs toward practical levels.
 Download figure
 Open in new tab
 Download powerpoint
Our simulation results are illustrated in Figure 2. The estimated probability of one or more sign inconsistencies as a function of marker spacing, number of QTL per chromosome, and QTL effect size is presented. The probability of one or more sign inconsistencies clearly increases with increasing number of QTL. Although it is not obvious from Figure 2, the probability of one or more sign inconsistencies becomes greater when marker spacing is decreased from 10 to 5 cM. In fact, this is true for each effect size and number of QTL. Perhaps the most obvious feature exhibited by Figure 2 is the decrease in the inconsistency probability as QTL effect size grows. Note, however, that even in the most favorable case studied the inconsistency probability exceeds 10%.
DISCUSSION
Power considerations: Does sign inconsistency readily occur only when there is little power to detect QTL? If this were the case, the large sign inconsistency probabilities that we have reported would have little practical importance. We claim, however, that sign inconsistency occurs with unacceptable frequency, even when the power for QTL detection is quite high.
Power for the regression method of QTL mapping proposed by Whittaker et al. (1996) and extended by Piepho and Gauch (2001) depends on the details of the model selection method and cannot be easily computed. Furthermore, this method assumes sign consistency, an assumption in question. Hence we gauge power for QTL detection by studying a simpler, traditional method of analysis. Consider, for example, the power of a simple singlemarker analysis for QTL detection when there is one QTL midway between 2 of 11 markers separated by 10 cM. For each marker, we could compute a twosample ttest to compare the mean of the heterozygotes to the mean of the homozygotes (see Liu 1998, Section 13.2, for example). We could declare significant evidence of linkage between a marker and the QTL if the P value of the ttest is <0.05/11. This singlemarker analysis would have a chromosomewise type I error rate no larger than 0.05 because of the conservative Bonferroni adjustment obtained by dividing the desired significance level by the number of markers tested. The probability that a twosample tstatistic immediately flanking the single QTL would have a P value ≤0.05/11 is approximately
It is troublesome and perhaps surprising that the probability of sign inconsistency can be large when conditions for QTL detection are so favorable. Consider, for example, the case of a single QTL with effect size seven, n = 300, and error variance σ^{2} = 100. From Table 1 we can deduce that the power of a simple singlemarker QTL scan across 21 markers evenly spaced on a 1M chromosome is near 99%. Nevertheless, the probability of sign inconsistency is at least 20% at best (Figure 1) and >30% when location of the QTL within its interval follows a uniform distribution (Figure 2).
Our results suggest that procedures insisting on sign consistency will often fail to map QTL to the marker intervals in which they are contained. It is possible, however, that QTL will be mapped to intervals near the correct interval when sign inconsistency occurs. Thus, depending on how QTL detection is defined, procedures insisting on sign consistency may maintain reasonable power for QTL detection despite sign inconsistency. Indeed, Piepho and Gauch (2001) report very high power for QTL detection under conditions where we would expect sign inconsistency to be frequent. This is not necessarily inconsistent with our results because Piepho and Gauch consider a QTL to be detected if it is mapped within 15 cM of its true location. Since 15 cM could be a couple of marker intervals away from the interval that actually contains the QTL, it is possible that the true interval will exhibit sign inconsistency and yet be successfully detected according to Piepho and Gauch's criterion.
Reason for sign inconsistency: The reasons for sign inconsistency in the singleQTL case can be understood by working through the proofs of Theorems 1 and 2 in the appendix. A less formal understanding can be gained by examining (7) and noting that the correlation between the flankingmarker regression coefficient estimators is negative in the case of a single QTL. Consider, for example, the case where flanking markers are separated by 5 cM, the QTL effect size α = 5, the backcross sample size n = 300, and the error variance σ^{2} = 100. According to Theorem 1, the joint distribution of
When the QTL is centered in the marker interval, the left half of Figure 3 shows that there is considerable probability mass in the second and fourth quadrants because of the negative correlation between
Figure 3 provides insight beyond the special case depicted. In general the asymptotic correlation between
 Download figure
 Open in new tab
 Download powerpoint
From the left half of Figure 3, it is easy to imagine a situation where the QTL effect α is so large that nearly all the probability mass of the approximate joint distribution will fall in the first quadrant, assuming that the QTL is located at the center of its marker interval. Our simulations and probability calculations suggest that this situation does not occur until α is extremely large. An examination of the covariance matrix M/n can provide some explanation. Note that the approximate variance of
Implications for mapping QTL through regression of phenotype on marker genotype: The regression approach to QTL mapping described by Whittaker et al. (1996) has many attractive features. One particularly appealing aspect is the fact that multiple regression is well understood by many practitioners and well studied in the statistics literature. By adopting the approach of Whittaker et al. (1996), many multiple regression techniques that have been developed over the years have new relevance to the problem of mapping multiple QTL with the aid of molecular markers. We have pointed out one potential weakness of the regression approach and the MPS method proposed by Piepho and Gauch (2001). Our intent is not to condemn these methods, but rather to call attention to an issue that requires more careful examination before these methods can reach their full potential for locating QTL and estimating their effects.
Our results indicate that the problem of sign inconsistency is greatest when markers flanking QTL are close together and thus highly correlated. This is not very surprising because ordinary leastsquares regression estimators become unstable (i.e., highly variable) when there is high correlation among predictor variables. This problem is shared by virtually all multiplemarker QTL mapping methods, and there is currently no simple fix. Multicollinearity problems, however, are well documented in the statistics literature, along with many potentially useful methods for combating multicollinearity. Hwang and Nettleton (2000) consider the use of principal components regression and related methods for estimating regression coefficients with correlated predictors. The focus of that article is on reducing the meansquared error of the regression coefficient estimators. A simulation in Hwang and Nettleton (2000) suggests that the methods proposed there may lead to improved estimates of QTL locations and effects. It is possible that such a version of principal components regression will reduce the probability of sign inconsistency, but we have not yet investigated this.
Piepho and Gauch's (2001) own suggestion for dealing with sign inconsistency is potentially a good one. They mention a modification of their MPS procedure that can lead to the consideration of marker pairs that are more separated than adjacent markers. This suggestion has the potential for reducing sign inconsistency because, as we have shown, the probability of sign inconsistency decreases as markers become more separated. We believe that there is now good reason to investigate their proposed modification and/or similar methods that do not insist upon sign consistency of adjacent marker pairs.
Acknowledgments
The authors thank Professor Allen Back for writing programs to confirm the accuracy of the formulas in this version as well as earlier versions of this article. The authors also thank Dr. C. Z. Wei and Professor David Ruppert for discussions regarding the asymptotic normality of the leastsquares estimator. Dan Nettleton acknowledges support from the National Research Initiative Competitive Grants Program of the U.S. Department of Agriculture, Award 19983520510390.
APPENDIX
Proof of Theorem 1. The goal is to prove that the distribution of (M/n)^{−1/2}(
The ordinary leastsquares estimate of β = (β_{1}, β_{2})′ is
By direct calculation, the ith diagonal element of D_{0} is
Now as n grows large,
To simplify the second term of (A3), observe that
i. For nonrecombinants X_{i} = (0, 0)′ or (1, 1)′,
ii. For recombinants X_{i} = (1, 0)′ or (0, 1)′,
Now for large n ~n(1 − θ) of the X_{i} will be nonrecombinants and approximately nθ will be recombinants. Thus
The proof for the mean (6) follows directly from (3) and the fact that
Proof of Theorem 2. By Theorem 1, the probability of sign inconsistency is P(b_{1}b_{2} < 0), where (b_{1},b_{2})′ has the joint distribution described in the statement of Theorem 1. As r_{L} converges to zero, the mean (6) converges to (α, 0)′, and the covariance matrix converges to
To prove that (8) is greater than onehalf, we write the probability as
Footnotes

Communicating editor: G. A. Churchill
 Received July 19, 2001.
 Accepted January 14, 2002.
 Copyright © 2002 by the Genetics Society of America