Abstract
Lander and Botstein introduced statistical methods for searching an entire genome for quantitative trait loci (QTL) in experimental organisms, with emphasis on a backcross design and QTL having only additive effects. We extend their results to intercross and other designs, and we compare the power of the resulting test as a function of the magnitude of the additive and dominance effects, the sample size and intermarker distances. We also compare three methods for constructing confidence regions for a QTL: likelihood regions, Bayesian credible sets, and support regions. We show that with an appropriate evaluation of the coverage probability a support region is approximately a confidence region, and we provide a theroretical explanation of the empirical observation that the size of the support region is proportional to the sample size, not the square root of the sample size, as one might expect from standard statistical theory.
RECENT advances in genetics have led to the identification of genes responsible for certain diseases such as cystic fibrosis, Huntington's disease, breast cancer, and others. Linkage analysis, which is especially effective when the disease or trait of interest exhibits Mendelian inheritance, played an important role in the identification of those genetic loci. When the disease is complex in nature (incomplete penetrance, multiple loci involved, etc.) or quantitative, finding the genetic loci involved in the etiology of the trait can be more difficult. In particular, in human studies, it is difficult to separate environmental and genetic effects. However, with experimental organisms, studies can be designed to provide a similar environment for all individuals, so that the variation in phenotypes can be attributed mainly to genetic factors; and breeding designs can control the nature of the differences in genotype. Studies of experimental organisms can provide useful information for agricultural purposes and/or contribute to our understanding of human disease via animal models. Moreover, it is now feasible to search the entire genome for a gene locus influencing a trait of interest. Statistical methods for mapping quantitative trait loci (QTL) from experimental crosses using a dense set of markers were introduced by Lander and Botstein (1989). Applications have involved (i) tomatoes (Patersonet al. 1991) to identify loci influencing traits such as mass per fruit, pH, and soluble solid concentration; (ii) grain yield in maize (Stuberet al. 1992); (iii) high blood pressure in rats (Jacobet al. 1991); and (iv) fatness and growth rate in pigs (Anderssonet al. 1994). In their original article, Lander and Botstein suggested statistical tests for general designs, but provided guidelines for declaring statistical significance for the backcross design only. Paterson et al. (1991) used these guidelines for intercross designs, but to avoid an increase in the falsepositive error rate, they restricted themselves to a 1d.f. statistic that ignored dominance effects. Churchill and Doerge (1994) proposed use of the permutation distribution to define thresholds for all design types. This method has the advantage that it makes no assumptions on the distribution of the phenotype. However, the thresholds depend on the observed data, so they need to be computed by Monte Carlo for each study; hence the method is less useful for analyzing and comparing different designs.
In this article we propose for intercross and other designs simple approximations that can be used to compare different designs under various conditions or the same design for different sample sizes or marker densities. We also discuss and compare three methods for constructing confidence intervals for a QTL. We assume throughout that markers are equally spaced, that there are no missing data, and except where noted that recombination occurs without interference. While these are artificially simple assumptions, at the cost of some complication they can all be weakened. Rough preliminary calculations suggest that the resulting picture would not change substantially unless the assumptions are radically altered. The sections on QTL detection and confidence regions are independent and can be read in any order.
RESULTS
The model and likelihood ratio statistics: The starting point for our considerations is a cross between two strains that differ substantially in the quantitative trait of interest. The parental lines can be “pure” breeding lines obtained through inbreeding or simply two different strains of the same organism with widely differing mean phenotype. A cross is obtained from the two parental lines, creating the first generation of offspring (generation F_{1}). The F_{1} generation is then allowed to mate together to produce the second generation (F_{2}), the intercross. We assume that the genotypes of the parental lines are completely different, so that at any marker locus we can label alleles from the strain with the larger mean phenotype as A, and alleles from the other strain as B. At each locus, each individual of the F_{2} generation will have zero, one, or two A alleles. A backcross is generated by mating an individual of the F_{1} generation to one from the parental line. If the parental line with the smaller mean for the trait is used, the offspring from the backcross will have zero or one A alleles at any locus on their genome.
A standard model for quantitative traits (e.g., Kempthorne 1957) in notation suitable for our purposes is the following. Let y_{i} be the phenotypic value of individual i, and let x_{ij}(d) be the number of A alleles at locus d on the jth chromosome. The locus is identified by its genetic distance d from one end of the chromosome. If there exists only one QTL on the jth chromosome that influences the traits and its location is q, the phenotype can be modeled as
For backcross data, because x_{i}(q) = 0 or 1, the additive and dominance effects cannot be estimated separately, and the model reduces to
If one observes the genotype of a marker at a putative trait locus d, the maximumloglikelihood ratio at d is given approximately by
Detection of linkage in backcrosses: Because the loglikelihood ratio is maximized over the entire genome, it is unclear whether the conventional threshold of LOD = 3.0 [equivalently 2 ln LR(d) > 13.8] to declare statistical significance is appropriate in the present context. To address this issue, Lander and Botstein (1989) proposed the approximation of
For the case of equispaced markers along the genome, Feingold et al. (1993) proposed an approximation, which agrees closely with the results from Lander and Botstein's simulations. That approximation is
For a backcross design with a QTL located exactly at a marker, Feingold et al. (1993) gave as an approximation for the power
From (6) and (4) we see the importance of the parameter β, which equals 0.02 for backcross designs, but can assume a larger value for other designs (e.g., recombinant inbred designs). In (4), β multiplies the length of the genome, so a larger value requires a larger threshold to maintain a given falsepositive error rate. From (6) we see that it also governs the rate at which the noncentrality parameter decays as a function of the distance from QTL to flanking marker. A large value of β means a rapid falling off in power to detect the QTL as a function of that distance. On the other hand, it also provides the possibility for more precise fine mapping of the QTL location, because a large β leads to a sharper delineation of the “peak” in the process 2 ln LR(d) that identifies the location of the QTL. We return to these issues below.
The preceding analysis is concerned with the likelihood ratio process observed at the discrete set of marker loci. To mitigate the problems indicated by (6) when the QTL is in the center of a marker interval, Lander and Botstein (1989) suggested the technique of interval mapping, i.e., treating the unobserved intervals between marker loci as missing data and using the EM algorithm to interpolate between the observed data points. Rebai et al. (1994, 1995) have used Rice's formula for the expected number of upcrossings of a level by a piecewise smooth Gaussian process to give approximations for the falsepositive rates when using interval mapping. The method is analytically tractable when one assumes complete interference, i.e., the recombination probability and map distance in Morgans are equal. Single chromosome simulations performed by these authors and our own whole genome simulations (data not shown) indicate that the approximation is very good when the sample size is reasonably large and markers are not too closely spaced. For dense markers (∼1 cM) it is conservative. A modification suitable for small samples can be inferred from Johnstone and Siegmund (1989).
An argument of Siegmund and Worsley (1995) can be adapted to give a simple approximation for the power of an interval mapping test. See appendix a.
Intercrosses: Most previous theoretical analyses have concentrated on backcrosses and consequently have ignored dominance effects. Paterson et al. (1991) used the full model (1) to locate QTL in tomatoes in an intercross, and estimated the dominance effects. However, to detect linkage, they used a 1d.f. statistic that ignores the dominance effects. Here we analyze the 2d.f. statistic involving both additive and dominance effects.
Consider the likelihood ratio statistic to test the general hypothesis that α = δ = 0 vs. the alternative that α ≠ 0 or δ ≠ 0. For intercross data the vectors with coordinates x_{i}(d) and 1_{(}_{xi}_{(}_{d}_{)=1)} (i = 1,..., N) are asymptotically orthogonal. Therefore, the approximations used to obtain (3) now yield for the loglikelihood ratio at the marker d
Let
Rebai et al. (1995) have given an approximation for the falsepositive error rate when interval mapping is used. This approximation involves an elliptic integral, to be evaluated numerically, and so is more complicated than the analogous backcross approximation, which can be written in closed form involving only the exponential and inverse tangent functions. In fact, the mathematically correct form of (9) involves similar complications, although extensive numerical calculations show that there is very little difference between the mathematically correct approximation and the more convenient one given above, which is based on replacing the two parameters β_{1} and β_{2} associated with the two coordinate processes by their average value, (β_{1} + β_{2})/2. In this spirit one can modify the approximation of Rebai et al. (1995) to obtain a closed form approximation that is no more complicated than that obtained for a backcross and gives essentially the same numerical results as the more complicated, mathematically correct approximation.
To check the accuracy of (9) and our interval mapping approximation, we simulated thresholds for the loglikelihood ratio based on an intercross sample of N = 350 organisms with 12 chromosomes of total length 1200 cM (to approximate the tomato genome). The interval mapping step was performed using an approximation due to Haley and Knott (1992), which is much less computer intensive and gives results almost identical to the EM algorithm for large values of N. Results are shown in Figure 1.
Both approximations are very accurate. As predicted, the process with the interval mapping step requires a higher threshold for a given value of the TypeI error. For smaller N, somewhat different approximations yielding larger thresholds need to be used, since the given approximations do not take into account the variability in the estimate of the variance,
For intercross data the noncentrality parameter for a QTL located at a marker locus is
Using simulations and the theoretical power approximations above, we compare in Figure 2 the power of the marker process with the power of the interval mapping test. We also present the power of the interval mapping test using the more stringent threshold (assuming continuous markers) proposed by Lander and Kruglyak (1995). The power was investigated for a dominant model, so δ = α, and ξ = 4.12, 4.41, 4.75, and 5.21, which correspond roughly to powers of 60, 70, 80, and 90% with a continuous map of markers. For recessive (δ = –α) models, the power would be exactly the same. For the same noncentrality values and an additive model (δ= 0), it would be slightly larger. Power under two map densities was estimated (Δ = 20 and 5 cM) and we used N = 350 tomato genomes. Each power simulation is based on 1000 replicates. The gain in power from using interval mapping is small, on the order of 2±4%, a result similar to that found by Darvasi et al. (1993). The gains anticipated by Lander and Botstein (1986, 1989), who write of interval mapping as providing a “virtual marker” midway between the actual markers, are overly optimistic. Their analysis is marred by their comparison of interval mapping with the marker process at only one of the flanking loci, where a more appropriate comparison would be with the maximum of the process at the two flanking loci. They also neglect the increase in threshold required to maintain a given falsepositive error rate for the interval mapping process. The gain in power for interval mapping is largest for the sparse map (Δ = 20 cM), but the gain is only ∼3±4%. Using the threshold for a continuous map when in fact a sparse map of markers is used greatly reduces the power (by as much as 20%).
We have made similar computations with similar results for backcross designs.
When the markers only process is used, the theoretical power approximations are very good, so only the simulated values have been included in Figure 2. The approximations are also good for interval mapping except when the intermarker distance is 5 cM and the QTL is midway between markers. In this case the power is underestimated by ∼5%. The reason is that the theoretical approximation involves only the probability that the process is above the threshold somewhere in the interval containing the QTL and neglects the probability of detecting the QTL to be in a neighboring interval. This is not a problem when the intermarker interval is large.
Other designs and a comparison of different designs: Many other designs can be handled by similar approximations. To evaluate an appropriate threshold, for the markers only process it is only necessary to know the recombination parameter β (or β_{1} and β_{2}), which depends only on the design, not the mathematical model used for recombination. Although there is no general method to evaluate this parameter, it has been calculated for many different designs. (Some values are given below.) For interval mapping one must know the complete covariance function, which depends on both the design and the model for recombination.
For instance, for recombinant inbred data, which involve the 1d.f. statistic (3), one can use approximation (4) with β= 0.04 for recombinants produced by selfing and β = 0.08 for recombinants produced by recurrent sib mating (as originally suggested by Lander and Botstein 1989). It is only slightly more complicated to incorporate interval mapping. (See Rebaiet al. 1994 for the case of selfing. A similar formula can be obtained for inbreds produced by recurrent sib mating.) For the advanced intercross designs suggested by Darvasi and Soller (1995) to provide more accurate localization of QTL, for the F_{i} offspring one can use (9) with β_{1} = iλ, β_{2} = 2iλ. For reciprocal backcross designs, where half of the offspring are backcrossed to each parental strain, one can use (9) with β_{1} = β_{2} = 0.02.
In Stuber et al. (1992), offspring from a cross of two inbred Maize strains (F_{1} generation) were allowed to self twice and then backcrossed to one of the parental lines. A careful examination of that design shows that the maximum LOD for testing the hypothesis of no linkage is approximately [cf. (3), (6)]
Korol et al. (1995) have suggested the use of correlated traits as a technique to improve the power of QTL mapping. If the number of traits is t, this would require a t dimensional version of (4) or a 2t dimensional version of (9) for the backcross or intercross design, respectively. The appropriate k dimensional approximation (k = t or 2t) is given by
We have used the theory developed above to compare the power of backcross, intercross, and recombinant inbred designs (obtained by recurrent sib mating). Let
Suppose ρ = 0. It is easy to see that the noncentrality parameter of the backcross is smallest and that of the recombinant inbred is largest. All three noncentrality parameters are comparable for large H ^{2}, but there can be sizeable differences for small H ^{2}. Because the threshold required for a given significance level is smallest for the backcross and largest for the intercross, one expects to find the backcross the most powerful design when H ^{2} is large, but not otherwise.
A numerical example is given in Table 1. We have determined for continuous markers sample sizes that give 80% power for values of H ^{2}, v^{2}, and ρ. Although the exact sample sizes depend on v^{2}, their relative values are roughly constant throughout a broad range where v^{2}, the heritability attributable to the QTL, contributes from roughly ⅛–½ H ^{2}, so only the intermediate value v^{2} = 0.2 H ^{2} is included in the table. Similarly the relative sample sizes are fairly insensitive to the exact power required. In agreement with the qualitative analysis of the preceding paragraph, for ρ = 0 the sample size required by a backcross design is about the same as that of the intercross for H ^{2} = 0.75 but is appreciably larger for H ^{2} = 0.25. For ρ^{2} = 0.04, the backcross design can require somewhat smaller or much larger sample sizes than the intercross design depending on whether ρ is positive or negative, which in turn depends on the parental strain used for the backcross. Hence with a small amount of dominance, probably too small to be detected in segregation analysis, a backcross design can yield a very misleading picture. The sample sizes required of the recombinant inbred design are smaller than those of the intercross and backcross designs and are insensitive to the values of ρ, at least for the relatively small values considered here.
We have performed similar calculations when the amount of dominance varies across QTL. The sample sizes in the backcross column can change substantially, but the qualitative picture is the same.
This problem with a backcross design could in principle be eliminated by backcrossing to both parental strains and using a 2d.f. statistic (with β_{1} = β_{2} = 0.02). One can easily evaluate the noncentrality parameter and see that for small values of H ^{2} such a reciprocal backcross is less powerful than an intercross design based on an equal number of progeny, but is slightly more powerful than an intercross design based on an equal number of matings (hence presumably half as many progeny). For larger values of H ^{2}, numerical calculations as in Table 1 can help one determine the potential usefulness of such a design.
To simplify the preceding comparison, we have assumed continuously distributed markers. This has the effect of concealing a weakness of the recombinant inbred design, which has a very large recombination parameter (β = 0.08). A consequence is that if markers are not closely spaced there is a considerable loss of power to detect a QTL located midway between markers. For an example consider the fourth row of Table 1, where the recombinant inbred design is much more powerful than either of the other two. For a Δ = 20cM map and a QTL midway between markers, the power falls to about 0.73 if we use the sample sizes given in the table with an intercross or backcross design. To achieve this power with a recombinant inbred design, one would need a sample size of ∼380, and in this case interval mapping would be mandatory. Otherwise a sample size of ∼690 would be required. For a Δ = 5cM map, the power of a backcross or intercross would fall only to 0.79 for a QTL midway between markers. Now for a recombinant inbred design a sample size of about 291 would be required (300 without interval mapping). To achieve the benefits of a recombinant inbred design, it appears advisable to type markers at no more than 5 cM distance, and closer would be better. A similar caution is applicable to the advanced intercross designs of Darvasi and Soller (1995).
Confidence regions for QTL: A confidence region can be used to identify a chromosomal region in which to concentrate the search for the exact location of a QTL. In this section, three methods of constructing a confidence region around the gene locus are presented and compared. It is perhaps worth noting from the outset that this is not a “regular” estimation problem as the term is used by statisticians. Because the likelihood function has cusps at marker loci, the maximumlikelihood estimate of a QTL may fail to be approximately normally distributed, so one is not justified in using the maximimlikelihood estimator plus or minus two estimated standard errors as an approximate 95% confidence interval. Darvasi et al. (1993) in one of their suggestions appear to have assumed incorrectly that the standard statistical theory is applicable. Visscher et al. (1996) have suggested a confidence interval based on the unconditional distribution of the maximumlikelihood estimator, which they estimate by bootstrapping. Although their coverage probabilities are shown by a Monte Carlo experiment to be quite close to the specified level, this method does not adapt to the rate of decay of the likelihood function near its maximum and is known to give confidence regions that are unnecessarily large in related “changepoint” problems. A numerical example given below suggests that it has the same undesirable feature here. See Siegmund (1988) for a more complete discussion.
Support intervals: Support intervals (cf. Conneallyet al. 1985) provide a method of estimating the location of a trait locus. They are essentially equivalent to the standard statistical technique of inverting the likelihood ratio test to obtain a confidence region. Given a value x > 0, a support region includes all the loci q such that
Likelihood methods: A second method to provide a confidence interval for a QTL relies on using likelihood methods for change points (Siegmund 1988; Feingoldet al. 1993). It is closely related to the support method described above and provides some analytic tools for studying that concept. Unlike the support method, however, for the special case that the trait locus is exactly at a marker location the likelihood method in principle gives an exact confidence region.
Although the actual procedure is based on twice the loglikelihood ratio, our discussion will be simplified notationally by using the asymptotically equivalent ∥Z_{d}∥^{2}, where Z_{d} = (X_{d}, Y_{d}) is defined in (8) [cf. also (7)] and
As the desired conditional probability does not depend on α, δ, it can be evaluated under the hypothesis that these parameters are both zero. The approximation (B1) of appendix b yields as a confidence interval for the QTL those loci q such that
By (7) the inequality defining A_{q} and the inequality in (12) are asymptotically equivalent. The important difference between the likelihood ratio and LOD support methods is that for the former x depends on Z_{q} and is chosen to make the conditional probability (13) equal to the desired confidence level. For any value x that does not depend on the data, the probability of (12) depends on the values of α and δ. Hence the support region is not a confidence region in the strict sense of the word. However, the similarity between the support regions and the likelihood ratio regions allows us to gain some interesting theoretical insights. For example, under the assumption that the QTL lies at a marker locus and that the distance Δ between markers is small, we can evaluate approximately the probability that a support region does not contain the true QTL, by taking the expectation of (B1) in appendix b with respect to Z_{q} = z. The result of some simple approximations is
For problems involving a single parameter, e.g., for backcrosses, recombinant inbreds, or intercrosses where we estimate only α and ignore δ, the factor in square brackets in (15) immediately preceding the exponential would be [(ξ^{2} + x)/ξ^{2}]^{1/2}. It is easy to see that at least for comparatively large values of ξ, the coverage probability for a given value of x is relatively insensitive to this change of dimension.
An approximation for the expected size of a support region, which is valid for dense markers (∼1 cM), is given in appendix b. A less precise but more easily interpreted approximation, valid when ξ ≫ x, is obtained by approximating the normal density in (B2) with mean ξ by a point mass at ξ, then taking two terms of the Taylor series expansion of ln[ξ^{2}/(ξ^{2} – x)], which yields
Bayesian credible regions: Given a prior probability for the location of the QTL and for the noncentrality parameters (ξ_{1}, ξ_{2}), a set having a posterior probability of 1 – γ is called a Bayesian credible region. Fisher (1934), in his classical study of ancillarity, showed in effect that under certain conditions Bayesian credible sets are in fact 1 – γ confidence regions having many desirable properties. Cobb (1978) pointed out that a special class of statistical problems having the required structure are “changepoint” problems, which have been studied extensively from this point of view by Zhang (1991). Feingold et al. (1993) and Kruglyak and Lander (1995) have noted the similarity between estimating the location of a changepoint and estimating the location of a trait locus from data on mapped markers. A consequence of this history is the expectation that a Bayesian credible region for a uniform prior distribution on the location of the QTL will provide satisfactory confidence regions.
A Bayesian credible region B_{γ} is constructed by including all loci v whose posterior density given the data exceeds c_{γ}, i.e.,
Comparison study: Using simulated tomato genomes, we constructed the likelihood confidence region, the 1.0 and 1.5LOD support region and the Bayes credible regions, with the three different priors mentioned above. However, only the results from Bayes credible sets with a mixture of normal priors are included in Tables 2 and 3. For each tomato, the crossover process for the chromosome containing the QTL was generated using the Haldane mapping function and the phenotype y_{i} was assigned the value
We performed the simulations for the dominance model (δ=α), with ξ= 5, 7.5, and 10.0. The trait locus was either at a marker, midway between markers, or randomly assigned. We generated 1000 sets of 350 tomatoes and calculated the average size and the probability of covering the true locus given a map with Δ = 1, 5, and 10 cM. Interval mapping was used throughout.
Both the 1.5LOD (x = 6.9) support regions and the Bayesian credible regions provided at least 95% coverage under all simulated conditions. The support regions gave the smallest confidence regions for dense maps, while the Bayesian credible regions did the same for sparse maps. The coverage probability for the support regions obtained in the simulations is close to that predicted by the approximation (15). The approximate expected size provided by (B2) is close in the case of a dense map, but not otherwise. The likelihood method was conservative; and because it adapts to the observed value of the likelihood ratio statistic at the putative trait locus it resulted in the widest confidence regions for small values of the noncentrality parameter but was equivalent to the support region for the larger values ξ= 7.5 and 10. For all methods, the sizes of the intervals were largest when the trait was midmarker. The Bayes credible sets were the widest and they fell short of the desired 95% for large values of ξ and sparse maps, especially when the trait was located at a marker.
The size of the confidence regions is relatively insensitive to the marker density when the distance between markers and the size of the region are roughly commensurate; but when ξ is large, the dense marker map provides substantially smaller regions.
We performed similar simulations for a backcross with essentially the same results (data not shown). The simulations were repeated with fewer tomatoes (N = 100) (results not shown). The size of the region was unchanged for all methods, and all methods had the right coverage probability when the locus was located at a marker. The coverage probability was substantially reduced for the case of the likelihood method and the Bayes method when the trait was located midmarkers (≈80% instead of 95%). The LOD support method had a slight drop in confidence coverage (≈90%), but was more robust than the other methods.
We have also simulated support regions under the conditions of Table 2 of Visscher et al. (1996), which involved a backcross with no dominance variance and marker spacings of 20 cM. At this intermarker distance 1LOD (x = 4.6) regions had coverage probabilities ranging from 93 to 96% and in all cases gave smaller regions than the 95% bootstrap regions recommended by Visscher et al. (1996), while 1.5LOD regions had 98±99% coverage probability and about the same expected sizes as the bootstrap regions. For example, for a heritability of 0.05 and a sample size of 500, which yield a noncentrality parameter ξ = 5.06, the coverage probability of the 1LOD region based on 1000 simulations was 96%, and the expected size was 29 cM compared with 96% and 43 cM obtained by Visscher et al. (1996) for their bootstrap regions.
Another method to obtain confidence intervals for QTL location has been proposed by Mangin et al. (1994). This method amounts to fixing a putative QTL location and testing the hypothesis that there is no QTL between that location and either end of the chromosome. In the statistical literature on changepoint analysis Worsley (1986) has discussed a similar idea and has pointed out that if there is another changepoint (here QTL on the same chromosome) the method may produce an empty confidence set, since for every putative QTL there is evidence of another somewhere on the chromosome. Of course, the problem of detecting a second, linked QTL given an already detected QTL is itself interesting and important.
DISCUSSION
In this article we have discussed genome scanning methods to detect QTL in experimental genetics. Our goal has been to produce relatively simple approximations for quantities of interest, e.g., the falsepositive error rate, power to detect a QTL, and coverage probability of a support region, so that one can easily address questions concerning sample size, marker density, etc., and can compare different designs. Our approximations for significance level and power seem adequate in this regard, but our approximations for the expected size of a support region are good only for dense markers (e.g., Δ ≈ 1 cM).
Although in a backcross the conventional LOD = 3 threshold produces falsepositive rates <0.05 unless intermarker distances are small, it is anticonservative in an intercross even for intermarker distances as large as 25 cM without interval mapping.
Our approximations are based on the artificial assumption that markers are equally spaced and there are no missing data. If markers are not equally spaced, the approximations (4) and (9) can be modified by averaging the function ν with respect to the distribution of the distances Δ between markers. One can also use the original approximations with an average intermarker distance. (This should be the average distance in the neighborhood of detected QTL if one adds additional markers to promising regions.) Since (4) and (9) are insensitive to minor changes in the assumed value of Δ, one can reasonably expect such refinements to have little practical effect. If we use interval mapping to impute missing marker data, the resulting process is more correlated than would be the case if the data were not missing, so the threshold obtained under the assumption of no missing data is still appropriate and, in fact, slightly conservative.
The assumption of normality is robust in the sense that the regression statistics we use are approximately normally distributed in large samples, so our approximations for significance level and power are valid in large samples. However, it is possible that by using a more appropriate model, e.g., a mixture model if the nonnormality arises from large QTL effects, one can obtain greater power, although large QTL effects will be comparatively easy to detect with a suboptimal procedure.
When using a backcross or intercross, intermarker distances up to ∼10 cM are almost as powerful as continuously distributed markers. Except at intermarker distances of ∼20 cM or more, or when using a design involving a large recombination rate, e.g., a recombinant inbred design or advanced intercross design, there is little gain in power from interval mapping, which in any event does not provide nearly as much power as more closely spaced markers.
Although intercross designs involve a 2d.f. statistic and hence a higher threshold than a backcross design, and have larger residual variance, intercross designs are usually more powerful than backcross designs, unless (a) the effect of the gene is large and additive or (b) there is dominance and the dominance deviation has the same sign as the additive genetic effect. A backcross design can lose considerable power in the presence of even a small departure from additivity if the incorrect parental strain is used for the backcross. A recombinant inbred design can be more efficient than an intercross, except when dominance effects are large compared to additive effects. Because of the high recombination rate associated with recombinant inbreds, especially those based on recurrent sib mating, power to detect linkage falls off rapidly with intermarker distance when a QTL is located midway between markers. To avoid this loss of power when using an inbred design based on recurrent sib mating, intermarker distances should be no more than 5 cM and preferably should be even less. Similar considerations apply to advanced intercross lines (Darvasi and Soller 1995).
We have also presented three methods of constructing confidence regions for the location of QTL: the likelihood method, Bayes credible sets, and support regions. The support method and the Bayesian credible sets seem roughly comparable in large samples, but the coverage probability of the support method is more robust to changes in the sample size. Both methods are better than the likelihood ratio method, which often has a coverage probability substantially smaller than the nominal level, except for the case of dense markers.
The size of a confidence region depends on the noncentrality parameter and the density of the markers in the neighborhood of the QTL. When the noncentrality parameter is ∼5, which provides power of ∼0.9 for QTL detection, little is gained by having markers more closely spaced than ∼10 cM; but when the noncentrality parameter is 7.5, intermarker distances of 1±5 cM provide shorter confidence regions. A reasonable guideline is to achieve a marker density in the neighborhood of a putative QTL about equal to the expected half length of a support region for a QTL of that strength.
When dominance effects are relatively small and markers sufficiently dense, support regions from recombinant inbred designs are often about onefourth as large as from intercross designs, which in turn are substantially smaller than from backcross designs. Advanced intercross designs (Darvasi and Soller 1995) are also especially powerful for fine localization of QTL. In almost all cases, however, the size of the confidence regions is on the order of several centimorgans unless the sample size is considerably larger than what is required to detect linkage, so there is a continuing need to develop better designs for fine localization of QTL.
We have not explicitly addressed the complexities associated with identifying multiple, possibly linked, possibly interacting, QTL. For mapping qualitative traits in humans, we have discussed these issues (Dupuiset al. 1995), and expect to return to them for QTL mapping. For example, once a linked QTL is located, conditional search removes the effect of that QTL by subtracting its (estimated) genotypic contribution from the phenotypic value to define a new regression model, hence a new loglikelihood ratio statistic, to search for additional QTL. Suppose an intercross design is used and, for simplicity, we use a 1d.f. statistic to detect a QTL of purely additive effect. Assume also that we know exactly the location of a QTL making contribution v^{2} to the heritability. The (asymptotic) correlation function between the new and old processes at each unlinked marker is (1 – v^{2})^{1/2}, and under the assumption of no epistasis, the noncentrality parameter for the new statistic is larger by the factor 1/(1 – v^{2})^{1/2}. Hence a large QTL effect v^{2} is necessary at the detected locus to get a reasonable “gain” from the conditional search, although a large value of v^{2} also leads to a new process only weakly correlated with the original search process, which increases the likelihood that conditional search will incur a falsepositive error. Of course, there must be another QTL of sufficiently large effect for the gain in noncentrality to be helpful. Rough calculations suggest that suitable combinations of QTL effects will occur relatively rarely.
Similar considerations are relevant to recently suggested multiple regression methods, e.g., Zeng (1994) and Jansen (1994), whereby one searches, for example, a given chromosome or chromosomal arm for a QTL while controlling for QTL on other chromosomes through arbitrarily placed markers. In comparison with conditional search, this method has the potential advantage of controlling the phenotypic variability due to multiple QTL, but at least initially has the disadvantage that the success of the control depends on fortuitously placing the control markers close to true QTL. Straightforward calculations show that the control markers on other chromosomes have no effect on the asymptotic distribution of the loglikelihood ratio process along the currently searched (unlinked) chromosome, although they do reduce the number of degrees of freedom available to estimate the error variance. By considering one chromosome at a time and adding the chromosomewide falsepositive rates, one obtains an asymptotic upper bound on the genomewide falsepositive rate. Because of the independent assortment of chromosomes, this upper bound should not be overly conservative.
The second method discussed by Dupuis et al. (1995), simultaneous search, will for the reasons given there rarely be useful in the absence of epistasis. Preliminary calculations suggest it can be very helpful when there is substantial epistasis.
We expect to return to the problem of detecting multiple, possibly linked, QTL in a future article.
Acknowledgments
Health grant HG00848 and the National Science Foundation grant DMS 9704324.
APPENDIX A
Power of interval mapping: We first consider a backcross and suppose there is a single trait locus (on any particular chromosome) at q. Let Z_{d} denote the signed square root of twice the loglikelihood ratio (incorporating interval mapping), which for large N behaves like a piecewise smooth Gaussian process. We use the basic decomposition
The noncentrality parameter ξ_{q} can be evaluated by a direct computation starting from a suitable explicit representation of the interval mapping statistic. See Rebai et al. (1995) for such a representation in a complete interference model; their equation is easily modified for the Haldane model of no interference. We present here an alternative method, which will be easier to apply to intercross designs, where the explicit statistic is much clumsier to manipulate. We begin with the following asymptotically equivalent expression for the square root of (3):
We can also give as an approximation for the power of the interval mapping process
A more detailed calculation along the lines of that given for a backcross yields an expression for ξ_{q}, which in general is somewhat complicated. In the special case that q is the midpoint between two markers at distance Δ, the parameter ξ_{q} is the norm of the vector with coordinates
APPENDIX B
Approximations for the conditional probability of (14) and the expected size of a LOD support region: To approximate the conditional probability of (14), we begin with the following lemma.
Lemma. Let Z_{t} = (Z_{1,}_{t}, Z_{2,}_{t}) where Z_{1,}_{t} and Z_{2,}_{t} are independent Gaussian processes with covariance functions satisfying
For our particular application, R_{i}(t) = exp(–β_{i} t). Putting b^{2} = ∥z∥^{2} + x and assuming x^{1/2}z_{2} ≪ z_{1}, which will be the case with probability close to one unless there is overdominance, we obtain
We can also obtain a rough approximation for the expected size of the support region as follows. First consider the onedimensional case of a backcross or recombinant inbreds and assume as before that a marker is at the QTL q. Then the expected size of the support region is
Footnotes

Communicating editor: S. Tavaré
 Received April 14, 1997.
 Accepted September 21, 1998.
 Copyright © 1999 by the Genetics Society of America