Statistical Methods for Mapping Quantitative Trait Loci From a Dense Set of Markers
Josée Dupuis, David Siegmund

## Abstract

Lander and Botstein introduced statistical methods for searching an entire genome for quantitative trait loci (QTL) in experimental organisms, with emphasis on a backcross design and QTL having only additive effects. We extend their results to intercross and other designs, and we compare the power of the resulting test as a function of the magnitude of the additive and dominance effects, the sample size and intermarker distances. We also compare three methods for constructing confidence regions for a QTL: likelihood regions, Bayesian credible sets, and support regions. We show that with an appropriate evaluation of the coverage probability a support region is approximately a confidence region, and we provide a theroretical explanation of the empirical observation that the size of the support region is proportional to the sample size, not the square root of the sample size, as one might expect from standard statistical theory.

RECENT advances in genetics have led to the identification of genes responsible for certain diseases such as cystic fibrosis, Huntington's disease, breast cancer, and others. Linkage analysis, which is especially effective when the disease or trait of interest exhibits Mendelian inheritance, played an important role in the identification of those genetic loci. When the disease is complex in nature (incomplete penetrance, multiple loci involved, etc.) or quantitative, finding the genetic loci involved in the etiology of the trait can be more difficult. In particular, in human studies, it is difficult to separate environmental and genetic effects. However, with experimental organisms, studies can be designed to provide a similar environment for all individuals, so that the variation in phenotypes can be attributed mainly to genetic factors; and breeding designs can control the nature of the differences in genotype. Studies of experimental organisms can provide useful information for agricultural purposes and/or contribute to our understanding of human disease via animal models. Moreover, it is now feasible to search the entire genome for a gene locus influencing a trait of interest. Statistical methods for mapping quantitative trait loci (QTL) from experimental crosses using a dense set of markers were introduced by Lander and Botstein (1989). Applications have involved (i) tomatoes (Patersonet al. 1991) to identify loci influencing traits such as mass per fruit, pH, and soluble solid concentration; (ii) grain yield in maize (Stuberet al. 1992); (iii) high blood pressure in rats (Jacobet al. 1991); and (iv) fatness and growth rate in pigs (Anderssonet al. 1994). In their original article, Lander and Botstein suggested statistical tests for general designs, but provided guidelines for declaring statistical significance for the backcross design only. Paterson et al. (1991) used these guidelines for intercross designs, but to avoid an increase in the false-positive error rate, they restricted themselves to a 1-d.f. statistic that ignored dominance effects. Churchill and Doerge (1994) proposed use of the permutation distribution to define thresholds for all design types. This method has the advantage that it makes no assumptions on the distribution of the phenotype. However, the thresholds depend on the observed data, so they need to be computed by Monte Carlo for each study; hence the method is less useful for analyzing and comparing different designs.

In this article we propose for intercross and other designs simple approximations that can be used to compare different designs under various conditions or the same design for different sample sizes or marker densities. We also discuss and compare three methods for constructing confidence intervals for a QTL. We assume throughout that markers are equally spaced, that there are no missing data, and except where noted that recombination occurs without interference. While these are artificially simple assumptions, at the cost of some complication they can all be weakened. Rough preliminary calculations suggest that the resulting picture would not change substantially unless the assumptions are radically altered. The sections on QTL detection and confidence regions are independent and can be read in any order.

## RESULTS

The model and likelihood ratio statistics: The starting point for our considerations is a cross between two strains that differ substantially in the quantitative trait of interest. The parental lines can be “pure” breeding lines obtained through inbreeding or simply two different strains of the same organism with widely differing mean phenotype. A cross is obtained from the two parental lines, creating the first generation of offspring (generation F1). The F1 generation is then allowed to mate together to produce the second generation (F2), the intercross. We assume that the genotypes of the parental lines are completely different, so that at any marker locus we can label alleles from the strain with the larger mean phenotype as A, and alleles from the other strain as B. At each locus, each individual of the F2 generation will have zero, one, or two A alleles. A backcross is generated by mating an individual of the F1 generation to one from the parental line. If the parental line with the smaller mean for the trait is used, the offspring from the backcross will have zero or one A alleles at any locus on their genome.

A standard model for quantitative traits (e.g., Kempthorne 1957) in notation suitable for our purposes is the following. Let yi be the phenotypic value of individual i, and let xij(d) be the number of A alleles at locus d on the jth chromosome. The locus is identified by its genetic distance d from one end of the chromosome. If there exists only one QTL on the jth chromosome that influences the traits and its location is q, the phenotype can be modeled as yi=μ+αxij(q)+δ1(xij(q)=1)+eij, (1) where μ, α, δ are the phenotypic mean, additive effect, and dominance effect, respectively, and 1c equals 1 or 0 according to whether the condition C is satisfied or not. The eij's are residual effects, which include both environmental effects and the genetic effects of QTL on other chromosomes than the jth. As we will be considering only a single chromosome at a time, we drop the subscript j in what follows. We assume that xi(q) and ei are uncorrelated, which would be the case if there is no epistasis and the environmental effect is uncorrelated with the genetic effects. We also assume that the ei are independent normally distributed random variables with mean 0 and variance σe2. The residual variance σe2 equals the sum of the environmental variance and the genetic variance for those QTL not on the jth chromosome. Without the normality assumption the regression-like statistics given below are not exact maximum-log-likelihood ratios, so it is possible that more powerful tests can be found. However, by virtue of the central limit theorem the various approximations to significance level, power, etc. will still be valid in large samples even if the e's are not normally distributed. In fact, for the significance level, it is not necessary to assume any stochastic model for the y's. One can simply regard the y's as fixed numbers and the regression statistic essentially a weighted (by the y's) sum of the x's, to which the central limit theorem applies under an assumption that the empirical behavior of the y's is about what it would be if they were independent and identically distributed observations from a fixed distribution.

For backcross data, because xi(q) = 0 or 1, the additive and dominance effects cannot be estimated separately, and the model reduces to yi=μ+αxi(q)+ei, (2) where the parameter α* in (2) equals α + δ from the model (1). This is the model developed by Lander and Botstein (1989), which we review briefly here. Treatment of the full model (1) is shown later in this article.

If one observes the genotype of a marker at a putative trait locus d, the maximum-log-likelihood ratio at d is given approximately by 2lnLR(d)Nln(1α^d24σ^y2)Nα^d24σe2, (3) where N is the number of typed individuals, α^d is the maximum-likelihood estimate of the parameter α* = α + δ, and σ^y2 is the maximum-likelihood estimate of the phenotypic variance σy2=σe2+α24. It is important to note that both σy2 and σe2 depend on the design and for a backcross differ from the corresponding quantities for an intercross, although this difference is not reflected in the notation. Note also that (3) involves natural logarithms; the marginal asymptotic distribution of (3) at any unlinked locus is χ2 with 1 d.f. To convert this and subsequent expressions to the LOD scale, one can divide by 2 ln 10 ≈ 4.6. For the first approximation in (3) we have replaced the empirical variance of {xi(d)}, namely N–1Σi[xi(d) – N–1Σjxj(d)]2, by its asymptotic value of ¼; for the second we have approximated the logarithm by the first term of its Taylor expansion and have replaced the estimate σ^y by the parameter σe that it estimates under the hypothesis of no linkage on the jth chromosome. Since the trait locus q is typically unknown, the log-likelihood ratio is maximized over all marker locations d and chromosomes j. At each marker, assumed to be a QTL, the log-likelihood ratio is computed exactly. Between markers, Lander and Botstein (1989) suggest the use of “interval mapping,” which consists of treating the unobserved marker information as missing data and using the EM algorithm (Dempsteret al. 1977) to evaluate the log-likelihood ratio at d based on the marker information at the flanking markers. A noniterative, regression-based alternative to the EM algorithm was proposed by Haley and Knott (1992) and was shown to give equivalent results provided N is sufficiently large.

Detection of linkage in backcrosses: Because the log-likelihood ratio is maximized over the entire genome, it is unclear whether the conventional threshold of LOD = 3.0 [equivalently 2 ln LR(d) > 13.8] to declare statistical significance is appropriate in the present context. To address this issue, Lander and Botstein (1989) proposed the approximation of N12αd^2σe [cf. (3)] by an Ornstein-Uhlenbeck process. This can be justified by the central limit theorem and a straightforward calculation of covariances. For the case of complete marker information (continuous markers), they gave thresholds depending on the length of the genome and the number of chromosomes searched (cf. their Proposition 2). For the case of a discrete set of markers evenly distributed over the genome, they obtained thresholds from a simulation study conducted under the assumption of no interference.

For the case of equispaced markers along the genome, Feingold et al. (1993) proposed an approximation, which agrees closely with the results from Lander and Botstein's simulations. That approximation is P{max2lnLR(kΔ)>a}1exp{2C[1Φ(b)]2βLbϕ(b)ν(b{2βΔ}12)}, (4) where a = b2, L is the total length of the genome, C is the number of chromosomes, β = 2λ, λ being the rate of crossovers (λ = 1 if L is in Morgans and λ = 0.01 if L is in centimorgans), Δ is the distance between markers in the same units as L, and Φ(x) and ϕ(x) are the standard normal cumulative and density function, respectively. The function ν is a discreteness correction for the distance Δ between markers. The defining expression can be found in Siegmund (1985), p. 82. Often it is adequate to approximate ν(x) by exp(–0.583x), which is valid for x < ∼2, while for x > 2 the first four terms of the defining infinite series provide a reasonable approximation. For the case of continuous markers Δ = 0, so ν= 1, and (4) is essentially the same as the approximation of Lander and Botstein (1989).

For a backcross design with a QTL located exactly at a marker, Feingold et al. (1993) gave as an approximation for the power P{max2lnLR(kΔ)>a}1Φ(bξ)+ϕ(bξ)[2νξν2(b+ξ)2], (5) where a = b2, ξ={Nln[1+(α+δ)24σe2]}12, and ν= ν(b{2βΔ}1/2), as defined previously. The parameter ξ is the noncentrality parameter of (3) expressed in terms of the parameters of the model (2). The first term in (5) is the probability the process is above the threshold at the QTL; the second is the probability that it is below at the QTL but crosses the threshold at some nearby marker. Unless the markers are closely spaced, the first term by itself is a reasonably good approximation. When the QTL is located between markers, it is necessary to analyze the (correlated) process at the two flanking markers. The more complex approximation, which requires a one-dimensional numerical integration, can be found in Dupuis (1994). The noncentrality parameter at a flanking marker at distance Δ1 from the QTL is ξexp(βΔ1), (6) where β and ξ are as defined above.

From (6) and (4) we see the importance of the parameter β, which equals 0.02 for backcross designs, but can assume a larger value for other designs (e.g., recombinant inbred designs). In (4), β multiplies the length of the genome, so a larger value requires a larger threshold to maintain a given false-positive error rate. From (6) we see that it also governs the rate at which the noncentrality parameter decays as a function of the distance from QTL to flanking marker. A large value of β means a rapid falling off in power to detect the QTL as a function of that distance. On the other hand, it also provides the possibility for more precise fine mapping of the QTL location, because a large β leads to a sharper delineation of the “peak” in the process 2 ln LR(d) that identifies the location of the QTL. We return to these issues below.

The preceding analysis is concerned with the likelihood ratio process observed at the discrete set of marker loci. To mitigate the problems indicated by (6) when the QTL is in the center of a marker interval, Lander and Botstein (1989) suggested the technique of interval mapping, i.e., treating the unobserved intervals between marker loci as missing data and using the EM algorithm to interpolate between the observed data points. Rebai et al. (1994, 1995) have used Rice's formula for the expected number of upcrossings of a level by a piecewise smooth Gaussian process to give approximations for the false-positive rates when using interval mapping. The method is analytically tractable when one assumes complete interference, i.e., the recombination probability and map distance in Morgans are equal. Single chromosome simulations performed by these authors and our own whole genome simulations (data not shown) indicate that the approximation is very good when the sample size is reasonably large and markers are not too closely spaced. For dense markers (∼1 cM) it is conservative. A modification suitable for small samples can be inferred from Johnstone and Siegmund (1989).

An argument of Siegmund and Worsley (1995) can be adapted to give a simple approximation for the power of an interval mapping test. See appendix a.

Intercrosses: Most previous theoretical analyses have concentrated on backcrosses and consequently have ignored dominance effects. Paterson et al. (1991) used the full model (1) to locate QTL in tomatoes in an intercross, and estimated the dominance effects. However, to detect linkage, they used a 1-d.f. statistic that ignores the dominance effects. Here we analyze the 2-d.f. statistic involving both additive and dominance effects.

Consider the likelihood ratio statistic to test the general hypothesis that α = δ = 0 vs. the alternative that α ≠ 0 or δ ≠ 0. For intercross data the vectors with coordinates xi(d) and 1(xi(d)=1) (i = 1,..., N) are asymptotically orthogonal. Therefore, the approximations used to obtain (3) now yield for the log-likelihood ratio at the marker d 2lnLR(d)Nln{1αd^22+δ^d24σ^y2}[(N12α^d212σe)2+(N12δ^d2σe)2]. (7) To define a significance level, we give an approximation under the hypothesis of no linkage to the distribution of the maximum of (7) over all possible values of d.

Let Xd=N12α^d212σeandYd=N12δ^d2σe. (8) A straightforward application of the central limit theorem and calculation of covariances shows that when α = δ = 0, for large N, Xd and Yd are approximately independent Ornstein-Uhlenbeck processes with mean 0 and covariance functions e–2λ|t| and e–4λ|t|, respectively. An approximation to the tail distribution of the maximum of (7) is provided by P{maxd2lnLR(d)a}1exp{[C+νb2L(β1+β22)]exp(b22)}, (9) where β1 = 2λ, β2 = 4λ, a = b2, and ν = ν(b{Δ(β1 + β2)}1/2). As in the case of (4), this approximation does not take interval mapping into account. It is obtained by a suitable modification of Woodroofe's (1976) argument. For an idealized tomato genome consisting of 12 chromosomes of length 100 cM each and a dense set (Δ = 0) of markers, the 0.05 false-positive threshold obtained from (9) is a = 19.0 (LOD = 4.13), in comparison with a = 14.6 (LOD = 3.17) for the backcross case. Although smaller thresholds are required when the intermarker distance is greater, for an intercross the conventional LOD = 3 threshold would lead to a false-positive rate greater than 0.05 even for intermarker distances of 25 cM. This stands in contrast to the case of a backcross, where the LOD = 3 threshold is conservative for intermarker distances down to ∼1 cM.

Rebai et al. (1995) have given an approximation for the false-positive error rate when interval mapping is used. This approximation involves an elliptic integral, to be evaluated numerically, and so is more complicated than the analogous backcross approximation, which can be written in closed form involving only the exponential and inverse tangent functions. In fact, the mathematically correct form of (9) involves similar complications, although extensive numerical calculations show that there is very little difference between the mathematically correct approximation and the more convenient one given above, which is based on replacing the two parameters β1 and β2 associated with the two coordinate processes by their average value, (β1 + β2)/2. In this spirit one can modify the approximation of Rebai et al. (1995) to obtain a closed form approximation that is no more complicated than that obtained for a backcross and gives essentially the same numerical results as the more complicated, mathematically correct approximation.

Figure 1.

—Thresholds for 350 simulated tomato genomes.

To check the accuracy of (9) and our interval mapping approximation, we simulated thresholds for the log-likelihood ratio based on an intercross sample of N = 350 organisms with 12 chromosomes of total length 1200 cM (to approximate the tomato genome). The interval mapping step was performed using an approximation due to Haley and Knott (1992), which is much less computer intensive and gives results almost identical to the EM algorithm for large values of N. Results are shown in Figure 1.

Both approximations are very accurate. As predicted, the process with the interval mapping step requires a higher threshold for a given value of the Type-I error. For smaller N, somewhat different approximations yielding larger thresholds need to be used, since the given approximations do not take into account the variability in the estimate of the variance, σy2. However, when N is large (at least 200), the approximations provide thresholds for the statistic and marker density actually used, which are more appropriate than the conventional LOD = 3.0. In mapping human traits, Lander and Kruglyak (1995) have argued that because the investigator is likely to type more markers around promising loci, the threshold for Δ = 0 should be used in all cases. If we use this threshold, it is not necessary to rationalize the choice of Δ, which should otherwise be an average intermarker distance in the neighborhood of detected linkages, or to concern ourselves about the effect of interval mapping on the false-positive error rate. But insistence on this threshold would noticeably reduce the power of the test, as is shown shortly.

For intercross data the noncentrality parameter for a QTL located at a marker locus is ξ={Nln[1+(α22+δ24)σe2]}12. To attribute appropriate parts of the total noncentrality to the two processes in (10), we let ξ1 = ξα/(α2 + δ2/2)1/2, ξ2 = ξδ/[2(α2 + δ2/2)]1/2. If the QTL is located at a marker, the power is approximately P{max2lnLR(kΔ)>a}1Φ(bξ)+ϕ(bξ)×[12ξ+2b12νξ32b12ν2ξ12(b+ξ)], (10) where ν=ν(b{2βΔ}12),β=(β1ξ12+β2ξ22)ξ2. For a QTL between markers, one must as in the backcross case consider the joint distribution at flanking markers. For a marker at distance Δ1 from the QTL the noncentrality parameters are ξ1exp(β1Δ1)andξ2exp(β2Δ1). (11) See appendix a for an approximation for the power of the interval mapping process.

Using simulations and the theoretical power approximations above, we compare in Figure 2 the power of the marker process with the power of the interval mapping test. We also present the power of the interval mapping test using the more stringent threshold (assuming continuous markers) proposed by Lander and Kruglyak (1995). The power was investigated for a dominant model, so δ = α, and ξ = 4.12, 4.41, 4.75, and 5.21, which correspond roughly to powers of 60, 70, 80, and 90% with a continuous map of markers. For recessive (δ = –α) models, the power would be exactly the same. For the same noncentrality values and an additive model (δ= 0), it would be slightly larger. Power under two map densities was estimated (Δ = 20 and 5 cM) and we used N = 350 tomato genomes. Each power simulation is based on 1000 replicates. The gain in power from using interval mapping is small, on the order of 2±4%, a result similar to that found by Darvasi et al. (1993). The gains anticipated by Lander and Botstein (1986, 1989), who write of interval mapping as providing a “virtual marker” midway between the actual markers, are overly optimistic. Their analysis is marred by their comparison of interval mapping with the marker process at only one of the flanking loci, where a more appropriate comparison would be with the maximum of the process at the two flanking loci. They also neglect the increase in threshold required to maintain a given false-positive error rate for the interval mapping process. The gain in power for interval mapping is largest for the sparse map (Δ = 20 cM), but the gain is only ∼3±4%. Using the threshold for a continuous map when in fact a sparse map of markers is used greatly reduces the power (by as much as 20%).

Figure 2.

—Power to detect linkage for different map densities, gene locations, and thresholds. In a and b, Δ = 5 cM while Δ = 20 cM in c and d. The trait locus is located at a marker in a and c and mid-markers in b and d. The process without interval mapping is represented by □; the process with interval mapping is represented by ⋄ (solid symbols for the theoretical approximation) and ▿ (power for the higher threshold appropriate when Δ = 0).

We have made similar computations with similar results for backcross designs.

When the markers only process is used, the theoretical power approximations are very good, so only the simulated values have been included in Figure 2. The approximations are also good for interval mapping except when the intermarker distance is 5 cM and the QTL is midway between markers. In this case the power is underestimated by ∼5%. The reason is that the theoretical approximation involves only the probability that the process is above the threshold somewhere in the interval containing the QTL and neglects the probability of detecting the QTL to be in a neighboring interval. This is not a problem when the intermarker interval is large.

Other designs and a comparison of different designs: Many other designs can be handled by similar approximations. To evaluate an appropriate threshold, for the markers only process it is only necessary to know the recombination parameter β (or β1 and β2), which depends only on the design, not the mathematical model used for recombination. Although there is no general method to evaluate this parameter, it has been calculated for many different designs. (Some values are given below.) For interval mapping one must know the complete covariance function, which depends on both the design and the model for recombination.

For instance, for recombinant inbred data, which involve the 1-d.f. statistic (3), one can use approximation (4) with β= 0.04 for recombinants produced by selfing and β = 0.08 for recombinants produced by recurrent sib mating (as originally suggested by Lander and Botstein 1989). It is only slightly more complicated to incorporate interval mapping. (See Rebaiet al. 1994 for the case of selfing. A similar formula can be obtained for inbreds produced by recurrent sib mating.) For the advanced intercross designs suggested by Darvasi and Soller (1995) to provide more accurate localization of QTL, for the Fi offspring one can use (9) with β1 = iλ, β2 = 2iλ. For reciprocal backcross designs, where half of the offspring are backcrossed to each parental strain, one can use (9) with β1 = β2 = 0.02.

In Stuber et al. (1992), offspring from a cross of two inbred Maize strains (F1 generation) were allowed to self twice and then backcrossed to one of the parental lines. A careful examination of that design shows that the maximum LOD for testing the hypothesis of no linkage is approximately [cf. (3), (6)] maxdX2(d)=maxd3Nα^2(d)4σe2, where α^(d) is the maximum-likelihood estimate of the sum of the additive and dominance effects. One can show that under the null hypothesis, X(d) is approximately a Gaussian process with covariance function R(d)=183λd+o(d) as d → 0. Therefore, approximation (4) can be used with β=83λ to find an appropriate threshold.

Korol et al. (1995) have suggested the use of correlated traits as a technique to improve the power of QTL mapping. If the number of traits is t, this would require a t dimensional version of (4) or a 2t dimensional version of (9) for the backcross or intercross design, respectively. The appropriate k dimensional approximation (k = t or 2t) is given by 1exp{C[1Fk(b)]βL2(2k)2×[Γ(k2)]1bkexp(b22)}. Here Fk is the χ2 distribution with k degrees of freedom, Γ denotes the gamma function, and β would be replaced by (β1 + β2)/2 for an intercross design. Corrections for discrete spacing of markers would be exactly as above.

We have used the theory developed above to compare the power of backcross, intercross, and recombinant inbred designs (obtained by recurrent sib mating). Let σA2,σD2,σE2 denote the total additive, dominance, and environmental variances, respectively. Assuming that environmental and genetic effects are uncorrelated and there is no epistasis, we have the usual representation of the phenotypic variance as σy2=σA2+σD2+σE2. Let H2=(σA2+σD2)σy2 denote the wide sense heritability in the intercross, and put ρ = δ/21/2α. To reduce the number of different special cases we assume that ρ is the same at all QTL; i.e., they all have the same relative amount of dominance. If we let v2 be the heritability attributable to the locus of interest, i.e., v2=(α22+δ24)σy2H2, then the noncentrality parameters of an intercross, backcross, and recombinant inbred design are, respectively, [–N ln(1 – v2)]1/2,{–N ln[1 – (v2(1 + 21/2ρ)2)/(H 2(1 + 21/2ρ)2 + 2(1 – H 2)(1 + ρ2))]}1/2 and {–N ln[1 – 2v2/(1 + ρ2 + H 2(1 – ρ2))]}1/2.

Suppose ρ = 0. It is easy to see that the noncentrality parameter of the backcross is smallest and that of the recombinant inbred is largest. All three noncentrality parameters are comparable for large H 2, but there can be sizeable differences for small H 2. Because the threshold required for a given significance level is smallest for the backcross and largest for the intercross, one expects to find the backcross the most powerful design when H 2 is large, but not otherwise.

A numerical example is given in Table 1. We have determined for continuous markers sample sizes that give 80% power for values of H 2, v2, and ρ. Although the exact sample sizes depend on v2, their relative values are roughly constant throughout a broad range where v2, the heritability attributable to the QTL, contributes from roughly ⅛–½ H 2, so only the intermediate value v2 = 0.2 H 2 is included in the table. Similarly the relative sample sizes are fairly insensitive to the exact power required. In agreement with the qualitative analysis of the preceding paragraph, for ρ = 0 the sample size required by a backcross design is about the same as that of the intercross for H 2 = 0.75 but is appreciably larger for H 2 = 0.25. For ρ2 = 0.04, the backcross design can require somewhat smaller or much larger sample sizes than the intercross design depending on whether ρ is positive or negative, which in turn depends on the parental strain used for the backcross. Hence with a small amount of dominance, probably too small to be detected in segregation analysis, a backcross design can yield a very misleading picture. The sample sizes required of the recombinant inbred design are smaller than those of the intercross and backcross designs and are insensitive to the values of ρ, at least for the relatively small values considered here.

View this table:
TABLE 1

Theoretical sample sizes of intercross, backcross, and recombinant inbred designs necessary to achieve 80% power with dense (Δ = 0) markers

We have performed similar calculations when the amount of dominance varies across QTL. The sample sizes in the backcross column can change substantially, but the qualitative picture is the same.

This problem with a backcross design could in principle be eliminated by backcrossing to both parental strains and using a 2-d.f. statistic (with β1 = β2 = 0.02). One can easily evaluate the noncentrality parameter and see that for small values of H 2 such a reciprocal backcross is less powerful than an intercross design based on an equal number of progeny, but is slightly more powerful than an intercross design based on an equal number of matings (hence presumably half as many progeny). For larger values of H 2, numerical calculations as in Table 1 can help one determine the potential usefulness of such a design.

To simplify the preceding comparison, we have assumed continuously distributed markers. This has the effect of concealing a weakness of the recombinant inbred design, which has a very large recombination parameter (β = 0.08). A consequence is that if markers are not closely spaced there is a considerable loss of power to detect a QTL located midway between markers. For an example consider the fourth row of Table 1, where the recombinant inbred design is much more powerful than either of the other two. For a Δ = 20-cM map and a QTL midway between markers, the power falls to about 0.73 if we use the sample sizes given in the table with an intercross or backcross design. To achieve this power with a recombinant inbred design, one would need a sample size of ∼380, and in this case interval mapping would be mandatory. Otherwise a sample size of ∼690 would be required. For a Δ = 5-cM map, the power of a backcross or intercross would fall only to 0.79 for a QTL midway between markers. Now for a recombinant inbred design a sample size of about 291 would be required (300 without interval mapping). To achieve the benefits of a recombinant inbred design, it appears advisable to type markers at no more than 5 cM distance, and closer would be better. A similar caution is applicable to the advanced intercross designs of Darvasi and Soller (1995).

Confidence regions for QTL: A confidence region can be used to identify a chromosomal region in which to concentrate the search for the exact location of a QTL. In this section, three methods of constructing a confidence region around the gene locus are presented and compared. It is perhaps worth noting from the outset that this is not a “regular” estimation problem as the term is used by statisticians. Because the likelihood function has cusps at marker loci, the maximum-likelihood estimate of a QTL may fail to be approximately normally distributed, so one is not justified in using the maximim-likelihood estimator plus or minus two estimated standard errors as an approximate 95% confidence interval. Darvasi et al. (1993) in one of their suggestions appear to have assumed incorrectly that the standard statistical theory is applicable. Visscher et al. (1996) have suggested a confidence interval based on the unconditional distribution of the maximum-likelihood estimator, which they estimate by bootstrapping. Although their coverage probabilities are shown by a Monte Carlo experiment to be quite close to the specified level, this method does not adapt to the rate of decay of the likelihood function near its maximum and is known to give confidence regions that are unnecessarily large in related “change-point” problems. A numerical example given below suggests that it has the same undesirable feature here. See Siegmund (1988) for a more complete discussion.

Support intervals: Support intervals (cf. Conneallyet al. 1985) provide a method of estimating the location of a trait locus. They are essentially equivalent to the standard statistical technique of inverting the likelihood ratio test to obtain a confidence region. Given a value x > 0, a support region includes all the loci q such that 2lnLR(q)maxd2lnLR(d)x. (12) Often the 2 is omitted and common logarithms are used. Then one speaks of a LOD support region. The value x in (12) provides an (x/2 ln 10)-LOD support region. With data from a single marker the statistical problem is regular, so a 1-LOD support interval (x = 4.6) is approximately a 97% confidence interval (because 4.6 is the 97th percentile of the χ2 distribution with 1 d.f.; see Ott (1991, p. 67). However, this result does not generalize to genome-wide scans involving reasonably dense markers, where the coverage probability of (12) depends on the density of the map of markers and on the strength of the signal at the trait locus. In fact, there is no exact confidence coefficient that can be assigned to a support region. Through theoretical analysis and a simulation study presented below, we show that a 1-LOD (x = 4.6), respectively 1.5-LOD (x = 6.9), support interval corresponds roughly to a 90%, respectively 95%, confidence region in the case of a dense map of markers (∼1 cM), and provides even greater probability of coverage for sparser maps.

Likelihood methods: A second method to provide a confidence interval for a QTL relies on using likelihood methods for change points (Siegmund 1988; Feingoldet al. 1993). It is closely related to the support method described above and provides some analytic tools for studying that concept. Unlike the support method, however, for the special case that the trait locus is exactly at a marker location the likelihood method in principle gives an exact confidence region.

Although the actual procedure is based on twice the log-likelihood ratio, our discussion will be simplified notationally by using the asymptotically equivalent ∥Zd2, where Zd = (Xd, Yd) is defined in (8) [cf. also (7)] and Zd2=Xd2+Yd2. In terms of these variables the acceptance region for the likelihood ratio test of the hypothesis that a QTL is located at q has the form Aq={maxdZd2Zq2x}. By sufficiency, the conditional probability of Aq given Zq does not depend on the unknown parameters α, δ. Hence in principle we can choose x = x(Zq) such that P(AqZq)=1γ. (13) The set of all values q that are not rejected by this test is a (1 –γ)100% confidence region (Cox and Hinkley 1974).

As the desired conditional probability does not depend on α, δ, it can be evaluated under the hypothesis that these parameters are both zero. The approximation (B1) of appendix b yields as a confidence interval for the QTL those loci q such that P(maxdZd2>(maxdZd2)obsZq)γ. (14) The likelihood method works best for very dense sets of markers (∼1 cM), as the argument given above is technically correct only when the QTL is at a marker. It can be extended to provide a joint confidence region for the locus and the additive and dominance effects (Dupuis 1994).

By (7) the inequality defining Aq and the inequality in (12) are asymptotically equivalent. The important difference between the likelihood ratio and LOD support methods is that for the former x depends on Zq and is chosen to make the conditional probability (13) equal to the desired confidence level. For any value x that does not depend on the data, the probability of (12) depends on the values of α and δ. Hence the support region is not a confidence region in the strict sense of the word. However, the similarity between the support regions and the likelihood ratio regions allows us to gain some interesting theoretical insights. For example, under the assumption that the QTL lies at a marker locus and that the distance Δ between markers is small, we can evaluate approximately the probability that a support region does not contain the true QTL, by taking the expectation of (B1) in appendix b with respect to Zq = z. The result of some simple approximations is P(Aq)12ν{[2βΔ(ξ2+x)]12}×[ξ2+xξ2+xξ22(ξ12+2ξ22)]32exp(x2), (15) where ξ=(ξ12+ξ22)12,β=(βξ12+2βξ22)ξ2,β=0.02. Numerical calculations based on this approximation suggest, and simulations reported below verify, that for a given value of Δ the coverage probability of the support region is relatively insensitive to the values of ξ and to the relative sizes of the additive and dominance components, at least for values of ξ in the range 4 ≤ ξ ≤ 10, where detection of linkage ranges from reasonably likely to virtually certain, so QTL localization is especially important. The coverage probability is an increasing function of the intermarker distance Δ, so a 1.5-LOD support region has ≈95% coverage when Δ ≈ 1 cM, while a 1-LOD support region gives similar coverage for Δ ≈ 20 cM. Hence for practical purposes a support region is approximately a confidence region, albeit with a different confidence coefficient than that suggested by standard statistical distribution theory.

For problems involving a single parameter, e.g., for backcrosses, recombinant inbreds, or intercrosses where we estimate only α and ignore δ, the factor in square brackets in (15) immediately preceding the exponential would be [(ξ2 + x)/ξ2]1/2. It is easy to see that at least for comparatively large values of ξ, the coverage probability for a given value of x is relatively insensitive to this change of dimension.

An approximation for the expected size of a support region, which is valid for dense markers (∼1 cM), is given in appendix b. A less precise but more easily interpreted approximation, valid when ξ ≫ x, is obtained by approximating the normal density in (B2) with mean ξ by a point mass at ξ, then taking two terms of the Taylor series expansion of ln[ξ2/(ξ2x)], which yields β1[xξ2+0.5x2ξ4+2ξ2(12ν(ξ(2βΔ)12)+0.5ν2(ξ(2βΔ)12))]. (16) This expression is roughly proportional to ξ–2, hence to N–1. In contrast, for regular statistical problems the size of a confidence region is inversely proportional to the square root of the sample size. The fact that confidence regions for a QTL are roughly inversely proportional to the sample size has been observed in the simulations of Darvasi et al. (1993) and Visscher et al. (1996), although these authors do not provide a theoretical explanation. The approximation (16) also shows, as one might have anticipated, that the average length of a support region is inversely proportional to β, hence to the recombination rate for the design used. Even if we ignore the difference between noncentrality parameters for recombinant inbred and backcross designs, the recombinant inbred design, for which β = 0.08, will give regions roughly one-fourth the size of those obtained from a backcross, provided the intermarker distances are sufficiently small. In fact, for additive traits recombinant inbreds always have a larger noncentrality parameter than a backcross, so they provide support regions even less than one-fourth as large. In the extreme case of small heritability and a QTL that is responsible for most of the additive variance, the relative size can shrink by another factor of almost 4.

Bayesian credible regions: Given a prior probability for the location of the QTL and for the noncentrality parameters (ξ1, ξ2), a set having a posterior probability of 1 – γ is called a Bayesian credible region. Fisher (1934), in his classical study of ancillarity, showed in effect that under certain conditions Bayesian credible sets are in fact 1 – γ confidence regions having many desirable properties. Cobb (1978) pointed out that a special class of statistical problems having the required structure are “change-point” problems, which have been studied extensively from this point of view by Zhang (1991). Feingold et al. (1993) and Kruglyak and Lander (1995) have noted the similarity between estimating the location of a change-point and estimating the location of a trait locus from data on mapped markers. A consequence of this history is the expectation that a Bayesian credible region for a uniform prior distribution on the location of the QTL will provide satisfactory confidence regions.

A Bayesian credible region Bγ is constructed by including all loci v whose posterior density given the data exceeds cγ, i.e., Bγ={v:π(vy,x)>cγ}, (17) where cγ is chosen so that Bγπ(vy,x)dv=1γ. Here y = {y1,..., yN}, x = {x1,..., xN} and xi is the set of all marker genotypes for individual i. The posterior probability π (v|y,x) is often easy to compute and depends on the prior distribution on the location q and the additive and dominance effects α and δ. If one takes uninformative priors on all parameters, π(vy,x)exp(14Zv2)0lexp(14Zs2)ds, (18) where Zt = (Xt, Yt) was defined previously and can be obtained using least-squares estimates or the interval mapping equivalent. Analogous expressions can be obtained for other priors. We have studied properties of three different priors on the additive and dominance effects, with a uniform prior for the gene location. First a flat prior was implemented. Second, we constructed the confidence sets with uncorrelated normal priors with mean 0 and standard deviation of 4. The mean of 0 is to allow the parameters to be positive or negative and a standard deviation of 4 should be large enough to allow the parameters to vary freely. Finally, since the smallest detectable genetic effect involves a noncentrality of ∼4, a uniform mixture of four uncorrelated normal priors with noncentralities of 4 corresponding to dominant (δ = α) and recessive (δ = –α) models and with variance of one was also applied. Results are presented in the next section.

Comparison study: Using simulated tomato genomes, we constructed the likelihood confidence region, the 1.0- and 1.5-LOD support region and the Bayes credible regions, with the three different priors mentioned above. However, only the results from Bayes credible sets with a mixture of normal priors are included in Tables 2 and 3. For each tomato, the crossover process for the chromosome containing the QTL was generated using the Haldane mapping function and the phenotype yi was assigned the value yi=αxi(q)+δ1(xi(q)=1)+ei, where the ei's are normal random variables with mean 0 and variance 1.

We performed the simulations for the dominance model (δ=α), with ξ= 5, 7.5, and 10.0. The trait locus was either at a marker, midway between markers, or randomly assigned. We generated 1000 sets of 350 tomatoes and calculated the average size and the probability of covering the true locus given a map with Δ = 1, 5, and 10 cM. Interval mapping was used throughout.

View this table:
TABLE 2

Average size in centimorgans of simulated confidence intervals

Both the 1.5-LOD (x = 6.9) support regions and the Bayesian credible regions provided at least 95% coverage under all simulated conditions. The support regions gave the smallest confidence regions for dense maps, while the Bayesian credible regions did the same for sparse maps. The coverage probability for the support regions obtained in the simulations is close to that predicted by the approximation (15). The approximate expected size provided by (B2) is close in the case of a dense map, but not otherwise. The likelihood method was conservative; and because it adapts to the observed value of the likelihood ratio statistic at the putative trait locus it resulted in the widest confidence regions for small values of the noncentrality parameter but was equivalent to the support region for the larger values ξ= 7.5 and 10. For all methods, the sizes of the intervals were largest when the trait was midmarker. The Bayes credible sets were the widest and they fell short of the desired 95% for large values of ξ and sparse maps, especially when the trait was located at a marker.

The size of the confidence regions is relatively insensitive to the marker density when the distance between markers and the size of the region are roughly commensurate; but when ξ is large, the dense marker map provides substantially smaller regions.

View this table:
TABLE 3

Coverage probability of simulated confidence intervals

We performed similar simulations for a backcross with essentially the same results (data not shown). The simulations were repeated with fewer tomatoes (N = 100) (results not shown). The size of the region was unchanged for all methods, and all methods had the right coverage probability when the locus was located at a marker. The coverage probability was substantially reduced for the case of the likelihood method and the Bayes method when the trait was located midmarkers (≈80% instead of 95%). The LOD support method had a slight drop in confidence coverage (≈90%), but was more robust than the other methods.

We have also simulated support regions under the conditions of Table 2 of Visscher et al. (1996), which involved a backcross with no dominance variance and marker spacings of 20 cM. At this intermarker distance 1-LOD (x = 4.6) regions had coverage probabilities ranging from 93 to 96% and in all cases gave smaller regions than the 95% bootstrap regions recommended by Visscher et al. (1996), while 1.5-LOD regions had 98±99% coverage probability and about the same expected sizes as the bootstrap regions. For example, for a heritability of 0.05 and a sample size of 500, which yield a noncentrality parameter ξ = 5.06, the coverage probability of the 1-LOD region based on 1000 simulations was 96%, and the expected size was 29 cM compared with 96% and 43 cM obtained by Visscher et al. (1996) for their bootstrap regions.

Another method to obtain confidence intervals for QTL location has been proposed by Mangin et al. (1994). This method amounts to fixing a putative QTL location and testing the hypothesis that there is no QTL between that location and either end of the chromosome. In the statistical literature on change-point analysis Worsley (1986) has discussed a similar idea and has pointed out that if there is another change-point (here QTL on the same chromosome) the method may produce an empty confidence set, since for every putative QTL there is evidence of another somewhere on the chromosome. Of course, the problem of detecting a second, linked QTL given an already detected QTL is itself interesting and important.

## DISCUSSION

In this article we have discussed genome scanning methods to detect QTL in experimental genetics. Our goal has been to produce relatively simple approximations for quantities of interest, e.g., the false-positive error rate, power to detect a QTL, and coverage probability of a support region, so that one can easily address questions concerning sample size, marker density, etc., and can compare different designs. Our approximations for significance level and power seem adequate in this regard, but our approximations for the expected size of a support region are good only for dense markers (e.g., Δ ≈ 1 cM).

Although in a backcross the conventional LOD = 3 threshold produces false-positive rates <0.05 unless intermarker distances are small, it is anticonservative in an intercross even for intermarker distances as large as 25 cM without interval mapping.

Our approximations are based on the artificial assumption that markers are equally spaced and there are no missing data. If markers are not equally spaced, the approximations (4) and (9) can be modified by averaging the function ν with respect to the distribution of the distances Δ between markers. One can also use the original approximations with an average intermarker distance. (This should be the average distance in the neighborhood of detected QTL if one adds additional markers to promising regions.) Since (4) and (9) are insensitive to minor changes in the assumed value of Δ, one can reasonably expect such refinements to have little practical effect. If we use interval mapping to impute missing marker data, the resulting process is more correlated than would be the case if the data were not missing, so the threshold obtained under the assumption of no missing data is still appropriate and, in fact, slightly conservative.

The assumption of normality is robust in the sense that the regression statistics we use are approximately normally distributed in large samples, so our approximations for significance level and power are valid in large samples. However, it is possible that by using a more appropriate model, e.g., a mixture model if the nonnormality arises from large QTL effects, one can obtain greater power, although large QTL effects will be comparatively easy to detect with a suboptimal procedure.

When using a backcross or intercross, intermarker distances up to ∼10 cM are almost as powerful as continuously distributed markers. Except at intermarker distances of ∼20 cM or more, or when using a design involving a large recombination rate, e.g., a recombinant inbred design or advanced intercross design, there is little gain in power from interval mapping, which in any event does not provide nearly as much power as more closely spaced markers.

Although intercross designs involve a 2-d.f. statistic and hence a higher threshold than a backcross design, and have larger residual variance, intercross designs are usually more powerful than backcross designs, unless (a) the effect of the gene is large and additive or (b) there is dominance and the dominance deviation has the same sign as the additive genetic effect. A backcross design can lose considerable power in the presence of even a small departure from additivity if the incorrect parental strain is used for the backcross. A recombinant inbred design can be more efficient than an intercross, except when dominance effects are large compared to additive effects. Because of the high recombination rate associated with recombinant inbreds, especially those based on recurrent sib mating, power to detect linkage falls off rapidly with intermarker distance when a QTL is located midway between markers. To avoid this loss of power when using an inbred design based on recurrent sib mating, intermarker distances should be no more than 5 cM and preferably should be even less. Similar considerations apply to advanced intercross lines (Darvasi and Soller 1995).

We have also presented three methods of constructing confidence regions for the location of QTL: the likelihood method, Bayes credible sets, and support regions. The support method and the Bayesian credible sets seem roughly comparable in large samples, but the coverage probability of the support method is more robust to changes in the sample size. Both methods are better than the likelihood ratio method, which often has a coverage probability substantially smaller than the nominal level, except for the case of dense markers.

The size of a confidence region depends on the noncentrality parameter and the density of the markers in the neighborhood of the QTL. When the noncentrality parameter is ∼5, which provides power of ∼0.9 for QTL detection, little is gained by having markers more closely spaced than ∼10 cM; but when the noncentrality parameter is 7.5, intermarker distances of 1±5 cM provide shorter confidence regions. A reasonable guideline is to achieve a marker density in the neighborhood of a putative QTL about equal to the expected half length of a support region for a QTL of that strength.

When dominance effects are relatively small and markers sufficiently dense, support regions from recombinant inbred designs are often about one-fourth as large as from intercross designs, which in turn are substantially smaller than from backcross designs. Advanced intercross designs (Darvasi and Soller 1995) are also especially powerful for fine localization of QTL. In almost all cases, however, the size of the confidence regions is on the order of several centimorgans unless the sample size is considerably larger than what is required to detect linkage, so there is a continuing need to develop better designs for fine localization of QTL.

We have not explicitly addressed the complexities associated with identifying multiple, possibly linked, possibly interacting, QTL. For mapping qualitative traits in humans, we have discussed these issues (Dupuiset al. 1995), and expect to return to them for QTL mapping. For example, once a linked QTL is located, conditional search removes the effect of that QTL by subtracting its (estimated) genotypic contribution from the phenotypic value to define a new regression model, hence a new log-likelihood ratio statistic, to search for additional QTL. Suppose an intercross design is used and, for simplicity, we use a 1-d.f. statistic to detect a QTL of purely additive effect. Assume also that we know exactly the location of a QTL making contribution v2 to the heritability. The (asymptotic) correlation function between the new and old processes at each unlinked marker is (1 – v2)1/2, and under the assumption of no epistasis, the noncentrality parameter for the new statistic is larger by the factor 1/(1 – v2)1/2. Hence a large QTL effect v2 is necessary at the detected locus to get a reasonable “gain” from the conditional search, although a large value of v2 also leads to a new process only weakly correlated with the original search process, which increases the likelihood that conditional search will incur a false-positive error. Of course, there must be another QTL of sufficiently large effect for the gain in noncentrality to be helpful. Rough calculations suggest that suitable combinations of QTL effects will occur relatively rarely.

Similar considerations are relevant to recently suggested multiple regression methods, e.g., Zeng (1994) and Jansen (1994), whereby one searches, for example, a given chromosome or chromosomal arm for a QTL while controlling for QTL on other chromosomes through arbitrarily placed markers. In comparison with conditional search, this method has the potential advantage of controlling the phenotypic variability due to multiple QTL, but at least initially has the disadvantage that the success of the control depends on fortuitously placing the control markers close to true QTL. Straightforward calculations show that the control markers on other chromosomes have no effect on the asymptotic distribution of the log-likelihood ratio process along the currently searched (unlinked) chromosome, although they do reduce the number of degrees of freedom available to estimate the error variance. By considering one chromosome at a time and adding the chromosome-wide false-positive rates, one obtains an asymptotic upper bound on the genome-wide false-positive rate. Because of the independent assortment of chromosomes, this upper bound should not be overly conservative.

The second method discussed by Dupuis et al. (1995), simultaneous search, will for the reasons given there rarely be useful in the absence of epistasis. Preliminary calculations suggest it can be very helpful when there is substantial epistasis.

We expect to return to the problem of detecting multiple, possibly linked, QTL in a future article.

## Acknowledgments

Health grant HG-00848 and the National Science Foundation grant DMS 9704324.

## APPENDIX A

Power of interval mapping: We first consider a backcross and suppose there is a single trait locus (on any particular chromosome) at q. Let Zd denote the signed square root of twice the log-likelihood ratio (incorporating interval mapping), which for large N behaves like a piecewise smooth Gaussian process. We use the basic decomposition P[maxdZdb]=P[Zqb]+P[Zq<b,maxdqZdb]. The first term on the right-hand side is given by 1 – Φ(b –ξq), where ξq = E(Zq). To approximate the second term, we assume that if the process exceeds the threshold for some dq it does so at a value of d between the same two flanking markers as q, or in one of the immediately adjoining marker intervals in the case that q is itself a marker. This analysis can be expected to yield reasonable approximations in the case that intermarker intervals are large, when interval mapping is supposed to be most helpful. It may not be effective when the intermarker distances are small, expecially if the noncentrality is also small. We approximate maxdZd by expanding Zd in two terms of a Taylor series around d = q and using calculus to maximize the resulting expression. See Siegmund and Worsley (1995) for details of this calculation. The final approximation is P[maxdZdb]1Φ(bξq)+Iq(ξqb)1φ(bξq)[1(bξq)12], (A1) where Iq equals 2 or 1 according to the trait locus being at a marker or in the interval between markers. This discontinuous behavior at the markers is caused by the discontinuity in the derivative of the interval mapping statistic that occurs at the markers.

The noncentrality parameter ξq can be evaluated by a direct computation starting from a suitable explicit representation of the interval mapping statistic. See Rebai et al. (1995) for such a representation in a complete interference model; their equation is easily modified for the Haldane model of no interference. We present here an alternative method, which will be easier to apply to intercross designs, where the explicit statistic is much clumsier to manipulate. We begin with the following asymptotically equivalent expression for the square root of (3): Σ[xi(d)12](yiy)σe{Σ[xi(d)12]2}12, (A2) where y = N–1Ryi. In the case where the locus d lies between flanking markers, we replace the actual marker data, xi(d), by its conditional expectation given the genotypes of the flanking markers, E[xi(d) | Gi]. Taking expectations and using (2), we see from some simple manipulations that the noncentrality is asymptotically equal to [(α+δ)σe]{ΣiE[E(xi(q)Gi)12]2}12. To express this explicitly in terms of recombination fractions, let θ12) denote the recombination fraction between the QTL at q and the marker flanking on the left (right), and θ the recombination fraction between the two flanking markers. Then straightforward calculations yield ξq2=ξ2{(1θ1θ2)2(1θ)+(θ1θ2)2θ}, where ξ2 = N ln{1 + [(α + δ)/2σe]2}. This reduces to the noncentrality ξ when θ1 = 0, so θ2 = θ. At the midpoint between markers, if we assume the Haldane model of no interference it simplifies to 2ξ2{exp(βΔ)[1+exp(βΔ)]}. This always exceeds the parameter (6), although a direct comparison is not really meaningful because the markers only statistic involves the maximum of the process at the two flanking markers.

We can also give as an approximation for the power of the interval mapping process P[maxdZdb]1Φ(bξq)+[12ξq+Iq(bξq)12{1(bξq)12}ξqb]×φ(bξq). (A3)

A more detailed calculation along the lines of that given for a backcross yields an expression for ξq, which in general is somewhat complicated. In the special case that q is the midpoint between two markers at distance Δ, the parameter ξq is the norm of the vector with coordinates ξ1{2exp(β1Δ)[1+exp(β1Δ)]}12,ξ2exp(β1Δ){1[1+exp(β2Δ)]+2[1+exp(β1Δ)]2}12, where ξ1, ξ2, β1, and β2 are as defined in the paper.

## APPENDIX B

Approximations for the conditional probability of (14) and the expected size of a LOD support region: To approximate the conditional probability of (14), we begin with the following lemma.

Lemma. Let Zt = (Z1,t, Z2,t) where Z1,t and Z2,t are independent Gaussian processes with covariance functions satisfying Ri(t)=1βit+o(t)ast0. Assume b → ∞, Δ → 0, and bΔ1/2 is bounded away from 0 and. Let 0 < ∥z2 < b2 and define t*, w* to be the solution of (z1z2)=(bR1(t)coswbR2(t)sinw). Assume t* is contained in (0,t1) and is bounded away from the upper endpoint (t1 > 0). Then P{max0iΔlZiΔbZ0=z}βexp[12(b2z2)]R.1(t)R2(t)cos2w+R1(t)R.2(t)sin2w×ν[b(2βΔ)12], where i(t) = dRi(t)/dt and β = β1 cos2(w*) + β2 sin2(w*).

For our particular application, Ri(t) = exp(–βi t). Putting b2 = ∥z2 + x and assuming |x1/2z2| ≪ |z1|, which will be the case with probability close to one unless there is overdominance, we obtain P{max0iΔlZiΔ2>z2+xZ0=z}[2(z2+x)]32exp(x2)(z12+[(z12+2z22)2+4z22x]12)32×ν([2Δ(β1z12+β2z22)(1+xz2)]12). (B1) A proof of the lemma is given in Dupuis (1994). The false-positive error rate in (9) can be obtained by integration with respect to the distribution of ∥Z0∥, although it is easier to give a direct calculation along the same lines as the proof of the lemma.

We can also obtain a rough approximation for the expected size of the support region as follows. First consider the one-dimensional case of a backcross or recombinant inbreds and assume as before that a marker is at the QTL q. Then the expected size of the support region is ΔΣkP{ZkΔ2maxZjΔ2x}=ΔΣkφ(zξ)×Pz{ZkΔ2maxZjΔ2x}dz, where Pz denotes probability under the condition that Zq = z. The outcome of substantial calculation along the lines of Siegmund's (1988) Theorem 1 (which contains some minor errors that must be corrected) shows that for large ξ and small Δ, hence in particular for dense markers, the average size of the support region is approximately β1φ(yξ){ln[y2(y2x)]+2y2[12ν(y(2βΔ)12)+0.5ν2(y(2βΔ)12)]}dy. (B2) A similar argument in two dimensions yields a similar expression with β replaced by β and the additional factor (y/ξ)1/2 multiplying φ(y – ξ) to approximate a noncentral χ2 density.

## Footnotes

• Communicating editor: S. Tavaré