- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Dupuis, J.
- Articles by Siegmund, D.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Dupuis, J.
- Articles by Siegmund, D.
Statistical Methods for Mapping Quantitative Trait Loci From a Dense Set of Markers
Josée Dupuisa and David Siegmundba Genome Therapeutics Corporation, Waltham, Massachusetts 02453
b Department of Statistics, Stanford University, Stanford, California 94305
Corresponding author: Josée Dupuis, Genome Therapeutics Corporation, 100 Beaver St., Waltham, MA 02453-8443., josee.dupuis{at}genomecorp.com (E-mail)
Communicating editor: S. TAVARÉ
| ABSTRACT |
|---|
Lander and Botstein introduced statistical methods for searching an entire genome for quantitative trait loci (QTL) in experimental organisms, with emphasis on a backcross design and QTL having only additive effects. We extend their results to intercross and other designs, and we compare the power of the resulting test as a function of the magnitude of the additive and dominance effects, the sample size and intermarker distances. We also compare three methods for constructing confidence regions for a QTL: likelihood regions, Bayesian credible sets, and support regions. We show that with an appropriate evaluation of the coverage probability a support region is approximately a confidence region, and we provide a theroretical explanation of the empirical observation that the size of the support region is proportional to the sample size, not the square root of the sample size, as one might expect from standard statistical theory.
RECENT advances in genetics have led to the identification of genes responsible for certain diseases such as cystic fibrosis, Huntington's disease, breast cancer, and others. Linkage analysis, which is especially effective when the disease or trait of interest exhibits Mendelian inheritance, played an important role in the identification of those genetic loci. When the disease is complex in nature (incomplete penetrance, multiple loci involved, etc.) or quantitative, finding the genetic loci involved in the etiology of the trait can be more difficult. In particular, in human studies, it is difficult to separate environmental and genetic effects. However, with experimental organisms, studies can be designed to provide a similar environment for all individuals, so that the variation in phenotypes can be attributed mainly to genetic factors; and breeding designs can control the nature of the differences in genotype. Studies of experimental organisms can provide useful information for agricultural purposes and/or contribute to our understanding of human disease via animal models. Moreover, it is now feasible to search the entire genome for a gene locus influencing a trait of interest. Statistical methods for mapping quantitative trait loci (QTL) from experimental crosses using a dense set of markers were introduced by ![]()
![]()
![]()
![]()
![]()
![]()
![]()
In this article we propose for intercross and other designs simple approximations that can be used to compare different designs under various conditions or the same design for different sample sizes or marker densities. We also discuss and compare three methods for constructing confidence intervals for a QTL. We assume throughout that markers are equally spaced, that there are no missing data, and except where noted that recombination occurs without interference. While these are artificially simple assumptions, at the cost of some complication they can all be weakened. Rough preliminary calculations suggest that the resulting picture would not change substantially unless the assumptions are radically altered. The sections on QTL detection and confidence regions are independent and can be read in any order.
| RESULTS |
|---|
The model and likelihood ratio statistics:
The starting point for our considerations is a cross between two strains that differ substantially in the quantitative trait of interest. The parental lines can be "pure" breeding lines obtained through inbreeding or simply two different strains of the same organism with widely differing mean phenotype. A cross is obtained from the two parental lines, creating the first generation of offspring (generation F1). The F1 generation is then allowed to mate together to produce the second generation (F2), the intercross. We assume that the genotypes of the parental lines are completely different, so that at any marker locus we can label alleles from the strain with the larger mean phenotype as A, and alleles from the other strain as B. At each locus, each individual of the F2 generation will have zero, one, or two A alleles. A backcross is generated by mating an individual of the F1 generation to one from the parental line. If the parental line with the smaller mean for the trait is used, the offspring from the backcross will have zero or one A alleles at any locus on their genome.
A standard model for quantitative traits (e.g., ![]()
![]() |
(1) |
,
are the phenotypic mean, additive effect, and dominance effect, respectively, and 1C equals 1 or 0 according to whether the condition C is satisfied or not. The eij's are residual effects, which include both environmental effects and the genetic effects of QTL on other chromosomes than the jth. As we will be considering only a single chromosome at a time, we drop the subscript j in what follows. We assume that xi(q) and ei are uncorrelated, which would be the case if there is no epistasis and the environmental effect is uncorrelated with the genetic effects. We also assume that the ei are independent normally distributed random variables with mean 0 and variance
2e . The residual variance
2e equals the sum of the environmental variance and the genetic variance for those QTL not on the jth chromosome. Without the normality assumption the regression-like statistics given below are not exact maximum-log-likelihood ratios, so it is possible that more powerful tests can be found. However, by virtue of the central limit theorem the various approximations to significance level, power, etc. will still be valid in large samples even if the e's are not normally distributed. In fact, for the significance level, it is not necessary to assume any stochastic model for the y's. One can simply regard the y's as fixed numbers and the regression statistic essentially a weighted (by the y's) sum of the x's, to which the central limit theorem applies under an assumption that the empirical behavior of the y's is about what it would be if they were independent and identically distributed observations from a fixed distribution.
For backcross data, because xi(q) = 0 or 1, the additive and dominance effects cannot be estimated separately, and the model reduces to
![]() |
(2) |
* in (2) equals
+
from the model (1). This is the model developed by
If one observes the genotype of a marker at a putative trait locus d, the maximum-log-likelihood ratio at d is given approximately by
![]() |
(3) |
d is the maximum-likelihood estimate of the parameter
* =
+
, and
2y is the maximum-likelihood estimate of the phenotypic variance
2y =
2e +
. It is important to note that both
2y and
2e depend on the design and for a backcross differ from the corresponding quantities for an intercross, although this difference is not reflected in the notation. Note also that (3) involves natural logarithms; the marginal asymptotic distribution of (3) at any unlinked locus is
2 with 1 d.f. To convert this and subsequent expressions to the LOD scale, one can divide by 2 ln 10
4.6. For the first approximation in (3) we have replaced the empirical variance of {xi(d)}, namely N-1
i[xi(d) - N-1
jxj(d)]2, by its asymptotic value of 1/4; for the second we have approximated the logarithm by the first term of its Taylor expansion and have replaced the estimate
y by the parameter
e that it estimates under the hypothesis of no linkage on the jth chromosome. Since the trait locus q is typically unknown, the log-likelihood ratio is maximized over all marker locations d and chromosomes j. At each marker, assumed to be a QTL, the log-likelihood ratio is computed exactly. Between markers,
Detection of linkage in backcrosses:
Because the log-likelihood ratio is maximized over the entire genome, it is unclear whether the conventional threshold of LOD = 3.0 [equivalently 2 ln LR(d) > 13.8] to declare statistical significance is appropriate in the present context. To address this issue, ![]()
[cf. (3)] by an Ornstein-Uhlenbeck process. This can be justified by the central limit theorem and a straightforward calculation of covariances. For the case of complete marker information (continuous markers), they gave thresholds depending on the length of the genome and the number of chromosomes searched (cf. their Proposition 2). For the case of a discrete set of markers evenly distributed over the genome, they obtained thresholds from a simulation study conducted under the assumption of no interference.
For the case of equispaced markers along the genome, ![]()
![]() |
(4) |
,
being the rate of crossovers (
= 1 if L is in Morgans and
= 0.01 if L is in centimorgans),
is the distance between markers in the same units as L, and
(x) and
(x) are the standard normal cumulative and density function, respectively. The function
is a discreteness correction for the distance
between markers. The defining expression can be found in
(x) by exp(-0.583x), which is valid for x < ~2, while for x > 2 the first four terms of the defining infinite series provide a reasonable approximation. For the case of continuous markers
= 0, so
= 1, and (4) is essentially the same as the approximation of
For a backcross design with a QTL located exactly at a marker, ![]()
![]() |
(5) |
= {N ln[1 +
]}
, and
=
(b{2ß
}
), as defined previously. The parameter
is the noncentrality parameter of (3) expressed in terms of the parameters of the model (2). The first term in (5) is the probability the process is above the threshold at the QTL; the second is the probability that it is below at the QTL but crosses the threshold at some nearby marker. Unless the markers are closely spaced, the first term by itself is a reasonably good approximation. When the QTL is located between markers, it is necessary to analyze the (correlated) process at the two flanking markers. The more complex approximation, which requires a one-dimensional numerical integration, can be found in
1 from the QTL is ![]() |
(6) |
are as defined above. From (6) and (4) we see the importance of the parameter ß, which equals 0.02 for backcross designs, but can assume a larger value for other designs (e.g., recombinant inbred designs). In (4), ß multiplies the length of the genome, so a larger value requires a larger threshold to maintain a given false-positive error rate. From (6) we see that it also governs the rate at which the noncentrality parameter decays as a function of the distance from QTL to flanking marker. A large value of ß means a rapid falling off in power to detect the QTL as a function of that distance. On the other hand, it also provides the possibility for more precise fine mapping of the QTL location, because a large ß leads to a sharper delineation of the "peak" in the process 2 ln LR(d) that identifies the location of the QTL. We return to these issues below.
The preceding analysis is concerned with the likelihood ratio process observed at the discrete set of marker loci. To mitigate the problems indicated by (6) when the QTL is in the center of a marker interval, ![]()
![]()
![]()
![]()
An argument of ![]()
Intercrosses:
Most previous theoretical analyses have concentrated on backcrosses and consequently have ignored dominance effects. ![]()
Consider the likelihood ratio statistic to test the general hypothesis that
=
= 0 vs. the alternative that
0 or
0. For intercross data the vectors with coordinates xi(d) and 1(xi(d)=1) (i = 1, ..., N) are asymptotically orthogonal. Therefore, the approximations used to obtain (3) now yield for the log-likelihood ratio at the marker d
![]() |
(7) |
To define a significance level, we give an approximation under the hypothesis of no linkage to the distribution of the maximum of (7) over all possible values of d.
Let
![]() |
(8) |
A straightforward application of the central limit theorem and calculation of covariances shows that when
=
= 0, for large N, Xd and Yd are approximately independent Ornstein-Uhlenbeck processes with mean 0 and covariance functions e-2
|t| and e-4
|t|, respectively. An approximation to the tail distribution of the maximum of (7) is provided by
![]() |
(9) |
, ß2 = 4
, a = b2, and
=
(b{
(ß1 + ß2)}
). As in the case of (4), this approximation does not take interval mapping into account. It is obtained by a suitable modification of WOODROOFE's (1976) argument. For an idealized tomato genome consisting of 12 chromosomes of length 100 cM each and a dense set (
= 0) of markers, the 0.05 false-positive threshold obtained from (9) is a = 19.0 (LOD = 4.13), in comparison with a = 14.6 (LOD = 3.17) for the backcross case. Although smaller thresholds are required when the intermarker distance is greater, for an intercross the conventional LOD = 3 threshold would lead to a false-positive rate greater than 0.05 even for intermarker distances of 25 cM. This stands in contrast to the case of a backcross, where the LOD = 3 threshold is conservative for intermarker distances down to ~1 cM.
![]()
![]()
To check the accuracy of (9) and our interval mapping approximation, we simulated thresholds for the log-likelihood ratio based on an intercross sample of N = 350 organisms with 12 chromosomes of total length 1200 cM (to approximate the tomato genome). The interval mapping step was performed using an approximation due to ![]()
|
Both approximations are very accurate. As predicted, the process with the interval mapping step requires a higher threshold for a given value of the Type-I error. For smaller N, somewhat different approximations yielding larger thresholds need to be used, since the given approximations do not take into account the variability in the estimate of the variance,
2y . However, when N is large (at least 200), the approximations provide thresholds for the statistic and marker density actually used, which are more appropriate than the conventional LOD = 3.0. In mapping human traits, ![]()
= 0 should be used in all cases. If we use this threshold, it is not necessary to rationalize the choice of
, which should otherwise be an average intermarker distance in the neighborhood of detected linkages, or to concern ourselves about the effect of interval mapping on the false-positive error rate. But insistence on this threshold would noticeably reduce the power of the test, as is shown shortly.
For intercross data the noncentrality parameter for a QTL located at a marker locus is
= {N ln[1 +
]}
. To attribute appropriate parts of the total noncentrality to the two processes in (10), we let
1 = 
,
2 =
. If the QTL is located at a marker, the power is approximately
![]() |
(10) |
=
(b{2ß
}
), ß =
. For a QTL between markers, one must as in the backcross case consider the joint distribution at flanking markers. For a marker at distance
1 from the QTL the noncentrality parameters are ![]() |
(11) |
See Appendix 1 for an approximation for the power of the interval mapping process.
Using simulations and the theoretical power approximations above, we compare in Figure 2 the power of the marker process with the power of the interval mapping test. We also present the power of the interval mapping test using the more stringent threshold (assuming continuous markers) proposed by ![]()
=
, and
= 4.12, 4.41, 4.75, and 5.21, which correspond roughly to powers of 60, 70, 80, and 90% with a continuous map of markers. For recessive (
= -
) models, the power would be exactly the same. For the same noncentrality values and an additive model (
= 0), it would be slightly larger. Power under two map densities was estimated (
= 20 and 5 cM) and we used N = 350 tomato genomes. Each power simulation is based on 1000 replicates. The gain in power from using interval mapping is small, on the order of 24%, a result similar to that found by ![]()
![]()
![]()
= 20 cM), but the gain is only ~34%. Using the threshold for a continuous map when in fact a sparse map of markers is used greatly reduces the power (by as much as 20%).
|
We have made similar computations with similar results for backcross designs.
When the markers only process is used, the theoretical power approximations are very good, so only the simulated values have been included in Figure 2. The approximations are also good for interval mapping except when the intermarker distance is 5 cM and the QTL is midway between markers. In this case the power is underestimated by ~5%. The reason is that the theoretical approximation involves only the probability that the process is above the threshold somewhere in the interval containing the QTL and neglects the probability of detecting the QTL to be in a neighboring interval. This is not a problem when the intermarker interval is large.
Other designs and a comparison of different designs:
Many other designs can be handled by similar approximations. To evaluate an appropriate threshold, for the markers only process it is only necessary to know the recombination parameter ß (or ß1 and ß2), which depends only on the design, not the mathematical model used for recombination. Although there is no general method to evaluate this parameter, it has been calculated for many different designs. (Some values are given below.) For interval mapping one must know the complete covariance function, which depends on both the design and the model for recombination.
For instance, for recombinant inbred data, which involve the 1-d.f. statistic (3), one can use approximation (4) with ß = 0.04 for recombinants produced by selfing and ß = 0.08 for recombinants produced by recurrent sib mating (as originally suggested by ![]()
![]()
![]()
, ß2 = 2i
. For reciprocal backcross designs, where half of the offspring are backcrossed to each parental strain, one can use (9) with ß1 = ß2 = 0.02.
In ![]()

where
(d) is the maximum-likelihood estimate of the sum of the additive and dominance effects. One can show that under the null hypothesis, X(d) is approximately a Gaussian process with covariance function R(d) = 1 - 
|d| + o(|d|) as d
0. Therefore, approximation (4) can be used with ß = 
to find an appropriate threshold.
![]()

Here Fk is the
2 distribution with k degrees of freedom,
denotes the gamma function, and ß would be replaced by (ß1 + ß2)/2 for an intercross design. Corrections for discrete spacing of markers would be exactly as above.
We have used the theory developed above to compare the power of backcross, intercross, and recombinant inbred designs (obtained by recurrent sib mating). Let
2A,
2D,
2E denote the total additive, dominance, and environmental variances, respectively. Assuming that environmental and genetic effects are uncorrelated and there is no epistasis, we have the usual representation of the phenotypic variance as
2y =
2A +
2D +
2E . Let H2 =
denote the wide sense heritability in the intercross, and put
=
. To reduce the number of different special cases we assume that
is the same at all QTL; i.e., they all have the same relative amount of dominance. If we let v2 be the heritability attributable to the locus of interest, i.e., v2 =
H2 , then the noncentrality parameters of an intercross, backcross, and recombinant inbred design are, respectively, [-N ln(1 - v2)]1/2, {-N ln[1 - (v2(1 + 21/2
)2)/(H2(1 + 21/2
)2 + 2(1 - H2)(1 +
2))]}1/2 and {-N ln[1 - 2v2/(1 +
2 + H2(1 -
2))]}1/2.
Suppose
= 0. It is easy to see that the noncentrality parameter of the backcross is smallest and that of the recombinant inbred is largest. All three noncentrality parameters are comparable for large H2, but there can be sizeable differences for small H2. Because the threshold required for a given significance level is smallest for the backcross and largest for the intercross, one expects to find the backcross the most powerful design when H2 is large, but not otherwise.
A numerical example is given in Table 1. We have determined for continuous markers sample sizes that give 80% power for values of H2, v2, and
. Although the exact sample sizes depend on v2, their relative values are roughly constant throughout a broad range where v2, the heritability attributable to the QTL, contributes from roughly 1/81/2 H2, so only the intermediate value v2 = 0.2 H2 is included in the table. Similarly the relative sample sizes are fairly insensitive to the exact power required. In agreement with the qualitative analysis of the preceding paragraph, for
= 0 the sample size required by a backcross design is about the same as that of the intercross for H2 = 0.75 but is appreciably larger for H2 = 0.25. For
2 = 0.04, the backcross design can require somewhat smaller or much larger sample sizes than the intercross design depending on whether
is positive or negative, which in turn depends on the parental strain used for the backcross. Hence with a small amount of dominance, probably too small to be detected in segregation analysis, a backcross design can yield a very misleading picture. The sample sizes required of the recombinant inbred design are smaller than those of the intercross and backcross designs and are insensitive to the values of
, at least for the relatively small values considered here.
|
We have performed similar calculations when the amount of dominance varies across QTL. The sample sizes in the backcross column can change substantially, but the qualitative picture is the same.
This problem with a backcross design could in principle be eliminated by backcrossing to both parental strains and using a 2-d.f. statistic (with ß1 = ß2 = 0.02). One can easily evaluate the noncentrality parameter and see that for small values of H2 such a reciprocal backcross is less powerful than an intercross design based on an equal number of progeny, but is slightly more powerful than an intercross design based on an equal number of matings (hence presumably half as many progeny). For larger values of H2, numerical calculations as in Table 1 can help one determine the potential usefulness of such a design.
To simplify the preceding comparison, we have assumed continuously distributed markers. This has the effect of concealing a weakness of the recombinant inbred design, which has a very large recombination parameter (ß = 0.08). A consequence is that if markers are not closely spaced there is a considerable loss of power to detect a QTL located midway between markers. For an example consider the fourth row of Table 1, where the recombinant inbred design is much more powerful than either of the other two. For a
= 20-cM map and a QTL midway between markers, the power falls to about 0.73 if we use the sample sizes given in the table with an intercross or backcross design. To achieve this power with a recombinant inbred design, one would need a sample size of ~380, and in this case interval mapping would be mandatory. Otherwise a sample size of ~690 would be required. For a
= 5-cM map, the power of a backcross or intercross would fall only to 0.79 for a QTL midway between markers. Now for a recombinant inbred design a sample size of about 291 would be required (300 without interval mapping). To achieve the benefits of a recombinant inbred design, it appears advisable to type markers at no more than 5 cM distance, and closer would be better. A similar caution is applicable to the advanced intercross designs of ![]()
Confidence regions for QTL:
A confidence region can be used to identify a chromosomal region in which to concentrate the search for the exact location of a QTL. In this section, three methods of constructing a confidence region around the gene locus are presented and compared. It is perhaps worth noting from the outset that this is not a "regular" estimation problem as the term is used by statisticians. Because the likelihood function has cusps at marker loci, the maximum-likelihood estimate of a QTL may fail to be approximately normally distributed, so one is not justified in using the maximim-likelihood estimator plus or minus two estimated standard errors as an approximate 95% confidence interval. ![]()
![]()
![]()
Support intervals:
Support intervals (cf. ![]()
![]() |
(12) |
Often the 2 is omitted and common logarithms are used. Then one speaks of a LOD support region. The value x in (12) provides an (x/2 ln 10)-LOD support region. With data from a single marker the statistical problem is regular, so a 1-LOD support interval (x = 4.6) is approximately a 97% confidence interval (because 4.6 is the 97th percentile of the
2 distribution with 1 d.f.; see ![]()
Likelihood methods:
A second method to provide a confidence interval for a QTL relies on using likelihood methods for change points (![]()
![]()
Although the actual procedure is based on twice the log-likelihood ratio, our discussion will be simplified notationally by using the asymptotically equivalent ||Zd||2, where Zd = (Xd, Yd) is defined in (8) [cf. also (7)] and ||Zd||2 = X2d + Y2d . In terms of these variables the acceptance region for the likelihood ratio test of the hypothesis that a QTL is located at q has the form

By sufficiency, the conditional probability of Aq given Zq does not depend on the unknown parameters
,
. Hence in principle we can choose x = x(Zq) such that
![]() |
(13) |
The set of all values q that are not rejected by this test is a (1 -
)100% confidence region (![]()
As the desired conditional probability does not depend on
,
, it can be evaluated under the hypothesis that these parameters are both zero. The approximation (B1) of Appendix 1 yields as a confidence interval for the QTL those loci q such that
![]() |
(14) |
The likelihood method works best for very dense sets of markers (~1 cM), as the argument given above is technically correct only when the QTL is at a marker. It can be extended to provide a joint confidence region for the locus and the additive and dominance effects (![]()
By (7) the inequality defining Aq and the inequality in (12) are asymptotically equivalent. The important difference between the likelihood ratio and LOD support methods is that for the former x depends on Zq and is chosen to make the conditional probability (13) equal to the desired confidence level. For any value x that does not depend on the data, the probability of (12) depends on the values of
and
. Hence the support region is not a confidence region in the strict sense of the word. However, the similarity between the support regions and the likelihood ratio regions allows us to gain some interesting theoretical insights. For example, under the assumption that the QTL lies at a marker locus and that the distance
between markers is small, we can evaluate approximately the probability that a support region does not contain the true QTL, by taking the expectation of (B1) in Appendix 1 with respect to Zq = z. The result of some simple approximations is
![]() |
(15) |
= (
21 +
22)
,
=
, ß = 0.02 . Numerical calculations based on this approximation suggest, and simulations reported below verify, that for a given value of
the coverage probability of the support region is relatively insensitive to the values of
and to the relative sizes of the additive and dominance components, at least for values of
in the range 4
10, where detection of linkage ranges from reasonably likely to virtually certain, so QTL localization is especially important. The coverage probability is an increasing function of the intermarker distance
, so a 1.5-LOD support region has
95% coverage when
1 cM, while a 1-LOD support region gives similar coverage for
20 cM. Hence for practical purposes a support region is approximately a confidence region, albeit with a different confidence coefficient than that suggested by standard statistical distribution theory.
For problems involving a single parameter, e.g., for backcrosses, recombinant inbreds, or intercrosses where we estimate only
and ignore
, the factor in square brackets in (15) immediately preceding the exponential would be [
]
. It is easy to see that at least for comparatively large values of
, the coverage probability for a given value of x is relatively insensitive to this change of dimension.
An approximation for the expected size of a support region, which is valid for dense markers (~1 cM), is given in Appendix 1. A less precise but more easily interpreted approximation, valid when
x, is obtained by approximating the normal density in (B2) with mean
by a point mass at
, then taking two terms of the Taylor series expansion of ln[
2/(
2 - x)], which yields
![]() |
(16) |
This expression is roughly proportional to
-2, hence to N-1. In contrast, for regular statistical problems the size of a confidence region is inversely proportional to the square root of the sample size. The fact that confidence regions for a QTL are roughly inversely proportional to the sample size has been observed in the simulations of ![]()
![]()
Bayesian credible regions:
Given a prior probability for the location of the QTL and for the noncentrality parameters (
1,
2), a set having a posterior probability of 1 -
is called a Bayesian credible region. ![]()
confidence regions having many desirable properties. ![]()
![]()
![]()
![]()
A Bayesian credible region B
is constructed by including all loci v whose posterior density given the data exceeds c
, i.e.,
![]() |
(17) |
is chosen so that

Here y = {y1, ... , yN}, x = {x1, ... , xN} and xi is the set of all marker genotypes for individual i. The posterior probability
(v|y,x) is often easy to compute and depends on the prior distribution on the location q and the additive and dominance effects
and
. If one takes uninformative priors on all parameters,
![]() |
(18) |
=
) and recessive (
= -
) models and with variance of one was also applied. Results are presented in the next section.
Comparison study:
Using simulated tomato genomes, we constructed the likelihood confidence region, the 1.0- and 1.5-LOD support region and the Bayes credible regions, with the three different priors mentioned above. However, only the results from Bayes credible sets with a mixture of normal priors are included in Table 2 and Table 3. For each tomato, the crossover process for the chromosome containing the QTL was generated using the Haldane mapping function and the phenotype yi was assigned the value

where the ei's are normal random variables with mean 0 and variance 1.
|
|
We performed the simulations for the dominance model (
=
), with
= 5, 7.5, and 10.0. The trait locus was either at a marker, midway between markers, or randomly assigned. We generated 1000 sets of 350 tomatoes and calculated the average size and the probability of covering the true locus given a map with
= 1, 5, and 10 cM. Interval mapping was used throughout.
Both the 1.5-LOD (x = 6.9) support regions and the Bayesian credible regions provided at least 95% coverage under all simulated conditions. The support regions gave the smallest confidence regions for dense maps, while the Bayesian credible regions did the same for sparse maps. The coverage probability for the support regions obtained in the simulations is close to that predicted by the approximation (15). The approximate expected size provided by (B2) is close in the case of a dense map, but not otherwise. The likelihood method was conservative; and because it adapts to the observed value of the likelihood ratio statistic at the putative trait locus it resulted in the widest confidence regions for small values of the noncentrality parameter but was equivalent to the support region for the larger values
= 7.5 and 10. For all methods, the sizes of the intervals were largest when the trait was midmarker. The Bayes credible sets were the widest and they fell short of the desired 95% for large values of
and sparse maps, especially when the trait was located at a marker.
The size of the confidence regions is relatively insensitive to the marker density when the distance between markers and the size of the region are roughly commensurate; but when
is large, the dense marker map provides substantially smaller regions.
We performed similar simulations for a backcross with essentially the same results (data not shown). The simulations were repeated with fewer tomatoes (N = 100) (results not shown). The size of the region was unchanged for all methods, and all methods had the right coverage probability when the locus was located at a marker. The coverage probability was substantially reduced for the case of the likelihood method and the Bayes method when the trait was located midmarkers (
80% instead of 95%). The LOD support method had a slight drop in confidence coverage (
90%), but was more robust than the other methods.
We have also simulated support regions under the conditions of Table 2 of ![]()
![]()
= 5.06, the coverage probability of the 1-LOD region based on 1000 simulations was 96%, and the expected size was 29 cM compared with 96% and 43 cM obtained by ![]()
Another method to obtain confidence intervals for QTL location has been proposed by ![]()
![]()
| DISCUSSION |
|---|
In this article we have discussed genome scanning methods to detect QTL in experimental genetics. Our goal has been to produce relatively simple approximations for quantities of interest, e.g., the false-positive error rate, power to detect a QTL, and coverage probability of a support region, so that one can easily address questions concerning sample size, marker density, etc., and can compare different designs. Our approximations for significance level and power seem adequate in this regard, but our approximations for the expected size of a support region are good only for dense markers (e.g.,
1 cM).
Although in a backcross the conventional LOD = 3 threshold produces false-positive rates <0.05 unless intermarker distances are small, it is anticonservative in an intercross even for intermarker distances as large as 25 cM without interval mapping.
Our approximations are based on the artificial assumption that markers are equally spaced and there are no missing data. If markers are not equally spaced, the approximations (4) and (9) can be modified by averaging the function
with respect to the distribution of the distances
between markers. One can also use the original approximations with an average intermarker distance. (This should be the average distance in the neighborhood of detected QTL if one adds additional markers to promising regions.) Since (4) and (9) are insensitive to minor changes in the assumed value of
, one can reasonably expect such refinements to have little practical effect. If we use interval mapping to impute missing marker data, the resulting process is more correlated than would be the case if the data were not missing, so the threshold obtained under the assumption of no missing data is still appropriate and, in fact, slightly conservative.
The assumption of normality is robust in the sense that the regression statistics we use are approximately normally distributed in large samples, so our approximations for significance level and power are valid in large samples. However, it is possible that by using a more appropriate model, e.g., a mixture model if the nonnormality arises from large QTL effects, one can obtain greater power, although large QTL effects will be comparatively easy to detect with a suboptimal procedure.
When using a backcross or intercross, intermarker distances up to ~10 cM are almost as powerful as continuously distributed markers. Except at intermarker distances of ~20 cM or more, or when using a design involving a large recombination rate, e.g., a recombinant inbred design or advanced intercross design, there is little gain in power from interval mapping, which in any event does not provide nearly as much power as more closely spaced markers.
Although intercross designs involve a 2-d.f. statistic and hence a higher threshold than a backcross design, and have larger residual variance, intercross designs are usually more powerful than backcross designs, unless (a) the effect of the gene is large and additive or (b) there is dominance and the dominance deviation has the same sign as the additive genetic effect. A backcross design can lose considerable power in the presence of even a small departure from additivity if the incorrect parental strain is used for the backcross. A recombinant inbred design can be more efficient than an intercross, except when dominance effects are large compared to additive effects. Because of the high recombination rate associated with recombinant inbreds, especially those based on recurrent sib mating, power to detect linkage falls off rapidly with intermarker distance when a QTL is located midway between markers. To avoid this loss of power when using an inbred design based on recurrent sib mating, intermarker distances should be no more than 5 cM and preferably should be even less. Similar considerations apply to advanced intercross lines (![]()
We have also presented three methods of constructing confidence regions for the location of QTL: the likelihood method, Bayes credible sets, and support regions. The support method and the Bayesian credible sets seem roughly comparable in large samples, but the coverage probability of the support method is more robust to changes in the sample size. Both methods are better than the likelihood ratio method, which often has a coverage probability substantially smaller than the nominal level, except for the case of dense markers.
The size of a confidence region depends on the noncentrality parameter and the density of the markers in the neighborhood of the QTL. When the noncentrality parameter is ~5, which provides power of ~0.9 for QTL detection, little is gained by having markers more closely spaced than ~10 cM; but when the noncentrality parameter is 7.5, intermarker distances of 15 cM provide shorter confidence regions. A reasonable guideline is to achieve a marker density in the neighborhood of a putative QTL about equal to the expected half length of a support region for a QTL of that strength.
When dominance effects are relatively small and markers sufficiently dense, support regions from recombinant inbred designs are often about one-fourth as large as from intercross designs, which in turn are substantially smaller than from backcross designs. Advanced intercross designs (![]()
We have not explicitly addressed the complexities associated with identifying multiple, possibly linked, possibly interacting, QTL. For mapping qualitative traits in humans, we have discussed these issues (![]()
Similar considerations are relevant to recently suggested multiple regression methods, e.g., ![]()
![]()
The second method discussed by ![]()
We expect to return to the problem of detecting multiple, possibly linked, QTL in a future article.
| ACKNOWLEDGMENTS |
|---|
This research was partially supported by the National Institutes of Health grant HG-00848 and the National Science Foundation grant DMS 9704324.
Manuscript received April 14, 1997; Accepted for publication September 21, 1998.
| APPENDIX 1 |
|---|
Power of interval mapping:
We first consider a backcross and suppose there is a single trait locus (on any particular chromosome) at q. Let Zd denote the signed square root of twice the log-likelihood ratio (incorporating interval mapping), which for large N behaves like a piecewise smooth Gaussian process. We use the basic decomposition

The first term on the right-hand side is given by 1 -
(b -
q), where
q = E(Zq). To approximate the second term, we assume that if the process exceeds the threshold for some d
q it does so at a value of d between the same two flanking markers as q, or in one of the immediately adjoining marker intervals in the case that q is itself a marker. This analysis can be expected to yield reasonable approximations in the case that intermarker intervals are large, when interval mapping is supposed to be most helpful. It may not be effective when the intermarker distances are small, expecially if the noncentrality is also small. We approximate maxdZd by expanding Zd in two terms of a Taylor series around d = q and using calculus to maximize the resulting expression. ![]()
![]() |
(A1) |
The noncentrality parameter
q can be evaluated by a direct computation starting from a suitable explicit representation of the interval mapping statistic. ![]()
![]() |
(A2) |
= N-1
yi . In the case where the locus d lies between flanking markers, we replace the actual marker data, xi(d), by its conditional expectation given the genotypes of the flanking markers, E[xi(d) | Gi]. Taking expectations and using (2), we see from some simple manipulations that the noncentrality is asymptotically equal to

To express this explicitly in terms of recombination fractions, let
1 (
2) denote the recombination fraction between the QTL at q and the marker flanking on the left (right), and
the recombination fraction between the two flanking markers. Then straightforward calculations yield

where
2 = N ln{1 + [
]2}. This reduces to the noncentrality
when
1 = 0, so
2 =
. At the midpoint between markers, if we assume the Haldane model of no interference it simplifies to

This always exceeds the parameter (6), although a direct comparison is not really meaningful because the markers only statistic involves the maximum of the process at the two flanking markers.
We can also give as an approximation for the power of the interval mapping process
![]() |
(A3) |
A more detailed calculation along the lines of that given for a backcross yields an expression for
q, which in general is somewhat complicated. In the special case that q is the midpoint between two markers at distance
, the parameter
q is the norm of the vector with coordinates

where
1,
2, ß1, and ß2 are as defined in the paper.
| APPENDIX 1 |
|---|
Approximations for the conditional probability of (14) and the expected size of a LOD support region:
To approximate the conditional probability of (14), we begin with the following lemma.
LEMMA. Le













; the process with interval mapping is represented by
(solid symbols for the theoretical approximation) and
(power for the higher threshold appropriate when 








