Abstract
This article addresses the identification of genetic loci (QTL and elsewhere) that influence nonnormal quantitative traits with focus on experimental crosses. QTL mapping is typically based on the assumption that the traits follow normal distributions, which may not be true in practice. Modelfree tests have been proposed. However, nonparametric estimation of genetic effects has not been studied. We propose an estimation procedure based on the linear rank test statistics. The properties of the new procedure are compared with those of traditional likelihoodbased interval mapping and regression interval mapping via simulations and a real data example. The results indicate that the nonparametric method is a competitive alternative to the existing parametric methodologies.
QUANTITATIVE genetics has developed rapidly, especially with progress in DNAbased genetic linkage maps. Various statistical approaches have been proposed to identify QTL by using molecular markers, such as Sax’s (1923) singlemarker ttest, Lander and Botstein’s (1989) maximumlikelihoodbased interval mapping, Haley and Knott’s (1992) regression interval mapping, and Zeng’s (1993, 1994) and Jansen and Stam’s (1994) composite interval mapping.
All the methods mentioned above are based on the normality assumption (or other parametric models) for the component distributions. The normal mixture model is the default analysis and is implemented in the widely used packages Mapmaker/QTL (Lincolnet al. 1993) and QTL Cartographer (Bastenet al. 1997). Many traits, however, are not normally distributed. An example is tumor counts, which arise in cancer studies and often appear to follow a negative binomial (Drinkwater and Klotz 1981). Naively assuming normality of the underlying distributions greatly simplifies the form of the likelihood function. A problem is that if this assumption is violated, then false detection of a major locus may occur (Morton 1984).
When the underlying distributions are suspected to be nonnormal, one strategy is to use a likelihood approach after transforming the data using, for example, the BoxCox transformation (Draper and Smith 1998). However, an appropriate transformation may not exist or may be difficult to find. Also this approach can raise serious issues of interpretation and the transformation involves an extra parametric assumption.
An alternative approach is to consider nonparametric methods. Kruglyak and Lander (1995) apply the linear rank statistics to interval mapping, which is implemented in the latest version of Mapmaker/QTL (Lincolnet al. 1993) and Qlink (Drinkwater 1997). However, the method tests only for the presence of a QTL and does not provide an estimate of the phenotypic effect of the QTL. In this article, we extend the rankbased test statistic to the estimation of the quantitative trait effects.
Rankbased methods play an important role in nonparametric statistics. The linear rank statistic has been widely used in practice and its theoretical properties have been thoroughly studied (Hajek and Sidak 1967; Hajek 1968). For linear regression, estimates of the regression coefficients based on linear rank statistics are available and have efficiency and robustness properties that are similar to those of the linear rank statistics. In this article, we adapt the existing methodology to construct rankbased estimates for genetic effects under the assumption that the underlying QTL component distributions have the same form and differ only by a shift. This appears to be the first attempt to apply linear rankbased estimates directly to the interval mapping and thus complements existing parametric methods. Simulations are conducted to compare the relative efficiencies of the nonparametric and parametric methods under a variety of distributions.
The article is arranged as follows. In the next section, we briefly introduce linear rank statistics and related estimation procedures for regression analysis. In nonparametric interval mapping, the estimates of QTL effects are proposed in the context of interval mapping. In numerical studies, the relative efficiencies of the proposed estimates are compared with the parametric estimates in simulation studies and the methods are illustrated in backcross data where the phenotype has a highly skewed distribution. In conclusion and remarks, we discuss the practical utility of the proposed methods.
RANKBASED METHODS
First consider a simple regression model: P(Y_{i} < y X_{i}) = F(y  X_{i}β), where F is an unknown distribution, X_{i} are regressors, and Y_{i} are responses, i = 1,..., n, and we are interested in testing H_{0}: β= 0. Define the shifted responses Y_{i}(b) = Y_{i}  (X_{i} X)b and their ranks R_{i}(b) = rank(Y_{i}(b)). The ranks are 1 for the smallest observation, 2 for the next, and so on, preserving the order of the data but not the value. Under the null, the distribution of R_{i}(0) is independent of the distribution F and uniformly distributed on {1, 2, 3,..., n}. The Wilcoxon score statistic
The statistic L(b) plays a fundamental role in nonparametric inference. Under the null hypothesis H_{0}: β= 0, L(0) has the following asymptotic property:
To estimate β, find the value b that shifts values of Y_{i} to Y_{i}(b) such that the shifted values Y_{i}(b) are not associated with X_{i}’s anymore. A commonly used estimator is the HodgesLehmann estimator
The asymptotic properties of the linear rankbased inferences and estimators and their relative efficiencies are discussed in detail in Puri and Sen (1985). The efficiency of the Wilcoxon rank sum test (HodgesLehmann estimate) relative to the ttest [maximumlikelihood estimate (MLE)] is ∼95% if the distribution is normal and is never <86% for symmetric distributions. Thus the loss of efficiency in the normal case is slight and is offset by the robustness of the nonparametric method. For heavy tailed distributions, the gain in efficiency may be great. Later our simulations show that even for nonsymmetric error distributions, such as exponential, the rankbased method may be more efficient.
For multiple regression, all the above arguments can be extended in a straightforward manner. Suppose
Under some regularity conditions (Puri and Sen 1985, Chap. 5),
To estimate β, define
NONPARAMETRIC INTERVAL MAPPING
Backcross: In this section, we consider a backcross population [(QQ × qq) × QQ]. For a singleQTL model, we assume P(Y_{i} < y X_{i}) = F(y βX_{i}), where X_{i} = I(Q_{i}) is the indicator function that takes 1 if the QTL genotype Q_{i} = QQ, and 0 otherwise. We are interested in testing H_{0}: β= 0 vs. H_{1}: β ≠ 0 and in estimating β, the genetic shift in distribution at the QTL between QQ and qQ genotypes.
If the QTL genotype Q_{i}’s are known, we could apply the Wilcoxon rank sum tests and HodgesLehmann estimators directly in QTL analysis. However, in intervals between known loci, the QTL genotypes are not observed and the quantitative traits follow discrete mixture models and thus Q_{i}’s are generally not available. A natural choice would be to use HaleyKnott regression (Haley and Knott 1992). That is, first, the mixing weights are calculated as the conditional probabilities of the QTL genotypes in intervals between marker loci using the genetic map and the genotypes of the flanking markers. Then, X_{i} is substituted with its conditional expectation E(X_{i} flanking markers).
Since individuals with the same flanking markers have the same mixing weights and thus the same mixture distribution, for convenience, we can group the data into K groups by their flankingmarker genotypes. Suppose within each group the data have common distribution M_{k}, k = 1, 2,..., K. Under the null hypothesis H_{0}, M_{1} = M_{2} =... = M_{K}. After substituting X_{i} in (1) with E(X_{i} flanking markers), we obtain the rank test statistic equivalent to the one in Kruglyak and Lander (1995). Note that instead of testing H_{0} directly, here we instead test M_{1} = M_{2} =... = M_{K}. Usually, K is much greater than the number of underlying distributions. For example, in the backcross population, we are interested in testing the difference between the two component distributions in the mixture model. In essence, we test for differences among the four mixtures, M_{k}, k = 1,..., 4. Theoretically, it is unclear whether the relative efficiency of the rank sum test vs. the ttest [or, equivalently the likelihoodratio test (LRT)] in linear regression still holds in this setting. However, we expect that the rank sum test performs better under most circumstances when data are nonnormal, which we investigate by simulations.
The estimation of β is more problematic than that for simple linear regression. In traditional linear regression, E_{β}{L(β)} = 0. Thus the estimator
The following are some properties of
Extensions: Next we extend the methods to any other cross derived from two inbred lines, such as F_{2}. In general, the model can be expressed as
X_{1,}_{i} =1, 0, or 1 if individual i has QTL genotype qq, qQ, or QQ, and
X_{2,}_{i} = 1 (or 0) if individual i has QTL genotype qQ (or else)
correspond to the additive and dominance genetic effects, a and d, respectively. In regression mapping, if the unknown X_{j}_{,}_{i}’s are replaced by their conditional expectations E(X_{j}_{,}_{i} flanking markers), then the estimator
NUMERICAL STUDIES
Simulations were conducted to study the behavior of Z and
The estimates of QTL position and effect from the REG and the ML methods are very similar not only for normal data, which is consistent with Haley and Knott (1992) and with Xu (1995), but also for nonnormal data. Note that the nonparametric test and estimate generally are much more efficient than the parametric versions when data are not normally distributed. There is a modest loss of efficiency with normal data, which agrees with theory for simple linear regression. The marker distances and the magnitude of the QTL effect do not seem to have a large impact on the relative efficiencies of the estimators.
To estimate the power, the rank test statistic Z is first transformed to LOD_{R} = {2 log(10)}^{1}Z^{2} and the test statistic from REG is also transformed to an equivalent LOD score. We then take threshold 3 for the LOD scores, which is recommended in practical genomewide QTL analysis (see also Kruglyak and Lander 1995 for analytic genomewide threshold calculations). The power is calculated as the proportion of significant tests from 100 simulated data sets. For the extreme case where data are Cauchy distributed, there is no power to detect the QTL by ML or REG interval mapping while Rank interval mapping does have power.
To further demonstrate the method, we consider the data on the time to death following infection with Listeria monocytogenes of 116 F_{2} mice from an intercross between the BALB/cByJ and C57BL/6ByJ strains (Boyartchuket al. 2001). The histograms of the log time to death of the nonsurvivors are given in Figure 1. Roughly 30% of mice survive beyond 264 days. From the histogram it is hard to justify that the log time to death of the nonsurvivors is normally distributed. Broman (2003) applied four different methods, including both the standard interval mapping and nonparametric interval mapping, to this data set and showed that the locus on chromosome 1 appears to have effect only on the average time to death among the nonsurvivors. For this reason, our analysis is restricted on chromosome 1 for those nonsurvivors.
The LOD scores obtained by standard interval mapping and the nonparametric interval mapping with the log time to death are plotted in Figure 2. It is clear that the two methods result in the maximums at the same position although the LOD curves are slightly different, which will result in some slightly different confidence intervals of the putative QTL locus by the conventional 1LOD drop method. The additive and dominance estimators are 0.262 and 0.059, respectively, from standard interval mapping and are 0.257 and 0.038, respectively, based on our method. To assess whether the differences between the two methods are significant or not, 1000 bootstraps are performed. We restrict our analysis within chromosome 1. From our method, the 95% confidence interval (CI) of the QTL locus is (50 cM, 84 cM). The mean of the additive effect is 0.247 with standard error 0.077 and the mean of the dominant effect is 0.055 with standard error 0.122. Similarly, from standard interval mapping, we get the 95% CI of the QTL locus as (51 cM, 92 cM). The mean of the additive effect is 0.268 with standard error 0.071 and the mean of the dominant effect is 0.0284 with standard error 0.122. In all, the nonparametric QTL locus estimator is relatively more efficient than the parametric estimator and our nonparametric analysis confirms the results of Broman (2003).
CONCLUSION AND REMARKS
In this article, traditional rankbased estimators for linear regression have been adapted to analyze quantitative traits. The new method has been shown to be very similar to Haley and Knott’s regression interval mapping when data are normally distributed and more efficient for nonnormal data. Our simulations indicate that the normal likelihoodratiobased interval mapping is usually unbiased, even when the data are nonnormal, but may have very low efficiency. All our simulations are based on one QTL model. We believe the nonparametric model is very likely to produce ghost QTL as the parametric method does when two QTL are close to each other and multiple nonparametric QTL mapping is needed.
In genetic studies of quantitative traits, adapting rankbased methodologies is complicated because genetic markers are observed only at known loci and the QTL genotypes are usually unknown. Thus, the trait data arise from discrete mixtures of unknown distributions. The mixture structure of the data may distort certain properties of the underlying error distributions. For example, F may be unimodal even though the QTL data may not be. This means that the rank test in QTL mapping may have properties that differ from those for the rank test in linear regression.
As explained in nonparametric interval mapping, the rankbased parameter estimate
The computation of
Footnotes

Communicating editor: ZB. Zeng
 Received April 13, 2003.
 Accepted July 25, 2003.
 Copyright © 2003 by the Genetics Society of America