- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Zou, F.
- Articles by Fine, J. P.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Zou, F.
- Articles by Fine, J. P.
Rank-Based Statistical Methodologies for Quantitative Trait Locus Mapping
Fei Zoua, Brian S. Yandellb, and Jason P. Fineba Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599
b Department of Statistics, University of Wisconsin, Madison, Wisconsin 53706
Corresponding author: Fei Zou, University of North Carolina, 3107D McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599., fzou{at}bios.unc.edu (E-mail)
Communicating editor: Z-B. ZENG
| ABSTRACT |
|---|
This article addresses the identification of genetic loci (QTL and elsewhere) that influence nonnormal quantitative traits with focus on experimental crosses. QTL mapping is typically based on the assumption that the traits follow normal distributions, which may not be true in practice. Model-free tests have been proposed. However, nonparametric estimation of genetic effects has not been studied. We propose an estimation procedure based on the linear rank test statistics. The properties of the new procedure are compared with those of traditional likelihood-based interval mapping and regression interval mapping via simulations and a real data example. The results indicate that the nonparametric method is a competitive alternative to the existing parametric methodologies.
QUANTITATIVE genetics has developed rapidly, especially with progress in DNA-based genetic linkage maps. Various statistical approaches have been proposed to identify QTL by using molecular markers, such as SAX's (1923) single-marker t-test, LANDER and BOTSTEIN's (1989) maximum-likelihood-based interval mapping, HALEY and KNOTT's (1992) regression interval mapping, and ZENG's (1993, 1994) and JANSEN and STAM's (1994) composite interval mapping.
All the methods mentioned above are based on the normality assumption (or other parametric models) for the component distributions. The normal mixture model is the default analysis and is implemented in the widely used packages Mapmaker/QTL (![]()
![]()
![]()
![]()
When the underlying distributions are suspected to be nonnormal, one strategy is to use a likelihood approach after transforming the data using, for example, the Box-Cox transformation (![]()
An alternative approach is to consider nonparametric methods. ![]()
![]()
![]()
Rank-based methods play an important role in nonparametric statistics. The linear rank statistic has been widely used in practice and its theoretical properties have been thoroughly studied (![]()
![]()
The article is arranged as follows. In the next section, we briefly introduce linear rank statistics and related estimation procedures for regression analysis. In NONPARAMETRIC INTERVAL MAPPING, the estimates of QTL effects are proposed in the context of interval mapping. In NUMERICAL STUDIES, the relative efficiencies of the proposed estimates are compared with the parametric estimates in simulation studies and the methods are illustrated in backcross data where the phenotype has a highly skewed distribution. In CONCLUSION AND REMARKS, we discuss the practical utility of the proposed methods.
| RANK-BASED METHODS |
|---|
First consider a simple regression model: P(Yi < y|Xi) = F(y - Xiß), where F is an unknown distribution, Xi are regressors, and Yi are responses, i = 1, ... , n, and we are interested in testing H0: ß = 0. Define the shifted responses
and their ranks Ri(b) = rank(Yi(b)). The ranks are 1 for the smallest observation, 2 for the next, and so on, preserving the order of the data but not the value. Under the null, the distribution of Ri(0) is independent of the distribution F and uniformly distributed on {1, 2, 3, ... , n}. The Wilcoxon score statistic
is a simple linear rank statistic (see ![]()
The statistic L(b) plays a fundamental role in nonparametric inference. Under the null hypothesis H0: ß = 0, L(0) has the following asymptotic property:
![]() |
(1) |
To estimate ß, find the value b that shifts values of Yi to Yi(b) such that the shifted values Yi(b) are not associated with Xi's anymore. A commonly used estimator is the Hodges-Lehmann estimator
, which is the solution of the estimating equation L(b) = 0. However, the linear rank statistic L(b) may not reach zero, so in practice
is taken to be the average of the closest values on either side of 0. In other words,
= 1/2(
U +
L) with
![]() |
(2) |
The asymptotic properties of the linear rank-based inferences and estimators and their relative efficiencies are discussed in detail in ![]()
95% if the distribution is normal and is never <86% for symmetric distributions. Thus the loss of efficiency in the normal case is slight and is offset by the robustness of the nonparametric method. For heavy tailed distributions, the gain in efficiency may be great. Later our simulations show that even for nonsymmetric error distributions, such as exponential, the rank-based method may be more efficient.
For multiple regression, all the above arguments can be extended in a straightforward manner. Suppose P(Yi < y|Xi) = F(y - Xi'ß), where ß, Xi
p. Again, F is totally unspecified. Similar to the simple regression, we define

and

where b = (b1, ... , bp)'
p.
Under some regularity conditions (![]()

where
and
This result can be used to test H0: ß = 0.
To estimate ß, define
Lj(b)2, and let
n = {arg minb ||L(b)||2}. Note that the set
n may not be a single point. To obtain a unique estimator, we can let
be the center of mass of
n. The computation of
usually requires some iterative procedures.
| NONPARAMETRIC INTERVAL MAPPING |
|---|
Backcross:
In this section, we consider a backcross population [(QQ x qq) x QQ]. For a single-QTL model, we assume P(Yi < y|Xi) = F(y - ßXi), where Xi = I(Qi) is the indicator function that takes 1 if the QTL genotype Qi = QQ, and 0 otherwise. We are interested in testing H0: ß = 0 vs. H1: ß
0 and in estimating ß, the genetic shift in distribution at the QTL between QQ and qQ genotypes.
If the QTL genotype Qi's are known, we could apply the Wilcoxon rank sum tests and Hodges-Lehmann estimators directly in QTL analysis. However, in intervals between known loci, the QTL genotypes are not observed and the quantitative traits follow discrete mixture models and thus Qi's are generally not available. A natural choice would be to use Haley-Knott regression (![]()
Since individuals with the same flanking markers have the same mixing weights and thus the same mixture distribution, for convenience, we can group the data into K groups by their flanking-marker genotypes. Suppose within each group the data have common distribution Mk, k = 1, 2, ... , K. Under the null hypothesis H0, M1 = M2 = ... = MK. After substituting Xi in (1) with E(Xi|flanking markers), we obtain the rank test statistic equivalent to the one in ![]()
The estimation of ß is more problematic than that for simple linear regression. In traditional linear regression, Eß{L(ß)} = 0. Thus the estimator
is consistent. However, due to the mixture structure of QTL data, we can show that Eß{L(ß)} does not generally equal 0 when Xi is substituted by its conditional expectation. A theoretical formula of Eß{L(ß)} indicates that the magnitude of the deviation from 0 depends on the underlying distributions, the flanking marker distances, and the magnitude of ß. It can be shown that, for a given distribution, the deviation goes to 0 as ß goes to 0 or as the flanking marker distance goes to 0. Thus we expect the estimator
to work well in QTL analysis if either there is a relatively dense map (e.g., < 20 cM, a common scenario of current genetic studies) or the QTL effect is relatively small. Efficiency is of less concern when the QTL effect is large than when it is small. In QTL mapping of complex traits, an individual QTL usually has small effect. For these reasons, we believe and our simulations as well show that the rank sum-based estimators are practically useful alternatives to the least-squares estimators from Haley and Knott's regression interval mapping.
The following are some properties of
. To emphasize that
depends on Y = {Yi}, we rewrite
as
(Y). From the definition of
, it is not difficult to show that, for any b
R,
, and
(Y) = -
(-Y).
In words, i indicates that adding a constant to the data has no effect on the estimator of QTL effect. Property ii says that if the data are multiplied by -1, the estimator has an opposite sign.
Extensions:
Next we extend the methods to any other cross derived from two inbred lines, such as F2. In general, the model can be expressed as P(Yi < y|Xi) = F(y - Xi'ß), where ß = (a, d)' and Xi = (X1,i, X2,i)'. The covariates
- X1,i = -1, 0, or 1 if individual i has QTL genotype qq, qQ, or QQ, and
- X2,i = 1 (or 0) if individual i has QTL genotype qQ (or else)
correspond to the additive and dominance genetic effects, a and d, respectively. In regression mapping, if the unknown Xj,i's are replaced by their conditional expectations E(Xj,i|flanking markers), then the estimator
can be derived as described in RANK-BASED METHODS for multiple regression without any modifications. The methods may also be adapted to map multiple QTL (![]()
![]()
| NUMERICAL STUDIES |
|---|
Simulations were conducted to study the behavior of Z and
in a backcross population. For simplicity, only one chromosomal segment flanked by two markers is simulated. The two markers are either located at 0 and 10 cM with simulated QTL at 5 cM or located at 0 and 20 cM with simulated QTL at 10 cM, respectively. The setups are similar to those in ![]()
![]()
|
|
|
|
The estimates of QTL position and effect from the REG and the ML methods are very similar not only for normal data, which is consistent with ![]()
![]()
To estimate the power, the rank test statistic Z is first transformed to LODR = {2 log(10)}-1Z2 and the test statistic from REG is also transformed to an equivalent LOD score. We then take threshold 3 for the LOD scores, which is recommended in practical genome-wide QTL analysis (see also ![]()
To further demonstrate the method, we consider the data on the time to death following infection with Listeria monocytogenes of 116 F2 mice from an intercross between the BALB/cByJ and C57BL/6ByJ strains (![]()
![]()
|
The LOD scores obtained by standard interval mapping and the nonparametric interval mapping with the log time to death are plotted in Fig 2. It is clear that the two methods result in the maximums at the same position although the LOD curves are slightly different, which will result in some slightly different confidence intervals of the putative QTL locus by the conventional 1-LOD drop method. The additive and dominance estimators are 0.262 and 0.059, respectively, from standard interval mapping and are 0.257 and 0.038, respectively, based on our method. To assess whether the differences between the two methods are significant or not, 1000 bootstraps are performed. We restrict our analysis within chromosome 1. From our method, the 95% confidence interval (CI) of the QTL locus is (50 cM, 84 cM). The mean of the additive effect is 0.247 with standard error 0.077 and the mean of the dominant effect is 0.055 with standard error 0.122. Similarly, from standard interval mapping, we get the 95% CI of the QTL locus as (51 cM, 92 cM). The mean of the additive effect is 0.268 with standard error 0.071 and the mean of the dominant effect is 0.0284 with standard error 0.122. In all, the nonparametric QTL locus estimator is relatively more efficient than the parametric estimator and our nonparametric analysis confirms the results of ![]()
|
| CONCLUSION AND REMARKS |
|---|
In this article, traditional rank-based estimators for linear regression have been adapted to analyze quantitative traits. The new method has been shown to be very similar to Haley and Knott's regression interval mapping when data are normally distributed and more efficient for nonnormal data. Our simulations indicate that the normal likelihood-ratio-based interval mapping is usually unbiased, even when the data are nonnormal, but may have very low efficiency. All our simulations are based on one QTL model. We believe the nonparametric model is very likely to produce ghost QTL as the parametric method does when two QTL are close to each other and multiple nonparametric QTL mapping is needed.
In genetic studies of quantitative traits, adapting rank-based methodologies is complicated because genetic markers are observed only at known loci and the QTL genotypes are usually unknown. Thus, the trait data arise from discrete mixtures of unknown distributions. The mixture structure of the data may distort certain properties of the underlying error distributions. For example, F may be unimodal even though the QTL data may not be. This means that the rank test in QTL mapping may have properties that differ from those for the rank test in linear regression.
As explained in NONPARAMETRIC INTERVAL MAPPING, the rank-based parameter estimate
is not generally unbiased with QTL data because the unknown regressors Xi are replaced by their expectations. On the basis of the theory of general estimating equations (![]()
The computation of
usually is complicated if the dimension of ß is >1 and requires some iterative procedures. ![]()
. We may use this one-step modification if the calculation of
is too complicated,
![]() |
(3) |
where
is the least-squares estimator of ß, and for any d
p,

Manuscript received April 13, 2003; Accepted for publication July 25, 2003.
| LITERATURE CITED |
|---|
BASTEN, C. J., B. S. WEIR and Z-B. ZENG, 1997 QTL Cartographer: A Reference Manual and Tutorial for QTL Mapping. Department of Statistics, North Carolina State University, Raleigh, NC.
BOYARTCHUK, V. L., K. W. BROMAN, R. E. MOSHER, S. E. F. DORAZIO, and M. N. STARNBACH et al., 2001 Multigenic control of Listeria monocytogenes susceptibility in mice. Nat. Genet. 27:259-260.[Medline]
BROMAN, K. W., 2003 Mapping quantitative trait loci in the case of a spike in the phenotype distribution. Genetics 163:1169-1175.
DRAPER, N. R., and H. SMITH, 1998 Applied Regression Analysis, Ed. 3. John Wiley & Sons, New York.
DRINKWATER, N. R., 1997 Qlink Documentation. McArdle Laboratory for Cancer Research, University of Wisconsin, Madison, WI.
DRINKWATER, N. R. and J. H. KLOTZ, 1981 Statistical methods for the analysis of tumor multiplicity data. Cancer Res. 41:113-119.
HAJEK, J., 1968 Asymptotic normality of simple linear rank statistics under alternatives. Ann. Math. Stat. 39:325-346.
HAJEK, J., and Z. SIDAK, 1967 Theory of Rank Tests. Academic Press, New York/London.
HALEY, C. S. and S. A. KNOTT, 1992 A simple regression method for mapping quantitative traits in line crosses using flanking markers. Heredity 69:315-324.[Medline]
JANSEN, R. C. and P. STAM, 1994 High resolution of quantitative traits into multiple loci via interval mapping. Genetics 136:1447-1455.[Abstract]
KAO, C. H., R. D. Z-B. ZENG AND, and R. D. Z-B. ZENG ANDTEASDALE, 1999 Multiple interval mapping for quantitative trait loci. Genetics 152:1203-1216.
KRAFT, C. H. and C. VAN EEDEN, 1972 Linearized rank estimates and signed rank estimates for the general linear hypothesis. Ann. Math. Stat. 43:42-57.
KRUGLYAK, L. and E. S. LANDER, 1995 A nonparametric approach for mapping quantitative trait loci. Genetics 139:1421-1428.[Abstract]
LANDER, E. S. and D. BOTSTEIN, 1989 Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121:185-199.
LIANG, K. Y. and S. L. ZEGER, 1986 Longitudinal data analysis using generalized linear models. Biometrika 73:13-22.
LINCOLN, S. E., M. J. DALY and E. S. LANDER, 1993 A Tutorial and Reference Manual for MAPMAKER/QTL. Whitehead Institute for Biometrical Research.
LIU, Y. and LIU, Y.Z-B. ZENG, 2000 A general mixture model approach for mapping quantitative trait loci from diverse cross designs involving multiple inbred lines. Genet. Res. 75:345-355.[Medline]
MORTON, N. E., 1984 Trials of segregation analysis by deterministic and macro simulation, pp. 83107 in Human Population Genetics: The Pittsburgh Symposium, edited by A. CHAKRAVARTI. Van Nostrand Reinhold, New York.
PURI, M. L., and P. K. SEN, 1985 Nonparametric Methods in General Linear Models. John Wiley & Sons, New York.
SAX, K., 1923 The association of size differences with seed-coat pattern and pigmentation in Phaseolus vulgaris.. Genetics 8:552-560.
XU, S., 1995 A comment on the simple regression method for interval mapping. Genetics 141:1657-1659.[Medline]
ZENG, Z-B., 1993 Theoretical basis of separation of multiple linked gene effects on mapping quantitative trait loci. Proc. Natl. Acad. Sci. USA 90:10972-10976.
ZENG, Z-B., 1994 Precision mapping of quantitative traits loci. Genetics 136:1457-1468.[Abstract]
This article has been cited by other articles:
![]() |
M. J. Sillanpaa and F. Hoti Mapping Quantitative Trait Loci From a Single-Tail Sample of the Phenotype Distribution Including Survival Data Genetics, December 1, 2007; 177(4): 2361 - 2377. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Zak, A. Baierl, M. Bogdan, and A. Futschik Locating Multiple Interacting Quantitative Trait Loci Using Rank-Based Model Selection Genetics, July 1, 2007; 176(3): 1845 - 1854. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Zou, F.
- Articles by Fine, J. P.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Zou, F.
- Articles by Fine, J. P.





