Genetics, Vol. 165, 1599-1605, November 2003, Copyright © 2003

Rank-Based Statistical Methodologies for Quantitative Trait Locus Mapping

Fei Zoua, Brian S. Yandellb, and Jason P. Fineb
a Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599
b Department of Statistics, University of Wisconsin, Madison, Wisconsin 53706

Corresponding author: Fei Zou, University of North Carolina, 3107D McGavran-Greenberg Hall, CB 7420, Chapel Hill, NC 27599., fzou{at}bios.unc.edu (E-mail)

Communicating editor: Z-B. ZENG


*  ABSTRACT
*TOP
*ABSTRACT
*RANK-BASED METHODS
*NONPARAMETRIC INTERVAL MAPPING
*NUMERICAL STUDIES
*CONCLUSION AND REMARKS
*LITERATURE CITED

This article addresses the identification of genetic loci (QTL and elsewhere) that influence nonnormal quantitative traits with focus on experimental crosses. QTL mapping is typically based on the assumption that the traits follow normal distributions, which may not be true in practice. Model-free tests have been proposed. However, nonparametric estimation of genetic effects has not been studied. We propose an estimation procedure based on the linear rank test statistics. The properties of the new procedure are compared with those of traditional likelihood-based interval mapping and regression interval mapping via simulations and a real data example. The results indicate that the nonparametric method is a competitive alternative to the existing parametric methodologies.


QUANTITATIVE genetics has developed rapidly, especially with progress in DNA-based genetic linkage maps. Various statistical approaches have been proposed to identify QTL by using molecular markers, such as SAX's (1923) single-marker t-test, LANDER and BOTSTEIN's (1989) maximum-likelihood-based interval mapping, HALEY and KNOTT's (1992) regression interval mapping, and ZENG's (1993, 1994) and JANSEN and STAM's (1994) composite interval mapping.

All the methods mentioned above are based on the normality assumption (or other parametric models) for the component distributions. The normal mixture model is the default analysis and is implemented in the widely used packages Mapmaker/QTL (LINCOLN et al. 1993 Down) and QTL Cartographer (BASTEN et al. 1997 Down). Many traits, however, are not normally distributed. An example is tumor counts, which arise in cancer studies and often appear to follow a negative binomial (DRINKWATER and KLOTZ 1981 Down). Naively assuming normality of the underlying distributions greatly simplifies the form of the likelihood function. A problem is that if this assumption is violated, then false detection of a major locus may occur (MORTON 1984 Down).

When the underlying distributions are suspected to be nonnormal, one strategy is to use a likelihood approach after transforming the data using, for example, the Box-Cox transformation (DRAPER and SMITH 1998 Down). However, an appropriate transformation may not exist or may be difficult to find. Also this approach can raise serious issues of interpretation and the transformation involves an extra parametric assumption.

An alternative approach is to consider nonparametric methods. KRUGLYAK and LANDER 1995 Down apply the linear rank statistics to interval mapping, which is implemented in the latest version of Mapmaker/QTL (LINCOLN et al. 1993 Down) and Qlink (DRINKWATER 1997 Down). However, the method tests only for the presence of a QTL and does not provide an estimate of the phenotypic effect of the QTL. In this article, we extend the rank-based test statistic to the estimation of the quantitative trait effects.

Rank-based methods play an important role in nonparametric statistics. The linear rank statistic has been widely used in practice and its theoretical properties have been thoroughly studied (HAJEK and SIDAK 1967 Down; HAJEK 1968 Down). For linear regression, estimates of the regression coefficients based on linear rank statistics are available and have efficiency and robustness properties that are similar to those of the linear rank statistics. In this article, we adapt the existing methodology to construct rank-based estimates for genetic effects under the assumption that the underlying QTL component distributions have the same form and differ only by a shift. This appears to be the first attempt to apply linear rank-based estimates directly to the interval mapping and thus complements existing parametric methods. Simulations are conducted to compare the relative efficiencies of the nonparametric and parametric methods under a variety of distributions.

The article is arranged as follows. In the next section, we briefly introduce linear rank statistics and related estimation procedures for regression analysis. In NONPARAMETRIC INTERVAL MAPPING, the estimates of QTL effects are proposed in the context of interval mapping. In NUMERICAL STUDIES, the relative efficiencies of the proposed estimates are compared with the parametric estimates in simulation studies and the methods are illustrated in backcross data where the phenotype has a highly skewed distribution. In CONCLUSION AND REMARKS, we discuss the practical utility of the proposed methods.


*  RANK-BASED METHODS
*TOP
*ABSTRACT
*RANK-BASED METHODS
*NONPARAMETRIC INTERVAL MAPPING
*NUMERICAL STUDIES
*CONCLUSION AND REMARKS
*LITERATURE CITED

First consider a simple regression model: P(Yi < y|Xi) = F(y - Xiß), where F is an unknown distribution, Xi are regressors, and Yi are responses, i = 1, ... , n, and we are interested in testing H0: ß = 0. Define the shifted responses and their ranks Ri(b) = rank(Yi(b)). The ranks are 1 for the smallest observation, 2 for the next, and so on, preserving the order of the data but not the value. Under the null, the distribution of Ri(0) is independent of the distribution F and uniformly distributed on {1, 2, 3, ... , n}. The Wilcoxon score statistic is a simple linear rank statistic (see PURI and SEN 1985 Down for some alternatives) and is widely used to test H0. Statistical inquiry based on ranks can have dramatically smaller variances when data are not normal, leading to more efficient tests and estimators. Note that if we knew the true shift ß, then the shifted values Yi(ß) would all have the same distribution F and Eß{L(ß)} = 0. All rank-based inference and estimation procedures are built on this premise.

The statistic L(b) plays a fundamental role in nonparametric inference. Under the null hypothesis H0: ß = 0, L(0) has the following asymptotic property:

(1)

To estimate ß, find the value b that shifts values of Yi to Yi(b) such that the shifted values Yi(b) are not associated with Xi's anymore. A commonly used estimator is the Hodges-Lehmann estimator , which is the solution of the estimating equation L(b) = 0. However, the linear rank statistic L(b) may not reach zero, so in practice is taken to be the average of the closest values on either side of 0. In other words, = 1/2(U + L) with

(2)

The asymptotic properties of the linear rank-based inferences and estimators and their relative efficiencies are discussed in detail in PURI and SEN 1985 Down. The efficiency of the Wilcoxon rank sum test (Hodges-Lehmann estimate) relative to the t-test [maximum-likelihood estimate (MLE)] is ~95% if the distribution is normal and is never <86% for symmetric distributions. Thus the loss of efficiency in the normal case is slight and is offset by the robustness of the nonparametric method. For heavy tailed distributions, the gain in efficiency may be great. Later our simulations show that even for nonsymmetric error distributions, such as exponential, the rank-based method may be more efficient.

For multiple regression, all the above arguments can be extended in a straightforward manner. Suppose P(Yi < y|Xi) = F(y - Xi'ß), where ß, Xi {isin} p. Again, F is totally unspecified. Similar to the simple regression, we define

and

where b = (b1, ... , bp)' {isin} p.

Under some regularity conditions (PURI and SEN 1985 Down, Chap. 5),

where and This result can be used to test H0: ß = 0.

To estimate ß, define Lj(b)2, and let {Delta}n = {arg minb ||L(b)||2}. Note that the set {Delta}n may not be a single point. To obtain a unique estimator, we can let be the center of mass of {Delta}n. The computation of usually requires some iterative procedures.


*  NONPARAMETRIC INTERVAL MAPPING
*TOP
*ABSTRACT
*RANK-BASED METHODS
*NONPARAMETRIC INTERVAL MAPPING
*NUMERICAL STUDIES
*CONCLUSION AND REMARKS
*LITERATURE CITED

Backcross:
In this section, we consider a backcross population [(QQ x qq) x QQ]. For a single-QTL model, we assume P(Yi < y|Xi) = F(y - ßXi), where Xi = I(Qi) is the indicator function that takes 1 if the QTL genotype Qi = QQ, and 0 otherwise. We are interested in testing H0: ß = 0 vs. H1: ß != 0 and in estimating ß, the genetic shift in distribution at the QTL between QQ and qQ genotypes.

If the QTL genotype Qi's are known, we could apply the Wilcoxon rank sum tests and Hodges-Lehmann estimators directly in QTL analysis. However, in intervals between known loci, the QTL genotypes are not observed and the quantitative traits follow discrete mixture models and thus Qi's are generally not available. A natural choice would be to use Haley-Knott regression (HALEY and KNOTT 1992 Down). That is, first, the mixing weights are calculated as the conditional probabilities of the QTL genotypes in intervals between marker loci using the genetic map and the genotypes of the flanking markers. Then, Xi is substituted with its conditional expectation E(Xi|flanking markers).

Since individuals with the same flanking markers have the same mixing weights and thus the same mixture distribution, for convenience, we can group the data into K groups by their flanking-marker genotypes. Suppose within each group the data have common distribution Mk, k = 1, 2, ... , K. Under the null hypothesis H0, M1 = M2 = ... = MK. After substituting Xi in (1) with E(Xi|flanking markers), we obtain the rank test statistic equivalent to the one in KRUGLYAK and LANDER 1995 Down. Note that instead of testing H0 directly, here we instead test M1 = M2 = ... = MK. Usually, K is much greater than the number of underlying distributions. For example, in the backcross population, we are interested in testing the difference between the two component distributions in the mixture model. In essence, we test for differences among the four mixtures, Mk, k = 1, ... , 4. Theoretically, it is unclear whether the relative efficiency of the rank sum test vs. the t-test [or, equivalently the likelihood-ratio test (LRT)] in linear regression still holds in this setting. However, we expect that the rank sum test performs better under most circumstances when data are nonnormal, which we investigate by simulations.

The estimation of ß is more problematic than that for simple linear regression. In traditional linear regression, Eß{L(ß)} = 0. Thus the estimator is consistent. However, due to the mixture structure of QTL data, we can show that Eß{L(ß)} does not generally equal 0 when Xi is substituted by its conditional expectation. A theoretical formula of Eß{L(ß)} indicates that the magnitude of the deviation from 0 depends on the underlying distributions, the flanking marker distances, and the magnitude of ß. It can be shown that, for a given distribution, the deviation goes to 0 as ß goes to 0 or as the flanking marker distance goes to 0. Thus we expect the estimator to work well in QTL analysis if either there is a relatively dense map (e.g., < 20 cM, a common scenario of current genetic studies) or the QTL effect is relatively small. Efficiency is of less concern when the QTL effect is large than when it is small. In QTL mapping of complex traits, an individual QTL usually has small effect. For these reasons, we believe and our simulations as well show that the rank sum-based estimators are practically useful alternatives to the least-squares estimators from Haley and Knott's regression interval mapping.

The following are some properties of . To emphasize that depends on Y = {Yi}, we rewrite as (Y). From the definition of , it is not difficult to show that, for any b {isin} R,

  1. , and

  2. (Y) = -(-Y).

In words, i indicates that adding a constant to the data has no effect on the estimator of QTL effect. Property ii says that if the data are multiplied by -1, the estimator has an opposite sign.

Extensions:
Next we extend the methods to any other cross derived from two inbred lines, such as F2. In general, the model can be expressed as P(Yi < y|Xi) = F(y - Xi'ß), where ß = (a, d)' and Xi = (X1,i, X2,i)'. The covariates

  • X1,i = -1, 0, or 1 if individual i has QTL genotype qq, qQ, or QQ, and

  • X2,i = 1 (or 0) if individual i has QTL genotype qQ (or else)

correspond to the additive and dominance genetic effects, a and d, respectively. In regression mapping, if the unknown Xj,i's are replaced by their conditional expectations E(Xj,i|flanking markers), then the estimator can be derived as described in RANK-BASED METHODS for multiple regression without any modifications. The methods may also be adapted to map multiple QTL (KAO et al. 1999 Down) or to more complicated designs involving more than two inbred lines (LIU and ZENG 2000 Down) by changing the dimension of ß. Of course, the efficiency may be low if the dimension of ß is large. This requires further investigation.


*  NUMERICAL STUDIES
*TOP
*ABSTRACT
*RANK-BASED METHODS
*NONPARAMETRIC INTERVAL MAPPING
*NUMERICAL STUDIES
*CONCLUSION AND REMARKS
*LITERATURE CITED

Simulations were conducted to study the behavior of Z and in a backcross population. For simplicity, only one chromosomal segment flanked by two markers is simulated. The two markers are either located at 0 and 10 cM with simulated QTL at 5 cM or located at 0 and 20 cM with simulated QTL at 10 cM, respectively. The setups are similar to those in XU 1995 Down. The QTL effect ß is either 1 or 2. Standard normal, exponential(0.5), t(3), standard logistic, and standard Cauchy are used as error distributions. One hundred simulations were conducted for each combination with sample size n = 1000. The average values and corresponding standard errors of estimated QTL position, QTL effect from parametric interval mapping (ML), and nonparametric Wilcoxon rank sum interval mapping (Rank) are given in Table 1 Table 2 Table 3 Table 4. As a comparison, we also run the regression analysis (REG) of HALEY and KNOTT 1992 Down and the results are given in Table 1.


 
View this table:
In this window
In a new window

 
Table 1. Comparison of parametric and nonparametric methods (20 cM)


 
View this table:
In this window
In a new window

 
Table 2. Comparison of parametric and nonparametric methods (20 cM)


 
View this table:
In this window
In a new window

 
Table 3. Comparison of parametric and nonparametric methods (10 cM)


 
View this table:
In this window
In a new window

 
Table 4. Comparison of parametric and nonparametric methods (10 cM)

The estimates of QTL position and effect from the REG and the ML methods are very similar not only for normal data, which is consistent with HALEY and KNOTT 1992 Down and with XU 1995 Down, but also for nonnormal data. Note that the nonparametric test and estimate generally are much more efficient than the parametric versions when data are not normally distributed. There is a modest loss of efficiency with normal data, which agrees with theory for simple linear regression. The marker distances and the magnitude of the QTL effect do not seem to have a large impact on the relative efficiencies of the estimators.

To estimate the power, the rank test statistic Z is first transformed to LODR = {2 log(10)}-1Z2 and the test statistic from REG is also transformed to an equivalent LOD score. We then take threshold 3 for the LOD scores, which is recommended in practical genome-wide QTL analysis (see also KRUGLYAK and LANDER 1995 Down for analytic genome-wide threshold calculations). The power is calculated as the proportion of significant tests from 100 simulated data sets. For the extreme case where data are Cauchy distributed, there is no power to detect the QTL by ML or REG interval mapping while Rank interval mapping does have power.

To further demonstrate the method, we consider the data on the time to death following infection with Listeria monocytogenes of 116 F2 mice from an intercross between the BALB/cByJ and C57BL/6ByJ strains (BOYARTCHUK et al. 2001 Down). The histograms of the log time to death of the nonsurvivors are given in Fig 1. Roughly 30% of mice survive beyond 264 days. From the histogram it is hard to justify that the log time to death of the nonsurvivors is normally distributed. BROMAN 2003 Down applied four different methods, including both the standard interval mapping and nonparametric interval mapping, to this data set and showed that the locus on chromosome 1 appears to have effect only on the average time to death among the nonsurvivors. For this reason, our analysis is restricted on chromosome 1 for those nonsurvivors.



View larger version (19K):
In this window
In a new window
Download PPT slide
 
Figure 1. Histogram of log 2(survival time), following infection with Listeria monocytogenes of 85 nonsurvival mice out of a total of 116 mice. The remaining 31 mice recovered from the infection and survived to the end of experiment, 264 hr [log 2(264) = 8].

The LOD scores obtained by standard interval mapping and the nonparametric interval mapping with the log time to death are plotted in Fig 2. It is clear that the two methods result in the maximums at the same position although the LOD curves are slightly different, which will result in some slightly different confidence intervals of the putative QTL locus by the conventional 1-LOD drop method. The additive and dominance estimators are 0.262 and 0.059, respectively, from standard interval mapping and are 0.257 and 0.038, respectively, based on our method. To assess whether the differences between the two methods are significant or not, 1000 bootstraps are performed. We restrict our analysis within chromosome 1. From our method, the 95% confidence interval (CI) of the QTL locus is (50 cM, 84 cM). The mean of the additive effect is 0.247 with standard error 0.077 and the mean of the dominant effect is 0.055 with standard error 0.122. Similarly, from standard interval mapping, we get the 95% CI of the QTL locus as (51 cM, 92 cM). The mean of the additive effect is 0.268 with standard error 0.071 and the mean of the dominant effect is 0.0284 with standard error 0.122. In all, the nonparametric QTL locus estimator is relatively more efficient than the parametric estimator and our nonparametric analysis confirms the results of BROMAN 2003 Down.



View larger version (17K):
In this window
In a new window
Download PPT slide
 
Figure 2. LOD score curves from standard interval mapping (solid line) and nonparametric interval mapping (dashed line).


*  CONCLUSION AND REMARKS
*TOP
*ABSTRACT
*RANK-BASED METHODS
*NONPARAMETRIC INTERVAL MAPPING
*NUMERICAL STUDIES
*CONCLUSION AND REMARKS
*LITERATURE CITED

In this article, traditional rank-based estimators for linear regression have been adapted to analyze quantitative traits. The new method has been shown to be very similar to Haley and Knott's regression interval mapping when data are normally distributed and more efficient for nonnormal data. Our simulations indicate that the normal likelihood-ratio-based interval mapping is usually unbiased, even when the data are nonnormal, but may have very low efficiency. All our simulations are based on one QTL model. We believe the nonparametric model is very likely to produce ghost QTL as the parametric method does when two QTL are close to each other and multiple nonparametric QTL mapping is needed.

In genetic studies of quantitative traits, adapting rank-based methodologies is complicated because genetic markers are observed only at known loci and the QTL genotypes are usually unknown. Thus, the trait data arise from discrete mixtures of unknown distributions. The mixture structure of the data may distort certain properties of the underlying error distributions. For example, F may be unimodal even though the QTL data may not be. This means that the rank test in QTL mapping may have properties that differ from those for the rank test in linear regression.

As explained in NONPARAMETRIC INTERVAL MAPPING, the rank-based parameter estimate is not generally unbiased with QTL data because the unknown regressors Xi are replaced by their expectations. On the basis of the theory of general estimating equations (LIANG and ZEGER 1986 Down), one may show that the estimators of genetic effects from HALEY and KNOTT's (1992) regression method are unbiased, although the variances of the estimators may be larger than those from the Hodges-Lehmann estimators. While the rank-based estimators are theoretically biased, in simulations, the bias is negligible when compared with the regression and maximum-likelihood methods.

The computation of usually is complicated if the dimension of ß is >1 and requires some iterative procedures. KRAFT and VAN EEDEN 1972 Down proposed an easy one-step modification of the least-squares estimator of ß to approximate . We may use this one-step modification if the calculation of is too complicated,

(3)

where is the least-squares estimator of ß, and for any d {isin} p,

Manuscript received April 13, 2003; Accepted for publication July 25, 2003.
*  LITERATURE CITED
*TOP
*ABSTRACT
*RANK-BASED METHODS
*NONPARAMETRIC INTERVAL MAPPING
*NUMERICAL STUDIES
*CONCLUSION AND REMARKS
*LITERATURE CITED

BASTEN, C. J., B. S. WEIR and Z-B. ZENG, 1997 QTL Cartographer: A Reference Manual and Tutorial for QTL Mapping. Department of Statistics, North Carolina State University, Raleigh, NC.

BOYARTCHUK, V. L., K. W. BROMAN, R. E. MOSHER, S. E. F. DORAZIO, and M. N. STARNBACH et al., 2001  Multigenic control of Listeria monocytogenes susceptibility in mice. Nat. Genet. 27:259-260.[Medline]

BROMAN, K. W., 2003  Mapping quantitative trait loci in the case of a spike in the phenotype distribution. Genetics 163:1169-1175.[Abstract/Free Full Text]

DRAPER, N. R., and H. SMITH, 1998 Applied Regression Analysis, Ed. 3. John Wiley & Sons, New York.

DRINKWATER, N. R., 1997 Qlink Documentation. McArdle Laboratory for Cancer Research, University of Wisconsin, Madison, WI.

DRINKWATER, N. R. and J. H. KLOTZ, 1981  Statistical methods for the analysis of tumor multiplicity data. Cancer Res. 41:113-119.[Abstract/Free Full Text]

HAJEK, J., 1968  Asymptotic normality of simple linear rank statistics under alternatives. Ann. Math. Stat. 39:325-346.

HAJEK, J., and Z. SIDAK, 1967 Theory of Rank Tests. Academic Press, New York/London.

HALEY, C. S. and S. A. KNOTT, 1992  A simple regression method for mapping quantitative traits in line crosses using flanking markers. Heredity 69:315-324.[Medline]

JANSEN, R. C. and P. STAM, 1994  High resolution of quantitative traits into multiple loci via interval mapping. Genetics 136:1447-1455.[Abstract]

KAO, C. H., R. D. Z-B. ZENG AND, and R. D. Z-B. ZENG ANDTEASDALE, 1999  Multiple interval mapping for quantitative trait loci. Genetics 152:1203-1216.[Abstract/Free Full Text]

KRAFT, C. H. and C. VAN EEDEN, 1972  Linearized rank estimates and signed rank estimates for the general linear hypothesis. Ann. Math. Stat. 43:42-57.

KRUGLYAK, L. and E. S. LANDER, 1995  A nonparametric approach for mapping quantitative trait loci. Genetics 139:1421-1428.[Abstract]

LANDER, E. S. and D. BOTSTEIN, 1989  Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121:185-199.[Abstract/Free Full Text]

LIANG, K. Y. and S. L. ZEGER, 1986  Longitudinal data analysis using generalized linear models. Biometrika 73:13-22.[Abstract/Free Full Text]

LINCOLN, S. E., M. J. DALY and E. S. LANDER, 1993 A Tutorial and Reference Manual for MAPMAKER/QTL. Whitehead Institute for Biometrical Research.

LIU, Y. and LIU, Y.Z-B. ZENG, 2000  A general mixture model approach for mapping quantitative trait loci from diverse cross designs involving multiple inbred lines. Genet. Res. 75:345-355.[Medline]

MORTON, N. E., 1984 Trials of segregation analysis by deterministic and macro simulation, pp. 83–107 in Human Population Genetics: The Pittsburgh Symposium, edited by A. CHAKRAVARTI. Van Nostrand Reinhold, New York.

PURI, M. L., and P. K. SEN, 1985 Nonparametric Methods in General Linear Models. John Wiley & Sons, New York.

SAX, K., 1923  The association of size differences with seed-coat pattern and pigmentation in Phaseolus vulgaris.. Genetics 8:552-560.[Free Full Text]

XU, S., 1995  A comment on the simple regression method for interval mapping. Genetics 141:1657-1659.[Medline]

ZENG, Z-B., 1993  Theoretical basis of separation of multiple linked gene effects on mapping quantitative trait loci. Proc. Natl. Acad. Sci. USA 90:10972-10976.[Abstract/Free Full Text]

ZENG, Z-B., 1994  Precision mapping of quantitative traits loci. Genetics 136:1457-1468.[Abstract]




This article has been cited by other articles:


Home page
GeneticsHome page
M. J. Sillanpaa and F. Hoti
Mapping Quantitative Trait Loci From a Single-Tail Sample of the Phenotype Distribution Including Survival Data
Genetics, December 1, 2007; 177(4): 2361 - 2377.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
M. Zak, A. Baierl, M. Bogdan, and A. Futschik
Locating Multiple Interacting Quantitative Trait Loci Using Rank-Based Model Selection
Genetics, July 1, 2007; 176(3): 1845 - 1854.
[Abstract] [Full Text] [PDF]