Abstract
A number of statistical methods are now available to map quantitative trait loci (QTL) relative to markers. However, no existing methodology can simultaneously map QTL for multiple nonnormal traits. In this article we rectify this deficiency by developing a QTLmapping approach based on generalized estimating equations (GEE). Simulation experiments are used to illustrate the application of the GEEbased approach.
OVER the last 10 years, there has been a great deal of interest in the development of methodology to map quantitative trait loci (QTL) relative to a known marker map in populations derived from inbred line crosses. Perhaps the most commonly used current techniques for univariate traits are based on the work of Jansen (1994) and Zeng (1994). These assume a normal distribution for the environmental errors and solve the resulting likelihood equations via the expectationmaximization (EM) algorithm (Dempsteret al. 1997). These methods naturally extend to simultaneous QTL mapping of several normally distributed traits (Jiang and Zeng 1995). This is useful because when the environmental correlation structure is modeled properly, simultaneous QTL mapping improves the efficiency of parameter estimates (Korolet al. 1995; Henshall and Goddard 1999). Furthermore, pleiotropic effects can be included and estimated. Analogues of these methods based on leastsquares also exist (Haley and Knott 1992; Martínez and Curnow 1992; Knott and Haley 2000) and in most, although not all (Xu 1995; Kao 2000), situations give very similar results to the likelihoodbased methods, with reduced computational complexity (Knott and Haley 2000).
In contrast, relatively little work has been done on mapping procedures for nonnormally distributed traits (but see Hackett and Weller 1995; Visscher et al. 1996a,b; Henshall and Goddard 1999). In particular, no methods map QTL of several correlated nonnormally distributed traits simultaneously. This may be because of the difficulty in specifying a full probability model for such data: This makes likelihoodbased approaches difficult to implement.
Here we avoid these difficulties by using the generalized estimating equation (GEE) approach to correlated data (Liang and Zeger 1986; Diggleet al. 1996). The advantage of the GEE methodology is its semiparametric character: Correct specification of the mean and covariance structure of the model is sufficient to guarantee asymptotically unbiased parameter estimates, regardless of the actual underlying probability model. In fact, estimation of mean parameters (we see below that these include the QTL location and effect) is also robust to misspecification of the covariance structure, although efficiency of estimation is higher the better specified the covariance structure is.
In this article we concentrate on F_{2} populations, but the proposed methodology can easily be applied to any genetic design; we need only derive appropriate mean and variance assumptions for the design of interest.
METHODS
We begin by introducing the GEE concept. We can give only an intuitive motivation of the concept here; more formal treatments can be found in McCullagh and Nelder (1989), Diggle et al. (1996), or Heyde (1997).
Consider a vector of responses Y such that the expectation of Y can be written E(Y) =μ(γ) for some function μ, where γ is a vector of parameters we wish to estimate. Intuitively, a sensible way to do this would be to choose
We often have a number of correlated observations on independent individuals, so that the matrix V(μ, α) becomes block diagonal: Alternatively, the estimating equations may be written as a sum over individuals,
Finally, note that an alternative motivation of GEE is possible by defining the quasilikelihood (QL) q(μ; Y) via
In conclusion, provided we can specify the mean function μ(γ) and the covariance matrix V(α), we can obtain consistent estimates of the parameters of interest γ using GEE. We now derive a suitable mean and variance structure for multivariate QTL mapping.
QTL mapping via GEE: Suppose we have n individuals from an F_{2} population resulting from a cross between two inbred lines, with observations on m quantitative traits and on a number of codominant genetic markers for each individual. The markers are recorded as 1 and −1 for the homozygotes in the two parental lines and 0 for the heterozygotes. The same notation is also applied for the unobserved QTL genotypes, with homozygotes coded as 1 and −1 and heterozygotes as 0. We assume that a marker map exists, although we show below that uncertainty about intermarker recombination fractions can easily be accommodated.
We now derive the mean and variance structure required for our GEE model. We begin by considering the estimation of location and effect for a single QTL for each trait—we consider how multiple QTL might
be dealt with below—and introduce the GEE approach using the familiar special case of normally distributed traits. Denoting the phenotypic value of the kth trait in the jth individual by y_{jk}, the corresponding random variable by Y_{jk}, and the random variable of the unobserved QTL score by Q_{jk}, we assume the connection between the phenotypic information Y_{ij} and Q_{jk} is given by
Let the random variables representing the marker genotype of the left and right flanking markers in the jth individual and the kth trait be
Note the distinction between this and the fulllikelihood approach of Jiang and Zeng (1995): Rather than allowing for the fact that, conditional on a given marker genotype, the phenotypic distribution is a mixture of components corresponding to the unknown QTL genotype, we average out the unknown QTL genotypes to get the conditional mean. Our approach can therefore be seen to be a generalization of the mean assumption of the leastsquaresbased QTLmapping methods (Haley and Knott 1992; Martínez and Curnow 1992) to multivariate data.
Now consider the second moment assumption. To simplify notation, we initially derive the “working” variance matrix under the simplest possible model. We assume that there is no geneenvironment or genetic interaction and that the predictor variables X_{jk} are independent of the flanking marker score. Then the variance of Y_{jk} conditional on the flanking marker information can be written as
We now extend this model to deal with nonnormal traits. This is easily done by applying a link function h_{k}(μ_{jk}) to the conditional mean of Y_{jk} (McCullagh and Nelder 1989), which gives the following mean assumption:
The variance matrix is then constructed on the basis of the link function. We assume that a correlation matrix
Calculation of the conditional means: Here we show how to calculate the expectations given marker genotypes required in μ_{jk}; that is, we calculate
Following the complete interference assumption of Jiang and Zeng (1995), expressions for
Parameterization (12) also gives robustness to variation in map lengths between individuals, for example, because of sex effects, although in practice sexaveraged recombination rates usually give satisfactory results with most methods. Finally, note that these results also hold for the nointerference case.
Parameter estimation: Parameter estimation is performed using a simple extension of the standard GEE approach described by Liang and Zeger (1986). The details, which are by their nature rather technical, can be found in Lange (2000); here we sketch the key steps of the estimation procedure. Recall from Equation 3 that the GEE estimates are defined as the solution of
Equation 3 also involves the matrix of derivatives of each individual's phenotypic means, D_{j}, which is given by
Equation 3 can then be solved by a twostep procedure that iterates between an updating step for the mean parameter estimates, on the basis of the current values of the variance parameters, and an updating step for the variance parameters, on the basis of the current estimates of the mean parameters. In the first step, the estimates of the mean parameter vector
An attractive property of this procedure is that under mild conditions, notably that the momentbased estimators
It is important to note that the validity of this result does not depend on the correct specification of the variance assumption. Regardless of the degree of misspecification, the mean parameter estimates are always consistent and the standard errors provided are correct.
Model selection: QTL mapping for univariate phenotypes relies on searching the genome for putative QTL locations, which give maxima in the likelihood or minima in the residual sum of squares, according to the approach chosen; approximate multipleQTL models are usually fitted by selecting a set of markers, usually via a model selection criterion such as AIC, to include as cofactors when scanning the genome. The evidence for a QTL at any location can then be assessed.
In the above we have shown how QTL location and effect size can be estimated for a given set of marker intervals using our GEE approach, but we have not yet considered how models for different marker intervals can be compared. It is not obvious how to do this in the GEE framework: One of the strengths of the approach is that we need not specify a full probability model for the data, so we have no likelihood to use in model comparison. The quasilikelihood defined above can sometimes be used to play a similar role, but we have already commented that quasilikelihood is not available for our QTL model.
A natural alternative to quasilikelihood is the generalized Pearson chisquare statistic
Models can now be compared just as for the leastsquaresbased methods (see, e.g. Haley and Knott 2000) by replacing the usual RSS by d*. However, as with the leastsquaresbased methods, significance thresholds should be obtained by permutation (Doerge and Churchill 1996) or by use of the parametric bootstrap, rather than by reliance on asymptotic results. Furthermore, analogues of the standard model selection criteria used for linear models can then be defined by replacing the usual residual sum of squares by d*. For example, we can define the Akaike information criterion (AIC) for a given GEE model to be
Any of the standard approaches for QTL analysis can now be implemented. For instance, to reproduce the original intervalmapping approach of Lander and Botstein (1988) for a single trait using GEE we would simply fit a singleQTL model for each interval in turn using the procedure outlined above. This provides estimates of QTL location and effect, together with a value of d*, corresponding to a model with a single QTL in the interval currently under consideration. The interval with the smallest value of d* is then selected, and the corresponding estimates of QTL location and effect are recorded. The QTL will be declared significant if d* is below an appropriate threshold, determined by permutation or simulation as in the leastsquaresbased approaches.
Equivalents to composite interval mapping (Zeng 1994) or multipleQTL mapping (Jansen 1994) can be produced by including markers, usually selected by stepwise variable selection using AIC or similar criteria, as cofactors to control for the presence of QTL outside the interval currently under consideration. It is also of course possible to fit models containing QTL in several intervals simultaneously, by adding the appropriate terms to the models described above. Tests to dissect the genetic architecture of multiple traits can be defined as in Knott and Haley (2000), with d* replacing the RSS. In particular, a test of a single pleiotropic QTL against linked QTL, each affecting a single trait, is produced by comparing the d* values for the relevant models.
RESULTS
Simultaneous QTL mapping for two nonnormally distributed traits: Here we compare the GEE approach to two alternatives that do not explicitly model the nonnormality of the data. The methods used were as follows:
Method I: The GEE approach described above.
Method II: The data are transformed to normality and the appropriate GEE for multivariate normal responses is used (i.e., identity link function, etc.).
Method III: The GEE method appropriate for multivariate normal responses is used on the untransformed data. This is of interest since Visscher et al. (1996a,b) found that assuming normality often works well for univariate binary traits.
A relatively simple experiment is sufficient to compare these methods. We simulate a single chromosome with 16 uniformly distributed markers; the marker interval length is 10 cM, giving a total length of 150 cM. Data are simulated using the following mean assumption:
Results are summarized using box plots displaying the median, 25th percentile, 75th percentile, and the range of the estimated location; and additive and dominance effects over the 200 replicates. The results are in Figures 1, 2 and 3 and Table 5. It is immediately obvious that the efficiency of the methods differs substantially. Method I gives more efficient estimates of QTL location and additive effect, with methods II and III giving estimates of additive effect with considerable bias. Thus even marginal transformation to normality followed by an analysis assuming multivariate normality (method II), although an improvement on use of the untransformed data, is less efficient than the pure GEE approach. Normal theory alone is clearly not sufficient to cope with nonnormally distributed traits.
The poor performance of method III above shows that the good performance of method III for univariate binary traits observed by Visscher et al. (1996a,b) does not generalize. We can explain the good performance of method III for univariate binary traits by noting that for univariate binary data the score equation of the GEE model for method I is given by
Note that this argument applies also to univariate likelihoodbased methods, since in the univariate case the likelihood score and GEE score are identical.
DISCUSSION
We have introduced a method to allow the simultaneous mapping of QTL for several nonnormal traits. Explicitly multivariate analyses of this sort have several advantages over the alternative approach based around a number of singletrait analyses, notably the ease of interpretation of results and the increased efficiency of multivariate analyses (Jiang and Zeng 1995; Korolet al. 1995). It is also possible to test for pleiotropy, although in practice it is very difficult to distinguish between several linked QTL, each affecting a single trait, and a single QTL with pleiotropic effects.
However, full probability models for multivariate nonnormal responses are very difficult to specify, and it is therefore tempting to either ignore the nonnormality or marginally transform the data to normality and then assume multivariate normality in analysis. As the simulation experiment shows, doing this can give poor results. Instead we avoided the difficulties of fulllikelihood methods by adopting a semiparametric approach based on GEE. The key advantages are the avoidance of distributional assumptions, the ease and flexibility of model specification [for example, the model presented here could be easily extended to incorporate factors such as epistasis or parent of origin (imprinting) effects], and the robustness of estimates of QTL parameters to misspecification of the underlying variance structure. We have also shown that a simple reparameterization leads to a method for which the intermarker distances need not be known.
GEE methods are often almost as efficient as fulllikelihood approaches using the correct probability model, while being more robust. Nonetheless, it would be interesting to develop fulllikelihood approaches and to compare these with the GEE approach. This should be possible for at least some special cases, possibly using generalized linear mixed models or hierarchical models implemented in a Bayesian framework via the Markov chain Monte Carlo method. However, such an approach will be extremely computationally demanding, even for relatively small numbers of traits.
Significance thresholds for the GEE approach can be calculated by the usual permutation or simulation procedures, and confidence intervals for QTL location could be obtained by bootstrapping as in Visscher et al. (1996a,b). However, multivariate QTL mapping by its nature becomes increasingly computationally demanding as the number of responses increases, so these computerintensive techniques may be at present limited to relatively small numbers of responses. These are in any case the situation where application of multivariate techniques seems most likely, but less computationally demanding solutions to the threshold/confidence interval problems would nonetheless be useful, and this subject deserves further study. Finally we note that the GEE approach developed here has obvious applications to markerassisted selection for multivariate nonnormal traits. We hope to investigate this elsewhere.
Acknowledgments
We thank Dr. ZhaoBang Zeng and the unknown referee for their constructive comments on an earlier draft of this article. This research was supported by grant MH59532 of The National Institutes of Health.
APPENDIX
Robustness of reparameterization (12): We can investigate the influence of the marker interval length
Footnotes

Communicating editor: ZB. Zeng
 Received November 17, 2000.
 Accepted August 29, 2001.
 Copyright © 2001 by the Genetics Society of America