| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Genetics, Vol. 173, 2269-2282, August 2006, Copyright © 2006
doi:10.1534/genetics.106.058537
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

* Department of Natural Sciences, Royal Veterinary and Agricultural University, DK-1871 Frederiksberg C, Denmark and
Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland 21205
1 Corresponding author: Department of Natural Sciences, Royal Veterinary and Agricultural University, Thorvaldsensvej 40, DK-1871 Frederiksberg C, Denmark.
E-mail: bjarke{at}dina.kvl.dk
| ABSTRACT |
|---|
|
|
|---|
With the advent of dense genetic marker maps, a lot of effort has been devoted to the development of statistical methods for locating QTL and estimating their effects. The seminal article by LANDER and BOTSTEIN (1989) introduced the interval-mapping (IM) method, which considers, one at a time, a suite of putative QTL positions along the genome. In the case of a backcross, say, an individual with genotype g = QQ or Qq at the putative QTL is assumed to have phenotype
. Since the QTL genotypes will generally not be known, the phenotype distribution given the marker data is a mixture of the two normal distributions. Closed-form expressions for the maximum-likelihood estimators are not available for mixtures of normal distributions. Thus, estimation under the IM method must be done numerically; most commonly a version of the expectation-maximization (EM) algorithm (DEMPSTER et al. 1977) is used.
A key disadvantage to the IM method is long computation time, since the EM algorithm often requires many iterations to converge. HALEY and KNOTT (1992) and MARTÍNEZ and CURNOW (1992) independently developed a simple regression method that usually approximates IM very well and requires much less computation. Again, consider a backcross individual with genotype g = QQ or Qq at the putative QTL. The HaleyKnott (HK) regression method applied to backcross data consists simply of regressing the individuals' phenotypes on the conditional probabilities for having genotypes QQ or Qq at the putative QTL, given the marker data. In other words, an individual with marker data m is assumed to have phenotype
. Since we need only to do a simple regression calculation at each putative QTL position, there are great savings in computation time compared to that of the IM method, something that is particularly important when generalizing the methods for multiple-QTL models.
A number of studies have compared the IM and HK methods (HALEY and KNOTT 1992; XU 1998a,b; KAO 2000), and in many cases the two methods provide almost identical parameter estimates and test statistics. There are, however, important differences; for instance, it is well known that the HK method overestimates the residual variance (XU 1995; KAO 2000). In general, the total genetic variance may be split into two parts, one part originating from marker genotype variation and one part from variation of QTL genotype given marker genotypes. In the HK method the latter part is contained in the residual variance estimate, sometimes causing overestimation of the residual variance. Also, KAO (2000), in an extensive comparison of the two methods by computer simulations, found that epistasis between QTL and linkage between QTL may lead to large differences in power to detect QTL and efficiency of parameter estimation. One further difference between the IM and HK methods concerns robustness in the presence of nonnormal phenotype distributions. In such cases the IM method may produce large spurious LOD score peaks (BROMAN 2003; FEENSTRA and SKOVGAARD 2004), something that the HK method avoids.
In this study, we develop an extension of the HK method on the basis of estimating equations. In the HK method the conditional mean of an individual's phenotype given marker data, E(yi | mi), is specified correctly, but the conditional variance Var(yi | mi) is incorrectly assumed to be constant. In the estimating equation (EE) method, we develop joint estimating equations for mean and covariance parameters, on the basis of a coherent specification of both E(yi | mi) and Var(yi | mi). It has been suggested to overcome the bias of the HK method with an iteratively reweighted least-squares (IRLS) approach using Var(yi | mi) to weight observations (XU 1995, 1998a,b). The IRLS method, however, does not fully utilize information about the mean parameters contained in the conditional variance Var(yi | mi) and may be expected to be less efficient than the estimating equation approach taken here.
We explore the performance of the proposed EE method on a number of real and simulated data sets, covering a range of different genetic models and data set structures. Comparison is made with the IM, HK, and IRLS methods with respect to important performance criteria, such as power, robustness, efficiency, bias, and computational speed in implementations. We focus the comparison on situations where either the HK method or the IM method is suspected to perform poorly.
| GENETIC MODEL |
|---|
|
|
|---|
. At any given QTL, the jth say, there are two possible genotypes: QjQj and Qjqj, making the total number of possible QTL genotypes in the population 2m. The goal of a genetic model is to relate the 2m possible genotypic values to a set of genetic parameters, such that these parameters are interpretable in terms of main and epistatic effects of the m QTL. We prefer a genetic model using orthogonal contrast scales because it is consistent in the sense that the effect of a QTL is consistently defined whether the genetic model includes one, two, three, or more QTL (KAO and ZENG 2002; ZENG et al. 2005). The relation between the genotypic value Gi of individual i and the genetic parameters can be expressed by
![]() | (1) |
![]() |
![]() | (2) |
For other kinds of crosses, such as F2 populations, different contrast scales are needed to achieve orthogonality (ZENG et al. 2005).
| STATISTICAL METHODS |
|---|
|
|
|---|
![]() | (3) |
is a row vector of length 2m indicating the multilocus QTL genotype; i.e., one of the Xig = 1, the rest are zeros;
is the vector of parameters to be estimated; and ei is a random error term with an unspecified distribution. The relationship between the model parameters ß and the genetic model parameters is important to guide the formulation of relevant hypotheses to test and to interpret the estimates from the final model. Fortunately, estimates of the model parameters are readily translated to genetic parameter estimates; we outline how to do this below. A key QTL-mapping problem is how to deal with missing genotype data, since the QTL genotypes, Xi, are generally not observed. A number of approaches exist for this; we briefly describe the IM and HK methods and go on to introduce the proposed EE method.
Interval mapping:
The interval-mapping method, pioneered by LANDER and BOTSTEIN (1989) and generalized to multiple loci by KAO et al. (1999), was the first approach to fully exploit the fact that QTL are located in intervals flanked by genetic markers with observed genotypes. This means that given a genetic marker map and a putative QTL position and assuming a map function, we may calculate pig = Pr(g | mi), the conditional probability of QTL genotype g given the multipoint marker data mi. The IM method assumes that
in Equation 3 and models the phenotypes given the observed marker data as a mixture of normal distributions. The likelihood function for the parameters, ß,
2 is
![]() | (4) |
2) being the density function of a normal distribution with mean ßg and variance
2.
HaleyKnott regression:
The HaleyKnott regression method deals differently with the missing observations Xig in the statistical model (3). Although the genotypes Xig are unobserved, we may calculate their conditional expectations given the marker data. Actually, in the case of backcross populations, E(Xig | mi) = pig, since the Xig are indicator variables. The HK method replaces Xig with E(Xig | mi) in the regression (3), which then becomes
![]() | (5) |
. Thus, the likelihood function is
![]() | (6) |
An estimating equation method:
Like the IM and HK methods, the estimating equation method considers the phenotype distribution given the marker data. Initially, we assume that the marginal density function of the phenotype, yi, given the marker data, mi for individual i has the general form
![]() | (7) |
![]() | (8) |
![]() | (9) |
2, is assumed to be the same for all individuals and QTL genotypes and may be interpreted as the environmental variance, whereas the second component corresponds to the variance due to uncertainty of QTL genotype given marker data and varies with marker and QTL genotype.
To estimate the ßg-parameters as well as
2, we must find a set of estimating equations for the parameters satisfying the requirement that their expectation equals 0. For simplicity, we further make the assumption in this article that
. This resembles the assumption in the HK method that
; we view the method presented here as an extension of the HK method that uses a coherent specification of both E(yi | mi) and Var(yi | mi) by which the variance reflects the uncertainty about the QTL genotype. We use the resulting score equations as estimating equations for the parameters. It should be emphasized, however, that these score equations may be used perfectly well as estimating equations without assuming normality.
The contribution of a single observation to the negative log-likelihood function for the normal model is
![]() |
Differentiating this function with respect to the parameters ßg and
2, summing over individuals, and setting the resulting score function equal to 0 yields the estimating equations
![]() | (10) |
![]() | (11) |
and
. Note that E(zi) = 0 and E(zi2) = 1, confirming that the expectations of the left-hand sides of both Equations 10 and 11 equal 0 as required.
The estimating equations must be solved numerically. To do so, we implemented an algorithm that uses two conditional maximizations in each iteration. First, estimates of ßg are updated by solving Equation 10 while keeping fixed
2 and the ßg's that enter in
i and zi2. Second, an updated estimate of
2 is obtained by solving Equation 11 with the ßg's fixed. We iterate until the estimates converge.
Relation to iteratively reweighted least-squares regression:
XU (1995) first pointed out that the HK method tends to give biased estimates of the residual variance. It was suggested to correct the bias by using Equations 8 and 9 for the conditional mean and variance, respectively, of the phenotypes given full marker data. XU (1998a,b) further assumed that
and claimed to maximize the corresponding likelihood function by the iteratively reweighted least-squares method. In the IRLS method, the variance is written
![]() |
i.e., vi depends both on
2 and on the ßg. In the iterations, updated parameter estimates are obtained by treating the vi as known weights and performing weighted least-squares regression,
![]() | (12) |
![]() | (13) |
![]() | (14) |
The above two equations correspond to Equations 7 and 8 in XU (1998b). It may be shown that IRLS iterations using Equations 12 and 13 are (asymptotically) equivalent with using
![]() | (15) |
![]() | (16) |
2. These estimating equations are simpler than the ones (Equations 10 and 11) used in the EE method, and intuitively it might be expected that the EE method captures more information about the parameters than the IRLS method. In the APPENDIX, we demonstrate that the EE method is indeed more efficient than the IRLS method under the assumptions used here. Situations where the assumption that
is not met are explored by computer simulation in the RESULTS.
Estimating genetic parameters:
To illustrate the translation from genetic parameters in the genetic model (1) to mean parameters in the statistical model (3) and vice versa, we consider parameters from a model with three loci in a backcross population.
At each QTL, we index homozygotes by 2 and heterozygotes by 1, e.g., the parameter ß211 corresponds to QTL genotype Q1Q1/Q2q2/Q3q3. Expressed in matrix notation, the relation between the two types of parameters is
![]() | (17) |
, where S is the genetic effect design matrix and
is the vector of genetic parameters. Conversely,
may be found from ß by
= S1ß. In the case where there is no three-locus epistasis, i.e., c123 = 0, there is a constraint on the parameter vector ß in the sense that ß111 can be expressed as a function of the other seven ßg's. Writing the genetic effect design matrix as
![]() |
![]() |
| RESULTS |
|---|
|
|
|---|
As a starting point, we considered the simple situation of detecting a single QTL and estimating its effect. A backcross population was simulated with a single QTL placed at the center of chromosome 1, which had a length of 100 cM and six evenly spaced markers. Progeny sizes of 50, 100, and 200 were considered and the corresponding ranges of additive effects of the QTL (a1 in Equation 2) were 0.543.80, 0.342.00, and 0.201.00, respectively. In each case, six different values for the additive effect were simulated. Moreover, three different genetic models were used in the simulation setup. First, no other QTL were segregating in the population. Second, a QTL of moderate effect (a2 = 0.60) was segregating at a position unlinked to chromosome 1. Third, two unlinked but strongly interacting QTL (a2 = 1.00, a3 = 1.00, and b23 = 4.00) were segregating at positions unlinked to chromosome 1. The environmental variation was sampled from a standard normal distribution in all cases. Single-locus scans were conducted with all four methods to detect the QTL on chromosome 1. In this simple setup there were only minor differences between the methods for all genetic models, and we therefore summarize the results in text only. Power and precision of locating the QTL were virtually the same for all methods as were mean parameter estimates. In accordance with previous reports (XU 1995; KAO 2000) the HK method overestimated the residual variance. There was a slight but consistent trend that the lowest standard deviations and mean squared errors on mean parameter estimates were seen with the IM method followed by the EE, IRLS, and HK methods.
Since the differences in the simple simulation setup were only minor, we focus our attention in the following on more complicated situations where either the HK method or the IM method is known to perform poorly.
Nonrandom missing data patterns:
In many cases the costs involved in genotyping an individual for a large collection of genetic markers are much higher than the costs of obtaining the individual's phenotype. In such situations, selective genotyping, where individuals with extreme phenotypes are genotyped much more heavily than intermediate ones, may be an effective strategy for reducing experimental costs without losing much information about the QTL behind the trait of interest (LANDER and BOTSTEIN 1989; DARVASI and SOLLER 1992; SEN et al. 2005). It appears, however, that the HK method is particularly sensitive to the special kind of nonrandom missing data that follow from selective genotyping.
Consider, for example, the data set consisting of 250 backcross mice studied for hypertension in SUGIYAMA et al. (2001). Initially, individuals with extreme phenotypes were genotyped; in regions showing evidence for QTL, all individuals were genotyped and additional markers were added. Further, at some markers only recombinant individuals were genotyped. In 8 of the 19 autosomes, only 46 individuals in each extreme of the phenotype distribution were genotyped; i.e., the middle 158 individuals were not genotyped for any markers on those chromosomes. When those eight chromosomes are scanned with both the IM and the HK methods, the LOD curves produced by the HK method exceed those produced by the IM method, with differences of up to 1 LOD score unit. If, however, the phenotypes of intermediate individuals are discarded, the LOD curves produced by the HK method are virtually indistinguishable from those produced by the IM method for the eight chromosomes considered (results not shown).
To further investigate this behavior of the HK method, we simulated 250 backcross individuals and a single chromosome of length 100 cM with 10 markers and a QTL at position 45 cM explaining 14% of the phenotypic variance. We scanned the simulated chromosome with one-locus versions of the IM, HK, and EE methods in the case where all individuals had complete marker data, but we also considered the case of selective genotyping by letting observations from the 40th to the 60th percentile in phenotype distribution have missing observations for all markers on the chromosome.
Figure 1 shows that discarding marker genotypes for individuals in the middle 20% of the phenotype distribution inflates the HK LOD curve over the whole range of the chromosome compared to using the HK method with all marker data and compared to using the IM method. The inflation is most pronounced at the peak of the LOD curve. The EE method almost completely avoids this problem of inflated LOD curves. In Figure 1, it can be seen that the LOD curves for the EE method are close to the IM curve, both in the case of full marker data and in the case of selective genotyping.
|
![]() |
. In Figure 2, regression lines are shown for the position with the largest LOD score. Observations from the middle 20% of individuals are shown as solid dots and other observations as open circles.
|
As may be seen from the simulation example in Figure 1, the EE method almost completely avoids this problem, as it weights observations by their inverse variances. Thus, observations with large variance due to uncertainty of QTL genotype [i.e., Pr(QQ | mi) close to 0.5] have little weight in the likelihood calculation.
Nonnormal phenotype distributions:
In earlier work, we have considered QTL mapping strategies in situations where the phenotype distribution deviates from a normal distribution (BROMAN 2003; FEENSTRA and SKOVGAARD 2004). The IM method can occasionally produce spurious LOD score peaks in regions of low genotypic information (e.g., widely spaced markers), especially if the phenotype distribution deviates markedly from a normal distribution. This is caused by the fact that the IM method models the phenotype distribution as a mixture of two or more normal distributions when a QTL is included in the model, while only using a single normal distribution under the null hypothesis. If the phenotype distribution is not normal, the model including a QTL may fit the data much better than the null model, even if there is no real QTL and no genetic marker information (FEENSTRA and SKOVGAARD 2004).
In FEENSTRA and SKOVGAARD (2004), we considered models with a single QTL and developed a two-component mixture model that avoids the problem of spurious LOD score peaks. Here, we broaden our view to models with more than one QTL with possible epistatic interactions between loci. It appears that the problem of spurious LOD score peaks gets worse when the IM method is used to map more QTL simultaneously.
Figure 3 shows the results of two-QTL scans of a simulated data set consisting of 80 backcross individuals, five chromosomes of length 140 cM with 12, 12, 8, 6, and 4 markers, respectively, and two epistatically interacting QTL on chromosome 1 at position 45 cM and chromosome 2 at position 5 cM, respectively. In Figure 3A, results for the IM method are shown. It can be seen that the interacting QTL on chromosomes 1 and 2 are detected with high LOD scores. However, there are also large areas in the plot corresponding to combinations of positions on other chromosomes with high LOD scores. These high LOD score areas involve chromosomes with few markers, i.e., little genetic information, strongly suggesting that this is the same kind of artifact as the spurious LOD score peaks seen in one-QTL scans. In this simulation example, the residual variation was normal, but the influence of the two interacting QTL caused the phenotype distribution to be nonnormal, thereby allowing the phenomenon of artificially high LOD scores at other positions.
|
Epistasis between QTL:
In an extensive analytical and simulation-based comparison between the IM and HK methods, KAO (2000) found that there may be significant differences between the two methods, especially if QTL interact or are linked. Here, we focus on the situation where two unlinked QTL interact. We use the simulation setup from Table 3 in KAO (2000) and also include the IRLS and EE methods in the comparison. Data were generated from a genetic model with two unlinked epistatic QTL with genetic parameters µ = 0, a1 = 1, a2 = 1, and b12 = 1, 2, or 3 (cf. Equation 1). Thus, the strength of epistasis was increased compared to the additive effects of the QTL. Estimates of the genetic parameters,
2, and the broad sense heritability, h2 (see, for example, FALCONER and MACKAY 1996), were recorded as well as likelihood-ratio test (LRT) statistics comparing the full model with a null model of no QTL.
Table 1 displays the simulation results. The means of the estimated main and epistatic effects by the four methods are almost identical and very close to the true values for all three degrees of epistasis. Standard deviations (SDs) and mean square errors (MSEs) are smallest for the IM method, slightly larger for the EE method, and largest for the HK method. The IRLS method resembles the HK method with respect to SDs and MSEs for the main and epistatic effect estimates.
|
2 and h2. The EE method does not show this bias and provides almost identical estimates of
2 and h2 to those by the IM method (Table 1). The IRLS method also performs very well in this respect. We note that the results are in good accordance with those from KAO (2000), with one exception. When calculating the genetic variance component, and thereby the heritability for the HK method, KAO (2000) does not use the model that the data were simulated from (Equation 3). Rather, it appears that the modified regression model (Equation 5) is used for calculating the genetic variance. Indeed, when calculating the genetic variance on the basis of Equation 5 we get h2-estimates for the HK method of 0.310, 0.279, and 0.259 for the three levels of epistasis, which are very close to the values 0.302, 0.277, and 0.255 reported by KAO (2000). Thus, while our estimates of h2 by the HK method are also clearly biased (Table 1), our findings indicate that the bias in KAO (2000) is exaggerated. In any case, the EE method avoids the bias and approximates the IM method very well for all parameter estimates and with respect to the LRT statistics. In addition, it is superior to the IRLS method, considering the efficiency of the parameter estimates.
Linked QTL:
The most pronounced differences between the IM and HK methods are found when two QTL of opposite effect are linked (KAO 2000). Here we focus on that situation, using the simulation setup from Table 5 in KAO (2000) and also including the EE and IRLS methods in the comparison. Data were generated from a genetic model with two linked QTL of opposite effects without epistasis (µ = 0, a1 = 1, a2 = 1, b12 = 0, cf. Equation 1). The two QTL were placed in two neighboring 40-cM intervals and were 10, 20, 30, or 40 cM apart from each other. Haldane's map function was used.
Table 2 shows the simulation results. Again, the means of the estimates of µ, a1, and a2 are close to the true values for all four methods, and again the IM method has the lowest MSEs on the paramater estimates, followed by the EE method and then the IRLS and HK methods. Estimates of
2 and h2 from the HK method are even more biased than in the epistasis case. Again, the IM, IRLS, and EE methods provide very similar estimates of
2 and h2. Like in the epistasis simulations, the results are in good accordance with those reported in KAO (2000) with the exception that the h2-estimates for the HK method are not as dramatically biased as those in KAO (2000).
|
However, this did not come without a price. To investigate the effect on chromosomes unlinked to QTL we simulated a second chromosome with the same marker spacing, but with no QTL on it, while retaining the phenotypes (which are strongly influenced by the QTL on chromosome 1). We then considered positions on this second chromosome that mirrored the positions of the QTL on the first chromosome and calculated LRT statistics for going from a model with two additively acting QTL on the second chromosome to a model with just one QTL (data not shown). Since there were no QTL on this second chromosome, we would expect low LRT statistics. On the contrary, very high LRT statistics were observed for the IM method (the 95th percentile was at 35.5), strongly suggesting that the phenomenon of spurious LOD score peaks had occurred. The HK, IRLS, and EE methods did not show such high LRT statistics on the unlinked chromosome (the 95th percentiles were at 4.2, 6.0, and 3.8, respectively).
Taking a closer look at the LRT statistics for the IRLS method on this second chromosome with no QTL on it revealed another problem. Following XU (1998a,b) we calculate LRT statistics for the IRLS method using the likelihood based on the assumption that
. However, it follows from the APPENDIX that the IRLS method does not give maximum-likelihood estimates corresponding to this likelihood (in contrast to the EE method). Thus, the LRT statistics for the IRLS method are not guaranteed to be nonnegative. In fact, analyzing the second chromosome with no QTL while keeping the phenotypes, which are influenced by the two 10-cM-apart QTL on the first chromosome, yielded negative LRT statistics for the IRLS method in
400 of 1000 simulation replicates with the 10th percentile being at 1.2.
In summary, these simulations have shown that the EE method approximates the IM method very well when two loci with opposite effects are closely linked. The EE method avoids the bias shown by the HK method, estimates the parameters more efficiently than the IRLS method, and also avoids problems with artificially high LRT statistics on other chromosomes observed with the IM method and negative LRT statistics seen with the IRLS method.
| DISCUSSION |
|---|
|
|
|---|
First, the multidimensional model space may be searched much more efficiently compared to doing an exhaustive grid search. More efficient model search procedures include techniques such as forward selection and backward elimination to search through nested sequences of models (BROMAN and SPEED 2002), randomization algorithms such as Markov chain Monte Carlo (YI 2004) or a genetic algorithm (CARLBORG et al. 2000), and deterministic global optimization algorithms that repetitively divide the search space into smaller parts (LJUNGBERG et al. 2004).
Second, any model space search procedure will involve fitting the statistical model many times. Thus, a fast and efficient method for estimating model parameters is needed to reduce total computation time. Currently, the HK method is preferred as a fast approximation to the IM method for estimating model parameters. However, the HK method is known to produce biased estimates of the residual variance and to be sensitive to epistasis and linkage between QTL.
We have focused on the latter issue, fitting multilocus QTL models fast and efficiently. An extension of the HK method is proposed and formulated using estimating equations. This EE method involves simultaneously solving estimating equations for both mean and variance parameters. We have compared the IM, HK, IRLS, and EE methods primarily by computer simulation, focusing on situations where either the HK method or the IM method performs poorly.
It is found here that the HK method is sensitive to certain missing data patterns, e.g., as arise from selective genotyping. With such data, the HK LOD curve may be artificially inflated over large stretches of the genome. The EE method alleviates this problem and produces LOD curves very similar to IM LOD curves. Also, the HK method suffers from large bias in the estimation of the residual variance and has lower power to detect QTL than the IM method, especially in situations of epistatically interacting QTL or QTL that are linked (KAO 2000). Here, it is found that the EE method approximates the IM method more closely in cases of epistasis or linked QTL: it produces unbiased estimates of the residual variance, it has smaller standard deviations on the parameter estimates than the HK method, and it has high power to detect even closely linked QTL of opposite effect.
In comparison to the IM method, the EE method has increased robustness toward nonnormal phenotype distributions. The IM method occasionally produces large spurious LOD score peaks in regions with little marker information if the phenotype distribution deviates markedly from a normal distribution (FEENSTRA and SKOVGAARD 2004). This artifact is caused by the fact that a mixture distribution with many components always produces a better fit than a mixture with few components. It is found that the problem is aggravated for models with multiple loci. The HK and EE methods are immune to this problem, since single normal distributions are used both for full and for reduced models.
The EE method is not as fast as the HK method, since it involves solving a set of estimating equations numerically. Still, it may provide gains in computational speed compared to the IM method. In full two-locus genome scans, for example, our implementation of the EE method was twice as fast as the IM method in computation time, and we expect additional gains in speed when the code has been further optimized.
We have demonstrated analytically in the APPENDIX that the EE method is more efficient than the HK and IRLS methods under the assumption that
. Admittedly, this assumption may often be violated; the residual variation of certain traits may be inherently nonnormal and the effects of major QTL may also cause the phenotype distribution to deviate from normality. It is evident, however, from the simulations investigating epistasis and linkage that the EE method can also be very efficient compared to the HK and IRLS methods in cases where yi | mi is clearly not normally distributed. Moreover, the IRLS method may result in negative LRT statistics, something that the EE method avoids.
The estimating equations used by the EE method involve weighted linear combinations of the simple estimating functions yi µi and (
. Under the assumption taken by the HK, EE, and IRLS methods that
, it may be shown that the EE method is asymptotically optimal in the sense that any other linear combination of the simple estimating functions gives rise to estimates with a larger asymptotic covariance matrix (GODAMBE and HEYDE 1987). A further development might be to assume a specific distribution of the residuals in Equation 3 and derive optimally weighted combinations of the simple estimating functions on the basis of this distribution. We anticipate that the gain in efficiency compared to the EE method will be minimal and possibly at the expense of greater numerical instability.
Few authors have employed estimating equations for mapping QTL in experimental crosses. LANGE and WHITTAKER (2001) provide a recent exception and develop a generalized estimating equation (GEE) approach for QTL mapping of multiple correlated traits. However, in contrast to the EE method, these authors assume that the variance due to uncertainty of QTL genotype given marker genotype could be ignored. This is the same assumption taken by the HK method, which may lead to problems of inflated LOD curves, biased variance estimates, and low power, as seen here. It might be worthwhile to pursue the estimating equation approach further compared to this presentation. For instance, the EE method, as proposed here, tests hypotheses by likelihood-ratio tests based on a normal model. It would, however, be perfectly possible to still obtain parameter estimates by solving the estimating equations (Equations 10 and 11), but then use, e.g., score-type test statistics for hypothesis testing. This could possibly contribute some extra robustness compared to using the EE method in conjunction with LRT statistics.
In conclusion, the estimating equation method presented here may be used as a fast and efficient approach for mapping multiple QTL. Generally, it performs better than the HK method at approximating the IM method. Importantly, it avoids problems shown by the HK method in situations with special missing data patterns, epistasis, and linked QTL. Furthermore, the EE method is more robust than the IM method toward nonnormal phenotype distributions, and it is computationally faster. These issues become especially important in the analysis of multiple-QTL models.
The EE method was implemented with new functions soon to be incorporated in the QTL mapping software R/qtl (BROMAN et al. 2003), an add-on package for the general statistical software, R (IHAKA and GENTLEMAN 1996; R DEVELOPMENT CORE TEAM 2005).
| APPENDIX |
|---|
|
|
|---|
) denote the 2m + 1 vector of estimating functions for the EE method, i.e., GEE(y;
) corresponds to the left-hand sides of the estimating equations (Equations 10 and 11). Similarly, GIRLS(y;
) denotes the estimating functions for the IRLS method, corresponding to the left-hand sides of Equations 15 and 16.
In general, the asymptotic distribution of an estimator
that solves a set of estimating equations G(y;
) = 0 is Gaussian with mean
and variance of the so-called sandwich form
![]() |
The information matrix may be found by inverting
:
![]() |
Assuming that
(or slightly more generally, that the third and fourth cumulants are zero) the information matrices for the EE and IRLS methods are given by the matrix expressions (derivations not shown)
![]() | (A1) |
![]() | (A2) |
![]() |
(i.e., a 1 x n matrix), and 1n is the identity vector of length n.
To invert the information matrices, we use the following result about inverting a partitioned square matrix. Let
![]() |
![]() |
The submatrix corresponding to
may be considered the effective information about the mean parameters, ß, since inverting it yields the asymptotic variance on the estimate of ß. The effective information matrices for the EE and IRLS methods are
![]() | (A3) |
![]() | (A4) |
Consider the difference between the effective information for the two methods:
![]() |
By a matrix version of the CauchySchwarz inequality (MARSHALL and OLKIN 1990) it can be seen that
ieff is positive definite. Thus, by inverting the effective information matrices we get that
is negative definite asymptotically; i.e., the EE method estimates the mean parameters more efficiently than the IRLS method.
We also consider efficiency under the HK method. The closed-form expression for the estimator of ß is
![]() |
![]() |
The corresponding effective information about ß may be found by inverting this matrix, since the estimator of ß is independent of that of
2 under HK regression. This matrix may be written
![]() |
![]() |
Consider now the difference between the effective information for the IRLS method and that for the HK method
![]() |
Again, it follows from the matrix version of the CauchySchwarz inequality MARSHALL and OLKIN (1990) that
is positive definite and consequently that
is negative definite asymptotically. Thus the EE method is more efficient than the IRLS method, which in turn is more efficient than the HK method.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
| LITERATURE CITED |
|---|
|
|
|---|
BROMAN, K. W., 2003 Mapping quantitative trait loci in the case of a spike in the phenotype distribution. Genetics 163: 11691175.
BROMAN, K. W., and T. P. SPEED, 2002 A model selection approach for the identification of quantitative trait loci in experimental crosses. J. R. Stat. Soc. B 64: 641656.[CrossRef]
BROMAN, K. W., H. WU,
. SEN and G. A. CHURCHILL, 2003 R/qtl: QTL mapping in experimental crosses. Bioinformatics 19: 889890.
CARLBORG, Ö., and L. ANDERSSON, 2002 Use of randomization testing to detect multiple epistatic QTLs. Genet. Res. 79: 175184.[CrossRef][Medline]
CARLBORG, Ö., L. ANDERSSON and B. KRINGHORN, 2000 The use of a genetic algorithm for simultaneous mapping of multiple interacting quantitative trait loci. Genetics 155: 20032010.
DARVASI, A., and M. SOLLER, 1992 Selective genotyping for determination of linkage between a marker locus and a quantitative trait locus. Theor. Appl. Genet. 85: 353359.
DEMPSTER, A. P., N. M. LAIRD and D. B. RUBIN, 1977 Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39: 138.
FALCONER, D. F., and T. F. C. MACKAY, 1996 Introduction to Quantitative Genetics, Ed. 4. Pearson Education, Harlow, UK.
FEENSTRA, B., and I. M. SKOVGAARD, 2004 A quantitative trait locus mixture model that avoids spurious lod score peaks. Genetics 167: 959965.
GODAMBE, V. P., and C. C. HEYDE, 1987 Quasi-likelihood and optimal estimation. Int. Stat. Rev. 55: 231244.
HALEY, C. S., and S. A. KNOTT, 1992 A simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity 69: 315324.[Medline]
IHAKA, R., and R. GENTLEMAN, 1996 R: a language for data analysis and graphics. J. Comp. Graph. Stat. 5: 299314.[CrossRef]
KAO, C.-H., 2000 On the differences between maximum likelihood and regression interval mapping in the analysis of quantitative trait loci. Genetics 156: 855865.
KAO, C.-H., and Z-B. ZENG, 2002 Modeling epistasis of quantitative trait loci using Cockerham's model. Genetics 160: 12431261.
KAO, C.-H., Z-B. ZENG and R. D. TEASDALE, 1999 Multiple interval mapping for quantitative trait loci. Genetics 152: 12031216.
LANDER, E. S., and D. BOTSTEIN, 1989 Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121: 185199.
LANGE, C., and J. C. WHITTAKER, 2001 Mapping quantitative trait loci using generalized estimating equations. Genetics 159: 13251337.
LJUNGBERG, K., S. HOLMGREN and Ö. CARLBORG, 2004 Simultaneous search for multiple QTL using the global optimization algorithm DIRECT. Bioinformatics 20: 18871895.
LYNCH, M., and B. WALSH, 1998 Genetics and Analysis of Quantitative Traits. Sinauer, Sunderland, MA.
MARSHALL, A. W., and I. OLKIN, 1990 Matrix versions of the Cauchy and Kantorovich inequalities. Aequationes Mathematicae 40: 8993.[CrossRef]
MARTÍNEZ, O., and R. N. CURNOW, 1992 Estimating the locations and the sizes of the effects of quantitative trait loci using flanking markers. Theor. Appl. Genet. 85: 480488.
R DEVELOPMENT CORE TEAM, 2005 R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna.
REBAÏ, A., 1997 Comparison of methods for regression interval mapping in QTL analysis with non-normal traits. Genet. Res. 69: 6974.[CrossRef]
SEN,
., J. M. SATAGOPAN and G. A. CHURCHILL, 2005 Quantitative trait locus study design from an information perspective. Genetics 170: 447464.
SUGIYAMA, F., G. A. CHURCHILL, D. C. HIGGINS, C. JOHNS, K. P. MAKARITSIS et al., 2001 Concordance of murine quantitative trait loci for salt-induced hypertension with rat and human loci. Genomics 71: 7077.[CrossRef][Medline]
XU, S., 1995 A comment on the simple regression method for interval mapping. Genetics 141: 16571659.[Medline]
XU, S., 1998a Further investigation on the regression method of mapping quantitative trait loci. Heredity 80: 364373.
XU, S., 1998b Iteratively reweighted least squares mapping of quantitative trait loci. Behav. Genet. 28: 341355.[CrossRef][Medline]
YI, N., 2004 A unified Markov chain Monte Carlo framework for mapping multiple quantitative trait loci. Genetics 167: 967975.
ZENG, Z-B., T. WANG and W. ZOU, 2005 Modeling quantitative trait loci and interpretation of models. Genetics 169: 17111725.
This article has been cited by other articles:
![]() |
L. Ronnegard, F. Besnier, and O. Carlborg An Improved Method for Quantitative Trait Loci Detection and Identification of Within-Line Segregation in F2 Intercross Designs Genetics, April 1, 2008; 178(4): 2315 - 2326. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Manichaikul, A. A. Palmer, S. Sen, and K. W. Broman Significance Thresholds for Quantitative Trait Locus Mapping Under Selective Genotyping Genetics, November 1, 2007; 177(3): 1963 - 1966. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||