MultipleTrait Mapping of Quantitative Trait Loci After Selective Genotyping Using Logistic Regression
 John M. Henshall and
 Michael E. Goddard1
 Animal Genetics and Breeding Unit,2 University of New England, Armidale, New South Wales 2351, Australia
 Corresponding author: John Henshall, Animal Genetics and Breeding Unit, The University of New England, Armidale, NSW 2351, Australia. Email: jhenshal{at}metz.une.edu.au
Abstract
Experiments to map QTL usually measure several traits, and not uncommonly genotype only those animals that are extreme for some trait(s). Analysis of selectively genotyped, multipletrait data presents special problems, and most simple methods lead to biased estimates of the QTL effects. The use of logistic regression to estimate QTL effects is described, where the genotype is treated as the dependent variable and the phenotype as the independent variable. In this way selection on phenotype does not bias the results. If normally distributed errors are assumed, the logisticregression analysis is almost equivalent to a maximumlikelihood analysis, but can be carried out with standard statistical packages. Analysis of a simulated halfsib experiment shows that logistic regression can estimate the effect and position of a QTL without bias and confirms the increased power achieved by multipletrait analysis.
EXPERIMENTS to detect and locate quantitative trait loci (QTL) in livestock species are becoming more common, driven by the potential for increased genetic gain in traits of economic importance (e.g., Soller 1978; Meuwissen and Goddard 1996). A variety of experimental designs have been used, and various statistical methods have been applied to the resulting data. With commercial livestock, designs commonly measure phenotype and genotype on halfsib families, because of the higher reproductive capacity of males. The phenotype records might be measured directly on the halfsibs, as in a daughter design, or on the progeny of the halfsibs, as in a granddaughter design (Welleret al. 1990). The data are commonly analyzed using maximumlikelihood or regression interval mapping (Lander and Botstein 1989; Haley and Knott 1992; Martinez and Curnow 1992). The regression methods are computationally less demanding, which may be relevant if techniques such as permutation testing (Churchill and Doerge 1994) are used to determine significance thresholds. These methods were initially developed for detecting QTL that affect a single trait. Even when phenotype measurements have been available on multiple traits, the results of singletrait analyses have generally been presented in the literature.
In recent years, the possibility of multipletrait QTL detection has been considered by a number of researchers (e.g., Jiang and Zeng 1995; Korolet al. 1995; Welleret al. 1997). There are several reasons why multipletrait QTL mapping is of interest. Not only will an understanding of a QTL's part in the genetic covariance structure of economically important traits be important if selection decisions are to be based on QTL genotype, but the statistical power to detect QTL is potentially higher in multipletrait analysis than in singletrait analysis. Both Korol et al. (1995) and Jiang and Zeng (1995) demonstrate such increased power on simulated datasets while maximizing the likelihood to obtain multipletrait parameter estimates.
Another approach often applied in livestock QTL detection experiments is selective genotyping, in which only animals with extreme phenotypes are genotyped (see Lebowitzet al. 1987; Lander and Botstein 1989; Darvasi and Soller 1992; Muranty and Goffinet 1997; Bovenhuis and Spelman 1998). For a given number of animals genotyped, the power to detect QTL is increased with this approach. However, although simple regression methods can be used to estimate parameters with selective genotyping, the estimates will be biased. To obtain unbiased estimates, maximum likelihood can be applied to the full dataset, including the ungenotyped animals (Lander and Botstein 1989), or approximations to the parameters can be made (Darvasi and Soller 1992; Muranty and Goffinet 1997). Markov chain Monte Carlo methods, which sample missing data, are also appropriate for selectively genotyped data (see Binket al. 1998; Jansenet al. 1998). The estimation of the effects of QTL in traits correlated to the trait in which selective genotyping occurred is also problematic (see Welleret al. 1997). Unless the selective genotyping and the correlation are taken into account, parameter estimates in the correlated trait will be biased. Maximumlikelihood methods or the less computationally demanding approximation methods of Muranty and Goffinet (1997) and Bovenhuis and Spelman (1998) can be applied. These authors state that these approximations are suitable when QTL effects are small.
Most statistical methods currently used for QTL detection, including those mentioned above, make comparisons between the phenotypes of alternate marker genotypes. Alternatively, the marker genotypes of differing phenotypes can be compared (Stuber et al. 1980, 1982). Lebowitz et al. (1987) called this approach trait based, as opposed to marker based, and presented methods to compare the marker allele frequencies in divergent selection lines or selectively genotyped individuals.
In this article, a more general traitbased method is presented. The method is suited to halfsib data and addresses the problems that arise with multipletrait QTL detection on selectively genotyped data. Results comparable with those obtained with maximum likelihood on the full dataset can be achieved, using software in standard statistical packages. The method is regression based, but instead of regressing phenotype on genotype, the regression is genotype on phenotype. This replaces the assumption that the phenotypes are unselected with the assumption that there was no selection based on genotype. This assumption is easily satisfied by including all genotyped animals in the analysis. With halfsib experiments, in treating genotype as the response, the response variable is binary. Methods for analyzing binary data are well understood (e.g., Cox and Snell 1989; Dobson 1990), and suitable subroutines are included in the major statistical computing packages.
STATISTICAL METHODS
Single trait, no recombination model: In the first case considered, it is assumed that there is no recombination between the genotyped locus and the locus affecting the quantitative trait. This would apply when testing for an effect from a candidate gene. A singlesire, halfsib design will be assumed, with no genotypes available on the dams. Let the QTL have two alleles, Q and q, and the genotypic marker have two alleles, M and m. The model can be written as
If we assume the e_{i} has a normal distribution with variance
The total variance, σ^{2}, is composed of
Figure 1 contains an example of the function p, where the sire QTL allele accounts for 20% of the total variance. The underlying normal distributions are also shown. The parameter b is related to the “slope,” and it is our estimate of b that allows us to estimate the magnitude of the QTL. For the mixture distribution in Figure 1, selecting the upper and lower 5% of observations would exclude records with phenotypes between –2 and 2. For phenotypes of ±2, the ratio of allele frequencies is around 0.1:0.9, and the shape of the curve p between –2 and 2 can be interpolated reasonably well even without observations in this region. If more extreme selection were applied, or if the QTL effect were larger, then the ratio of allele frequencies at the truncation points might approach 0.0:1.0. Then the shape of the curve between the truncation points could not be reliably interpolated. Selection on the basis of phenotype, as in selective genotyping, will reduce the precision with which we estimate b (and therefore α) compared with genotyping the whole sample, the degree by which the precision is reduced being a function of the percentage of records genotyped and the size of the QTL effect.
To be useful in QTL detection, a measure of how well the model fits the data as well as estimates of the parameters is required. Standard logistic regression software generally provides a loglikelihood ratio test for whether b̂ is significantly different from zero. As
Multipletrait, no recombination model: The methods described above are easily extended to the situation where phenotypes are available on more than one trait. Let f_{1} and f_{2} in Equation 1 be multivariate normal distributions. If the sire mean is a vector μ, and the vector of half the average sire allele effects A, then the mean of animals in distribution f_{1} will be μ_{1} = μ + A, and the mean of animals in distribution f_{2} will be μ_{2} = μ – A. If the covariance matrix estimated from the data is R, and the covariance matrix within sire QTL genotype is V, then R = V + AA′. As in the singletrait model, V will contain both genetic and nongenetic components. Then,
Recombination model: Both the single and multipletrait methods described above assume that there is no recombination between the genotyped locus and the locus affecting the quantitative trait. In QTL detection experiments using markers, this assumption cannot be made. The indicator variable s_{i} is now unobservable. If we consider a single marker with a recombination rate r with the QTL, the multivariate logistic model becomes
A more common use of this model is where an estimate of α is required, given a map position relative to a number of markers, as in interval mapping. A method is therefore required to summarize the information from multiple markers into a form that can be used in Equation 8, in which we require a vector p and an associated vector r. As for maximumlikelihood methods, we can calculate the probability that animal i inherited the sire allele Q, on the basis of the observed marker transmission, the recombination rates between the postulated locus and the markers, and the assumed mapping function. Let this probability be q_{i}. Although q_{i} was estimated from multiple markers, we can proceed as if it had been estimated from a single marker and recover a value p_{i}, which will be either zero or one, and a value r_{i}, which will be <0.5. As
There are several numerical methods that can be used to estimate β given vectors p and r. In testing, both fitting Equation 8 using nonlinear least squares and iteratively maximizing the likelihood of Equation 8 appeared to work well with a single trait. However, as the focus of this article is on using standard statistical software for multipletrait analysis, an approximate method of interval mapping is described. There are two parts to the problem, the estimation of β and the evaluation of the loglikelihood ratio statistic. We have already shown how to estimate β at the markers, and provided that the markers are not too far apart, simple interpolation will provide sufficiently accurate estimates between the markers. Given an estimate of β, the evaluation of the loglikelihood is straightforward. For the logistic model, the loglikelihood takes the form
RESULTS OF SIMULATION STUDIES
To compare the logisticregression method to alternative methods and to generally examine its performance, various simulation studies were carried out. Phenotypic values were generated by adding randomly generated error terms to genotypic values. In all cases, a halfsib design was simulated with phenotypes available on 1000 halfsibs. All simulations were repeated 100 times, and mean estimates and significance levels were calculated.
Single trait, no recombination: A sire QTL effect and an error term were simulated, with the sire allele accounting for 0, 1, 4, and 25% of the total phenotypic variance. As the total phenotypic variance was 1.0, the magnitudes of the sire allele effects were 0.0, 0.1, 0.2, and 0.5. Three levels of selective genotyping were tested, with genotypes available on all animals (i.e., no selective genotyping), on 50% of animals, and on 10% of animals. Where selective genotyping was applied, genotypes for the animals with the highest and lowest phenotypes were made available.
The allele effect (α) was estimated with logistic regression (LR), with maximum likelihood (ML), and with the methods of Darvasi and Soller (1992; DS) and Muranty and Goffinet (1997; MG). Estimates of b for the LR method were obtained using SAS procedure LOGISTIC (SAS 1990). The response variable, or marker genotype, was coded 0.0 or 1.0, the independent variable was the phenotype, and a dummy variable n, for number of trials, was set to 1. Where markers were uninformative, or the animal was not genotyped, then that record was not included. However, all of the records were used to estimate the variance (σ^{2}), which is required to estimate α. For the ML analysis, the loglikelihood was maximized numerically using NAG subroutine E04JAF (NAG 1991). The likelihood used was
Three models were used to simulate the data. In the first model, the error term was normally distributed, and the selective genotyping was the only selection applied to the data. This would be the case if all markers were fully informative. This resulted in truncation selection, with equal proportions informative in each tail. Table 1 summarizes the results obtained. All four methods provide similar estimates and standard errors when QTL effects are small. It is only when the QTL effect is large that differences are observed between the methods. For a QTL effect of 0.5, with selective genotyping, methods DS and MG underestimate the QTL effect, and methods LR and ML overestimate the QTL effect.
To test the performance of the methods when the error is not normally distributed, errors were simulated from other distributions in the second model. Two distributions were used: a mixture distribution, composed of two normal distributions, with the difference in means responsible for a variance of 0.5, and a χ^{2}distribution with 4 d.f., scaled to produce a total variance of 1.0. These are similar to the distributions used by other researchers (e.g., Muranty and Goffinet 1997). The mixture distribution could occur because of the effect of other QTL segregating or because of failure to correctly account for fixed effects. The χ^{2}error produces a skewed distribution of phenotypes.
Table 2 contains the results for sire QTL allele effects of 0.1 and 0.2 and one result for a sire QTL allele effect of 0.5. When genotype records were available on all animals, the nonnormal error had little effect on the estimates of the QTL effect. With selective genotyping however, all methods had difficulties in estimating the QTL effect. When the error was from the mixture distribution, method DS produced estimates of QTL effect closest to those simulated, with little difference between the mean estimates produced by methods MG, LG, and ML. When the error term was simulated from the χ^{2}distribution, the effect of reducing the number of genotype records available appeared to be nonlinear. Genotyping most animals, or small QTL effects, resulted in overestimation of the QTL effect, while genotyping less animals, or large QTL effects, resulted in underestimation of the QTL effect. This pattern was consistent for all of the methods, with methods DS and LR requiring greater selection, or larger QTL effects, before the underestimation occurred.
Another departure from the model assumed by the estimation methods is when the selection is not truncation selection with fixed proportions of animals with genotype records in each tail. This might occur when the marker allele inherited from the sire cannot be determined for some animals, as occurs when the marker genotype of the animal is the same as that of sire, with no genotype available for the dam. If one of the sire's marker alleles is at a high frequency in the dam population, then one of the sire's marker alleles will be identified more often in the offspring than the other. The effect of this was tested in the third model, with 50% of genotyped markers assumed to be uninformative, but with one sire allele informative 90% of the time and the other sire allele informative only 10% of the time.
The DS method was not applied to this model because it requires the assumption of known, equal, selected proportions. Results from application of the MG method to data in which all animals were genotyped, are presented for the equation that assumed selection, as it performed better than the equation that did not assume selection. Table 3 contains the results obtained for methods MG, LR, and ML. All of the methods tested have problems with this model. Method MG underestimates the QTL effect, except when the proportion selected is ∼50%. For simulated values of α of 0.2 and 0.5, method LR overestimates the QTL effect when only 10% of animals are genotyped. These overestimates are accompanied by relatively high standard errors. Method ML overestimated the allele effect. This was less apparent when selective genotyping was applied.
Multiple trait, no recombination: A bivariate analysis was performed, where the two traits had covariance matrix
When all animals are genotyped, there is no difference between the allele effect estimates produced using singleor multipletrait analysis, and the MG and LR multipletrait methods produce identical results. When selective genotyping is applied, the singletrait analysis produces good estimates of α_{1}, the effect on the trait used to select animals for genotyping, but biased estimates of α_{2}, the effect on the correlated trait. The estimates of QTL effect from the multipletrait methods are much less biased than the singletrait analysis estimates, with the estimates from the LR method less biased than the estimates from the MG and BS methods. For smaller QTL effects there were no differences between the results produced by the multipletrait methods (results not shown).
Single trait, recombination: Markers and the QTL were simulated in the order
Interval mapping was carried out, using both ML and LR. For the ML analysis, the likelihood was maximized numerically using NAG subroutine E04JAF. The likelihood used was again (10), but with q_{i} calculated from the observed marker transmission for the nearest flanking markers and the locus being mapped. Haldane's mapping function was assumed. SAS procedure LOGISTIC was used to obtain the LR estimates of a and b at the markers and the loglikelihood ratio statistic between the markers calculated from interpolated values of a and b.
Figure 2 contains the loglikelihood ratio profiles obtained. When 100% of animals were genotyped, at the markers, there is almost no difference between the loglikelihood statistics produced by the two methods. Between the markers, the LR profile is slightly lower than that produced by ML. In addition to the ML and LR analyses, the data were analyzed using the regression method of Haley and Knott (1992). The resulting profile (not shown) was more similar to the ML profile than to the LR profile. When only 10% of animals were genotyped the loglikelihood statistics were less than when all animals were genotyped, but there was little difference between the ML and LR profiles. Regardless of whether selective genotyping was applied, it appears that provided that the distance between markers is not too great, and provided that most markers are informative, the profiles are for practical purposes equivalent. Having some markers not informative is equivalent to increasing the distance between markers for some animals, and the profiles produced by the two methods may differ. However, if significance thresholds are determined through permutation testing (Churchill and Doerge 1994), similar conclusions should be drawn despite the differences between the profiles.
Multiple trait, recombination: The marker and QTL locations simulated in the singletrait analysis were used for a bivariate analysis. The two traits had covariance matrix
DISCUSSION
It has been shown in this article that in halfsib families, multipletrait QTL detection is a simple matter of performing multivariate logistic regression. As no assumptions are made regarding selection among phenotypes, the method is useful for selectively genotyped data as well as for experiments where genotypes are known for all animals.
All methods for estimating QTL effects with missing genotype information require assumptions regarding the distribution of phenotypes within QTL genotype classes. The methods considered in this study all assumed normal distributions for phenotypes, and when the data were simulated according to this assumption, with equal proportions genotyped in each phenotypic tail, differences between the results produced by the methods were minor. The exception to this was when a large QTL was segregating and a small percentage of animals were genotyped. It is under these conditions that the assumptions regarding the distributions of phenotype within unknown genotype become most critical. However, although there is little power for any method to accurately estimate such a QTL effect, in practice this is not a major problem. A more important problem is that most QTL effects are too small to be significant. The situation with very large QTL is that we can tell that they are big, but we can't tell how big.
The robustness of the methods to departure from normal error is of interest. The performance of the methods appears to depend heavily on the distribution of the error, the percentage of animals genotyped, and the relative size of the QTL. When error was generated from a mixture distribution, the DS method produced less bias due to selective genotyping, but with a χ^{2}error the MG method and the ML method produced less bias. However, this was only the case for small QTL effects; for larger QTL effects the LR method produced less bias with high levels of selection. It appears that no method is “best” for all circumstances, but all perform reasonably well provided that at least 50% of animals are genotyped. If the data suggest that the distribution of phenotypes is not normal, then the use of data transformations might be considered.
If the selection is not simple truncation selection based on phenotype, then methods such as DS become difficult to implement, because of their use of the expected mean of a truncated distribution. The MG method can be applied to this type of data, but the estimates produced appear to be biased, even when all animals are genotyped. The LR method produces acceptable estimates of QTL effect except when QTL effects are large or selection extreme. The performance of the ML method for this type of data was unsatisfactory. With all animals genotyped the QTL effect was overestimated even for small QTL, but with less genotypic information available the estimates improved. This may be due to the ratio of QTL genotypes among the animals of unknown genotype. In the likelihood function used in the simulations (Equation 10), when genotype was unknown, a value of 0.5 was coded for the probability of inheriting each of the sire's QTL alleles, implying equal probability of either genotype. With all animals genotyped, almost all of the animals with unknown genotype carry the same QTL allele, but, when only some animals are genotyped, the distribution of QTL alleles in the animals of unknown genotype is more balanced. Probabilities for noninformative markers based on the allele frequencies in the dam population could be assigned to alleviate this problem; the other methods might also benefit from such data preparation.
The DS and MG methods estimate QTL effects as functions of means and variances of the data. As such they are quickly and easily computed. However, the cost of this computational simplicity is reduced generality. The DS method requires simple truncation selection. Two singletrait equations are provided to estimate QTL effects by the MG method, one applicable when all genotypes are available and one applicable when selection has taken place. When the selection is based on other than phenotype, as with noninformative marker information, it is not obvious which equation should be applied.
The LR method has more in common with the ML method computationally. In practice, the LR procedures in statistical packages may use ML to fit the logistic curve to the data. Alternatively another iterative method, such as weighted least squares, may be used. Therefore there may not be much advantage in computing time to using LR. However, LR is a common statistical procedure, and the algorithms for LR in the major statistical packages should be highly optimized. The major advantage of the LR method over ML is the availability and ease of use of appropriate software. For example, the commands required to estimate β and the loglikelihood for a twotrait model with SAS procedure LOGISTIC (SAS 1990) are
LR can be used for any experimental design producing markers relating to a mixture of two normal distributions. This would include backcross designs, where the difference between animals heterozygous for the QTL and animals homozygous for the QTL can be estimated, and granddaughter designs, which are essentially halfsib designs with repeated records. Where the genotyped animals are the F_{2} generation resulting from a cross between inbred lines, ML methods estimate effects for the two classes of homozygous animals and a dominance effect. As LR is a method for binary data, it is not possible to fit the full model for this type of data directly. However, it should be possible to estimate the differences between two QTL classes, for example, the difference between the homozygous classes or the difference between heterozygous animals and one homozygous class. In this case weights should be used, proportional to the probability that the animal has one of the QTL genotypes under consideration.
As with ML interval mapping, it is desirable to account for both linked and unlinked QTL in estimating QTL effects and locations. Iterative methods, such as that of Zeng (1994) should be adaptable to the LR method. If multipletrait LR is being performed, then the method is comparable with that of Jiang and Zeng (1995). Also, as for ML or regressioninterval mapping, permutation testing (Churchill and Doerge 1994) will provide significance thresholds that should account for any peculiarities in the data.
The results of the simulation studies presented here are no different from those of earlier studies. Jiang and Zeng (1995) and Korol et al. (1995) provide convincing arguments for multipletrait interval mapping. Muranty and Goffinet (1997) use a bivariate ML analysis as a benchmark against which to compare their approximation methods for multipletrait estimation under selective genotyping. What we have demonstrated here is that using LR, multipletrait QTL analysis becomes a straightforward application of standard statistical software. The method is applicable regardless of whether selective genotyping was applied. The achievable results are comparable to those obtained from the closely related ML methods, but without the complexity of multipletrait ML.
Acknowledgments
J. Henshall was in receipt of a supplementary stipend from the Cooperative Research Centre for Cattle and Beef Industry (Meat Quality) while undertaking this work.
Footnotes

Communicating editor: C. Haley
 Received April 14, 1998.
 Accepted October 30, 1998.
 Copyright © 1999 by the Genetics Society of America