Abstract
A mixture model approach is presented for the mapping of one or more quantitative trait loci (QTLs) in complex populations. In order to exploit the full power of complete linkage maps the simultaneous likelihood of phenotype and a multilocus (all markers and putative QTLs) genotype is computed. Maximum likelihood estimation in our mixture models is implemented via an ExpectationMaximization algorithm: exact, stochastic or Monte Carlo EM by using a simple and flexible Gibbs sampler. Parameters include allele frequencies of markers and QTLs, discrete or normal effects of biallelic or multiallelic QTLs, and homogeneous or heterogeneous residual variances. As an illustration a dairy cattle data set consisting of twenty halfsib families has been reanalyzed. We discuss the potential which our and other approaches have for realistic multipleQTL analyses in complex populations.
THE number of genes identified in humans, plants and animals has increased notably during the last decade. The largest increase in number of identified genes has occurred for qualitative single gene traits. In contrast, progress in mapping quantitative trait loci (QTLs) has been slow, except for species for which inbred lines are available. Human pedigrees are often complex and small, and analysis of pedigree data requires sophisticated statistical techniques, the development of which has become a bottleneck in QTL mapping (Guo and Thompson 1992). In outbred populations of animals or plants, this bottleneck is also real but less severe, because of the high reproduction rate and the option to design experiments. The main problems to be dealt with in the analysis of complex populations can be summarized as follows:
The number of alleles and the allele frequencies in the (base) population are unknown for QTLs as well as for marker loci;
If a parent is homozygous at a marker locus, it is impossible to trace which allele from a pair of parental homologous chromosomes has been transmitted to a descendant;
When two parents are heterozygous and carry the same alleles at a marker locus, the parental origin of alleles of a heterozygous descendant cannot be determined;
The genotype at a QTL cannot be observed, and we therefore do not know which parents are heterozygous for the QTL;
Markers may have been selectively genotyped for only a subset of the population;
Linkage phases between markers and between markers and QTLs may be unknown.
Markers are used to follow the inheritance of genome segments from parent to offspring (the pattern of identitybydescent or IBD of alleles). If the IBD pattern at a certain marker (or QTL) locus is unknown, then neighboring markers may be informative, i.e., linked markers may indicate the likely IBD pattern at the locus under study (cf. Haleyet al. 1994; Jansen 1996a; Knottet al. 1996). Phenotype also contains information on genotype, but this information is often ignored to simplify computations (see Modeling QTLs below). In general, all markers and phenotype should be used simultaneously so as to recover at each map position as much information as possible.
One can assume fixed or random effects models or mixed models for the relation between phenotype and “known” genotype. As stated above the information about genotype can be incomplete for various reasons. We therefore enter the area of socalled mixture models, where the possible genotypic configurations are the components of the mixture. An important monograph on mixture models is written by Titterington et al. (1985). A popular statistical algorithm for handling mixture problems is the expectationmaximization (EM) algorithm. It is an iterative approach, it is relatively easy to program, and it produces maximum likelihood estimates and also empirical Bayesian a posteriori estimates (Dempsteret al. 1977). Jansen (1992) and Jansen and Stam (1994) described a general and flexible EM algorithm for recovering information about a multilocus genotype in populations obtained from crosses between inbred lines. In a next article, Jansen (1996a) developed a Monte Carlo method for multilocus analysis in a simple outbred cross between two plant cultivars. Here, we will make our EM approach applicable for complex population structures in which additional dependencies between individuals may exist. We propose a stochastic EM algorithm and a Monte Carlo EM algorithm in which a Markov chain of possible genotypic configurations is generated via the Gibbs sampler. With large progeny groups, however, the chain may show slow changing of genotype states or can even remain stuck in a certain “subspace” (Jansset al. 1995). To avoid these problems, we introduce a simple and flexible scheme, based on different descriptions of the genotype of founders and nonfounders of the population. Furthermore we demonstrate how our EM approach can be used to fit models for single or multiple QTLs with fixed or random effects. Recently, data for paternal halfsib families of dairy cattle have been adopted for comparison of analytical approaches developed in the animal breeding community (Bovenhuiset al. 1998). As an illustration, we will analyze these data. We postulate several mixture models and fit them by using our Monte Carlo EM algorithm.
There is a growing need for sophisticated analytical tools to genetically dissect multigenic traits in complex populations. The theory on QTL mapping is developing very fast, and we therefore start with a section in which we briefly review and classify the various recent developments (Jansen 1996a; Knottet al. 1996; Satagopanet al. 1996; Satagopan and Yandell 1997; Thaller and Hoeschele 1996; Uimariet al. 1996a; Xu 1996; Grignolaet al. 1997). In the discussion, we focus on the potential that our and other approaches have for realistic multipleQTL analyses in complex populations.
MODELING QTLS
Recent analytical approaches can be classified according to three criteria: modeling the full mixture of possible phenotypegenotype combinations or not, assuming fixed or random QTL effects, and adopting the (restricted) maximum likelihood or Bayesian approach.
Consider an Nmember population on which trait values and marker scores are observed. Let y_{i} denote the ith individual's trait value and let g_{i} denote its combined genotype at all marker loci and one or more putative QTLs. We write y = (y_{1} … y_{N})′ and g = (g_{1} … g_{N})′. The population may consist of multiple generations and marker and trait data need not be observed for all individuals. For a given genotypic constitution g on the population, one can model the relation between y and the “known” genotype g by assuming a model with discrete and fixed QTLeffects and normally distributed error. The distribution f(yg) is then a multivariate normal distribution and the mean, μ(yg), is modeled in terms of genetic parameters (θ) such as additivity and dominance of (multiple) QTL effects. On the other hand, one may prefer a random model in case of multiallelic QTLs. It is then often assumed that QTL effects are independent realizations from a normal distribution, which represents the distribution over many alleles in the base population. Now f(yg) is a multivariate normal distribution with variancecovariance matrix v(yg) expressed in terms of genetic parameters. Multiple QTLs, random or fixed family (polygenic) effects and additional (experimental) effects such as QTLQTL interaction or QTLenvironment interaction can be included resulting in socalled mixed models. Also other types of distribution can be assumed in addition to the commonly assumed normal distribution.
The genotype g includes full multilocus information about alleles and their IBD pattern, but unfortunately this information can be observed only partially. For each possible genotypic configuration g on the population (that is, a configuration which is consistent with the observed marker data), we can calculate a scalar probability P(g) of occurrence. P(g) is a function of (known or unknown) recombination and allele frequencies. The exact methods use mixture distributions to model the full relation between phenotype and possible genotypes: f(y) = Σ ΣP(g)f(yg), where summation is over all possible genotypes g. Jansen (1992, 1994, 1996a), Thaller and Hoeschele (1996), Uimari et al. (1996a), Xu (1996) and Satagopan et al. (1996, 1997) consider mixture models for QTL mapping. Jansen uses maximum likelihood for QTLs with discrete effects, Xu uses maximum likelihood for QTLs with normal effects, Thaller and Hoeschele, Uimari et al. and Satagopan et al. use Bayesian methods for QTLs with discrete effects.
An exact mixture analysis can be computationally demanding, especially if the number of possible genotypes g is huge. Approximate expectation methods first calculate an expected trait mean μ(y) = Σ P(g)μ(yg) if a discrete QTLeffects model is used, where summation is again over possible genotypes. In the normal QTLeffects models, an expected variancecovariance matrix v(y) = Σ P(g)v(yg) is calculated. Next it is assumed that f(y) is normally distributed with mean μ(y) or variancecovariance matrix v(y). Knott et al. (1996) use the expectation method for discrete QTLeffects models and Grignola et al. (1997) use it for normalQTL effects models. Zeng (1994) assumes discrete QTLeffects and uses a combination of the mixture and the expectation method: the expectation method to deal with missing marker (cofactor) data and the mixture method for the putative QTL.
Phenotypes contain information on QTL genotypes. Moreover, if markers are linked to QTLs, phenotypes also contain information about incomplete marker genotypes. The exact methods take into account marker plus phenotype information to retrieve information. In contrast, in the approximate expectation methods, genotype probabilities are calculated on the basis of marker data only and this calculation is done only once, namely before QTL analysis.
MAXIMUM LIKELIHOOD IN MIXTURE MODELS VIA EM ALGORITHMS
Let θ denote the vector of all parameters for fixed and random model terms and for recombination and allele frequencies. In QTL mapping, the genetic map is usually assumed to be known (i.e., the recombination frequencies are known). Jansen (1992, 1996a) developed formulae for simple twogeneration designs with f_{θ}(y,h) = Π f_{θ}(y_{i},h_{i}), where the product is over the members of the population and h = (h_{1} … h_{N})′ denotes the observed marker data. Here we consider a general population structure in which case additional dependency of individuals can exist so that f_{θ}(y,h) can not be expressed as a simple product of member likelihoods. To simplify notation, we will write f_{θ} again as f (and P_{θ} as P).
The simultaneous likelihood
Exact EM: The likelihood equations can be solved by applying an EM algorithm (Jansen 1992; Jansen and Stam 1994). Each iteration consists of two steps. First, in the socalled Estep, the conditional probability P(gy,h) is evaluated for all possible genotypes g, given the current parameter estimates and given the observed incomplete information h on the genotype (using Baye's theorem). Next, in the socalled Mstep, the likelihood equations are solved by fixing the weights P(gy,h), which gives updated parameter estimates. The likelihood equation can be split into two terms: the first term refers to the genetic linkage between loci, the second term to the phenotypecomplete genotype relation. Each term can be recognized as a likelihood equation for nonmixture problems that can be solved with standard statistical routines or packages for (weighted) regression or variance component models (see also Jansen 1992).
Stochastic EM: In each cycle of the EM algorithm, the likelihood equation can be estimated by using a single Monte Carlo realization
Monte Carlo EM: In each cycle of the EM algorithm, the likelihood equation can be estimated using a number (M) of Monte Carlo realizations
The Monte Carlo samples can also be used for likelihoodratio estimation in the final EM step. The likelihood ratio is estimated as
Gibbs sampler: Unfortunately there may be no direct feasible way to generate the Monte Carlo samples because of the huge amount of possible genotypic states g in complex populations with many loci. A solution to this problem is to utilize the Gibbs sampler (cf. Guo and Thompson 1992; Jansset al. 1995). Jansen (1996a) considered a situation with multiple loci in a simple outbred cross between two plant cultivars and described a simple Gibbs sampler in which the offspring genotype is updated in a stepwise manner for only a single locus and a single individual at a time, while taking “for granted” the remaining part of the genotype. In this way, the number of possible genotypic states is small and sampling can easily be done (of course states have to be consistent with observed marker data).
In this article, we deal with general population structures in which case there are serious problems if we specify genotype in terms of allelic constitution only. For instance, in a halfsib analysis, a parent (sire) with current QTLstate a/b produces offspring carrying allele a or b, and a change of the parent's genotype to a/a or b/b is practically prohibited if the male has many offspring (Jansset al. 1995). To avoid this kind of problem, we now introduce different descriptions of the genotype of founders and nonfounders in the population. We specify the genotypic state of any founder by the alleles at each of its two homologues. We express the state of any nonfounder by IBD values indicating parental origin of its alleles. In the above halfsib example, an offspring of the sire a/b inherits the allele of either the first homologue of its parent (IBD = 1) or the second (IBD = 2). In a markerQTLmarker situation, a sire may have current genotype aaa/bbb, i.e., aaa on homologue 1 and bbb on homologue 2. An offspring of this sire may have current genotype aab/ccc, withc alleles originating from the dam, and we will write this as 112/ccc by using an IBD indicator rather than the actual allele type (a or b). The same can be done in other types of populations (e.g., with more generations).
We will now briefly consider the three steps used in our Gibbs sampler for updating the genotype of founders (their QTL states and linkages phases between loci) and the genotype of nonfounders (via IBD pattern).
Step 1: To take all possible QTL states of founders into account, one can sample allelic configurations QTL by QTL. For instance, consider a change of markerQTLmarker genotype aaa/bbb of a sire to aaa/bab, aba/bab or aba/bbb without changing the “known” IBD pattern (e.g., offspring 112/ccc). One can calculate the corresponding conditional probabilities given “known” IBD pattern and given phenotypes, and next sample one of the four possible states.
Step 2: To take all possible linkages phases in the genotypes of founders into account, one can sample linkage phases interval by interval and founder by founder. For instance, consider a change of the linkage phase between the proximal and distal part of the chromosome at a certain interval for a certain founder. In the case of a phase switch, the distal part of its homologue 1 is attached to the proximal part of homologue 2 (i.e., becomes part of the new homologue 2) and vice versa. IBD values are used for the description of genotypes of nonfounders, and, in case of a phase switch, one should change the IBD values at the distal part of the chromosome accordingly (1 becomes 2 and vice versa). One can calculate the conditional probabilities for the two options “phase switch” and “no phase switch” and sample one of them.
Step 3: To generate genotypes of nonfounders, one can sample a new IBD pattern given “known” genotype of founders. This can be done individual by individual and locus by locus. If we update the IBD at a certain marker locus, then the two flanking loci (with “known” IBD!) are fully informative and no other loci are needed (no matter whether the two flanking loci are markers and/or QTLs). If we update the IBD at a putative QTL, the update step also depends on the expected phenotype (i.e., ‘fitted values’) given “known” genotype at other putative QTLs. As stated above, only genotypic states consistent with observed marker data are allowed.
Our notation has two important advantages. First, one can now change the genotype of founders independently of the IBD pattern and vice versa, therewith avoiding the problems discussed by Janss et al. (1995). Second, an IBD pattern is generated at all loci and therefore IBD is “known” in the Gibbs sampling process even if a parent is homozygous for a marker or if both parents and offspring are heterozygous carrying the same alleles at a certain marker. Computer implementation is then rather straightforward (using threelocus information instead of multilocus information), not only for the singleQTL models, but also for the more powerful multipleQTL models. However if two (or more) loci are closely linked, it may be more efficient to update the genotype at these loci together (“in blocks”) to reduce autocorrelation in the Gibbs sampler, at the cost of more computer programming (Jansset al. 1995). In our application, we blocked marker and QTL when fitting a QTL close to or on top of the marker.
APPLICATION
In this section, our focus will be on the dairy cattle experiment (Bovenhuiset al. 1998). Data were available for 20 sires and their families of halfsib sons. Nine molecular markers on chromosome 6 were scored in sires and sons, and marker alleles were encoded within families (code a and b for alleles of paternal type and code c for maternal alleles that differ from the sire alleles). Protein percentage data were available for sons only (obtained as averages for milk production data of daughters of the sons). We refer to Spelman et al. (1996) for more information on the experiment. It should be noted that we analyzed the data as they were released to the animal breeding community. Spelman et al. (1996) analyzed slightly different data (corrected for some suspicious data).
Many different models can be formulated by making (combinations of) assumptions about the number of genes involved (monogenic, oligogenic or polygenic inheritance), about the number of alleles per QTL (biallelic or multiallelic), about the allelic effects (fixed or random model terms), about interaction effects (QTLfamily, QTLQTL), about residual variance (homogeneity of residual variance over families or heterogeneity), about the effect of the dam (ignored or included in the model), or about linkage phases between loci (unknown, or fixed at a likely configuration). As an illustration, some of the proposed models will be applied to the dairy cattle data (see Table 1 for a description), and our results are compared with those published by Spelman et al. (1996).
We have to do with a “serious” mixture problem, since there are several sources of missing information: each marker isuninformative (homozygous) for several families; in a number of cases, it cannot be assessed which of the two marker alleles of an offspring originates from the sire; all QTL scores and some marker scores of offspring are missing; all marker scores of dams are missing; marker and QTL allele frequencies are unknown; and linkage phases of all loci are unknown. Clearly the total number of possible configurations g consistent with observed marker data is huge, making an exact analysis demanding. We assume that the recombination frequencies are known (fixed genetic map). Let y_{ij} be the trait value of the jth son of the ith sire.
Model I: Spelman et al. (1996) used the expectation method developed by Knott et al. (1994). To simplify the computational work, the most likely linkage phase was determined and taken for each sire, and when different phases were equally likely, one was chosen at random. Effects of dams were ignored. Information on marker allele frequencies was not used. In the approximate method, P(g) is calculated on the basis of marker data only and this calculation is done only once, namely before QTL analysis. At the map position under study their expectation QTL model reads
Model II: A mixture model for a multiallelic QTL is considered (similar to Spelman's model I). At the map position under study the model for phenotype given “known” genotype reads
Model III: As model II, but now we assume heterogeneous residual variance over families (that is, a separate variance parameter per family was used). As an example of fitting multiple QTLs, we also extended model III and fitted two QTLs simultaneously.
Model IV: Models I–III are multiallelicQTL models, and dam contributions were ignored. Now we consider a biallelic QTL and we also include the (unobserved) dam contributions. The estimate of the polygenic effect of a sire is affected by the QTL genotype being considered for that sire (Knottet al. 1992). To deal with that we use μ_{QQ}, μ_{Qq} and μ_{qq} instead of μ. If the ith sire has genotype QQ, the model for phenotype given “known” genotype reads
QTL likelihoods: Figure 1 shows the four QTL likelihood plots for models I–IV. At each map position, the value of the test statistic is plotted for the comparison of the two models with and without a QTL at the given map position. The solid curve of model I is obtained by converting the F values reported by Spelman et al. (1996) into likelihood ratio values (likelihood ratio test ≈pF where p = 20 is the d.f. for the test, see Haley and Knott 1992). Like in model I, the tests for a multiallelic QTL in models II and III have 20 d.f. In contrast, the test for the biallelic QTL in model IV has ~2 d.f.: one for the QTL effect (assuming additivity) and one for the frequency of the QTL allele.
To obtain empirical critical values, Spelman et al. (1996) analyzed original marker data with randomly permuted trait values over many permutations. A QTL for protein percentage was detected near marker two with a singletest significance value of 0.01% and an experimentwise significance value of 1%. The evidence was mainly coming from two families (families 1 and 16).
Parameter estimation for models II–IV was implemented via Monte Carlo EM, using 1000 Gibbs cycles per EM iteration and using the genotypic state in each tenth cycle as a Monte Carlo realization. In the final EM iteration, QTL likelihood was evaluated at marker positions by running 25,000 Gibbs cycles, using every 20th cycle for Monte Carlo evaluation of the likelihood ratio, and using <20 intermediate models spanning the range between the model with the QTL and that without a QTL. Running 25,000 Gibbs cycles for model III took ~15 min CPU time on a DEC AlphaServer 2100 at 275 MHz.
Figure 1 clearly shows that all curves peak in the first marker bracket, that is, there is similarity of the four QTL likelihood curves in the region between markers one and four. In contrast, large differences between the curves appear in the region between markers six and nine. We will propose several explanations for these dissimilarities below.
Comparison of models I and II: Note that these models assume homogeneity of residual variance, although variances differ significantly between families (not shown). The significant heterogeneity of residual variance is (partially) due to (non)segregation of the putative QTL near marker two. The QTL likelihood curves differ in particular at marker nine. This marker is loosely linked to the other markers and also uninformative for families with the largest values of total variance (families 1 and 16 are among them; data not shown). Under model II—with the faulty assumption of homogeneity of residual variance—we can fit a mixture to those families to reduce their withinfamily residual variance to the average residual variance; this significantly improves the fit to the data and explains the high value of the test statistic at marker nine under model II.
Comparison of models II and III: In model II, we assume homogeneity of residual variance, whereas in model III we allow for heterogeneity of residual variance. Clearly, under model III, the high QTL likelihood at marker nine has disappeared. Analysis by models II and III demonstrates that the assumption of homogeneous variance is not appropriate when fitting a QTL at marker nine. Between markers one and four the QTL likelihood is much higher under model II than under model III (singletest significance levels of 0.0003 and 0.04%, respectively). Under model III, a larger total variance for a certain family can be met by a larger estimate of residual variance, and therefore the evidence for QTL activity will now originate only from differences between the means of genotype classes at a marker. Under model II, reduction of residual variance (ultimately to the average residual variance) will increase the test statistic value. Therefore QTL likelihood is expected to be higher under model II than under model III if a QTL is segregating near marker two. However, differences between total variances can partially originate from segregation of different sets of QTLs at other parts of the genome and, if that is the case, a certain degree of artificial inflation of QTL likelihood is expected under models I and II (although model I is probably more robust).
QTL activity near marker eight is suggested by model III. We extended model III and fitted a twoQTL model with QTLs at markers two and eight (we have chosen marker two, because markers one and three are uninformative for families 1 and 16). We compared this model with the singleQTL model with a QTL at marker two only. This (conditional) likelihood ratio test for QTL activity near marker eight was significant at a 0.2% singletest significance level. It is interesting to note that this region is known to contain multiple casein loci that affect protein percentage (Bovenhuiset al. 1992).
Comparison of models II–IV: Models II and III assume a multiallelic QTL but ignore the dam contribution. Model IV assumes a biallelic QTL and takes the dam contribution into account. It is somewhat difficult to compare the QTL likelihood curves because of the difference of degrees of freedom involved in the two tests: 20 in models II and III and only two in model IV. The peak in the first marker bracket is more significant under model IV than under model II or III (singletest significance levels of >0.00001, 0.0003 and 0.04%, respectively). Similar power for models I–IV is expected if the true situation is indeed a biallelicQTL configuration with small QTL effect (Knottet al. 1996). The models indicate a large QTL effect (more than one genetic standard deviation) (Spelmanet al. 1996), so that power will be improved significantly by taking the dam contribution into account (model IV). Combining these results, evidence is provided for the presence of a biallelic QTL near marker two. In contrast, models II, III and IV clearly differ for QTL likelihood near marker eight, and we conclude that putative QTL activity in this region cannot be explained by the presence of a single biallelic QTL. The true situation may consist of a multiallelic QTL or cluster of biallelic QTLs. It is known that several casein loci are clustered in this region of chromosome 6 (Bovenhuiset al. 1992).
Comparison of models IV and I: Spelman et al. (1996) used model I and reported that the QTL near marker two affects protein percentage in families 1 and 16. The results from our analysis, using model IV, are in agreement with the previous results: the sire is heterozygous in all Gibbs cycles for family 16 and in ~95% of the Gibbs cycles for family 1 (the conditional probability of being heterozygous is high). Under model IV, the other 18 families are homozygous for the QTL in most Gibbs cycles, but they still follow a mixture distribution: each son inherits one of the two QTL alleles from his dam (equivalent to standard segregation analysis within families).
DISCUSSION
Currently, singleQTL methods are still widely used in plant, animal and human genetics, but they are intrinsically inappropriate for complex traits affected by multiple QTLs. In experimental plant applications, multipleQTL models (MQM) are now used more and more frequently; background QTLs are taken into account by including them (via linked markers) as cofactors in the model (proposed by Jansen 1992; see Jansen 1996b for a review). This can be done in plants because complete marker maps are available for many plant species and also because experimental plant populations, e.g., F_{2} or BC, are easier to deal with from the analytical point of view. In animal and human applications, the effects of background QTLs are often modeled by a single variance component term because complete markers maps are not (yet) available for livestock or human populations. In general populations, a marker can be segregating in some families whereas the QTL is not and vice versa. Then we cannot use a marker linked to a putative QTL as the cofactor in the (expectation or mixture) model, as also indicated by e.g., Spelman et al. (1996). In such cases, we should really include the QTL instead of the marker as cofactor in the model, although we can put the putative QTL close to or even on top of a marker. Eventually dense marker maps may become available in human and animal applications, and, with cofactors for background QTLs, it may not be necessary to include other parameters for genetic background control. Modeling via cofactors will also make it possible to explain differences in variances between families originating from segregating of different sets of genes and therefore residual variance may be assumed to be homogeneous; this can not be achieved by a model term for polygenic background effect. We believe that our mixture model approach via stochastic or Monte Carlo EM brings MQM mapping in complex populations within reach. Moreover, our approach uses data imputation via the Gibbs sampler (to generate one possible genotype in stochastic EM and multiple genotypes in Monte Carlo EM) and with these “known” genotypes standard software routines for linear regression, variance component or mixed models can be applied. Our Gibbs sampler is implemented in an easy locusbylocus and individualbyindividual manner. In particular, the stochastic EM algorithm is relatively easy to program. We have mainly used stochastic EM to provide starting values for Monte Carlo EM. More research has to be done to compare the efficiency of stochastic and Monte Carlo EM in various situations.
A Bayesian approach developed by Satagopan et al. (1997) offers an alternative to the MQM mapping approach. These authors assume a Poisson prior distribution for the unknown number of QTLs with discrete effects. By using recent MCMC techniques, the “birth” or “death” of a QTL can be sampled to have great flexibility with respect to the number of QTLs in the model. Other groups now work on similar approaches (I. Hoeschele, personal communication; M. Sillanpää, personal communication).
We expect that ML or Bayesian approaches for multipleQTL with discrete effects are computationally manageable in complex populations. In contrast, in the case of multiple QTLs of normal effects the computation of multiple variance components may already be much more intensive for three or more QTLs. Therefore, practical computational considerations may prevent the use of variance component models, although multiallelic QTLs may exist, and drawing inferences about multiallelic QTL variance via normal QTLeffects models would be the natural way to characterize genetic variation in the (base) population.
Although the structure of a population may be very complex, a simplified analysis may often be possible. This can be done by either focusing on a welldesigned and simple subset of the entire population or by relaxing assumptions and ignoring possible sources of (genetic) variation. For instance, with multiple families, one can estimate allele contrasts for the parents of the families without considering their relationships; one can ignore fullsib relationships within families and perform halfsib analyses for males and females separately; one can select the most likely linkage phases in parents and ignore other configurations; etc. One can then first use an approximate (expectation) method that is computationally inexpensive (Knottet al. 1996; Grignolaet al. 1997) and apply the data simulation method (“parametric bootstrapping”) (Jansen 1994) or permutation method (Churchill and Doerge 1994) to obtain genomewide significance thresholds for QTL detection. In this way, the entire genome would be screened relatively fast to pinpoint regions for further investigation by exact methods that need more computer time. Knott's approximate method uses one step of regression (least squares) analysis at each map location, whereas Jansen's exact method uses multiple cycles of regression analysis (iteratively reweighted least squares). The exact approach is computationally more demanding than the approximate approach. But this may be just a matter of seconds only if markers are highly informative. Moreover, the power and efficiency of the methods will then be similar. In more complex situations, however, we expect the exact (mixture) approaches to be more powerful and efficient than the approximate (expectation) methods at the cost of more computation. Particularly when markers are not fully informative, when individuals are selectively genotyped, when QTLs with large(r) effects are present, or when population structure is complex and much information is lost by simplification, the power and precision can increase considerably by the exact mixture approaches.
As indicated by our analysis of the cattle data set, it can be useful to compare various models with rather different assumptions such as for instance biallelic vs. multiallelic QTLs or homogeneous vs. heterogeneous residual variance over families. Our analysis suggested the presence of a biallelic QTL near marker two of chromosome 6, and the presence of a cluster of biallelic QTLs or a multiallelic QTL in the region of known casein genes near marker eight. We also demonstrated the pitfall of detecting ghost QTLs when erroneously assuming homogeneity of residual variance. Spelman's, Uimari's and our analyses produce slightly different but still consistent results (Spelmanet al. 1996; Uimariet al. 1996b). This may not be too surprising because multilocus information for paternal inheritance was relatively high for markers 2–8 (see Figure 5 in Spelmanet al. 1996). The data set was adopted by the animal breeding community to stimulate the development and comparison of (recent) analytical approaches to QTL mapping in complex situations. These cattle data have generated our study, but our methods can handle more complex situations. To investigate properties of the new methodology in a more thorough way, simulation studies are currently being carried out. For instance, we now study the performance of our mixture approach in the presence of selective genotyping. When marker scores are missing, we sample possible allelic configurations by using the Gibbs sampler, as for the case of the unknown QTL. Preliminary results indicate that the estimates of QTL effects are not biased by selection.
Acknowledgments
We are grateful to Livestock Improvement Corporation, Holland Genetics and the Department of Genetics of the University of Liege for data access.
Footnotes

Communicating editor: P. D. Keightley
 Received March 3, 1997.
 Accepted September 17, 1997.
 Copyright © 1998 by the Genetics Society of America