Quantitative Trait Loci Mapping in F2 Crosses Between Outbred Lines
 Miguel PérezEnciso⇓ and
 Luis Varona
 Corresponding author: Miguel PérezEnciso, Station d'Amélioration Génétique des Animaux, INRA, BP 27, 31326 CastanetTolosan Cedex, France. Email: mperez{at}toulouse.inra.fr
Abstract
We develop a mixedmodel approach for QTL analysis in crosses between outbred lines that allows for QTL segregation within lines as well as for differences in mean QTL effects between lines. We also propose a method called “segment mapping” that is based in partitioning the genome in a series of segments. The expected change in mean according to percentage of breed origin, together with the genetic variance associated with each segment, is estimated using maximum likelihood. The method also allows the estimation of differences in additive variances between the parental lines. Completely fixed random and mixed models together with segment mapping are compared via simulation. The segment mapping and mixedmodel behaviors are similar to those of classical methods, either the fixed or random models, under simple genetic models (a single QTL with alternative alleles fixed in each line), whereas they provide less biased estimates and have higher power than fixed or random models in more complex situations, i.e., when the QTL are segregating within the parental lines. The segment mapping approach is particularly useful to determining which chromosome regions are likely to contain QTL when these are linked.
QUANTITATIVE traits arise from the joint action of the environment and multiple genes, usually called quantitative trait loci (QTL). The wide availability of DNA markers scattered along the genome, together with recently developed statistical methods, has spurred the massive search for QTL in any species of interest. Crosses between highly divergent lines are a powerful experimental design for this purpose (Lynch and Walsh 1998). The optimum situation in a F_{2} design occurs when all genes affecting the trait of interest are diallelic with the alternative alleles fixed in each parental line. Although in annual plant species and some lab animals highly inbred lines that may fulfill this condition have been developed, outbred parental populations are normally the only genetic material available in domestic animals (e.g., Anderssonet al. 1994) or trees (e.g., Grattapagliaet al. 1995), as well as in allogamous wild species (e.g., Huntet al. 1998). The QTL analysis of crosses between outbred populations poses two main statistical problems (reviews in Bovenhuiset al. 1997; Hoescheleet al. 1997; Elsenet al. 1999). The first one concerns the validity of the genetic model assumed in the analysis. The second one is related to accounting for the variation in the rest of the genome when fitting a QTL model at a particular position.
The usual model for analyzing F_{2} crosses (Lander and Botstein 1989; Haley and Knott 1992) is based on estimating the QTL effect from the phenotypic differences between individuals according to the estimated percentage of breed origin at a given position, assuming that alternative alleles are fixed in each parental line. We call this model the fixed model. Yet, the fact that heritability for a given trait is nonzero, as in most outbred lines, implies that there exists additive variation within lines and thus not all alleles affecting the trait can be fixed. There are also methods that allow for QTL segregation where the QTL effect is modeled as a normally distributed random variable with mean zero and variance to be estimated. This is the random model. The random model strategy has been put forward by several authors in the context of the analysis of outbred populations (Fernando and Grossman 1989; Goldgar 1990; Xu and Atchley 1995; Grignolaet al. 1996). The QTL variance is estimated by assessing the degree of phenotypic similarity between relatives according to the probability of sharing identical by descent alleles at specified positions. But the random model does not seem appropriate for the analysis of F_{2} crosses because no particular distinction is made between allele breed origin in current implementations. A strategy similar to the random model is the withinfamily analyses, where each family (e.g., descendants of each sire) is analyzed separately and the results pooled (e.g., Knottet al. 1996). However, this approach will tend to have small power when the family size and the QTL effect decrease.
A mixedmodel approach that accounts for variation both between and within lines is thus the most appropriate strategy for analyzing F_{2} crosses between outbred lines. Goddard (1992) proposed a QTL mixedmodel strategy for genetic evaluation that can potentially be applied to crosses between outbred lines, but marker information is used only to model covariances between QTL effects, not means, and the method does not account for differences in means and heritabilities between breeds in the genetic covariance matrix of crossed individuals; it is also assumed that marker phases are known in constructing the relationship matrix. Lo et al. (1993) developed the covariance between relatives in crosses between outbred populations for a number of unlinked loci and without marker information, whereas Wang et al. (1998) studied the case of a single marker and a QTL in a genetic evaluation context.
The problem of accounting for the genetic variation in the rest of the genome has been addressed by proposing the use of cofactors (“composite interval mapping”; Jansen 1993; Zeng 1993), but it would be desirable to have a methodology that addresses the issue more generally. Other authors have included a polygenic effect in addition to the fixed QTL effect (e.g., Fernando and Grossman 1989), but this does not allow for the fact that not all the genome contributes equally to the genetic variation and implies that this polygenic component is unlinked to the QTL of interest.
In this work we derive the genetic covariance matrix in crosses between outbred lines allowing for any number of linked markers and QTL, thus permitting a general QTL analysis of F_{2} crosses. This mixed model allows for more flexible genetic models than current strategies. We also propose a method, “segment mapping,” aimed at accounting for the variation in the whole genome simultaneously. The method also allows us to test genetic variance differences between breeds. A simulation study is carried out to compare the performance of segment mapping and mixed model mapping with classical methods, i.e., a genome scan using fixed or random models.
THEORY
The breeding value of an individual is, by definition, twice the average performance of an infinite number of its offspring when mated to a random sample of spouses from the same population. The starting point is the assumption that the breeding values (g) of two outbred populations A and B are normally distributed
The goal of the approach presented here is to estimate, conditional on marker information, the contribution of each segment to total genetic variance/covariance between the F_{2} individuals and to ascertain the expected phenotypic mean of individuals according to the percentage of breed origin in each particular segment. A reasonable strategy would be to include loci of similar effect in the same segment but the theory developed is valid for any partition strategy.
Assume that trait performance has been recorded in a F_{2} cross population derived from breed A and B and that parental, F_{1}, and F_{2} individuals have been genotyped for a series of markers. A general explanatory model of the F_{2} records is
The model in (1) and (2) together with (3) and (4) provides the general framework to analyze F_{2} populations using standard mixedmodel theory and molecular markers. These equations account for the fact that the average effect of alleles can be different between breeds, but also that there can simultaneously exist a QTL segregation within breeds. The average difference in allelic effects between both breeds is included as a fixed effect through QΔ, whereas the additional variation within breeds is allowed through G. The usual genome scan/regression strategy means that model (1) is fitted with an infinitesimally small segment (= 1 QTL) in successive positions assuming
Molecular information is used to calculate
Parameter estimates of b, Δ_{s},
SIMULATION
We carried out a simulation study to test the performance of segment mapping and mixedmodel scan vs. standard strategies. The F_{2} pedigree consisted of 5 parental sires from breed A, each mated to 2 dams of breed B that produced 5 F_{1} sires (1 per parental sire) and 40 F_{1} dams (4 per parental dam). The number of F_{2} offspring was 400. A 60cM chromosome was simulated,
and completely informative markers (i.e., each line had different marker alleles and as many alleles as founder individuals were generated) were located at positions 0, 20, 40, and 60 cM. Three genetic scenarios as depicted in Figure 1 were considered. A single telomeric locus explained all genetic differences between lines in scenario 1, and there were two telomeric loci at positions 0 and 60 cM in scenario 2. In scenario 3 there were two spaced clusters of 20 genes each, and the loci were of equal effect located every centimorgan in positions 1–20 and 41–60 cM. Three distinct cases were studied in scenario 1. First (case a)
Four methods of analysis were compared:
Segment mapping: The chromosome was divided into two segments, a 10cM segment (genetic scenarios 1 and 2) or 20 cM (scenario 3) and a segment comprising the rest of the chromosome. The model was
Mixed model: The point model was
Random model: The point model was
Fixed model: The point model was
The relationship matrices and p_{s} were obtained after 1000 iterates of the Gibbs sampling scheme. The parameters were estimated in all cases by maximum likelihood using a Simplex algorithm. At each genome partition (segment mapping) or interval position (mixed, random, and fixed models), the likelihood ratio (LR_{0}) comparing models (5), (6), (7), or (8) vs. y = μ + e was computed. In addition, the segment mapping model (5) was compared vs. model
RESULTS
Table 1 shows the statistics corresponding to the empirical (simulated) distributions of the different likelihood ratios. The distributions analyzed were those corresponding to the maximum LR at each scan or at each chromosome partition. They are not far apart from the theoretical asymptotic values. There is a trend, as expected, in increasing the mean and variance with the degrees of freedom and, in fact, the empirical threshold is sometimes less conservative than the theoretical chisquare figure P(χ^{2} > x_{0.05}) > 0.05. Figure 2 shows the empirical cumulative distribution functions (CDFs) together with their chisquare counterparts. We can conclude as Knott and Haley (1992) that, for all practical purposes, the chisquare distribution is a valid approximation in this instance.
Scenario 1: Here there is only one QTL in the linkage group studied. The average LRs over segments are in Figures 3, 4 and 5 for cases a, b, and c, respectively. These figures are equivalent to a LOD score or Fgraphics in a chromosome scan, but we prefer a bar representation to underline that they are tests at discrete positions. Note again that the LR_{0,SM} corresponds to a test where the whole chromosome is considered; it changes only the partition employed (Figure 1b). In the presence of a single QTL, the segment mapping test shows a distinct behavior from that of the point scan strategies (mixed, random, and fixed models). As expected, the scan strategies produce LR_{0} maxima at the QTL position, and LR_{0} decreases as the test position moves away. In contrast, LR_{0,SM} also shows a clear maximum with partition 1, whereas the rest of the partitions show a rather flat and nonclearly decreasing profile. The differences between partitions should be due to random fluctuations because no clear pattern emerges. Now consider LR_{s}. This statistic should be larger than zero whenever there is a QTL in the position considered and close to zero elsewhere. This is what we observe, and LR_{s} shows clear maxima at the QTL positions irrespective of the genetic case, a, b, or c. The drop in LR_{s} when we move away from the QTL position is much larger than in scan methods; e.g., compare the change in LR_{0,MM} and in LR_{S} between positions 1 and 2 (Figures 3, 4 and 5).
Although there are some similarities between LR_{0,MM},
LR_{0,RM}, and LR_{0,FM}, their performance depends critically on the underlying genetic model. Consider first case a (Figure 3), where the random model is the most appropriate strategy. It is not surprising that LR_{0,RM} is very close to LR_{0,MM} and LR_{0,SM} in position 1, despite the larger number of parameters involved in the latter two methods. Moreover, Table 2 shows that segment mapping as well as the mixed and random models lead to the same
In contrast, the fixed model (8) is the best choice in scenario 1b because the premise that the QTL affecting the trait are diallelic with alternative alleles fixed in each parental line is fulfilled. Here LR_{0,RM} was much lower
than LR_{0,FM}, and this was very similar to LR_{0,MM} and LR_{0,SM}, because no additional parameters are needed. The fixed model yielded unbiased estimates of
The most complex, and realistic, scenario is when alleles are not fixed and their average effect differs from line to line (case c). All four analysis strategies identified the correct QTL location (except for one replicate in the fixedmodel analysis) with power 100%, and in this sense all methods would lead to the detection of a QTL. But classical methods, either fixed or random models, are not capable of extracting all available information from the data. According to previous results, it is not surprising that the fixedmodel analysis resulted in a biased estimate of
Scenario 2c: Consider first the behavior of the likelihood ratio under the different models of analyses (Figure 6). The scan approaches (mixed, random, and fixed models) peaked at both QTL positions with probability close to 50% in all methods (Table 3) because the two QTL were of about the same effect. Again, LR_{0,RM} was higher than LR_{0,FM}, and the power was slightly larger with the random model than with the fixedmodel approach. The LR_{0,SM} peaks were more scattered, but almost 50% of the maxima were located at intermediate positions (partitions 3 and 5). These partitions correspond to those where segments containing QTL are grouped vs. segments without QTL. We can think of these partitions as the most “reasonable” ones. Occasionally the LR_{0,SM} peaked at partitions 1 or 6 because in that particular replicate a given QTL effect was much larger than the other QTL effect. In no replicate did the maximum LR_{0,SM} coincide with partition 2 or 5. The plot of LR_{s} clearly indicates that only segments 1 and 6 contain QTL (Figure 6). Moreover, Table 3 shows that SM resulted in unbiased estimates of
Scenario 3c: Here the marker positions coincided with segment bounds. The presence of a close but distinct cluster of genes results in a different LR_{0} pattern as compared to scenario 2c. The LR_{0,MM} and LR_{0,RM} tend now to peak in between both clusters, whereas LR_{0,FM} results in a completely flat profile, with maxima randomly located along the chromosome (Figure 7, Table 4). The LR_{s} allows us to identify convincingly that the intermediate segment contains no QTL. Note that LR_{s} for s = 1 and 3 are significant despite the much lower value compared to the other LR. Again LR_{0,SM} peaked at partition 2. The phenomena already described in scenario 2c are noted again but to a larger extent because more than one linked loci are involved now: there is a bias in
DISCUSSION
The QTL mixed model developed here is a generalization over the Wang et al. (1998) approach by allowing that loci can be linked and making use of the information provided by any number of molecular markers jointly; thus the method can be applied to the analysis of QTL studies of F_{2} crosses. The methodology presented here shows as well that the covariance between F_{2} individuals should be split into the probabilities of identity by descent contributed by each breed. Further, the segmentmapping approach allows a global analysis by partitioning the genome, or the chromosome, in segments. Rodolphe and Lefort (1993) proposed considering the whole genome simultaneously but their approach is a fixed model with multiple regression on all markers genotyped. And this results in a loss of power as the number of markers increases. This does not occur with segment mapping because the number of parameters depends on the number of segments defined, not on the number of markers used.
The simulation results presented show that, under a variety of genetic architectures, the mixedmodel and segmentmapping procedures are more robust and flexible strategies than the classical methods based on pure fixed or random models. Segmentmapping, mixed model, and pure fixed or random models are hierarchical levels of analysis complexity, as can be seen from comparing (5), (6), (7), and (8). A likelihoodratio test can be used to decide whether there is evidence to consider a genetic model more complex than the one assumed in classical methods. Overall, the point mixed model showed optimum performance with a single QTL. The segmentmapping approach will be most useful in the case of linked QTL (Tables 3 and 4). The LR_{s} will help to determine which chromosome regions are likely to contain QTL. It is interesting that the segmentmapping partition corresponding to the maximum likelihood (at equal number of parameters) occurs when the genome is partitioned according to its effect on the trait. For instance, when the QTL are in both extremes, the likelihood is maximized when a modelpartitioning segment equidistant between the two QTL or the two clusters vs. the rest of the genome is chosen (Tables 3 and 4). But it is also a nice property of segment mapping that, irrespective of the partition actually chosen, it results in general in accurate estimates of
The classical fixedmodel approach is simple to compute and easy to interpret in F_{2} crosses, although it makes very strong assumptions about allele distributions in the parental lines. We have shown that fixedmodel estimates can be dramatically affected if alleles are not fixed within lines, even in onelocus scenarios (Tables 2 and 4).
A systematic upward bias of the
It is interesting to compare the performance of random and fixed models under the genetic models considered. The random model was more robust than the fixedmodel approach in terms of locating a QTL: the LR_{0,RM} was higher in case b (Figure 4) than LR_{0,FM} in case a (Figure 3), as well as in case c (Figures 5, 6, and 7). That is, the random model behaved better when the randommodel assumptions were violated than the fixed model did when fixedmodel assumptions did not hold. This is an interesting result; the random model does not seem a priori a reasonable strategy for analyzing F_{2} crosses as no differences in allelic effects between breeds are assumed. Xu (1998) studied by computer simulation the performance of random models in analyzing crosses but in a context where several crosses between different inbred lines were analyzed together. We are not aware of actual F_{2} QTL experiments analyzed using a completely random model. Nonetheless De Koning et al. (1999) have analyzed a F_{2} cross in pigs using a withinsire regression approach (Knottet al. 1996) and a classical fixed model. The former method does not make specific assumptions about number of alleles and frequencies in the parental lines, at the expense of increasing the number of parameters and disregarding genotypic information of dam origin. Interestingly, the two statistical approaches lead to distinct results, both in QTL effect and in location (with the exception of a QTL for backfat thickness on chromosome 7). The withinsire approach exhibited, overall, smaller power than the fixed model. This analysis seems to contradict our simulation results concerning the robustness of the random model, but there are important differences between the random model and the withinsire regression. First, the withinsire regression as used by De Koning et al. (1999) disregards dam information. This can have a negligible effect in very large and outbred populations, but not necessarily so in modest family sizes (22–51 halfsibs in De Koninget al. 1999) and in a F_{2} between divergent breeds where the variation contributed by the meiotic segregation in the dam can be large compared to the environmental variance. Second, we have assumed in the simulations a maximum informativity in terms of marker alleles, and it is plausible that the relative performance of the methods differs at lower levels of heterozygosity.
The approximation of (3) depends on the informativity and density of molecular markers. We have not explored in detail the impact of noninformativeness on the segment mapping approach, but it can be seen that the partitions used in genetic scenarios 1 and 2 (Tables 2 and 3) have segments with one bound not coinciding with markers, i.e., the least informative possible situation. Despite this, the estimates were quite reasonable. Take, e.g., genetic scenario 1 (Table 2): in partition 1 the variance associated with segment 1–10 cM collects almost all genetic variance and
The simulations carried out here have assumed that loci behave additively, both between and within breeds. This may seem a quite strong assumption in view of the ample empirical evidence for heterosis in line crosses (Lynch and Walsh 1998). The general theory to deal with dominance in crosses between outbred lines has been developed by Lo et al. (1995), and it can be extended to deal with molecular markers. Unfortunately the number of parameters that need to be estimated is very large so that in practice one may be confined to providing only approximate estimates of the dominance variance or making strong assumptions about allele distributions. The fixedmodel approach and regressiontype methods take into account dominance by adding an additional covariable to the probability of the QTL being heterozygous at the position of interest. The same course of action can be followed here, but it should be noted that this strategy presupposes that a diallelic locus is fixed in each line. Otherwise, the dominance deviation estimate will be biased and not accurate.
We have assumed a model
Note that in segment mapping we do not make the distinction between a QTL and a polygenic background, and it is not necessarily assumed in segment mapping that a single locus is segregating within the segment or segments considered. It follows that it is more relevant in the segmentmapping context to test whether a given segment, however small, contributes significantly to genetic variation than in an accurate QTL location, as is emphasized in interval mapping (e.g., Visscheret al. 1996). The importance of accuracy of QTL location or correctly ascertaining the number of QTL need not be overestimated. First, if a very dense genotyping is carried out, segmentmapping will be able to separate intervals contributing to variation more effectively than genome scan because external “genetic noise” is properly accounted for in segment mapping. Compare, e.g., the drops in LR_{0,MM} and LR_{s} between positions 1 and 2, which have very similar distributions under the null hypothesis (Figure 2). The change in LR_{s} is larger than in LR_{0,MM} for all genetic cases. We may thus conjecture that a combination of LR_{0,SM} and LR_{s} tests may lead to a more accurate location of the QTL than a simple scan with LR_{0,MM}, although more extensive simulation is needed to prove this. Second, the candidate genes will be readily located once a promising region is identified as genetic maps are becoming densely populated with known genes. The current strategy in QTL analysis is to look for candidate genes within the chromosome regions that have shown association with the trait. It is likely, in fact, that the reverse strategy will be predominant in the future: once the number of cloned candidate genes becomes very large and their physiological effects are ascertained or inferred, it will be routine to estimate the fraction of genetic variance associated with these genes, including possible epistatic effects, in a particular population.
A feature of the segmentmapping strategy is that there is not an obvious course of action to conduct a genome partitioning. We propose to run a preliminary analysis with a segment partitioning scan as depicted in Figure 1b complemented with LR_{s} tests every, say, 10 or 5 cM. This should allow us to identify which segments are more promising. In a second analysis the noninteresting regions should be discarded from further consideration, and a detailed partitioning of the most relevant genome regions can be studied, together with elucidating whether fixed, random, or mixed models are more suitable for each segment. Interactions between segments can be analyzed as well. The ultimate goal of segment mapping would be to have a function establishing the appropriate weights given to each region of the genome when computing the additive relationship between animals and, additionally, the expected changes in mean as well. Given estimation errors, the most parsimonious model explaining the maximum variance should be chosen. A reasonable compromise is to classify genome regions according to their effect on the trait of interest, e.g., strong, weak, and nonsignificant. Regions of similar effect can be analyzed together in the same segment. Note that different “segmentation” may be used to model variance components or means; i.e., the whole genome may be partitioned in just three segments grouped according to its contribution to total genetic variance, whereas differences in means (Δ_{s}) can be fitted in more segments, or at specific genome locations if there is clear evidence of a QTL. In that manner, it can be considered that QTL that contribute to differences between lines do not contribute necessarily to differences within lines.
In conclusion, we have put forward a methodology based on mixedmodel theory that allows for complex genetic models and, at least theoretically, a simultaneous analysis of the whole genome. It has been shown that genome scans using regression or completely random model approaches are but particular cases of the theory presented in this work. The random model shows a more robust behavior than the most commonly used regression approach. Finally, segmentmapping principles can be accommodated to a variety of experimental designs, not only F_{2} crosses.
Acknowledgments
We are grateful to Miguel Toro, Luis Silió, Rohan Fernando, and the referees for useful comments. Some of this work was accomplished during a sabbatical visit of M.P.E. to Iowa State University. M.P.E. expresses his appreciation for the financial support received by Cotswold USA and Max Rothschild during his stay at Iowa State University. Work was funded by projects Comisión Asesora de Ciencia y Technología AGF962510 (Spain) and BIO4CT97962243 (E.U.).
APPENDIX
The variance/covariance matrix of additive genetic values in the F_{2} generation, G, is derived. First a finite number of loci (nloci) is considered and then extended to an infinitesimal model. Genetic equilibrium and additive genic action, within and between breeds, is assumed. The genetic value of individual i from breed A is
In the absence of marker information,
Footnotes

Communicating editor: C. Haley
 Received March 16, 1999.
 Accepted January 10, 2000.
 Copyright © 2000 by the Genetics Society of America