Abstract
We provide a theoretical framework for quantitative trait locus (QTL) analysis of a crossed population where parental lines may be outbred and dominance as well as inbreeding are allowed for. It can be applied to any pedigree. A biallelic QTL is assumed, and the QTL allele frequencies can be different in each breed. The genetic covariance between any two individuals is expressed as a nonlinear function of the probability of up to 15 possible identity modes and of the additive and dominance effects, together with the allelic frequencies in each of the two parental breeds. The probabilities of each identity mode are obtained at the desired genome positions using a Monte Carlo Markov chain method. Unbiased estimates of the actual genetic parameters are recovered in a simulated F_{2} cross and in a sixgeneration complex pedigree under a variety of genetic models (allele fixed or segregating in the parental populations and additive or dominance action). Results from analyzing an F_{2} cross between Meishan and Large White pigs are also presented.
THERE is currently much interest in the use of molecular markers to analyze the genetic basis of quantitative or “complex” traits, and an increasing number of experimental designs and statistical methods are being developed for this purpose (e.g., Liu 1998). A widely used design crosses two inbred lines. In this case the quantitative trait locus (QTL) and the markers are fixed for alternative alleles. Unfortunately, completely inbred lines are available in only a few species and are certainly not available in most domestic animals or in some plants like trees. Instead, the researcher has resorted to crosses between divergent, although outbred, lines. Thus, both the marker and the QTL may be segregating in the parental lines. Furthermore neither the number of QTL alleles nor the allelic frequencies are known. It has been shown that assuming that the QTL alleles are fixed in the parental lines when this is not the case may lead to an important loss of power (Alfonso and Haley 1998; PérezEnciso and Varona 2000). Under additive inheritance, a mixedmodel approach has been suggested for analyzing crosses between outbred lines (Wanget al. 1998; PérezEnciso and Varona 2000). However, given the welldocumented phenomenon of heterosis, i.e., an evidence of dominance, the assumption of additive inheritance in these methods may be an important shortcoming.
Modeling of dominance in outbred populations with inbreeding has proved to be a difficult task, given the large number of parameters that are required. Smith and MäkiTanila (1990) gave recursive formulas to compute the identity coefficients in a single breed. Lo et al. (1995) provided a general framework to model inbreeding and dominance in crossed populations. In both studies, however, no marker information was used.
The objective of this work is to present theory to analyze data from crosses of outbred lines using marker information. This theory allows dominance and inbreeding and the use of all available pedigree information. The article is organized as follows. First, we present the theory. Second, we illustrate the approach with simulated data and real data from a pig F_{2} cross. The main emphasis is on F_{2} crosses, given the wide popularity of this experimental design, but we also show results concerning more complex pedigrees.
THEORY
A general explanatory model for performance records is
The theory developed by Lo et al. (1995) is based on an extension to two breeds of Malécot's (1948) kinships' coefficients. Take the two alleles of a locus from a single crossed individual. The twobreed identity mode (TIM, Loet al. 1995) for a single individual can take any of five mutually exclusive values, which are listed in Table 1. If the two alleles are not identical by descent, either both are from breed A (N = 1), or both from breed B (N = 3), or each allele is from a different breed origin (N = 2, 2′). If the alleles are identical by descent, they originate either from A (N = 4) or B (N = 5). Now consider two individuals; we can define in similar terms a pair twobreed identity mode. A schematic representation of the identity modes required to model the covariance between two individuals (see below) is depicted in Figure 1. Two related individuals can either share only one, two, or three alleles, or the four alleles can be identical by descent. (Of course two individuals can share no allele identical by descent at all, but these terms do not contribute to the genetic covariance and are not shown). Note that identity modes are grouped in Figure 1 by the number of alleles identical by descent shared by any two individuals.
Lo et al. (1995) showed that the expected genetic value of the ith individual is
The principles of this theory can be applied to QTL detection, thus permitting the QTL analysis of populations of any pedigree structure issued from crosses between outbred populations irrespective of whether the gene action shows dominance or not. But two main obstacles persist. First, the number of genetic parameters to be estimated for every locus is 20, the mean plus the covariance parameters. Even if we reduced the number of parameters required in Lo et al. (1995), the size and pedigree structure of most QTL experiments do not suffice to obtain meaningful estimates. Second, we need to obtain the probabilities for each identity mode at every desired genome location for all pairs of related individuals, conditional on marker information.
The number of parameters to be estimated can be dramatically reduced if we assume a biallelic QTL with different frequencies in each breed. The model can now be reparameterized solely in terms of the additive (a) and dominance (d) QTL effects, plus the frequencies of each allele in breeds A and B, p_{1} and p_{2}, respectively. The genotypic value of homozygous individuals is thus a and −a for the alternative alleles, and heterozygous individuals have d as genotypic value. The conditional covariances in (4) can be obtained easily if we assume HardyWeinberg equilibrium in the purebred founder individuals. Consider, for instance, M = 14 (Table 2), i.e., the case where both individuals are inbred and the locus is from breed A origin. The covariance is, dropping the subscripts in M,
The TIM probabilities conditional on marker information can be computed via a modification of the Monte Carlo Markov chain (MCMC) approach described in PérezEnciso and Varona (2000) and in PérezEnciso et al. (2000b). In short, the algorithm consists of two steps, a step where unknown phases are sampled conditional on available marker information and current phases at other loci and a step where crossover positions are sampled conditional on current phases. Once crossovers are sampled it is possible to trace back the genome origins of any individual at any genome position. Thus, any identitybydescent coefficient can be readily calculated, including the twobreed identity modes. The process just described is repeated and the mean over MCMC iterations is used to obtain Pr(M) and Pr(N) in (3) and (4). The probabilities of TIM coefficients that need to be stored are 3 (M = 7, 8, and 9) for the diagonal of a noninbred individual, 5 if it is inbred (M = 7–9, 14, 15), 9 for the offdiagonal elements if no individual is inbred (M = 1–9), and all 15 if any of the two is inbred. See Figure 1 and Table 2. We ran the MCMC for 1000 iterations. This relatively small number was good enough as the autocorrelation between samples was very small (PérezEncisoet al. 2000b). Further, we tested the algorithm with the exact analytical result in Lo et al. (1995) when there was no informative marker at all, and we also saw that 1000 iterations sufficed. Nonetheless, it should be borne in mind that MCMC convergence problems may exist, in particular if the percentage of missing markers is large.
Finally, it should be noted that Equation 4 assumes that loci are unlinked, which would preclude the analysis of linked QTL. However, we have shown that the covariances between loci are zero conditional on marker information, provided that markers are informative and distances between successive markers are small (PérezEnciso and Varona 2000).
We used a twostep strategy for the QTL analysis. First, the TIM coefficients were calculated at the desired genome positions. Subsequently, maximumlikelihood estimates for a, d, p_{1}, and p_{2}, plus the fixed effects, were obtained at each genome position to determine the most likely QTL location, its effect, and its frequencies. The loglikelihood is
It is interesting to compare the approach followed here with other classical methods. Take p_{1} = 1 and p_{2} = 0. This is the model used in analyzing crosses between inbred lines. By substituting p_{1} = 1 and p_{2} = 0 into (3) and (4) it is straightforward to show that G = Ø and that the only terms remaining are those involved in μ_{g}, which are the usual regression coefficients employed in QTL analysis. If, in contrast, we set p_{1} = p_{2}, we retrieve a model for analyzing outbred populations, i.e., where breed origins are not taken into account. Similarly, a strict additive model can be studied by constraining d = 0. In summary, we should be able to test specific gene actions in the population under study by choosing an appropriate restriction on the parameters.
SIMULATION
Two sets of simulations, an F_{2} cross and a sixgeneration pedigree, were simulated. The F_{2} population consisted of 10 and 20 founders from each of the two breeds, 20 male and 40 female F_{1} individuals, and 320 F_{2} individuals. All families contributed an equal number of descendants. Two analysis options were considered: Either only performances from F_{2} individuals were used (n = 320) or also records from all F_{0} and F_{1} individuals were available and analyzed jointly with the F_{2} data (n = 410).
In addition, we also tested the method in a general pedigree. More specifically we simulated a sixdiscretegeneration pedigree (n = 410). It consisted of 10 and 20 founders from each of the two breeds. The individuals of the next generation were produced by mating 5 sires to two dams each, sires and dams being chosen at random with replacement (i.e., an individual, male or female, could participate in more than one mating per generation), and five fullsibs per mating were generated. The exceptions were the F_{1} generation, where 10 sires were chosen to produce the F_{2}, and the F_{2}, where 13 offspring per mating were generated. It was assumed that all individuals were genotyped and phenotyped. All data were included in the analysis.
The trait was assumed to be controlled by a single biallelic QTL in position 10 cM and bracketed by two markers located at 0 and 25 cM. The markers had 12 alleles, with 6 alleles specific to each breed. HardyWeinberg equilibrium frequencies were forced for the QTL in the founder individuals. Founder marker genotypes were sampled at random from a uniform distribution for allele frequencies. The additive genetic value was set to a = 1 and d to 0 or 1. Three cases for allele frequencies were considered: p_{1} = p_{2} = 0.5; p_{1} = 1, p_{2} = 0; and p_{1} = 1, p_{2} = 0.5. All six cases considered are listed in Tables 3 and 4. The phenotype was obtained by adding a normal deviate N(0, 1) to the genetic value. We report the estimates obtained by maximizing the likelihood at the true QTL position. This was done to assess the ability of the method to distinguish between alternative genetic models. The performance of the method in a chromosome scan is shown below in the real data example. Thirty replicates per case were done.
Four models were used to analyze each of the data sets generated under the six genetic situations. These were an additive model where a single allele frequency was estimated, i.e., a and p (p_{1} = p_{2} forced) as parameters; second, a model containing a, d, and p; third, a model with a, p_{1}, and p_{2} parameters; and finally a full model containing a, d, p_{1}, and p_{2}.
REAL DATA
The data were from an F_{2} cross with Meishan and Large White pigs as parental populations. A comprehensive report of the experimental design and results can be found in Milan et al. (1998). The pedigree analyzed comprised six Large White boars, six Meishan females as founder animals, 36 F_{1}, and 300 F_{2} individuals, which were a subset of the 1000 individuals available. Previous analysis using the approach of Haley et al. (1994) provided strong evidence of a QTL on chromosome 4 affecting growth, but the results with respect to backfat were less conclusive. A joint analysis of seven QTL experiments suggested that the chromosome 4 effect on backfat in crosses involving Meishan was much smaller than in crosses with wild boar (Wallinget al. 2000). Thus, we selected for the analysis records from backfat thickness adjusted at 80 kg and live weight adjusted for 120 days and markers from chromosome 4. Eight microsatellites were genotyped. They were located in positions 0 (S0227), 27 (SW2547), 50 (S0001), 75 (SW1089), 88 (SW270), 91 (S0214), 121 (SW445), and 141 (S0097) cM. These distances are from the average sex map. Different models were fitted at 2cM windows between positions 50 and 90 cM. This region should largely contain the 95% confidence interval for the QTL, with maxima located in positions 75 (backfat) and 68 cM (live weight; Milanet al. 1998). The probability coefficients Pr(N) and Pr(M) were obtained after 1000 MCMC iterates using all marker information, even if we restricted the analysis to a specific chromosome region. The same data set was also analyzed using the regression approach in Haley et al. (1994), which assumes a biallelic QTL, and a withinfamily approach suited for a threegeneration pedigree consisting of a mixture of full and halfsib families (Le Royet al. 1998). In this latter approach both the sire and the dam of the F_{2} individuals are assumed to be heterozygous, not necessarily for the same alleles across families. Estimates are obtained via maximum likelihood.
RESULTS
Simulation: A first step in the analysis is to decide which is the most appropriate genetic model, i.e., whether the alleles are fixed within the parental populations and whether genic action is purely additive or there is evidence of dominance. Consequently, we computed the likelihood ratio (LR) of models including dominance and/or unequal breed allele frequencies vs. the simplest model, i.e., no dominance and equal allele frequencies in both breeds. Figure 2 shows the results corresponding to the F_{2} population. The results are shown for all six parameter combinations used to generate the data. Statistics are presented for two cases, namely, whether only F_{2} phenotypic records or all F_{0}, F_{1}, and F_{2} records are included in the analysis. A LR test allowed us to retrieve the correct model in all instances studied. Take, for example, Figure 2a, where the null model is the true one, no LR exceeded the significance threshold. Whenever data were generated according to a purely additive model (Figure 2, a, c, and e), the LR of the model including dominance did not improve upon the additive one. Otherwise (d = 1), the LR clearly showed that a dominance parameter should be included in the model (Figure 2, b, d, and f). Accordingly, a LR also discriminated whether allele frequencies are equal (Figure 2, a, and b) or not (Figure 2, c–f). In cases c and d (true p_{1} = 1, p_{2} = 0), we also tested whether a model including parameters p_{1} and p_{2} improved over a model that set p_{1} = 1 and p_{2} = 0, with the result that the former model was not significantly better than the latter model (results not shown). Similar results, not presented to avoid repetition, were found for the sixgeneration pedigree.
Note, in addition, that the inclusion of parental purebred and F_{1} records improves the probability of detecting the correct model as the LR of the most parsimonious correct model increases. In the particular case represented in Figure 2a (d = 0, p_{1} = p_{2} = 0.5), the LRs of less parsimonious models decrease when analyzing all data, giving further support to the null hypothesis model.
Average estimates of the parameters for the F_{2} cross and the sixgeneration pedigree are in Tables 3 and 4, respectively. The estimates reported are those obtained under the correct model, the rationale being that a test has been carried out to determine which is the appropriate model, as in Figure 2. All in all, we find an excellent agreement between actual parameters and their estimates. The accuracy of allele frequency estimates was very high if alternative alleles were fixed and less so if the alleles were segregating within breeds, but still unbiased estimates were retrieved. Standard errors were, in most cases, smaller when F_{0} and F_{1} records were included in the F_{2} pedigree analysis.
Comparing by experimental designs, the estimates of the sixgeneration pedigree had on average a larger standard error than those in the F_{2} design when alleles were not fixed in each parental breed. This is likely to be due to genetic drift, which increases each generation, and we found a strong interrelationship between allele frequency and QTL effect estimates. In contrast, we also observed a smaller error for QTL position in the sixgeneration pedigree than in the F_{2} design (results not presented), as expected because of a larger number of meioses in the former population (Darvasi and Soller 1995).
Pig data: The results of the comparison between alternative models on the F_{2} cross pig data from Milan et al. (1998) are listed in Table 5. Five models were fitted. Model 1 is the model assumed in a typical regression approach including dominance; models 2 and 4 assume a pure additive action, whereas models 3 and 5 include d. Models 4 and 5 allow for different allele frequencies in each breed, whereas models 2 and 3 do not. In addition, the parameter estimates using the regression approach in Haley et al. (1994, model 0a) and the withinfamily analyses of Le Roy et al. (1998, model 0b) are shown as well. For backfat, the allelic action is additive, as can be seen from comparing the LR for models that include d vs. those that do not include d, i.e., models 3 vs. 2 and 5 vs. 4. Moreover, the allele frequencies are also significantly different in each breed, as would be expected. But it is more illuminating to ask whether the breeds have alternative alleles fixed. The difference in LR between model 5 and the model where fixed alleles are assumed (model 1) is 6.8. The exact distribution of this ratio is not known but a chi square between 1 and 2 d.f. can be a good approximation, and even in the most conservative case (2 d.f.) it would be significant (P < 0.05). The analysis would thus suggest that the “fat” allele is fixed in Meishan and at low frequency but still segregating in Large White. Note that the QTL effect is underestimated in a model that forces alleles to be fixed in each breed. It is well known that power decreases in a regression approach if alleles are not fixed in each breed (Alfonso and Haley 1998; PérezEnciso and Varona 2000).
In contrast to backfat thickness, all statistical evidence suggests that Meishan and Large White pigs have fixed alternative alleles affecting live weight in chromosome 4. Models 4 and 5 converge to p_{1} = 1 and p_{2} = 0, with no increase in likelihood in model 5 with respect to model 1. It is more difficult to ascertain the effect of dominance, the difference in LR being close to significance. The regression approach provided estimates similar to those obtained under the additive model 3. Note that the QTL position estimates vary widely depending on whether dominance was included or not; QTL position changed over 10 cM according to the model chosen. Figure 3 shows a plot of LR for models that include the dominance effect or not and p_{1} = 1 and p_{2} = 0. It can be seen that there are two local maxima in that region, and probably the confidence interval for the QTL position comprises both maxima. In any case, this change in QTL position is particularly worrying here given that the effect of dominance borders significance. Note, in addition, that the withinfamily analysis agrees with the position estimated under the dominant model, whereas the betweenbreed regression estimate is close to that obtained with the additive model. The LRs of models that assume equal frequencies in both breeds (models 2 and 3) were nonsignificant, which contrasts to the results obtained for backfat thickness. This occurs because these models assume that there is allelic variation within breeds, which seems to be the case for backfat thickness but not for growth.
DISCUSSION
The theory developed allows us to obtain a very useful insight into QTL genic action. It allows us to diagnose whether the alleles are fixed within the parental populations or segregating at similar frequencies and whether the genic action is dominant or not. Unlike other indirect approaches like withinsire regression, testing can be done irrespective of the population structure, i.e., number of generations, and using all available pedigree and marker information. Certainly the method can be improved; for instance, it would be desirable to use a single MCMC strategy to sample jointly the identity coefficients and the rest of the parameters. Such a strategy would provide exact estimates of the standard errors of the parameters, whereas in this likelihood framework with a simplex algorithm we need to resort to asymptotic approximations. We are currently working on a general Bayesian strategy to address this issue. Nonetheless, we have shown that the approach followed here performed quite well under a variety of genetic and pedigree scenarios (Tables 3 and 4, Figure 2).
The ascertainment of whether the QTL alleles are segregating within lines is an important issue in QTL identification. If a QTL is found, say, in an F_{2} cross, the subsequent experimental procedure to map it finely can be very different depending on whether all F_{1} are heterozygous at the QTL (alternative alleles fixed) or only a percentage are (i.e., alleles segregating in the parental lines). Moreover, the presence of dominance may also alter the statistical results obtained via classical regression type methods. The power of such an approach will diminish if a recessive allele is segregating in any of the crossed populations. Certainly, the most convincing proof would be to analyze the purebreds directly, but usually the number of purebred individuals typed in experimental crosses is small and would require setting up an additional experiment. Our approach is able to extract more information than classical approaches from the already available data.
There are currently a large number of crosses between divergent lines in many different animal and plant species. Certainly, not all the parental lines utilized are completely inbred. It is thus interesting to compare the results using a regression method and the method developed here. The results presented in Table 5 represent two of the possible situations that may be encountered. For the first trait, backfat thickness, there is reasonable evidence that alleles may not be fixed in both lines. Further, the statistical analysis suggests that the QTL is segregating in Large White but not in Meishan (Table 5), which may be explained by the very small number of founder animals of the French Meishan population (Bidanelet al. 1989). Interestingly, it was for backfat that a joint study of seven pig F_{2} crosses reported an interaction between QTL effect and experiment (Wallinget al. 2000). One of the most parsimonious explanations for this interaction is that the QTL may be segregating at different frequencies in each breed, and the regression approach can simply not take into account this possibility. Note that model 4 in Table 5 has only one extra parameter than model 2 but that the increase in likelihood is quite important. For this trait, thus, the power will increase to a larger extent by allowing distinct allele frequencies in each breed than by including a dominance effect in the model.
In contrast to backfat, the classical model seems appropriate for live weight and there is not much to be gained by adding extra parameters to the regression model. Here some uncertainty lies on the relevance of the dominant effect, as the significance level of contrasting model 5 vs. 4 is ~P ≈ 0.09, the probability of a chi square distribution (1 d.f.) being >2.8. The evidence in favor of dominance is thus weak, in agreement with the results of the regression analysis. The QTL position estimates obtained via the withinfamily analyses are in agreement with those obtained via the TIM coefficients, although the average estimate is lower. In principle, one should also expect a withinfamily heterogeneity of substitution effects if alleles are not fixed. Nonetheless, we found that the variance of sire effect estimates using the withinfamily approach was similar for both traits, 0.10 and 0.08 in SD units for live weight and backfat, respectively. This is probably the result that each halfsib family is analyzed separately and thus a small number of observations is actually used to estimate each substitution effect, in contrast to the more parsimonious approach presented here where all pedigree and marker information is considered jointly.
Although beyond the scope of this article, the researcher should be aware of possible QTL position shifts according to the model of choice (Figure 3). This is pertinent especially considering that it is customary to include a dominant effect in the model in crossedpopulation analyses without testing for its effect. We did not carry out a joint multivariate analysis of live weight and backfat, but the fact that the allele frequencies are different for each trait would suggest that there are two linked loci. This hypothesis would be in agreement with results from Marklund et al. (1999) who found the QTL effect on growth, but not on fat, was diminished in a wild boar × Large White backcross when boars with different QTL genotypes for fat were progeny tested. We also applied our approach to an F_{2} cross between Iberian and Landrace pigs to chromosome 4 (PérezEncisoet al. 2000a), finding evidence of alleles fixed in Iberian but not in Landrace pigs for carcass weight and fixed alleles in both breeds for backfat (M. PérezEnciso and A. Clop, unpublished results). All in all, this real data example illustrates the advantages of inspecting the data under different models, given that the genetic basis of all traits analyzed is not necessarily the same.
The identity modes in Lo et al. (1995) are a generalization of the kinship coefficients described by Malécot (1948), which are well known in quantitative genetics. They had been proved to be a very powerful instrument to model complex genetic relationships (Gillois 1964; Harris 1964; Loet al. 1995) but we are not aware of any application so far in QTL studies. A definitive advantage of Malécot's coefficients is that it is straightforward to take into account other genetic situations as well. The reader can easily figure out how the relevant identity modes depicted in Figure 1 and Tables 1 and 2 could be modified to allow for, e.g., imprinting or sex chromosome inheritance. We also developed a multivariate approach that allows us to distinguish between pleiotropy and linkage in a multiple QTL model; the results will be presented elsewhere. The daunting issue of obtaining the TIM probabilities conditional on marker information was solved via MCMC methods, illustrating once more the versatility of these approaches (PérezEncisoet al. 2000b). These methods are computer intensive but relatively easy to implement and program. Further, the number of parameters required was dramatically reduced by considering a biallelic locus. A biological justification of this model would be an ancestral mutation whose frequency has been changed through selection and drift at different speeds in each of the breeds studied. The pig Halothane (Ryr1) gene would be a classical example. But there exist as well large allelic series with a quantitative effect, like the αs_{1}casein in goats (Martinet al. 1995). In principle, the theory can be extended to deal with a multiple breed population and more than two alleles. Each additional breed considered adds only one parameter, the allele frequency, but each additional allele increases the number by 2 + n_{a}, where n_{a} is the previous number of alleles. This approach is not thus suitable in a large multiallelic system. A method based on analysis of variance will be more appropriate in this instance.
Finally, it should be recalled that most plant and animal individuals exploited commercially are hybrids but that their genetic evaluation is largely based on purebred performance. Thus a further application of the theory developed here, beyond the detection of QTL, will be to include molecular and performance data from hybrids in the genetic evaluation scheme. This approach can also be used to help markerassisted introgression, where typically data from several generations are available and where dominance and inbreeding may be present.
Acknowledgments
This work began during a sabbatical visit of the senior author to Iowa State University. M.PE. expresses his appreciation for the financial support received by Max Rothschild and Cotswold USA during this visit. The pigmap QTL project was funded by the European Union (Bridge and Biotech+ programs), Institut National de la Recherche Agronomique (Department of Animal Genetics, AIP “structure des génomes animaux” and the “Groupement de Recherches et d'Études sur les Génomes”).
Footnotes

Communicating editor: C. Haley
 Received September 13, 2000.
 Accepted June 13, 2001.
 Copyright © 2001 by the Genetics Society of America