Abstract
Because of uncertainty about linkage phases of founders, linkage mapping in nonmodel, outcrossing systems using molecular markers presents one of the major statistical challenges in genetic research. In this article, we devise a statistical method for mapping QTL affecting a complex trait by incorporating all possible QTLmarker linkage phases within a mapping framework. The advantage of this model is the simultaneous estimation of linkage phases and QTL location and effect parameters. These estimates are obtained through maximumlikelihood methods implemented with the EM algorithm. Extensive simulation studies are performed to investigate the statistical properties of our model. In a case study from a forest tree, this model has successfully identified a significant QTL affecting wood density. Also, the probability of the linkage phase between this QTL and its flanking markers is estimated. The implications of our model and its extension to more general circumstances are discussed.
RECENT developments of modern molecular marker technologies and statistical and computational tools have led to a great resurgence of interest in studying the inheritance and genetic architecture of a complex trait at the individual quantitative trait locus (QTL) level (Lander and Schork 1994; Lander and Weinberg 2000; Mackay 2001). A number of analytical methodologies that suit different situations of QTL mapping have been framed (Lynch and Walsh 1998; Jansen 2000; Weller 2001) and the results of genetic analysis for a variety of organisms using these methodologies have been reported (reviewed in Wuet al. 2000; Mackay 2001). According to the biological properties of study objects, all these theoretical or empirical studies can be classified into two categories, one for model systems and the other for nonmodel systems.
QTL mapping for model systems, in which homozygous inbred lines can be developed, is performed with welldesigned experiments. One popular experimental design is to create a segregating progeny population, such as F_{2} or backcross, by using two complementary inbred lines. Statistical technologies for identifying QTL in these standard designs are relatively simple because there are only two segregating alleles for each genetic locus and because the allelic frequencies and linkage phases for both the markers and QTL are known. Lander and Botstein (1989) for the first time proposed a maximumlikelihoodbased method to map a QTL in a chromosomal interval bracketed by two flanking markers. The theory behind this intervalmapping method was subsequently extended to create a socalled compositeinterval mapping by combining multiple marker regression analysis techniques, which can overcome the influences of QTL in other different marker intervals (reviewed in Jansen 2000). Although these two statistical developments of QTL mapping have brought about numerous publications on QTL identification, they are not designed to make a simultaneous search for all possible QTL affecting a quantitative trait throughout the entire genome. Kao and Zeng (1997) have derived general formulas for obtaining maximumlikelihood estimates for QTL positions and effects. In their article, the authors developed a multipleintervalmapping method to search and map all possible QTL by analyzing multiple marker intervals simultaneously and, therefore, to estimate the genetic architecture of a quantitative trait in a comprehensive framework.
Unlike the model systems, it is difficult or impossible to generate inbred lines in nonmodel systems and, thus, QTL mapping for these species should be based on existing nondomesticated populations, such as a fullsib family derived from heterozygous parents (Grattapaglia and Sederoff 1994). In the mapping populations of nonmodel systems the number of segregating alleles per marker locus or QTL and linkage phases between different loci are usually unknown. These two uncertainties make linkage analysis and QTL mapping using molecular markers much more challenging for a fullsib family of outbred lines than for a progeny of a cross derived from inbred lines. Several studies have been conducted for linkage analyses of molecular markers of a different amount of segregation informativeness (Ritteret al. 1990; Aruset al. 1994; Ritter and Salamini 1996; Maliepaardet al. 1997; Ridoutet al. 1998) or QTL identification using these different markers in a fullsib family (SchäferPreglet al. 1996; Johnsonet al. 1999; Songet al. 1999). In some studies, more sophisticated statistical algorithms, such as Bayesian approaches relying on a Markov chain Monte Carlo, have been proposed to take the complexity of fullsib family mapping into account (Hoescheleet al. 1997; Sillanpaa and Arjas 1999; Xu and Yi 2000). However, all these studies are still simplified in practice because they do not provide a robust approach for characterizing linkage phases between markers and QTL. Because different segregation patterns are expected under different linkage phases (Wuet al. 2002), the failure to characterize a correct linkage phase may lead to serious biases for the estimation of QTL positions and effect sizes in a fullsib family.
In this article, we extend Wu et al.'s (2002) multilocus analysis procedure to simultaneously estimate linkage and linkage phases between markers and QTL segregating in outcrossing populations. Our idea here is to integrate all possible linkage phases between a putative QTL and two flanking markers in two parents, each specified by a phase probability, within the framework of a mixture statistical model. In characterizing a most likely linkage phase on the basis of the phase probabilities, the QTL position, QTL effects, and other model parameters are also estimated using a likelihood approach. We perform numerous simulation studies to investigate the robustness, power, and precision of our statistical mapping method, incorporating linkage phases. An example from an outcrossing forest tree is used to validate the application of our method to QTL mapping for nonmodel systems.
STATISTICAL MODEL
Marker segregation types: A commonly used mapping population for outcrossing species is one derived from a fullsib family generated by two heterozygous outbred parental lines. In such a fullsib family, many different marker segregation types can be expected given the heterozygosity of the two parents. Grattapaglia and Sederoff (1994) proposed a wellaccepted pseudotest backcross design for mapping in an outcrossing population, but this design can use only a portion of the genome markers segregating in one parent but null in the other. For a fullsib family derived from two outbred parents, up to four marker alleles, besides a null allele, at a single locus, can occur. Also, the number of alleles may vary over loci. Each of the marker alleles, symbolized by a, b, c, and d, is dominant to the null allele, symbolized with o. Assume that all markers follow a Mendelian segregation without distortion. Depending on how different alleles are combined between the two parents used for the cross, a total of 18 possible cross types exist for a segregating marker locus (Table 1). On the basis of both parental and offspring marker band patterns, these cross types can be classified into seven groups:

Loci that are heterozygous in both parents and segregate in a 1:1:1:1 ratio, including four alleles, ab × cd; three nonnull alleles, ab × ac; three nonnull alleles and a null allele, ab × co; and two null alleles and two nonnull alleles, ao × bo.

Loci that are heterozygous in both parents and segregate in a 1:2:1 ratio, which include three groups:

B_{1}. Three alleles form a nonsymmetrical cross type between the two parents. Of the three alleles, one is a null allele in one parent, e.g., ab × ao.

B_{2}. The reciprocal of B_{1}.

B_{3}. Two alleles form a symmetrical type between the two parents, i.e., ab × ab.


Loci that are heterozygous in both parents and segregate in a 3:1 ratio, i.e., ao × ao.

Loci that are in the testcross configuration between the parents and segregate in a 1:1 ratio, which include two groups:

D_{1}. Heterozygous in one parent and homozygous in the other, including three alleles, ab × cc; two alleles, ab × aa, ab × oo, and bo × aa; and one allele ao × oo.

D_{2}. The reciprocals of D_{1}.

A general statistical framework has been proposed for linkage analysis of different types of markers in nonmodel systems (Wuet al. 2002). A multilocus linkage phase inference model is derived on the basis of a hidden Markov chain process to simultaneously estimate linkage and linkage phases for the markers on a same linkage group. The genetic mapping of QTL is conducted using such a wellconstructed linkage map.
A general framework: Consider two outbred parental lines denoted as P and Q, each containing two homologous chromosomes 12 in a set. The cross between these two lines, 12 × 12, results in four possible parental chromosome pairings, 11, 12, 21, and 22. In this article, we used italic numbers to denote parental chromosomes.
As explained above and seen in Table 1, there may be many different marker types in a fullsib family derived from the two outbred parental lines. However, all observed markers, no matter which type they come from, can be described by two alleles,
Because of different gamete probability combinations, the joint probabilities of the 64 zygotic genotypes (and therefore the conditional probabilities of the QTL genotypes given marker genotypes) will be different among the four phase combinations. However, regardless of the difference among these four phase combinations, these conditional probabilities under different phase combinations can be obtained just by changing the order of the QTL genotypes corresponding to a particular phase combination (Table 2).
Let u and v denote the QTL alleles that an offspring i has received from parent P and Q, respectively. The conditional probability of the QTL genotype for this individual under parentalphase combination Φ_{11} is denoted by h_{uvi} expressed in one of the four vectors,
Assume that the phenotypic values y of a QTL genotype P_{u}Q_{v} are normally distributed with mean μ_{uv} and variance σ^{2}, expressed as
Because marker information for each offspring has been incorporated into the mixture model of Equation 1, one unknown parameter r_{1} or r_{2} (that determines the location of the QTL on the interval) should be estimated along with vector Ω. But in practice, the QTL location is estimated by treating r_{1} (and therefore r_{2}) as fixed. Using a grid approach, we can obtain the MLE of the QTL location from the peak of the profile of the loglikelihood ratio test statistics across a chromosome.
On the basis of quantitative genetic theory, the genotypic value of a QTL can be partitioned into the additive and dominant effects as
Let m = (μ_{uv})_{4×1} and a = (μ, α, β, δ)^{T}, which can be connected by a design matrix D. We have
The MLE of a can be obtained from the MLE of m by
Fitting marker phenotypes: We have built a general framework for QTL mapping based on the twomarker zygote genotypes. But in practice only the phenotypes of the marker zygotes can be observed. The numbers of the zygote phenotypes of a marker are 4, 3, 3, 3, 2, 2, and 2 for marker types A, B_{1}, B_{2}, C, D_{1}, and D_{2}, respectively (Table 1). We have designed different incidence matrices I (Wuet al. 2002; appendix b) to connect the zygotic genotypes to the zygotic phenotypes for all different marker types listed in Table 1. Thus, general expressions for the probability vector of twomarker genotypes or the joint probability matrix of two markers and QTL for particular marker types can be derived by using the corresponding incidence matrices (appendix b), which are expressed as
Hypothesis tests: The existence of a QTL of significant effect within a marker interval can be tested by calculating a loglikelihood ratio (LR) test statistic under the null (H_{0}, there is no QTL) and alternative hypotheses (H_{1}, there is a QTL), expressed as
In a fullsib family derived from two outbred parents, it is possible that a putative QTL does not segregate in a 1:1:1:1 ratio. The genetic model (1) proposed in this article has power to test if the QTL detected is diallelic segregating 1:2:1 (like marker type B) or 1:1 (like marker type D_{1} or D_{2}; see Table 1). The hypothesis that a significant QTL conforms to segregation type B can be tested by formulating
Similarly, the hypothesis for testing for the consistency of the QTL segregation to type D can be formulated as
In each of the two hypotheses above, the LR is calculated similarly to Equation 4. In practice, the segregation pattern of a significant QTL should be tested because this is important for designing an efficient breeding strategy.
MONTE CARLO SIMULATION
Extensive simulation studies are performed to test the statistical properties of our method for simultaneously estimating QTL position and effects and linkage phase between the QTL and markers in an outcrossed population. Suppose a fullsib family is derived from two outcrossed parents. This fullsib family is genotyped at six equally spaced (20cM) fully informative markers (type A), forming five intervals. A QTL is hypothesized at 26 cM from the first marker (located within the second interval).
The phenotypic values for this fullsib family are simulated by giving a particular set of unknown QTL effect parameters under the linkage phase combination Φ_{11} for the two parents. These simulated data are subject to mapping analysis using our model that considers a mixture of all possible linkage phases. Thus, if the MLEs of the phase probabilities p and q are near one, this indicates that our model can precisely characterize the linkage phase for a practical phaseunknown data set. Of course, if the data are simulated under the other linkage phase combination, the values of p and q reflecting this correct linkage phase should be changed correspondingly.
The critical thresholds for declaring a significant QTL are determined from the distribution of the LR values calculated from the simulated phenotypic data assuming no QTL. The simulated data under this null hypothesis are analyzed by the statistical model proposed. The distribution of the LR values over 1000 simulation replicates can be approximated by a χ^{2} distribution. The 99th percentile of the distribution of the maximum is used as empirical critical values to declare chromosomewide existence of a significant QTL at the significance level α= 0.01.
Our simulation schemes include different gene action modes (additive, dominant, or overdominant), different heritabilities (H^{2} = 0.1 or 0.4), and different sample sizes (N = 200 or 400; Table 3). Given a heritability and the genetic variance calculated from hypothesized genetic effect values, we estimate the residual variance (σ^{2}). The accuracy and precision of parameter estimates are affected by gene action modes in three ways (Table 3). First, an overdominant QTL tends to have a more precise estimate of location than does a dominant or additive QTL. For example, the standard error (SE) of the location MLE for an overdominant QTL is 14–20% smaller than those for other QTL when the heritability is 0.4 and sample size is 400. Second, for a small heritability trait, the MLEs of additive effects for an overdominant QTL are less biased than those for additive or dominant QTL. Third, the dominant effect is overestimated to a larger extent than the additive effect, especially for an overdominant QTL.
The estimation accuracy and precision of all parameters can be improved when heritabilities and sample sizes are increased (Table 3). For example, it is difficult to estimate the position of QTL for a low heritability (H^{2} = 0.1) trait when N = 200 (Figure 1A). An increased sample size (N = 400) can lead to more precise estimation of the QTL location. For a high heritability (H^{2} = 0.4) trait, the QTL can be precisely localized, especially when a larger sample size is used (Figure 1B). Similar trends also hold for the estimates of other QTL parameters, such as additive and dominant effects, and model parameters (overall means and residual variance; Table 3). It appears that there are more substantial improvements in the accuracy and precision of parameter estimates due to an increased heritability level from 0.1 to 0.4 than to an increased sample size from 200 to 400.
It is interesting to note that our model can well estimate the linkage phase between the QTL and the markers. The MLEs of phase probabilities are close to 0.90 for a small heritability trait and ≥0.95 for a high heritability trait (Table 3). Unlike the estimation of other parameters, the power of detecting a significant QTL seems to be more sensitive to sample sizes than to heritabilities (Table 3). For a small mapping population, the power of detecting a significant QTL is considerably reduced.
We also performed an additional simulation experiment to test the influence of incorrectly characterizing a linkage phase on QTL detection and parameter estimation. The simulated data, given H^{2} = 0.4 and N = 400, under linkage phase combination Φ_{11} are analyzed using models based on this phase and three other different phases, Φ_{12}, Φ_{21}, and Φ_{22}. Because different linkage phases change only the order of the parental chromosomal pairings, the maximums of the LR values from the correct linkage phase Φ_{11} and the three incorrect linkage phases Φ_{12}, Φ_{21}, and Φ_{22} will be identical (see Figure 2), suggesting that phaseseparate analyses have no power to select a most likely linkage phase. Also as shown by flat, crooked curves, the maximum LR value from a single linkage phase model cannot be used to precisely determine the QTL position. Figure 2 also illustrates the LR values across the linkage group calculated when all linkage phase combinations are considered simultaneously on the basis of the same simulated data set. A higher peak of the curve for a mixedphase analysis (see also Figure 1B) indicates that our model has a greater advantage in detecting a significant QTL than usual phaseseparate analyses do. When an incorrect linkage phase is used, the signs of the MLEs of the additive and dominant effects of a QTL will be reversed (results not shown).
A CASE STUDY
We use an example of an outcrossed forest tree to demonstrate the power of our statistical model for mapping QTL affecting a quantitative trait. The study material used was derived from the hybridization between two poplar species, Populus deltoides and P. euramericana. A genetic linkage map was constructed using a socalled pseudotestcross strategy (Grattapaglia and Sederoff 1994) based on 90 genotypes selected randomly from the F_{1} interspecifc hybrid family with random amplified polymorphic DNAs (RAPDs), amplified fraction length polymorphisms (AFLPs), and intersimple sequence repeats (ISSRs; Yinet al. 2002). This map is composed of the 19 largest linkage groups for each parental map, which roughly represent 19 pairs of chromosomes. The 90 hybrid genotypes used for map construction were measured for wood density with wood samples collected from 11yearold stems in a field trial. The measurement for each genotype was repeated four to six times to reduce measurement errors. The means of these genotypes were calculated and used for QTL mapping here.
Our model can successfully identify a significant QTL for wood density on linkage group D17 as reported in Yin et al. (2002). In this example, the empirical estimate of the critical value is obtained from 1000 permutation tests. It is found that the critical value for declaring the existence of a QTL on the whole linkage group under consideration is 6.9 at the significance level P = 0.05. The profile of the LRs of the full vs. reduced model across the length of linkage group D17 has a steep peak between a narrow marker interval AG/CGA480–AG/ CGA330 (Figure 3). The LR value at this peak is 11.7, well beyond the empirical critical threshold at the significance level P = 0.05.
The additive effect of this significant QTL detected is 0.033, or equivalent to 7% relative to the overall mean. This QTL was found to explain ∼30% of the phenotypic variance for wood density in hybrid poplars. The MLE of phase probability p is 0.82, thus suggesting that there is quite a high probability to have a linkage phase Φ_{11}. This indicates that the positive allele of this QTL that increases wood density is, at a probability of 0.82, in coupling phase with dominant alleles of the two markers AG/CGA480 and AG/CGA330 flanking the QTL.
The same material was analyzed using a traditional intervalmapping approach that assumes a possible QTLmarker linkage phase at one time. This phaseseparate approach can also identify a significant QTL for wood density (data not shown), but cannot determine a correct linkage phase because the maximums of the LR values are identical between two possible linkage phases. Our method provides important information about nonallelic arrangements on the homologous chromosomes.
DISCUSSION
Statistical strategies for mapping QTL segregating in an inbred population have been well established (reviewed in Jansen 2000), which has led to a number of publications reporting on the detection of QTL in different species (Wuet al. 2000; Mackay 2001). Yet, despite significant importance, QTL mapping in outcrossing species is often frustrated due to lack of an appropriate statistical method to consider high heterozygosity of this group of species. In this article, we present a statistical model for mapping QTL in these outcrossing, nonmodel systems by incorporating their heterozygous nature into a mapping framework.
Our model is advantageous over current QTL mapping methods in a fullsib family derived from outcrossing species in three aspects (Anderssonet al. 1994; Haleyet al. 1994; Xu 1996; Knottet al. 1997). First, our model can characterize a correct linkage phase between a putative QTL and markers. For heterozygous populations, allelic arrangements of different markers and QTL on a single chromosome, i.e., linkage phase, generally cannot be known a priori. The current statistical methods for fullsib analysis either were based on a simplified assumption that markers are segregating but QTL fixed in a fullsib family (Haleyet al. 1994) or failed to consider the influence of incorrectly characterizing a markerQTL linkage phase on parameter estimation when both markers and QTL are assumed to be segregating (Knottet al. 1997). Linkage phase affects statistical inference about QTL effect size and direction for a fixedmodel approach, although this problem does not occur for a random modelbased mapping approach (Xu and Atchley 1995).
Second, our model can analyze all possible different types of markers and can test how a QTL is segregating in a family. Most of the current studies consider only fully informative markers. For example, Xu (1996) proposed a fullsib familymapping approach by assuming four different alleles at each marker and QTL. This approach is likely limited because the genome of an outcrossing species is often covered by different types of polymorphism markers, forming many different cross types when two parents are crossed (Table 1). Third, our model provides a way of simultaneously estimating linkage phases and QTL parameters within a unified framework. Simulation studies suggest that this unified framework has power to characterize a most likely linkage phase and also displays increased power to detect a significant QTL (Figure 2). For two heterozygous parents used to generate a fullsib family, there are multiple linkage phases, but only one is correct. For pure marker analysis, maximumlikelihood approaches can be used to select a most likely linkage phase because different phases correspond to different LR values (Wuet al. 2002). But, using two markers to infer a QTL, all different linkage phases theoretically give the identical LR (Figure 2), which indicates that it is not possible to correctly detect a linkage phase on the basis of likelihood analysis. If QTL identification is based on a wrong linkage phase, the estimation of QTL additive and dominant effects will have an inverse sign.
The correct characterization of a linkage phase between the QTL and markers for a practical data set is not only important for parameter estimation and model selection, but also essential for the application of molecular results to genetic improvement programs. For a genetic breeding program, we need to know the direction of genetic effects to make an efficient markerassisted selection. Suppose a dominant marker allele is in a coupling phase with the positive allele of a QTL. Thus, the selection for dominant marker alleles can lead to improved phenotypes due to favorable QTL alleles. Without this knowledge, however, it is possible to select the negative allele of this QTL by using the marker allele if it is based only on the significant relationship between the marker and QTL.
The robustness and performance of our statistical method has been examined through extensive simulations. One of the most important findings is that improvements in the accuracy and precision of QTL parameters can be more substantial with increased heritabilities than with increased sample sizes (Table 2). In practice, this conclusion will have important implications for framing an optimal experiment design for precise estimation of QTL parameters. To increase parameter precision, for example, special care should be paid to the use of silvicultural measurements to increase site homogeneity, rather than planting a huge sample size on largescale, nonuniform sites.
The model proposed is based on simple interval mapping for a single QTL affecting a quantitative trait using a mixed set of marker types. The theory extended to capture information provided by other markers outside interval markers considered is straightforward (see Zeng 1994) and simulations can be similarly conducted to test the analytical advantages and disadvantages of including more markers in our analysis model. Also, a more important statistical aspect of QTLmapping models is to simultaneously map multiple linked QTL for a trait. MultipleQTL analysis obviously can be closer to biological reality because many traits are actually polygenic (Lynch and Walsh 1998) and can also increase the power to detect QTL of smaller effects from an analytical perspective. Several QTL may be located on the same linkage group or different groups. In addition to the modeling of more gene actions and interactions between different QTL, multipleQTL analyses include increased linkage phase combinations relative to the marker intervals on which different QTL are located. It is possible that these more complex models can be solved using Markov chain Monte Carlo algorithms (Robert and Casella 1999; Sillanpaa and Arjas 1999; Xu and Yi 2000). For a broader application, we have proposed a unified framework for simultaneous maximumlikelihood estimation of linkage phases and QTL actions and interactions in our computer program.
Acknowledgments
This work is partially supported by an Outstanding Young Investigators Award of the National Science Foundation of China (30128017), a University of Florida Research Opportunity Fund (02050259), and a University of South Florida Biodefense grant (722206112) to R.W., and National Science Foundation of China grant (30000097) to X.Y.L. The publication of this manuscript is approved as journal series no. R09202 by the Florida Agricultural Experiment Station.
APPENDIX A
We present the formulas for obtaining the MLEs of the unknown parameters Ω = (μ_{uv}, σ^{2}, Φ_{j})^{T} in the M step. For the distribution parameters within the mixture model, we have
APPENDIX B
The pattern and structure of an incidence matrix (I^{k}) relating the zygotic genotypes to phenotypes for marker M^{k} depend on the type of this marker (Table 1). When marker M^{k} is from marker types A, B_{1}, B_{2}, B_{3}, C, D_{1}, and D_{2}, we have
Footnotes

Communicating editor: S. Tavaré
 Received July 13, 2002.
 Accepted June 17, 2003.
 Copyright © 2003 by the Genetics Society of America