## Abstract

F_{2} populations are commonly used in genetic studies of animals and plants. For simplicity, most quantitative trait locus or loci (QTL) mapping methods have been developed on the basis of populations having two distinct genotypes at each polymorphic marker or gene locus. In this study, we demonstrate that dominance can cause the interactions between markers and propose an inclusive linear model that includes marker variables and marker interactions so as to completely control both additive and dominance effects of QTL. The proposed linear model is the theoretical basis for inclusive composite-interval QTL mapping (ICIM) for F_{2} populations, which consists of two steps: first, the best regression model is selected by stepwise regression, which approximately identifies markers and marker interactions explaining both additive and dominance variations; second, the interval mapping approach is applied to the phenotypic values adjusted by the regression model selected in the first step. Due to the limited mapping population size, the large number of variables, and multicollinearity between variables, coefficients in the inclusive linear model cannot be accurately determined in the first step. Interval mapping is necessary in the second step to fine tune the QTL to their true positions. The efficiency of including marker interactions in mapping additive and dominance QTL was demonstrated by extensive simulations using three QTL distribution models with two population sizes and an actual rice F_{2} population.

SIGNIFICANT progress in the development of polymorphic molecular markers has led to the intensive use of quantitative trait locus or loci (QTL) mapping in genetically segregating populations (Paterson *et al*. 1991; Lynch and Walsh 1998; Mackay 2001; Barton and Keightley 2002; Doerge 2002). A number of statistical methods have been developed for QTL detection and effect estimation. For regression-based methods, see Haley and Knott (1992), Martinez and Curnow (1992), Haley *et al*. (1994), Wright and Mowers (1994), Whittaker *et al*. (1996), and Feenstra *et al*. (2006); for maximum-likelihood-based methods, see Lander and Botstein (1989), Knott and Haley (1992), Zeng (1994), Kao *et al*. (1999), and Li *et al*. (2007, 2008); and for Bayesian model-based methods, see Satagopan *et al*. (1996), Ball (2001), Sen and Churchill (2001), Sillanpää and Corander (2002), Yi *et al*. (2003), and Bogdan *et al*. (2004).

For simplicity, most QTL mapping methods (here we mean linkage mapping for quantitatively inherited traits in biparental populations derived through controlled fertilization rather than association mapping in naturally mated populations) have been developed on the basis of backcross populations, doubled haploids, or recombination inbred lines derived from two parental lines (represented by P_{1} and P_{2}), where two individual genotypes occur at each marker locus or QTL. F_{2} populations have been widely used in genetic studies of animals and plants since the rediscovery of Mendel's hybridization experiments. Relatively fewer methods have been developed on the basis of F_{2} populations, and dominance has sometimes been ignored (Wright and Mowers 1994; Whittaker *et al*. 1996; Jia and Xu 2007). Using similar principles in interval mapping (IM) as proposed by Lander and Botstein (1989), Knott and Haley (1992) investigated the maximum-likelihood methods for QTL mapping in F_{2} populations using simulated data. However, it is generally agreed that the mapping power of IM is low due to the lack of background control, and linked QTL cannot be properly separated (Zeng 1994). For F_{2} crosses between outbred lines, a mixed model was proposed to account for the variation both between and within lines (Pérez-Enciso and Varona 2000). On the basis of composite-interval mapping (CIM) (Zeng 1994), Jiang and Zeng (1995) used simulated F_{2} populations to demonstrate multiple-trait QTL mapping.

In populations consisting of two distinct genotypes, QTL mapping is focused on additive effects, even though the additive effect is defined differently in different populations. For example, in a backcross where P_{1} was used as the recurrent parent, the additive effect at a specific locus is normally defined as half of the difference between the P_{1} genotype and the F_{1} genotype (Zeng 1994). In doubled haploids or recombination inbred lines, the additive effect is defined as half of the difference between the P_{1} genotype and the P_{2} genotype. Sometimes authors claimed their methods could be extended to F_{2} populations (Zeng 1994). However, we report here that dominance can unexpectedly complicate the QTL mapping procedure by causing interactions between markers. As a result, the interactions detected between markers may be caused by the dominance effect of a QTL, rather than by real epistasis between interacting QTL.

Due to the lack of suitable QTL mapping methods for epistasis, some authors have used two-way ANOVA between markers to gain a rough idea of the importance of epistasis (Yu *et al*. 1997; Hua *et al*. 2003). More recently, Bayesian models have been widely investigated for mapping epistasis (Ball 2001; Broman and Speed 2002; Yi *et al*. 2003; Baierl *et al*. 2006). ANOVA between marker classes at one marker locus or two marker loci and some Bayesian model-based QTL mapping methods are valid under the assumption that QTL are completely linked with markers. Therefore, if QTL are located between marker intervals, false interacting QTL caused by the dominance effect may be detected by using these methods.

In this study, we report an inclusive linear model that includes interaction variables between two flanking markers, capable of completely absorbing both additive and dominance effects of QTL. On the basis of the linear model, we propose the inclusive composite-interval mapping (ICIM) suitable for QTL studies using F_{2} populations. Simulations were conducted to compare ICIM with CIM, and an actual F_{2} population was used to investigate QTL affecting plant height in rice (*Oryza sativa* L.).

## MATERIALS AND METHODS

#### One-QTL model in F_{2} populations:

For one QTL (*Q* and *q* are the two alleles) in F_{2} populations, the genotypic value of an individual with a known QTL genotype, *i.e*., *QQ*, *Qq*, or *qq*, is written by(1)where is the mean of the two homozygous genotypes *QQ* and *qq*, *a* is the additive genetic effect, *d* is the dominance effect, and *w* and *v* are indicators for QTL genotypes valued at 1 and 0 for *QQ*, 0 and 1 for *Qq*, and −1 and 0 for *qq*.

For two codominant markers (*A*-*a* and *B*-*b*) flanking the QTL, nine marker classes can be found in F_{2} (Table 1). In F_{2} populations, two indicators (represented by *x* and *y*, respectively) occur for each marker locus, similarly defined as indicators *w* and *v* for a QTL in model (1). The expectations of *w* and *v*, *i.e*., *E*(*w*) and *E*(*v*), can be calculated from the frequencies of the three QTL genotypes in each marker class (Table 1). In QTL mapping, the QTL genotype of an individual is usually unknown, but the marker type or the class of its flanking markers is known. In general, we can define the expected genotypic value (the last column in Table 1) of an individual with known marker types as(2)where *x*_{1} and *y*_{1} are the indicators for the left marker, *x*_{2} and *y*_{2} are the indicators for the right marker, *x*_{1} and *x*_{2} have similar values to *w*, and *y*_{1} and *y*_{2} have similar values to *v*. Similar to two genes, we can define the additive effects of the two markers, *i.e*., (*a*)*A*_{1} and (*a*)*A*_{2}, dominance effects of the two markers, *i.e*., (*d*)*D*_{1} and (*d*)*D*_{2}, and various interactions between the two markers, *i.e*., (*d*)*AA*_{12}, *AD*_{12}, *DA*_{12}, and (*d*)*DD*_{12} in Equation 3, where is the mean of the four homozygous marker classes (Table 1):(3)

By resolving the above linear equations, the relationship between marker effects and QTL effects can be identified; *i.e*.,(4)Clearly, the additive QTL effect (*a*) causes only additive marker effects, *i.e*., (*a*)*A*_{1} and (*a*)*A*_{2}, but the dominance QTL effect (*d*) causes additive-by-additive and dominance-by-dominance marker interactions, *i.e*., (*d*)*AA*_{12} and (*d*)*DD*_{12}, as well as dominance marker effects, *i.e*., (*d*)*D*_{1} and (*d*)*D*_{2}. The genetic model used in Equations 1 and 3 is usually called the model (Zeng *et al*. 2005). It has been proved that the use of other models such as the F_{2} model or the G2A model (Zeng *et al*. 2005) cannot eliminate the influence of the dominance effect on the interactions between markers either (results not shown).

In Equation 4, *f*_{1}, *f*_{2}, *f*_{3}, *g*_{1}, *g*_{2}, *g*_{3}, *g*_{4}, and *g*_{5} defined in Table 1 (similar to Table 1 in Haley and Knott 1992) are functions of recombination frequencies and independent of QTL effects. Denote(5)The expectations of *w* and *v* under each marker class can be proved as(6)and(7)Equation 6 has been widely used in mapping QTL with additive effects regardless of the statistical method, *e.g*., regression analysis, maximum likelihood, or Bayesian models (for examples see Zeng 1994; Whittaker *et al*. 1996; Kao *et al*. 1999; Ball 2001; Sen and Churchill 2001; Sillanpää and Corander 2002; Yi *et al*. 2003; Bogdan *et al*. 2004; Feenstra *et al*. 2006; Li *et al*. 2007). However, we have not seen Equation 7 used in QTL mapping studies of F_{2} populations.

Using Equations 6 and 7, the genotypic value of an F_{2} individual with known marker class can be represented by marker variables and two-marker interactions as(8)where , representing the mean of the four homozygous marker classes (*i.e*., *AABB*, *AAbb*, *aaBB*, and *aabb* in Table 1).

For clarity, we added the symbols of QTL effects to various marker effects in Equations 3, 4, and 8. For example, is the additional mean contributed by QTL dominance, is the additive effect of the left marker caused by QTL additive effect, is the additive-by-additive effect between the left and right markers caused by QTL dominance effect, and so on. Model (8) is a completely fitted model, and coefficients in it contain all the information regarding QTL location and effects. In other words, the additive and dominance effects of the flanked QTL are completely absorbed by the six variables in model (8). The nonzero marker interactions (*d*)*AA*_{12} and (*d*)*DD*_{12}, caused by the dominance effect, indicate that marker variables by themselves cannot completely absorb the effects of QTL located between the two markers.

#### The inclusive linear model for multiple QTL:

For succinctness, we assume there are *m* QTL located in *m* intervals defined by *m* + 1 markers on one chromosome. The genotypic value of an F_{2} individual is defined as(9)where *w _{j}* and

*v*are the indicators for genotypes at the

_{j}*j*th QTL. By using Equations 6 and 7, the genotypic value of an F

_{2}individual with known marker types can be reorganized aswhereandTherefore, the inclusive linear model simultaneously containing all markers and phenotyping errors is(10)where

*P*is the phenotypic value of the trait of interest, and is the random environmental error.

It can be seen that coefficients in model (10) are affected only by neighboring QTL. In other words, QTL effects will be completely absorbed by the six variables of the two closest markers. Model (10) is suitable for QTL mapping in F_{2} populations, as it completely explains both additive and dominance variations. In some studies, marker interactions were not included (for examples see Jiang and Zeng 1995; Kao *et al*. 1999; Jia and Xu 2007), which may bias the QTL mapping results and be problematic when extending to epistatic mapping.

#### ICIM in F_{2} populations:

Assume there are *n* individuals in an F_{2} population. Similar to QTL mapping for other populations (Li *et al*. 2007, 2008), we adopted a two-step mapping strategy. In the first step, stepwise regression was used to estimate the parameters in model (10). Coefficients of those variables not retained by stepwise regression were set at 0. However, we did not exclude the possibility that other model selection methods (Miller 1990; Piepho and Gauch 2001) may achieve similar or better performance in model selection than stepwise regression. In the second step, traditional interval mapping (Lander and Botstein 1989) was conducted on adjusted phenotypic values; *i.e*.,(11)where *k* and *k* + 1 represent the two flanking markers of the current testing position, represents each F_{2} individual, and the circumflex means “estimated.” Under the condition of isolated QTL (Whittaker *et al*. 1996), adjusted values in Equation 11 contain all the location and effect information of QTL in the current interval, but at the same time, QTL in other chromosomal intervals have been completely controlled. At a testing position in the interval [*k*, *k* + 1], phenotypes of the three QTL genotypes *QQ*, *Qq*, and *qq* were assumed to be normally distributed as , where *k* = 1, 2, 3, representing the three QTL genotypes, respectively. The two hypotheses used to test the existence of QTL at the scanning position are*vs*.The logarithm likelihood under H_{A} is, therefore,where denotes individuals belonging to the marker class (*j* = 1, 2, … , 9; Table 1), (*k* = 1, 2, 3) is the proportion of the QTL genotype in the class, and is the density function of the normal distribution .

Most individuals in marker classes 1, 5, and 9 have QTL genotypes *QQ*, *Qq*, and *qq*, respectively. Hence, the initial parameters used in the EM algorithm (Dempster *et al*. 1977; Li *et al*. 2007) can be defined asand

In the E-step, the posterior probabilities of an individual belonging to the three QTL genotypes were calculated aswhere . In the M-step, parameters in the maximum-likelihood equation were updated byandThe genetic effects in model (1) were therefore estimated by

Under the null hypothesis, the three QTL genotypes follow the same normal distribution, denoted by . Parameters under H_{0} were calculated asfrom which the maximum likelihood under H_{0} and the LOD score between H_{A} and H_{0} can be calculated. Additional hypotheses can be built to further test if the additive or the dominance effect is significant; this is not discussed in detail in this article.

#### QTL distribution models in simulation:

We considered six QTL with different levels of dominance and a genome consisting of eight chromosomes in our simulation studies (Table 2). Each chromosome is of 140 cM, with 15 evenly distributed codominant markers. QTL1 has additive effect 1, without a dominance effect. QTL2 has dominance effect 1, without an additive effect. QTL3 can be viewed as completely dominant, while QTL4 is completely recessive. Both QTL5 and QTL6 show overdominance, but in different directions. No interactions between QTL were considered. Each QTL was assumed to be located in the middle of a marker interval.

To investigate the effect of linkage on QTL mapping, we considered three QTL distribution models (Table 2). QTL were distributed on different chromosomes in model I, and two QTL were linked on each of the first three chromosomes in models II and III. In model I, QTL5 and QTL6 each explained 24.3% of genotypic variation and 17.0% of the phenotypic variance under heritability 0.7. QTL2 explained the least genotypic and phenotypic variation among the six defined QTL (Table 2).

F_{2} mapping populations were simulated by the genetics and breeding simulation tool of QuLine, formerly called QuCim (Wang *et al*. 2003, 2004). ICIM was implemented by the software QTL IciMapping, and CIM was implemented by the software QTL Cartographer (Wang *et al*. 2005). For CIM, we applied “Model 6: Standard Model” and “3. Forward & Backward Method” available in Cartographer. The two probabilities for entering and removing variables were set at 0.01 and 0.02. For ICIM, the same probability levels were adopted in the first step of stepwise regression. The threshold LOD score was set at 3.0 for both methods.

#### One F_{2} population in rice:

The actual F_{2} population used in this study consists of 180 individuals and was derived by the Rice Research Institute, Sichuan Agricultural University (Ye *et al*. 2005, 2007). The cross was made in Chengdu, China, in July 2002 between the *indica* rice variety PA64s (full name: Pei'Ai 64s) and *japonica* rice variety Nipponbare. Nipponbare was completely sequenced in 2002, and PA64s was partially sequenced in the same year. The F_{1} population was planted in Hainan, China, in December 2002, and the F_{2} population was planted in Chengdu, China, in April 2003 for genotyping and phenotyping. A total of 137 SSR markers were screened for building the linkage map (Ye *et al*. 2005), and a number of agronomic traits were investigated in the field (Ye *et al*. 2005, 2007). The whole genome was of 2046.2 cM, and the average marker distance was 17.1 cM. Each of the 12 chromosomes had 6–12 relatively evenly distributed markers. We used ICIM for QTL mapping of plant height, where the two probabilities for entering and removing variables in the first step of stepwise regression were set at 0.01 and 0.02, and the threshold LOD score was set at 3.0.

## RESULTS

#### Expected effects of the flanking markers:

The expected additive, dominance, additive-by-additive, and dominance-by-dominance effects of the two nearest flanking markers associated with each defined QTL in Table 2 were calculated from Equation 4 and are shown in Table 3. When the dominance effect is zero, *i.e*., QTL1 (*a* = 1 and *d* = 0), the two flanking markers have only additive effects, the size of which is dependent on the QTL additive effect and its location between the two markers. In cases where the QTL was located at the center of its flanking marker interval, both markers have the same additive effect, which approximates half of the QTL additive effect (Table 3). Therefore, when the dominance effect can be ignored, or the additive effect is the only genetic effect of interest, including one marker indicator for each marker locus will allow the QTL additive effect to be absorbed. This is the case of QTL mapping in populations consisting of two individual genotypes (Li *et al*. 2007), where the additive is the only genetic effect of interest.

When there is no additive effect, *i.e*., QTL2 (*a* = 0 and *d* = 1), the two flanking markers do not have additive effects either, but they do have additive-by-additive interaction (Table 3). Obviously, this interaction was caused by the dominance effect of QTL2 and did not indicate there were two interacting QTL. When both additive and dominance effects are present, *i.e*., QTL3–QTL6, additive, dominance, additive-by-additive, and dominance-by-dominance effects can all occur on the two flanking markers (Table 3). The dominance effect of a QTL causes not only marker dominance effects, but also marker interactions (Equation 4 and Table 3). Under linkage equilibrium and when each marker allele had a frequency of 0.5, ANOVA indicated that marker interactions caused by QTL2 explained >20% of the variation, those caused by QTL3 and QTL4 each explained >5% of the variation, and those caused by QTL5 and QTL6 each explained ∼10% of the variation between marker classes (Table 3).

The results from Equation 4 and Table 3 clearly indicated that the dominance of a QTL could complicate the coefficients of the two markers flanking a QTL by causing interactions between markers. We used the model, *i.e*., Equation 3, to illustrate the phenomenon in this study. We have used other models such as the F_{2} or the G2A model (Zeng *et al*. 2005) and found they would neither eliminate the marker interactions caused by QTL dominance effect nor make the mapping procedure less complicated. The consequence of this phenomenon is that QTL mapping focusing on estimation of marker effects may lead to erroneous conclusions about QTL locations and effects.

#### Comparison of ICIM with CIM:

In Figure 1, each simulated QTL was assigned to a confidence interval of 15 cM centered at the true QTL location, and the power for the confidence interval was estimated. QTL identified in other intervals were viewed as false positives. In the confidence interval, if multiple peaks occurred, only the highest one was counted. In other chromosome regions, all peaks higher than the LOD threshold of 3.0 were counted, regardless of the distance between the significant peaks (Li *et al*. 2007). Under population size 200, both ICIM and CIM resulted in high powers (*i.e*., >0.60) for QTL3–QTL6 (Figure 1, A, C, and E). QTL1 and QTL2 explain the least genetic variation (Table 2) and their detection powers were relatively low. The difference in powers between ICIM and CIM is minor, except for QTL1 and QTL2 in models I and II and QTL1–QTL3 in model III (Figure 1, A and E). The distribution of QTL has effects on their detection powers (Figure 1, C and E).

As expected, the increase in population size resulted in the increased detection power for both methods (Figure 1A *vs*. 1B, Figure 1C *vs*. 1D, and Figure 1E *vs*. 1F). Under population size 500, both CIM and ICIM had powers close to 1 in detecting all QTL (Figure 1, B, D, and F). The false discovery rate (FDR) is defined as the proportion of false positives to the total number of significant discoveries (Benjamini and Hochberg 1995). The FDR of ICIM was always lower than that of CIM (Figure 1). The increase in population size not only improved the detection power of ICIM, but also reduced its FDR. For CIM, the increases in population size improved its detection power, but did not reduce its FDR. As stated earlier, false positives were counted without considering a confidence interval; that is to say, any significant peaks that were not within the QTL confidence intervals were viewed as false positives, which resulted in a large number of false positives for both methods. In the other aspect, this may indicate that a higher LOD threshold should be applied when using CIM or ICIM.

In Figure 2, power was calculated for every marker interval on the genome, which allows monitoring QTL locations if not located in the predefined intervals. It can be clearly seen that false positives were around the true QTL positions and were less likely to be located in chromosomal regions far from the predefined QTL or in chromosomes where no QTL were located (Figure 2). There is an obvious tendency for significant peaks identified by ICIM for QTL distribution models II and III to be closer to the true QTL locations (Figure 2, C–F), indicating that ICIM is more capable of dissecting linked QTL.

Estimated QTL location and effects from QTL distribution model II are shown in Table 4. Unbiased estimations of QTL locations and additive effects were observed for ICIM and CIM under the two population sizes. The dominance effects estimated by ICIM were less biased than those estimated by CIM, indicating the advantage of using model (10) in ICIM. Taking population size 500 as an example, the dominance effects estimated by ICIM were 0.05, 0.98, 0.82, −0.87, 1.38, and −1.38, corresponding to the true effects 0, 1, 1, −1, 1.5, and −1.5, respectively. However, the effects estimated by CIM were 0.39, 1.01, 0.61, −0.64, 0.94, and −0.90, respectively. Considering the higher detection power, lower FDR, and less biased estimation of dominance effect, we can conclude that ICIM built on the inclusive linear model (10) is a better method for mapping QTL with additive and dominance in F_{2} populations. The LOD score from ICIM was always higher than that from CIM (Table 4), indicating the residual variation is better controlled in ICIM.

#### Estimated QTL locations and effects from large simulated F_{2} populations:

To further illustrate the outcomes from ICIM, we conducted QTL mapping on the first simulated F_{2} populations with 500 individuals from the three QTL distribution models (Figures 3, A–C, and 4, A–F). The genotypic values of the two parents and their F_{1} hybrid were 15, 5, and 16, respectively, for the three QTL models. Phenotypic values in F_{2} for the three QTL distribution models show continuous distributions (Figure 3, A–C) that are similar to typical quantitative traits. There is no clear classification of the phenotype, and it is impossible to deduce the number of QTL without the assistance of molecular markers.

QTL mapping by ICIM found the difference in genetic mechanism for the three seemingly similar phenotypic distributions in Figure 3, A–C. For QTL distribution model I, six clear peaks on the first six chromosomes can be seen along the one-dimensional LOD profile, indicating six unlinked QTL (Figure 4A). The chromosomes or chromosomal regions not harboring QTL have LOD scores close to 0 (Figure 4A). The six peaks were close to the true QTL position, and the effects at those positions are shown in Table 5. The estimated positions were at 28, 53, 24, 57, 26, and 55 cM, corresponding to the true positions 25, 55, 25, 55, 25, and 55 cM on the first six chromosomes. Along with scanning, the additive and dominance effects (Figure 4B) and variation explained by QTL at the testing positions can also be estimated. The estimated effects at peak positions were close to the true effects in Table 3, although some discrepancies were observed.

For QTL distribution model II, six clear peaks, two each on the first three chromosomes, can be seen on the LOD profile (Figure 4C). The last five chromosomes do not have any QTL and have LOD scores close to 0. The estimated positions were at 21, 54, 26, 55, 24, and 55 cM, corresponding to the true positions 25, 55, 25, 55, 25, and 55 cM on the first three chromosomes. Some bias in estimated effects was observed (Table 5), especially the dominance effect of QTL2. Similar results from Figure 4, E and F, can be observed for QTL distribution model III.

#### QTL affecting plant height in rice:

The plant height of rice variety PA64s, a carrier of one major dwarfing gene, is 74.4 cm, while that of Nipponbare is 98.3 cm (Figure 3D). The distribution of plant height in their F_{2} populations is similar to those in Figure 3, A–C. There are a total of 24,660 (*i.e*., 180 × 137) marker points in the F_{2} population, 5131 belonging to the PA64s marker type, 6175 to the Nipponbare marker type, and 11,114 to the F_{1} marker type. A total of 2240 marker points were missing, representing 9.08% of total marker points. Segregation distortions were observed for a few markers as well. LOD scores, along with estimated additive and dominance effects along the rice genome, are shown in Figure 4, G and H. Obviously, the LOD profile in Figure 4G is more complicated than those in Figure 4, A, C, and E, indicating the genetic model with real data may be more complicated than those used in simulation. The other reasons for the rugged LOD profile may be the large amount of missing data and segregation distortions.

Under the LOD threshold of 3.0, eight QTL affecting plant height in the F_{2} population were identified: two each on chromosomes 1 and 3, one on chromosome 4, and three on chromosome 7 (Table 5). Locus *qPH1-2*, a major QTL explaining ∼30% of the phenotypic variation, has been detected by other methods (Ye *et al*. 2005). The PA64s allele at *qPH1-2* can reduce plant height by ∼10 cm, and the dominance effect is relatively small.

Few F_{2} individuals are shorter than PA64s (Figure 3D), indicating most, if not all, reduced-height alleles are harbored by PA64s. However, many F_{2} individual plants are taller than the taller parent Nipponbare (Figure 3D), which may indicate the presence of overdominance. Five QTL have negative additive effects (Table 5), indicating the reduced-height alleles at these loci are also from PA64s. Overdominance effects were observed for *qPH1-1*, *PH3-1*, *qPH7-1*, *qPH7-2*, and *qPH7-3*, which explains the large number of F_{2} individuals that are taller than Nipponbare. For *qPH1-1* and *qPH7-1*, the additive effects were close to 0, indicating that these loci will be less likely to be detected in other populations, such as recombination inbred lines, where heterozygosity is not present. So it is not unusual that different QTL are detected even when using the populations derived from the same parents.

## DISCUSSION

#### Properties of the proposed inclusive linear model:

In an F_{2} population, all three genotypes at a locus are represented, which allows the estimation of additive effects and dominance deviations for individual QTL (Paterson *et al*. 1991). At the same time, the genetic analysis can be very complicated, as more genetic parameters have to be considered simultaneously. In this study, we proposed an inclusive linear model where the included marker variables can completely explain the additive and dominance effects of QTL. Model (10), built on solid genetic and statistical theories, is the theoretical basis for QTL mapping in F_{2} populations. It has the properties similar to those reported by Zeng (1994) for CIM, which are summarized as follows.

##### Property 1:

The QTL additive effect causes marker additive effects, while the QTL dominance effect causes marker dominance effects, as well as additive-by-additive and dominance-by-dominance interactions between the two flanking markers. By including two multiplication variables between flanking markers, the additive and dominance effects of one QTL can be completely absorbed. This property comes from Equations 6 and 7.

##### Property 2:

Assuming the additivity of QTL effects on a phenotypic trait, *i.e*., model (9), the expectation of the main marker effect in model (10), *i.e*., or , depends only on those QTL located on two intervals where the current marker is involved. The expectation of marker interaction in model (10), *i.e*., or , depends only on the QTL located in the interval flanked by the two markers. This property can be seen from the deriving process of Equation 10. Thereby, under the condition of isolated QTL (Whittaker *et al*. 1996), the six coefficients of the *j*th marker interval, *i.e*., , , , , , and , contain and contain only the effect and location information of the QTL located in the interval.

##### Property 3:

Under the condition of isolated QTL, adjusted phenotypic values by Equation 11 retain the effect and location information of the QTL located in the current interval; at the same time, QTL in other intervals and chromosomes have been controlled. Therefore, conditioning on both linked and unlinked markers in the second step of interval mapping reduces the sampling variance of the test statistic by controlling the residual genetic variation, thus increasing the power of QTL mapping.

#### Marker coefficients are biased in the first step of model selection using stepwise regression:

In the QTL and marker distribution model used in our simulation study, there were a total of 464 variables included in model (10), *i.e*., *x*_{1}, *x*_{2},· · · , *x*_{120}, *y*_{1}, *y*_{2},· · · , *y*_{120}, *x*_{1} × *x*_{2},· · · , *x*_{119} × *x*_{120}, *y*_{1} × *y*_{2},· · · , and *y*_{119} × *y*_{120}, where the multiplication of the last marker in a chromosome with the first marker in the next chromosome was excluded. When the largest *P*-value for entering variables and the smallest *P*-value for removing variables were set at 0.01 and 0.02, only a few of the six variables were picked up by stepwise regression. For QTL1 in distribution model I, only variable *x*_{4} was retained and its coefficient was estimated as 0.841. Without the second step of interval mapping, one could conclude that one additive QTL was located at 30 cM. For QTL1 in distribution models II and III, only variable *x*_{3} was retained and its coefficient was estimated as 0.648 and 0.713, respectively. Without the second step of interval mapping, one could conclude that one additive QTL was located at 20 cM. However, the second step of interval mapping found the largest LOD score was achieved at 21 cM for model II and at 25 cM for model III (Table 5), which are closer or the same to the true QTL position. For QTL5 and QTL6, the interaction coefficients were more important (Table 2). Under distribution model II, the four variables for QTL6 retained by stepwise regression were *x*_{36}, *x*_{37}, *x*_{36} × *x*_{37}, and *y*_{36} × *y*_{37}. But this does not mean there were two interacting QTL located at 50 and 60 cM on chromosome 3. Rather, the dominance effect of QTL6 caused interactions between the 36th and 37th markers.

Model (10) is a linear regression model, and the choice of variables is a typical model selection issue (Miller 1990). Treating QTL mapping as a model selection problem and the use of model selection criteria to identify the best model have been intensely investigated by many authors (Piepho and Gauch 2001; Broman and Speed 2002; Bogdan *et al*. 2004; Baierl *et al*. 2006). A number of statistical methods are available to search through the space of models, and various criteria can be used to select the best model (Miller 1990; Piepho and Gauch 2001). However, there is no general conclusion in statistics as to which model selection method is best (Miller 1990). In the first step of ICIM, we use stepwise regression for model selection. However, we do not exclude the possibility that other model selection methods may achieve similar or even better performance than the stepwise regression used in ICIM. If better model selection methods than stepwise regression are identified, they should be readily used in the first step of ICIM.

#### The second step of interval mapping is necessary in ICIM:

At first glance, the result of ICIM seems to depend on the identification of an appropriate regression model in the first step. However, the two-step approach we adopted in ICIM has the advantage that the best regression model in the first step does not need to be very close to the true model. Ideally, the second step of interval mapping can correct the imprecision of coefficient estimation in the first step. For all three QTL distribution models, a large bias has been observed between the true marker effects in Table 3 and the estimated marker effects. In addition, some nonrelevant variables were also selected by stepwise regression. Some were close to the markers flanking QTL, but some were not. For example, the marker for *x*_{53} in model I was close to the markers flanking QTL4, but the marker for *y*_{97} and the markers for *x*_{104} × *x*_{105} were on the seventh chromosome where no QTL was located. However, all biases were apparently corrected to some extent by the second step of interval mapping (Figure 4, A–F; Table 5), which indicated the necessity of fine tuning using interval mapping in the second step of ICIM.

In ICIM, the inference of QTL is not built on the estimated coefficients in model (10). Actually, model (10) is used to control background genetic variation in the second step of interval mapping. In this sense, the predictability of model (10) for the background genetic effects that can be used to adjust the phenotypic performance in Equation 11 becomes more important. In the regression theory, it is generally agreed that collinearity between regression variables in model (10) can seriously bias the estimation of their effects, but this undesirable bias does not extend to the model's fit (Miller 1990; Harrell 2001). In other words, collinearity does not affect predictions made on the same data set used to estimate the model parameters. This may have explained the advantage of using the two-step strategy in ICIM.

## Acknowledgments

The authors thank Ping Li, Rice Research Institute, Sichuan Agricultural University, for providing the rice F_{2} population. Software for ICIM, called QTL IciMapping, is available from http://www.isbreeding.net, and the genetics and breeding simulation tool of Quline is available from http://www.uq.edu.au/lcafs/qugene/. This work was supported by the National 863 Programme of China (no. 2006AA10Z1B1), the Natural Science Foundation of China (no. 30771351), the Bill and Melinda Gates Foundation (drought tolerant maize for Africa), and the Generation Challenge Programme of the Consultative Group for International Agricultural Research.

## Footnotes

Communicating editor: C. Haley

- Received May 30, 2008.
- Accepted July 24, 2008.

- Copyright © 2008 by the Genetics Society of America