# A Modified Algorithm for the Improvement of Composite Interval Mapping

^{*}School of Mathematical Sciences, Beijing Normal University, Beijing 100875, China,^{†}Institute of Crop Science and The National Key Facility for Crop Gene Resources and Genetic Improvement, Chinese Academy of Agricultural Sciences, Beijing 100081, China,^{‡}Crop Research Informatics Laboratory and Genetic Resources Enhancement Unit, CIMMYT, 06600 Mexico, D.F., Mexico and^{§}Primary Industries Research Victoria, Bundoora, Victoria 3086, Australia

- 1
*Corresponding author:*Institute of Crop Science and The National Key Facility for Crop Gene Resources and Genetic Improvement, Chinese Academy of Agricultural Sciences, No. 12 Zhongguancun South St., Beijing 100081, China. E-mail: wangjk{at}caas.net.cn

## Abstract

Composite interval mapping (CIM) is the most commonly used method for mapping quantitative trait loci (QTL) with populations derived from biparental crosses. However, the algorithm implemented in the popular QTL Cartographer software may not completely ensure all its advantageous properties. In addition, different background marker selection methods may give very different mapping results, and the nature of the preferred method is not clear. A modified algorithm called inclusive composite interval mapping (ICIM) is proposed in this article. In ICIM, marker selection is conducted only once through stepwise regression by considering all marker information simultaneously, and the phenotypic values are then adjusted by all markers retained in the regression equation except the two markers flanking the current mapping interval. The adjusted phenotypic values are finally used in interval mapping (IM). The modified algorithm has a simpler form than that used in CIM, but a faster convergence speed. ICIM retains all advantages of CIM over IM and avoids the possible increase of sampling variance and the complicated background marker selection process in CIM. Extensive simulations using two genomes and various genetic models indicated that ICIM has increased detection power, a reduced false detection rate, and less biased estimates of QTL effects.

THE rapid increase in availability of fine-scale genetic marker maps has led to the intensive use of QTL mapping in the genetic study of quantitative traits (Falconer and Mackay 1996; Doerge *et al*. 1997; Lynch and Walsh 1998; Kearsey 2002; Steinmetz *et al*. 2002; Wu and Lin 2006). A number of statistical methods have been developed for QTL detection and effect estimation (Lander and Botstein 1989; Haley and Knott 1992; Jansen 1994; Wright and Mowers 1994; Zeng 1994; Satagopan *et al*. 1996; Whittaker *et al*. 1996; Piepho and Gauch 2001; Sen and Churchill 2001; Broman and Speed 2002; van den Oord and Sullivan 2003; Xu 2003; Bogdan *et al*. 2004).

From a statistical perspective, methods for QTL mapping are based on three broad classes: regression (Haley and Knott 1992; Whittaker *et al*. 1996), maximum-likelihood (Doerge *et al*. 1997), and Bayesian models (Sillanpää and Corander 2002). The simplest single-marker analysis identifies QTL on the basis of the difference between the mean phenotypes of different marker groups, but cannot separate the estimates of recombination fraction and QTL effect (Soller *et al.* 1976; Doerge *et al*. 1997). Interval mapping (IM) is based on maximum-likelihood parameter estimation and provides a likelihood-ratio test for QTL position (Lander and Botstein 1989). Regression interval mapping was proposed to approximate maximum-likelihood interval mapping to save computation time at one or multiple genomic positions (Haley and Knott 1992; Martinez and Curnow 1992). The major disadvantage of IM is that the estimates of locations and effects of QTL may be biased when QTL are linked (Haley and Knott 1992; Martinez and Curnow 1992; Zeng 1994). Composite interval mapping (CIM) (Jansen 1994; Zeng 1994) combines IM with multiple-marker regression analysis, which controls the effects of QTL on other intervals or chromosomes onto the QTL that is being tested and thus increases the precision of QTL detection. More recently, the use of Bayesian models has been widely explored for QTL mapping (Satagopan *et al*. 1996; Uimari and Hoeschele 1997; Sen and Churchill 2001; Xu 2003; Bogdan *et al*. 2004; Wang *et al*. 2005a). However, methods based on Bayesian models have not been widely used in practice, partially due to the difficulty of choosing prior distributions, complexity of computation, and lack of user-friendly software.

Due to the accessibility of the freely available software QTL Cartographer (Wang *et al*. 2005b) CIM is now the most commonly used method for QTL mapping with populations derived from biparental crosses. However, in Zeng's algorithm, QTL effect at the current testing position and regression coefficients of the marker variables used to control genetic background were estimated simultaneously in an expectation and conditional maximization (ECM) algorithm. Thus, the same marker variable may have different coefficient estimates as the testing position changes along the chromosomes. The algorithm used in CIM cannot completely ensure that the effect of QTL at the current testing interval is not absorbed by the background marker variables and may result in biased estimation of the QTL effect (see Table 4 and Figure 1 in Zeng 1994).

In this article, we propose a modified algorithm to render CIM more inclusive of all marker data [inclusive composite interval mapping (ICIM)] and then compare ICIM with CIM through extensive simulations.

## MATERIALS AND METHODS

#### The linear regression model and its properties in QTL mapping:

For simplicity, it is supposed that two inbred parents P_{1} and P_{2} differ in *m* QTL, being distributed in *m* intervals flanked by *m* + 1 markers. The parental QTL genotype is assumed to be *Q*_{1}*Q*_{1}*Q*_{2}*Q*_{2}…*Q _{m}Q_{m}* for P

_{1}and

*q*

_{1}

*q*

_{1}

*q*

_{2}

*q*

_{2}…

*q*for P

_{m}q_{m}_{2}. We consider a backcross population where P

_{1}is the recurrent parent. For an individual in a backcross population

**X**= (

*x*

_{1},

*x*

_{2},…,

*x*,

_{m}*x*

_{m}_{+1}) represents marker variables that are 1 and −1, standing for the two marker types (homozygote and heterozygote), respectively, and

**G**= (

*g*

_{1},

*g*

_{2},…,

*g*) represents the QTL variables that are 1 and −1, standing for the two QTL types (homozygote and heterozygote), respectively. Additive effects of QTL are represented by

_{m}*a*

_{1},

*a*

_{2},…, and

*a*. Under the assumption of additivity of QTL effects, the genetic value

_{m}*G*of an individual under an additive genetic model can be written in the following form:(1)(Whittaker

*et al*. 1996).

The expectation of QTL genotype is dependent on the position of the *j*th QTL on the chromosomal interval flanked by the *j*th and (*j* + 1)th markers and the length of the interval (Zeng 1993; Wright and Mowers 1994; Whittaker *et al*. 1996); *i.e.*,(2)where and are functions of the three recombination fractions between the *j*th marker and *j*th QTL, between the *j*th QTL and (*j* + 1)th marker, and between the *j*th and (*j* + 1)th markers. Therefore, the expectation of the genotypic value *G* conditional on the known marker types can be written as a linear function of marker variables; *i.e*.,(3)where , (*j* = 2,…, *m*), and . The coefficient of the *j*th marker is affected by QTL only on intervals (*j* − 1, *j*) and (*j*, *j* + 1). If there are no QTL in the neighboring intervals of the current interval (*j*, *j* + 1), corresponding to the assumption of isolated QTL according to Whittaker *et al*. (1996), the two coefficients and contain all the position and additive effect information of the QTL in the interval (*j*, *j* + 1), which provides the theoretical basis for mapping additive QTL in CIM (Zeng 1994) and other regression mapping methods (Wright and Mowers 1994; Whittaker *et al*. 1996).

Suppose that we have a sample of *n* individuals from a backcross population with observations on a quantitative trait of interest and *m* + 1 ordered markers. The following linear regression model based on Equation 3 can be used in mapping additive QTL; *i.e*.,(4)where is the trait value of the *i*th individual in the mapping population; is the overall mean of the model; is a dummy variable for the genotype of the *i*th individual at the *j*th marker, taking value 1 for homozygote of marker type and −1 for heterozygote; is the regression coefficient of the phenotype on the *j*th marker conditional on all other markers; and is the residual random error that is assumed to be normally distributed.

According to Zeng (1994), the two major properties of CIM were:

*Property 1*: In the multiple-regression analysis, assuming additivity of QTL effects between loci (*i.e.*, ignoring epistasis), the expected partial regression coefficient of the trait on a marker depends only on those QTL that are located on the interval bracketed by the two neighboring markers and is unaffected by the effects of QTL located on other intervals.*Property 2*: Conditioning on unlinked markers in the multiple-regression analysis will reduce the sampling variance of the test statistic by controlling some residual genetic variation and thus will increase the power of QTL mapping.

Both properties come from the regression properties of regression model (4). In Zeng's algorithm, both QTL effect at the current testing interval and regression coefficients of the background markers were estimated simultaneously by an ECM algorithm. However, this algorithm may not completely ensure the two properties.

#### A modified CIM algorithm:

The basic idea behind the modified algorithm is to use all marker information when building model (4), so that properties 1 and 2 in Zeng (1993, 1994) can be completely guaranteed, and then the interval mapping approach of Lander and Botstein (1989) is applied on the adjusted phenotypic data. Considering that the number of QTL is always much lower than the number of markers, stepwise regression can be used to select the most important marker variables and therefore select the significant QTL. The coefficients of unselected markers through stepwise regression are set to 0 in model (4). When scanning for QTL along the chromosomes, the parameters in model (4) are estimated only once. For a testing position in interval (*k*, *k* + 1), the observation values in model (4) can be adjusted by(5)where is the estimate of in model (4). As shown in model (3), the two estimates and contain all the position- and additive-effect information of the QTL located on the current interval (*k*, *k* + 1) under the condition of no QTL in its neighboring intervals and the condition of large samples. Therefore, the use of in the subsequent interval mapping does not lose any information of the QTL at the current mapping interval, but the effects of QTL located on other intervals and chromosomes are controlled through the introduction of other coefficients in Equation 5. *The adjusted observation* does not change until the testing position moves into a new interval. Please note that the only assumption we made here is that the QTL on the same linkage group or chromosome are isolated by at least one empty interval (isolated QTL according to Whittaker *et al*. 1996).

For a testing position in an interval, all individuals in the backcross population can be classified into four groups on the basis of the two flanking markers (Table 1). If there is one QTL (with the two alleles denoted as *Q* and *q*) at the testing position, individuals in all the four groups have QTL genotypes *QQ* or *Qq* and hence follow a mixture distribution consisting of components and (Table 1) (McLachlan and Basford 1988). The distribution proportions in each mixture distribution depend on the recombination frequencies between QTL and the two flanking markers (Table 1). The existence of QTL at the current mapping position can be tested by the following hypotheses:

Supposing that all the *n* individuals have been sorted on the basis of their marker types, the log-likelihood function under the alternative hypothesis H_{A} is(6)where and are the proportions of individuals with *QQ* genotype in group 1 and group 2 or the proportions of individuals with *Qq* genotype in group 4 and group 3, respectively. and represent the probability densities of the two normal distributions of and , corresponding to the two QTL genotypes *QQ* and *Qq*, respectively (Table 1).

The expectation and maximization (EM) algorithm (Dempster *et al*. 1977; McLachlan and Basford 1988) is used to estimate the two means and one variance in Equation 6. The initial values of the three unknown parameters can be defined from groups 1 and 4 (Table 1) asand

In the E-step, the posterior probabilities of an individual being *QQ* at the QTL in groups 1–4 are

andrespectively.

In the M-step, the three parameters were updated asand

The EM algorithm continues until the difference in likelihood function between two consecutive iterations reaches a preassigned precision criterion. The maximum-likelihood estimates thus obtained are represented as , , and , from which the additive effect of the putative QTL can be estimated.

Under the null hypothesis, H_{0}, all defined by Equation 5 follow a normal distribution denoted as . The mean and variance of this distribution can be estimated as

Thus, the log-likelihood function under the null hypothesis H_{0} is

The LOD score at the testing position can be calculated from the log-likelihoods under the two hypotheses.

#### Genetic models used in simulation studies:

Two hypothetical genomes were used in simulation. One genome consisted of six chromosomes, each of 150 cM in length and with 16 evenly distributed markers. Ten QTL (represented by QZ1–QZ10; Table 2) were assumed to contribute to the trait of interest. Three QTL were located on each of the first three chromosomes and one QTL on the fourth chromosome. There was no QTL on chromosomes 5 and 6. The locations and effects of these QTL were similar to the scenario used by Zeng (1994). Both coupling and repulsive linkages and unequal QTL effects were considered in this scenario and therefore should have a wide applicability. To investigate the effect of epistasis on mapping additive QTL, two genetic models were simulated for this genome, one consisting of only additive genetic effects and the other consisting of both additive effects and digenic interactions (Table 2). The additive effects in the epistasis model were the same as those in the additive model, and the interaction effect was drawn from a Gamma distribution implemented by QTL Cartographer (Wang *et al*. 2005b). Under the QTL distribution in Table 2, the theoretical additive variance was 4.0, and the theoretical epistasis variance was 2.0 (estimated by QTL Cartographer). Two heritability (in the broad sense) levels were considered: *H* = 0.8 (representing high heritability traits) and *H* = 0.5 (representing medium heritability traits). One hundred backcross populations of 200 individuals were simulated for each model by heritability combination using QTL Cartographer.

The other genome consisted of four chromosomes, each with 100 cM in length and 21 markers evenly distributed. Eight large-effect QTL (represented by QY1–QY8) and 16 small-effect QTL contributed to the expression of a quantitative trait of interest (for details see Table 1 in Yi *et al*. 2003). To compare CIM and ICIM with the Bayesian mapping methods of Yi *et al*. (2003), 100 backcross populations each of 300 individuals were generated, and the residual variance was adjusted to 1. The population size and the residual variance were the same as those used in Yi *et al.* (2003).

For CIM, we applied different background marker selection methods available in QTL Cartographer. The model using stepwise regression to select control markers was the best in terms of the estimates of QTL positions and effects, so other models for CIM were not considered in power analysis. For ICIM, the stepwise regression was used to select markers and estimate the parameters in model (4), in which the largest *P*-value for entering variables was set at 0.05, and the smallest *P*-value for removing variables was 0.10. CIM was implemented by QTL Cartographer, and IM (where applicable) and ICIM were implemented by an in-house computer program called IciMapping (available from http://www.isbreeding.net/software.html). On the basis of a permutation test, a LOD threshold of 2.5 was used to declare the presence of a QTL.

#### Power calculation and position and effect estimation:

QTL mapping based on an interval test is not a point estimation, which makes it complicated to calculate power through simulation. Especially, when QTL are closely linked, it is difficult to determine which putative QTL the LOD peak belongs to. We adopted two methodologies to calculate power. First, a power was calculated for each interval defined by markers. This power calculation allows monitoring of QTL locations if not on the predefined intervals. Second, each predefined QTL was assigned to a 10-cM interval centered at the true QTL location, and then the power was estimated for the so-defined confidence interval. QTL identified in other intervals were viewed as false positives.

We also adopted two methodologies to calculate the mean QTL position and effect (Zeng 1994). One was calculated from all peaks in the confidence interval across 100 runs, and the other from the peaks having a LOD score over the predefined threshold of 2.5.

## SIMULATION RESULTS

#### LOD score distribution of ICIM:

Permutation tests (Churchill and Doerge 1994) were conducted to find the LOD score distributions of ICIM under the null hypothesis. These distributions were different for different genetic models and heritability levels in genome 1 (Table 3). For a population size of 200, the probabilities that the LOD score was >2.5 were 0.0630, 0.0311, 0.0462, and 0.0204 (calculated from Table 3) for the four combinations of two genetic models and two heritability levels, respectively. For a population size of 300, these probabilities were 0.1393, 0.0899, 0.0932, and 0.0474, respectively. Results in Table 3 indicate that different LOD thresholds should be applied for different data sets to ensure the same level of false-positive rates. For simplicity, we applied the LOD threshold of 2.5 in the simulation study. This threshold value may not result in a false discovery rate <0.05, but will have little effect on the comparison of different mapping methods. Moreover, we compared ICIM with CIM not only in terms of mapping power but also in terms of the number of false positives.

#### Power analysis of CIM and ICIM from genome 1:

On the average LOD profiles of ICIM displayed clear peaks around most of the predefined QTL, but this was not the case for CIM especially on chromosomes where there were multiple QTL (Figure 1). The three QTL on chromosome 2, *i.e*., QZ4, QZ5, and QZ6 (explaining 8.32, 12.30, and 12.96% of the phenotypic variance under the additive genetic model and *H* = 0.8, respectively; Table 2), had similar effects. QZ4 was linked with QZ5 in repulsive phase, and QZ5 was linked with QZ6 in coupling phase. Three clear peaks were observed on the average LOD profiles of ICIM, but it was hard to distinguish QZ5 and QZ6 on the average LOD profiles of CIM (Figure 1). The average LOD profiles were very low on chromosomes 5 and 6 on which there were no QTL, indicating that both CIM and ICIM are less likely to locate a QTL on one chromosome to other chromosomes.

When powers were calculated for all marker intervals along the six chromosomes, the probability that QTL were mapped onto the two devoid chromosomes (*i.e.*, 5 and 6) was rather low for both CIM and ICIM (Figure 2), as has been seen from the average LOD profiles in Figure 1. The advantage of ICIM over CIM was not significant for chromosome 3 (Figure 2), for which QZ7 has a very small effect (explaining 1.74% of phenotypic variance under the additive model and *H* = 0.8; Table 2), and QZ8 and QZ9 are far apart (61 cM apart on chromosome 3; Table 2). The advantage was not significant for chromosome 4 either, on which there was only one QTL (Figure 2). However, the advantage of ICIM was significant for chromosome 2 for all genetic models and heritability levels, on which there were three QTL of similar effects (QZ4, QZ5, and QZ6), and the distances between QZ4 and QZ5 and QZ5 and QZ6 were 40 and 34 cM, respectively (Table 2). ICIM had higher powers to map QZ4 and QZ5 in the right intervals and lower probability to assign them to incorrect intervals than CIM (Figures 2 and 3).

The advantage of ICIM over CIM was much clearer when power was calculated on the basis of an interval of 10 cM centered on the predefined QTL (Figure 3). For QZ1, CIM had higher powers than ICIM for the additive genetic model under heritability 0.5 and the additive and epistasis model under heritability 0.8. For QZ2 and QZ6, CIM also had higher powers in most cases. For QZ8, the powers of CIM and ICIM for the additive genetic model under heritability 0.5 and the additive and epistasis model under heritability 0.5 were the same, and for QZ9 those powers for the additive genetic model under heritability 0.5 were the same. But for all other cases, ICIM had higher powers. On average, CIM identified 6.11, 3.63, 3.97, and 2.35 QTL for the two genetic models and two heritability levels in each run, respectively, while ICIM identified 7.52, 4.40, 4.84, and 2.72 QTL, respectively, and all of them were higher than those observed from CIM (Figure 4). QTL identified in intervals other than the defined intervals were viewed as false positives. On average the false QTL numbers in each run were 6.58, 4.82, 4.85, and 3.36 for CIM, but 3.67, 3.71, 3.31, and 2.73 for ICIM, respectively (Figure 4). Thus, the proportions of true to false positives were 0.93, 0.75, 0.82, and 0.70 for CIM, but 2.05, 1.19, 1.46, and 1.00 for ICIM for the two genetic models and two heritability levels, respectively. As shown in Figure 2, many false positives were located in the neighboring intervals on the chromosomes, and the proportion of true QTL *vs.* false positives was generally dependent on the width of confidence intervals.

Lower heritability and the inclusion of epistasis reduced the power for mapping additive QTL using CIM and ICIM (Figures 1–4⇑⇑⇑). The mapping power under the additive and epistasis model and heritability 0.8 was similar to that under the additive genetic model and heritability 0.5. When mapping was conducted assuming additivity of QTL, the epistatic effect should enter into the sampling error. The effect of epistasis on additive QTL mapping was equivalent to additional random errors on phenotypic data. For the additive and epistasis genetic model under heritability 0.8 (Table 2), the proportion of additive variance to the phenotypic variance was 4/(4 + 2 + 1.5) ≈ 53%, which was similar to the additive genetic model of heritability 0.5 (Table 2). This suggests that CIM and ICIM are still effective for locating the QTL and estimating their additive effects when epistasis is present if the heritability in the narrow sense is not too low.

#### Estimation of QTL positions and effects from genome 1:

The estimates of QTL positions can be on the left or right of the true position. The deviation of the average position estimates across the 100 simulations ranged from −2.23 to 2.40 cM (Table 4). There is a tendency that QTL were mapped toward to their closest markers. In other words, a QTL closer to its left flanking marker had a negative deviation (*i.e.*, QZ4 at 3 cM and QZ5 at 43 cM on chromosome 2 and QZ7 at 33 cM on chromosome 3), and a QTL closer to its right flanking marker had a positive deviation (*i.e.*, QZ2 at 48 cM and QZ3 at 108 cM on chromosome 1, QZ6 at 77 cM on chromosome 2, and QZ8 at 68 cM and QZ9 at 129 cM on chromosome 3). This is understandable by looking into the two coefficients in model (2). If a QTL is located in the middle of a flanking interval, its effect will be evenly absorbed by the two flanking markers. Otherwise, the marker closer to the QTL will absorb most of the QTL variation and is more likely retained through stepwise regression. Therefore, a QTL is more likely to be mapped onto the marker closer to it. Due to the same reason, the highest power of ICIM was not reached on interval (120, 130) on chromosome 3 where QZ9 was located (*i.e.*, 129 cM), but was achieved on interval (130, 140), as shown in Figure 2.

When calculated from all peaks, the estimate of QTL effect was almost unbiased. In comparison, the QTL effect was generally overestimated when calculated from the significant peaks only, which was expected as small-effect estimates in some simulation runs were not counted. Therefore, in any simulation studies, it is not likely to achieve an unbiased estimation of QTL effect if only significant QTL are counted. Compared with CIM, ICIM tends to have smaller bias in effect estimation in most cases (Table 4).

#### Simulation results from genome 2:

The advantage of ICIM over CIM is also significant for genome 2 (Figures 5 and 6). The power given in Figure 5A was calculated for each chromosomal interval of 5 cM as defined by two neighboring markers. For ICIM, the highest powers were achieved in the intervals where QY1, QY3, QY4, QY5, QY7, and QY8 were located, which were 0.56, 0.69, 0.67, 0.72, 0.71, and 0.78, respectively. The highest powers were achieved in the neighboring intervals for QY2 and QY6 (Figure 5A). The result of CIM was comparable to that of ICIM for chromosome 4, for which the two QTL were linked in coupling with a distance of 35 cM. The worst performance of CIM was for chromosome 2, for which the two QTL were linked in repulsion with a distance of 21 cM (Figure 5A).

The mean LOD profile of ICIM across the 100 simulations had eight clear peaks corresponding to the eight major predefined QTL, while that of CIM only had two clear peaks on chromosome 4 (Figure 5B). The mean estimated effects from ICIM across the 100 simulation runs were close to the true QTL effects for all the eight predefined major QTL. But CIM tended to overestimate QTL effects on chromosomes 1, 3, and 4 where the linked QTL were in the coupling phase, but tended to underestimate QTL effects on chromosome 2, where the linked QTL were in the repulsive phase (Figure 5C).

When a confidence interval of 10 cM with the predefined QTL at the center was considered, ICIM had a power >0.87 to map the eight major QTL (Figure 6A). The powers of CIM were >0.76 for QTL on chromosomes 1, 3, and 4, for which linked QTL were in the coupling phase, but were only 0.09 for QY3 and 0.47 for QY4 on chromosome 2, where the two QTL were linked in repulsion (Figure 6A). In addition, ICIM also resulted in less false positives (total false positives in intervals other than the eight confidence intervals divided by 8) than CIM (Figure 6A). Across the 100 simulations, the estimates of QTL effects were almost unbiased for ICIM (Figure 6B), while the effect estimates of QY3 and QY4 on chromosome 2 were about twice the true QTL effects for CIM.

## DISCUSSION

#### Theoretical justification of ICIM:

For both ICIM and CIM, Mendelian segregation and recombination laws (Table 1) and quantitative genetic theories [models (1)–(3)] provide the theoretical basis (Falconer and Mackay 1996; Lynch and Walsh 1998; Hartl and Jones 2005), and regression and maximum-likelihood principles [Table 1, model (4), and Equation 6] provide the statistical basis. The statistical assumption made in CIM and ICIM is that the residual errors in model (4) are normally distributed. The genetic assumptions are that (i) the genotypic value of an individual is the summation of effects from all loci affecting the trait of interest and (ii) linked QTL are separated by at least one blank interval. These well-established genetic and statistical theories ensure that these mapping methods are valid under these assumptions. However, simulations are useful if one wants to investigate their sensitivities to the violation of the underlying assumptions, such as nonisolated QTL and epistasis.

#### ICIM makes the background marker selection process easier:

Various methods for selecting background markers are available in QTL Cartographer implementing CIM (Wang *et al*. 2005b), and different methods may result in different, sometime controversial, mapping results. A mapping population from the first simulation run using the additive genetic model and *H* = 0.8 in genome 1 was used to demonstrate this point. Three cofactor selection methods used in CIM, *i.e.*, unlinked marker control, all marker control, and the standard model using stepwise regression (window size 10 cM), gave rather different LOD profiles (Figure 7A). The method using unlinked markers as a control was similar to IM (Figure 7A), which should not be recommended. The method using all markers as a control had clear peaks where major QTL were located, but the LOD score was generally lower than that from IM (Figure 7A). Moreover, power analysis showed that this method resulted in a large number of false positives on the two devoid chromosomes (results not shown), and this method cannot be applied when markers outnumber the mapping population size.

In ICIM, the background markers were selected only once using the standard stepwise regression, and thus the difficulty in choosing the background markers associated with CIM can be avoided. To investigate the influence of marker inclusion and exclusion criteria in stepwise regression on mapping results, different *P*-values for entering variables (PIN) and removing variables (POUT) were applied to the population used above for comparing marker selection methods in CIM. The LOD profiles from three levels of PIN and POUT showed very little difference (Figure 7B). All three probability levels identified the 8 largest QTL (Figure 7B). When PIN = 0.10 and POUT = 0.20 were used, the average power to detect the 10 predefined QTL from the additive genetic model and *H* = 0.8 was slightly lower than that from PIN = 0.05 and POUT = 0.10, *i.e.*, 73.7 *vs.* 75.2 (calculated from Table 5), but the false positive was much higher, *i.e.*, 511 *vs.* 367 (Table 5). Similar results were also observed for other genetic models. Therefore, ICIM is robust to the choice of probability levels, and in practice a lower probability level should be applied to further reduce the false positives without sufficiently changing the detection power (Figure7B and Table 5).

#### ICIM does not increase sampling variance compared to IM:

According to property 3 proposed by Zeng (1994), conditioning on linked markers in the multiple-regression analysis will reduce the influence of interference caused by possible multiple linked QTL on hypothesis testing and parameter estimation, but with a possible increase of sampling variance. The increased sampling variance can be seen from the lower LOD scores (compared with IM) that resulted from CIM using all markers as a control (Figure 7A). But this is not the case for ICIM. At most peak positions, ICIM has higher LOD scores than IM (Figure 7, A and B). So for ICIM, properties 2 and 3 of CIM (Zeng 1994) can be merged as “Conditioning on both linked and unlinked markers in the multiple regression analysis will reduce the sampling variance of the test statistic by controlling some residual genetic variation and thus will increase the power of QTL mapping” (Zeng 1994, p. 1460).

#### CIM and ICIM are valid and simple methods for mapping with populations derived from biparental crosses:

CIM, when implemented properly, represents the best single-interval mapping method based on linear model and maximum-likelihood principles. Recently, Bayesian models have gained some popularity among theoreticians. In a sense and in the context of QTL mapping using populations derived from biparental crosses, both frequentist statistics and Bayesian statistics deal with the maximization of likelihood function. The major difference is that a prior distribution has to be considered in any Bayesian model, and the choice of the prior in the case of QTL mapping is rather arbitrary and a tedious process (Xu 2003). On the basis of the prior, Bayesian statistics derive the posterior and then conduct inference on the basis of the posterior distribution. The conventional maximum likelihood can be viewed as a special case of Bayesian models where a uniform density is used as the prior distribution. It should be noted that the effect and advantage of prior distribution diminish as the sample size increases (Gelman *et al.* 2004). The population size for a QTL mapping population is normally hundreds, which can be reasonably regarded as a large sample in statistics. Theoretically, Bayesian mapping has no flaw in terms of models. However, Bayesian statistics may not have significant advantages over frequentist statistics in QTL mapping for some standard mapping populations derived from biparental crosses, considering the difficulty in choosing prior distributions and the complexity in computing posterior distributions.

Genome 2 used in our simulation has been used by Yi *et al*. (2003) to demonstrate their Bayesian mapping method. The results of ICIM shown in Figure 5 were comparable with the probability profile from the Bayesian method (Figure 1 in Yi *et al*. 2003). The mean effects for the eight identified large-effect QTL were 1.02, 0.96, −0.83, 0.81, −0.78, −0.87, 1.00, and 1.03, respectively (Figure 5C), which were close to the true additive effects. But ICIM is much simpler in principle and faster in computation. It required <5 min for ICIM to complete the 100 simulation runs in a personal computer. In addition, we also compared ICIM with the Bayesian mapping method proposed by Sillanpää and Arjas (1998), and the results of ICIM were very similar to those from Bayesian models (results not shown).

Since the number of QTL is always much lower than the number of markers, QTL mapping can be viewed as an issue of model selection (Broman and Speed 2002; Sillanpää and Corander 2002). A number of statistical methods are available to search through the space of models and various criteria can be used to select the best model (Miller 1990; Piepho and Gauch 2001). However, there is no conclusion in statistics as to which model selection method is the best (Miller 1990). On the basis of our simulation results, the performance of stepwise regression is satisfactory. However, we do not exclude the possibility that other model selection methods may achieve similar performance as the stepwise regression used in ICIM.

The calculation of probability that a QTL is in a given interval is viewed as a major advantage of Bayesian models (Ball 2001). In any likelihood-ratio test-based mapping methods such as IM, CIM, and ICIM, the QTL position is estimated as the peak of the LOD profile with a LOD score over a specified threshold value. The LOD score is actually a likelihood-ratio test (LRT) [LRT = 2 log(10)LOD ≈ 4.61 LOD]. In the case of mixture models (McLachlan and Basford 1988; Goffinet *et al*. 1992), the asymptotic distribution of LRT may not exactly follow a , where d.f. is the difference in the number of dependent variables under the two hypotheses. However, Zeng (1994) showed when the sample size was large and the number of markers fitted to the model was relatively small, the LRT statistic was still approximately distributed as . Another way to find the LOD threshold is to use permutation tests (Churchill and Doerge 1994). A probability may be calculated at any testing position if required. In any mapping methods based on the likelihood-ratio test, the LOD score actually indicates the likelihood of a QTL at the testing position. The similarity between the mean LOD score profile shown in Figure 5B and the probability profile shown in Figure 1 in Yi *et al*. (2003) is consequently not unexpected.

Multiple interval mapping (MIM) was proposed to map multiple QTL simultaneously (Kao *et al*. 1999; Zeng *et al*. 1999). MIM may have avoided the complicated background selection process associated with CIM, but introduced various model selection methods (Zeng *et al*. 1999; Wang *et al*. 2005b). We applied two model selection methods available in QTL Cartographer to the same mapping population previously used for comparing different cofactor selection methods in CIM. Eleven QTL were identified when the forward and backward selection on markers was used, while only 6 QTL were identified when the MIM forward search method was used (results not shown). Again, different MIM model selection methods resulted in different mapping results.

#### Further considerations of QTL mapping:

An important assumption of most QTL mapping methods is that the QTL are separated by at least one blank interval. It is expected that this assumption is more satisfactory for narrow than for wide marker spacing. Similarly, it is a better assumption for independent or loosely linked than for tightly linked QTL. With the rapid development of molecular technology, high-density linkage maps are becoming available for more and more species. The treatment of tightly linked QTL is more an issue of biology than of statistical methodology. Two linked QTL can be separated only if recombinants are sampled in the mapping population (Kearsey 2002). Therefore, the mapping resolution is limited by the practicable mapping population size. Populations of size ≥500 are rarely seen in practice for mapping using primary populations such as backcross, F_{2}, and recombination inbred lines. Therefore, QTL mapping using primary populations can give only a rough position and effect estimation due to the limited population size and errors in both genotyping and phenotyping. Once the QTL interval has been identified, some secondary mapping populations should be built from the preliminary mapping population and fine mapping needs to be conducted (Kearsey 2002). As whole-genome genotyping is not requested in the secondary population, and selective phenotyping may be implemented, a larger population can be used. At the same time, new markers in the candidate intervals may be discovered and added to the linkage map. By then we may determine whether the identified QTL contain one gene or multiple genes.

It is now common that the number of markers exceeds the sample size of the mapping population. The performance of ICIM and other recommended mapping methods needs to be investigated under this situation. We have used a backcross population to illustrate our method in this article. However, the extension of ICIM to F_{2}, doubled haploids, and recombination inbred lines is straightforward. CIM is not extendable to epistasis (Zeng *et al*. 1999). As can be seen, models (1)–(3) can be easily extended to include epistasis. For instance, by inclusion of marker-pair multiplications in model (4) digenic epistasis can be modeled. This work is currently under development.

#### Conclusions:

The problem with the current CIM method is the arbitrariness of choosing the cofactors. Different methods of cofactor selection will generate different, sometime controversial results. In ICIM, significant cofactors are selected and their corresponding effects are estimated by using stepwise regression analysis prior to interval mapping. The effects of the cofactors are then fixed when used in the genome-scanning process. This eliminates the arbitrariness of cofactor selection associated with CIM. ICIM has a simpler form and faster convergence speed (EM algorithm converges after three to five iterations), without losing the optimal properties of CIM. ICIM gives clearly high LOD scores at chromosomal regions with QTL but rather low LOD scores where no QTL is located and results in less biased estimates of QTL effects, thereby improving the mapping power and precision. Extensive simulations showed that ICIM improved the performance of QTL mapping over the existing CIM method.

## Acknowledgments

The authors thank two anonymous reviewers for their useful comments on earlier versions of the manuscript. This work was supported by the National 973 Projects of China (no. 2006CB101700) and the Generation Challenge Program of the Consultative Group for International Agricultural Research.

## Footnotes

Communicating editor: N. Takahata

- Received October 13, 2006.
- Accepted October 24, 2006.

- Copyright © 2007 by the Genetics Society of America