## Abstract

Endosperm traits are trisomic inheritant and are of great economic importance because they are usually directly related to grain quality. Mapping for quantitative trait loci (QTL) underlying endosperm traits can provide an efficient way to genetically improve grain quality. As the traditional QTL mapping methods (diploid methods) are usually designed for traits under diploid control, they are not the ideal approaches to map endosperm traits because they ignore the triploid nature of endosperm. In this article, a statistical method considering the triploid nature of endosperm (triploid method) is developed on the basis of multiple-interval mapping (MIM) to map for the underlying QTL. The proposed triploid MIM method is derived to broadly use the marker information either from only the maternal plants or from both the maternal plants and their embryos in the backcross and F_{2} populations for mapping endosperm traits. Due to the use of multiple intervals simultaneously to take multiple QTL into account, the triploid MIM method can provide better detection power and estimation precision, and as shown in this article it is capable of analyzing and searching for epistatic QTL directly as compared to the traditional diploid methods and current triploid methods using only one (or two) interval(s). Several important issues in endosperm trait mapping, such as the relation and differences between the diploid and triploid methods, variance components of genetic variation, and the problems if effects are present and ignored, are also addressed. Simulations are performed to further explore these issues, to investigate the relative efficiency of different experimental designs, and to evaluate the performance of the proposed and current methods in mapping endosperm traits. The MIM-based triploid method can provide a powerful tool to estimate the genetic architecture of endosperm traits and to assist the marker-assisted selection for the improvement of grain quality in crop science. The triploid MIM FORTRAN program for mapping endosperm traits is available on the worldwide web (http://www.stat.sinica.edu.tw/chkao/).

CEREAL grains of many crops, such as rice, wheat, barley, and corn, are major food and nutritious resources for human, animal feeds, and industrial products. To enhance the yield and quality of grains, the understanding of the genetic basis underlying the cereal grains becomes increasingly important in crop study. The cereal grains are generally composed of diploid (embryo) and triploid (endosperm) tissues due to double fertilization. During the process of double fertilization, one of the two sperm cells fuses with the egg cell to produce a diploid zygote, which later divides mitotically to form the embryo, and the other sperm cell unites with the central cell (a diploid set of maternal chromosomes) to form a triploid endosperm nucleus, which also undergoes several mitotic divisions to become the endosperm. It is known that the endosperm plays a major role to nourish the embryo in the seed and the young seedling, and the content of endosperms, such as protein, sugar, oil, and carbohydrate concentration, showing quantitative variation is directly related to the quality of cereal grains. The genetic improvement targeting these endosperm traits can provide an efficient way to enhance the grain quality, and it has attracted a lot of attention in plant breeding (Sadimantara* et al.* 1997; Mazur* et al.* 1999; Tan* et al.* 1999; Wang and Larkins 2001; Lou and Zhu 2002). Genetically, the trisomic endosperm represents the next generation and has a more complex genetic mechanism than the diploid tissues. For these reasons, the approach of genetic analysis to endosperm traits is different from that to traits under diploid control, and special treatments are required in the study of endosperm traits.

Most endosperm traits show continuous variations. Quantitative genetic models considering the triploid nature of endosperm traits for studying the underlying genetic basis have been proposed by several researchers (Gale 1976; Mo 1987; Bogyo* et al.* 1988; Zhu and Weir 1994). These models generally focus on partitioning the phenotypic variance of an endosperm trait into various genetic and nongenetic (environmental) components. These variance components do not provide all the detailed information, such as the number, positions, and effects about the underlying quantitative trait loci (QTL). To unlock this QTL information, the ideas of the traditional QTL mapping methods utilizing the well-distributed genetic markers along the genome to infer the QTL parameters can be used. The traditional QTL mapping methods use the information about traits and markers from the same generation, *e.g.*, backcross or F_{2} populations, to detect QTL controlling traits in diploid organisms (Lander and Botstein 1989; Haley and Knott 1992; Jansen 1993; Zeng 1994; Kao* et al.* 1999; Kao and Zeng 2002). Although they are designed for traits under diploid control, some researchers have applied them to mapping for QTL controlling endosperm traits (Tan* et al.* 1999; Wang and Larkins 2001; Wang* et al.* 2001). Such application implicitly relies on an invalid assumption that the endosperm traits are directly controlled by the diploid maternal genomes, not by the triploid endosperm genomes. Consequently, the traditional QTL mapping methods have limited power and precision in mapping endosperm traits (Wu* et al.* 2002a).

Wu* et al.* (2002a)(b) and Xu* et al.* (2003) pioneered statistical methods to map endosperm traits by taking the triploid nature of endosperms into account using the marker information from the maternal plants (one-stage design) in the backcross or F_{2} population. Wu* et al.* (2002a) further proposed a triploid QTL mapping method by using the marker information from both the maternal plants and their embryos (two-stage design), to improve the mapping of endosperm traits in the backcross population. Their methods have been shown to be able to provide improved QTL resolution. As these methods consider only one (or two) QTL at a time in the model, they can bias QTL identification and estimation when multiple QTL are located in the same linkage group (Lander and Botstein 1989; Jansen 1993; Zeng 1994). To deal with these problems and further improve the endosperm trait mapping, a potential way is to extend the current one-QTL model to a multiple-QTL model such that more genetic variation can be controlled in the model, as has been done in mapping traits in diploid tissues (Kao and Zeng 1997; Kao* et al.* 1999; Zeng* et al.* 1999). In this article, a triploid method based on multiple-interval mapping (MIM) using multiple marker intervals simultaneously to fit multiple putative QTL into the model is developed to achieve these purposes. This MIM-based triploid method can broadly take either the one- or two-stage design in either the backcross or F_{2} population into account to analyze endosperm traits. As shown in this article, the proposed method can detect QTL responsible for endosperm traits with more power and better precision, and it can readily analyze and search for epistatic QTL due to its multiple-QTL approach. Besides, some related issues in mapping endosperm traits, such as the problems of using the diploid methods, the differences and relation between the diploid and triploid methods, the genetic variance components of endosperm traits, and the problems if QTL effects are present and ignored, are also investigated. A series of simulation studies was performed to further investigate these issues, to examine the relative efficiency of different experimental designs, and to evaluate the performance of the MIM-based method as compared to the current methods in mapping endosperm traits.

## GENETIC MODEL OF ENDOSPERM TRAITS

### Genetic model:

For individuals in a backcross or F_{2} population of autogamous plants, the endosperm tissues of their seeds can have four possible genotypes, *QQQ*, *QQq*, *Qqq*, and *qqq*, if only one QTL *Q* is considered (appendix B). Some genetic models for defining the genetic parameters and modeling the relationship between their genotypic values and the genetic parameters already exist (*e.g.*, Gale 1976; Mo 1987; Bogyo* et al.* 1988; Pooni* et al.* 1992; Zhu and Weir 1994). Here, the genetic model by Bogyo *et al.* is adopted for modeling, and it can be expressed in matrix notation as 1where the notations *G*_{1}, *G*_{2}, *G*_{3}, and *G*_{4} denote the genotypic values of genotypes *QQQ*, *QQq*, *Qqq*, and *qqq*, respectively, and *a*, *d*_{1}, and *d*_{2} are the genetic parameters. In Equation 1, the matrix with 4 × 3 dimension is called a genetic design matrix as it specifies the relationship between the genotypic values and genetic parameters, and it is symbolized by *D*. The unique solutions of *a*, *d*_{1}, and *d*_{2} in terms of the genotypic values are

The parameter μ obviously is not a measure of mean genotypic values as the genotypic values of *AAa* and *Aaa* are ignored. The parameter *a*, which measures the average effect of substituting *Q* for *q*, is defined as the additive effect, and the parameter *d*_{1} (*d*_{2}), which measures the departure of the substitution effect in *QQ* (*qq*) background, is defined as the first (second) dominance effect. The genetic model can be expressed more succinctly as 2where the coded variables are defined as such that each genotype corresponds to its genotypic value. If different genetic models are used for modeling, they can be also expressed as in Equations 1 and 2, but note that the parameters may have different meanings and the variance component may have different structure.

The extension of the one-locus genetic model in Equation 1 to multiple, say *m*, loci is straightforward. Consider *m* QTL, *Q*_{1}, *Q*_{2}, … , and *Q _{m}*, each with four genotypes and three genetic parameters. Together, for

*m*QTL, there are 4

*possible different QTL genotypes and 3*

^{m}*m*parameters if epistasis between QTL is not considered or 3

*m*(3

*m*− 1)/2 parameters if only up to digenic epistasis is considered. The columns for epistasis can easily be obtained from the product of columns of marginal effects. By expanding the genetic design matrix

*D*of Equation 1 to a 4

*× 3*

^{m}*m*or 4

*× 3*

^{m}*m*(3

*m*− 1)/2 matrix (see the mim model for mapping endosperm traits), the genetic model for

*m*QTL in matrix notation can be obtained. The genetic design matrix

*D*plays an important role in the estimation of the QTL effects in the triploid MIM model. The corresponding multiple-QTL model in the form of Equation 2 can be easily obtained using a regression principle. Following the regression principle, the genetic model of

*m*QTL by considering up to digenic epistasis can be written as 3where μ is the intercept;

*a*,

_{j}*d*

_{j}_{1}, and

*d*

_{j}_{2}are the additive and dominance effects of

*Q*;

_{j}*i*

_{ajak},

*i*

_{ajdk1},

*i*

_{ajdk2},

*i*

_{dj1dk1},

*i*

_{dj1dk2},

*i*

_{dj2dk1}, and

*i*

_{dj2dk2}denote the epistatic effects between QTL; and

*x*,

_{j}*z*

_{j}_{1}, and

*z*

_{j}_{2}are the coded variables of the additive and dominance effects for

*Q*.

_{j}### Variance components:

Consider only one QTL in the genetic model. It is easy to show that the variance of the additive variable, *V*(*x*), is ^{19}/_{16}, and the variances of the two dominance variables, *V*(*z*_{1}) and *V*(*z*_{2}), are ^{7}/_{64}, in a backcross population. In an F_{2} population, these variances are ^{7}/_{4}, ^{7}/_{64}, and ^{7}/_{64}, respectively. The covariances between the variables, Cov(*x*, *z*_{1}), Cov(*x*, *z*_{2}), and Cov(*z*_{1}, *z*_{2}), are ^{5}/_{32}, ^{1}/_{32}, and −^{1}/_{64}, respectively, in the backcross population, and they are ^{1}/_{16}, −^{1}/_{16}, and −^{1}/_{64}, respectively, in the F_{2} population. Therefore, the genetic variance components of an endosperm trait are 4in the backcross population, and they are 5in the F_{2} population. It shows that each effect contributes not only to its variance but also to the covariances with other effects, and that the relative importance of effects in contributing to the total genetic variance depends not only on their sizes but also on their associated coefficients (the variance or covariance of their coded variables). When *m* QTL each with complete effects are considered together, the genetic variance has [9*m*^{2}(3*m* − 1)^{2} + 6*m* (3*m* − 1)]/8 components. For example, the total genetic variance has 120 components for *m* = 2 in both populations (not shown), and it reduces to 111 components in the backcross population and 83 components in the F_{2} population when the two QTL are unlinked (appendix A). Among the coefficients of the variances involving the epistatic effects, the coefficients associated with the additive-by-additive effect are relatively much larger than those of other variances and covariances. For example, in the F_{2} population, the coefficient of *i*^{2}_{a1a2} (the variance of *x*_{1}*x*_{2}) is ^{49}/_{16} (appendix A); *i.e*., the variance contributed by *i*_{a1a2} is ^{49}/_{16} × *i*^{2}_{a1a2}, the coefficients of the other four epistatic variances involving the additive effects are ^{7}/_{32}, and the coefficients of the remaining four different types of epistatic variance are ^{63}/_{4096}. The coefficients of the covariances between the additive effects and the epistatic effects involving the additive effects are ^{7}/_{32}, and the coefficients of the covariances between *i*_{a1a2} and the other epistatic effects involving the additive effects are ^{7}/_{64}. The other covariances are relatively smaller. Therefore, it implies that, for the same order of the epistatic effects, the epistatic effects involving the additive effects, especially the additive-by-additive effect, are relatively easy to detect, and the other epistatic effects are relatively difficult to detect in practical QTL mapping (with a limited sample size). A similar pattern can also be found in the backcross population. For two nonepistatic QTL, the variance components are where *r*_{12} is the recombination fraction between the two QTL, in the F_{2} population. Similarly, the variance components for the backcross population also have 21 terms (not shown). If the two nonepistatic QTL are unlinked, the variance components reduce to a much simpler form with the first 12 components.

## THE RELATION BETWEEN THE DIPLOID AND TRIPLOID METHODS

The traditional QTL mapping methods are usually designed to map for QTL controlling traits in diploid organisms (Lander and Botstein 1989; Haley and Knott 1992; Jansen 1993; Zeng 1994; Kao* et al.* 1999; Kao and Zeng 2002). These diploid methods classify the genotypes of each QTL into two groups, *QQ* (*qq*) and *Qq*, for the backcross population or three groups, *qq*, *Qq*, and *QQ*, for the F_{2} population, and they detect the association between the QTL genotype and the trait value both measured at the same generation for QTL mapping. Although the endosperms are known to be triploid and represent the next generation, some researchers have applied these diploid methods to mapping endosperm traits of the backcross or F_{2} individuals (Tan* et al.* 1999; Wang* et al.* 2001; Wu* et al.* 2002a). Therefore, it is important to investigate the problems of using the diploid methods and the relation between the diploid and triploid methods in mapping endosperm traits.

### Diploid methods:

When applying the diploid methods to mapping endosperm traits and only one QTL is considered, the statistical model for *n* endosperms in the backcross population can be written as 6where *w*^{*}_{i} is coded as

*y _{i}* is the endosperm trait value; μ is the intercept;

*b*is the QTL effect measuring the genotypic difference between

*n*endosperms in the F

_{2}population can be expressed as 7where

*w*

^{*}

_{ai}and

*w*

^{*}

_{di}are defined as

*b*_{a} and *b*_{d} denote the additive and dominance effects. The residual error ε* _{i}* in the above two models is assumed to have a normal distribution with mean zero and variance σ

^{2}

_{ε}. As QTL may not be coincident with markers, the QTL genotype is usually unobservable. Therefore, the likelihood of the diploid model is known as a mixture of normals, 8where μ

*'s correspond to the genotypic values of the*

_{j}*k*different QTL genotypes (

*k*= 2 for the backcross model and

*k*= 3 for the F

_{2}model), and the mixing proportions,

*p*'s, are the conditional probabilities of QTL genotypes (see Tables 1 and 2 in Kao and Zeng 1997). The maximum-likelihood estimate (MLE) of the QTL effects and their asymptotic variance-covariance can be obtained using the EM algorithm (Dempster

_{ij}*et al.*1977) and Louis's (1982) method by treating the normal mixture model as an incomplete-data problem.

### The relation between the diploid and triploid models:

When applying the diploid models to mapping endosperm traits, it is generally assumed that the endosperm traits are directly controlled by the diploid genomes of the backcross or F_{2} individuals. This assumption, however, violates the fact that the triploid endosperms represent the genetic composition of the next generation, which, in fact, is mainly responsible for the trait variation. Consequently, as compared to the use of the triploid model, some problems, such as less power and precision in QTL detection, will occur in the diploid model as shown below.

When an endosperm trait affected only by one QTL, *Q*, is regressed on a marker *M* along the genome to infer *Q*, the regression coefficient of *M* in the backcross diploid model is 9where *r _{QM}* is the recombination fraction between

*M*and

*Q*, in the backcross population (appendix B). If the marker

*M*is coincident with

*Q*(

*r*= 0), the coefficient reduces to

_{QM}*b*=

_{M}^{3}/

_{2}

*a*+

^{1}/

_{4}(

*d*

_{1}+

*d*

_{2}). The estimated coefficient of the backcross diploid model is composed of the additive effect and two dominance effects. In the F

_{2}diploid model, the regression coefficient for the additive effect of

*M*is 10and the coefficient for the dominance effect is 11

If *M* and *Q* are coincident, the additive coefficient reduces to and the dominance coefficient reduces to . The additive coefficient estimated in the F_{2} diploid model is 1.5 times the additive effect, and the estimated dominance coefficient is one-quarter of the sum of the two dominance effects. When both of the additive and dominance variables are fitted in the model, the partial regression coefficients are the same as Equations 10 and 11 because of orthogonality. The above derivations present the relation of parameters between the diploid and triploid models and show that the diploid models cannot directly estimate the QTL effects in mapping endosperm traits.

The phenotypic variance conditional on the marker *M* in the backcross diploid model is 12

(appendix B). It shows that the genetic variances and covariances contributed by the additive and dominance effects cannot be fully controlled in the model. The percentages of additive and dominance variances uncontrolled by the diploid model are ∼47.4% (9/19) and 14.3% (1/7), respectively. For the F_{2} population, the phenotypic variance conditional on the additive and dominance variables of marker *M* is the same as that in the backcross model (appendix B). The percentages of uncontrolled additive and dominance variances are ∼63.4% (9/14) and 14.3% (1/7), respectively. In addition, a part of the genetic covariances is also uncontrolled by the diploid model. The uncontrolled variances and covariances will become a part of the genetic residual, causing inflation of the sampling variance of the coefficients. The sampling variance of the regression coefficient of the backcross model is , where σ^{2}_{M} is the variance of the coded variable of *M*, in a large sample with size *n* (Stuart and Ord 1991). Using the approximation, the sampling variances of the regression coefficients between the diploid and triploid models can be compared when *M* and *Q* are coincident (*r _{QM}* = 0). Taking a QTL with no dominance (

*a*= 1,

*d*

_{1}=

*d*

_{2}= 0) and contributing 10% of the trait variation as an example, the conditional phenotypic variance roughly equals to σ

^{2}for the triploid model, and it is ∼

^{181}/

_{171}× σ

^{2}for the diploid model. The variances σ

^{2}

_{M}for the two different models are

^{1}/

_{4}and

^{19}/

_{16}, respectively. Consequently, the sampling variance of the regression coefficient for the diploid model is ∼5.03 times that for the triploid model in the backcross population. It is ∼3.64 times that for the same setting in the F

_{2}population. The sampling variances of the regression coefficients in the diploid models are larger than those in the triploid model.

On the basis of the above findings, two problems will occur if the diploid models are applied to mapping endosperm traits. First, the estimates in the diploid models are generally confounded by the additive and dominance effects of endosperm QTL (Equations 9–11). Second, the sampling variances of the estimates will inflate because the genetic variances and covariances contributed by QTL are not fully controlled in the model. Consequently, the diploid models cannot directly estimate the effects of the endosperm QTL, and they have the confounding problems in estimation and will decrease the power in endosperm QTL detection.

## THE MIM MODEL FOR MAPPING ENDOSPERM TRAITS

### Endosperm trait multiple-interval mapping:

Assume an endosperm trait is controlled by *m* QTL, *Q*_{1}, *Q*_{2}, … , and *Q _{m}*, located at positions

*p*

_{1},

*p*

_{2}, … , and

*p*, in

_{m}*m*different marker intervals,

*I*

_{1},

*I*

_{2}, … , and

*I*, along the genome. If only up to digenic epistasis is considered, the value of an endosperm trait,

_{m}*y*, in the backcross or F

_{i}_{2}population can be related to the

*m*putative QTL by the model 13where the parameters and coded variables have the same definitions as those in the genetic model in Equation 3, and the residual error ε

*is assumed to follow normal distribution with mean zero and variance σ*

_{i}^{2}. In QTL mapping, the endosperm QTL genotype of any putative QTL, say

*Q*,

_{j}*j*= 1, 2, … ,

*m*, is usually not observable and could be

*Q*,

_{j}Q_{j}Q_{j}*Q*,

_{j}Q_{j}q_{j}*Q*, or

_{j}q_{j}q_{j}*q*with different (conditional) probabilities for different endosperm

_{j}q_{j}q_{j}*i*. The conditional probabilities (distribution) for each

*Q*under different experimental designs can be derived by using its flanking marker information from the maternal plants (and their embryos) as shown below, and then the normal mixture likelihood of the model can be constructed. As multiple (

_{j}*m*) intervals are used to infer the conditional distribution of the (

*m*) endosperm QTL for modeling, this approach is called multiple-interval mapping in QTL mapping (Kao and Zeng 1997; Kao

*et al.*1999), and this model is a MIM-based triploid model. By specifying appropriate conditional probabilities to the 4

*endosperm QTL genotypes of the*

^{m}*m*QTL, this triploid MIM model can be applied widely to mapping endosperm traits using data from various designs and populations.

### Likelihood:

For any interval, *I _{j}*, flanked by the two markers,

*M*and

_{j}*N*, the maternal plants or their embryos can have four and nine different marker genotypes in the backcross and F

_{j}_{2}populations, respectively. If both the plants and embryos are considered together, their marker genotypes can have 16 and 25 combinations in the two different populations, respectively (appendix C). For any

*Q*in

_{j}*I*, the (conditional) probabilities of the four endosperm QTL genotypes can be inferred only from the maternal plants (one-stage design) or both from the maternal plants and their embryos (two-stage design) as shown in appendix C. To assist with explaining the parameter estimation, these conditional probabilities are extracted to form a matrix

_{j}**Q**

*,*

_{j}*j*= 1, 2, … ,

*m*. The dimension of

**Q**

*is 25 × 4 (16 × 4) for a two-stage design in the F*

_{j}_{2}(backcross) population, it is 9 × 4 (4 × 4) for a one-stage design in the F

_{2}(backcross) population (note that

*Q*denotes QTL, and

**Q**denotes the conditional probability matrix). For the total

*m*QTL in the

*m*different intervals, there are 4

*possible endosperm QTL genotypes in each of 25*

^{m}*(16*

^{m}*, 9*

^{m}*, or 4*

^{m}*) possible marker genotypes. The 4*

^{m}*joint conditional probabilities of endosperm QTL genotypes can be obtained by the product of individual conditional probabilities for each QTL using the property of conditional independence among different QTL (Kao and Zeng 1997), and they play the role of mixing proportions in the normal mixture likelihood. Let the conditional probabilities of 4*

^{m}*possible QTL genotypes for endosperm*

^{m}*i*from designs and populations be denoted as

*p*'s,

_{ij}*j*= 1, 2, … , 4

*(note that p*

^{m}*'s denote QTL positions, and*

_{j}*p*'s denote the conditional probabilities). The likelihood of the triploid MIM model for the

_{ij}*n*endosperms is a mixture of 4

*normals as 14where μ*

^{m}*'s correspond to the genotypic values of the 4*

_{j}*different QTL genotypes, and the mixing proportions,*

^{m}*p*'s, are the corresponding joint conditional probabilities. The density of each individual

_{ij}*i*is a mixture of 4

*possible normal densities with different means, μ*

^{m}*'s, and mixing proportions,*

_{j}*p*'s. The general formulas proposed by Kao and Zeng (1997) are used to obtain the MLE of the effects and their asymptotic variance-covariance matrix.

_{ij}### Parameter estimation:

The application of the general formulas to obtain the MLE and the asymptotic variance-covariance matrix for the triploid MIM model is based on the construction of the two matrices *D* and *Q*, where *D* is the genetic design matrix for characterizing the QTL effects, and *Q* is the conditional probability matrix containing the mixing proportions of QTL genotypes. Given the two matrices, the MLE of QTL effects and their asymptotic variance-covariance matrix of the triploid model can be easily obtained. The construction of the *D* and *Q* matrices is described below.

For one QTL (*m* = 1) in the model, there are four endosperm QTL genotypes and three genetic effects, and the genetic design matrix is a 4 × 3 matrix as shown in Equation 1. For *m* QTL in the model, if epistasis between QTL effects is not considered, there are 4* ^{m}* endosperm QTL genotypes and 3

*m*genetic effects (

*m*additive effects,

*m*first dominance effects, and

*m*second dominance effects), and the genetic design matrix is then a 4

*× 3*

^{m}*m*matrix. If all the possible digenic epistases between QTL are considered, the column dimension of

*D*becomes 3

*m*(3

*m*− 1)/2. An example of genetic design matrix with

*m*= 2 and all possible effects (with dimension 16 × 15) can be found in Wu

*et al.*(2002a). The joint conditional probability matrix

**Q**for the

*m*QTL has a dimension 9

*× 4*

^{m}*(4*

^{m}*× 4*

^{m}*) or 25*

^{m}*× 4*

^{m}*(16*

^{m}*× 4*

^{m}*) under the one- or two-stage design in the F*

^{m}_{2}(backcross) population, and they can be obtained by

**Q**=

**Q**

_{1}⊗

**Q**

_{2}⊗ … ⊗

**Q**

*, where ⊗ denotes the Kronecker product. The 4*

_{m}*mixing proportions of any endosperm*

^{m}*i*,

*p*'s, in the likelihood can be found to be one of the rows in

_{ij}**Q**according to the marker genotypes of the plants (and embryos). By applying the matrices

**D**and

**Q**to the general formulas, the MLE of the effects and their asymptotic variance-covariance matrix can be readily obtained.

### The problems if effects are present and ignored:

Three marginal genetic effects are associated with each endosperm QTL. In practice, QTL may display all or some of the effects (see Wu* et al.* 2002b as an example), and, before mapping, it is not known which effects are present or absent. The possible drawback of fitting the absent effects (overfitting) in the model is the loss of power in QTL detection, as higher critical value is usually required to claim the significance of QTL. If some displayed effects are ignored in the model, not only the power of detection will be affected but also the confounding problem will occur as discussed below.

Assume the endosperm trait value *y* is affected by two nonepistatic endosperm QTL, *Q*_{1} and *Q*_{2}. When the trait value is regressed on *Q*_{1} by fitting only its additive variable *x*_{1} into the model, the regression coefficients in terms of the QTL effects and linkage parameter for the backcross and F_{2} populations are shown in appendix D. It shows that the estimate of the additive effect of *Q*_{1} is not unbiased for *a*_{1} and is confounded by its other effects and the effects of *Q*_{2}. The confounding of *Q*_{2} effects is through linkage parameter. If *Q*_{1} and *Q*_{2} are unlinked, the regression coefficients reduce to much simpler forms without the confounding of *Q*_{2}. For example, if *r*_{12} = 0.5, for the backcross population, and for the F_{2} population. The confounding of *Q*_{2} disappears, and the coefficient is confounded only by its dominance effects. The same confounding problem can also be found for the estimate of the dominance effect if fitting only its dominance variable *z*_{1} in the model (appendix D). If epistasis is present and ignored in the model, most of the epistatic effects will be confounded in the estimation as most of the covariances between the marginal and epistatic effects are not zero whether they are linked or not (result not shown). To avoid the confounding problem and enhance the detection power, it is desirable to fit only those displayed effects into the model in QTL mapping.

## SIMULATION STUDY

A series of simulations was performed to achieve three purposes: (1) to verify the derived relations and compare the differences between the diploid and triploid models, (2) to examine the performance of the triploid method in different experimental designs and populations, and (3) to evaluate the performance of the proposed MIM-based triploid method as compared to the current methods in mapping endosperm traits. The simulation study includes two parts. The first part is to achieve the first two purposes, and the second part is to achieve the third purpose. In each part, the sample size is assumed to be 200. The first part assumes one QTL affecting the endosperm trait with two levels of heritability (*h*^{2}), 0.1 and 0.2. It includes four different parameter settings: (1) *a* = 1, *d*_{1} = −2, *d*_{2} = −2 (*G*_{1} = ^{3}/_{2}, *G*_{2} = −^{3}/_{2}, *G*_{3} = −^{5}/_{2}, and *G*_{4} = −^{3}/_{2}); (2) *a* = 1, *d*_{1} = 2, *d*_{2} = 2 (*G*_{1} = ^{3}/_{2}, *G*_{2} = ^{5}/_{2}, *G*_{3} = ^{3}/_{2}, and *G*_{4} = −^{3}/_{2}); (3) *a* = 1, *d*_{1} = −2, *d*_{2} = 0; (*G*_{1} = ^{3}/_{2}, *G*_{2} = −^{3}/_{2}, *G*_{3} = −^{1}/_{2}, and *G*_{4} = −^{3}/_{2}); and (4) *a* = 1, *d*_{1} = 0, *d*_{2} = 0 (*G*_{1} = ^{3}/_{2}, *G*_{2} = ^{1}/_{2}, *G*_{3} = −^{1}/_{2}, and *G*_{4} = −^{3}/_{2}). Among the four settings, the QTL genotypes are complete-recessive type in the first and third settings, and they are complete-dominance type in the second setting. For each setting, the QTL is placed in the middle of a chromosome with six 20-cM equally spaced markers, and data from both the one- and two-stage designs in the backcross and F_{2} populations were generated. The number of simulation replicates is 500. Both the diploid and triploid methods were used to detect the QTL using the generated data sets. The results are shown in Tables 1 and 2. The second part assumes three chromosomes each with six 20-cM equally spaced markers, and each chromosome contains only one QTL. The three unlinked QTL, *Q*_{A}, *Q*_{B}, and *Q*_{C}, are assumed to contribute 40% to the total trait variation together and to be located in the middle of the chromosomes. Data from the two-stage design in the F_{2} population were generated. The parameter setting is *a*_{1} = 3, *d*_{11} = −3, and *d*_{12} = −3 for *Q*_{A}; *a*_{2} = 2.5, *d*_{21} = 4, and *d*_{22} = 4 for *Q*_{B}; and *a*_{3} = 1.5, *d*_{31} = 0, and *d*_{32} = 0 for *Q*_{C}. There is additive-by-additive interaction between *Q*_{B} and *Q*_{C}, and the epistatic effect *i*_{a2a3} is assumed to be 1. Under the parameter setting, the genetic and environmental variances are ∼38.37 and 51.66, respectively. In the total genetic variance, the marginal effects of the three QTL contribute ∼45.44, 36.32, and 10.26%, respectively, and the epistatic effect contributes ∼7.98%. In the genetic variance contributed by *Q*_{A} (*Q*_{B}), the variance contributed by the two dominance effects is ∼11.29% (25.11%). The number of simulation replicates is 100. Both the current triploid method considering only one QTL, *i.e*., the interval-mapping (IM)-based method, and the proposed MIM-based method were used to analyze the data. The results are shown in Table 3. In each scenario, permutation tests proposed by Churchill and Doerge (1994) were used to determine the critical values for power calculation.

Tables 1 and 2 show the results of the first part of the simulation. The relationship between the estimates of the diploid and triploid models corresponds very well with the derived prediction (Equations 9–11). For the backcross population, the effects of the diploid models in the four settings are expected to be 0.5, 2.5, 1.0, and 1.5, according to Equation 9. The means of the estimates are found to be 0.610, 2.516, 1.040, and 1.521, respectively, for *h*^{2} = 0.1 (Table 2), and they are 0.599, 2.489, 1.005, and 1.475, respectively, for *h*^{2} = 0.2 (Table 3). For the F_{2} population, the means of the estimated additive and dominance effects in the diploid model are also found to be very close to the predicted values in both levels of heritability. For example, the mean of the estimated additive effects for the first setting with *h*^{2} = 0.1 is 1.499 (predicted value 1.5), and the mean of the estimated dominance effects for the second setting with *h*^{2} = 0.2 is 1.010 (predicted value 1.0). The estimated residual variance by the diploid model is found to be upwardly biased in all cases as expected by Equation 12.

The most striking differences in power and estimation between the diploid and triploid models are found in the first parameter setting when the additive and dominance effects are in the opposite direction and *h*^{2} = 0.2 (Table 2). The detecting powers of the diploid model are 0.160 and 0.100, respectively, in the two different populations. The detecting powers of the triploid model are 0.508 and 0.926, respectively, under the one-stage design, and they increase to 0.980 and 0.998, respectively, under the two-stage design. For QTL position, the means of position estimates by the diploid model are 46.46 (SD 28.58) and 49.63 (SD 11.10), respectively, in the two populations. The means of position estimates provided by the triploid model under the two-stage design are 49.77 (SD 7.08) and 50.21 (SD 5.68), respectively, and they are 49.00 (SD 24.26) and 51.03 (SD 11.19) under the one-stage design, respectively. Therefore, the triploid model performs significantly better than the diploid model in this setting. In other settings, the triploid model under the two-stage design is also found to be much more powerful and precise than the diploid model, but the triploid model under the one-stage design seems to provide power and precision (in position estimation) similar to the diploid model. For example, in the third setting of the backcross population with *h*^{2} = 0.1, the diploid model has power 0.400 and mean estimated position 49.03 (SD 23.68; Table 1). For the triploid model, they are 0.406 and 50.33 (SD 23.67) under the one-stage design, and they are 0.800 and 49.15 (SD 12.38) under the two-stage design. In the second setting of the F_{2} population with *h*^{2} = 0.1, the diploid model has power 0.668 and mean estimated position 49.66 (SD 19.16). For the triploid model, they are 0.648 and 49.85 (SD 19.05) under the one-stage design, and they are 0.796 and 49.86 (SD 12.35) under the two-stage design. A similar pattern can also be found for the other settings in Tables 1 and 2.

The triploid model is found to have better performance under the two-stage design than under the one-stage design in this study. Under the two-stage design, the triploid model can provide higher power for QTL detection and more precise estimates for positions and effects. For example, in the first setting with *h*^{2} = 0.1 in the backcross population, the powers are 0.190 and 0.730, respectively (Table 1), and the means of the position estimates are 47.48 (SD 29.98) and 49.39 (SD 15.10), respectively, under the two different designs. In the second setting with *h*^{2} = 0.2 in the F_{2} population, the powers are 0.938 and 0.986, respectively (Table 2), and the means of the position estimates are 50.70 (SD 11.65) and 49.77 (SD 6.37), respectively, under the two different designs. Besides, the triploid model under the one-stage design seems to have problems in correctly estimating the effects in the backcross population when the additive and dominance effects are in opposite direction. For example, in the first setting (*a* = 1, *d*_{1} = −2, and *d*_{2} = −2), the means of the effect estimates by the triploid model under the one-stage design are 0.199 (SD 0.452), 1.587 (SD 2.342), and −0.589 (SD 2.202), respectively, for *h*^{2} = 0.1 (Table 1), and they are 0.143 (SD 0.344), 2.117 (SD 1.628), and −0.766 (SD 1.547), respectively, for *h*^{2} = 0.2 (Table 2). These estimates are highly biased and imprecise under the one-stage design. Similar problems can also be found in the third setting (*a* = 1, *d*_{1} = −2, and *d*_{2} = 0) for the backcross population. Such estimation problems, however, do not occur in the F_{2} population or under the two-stage design (see Tables 1 and 2), which may suggest that the F_{2} population is a better population than the backcross population and the two-stage design might be a more suitable design than the one-stage design for mapping endosperm traits.

The simulation in the second part aims to evaluate and compare the differences between the proposed MIM-based and the current IM-based methods in mapping endosperm traits. The results are shown in Table 3. When the IM-based method is used to detect QTL, three different models, the additive-effect model (with *a* only), the one dominant-effect model (with *a* and *d*_{1}), and the complete-effect model (with *a*, *d*_{1}, and *d*_{2}), will be implemented in the search. The experimentwise critical values at 0.05 significance level are found to be 9.36, 12.57, and 13.48 for the three different models, respectively, by 1000 permutations. For the additive-effect model, the powers to detect *Q*_{A}, *Q*_{B}, and *Q*_{C} are 0.97, 0.96, and 0.41, respectively. For the one dominant-effect model, the powers to detect the three QTL are 0.97, 0.95, and 0.31, respectively. For the complete-effect model, the powers are 0.97, 0.94, and 0.31, respectively. The three models have similar powers to detect *Q*_{A} and *Q*_{B}, and the additive-effect model has greater power than the other two models to detect *Q*_{C}. Among the 100 replicates, the three models can detect either both or one of *Q*_{A} and *Q*_{B} in each replicate. The results of mapping *Q*_{A} and *Q*_{B} by the complete-effect model and mapping *Q*_{C} by the additive-effect model are presented in Table 3. In Table 3, the means of the position estimates for the three QTL are 48.94 (SD 7.86), 50.95 (SD 6.50), and 50.83 (SD 20.43), respectively. The average LRT statistics are 31.12 (SD 10.00), 25.99 (SD 8.82), and 9.26 (SD 6.45), respectively, for the three QTL. This shows that the larger QTL, *Q*_{A} and *Q*_{B}, can be detected with higher power and better precision as compared to the small QTL, *Q*_{C}. Besides, the estimates of additive effects generally are more precise than those of dominance effects. For example, the mean of *â*_{1} is 3.003 (SD 0.554), and the means of *d̂*_{11} and *d̂*_{12} are −1.449 (SD 3.451) and −3.995 (SD 2.585), respectively. One of the advantages of the MIM-based method is that it is capable of fitting the detected QTL into the model in further searching for the other QTL. When the MIM-based method considers only one QTL in the model (*m* = 1), the mapping results are identical to those obtained by the IM-based method. Among the 100 replicates analyzed by the IM-based method, most of the replicates (91 replicates) have both *Q*_{A} and *Q*_{B} detected. For the remaining 9 replicates, either *Q*_{A} or *Q*_{B} is detected. If the detected *Q*_{A} (*Q*_{B}) is fitted into the MIM-based model in the search (*m* = 2), the undetected *Q*_{B} (*Q*_{A}) in the 9 replicates can be identified and the already detected *Q*_{B} (*Q*_{A}) in the other replicates will have a larger LRT statistic by including either their partial or complete effects in the model (that is, the power for detecting *Q*_{A} and *Q*_{B} is 1.0 for MIM with *m* = 2). To shorten the article, only the results of considering complete effects of *Q*_{A} and *Q*_{B} in the analysis are presented (Table 3). The average (partial) LRT statistics of *Q*_{A} and *Q*_{B} increase to 35.18 (SD 10.57) and 30.07 (SD 10.42), respectively. Further, if these two detected QTL are fitted into the MIM model for QTL search along the third chromosome (*m* = 3), the power to detect *Q*_{C} is 0.57 (average LRT statistic 11.36 with SD 6.97) if only the additive effect (*a*_{3}) is considered (Table 3). The power decreases to 40% (36%) if the one-dominant-effect (complete-effect) model is considered (not shown). The means of the position estimates are 49.37 (SD 6.34), 50.60 (SD 6.63), and 49.40 (SD 18.29) for the three QTL, respectively, which become more precise as compared to those by the IM-based method. If epistasis is taken into account to search for the third chromosome, many different types of epistasis can be considered. For illustration, only the additive-by-additive epistatic effect between QTL is considered (see also genetic model of endosperm traits for first taking the additive-by-additive effect into account). Among the three possible additive-by-additive effects, only the consideration of *i*_{a2a3} improves the QTL detection. The power increases to 71% (Table 3) when *i*_{a2a3} is considered in the MIM model (*m* = 3 with epistasis) to search for *Q*_{C} (critical value 12.57 by permutation tests; average partial LRT statistic 16.84 with SD 7.79). The mean estimate of *i*_{a2a3} is 0.904 (SD 0.510), and the mean estimate of σ^{2} is 50.39 (SD 8.26). The mean of position estimate for *Q*_{C} becomes 48.97 (SD 17.18), and the mean of the estimated effect is 1.510 (SD 0.698), which is more precise than that obtained by ignoring epistasis.

## CONCLUSION AND DISCUSSION

The endosperm of a seed is a triploid tissue and has a more complicated genetic mechanism than the diploid tissues. Therefore, the traditional QTL mapping methods (Lander and Botstein 1989; Haley and Knott 1992; Jansen 1993; Zeng 1994; Churchill and Doerge 1994; Kao* et al.* 1999; Kao and Zeng 2002) designed for traits under diploid control are not appropriate approaches to map for QTL underlying the endosperm traits because they ignore the triploid nature of endosperms. Wu* et al.* (2002a)(b) and Xu* et al.* (2003) first considered the triploid inheritance of endosperms to propose IM-based triploid methods in the detection of the underlying QTL. In this article, a new triploid approach based on the MIM method is developed to take multiple QTL into account in the model for mapping endosperm traits. The proposed method can be implemented to analyze data from either the one-stage design using only maternal genotypes or the two-stage design using both maternal and embryo genotypes in the backcross and F_{2} populations. As shown in this article, the triploid MIM method can provide better detection power and estimation precision, and it can analyze and search for epistatic QTL directly in comparison with the current IM-based methods when mapping endosperm traits. Some important issues in mapping endosperm traits, such as the problems of using the diploid mapping methods, the relation between the diploid and triploid methods, the variance components of genetic variance, the problems if effects are present and ignored, and the relative efficiency of the diploid and triploid models under different experimental designs, are also investigated analytically or by simulation.

The triploid mapping method can provide better power in detection and more precise estimation under the two-stage design than under the one-stage design in mapping endosperm traits as shown in the simulation study (Tables 1 and 2) and also demonstrated by Wu* et al.* (2002b). This is because the two-stage design, which provides both the maternal and embryo marker genotypes, is more informative than the one-stage design, which offers only the maternal marker genotype, in inferring the conditional probabilities of the endosperm QTL genotypes (see the website http://www.stat.sinica.edu.tw/chkao/ for the conditional probabilities under different experimental designs). In the backcross population, the one-stage design provides only 4 different marker genotypes, and these marker genotypes are noninformative in inferring *QQQ*, *QQq*, and *Qqq* as equal conditional probabilities are assigned to them. The two-stage design, however, can provide 16 different marker genotypes, and the marker genotypes are not informative only for *QQq* and *Qqq*. In the F_{2} population, the one- and two-stage designs can provide 9 and 25 marker genotypes, respectively, and each marker genotype in either design is noninformative only for the genotypes *QQq* and *Qqq*. Therefore, the two-stage design is generally more informative than the one-stage design, and the F_{2} population is generally more informative than the backcross design in inferring the conditional probabilities. As these conditional probabilities are the mixing proportions in the normal mixture likelihood, they play a very important role in the quality estimation of QTL parameters for the model. A more informative design or population can provide more detailed information in inferring the conditional probabilities and thus can help improve the estimation of QTL parameters. This argument can explain the reasons why the performance of the triploid method is generally poor under the one-stage design in the backcross population as compared to the performance under another data structure (see, for example, the simulation results in Tables 1 and 2 when the additive and dominance effects are in the opposite directions) and why the triploid method under the two-stage design can perform well with satisfactory power and precision in all the parameter settings. The two-stage design generally requires more genotyping work as both the genomes of the plants and their seeds need to be genotyped, and different sampling strategies for allocations of a given sample size between the two generations should be considered for cost control. Besides, Wu* et al.* (2002b) also pointed out that the different sampling strategies for allocations can affect the parameter estimation. Therefore, the best strategy of allocation for the two-stage design under the consideration of cost and estimation deserves further investigation in practical QTL mapping.

The traditional diploid methods proposed for mapping diploid traits have been applied to mapping endosperm traits by several researchers (Tan* et al.* 1999; Wang and Larkins 2001; Wang* et al.* 2001). Such applications generally violate the traditional belief that the endosperm traits are under the control of triploid mechanisms (Benner* et al.* 1989; Zhu and Weir 1994; Wu* et al.* 2002a,b). If the diploid methods are applied to mapping endosperm traits, the confounding problem in estimation will occur (Equations 9–11), and the sampling variances of the estimates will inflate. Consequently, the diploid methods can cause some problems, such as bias in estimation and loss in power, in mapping endosperm traits. Although the diploid methods have these problems, the simulation study indicates that, in some parameter settings, its performance (in power and position estimate) can be similar to the triploid method under the one-stage design (Tables 1 and 2) due mainly to the correlation between the genomes of the maternal plant and its endosperms. Therefore, the diploid method can still be used as a preliminary method in mapping endosperm traits. By taking the triploid mechanism into account, the triploid method, especially under the two-stage design, can effectively solve the problems and significantly improve the mapping of endosperm traits.

The proposed MIM-based triploid method is a multiple-QTL model. This multiple-QTL approach distinguishes itself from the current IM-based methods of Wu* et al*. (2002a)(b) and Xu* et al.* (2003) by the ability to use multiple-marker intervals simultaneously to fit multiple QTL into the model in mapping endosperm traits. As a result, the proposed method can provide greater power and precision, and it can readily analyze and search for epistatic QTL in endosperm trait mapping. Besides, the estimation procedures between these methods are different. The likelihood of the MIM-based method is a mixture of 4* ^{m}* normals and will become increasingly unwieldy in maximization as the number of QTL (

*m*) fitted into the model increases. To solve the maximization problem with large

*m*, the general formulas proposed by Kao and Zeng (1997) are applied to obtain the MLE of QTL effects as well as their variance-covariance matrix (see the mim model for mapping endosperm traits). The procedure of the general formulas is a maximum-likelihood approach based on the EM algorithm. The method by Xu

*et al.*uses an iteratively reweighted least squares (IRWLS) procedure, which is a second-order approximation to the maximum likelihood, and it has problems in estimating the two dominance effects separately as pointed out by Xu

*et al.*The estimation procedure in Wu

*et al.*also implements a maximum-likelihood approach via the EM algorithm, but it needs additional procedures in the M-step to obtain the MLE if some QTL effects are not considered in the model (see appendix B in Wu

*et al.*2002b). The general formulas, however, do not have these problems and are relatively straightforward and simple to maximize. An initial version of the triploid MIM program source code (written in Fortran 77 language) is available on the worldwide web (http://www.stat.sinica.edu.tw/chkao/).

It has been pointed out that the critical value for claiming QTL detection is a very complicated issue and deserves further investigation (Lander and Botstein 1989; Jansen 1993; Zeng 1994; Kao* et al.* 1999). Generally, the critical value depends on the number and size of intervals, different levels of heritability (size of QTL), different numbers of (linked or unlinked) QTL, and linked QTL in the same or opposite direction of effects. Visscher and Haley (1996) pointed out that the critical value should be reduced after a QTL of large effect has been detected. The determination of critical value in mapping endosperm traits will be more complicated as each QTL can have three possible effects and many different types of epistasis, and more different experimental designs (the one-stage or two-stage design with different allocations in the backcross or F_{2} population) can be considered. In this article, the permutation tests by Churchill and Doerge (1994) are used to determine the critical value for claiming QTL detection in endosperm trait mapping. It is found that the critical value for the triploid model in the two-stage design is larger than that in the one-stage design (Tables 1 and 2). Given the same heritability, the critical value in the F_{2} population is larger than that in the backcross population except for the third setting. More efforts are needed to unravel the issue of critical value in mapping endosperm traits. The understanding of QTL underlying the endosperm traits is very important to cereal breeding in improving yield potential and grain quality. This MIM-based triploid method can serve as an effective tool to estimate the parameters associated with the underlying QTL in mapping endosperm traits. Another important issue worth pursuing is to investigate the properties of different genetic models in mapping endosperm traits. Besides, several researchers (Zhu and Weir 1994; Mazur* et al.* 1999; van der Meer* et al.* 2001; Wu* et al.* 2002b; Xu* et al.* 2003) have pointed out that the maternal and offspring genomes could jointly affect the seed- or endosperm-specific traits. Therefore, it is important to take the genome information about the two successive generations into account in mapping those traits and, more importantly, to do so on the basis of a multiple-QTL model approach.

## APPENDIX A:

### THE GENETIC VARIANCE COMPONENTS OF ENDOSPERM TRAITS

When *m* QTL with complete marginal and epistatic effects are considered together, the genetic variance of an endosperm trait can be decomposed into 4* ^{m}* × (4

*− 1)/2 variance and covariance components. Taking*

^{m}*m*= 2 as an example, the genetic variance can have 120 variance and covariance components in the backcross and F

_{2}populations (not shown). If the two QTL are unlinked, the genetic variance reduces to 83 and 111 components in the two populations. For the F

_{2}population, these components are

Likewise, the components of variance and covariance for the backcross population can be also obtained.

## APPENDIX B:

### THE RELATION BETWEEN THE PARAMETERS OF THE DIPLOID AND TRIPLOID MODELS IN MAPPING ENDOSPERM TRAITS

To simplify the argument, assume that an endosperm trait value, *y*, measured in the backcross or F_{2} population is affected only by a single QTL, *Q*. The backcross individuals can have two possible QTL genotypes, *Qq* (*w* = ^{1}/_{2}) and *qq* (*w* = −^{1}/_{2}), each with frequency 1/2. The F_{2} individuals can have three possible QTL genotypes, *QQ* (*w*_{1} = 1, *w*_{2} = −^{1}/_{2}), *Qq* (*w*_{1} = 0, *w*_{2} = ^{1}/_{2}), and *qq* (*w*_{1} = −1, *w*_{2} = −^{1}/_{2}), with frequencies 1/4, 1/2, and 1/4, respectively. For autogamous plants, the individuals with *QQ* or *qq* genotype can produce only one endosperm genotype, *QQQ* or *qqq*. The individuals with *Qq* genotype can produce four kinds of endosperm genotype, *QQQ* (*x* = ^{3}/_{2}, *z*_{1} = 0, *z*_{2} = 0), *QQq* (*x* = ^{1}/_{2}, *z*_{1} = 1, *z*_{2} = 0), *Qqq* (*x* = −^{1}/_{2}, *z*_{1} = 0, *z*_{2} = 1), and *qqq* (*x* = −^{3}/_{2}, *z*_{1} = 0, *z*_{2} = 0), each with frequency 1/4. The frequencies of the four triploid QTL genotypes are 1/8, 1/8, 1/8, and 5/8, respectively, in the backcross population, and they are 3/8, 1/8, 1/8, and 3/8, respectively, in the F_{2} population. The covariances between the coded variables for the QTL genotypes of a diploid individual and its triploid endosperm are found to be Cov(*x*, *w*) = ^{3}/_{8}, Cov(*z*_{1}, *w*) = ^{1}/_{16}, Cov(*z*_{2}, *w*) = ^{1}/_{16} in the backcross population, and they are Cov(*x*, *w*_{1}) = ^{3}/_{4}, Cov(*z*_{1}, *w*_{1}) = 0, Cov(*z*_{2}, *w*_{1}) = 0, Cov(*x*, *w*_{2}) = 0, Cov(*z*_{1}, *w*_{2}) = ^{1}/_{16}, and Cov(*z*_{2}, *w*_{2}) = ^{1}/_{16} in the F_{2} population.

If the diploid models in Equation 6 or 7 are applied to analyze a marker, *M*, to infer *Q* along the genome, the regression coefficient of *y* on the marker *M* (coded by *w _{M}*) in the backcross model is given by

*b*= Cov(

_{yM}*y*,

*w*)/

_{M}*V*(

*w*), where Cov(

_{M}*y*,

*w*) is the covariance between the endosperm trait and the marker variable, and

_{M}*V*(

*w*) is the variance of the marker variable. It is easy to obtain if there is no covariance between the residual error and marker variable. The regression coefficient is

_{M}*b*= (1 − 2

_{yM}*r*)[3

_{QM}*a*/2 + (

*d*

_{1}+

*d*

_{2})/4] because

*V*(

*w*) =

_{M}^{1}/

_{4}. Similarly, the two regression coefficients for the additive and dominance effects of

*M*in the F

_{2}diploid model can be obtained. The regression coefficient of the additive variable is and regression coefficient of the dominance variable is

Note that the partial regression coefficients for the additive and dominance effects are the same as *b*_{yMa} and *b*_{yMd}, as *w*_{Ma} and *w*_{Md} are orthogonal in the F_{2} population.

The conditional phenotypic variance on the marker *M* for the backcross diploid model is , where σ^{2}_{y} is the phenotypic variance, and σ* _{yM}* denotes the covariance between

*y*and

*M*. The conditional phenotypic variance is where σ

^{2}is the variance of residual error. For the F

_{2}diploid model, the conditional phenotypic variance on the marker

*M*is . The conditional phenotypic variance is

The conditional phenotypic variances are the same for the backcross and F_{2} models.

## APPENDIX C:

### CONDITIONAL PROBABILITIES OF ENDOSPERM QTL GENOTYPES

Consider a marker interval, *I _{j}*, flanked by markers,

*M*and

_{j}*N*, on a linkage group. For the plants in the F

_{j}_{2}population, there are nine observable genotypes for markers

*M*and

_{j}*N*. They are

_{j}*M*,

_{j}N_{j}/M_{j}N_{j}*M*,

_{j}N_{j}/M_{j}n_{j}*M*/

_{j}n_{j}*M*,

_{j}n_{j}*M*,

_{j}N_{j}/m_{j}N_{j}*M*(

_{j}m_{j}N_{j}n_{j}*M*or

_{j}N_{j}/m_{j}n_{j}*M*),

_{j}n_{j}/m_{j}N_{j}*M*,

_{j}n_{j}/m_{j}n_{j}*m*,

_{j}N_{j}/m_{j}N_{j}*m*, and

_{j}N_{j}/m_{j}m_{j}*m*with proportions (1 −

_{j}n_{j}/m_{j}m_{j}*r*)

^{2}/4,

*r*(1 −

*r*)/2,

*r*

^{2}/4,

*r*(1 −

*r*)

*/*2, (1 −

*r*)

^{2}

*/*2 +

*r*

^{2}

*/*2,

*r*(1 −

*r*)

*/*2,

*r*

^{2}

*/*4,

*r*(1 −

*r*)/2, and (1 −

*r*)

^{2}

*/*4, respectively. For the plants in the backcross population, there are four observable genotypes,

*M*,

_{j}N_{j}/M_{j}N_{j}*M*/

_{j}N_{j}*M*,

_{j}n_{j}*M*, and

_{j}N_{j}/m_{j}N_{j}*M*, with proportions (1 −

_{j}N_{j}/m_{j}n_{j}*r*)

*/*2,

*r/*2,

*r/*2, (1 −

*r*)

*/*2, respectively. For autogamous plants, the plants with genotypes

*M*,

_{j}N_{j}/M_{j}N_{j}*M*,

_{j}n_{j}/M_{j}n_{j}*m*, and

_{j}N_{j}/m_{j}N_{j}*m*each can produce only one progeny (embryo) genotype. The plants with genotypes

_{j}n_{j}/m_{j}m_{j}*M*/

_{j}N_{j}*M*,

_{j}n_{j}*M*,

_{j}N_{j}/m_{j}N_{j}*M*, and

_{j}n_{j}/m_{j}n_{j}*m*each can produce three different embryo genotypes. For example, the three embryo genotypes produced by plants with genotype

_{j}N_{j}/m_{j}m_{j}*M*are

_{j}N_{j}/M_{j}n_{j}*M*,

_{j}N_{j}/M_{j}N_{j}*M*, and

_{j}N_{j}/M_{j}n_{j}*M*. The plants with genotype

_{j}n_{j}/M_{j}n_{j}*M*(

_{j}N_{j}/m_{j}n_{j}*M*) can produce nine different embryo genotypes. A total of 25 and 16 different combinations of the plant and embryo genotypes are in the F

_{j}n_{j}/m_{j}N_{j}_{2}and backcross populations, respectively.

If an unobservable QTL, *Q _{j}*, is located in

*I*, among the seeds (progeny) collected from the F

_{j}_{2}plants, there are three possible embryo genotypes,

*Q*,

_{j}Q_{j}*Q*, and

_{j}q_{j}*q*, and four possible endosperm genotypes,

_{j}q_{j}*Q*,

_{j}Q_{j}Q_{j}*Q*,

_{j}Q_{j}q_{j}*Q*, and

_{j}q_{j}q_{j}*q*. The conditional distribution of these endosperm genotypes given the observable marker genotypes of the F

_{j}q_{j}q_{j}_{2}plant (

*t*) and embryo (

*t*+ 1) can be derived on the basis of Haldane's mapping function (Haldane 1919) assuming no crossover interference. For example, the conditional probabilities of the endosperm genotype,

*Q*, given the plant genotype

_{j}Q_{j}Q_{j}*M*

_{j}

*N*

_{j}

*/M*

_{j}

*n*

^{}

_{j}and its embryo genotype

*M*

_{j}

*N*

_{j}

*/M*

_{j}

*N*

^{}

_{j}are calculated as C1

The probability in the denominator of Equation C1 is *r*(1 − *r*)/8. As the QTL endosperm genotype *Q _{j}Q_{j}Q_{j}* implies the embryo genotype

*Q*, it ensures that the marker and QTL genotype of the embryo is

_{j}Q_{j}*M*

_{j}

*Q*

_{j}

*N*

_{j}/

*M*

_{j}

*Q*

_{j}

*N*

^{}

_{j}. The possible F

_{2}plants that can produce such an embryo genotype should be from one of the three genotypes,

*M*

_{j}

*Q*

_{j}

*N*

_{j}

*/M*

_{j}

*q*

_{j}

*n*

^{}

_{j},

*M*

_{j}

*q*

_{j}

*N*

_{j}

*/M*

_{j}

*Q*

_{j}

*n*

^{}

_{j}, and

*M*

_{j}

*Q*

_{j}

*N*

_{j}

*/M*

_{j}

*Q*

_{j}

*n*

^{}

_{j}. It is easy to obtain that the probabilities of the F

_{2}plants with these three genotypes are

*r*

_{1}(1 −

*r*

_{1})(1 −

*r*

_{2})

^{2}/2,

*r*

_{1}

*r*

^{2}

_{2}/2, and (1 −

*r*

_{1})

^{2}

*r*

_{2}(1 −

*r*

_{2})/2, respectively, and that their chances to produce seeds with embryo genotype

*M*

_{j}

*Q*

_{j}

*N*

_{j}/

*M*

_{j}

*Q*

_{j}

*N*

^{}

_{j}are (1 −

*r*

_{2})

^{2}/4,

*r*

^{2}

_{2}/4, and 1/4, respectively. This allows calculation of the numerator of Equation C1 as the sum of the following three probabilities:

Therefore, the conditional probability of the endosperm genotype, *Q _{j}Q_{j}Q_{j}*, given the plant marker genotype,

*M*

_{j}

*N*

_{j}

*/M*

_{j}

*n*

^{}

_{j}, and its embryo marker genotype,

*M*

_{j}

*N*

_{j}

*/M*

_{j}

*N*

^{}

_{j}, is

The same argument leads the other three conditional probabilities of the endosperm genotypes, *Q _{j}Q_{j}q_{j}*,

*Q*, and

_{j}q_{j}q_{j}*q*, to

_{j}q_{j}q_{j}Similarly, the conditional probabilities of endosperm QTL genotypes given the other combinations of the F_{2} (backcross) plant and embryo genotypes (the two-stage design) can be derived. If only the plant marker genotype (the one-stage design) is available for inference, the derivation for the conditional probabilities of endosperm QTL genotypes is simpler and can be also obtained. These conditional probabilities under the one- and two-stage designs in the backcross and F_{2} populations are placed on the website (http://www.stat.sinica.edu.tw/chkao/) or a part of them can be found in Wu* et al*. (2002a)(b) and Xu* et al*. (2003).

## APPENDIX D:

### THE PROBLEMS IF EFFECTS ARE PRESENT AND IGNORED IN MAPPING ENDOSPERM TRAITS

For simplicity, assume that an endosperm trait is controlled by two QTL, *Q*_{1} and *Q*_{2}, without epistasis. It can be found that the covariances between the coded variables for the effects of different QTL are where *r*_{12} is the recombination fraction betwen *Q*_{1} and *Q*_{2} in the backcross population. In the F_{2} population, these covariances become

If *Q*_{1} and *Q*_{2} are unlinked (*r*_{12} = 0.5), the covariances are all zeros. In the backcross population, if a single-QTL model considering only the additive effect is used to analyze *Q*_{1}, the regression coefficient is

If only a dominance effect, say *d*_{1}, is considered, the regression coefficient is

In the F_{2} population, the two coefficients are and

They show that the estimate of the additive (dominance) effect of *Q*_{1} is confounded by its other effects and the effects of *Q*_{2}.

## Acknowledgments

The author is grateful to two anonymous reviewers for helpful comments and to Pei-Ying Shih and Chu-Chun Chen for the derivation of variance components. This work was supported by grants NSC92-2118-M-001-038 from the National Science Council, Taiwan, Republic of China.

## Footnotes

Communication editor: A. Paterson

- Received August 28, 2003.
- Accepted April 30, 2004.

- Genetics Society of America