## Abstract

The seeds of flowering plants develop from double fertilization and play a vital role in reproduction and supplying human and animal food. The genetic variation of seed traits is influenced by multiple genetic systems, *e.g.*, maternal, embryo, and/or endosperm genomes. Understanding the genetic architecture of seed traits is a major challenge because of this complex mechanism of multiple genetic systems, especially the epistasis within or between different genomes and their interactions with the environment. In this study, a statistical model was proposed for mapping QTL with epistasis and QTL-by-environment (QE) interactions underlying endosperm and embryo traits. Our model integrates the maternal and the offspring genomes into one mapping framework and can accurately analyze maternal additive and dominant effects, endosperm/embryo additive and dominant effects, and epistatic effects of two loci in the same or two different genomes, as well as interaction effects of each genetic component of QTL with environment. Intensive simulations under different sampling strategies, heritabilities, and model parameters were performed to investigate the statistical properties of the model. A set of real cottonseed data was analyzed to demonstrate our methods. A software package, QTLNetwork-Seed-1.0.exe, was developed for QTL analysis of seed traits.

CROP seeds are the primary source of human food, animal feed, and industrial raw materials, hence understanding how seed traits are genetically determined is of great significance. The embryo and/or endosperm are the principal components of mature cereal grains. Endosperm in monocots such as rice, wheat, and maize or cotyledon in species such as *Arabidopsis thaliana* function as storage organs for nutrients such as carbohydrates, proteins, lipids, and minerals (Mazur *et al.* 1999), which are used in embryo development and maturation (Olsen 1998).

Most crop seed traits are quantitative. Quantitative trait locus (QTL) mapping has become an effective way to dissect the genetic architecture of quantitative traits. So far, several genetic models and statistical methods for seed trait mapping have been developed and applied in analysis of breeding data (Wu *et al.* 2002; Tsilo *et al.* 2011; Walker *et al.* 2013). However, some of these approaches ignored the difference between diploid and triploid inheritance. Because endosperm is triploid, it exhibits more complex genetics than diploid organisms. For example, in a locus with two alleles (Q and q), endosperm has four genotypes (QQQ, QQq, Qqq, and qqq) whereas there are only three genotypes for the embryo (QQ, Qq, and qq). Thus, endosperm usually exhibits reciprocal effects. Furthermore, the endosperm is the next generation of maternal plants. Based on such a genetic feature, Hu and Xu (2005) proposed a statistical method for characterizing the genetic effects of the maternal genome on offspring traits, incorporating both a quantitative genetic model for diploid maternal traits and one for triploid endosperm traits into a unified QTL mapping framework. Wen and Wu (2007) further proposed a method of interval mapping endosperm traits based on a two-stage hierarchical design, which could estimate all kinds of main effects and further improved the mapping accuracy. However, these methods ignored epistatic interactions and QTL-by-environment (QE) interactions, potentially leading to biased estimation of parameters.

Epistasis has long been recognized as fundamentally important to understanding the structure and function of genetic pathways and the evolutionary dynamics of complex genetic systems. It describes the general situation in which the phenotype of a quantitative trait cannot be predicted by the sum of all single-locus effects (Lark *et al.* 1995; Xing *et al.* 2002; Fan *et al.* 2005). Intensive research has illustrated that epistasis is an important factor in complex traits such as grain yield and its components, as well as heterosis and inbreeding depression (Yu *et al.* 1997; Mackay *et al.* 2009). With increasing recognition of the importance of epistasis in complex traits, it is appealing to develop a statistical model that integrates maternal and offspring genome into one mapping framework for dissecting the genetic networks of seed traits. General epistasis is defined as the interaction of two genes at different loci of the same genome; however, some studies have indicated that two epistatic genes can reside in different genomes (Cui *et al.* 2004; Cui and Wu 2005a,b). Zhang *et al.* (2004) proposed a statistical model for testing and estimating the effects of maternal–offspring genome interactions on embryo and endosperm traits during seed development in autogamous plants. This approach can detect epistasis from different genomes, but at the price of ignoring maternal effects of a single QTL. Cui *et al.* (2004; Cui and Wu 2005a,b) proposed a statistical method to detect the genetic mechanism of maternal–offspring genome interactions. Under the assumption that genetic expression of endosperm traits is controlled by both the maternal and the offspring genomes, the method can estimate the direct (offspring) effect, the indirect (maternal) effect, and the effects of maternal–offspring genome interaction simultaneously. But this method is based on the arguable assumption that the two interacting QTL cannot have maternal and offspring effects simultaneously. In other words, the methods proposed by Cui *et al.* (2004; Cui and Wu 2005a,b) can handle only an interaction of paired QTL in which one QTL has only a maternal effect while the other has only an endosperm effect. As the endosperm develops from the maternal plant, it is natural to consider that both maternal and offspring QTL can have phenotypic manifestations. Consequently, assuming the QTL potentially act in both the maternal and the endosperm genomes, the maternal effects and the endosperm effects, respectively, of each QTL can be tested.

GE interaction is also a very important component of phenotypic variation. GE interaction occurs when the environment effect depends on a genotype or, equivalently, when the genotypic effect depends on the environment. Identification of GE interactions has been a hot topic in mapping QTL for crop complex traits and human disease susceptibility genes (Zheng *et al.* 2008; Renz *et al.* 2011; Nickels *et al.* 2013). Qi *et al.* (2014) proposed two statistical models for mapping QTL of crop seed traits with inclusion of maternal effects, embryo or endosperm effects of QTL, environment effects, and QE interactions. All the above methods do not integrate epistatic effects either from one genome or from two different genomes and their interaction effects with environment, which are of great importance for marker-assisted selection in crop improvement (He *et al.* 1999; Bao *et al.* 2004).

To deal with the above issues and further improve the mapping power and estimation accuracy of QTL effects for seed traits, we propose a new methodology for systematically mapping QTL based on a mixed linear model approach. We integrated maternal and offspring genomes into one model. The QTL effects, including epistatic effects between two loci in the same genome and different genomes, and their interaction effects with environment were estimated and tested using the Markov chain Monte Carlo (MCMC) algorithm for Gaussian mixed linear model via Gibbs sampling (Smith and Roberts 1993). Monte Carlo simulations were conducted to investigate the reliability and efficiency of the method. A worked example was provided to demonstrate the effectiveness of our method. A computer program (QTLNetwork-Seed-1.0.exe) was developed for implementing QTL analysis of seed traits.

## Materials and Methods

### Genetic model

Consider a population derived from random mating of immortalized F_{2} (IF_{2}) of which genetically identical individuals can be regenerated for replication of experiments and phenotyping in different environments for detection of GE interactions. Marker information for maternal plants and the phenotypes of offspring are employed to perform QTL mapping. The genetic experiment is conducted in *p* different environments, each with *b* blocks. Suppose a seed trait is controlled by both the maternal genetic effects and endosperm (embryo) genetic effects of *s* segregating QTL (*Q*_{1}, *Q*_{2},*…*, *Q _{s}*) in which

*t*pairs of QTL are involved in epistatic interactions. Let a random variable £

*be the genotype of*

_{ki}*Q*from the

_{k}*i*th line. If QTL genotypes are available, we can easily get the genotypic value (Hu and Xu 2005). In reality, we do not know such information, but we can infer the conditional probabilities based on flanking markers of QTL. Let

*x*=

_{ki}*E*(£

*| the genotypes of flanking markers), which can be estimated by a general algorithm with dominant, codominant, or missing markers (Zeng 1994). For the*

_{ki}*i*th strain in the

*j*th block from the

*h*th environment, its phenotype of seed trait can be expressed by the following mixed linear model: (1)where

*μ*is the population mean; and are the maternal additive and dominance effects of with coefficients and respectively; and are the endosperm (embryo) additive and dominance effects of with coefficients and is the random effect of the

*h*th environment, and are the interaction effects of the maternal additive and the maternal dominance with environment, and are the endosperm (embryo) additive- and dominance-by-environment interaction effects, is the maternal additive–additive epistatic effect between and with coefficient is the endosperm (embryo) additive–additive epistatic effect between and with coefficient is the maternal–endosperm (embryo) additive–additive epistatic effect between and with coefficient is the endosperm (embryo)–maternal additive–additive epistatic effect between and with coefficient is the interaction effect between and the

*h*th environment; is the interaction effect between and the

*h*th environment; is the interaction effect between and the

*h*th environment; is the interaction effect between and the

*h*th environment; is the block effect in the

*h*th environment; and is the residual random effect,

### Mapping strategy

In model (1), we assume that the position and genotype of each QTL are known. In reality, such information is unavailable before mapping. We first need to specify the positions of all QTL, and then the coefficients of each genetic effect can be determined by the conditional probability of the QTL genotype on the observed flanking marker genotype. Then, all the effects can be estimated through solving the mixed linear model. To search for *s* segregating QTL and *t* pairs of epistases, we adopt a whole-genome one-dimensional scanning strategy to detect QTL with individual effects and two-dimensional scanning to detect paired QTL with epistatic effects (Yang *et al.* 2007).

#### Mapping QTL and epistasis:

Candidate marker intervals potentially harboring QTL are first selected through one-dimensional scanning across the whole genome. Then the *F*-test can be conducted with a walking step, such as 1 cM, along the whole genome, while the selected marker intervals are included in the model as the cofactors to control background genetic effects. Without loss of generality, we use the endosperm model to illustrate the proposed analysis. The following model is used to identify the significant marker intervals on the whole genome, (2)where is the population mean in the *h*th environment; *t* (*t* = 1, …, *T*) indexes the *t*th marker interval to be tested in *T* intervals; () and () stand for the maternal additive and dominance effects for the right marker (left marker) of the *t*th marker interval in the *h*th environment, with coefficients () and (), respectively; () and () stand for the endosperm additive effect and the dominance effect for the right marker (left marker) of the *t*th marker interval in the *h*th environment with coefficients () and (), respectively; and the other parameters have the same definitions as model (1).

All ζ’s corresponding to different marker effects can be determined the same way as for the coefficients of QTL effects.

Once the significant intervals are selected, the QTL mapping model for testing a locus *k* within a particular genomic region can be written as (3)where and are the maternal additive and dominance effects of the *k*th putative QTL in the *h*th environment; and are the endosperm additive and dominance effects of the *k*th putative QTL in the *h*th environment, respectively; and all the other parameters have the same definitions as those in models (1) and (2).

When the *F* values for a region exceed the critical threshold determined by permutation testing (Doerge and Churchill 1996), a QTL is indicated at that position with the regional maximum *F* value.

Suppose that *s* QTL have been mapped by a one-dimensional genome scan. To find all possible epistasis, the *s* QTL as well as significant marker interval pairs identified by a marker pair selection (Piepho and Gauch 2001) approach are included in the model as cofactors. For a pair of marker intervals consisting of the *l*th and the *r*th intervals, we can use the following model to test its significance, (4)where and are the maternal, the maternal–endosperm, the endosperm–maternal, and the endosperm additive–additive interaction effects between two right markers of the *t*th paired interval in the *h*th environment, respectively; and similarly, and are the maternal, the maternal–endosperm, the endosperm–maternal, and the endosperm additive–additive interaction effects between two left markers of the *t*th paired interval in the *h*th environment and the remaining parameters are the same as in models (1) and (2).

To reduce computational complexity, we scan only the regions in which significant marker interval interactions were detected by the interval interaction analysis to find all paired QTL with potential epistatic effects. Suppose *p* candidate QTL are identified in one-dimensional QTL scanning and *f* pairs of candidate marker intervals are selected in marker pair selection; we can use the following model to test the epistatic effects between two putative QTL, (5)where all parameters and variables have the same definitions as those described above. Two-dimensional scanning is conducted and permutation testing is employed to determine an empirical threshold value of the *F* statistic. All significant paired epistatic QTL are distinguished. To remove false positive QTL, stepwise regression on all distinguished QTL is performed; as a result, a QTL full model could be established based on all significant QTL.

#### Genetic effect estimation:

After obtaining the positions of the QTL and their epistatic interactions, a QTL full model is constructed to estimate all the parameters in model (1). To fit the mixed linear model, we first obtain a set of initial estimations of parameters including variances of random effects by the minimum norm quadric unbiased estimation (MINQUE) (Rao 1971), the fixed effects by the ordinary least-squares estimation (OLSE), and the random effects by the adjusted unbiased prediction (AUP). These estimates are then used as the prior values of parameters in the MCMC procedure for the mixed linear model implemented with Gibbs sampling (Smith and Roberts 1993). Parameter estimation and statistical inference are conducted by summarizing the Gibbs samplers (Wang *et al.* 1994; Yang *et al.* 2007). In the full model, we use MCMC to fit the parameters, implying an assumption of normality for the random effects. When the assumption of normality does not hold true, the MINQUE can be used to estimate the variance components, OLSE for estimation of fixed effects, and AUP for prediction of random effects.

## Results

### Monte Carlo simulation

#### Simulation scenarios:

We performed a series of Monte Carlo simulations to verify the unbiasedness and robustness of our method. The simulation study was conducted under different heritabilities, sampling strategies, and model parameters. The experiment design was random mating of IF_{2}, with 600 genotypes and two environments. The genetic map consisted of five chromosomes, each with 11 markers evenly distributed. The distance between 2 consecutive markers was set as 10 cM. Three QTL (Q_{1}, Q_{2}, and Q_{3}) and three epistastic interactions (EQ_{1}, EQ_{2}, and EQ_{3}) controlling an endosperm trait were scattered throughout the five chromosomes. EQ_{1} indicated the interaction of Q_{1} and Q_{2}; EQ_{2} indicated the interaction of Q_{2} and a locus on chromosome 1 without main effects; EQ_{3} was the interaction between two loci on chromosomes 1 and 2, respectively, both of which had no main effects. Factors simulated include (I) three QTL heritabilities of 30%, 50%, and 70%; (II) 600 and 300 maternal plant lines in the segregating population, each line having one or two offspring; and (III) different models with or without inclusion of epistatic effects. Such scenarios not only allow us to examine the performance of parameter estimation but also provide useful guidance for practical experiment design. Each scenario repeated 500 simulations.

#### Estimation of QTL parameter:

Since the simulation results had similar patterns for all the scenarios, we chose to report one simulation result under the heritability of 70% with population structure of 600:1 (*i.e.*, 600 IF_{2} lines, each with one offspring). The results are summarized in Table 1, Table 2, Table 3, and Table 4. It was shown that all the parameter estimates were not significantly different from their true values in all the cases. All the statistical powers of QTL detection were extremely high and the lowest still reached 99.80% for Q_{1}. For the estimation of QTL effects, the estimation accuracy of main effects was slightly better than that of QE interaction effects.

#### Influence of different models:

The influence of different models was investigated and three scenarios were simulated under the same heritability (70%) with a population structure of 600:1 in the simulation procedure: (I) the simulated trait was controlled by epistatic effects and the epistatic model was used, (II) the simulated trait was controlled only by QTL main effects but the epistatic model was used, and (III) the simulated trait was controlled by epistatic effects but the reduced model without epistatic effects was used. The simulation results are summarized in Table 5, Table 6, Table 7, and Table 8. Unbiased estimates of QTL positions and close estimates of all effects (Table 6, Table 7, and Table 8) could be observed under the three situations, consistent with the results at different levels of heritability. Under the three simulations, we found that the estimation of main effects was slightly more stable than for the QE interaction effects. In scenario III, the QTL positions and main effects were well estimated by the reduced model but with higher SD. Moreover, this model could not detect epistasis. Hence, it could be concluded that in the presence of epistatic interaction, the model without epistatic interactions tended to result in unstable estimates of QTL parameters. Although the QTL full model was employed, statistical power and SD in scenario II were slightly better than in the other scenarios, which may be due to the absence of epistatic effects on variance of the simulated trait. Compared with scenario III, scenario I, which was based on the QTL full model, exhibited relatively lower SD and could detect epistasis, indicating the full model is more suitable for dissecting the genetic basis of complex traits. Hence, in practice, regardless of whether epistasis exists or not, we suggest using the QTL full model with inclusion of epistatic effects.

### Analysis of cottonseed data

A set of cottonseed data from an IF_{2} population was analyzed by the proposed method. A total of 188 recombinant inbred lines (RILs) were developed from intraspecific crossing between HS46 and WARKCBUCAG8US-1-88 parental strains, which have wide genetic differences in yield, fiber quality, disease resistance, and seed quality traits. In this study, every 2 lines randomly sampled from 188 RILs were crossed during flowering to produce 376 IF_{2} lines in 2009 and 2010, which were used for QTL analysis. Seeds of the IF_{2} population and two parents were manually harvested at maturity. Fiber percentage (FP) was measured for all seeds. The genetic map consists of 388 molecular markers across 30 linkage groups. It covers a total length of 1946.22 cM, accounting for 41.55% of the whole genome, with an average distance of 5.03 cM between adjacent markers.

The critical *F* value to declare QTL was calculated by permutation test at the 0.05 genome-wide significance level. Our method detected two significant QTL with sharp and narrow *F*-statistic peaks located on chromosomes 19 (*F* = 5.20) and 21 (*F* = 9.93) by a one-dimensional genome scan (Figure 1A). Moreover, a total of four QTL with two pairs of epistatic interactions (*F* = 4.83 and 5.47) were detected by a two-dimensional genome scan (Figure 1B). All the QTL and epistases seemed to be sensitive to environment. The genetic effects and the corresponding *P*-values of all the QTL and epistases are presented in Table 9. The estimated relative contribution (RC) of QTL 21-18 is 14.05%, which means that ∼14.05% of phenotypic variation could be explained by this QTL. The suffix 21-18 indicates the QTL is located on the 18th marker interval on chromosome 21. Moreover, this study revealed that nearly 16.72% of the phenotypic variance is attributed to the epistatic interactions, with 4.01% from between-genome interactions and 12.71% from the interactions within one genome. Hence, in practical applications, we should take account of the epistasis from both QTL within the same genome and those in different genomes. More attention should be paid to QTL 4-2, which is involved in two epistases but has no significant marginal effects.

## Discussion

In this study, we proposed a statistical method for mapping endosperm or embryo traits of crops that integrated into one mapping framework the maternal and offspring effects of multiple QTL, epistatic interactions either within one genome or between two genomes, and QE interactions. Epistasis is a topic of current interest in molecular and quantitative genetics and has been proposed to be the source of missing heritability (Zuk *et al.* 2012). Uncovering genetic interaction networks from epistatic interactions between loci will improve our understanding of biological systems that give rise to quantitative trait variation and help reveal the mechanisms that underlie genetic homeostasis and speciation. Knowledge of interaction loci will improve our ability to predict individual disease risk factors in humans and response to natural and artificial selection, as well as inbreeding depression and heterosis in agricultural species. In addition, GE interactions are also an important issue for breeders and quantitative geneticists. It has been well documented that environmental conditions, especially temperature during grain filling, significantly affect cooking and eating quality (Zheng *et al.* 2008). Therefore, detection of GE interactions is very valuable for breeding environment-specific or broadly adapted varieties in genetic improvement of crop traits by marker-assisted selection.

However, despite being important genetic components of seed trait variation, epistasis and GE interactions have not been integrated into a single-QTL mapping model, mainly due to lack of an appropriate methodology. One major difficulty in developing a powerful statistical approach for mapping seed trait QTL with epistasis and QE interaction is fitting many parameters of multiple QTL in the full model, especially for epistasis from multiple genomes. Unlike previous one-QTL models that produce bias when there is more than one linked QTL, we use a QTL full model and conduct model selection to remove false positive QTL. Thus our approach is more effective for the precise estimation of all QTL positions and main effects, as well as the QE interaction effects, although the number of parameters in the model increases rapidly compared with that in the other methods. In QTL genome-wide scanning and model selection on QTL, we use the *F*-test to test significance of QTL based on Henderson’s method III, which avoids inversion of large matrices and therefore requires less computation.

Permanent genetic resources such as RILs or IF_{2} possess many advantages in QTL studies. One is that they can be used to conduct multiple-environment experiments, which are required for studying GE interaction. Another advantage is that the population derived from them need not be genotyped and can be inferred from their genotypes. In this study, we utilize a one-generation design that needs only the maternal marker information and the offspring phenotypes and thus can reduce a large amount of labor and costs, making it easier to implement for breeders in practical studies. Certainly, the work of hybridization in the random-mating design is laborious, but it can produce a very large population of hybrid lines without additional genotyping costs and effectively increases the statistical power of QTL analysis.

As shown by simulation studies, our model is quite robust and reliable in estimation of parameters for a modest sample size (200) and low heritability (0.3). Mapping power for both QTL and epistasis increased when the number of maternal plant lines increased from 100 to 600, while the false discovery rate (FDR) decreased (data not shown). Based on our simulations, we suggest that at least 200 maternal plant lines should be used for sufficient detection power for epistasis and GE interactions, while keeping a reasonable hybridization workload. It is noteworthy that increasing maternal line numbers may be more important than increasing offspring numbers to increase estimation precision. The simulation results from different models in the current study indicate that when the epistatic effect is small and a reduced model is employed, the estimations of positions and effects of QTL are relatively precise. However, when the epistatic effect is large and we still use a reduced model, the position estimations also reach better precision, but the estimated effects of QTL will be biased severely. This is because when we search the significant QTL by a one-dimensional scan, the reduced model does not include interaction effects of paired markers to control background genetic effects. As a result, when we estimate all kinds of effects in the reduced model, the main effects of QTL will be confounded with the epistatic effects. However, the *F*-values profile still exhibits a very similar pattern, and therefore the estimation of QTL position is less influenced by the epistatic effects. Since epistasis is mostly involved in genetic variation of quantitative traits and our method will give accurate estimation of QTL effects no matter whether the epistasis and QE interaction exist or not, we suggest using the QTL full model with inclusion of epistasis and QE interactions for seed trait QTL studies where the genetic architecture is unknown. In the cottonseed data analysis, we found some epistatic QTL without significant individual effects detected. It is noteworthy that these types of epistasis are also very important in genetic buffering and can respond to natural and artificial selection. Moreover, the epistases from different genomes also contribute to phenotypic variation; this kind of epistasis should not be ignored in mapping seed trait QTL.

Although the mechanisms for genetic imprinting are not totally understood, this phenomenon is thought to offer an evolutionary advantage through the maintenance of greater genetic variation. An imprinting QTL can function in a coordinated network of gene–gene and gene–environment interactions. It is worthwhile to develop new statistical models for the detection of interacting regulatory genes that affect the imprinting expression of any QTL involved in a genetic network composed of maternal and zygotic genomes. Under our framework, it is in principle feasible to extend the model to accommodate imprinting effects.

The mixed linear model (1) developed in the present study provides a basic framework that can easily be extended to cover more complex experimental designs, such as double backcross of IF_{2} or selfing of IF_{2} based on a one- or two-generation design. In the two-generation designs, such as double backcross of IF_{2} and selfing of IF_{2}, that include maternal and offspring marker information as well as the phenotypes of the offspring, our model can give accurate position estimations, all the QTL main effects, and the interaction effects, including GE interactions and epistatic interactions. In one-generation designs, which use only maternal marker information and phenotypes of offspring, for double backcross of IF_{2} lines linear dependency will occur between genetic effects of any pair of QTL when the QTL full model (1) is applied. To tackle this problem, a potential strategy is to keep the endosperm dominance effect of the QTL while merging the dominance effect of all other QTL, so that the linear dependency could be removed; however, this procedure will increase programming complexity and require much more computation time because the model has to be adjusted for different QTL. Given that the effectiveness of this strategy still needs to be investigated, we do not currently implement it in our software.

Based on the models and methods proposed in the present study, the QTLNetwork-Seed-1.0.exe software was developed in the C++ programming language. This software can be run on commonly used operation systems and can analyze data for 1000 individuals and 120 markers in ∼2 min with 45 MB memory (data not shown). The simulations on median running time and memory usage suggested that, as expected, the computing time and the memory usage increased with increasing maternal plant lines and marker density. It would take 2.28 hr and 131 MB memory for a set of samples consisting of 600 maternal lines, 1198 individuals, and 1005 markers, suggesting that our software can handle much larger data than those commonly used. We noted that running time and memory usage comparisons might vary as a function of computing environment. Different experimental designs, such as double backcross of IF_{2}; selfing of IF_{2} based on one- or two-generation designs; and random mating for IF_{2}, double haploid, or RILs can be handled by this software for detecting genetic architecture of seed endosperm or embryo traits.

## Acknowledgments

This work was supported in part by the National Basic Research Program of China (973 programs 2011CB109306 and 2010CB126006), the National Natural Science Foundation grants 31271608 and 31470083, the National Institutes of Health grant DA025095, the Science and Technology Innovation Program of the Chinese Academy of Agricultural Sciences, and the Bill and Melinda Gates Foundation Project. The authors declare no conflict of interest.

## Footnotes

*Communicating editor: E. A. Stone*

- Received July 18, 2014.
- Accepted October 8, 2014.

- Copyright © 2015 by the Genetics Society of America