# Modeling Epistasis in Genomic Selection

- Yong Jiang and
- Jochen C. Reif
^{1}

- Department of Breeding Research, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466 Stadt Seeland, Germany

- 1Corresponding author: Department of Breeding Research, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466 Stadt Seeland, Germany. E-mail: reif{at}ipk-gatersleben.de

## Abstract

Modeling epistasis in genomic selection is impeded by a high computational load. The extended genomic best linear unbiased prediction (EG-BLUP) with an epistatic relationship matrix and the reproducing kernel Hilbert space regression (RKHS) are two attractive approaches that reduce the computational load. In this study, we proved the equivalence of EG-BLUP and genomic selection approaches, explicitly modeling epistatic effects. Moreover, we have shown why the RKHS model based on a Gaussian kernel captures epistatic effects among markers. Using experimental data sets in wheat and maize, we compared different genomic selection approaches and concluded that prediction accuracy can be improved by modeling epistasis for selfing species but may not for outcrossing species.

- epistasis
- genomic selection
- genomic best linear unbiased prediction (G-BLUP)
- extended G-BLUP (EG-BLUP)
- reproducing kernel Hilbert space regression (RKHS)
- GenPred
- shared data resource

EPISTASIS has long been recognized as an important component in dissecting genetic pathways and understanding the evolution of complex genetic systems (Phillips 2008). It is hence a biologically influential component contributing to the genetic architecture of quantitative traits (Mackay 2014). The influence of epistasis on genome-wide QTL mapping ranges from limited (*e.g.*, Buckler *et al.* 2009; Tian *et al.* 2011) to high (*e.g.*, Carlborg *et al.* 2006; Würschum *et al.* 2011; Huang *et al.* 2014). These discrepancies can be explained by the complexities of the examined traits, which are controlled by many loci exhibiting small effects entailing a low QTL detection power. In addition, the estimation of QTL main and interaction effects is very likely biased (Beavis 1994), which makes it challenging to reliably elucidate the role of epistasis through genome-wide QTL mapping studies.

Genomic selection has been suggested as an alternative to tackle complex traits that are regulated by many genes, each exhibiting a small effect (Meuwissen *et al.* 2001). Genomic selection approaches based on additive and dominance effects have been successfully applied to predict complex traits in human (Yang *et al.* 2010), animal (Hayes *et al.* 2009), and plant populations (Jannink *et al.* 2010; Zhao *et al.* 2015). Moreover, several genomic selection approaches have been developed to model both main and epistatic effects (Xu 2007; Cai *et al.* 2011; Wittenburg *et al.* 2011; Wang *et al.* 2012). While in some studies prediction accuracies increased (Hu *et al.* 2011), in others modeling epistasis adversely affected prediction accuracies (Lorenzana and Bernardo 2009).

Despite these first attempts, epistasis is often ignored in genomic selection approaches using parametric models mainly because of the high associated computational load, especially if a large number of markers are available. An attractive solution to reduce the computational load is to extend genomic best linear unbiased prediction (G-BLUP) models (VanRaden 2008) by adding marker-based epistatic relationship matrices [extended genomic best linear unbiased prediction (EG-BLUP)]. Dating back to Henderson (1985), EG-BLUP enables the estimation of epistatic components of the genotypic values without explicitly assessing individually epistatic effects. Applied to predicting daily gain in pigs and the total height of pine trees, EG-BLUP outperformed G-BLUP (Su *et al.* 2012; Muñoz *et al.* 2014). The equivalence between G-BLUP and genomic selection approaches, with explicit relevance for modeling marker main effects, has been demonstrated (Habier *et al.* 2007). However, the association between EG-BLUP and genomic selection approaches explicitly modeling marker main and interaction effects has not been studied.

The use of semiparametric reproducing kernel Hilbert space (RKHS) regression models has been promoted as an alternative powerful option to capture epistasis in genomic selection (Gianola *et al.* 2006; Gianola and Van Kaam 2008). The RKHS model outperformed linear models that focused exclusively on marker main effects in a number of studies based on simulated data (*e.g.*, Gianola *et al.* 2006; Howard *et al.* 2014) and empirical data (*e.g.*, Perez-Rodriguez *et al.* 2012; Rutkoski *et al.* 2012; Crossa *et al.* 2013). Choosing an appropriate kernel, which can be interpreted as a relationship matrix among genotypes (*i.e.*, individuals), is a central element of model specification in RKHS regression (De Los Campos *et al.* 2010). Among all possible kernels, the Gaussian kernel has been extensively used and is assumed to implicitly portray the genetic effects including epistasis (Gianola and Van Kaam 2008; Morota and Gianola 2014). The exponential function involved in the Gaussian kernel is a nonlinear transformation of the additive inputs, which encodes a type of epistasis (Gianola *et al.* 2014). Nevertheless, it has not been clarified how RKHS regression based on Gaussian kernels explicitly models epistatic effects among different markers.

In this study, we aimed at (1) explaining how the marker-based epistatic relationship matrix used in EG-BLUP models is related to epistatic effects among markers, (2) unraveling how the RKHS model based on a Gaussian kernel takes epistatic effects among different markers into account, and (3) comparing the prediction abilities of three models (G-BLUP, EG-BLUP, and RKHS), using several published experimental data sets.

## Theory

Throughout this article, we use the following notations: Let *n* be the number of genotypes, *m* be the number of genotypes having phenotypic records, and *p* be the number of markers. Let be the matrix of SNP markers, where equals the number of a chosen allele at the *j*th locus for the *i*th genotype. Let be the *i*th row of the matrix *X*, which is the marker profile for the *i*th genotype. Let be the allele frequency of the *j*th marker. We do not necessarily assume Hardy–Weinberg equilibrium in the population.

### The G-BLUP model with additive relationship matrix

The baseline model for comparison was the standard G-BLUP model focusing on additive effects, (1)where *y* refers to the *m*-dimensional vector of phenotypic records, is an *m*-dimensional vector of ones, *μ* is the population mean, *g* is an *n*-dimensional vector of additive genotypic values, is the corresponding design matrix allocating phenotypic records to genotypes (*i.e.*, if the *j*th entry of *g* corresponds to the *i*th observation in *y*, and otherwise), and *e* is an *m*-dimensional vector of residual terms.

Without loss of generality, we subsequently assume that and that *Z* is the identity matrix, leading to the simpler form of the model, (2)where *y*, , *μ*, and *e* are the same as defined in (1). We assume that *μ* is a fixed parameter, and *g*, *e* are random parameters with and . *G* denotes the genomic relationship matrix among all genotypes, calculated following VanRaden (2008) as , where and is an matrix with . It was proved that the matrix *G* approaches the well-known numerator relationship matrix *A* as the number of markers increases (Habier *et al.* 2007).

### EG-BLUP: an extended G-BLUP model comprising additive and additive × additive relationship matrices

Focusing exclusively on additive × additive epistasis, the EG-BLUP model has the form (3)where *y*, , *μ*, and *e* are the same as defined in (2). For each genotype, not only the additive genotypic values but also epistatic genotypic values are included in the model. Namely, is an *n*-dimensional vector of additive genotypic values, and is an *n*-dimensional vector of additive × additive epistatic genotypic values. We assume that *μ* is a fixed parameter, , , , and Here the matrix *G* is the same as in the G-BLUP model. In an infinitesimal model, Henderson (1985) suggested using the Hadamard product of the additive relationship matrix by itself to obtain the epistatic relationship matrix *H*. Translated to genomic relationship, this yields (4)This extended G-BLUP model was recently used by Su *et al.* (2012) and Muñoz *et al.* (2014).

When the number of markers is large, we proved that EG-BLUP is equivalent to the model EG-BLUP* with explicit epistatic effects of markers (see the *Appendix*), (5)where *y*, , *μ*, and *e* are the same as before; is the *i*th column of the matrix *W*; is the additive effect of the *i*th marker; is the element-wise product of the two vectors and ; is the additive × additive epistatic effect of the *i*th and the *j*th marker; and *e* is the vector of residual terms. We assume that *μ* is a fixed parameter; ; ; ; and no covariance among , , and *e*. The basic setting of EG-BLUP* in Equation 5 appeared in Wittenburg *et al.* (2011) with different assumptions on the parameters.

Note that the parameters in EG-BLUP* should be considered in the framework of Fisher (1918). Namely, *μ* is the population mean, is the average effect of an allele for the *i*th locus, defined as the regression coefficient of the genotypic values on the number of the allele, and is the epistatic deviation for the *i*th and the *j*th loci.

The extension of Equation 3 to include also higher-order additive × additive genotypic values can be deduced using the same method as in Henderson (1985). We need only to note that the ()th-order epistatic relationship matrix is given by (the Hadamard product of *k* copies of *G*).

### The RKHS regression model based on a Gaussian kernel

We consider the following model that is equivalent to RKHS regression (De Los Campos *et al.* 2010): (6)The notations are the same as in (2) and the assumptions are , , where is an kernel matrix whose entries are functions of marker profiles of pairs of genotypes. It is required that *K* satisfies the semipositive definite property , for all real numbers , . Mathematically, a number of matrices would satisfy this property. For example, we may choose whereby the RKHS model is equivalent to G-BLUP.

In this study, we consider only the Gaussian kernel (Gianola and Van Kaam 2008), (7)where denotes the norm in the Euclidean space and *h* is a bandwidth parameter. As the matrix *K* serves as a genetic relationship matrix among genotypes, the parameter *h* controls how fast the relationship between two genotypes decays as the distance between the corresponding pairs of marker vectors increases. The choice of the bandwidth parameter can be optimized by applying a cross-validation or a Bayesian approach, treating *h* as a random variable. Throughout this study, we assume that *h* is known.

### An explicit explanation of why the RKHS model captures epistasis

We start by inspecting the kernel matrix (7) in more detail. Recall that the entries in *W* are defined as . Hence we haveRecall that . Thus the th entry of *G* is . Write , for all . Then we obtainLet be the matrix of ones and let . Note that in terms of power series, (Levi 1968). Rewriting the above steps in matrix form, we have (8)where (9)Therefore, we can see that the epistatic relationship matrices (for each ) used in EG-BLUP are all involved in the Gaussian kernel for the RKHS model. In this sense, the Gaussian kernel indeed carries the information of additive × additive epistasis up to any order. But note that in the Gaussian kernel, the proportions of the additive and each epistatic relationship matrix in the total matrix are fixed, once the bandwidth parameter is chosen. In contrast, in EG-BLUP, the proportion of in *H* depends on the corresponding variance component, which is an unknown parameter to be estimated.

Based on the above observations, we can actually formulate a model with explicit epistasis effects of markers and prove that it is equivalent to the RKHS model with the Gaussian kernel. Let us consider the following model, which seems to be ill-posed as infinitely many unknown parameters are included. But we immediately show that it is equivalent to the RKHS model with Gaussian kernel, (10)where the notations *y*, , *μ*, , , and *e* are the same as in (4). is the element-wise product of the vectors for . are the *s*th-order epistatic effects among the , , …, and the loci. We assume that *μ* is fixed, *v* is an extra random intercept term with , , , , and there is no covariance among *ν*, , , and *e*.

Now, let *a* be the *p*-dimensional vector , be the -dimensional vector , and be the matrix whose columns consist of the vectors for all . Here denotes the binomial coefficient. With the above notations, Equation 6 can be rewritten in matrix form as (11)with assumptions , , , and and all covariance terms are zero.

Then we haveRecall that . We need to calculate for any Note that in the case of , we have shown in the *Appendix* that . This result can be easily generalized for , using the same method. That is, for any , we haveThus, when *p* is very large, we can approximately treatThen we can deduce thatNote that the matrix is exactly the Gaussian kernel *K* (Equation 8) and that the variance–covariance matrix is exactly the same as in the RKHS model with Gaussian kernel.

Using the same approach as in the *Appendix*, it is straightforward to deduce that the modified RKHS (Equation 11) and the RKHS models give the same predictions for the total genotypic values. Thus, we gave a complete explanation on why the RKHS model takes epistasis into account.

### Comparing G-BLUP, EG-BLUP, and RKHS, using experimental data

We used two published data sets each in wheat and maize for our study. The first data set consisted of 599 wheat lines genotyped by 1447 diversity array technology (DArT) markers (Crossa *et al.* 2010). The second data set comprised 254 advanced wheat breeding lines genotyped by 1576 single-nucleotide polymorphism (SNP) markers (Poland *et al.* 2012). The third data set consisted of 300 maize lines with 1148 SNP markers (Crossa *et al.* 2010). The fourth data set comprised two large half-sib maize panels from the flint and dent heterotic pools (Bauer *et al.* 2013). The dent (flint) panel consists of 847 (833) lines with 31,498 (29,466) SNPs. The phenotypic trait on which we focused in this study was grain yield. More details on the data sets are provided in supporting information, File S1.

Using the four data sets, we tested the option to increase the predicting accuracy by modeling epistasis. To this end, we estimated the prediction accuracy based on the G-BLUP, EG-BLUP, and RKHS models, applying fivefold cross-validations. The prediction accuracy was measured as the Pearson product-moment correlation between predicted and observed genotypic values of the individuals in the test set (more details on methods are included in File S1). We observed that the performance of RKHS was very similar to that of EG-BLUP (Table 1), which fits well with our theoretical findings on the congruency of both models. For the two reanalyzed maize data sets, EG-BLUP and RKHS including epistasis did not outperform G-BLUP ignoring epistasis. In contrast, in the two reanalyzed wheat data sets, we observed that the prediction accuracies for RKHS and EG-BLUP were consistently higher than that for the G-BLUP model.

### Data availability

This study was based on published datasets. Detailed description and the sources of all data sets were provided in File S1.

## Discussion

We focused in our study on digenic additive × additive epistatic effects. Extending the EG-BLUP approach toward additive × dominance and dominance × dominance effects or to higher-order epistasis is straightforward (Henderson 1985). It is important to note, however, that based on the framework used to partition the genotypic variance, additive × additive effects are expected to be the prevailing epistatic effects (Fisher 1918; Lynch and Walsh 1998).

### EG-BLUP and RKHS are computational efficient approaches to tackle epistasis in genomic selection

Extending genomic selection models toward epistasis is often hampered by high computational load. We have demonstrated that EG-BLUP is equivalent to genomic selection approaches modeling explicitly epistatic effects (EG-BLUP*, Equation 5). Moreover, RKHS can also be reformulated as a genomic selection model with explicit epistatic effects (modified RKHS, Equation 10). The computational load of EG-BLUP and RKHS mainly depends on the number of genotypes. In contrast, the computational load of EG-BLUP* comprising additive as well as additive × additive epistatic effects depends on the square of the number of markers. Implementing the EG-BLUP and RKHS models for a previously published maize data set (Bauer *et al.* 2013) with 847 genotypes and 1000 randomly sampled markers is, for instance, up to 130 times faster compared with the corresponding RR-BLUP approach. Consequently, EG-BLUP and RKHS are promising models to routinely integrate epistasis in genomic selection studies.

### Modeling epistasis improved the prediction accuracy in selfing but not in outcrossing species

We compared the cross-validated prediction accuracies, using the G-BLUP, EG-BLUP, and RKHS models based on four published data sets. Interestingly, we observed contrast trends for wheat compared with maize on the performance of models including epistasis (EG-BLUP and RKHS) and G-BLUP without considering epistasis. Namely, EG-BLUP and RKHS were superior to G-BLUP for the wheat data sets but not for the maize data sets (Table 1). Hence, our results suggested that modeling additive × additive epistasis can increase the prediction accuracy in genomic selection for selfing but not for outcrossing species. This is in line with recent findings that additive × additive epistasis substantially affects midparent heterosis in the selfing species rice, but contributes only marginally to heterosis in the outcrossing species maize (Garcia *et al.* 2008). Nevertheless, more experimental data sets are required to examine the role of epistasis in selfing and outcrossing species in more detail. In particular, it seems attractive to study also the role of epistasis involving dominance effects, which entails specific designs such as factorial mating designs (Comstock and Robinson 1952).

In the EG-BLUP model, both the additive and the additive × additive epistatic relationship matrices were derived from molecular markers. If the markers under consideration are in linkage equilibrium (LE), the additive and additive × additive terms in EG-BLUP* are orthogonal in the sense of Cockerham (1954), and hence the estimates of additive and epistatic effects are independent (Álvarez-Castro and Carlborg 2007). However, the assumption of linkage equilibrium may never be true in reality unless only a few loci sparsely distributed on the genome are considered. Hence, we performed a simulation study to investigate whether linkage disequilibrium (LD) among markers, which causes nonorthogonality of the model, has an influence on the performance of EG-BLUP.

Our simulation was based on the first wheat data set [599 wheat lines with 1447 markers (Crossa *et al.* 2010)] and the dent panel of the second maize data set [847 lines with 31,498 markers (Bauer *et al.* 2013)]. We simulated two scenarios: (1) markers contributing to the trait are in LE and (2) markers contributing to the trait are in LD. In all cases, both additive and additive × additive epistatic effects were simulated. The heritability was set to be 0.7. Details for the simulation procedure are presented in File S1. We observed that the prediction accuracy of EG-BLUP was consistently higher than that of G-BLUP in both data sets and both scenarios (Figure 1). Hence, we may conclude that LD among markers has low influence on the effectiveness of EG-BLUP *vs.* G-BLUP.

Another factor that may affect the performance of EG-BLUP is inbreeding. In Henderson’s extended BLUP model (Henderson 1985), the derivation of the epistatic relationship matrix being the Hadamard square of the numerator relationship matrix depends on the assumption of random mating (Cockerham 1954), which may never hold for data from plant breeding. In our study, the marker-derived epistatic relationship matrix in EG-BLUP approximately equals the Hadamard square of the marker-derived additive relationship matrix. This result relies only on the assumption that the marker additive and epistatic effects are independent. Maybe this assumption is more likely to hold in noninbred than in inbred populations. If this is true, the superiority of EG-BLUP over G-BLUP would be more pronounced for noninbred than for inbred populations, provided that epistasis substantially contributed to the trait. An investigation of this problem is interesting but beyond the scope of this study. Nevertheless, our results in both simulation and empirical study indicated that EG-BLUP can be effectively applied to noninbred plant data.

### Enhancing prediction accuracy across a biparental population through modeling epistasis

Previous studies have shown that prediction accuracy is impaired when performing genomic selection across connected biparental populations (Zhao *et al.* 2012; Riedelsheimer *et al.* 2013). This may be explained at least partially by epistatic effects as the genetic relatedness across connected populations may be better exploited by modeling epistasis in addition to additive effects. Again we used a published maize data set (Bauer *et al.* 2013) and investigated whether the prediction accuracy across connected biparental families can be increased by modeling additive × additive epistasis. In our scenario, genotypic values of the lines in one family were predicted using lines from each of the other families. We compared the mean and maximal prediction accuracies for each family and observed no superiority for EG-BLUP and RKHS (including epistasis) compared with G-BLUP (ignoring epistasis; Figure 2). The sizes of the biparental populations were small, ranging from 17 to 133. This small population size can substantially reduce prediction accuracy exploiting epistasis, as has been shown previously for QTL mapping (Carlborg and Haley 2004). In addition, maize as an outcrossing species is likely to be influenced only little by additive × additive epistasis in contrast to selfing species (Garcia *et al.* 2008). Therefore, it will be interesting to investigate in future studies whether prediction accuracy across connected biparental populations can be improved, modeling epistasis using large biparental populations in selfing species.

## Acknowledgments

We thank Yusheng Zhao and Timothy Sharbel for their valuable comments on the manuscript. We thank the authors in Crossa *et al.* (2010), Poland *et al.* (2012), Bauer *et al.* (2013), and Lehermeier *et al.* (2014) for making their data sets publicly available. We are grateful to all reviewers and the editor for their helpful comments and suggestions, which greatly improved the manuscript. This study is based on published data sets. The authors have no conflicts of interest to declare.

## Appendix: A Proof of the Equivalence Between EG-BLUP and EG-BLUP* When the Number of Markers Is Large

Let us start with the EG-BLUP* model (Equation 5). Let *a* be the *p*-dimensional vector of the ’s and *v* be the -dimensional vector of the ’s. Let *U* be the matrix whose columns are given by the vectors . Then Equation 5 can be simply written aswith assumptions , , and and all covariance terms are zero.

Then we have

The matrix is an matrix whose entry is given byThen it is easy to deduce thatHence we haveNow we claim thatwhich means that when *p* is very large, we can approximately treat . For this purpose we need only to prove (A2)In fact, the th entry of the matrix is (A3)Note that we always exclude monomorphic markers in the analyses. So we can assume that , where is the threshold of minor allele frequency in the quality control (*e.g.*, or 0.05). Then the numerator of (A3) is a sum of *p* positive numbers, each belonging to the interval , while the denominator is a sum of positive numbers, each in the interval . Thus we havewhich proved (A2).

Hence (A1) is simplified to the following:The right-hand side of the above formula is exactly the same as the variance–covariance matrix for Equation 3 in EG-BLUP.

By the results of Henderson (1975), the BLUPs of *a* and *v* are given by (A4)where (A5)On the other hand, the BLUPs of and in the EG-BLUP model are given by (A6)where is the same as in (A5) as we have proved that the matrices in EG-BLUP and EG-BLUP* are the same.

Comparing (A4) and (A6), we see that and , confirming that EG-BLUP and EG-BLUP* give the same predictions.

## Footnotes

*Communicating editor: F. van Eeuwijk*Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.177907/-/DC1.

- Received May 5, 2015.
- Accepted July 23, 2015.

- Copyright © 2015 by the Genetics Society of America