# Needles: Toward Large-Scale Genomic Prediction with Marker-by-Environment Interaction

- Arne De Coninck*,
^{†},^{1}, - Bernard De Baets*,
^{†}, - Drosos Kourounis
^{‡}, - Fabio Verbosio
^{‡}, - Olaf Schenk
^{‡}, - Steven Maenhout
^{§}and - Jan Fostier
^{†},**

^{*}KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, B-9000 Ghent, Belgium^{†}Bioinformatics Institute Ghent, Ghent University, B-9000 Ghent, Belgium^{‡}Institute of Computational Science, Università della Svizzera italiana, CH-6904 Lugano, Switzerland^{§}Progeno, B-9052 Zwijnaarde, Belgium^{**}Department of Information Technology (INTEC), Ghent University–iMinds, B-9000 Ghent, Belgium

- 1Corresponding author: KERMIT, Department of Mathematical Modelling, Statistics and Bioinformatics, Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, B-9000 Ghent, Belgium. E-mail: arne.deconinck{at}ugent.be

## Abstract

Genomic prediction relies on genotypic marker information to predict the agronomic performance of future hybrid breeds based on trial records. Because the effect of markers may vary substantially under the influence of different environmental conditions, marker-by-environment interaction effects have to be taken into account. However, this may lead to a dramatic increase in the computational resources needed for analyzing large-scale trial data. A high-performance computing solution, called *Needles*, is presented for handling such data sets. Needles is tailored to the particular properties of the underlying algebraic framework by exploiting a sparse matrix formalism where suited and by utilizing distributed computing techniques to enable the use of a dedicated computing cluster. It is demonstrated that large-scale analyses can be performed within reasonable time frames with this framework. Moreover, by analyzing simulated trial data, it is shown that the effects of markers with a high environmental interaction can be predicted more accurately when more records per environment are available in the training data. The availability of such data and their analysis with Needles also may lead to the discovery of highly contributing QTL in specific environmental conditions. Such a framework thus opens the path for plant breeders to select crops based on these QTL, resulting in hybrid lines with optimized agronomic performance in specific environmental conditions.

- GenPred
- shared data resource
- genomic selection
- genomic prediction
- marker-by-environment interaction
- high-performance computing
- variance component estimation
- simulated data

GENOMIC prediction methods most often rely on a linear mixed-model framework that models at the same time fixed effects and random genetic effects (Meuwissen *et al.* 2001). These genetic effects are modeled by assigning a small effect to markers, which are used to genotype the individuals. Introducing a large number of genome-wide markers in the analysis has already proven to be beneficial instead of using only pedigree information or a few markers that are known to have a significant effect (so-called marker-assisted selection) in animal breeding (VanRaden *et al.* 2009; Aguilar *et al.* 2010) as well as in plant breeding (Bernardo and Yu 2007; Crossa *et al.* 2010; Burgueño *et al.* 2012). From an animal breeding perspective, environmental effects are mostly not modeled and regarded only as a nuisance because environmental effects are either negligible (König *et al.* 2005) or can be under the control of the breeders by creating selection environments that are very close to commercial environments (Mulder and Bijma 2005). Using this assumption, a distributed average information restricted maximum likelihood (AI-REML) ridge regression best linear unbiased prediction (DAIRRy-BLUP) framework (De Coninck *et al.* 2014) was developed to employ the computing power of supercomputing clusters for analyzing data sets with a large number of genotyped individuals based solely on dense linear algebra because genetic marker information is mainly dense.

However, when cultivating plants, the environment and some specific environmental conditions (*e.g.*, soil moisture, solar radiation, and air humidity) can have a much stronger impact on the phenotypic trait, and the effects of markers may vary in different environments. It is thus recommended to also include genotype-by-environment interaction (G × E) effects for genomic prediction in plant breeding (Denis *et al.* 1997; Cooper *et al.* 2005). Different models have been presented to account for these interaction effects in genomic prediction, and most of these models apply a two-stage approach, where in the first stage an adjusted genotype mean is computed across environments, which is then used in the second stage to predict breeding values for untested plants based on their marker genotypes (Schulz-Streeck *et al.* 2013b). Actually, this two-stage approach commonly includes a preliminary step in which the intraenvironmental effects, such as block, row, and column effects, are taken into account when computing the genotypic mean per environment. These intraenvironmental effects can be modeled together with a location effect and the G × E effects to immediately obtain the genotypic means across the environments in the first step of a two-step approach (Schulz-Streeck *et al.* 2013b). However, in recent single-stage analyses, where the computation of genotypic means across environments is avoided and the interaction effects are explicitly modeled, the phenotypic records are mostly already corrected for spatial variations inside the environment (Burgueño *et al.* 2012; Heslot *et al.* 2014; Lopez-Cruz *et al.* 2015). Nonetheless, the single-stage approach may include the modeling of these intraenvironmental effects to enable the direct analysis of the raw phenotypic data (Schulz-Streeck *et al.* 2013a). The genetic effects can be assumed to follow a wide range of distributions. The most widely used choice is the assumption that genetic effects come from a normal distribution, and while other assumptions may lead to better predictions of the genomic breeding values, the normality assumption is a viable alternative because of its simplicity and computational efficiency (Crossa *et al.* 2010; Heslot *et al.* 2012). This assumption leads to the so-called best linear unbiased predictors (BLUP) for the random genetic effects (Henderson 1973).

When genetic marker information is applied for calculating correlations between individuals, this is referred to as GBLUP, where the G stands for use of a genomic relationship matrix instead of a relationship matrix based on pedigree data (Habier *et al.* 2007). More advanced methods depending on correlations based on pedigree as well as genetic marker information do sometimes result in a slight increase in the prediction accuracy of the breeding values, but the gain in prediction accuracy mostly does not outweigh the added complexity (Crossa *et al.* 2010; Burgueño *et al.* 2012). However, these methods can be of importance when records of ungenotyped individuals should be included in the analysis, which is commonly the case in animal breeding because historical records of ungenotyped individuals then can be linked to records of genotyped individuals owing to the availability of extensive pedigree information. (Aguilar *et al.* 2010; Christensen and Lund 2010). Recently, the GBLUP methodology has been extended to incorporate G × E effects by assuming that the genetic effects were different in each environment, where correlations between genotypes or environments could be included in the covariance matrices for the random genetic effects and the residual errors (Burgueño *et al.* 2012). Other methods include a global genetic effect and variable genetic effects across the environments, implying correlation across environments through the shared global genetic effects (Jarquín *et al.* 2014; Lopez-Cruz *et al.* 2015).

In all these studies, genetic marker information is used only to derive correlations between individuals, while originally in genomic prediction the effect of each marker was modeled explicitly, and breeding values were calculated by adding all the genetic marker effects (Meuwissen *et al.* 2001). GBLUP is still preferred over the explicit modeling of the genetic marker effects because it enables solving the BLUP equations in the dimensions of the number of genotypic lines used in the study, which in common data sets is still lower than the number of genetic markers included in the linear mixed model. Therefore, the computational burden is still lower for a GBLUP approach than for the explicit modeling of marker effects and their environmental interactions. However, our approach models genetic marker and their environmental interaction (M × E) effects explicitly because we believe that increasing the number of genotypic lines in the analysis has a great potential to increase the prediction accuracy of the model, as was already shown in previous experimental and theoretical research (Hayes *et al.* 2009; De Coninck *et al.* 2014). In such data-driven biology, the number of genotypic lines in fact can become larger than the number of markers, lending support to our explicit modeling approach.

Such explicit modeling of the marker effects was actually first applied in the field of QTL mapping, an important sidetrack of genomic prediction, helping marker-assisted selection by inclusion of the known QTL as genetic markers (Podlich *et al.* 2004). Of course, the field of QTL mapping also includes environmental dependence of the QTL to obtain reliable estimates of the QTL main effects (Cooper *et al.* 2002). Linear mixed models used in QTL mapping resemble those of genomic prediction with as a major difference that the QTL effects are mostly modeled as fixed effects next to a random genetic effect, leading to the detection of only a few QTL (maximum 40) (Schön *et al.* 2004; Piepho 2005; van Eeuwijk *et al.* 2005; Boer *et al.* 2007). However, by treating the QTL effects as fixed effects, they tend to get overestimated, leading to less reliable predictions of the breeding values based on these QTL effects (Boer *et al.* 2007). Therefore, because explicit modeling of the marker and M × E effects in genomic prediction directly returns predictions for these effects, it seemed interesting to investigate whether these genomic prediction models can be applied for detecting whether or not QTL are stable across environments. Although in the next section the statistical model for genomic prediction will be explained generally for SNP markers and their interaction effects, the *Results* section will be focused only on the prediction accuracy of the simulated QTL marker and their environmental interaction (Q × E) effects. In fact, simulating QTL and their effects on the trait is the only way to verify whether estimates of genetic marker effects are accurately predicted because, for random SNP markers, it can never be ascertained what their exact effects are on the phenotypic trait.

Explicitly modeling M × E effects can lead to huge numbers of effects to be estimated because every marker is coupled once to each environmental condition. For instance, when plants are tested in 100 different environmental conditions and genotyped for 3000 markers, this leads to 300,000 M × E effects. Fortunately, the information about the M × E effects is very sparse because each observation is made under a specific environmental condition, resulting in this case in an incidence matrix for the M × E effects that is filled with at least 99% zeroes. Using only dense matrix algebra to provide estimates of all these effects would require huge distributed systems and result in an unnecessary waste of memory and computational resources. Therefore, we exploit the sparsity of the information about the M × E effects and use a technique that couples sparse and dense linear algebra for analyzing such large-scale genomic prediction settings. The first implementation of such a framework, presented herein, is called *Needles*, referring to the saying, “finding needles in a haystack,” which is very appropriate when trying to find the genetic markers that contribute most to a certain trait based on the results of multienvironment trials. The source code of this implementation can be found at https://github.com/arnedc/Needles.

The focus of this research is (in order of importance) (1) presenting a high-performance computing implementation for genomic prediction with marker-by-environment interaction, (2) showing that such an implementation can deal with data sets consisting of many observations of many genotypic lines, (3) showing, by a simulation study, that such an approach also can be useful for accurately predicting QTL effects and their environmental interaction effects, and (iv) raising awareness that acquiring more data may lead to better predictive abilities, even with simple models. It is not the goal of this paper to present a comparison of different models or put forward a certain model, but in contrast, it should be a stepping stone to other research to apply the same methods for performing large-scale analyses with a model of choice on simulated or real-life data. After all, the basic algebraic structure and high-performance computing techniques are already laid out in Needles and can be built on to extend the framework to more complex applications.

## Methods

### Simulated data

All data used for benchmarking were simulated using AlphaMPSim (Hickey *et al.* 2014), which is flexible simulation software for creating large populations of multiparent recombinant inbred lines whose polygenic traits are controlled by large numbers of QTL. The default option of AlphaMPSim was chosen to create a historical population of founder haplotypes using MaCS (Chen *et al.* 2009). The distribution of the AlphaMPSim software included an example for obtaining wheat-like data, mimicking the evolution of wheat according to specified mutation and recombination rates by employing MaCS. In this way, a pool of 400 haplotypes for each of the 21 chromosomes was provided as a starting point for this study. A base generation of 2000 homozygous founders with a genetic diversity as defined by the historical population was then constructed by randomly selecting one of these haplotypes per chromosome. From this base generation, 500 lines of four-way recombinant inbred lines (RILs) were simulated, as depicted schematically in Figure 1. The choice for simulating RILs is motivated by the fact that these are used frequently in gene-environment interaction studies and QTL analyses (Shindo *et al.* 2003; Crossa *et al.* 2014). The pedigree was set up in such a way that there were 50 siblings per line, and each of these siblings went through seven generations of selfing. This resulted in a total of 203,000 individuals to be simulated, of which the 25,000 of the last generation resemble common wheat RILs.

These 25,000 RIL genotypes were simulated by AlphaMPSim, and traits were simulated based on two different numbers of QTL (1575 and 3150). The global contributions of the QTL to the trait were sampled from a Gaussian distribution with mean zero and SD of one unit. For each genotype, the sum of the global QTL effects is defined here as the true breeding value, ignoring any environmental effect and regarding the breeding value as the global genetic potential of a RIL. Because the goal was to simulate multienvironment trials with different setups, random samples were drawn from these 25,000 RIL genotypes, and their phenotypes were simulated for different environments. Therefore, 32% of the QTL (respectively, 500 and 1000) were randomly chosen to have a variable effect depending on the environment, sampled from a Gaussian distribution. The total simulated phenotypes were a sum of the true breeding values, the Q × E effects for the given genotype, a fixed environmental effect, and a residual term sampled from a Gaussian distribution. This means that no specific row, column, or block effects were simulated, and thus the phenotypic records can be regarded as already being corrected for spatial variations inside each environment. The variances of the Gaussian distributions were chosen in such a way that of the total variance of the phenotypes, excluding the fixed environmental effects, 37% is attributed to pure genetic effects, 23% to interactions between the genotype and the environment, and the remaining 40% to factors unaccounted for (residual term). These values are in accordance with the findings of Jarquín *et al.* (2014) on experimental data about grain yield of 139 wheat lines tested in 340 year × location combinations, which also was the basis for the choice of the variances of the Q × E effects and the residual term.

The multienvironment trial was simulated in such a way that in each environment an equal number of different RIL genotypes were tested, and 4% of the total number of phenotypic records per environment came from RIL genotypes that were tested across all environments. These so-called standards were replicated four times per environment, except for the simulation with the largest number of records, where they were replicated 10 times in each environment. As such, the simulation resembles a sort of augmented design (Federer and Raghavarao 1975), where the RIL genotypes that are not standards are tested in four different environments, except for the largest simulation, where they are tested in all (10) environments. To illustrate this with an example, consider a simulation with 1000 phenotypic records per environment and 50 environments in total. In each environment, 40 of the records come from 10 standards that are replicated four times in this environment. The other 960 records are simulated from 960 different RIL genotypes. The 10 standards are also replicated four times in all the other environments, meaning that of the total number of 50,000 multienvironment trial records, 2000 are obtained by simulating the phenotypic trait of the 10 standards in all environments. The other 48,000 records come from the 960 nonstandard RIL genotypes tested in each environment, but because these genotypes are tested in four environments, only 12,000 different nonstandard genotypes can be included in this multienvironment trial. It must be stressed that an environment in this context is defined as an abstract concept and can be seen as a location, a location × year combination, or a class of environments that share more or less the same environmental conditions.

### Statistical method: linear mixed model

The underlying framework of the genomic prediction model with explicit marker-by-environment interaction is the linear mixed model (1)where is the vector of *n* observations, is the vector of *k* fixed environmental effects, is the vector of *l* random marker-by-environment interaction effects, is the vector of *m* random genetic marker effects, and is the residual error term. The design matrices for the effects are, respectively, , , and , where the and matrices are 0/1/2 coded, with 0 standing for homozygosity in the most frequent allele, 1 standing for heterozygosity, and 2 standing for homozygosity in the least frequent allele. This allele coding was chosen instead of a centered and standardized coding so that could be stored more compactly as a sparse matrix. For a specific environment *i*, the model becomes (2)where **1** is a vector of ones, is the fixed environmental effect of environment *i*, is the matrix of genotypes evaluated in environment *i*, and is the marker-by-environment interaction effects in environment *i*. Because the phenotypic records were modeled as being already corrected for spatial variations inside each environment, no specific intraenvironmental effects (*e.g.*, block, row, or column effects) had to be modeled.

The random effects and residual term are modeled as (3)which results in an *a priori* assumption that the observations are distributed as (4)with (5)Although this model, commonly referred to as a *compound symmetry* (CS) model, imposes some restrictions because a certain homogeneity among the environments is assumed, it is the simplest method to include marker-by-environment interaction effects and has been shown to perform rather well compared to more advanced methods (Burgueño *et al.* 2012). Moreover, Boer *et al.* (2007) showed that this model is the best balance between model complexity and fit to the data when analyzing the yield of maize, although they mainly used a factor analytic model (Piepho 1998) for its flexibility, and they concluded that the most appropriate model might be specific to the crop and trait under study. Nonetheless, the CS model also was chosen to simulate the data because it required the least number of assumptions about the environmental interaction effects.

The best linear unbiased estimators (BLUE) and predictors (BLUP) of the fixed and random effects are linear estimates or predictions that minimize the mean squared error and exhibit no bias. They are also the solutions of the so-called mixed-model equations (MMEs) (Henderson 1963): (6)where , , and are the estimators/predictors for , , and . The coefficient matrix of this equation will be referred to as .

This system of linear equations depends on the variance components and , which are not known *a priori*. Actually, and are ratios of variance relative to the variance of the residual error , but for ease of notation, we will also refer to these variance ratios as *variance components*. These variance components are estimated using the AI-REML methodology based on the entire data set. AI-REML is an iterative, gradient-based approach that has been shown to converge more rapidly than a derivative-free or expectation-maximization approach (Gilmour *et al.* 1995). The update equation for the variance components is (7)where is the vector of variance components at iteration *n* [here ], is the AI update matrix at iteration *n*, and is the gradient of the REML log likelihood with respect to evaluated at . This is very similar to Newton’s method for maximizing likelihood functions, but the difference lies in the use of the AI update matrix instead of the Hessian matrix. Originally, the expected value of the Hessian matrix, also known as the *Fisher information matrix*, was used as the update matrix in Equation 7 because of the tedious computation of the Hessian matrix (Patterson and Thompson 1971). But the average information matrix, a simplified average of the Hessian or Fisher information matrix, can be constructed in an elegant way and therefore lends itself better to be used in a distributed computing paradigm for updating the variance components in each iteration (De Coninck *et al*. 2014).

The REML log-likelihood function can be written as (8)with (9)When and are known, an analytical solution for the maximization of this likelihood function with respect to can be found. However, for maximization with respect to and , we have to resort to the iterative AI-REML technique, and the gradient of the REML log likelihood thus should be evaluated for both variance components: (10) (11)where and are the blocks of the inverse of corresponding to the blocks in Equation 6 containing, respectively, and .

### Implementation details

Although the solution of system (6) seems trivial, it can become a very computationally demanding task when the number of effects to be estimated becomes so high that the coefficient matrix no longer fits in the working memory of a single computing node. Moreover, calculation of the inverse matrices in (10) and (11) also will become a time-consuming task because the time complexity is a cubic function of the dimension of the matrix.

To efficiently apply the computing power of a high-performance computing cluster, we developed an algorithm that takes advantage of distributed computing techniques and the fact that a large part of the coefficient matrix is filled with zeroes. The latter is due to the fact that every observation comes from a single environment, and thus, is only sparsely filled with information. Therefore, the coefficient matrix is split up into sparse and dense parts: (12)with (13) (14) (15)where is a sparse submatrix, and and are treated as dense submatrices. The solution of system (6), as well as calculation of the two blocks of the inverse of , then can be performed blockwise (more information is given in Supplemental Material, File S1). These blockwise operations involve construction of the Schur complement of : (16)which transforms the solution of system (6) into solving, respectively, (17) (18)Moreover, the block of (11) is precisely the inverse of this Schur complement. Although only the diagonal elements are needed for calculating the trace of , the entire inverse of is needed for obtaining these diagonal elements. For a sparse matrix, however, a selected inverse can be computed, requiring only the elements of the inverse corresponding to nonzero elements in the factors of the sparse matrix (Takahashi *et al.* 1973). The diagonal elements of then can be calculated using and the selected inverse of : (19)where , is the *i*th row of , is element of , and is element of .

The sparse matrix is stored in compressed sparse row format because it is also used by the PARDISO solver (Schenk *et al.* 2008, 2007; Kuzmin *et al.* 2013), which is applied for solving the sparse matrix equation and calculating the selected inverse of submatrix . PARDISO is a multithreaded solver, but it cannot yet cope with a distributed sparse matrix. Therefore, one computing node is entirely reserved for storing sparse submatrix , and all CPU cores available on this node are applied for constructing, factorizing, and partially inverting it. All other nodes are used for storing dense submatrices and in a one-dimensional block-cyclic column-distributed way. By distributing the matrices in this way, can be calculated easily by broadcasting to all nodes and solving on each node for all columns of available on this node. As such, the solution is automatically distributed in a one-dimensional block-cyclic column-distributed way over the available nodes.

### High-performance computing infrastructure

All results were obtained using the Delcatty cluster on Stevin, the high-performance computing (HPC) infrastructure of Ghent University. This cluster consists of 160 computing nodes interconnected with an FDR InfiniBand network. Each node contains a dual-socket octa-core Intel Xeon Sandy Bridge (E5-2670) 2.5-GHz CPU (16 cores) with 64 GB of physical memory. All data were accessible from storage available through a fast GPFS mount over the InfiniBand network. The Delcatty nodes are running Scientific Linux 6.1 (SL6).

The algebraic operations that are to be performed on the distributed dense matrices rely on the standard libraries PBLAS (Choi *et al.* 1996) and ScaLAPACK (Blackford *et al.* 1997) as implemented in Intel MKL 11.2. All communication that is required between the nodes is handled by MPI (Snir *et al.* 1998) as implemented in Intel MPI 5.0. The entire code was written in C/C++ and compiled with Intel C++ Compiler 15.0. The code makes use of a hybrid MPI/OpenMP parallelism, which means that each MPI process is mapped to an entire node, while OpenMP is used to apply each CPU core on the node to perform the tasks of each MPI process.

### Data availability

Data was simulated with the freely available software AlphaMPSim. The code of Needles is freely available on github (https://github.com/arnedc/Needles).

## Results

The results (discussed further on) are mainly to show the capabilities of our software and to demonstrate the usefulness of the large-scale analyses that are enabled by this parallel sparse-dense framework. An overview of the required wall time and working memory for different sizes of analyses can be found in Table 1. For each analysis, the starting values of the AI-REML algorithm were set to 0.001, and convergence was reached when the relative update of the values of the variance components and the relative update of the log likelihood were less than 1%. The largest data set analyzed, in terms of number of effects included, resulted in a linear mixed model with 100,000 observations, 100 fixed environmental effects, 3150 random genetic marker effects, and 315,000 random M × E effects. The analysis of such a data set could be performed using four computing nodes with 16 CPU cores per node in 3.25 hr. The root node, managing all operations on sparse submatrix , required a maximum peak of 40 GB of virtual memory, while the other nodes required a maximum of 44 GB of virtual memory. In total, the analysis thus required 172 GB of memory, while performing the same analysis without any sparse matrix formalism would require more than 800 GB of working memory. It also can be seen in Table 1 that for a larger number of observations, the required amount of memory does not change dramatically. Only the computing time increases because the setup of the coefficient matrix takes more time. This is also reflected in the average wall time per iteration, which does not increase dramatically when more observations are included because the setup of the matrices, except for the dense submatrix , is performed only once before the iterative algorithm starts.

A detailed analysis of the time complexity of the algorithm is outside the scope of this paper because the coupling of sparse and dense matrix formalisms and the several underlying mechanisms of the analysis framework do not allow for a straightforward conclusion. Nonetheless, when all matrices would be treated as dense, the time complexity of the algorithm would be cubic with respect to the number of effects included in the model because the factorization and inversion of the coefficient matrix dominate the computing time of the algorithm. Table 1 clearly shows that the wall time per iteration does not scale cubicly with the number of included effects, which is due to introduction of the sparse matrix formalism. Factorizing and inverting a sparse matrix are no longer only functions of the dimension of the matrix but depend mainly on the number of nonzeroes in the factors of the matrix. As such, it is hard to predict how the wall time per iteration will increase with an increasing number of effects included in the model, but it will always stay below the upper bound of a cubic complexity with respect to the number of effects included in the model.

In the remainder of this paper, marker effects will coincide with QTL effects because simulated data are used, and thus the genotypes for the QTL are known together with their true effects. Therefore, the marker-by-environment interaction effects are now referred to as Q × E effects because we actually model the QTL-by-environment interaction effects. Using simulated data makes it possible to evaluate the prediction accuracy of these Q × E effects because we know the true values of these effects used to simulate the phenotypic records.

### Accuracy of estimation of effects

Instead of looking at the accuracy of estimated breeding values or predicted phenotypes, which is commonly done in the field of genomic prediction, we decided to concentrate on the prediction accuracy of the values of the effects. This is made possible by the fact that simulated data are analyzed, while for real-life data the true effects of the genetic markers are not known. The prediction accuracy is measured by the Pearson correlation between the estimated and true effects used for simulating the phenotypic traits. Being able to estimate these effects correctly not only might result in a more precise prediction of the phenotypes in certain environments but also might open up opportunities for pinpointing highly contributing effects and detecting variations of these effects under certain environmental conditions. To obtain some insight into which parameters of a trial setup influence prediction accuracy, a range of multienvironment trials was simulated. An overview of the different trial setups is given in Table 2. Each of these trials was simulated for RILs with a trait determined by 1575 QTL and one determined by 3150 QTL to investigate what the impact of the number of QTL would be on the prediction accuracy.

The prediction accuracy for the fixed environmental effects was always at least 0.97, even for the smallest field trial. The results for the other effects are summarized in Figure 2 and Figure 3. Estimation of the QTL effects depends mainly on the total number of observations, regardless of the setup of the field trial (Figure 2). For example, the correlations of the QTL effects for the field trials with a total of 100,000 observations for the trait determined by 3150 QTL are 0.904 (10 environments, 10,000 RILs per environment), 0.902 (50 environments, 2000 RILs per environment), and 0.9 (100 environments, 1000 RILs per environment), resulting in three overlapping points.

On the contrary, the prediction accuracies of the Q × E effects are mainly affected by the field trial setup and in particular by the number of observations per environment (Figure 3). Given a fixed number of observations per environment, the prediction accuracies for the Q × E effects are somewhat lower for a lower number of total observations (corresponding to fewer environments), which is probably a side effect of the fact that the overall QTL effects are also predicted more poorly for these trials. This effect can be seen in Figure 3 by the small differences in Pearson correlation for 1000 observations per environment, where the highest correlation is found for a trial on 100 environments (100,000 total observations) and the lowest for a trial on 10 environments (10,000 total observations).

As already shown theoretically (Lande and Thompson 1990), the number of QTL affecting a trait also plays a major role in the predictive power of the genomic prediction model. The more QTL affecting a trait, the bigger the training set should be for accurately predicting the QTL effects (Figure 2). For obtaining similar prediction accuracy for the Q × E effects when the trait is described by a larger number of QTL, it is again the number of observations per environment that should be increased (Figure 3).

### Detecting QTL susceptible to environmental conditions

The high prediction accuracies of the Q × E effects for the trials with the most observations per environment triggered the question of how well the model could identify the QTL that contributed the most in certain environments. Therefore, the results of the analysis of the trial with 25,000 RILs per environment for a simulated trait with 1575 QTL were examined more thoroughly. The predicted Q × E effects in a certain environment for this trial are plotted in Figure 4. From this plot it can be seen that it is hard to determine how many QTL should be selected as having an important environment-dependent effect. A commonly used approach is to select those with an absolute value greater than a multiple of the SD or interquartile range. However, this always depends on interpretation of the results, and the criterion used may have to change for different traits.

To circumvent these potential problems, a parameter-free methodology akin to machine learning, namely, two-means clustering (Hartigan and Wong 1979), was used to separate the QTL with a substantial Q × E effect in a specific environment from the other QTL. The usefulness of this technique for different numbers of QTL with simulated Q × E effects was assessed by producing data sets for a trait with 1575 QTL where, respectively, 5, 25, 50, 100, and 250 of these QTL had environment-dependent effects. After analysis of these data sets and performing two-means clustering on the estimated Q × E effects, it was seen that the number of QTL in the cluster corresponding to high absolute predicted environmental interaction effects was always less than the number of QTL with a truly simulated Q × E effect. This is in accordance with what is expected because some of these QTL have only a very small simulated Q × E effect in a certain environment and therefore cannot be detected as interacting with that specific environment.

Now that a method is available to select the QTL with a high predicted Q × E effect, some measures are presented to quantify how well this selection corresponds to the underlying truth. The positive predictive value (PPV) is the relative number of QTL found to have important Q × E effects that are also in the set of QTL with a truly simulated Q × E effect. It is thus a measure of the relevance of the detected QTL with an important environmental interaction. The true positive rate (TPR) is the relative number of QTL with a truly simulated Q × E effect that are also found by clustering the predicted values for the Q × E effects of the QTL. Therefore, it measures the model’s ability to correctly detect QTL that do have important Q × E effects.

Not only is it important to pinpoint the QTL that have a specific environmental effect, but the correct estimation of their mutual ranks also is desirable to enable selection of the most contributing QTL for a specific environment. A simple measure is the top-10 rank, which states how many of the 10 truly most contributing QTL in a certain environment are also in the top 10 of the QTL with the highest predicted Q × E effects. For a more complete measure of the ranking of the QTL, Spearman’s rank correlation is used, which is the Pearson correlation of the ranks without taking into account the actual values of the effects. However, when only a small part of the Q × E effects has a true nonzero value, the Spearman correlation is biased downward by the fact that the largest part of the true Q × E effects has an equal rank, while this is not the case for the predicted Q × E effects.

Therefore, the values of the predicted Q × E effects for the QTL not selected by clustering as having a substantial Q × E effect were set to zero, and the Spearman correlation was calculated between the new set of predicted Q × E effects and the simulated Q × E effects in a specific environment. Although this already resulted in a reasonably high rank correlation coefficient, this number is still penalized by comparing the predicted results with an unattainable ideal scenario. To enable a comparison of the results of our model with the best possible results, two-means clustering also was performed on the simulated Q E effects. The values of the Q × E effects for QTL belonging to the cluster corresponding to the highest environmental interactions remained unchanged, while the others were set to zero. This corresponds to a scenario where the Q × E effects are estimated correctly, and from these estimations, the QTL with a high environmental interaction effect are discriminated from the others using two-means clustering. The Spearman correlation between this set and the new set of predicted values after clustering was calculated, representing a fair measure of how well the most contributing QTL in a certain environment can be ranked.

The results of these analyses are summarized in Table 3. The raw Pearson correlation is the same correlation as that used for setting up Figure 3 and is a measure of the accuracy of the predicted values for the Q × E effects. However, it is not robust if outliers are present, which can be seen here by the somewhat lower correlation for small numbers of QTL with a simulated Q × E effect. This is also one of the reasons why the Spearman correlation is used as a measure of the accuracy of ranking the QTL as a function of their Q × E effects further on. The high numbers of the PPV for the comparison with the ground truth are a very reassuring result, stating that if we can extract QTL that have a substantial Q × E effect, these will almost surely be truly influenced by environmental conditions. In addition, the high top-10 ranks and high PPV for the comparison with the clustered simulated Q × E effects show that the QTL that are found to have a high Q × E effect are most probably also the QTL with a truly high environment-specific effect. Furthermore, it can be seen that only fewer than half the QTL with a simulated Q × E effect can be extracted, but if we compare this with what can be extracted in the most ideal situation, cfr clustering on the simulated Q × E effects, between 65 and 75% are also found by our linear mixed model. Finally, the Spearman correlation between the clustered sets of true and predicted Q × E effects shows that the model can discriminate very well between highly contributing, less contributing, and negligible QTL in a specific environment. This correlation remains fairly constant when the number of QTL with a simulated Q × E effect is changed, with a minor decrease noted for the highest number of QTL with a simulated Q × E effect.

## Discussion

This paper presented a novel approach of combining sparse and dense matrix algebra together with distributed computing techniques to enable large-scale genomic prediction with marker-by-environment interaction. Given the increase in genomic data for plants (Ganal *et al.* 2012), there is a need for a high-performance computing framework that can efficiently make use of a supercomputing cluster to analyze these future large-scale data sets. To the best of our knowledge, this is the first attempt at introducing the computing power of a supercomputing cluster in the field of genomic prediction for plants, taking into account variations in genetic effects across environments. It enabled us to analyze data sets with 100,000 observations and more than 300,000 effects in less than 4 hr using four computing nodes. Adding more observations increases the runtime because of a more tedious setup of the system of equations, but this runtime can be decreased by using more computing nodes. The required memory will increase somewhat as a result of the fact that and are read in entirely on each node, but given their sparse storage format, this is not a limiting factor for Needles.

As a first result of the analysis of such large-scale data sets with Needles, it has been shown that by increasing the number of phenotyped and genotyped individuals in the training data, the prediction accuracy for the QTL effects can be increased dramatically. Furthermore, if it is desirable to boost the performance in a certain type of environment, it is essential to have as many observations as possible coming from this environment. In this way, the interactions of the QTL with this environment can be estimated fairly accurately, and it has been shown that when sufficient observations in this environment are included, the most contributing QTL in this specific environment can be detected by two-means clustering on the predicted Q × E effects. This result was at first rather unexpected because BLUP are under the influence of shrinkage toward zero that is homogeneous across markers (de los Campos *et al.* 2013), but apparently it can still identify QTL with a high effect when enough data are available. It thus enables the selection of crops on these QTL for environment-specific plant breeding.

The selection of environment-dependent QTL by two-means clustering was chosen because it required no prior knowledge about the Q × E effects. However, because of the fact that the data were simulated by Q × E effects drawn from a Gaussian distribution and it was also assumed so in our linear model, this information could be included *a priori*. A more advanced method is known as *Gaussian mixture modeling*, where clusters are formed by attributing values to being drawn from different Gaussian distributions (Friedman *et al.* 2001). In our case, the data are split up into two distributions with a different variance. The results are not shown here, but the main difference with two-means clustering is the fact that the TPR was somewhat higher, meaning that more QTL with a true Q × E effect could be detected in this way. However, the PPV dropped, making it less sure that the selected QTL truly had a significant environmental interaction effect.

The postprocessing by two-means clustering of the Q × E effects was proposed because estimations of all possible marker × environment combinations are available as a result of the analysis. The paper by Heslot *et al.* (2014) is the only one we found where M × E effects also were modeled explicitly and where an effort was made to detect QTL with specific environmental variations. However, a Bayesian Lasso approach was used, which made it computationally impractical to model all marker × environment combinations together with the marker effects at once, and thus a multistep approach was needed, leading to quite a complex procedure. Nevertheless, their results indicated that QTL with important Q × E effects had small main effects, which is in contrast with assumptions of earlier studies, where Q × E effects were only modeled for QTL with a high main effect (Schön *et al.* 2004; van Eeuwijk *et al.* 2005). These early studies were mainly interested in mapping QTL that were stable across environments and thus obtained better estimates of the global genetic potential of an individual regardless of the environment. However, currently attention is directed more toward clarifying which QTL are susceptible to specific environmental covariables such as climatologic or soil conditions (Boer *et al.* 2007; Heslot *et al.* 2014). This makes it possible to define certain target populations of environments (TPEs) sharing some of the same environmental covariables so that environment-dependent QTL can be exploited for these TPEs (Heslot *et al.* 2014). Of course, the weather, playing an important factor in QTL expression (Boer *et al.* 2007), is always partly unpredictable, so information about weather-dependent QTL can open up a whole new research field of balancing between yield optimization for a certain weather type and minimizing the risk of yield loss owing to bad weather predictions.

Because environments were treated in this study as an abstract concept, they could easily be defined as a TPE with different (simulated) weather conditions to obtain estimates of the Q × E effects under different weather conditions. As such, a certain homogeneity is present between environments, resembling the assumptions made in this paper, and it has been shown that an across-environment model with inclusion of M × E effects then results in better predictions of grain yield for genotypes already evaluated in a certain environment than when using a stratified (*i.e.*, within-environment) analysis (Lopez-Cruz *et al.* 2015). The same study concluded that for untested genotypes or uncorrelated environments, results were similar for both types of analyses. However, this could be due to the fact that the number of genotypic lines evaluated per environment was fewer than 1000, and as shown here, this might have resulted in bad estimates of the M × E effects, leading to an overestimation of the global marker effects compared to the M × E effects. Moreover, the number of test environments was small (maximum 5), so when more environments are introduced, chances are greater that there will always be an environment more correlated with another, and these specific correlations might be introduced in the analytical model via the covariance structure of the residual effects and the M × E effects. The comparison of such an analysis with a stratified analysis might shed more light on the advantages and disadvantages of a joint analysis of multiple environments that are only sparsely correlated and thus is an interesting path for future research.

The number of observations per environment used in this study might seem high for realistic scenarios, although some studies have shown that the number of observations of different genotypic lines should increase to improve prediction accuracy of complex traits (Schön *et al.* 2004; Hayes *et al.* 2009). This also emanates from our simulation study showing that for the global marker effects, 10,000 RILs needed to be evaluated four times to surpass a Pearson correlation of 0.8 for a trait determined by 3150 QTL. This number is a lot more than a population of 500, which was suggested by Boer *et al.* (2007) as a limit where beyond there would probably not be much to gain. For improving the prediction accuracy of the Q × E effects, it is shown that the number of RILs evaluated per environment should go well beyond 1000, which is common practice in realistic scenarios (Heslot *et al.* 2014; Lopez-Cruz *et al.* 2015). When environments are treated as location × year combinations, it will indeed be hard to obtain more than 1000 observations per environment. Nonetheless, similar environments can be clustered together (Moreau *et al.* 2004) or environments can be categorized by some environmental covariates (Jarquín *et al.* 2014) to increase the number of observations per environment, which might lead to better prediction accuracies of the phenotypic traits in certain types of environments.

All these results were based on analyses where the QTL were known and QTL genotypes were available for all plants. In reality, only some of the QTL are known, and therefore, genome-wide SNP sets are used for determining the genotype of the plants (Ganal *et al.* 2012). In a previous study (De Coninck *et al.* 2014), it was shown that the prediction of breeding values is a lot harder using SNP markers than when using QTL markers. This probably can be explained by the fact that most SNP markers are only in partial linkage disequilibrium with the QTL, so it is more difficult to predict their true effects. Future research should clarify whether the use of SNP markers instead of QTL genotypes would have a substantial effect on the predictive ability of our model, but it is expected that even more observations are required to detect SNPs that contribute more to a specific trait in a certain environment. Moreover, the statistical model used in this study was intentionally kept simplistic because the emphasis of this research was to enable the analysis of large-scale data. Use of such a simplistic model was made possible by simulating the data according to the model. In more realistic scenarios, this model might not obtain the same performance as the results of this study indicate, but the most important result of this study is the indication that including more genotypic lines in the analysis leads to better prediction accuracies of the genetic effects and their interactions with the environment. A next step to take in the further development of this analytic framework is to extend the underlying model so that it can take into account distinct genetic variances in different environments and heterogeneous genetic correlations across environments.

Although Needles, a first high-performance computing implementation of the approach described in this paper, seems promising for analyzing large-scale genomic data sets obtained from multienvironment trials, there are some side notes to take into consideration. First of all, Needles is optimized for analyzing data sets with a large number of observations (*n*) in the training data because this does not influence the memory complexity of the model used. It is thus not recommended for use when the number of observations is several orders of magnitude smaller than the number of effects to be estimated. Second, an important restriction is the fact that the sparse submatrix cannot be distributed across different nodes. Currently, no libraries are available that are able to compute a selected inverse of a distributed sparse matrix, which is also one of the reasons why Needles was developed. When the number of genetic markers and/or the number of environments becomes large, the memory requirements for storing the sparse matrix might surpass the available working memory on a single computing node. Although it was not handled explicitly in this study, there are two ways of dealing with such a situation. As a first step, one could cluster the environments in homogeneous subgroups to reduce the number of different environments (Moreau *et al.* 2004; Bernardo 2010). However, when there are a large number of genetic markers, the number of marker-by-environment interaction effects still may lead to an unduly large, sparse submatrix. Most of the time, not all genetic markers have a variable effect across environments, and thus markers can be preselected by screening simultaneously for a high main effect and for consistency of these effects across environments (Schulz-Streeck *et al.* 2011; Heslot *et al.* 2014). In this way, one can reduce the number of marker-by-environment interaction effects and thus also the dimensionality of the sparse submatrix . The number of genetic markers included then is limited only by the characteristics of the computing cluster because the dense part can be distributed over the working memories of all available computing nodes.

The development of Needles is a first step in taking plant genomic prediction to the next scale. As the cost of genotyping plants decreases and the size of real-life data sets increases, the efficient use of a supercomputing cluster becomes a necessity for enabling the analysis of these data sets. Based on simulated data, it was shown that when a huge amount of training data is available, QTL effects can be predicted accurately, and the QTL that have a variable effect in specific environments can be identified. As a further optimization, an important next step would be to apply distributed computing techniques on the factorization and selected inversion of the sparse submatrix. This could make it possible to use large-scale SNP genotypes and model marker-by-environment interaction effects for each SNP marker, making it unnecessary to preselect markers that could have variable effects across environments. We believe that by making the source code open to the public, the software can be extended to be used for more complex models, and thus this initial implementation is a stepping stone for future large-scale research in plant breeding.

## Acknowledgments

We would like to thank the Swiss National Supercomputing Center (CSCS) for providing additional computational resources. Furthermore, the reviewers of this paper provided some very interesting comments, which significantly improved the quality of this research paper. This study was funded by the Bioinformatics Institute Ghent, From Nucleotides to Networks (BIG N2N). The computational resources (Stevin Supercomputer Infrastructure) and services used in this work were provided by Ghent University, the Hercules Foundation, and the Flemish Government (Department EWI).

## Footnotes

*Communicating editor: F. van Eeuwijk*Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.179887/-/DC1.

- Received June 26, 2015.
- Accepted January 25, 2016.

- Copyright © 2016 by the Genetics Society of America