# Genome-Wide Regression and Prediction with the BGLR Statistical Package

^{*}Socio Economía Estadística e Informática, Colegio de Postgraduados 56230, México^{†}Department of Biostatistics, Section on Statistical Genetics, University of Alabama, Birmingham, Alabama 35294

- 1Corresponding author: Colegio de Postgraduados, Km. 36.5, Carretera Mexico, Montecillo Texcoco, Estado de México, México 56230. E-mail: perpdgo{at}colpos.mx

## Abstract

Many modern genomic data analyses require implementing regressions where the number of parameters (*p*, *e.g.*, the number of marker effects) exceeds sample size (*n*). Implementing these *large-p-with-small-n* regressions poses several statistical and computational challenges, some of which can be confronted using Bayesian methods. This approach allows integrating various parametric and nonparametric shrinkage and variable selection procedures in a unified and consistent manner. The BGLR R-package implements a large collection of Bayesian regression models, including parametric variable selection and shrinkage methods and semiparametric procedures (Bayesian reproducing kernel Hilbert spaces regressions, RKHS). The software was originally developed for genomic applications; however, the methods implemented are useful for many nongenomic applications as well. The response can be continuous (censored or not) or categorical (either binary or ordinal). The algorithm is based on a Gibbs sampler with scalar updates and the implementation takes advantage of efficient compiled C and Fortran routines. In this article we describe the methods implemented in BGLR, present examples of the use of the package, and discuss practical issues emerging in real-data analysis.

- Bayesian methods
- regression
- whole-genome regression
- whole-genome prediction
- genome-wide regression
- variable selection
- shrinkage
- semiparametric regression
- reproducing kernel Hilbert spaces regressions, RKHS
- R
- GenPred
- shared data resource

MANY modern statistical learning problems involve the analysis of high-dimensional data; this is particularly common in genetic studies where, for instance, phenotypes are regressed on large numbers of predictor variables (*e.g.*, SNPs) concurrently. Implementing these *large-p-with-small-n* regressions (where *n* denotes sample size and *p* represents the number of predictors) poses several statistical and computational challenges, including how to confront the so-called “curse of dimensionality” (Bellman 1961) as well as the complexity of a genetic mechanism that can involve various types and orders of interactions. Recent developments in shrinkage and variable selection estimation procedures have made the implementation of these *large-p-with-small-n* regressions feasible. Consequently, whole-genome-regression approaches (Meuwissen *et al.* 2001) are becoming increasingly popular for the analysis and prediction of complex traits in plants (*e.g.*, Crossa *et al.* 2010), animals (*e.g.*, Hayes *et al.* 2009, VanRaden *et al.* 2009), and humans (*e.g.*, Yang *et al.* 2010; Makowsky *et al.* 2011; Vazquez *et al.* 2012; de los Campos *et al.* 2013b).

In the past decade a large collection of parametric and nonparametric methods have been proposed and empirical evidence has demonstrated that no single approach performs best across data sets and traits. Indeed, the choice of the model depends on multiple factors such as the genetic architecture of the trait, marker density, sample size and the span of linkage disequilibrium (*e.g.*, de los Campos *et al.* 2013a). Although various software (BLR, Pérez *et al.* 2010; rrBLUP, Endelman 2011; synbreed, Wimmer *et al.* 2012; GEMMA, Zhou and Stephens 2012) exist, most statistical packages implement a few types of methods and the integration of these methods in a unified statistical and computational framework is needed. Motivated by this we have developed the R (R Core Team 2014) package BGLR. The package offers the user great latitude in combining different methods into models for data analysis and is available at CRAN (http://cran.at.r-project.org/web/packages/BGLR/index.html) and at the R-forge website (https://r-forge.r-project.org/projects/bglr/). In this article we discuss the statistical models implemented in BGLR (*Statistical Models*, *Algorithms*, *and Data*), present several examples based on real and simulated data (*Application Examples*), and provide benchmarks of computational time and memory usage for a linear model (*Benchmark of Parametric Models*). All the examples and benchmarks presented in this article are based on BGLR version 1.0.3. In addition to the scripts presented in the boxes included this article we provide supplementary code in File S1, and text version of all the scripts used to produce the results presented in the article in File S2.

## Statistical Models, Algorithms, and Data

The BGLR package supports models for continuous (censored or not) and categorical (binary or ordinal multinomial) traits. We first describe the models implemented in BGLR using a continuous, uncensored, response as example. Further details about censored and categorical outcomes are provided later on this article and in the supporting information, File S1.

For a continuous response (*y _{i}*;

*i*= 1, …,

*n*) the data equation is represented as

*y*=

_{i}*η*+

_{i}*ε*, where

_{i}*η*is a linear predictor (the expected value of

_{i}*y*given predictors) and

_{i}*ε*are independent normal model residuals with mean zero and variance . Here, the are user-defined weights (by default BGLR sets

_{i}*w*= 1 for all data-points) and is a residual variance parameter. In matrix notation we havewhere

_{i}**y**= {

*y*

_{1}, …,

*y*},

_{n}**η**= {

*η*

_{1}, …,

*η*}, and

_{n}**ε**= {

*ε*

_{1}, …,

*ε*}.

_{n}The linear predictor represents the conditional expectation function, and it is structured as (1)where *μ* is an intercept, **X*** _{j}* are design matrices for predictors,

*X**= {*

_{j}*x*},

_{ijk}**β**

*are vectors of effects associated to the columns of*

_{j}**X**

*, and*

_{j}**u**

*= {*

_{l}*u*

_{l}_{1}, …,

*u*} are vectors of random effects. The only element of the linear predictor included by default is the intercept. The other elements are user specified. Collecting the above assumptions, we have the following

_{ln}*conditional distribution of the data*:where

**θ**represents the collection of unknowns, including the intercept, regression coefficients, other random effects, and the residual variance.

### Prior density

The prior density is assumed to admit the following factorization:The *intercept* is assigned a flat prior and the *residual variance* is assigned a scaled-inverse χ^{2} density with degrees of freedom df* _{ε}*(> 0) and scale parameter

*S*(> 0). In the parameterization used in BGLR, the prior expectation of the scaled-inverse χ

_{ε}^{2}density χ

^{−2}(⋅|

*S*., df) is given by .

*Regression coefficients* {*β _{jk}*} can be assigned either uninformative (

*i.e.*, flat) or informative priors. Those coefficients assigned flat priors, the so-called “fixed” effects, are estimated based on information contained in the likelihood solely. For the coefficients assigned informative priors, the choice of the prior plays an important role in determining the type of shrinkage of estimates of effects induced. Figure 1 provides a graphical representation of the prior densities available in BGLR. The

*Gaussian prior*induces shrinkage of estimate similar to that of ridge regression (RR; Hoerl and Kennard 1970), where all effects are shrunk to a similar extent; we refer to this model as the Bayesian ridge regression (BRR). The

*scaled-t*and

*double exponential*(DE) densities have higher mass at zero and thicker tails than the normal density. These priors induces a type of shrinkage of estimates that is size-of-effect dependent (Gianola 2013). The scaled-t density is the prior used in model BayesA (Meuwissen

*et al.*2001), and the DE or Laplace prior is the one used in the BL (Park and Casella 2008). Finally, BGLR implements two

*finite mixture priors*: a mixture of a point of mass at zero and a Gaussian slab, a model referred to in the literature on genomic selection (GS) as BayesC (Habier

*et al.*2011), and a mixture of a point of mass at zero and a scaled-t slab, a model known in the GS literature as BayesB (Meuwissen

*et al.*2001). By assigning a nonnull prior probability for the marker effect to be equal to zero, the priors used in BayesB and BayesC have potential for inducing variable selection.

#### Hyperparameters:

Each of the prior densities described above are indexed by one or more parameters that control the type and extent of shrinkage/variable selection induced. We treat most of these regularization parameters as random; consequently a prior is assigned to these unknowns. Table 1 lists, for each of the prior densities implemented, the set of hyperparameters. Further details about how regularization parameters are inferred from the data are given in the supporting information, File S1.

#### Combining priors:

Different priors can be specified for each of the set of coefficients of the linear predictor, {**β**_{1}, …, **β*** _{J}*,

**u**

_{1},

**u**

_{2}, …,

**u**

_{L}}, giving the user great flexibility in building models for data analysis; an example illustrating how to combine different priors in a model is given in Example 2 of

*Application Examples*.

#### Gaussian processes:

The vectors of random effects **u*** _{l}* are assigned multivariate-normal priors with a mean equal to zero and covariance matrix , where

**K**

*is a (user-defined)*

_{l}*n*×

*n*-symmetric positive semidefinite matrix and is a variance parameter with prior density . These random effects can be used to deal with different types of problems, including but not limited to: (a) regressions on pedigree (Henderson 1975), (b) genomic BLUP (VanRaden 2008), and (c) nonparametric genomic regressions based on reproducing kernel Hilbert spaces (RKHS) methods (

*e.g.*, de los Campos

*et al.*2009a, 2010; Gianola

*et al.*2006). Examples about the inclusion of these Gaussian processes into models for data analysis are given in

*Application Examples*.

### Categorical Response

The argument response_type is used to indicate BGLR whether the response should be regarded as gaussian, the default value, or ordinal. For continuous traits the response vector should be coercible to numeric; for ordinal traits the response can take on *K* possible (ordered) values *y _{i}* ∈ {1, …,

*K*} (the case where

*K*= 2 corresponds to the binary outcome), and the response vector should be coercible to a factor. For categorical traits we use the probit link; here, the probability of each of the categories is linked to the linear predictor according to the following link functionwhere Φ(⋅) is the standard normal cumulative distribution function,

*η*is the linear predictor, specified as described above, and

_{i}*γ*are threshold parameters, with

_{k}*γ*

_{0}= −∞,

*γ*≥

_{k}*γ*

_{k}_{−1},

*γ*= ∞. The probit link is implemented using data augmentation (Tanner and Wong 1987); this is done by introducing a latent variable (so-called liability)

_{K}*l*=

_{i}*η*+

_{i}*ε*and a measurement model

_{i}*y*=

_{i}*k*if

*γ*

_{k}_{−1}≤

*l*≤

_{i}*γ*. For identification purpouses, the residual variance is set equal to one. At each iteration of the Gibbs sampler the unobserved liability scores are sampled from truncate normal densities; once the unobserved liability has been sampled, the Gibbs sampler proceeds as if

_{k}*l*were observed (see Albert and Chib 1993, for further details).

_{i}### Missing Values

The response vector can contain missing values. Internally, at each iteration of the Gibbs sampler, missing values are sampled from the corresponding fully conditional density. Missing values in predictors are not allowed.

### Censored Data

Censored data in BGLR is described using a triplet {*a _{i}*,

*y*,

_{i}*b*}, the elements of which must satisfy

_{i}*a*<

_{i}*y*<

_{i}*b*. Here,

_{i}*y*is the observed response (

_{i}*e.g.*, a time-to-event variable, observable only in uncensored data points, otherwise missing, NA) and

*a*and

_{i}*b*define lower and upper bounds for the response, respectively. Table 2 gives the configuration of the triplet for the different types of data points. The triplets are provided to BGLR in the form of three vectors (

_{i}**y**,

**a**,

**b**). The vectors

**a**and

**b**have NULL as default value; therefore, if only

**y**is provided this is interpreted as a continuous trait without censoring. If

**a**and

**b**are provided together with

**y**, data are treated as censored. Censoring is dealt with as a missing data problem; at each iteration of the MCMC algorithm the missing values of

*y*, present due to censoring, are sampled from truncated normal densities that satisfy

_{i}*a*<

_{i}*y*<

_{i}*b*. An example of how to fit models for censored data are given in the supporting information in File S1 (Section E).

_{i}### Algorithms

The R-package BGLR draws samples from the posterior density using a Gibbs sampler (Geman and Geman 1984; Casella and George 1992) with scalar updating. For computational convenience the scaled-t and DE densities are represented as infinite mixtures of scaled normal densities (Andrews and Mallows 1974), and the finite-mixture priors are implemented using latent random Bernoulli variables linking effects to components of the mixtures. The computationally demanding steps are performed using compiled C and Fortran code.

### User Interface

The R-package BGLR has a user interface similar to that of BLR (Pérez *et al.* 2010); however, we have modified key elements of the interface, and the internal implementation, to provide the user with greater flexibility for model building. All the arguments of the BGLR function have default values, except the vector of phenotypes. Therefore, the simplest call to the BGLR program is as follows.

When the call fm<-BGLR(y = y) is made, BGLR fits an intercept model, a total of 1500 cycles of a Gibbs sampler (the default value for the number of iterations; see Box 2) are run, and the first 500 samples are discarded, this is the default value for burn-in (see Box 2). As the Gibbs sampler collects samples, some are saved to the hard drive (only the most recent samples are retained in memory) in files with extension ∗.dat, and the running means required for computing estimates of the posterior means and of the posterior standard deviations are updated; by default, a thinning of 5 is used but this can be modified by the user using the thin argument of BGLR. Once the iteration process finishes, BGLR returns a list with estimates and the arguments used in the call.

### Inputs

Box 2, displays a list of the main arguments of the BGLR function, a short description follows:

y, a, b (y, coercible to either numeric or factor, a and b of type numeric) and response_type (character) are used to define the response.

ETA (of type list) is used to specify the linear predictor. By default it is set to NULL, in which case only the intercept is included. Further details about the specification of this argument are given below.

nIter, burnIn, and thin (all of type integer) control the number of iterations of the sampler, the number of samples discarded, and the thinning used to compute posterior means, respectively.

saveAt (character) can be used to indicate BGLR where to store the samples and to provide a pre-fix to be appended to the names of the file where samples are stored. By default samples are saved in the current working directory and no pre-fix is added to the file names.

S0, df0, R2 (numeric) define the prior assigned to the residual variance, df0 defines the degrees of freedom, and S0 defines the scale. If the scale is NULL, its value is chosen so that the prior mode of the residual variance matches the variance of phenotypes times 1-R2 (see supporting information, Section A of File S1, for further details).

### Return

The function BGLR returns a list with estimated posterior means and estimated posterior standard deviations and the arguments used to fit the model. Box 3, shows the structure of the object returned after fitting the intercept model of Box 1. The first element of the list (y) is the response vector used in the call to BGLR, $whichNa gives the position of the entries in y that were missing, these two elements are then followed by several entries describing the call (omitted in Box 3), and this is followed by estimated posterior means and estimated posterior standard deviations of the linear predictor ($yHat and $SD.yHat), the intercept ($mu and $SD.mu), and the residual variance ($varE and $SD.varE). Finally $fit gives a list with *DIC* and *DIC*-related statistics (Spiegelhalter *et al.* 2002).

### Output files

Box 4, shows an example of the files generated after executing the commands given in Box 1. In this case samples of the intercept (mu.dat) and of the residual variance (varE.dat) were stored. These samples can be used to assess convergence and to estimate Monte Carlo error. The R-package coda (Plummer *et al.* 2006) provides several useful functions for the analysis of samples used in Monte Carlo algorithms.

### Data Sets

The BGLR package comes with two genomic data sets involving phenotypes, markers, pedigree, and other covariates.

#### Mice data set:

This data set is from the Wellcome Trust (http://gscan.well.ox.ac.uk) and has been used for detection of quantitative trait loci (QTL) by Valdar *et al.* (2006a,b) and for whole-genome regression by Legarra *et al.* (2008), de los Campos *et al.* (2009b), and Okut *et al.* (2011). The data set consists of genotypes and phenotypes of 1814 mice. Several phenotypes are available in the data frame mice.pheno. Each mouse was genotyped at 10,346 SNPs. We removed SNPs with minor allele frequency (MAF) <0.05 and imputed missing genotypes with expected values computed with estimates of allele frequencies derived from the same data. In addition to this, an additive relationship matrix (mice.A) is provided; this was computed using the R-package pedigreemm (Bates and Vazquez 2009; Vazquez *et al.* 2010).

#### Wheat data set:

This data set is from CIMMYT global wheat breeding program and comprises phenotypic, genotypic, and pedigree information of 599 wheat lines. The data set was made publicly available by Crossa *et al.* (2010). Lines were evaluated for grain yield (each entry corresponds to an average of two plot records) at four different environments; phenotypes (wheat.Y) were centered and standardized to a unit variance within environment. Each of the lines were genotyped for 1279 diversity array technology (DArT) markers. At each marker two homozygous genotypes were possible and these were coded as 0/1. Marker genotypes are given in the object wheat.X. Finally a matrix wheat. A provides the pedigree relationships between lines computed from the pedigree (see Crossa *et al.* 2010 for further details). Box 5, illustrates how to load the wheat and mice data sets.

## Application Examples

In this section we illustrate the use of BGLR with various application examples, including comparison of shrinkage and variable selection methods (Example 1), how to fit models that account for genetic and nongenetic effects such as covariates or effects of the experimental design (Example 2), models for simultaneous regression on markers and pedigree (Example 3), reproducing kernel Hilbert spaces regression (Examples 4 and 5), and the assessment of prediction accuracy (Examples 6 and 7). The scripts used to fit the models discussed in each of these examples are presented in the text; additional scripts with code for post hoc analysis (*e.g.*, plots) are provided in the supporting information (File S1 and File S2).

### Example 1: Comparison of shrinkage and variable selection methods

In this example we show how to fit models that induce variable selection and others that shrink estimates toward zero. In the example, we use simulated data generated using the marker genotypes for the mice data set. We assume a very simple simulation setting with only 10 QTL. Phenotypes were simulated under the standard additive model,where *ε _{i}* ∼

*N*(0, 1 −

*h*

^{2}),

*h*

^{2}= 0.5. Marker effects were sampled from the mixture model:Box 6, shows the R code for simulating the phenotypes. The simulation settings can be changed using parameters that control sample size (

*n*), the number of markers used (

*p*), the number of QTL (

*n*QTL), and trait heritability (h2).

Box 7, shows code that can be used to fit a Bayesian ridge regression, BayesA, and BayesB. Once the models are fitted, estimates of marker effects, predictions, estimates of the residual variance, and measures of goodness of fit and model complexity can be extracted from the object returned by BGLR. Box S1 of File S1, provides the code used to extract the results presented next.

Table 3 provides the estimated residual variance, the deviance information criterion (*DIC*) and the effective number of parameters (*pD*) (Spiegelhalter *et al.* 2002). The estimated residual variances were all closed to the simulated value (0.5). According to *pD* (288.9, 200.2, 198.3 for the BRR, BayesA, and BayesB, respectively) the BRR was the most complex model, and *DIC* (“smaller is better”) favored BayesA and BayesB over the BRR, clearly. This was expected given the simple trait architecture simulated. Model BayesA and BayesB gave very similar estimates and predictions; this happened because in BayesB the estimated probability for the markers to have nonnull effects was very high (>0.9); as this proportion approaches one, BayesB converges to BayesA. The correlation between the true and simulated signals were high in all cases (0.86 for the BRR, 0.947 for BayesA, and 0.955 for BayesB) but favored BayesA and BayesB over the BRR, clearly. We run the simulation using 30 QTL and removing the QTL genotypes in the data analysis and the ranking of the models, based on *DIC* and on the correlation between predicted and simulated signal was similar to the one reported above.

Figure 2 displays the absolute values of estimates of marker effects for models BayesA (red) and the BRR (black). The estimates of BayesB (not shown) are similar to those of BayesA. The vertical lines and blue dots give the position and absolute value of the simulated effects. The BRR gives a profile of estimated of effects where all markers had tiny effects; BayesA and BayesB give a very different profiles of effects: most of the simulated QTL were detected (except the first one), markers having no effects had very small estimated effects, and QTL had sizable estimated effects. This simulation illustrates how in ideal circumstances the choice of the prior density assigned to marker effects can make a big difference in terms of estimates of effects. However, the difference between models is expected to be much smaller under more complex genetic architectures and, perhaps more importantly, when the marker panel does not contain markers in tight LD with QTL, *e.g.*, Wimmer *et al.* (2013). The example also illustrates a very important concept: in high-dimensional regressions it is possible to have similar predictions with very different estimates of effects. Indeed, in the example presented above, although the correlation of effects estimated by BRR and BayesB was low (0.226), the correlation between predictions (yHat) derived from each of the models was relatively high (0.946).

### Example 2: Fitting models for genetic and nongenetic factors

In the next example we illustrate how to fit models with various sets of predictors using the mice data set. Valdar *et al.* (2006b) pointed out that the cage where mice were housed had an important effect in the physiological covariates and Legarra *et al.* (2008) and de los Campos *et al.* (2009b) used models that accounted for sex, litter size, cage, familial relationships, and markers. Therefore, one possible linear model that we can fit to a continuous phenotype is where **y** is the phenotype vector (body mass index, in the example), *μ* is an intercept, **X**_{1} is a design matrix for the effects of sex and litter size, **β**_{1} is the corresponding vector of effects, **X**_{2} is the design matrix for the effects of cage, **β**_{2} is a vector of cage effects, **X**_{3} is the matrix with marker genotypes, and **β**_{3} is the corresponding vector of marker effects. We treat **β**_{1} as “FIXED” and the other two vectors of effects as random; **β**_{2} is treated as Gaussian and marker effects, **β**_{3}, are assigned IID double-exponential priors, which corresponds to the prior used in the Bayesian LASSO model.

#### Fitting the model:

The script needed to fit the model above described is given in Box 8. The first block of code, #1#, loads the data. In the second block of code we set the linear predictor. This is specified using a two-level list. Each of the elements of the inner list is used to specify one element of the linear predictor; these elements are specified by providing a formula or a design matrix and a prior (model argument). When the formula is used, the design matrix is created internally using the model.matrix() function of R. Additional arguments in the specification of the linear predictors are optional (see Table S1, File S1 for the arguments used to specify hyperparameters). Finally in the third block of code we fit the model by calling the BGLR() function. When BGLR begins to run, a message warns the user that hyperparameters were not provided and that consequently they were set using built-in rules (see Table S1, File S1, for further details).

#### Extracting results:

Once the model was fitted one can extract from the list returned by BGLR the estimated posterior means and the estimated posterior standard deviations as well as measures of model goodness of fit and model complexity. Also, as BGLR runs, it saves samples of some of the parameters; these samples can be brought into the R-environment for posterior analysis. Box S2 of File S1, illustrates how to extract estimates from the models fitted in Box 8. Some of the results are given in Figure 3. In the example, phenotypes were standardized to a unit sample variance and the estimated residual variance was 0.53, suggesting that the model explained ∼47% of the phenotypic variance. Figure 3, top left, gives the absolute value of estimated effects and Figure 3, top right, gives a scatter plot of phenotypes *vs.* predicted genomic values; this prediction does not include differences due to sex, litter size, or cage. Figure 3, bottom, gives trace plots of the residual variance (left) and of the regularization parameter of the Bayesian LASSO (right). The residual variance had a very good mixing; however, the mixing of the regularization parameter was not as good. In general, with large numbers of markers long chains are needed to infer regularization parameters precisely.

### Example 3: Fitting a pedigree**+**markers BLUP model using BGLR

In the following example we illustrate how to incorporate in the model Gaussian random effects with user-defined covariance structures. These types of random effects appear both in pedigree and genomic models. The example presented here uses the wheat data set included with the BGLR package. In the example of Box 9 we include two random effects, one representing a regression on pedigree, , where **A** is a pedigree-derived numerator relationship matrix, and one representing a linear regression on markers, where **G** is a marker-derived genomic relationship matrix. For ease of interpretation of estimates of variance components we standardized both matrices to an average diagonal value of (approximately) one. The implementation of Gaussian processes in BGLR exploits the equivalence between these processes and random regressions on principal components (de los Campos *et al.* 2010; Janss *et al.* 2012). The user can implement a RKHS regression either by providing covariance matrix (**K**) or its eigenvalue decomposition (see the example in Box 9). When the covariance matrix is provided, the eigenvalue decomposition is computed internally. Box S3 of File S1, shows how to extract estimates, predictions, and samples from the fitted model. The estimated residual variance (posterior standard deviation) was 0.43 (0.044), and the estimates of the variance components associated to the pedigree, , and markers, , were 0.24 (0.07) and 0.42 (0.09), respectively. The code in Box S3 of File S1, shows how to obtain samples of heritability from the samples collected before each of the variance components. The estimated posterior mean of the ratio of the genetic variance relative to the total variance was 0.6; therefore, we conclude that ∼60% of the phenotypic variance can be explained by genetic factors. In this example, the pedigree explained approximately one-third of the total genetic variance and markers explained the other two-thirds. The samples from the posterior distribution of and had a posterior correlation of −0.184; this happens because both **A** and **G** are, to some extent, redundant.

#### Reproducing kernel Hilbert spaces regressions:

Reproducing Kernel Hilbert Spaces Regressions (RKHS) have been used for regression (*e.g.*, smoothing spline, Wahba 1990), spatial smoothing (*e.g.*, kriging, Cressie 1988), and classification problems (*e.g.*, support vector machine, Vapnik 1998). Gianola *et al.* (2006) proposed using this approach for genomic prediction and since then several methodological and applied articles have been published elsewhere (Gianola and de los Campos 2008; de los Campos *et al.* 2009a, 2010). In this section we illustrate how to implement RKHS using single- (Example 4) and multi- (Example 5) kernel methods.

### Example 4: Single-kernel models

In RKHS the regression function is a linear combination of the basis function provided by the reproducing kernel (RK); therefore, the choice of the RK constitutes one of the central elements of model specification. The RK is a function that maps from pairs of points in input space (*i.e.*, pairs of individuals) into the real line and must be positive semidefinite. For instance, if the information set is given by vectors of marker genotypes the RK, maps from pairs of vectors of genotypes, onto the real line with a map that must satisfy , for any nonnull sequence of coefficients *α _{i}*. Following de los Campos

*et al.*(2009a) the Bayesian RKHS regression can be represented as (2)where is an (

*n*×

*n*)-matrix whose entries are the evaluations of the RK at pairs of points in input space. Note that the structure of the model described by (2) is that of the standard animal model (Quaas and Pollak 1980) with the pedigree-derived numerator relationship matrix (

**A**) replaced by the kernel matrix (

**K**). Box 10, features an example using a Gaussian kernel evaluated in the (average) squared-Euclidean distance between genotypes, that is: In the example genotypes were centered and standardized, but this is not strictly needed. The bandwidth parameter,

*h*, controls how fast the covariance function drops as the distance between pairs of vector genotypes increases. This parameter plays an important role in inferences and predictions. In this example we have arbitrarily chosen the bandwidth parameter to be equal to 0.25; further discussion about this parameter is given in the next example. With this choice of RK, the estimated residual variance was 0.41, which suggests that the RKHS model fitted in Box 10, fits the data slightly better than the pedigree + markers models of Box 9. Box S4 of File S1, provides supplementary code for the model fitted in Box 10.

### Example 5: Multikernel methods

The bandwidth parameter of the Gaussian kernel can be chosen using either cross-validation (CV) or Bayesian methods. From a Bayesian perspective, one possibility is to treat *h* as random; however, this is computationally demanding because the RK needs to be recomputed any time *h* is updated. To overcome this problem de los Campos *et al.* (2010) proposed using a multikernel approach (named kernel averaging, KA) consisting of: (a) defining a sequence of kernels based on a set of values of *h*, and (b) fitting a multikernel model with as many random effects as kernels in the sequence. The model has the form (3)where **K*** _{l}* is the RK evaluated at the

*l*th value of the bandwidth parameter in the sequence {

*h*

_{1}, …,

*h*}. It can be shown (

_{L}*e.g.*, de los Campos

*et al.*2010) that if variance components are known, the model of expression (3) is equivalent to a model with a single random effect whose distribution is where is a weighted average of all the RK used in (3) with weights proportional to the corresponding variance components (hence the name kernel averaging). Performing a grid search for

*h*or implementing a multikernel model requires defining a reasonable range for

*h*. If the value of

*h*is too small, the entries of the resulting Gaussian kernel will approach a matrix full of ones; such a kernel will be redundant with the intercept, which is included as fixed effect. On the other hand, if

*h*is too large, the off-diagonal values of the kernel matrix will approach zero, leading to a random effect that is confounded with the error term. Therefore, in choosing values for

*h*both extremes should be avoided. Although there is no general rule to define the values of the bandwidth parameter, one possibility is to set

*h*to values where

*M*is the median squared Euclidean distance between lines (computed using off-diagonals only). With this choice, the median off-diagonals of

**K**will be , exp(−1) and exp(−5) for

*h*equal to , and , respectively. We use this approach in Box 11, to fit a multikernel model. The resulting entries of the RK are displayed in Figure 4. Box S5 of File S1, provides supplementary code that can be used to retrieve estimates from the model fitted in Box 11. The estimated residual variance was close to 0.3 and the estimated variance components for each of the kernels fitted were 0.62, 0.48, and 0.24 for

*h*equal to 0.098, 0.490 and 2.450, respectively. The script provided in Box S5 of File S1, produces trace plots of variance components. The residual variance has a reasonably good mixing. The sum of the variances of the three kernels also has a reasonably good mixing; however, due to confounding between the kernels, individual variance components show a much poorer mixing.

#### Assessment of prediction accuracy:

In the previous examples we illustrated how to fit different types of models to training data; in the following section we consider two ways of assessing prediction accuracy: a single training–testing partition and multiple training–testing partitions.

### Example 6: Assessment of prediction accuracy using a single training-testing partition

A simple way of assessing prediction accuracy consists of partitioning the data set into two disjoint sets: one used for model training (TRN) and one used for testing (TST). Box 12, shows code that fits a G-BLUP model in a TRN–TST setting using the wheat data set. The code randomly assigns 100 individuals to the TST set. The variable tst is a vector that indicates which data points belong to the TST data set; for these entries we introduce missing values in the phenotypic vector (see Box 12). Once the model is fitted, predictions for individuals in the TST set can be obtained typing fm$yHat[tst] in the R command line; Box S6 of File S1 gives supplementary code to the example in Box 12, including the code used to produce Figure 5 that displays observed *vs.* predicted phenotypes for individuals in TRN (black dots) and TST (red dots) sets. The correlation between observed phenotypes and predictions was 0.83 in the TRN set and 0.60 in the TST set, and the regression of phenotypes on predictions was 1.49 and 1.24 for the TRN and TST set, respectively.

### Example 7: Model comparison based on multiple training–testing partitions

The example presented in the previous section is based on a single training–testing partition. A cross-validation is simply a generalization of the single TRN–TST evaluation. For a *K*-fold cross-validation there are *K* TRN–TST partitions; in each fold, the individuals assigned to that particular fold are used for TST and the remaining individuals are used for TRN. Another possibility is to generate multiple TRN–TST partitions with random assignment of subjects to either TRN or TST. Each partition yields a point estimate of prediction accuracy (*e.g.*, correlation between predictions and phenotypes). The variability of the point estimate across partitions (replicates) reflects uncertainty due to sampling of TRN and TST sets, and a precise estimate of prediction accuracy can be obtained by averaging the estimates of accuracy obtained in each partition.

Box 13, gives an example of an evaluation based on 100 TRN–TST partitions. In each partition two models (*P*, pedigree and PM, pedigree + markers) were fitted and used to predict yield in the TST data set. This yielded 100 estimates of the prediction correlation for each of the models fitted. These estimates should be regarded as paired samples because both share a common feature: the TRN–TST partition. Several statistics can be computed to compare the two models fitted, and a natural approach for testing the null hypotheses *H*_{0}: P and PM have the same prediction accuracy *vs. H _{A}*: the prediction accuracy of models P and PM are different is to conduct a paired-

*t*-test based on the difference of the correlation coefficients. Figure 6 gives the estimated correlations for the pedigree + markers model (PM, vertical axis)

*vs.*the pedigree-only model (P, horizontal axis) by environment. The great majority of the points lay above the 45° line indicating that in most partitions the PM model had higher prediction accuracy than the P-only model. The paired-

*t*-test had

*P*-values <0.001 in all environments indicating strong evidence against the null hypothesis (

*H*

_{0}: P and PM have the same prediction accuracy). The code used to generate the plot in Figure 6 and to carry out the

*t*-test is given in Box S7, File S1.

## Benchmark of Parametric Models

We carried out a benchmark evaluation by fitting a BRR to data sets involving three different sample size (*n* = 1K, 2K and 5K, K = 1000) and four different marker densities (*P* = 5K, 10K, 50K, and 100K). The evaluation was carried out in an Intel Xeon processor, 2 GHz, with R executed in a single thread and linked against OpenBLAS. Computing time (Figure 7) scales approximately proportional to the product of the number of records and the number of effects. For the most demanding scenario (*n* = 5K, *P* = 100K) it took ∼11 min to complete 1000 iterations of the Gibbs sampler. Using the R functions Rprof and summaryRprof we performed an analysis of memory usage. As expected the amount of memory used scaled linearly with marker density with a maximum memory usage of ∼6, 3, 0.6, and 0.3 Gb of RAM for *n* =5K and *P* = 100K, 50K, 10K, and 5K, respectively. Because R holds all the objects in virtual memory and the size of the objects depends on the underlying operating system, as a general rule, for an R session using more than 4 Gb of RAM, a 64-bit build of R would be needed.

## Concluding Remarks

In BGLR we implemented in a unified Bayesian framework several methods commonly used in genome-enabled prediction, including various parametric models and Gaussian processes that can be used for parametric (*e.g.*, pedigree regressions or G-BLUP) or semiparametric regressions (*e.g.*, genomic regressions). The package supports continuous (censored or not) as well as binary and ordinal traits. The software interface gives the user great latitude in combining different modeling approaches for data analysis. In the algorithm implemented in BGLR, operations that can be vectorized are performed using built-in R-functions; however, most of the computing intensive tasks are performed using compiled routines written in C and Fortran languages. The package is also able to take advantage of multithread BLAS implementations in both Windows and UNIX-like systems. Finally, together with the package we have included two data sets and ancillary functions that can be used to read into the R-environment genotype files written in ped and bed formats (Purcell *et al.* 2007).

The Gibbs sampler implemented is computationally very intensive and our current implementation stores genotypes in memory; therefore, despite the effort made to make BGLR computationally efficient, performing regressions with hundreds of thousands of markers requires access to large amounts of RAM and the computational time can be considerable. Certainly, algorithms other than MCMC can be quite faster; however, the MCMC framework adopted offers great flexibility, as illustrated by the examples presented here.

Although some of the computationally intensive algorithms implemented in BGLR can benefit from multithread computing; there is room to further improve the computational performance of the software by making more intensive use of parallel computing. In future releases we plan to exploit parallel computing to a much greater extent. Also, we are currently working on modifying the software in ways that avoid loading genotypes in in memory. Future releases will be made at the R-Forge website (https://r-forge.r-project.org/R/?group_id=1525) first and, after considerable testing, at CRAN.

## Acknowledgments

In the development of BGLR Paulino Pérez and Gustavo de los Campos received financial support provided by National Institutes of Health Grants R01GM099992 and R01GM101219. The authors thank Professor Daniel Gianola for comments made in a draft of the manuscript.

## Footnotes

Available freely online through the author-supported open access option.

Supporting information is available online at http://www.genetics.org/lookup/suppl/doi:10.1534/genetics.114.164442/-/DC1.

*Communicating editor: S. Sen*

- Received March 22, 2014.
- Accepted June 26, 2014.

- Copyright © 2014 by the Genetics Society of America

Available freely online through the author-supported open access option.