## Abstract

The availability of genomewide dense markers brings opportunities and challenges to breeding programs. An important question concerns the ways in which dense markers and pedigrees, together with phenotypic records, should be used to arrive at predictions of genetic values for complex traits. If a large number of markers are included in a regression model, marker-specific shrinkage of regression coefficients may be needed. For this reason, the Bayesian least absolute shrinkage and selection operator (LASSO) (BL) appears to be an interesting approach for fitting marker effects in a regression model. This article adapts the BL to arrive at a regression model where markers, pedigrees, and covariates other than markers are considered jointly. Connections between BL and other marker-based regression models are discussed, and the sensitivity of BL with respect to the choice of prior distributions assigned to key parameters is evaluated using simulation. The proposed model was fitted to two data sets from wheat and mouse populations, and evaluated using cross-validation methods. Results indicate that inclusion of markers in the regression further improved the predictive ability of models. An R program that implements the proposed model is freely available.

GENOMEWIDE dense marker maps are now available for many species in plants and animals (*e.g.*, WANG *et al*. 2005). An important challenge is how this information should be incorporated into statistical models for prediction of genetic values in animal and plant breeding programs or prediction of diseases.

A standard quantitative genetic model assumes that genetic and environmental effects act additively, to produce phenotypic outcomes according to the rule . The information set now available for predicting genetic values may include, in addition to phenotypic records, a pedigree, molecular markers, or both.

Several methodologies have been proposed for incorporating dense marker data into regression models. A distinction can be made between methods that explicitly regress phenotypic records on markers via the regression function , where is a vector of marker covariates and is a vector of regression coefficients, *e.g.*, , and those that view genetic values as a function of the subject and use marker information to build a (co)variance structure between subjects. The first group of methods includes standard Bayesian regression (BR) with random coefficients, *i.e.*, a Bayesian model where regression coefficients are assigned the same Gaussian prior, and other shrinkage methods such as Bayes A or Bayes B of Meuwissen *et al*. (2001), and specifications described in Gianola *et al*. (2003). The second type of approach was suggested by Gianola *et al*. (2006) and Gianola and van Kaam (2008), who proposed using reproducing kernel Hilbert spaces regression (RKHS), with the information set consisting of SNP (single-nucleotide polymorphism) genotypes, possibly supplemented by genealogies. As discussed in De los Campos *et al*. (2009), in this approach, marker information is used to create a prior (co)variance structure between genomic values, , , where is some positive-definite function and is a parameter to be estimated from the data.

The two types of approaches lead to predictions of genomic values for quantitative traits. An advantage of explicitly regressing phenotypes on marker covariates is that the model can produce information about genomic regions that may affect the trait of interest. However, a main difficulty is that the number of regression coefficients (*p*) is typically large, even larger than the number of records (*n*), with *p* ≫ *n*. Therefore, a crucial aspect is how this methodology can cope with the curse of dimensionality and with colinearity.

With whole-genome scans, many markers are likely to be located in regions that are not involved in the determination of traits of interest. On the other hand, some markers may be in linkage disequilibrium with some QTL or in regions harboring genes involved in the infinitesimal component of the trait. This suggests that differential shrinkage of marker effects should be a feature of the model, as noted by Meuwissen *et al*. (2001). Tibshirani (1996) proposed a regression method (least absolute shrinkage and selection operator, LASSO) that combines the good features of subset selection (*i.e.*, variable selection) with the shrinkage produced by BR. Recently, Park and Casella (2008) presented a Bayesian version of the LASSO method (Bayesian LASSO, BL) and suggested a Gibbs sampler for its implementation. Alternatives to the Gibbs sampler of Park and Casella are discussed in Hans (2008). While the BL described by Park and Casella is appealing for the reasons mentioned above, it does not accommodate pedigree information or regression on (co)variates other than the markers for which a different shrinkage approach may be desired.

Several authors have considered combining pedigree and marker data into a single model in the context of QTL analysis (*e.g.*, Fernando and Grossman 1989; Bink *et al.* 2002, 2008). Here, in this spirit, the BL is modified and extended to accommodate pedigree information as well as covariates other than markers.

The main objectives of this article are to (1) discuss the use of BL and related methods in the context of linear regression of quantitative traits on molecular markers, (2) evaluate the sensitivity of BL with respect to the choice of the prior for the regularization parameter, (3) extend the BL so that pedigrees or regressions on covariates other than markers can also be included in the model, and (4) evaluate the methodology using data from a self-pollinated wheat population and an outcross mouse population. The article is organized as follows: the first section, bayesian lasso, introduces the BL as presented in Park and Casella (2008) and discusses connections between BL and closely related methods, such as those proposed by Meuwissen *et al*. (2001) or variants proposed by other authors. monte carlo study evaluates the sensitivity of BL with respect to the choice of prior for the regularization parameter. bayesian regression coupled with lasso presents an extension of BL, treating effects of different types of regressors with different priors. In data analysis, the proposed methodology is applied to two data sets representing a collection of wheat lines and a population of mice. concluding remarks are provided in the final section of the article. An R function (R Development Core Team 2008) that fits the model and data sets used in this article are made available (see supporting information, File S1 and File S2).

## THE BAYESIAN LASSO

Tibshirani (1996) proposed using the sum of the absolute values of the regression coefficients (or *L*_{1} norm) as a penalty in regression models, to simultaneously produce variable selection and shrinkage of coefficients; the proposed methodology was termed LASSO. In LASSO, estimates are obtained by solving the constrained optimization problem(1)where is a vector of covariates, is the corresponding vector of regression coefficients, and *t* is an arbitrary positive constant. Above, it is assumed that data are centered, *i.e.*, has zero mean. Optimization problem (1) is equivalent to(2)(*e.g.*, Tibshirani 1996), for some value of the smoothing parameter . It is known that the solution to (2) may involve zeroing out some elements of , and there are many ways of illustrating why this may be so. One manner is to examine the shape of the feasible set in (1) (*e.g.*, Tibshirani 1996); another way is to consider the Bayesian interpretation of the LASSO. From (2), it follows that the solution can be viewed as the posterior mode in a Bayesian model with Gaussian likelihood, , and a prior on that is the product of *p* independent, zero-mean, double-exponential (DE) densities; that is, . In contrast, BR is obtained by assuming the same likelihood and a prior on that is the product of *p* independent normal densities; that is, , where is a variance parameter common to all regression coefficients. The difference between these two priors is illustrated in Figure 1: the DE density places more mass at zero and has thicker tails than the Gaussian distribution. From this perspective, relative to BR, LASSO produces stronger shrinkage of regression coefficients that are close to zero and less shrinkage of those with large absolute values.

Parameter , sometimes referred to as a regularization parameter, plays a central role: as it approaches zero, the solution to (2) tends to ordinary least squares, while large values of penalize the *L*_{1} norm of , , highly. In the Bayesian view of LASSO, controls the prior on , with large values of this parameter associated with more informative (sharper) priors.

By construction, the non-Bayesian LASSO solution admits at most *n −* 1 nonzero regression coefficients (*e.g.*, Park and Casella 2008). This is not desirable in models with dense marker-based regressions since, *a priori*, there is no reason why the number of markers with effectively nonzero effects should be smaller than the number of observations. This problem does not arise in BL, which is discussed next.

A computationally convenient hierarchical formulation of a DE distribution is obtained by exploiting the fact that the DE density can be represented as a mixture of scaled Gaussian densities (*e.g.*, Andrews and Mallows 1974; Rosa 1999), where the mixing process of the variances is an exponential distribution. Following Park and Casella (2008),Above, is the unknown effect of the *j*th marker and is a variance parameter (measuring prior uncertainty) associated with . Using this, Park and Casella (2008) suggested the following hierarchical model (BL),(3)(4)Above, and are normal densities centered at and 0, with variances and , respectively; is a scaled-inverted chi-square density, with degrees of freedom . and scale , in this parameterization, ; is an exponential density indexed by a single parameter, , and is a Gamma distribution, with shape parameter and rate parameter .

The role of the 's becomes more clear by changing variables in (3) and (4) from to . After this change of variables, the product of the likelihood function and of the joint prior for the regression coefficients, , becomes , where and is a vector of regression coefficients with homogeneous variance. Thus, one way of viewing this class of regression models is as a standard BR model with additional unknowns, , which assign different weights to the columns of **X**, with being equivalent to removing the *j*th covariate from the model.

Park and Casella (2008) presented a set of fully conditional distributions that allows fitting the BL model via the Gibbs sampler. Some of these distributions are discussed next, to illustrate main features of the algorithm.

#### Location parameters:

In the Gibbs sampler of Park and Casella (2008), the fully conditional distribution of the regression coefficients is multivariate normal with mean (covariance matrix) equal to the solution (inverse of the coefficient matrix) of the system of equations,(5)Recall that ordinary least-squares estimates are obtained by solving and that the counterpart of (5) in BR is . A key aspect of BL is that it produces a shrinkage that is marker specific, contrary to BR. Since is a scaling factor common to all regression coefficients, the differential shrinkage is due to the 's. If is large, *i.e.*, a large variance is associated with the effect of the *j*th marker, the quantity added to the diagonal will be small. Conversely, if a small variance is associated with the effect of the *j*th coefficient, will be large. Adding a large constant to the *j*th diagonal element shrinks the least-squares estimates toward zero and reduces the variance of its fully conditional distribution.

#### Variances of the regression coefficients:

An important aspect of the algorithm is how samples of the regression coefficients affect realizations of the variances of marker effects. In BL, the fully conditional posterior distributions of the 's can be shown to be inverse Gaussian (*e.g*., Chhikara and Folks 1989), with mean and scale parameter . For a given , a small absolute value of will lead to a fully conditional distribution of with a large mean, which in turn will generate relatively small values of .

#### The λ parameter of the exponential prior:

In the standard LASSO, controls the trade-off between goodness of fit and model complexity, and this may be crucial in defining the ability of a model to uncover signal. Small values of produce better fit, in the sense of the residual sum of squares ( = 0 gives ordinary least squares); as increases, the penalty on model complexity increases (in optimization problem (1) the feasible set is smaller). On the other hand, in BL, controls the shape of the prior distribution assigned to the 's. In general, the exponential prior assigns more density to small values of the 's than to large ones, and this may be reasonable for most SNPs under the expectation that most of their effects are nil.

In BL can be treated as any other unknown. If, as in (4), a Gamma prior is assigned to , the fully conditional posterior distribution of is also Gamma, with shape and rate parameters equal to and , respectively. The expectation of this Gamma distribution is , so a large value of will lead to a relatively small , and the opposite will occur if the sum of the variances of the regression coefficients is small.

#### Relationship between LASSO and other regression models used in genomic selection:

Standard BR may not be suitable for regressing phenotypes on a large number of markers because shrinkage of regression coefficients is homogeneous across markers (Fernando *et al*. 2007). In contrast, in BL the variance is marker specific, producing shrinkage whose extent is related to the absolute value of the estimated regression coefficient.

Meuwissen *et al*. (2001) recognized that marker-specific variances may be needed and suggested regression models based on marginal priors that are also mixtures of scaled-Gaussian distributions (“Bayes A”) or mixtures of scaled-Gaussian distributions and of a point mass at zero (“Bayes B”). In these models, the likelihood is as in (3) and, in Bayes A, the prior is(6)The first two components of (6), , are the counterparts of the first two components of (4), , with . The difference between BL and Bayes A (or Bayes B) is how the priors of the variances of the marker-specific regression coefficients ( in Bayes A and in BL) are specified. At this level, Bayes A and BL differ in two respects:

In Bayes A, the prior assumption is that the marker-specific variances are independent random variables following the same scaled-inverted chi-square distribution with known prior degree of belief and scale . In BL, the assumption is that these variances are independent as well, but each following the same exponential distribution with unknown parameter . The conditional (given ) marginal prior in BL is DE, while in Bayes A is a

*t*-distribution. Although a*t*-distribution may place more density at zero than the Gaussian prior of BR, the density at zero is larger in the DE. This issue was recognized by Meuwissen*et al*. (2001), leading to the development of Bayes B.Xu (2003) employed an improper prior for the marker-specific variances; if and , then . Similar to the exponential prior, this density decreases monotonically with . However, unlike the exponential distribution, where can be used to “tune” the shape of the distribution, this prior does not have parameters to allow any control. In addition, as noted by Ter Braak

*et al.*(2005), yields an improper posterior. As an alternative, these authors suggested to use with ; although this prior is improper, it does not yield an improper posterior. As with the exponential prior, is decreasing with respect to . Ter Braak (2006) furthered discussed the role of , which, as in the BL, controls the shape of the prior density on the variance of the regression coefficients.A second difference is that, in Bayes A, values of parameters and are specified as known

*a priori*. On the other hand, in BL there is an extra level in the model: is assigned a Gamma distribution, and information from all regression coefficients is pooled. This difference is illustrated in Figure 2: in Bayes A, and control, as does in BL, the trade-offs between goodness of fit and model complexity.

Yi and Xu (2008) discuss an extension of Bayes A where a prior is assigned to and , and these quantities are treated as nuisances, as is in BL. However, as argued earlier, the DE seems to be a better choice, if the assumption is that most markers have no effect on the trait of interest.

## MONTE CARLO STUDY

Although in the BL can be treated as unknown, it is not clear how sensitive results might be with respect to the choice of hyperparameters and . Park and Casella (2008, p. 683) recognized that this may be an issue: “The prior density for should approach 0 sufficiently fast as (to avoid mixing problems) but should be relatively flat and place high probability near the maximum likelihood estimate.” The main problem of applying this recommendation is that one does not know in advance what the maximum-likelihood estimate is.

The sensitivity of BL with respect to the choice of the prior distribution of was investigated here by fitting the model under different priors to simulated data. In addition to the conjugate Gamma prior, we also considered (see File S1 and File S2)(7)The above distribution gives great flexibility for specifying a relatively flat prior over a wide range of values. The uniform prior appears as a special case when . When the Beta prior is used, the fully conditional distribution of does not have closed form; however, draws from the distribution can be obtained using the Metropolis–Hastings algorithm (see File S1 and File S2).

#### Data-generating process:

Data were simulated in a simple setting, such that problems could be identified easily, while the phenotypic and genotypic structure attempted to resemble those encountered in real data sets.

Data were generated under the additive model,where is the phenotype for individual *i*, is the effect of allele substitution at marker *j* , and is the code for the genotype of subject *i* at locus *j*, . Residuals were independently sampled from a standard normal distribution; that is, .

Two scenarios regarding the genotypic distribution were considered. In scenario *X*_{0}, markers were in low linkage disequilibrium (LD), with almost no correlation between adjacent markers (Table 1). In scenario *X*_{1} a relatively high LD was considered (Table 1).

The effects of allele substitutions were kept constant across simulations and set to zero for all markers except for 10 (Figure 3). The locations of markers with nonnull effects were chosen such that different situations regarding effects of linked markers were represented.

#### Choice of prior distribution of λ:

For each Monte Carlo (MC) replicate, five variations of BL were fitted, and four involved a Gamma prior with the following values of parameters: ; ; ; . In BL5, the prior on was (Figure 4).

#### Results:

Table 2 shows the average (across 100 MC replicates) of posterior means of the residual variance, the regularization parameter, and the correlation between the true and the estimated quantity of several features (phenotypes, genomic values, and marker effects). is a goodness-of-fit measure, measures how well the model estimates genomic values, and evaluates how well the model estimates marker effects.

The posterior mean and standard deviation of were influenced by the prior (Table 2). The posterior mean was shrunk toward the prior mode, and the posterior standard deviation was larger for more dispersed priors (see Table 2 and Figure 4). These results suggest that there is not much information about in the type of samples evaluated. On the other hand, model goodness of fit and the ability of the model to uncover signal were not affected markedly by the choice of prior. This suggest that, while it may be difficult to learn about from data, inferences on quantities of interest (*e.g.*, genetic values) may be robust with respect to values of over a fairly wide range. For example, differences in or in were small when the prior was changed.

A relatively flat prior based on a Beta distribution (BL5) produced a more dispersed posterior distribution of , and mixing was not as good as when the sharper Gamma priors (BL1–BL4) were used. For example, the average (across MC replicates) effective sample sizes (*e.g.*, Plummer *et al.* 2008) for the residual variance were 1468, 1155, 1091, 1138, and 578 for BL1–BL5, respectively.

## BAYESIAN REGRESSION COUPLED WITH LASSO

In practice, the information set available for prediction of genomic values may include components other than genetic markers. For example, data may cluster into known contemporary groups (*e.g.*, individuals may be measured under different experimental conditions), or a pedigree may be available in addition to genetic markers. It is natural to treat the various classes of predictors in a different way. From a penalized-likelihood point of view, this amounts to using penalty functions that are specific to each class of predictors. From a Bayesian standpoint, treating predictors differently may be achieved by assigning different priors. A straightforward extension of the BL is described next.

The data structure is denoted as , where is the phenotype of subject *i*, is a vector of covariates that is treated as in a standard BR with a normal prior and variance common to all regressions, is a set of covariates whose effects are assigned a double-exponential prior as in BL, and is a label that allows tracking subjects in a pedigree. The equation for the data iswhere is an intercept, and are regressions of on and , respectively, is an infinitesimal genetic effect pertaining to individual *i* for which the prior (co)variance structure is determined by a pedigree, and is a model residual, assumed to be identically and independently distributed of other residuals. The likelihood function is(8)Prior specification (4) is modified as(9)where , , and are the variances of , , and , respectively; and are prior degrees of freedom and scale parameter of the corresponding distributions; is a (co)variance structure computed from the genealogy (for example, a numerator-relationship matrix); and, is the prior on that may be as in (4) or (7).

In the model defined by (8) and (9) all fully conditional distributions (except that of if a nonconjugate prior is chosen for ) have closed form, so a Gibbs sampler (with a Metropolis–Hastings step) can be used to draw samples from the joint posterior distribution (see File S1 and File S2). To distinguish the above model from the standard BL we refer to it as Bayesian regression coupled with LASSO (BRL).

## DATA ANALYSIS

Two data sets were analyzed with the BRL model. The first set pertains to a collection of wheat lines (see File S1 and File S2); the second set contains information from a population of mice (publicly available at http://gscan.well.ox.ac.uk).

The wheat data set is from the Global Wheat program of the International Maize and Wheat Improvement Center (CIMMYT). This program conducted several international trials across a wide variety of environments. For this study, we took a subset of 599 wheat lines derived from 25 years of Elite Spring Wheat Yield Trials (ESWYT) conducted from 1979 through 2005. The environments represented in these trials were grouped into four macroenvironments. The phenotype considered here was average grain yield performance of the 599 wheat lines evaluated in one of the macroenvironments. An association mapping study based on a reduced number of these ESWYT trials is presented in Crossa *et al*. (2007).

The Browse application of the International Crop Information System (ICIS), as described in http://cropwiki.irri.org/icis/index.php/TDM_GMS_Browse (McLaren *et al*. 2005), was used for deriving the relationship matrix **A** between the 599 lines, and it accounts for selection and inbreeding.

A total of 1447 Diversity Array Technology (DArT) markers were generated by Triticarte (Canberra, Australia; http://www.triticarte.com.au). The DArT markers may take on two values, denoted by their presence or their absence.

The mouse data come from an experiment carried out to detect and locate QTL for complex traits in a mouse population (Valdar *et al.* 2006a,b). These data have already been analyzed for comparing genome-assisted genetic evaluation methods (Legarra *et al.* 2008). The data file consists of 1884 individuals (168 full-sib families), each genotyped for 10,946 polymorphic markers. The trait analyzed here was body mass index (BMI), precorrected by body weight, season, month, and day. Mice were housed in 359 cages; on average, each litter was allocated into 2.84 cages.

Three models were fitted to each of the data sets: P (standing for pedigree) is a pedigree-based model where markers were not included; M is a model where the only genetic component is the regression on markers; P&M (standing for pedigree and markers) includes regressions on markers and an additive effect with (co)variance structure computed from the pedigree. For both data sets, phenotypes were standardized to have a sample variance equal to one, so that results are easily compared across data sets.

In the mouse data set, were the effects of cages where groups of mice were reared. In the wheat data set, the component was omitted because there was no such set of regressors.

Models were first fitted to the entire data set. Subsequently, a fivefold cross-validation (CV) was carried out with assignment of individuals into folds at random. The CV yields prediction of phenotypes (*f =* 1, … , 5) obtained from a model in which all observations in the *f*th fold were excluded. The ability of each model to predict out-of-sample data was evaluated via the correlation between phenotypes and predictions from CV. Inferences for each fit were based on 70,000 samples (after 5000 were discarded as burn-in). Convergence was checked by inspection of trace plots and with estimates of effective sample size for (co)variance components computed using the coda package of R (Plummer *et al*. 2008). Parameters of the prior distributions were and . This latter prior is flat over a wide range of values of .

## RESULTS AND DISCUSSION

Table 3 shows summaries of the posterior distributions of the variance components and of λ by model and data set. In both populations, a moderate reduction in the posterior mean of the residual variance was observed when the P&M model was fitted, relative to P. Using model P in the wheat population gave a posterior mean of heritability of grain yield of 0.34, while in the mouse population the posterior mean of *h*^{2} of body-mass index was 0.11. These results are in agreement with previous reports (Valdar *et al*. 2006b and Legarra *et al*. 2008 for the mouse data and Crossa *et al*. 2007 for the wheat data) for these traits and populations. The inclusion of markers (P&M) reduced the estimate of the variance of the infinitesimal additive effect, relative to P. This happens because, in P&M, part of the infinitesimal additive effect is captured by the regression on markers (*e.g.*, Habier *et al*. 2007; Bink *et al.* 2008). In the model for body-mass index in M, the variance between cages was reduced only slightly when the effects of the markers were fitted.

Figure 5 gives absolute values of the posterior means of marker effects. In the mouse data there are several regions showing groups of markers with relatively large estimated effects. This is not evident in the wheat data set where fewer markers were available.

From a breeder's perspective, a relevant question is whether or not the P, M, and P&M models lead to different ranking of individuals on the basis of the estimated genetic values. Table 4 shows the rank (Spearman) correlation of estimated genetic values. As expected, these correlations were high, but not perfect. The correlation between predicted genetic values from M and P&M was larger than that of the estimates from P and P&M, suggesting that the inclusion of markers in the model is probably critical. This was clearer in the mouse data set, where (a) the extent of additive relationships was not as strong as in the wheat population and (b) a much larger number of markers were available.

Figure 6 shows scatter plots of predicted genomic values in P and P&M for both data sets. Although the correlation between genetic values estimated from different models was high, using P and P&M would lead to different sets of selected individuals. The difference was more marked in the mouse data set, illustrating that the impact of considering markers in breeding decisions depends on the data structure and on how informative the pedigree and markers are. Also, the dispersion of predicted genetic values was larger when markers were fitted, and this is consistent with the smaller posterior mean of the residual variance observed for P&M (Table 3). An interpretation of this result is that, in certain contexts, markers may help to uncover genetic variance that would not be captured if only pedigree-based predictions were used.