## Abstract

A multiple-trait Bayesian LASSO (MBL) for genome-based analysis and prediction of quantitative traits is presented and applied to two real data sets. The data-generating model is a multivariate linear Bayesian regression on possibly a huge number of molecular markers, and with a Gaussian residual distribution posed. Each (one per marker) of the vectors of regression coefficients (*T*: number of traits) is assigned the same *T*−variate Laplace prior distribution, with a null mean vector and unknown scale matrix Σ. The multivariate prior reduces to that of the standard univariate Bayesian LASSO when The covariance matrix of the residual distribution is assigned a multivariate Jeffreys prior, and Σ is given an inverse-Wishart prior. The unknown quantities in the model are learned using a Markov chain Monte Carlo sampling scheme constructed using a scale-mixture of normal distributions representation. MBL is demonstrated in a bivariate context employing two publicly available data sets using a bivariate genomic best linear unbiased prediction model (GBLUP) for benchmarking results. The first data set is one where wheat grain yields in two different environments are treated as distinct traits. The second data set comes from genotyped *Pinus* trees, with each individual measured for two traits: rust bin and gall volume. In MBL, the bivariate marker effects are shrunk differentially, *i.e.*, “short” vectors are more strongly shrunk toward the origin than in GBLUP; conversely, “long” vectors are shrunk less. A predictive comparison was carried out as well in wheat, where the comparators of MBL were bivariate GBLUP and bivariate Bayes C*π*—a variable selection procedure. A training-testing layout was used, with 100 random reconstructions of training and testing sets. For the wheat data, all methods produced similar predictions. In *Pinus*, MBL gave better predictions that either a Bayesian bivariate GBLUP or the single trait Bayesian LASSO. MBL has been implemented in the Julia language package JWAS, and is now available for the scientific community to explore with different traits, species, and environments. It is well known that there is no universally best prediction machine, and MBL represents a new resource in the armamentarium for genome-enabled analysis and prediction of complex traits.

TWO main paradigms have been employed for investigating statistical associations between molecular markers and complex traits: marker-by-marker genome-wide association studies (GWAS) and whole-genome regression approaches (WGR). GWAS is dominant in human genetics; Visscher *et al.* (2017) present a perspective and Gianola *et al.* (2016) formulate a statistically orientated critique. WGR was developed mostly in animal and plant breeding (*e.g.*, Lande and Thompson 1990; Meuwissen *et al.* 2001; Gianola *et al.* 2003) primarily for predicting future performance, but it has received some attention in human genetics as well (*e.g.*, de los Campos *et al.* 2010; Yang *et al.* 2010; Lee *et al.* 2011; Makowsky *et al.* 2011; López de Maturana *et al.* 2014). de los Campos *et al.* (2013), Gianola (2013) and Isik *et al.* (2017) reviewed an extensive collection of WGR approaches. Other studies noted that WGR can be used both for “discovery” of associations and for prediction (Moser *et al.* 2015; Goddard *et al.* 2016; Fernando *et al.* 2017). Hence, WGR methodology is an active area of research.

Multiple-trait analysis has long been of great interest in plant and animal breeding, mainly from the point of view of joint selection for many traits (Smith 1936; Hazel 1943; Walsh and Lynch 2018). Henderson and Quaas (1976) developed multi-trait best linear unbiased prediction of breeding values for all individuals and traits measured in a population of animals—a methodology that has gradually become routine in the field. For example, Gao *et al.* (2018) described an application of a nine-variate model to data representing close to 7 million and 4 million Holstein and Nordic Red cattle, respectively; the nine traits were milk, fat, and protein yields in each of the first three lactations of the cows.

A multiple-trait analysis is also a natural choice in quests for understanding and dissecting genetic correlations between traits using molecular markers, *e.g.*, evaluating whether pleiotropy or linkage disequilibrium are at the root of between-trait associations (Gianola *et al.* 2015; Cheng *et al.* 2018a). For instance, Galesloot *et al.* (2014) compared six methods of multivariate GWAS via simulation and found that all delivered a higher power than single-trait GWAS, even when genetic correlations were weak. Many single-trait WGR methods extend directly to the multiple-trait domain, *e.g.*, genomic best linear unbiased prediction (GBLUP; VanRaden 2007). Other procedures, such as Bayesian mixture models, are more involved, but extensions are available (Calus and Veerkamp 2011; Jia and Jannink 2012; Cheng *et al.* 2018a). The mixture model of Cheng *et al.* (2018a) is particularly interesting because it provides insight into whether markers affect all, some, or none of the traits addressed. For example, the proportion of markers in each of the and categories, where means “no effect,” and denotes “effect” on each of two disease traits in *Pinus taeda* was estimated by Cheng *et al.* (2018a) using single nucleotide polymorphisms (SNPs). The proportion of Markov chain Monte Carlo (MCMC) samples falling into the class was <3%, with ∼140 markers appearing as candidates for further scrutiny of pleiotropy; 97% of the SNP were in the class, and 0.5% were in the and classes. It must be noted that Cheng *et al.* (2018a) used Bayesian model averaging, so posterior estimates of effects, and of their uncertainties, constitute averages over all possible configurations. The resulting “average model” is not truly sparse, as Bayesian mixture models always assign some posterior probability to each of the possible configurations. An alternative to a mixture is to use a prior distribution that produces strong shrinkage toward the origin of “weak” vectors of marker effects; here, each marker has a vector with dimension equal to the number of traits.

The LASSO (least absolute shrinkage and selection operator) method presented by Tibshirani (1996) is a single-response method based on minimizing a linear regression residual sum of squares subject to a constraint based on an norm. It can produce a sparse model, *i.e.*, if the linear regression model has *p* regression coefficients, the LASSO yields a smaller model (*i.e.*, model selection), but with a complexity that cannot exceed the number of observations. Tibshirani (1996) noted that the LASSO solutions can also be obtained by calculating the mode of the conditional posterior distribution of the regression coefficients in a Bayesian model in which each coefficient is assigned the same conditional double exponential (DE) or Laplace prior. Using a ridge regression reformulation of LASSO, it can be seen that its Bayesian version shrinks small-value regression coefficients very strongly toward zero, whereas large-effect variants are regularized to a much lesser extent than in ridge regression (Tibshirani 1996; Gianola 2013). Yuan and Lin (2006) and Yuan *et al.* (2007) considered the problem of clustering regression coefficients into groups (factors), with the focus becoming factor selection, as opposed to the predictor variable selection that takes place in LASSO. For instance, a cluster could consist of a group of markers in tight physical linkage. These authors noted that, in some instances, grouping enhances prediction performance over ridge regression, while in others, it does not. Such findings are consistent with knowledge accumulated in close to two decades of experience with genome-enabled prediction in animal breeding: there is no universally best prediction machine. A multiple-trait application of a LASSO penalty on regression coefficients was presented by Li *et al.* (2015). These authors assigned a multivariate Laplace distribution to the model residuals, and a group-LASSO penalty (Yuan and Lin 2006) to the regression coefficients. The procedure differs from Tibshirani’s LASSO in that the model selects vectors of regressors (corresponding to regressions of a given marker over traits) as opposed to single-trait predictor variables.

Park and Casella (2008) introduced a fully Bayesian LASSO (BL). Contrary to LASSO, BL produces a model where all regression coefficients are non-null (even if ); most regressions are often tiny in value, except those associated with covariates (markers) with strong effects. In short, LASSO produces a sparse model, whereas BL yields an effectively sparse specification, similar to Bayesian mixture models such as Bayes B (Meuwissen *et al.* 2001). The first application of BL in quantitative genetics was made by Yi and Xu (2008) in the context of quantitative trait locus (QTL) mapping, with subsequent applications reported in de los Campos *et al.* (2009), Legarra *et al.* (2011), Li *et al.* (2011) and Lehermeier *et al.* (2013).

It appears that no multiple-trait generalization of the BL has yet been reported. The present paper describes a multi-trait Bayesian LASSO (MBL) model based on adopting a multivariate Laplace distribution with unknown scale matrix as prior distribution for the markers or variants under scrutiny. The MBL is introduced and compared with a multiple-trait GBLUP (MTGBLUP) using wheat and pine tree data sets. The section *Multi-Trait Regression Model* describes MBL, including a MCMC sampling algorithm. Subsequently, MBL is compared with MTGBLUP using a wheat data set. Finally, bivariate MBL and bivariate MTGBLUP are contrasted from a predictive perspective, showing a better performance of MBL over BLUP and over a single-trait Bayesian LASSO specification, corroborating the usefulness of multiple-trait analyses. The paper concludes with a general discussion and technical details are presented in Appendices.

## Multi-Trait Regression Model

Assume there are *T* traits observed in each of *N* individuals, and let be a vector of allelic substitution effects at marker with representing the effect of marker *j* on trait *t* . The multi-trait regression model (assuming no nuisance location effects other than a mean) for the *T* responses is(1)where is a vector of responses observed in individual is the vector of trait means, and is the genotype individual *i* possesses at marker locus *j*. The residual vector is assumed to follow the Gaussian distribution where is a covariance matrix. All vectors are assumed to be mutually independent and identically distributed.

If traits are sorted within individuals, the probability model associated with Equation 1 can be represented as(2)where(3)is a matrix of sums of squares and products of the unobserved regression residuals.

The regression model can be formulated in an equivalent manner by sorting individuals within traits; hereafter, we will use . Let and be response vectors of order *N* each observed for traits 1, 2, and 3, respectively. The representation of the model is(4)where is an vector of is an matrix of marker genotypes, and and are vectors of regression coefficients and of residuals for trait t, respectively. Above, is a vector and has dimension . Note that . Putting the probability model is(5)We will work with either (1) or (4), depending on the context.

### Bayesian prior assumptions

#### Parameters μ and :

The vector μ will be assigned a “flat” improper prior, and Jeffreys noninformative prior (*e.g.*, Sorensen and Gianola 2002) will be adopted for so that their joint prior density is

#### Multivariate laplace prior distribution for marker effects:

The same *T*-variate Laplace prior distribution with a null mean vector will be assigned to each of the vectors , assumed mutually independent, *a* . Gómez *et al.* (1998) presented a multi-dimensional version of the power exponential family of distributions; one special case is the multivariate Laplace distribution (MLAP). The density of the MLAP with a zero-mean vector used here is(7)where is a positive-definite scale matrix. The variance–covariance matrix of MLAP is(8)note that the absolute values of the elements of the intertrait variance–covariance of marker effects, are larger than those of **Σ**. Hence, for where is the appropriate diagonal element of likewise, is the covariance of marker effects between traits *t* and for all Putting in (7) yields(9)The preceding is the density of a DE distribution with null mean, parameter and variance . As mentioned earlier, Tibshirani (1996) and Park and Casella (2008) used the DE distribution as conditional (given Σ) prior for regression coefficients in the BL—a member of the “Bayesian Alphabet” (Gianola *et al.* 2009). Gianola *et al.* (2018) assigned the DE distribution to residuals of a linear model for the purpose of attenuating outliers, and Li *et al.* (2015) used the MLAP distribution for the residuals in a “robust” linear regression model for QTL mapping.

MLAP is therefore an interesting candidate prior for multi-trait marker effects in a multiple trait generalization of the Bayesian LASSO (MBL). A zero-mean MLAP distribution has a sharp peak at the 0 coordinates, although, when MLAP reduces to a DE distribution, the marginal and conditional densities of MLAP are not DE. Gómez *et al.* (1998) showed that such densities are elliptically contoured, and, thus, not DE. Appendix A and Supplemental Material, Figures S1–S3 in the Supplemental material give background on MLAP.

Gómez-Sánchez-Manzano *et al.* (2008) showed that MLAP can be represented as a scaled mixture of normal distributions under the hierarchy: (1) and (2) for denotes a *T*−variate normal distribution with null mean and covariance matrix The density of is(10)Let the collection of all marker effects over traits be represented by the vector(11)If independent and identical MLAP prior distributions are assigned to each of the subvectors, the joint prior density of all marker effects, given Σ, can be represented as(12)and the joint density of β and is(13)When individuals are sorted within traits (*e.g.*, ), note that is a *Tp*−dimensional normal distribution, with null mean vector and covariance matrix(14)where is a diagonal matrix. Hence,

#### Scale matrix Σ:

The scale matrix Σ of MLAP can be given a fixed value (becoming a hyper-parameter), or inferred, in which case a prior distribution is needed. Here, an inverse-Wishart distribution with scale matrix and degrees of freedom will be assigned as prior. The density is

(16)### Joint posterior and fully conditional distributions

The joint posterior distribution, including from the scale-mixture of normals representation of the prior distribution of β, was assumed to take the form(17)where *H* denotes the hyper-parameters; recall that is the data vector sorted by individuals within trait. The fully conditional distributions are presented below, with used to denote all parameters that are kept fixed, together with *H*, in a specific conditional distribution.

#### Parameters *μ* and given ELSE:

From (17,) and using representations (4) and (15), the fully conditional posterior distribution of μ and has density(18)The preceding is a multivariate normal density (*e.g.*, Sorensen and Gianola 2002). The mean vector of the distribution is(19)and the variance–covariance matrix is(20)A more explicit representation is presented in Appendix B for the case

#### Fully conditional distributions of partitions of the location vector:

For details, see Van Tassell and Van Vleck (1996) and Sorensen and Gianola (2002). Since the joint posterior of the location parameters, given and is multivariate normal, all conditionals and linear combinations thereof are normal as well. In particular ,(21)and(22)Likewise(23)and

(24)#### Fully conditional distributions of and Σ:

From (17), using (2) and (6)(25)so is an distribution with degrees of freedom and scale matrix . In the kernel of the density is often written as where

Recall that(26)so, from (17)(27)where(28)is a matrix. Hence the conditional posterior distribution of **Σ** is . The kernel of the density of **Σ** is often represented as , where

### MCMC algorithm

Starting values for and Σ can be obtained “externally” from some estimates of and B (the matrix of variances and covariances of marker effects) calculated with standard methods such as maximum likelihood. Recall that

Sample each using the following Metropolis-Hastings sampler:

At round

*t*, draw*y*from and evaluate*y*as proposal; stands for an inverse-gamma distribution.Draw with the probability of move being , with

*R*as in Appendix C.If set and form as a new state. Otherwise, set

In a “single-pass” sampler, use (19) and (20) for sampling the entire location vector jointly. Otherwise, adopt a blocking strategy; for example draw μ and using (21), (22), (23) and (24).

Sample from and

**Σ**from .

### Remarks

Appendix **E** shows that the degree of shrinkage of marker effects results from a joint action between **Σ** and the strength of marker effects. A vector of effects of a marker with a short Mahalanobis distance away from **0** is more strongly shrunk toward the origin (*i.e.*, the mean of prior distribution) than vectors containing strong effects on at least one trait. MLAP preserves the spirit of BL, producing “pseudoselection” of covariates: all markers stay in the model, but some are effectively nullified. A marker with strong marginal and joint effects on the traits under consideration could flag potentially pleiotropic regions.

### Missing records for some traits

Often, not all traits are measured in all individuals, a situation that is more common in animal breeding than in plant breeding. A standard approach (“data augmentation”) treats missing phenotypes as unknowns in an expanded joint posterior distribution. As shown in Appendix F, a predictive distribution can be used to produce an imputation of missing data.

## Alternative Formulation in Dimensions

The MCMC sampler described above is based on a regression on markers formulation stemming from either (1) or (4). In a “single-pass” sampler, parameters must be drawn together; when *p* is very large, direct inversion is typically unfeasible, so the scheme must be reformulated into a “block-sampling” one, *i.e.*, by drawing some of the location parameters jointly by conditioning on the other location parameters, or by using a single-site sampler (Sorensen and Gianola 2002). Blocking or single-site sampling facilitate computation at the expense of slowing down convergence to the target distribution. Appendix D gives a scheme in which effects (trait means and bivariate genomic breeding values) are inferred, and the marker effects are calculated indirectly, following ideas of Henderson (1977) and adapted by Goddard (2009) to a genome-based model.

### Data availability

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article. The wheat yield data set in the R package BGLR (Pérez and de los Campos 2014) was employed to contrast MBL with GBLUP and Bayes C*π*. This wheat data set has been studied extensively, *e.g.*, by Crossa *et al.* (2010), Gianola *et al.* (2011, 2016), and Long *et al.* (2011). There are wheat inbred lines, each genotyped with DArT (Diversity Array Technology) markers, and each planted in four environments. The DArT markers are binary and denote the presence or absence of an allele at a marker locus in a given line. Grain yields in environments 1 and 2 were employed to compare outcomes between analyses based on bivariate GBLUP and the bivariate BL. In the bivariate model, yields in the two environments are treated as distinct traits, conceptually—an idea that dates back to Falconer (1952). This type of setting arises in breeding of dairy cattle, where milk production of daughters of bulls in different countries are regarded as different traits and in multi-environment situations in plant breeding; both instances can be represented as special cases of a multiple-trait mixed effects model.

A publicly available Loblolly pine data described in Cheng *et al.* (2018a) was used to carry out a predictive comparison between a Bayesian bivariate GBLUP with the bivariate Bayesian LASSO, as well as the latter *vs.* a single-trait Bayesian LASSO. After edits, there were individuals with SNP markers with measurements on rust bin scores and rust gall volume—two disease traits; see Cheng *et al.* (2018a). Supplemental material available at figshare: https://doi.org/10.25386/genetics.10689380.

## Bivariate Analysis of Wheat Yield: MBL *vs.* GBLUP

### Genomic BLUP and Bayesian BLUP

The bivariate model was(30)where is the vector of grain yields in environment of the 599 inbred lines; and are the trait means in the two environments, and **1** is a incidence vector of ones; and are the “additive genomic values” of the lines, and and are model residuals. In GBLUP (VanRaden 2008) the genetic signals captured by markers are represented as and where X is a centered and scaled matrix of genotype codes, and contains the marker allele substitution effects on trait The residual distribution was(31)where, as before, is the between-trait residual variance–covariance matrix. Effects of environment 1 are expected to be uncorrelated with those of environment 2. However, allowance was made for a non-null residual covariance because the additive genomic model may not capture extant epistasis involving additive effects, potentially creating correlations among residuals of the same lines in different environmental conditions.

GBLUP assumed and so(32)is the variance–covariance matrix of marker effects. It follows that(33)where(34)is a between-trait variance–covariance matrix of the additive genomic values (here, *e.g.*, ) and is a genomic-relationship matrix describing genome-based similarities among the 599 lines. The preceding assumptions induce the marginal distribution(35)where V is the phenotypic covariance matrix. The bivariate best linear unbiased predictor of and (Henderson 1975) is:(36)where(37)is the bivariate generalized least-squares (GLS) estimator of the trait means.

BLUP and GLS require knowledge of and and we replaced these unknown matrices by estimates obtained using a crude, but simple, procedure. Genomic and residual variance components were obtained by univariate maximum likelihood analyses of traits 1, 2, and and covariance component estimates were calculated from the expression The resulting estimates of and were inside their corresponding parameter spaces. An estimate of B was obtained by applying relationship (34) to the estimated

Henderson (1977) showed how BLUP of vectors that are not likelihood identified can be obtained from best linear unbiased predictions of likelihood-identified random effects (see Gianola 2013). Goddard (2009) and Strandén and Garrick (2009) used this property to obtain predictions of marker effects given predictions of signal . If β and g have a joint normal distribution, under (30) one hasUsing iterated expectations, and recalling that BLUP can be viewed as an estimated conditional expectation (with fixed effects replaced by their GLS estimates), BLUP of marker effects is expressible as(38)with After lengthy algebra, and using Henderson (1975), the prediction error variance–covariance matrix of the BLUP of marker effects is given by(39)A set of can be formed by taking the ratio between the BLUP of a given marker effect as in (38), and the square root of the corresponding diagonal element of (39). The statistic is a crude criterion for association between marker and phenotype as it ignores uncertainty associated with the fact that B and are estimated from the data, as opposed to being “true values” required by BLUP theory.

The Bayesian bivariate GBLUP model used standard assumption as in Sorensen and Gianola (2002), *i.e.*, it was a multivariate normal-inverse Wishart hierarchical specification. The only difference with GBLUP is that, in the Bayesian treatment, and were treated as unknown parameters, with the uncertainty about their values accounted for.

### Bivariate LASSO

Our MCMC implementation for MBL was applied to markers directly, as opposed to inferring their effects from signal indirectly, as it is done for GBLUP. The model was as in (4) with Each marker was assigned a conditional bivariate Laplace prior distribution with scale matrix **Σ** in turn, **Σ** was given a two-dimensional (2D) inverse Wishart distribution on degrees of freedom and with scale matrix The residual variance–covariance matrix was assigned the 2D Jeffreys improper prior in (6).

The MCMC scheme employed the scale mixture of normal representation of the bivariate Laplace distribution. First, six independent chains of 1500 iterations each were run. The shrinkage diagnostic metric of Gelman and Rubin (1992) was calculated for , and **Σ**, for the effect of marker 10 on trait 1, and for the effect of marker 200 on trait 2; the R package CODA was used for this purpose. Figures S4–S13 gave no strong evidence of lack of convergence, as indicated by shrinkage factor values close to 1.

Post burn-in samples were collected for an additional 2000 iterations in each chain, so a total of 12,000 samples (without thinning) was used for inference. Figures S14 and S15 depict post burn-in trace plots for the elements of and Σ, respectively. The six chains “joined” eventually, and sample values fluctuated thereafter within what seemed to be stationary distributions. To assess convergence further, a test suggested by Geweke (1992) was applied to the combined 12,000 samples from the posterior distributions of (residual correlation between yields in environments 1 and 2) and , the correlation between effects of a marker on the two traits. The test compared means of two parts of the combined collection of 12,000 samples at each of 10 segments of the collection: there was no evidence of lack of convergence. In short, the implementation met successfully the convergence tests applied.

Figure 1 and Figure 2 display estimated posterior densities of and Mixing for was poorer than for ; the effective number of samples was 220.6 and 979.0, respectively, and Monte Carlo errors were small enough. The residual correlation (posterior mean, 0.17) was positive and different from 0, whereas the parameter was estimated at −0.35, also different from zero. However, the posterior densities were not sharp enough for precise inference, probably due to the small sample size and low density of the marker panel . The quality of these estimates is of subsidiary interest here as our objective was to demonstrate the MBL in a comparison with bibariate BLUP of marker effects.

Location parameters mixed well. For example, the average effective sample size of over the six chains during burn-in was 1499 for a nominal 1500 iterations. For marker 10 effect on trait 1, it was 962 out of 1500, and, for marker 200 effect on trait 2, the effective size was 1130 out of 1500. These numbers suggest that all 2558 marker effects were estimated with a very small Monte Carlo error in our MBL implementation with 12,000 samples used for inference.

### MBL *vs.* BLUP estimates of marker effects

Figure 3 gives a comparison between bivariate BLUP and posterior mean estimates of effects from MBL. The upper panel shows good alignment between estimates, except at the extremes of the scatter plots. The lower panel depicts that markers with the strongest absolute effects, as estimated by BLUP, had an even stronger effect when estimated under the bivariate BL. Figure 4 presents standardized estimates of each of the 1279 marker effects, by trait. For GBLUP the was the estimated marker effect divided by the square root of its prediction error variance; for MBL, it was the posterior mean divided by its posterior standard deviation. There is no evidence that any of the markers had an effect differing from 0, corroborating the view that wheat yield is a typical quantitative trait affected by many variants each having small effects (Singh *et al.* 1986; Sleper and Poehlman 2006). Using a univariate least-squares, GWAS-type analysis, there were 29 (yield 1) and 56 (yield 2) significant hits after a Bonferroni correction (1279 tests, ) A comparison between the from the GWAS-type analysis with the standardized BLUP and MBL effects is provided in Figure 5. As expected, shrinkage toward null-mean distributions (bivariate Gaussian in BLUP and bivariate Laplace in MBL) made much smaller in absolute value than the corresponding ones from GWAS.

Standard GWAS aims to find connections between markers and genomic regions having an effect on a single trait (*e.g.*, Manolio *et al.* 2009, Visscher *et al.* 2012; Gianola *et al.* 2016; Schaid *et al.* 2018) A search for pleiotropy, on the other hand, focuses on markers having multi-trait effects. The latter can be viewed as a search for vectors of effects with non-null coordinates that are distant from a *T*−dimensional 0 origin. Mahalanobis squared distances away from for each the 1279 bivariate vectors of marker effects were calculated for both BLUP and MBL. For BLUP and marker the squared distance was computed as and for MBL it was where are effect estimates for marker *j*, and is the estimated posterior expectation of Σ. For BLUP, had median and maximum values of 0.16 and 2.94, respectively, over markers. For MBL the corresponding values were 0.14 and 3.83. Figure 6 shows that the largest estimated distances were obtained with MBL, supporting the view that the method produces less shrinkage of multiple-trait effect sizes than BLUP. If the 95% percentile of a chi-squared distribution on 2 degrees of freedom (5.99 and 14.4 without and with a Bonferroni correction at ) is used as “significance threshold”, none of the 1279 markers could be claimed to have a bivariate effect on the trait, which is consistent with the

## Predictive Comparison between MBL, MTGBLUP, and MT-BayesC*π*: Wheat

Bivariate Bayesian GBLUP and BayesC*π* models (Cheng *et al.* 2018a) were also fitted to the wheat data set. Multiple-trait Bayesian linear models are well known (*e.g.*, Sorensen and Gianola 2002); BayesC*π* consisted of a bivariate mixture in which each of the 1279 markers was allowed to fall, *a* into one of four disjoint classes: , where means that a marker has no effect on either trait, indicates that a marker affects yield 2 only, and so on. The prior for the four probabilities of membership was a distribution. All three methods were run in each of 100 randomly constructed training sets, and predictions were formed for lines in corresponding testing sets. Training and testing set sizes had 300 and 299 wheat lines, respectively, in each of the 100 runs. For all methods, the MCMC scheme was a single long chain of 50,000 iterations, with a burn-in period of 1000 draws. The analyses were run using the JWAS package written in the JULIA language (Cheng *et al.* 2018b).

Figure 7 and Figure 8 present pairwise plots (bivariate Bayesian GBLUP denoted as RR-BLUP in the plots) of predictive correlations and predictive mean-squared errors, respectively; the plots display <100 points because numbers were rounded to two decimal points. There were no appreciable differences in predictive performance between the three methods, supporting the view that cereal grain yield is multi-factorial and that there are none, if any, genomic regions, with large effects. The variability among replications of the training-testing layout is essentially random, reinforcing the notion of the importance of measuring uncertainty of prediction (Gianola *et al.* 2018). Many studies fail to replicate, and often claim differences between, methods based on single realizations of predictive analyses.

## Predictive Comparison between MBL *vs.* MTGBLUP and MBL *vs.* Single Trait Bayesian LASSO: *Pinus*

Figure 9 and Figure 10 present scatter-plots of the predictive performance (mean squared error and correlation, respectively) of the bivariate Bayesian LASSO and bivariate Bayesian GBLUP (MTGBLUP, denoted as RR-BLUP in the plots) in the 100 testing sets. There were no obvious differences in mean-squared error for either rust bin or gall volume although, for the latter trait, a slight superiority of MBL was noted (Figure 9); the plot contains distinct 12 points because the overlap in numerical values produced “clusters” of points. On the other hand, there was a decisive superiority (Figure 10) of MBL over MTGBLUP in predictive correlation.

Figure 11 contrasts the predictive performance of the bivariate Bayesian LASSO over the single trait Bayesian LASSO for gall volume. The two trait analysis tended to produce larger predictive correlations and smaller mean-squared errors, illustrating instances in which a multiple-trait specification clearly constitutes a better prediction machine.

## Conclusion

Our study is possibly the first report in the quantitative genetics literature of a MBL, inspired by the BL of Park and Casella (2008). MBL assumes that vectors of marker effects on *T* traits follow a null-mean multivariate Laplace distribution, *a* This distribution has a sharp peak at the origin, and reduces to the DE prior of the BL when applied to a single trait. The implementation of MBL requires MCMC sampling, and a relative simple Metropolis-Hastings algorithm based on a scaled mixture of normals representation (Gómez-Sánchez-Manzano *et al.* 2008) was presented. The algorithm was tested thoroughly with a wheat data set and found to mix well, with no evidence of lack of convergence to the posterior distribution, and with a small Monte Carlo error.

A question that arises often in practice, is the extent to which a multiple-trait method will produce a better performance than a single-trait specification. If the parameters of the model (assuming it holds) representing the inter-trait distribution are either known or well estimated, one should expect more power for QTL detection and a better predictive performance for the multivariate specification. In our study, we found that MBL outperformed the single trait in terms of delivering a better predictive performance for gall volume but not for rust bin in *Pinus*. On the other hand, a multiple-trait analysis is more complex and requires more assumptions, so it may be less robust than a single trait procedure, and fail to deliver according to expectation in real-life circumstances. It is risky to make sweeping statements arguing in favor of a specific treatment of data, as outcomes are heavily dependent on the biological architecture of the traits considered, and on the data structure as well. The picture emerging from two decades of experience in genome-enabled prediction in the fields of animal and plant breeding is that, in view of the large variability of performance with respect to data structure for any given prediction machine, it is largely futile to categorize methods in terms of expected predictive performance using broad criteria (Morota and Gianola 2014; Gianola and Rosa 2015; Momen *et al.* 2018; Montesinos-López *et al.* 2018 a,b, 2019a,b; Azodi *et al.* 2019).

MBL is expected to shrink more strongly toward zero vectors of markers with small effects in their coordinates, thus producing differential shrinkage and preserving the *modus operandi* of BL. Mimicking the single-trait argument in Tibshirani (1996), which shows equivalence between LASSO and a posterior mode, the representation in Appendix E illustrates that the degree of shrinkage of the vectorial effects of a marker (*j*, say) on a set of traits is inversely proportional to the quadratic form . Thus, multivariate Bayesian pseudosparsity is induced by MBL to an extent depending on the heterogeneity of over markers. We note, in passing, that the term given in (66) in Appendix E is the counterpart of , part of the “group-penalty” in Li *et al.* (2015), where *g* is some meaningful group of markers arrived at, say, on the basis of biological considerations, and is the group regression coefficient for trait *t*. The latter penalty assigns the same weight to these regressions over traits, contrary to MBL, where weights and coweights are driven by The BL or MBL can be adapted to situations where a group structure may be needed via hierarchical modeling; this fairly straightforward issue is outside of the scope of the paper, but may pursued in future extensions of MBL. Actually, Liquet *et al.* (2017) described a Bayesian multiple-trait analysis where a LASSO-type penalty is assigned to group effects, and a spike-slab prior induces additional Bayesian sparsity at the level of individual regression coefficients. The authors did not address the predictive ability of their method so it would be interesting to compare it against MBL and the multiple-trait mixture model of Cheng *et al.* (2018a). We plan to carry out this comparison in collaboration with CIMMYT (Centro Internacional de Mejoramiento de Maíz y Trigo, México) using a large number of data sets in various cereal crops.

Knowledge of the genetic basis of complex traits is limited, and not vast enough to enable formulation of *a priori* prescriptions for any specific trait or situation. The number, location, and effects of causal variants, the linkage disequilibrium structure between such variants and markers, and the mode of gene action of QTL are largely unknown, this holding for all species of domesticated plants and animals, and for most common diseases in humans. Theoretically, MBL is expected to perform better than multiple-trait BLUP whenever appreciable heterogeneity exists over the effects of the markers in the panel employed, while behaving as multiple-trait GBLUP when all markers have tiny and similar effects. This consideration follows directly from the structure of the method, and computer simulations could be easily tailored to create scenarios where MBL has a better or a worse performance simply by design but without necessarily being relevant to a real-life inferential or predictive problem.

The expectation stated above was verified empirically: markers with stronger (positive or negative) effects on the wheat yields examined had larger Mahalanobis distances away from zero than markers with small effects. Further, markers with short distances in GBLUP had even shorter distances under MBL. Neither of the two methods was able to detect variants having a strong effect on wheat yield, contrary to least-squares GWAS. However, outcomes from GWAS are not strictly comparable with those from shrinkage-based procedures. In single-marker least-squares, the estimator is potentially biased because other genomic regions are ignored in the model; further, short- and long-range linkage disequilibria create statistical ambiguity (Gianola *et al.* 2016). In WGR, on the other hand, regressions are akin to partial derivatives, *i.e.*, the coefficient gives the net effect of the marker given that the other markers are fitted; typically, regressions become smaller as *p* is increased at a fixed

In plant and animal breeding, a focal point is the evaluation of genetic merit of candidates for artificial selection, and the prediction of expected performance in either collateral relatives or in descendants. Under the assumptions of additive inheritance, genome-enabled prediction (Meuwissen *et al.* 2001) produces estimates of marked additive genomic value, or signal as referred to in our paper. In MBL, g and marker effects can be inferred from their posterior mean or from a modal approximation (MAP-MBL) that does not involve MCMC that is described in Appendix E. A rough comparison between GBLUP, MBL, and MAP-MBL was carried out with the wheat data. For the latter, we used and starting values for the iteration were calculated using BLUP estimates of marker effects. MAP with were iterated for 500 rounds. Figure S16 shows that, at iteration 500, the metric used for monitoring convergence had stabilized at the third decimal place, but, for our purposes, iteration could have stopped after 200 rounds. Figure S17 presents a scatter plot of the 2558 (bivariate) marker effect solutions at iterations 1 and 500 against the corresponding BLUP or MBL posterior mean estimates. Clearly, the MAP approach gave markedly different results, producing a stronger shrinkage to 0 of small-effect markers and, thus, an effectively sparser model. Figure S18 gives a comparison of the fitted genomic values, *i.e.*, and for the two traits. GBLUP and MBL estimates were closely aligned and fit the data in a similar manner. On the other hand, MAP-MBL gave a larger mean-squared error of fit, and a smaller correlation between fitted and observed phenotypes, possibly because of the larger effective sparsity of MAP-MBL. A worse fit to the data does not necessarily imply a poorer predictive ability. A thorough comparison of predictive ability between MBL and MAP-MBL will be carried out in future research.

Our predictive comparison in wheat involved three bivariate models: GBLUP, MBL and BayesC which employs Bayesian model averaging. A training-testing validation replicated 100 times at random indicated no differences among methods. However, it was found that MBL was better than MT Bayesian BLUP for the two pine tree traits. After almost two decades of genome-enabled prediction it is now clear that no universally best prediction machine exists (Gianola *et al.* 2011; Heslot *et al.* 2012; de los Campos *et al.* 2013; Momen *et al.* 2018; Bellot *et al.* 2018; Montesinos-López *et al.* 2018a,b, 2019a,b), even when nonparametric or deep learning techniques are brought into the comparisons. For this reason, we refrain from making any sweeping claim about the superiority (inferiority) of MBL over any other competing multiple-trait Bayesian regression method. If the situation is such that multiple-trait vectors of effects are fairly homogeneous over makers, it is to be expected that most methods will have a similar performance. On the other hand, if there is underlying heterogeneity of vectors of effects, possibly reflecting “structural sparsity” at the QTL level, it is quite likely that MBL and multiple-trait Bayesian variable selection methods (*e.g.*, Cheng *et al.* 2018a) will outperform MTGBLUP or a multiple-trait version of Bayes A. Unfortunately, is it difficult to anticipate *a priori* which method will deliver the best performance, given the limited knowledge of the biological architecture of complex traits, the strong influence of the data structure, and the variability of X matrices over data sets, in dimension and content.

As far as we know, our paper represents the first report in the quantitative genetics literature of a multiple-trait LASSO, implemented in a Bayesian or empirical Bayes (Appendix E) manner. MBL adds to the armamentarium of genome-enabled prediction, and expands the family of members of the Bayesian alphabet (Gianola *et al.* 2009; Habier *et al.* 2011; Gianola 2013). Further, it has been implemented in the publicly available JWAS software (Cheng *et al.* 2018b). We take the view that every prediction problem is unique, and that no claims about the superiority of a specific procedure over others should be made without qualification. For instance, MBL could perform worse or better than here when applied to other species, traits, or when confronted with different data structures. Most quantitative and disease traits are truly complex, and it is dangerous to offer simplistic solutions or predictive panaceas (Goddard *et al.* 2019).

## Acknowledgments

Chris-Carolin Schön is thanked for useful discussions and comments. J.M. Marín provided technical information on the MLAP distribution. Two anonymous reviewers are thanked for helpful suggestions. Financial support to D.G. was provided by the Deutsche Forschungsgemeinschaft (grant No. SCHO 690/4-1 to CCS) and by the J. Lush Endowment as Visiting Professor at Iowa State University in 2018. The University of Wisconsin-Madison, the Technical University of Munich (TUM) Institute for Advanced Study, TUM School of Life Sciences, and the Institut Pasteur de Montevideo, Uruguay, are thanked for providing office space, library and computing resources, and logistic support.

## 13 Appendices

### 13.1 Appendix A: Excursus on the MLAP distribution

#### 13.1.1 Three bivariate Laplace distributions

For illustration, consider three bivariate Laplace distributions, all having null means but distinct scale matrices, as follows:(40)Using (7), the density under is(41)The covariance matrix here, , is diagonal, so the random variables are uncorrelated but not independent, since (41) cannot be written as the product of two marginal densities. Under and , the densities are(42)and(43)Five bivariate Laplace densities are shown in Figure S1: (a) gives the density of the distribution of the two uncorrelated bivariate Laplace random variables , and (b) and (c) show the positively (*i.e.*, with ) and negatively (with ) correlated situations, respectively. These three densities have a sharp mode at indicating that a bivariate Laplace prior would strongly shrink vectors to the point, acting similarly to the DE prior in the univariate Bayesian LASSO. (d) and (e) display bivariate Laplace densities of distributions with non-null means.

#### 13.1.2 Conditional distributions

Dropping subscript *j* denoting a specific marker, partitions the vector of effects into where the subvectors have orders and respectively; recall that *T* is the number of traits. Correspondingly, put(44)According to J. M. Marín (personal communication), the conditional distribution has density(45)where and Similar to multivariate normal distribution, the conditional expectation is linear on the conditioning variable, and does not involve

#### 13.1.3 Simulation of a multivariate Laplace distribution

Gómez *et al.* (1998) showed that *S* independent draws from a MLAP distribution with a null mean vector can be made as(46)where results from the Cholesky decomposition , u is a vector uniformly distributed on a *T*-dimensional unit sphere, and *r* is a realization of a Gamma distribution with shape parameter *T* and scale 2. Vector u can be simulated by effecting *T* independent draws from a distribution, and then forming the element of u as ,

Marginal distributions for the three bivariate Laplace distributions with scale matrices and given above were estimated by sampling independent realizations; Equation 46 was employed. Using the samples, zero-mean DE and normal distributions with the same variances were fitted, and the resulting densities were compared with the estimated densities based on the draws. As shown in Figure S2, a normal distribution provided a poor approximation to the marginals from the three bivariate Laplace cases, and the sharp peak at 0, characteristic of a DE density, was not matched by such marginals. This is a corroboration of theoretical results in Gómez *et al.* (1998): marginals from MLAP distributions are elliptically contoured and not DE.

### 13.2 Appendix B: Mean vector of location parameters given ELSE

Consider (Equation 19). For let(47)Expansion of the Kronecker products in (Equation 19) produces the system(48)Observe how phenotypes for all traits contribute to the solutions of trait-specific effects.

### 13.3 Appendix C: Sampling from the conditional posterior distribution of

Consider (29). Let take values and 10, say. Numerical integration of (29) between 0 and 1000 produces as reciprocal of the resulting integration constants, with the normalized densities shown in Figure S3. The distributions are skewed, and, as *Q* increases, the density becomes flatter.

Let be the Lévy density of a positive random variable *Y* having a positive stable distribution with parameter *σ* (Samorodnitsky and Taqqu 2000). From Gómez *et al.* (1998) and Gómez-Sánchez-Manzano *et al.* (2008), the Lévy density is(49)which is that of an inverse Gamma distribution with parameters and Consider now the transformation (Gómez *et al.* 1998) so using (29)(50)Consider a Metropolis-Hastings ratio *R* using (49) with as proposal distribution, and (18) let be a proposed value, and be a member of the target distribution. The ratio is then(51)Hence, if a proposal is drawn from it can be accepted as belonging to the conditional posterior distribution of , with probability equal to *R* above. If accepted, a “new” is a member of with probability *R* as well; otherwise stay with the current

### 13.4 Appendix D: Alternative algorithm for indirect sampling of marker effects

An alternative sampling scheme that uses an equivalent formulation of the model is presented; a two-trait situation is employed for ease of presentation. Let and be the genomic values of the *N* individuals for each of the traits. A model could be(52)where residuals are as before. In a standard genomic best linear unbiased prediction (GBLUP, Van Raden 2008) setting, it is assumed that(53)where G is an marker-based matrix of “genomic relationships,” and(54)is a matrix containing the trait-specific genomic variances and their covariances. Specifically, from the definition of and and, assuming that (55)for and Similarly, , where and is the covariance between marker effects on traits 1 and 2. Let be the variance–covariance matrix of marker effects

For a bivariate Bayesian LASSO model, conditionally on the vector one has(56)Hence(57)Let Further,(58)and(59)After assigning a flat prior to each of and , standard results give that posterior distribution of the genotypic values given is normal, with mean vector(60)whereFurther (Henderson 1975)(61)Hence, draws from the conditional posterior distribution of given and can be obtained by sampling from a multivariate normal distribution with mean vector (60) and covariance matrix (61).

Assuming that, given and , the vector has a multivariate normal distribution, and let Hence(62)Now,(63)so(64)Similarly

(65)### 13.5 Appendix E: A conditional posterior mode approximation to marker effects

Despite of important advances in high-throughput computing, routine genetic evaluation of plants and animals is seldom done with MCMC methods. As an alternative to MCMC, we describe an iterative algorithm that produces point estimates of marker effects (and of linear functions thereof) and approximate measures of uncertainty in a computationally simpler manner. The algorithm uses a reweighted set of linear “mixed model equations,” for which extremely efficient solvers exist. It is assumed that “good” estimates of (the residual covariance matrix) and of B (the variance-covariance matrix of markers effects) are available. From (8) *e.g.*, for then ; hence, an assessment of the scale matrix of the MLAP distribution is easily available.

We make use of (2) and (7), but employ the “markers within trait” representation given in (4). Letting the logarithm of the joint (conditionally on the dispersion matrices) posterior density of location effects, apart from a constant, is(66)Let(67)where and are the two terms in (66). Then(68)and(69)Observe now that the relationship between β and marker effects sorted within traits can be expressed as where L is a nonsingular matrix of elementary operators that rearrange rows and columns. For example, for and and with representing the effect of marker *j* on trait (70)Since L is a matrix of elementary operators, (orthogonality) and the absolute value of the Jacobian of the transformation from β to is equal to 1. The contribution of the prior to the gradient for marker effects is then(71)where is proportional to the Mahalanobis distance of away from for Hence, the vector of derivatives with respect to all marker effects, sorted by traits within individuals is(72)where is a diagonal matrix with typical element . Rearranging the differentials such that the sorting is by markers within traits (73)Collecting (69) and (73),(74)Setting (68) and (74) simultaneously to 0 produces the system of equations (not explicit)(75)Expanding the equations above for yields(76)where *b* is iterate round. Matrix changes at every round of iteration, so the system needs to be reconstituted repeatedly. Marker effects producing small values of the Mahalanobis distance away from 0 result in tiny values and, consequently, will have large diagonal elements. Hence, vectors of markers with weak effects are more strongly shrunk toward the 0 coordinate than those having strong effects in at least one trait

The variance–covariance matrix of the conditional posterior distribution can be approximated as(77)with ∞ indicating parameters evaluated at converged values, assuming that convergence has been attained at a hopefully global mode.

### 13.6 Appendix F: Treatment of missing data

Let a multi-trait data point on individual *i* be represented as(78)where miss denotes a missing record, *e.g.*, if a record could be missing for trait 1 or for trait 2; represents the phenotypes for the traits observed in individual *i*. The posterior predictive distribution of has density(79)provided that data points in *i* are conditionally (given ) independent of any other in the sample, and with y being all observed data. The preceding formulae implies that can be imputed by sampling from their posterior distribution, and then drawing from(80)Since the sampling model is normal, for one has(81)where take the value 1 when a given trait is missing in case or denote “exclude from formula” otherwise; is the part of pertaining to observed phenotypes for case and is the submatrix of residual covariances between missing and observed traits. Further,(82)For example, let and suppose that trait 1 is missing in case 250, but that traits 2 and 3 have been recorded; here(83)and(84)In the MCMC algorithm, missing data are sampled independently across cases, but dependently within case by addressing the pattern of missingness peculiar to each observation. Samples for missing observations can be used to estimate predictive distributions for the missing data (Gelfand *et al.* 1992; Sorensen and Gianola 2002; Gelman *et al.* 2014).

## Footnotes

Supplemental material available at figshare: https://doi.org/10.25386/genetics.10689380.

*Communicating editor: M. Sillanpää*

- Received July 5, 2019.
- Accepted December 20, 2019.

- Copyright © 2020 by the Genetics Society of America

Available freely online through the author-supported open access option.