## Abstract

Bayesian hierarchical shrinkage methods have been widely used for quantitative trait locus mapping. From the computational perspective, the application of the Markov chain Monte Carlo (MCMC) method is not optimal for high-dimensional problems such as the ones arising in epistatic analysis. *Maximum a posteriori* (MAP) estimation can be a faster alternative, but it usually produces only point estimates without providing any measures of uncertainty (*i.e.*, interval estimates). The variational Bayes method, stemming from the mean field theory in theoretical physics, is regarded as a compromise between MAP and MCMC estimation, which can be efficiently computed and produces the uncertainty measures of the estimates. Furthermore, variational Bayes methods can be regarded as the extension of traditional expectation-maximization (EM) algorithms and can be applied to a broader class of Bayesian models. Thus, the use of variational Bayes algorithms based on three hierarchical shrinkage models including Bayesian adaptive shrinkage, Bayesian LASSO, and extended Bayesian LASSO is proposed here. These methods performed generally well and were found to be highly competitive with their MCMC counterparts in our example analyses. The use of posterior credible intervals and permutation tests are considered for decision making between quantitative trait loci (QTL) and non-QTL. The performance of the presented models is also compared with R/qtlbim and R/BhGLM packages, using a previously studied simulated public epistatic data set.

THE general goal of quantitative trait locus (QTL) mapping and association studies is to find certain genomic regions, or QTL, that have nonnegligible contributions to the distribution of quantitative traits (*e.g.*, Xu 2003). In a genome-wide set of markers, the majority of the markers may not link to the QTL and should have zero effects in theory. The ordinary least-squares estimates of marker effects may show large variances. When epistasis is investigated, all possible pairs of marker-by-marker interactions are typically included into the model, which drastically increases the model dimension. This usually leads to an oversaturated model, where the number of marker effects can be many times larger than the sample size (the number of individuals). These problems motivate the implementation of variable selection, including only a subset of important loci in the model and excluding markers irrelevant to the phenotype (*e.g.*, see Broman and Speed 2002; Sillanpää and Corander 2002), and parameter regularization, assigning a penalty term to shrink the marker effects toward zero (Tibshirani 1996). Correspondingly, two classes of Bayesian models (*e.g.*, see O’Hara and Sillanpää 2009) have been developed and used for QTL mapping:

Bayesian variable selection models: The first class is based on a two-component mixture prior distribution assigned to each marker/locus effect. In the first component, the distribution of the marker effects has a narrow Gaussian distribution (or point mass) at zero, while in the other component the marker effects can vary in a large range of values. An auxiliary indicator variable is then used to specify which component a marker belongs to. Examples include stochastic search variable selection (SSVS) (George and McCulloch 1993; Yi

*et al.*2003), Bayes B type methods (Meuwissen*et al.*2001; Sillanpää and Bhattacharjee 2005; Habier*et al.*2007), and a Bayesian classification approach introduced by Zhang*et al.*(2005).Bayesian shrinkage models: The second class is based on hierarchical shrinkage modeling, where hierarchical sparsity-inducing priors are used to shrink unimportant marker effects toward zero. Proposed models include Bayesian adaptive shrinkage (Xu 2003), the Student’s

*t*model (Yi and Xu 2008; Yi and Banerjee 2009), Bayesian LASSO (Park and Casella 2008; Yi and Xu 2008), and extended Bayesian LASSO (Mutshinda and Sillanpää 2010). In class 1, QTL occupancy probabilities and corresponding Bayes factors can be used to judge QTL (Knürr*et al.*2011). In class 2, after performing the shrinkage estimation of the marker effects, criteria such as posterior credible intervals (Li*et al.*2011) or permutation tests (Xu 2003) can be used to decide the location of the real QTL signals. Note, however, that rigorous decision making between QTL and non-QTL is still an open research problem in Bayesian shrinkage models (Heaton and Scott 2010).

After choosing a particular Bayesian model for QTL analysis, the next issue is to estimate the posterior distribution. Overall, there are two families of methods for reaching this goal. The first family consists of stochastic methods, and the other consists of deterministic point-estimation algorithms. Among the stochastic methods, the most commonly used technique is Markov chain Monte Carlo (MCMC) simulation (Robert and Casella 2004). By using this, we are able to generate a sequence of dependent samples for each unknown parameter from our target (posterior) distribution, and these samples can be used to estimate distributional characteristics, *e.g.*, the posterior mean and the parameter uncertainty around it. So far, most of the proposed Bayesian regression methods in QTL mapping have been based on MCMC methods. However, MCMC, as well as many other sampling-based methods, has huge computational demands when applied to large-scale data problems. Therefore, deterministic methods are usually preferable for large-scale data sets, since they can be faster in terms of computation time. A typical example of such a method is the computation of the posterior mode, or so-called *maximum a posteriori* (MAP) estimation. Numerical optimization algorithms, including the Newton–Raphson method, coordinate descent, and expectation maximization (EM) are often used for this task (Gelman *et al.* 2004; Zhang and Xu 2005; Yi and Banerjee 2009; Xu 2010). A third choice, called the variational Bayes (VB) method (Jaakkola and Jordan 2000; Grimmer 2011), can produce a factorized approximated distribution to the targeted posterior in a deterministic manner. It has high computational speed and moreover can give an uncertainty measure for the unknown coefficients.

In this article, we propose three variational Bayes algorithms for Bayesian shrinkage-based QTL and epistasis analysis. The hierarchical shrinkage models that we focus on here are Bayesian adaptive shrinkage (Xu 2003), Bayesian LASSO (Yi and Xu 2008), and extended Bayesian LASSO (Mutshinda and Sillanpää 2010). All of them were originally implemented by MCMC methods. The former two models have been gaining increased interest for QTL analysis, while the latter one was recently developed and therefore needs to be investigated more carefully. To our knowledge, the application of the VB method for quantitative trait locus analysis is very limited. A recent publication is Logsdon *et al.* (2010), which implements a VB algorithm on the basis of the Bayesian classification model of Zhang *et al.* (2005) for the genome-wide association study of human data. Another related work is Armagan (2009), in which a variational bridge regression method was proposed, which is closely related to Bayesian LASSO, but produces sparse models using a different mechanism.

The structure of this article is as follows. *Methods* introduces the Bayesian and frequentist shrinkage-based multiple regression models commonly used in QTL and epistatic analysis. Then we cover the general VB theory and next present the VB algorithm for these shrinkage models. In *Example analyses*, we demonstrate the efficiency of our VB methods for estimating QTL and epistatic effects by using the North American Barley data (Tinker *et al.* 1996) and simulated epistasis data from Xu (2007). Comparisons are done from two perspectives: (1) the difference among the three Bayesian shrinkage models and (2) the difference between VB and MCMC computation. Finally, we discuss the strengths and weaknesses of our VB approaches.

## Methods

### Bayesian shrinkage-based regression models

We consider a standard multiple linear regression model of the form(1)For QTL mapping, *y _{i}* represents the phenotypic value of the

*i*th individual in the mapping population, and

*x*is the genotypic value of individual

_{ij}*i*at marker

*j*, coded 1 for genotype

*AA*and −1 for

*BB*or

*AB*in a doubled haploid (DH) or a backcross (BC) population resulting from inbred line cross experiments with only two possible genotypes at each marker. β

_{0}is the intercept, β

*is the effect of locus*

_{j}*j*, and

*e*is the random error for individual

_{i}*i*, following a normal distribution with mean 0 and unknown variance In addition to the main QTL effects, the interaction between some pairs of markers may also contribute to the phenotypic variation. To study this, model (1) can be extended to(2)where β

*represents the pairwise interaction effect of the marker pair (*

_{uv}*u*,

*v*). Recoding the indexes of Equation 2 properly leads to the simpler expression in (1), where the design matrix

**X**now contains markers and also their interactions (

*e.g.*, see Zhang and Xu 2005). Thus, throughout the article, we consider (1) as our model.

Our interest is to estimate the marker effects (possibly including interaction effects) β* _{j}* for

*j*= 1, 2, ⋅ ⋅ ⋅ ,

*p*, to detect the statistical association between markers and phenotypes. In the Introduction, we pointed out that a class of Bayesian shrinkage regression (BSR) approaches has been discussed and used widely for estimating the marker effects. Many of them are related to the frequentist shrinkage approaches, including ridge regression (Hoerl and Kennard 1970) and LASSO (Tibshirani 1996). Ridge regression can be specified as(3)where λ ≥ 0. The ℓ

_{2}norm penalty function plays a role of shrinking the regression coefficients toward zero. Shrinkage factor λ is a tuning parameter, which determines the level of the shrinkage. λ is usually selected externally by a model selection criterion such as cross-validation, Akaike information criterion (AIC), or Bayesian information criterion (BIC). LASSO, on the other hand, uses an ℓ

_{1}norm penalty and is defined as(4)In contrast to ridge regression, LASSO is able to shrink some of the marker effects exactly to zero because of the nature of the ℓ

_{1}penalty.

Now let us go back to BSR. In BSR, the marker effects are assigned independent hierarchical priors for each marker. Note that through the article, we use the notation *p*(•|parameters) to represent a density function. For example, represents the normal density function with mean 0 and variance . Further sparsity-inducing prior distributions are chosen for nuisance parameters to guarantee that each marker effect has a high probability to be zero on the basis of the assumption of the model. Note that we assume the same amount of shrinkage for all effects in our study regardless of them being main or interaction effects (*cf*. Sillanpää 2009). In addition, noninformative priors can be used for other unknown parameters in (1), for example, *p*(β_{0})∝1 and . These hierarchical prior settings actually play a similar role to the penalty terms in ridge regression or LASSO, which shrink the marker effects toward zero. The choices of the priors for the variances usually determine the behavior of the shrinkage. In the following, we list three possible choices, which have been recently proposed and used for mapping quantitative traits:

Bayesian adaptive shrinkage (BAS): An improper prior was used by Xu (2003) for mapping quantitative traits. This can be regarded as an extreme case of an inverse gamma prior with hyperparameters α and β [in this article, the parameterization of the inverse gamma, gamma, and exponential distributions follows the book by Gelman

*et al.*(2004)] that together determine the amount of shrinkage and the degree of model sparsity. By setting α = 0 and β = 0, the inverse gamma distribution leads to the improper prior (Tipping 2001; Figueiredo 2003).Bayesian LASSO (BL): The prior for is an exponential distribution (with λ>0). It has been demonstrated that under this hierarchical structure of the priors, the marginal prior of β

is a Laplace distribution (also called double-exponential distribution) with the density(5)_{j}

This has been recognized as a Bayesian alternative to the LASSO (Park and Casella 2008). Here λ is also used for determining the degree of the model sparsity. Different from LASSO in which λ should be selected externally, in the Bayesian LASSO, Gamma(γ, υ) with the predetermined hyperparameters γ and υ can be used as a conjugate prior for λ^{2} (other nonconjugate priors are possible), so that λ can be updated through the Bayesian updating scheme. The applications of Bayesian LASSO to QTL mapping include Yi and Xu (2008), Xu (2010), and Li *et al.* (2011).

Extended Bayesian LASSO (EBL): Mutshinda and Sillanpää (2010) generalized the regular setting of the Bayesian LASSO prior as , where δ is a global shrinkage factor like λ in Bayesian LASSO, and η

is specified individually for each marker. [In our work, the parameterization of δ and η_{j}differs slightly from that in Mutshinda and Sillanpää 2010, so that here we use , which is in agreement with the common parameterization of Bayesian LASSO. More details can be found in the_{j}*Discussion*.] By setting our own shrinkage factors for each individual marker effect, the model has more flexibility for controlling the amount of the local shrinkage and choosing model sparsity. A similar approach called Bayesian adaptive LASSO has been proposed in Sun*et al.*(2010) with different parameterization. An earlier related frequentist version is adaptive LASSO (Zou 2006).

In the following, we turn to the general theory of variational Bayes approximation.

### A brief introduction to variational Bayes estimation

The development of VB estimation was motivated by the fact that in the full Bayesian analysis, many posterior distributions we are interested in are intractable, meaning that it is not possible to derive the marginal likelihood for each parameter analytically. By using VB estimation, we are aiming to find a tractable distribution that can approximate the target posterior. Here, we specially refer to a variational approximation method that seeks a factorized distribution to approximate the original posterior distribution. This VB method was originally adapted from the mean field theory (Parisi 1988) in theoretical physics and has been widely used for various Bayesian inferential problems in the fields of machine learning and signal processing (Beal 2003; Bishop 2006; Šmídl and Quinn 2006). In this section, we briefly go through the foundation of the VB method. More detailed introduction to the topic can be found, for example, in the tutorial of Tzikas *et al.* (2008). Specifically, let us assume that our target posterior distribution has the form *p*(**θ**|data) with unknown parameters **θ** = (θ_{1}, θ_{2}, . . . , θ* _{N}*). The goal of variational Bayes estimation is to find a tractable distribution

*q*(

**θ**|data) that is “close” to the target posterior. In VB, the approximate distribution

*q*is supposed to be in a factorized form as(6)where

**θ**

_{1},

**θ**

_{2}, . . . ,

**θ**

*is a certain partition of the parameter vector*

_{K}**θ**so that

*K*≤

*N*(and each

**θ**

*is a block of parameters). Compared to real posterior*

_{k}*p*, the approximate distribution

*q*assumes conditional independence of

**θ**

*(*

_{k}*k*= 1, . . . ,

*K*), so that the posterior correlation among the parameter blocks is ignored. From the perspective of information theory, the “distance” between two probability distributions can be measured by Kullback–Leibler divergence (Kullback and Leibler 1951),(7)where

**Θ**represents the parameter space of

**θ**. Note KL(

*q*||

*p*) ≥ 0, and it equals 0 if and only if the two distributions are the same. In VB, we seek the approximate distribution that satisfies(8)which means the closest possible approximate distribution to the target posterior

*p*(

**θ**|data). It has been demonstrated (see Šmídl and Quinn 2006) that the minimum of (8) is reached at(9)(10)where is the expectation of the log-joint distribution with respect to , the product of distributions of all other partitions of

**θ**except

**θ**

*. It is important to (i) choose suitable (preferably conjugate) prior distributions and (ii) partition the parameter vector*

_{k}**θ**properly to guarantee that each approximate marginal distribution in (10) is tractable. In a convenient case, (

*k*= 1, . . . ,

*K*) can be recognized as a standard parametric distribution from (10). Then it is possible to iteratively estimate by using the following algorithm:

Step 1: Initialize the approximate marginal distribution for each

**θ**._{k}Step 2: Update the parameters of each approximate marginal distribution in the cyclic manner 1, 2, . . . ,

*K*until convergence.

Otherwise, if cannot be recognized as a standard distribution, numerical integration methods are needed to estimate the denominator part in (9) to formulate , which is computationally more demanding. In both cases, we are able to obtain approximate marginal distributions for each block of parameters. Thus, we also obtain uncertainty measures for them from the VB approach. (In some cases, the MAP approach can also provide uncertainty measures for the parameters. See Zhang and Xu 2005 as an example. The posterior variance from MAP may be biased downwardly because it is often conditional on other parameters.) For example, if the approximate marginal distribution is a standard distribution, then we can easily determine the credible interval for the parameter block **θ*** _{k}*. In more complicated cases, this requires numerical techniques.

Finally, it is useful to point out that the VB approximation can be understood from the perspective of the following lower bound of the marginal distribution of the data (marginal likelihood); see Šmídl and Quinn (2006). Note that the following relation is satisfied,(11)where *L*(*q*(**θ** | data)) can be interpreted as a lower bound of the log marginal-likelihood function ln *p*(data), which is supposed to be a constant value, and minimization of KL(*q*(**θ**|data) || *p*(**θ**|data)) is equivalent to maximization of *L*(*q*(**θ**|data)). The lower bound can be further presented as(12)It is not difficult to prove that maximizing (12) with respect to *q*(**θ*** _{k}*|data) leads to the same equation (Equation 10) that we obtained by minimizing the KL divergence. From another perspective, in the above-mentioned iteration algorithm, after updating each once, we can then calculate the value of the corresponding lower bound by using (12) and use it as a stopping criterion for the algorithm. Furthermore, the lower bound can also be used as a criterion for model comparison, since it is an approximation to the log marginal-likelihood function of a particular model, which is essentially needed for constructing a Bayes factor (Beal 2003).

In the next sections, we show that the conjugate priors we choose in the hierarchical shrinkage models guarantee that (10) can be easily evaluated.

### VB algorithm for Bayesian shrinkage models

We are able to formulate posterior approximation algorithms for BAS, BL, and EBL on the basis of the variational Bayes theory described above. In the following, we provide a skeleton of the VB algorithm for the EBL model, which is the most complicated model of the three. The VB procedure can be derived for BAS and BL models in a similar way, and the details are presented in supporting information, File S1.

The likelihood for model (1) can be specified as(13)where . In addition, the priors are specified as(14)(15)(16)where ,(17)(18)and(19)We take each single parameter as a subcomponent of **θ**. Now we seek an approximate distribution to the posterior denoted by(20)From Equation 10, we can derive the approximate density of each component as follows:(21)(22)(23)(24)(25)(26)A detailed description of (21)–(26) is provided in the *Appendix*. After certain initial values are assigned to the parameters of each distribution , an iterative algorithm is used to update them successively until convergence. In practice, the lower bound can be calculated in each iteration and be used as a criterion for stopping. Typically, in the *n*th iteration, we can calculate . If it is smaller than some predefined threshold (small positive value), then the algorithm should stop. After convergence, we obtain the approximate marginal distributions for each parameter defined in the model.

## Example Analyses

We use data sets including the well-known North American Barley data (Tinker *et al.* 1996) and a simulated epistasis data set from Xu (2007) to illustrate the performance of our methods. In the first example, we simulated 50 replicated data sets of phenotypes on the basis of the barley genotype data and empirically assessed the average performance of the methods by presenting results from the replicated data analysis. In the second example, we performed the analysis of epistatic effects on the basis of the simulated data of Xu (2007). In the third example, the sensitivity of the methods was tested with different settings of hyperparameters, using the barley data with real phenotypes. Differences between VB and MCMC approaches were also investigated. Both barley data and simulated data are available as the supplementary materials of Xu (2007), which can be downloaded from the website http://www.biometrics.tibs.org/.

### Analysis of simulated data replicates

We used the barley marker data (described below) as a basis for the simulation study. Before the analysis, each missing marker genotype was completed (once) by random draws from , where is the conditional expectation estimated from flanking markers with known genotypes (see Haley and Knott 1992). We placed QTL positions exactly on the six markers (nos. 2, 20, 40, 60, 80, and 102) with additive effects of −0.5, 0.5, −0.3, 0.3, −0.8, and 0.8, respectively. On this basis, we simulated 50 replicated sets of phenotype data independently, setting the population intercept to 0 and residual variance of normal distribution to 1. By doing so, the average heritability over replications was 0.6881. For each replication, we fitted the BAS, BL, and EBL to the data using both variational Bayes and MCMC estimation algorithms (in the following, we abbreviate the methods as VB-BAS, VB-BL, VB-EBL, MCMC-BAS, MCMC-BL, and MCMC-EBL respectively). In VB-BL, we chose hyperpriors λ^{2} ∼ Gamma(1, 0.0001), and for VB-EBL, we had δ^{2} ∼ Gamma(1, 0.0001) and η_{j}^{2} ∝ Gamma(1, 0.0001). Note that Gamma(1, 0.0001) is a noninformative prior, which is close to an improper flat uniform prior Uni(0, ∞), meaning that we have a weak preassumption on the values of δ^{2} and . For comparison, in MCMC-EBL and MCMC-BL, we used the same priors for shrinkage factors as in the VB approaches. In VB, we chose a threshold of *t* = 10^{−6}, so that if is satisfied, we stopped running the algorithm. Since we obtained an approximate posterior distribution for each parameter from VB, we used the expectation of each posterior distribution as our point estimate. In MCMC, we generated 15,000 samples, where the first 5000 samples were discarded as burn-in, and from the remaining 10,000 samples we stored only every 10th sample to reduce the serial correlation. Next, the posterior mean was calculated and considered as the estimate of the marker effects. Although both VB and MCMC approaches do not shrink any marker effects exactly to zero, they can still be used to guide the variable selection. We used the criterion that if the 95% credible interval contained zero for the estimated coefficient, then the corresponding marker had zero effect (*i.e.*, is not a QTL). The same criterion has been used in Li *et al.* (2011). For 50 replications, we reported the average number of correctly selected nonzero effects (denoted as *C*) and that of incorrectly selected nonzero effects (denoted as *I*) in Table 1. Sometimes, the estimated QTL signals (effects) were detected from the neighboring markers instead of the actual marker positions that were used to generate the phenotypes. Therefore, we also reported the number of correctly selected QTL or their closest neighboring markers (denoted as *C* extension). In addition, the average mean squared error (MSE) and fivefold cross-validation error (CVE) (see Hastie *et al.* 2009) were also calculated for each method. The MSE here refers to the average sum of squared errors, defined as , where are the fitted values. We calculated by using the formula , where represented the posterior mean estimates obtained from VB and MCMC methods or the ℓ_{1}-penalized least-squares estimates from the LASSO method. Note that for comparison, we also performed a standard LASSO analysis (Tibshirani 1996) by using the coordinate descent algorithm (Friedman *et al.* 2010) on the same data, where the LASSO shrinkage factor λ was chosen so that the minimum of CVE was reached. These results are also presented in Table 1. Finally, the estimated QTL effects averaged over 50 replications for all the methods are shown in Figure 1.

From the perspective of producing the correct number of nonzero effects (*i.e.*, QTL), VB-EBL performs best, because it is able to capture on average 5.50 trait loci, which is close to 6, and at the same time provides accurate estimates of QTL effects. VB-BL also estimates almost 6 trait loci correctly, but it overshrinks their QTL effects. VB-BAS tends to identify correct QTL and provides accurate effect estimates for them, but it also gives many spurious signals. Compared to the VB approaches, MCMC-BAS and MCMC-EBL behave equally well in the sense that they accurately estimate the QTL effects. However, the number of correctly identified QTL is much smaller, because the 95% credible intervals estimated from MCMC samples are wider than those from VB. MCMC-BL, similarly to VB-BL, tends to overshrink the QTL effects. Compared to the Bayesian methods, standard LASSO is able to shrink marker effects exactly to zero, but it is not able to provide confidence intervals. Thus, we implemented the phenotype permutation method to decide a 95% significance threshold, on the basis of which QTL were selected from the LASSO solution (Churchill and Doerge 1994). To perform a permutation test, the phenotypes were reshuffled to destroy the association between the phenotypes and the markers for 1000 runs. For each run, we estimated the marker effects and used the largest absolute effect to construct an empirical distribution. We used its 95% quantile as a threshold, so that a marker with the absolute effect larger than the threshold was considered to be a QTL. Results showed that 2.32 QTL on average were detected. Furthermore, we also performed the permutation test for the three VB methods, and the results are summarized in Table 2. It seems that the permutation test performed well with VB-EBL. It correctly detected 3.72 QTL on average, which is less than the number of QTL detected by the credible intervals, but it also provided less falsely selected QTL. For VB-BAS, the estimated threshold was too high so that only 0.46 markers were detected to be the QTL on average. For VB-BL, the threshold was too low, and we obtained too many false positive signals. Moreover, MSE is often used to measure the goodness-of-fit of the model to the data. From the results, it can be seen that MCMC-BL and VB-BAS have smaller MSE on average than others, meaning that they fit the data better. On the other hand, cross-validation provides a measure of predictive ability for a method. Standard LASSO, MCMC-EBL, VB-EBL, and MCMC-BAS approaches tend to give smaller CVE, meaning that they have good predictive ability. MCMC-BL and VB-BAS produced the highest CVEs, indicating that these two methods may suffer from overfitting in this replication study.

### Analysis of simulated data to estimate epistatic effects

A backcross data set with 600 individuals was simulated in Xu (2007). In this data set, 121 markers were evenly distributed through a single chromosome of 1800 cM. Simulated QTL with main effects were exactly placed on 9 markers, and 13 marker pairs were simulated to have interaction effects. Phenotypes were simulated as the cumulative sum of all main and pairwise interaction effects. The population mean and residual variance of normal distribution used in the simulation were 5.0 and 10.0, respectively (the resulting heritability was 0.9065). See Xu (2007) for more details on these data.

We tested the efficiency of VB methods to estimate epistatic effects by using simulated backcross data from Xu (2007). Note that the same linear regression model presented in (1) can also be used here. Design matrix **X** should include not only 121 markers, but also pairwise interaction terms between each possible pair of markers, which leads to an ill-posed (*P* ≫*n*) regression problem. We implemented VB-BAS, VB-BL, and VB-EBL as well as MCMC-BAS, MCMC-BL, and MCMC-EBL (with the same gamma priors for δ^{2} and η_{j}^{2} as in the previous example analysis) on these data. Here we use the same priors for the main and interaction effects. The 95% credible intervals-based criterion is also considered as the tool to judge the QTL for each method.

Results are summarized in Table 3 and Figures 2 and 3. For methods including VB-BAS, MCMC-BAS, VB-EBL, and MCMC-EBL, we conclude that their performance is good in general, because they are able to detect almost all the correct loci with main effects as well as the interaction terms, and the estimated QTL effects are also accurate. VB-EBL, MCMC-EBL, and MCMC-BAS fail to find an interaction effect at marker pair (41, 61). In addition, VB-EBL does not find an effect at pair (21, 22). MCMC-EBL produced a quite accurate estimate of 0.7015 at (21, 22) that is close to the true effect size 1, but the pair (21, 22) is not recognized as a QTL by the interval test. Compared to the former three methods, VB-BAS is able to find all the important signals, but tends to give many spurious signals in addition to the correct ones. Finally, the results obtained from VB-BL and MCMC-BL are not satisfactory. They can detect only a few major QTL, and the estimated QTL effects in these positions are not accurate. Similar analysis has been done by using empirical Bayes, LASSO, penalized maximum likelihood (Zhang and Xu 2005), and SSVS in Xu (2007). Results obtained from our variational Bayes approaches including VB-EBL and VB-BAS are also competitive with the results published earlier.

Finally, we employed still two other efficient methods for detecting epistatic effects and compared them to our methods. The first one is an EM algorithm for estimating epistatic effects introduced by Yi and Banerjee (2009). In summary, a Bayesian hierarchical model was proposed with priors and . Note that an inverse-χ^{2} distribution Inv – χ^{2}(ν, *s*^{2}) is equivalent to an inverse gamma distribution . If both ν and *s*^{2} were set to be zero, we exactly obtain a BAS model. In our example, we chose ν = 0.01 and *s*^{2} = 0.0001, as Yi and Banerjee (2009) suggested. The method has been implemented in the recently developed R package BhGLM (http://www.ssg.uab.edu/bhglm/), which takes the advantage of R’s built-in function glm. We found that the method cannot be directly applied to this data set due to the memory limitation of the computer. Alternatively, we used a model search strategy (with thresholds *t*_{1} = 10^{−8} and *t*_{2} = 0.1) also introduced in Yi and Banerjee (2009), which gradually filters out the markers with negligible effects and therefore reduces the dimensionality of the model. The second method is a MCMC-based model-finding algorithm of Yi *et al.* (2005), which has been implemented in the R package qtlbim (see Yandell *et al.* 2007). We set the expected number of QTL with main effects (main.nqtl) to be 9, and the expected total number of QTL with both main and epistatic effects (mean.nqtl) to be 20 on the basis of the number of main and epistatic effects detected by VB-EBL. The results are shown in Figure 4. It seems that R/BhGLM does not provide as good results as VB-EBL. First of all, the locations of two main QTL, which were found at 88 and 113, are biased from the true locations 91 and 101. Second, many spurious QTL signals were detected in R/BhGLM analysis. These may be mainly caused by the use of the model search strategy. A simultaneous estimation of all the markers and the interaction terms may lead to better results. On the other hand, R/qtlbim seems to work nicely. Since the Bayesian model behind R/qtlbim assigned an indicator variable to each marker, an estimate of the Bayes factor can be derived from the MCMC samples and be used to judge the QTL positions. Results in Figure 4C indicate the markers with large values of Bayes factors were located either exactly at the positions of true main QTL or close to them. Furthermore, R/qtlbim was also able to correctly detect most of the marker pairs with true epistatic effects, as illustrated in Figure 4D. Like VB-EBL, R/qtlbim failed to detect the epistatic effects at marker pairs (21, 22) and (41, 61). In addition, it produces a few false positive signals. We have also found that the performance of R/qtlbim relies on the choice of priors including main.nqtl and mean.nqtl. As we have shown, it is possible to set those priors on the basis of results from our VB methods.

### Real barley data analysis

The mapping population consists of 145 doubled haploid lines (*n* = 145), each grown in a range of environments. A total of 127 markers were genotyped, covering 1270 cM of the barley genome, with the average distance between markers of 10.5 cM. Seven traits including yield, heading, maturity, height, lodging, kernel weight, and test weight were measured for each line.

Here we selected the kernel weight as the quantitative traits used in the analysis. Before the analysis, the phenotype values for each line in the traits were averaged over the environments. Missing marker genotypes were imputed in the same way as in the first example analysis. The VB methods based on the three models as well as the corresponding MCMC approaches were used for the analysis. To test the prior sensitivity of VB-BL and VB-EBL, we implemented VB-BL with different combinations of hyperparameters as (γ_{1}, ν_{1}) = (0.0001, 0.0001), (γ_{2}, ν_{2}) = (1, 0.0001), and (γ_{3}, ν_{3}) = (1, 1), respectively. Correspondingly, for VB-EBL, we chose (γ_{1}, ν_{1}) = (ψ_{1}, ϑ_{1}) = (0.0001, 0.0001), (γ_{2}, ν_{2}) = (ψ_{2}, ϑ_{2}) = (1, 0.0001), and (γ_{3}, ν_{3}) = (ψ_{3}, ϑ_{3}) = (1, 1). In MCMC-EBL and MCMC-BL, we used the same priors as in the corresponding VB approaches. Together with VB-BAS, results of VB estimation are presented in Figure 5. Additionally, results of MCMC estimation are shown in Figure 6. We see that the VB-EBL is quite sensitive to the choices of the hyperparameters. By our preferable choice of shape parameter as 1 and rate parameter as 0.0001, VB-EBL tends to produce a relatively sparse model compared to the other two choices. On the other hand, VB-BL seems to be less sensitive to the choice of the hyperparameters. We also note the difference between the results obtained from VB and MCMC methods. For example, in the BAS model, the MCMC approach leads to a sparser model than the VB approach. Another interesting feature is that the credible interval estimated from VB methods tends to be narrower than the one estimated from the MCMC samples based on the same shrinkage model. Figure 7 shows an example, where we plot the estimated effects of markers (in blue) which are announced to be significant (95% credible intervals do not contain zero) from VB-EBL. The corresponding posterior means and 95% C.I.’s estimated by MCMC (in red) are also shown in Figure 7. Clearly, all the MCMC-derived 95% C.I.’s contain VB-derived 95% C.I.’s. Consequently, from VB, we obtain 12 QTL by interval inference, but from MCMC, only 3 of them including markers 12, 43, and 102 are claimed to be significant. This implies that, in practice, the VB inference based on intervals may tend to give more false positive results. Alternatively, we performed a permutation test with VB-EBL, and three markers, 2, 12, and 102 were found to be significant. These markers are located close to the QTL signals reported in Tinker *et al.* (1996).

## Discussion

We propose a variational Bayes approximation-based algorithm for Bayesian shrinkage analysis of multiple-QTL mapping. Because of the usage of the conjugate priors, iterative algorithms can be easily derived for hierarchical Bayesian shrinkage models including BAS, BL, and EBL to update the approximate marginal distribution of each parameter. Compared to MCMC methods, which are used to generate dependent samples from the target posterior distribution directly, VB seeks an approximate distribution in a factorized form that has minimum KL divergence to the target posterior. Therefore, it is not surprising that for a particular model, the posterior mean estimated by MCMC can be different from that estimated by VB. Another feature is that the credible intervals estimated by VB methods are narrower than those of MCMC. Therefore, if the credible intervals are used to judge QTL, VB tends to give more false positive signals. On the other hand, MCMC-based shrinkage methods tend to provide too large credible intervals arguably due to the collinearity between loci, and a criterion based on these may fail to find some true QTL signals. Therefore, we suggest that a credible intervals-based criterion should be used for QTL detection only in the first stage. A further method may be needed to judge QTL in a more accurate way in the second stage. We found that a permutation-based test might be a suitable tool to control false positive findings under the VB-EBL model, but not for VB-BL and VB-BAS. However, performing such a test often requires huge computation time and is not preferable for the high-dimensional data. As Heaton and Scott (2010) pointed out, how to select the significant variables under a Bayesian shrinkage procedure remains an open problem and needs to be investigated in further studies. Compared to the MCMC approaches, one major advantage of VB approaches is their computational efficiency. Note that the computation time for a single iteration in VB is approximately identical to a Gibbs sampling step. On the basis of our empirical experiments, VB algorithms need ∼100–1500 iterations until convergence. On the other hand, when MCMC is used, we usually need to draw >10,000 MCMC samples to obtain good posterior summaries. Therefore, compared to MCMC, VB methods require much less computation time. Table 4 presents a comparison of the computation time of both VB and MCMC in our second and third example analyses. All the methods were implemented in MATLAB on a regular desktop with a 3.00-GHz Intel Pentium 4 processor and 3 GB RAM. The MATLAB source code is available in File S2.

The VB method we consider here can be regarded as an extension of the well-known EM algorithm and can be applied to a broader class of Bayesian models. In an EM algorithm, when the hidden variables interact with each other and their joint full conditional distribution is not analytically available, the variational approximation may be applied as an alternative to proceed to the E-step. This is then called the VB-EM algorithm (Beal 2003). Furthermore, if the parameters, with respect to which the objective function is maximized, are considered as the hidden variables so that only the E-step is remaining, then we exactly obtain the VB algorithm mentioned in this article. More information about the connection between VB and EM algorithms is provided in File S1.

This article discusses three Bayesian shrinkage models, including BL, EBL, and BAS. BL and EBL can be regarded as the same model class, which uses double exponential prior distributions of marker effects. Compared to BL, an improvement in EBL is to assign a marker-specific shrinkage factor η* _{j}* to each marker in addition to a global shrinkage factor δ. Mutshinda and Sillanpää (2010) pointed out that, by using this marker-specific shrinkage factor, EBL is able to relax the penalty for the important marker effects and meanwhile push the unimportant marker effects strongly toward zero. From the results of our example analyses, it can be seen that VB-EBL outperformed VB-BL. In our simulated examples, VB-EBL gave accurate estimates of the marker effects and meanwhile produced a relatively sparse model. On the other hand, VB-BL tended to shrink the large marker effects too much, so that the estimation was biased. BAS, a third model we focused on, comes from a different model class, where the Student’s

*t*prior is assigned for the marker effects. In the simulation studies, we found that VB-BAS can produce accurate estimates of the marker effects, but the resulting model was not sparse compared to VB-EBL [with our default hyperparameters (γ, ν) = (ψ, ϑ) = (1, 0.0001)], meaning that some spurious signals were present. Therefore, VB-BAS may not be a preferable method for QTL mapping, in which we are more interested in detecting QTL with relatively large effects on the traits.

The Bayesian LASSO method has also been used for predicting phenotypic values and breeding values (De Los Campos *et al.* 2009; Legarra *et al.* 2011). In the first replication study, we used cross-validation to measure the predictive ability of our proposed methods, and we found that the performance of VB-EBL is competitive to LASSO, which has been widely used for prediction problems in many areas. However, the predictive performance of a certain method depends on the property of data sets, *i.e.*, the distribution of QTL effects (Daetwyler *et al.* 2010; Clark *et al.* 2011). The general predictive abilities of our VB methods need to be investigated in more complicated data sets.

In addition, it is necessary to point out the difference between our EBL and BL models and other approaches. Compared to Mutshinda and Sillanpää (2010), we used different parameterizations and priors of the shrinkage factors in the EBL and BL models. Taking the EBL model as an example, in Mutshinda and Sillanpää (2010), δ and η* _{j}* were treated as parameters, and they were assigned the uniform priors Uni(0, 100). However, in our VB approaches, δ

^{2}and are treated as parameters, and conjugate priors δ

^{2}∼ Gamma(γ, ν) and ∼ Gamma(ψ, ϑ) are used to guarantee the convenience of the computation. For comparison, here we used the same gamma priors for MCMC as in the VB approaches. However, on the basis of our empirical experiments, EBL is sensitive to the hyperparameters with the gamma priors in MCMC, and the use of the uniform priors suggested by Mutshinda and Sillanpää (2010) might be a good choice to avoid tuning.

In BL and EBL, another remaining issue is “unimodality”. To guarantee unimodality of the posterior, Park and Casella (2008) suggested the incorporation of the global residual variance σ^{2}_{0} into the prior distribution of marker effects as β* _{j}* ∼

*N*(0, σ

^{2}

_{0}σ

_{j}^{2}). They demonstrated that, by doing this, the posterior distribution has a unique mode (maximum value), and the corresponding MCMC algorithm may converge faster. However, in VB estimation, instead of the posterior mode, we seek the minimum value of the KL divergence between the factorized approximate distribution and the original posterior. Thereby, the model suggested by Park and Casella (2008) may not be optimal for our variational Bayes approach. In fact, we found that the VB algorithm based on the Park and Casella model even slowed down the convergence. Also note that the uniqueness of the minimization (

*i.e.*, convergence to a global minimum) of the KL divergence in VB estimation is not guaranteed in general (see Šmídl and Quinn 2006). Further investigation may be needed to check for uniqueness of the VB approximation.

Finally, we point out some weaknesses of our current VB approach, which can potentially be improved in future work. First, we derived the approximate marginal distribution for each marker effect β* _{j}* separately. During the approximation, the posterior correlations among marker effects β

*are lost. It may be beneficial to take all marker effects as a block*

_{j}**β**= [β

_{1}, … , β

*] and update them simultaneously. This may help to reduce the minimum value of KL divergence and may give a better approximation to the target posterior distribution. In practice, this involves the inversion of a*

_{p}*p*×

*p*matrix where which is demanding in terms of computation and storage when

*p*is large. This problem can be potentially solved by applying “Woodbury matrix identity” (Woodbury 1950), which can be applied to convert the

*p*×

*p*matrix into an

*n*×

*n*one, so that the dimension of the matrix is reduced significantly. Second, in VB-EBL and VB-BL, we used gamma priors for shrinkage factors and chose shape parameters to be 1 and rate parameters to be a small value (say 0.0001) so that the priors were flat. This strategy was successful throughout our example analyses here, but may not be the optimal choice in general, since these methods are quite sensitive to the choice of the gamma parameters (see our third example analysis). A first solution to this problem might be that we perform a prior sensitivity analysis using cross-validation for new data to give optimal models on the basis of different data sets. Second, we have found that in the formula to update (in step VI in the

*Appendix*), if we heuristically replace by , VB-EBL can produce equally good results, and these slight modifications seem to make VB less sensitive to hyperparameters, although they do not have any theoretical justification. Logsdon

*et al.*(2010) also made some slight modifications in their VB algorithm to maintain the ease of computation, and they stated that those modifications did not significantly affect the results. However, in this article, we have presented only the results from standard VB.

## Acknowledgments

We are grateful to Petri Koistinen for giving constructive comments on the manuscript. This work was supported by the Finnish Graduate School of Population Genetics and by research grants from the Academy of Finland and the University of Helsinki’s Research Funds.

## Appendix: VB Estimation Algorithm for Extended Bayesian LASSO

First, the logarithm of the joint distribution of parameters and data is

(A1)where *C* is a constant. Next, we start to derive the VB marginal distributions for parameters on the basis of Equation 10.

### I. Derivation of

By keeping only terms containing β_{0}, we obtain(A2)where represents the product of all other approximate marginal distributions except . Here *E*[•] is a simplified notation of the expectation of the parameter • with respect to its approximate marginal distribution [*i.e.*, is a simplified notation of ], and *C* represents those terms that do not contain β_{0}. We can recognize from (A2) that is a *normal distribution* with mean(A3)and variance(A4)To update *E*[β_{0}] and Var[β_{0}], we need to calculate *E*[β* _{j}*] and on the basis of approximate marginal distributions and , respectively, which are shown in the following.

### II. Derivation of

Following a similar procedure to that in (1), we have(A5)showing that is also a *normal distribution* with mean(A6)and variance(A7)In addition, the expectation of β_{j}^{2} (second moment) can be calculated by *E*[β_{j}^{2}] = *E*[β* _{j}*]

^{2}+ Var[β

*], which is needed later.*

_{j}### III. Derivation of

We have(A8)where(A9)and(A10)Here is recognized as a *gamma distribution* with parameters shape parameter *a*_{1} and rate parameter *b*_{1}. In addition, we need to calculate the mean to update other approximate distributions.

### IV. Derivation of (j = 1, . . . , p)

(A11)Here is recognized as an *inverse Gaussian* distribution (Seshadri 1999) with parameters(A12)and(A13)Note that a general form of the inverse Gaussian density is(A14)To update other approximate distributions, we need to obtain the expectation of as , as well as the expectation of as .

### V. Derivation of

(A15)Here is recognized as a *gamma distribution* with shape parameter(A16)and rate parameter(A17)We also need the expectation .

### VI. Derivation of (j = 1, . . . , p)

(A18)Thus, is also a *gamma distribution* with shape parameter(A19)and rate parameter(A20)The expectation also needs to be calculated.

To update the approximate distributions in *Appendix* sections I–VI successively, we need to first assign initial values to the required expectations. In practice, we use *E*[β* _{j}*] = 0 (

*j*= 0, 1, . . . ,

*p*), (

*j*= 0, 1, . . . ,

*p*),

*E*[δ

^{2}] = 1, and (

*j*= 0, 1, . . . ,

*p*). In addition, we can derive the lower bound as(A21)where Γ(•) is the gamma function.

Finally, after convergence, we obtain an approximate distribution for each marker effect. The mean *E*[β* _{j}*] can be regarded as a point estimate of the effect of marker

*j*, and its 1 – α percent credible interval can be defined as , where

*Q*(•) here represents the normal quantile function.

## Footnotes

*Communicating editor: F. Zou*

- Received September 20, 2011.
- Accepted October 19, 2011.

- Copyright © 2012 by the Genetics Society of America