Abstract
Numerous Bayesian methods of phenotype prediction and genomic breeding value estimation based on multilocus association models have been proposed. Computationally the methods have been based either on Markov chain Monte Carlo or on faster maximum a posteriori estimation. The demand for more accurate and more efficient estimation has led to the rapid emergence of workable methods, unfortunately at the expense of well-defined principles for Bayesian model building. In this article we go back to the basics and build a Bayesian multilocus association model for quantitative and binary traits with carefully defined hierarchical parameterization of Student’s t and Laplace priors. In this treatment we consider alternative model structures, using indicator variables and polygenic terms. We make the most of the conjugate analysis, enabled by the hierarchical formulation of the prior densities, by deriving the fully conditional posterior densities of the parameters and using the acquired known distributions in building fast generalized expectation-maximization estimation algorithms.
- hierarchical model
- genomic selection
- Bayesian Lasso
- generalized expectation maximization (GEM)
- GenPred
- shared data resources
THE availability of the genome-wide sets of molecular markers has opened new avenues to animal and plant breeders for estimating breeding values on the basis of molecular markers with and without pedigree information (Bernardo and Yu 2007; Hayes et al. 2009; Lorenzano and Bernando 2009). The same holds true for phenotype prediction in human genetics (Lee et al. 2008; de los Campos et al. 2010). There are clearly two very different model approaches to estimate genomic breeding values, the first of which applies simultaneous estimation and variable selection to multilocus association models, where all markers are included as potential explanatory variables (e.g., Meuwissen et al. 2001; Xu 2003). Multilocus association models assign different, possibly zero, effects to every marker allele or genotype and determine the genetic value of an individual as a sum of the marker effects. The second approach is to utilize the marker information for estimating realized relationships between individuals and use the marker-estimated genomic relationship matrix instead of the pedigree-based numerator relationship matrix in a mixed-model context (e.g., VanRaden 2008; Powell et al. 2010).
In recent literature there are numerous Bayesian methods of phenotype prediction and breeding value estimation based on multilocus association models, from Meuwissen et al. (2001) BayesA and BayesB onward (e.g., Xu 2003; Yi et al. 2003; Yi and Xu 2008; de los Campos et al. 2009; Verbyla et al. 2009, 2010; Mutshinda and Sillanpää 2010; Sun et al. 2010; Habier et al. 2011; Knürr et al. 2011). The Bayesian methods have proved workable, efficient, and flexible, but the tremendous number of markers in the modern SNP data sets makes the computational methods traditionally connected to Bayesian estimation, e.g., Markov chain Monte Carlo (MCMC), rather cumbersome or even infeasible. For the same models fast alternative estimation procedures have been proposed, most commonly based on estimation of the maximum of the posterior density rather than the whole posterior distribution by the expectation-maximization (EM) algorithm (Dempster et al. 1977; McLachlan and Krishnan 1997; for the methods see, e.g., Figueiredo 2003; Meuwissen et al. 2009; Yi and Banerjee 2009; Hayashi and Iwata 2010; Lee et al. 2010; Shepherd et al. 2010; Sun et al. 2010; Xu 2010). A collection of the Bayesian methods used for phenotype prediction and breeding value estimation is listed in Table 1 with some central features common to several of the methods.
While the proposed methods are competent and useful, we find the recent development a bit disturbing for a couple of reasons. In the first place, rather than introducing a tailor-made algorithm for every single model variant and emphasizing the differences between the variants, it might be more helpful to try to focus on the similarities or on the common framework of the methods. The second concern of ours is the widespread fashion of mixing the model and the parameter estimation in a way that it is hard to follow what is the model, the likelihood, and the priors and what is the estimator. In this article our intention is to respond to the above-mentioned concerns by (1) carefully building a hierarchical Bayesian linear model and showing how many of the different models suggested in the literature can be seen as special cases of the more general one and (2) deriving a generalized EM algorithm for the parameter estimation in a way that presents explicitly how the model is translated into an algorithm.
In typical single-nucleotide polymorphism (SNP) data the number of markers is far greater than the number of individuals, and hence in multilocus association models there usually are a lot more explanatory variables than observations. This leads to a situation where some kind of selection of the predictors is required, either by discarding the unimportant predictors or by shrinking their effects toward zero (e.g., O’Hara and Sillanpää 2009). Contrary to the frequentist way of deriving an estimator by adding a penalty to the loss function (e.g., penalized maximum likelihood, such as ridge regression and Lasso), in the Bayesian context the shrinkage-inducing mechanism is included into the model, namely by specifying an appropriate prior density for the regression coefficients. Although it is clearly more logical to consider the assumptions about the model sparseness as a part of the model (the prior is a part of the model) rather than a part of the estimator (a penalty is a part of the estimator), the difference may seem trivial in practice. However, the fact that in the Bayesian context the model includes all available information permits the estimator to be always the same, either the whole posterior density or a maximum a posteriori (MAP) point estimate, and this enables the straightforward translation of the model into an algorithm.
Model-wise our purpose is to examine the behavior of different submodels and prior densities, especially to compare the pros and cons of Student’s t and Laplace priors. We want to explain the significance of the parameterization and hierarchical structure of the model for elegant derivation of posterior densities and mixing or convergence properties of an estimation algorithm and to emphasize in particular the beauty of the hierarchical specification of the prior density, contrary to the common enthusiasm to integrate out intermediate parameters when working within the EM context.
Estimation-wise we reclaim an approach familiar to those acquainted with MCMC, Gibbs sampling in particular, of deriving the fully conditional posterior density of every parameter and latent variable. Since the fully conditional posteriors are known distributions, we know the conditional posterior expectations and maximums required by the EM algorithm without integrating or setting the derivative to zero and solving, which is the traditional way to derive estimation steps in the EM framework. One of our purposes is to bring the MCMC and EM estimation frameworks closer together and, more generally, to point out that this approach makes it possible to construct an infinite variety of different algorithms, including EM algorithms, with different parameters considered as “missing data,” and type II maximum-likelihood algorithms, where all parameters are maximized (Tipping 2001), as well as mixtures of the previous algorithms and Gibbs sampling, where some variables are stochastically sampled (stochastic EM) (e.g., Lee et al. 2008; Zhan et al. 2011).
Hierarchical Model
Our hierarchical model, depicted as a directed acyclic graph in Figure 1, has a total of six layers, two of which are optional. The observed data, located in layers 1 and 2 in the graph, comprises phenotype and genotype information, plus a known pedigree, of a sample of related individuals. The phenotype measurements of the n individuals, denoted by an n-vector y, are assumed to be either continuous or binary; in the latter case the observed phenotypes are located in layer 1 in Figure 1. The genetic data X are an n × p matrix consisting of the genotypes of p biallelic SNP markers, coded as the number of the reference alleles, 0, 1, and 2, and standardized to have zero mean and unity variance. The pedigree information is given in the form of an additive genetic relationship matrix (Lange 1997), commonly denoted by A.
Hierarchical structure of the model. The ellipses represent random variables and rectangles fixed values, while the round-cornered rectangles may be either, depending on the selected model. Solid arrows indicate statistical dependency and dotted arrows functional relationship. The background boxes indicate the dimension of the vectors or matrices in question.
Linear model
The phenotypes are connected to the marker and pedigree information with a normal linear association model(1)where β0 is the population intercept, and the n-vector ε corresponds to the residuals, assumed normal and independent,
. If necessary, the intercept β0 can be easily replaced with a vector of environmental variables.
The second term on the right-hand side of the Equation 1 comprises the observed genotypes X and the allele substitution effects Γβ, modeled following Kuo and Mallick (1998) as a product of the size of the effect and a variable indicating whether the marker is linked to the phenotype. In Equation 1, β is a p vector of regression coefficients, denoting the additive effects sizes, and Γ = diag(γ) = diag(γ1, …, γp) is a p × p diagonal matrix of indicator variables, whose jth diagonal element γj has value 1 if the jth SNP is included in the model and 0 otherwise.
The term u in Equation 1 denotes the additive polygenic effects due to the combined effect of an infinite number of loci. It represents the genetic variation not captured by the SNP markers, as well as takes account of residual dependencies between individuals (Yu et al. 2006). The total number of individuals in the data is denoted by an uppercase N, in contrast to the lowercase n representing the number of individuals with an observed phenotype. The vector of polygenic effects, u = [u1 … uN]′, has an N-dimensional multivariate normal prior distribution with mean vector 0 and covariance matrix , where A is the additive genetic relationship matrix and
is the additive variance of the polygenes. Z = [In|0n×N−n] is an n × N design matrix connecting the polygenic effects to the observed phenotypes.
The individuals, or their phenotypic values yi, are assumed conditionally independent given the genotype information X and the polygenic effect u. This assumption and the described linear marker association model (1) give a multivariate normal likelihood for the phenotype vector y, or, due to the independence of the observations the likelihood can be interpreted also as a univariate normal density
for a single observation yi. The parameters of the multilocus association model that are present in the likelihood function are located in the “parameter” level of the graph in Figure 1.
Shrinkage-inducing priors
A central feature of handling an oversaturated model is selection of the important predictors (e.g., O’Hara and Sillanpää 2009). In the Bayesian context the sparseness is included into the model by specifying such a prior density for the regression coefficients that it represents the a priori understanding that most of the predictors have only a negligible effect, while there are a few predictors with possibly large effect sizes. A prior that would evince this idea should consist of a probability mass centered near zero and a probability mass distributed over the nonzero values, including a reasonably high probability for large values. The most prominent prior densities used for acquiring the desired shape are Student’s t (e.g., Meuwissen et al. 2001; Xu 2003) and Laplace distributions (de los Campos et al. 2009), either alone or combined with a point mass at zero (Meuwissen et al. 2001; Shepherd et al. 2010) or as a mixture of two Student’s t distributions with different variances (stochastic search variable selection) (e.g., Verbyla et al. 2009). Student’s t and Laplace densities possess several favorable features, including high kurtosis and proportionally heavy tails, that make them worthy candidates for shrinkage-inducing priors. However, in some cases, especially with an oligogenic trait, the shrinkage introduced by these densities may not be sufficient and a mixture prior with a point mass at zero is required to produce the optimal sparseness. Due to its weaker shrinking ability the Student’s t prior is more prone to the problem than the Laplace prior. In fact, Pikkuhookana and Sillanpää (2009) and Hayashi and Iwata (2010) have noted that adding a point mass at zero to a Student’s t prior model improves the performance, due to the elimination of the cumulative effect of a multitude of insignificant but nonzero marker effects. Although the Laplace prior generates more sparseness, there is evidence in support of the usefulness of the Laplace and point mass mixture prior (Shepherd et al. 2010). The Laplace prior leads to an estimate nearly identical to the frequentist Lasso estimate (Tibshirani 1996) and is therefore commonly denoted as the Bayesian Lasso (Park and Casella 2008). However, contrary to the original frequentist version, the Bayesian Lasso shrinks the unimportant coefficients to small values instead of zero (Sun et al. 2010), which also supports the putative benefit of a mixture prior.
The insertion of a zero-point mass into the prior of the marker effects is carried out by adding an indicator variable into the model to tell whether the effect of a given predictor variable comes from the density part (Laplace or Student’s t) or from the point mass part of the prior, that is to say whether the variable is included into the model or not. Following Kuo and Mallick (1998) the marker effects are modeled as a product of the indicator variable γj and the effect size βj, which are considered a priori independent; hence the joint prior of the marker effect becomes simply p(γj, βj) = p(γj)p(βj), where p(γj) is a Bernoulli density with a prior probability for a marker to be linked to the trait, the effect size prior p(βj) being either a Student’s t or a Laplace density.
Both Student’s t and Laplace distribution can be expressed as a scale mixture of normal distributions with a common mean and effect-specific variances. The Student’s t density can be formulated as a scale mixture of normal distributions with variances distributed as a scaled inverse χ2, while a Laplacian density can be presented in a similar manner, the mixing distribution now being an exponential one. As the hierarchical representation of the Student’s t distribution leads to conjugate priors for the normal likelihood parameters, it is a perfect choice for a conjugate analysis. The exponential prior of the effect variances in the Laplace model is not conjugate to the conditionally normally distributed effect sizes
; however, the inverse of the effect variance has an inverse Gaussian fully conditional distribution function (Chhikara and Folks 1989). Within the MCMC world, the hierarchical formulation of the prior densities, also known as model or parameter expansion, is a well-known device to simplify computations by transforming the prior into a conjugate and hence enabling Gibbs sampling and to accelerate convergence of the sampler by adding more working parts and therefore more space for the random walk to move (see, e.g., Gilks et al. 1996; Gelman 2004; Gelman et al. 2004). In MAP estimation, on the other hand, a commonly adopted approach to try to simplify the model is to integrate out the effect variances. However, the conjugacy maintained by preserving the intermediate variance layer is a valuable feature also for a MAP estimation, as it enables the straightforward derivation of the fully conditional posterior density functions.
Hyperparameter selection in MAP estimation
The prior hyperparameters (layer 5 in Figure 1) either can be estimated simultaneously to the model parameters or can be defined, e.g., by cross-validation or the Bayesian information criterion (see Sun et al. 2010). The estimation of the hyperparameters is depicted in Figure 1 by considering the priors for the indicator and the variance (round-cornered rectangles at layer 5) random and adding the sixth layer, optional hyperprior, to the model. The latter options correspond to a situation where layer 6 is absent and all of the parameters of the fifth layer are considered fixed. We call the process of determining the fixed values prior to parameter estimation tuning of the model. As Figure 1 points out, even if the hyperparameters are estimated from the data, the need for tuning does not vanish, but simply passes to the next layer of the model hierarchy, and hence inevitably at one level or another of the hierarchy the user has to provide some values.
In Bayesian Lasso models the hyperparameters are usually treated as random parameters, while in Student’s t-based models it is more common to hold them as constants, even though opposed examples exist; e.g., Xu (2010) proposes a Laplace model with constant hyperparameters, while Yi and Xu (2008) and Habier et al. (2011) estimate the hyperparameters of a Student’s t model with an MCMC algorithm (see Table 1). However, there seems not to be a single work proposing an EM algorithm for hypeparameter estimation of a Student’s t model. On the contrary, Carbonetto and Stephens (2011) present a variational Bayes MAP-estimation algorithm that swaps importance sampling for those hyperparameters. The absence of examples in the literature corroborates our own experimentation (results not shown) that the hyperparameters of a Student’s t model cannot be estimated with an EM algorithm, and therefore, since we are committed to proceed with EM estimation, we adopt the common fashion of including the optional hyperprior layer into the Laplace model, but exclude it from the Student’s t model.
Since the tuning of the model will be harder the more parameters there are to adjust, it is reasonable to try to manage with as few as possible. To this end we have decided to hold the degrees of freedom ν of the Student’s t prior constant and estimate or adjust only the scale parameters of the prior densities (λ2 is the inverse scale of the Laplace prior, while τ2 is a scale parameter of the Student’s t prior) along with the prior probability π that a SNP is linked to the trait. Provided that we do estimate the model parameters with an EM algorithm, it is anyway less important to determine the shape of the prior densities, since the only information that passes to the posterior is the expected or maximum value of the density. This is one of the key features under which sampling-based MCMC and optimization-based MAP estimation differ from each other: when the estimate is generated by sampling from a posterior density, the shape of the prior density is of crucial importance since the posterior is formed as the product of the prior and likelihood functions, as is well known. However, in MAP estimation one needs to be more concerned about the behavior of the expected or maximum points of the product function than about the actual shape of the function. Even though the model itself is the same regardless of the estimation method, this difference in the behavior of the estimation algorithm has to be taken into account when specifying the hyperparameter values. Further, as noted in the previous paragraph, it seems that the different behavior of the algorithms may also favor different model structures.
Hyperparameter values
To optimize the behavior of the estimation algorithm, the predetermined hyperparameters of the Student’s t distribution are selected in a way that the fully conditional posterior expectation of the effect variance, , is mainly determined by the the square of the effect size βj. By setting the degrees of freedom ν = 2, the posterior expectation will become
, so if we choose a small value for the scale τ2, the estimate of the variance stays always positive, but is shrunk toward zero strongly if βj is small
while left intact when βj is large
. Since τ2 is assumed to be somewhat data specific and affect the results, its value requires tuning.
Under the Laplace model we give the rate parameter λ2 of the exponential density a conjugate gamma hyperprior. The conditional posterior expectation of λ2 is ; since p is very large, the impact of τ2 into the posterior expectation is negligible, and therefore the shape parameter κ of the Gamma(κ, ξ) density is set to one. The rate parameter ξ affects the posterior expectation of λ2, and hence its value has to be tuned.
The indicator Γ = diag(γ) has a Bernoulli prior with a prior probability π = P(γj = 1) for the SNP j contributing to the trait. The value given for the probability π also represents our prior assumption of the proportion of the SNP markers that are linked to the trait. Loosely speaking, although the marker density and the distance of markers to putative QTL have their effects, it is quite reasonable to say that π is the number of SNP markers in the data divided by our prior assumption of the number of QTL affecting the trait. The probability π can be assumed either known or unknown, the latter approach inserting an additional layer to the hierarchical model (the “prior for π” box of layer 6 in Figure 1). The indicator affects the shrinkage of the marker effects concurrent with the shrinkage generated by the Student’s t or Laplace priors of the effect sizes βj, and therefore the value of π affects the selection of the hyperparameters of the Student’s t and Laplace priors. When the probability π is assumed known, as in our Student’s t model, it has to be tuned simultaneously to the other hyperparameters (τ2 in our case). Under the Laplace model the probability π is estimated with a conjugate beta prior, either uninformative uniform Beta(1, 1) or an informative Beta(a, b) density. The informative beta prior embodies our a priori assumed number of QTL by setting a as the number of markers assumed to be linked to trait (nqtl) and b as the number of markers not to be linked (i.e., b = p − nqtl, p being the number of markers). In the latter case the assumed number of QTL (nqtl) is the parameter to be tuned.
The population intercept β0 and the logarithm of the residual variance have uninformative uniform prior densities, so p(β0) ∝ 1 and
, which can be interpreted as conjugate β0 ∼
(0, ∞) and
priors. The polygenic effect u has a multivariate normal prior distribution, with mean vector 0 and variance
, where the additive genetic relationship matrix A is a constant given by known pedigree, and
is the additive variance of the polygenes. The variance
has an Inv-χ2(2, 0.1) prior, whose main purpose is to hold the variance estimate positive.
As depicted in Figure 1, the model parameters , and u, located at layer 3, are considered a priori independent of each other. The prior independence of the indicator and the effect size, as suggested by Kuo and Mallick (1998), leads to the most straightforward parameterization of a mixture prior for the effects, which, in conjunction with the conjugacy acquired by the hierarchical formulation of the prior densities, enables an easy derivation of a closed-form fully conditional posterior distribution for every parameter and latent variable in the model. The fully conditional posterior densities are given in the Appendix, and the derivations of the densities are detailed in Supporting Information, File S1.
Binary response
In the case of a binary phenotype the above general linear model (1) is not valid. In a linear regression model, the expected value of the response variable equals the linear predictor, (y) = β0 + XΓβ + Zu in model (1). The value range of the linear predictor is the whole real axis, while the expected value of a binary variable w, being also the probability of the positive outcome,
(wi) = P(wi = 1), has to lie between zero and one. However, we can easily bypass the problem by introducing a continuous, normally distributed latent variable y, such that the binary variable
Now the latent variable yi is given by the model Equation 1 with a residual εi ∼
(0, 1), and hence the expected value of the binary variable becomes
where Φ(⋅) denotes the standard normal cumulative distribution function,
(yi) being the linear predictor of model (1). The probability of the binary variable, given the expected value of the latent variable, is Bernoulli with a success probability Φ(
(yi)). The latent variable parameterization of the binary phenotype corresponds to a generalized linear model with the probit link function (Albert and Chib 1993).
The likelihood of the binary phenotype is the Bernoulli(Φ((yi))) distribution, while the normally distributed response y is interpreted as a latent variable with a normal prior density given by the likelihood function corresponding the linear model in (1). Since the augmentation of the latent variable is an additional layer in the original hierarchical model (layer 1 in Figure 1), the other parameters (except the residual variance that has been fixed to unity), and their fully conditional posterior densities, are the same as in the continuous-phenotype case.
Polygenic model
For a reference, we have considered a Bayesian version of G-BLUP, a classical animal model with a realized relationship matrix, where the genetic effects are assigned to individual animals, not to genetic markers,(2)where u now represents the additive genetic values of the individuals. Genetic marker data can be incorporated into a traditional animal model in a form of a realized relationship matrix. A realized relationship matrix, commonly denoted by G, estimated from the marker data, substitutes in the animal model the numerator relationship matrix A, estimated from the pedigree. We generated the genetic relationship matrix with the second method described in VanRaden (2008). Within the classical framework genomic breeding values are commonly estimated with known variance components, while in the Bayesian approach the variance components are estimated simultaneously with the genomic breeding values (Hallander et al. 2010). Contrary to the classical framework, the Bayesian inference is always based on variances that are estimated from data and hence are up-to-date and specific to the analyzed trait, letting also the uncertainty of the variance components be incorporated into the breeding values. Even though, e.g., ASREML (Gilmour et al. 2009) estimates the variance components from the data, and hence satisfies the up-to-date criterion, the variances are not estimated simultaneously to the breeding values; instead, the preestimated variance components are considered constant while estimating the breeding values. The likelihood of the data under the animal model is simply multivariate normal with mean β0 + Zu and covariance
. Priors for the genetic values u and the population intercept β0 are conjugate multivariate normal
and uniform, respectively, G being the genomic relationship matrix. The variances
and
have inverse-χ2 priors, uninformative
, and a flat Inv-χ2(2, M) with large M, respectively. The latter differs from the prior density proposed for the additive genetic variance of the polygene
of the multilocus association model (1), since here the polygene has to cover all of the genetic variance, not only a small fraction, and hence the variance component needs to be able to get substantially higher values.
Parameter Estimation
Since we know the fully conditional posterior density of every parameter and latent variable, we could easily implement a Gibbs sampler to sample from those distributions. However, as the fully conditional posteriors are known distributions and hence the conditional expectations and maximums are known, we can just as easily use the distributions in implementing a fast algorithm of our choice to find a MAP estimate of the posterior density.
Here we build a generalized expectation-maximization algorithm to compute a posterior maximum of the parameter vector of the multilocus association model (1). The EM algorithm was originally designed for imputation of missing data or estimation of hidden variables, and it operates by iteratively updating the hidden variables to their expected values (E-step) and subsequently the parameter vector to its maximum-likelihood value, given the values assigned to the latent variables (M-step). Later the algorithm was put into operation in parameter estimation by updating a part of the parameter vector to its expected value and the rest to its maximum-likelihood value, to conditional posterior expectation and to conditional posterior maximum, respectively, in the Bayesian context. To enable handling of large marker sets the iterative updating in our algorithm is done one parameter at the time, conditionally on the other parameters remaining fixed. This practice is a form of a generalized EM and can be seen as a further extension of the expectation-conditional-maximization (ECM) algorithm, as we update all of the parameters individually, not only the ones to be maximized.
The partition of the variables into parameters and hidden variables is somewhat arbitrary, although often variances and other nuisance parameters are integrated out from the posterior by updating them into their conditional expectations. Under a Gaussian model the classification is even more vague, since the expected and the maximum values of the location parameters are the same (symmetric posterior density), and the corresponding values for the scale parameters with an inverse-χ2 posterior become equivalent by a slight modification of the prior hyperparameters. Thus, for it is not clear or even interesting which parameters are updated into their conditional maximums and which to conditional expectations, we base our method on an alternative description of the EM algorithm (Neal and Hinton 1999), regarding both of the steps as maximization procedures of the same objective function. Under the alternative description of the EM algorithm, the generalized expectation-maximization (GEM) algorithm corresponds to seeking to increase the objective function instead of attempting to maximize it. That is, we do not guarantee that the chosen arguments maximize the objective function, but know that the value of the function will increase with every update. The generalized algorithm has been proved to converge into same estimate as the standard EM algorithm, although possibly slower (Neal and Hinton 1999). A similar algorithm was implemented for the animal model (2).
GEM algorithm for a MAP estimate
Set initial values for parameter vectors. We use zeros for β0, β, and u; small positive values, namely 0.1, for the variances; and 0.5 for the indicators γ.
In the case of a binary phenotype, update the values of the latent variable y by replacing the current values yi with the expected values of the truncated normal distributions (Equation A11),
where
and φ(⋅) and Φ(⋅) denote the standard normal density function and the distribution function, respectively.
Maximize the posterior distributions of β0 and βj (for all j) by substituting the fully conditional expectations for the current values of the parameters, one at the time. According to (A1) and (A2) we set
and
Update the error variance
into its conditional expectation. Since the expected value of an Inv-χ2(ν, τ2) distribution is ντ2/(ν − 2), we get from (A3) the conditional expectation of the posterior distribution of
that substitutes the existing value of
:
Update the effect variances
(for all j) to their conditional expectations.
Under the Student’s t prior model the fully conditional posterior distribution of
is an inverse-χ2, as expressed in (A4); hence we get
The value set for τ2 has to be small to create the desired shrinking effect; the smaller the value is, the more shrinkage there will be. Under the Laplace prior model the precision, or inverse of the variance parameters
, has an inverse-Gaussian fully conditional posterior distribution (A5) and its expected value equals
Replace the additive variance of the polygenes,
, with its conditional expectation, given by the Inv-χ2 distribution in (A6),
for
, and N > 2.
Maximize the polygenic effects u by replacing the current values with the conditional expectations. From (A7) we get
Update the indicators γj one at the time. First compute
as expressed in (A8) and then substitute the fully conditional expectation for current values of γj,
If the hyperparameters λ and π are estimated, update the values into conditional expectations. The expected value of a Gamma(κ, ξ) density, the parameter ξ being inverse scale or rate, is κ/ξ. Hence, from (A9) we get
The expected value of a Beta(a, b) density is a/(a + b), so, from (A10),
and
where p is the number of SNP markers, and a can be considered as the number of segregating QTL (nqtl).
The steps are repeated until convergence.
Example Analyses
In our example analyses we have considered the predictive performance and behavior of the model with the two alternative shrinkage priors, Student’s t and Laplace, as well as the importance of the hierarchical formulation of the latter. Further, we have studied the necessity of the components of the model under the different prior densities. Therefore, in addition to the proposed model (1) we have considered a model without the polygenic component (i.e., u = 0), a model without the indicator variable (i.e., γ = 1), and a model lacking both of these components. We refer to these variants as I, II, III and IV, respectively. To examine the importance of the hierarchical definition of the Laplace prior for the model behavior we have implemented a nonhierarchical variant of the Laplace prior model as our own GEM version of the MCMC algorithm proposed by Hans (2009), with the prior density for the regression coefficients βj being βj | λ ∼ Laplace(0, λ) and the hyperprior for the parameter λ being λ ∼ Gamma(κ, ξ) (note that here the hyperprior is assigned to λ instead of λ2). To compare the accuracy of the estimates given by the different models we have computed the genomic breeding values for the prediction set individuals and examined the Pearson’s correlation coefficient between the true and the estimated breeding values under the model varieties. The estimated breeding value is simply the linear predictor of model (1). The possible bias of the estimated breeding value was measured with a regression coefficient (cov(TBV, GEBV)/var(GEBV)) of the true breeding values on the estimated ones. In the absence of environmental covariates, the heritability of the trait was estimated indirectly from the observed phenotypic variance and the estimate of the residual variance.
We have tested our method with two data sets, the first of which is simulated data introduced in the XII QTL-MAS Workshop 2008 (Lund et al. 2009) and is freely available at the workshop homepage, http://www.computationalgenetics.se/QTLMAS08/QTLMAS/DATA.html. We selected this particular data set to be used in our example analysis since the data have been used extensively in different studies (e.g., Usai et al. 2009; Hallander et al. 2010; Shepherd et al. 2010), enabling an easy way to get some idea of the performance of our method in comparison to other methods proposed. The second data set is real pig (Sus scrofa) data, provided by the GENETICS journal to be used for benchmarking of genomic selection methods. The data have been described in detail by Cleveland et al. (2012).
Simulated data analysis
The XII QTL-MAS data set consists of 5865 individuals from seven generations of half-sib families, simulating a typical livestock breeding population (see Lund et al. 2009 for details). All of the individuals have information on 6000 biallelic SNP loci, evenly distributed over six chromosomes of length 100 cM each. Since SNPs with minor allele frequency <0.05 within the learning set were discarded, the actual number of markers in the analysis was 5726. The first four generations of the data, 4665 individuals, have both marker information and a phenotypic record and function as a learning set, while generations five to seven, 1200 individuals, function as a prediction or a test set.
There are 48 simulated QTL in the data set. The cumulative effect of the simulated QTL equals the genetic value of the individuals, while the phenotypes of the individuals have been obtained as the sum of the individuals’ genetic value and a random residual drawn from a normal distribution with mean zero and a variance set to produce heritability value 0.3 (Lund et al. 2009). The advantage of using a simulated data set in the example analysis is the availability of the true genetic values of the individuals and true effects and locations of the simulated trait loci. The accuracy of the estimates produced in the example analysis can therefore be examined directly by comparing the genetic values or true breeding values (TBV) and the genomic breeding value estimates (GEBV). Equally, the estimated effects and locations of the loci can be directly compared to the real ones.
The first set of analyses comprises estimation of the breeding values with all of the 13 above-mentioned model variants: 4 variants for all of the prior densities and the animal model (2) as a reference. The prior hyperparameters of the Student’s t prior model were set to τ2 = 0.01 and π = 0.0052, the latter expressing a prior assumption of 30 segregating QTL (nqtl = 30). The Laplace models included the additional hyperprior layer, so the corresponding hyperparameters were estimated simultaneously to the model parameters with hyperpriors λ2 ∼ Gamma(1, 1) and π ∼ Beta(1, 1) for the hierarchical version and λ ∼ Gamma(1, 100) and π ∼ Beta(30, p − 30) for the nonhierarchical version. In the animal model (2) the scale of the prior of the additive genetic variance was set to 1600.
The results of the analyses, summarized in Table 2, suggest that the Student’s t model benefits from the addition of a zero-point mass in the prior density, as the over all generations correlation r between the true and the estimated breeding values increased from 0.83 to 0.90 when the indicator variable was included in the model (from 0.83 to 0.85 if there was no polygenic component in the model).
The hierarchical Laplace model seems to actually suffer from the addition of the indicator variable, as the correlations decreased slightly, from 0.89 to 0.88, if the indicator variable was present. The estimated proportion of markers contributing the phenotype, or the value of π, in the hierarchical Laplace model was ≃0.8, so the majority of the markers were considered linked to the trait. The estimated value of the parameter λ was ≃75 in all of the hierarchical model variants. Contrary to the hierarchical version, the nonhierarchical Laplace model requires the indicator variable. In this case the estimated proportion of contributing markers was <2% (π = 0.018). There was a striking improvement in the performance of the nonhierarchical model after the addition of the indicator variable: the value of the correlation coefficient grew from 0.72 to 0.89 and that of the regression coefficient from 0.68 to blank 1.00. The estimated value of λ was ≃37 under both the models with the indicator variable and ≃44 under the models without an indicator.
The polygenic component u appears to be beneficial when the Student’s t prior has been used with an indicator variable, but has no impact when the prior has been used without an indicator or with either of the Laplace priors. Since the heritability calculated from data in the QTL-MAS data set is 0.32, it seems that all of the models but the nonhierarchical Laplace model without an indicator are capable of estimating it quite accurately. As expected, the results indicate superiority of the multilocus association model over the animal model. However, the Bayesian animal model with unknown variance components performed remarkably well when compared to a frequentist G-BLUP, the latter getting a correlation value of 0.75, when the correct variance components were given.
Even though these results are based on only one data set and hence are special cases and also best-case scenarios in the sense that the prior hyperparameter values have been selected to yield best accuracy, they provide an important starting point for a comparison of the model performances by reproducing the procedure performed in the literature concerning this particular data set (e.g., Lund et al. 2009; Usai et al. 2009; Shepherd et al. 2010). The correlation values obtained by Usai et al. (2009) for a G-BLUP model, a Student’s t model without an indicator variable or a polygene, and a frequentist Laplace model without an indicator variable or a polygene are equal to our corresponding correlation values (0.75, 0.83, and 0.89, respectively). This is noteworthy, since their estimates are acquired by either MCMC simulation or a LARS algorithm, both requiring several hours of run time, while ours are acquired by an EM algorithm that takes only a few minutes. Shepherd et al. (2010) observed a correlation value 0.88 under their nonhierarchical Laplace model with indicator variables, which is surprisingly slightly inferior to the correlation 0.89 we obtained with our corresponding model.
The breeding value estimates produced by the Student’s t model tended to be biased downward (regression coefficient b ≃ 0.8), while the hierarchical Laplace model produced slightly upward-biased values (b = 1.1 with an indicator and 1.4 without one). The most unbiased values were obtained with the nonhierarchical Laplace model with an indicator variable and with the Bayesian G-BLUP (b = 1.00 and 1.06, respectively). The breeding values acquired by the frequentist G-BLUP, with the correct variance components, are biased downward with b = 0.88. Since the bias of the estimated breeding values is important mainly when comparing breeding values estimated with different methods, which is not advised in any case, we do not concentrate on the bias of the estimates in our subsequent analyses of the data replicates.
Simulated data replicates
To further examine the model performance in a less data-specific situation, with the influence of sampling variation diminished, we generated 99 replicates of the QTL-MAS data set of approximately the same heritability to have a total of 100 phenotype sets (99 plus the original one). The phenotypic value of a given individual in the QTL-MAS data being the sum of the genetic value of that individual and a random residual (Lund et al. 2009), the replicated phenotypes were obtained by simply resampling the residuals from a normal density (0, var(TBV)(1/h2 − 1)), where var(TBV) denotes the observed variance of the genetic values and the heritability h2 = 0.3. The 100 replicated data sets were analyzed with the most promising and/or interesting model variants.
The Student’s t model with both indicator and polygenic components (variant I) was the most accurate model variant in the preliminary analysis and hence was selected as a matter of course for closer inspection. As the prior hyperparameters τ2 and π of the Student’s t model are given, one objective is to consider the potential influence of the selected hyperparameters on the predictive performance of the model. We therefore carried out an analysis with five different parameter combinations; the results of the analyses, shown in Table 3, comprise the correlation between the true and genomic breeding values within the prediction set individuals in the original QTL-MAS data, in addition to the mean, maximum, minimum, and variance of the correlation in the set of 100 analyses.
The parameter combinations comprise three values for the number of QTL, namely 30, 60, and 100, in addition to three values covering three orders of magnitude from 10−1 to 10−3, for the scale parameter τ2. Even though the correlation observed within the original data set varies between 0.86 and 0.90 depending on the hyperparameter values, the range of the mean values is much less wide, only from 0.84 to 0.85. Also the maximum and minimum values show less variation between different priors. To examine the effect of the polygenic component, the Student’s t model with an indicator, but the polygene u set to zero (variant III), was tested with the same parameter combinations as the model including both of the components. On average the impact of the polygene in the replicated data sets is less clear than in the individual original QTL-MAS data, even though the highest of the observed correlations were slightly larger when the polygene was included into the model, 0.90 and 0.89, respectively, and the means of the correlations were actually lower in the polygene model, 0.85 and 0.86, respectively (Table 3). Corresponding to the observations with the polygene model, the mean and maximum values of the correlations were not influenced by the selection of the hyperparameters, contrary to the correlation within the original QTL-MAS data set. The Student’s t model without an indicator or a polygenic component (variant IV) was tested by analyzing the replicated data sets with four alternative values for the scale parameter τ2, covering seven orders of magnitude from 10−4 to 10−10. Since here the Student’s t prior alone has to provide the shrinkage of the marker effects, the values given for the scale parameter τ2 are substantially smaller than in the indicator model, leading to more intensive shrinkage. This model was the least sensitive to the hyperparameter values, as all of the observed correlations are the same while the hyperparameter τ2 varies between 10−6 and 10−10. On average, the performance of the three models was less divergent than it appeared to be after the analysis of the single original data set, the highest mean correlations ranging between 0.84 and 0.86 and maximums between 0.88 and 0.90, in contrast to the corresponding values in the original data analysis, ranging between 0.83 and 0.90.
Since the polygenic component did not seem to have any impact on the Laplace prior models, we chose for the next phase the two model variants without polygenes (i.e., u = 0) (variants III and IV) for both hierarchical and nonhierarchical priors. Contrary to the Student’s t model, the Laplace models include the additional hyperprior layer, and hence the potential influence of the given parameter values is transferred from the hyperparameter layer to the hyperprior layer (see Figure 1). Three values, 0.5, 1, and 3, were proposed for the rate parameter ξ of the gamma prior of λ2 in the hierarchical Laplace model without an indicator variable, and the Gamma(1, 1) prior was combined with two extreme beta priors, a uniform Beta(1, 1) and a highly informative Beta(30, p − 30), for the probability π in the model with an indicator (p is the number of markers). Under the nonhierarchical Laplace model the variant with the indicator variable was clearly more promising, hence earning a more thorough investigation than its counterpart without an indicator. Two gamma priors, Gamma(1, 10) and Gamma(1, 100), were tested for the parameter λ and three beta priors, Beta(1, 1), Beta(30, p − 30), and Beta(100, p − 100) for the indicator under the model variant III (indicator, no polygene), along with a single prior, Gamma(1, 100), under variant IV (no indicator, no polygene). As presented in Table 3, the best correlation was acquired under the hierarchical Laplace model without an indicator, with a Gamma(1, 1) prior for λ2, where the maximum correlation exceeded 0.91, while the mean correlation was 0.89. The model with indicator remained slightly inferior, as suggested by the preliminary analysis, the mean correlation being 0.88 with a uniform prior for π. The estimated values of the parameter λ varied only slightly within the 100 analyses and were ≃105, 74, and 43 with hyperprior rates 0.5, 1, and 3, respectively (see Table S1). The estimates of the probability π were near 1 under the Beta(1, 1) prior and ≃0.01 under the Beta(30, p − 30) prior. The informative prior for π distracted the model quite badly, resulting in the mean correlation dropping from 0.89 (no indicator) to 0.86 (informative prior for indicator). Under the nonhierarchical Laplace model a Gamma(1, 100) prior for the parameter λ with a Beta(30, p − 30) prior for the indicator probability π were the best choices, leading to correlation values equal to the best ones from the hierarchical Laplace model. The model seems to be quite robust regarding to the indicator prior, as the prior assumption of 30 or 100 segregating QTL led to almost similar results.
Binary data
The binary phenotype model was tested by using the replicates of the QTL-MAS data with dichotomized phenotypic values. A binary phenotype was acquired by simply cutting the data in two in a way that 80% of the learning set individuals get a binary value 0 and 20% get a binary value 1. The selection of the success probability 0.2 was arbitrary, except that we wanted to avoid the extreme variance values 0.25 and 0 for the Bernoulli response variable, with corresponding success probabilities 0.5 and 0 (or 1).
The results for the Student’s t model, for both of the Laplace models and for the Bayesian G-BLUP are presented in Table 4. The results show that the hierarchical Laplace model suffers less from the reduced information of the data than the nonhierarchical Laplace model and the Student’s t model.
The mean correlations acquired by the nonhierarchical Laplace models and the best Student’s t models are only 0.78 and 0.77, respectively, while the best mean correlation observed under the hierarchical Laplace model is 0.82. The best-performing Student’s t models are the models with an indicator, with hyperparameters π = 100/p and τ2 = 10−3, while the presence of the polygenic component seems to be of no importance. The nonhierarchical Laplace model performed again poorly without the indicator variable, while the hierarchical model performed better without one. The estimated parameter values under the hierarchical Laplace model were the same as in the continuous phenotype case: λ = 105, 74, and 43 with hyperprior rates 0.5, 1, and 3, respectively, and π either ≃1 or 0.1, depending on the prior. Under the nonhierarchical Laplace model λ = 470 (model variant IV, no indicator), 44 (with indicator, 49 when no indicator was present), and 11 (with indicator) with hyperprior rates 10, 100, and 500, respectively, and π = 0.005 and 0.007 with nqtl = 30 and 100, respectively.
Real data analysis
The real pig data set consists of 3534 animals from a single pig line with genotypes from a 60k SNP chip and a pedigree including the parents and the grandparents of the genotyped animals (Cleveland et al. 2012). A total of 3184 genotyped individuals have a phenotypic record for a trait with predetermined heritability 0.62. As only the SNP markers with minimum allele frequency >0.05 were accepted to our analysis, the number of markers in the analysis was 45,317. The data set was analyzed with all of the 13 model variants.
The analysis of this data set provides a challenge concerning the proportion of SNPs to individuals. There is an upper limit to the number of effects with respect to the sample size, and even though the limit depends on the sample size (smaller data sets seem to be able to be more oversaturated) and the genetic architecture of the trait (a data set with an oligogenic trait may be less sensitive to oversaturation than one with a polygenic trait), Hoti and Sillanpää (2006) have suggested a limit of 10 times more predictors than individuals. We have reduced the number of markers in the multilocus association model by applying the sure independence screening method of Fan and Lv (2008) for ultrahigh-dimensional feature space. The method is based on ranking the predictors with respect to their marginal correlation with the response variable and selecting either a predetermined proportion of the predictors or the predictors exceeding a predetermined importance measure. We have chosen to take 10,000 best-ranking SNPs to the multilocus association study.
Contrary to the simulated data set, there are neither true genetic values of the individuals nor true effects of the QTL available, and hence we estimate the accuracy of the predicted breeding value by dividing the correlation between the GEBVs and the phenotypic values by the square root of the heritability of the trait. Since the data do not consist of a separate validation population, we compute the result statistics using cross-validation, where the 3184 individuals are randomly partitioned into 10 subsets (10-fold cross-validation) of 318 or 319 individuals.
The selection of the model hyperparameters is done similarly to that in the previous analyses, the only difference being the method of accuracy estimation. As previously, we tried quite a few values and selected the ones producing the best accuracy. Since here we do not have any extra information but the pheno- and genotypes of the individuals, the procedure qualifies as a genuine parameter selection by cross-validation. The selected hyperparameter values under the Student’s t model variants with the indicator (I and III) were τ2 = 1 and π = 0.05 and τ2 = 0.01 under the variants without an indicator (II and IV). The best hyperpriors for the hierarchical Laplace model were λ2 ∼ Gamma(1, 4000) for all variants (I–IV) and π ∼ Beta(1, 1) for the indicator (variants II and IV). For the nonhierarchical Laplace model the corresponding hyperpriors were λ ∼ Gamma(1, 4000) and π ∼ Beta(1000, p − 1000) under the model variants with the indicator (I and III) and λ ∼ Gamma(1, 2000) under the variants without an indicator (II and IV). In all of the models including a polygenic component the prior for the polygenic variance was inverse-χ2(2, 1). In the animal model (2) the scale of the prior of the additive genetic variance was set to 1.3 × 106. Some of the hyperparameter and hyperprior parameter values are remarkably large due to the large variance of the phenotype, the variance being ∼3500 (Cleveland et al. 2012). The accuracy estimates acquired with these hyperparameters are presented in Table 5.
Contrary to the QTL-MAS data set, the phenotypes of the pig data are highly polygenic. The polygenic nature of the data is reflected by the select values of the a priori assumed number of QTL (500 or 1000), as well as the relatively high accuracy of the Bayesian G-BLUP (correlation 0.63 with both Bayesian G-BLUP and the best of the association models, Table 5). Again, the Bayesian version of G-BLUP with simultaneously estimated variance components was slightly more accurate than the frequentist version with predetermined heritability; the correlation and regression coefficients acquired by the frequentist method were 0.62 and 0.84, respectively. With the polygenic data the additional indicator variable (model variants I and III) was not as important as with the oligogenic QTL-MAS data. In fact, the accuracy of the Student’s t model was the same (0.54) regardless of the indicator, while the nonhierarchical Laplace model was only slightly more accurate when the indicator was present (correlation 0.60 with and 0.58 without the indicator). Respectively, the hierarchical Laplace model suffered less from the addition of the indicator variable; the accuracy dropped from 0.63 only to 0.62 when the indicator was added. In this set of analyses the Student’s t model performed remarkably badly compared to the Laplace models. The best correlation obtained under the Student’s t model was 0.55, while the hierarchical Laplace model came up to correlation 0.63. The hierarchical Laplace model was superior to the nonhierarchical model; the best observed correlations under the two models were 0.63 and 0.60, respectively. The estimated value of the hyperparameter λ was 1.56 with all of the hierarchical Laplace model variants, 1.97 with the nonhierarchical variants with an indicator (I and III), and 4.68 with the nonhierarchical variants without an indicator (II and IV). The estimate for the indicator hyperparameter π had value 1 under the hierarchical Laplace model, reflecting the redundancy of the indicator in the hierarchical model version, and was 0.11 under the nonhierarchical model. The polygenic component (model variants I and II) did not improve the model performance, probably due to the large number of markers.
In all analyses the algorithm was iterated until convergence was ascertained by the visualized behavior of the estimate values. The estimates converged rapidly, after only 20 GEM iterations when estimating the parameters of the Student’s t model or the Laplace model without an indicator and after 13 or 14 GEM iterations when estimating the animal model parameters. The time needed varied from 12 sec to few minutes, depending on the model, the polygene being the slowest component to update. The only parameter requiring a substantial time to converge was the probability of a marker to be linked to the trait, π, of the hierarchical Laplace model. This hyperparameter tended to converge to a value near unity [with Beta(1, 1) prior] when given enough time, usually >200 GEM iterations. The GEM algorithm was implemented with Matlab version 7.10.0 (R2010a); the Matlab codes for the parameter estimation and the data replication are provided in File S2. The analyses were performed with a 64-bit Windows 7 desktop computer with 3.50 GHz Intel(i7) CPU and 16.0 GB RAM.
Discussion
As declared in the Introduction, the purpose of this article is to (1) find a common denominator for many of the Bayesian multilocus association methods with shrinkage-inducing priors proposed in the literature and build a general model framework explaining the similarities and differences between the proposed models and (2) examine the submodels especially regarding (a) different shrinkage prior densities, namely Student’s t and Laplace; (b) possible advantage of the hierarchical formulation of the prior density for the model performance; and (c) the necessity of the model components, indicator variable, polygenic component, and hyperprior layer, under the different prior densities.
Many of the 12 variants of the multilocus association model considered in the example analysis correspond closely to various methods proposed in the literature. As presented in Table 6, the variant IV Student’s t model without a polygene (i.e., u = 0) or an indicator (i.e., γ = 1) evidently equals BayesA (Meuwissen et al. 2001; Xu 2003), while the variant III Student’s t model with an indicator variable γ but without a polygene (u = 0) equals the weighted Bayesian shrinkage regression method (wBSR) proposed by Hayashi and Iwata (2010). Although the latter leads to the same marginal posterior for the marker effects as BayesB (Meuwissen et al. 2001; see Habier et al. 2007 and Knürr et al. 2011 for comments), the two models have different hierarchical structures. In our model the marker effect size β is a priori independent from the indicator γ, and the marker effect is given as the product Γβ, as illustrated in Figure 2A. In the original BayesB the marker effect is given by β alone, since the likelihood does not include the indicator; instead, the indicator effect is through the effect variance σ2 (see Figure 2B). The parameterization of BayesB has been criticized by Gianola et al. (2009). It is noteworthy that with our parameterization a MCMC algorithm for the model variant III would consist only of Gibbs steps, while with the original parameterization of BayesB Metropolis–Hastings steps are needed. Hence, even though we do not have direct evidence of the benefit of our parameterization to the GEBV accuracy, estimation-wise the advantage of this model structure is clear.
Different hierarchical structures of models. (A) The hierarchical structure of our model, the effect size, and the indicator are independent and both contribute to the likelihood. (B) The hierarchical structure of BayesB (Meuwissen et al. 2001). The indicator works through the effect variance. (C) The structure of the nonhierarchical Laplace model, the effect size and the indicator are independent and both contribute to the likelihood that there is no latent variable level. (D) The hierarchical structure of emBayesB (Shepherd et al. 2010). The indicator variable affects through the marker effect, but seems to be also present in the likelihood (dotted line).
As noted in the Table 6, the variant IV hierarchical Laplace model without an indicator or polygene, with an estimated parameter λ, covers several Bayesian Lasso models introduced by Yi and Xu (2008). A corresponding hierarchical Laplace model with a prespecified hyperparameter has also been studied (Usai et al. 2009; Xu 2010). The fast versions of BayesB, fBayesB (Meuwissen et al. 2009) and emBayesB (Shepherd et al. 2010), correspond closely to the nonhierarchical Laplace model we have considered; however, again the structures are different. The structure of our nonhierarchical Laplace model is consistent with its hierarchical counterpart as only the latent variable layer is integrated out (see Figure 2C). The hierarchical structure of emBayesB is somewhat cryptic, since the indicator variable should work only through the marker effect but instead seems to be present also in the likelihood (Figure 2D). This may explain the difference in the performances of the two models noted in the previous section. Further, unlike Shepherd et al. (2010), we had no observation of the starting values of the algorithm affecting the result.
Our example analyses demonstrate that the Bayesian multilocus association model is a powerful tool to estimate the effects of genetic markers associated with QTL, leading to accurate breeding value estimates for genomic selection. Compared to the animal model, the multilocus association model benefits from the ability to assign effects of different magnitude for the different marker loci, which is especially important when the trait in question is oligogenic, as is the case in the QTL-MAS data set. In a more polygenic situation, however, the animal model is a competitive alternative, as can be seen in the pig data analyses. We have tested the two most prominent prior densities, Student’s t and Laplace, for the marker effect sizes. Both of the densities work nicely as shrinkage priors; however, it seems that with a polygenic trait the Laplace prior is clearly more efficient. This is an important discovery, since the critique of marker association models and their bad behavior in a polygenic situation (e.g., Daetwyler et al. 2010; Clark et al. 2011) is mainly based on the observations on BayesB, that is, a Student’s t model.
An EM algorithm for the Laplace model might be more easily tuned than an algorithm for the Student’s t model, probably due to the additional hyperprior layer. The prevalent approach, one that we have also adopted, is to treat the hyperparameters ν and τ2 of the Student’s t prior as given, while the hyperparameter λ of the Laplace prior is treated as random. We do not know why others have chosen this course and why Carbonetto and Stephens (2011) use importance sampling for Student’s t model hyperparameters within their otherwise optimization-based algorithm, but our reason is the old-fashioned trial and error: the alternative method did not work. We tried several priors for the scale parameter τ2 of the effect variance in the Student’s t model, but none of them led to a reasonable estimate (results not shown). Yi and Xu (2008) managed to estimate the hyperparameters of the Student’s t prior in a model without an indicator variable with uniform hyperprior densities by MCMC. However, the EM algorithm is less flexible in a sense that the variance of the fully conditional posterior density does not have an impact when a parameter is updated to its fully conditional posterior mean or maximum, contrary to MCMC sampling, where a large posterior variance gives great room for maneuver for the parameter to be updated. It appeared to be impossible to select such a prior density that neither the hyperparameters nor the effect variances would determine completely the posterior expectation. Gelman et al. (2004) advise augmenting the model with an additional parameter breaking the dependence between τ2 and , which should help a Gibbs sampler not to get stuck, but even this method did not help the EM algorithm. The EM estimation of the prior probability of a marker to be linked to the phenotype, π, proved also impossible with our hierarchical model. With any reasonable uninformative hyperprior the estimated value was almost unity, and the only way to obtain something smaller is to propose a highly informative hyperprior, one so informative that it kind of spoils the whole idea of estimating the value. Therefore, the most reasonable approach seems to be to treat the hyperparameters τ2 and π of the Student’s t prior model as given and determine the best ones, e.g., by cross-validation.
As mentioned earlier, although the hyperparameter λ of the Laplace distribution was estimated from the data, the need for tuning of parameters passes to the next layer of the model hierarchy. This is clearly illustrated by proposing several gamma hyperpriors for the parameter and finding out that different hyperpriors give different results and that different hyperpriors work for different data sets (Tables 3–5). The necessity of tuning the parameters remains even when the parameter λ is analytically integrated out; in fact, the reduced hierarchy leads to increased sensitivity to the hyperpriors (Cai et al. 2011).
The hierarchical formulation of the model has several advantages over its nonhierarchical counterpart. Even if the marginal distributions of the marker effects (β) are mathematically equivalent in hierarchical and nonhierarchical models, the parameterization and model structure alter the properties and behavior of the model and thus have influence on the mixing and convergence properties of an estimation algorithm and also on the values of the actual estimates. Contrary to the nonhierarchical version, the hierarchical Laplace model seems to work without an additional indicator variable. A reduced number of variables leads not only to more straightforward implementation and faster estimation, but also to easier and more accurate tuning of hyperprior parameters. Otherwise there seemed not to be a major difference in the performance of the hierarchical and the nonhierarchical Laplace prior models when the phenotype was oligogenic. However, in a more polygenic situation the hierarchical version was clearly more accurate. Unlike some previous observations, we did not note poor behavior of the algorithm or dependency on the starting values (e.g., Hans 2009; O’Hara and Sillanpää 2009; Shepherd et al. 2010).
An additional benefit of the hierarchical formulation of the Laplace prior density is that by maintaining conjugacy, the hierarchical formulation enables an easy and transparent derivation of the fully conditional posterior densities. Although it is possible to express the fully conditional posterior density of the marker effects under the nonhierarchical Laplace model in a closed form, the posterior does not represent any known distribution. The derivation of fully conditional posterior density of a known distribution provides, in addition to the mean and maximum points required by the EM algorithm, a posterior estimate for the variance of the parameters. Although the EM estimation is often criticized for not providing variance estimates, in a conjugate model the EM algorithm is capable of providing accuracy estimates similar to those of the more complicated MAP-estimation methods like variational Bayes. As these accuracy estimates are based on the conditional posterior densities instead of the marginal ones, they are not comparable to the accuracies produced by a MCMC estimation and tend to strongly underestimate the variance of the parameter estimates (see, e.g., Li and Sillanpää 2012).
There are competing conceptions of the necessity of the polygenic component in the literature. Although many authors have found the impact of the polygene insignificant (e.g., Calus and Veerkamp 2007; Pikkuhookana and Sillanpää 2009), some are convinced of its importance (e.g., de los Campos et al. 2009; Lund et al. 2009). In our example analyses the estimates of the polygenic component are negligible and have no influence on the estimated breeding value. Lund et al. (2009) assumed that in real data the polygenic component may prove more valuable that what is observed with a simulated data set, since in real data the genetic variation is likely to arise from sources other than QTL alleles and the structure of a real population is likely to produce cryptic residual correlations. We did not note this in our real data analyses; however, it is possible that with less dense marker map the polygenic component would be more important. In addition, Solberg et al. (2009) noted that even though including a polygenic effect has little impact on the accuracy of the GEBVs in the first prediction generations, the deterioration of the accuracy in subsequent generations may be slower if the model uses both marker and pedigree information.
Acknowledgments
The authors thank Petri Koistinen, Zitong Li, and the anonymous referee for helpful insights. This work was supported by research grants from the Academy of Finland and the University of Helsinki’s Research Funds. The authors declare no conflict of interest that might influence the results or their interpretation in this study.
Appendix
Under the multilocus association model (1) we get the following, closed form, fully conditional posterior distributions for our parameters (for simplicity: ⋆ is “the data and the parameters except the one in question”):


where(A3)Effect variance under the Student’s t prior model and the Laplace prior model, respectively, is
(A4)
(A5)
(A6)
(A7)with
and
(A8)where
Note that in Equation A2 the summation over l ≠ j means that we sum over all markers, except leave out the one whose parameters are in question.
Since the prior hyperparameters λ and π are also estimated in the Laplace model, there are two additional conditional posteriors to consider. For convenience, the posterior is given for λ2 instead of for λ:


The fully conditional posterior of the latent variable in the binary trait case is a truncated normal density(A11)where
+ denotes the positive side (yi > 0) of the distribution and
− the negative side (yi ≤ 0).
The fully conditional posterior density of the regression coefficients βj under the nonhierarchical Laplace prior model is not a known distribution, but can still be expressed in a closed form (Hans 2009):(A12)where
+(⋅ | μ, σ2) denotes the positive side of the normal density function with parameters μ and σ2 and
−(⋅ | μ, σ2) the negative side, respectively, while
with
corresponding to the ordinary least-squares solution for β and ωij corresponding to the ijth element of the inverse of the covariance matrix
. The weights are
where φ(⋅) and Φ(⋅) denote the standard normal density function and the distribution function, respectively.
Footnotes
Edited by Dirk-Jan de Koning and Lauren M. McIntyre
- Received January 26, 2012.
- Accepted April 15, 2012.
- Copyright © 2012 by the Genetics Society of America