## Abstract

In this article, a unified Markov chain Monte Carlo (MCMC) framework is proposed to identify multiple quantitative trait loci (QTL) for complex traits in experimental designs, based on a composite space representation of the problem that has fixed dimension. The proposed unified approach includes the existing Bayesian QTL mapping methods using reversible jump MCMC algorithm as special cases. We also show that a variety of Bayesian variable selection methods using Gibbs sampling can be applied to the composite model space for mapping multiple QTL. The unified framework not only results in some new algorithms, but also gives useful insight into some of the important factors governing the performance of Gibbs sampling and reversible jump for mapping multiple QTL. Finally, we develop strategies to improve the performance of MCMC algorithms.

MANY complex traits are controlled by multiple genetic [quantitative trait loci (QTL)] and environmental factors. Mapping QTL is the process of estimating the number of QTL, their genomic positions, and genetic effects conditional on the observed phenotypic data and marker data. This is essentially a problem of model selection (*e.g.*, Broman and Speed 2002; Sillanpää and Corander 2002). QTL mapping is complicated by the fact that the number of QTL and hence the dimensionality of the parameter space are unknown. Recently, the Bayesian methods and Markov chain Monte Carlo (MCMC) algorithms have been applied to jointly infer the number of QTL, their genomic positions, and genetic effects. The reversible-jump MCMC algorithm introduced by Green (1995) can move between models of different dimension and has become an almost routine tool in Bayesian QTL mapping (Hoeschele 2001). Using the reversible-jump MCMC method, we can in principle jointly infer the genetic model of a complex trait and the associated genetic parameters, including the number, positions, and genetic effects of the identified QTL. Recently, a variety of reversible-jump algorithms have been conducted to map QTL in both experimental designs (Satagopan and Yandell 1996; Sillanpää and Arjas 1998, 1999; Stephens and Fisch 1998; Yi and Xu 2000; Gaffney 2001) and pedigrees (Heath 1997; Uimari and Hoeschele 1997; Xu and Yi 2000; Uimari and Sillanpää 2001; Yi and Xu 2001).

The reversible-jump MCMC algorithm is a very general and widely applicable technique (Green 1995, 2003). It appears to be suited for implementing model selection procedures across a wide range of possible genetic architectures. However, this flexible method has been deemed somewhat “difficult” to understand, cumbersome to conduct, and difficult to tune. It also has been noted that the reversible-jump MCMC is usually subject to poor mixing and slow convergence. Therefore, there seems to be a need for further methodological work on improving the efficiency of reversible jump. The improved frameworks have been established recently for conventional statistical models (Godsill 2001; Brooks* et al.* 2003; Green 2003). It is clear that Bayesian QTL mapping can benefit from renewed research efforts.

For conventional linear models, a variety of MCMC methods have been proposed for variable selection, including the variable selection algorithms of Smith and Kohn (1996) and Kuo and Mallick (1998), the MCMC model combination (MC^{3}) technique of Raftery* et al.* (1997), the Gibbs variable selection of Dellaportas* et al.* (2002), and the stochastic search variable selection of George and McCulloch (1993). For certain situations, these different methods have their own advantages. To date, however, they have been rarely applied to the area of mapping QTL (but see Broman and Speed 2002; Yi* et al.* 2003b). The variable selection methods, originally derived from diverse procedures, have been recently shown to relate closely to Green's reversible-jump MCMC (Clyde 1999; Ntzoufras 1999; Godsill 2001; Dellaportas* et al.* 2002). Godsill (2001) recently introduced a composite model space framework that embraces not only all of the above variable selection methods, but also the reversible jump. The composite space method, which is a modification of the product space used by Carlin and Chib (1995), provides an interesting viewpoint on model selection, since it allows MCMC simulation to be performed, at least conceptually, on a fixed dimension space. Under the composite space representation, the Bayesian variable selection methods described above and the reversible-jump algorithm can be shown to derive straightforwardly from a general framework. These relationships between the methods can aid our understanding of MCMC model selection procedures and may assist in the development of improved procedures.

In this study, we propose a composite space presentation for the multiple-QTL model and develop a unified MCMC framework for exploring the posterior of the composite space. The proposed unified approach includes the existing Bayesian QTL mapping methods using reversible-jump MCMC algorithm as special cases. We also show that a variety of Bayesian variable selection methods using Gibbs sampling can be applied to map multiple QTL. The unified framework sheds some light upon the important factors governing the performance of Gibbs sampling and reversible jump for mapping multiple QTL. We also develop strategies to improve the performance of MCMC algorithms.

## THE MULTIPLE-QTL MODEL

We consider a mapping population derived from two or multiple inbred lines. Suppose that the quantitative trait under investigation is affected by *l* loci (QTL). If no epistasis is assumed, the observed phenotypic value of individual *i*, *y _{i}*, can be described by the linear model
1where μ is the population mean,

**x**

*denotes the genotype indicator of the*

_{ij}*j*th QTL for individual

*i*,

**β**

*is the vector of genetic effects associated with the*

_{j}*j*th QTL, and

*e*is the residual error assumed to follow

_{i}*N*(0, σ

^{2}). The definitions of

**x**

*and*

_{ij}**β**

*depend on the experimental design. For an F*

_{j}_{2}cross, for example, we have that where

*a*and

_{j}*d*are the additive and dominance effects of the

_{j}*j*th QTL, respectively.

The above model is typically used in Bayesian mapping implemented via the reversible-jump MCMC algorithm. In this model, the number of QTL is treated as a random variable, and thus the total number of possible effects is unknown. In practical implementation of reversible-jump MCMC, we usually assume that the random variable *l* has an upper bound *K*. Model (1) can be rewritten as
2where γ* _{j}* is an indicator variable that denotes that the

*j*th QTL is included (γ

*= 1) in the model or excluded from the model (γ*

_{j}*= 0). Note that the number of QTL does not explicitly appear in model (2). This parameter equals the number of 1's in .*

_{j}Model (2) is similar to that used in Bayesian variable selection for the linear regression model (*e.g.*, Kuo and Mallick 1998). The idea of adding the indicator variable in the model facilitates setting up MCMC algorithms. As in the linear regression model, we treat *K* as known and thus in model (2) the total number of possible effects is fixed.

The choice of the constant *K* depends on the method and the aim in the analysis. In marker analysis, each marker is treated as a potential QTL and thus *K* equals the number of markers (Ball 2001; Broman and Speed 2002; Xu 2003; Yi* et al.* 2003b). In QTL mapping, one does not know *a priori* how many QTL to expect for a given trait. We here propose two methods for choosing *K*: (1) As in almost all existing Bayesian mapping methods, we assume that there are at most *K* QTL in the entire genome, and (2) we assume that there are at most *K _{c}* QTL on the

*c*th chromosome. Then we have

*K*= ∑

*. As an extreme case, we could assume that each marker interval is associated with a QTL and thus*

_{c}K_{c}*K*is identical to the number of marker intervals on the

_{c}*c*th chromosome. The assumption that there is at most one QTL on a marker interval is not a fundamental requirement for the proposed method. Generally, the value of

*K*can be smaller than the number of marker intervals. The value of

*K*should account for the data information and the previous results obtained by using other QTL mapping methods. As is seen later, alternatively, we can use particular prior distributions on the parameters in the model to relieve the influence of

*K*on the performance of the proposed algorithms. In particular, these prior distributions account for the sample size

*n*, the marker information, and the upper bound of QTL

*K*.

In the following sections, we first propose a composite space representation of the problem for mapping multiple QTL based on model (2). We then discuss the specifications of the prior distributions on the unknowns. Finally, we develop a unified MCMC framework for exploring the composite model space.

## THE COMPOSITE SPACE FOR THE MULTIPLE-QTL MODEL

In QTL studies, we observe the phenotypic trait and a set of marker genotypes. Assume that marker linkage maps have been developed on the basis of the observed marker data so that the locations of the markers on each chromosome are known *a priori*. Our aim is to jointly infer the number of QTL, their genomic positions, and genetic effects. This can be viewed essentially as a problem of model selection. In model (2), the number of QTL is determined by the vector of indicator variables . Hereafter, we call the vector **γ** the model index, which indicates which QTL are present in the model.

In marker analysis, model (2) is essentially a usual linear regression model in that each coefficient *x _{ij}* is observed. In QTL mapping, the coefficients in the model, , are unobservable, and the locations of

*K*QTL, , are also unknown. Denote and

**θ**= (

**β**, μ, σ

^{2}). We partition (

**λ**,

**x**,

**θ**) into (

**λ**,

_{γ}**x**,

_{γ}**θ**) and (

_{γ}**λ**,

_{−γ}**x**,

_{−γ}**θ**), representing the unknowns included (γ

_{−γ}*= 1) or excluded (γ*

_{j}*= 0) from the model, respectively, where*

_{j}**θ**= (

_{γ}**β**, μ, σ

_{γ}^{2}) and

**θ**=

_{−γ}**β**. Hereafter, we call (

_{−γ}**γ**,

**λ**,

**x**,

**θ**) the composite space for the multiple-QTL model. For the detailed description about the composite space for model uncertainty problems, the reader is referred to Godsill (2001)(2003).

Under model (2), the likelihood function for a particular **γ** depends only upon the parameters (**x _{γ}**,

**θ**) used by that model,

_{γ}*i.e.*, 3

We assume that the prior distribution of (**γ**, **λ**, **x**, **θ**) can be partitioned as
4

The full posterior distribution of the composite model space (**γ**, **λ**, **x**, **θ**) can now be expressed as
5

Note that here we have suppressed the notation for conditional on the observed marker data.

In the above posterior distribution, *p*(**γ**) is the prior distribution of the model index. *p*(**λ _{γ}**,

**x**,

_{γ}**θ**|

_{γ}**γ**) is the joint prior distribution of the used unknowns, which can be partitioned into three components: 6

The prior for the unused unknowns, *p*(**λ _{−γ}**,

**x**,

_{−γ}**θ**|

_{−γ}**γ**,

**λ**,

_{γ}**x**,

_{γ}**θ**), may be called “pseudo-prior.” It is reasonable to assume that (

_{γ}**λ**,

_{−γ}**x**,

_{−γ}**θ**) is

_{−γ}*a priori*independent of (

**λ**,

_{γ}**x**,

_{γ}**θ**). The pseudo-prior can be factorized into three components: 7

_{γ}In Equations 6 and 7, *p*(**λ _{γ}**|

**γ**) and

*p*(

**λ**|

_{−γ}**γ**) are the prior distributions of the locations of QTL.

*p*(

**x**|

_{γ}**λ**) and

_{γ}*p*(

**x**|

_{−γ}**λ**) are the probability distributions of QTL genotype indicators, which can be calculated using multipoint methods (Jiang and Zeng 1997).

_{−γ}*p*(

**θ**|

_{γ}**γ**,

**x**) is the prior distribution of the used parameters, which may depend on

_{γ}**x**.

_{γ}*p*(

**θ**|

_{−γ}**γ**) is the prior distribution of the unused genetic effects.

The key feature of the composite model space is that the dimension remains fixed even when the model index **γ** or the number of QTL changes. This remarkable feat is achieved by augmenting the varying dimensional space (**γ**, **λ _{γ}**,

**x**,

_{γ}**θ**) to the fixed dimensional space (

_{γ}**γ**,

**λ**,

**x**,

**θ**). Simulation of

*p*(

**γ**,

**λ**,

**x**,

**θ**|

**y**) then can be addressed via standard MCMC algorithms for a distribution of fixed dimension (Godsill 2001). Thus convergence properties of these algorithms are inherited from standard MCMC theory. Furthermore, the composite space approach provides a method to use the important parameters for models other than the current model for efficient proposal design (Godsill 2003; Green 2003).

## PRIOR SPECIFICATIONS

The statistical properties of the Bayesian approach rest squarely on the specification of the prior distributions on the unknowns. This is especially true in mapping multiple QTL across the entire genome. In this section, we discuss the prior distribution of the composite model space for the multiple-QTL model.

For the specification of the model index, most Bayesian variable selection implementations have used independence priors of the form 8

Under this prior, each QTL enters the model independently of the other QTL, with probability *p*(γ* _{j}* = 1) = 1 −

*p*(γ

*= 0) =*

_{j}*w*. In QTL mapping, a reasonable reduction may be to set

_{j}*w*≡

_{j}*w*, yielding 9where

*l*is the number of QTL equal to the number of 1's in

**γ**. The hyperparameter

*w*is the prior expected proportion of QTL included in the model. In particular, setting

*w*=

^{1}/

_{2}yields the popular uniform prior 10which gives the same prior weight to all models and is widely used as noninformative prior in variable selection problems (Ntzoufras 1999). However, this prior actually puts most of its weight near models with

*K*/2 QTL (Chipman

*et al.*2001). Alternatively, we could put a Poisson prior with a predetermined mean

*L*on the number of QTL or the number of 1's in

**γ**,

*i.e*., 11

The locations of QTL are assumed to be independent *a priori* and uniformly distributed across the entire genome or the corresponding regions. If we suppose that there are at most *K _{c}* QTL on the

*c*th chromosome,

*K*QTL could be uniformly distributed on this chromosome. Since it is usually difficult to distinguish multiple QTL on a marker interval (

_{c}*e.g.*, Lynch and Walsh 1998), it may be reasonable to assume that there is at most one QTL on a marker interval although this is not a fundamental requirement for our method. Furthermore, this assumption can limit the model space.

The prior for the overall mean μ is normally distributed with mean η_{0} and variance τ^{2}_{0}. We could choose η_{0} = 0 or and . We choose an inverse Γ(*a*, *b*) as the prior of σ^{2}. Gaffney (2001) suggested *a* = 3 and , which has prior mean and variance equal to *s*^{2}_{y}/2. Alternatively, we could take *p*(σ^{2}) ∝ 1/σ^{2} or *p*(σ^{2}) ∝ 1.

We could use three types of prior distributions for the genetic effect **β**. First, we could use a normal prior for each vector of genetic effects, *i.e.*, **β*** _{j}* ∼

*N*(0, ∑),

*j*= 1, … ,

*K*, where the prior mean of zero reflects indifference between positive and negative values, and ∑ is the prior covariance matrix. The covariance matrix ∑ could be chosen to be diagonal. In this prior specification, the prior distribution for each QTL is identical and is independent of γ

*. Most Bayesian mapping methods have used this type of prior distribution. This prior has been used by Kuo and Mallick (1998) in Bayesian variable selection for a linear regression model. Second, we could use the prior*

_{j}*p*(

**β**

*|γ*

_{j}*) = (1 − γ*

_{j}*)*

_{j}*N*(0, ∑) + γ

*(0,*

_{j}N*c*

^{2}∑), where

*c*

^{2}is a predetermined constant. This prior has been used by George and McCulloch (1993) and Dellaportas

*et al.*(2002) for a linear regression model. Third, we could use

*p*∼

*N*, where

*c*

^{2}is a hyperparameter. This prior has been used extensively in Bayesian variable selection for the conventional linear model.

The prior distributions on the model index **γ** and the genetic effects **β** may be the most critical factors influencing the performance of the algorithms and thus deserve careful attention. The hyperparameter *w* or *L* in the prior of **γ** controls the expected proportion of genetic effects and the number of QTL included in the model. The prior covariance matrix or the hyperparameter *c* in the prior of **β** controls the expected size of genetic effects included in the model. Small *w* or *L* and large prior variance or *c* would concentrate the prior on parsimonious models with large effects, and large *w* or *L* and small prior variance or *c* would concentrate on saturated models with small effects. The reasonable choices of *c* and *w* would account for the sample size *n*, the marker information, and the upper bound of QTL *K*. For the conventional linear model, Fernandez* et al.* (2001) recommended *c* = max{*n*, *K*^{2}}. George and Foster (2000) proposed treating *c* and *w* as unknown parameters and using empirical Bayes estimates of *c* and *w* based on the data. Finally, we could consider hyperprior distributions on *w* or *L* and ∑ or *c* (*e.g.*, Chipman* et al.* 2001; Gaffney 2001).

## POSTERIOR CALCULATION AND EXPLORATION

In this section, we develop a unified MCMC framework for simulating from the posterior distribution, *p*(**γ**, **λ**, **x**, **θ**|**y**), which includes the existing reversible-jump algorithms as special cases and also provides some new methods for mapping multiple QTL. It is seen that standard Gibbs samplers applied to the composite model space produce several well-known Bayesian variable selection methods developed for the linear regression model, while a more sophisticated Metropolis-Hastings (M-H) approach produces a version of the reversible-jump algorithm. We also propose strategies to improve efficiency of Bayesian mapping.

The full conditional posterior distributions for **θ _{γ}** and

**θ**are given by 12 13

_{−γ}The full conditional posterior distributions for elements of **θ _{γ}**,

*e.g.*, μ,

**β**, and σ

_{γ}^{2}, can be easily derived from Equation 12. These posteriors have standard forms and thus can be easily sampled (

*e.g.*, Gelman

*et al.*1995). It can be seen that the parameters unused in the model do not influence the posterior of

**θ**. Since the unused parameters do not contribute to the likelihood, the posterior of

_{γ}**θ**is identical to its prior. For some algorithms developed later, the values of

_{−γ}**θ**are used to update the model index and thus

_{−γ}**θ**needs to be generated from the pseudo-prior. The pseudo-prior is likely to influence the performance of these algorithms and hence it should be specified with caution.

_{−γ}Since the position λ* _{j}* is highly dependent on

**x**

*, we jointly update λ*

_{j}*and*

_{j}**x**

*. The joint full conditional posterior distribution for the location and genotype indicator of the*

_{j}*j*th QTL is 14where

**λ**

_{−}

*(*

_{j}**x**

_{−}

*) represents all elements of*

_{j}**λ**(

**x**) except λ

*(*

_{j}**x**

*). This posterior is not a standard distribution, and thus the M-H algorithm is needed to update λ*

_{j}*and*

_{j}**x**

*jointly. We first propose a new location λ*

_{j}^{*}

_{j}from

*q*, and then generate genotype indicator

**x**

^{*}

_{j}at this new location for all individuals from the posterior . The proposals for the new location and the genotype indicator are then accepted or rejected simultaneously with probability 15

(Yi and Xu 2001). Two schemes can be employed to propose a new location, λ^{*}_{j}: (1) local move, propose λ^{*}_{j} from Uniform(λ* _{j}* −

*d*, λ

*+*

_{j}*d*), where

*d*is a predetermined tuning parameter; and (2) long range move, propose λ

^{*}

_{j}uniformly from the corresponding region (Gaffney 2001). Note that under the conjugate prior, the parameter

**θ**can be integrated out from the posterior distribution (14). Therefore, the joint full conditional posterior distribution becomes 16which is independent of

**θ**.

The full conditional posterior distribution of *x _{ij}* is given by
17where

**x**

_{−}

*represents all elements of*

_{ij}**x**except

*x*. This posterior is a discrete distribution and thus easily sampled.

_{ij}Note that when the *j*th QTL is not included in the model, the posteriors of the genetic effects, location, and genotype indicators for this QTL are identical to the corresponding priors. The values of these unknowns are required to update the indicator variable of the QTL. Therefore, we first describe the methods for updating the model index on the basis of the values of these unknowns sampled from their priors. However, sampling from the priors does not update or make use of our current knowledge about these unknowns and hence cannot possibly produce an optimal sampler.

The standard MCMC procedures, Gibbs sampler and M-H algorithm, can be applied to update the model index **γ**. Several different methods can be developed as follows.

### Method I:

The full conditional posterior distribution of the indicator variable γ* _{j}* is given by
18where

**γ**

_{−}

*represents all elements of*

_{j}**γ**except γ

*. This posterior is a Bernoulli distribution and thus easily sampled. The sampling can be implemented sequentially or in random order.*

_{j}This Gibbs sampler includes several Bayesian variable selection methods as special cases, depending on the prior specifications of **β*** _{j}* (Ntzoufras 1999; Dellaportas

*et al.*2002). Kuo and Mallick (1998) use a prior distribution

*p*(

**β**

*), which is independent of γ*

_{j}*so that*

_{j}*p*(

**β**

*|γ*

_{j}*= 1) =*

_{j}*p*(

**β**

*|γ*

_{j}*= 0). Then, the third term on the right-hand side of (18) can be omitted. Similar to George and McCulloch (1993), Dellaportas*

_{j}*et al.*(2002) use a mixture of normal distribution for model parameters,

*i.e.*,

*p*(

**β**

*|γ*

_{j}*) = (1 − γ*

_{j}*)*

_{j}*N*(0, ∑) + γ

*(0,*

_{j}N*c*

^{2}∑), so that

*p*(

**β**

*|γ*

_{j}*= 1) =*

_{j}*N*(0,

*c*

^{2}∑) and

*p*(

**β**

*|γ*

_{j}*= 0) =*

_{j}*N*(0, ∑). Then, the third term should remain in (18).

We also can apply the M-H algorithm to update the model index **γ** conditional on other unknowns. The M-H algorithm is not based on sampling directly from the full conditional, but on a proposal for a move from **γ** to **γ′**, followed by acceptance or rejection of this proposal. Although the M-H sampler can in principle update multiple components of **γ** simultaneously, we discuss only the simplest strategy where only one component in **γ** is proposed; thus at each iteration we actually propose to add or delete one QTL. We assume that the *j*th element of **γ** is proposed with probability *q*(**γ′**; **γ**); then the acceptance probability, using the standard M-H algorithm, is given by min(1, *r*), where the acceptance ratio *r* is
19where **γ** = (γ* _{j}* = 1 −

*s*,

**γ**

_{−}

*),*

_{j}**γ′**= (γ

*=*

_{j}*s*,

**γ**

_{−}

*), and*

_{j}*s*= 1 or 0 corresponding to adding or deleting one QTL, respectively.

The proposals *q*(**γ′**; **γ**) can be set to *p*_{a} or *p*_{d}/(*l* + 1), depending on *s* = 1 or 0, where *l* is the number of 1's in **γ**, which equals the current number of QTL, and *p*_{a} and *p*_{d} are constants satisfying *p*_{a} + *p*_{d} = 1. This proposal scheme is equivalent to that commonly used in Bayesian QTL mapping. Alternatively, we can set [*p*(**γ**)*q*(**γ**; **γ′**)]/[*p*(**γ′**)*q*(**γ′**; **γ**)] = 1 (Gaffney 2001). Under our composite model space, however, two new schemes can be developed, borrowing the idea of variable selection: (1) We pick one of the *K* variables (QTL) at random and either delete or add it if it is currently or not, respectively, in the model; thus we have that *q*_{1}(**γ′**; **γ**) = *q*_{1}(**γ**; **γ′**) = 1/*K*, or (2) we can update γ* _{j}* for all

*j*= 1, … ,

*K*sequentially or in random order; thus we have that

*q*

_{1}(

**γ′**;

**γ**) =

*q*

_{1}(

**γ**;

**γ′**) = 1. Under both these schemes, the move proposal probability cancels from the acceptance probability ratio (19).

The above M-H algorithm is equivalent to a reversible-jump algorithm with reflecting boundaries at 0 and *K* QTL. To describe this relationship, we assume *s* = 1, which corresponds to adding one QTL into the model. The reversible jump can proceed to generate a new location and the genotype indicators at the new location from the priors and the associated effects **β*** _{j}* from

*p*(

**β**

*|γ*

_{j}*= 0). Then, the acceptance ratio is given, using the reversible-jump algorithm of Green (1995)(2003), by (19). This reversible-jump algorithm has been widely used in Bayesian QTL mapping (*

_{j}*e.g.*, Heath 1997; Sillanpää and Arjas 1998, 1999; Stephens and Fisch 1998; Yi and Xu 2000).

### Method II:

The above algorithm is conditional on the value of **β*** _{j}* sampled from the pseudo-prior when

**β**

*is proposed to add into the model. More efficient MCMC algorithms by using blocking strategies can be devised to yield improved performance. Under the linear model (2), we can choose the pseudo-prior for*

_{j}**β**

*to be the conditional posterior for*

_{j}**β**

*with γ*

_{j}*= 1; that is, set where*

_{j}**θ**

_{−βj}means all elements of

**θ**except

**β**

*. The sampling step for γ*

_{j}*then reduces to 20*

_{j}(Godsill 2001). This approach is equivalent to a sampling scheme, which first draws γ* _{j}* from

*p*(γ

*|*

_{j}**γ**

_{−}

*,*

_{j}**x**,

**θ**

_{−βj},

**y**) and then draws

**β**

*from*

_{j}*p*(

**β**

*|γ*

_{j}*=*

_{j}*s*,

**γ**

_{−}

*,*

_{j}**λ**,

**x**,

**θ**,

**y**). This scheme actually draws jointly for (γ

*,*

_{j}**β**

*). This blocking procedure can be viewed as equivalent to that used by Geweke (1996).*

_{j}A Metropolis-Hastings version of the above blocking procedure can be easily designed. Assume that the *j*th element of **γ** is proposed with probability *q*(**γ′**; **γ**). The acceptance ratio is given by
21where **γ** = (γ* _{j}* = 1 −

*s*,

**γ**

_{−}

*),*

_{j}**γ′**= (γ

*=*

_{j}*s*,

**γ**

_{−}

*), and*

_{j}*s*= 1 or 0 corresponding to adding or deleting one QTL, respectively.

Using the identity , the acceptance ratio (21) for *s* = 1 then becomes
22where **γ** = (γ* _{j}* = 0,

**γ**

_{−}

*) and*

_{j}**γ′**= (γ

*= 1,*

_{j}**γ**

_{−}

*).*

_{j}This M-H algorithm is equivalent to the reversible-jump algorithm, which proceeds to generate a new location and the genotype indicators at the new location from the priors and the associated effects **β*** _{j}* from the full conditional posterior,

*p*(

**β**

*|*

_{j}**γ′**,

**x**,

**θ**,

_{γ}**y**). This reversible-jump algorithm is similar to that developed by Yi and Xu (2001)(2002).

### Method III:

Under the linear model (2), in fact, all parameters **θ** can be integrated out from the conditional posterior distribution (18), *i.e*.,
23

Therefore, **γ** can be updated independent of **θ**. This method is equivalent to that by Smith and Kohn (1996) for the conventional linear regression model. Broman and Speed (2002) have applied this method to marker selection in a backcross design. This approach is equivalent to a blocking scheme, which first draws **θ _{γ}** from the full conditional posterior

*p*(

**θ**|γ

_{γ}*= 1,*

_{j}**γ**

_{−}

*,*

_{j}**x**,

**y**) and then draws γ

*from*

_{j}*p*(γ

*=*

_{j}*s*|

**γ**

_{−}

*,*

_{j}**λ**,

**x**,

**θ**,

**y**). This equivalence can be seen more explicitly from the following M-H version.

Assume that a proposal for a move from **γ** = (γ* _{j}* = 1 −

*s*,

**γ**

_{−}

*) to*

_{j}**γ′**= (γ

*=*

_{j}*s*,

**γ**

_{−}

*) with probability*

_{j}*q*(

**γ′**;

**γ**); the acceptance ratio is given, using the standard M-H procedure, by 24

Using the identity *p*(γ* _{j}* =

*s*|

**γ**

_{−}

*,*

_{j}**x**,

**y**) =

*p*(γ

*=*

_{j}*s*,

**θ**|

_{γ}**γ**

_{−}

*,*

_{j}**x**,

**y**)/

*p*(

**θ**|

_{γ}**γ**,

**x**,

**y**), the acceptance ratio then becomes 25

This is exactly the acceptance ratio for the reversible-jump sampler, which proceeds to generate a new location and the genotype indicators at the new location from the priors and all the associated parameters **θ _{γ}** from the full conditional posterior,

*p*(

**θ**|

_{γ}**γ**,

**x**,

**y**). Such a scheme can be viewed as equivalent to the MC

^{3}method of Raftery

*et al.*(1997) for linear regressions. Note that the method developed by Gaffney (2001), in which all the associated effects

**β**are sampled from

_{γ}*p*(

**β**|

_{γ}**γ**,

**x**, μ, σ

^{2},

**y**), is close to the above approach.

The major difference among these three methods is the proposal distribution on the genetic effects. Proposing the new genetic effects from the prior does not seem to be the best choice. It places an extraordinary burden on prior specification for the effects (Gaffney 2001). In methods II and III, the corresponding parameters are integrated out from the posteriors, or equivalently the blocking strategies are used. Since the acceptance probabilities are independent of parameter values, the samplers would lead to excellent exploration of model space (Godsill 2001). This advantage results from the use of the full conditional posterior as reversible-jump proposals. This would suggest that reversible-jump proposals should be designed to approximate as close as possible the full conditionals.

### General formula and improved strategies:

We can derive a general formula that includes the algorithms discussed above as special cases. Consider a proposal from the current state of the composite space (**γ**, **λ**, **x**, **θ**) to a new state (**γ′**, **λ′**, **x′**, **θ′**) with the proposal distribution *q(***γ′**, **λ′**, **x′**, **θ′**; **γ**, **λ**, **x**, **θ**). Using the standard M-H procedure, the acceptance probability for this proposal is given by
26

The proposal can be split into three components:

The first component *q*_{1} proposes a move to a new model index **γ′**. The second term *q*_{2} is the proposal for the unknowns used by model **γ′**. The third term is the proposal probability for the remaining unused unknowns, which is chosen to equal the pseudo-prior *p*. The acceptance ratio then reduces to
27

This is exactly the acceptance ratio for the reversible-jump sampler with the proposal distribution factored into two components, *q*_{1}(.) and *q*_{2}(.) (Green 1995, 2003; Godsill 2001, 2003). This derivation of reversible jump is obtained purely from an application of the standard M-H method to fixed-dimensional composite model space. We see that the acceptance probability is independent of the value of any parameters that are unused by both models *k* and *k*′. Hence sampling of these unused unknowns is only a “conceptual” step, which need not be performed in practice. The aim of including these unused parameters is to build a fixed-dimensional model space.

The performance of the above M-H sampler is determined by the proposal distributions *q*_{1}(.) and *q*_{2}(.). The optimal choice of proposal *q*_{2} should be the full conditional *p* . This scheme produces an M-H sampler with the posterior model *p*(**γ**|**y**) as the target distribution and thus leads to excellent exploration of model space (Godsill 2001). Unfortunately, this full conditional is not available analytically. We have to design a proposal that approximates as closely as possible the full conditional. As in all existing Bayesian mapping methods, we use three sequential steps to propose the values of unknowns and then have the factorization
28

Conditional on (**γ**, **λ _{γ}**,

**x**), model (2) is a conventional linear model and thus

_{γ}*q*

_{23}can be taken to be the full conditional posterior

*p*(

**θ**|

_{γ}**γ**,

**x**,

**y**), which results in the acceptance probability independent of

**θ**. Sampling (

_{γ}**λ**,

_{γ}**x**) is a special problem in QTL mapping. Therefore, performance of MCMC mapping procedures should depend highly on the specifications of

_{γ}*q*

_{21}and

*q*

_{22}. In all the previous algorithms, the location λ

^{′}

_{j}and the genotypes

**x**

^{′}

_{j}are proposed from their priors. This sampling scheme may be suboptimal since each locus is chosen with equal probability no matter which one has weak or strong linkage evidence. Sampling λ

^{′}

_{j}from the prior also means that the information about the

*j*th QTL is totally lost as soon as we delete this QTL from the model; this usually causes low acceptance probability and greatly influences the mixing behavior. To improve performance of reversible jump, it may be desirable to choose a location with stronger linkage evidence. The proposal

*q*

_{21}then has unequal probability over the genome. Lee and Thomas (2000) developed a method to propose a location by scanning the unoccupied regions of the entire genome for evidence of linkage of the trait residuals. Although the method of Lee and Thomas (2000) has greatly improved the acceptance ratio, it largely increases computational load. With the composite model space approach, we are able to design an algorithm in which the values for any locus can be retained until this locus is next visited. An efficient scheme could be designed as follows: If the

*j*th QTL was ever included in the model, the last location of this QTL is directly taken; otherwise, a new location is sampled from the prior. Clearly this easy-to-use method makes use of our current knowledge about the QTL locations and thus should improve the performance of MCMC algorithms.

## DISCUSSION

Mapping multiple QTL can be viewed essentially as a problem of model selection (*e.g.*, Broman and Speed 2002; Sillanpää and Corander 2002). A variety of Bayesian model selection procedures have been developed for conventional statistical models (see Chipman* et al.* 2001; Godsill 2001; Dellaportas* et al.* 2002). Although some of these procedures, *e.g.*, reversible-jump algorithm, have been applied to map multiple QTL, others have not yet. To date, most applications of reversible jump have conducted proposals on an *ad hoc* basis. Therefore, there is a need for further methodological work on improving the reversible-jump algorithms for mapping QTL. This article presents a unified MCMC framework for mapping multiple QTL in experimental designs, based on a composite space representation of the QTL model. We show that various Bayesian model selection procedures can be modified to map multiple QTL. We also demonstrate that the composite space approach leads directly to the reversible-jump algorithm. The results add to the overall understanding of the reversible-jump and the Bayesian model selection procedures for QTL mapping and lead to new classes of Bayesian mapping algorithms that combine the benefits of several different schemes within the composite model space.

The main difficulty with the existing reversible-jump algorithms is that the acceptance probability is too low. Another major challenge remains to ascertain convergence of the reversible-jump sampler and obtain a rapidly converging sampler. This is especially true in searching for multiple QTL across the entire genome. With the existing reversible-jump algorithms, the information about a QTL is totally lost as soon as we delete this QTL from the model; this greatly influences the acceptance probability and the mixing behavior. The key to the proposed composite space approach is that transdimensional problems can be considered by looking only at a fixed dimension space. As Godsill (2003) notes, therefore, convergence properties of the reversible-jump algorithm are inherited from the standard M-H scheme on the composite model space. A further advantage is that in principle the parameters for models other than the current model can be stored and then used for an efficient proposal design when a model is revisited. In this study, we develop strategies to achieve this advantage, which may improve the MCMC procedures for mapping QTL.

The proposed unified procedure includes various Bayesian model and variable selection methods. These methods are derived from a single framework, and thus may be incorporated into a unified computer program. Future work includes full-scale simulation studies and real data analyses, which can verify the mathematical derivations involved in the theory and test the efficiency of the proposed methods. The statistical properties of the Bayesian approach rest squarely on the specification of the prior distributions on the unknowns. Substantial effort will be devoted to prior selection and investigating the robustness of the proposed methods.

In this study, we ignored gene-by-environmental interactions and gene-by-gene interactions (epistatic effects). A growing number of experiments provide strong evidence of the presence of gene-by-environment interactions and epistasis for many complex traits. Thus it is important to include potential interactions into the proposed methods. Recently, Yi and Xu (2002) and Yi* et al.* (2003a) developed reversible-jump algorithms for searching for complex epistatic QTL across the entire genome. The composite space sampler proposed can in principle be extended to include interactions and thus may improve efficiency of detecting complex interacting QTL.

## Acknowledgments

This work was supported by the National Institutes of Health (NIH) (NIH RO1ES09912, NIH RO1 DK056366, and NIH P30DK056336) and an Obesity-Related Pilot/Feasibility Studies grant at University of Alabama at Birmingham (528176).

## Footnotes

Communicating editor: J. B. Walsh

- Received January 6, 2004.
- Accepted February 25, 2004.

- Genetics Society of America