## Abstract

The problem of identifying complex epistatic quantitative trait loci (QTL) across the entire genome continues to be a formidable challenge for geneticists. The complexity of genome-wide epistatic analysis results mainly from the number of QTL being unknown and the number of possible epistatic effects being huge. In this article, we use a composite model space approach to develop a Bayesian model selection framework for identifying epistatic QTL for complex traits in experimental crosses from two inbred lines. By placing a liberal constraint on the upper bound of the number of detectable QTL we restrict attention to models of fixed dimension, greatly simplifying calculations. Indicators specify which main and epistatic effects of putative QTL are included. We detail how to use prior knowledge to bound the number of detectable QTL and to specify prior distributions for indicators of genetic effects. We develop a computationally efficient Markov chain Monte Carlo (MCMC) algorithm using the Gibbs sampler and Metropolis-Hastings algorithm to explore the posterior distribution. We illustrate the proposed method by detecting new epistatic QTL for obesity in a backcross of CAST/Ei mice onto M16i.

MANY complex human diseases and traits of biological and/or economic importance are determined by multiple genetic and environmental influences (Lynch and Walsh 1998). Mounting evidence suggests that interactions among genes (epistasis) play an important role in the genetic control and evolution of complex traits (Cheverud 2000; Carlborg and Haley 2004). Mapping quantitative trait loci (QTL) is a process of inferring the number of QTL, their genomic positions, and genetic effects given observed phenotype and marker genotype data. From a statistical perspective, two key problems in QTL mapping are model search and selection (*e.g.*, Broman and Speed 2002; Sillanpää and Corander 2002; Yi 2004). Traditional QTL mapping methods utilize a statistical model, which estimates the effects of only one QTL whose putative position is scanned across the genome (*e.g.*, Lander and Botstein 1989; Jansen and Stam 1994; Zeng 1994). Extensions of this approach can allow for main and epistatic effects at two or perhaps a few QTL at a time and employ a multidimensional scan to detect QTL. However, such an approach neglects potential confounding effects from additional QTL and requires prohibitive corrections for multiple testing. Non-Bayesian model selection methods combine simultaneous search with a sequential procedure such as forward or stepwise selection and apply criteria such as *P*-values or modified Bayesian information criterion (BIC) to identify well-fitting multiple-QTL models (Kao *et al.* 1999; Carlborg *et al.* 2000; Reifsnyder *et al.* 2000; Bogdan *et al.* 2004). These methods, although appealing in their simplicity and popularity, have several drawbacks, including: (1) the uncertainty about the model itself is ignored in the final inference, (2) they involve a complex sequential testing strategy that includes a dynamically changing null hypothesis, and (3) the selection procedure is heavily influenced by the quantity of data (Raftery *et al.* 1997; George 2000; Gelman *et al.* 2004; Kadane and Lazar 2004).

Bayesian model selection methods provide a powerful and conceptually simple approach to mapping multiple QTL (Satagopan *et al.* 1996; Hoeschele 2001; Sen and Churchill 2001). The Bayesian approach proceeds by setting up a likelihood function for the phenotype and assigning prior distributions to all unknowns in the problem. These induce a posterior distribution on the unknown quantities that contains all of the available information for inference of the genetic architecture of the trait. Bayesian mapping methods can treat the unknown number of QTL as a random variable, which has several advantages but results in the complication of varying the dimension of the model space. The reversible jump Markov chain Monte Carlo (MCMC) algorithm, introduced by Green (1995), offers a powerful and general approach to exploring posterior distributions in this setting. However, the ability to “move” between models of different dimension requires a careful construction of proposal distributions. Despite the challenges of implementation of reversible jump algorithms, effective approaches for mapping multiple noninteracting QTL have been developed (Satagopan and Yandell 1996; Heath 1997; Thomas *et al.* 1997; Uimari and Hoeschele 1997; Sillanpää and Arjas 1998; Stephens and Fisch 1998; Yi and Xu 2000; Gaffney 2001). Bayesian model selection methods employing the reversible jump MCMC algorithm have been proposed to map epistatic QTL in inbred line crosses and outbred populations (Yi and Xu 2002; Yi *et al.* 2003, 2004a,b; Narita and Sasaki 2004). However, the complexity of the reversible jump steps increases computational demand and may prohibit improvements of the algorithms.

Recently, Yi (2004) proposed a unified Bayesian model selection framework to identify multiple nonepistatic QTL for complex traits in experimental designs, based upon a composite space representation of the problem. The composite space approach, which is a modification of the product space concept developed by Carlin and Chib (1995), provides an interesting viewpoint on a wide variety of model selection problems (Godsill 2001). The key feature of the composite model space is that the dimension remains fixed, allowing for MCMC simulation to be performed on a space of fixed dimension, thus avoiding the complexities of reversible jump. In Yi (2004), the varying dimensional space is augmented to a fixed dimensional space (the composite model space) by placing an upper bound on the number of detectable QTL. In the composite model space, latent binary variables indicate whether each putative QTL has a nonzero effect. The resulting hierarchical model can vastly simplify the MCMC search strategy.

In this work we extend the composite model space approach to include epistatic effects. We develop a framework of Bayesian model selection for mapping epistatic QTL in experimental crosses from two inbred lines. We show how to incorporate prior knowledge to select an upper bound on the number of detectable QTL and prior distributions for indicator variables of genetic effects and other parameters. A computationally efficient MCMC algorithm using a Gibbs sampler or Metropolis-Hastings (M-H) algorithm is developed to explore the posterior distribution on the parameters. The proposed algorithm is easy to implement and allows more complete and rapid exploration of the model space. We first describe the implementation of this algorithm and then illustrate the method by analyzing a mouse backcross population.

## A BAYESIAN MODEL SELECTION FRAMEWORK FOR QTL MAPPING

We consider experimental crosses derived from two inbred lines. In QTL studies, the observed data consist of phenotypic trait values, **y**, and marker genotypes, **m**, for individuals in a mapping population. We assume that markers are organized into a linkage map and restrict attention to models with, at most, pairwise interactions. We partition the entire genome into *H* loci, **ζ** = {ζ_{1}, … , ζ* _{H}*}, and assume that the possible QTL occur at these fixed positions. This introduces only a minor bias in estimating the position of QTL when

*H*is large. When the markers are densely and regularly spaced, we set

**ζ**to the marker positions; otherwise,

**ζ**includes not only the marker positions but also points between markers. In general, the genotypes,

**g**, at loci

**ζ**are unobservable except at completely informative markers, but their probability distribution, p(

**g**|

**ζ**,

**m**), can be inferred from the observed marker data using the multipoint method (Jiang and Zeng 1997). This probability distribution is used as the prior distribution of QTL genotypes in our Bayesian framework.

The problem of inferring the number and locations of multiple QTL is equivalent to the problem of selecting a subset of **ζ** that fully explains the phenotypic variation. Although a complex trait may be influenced by multitudes of loci, our emphasis is on a set of at most *L* QTL with detectable effects. Typically *L* will be much smaller than *H*. Let **λ** = {λ_{1}, … , λ* _{L}*} (∈{ζ

_{1}, … , ζ

*}) be the current positions of*

_{H}*L*putative QTL. Each locus may affect the trait through its marginal (main) effects and/or interactions with other loci (epistasis). The phenotype distribution is assumed to follow a linear model, 1where

**μ**is the overall mean,

**β**denotes the vector of all possible main effects and pairwise interactions of

*L*potential QTL,

**X**is the design matrix, and

**e**is the vector of independent normal errors with mean zero and variance σ

^{2}. The number of genetic effects depends on the experimental design, and the design matrix

**X**is determined from those genotypes

**g**at the current loci

**λ**by using a particular genetic model (see appendix A for details of the Cockerham genetic model used here).

There is prior uncertainty about which genetic effects should be included in the model. As in Bayesian variable selection for linear regression (*e.g.*, George and McCulloch 1997; Kuo and Mallick 1998; Chipman *et al.* 2001), we introduce a binary variable γ for each effect, indicating that the corresponding effect is included (γ = 1) or excluded (γ = 0) from a model. Letting **Γ** = diag(**γ**), the model becomes 2This linear model defines the likelihood, *p*(**y**|**γ**, **X**, **θ**), with **θ** = (μ, **β**, σ^{2}), and the full posterior can be written as 3Specifications of priors *p*(**γ**, **λ**, **g**, **θ**|**m**) and posterior calculation are given in subsequent sections.

The vector **γ** determines the number of QTL (see appendix B). Hereafter, we denote the included positions of QTL by **λ _{γ}**. The vector (

**γ**,

**λ**) comprises a model index that identifies the genetic architecture of the trait. A natural model selection strategy is to choose the most probable model (

_{γ}**γ**,

**λ**) on the basis of its marginal posterior,

_{γ}*p*(

**γ**,

**λ**|

_{γ}**y**,

**m**) (George and Foster 2000). For genome-wide epistatic analysis, however, no single model may stand out, and thus we average over possible models when assessing characteristics of genetic architecture, with the various models weighted by their posterior probability (Raftery

*et al.*1997; Ball 2001; Sillanpää and Corander 2002).

## PRIOR DISTRIBUTIONS

The above Bayesian model selection framework provides a conceptually simple and general method to identify complex epistatic QTL across the entire genome. However, its practical implementation entails two challenges: prior specification and posterior calculation. In this section, we first propose a method to choose an upper bound for the number of QTL and then describe the prior specifications for the model index and other unknowns.

### Choice of the upper bound *L*:

We suggest first specifying the prior expected number of QTL, *l*_{0}, on the basis of initial investigations with traditional methods, and then determining a reasonably large upper bound, *L*. We assign the prior probability distribution for the number of QTL, *l*, to be a Poisson distribution with mean *l*_{0}. The value of *L* can be selected to be large enough that the probability Pr(*l* > *L*) is very small. On the basis of a normal approximation to the Poisson distribution, we could take *L* as *l*_{0} + 3.

### Prior on γ**:**

For the indicator vector **γ**, we use an independence prior of the form 4where *w _{j}* =

*p*(γ

*= 1) is the prior inclusion probability for the*

_{j}*j*th effect. We assume that

*w*equals the predetermined hyperparameter

_{j}*w*

_{m}or

*w*

_{e}, depending on the

*j*th effect being main effect or epistatic effect, respectively. Under this prior, the importance of any effect is independent of the importance of any other effect and the prior inclusion probability of main effect is different from that of epistatic effect.

The hyperparameters *w*_{m} and *w*_{e} control the expected numbers of main and epistatic effects included in the model, respectively; small *w*_{m} and *w*_{e} would concentrate the priors on parsimonious models with few main effects and epistatic effects. Instead of directly specifying *w*_{m} and *w*_{e}, it may be better to first determine the prior expected numbers of main-effect QTL, *l*_{m}, and all QTL, *l*_{0} ≥ *l*_{m} (*i.e.*, main-effect and epistatic QTL), and then solve for *w*_{m} and *w*_{e} from the expressions of the prior expected numbers. It is reasonable to require that *w*_{m} ≥ *w*_{e}, which requires some adjustment below when *l*_{m} = 0.

As shown in appendix B, the prior expected number of main-effect QTL can be expressed as 5and the prior expected number of all QTL as 6where *K* is the number of possible main effects for each QTL and *K*^{2} is the number of possible epistatic effects for any two QTL.

The prior expected number of main-effect QTL, *l*_{m}, could be set to the number of QTL detected by traditional nonepistatic mapping methods, *e.g.*, interval mapping or composite interval mapping (Lander and Botstein 1989; Zeng 1994). The prior expected number of all QTL, *l*_{0}, should be chosen to be at least *l*_{m}. The number of QTL detected by traditional epistatic mapping methods, *e.g.*, two-dimensional genome scan, could provide a rough guide for choosing *l*_{0}. From Equations 5 and 6, we obtain 7and 8

We note above that if no main-effect QTL is detected by traditional nonepistatic mapping methods and *l*_{m} = 0, then *w*_{m} = 0. In this case, we suggest making all weights equal, , and using (6) to obtain 9

### Prior on λ:

When there is no prior information concerning QTL locations, these could be assumed to be independent and uniformly distributed over the *H* possible loci. Thus, given *l*_{0} the prior probability that any locus is included becomes *l*_{0}/*H*. In practice, it may be reasonable to assume that any intervals of a given length (*e.g.*, 10 cM) contain at most one QTL. Although this assumption is not necessary, it can substantially reduce the model space and thus accelerate the search procedure.

### Prior on β:

We propose the following hierarchical mixture prior for each genetic effect, 10where **x _{·}**

*= (*

_{j}*x*

_{1}

*, … ,*

_{j}*x*)

_{nj}^{T}is the vector of the coefficients of β

*, and*

_{j}*c*is a positive scale factor. Many suggestions have been proposed for choice of

*c*for variable selection problems of linear regression (

*e.g.*, Chipman

*et al.*2001; Fernandez

*et al*. 2001). In this study, we take

*c*=

*n*, which is a popular choice and yields the BIC if the prior inclusion probability for each effect equals 0.5 (

*e.g.*, George and Foster 2000; Chipman

*et al.*2001).

In this prior setup, a point mass prior at 0 is used for the genetic effect β* _{j}* when γ

*= 0, effectively removing β*

_{j}*from the model. If γ*

_{j}*= 1, the prior variances reflect the precision of each β*

_{j}*and are invariant to scales changes in the phenotype and the coefficients. The value*

_{j}^{−1}varies for different types of genetic effects. For a large backcross population with no segregation distortion, for example,

^{−1}/

*n*≈ for marginal effects and [1 − (1 − 2

*r*)

^{2}]/16 for epistatic effects, with

*r*the recombination fraction between two QTL, under Cockerham's model (Zeng

*et al.*2000).

### Priors on μ and σ^{2}:

The prior for the overall mean μ is *N*. We could empirically set We take the noninformative prior for the residual variance, *p*(σ^{2}) ∝ 1/σ^{2} (Gelman *et al.* 2004). Although this prior is improper, it yields a proper posterior distribution for the unknowns and so can be used formally (Chipman *et al.* 2001).

## MARKOV CHAIN MONTE CARLO ALGORITHM

To develop our MCMC algorithm, we first partition the vector of unknowns (**λ**, **g**, **θ**) into (**λ _{γ}**,

**g**,

_{γ}**θ**) and (

_{γ}**λ**,

_{−γ}**g**,

_{−γ}**θ**), representing the unknowns included or excluded from the model, respectively, where

_{−γ}**λ**and

_{γ}**g**(

_{γ}**λ**and

_{−γ}**g**) are the positions and the genotypes of QTL included (excluded), respectively,

_{−γ}**β**(

_{γ}**β**) represent the genetic effects included (excluded),

_{−γ}**θ**= (

**β**, μ, σ

^{2}),

**θ**= (

_{γ}**β**, μ, σ

_{γ}^{2}), and

**θ**=

_{−γ}**β**. Similarly,

_{−γ}**X**(

_{γ}**X**) represent the model coefficients included (excluded), which are determined by

_{−γ}**g**and

**γ**.

We suppress the dependence on the observed marker data below. For a particular **γ** the likelihood function depends only upon the parameters (**X _{γ}**,

**θ**) used by that model,

_{γ}*i.e.*, 11The prior distribution of (

**λ**,

**γ**,

**g**,

**θ**) can be partitioned as 12The full posterior distribution for (

**γ**,

**λ**,

**g**,

**θ**) can now be expressed as 13From (13), we can derive the conditional posterior distributions 1415and 16It can be seen that the unused parameters do not affect the conditional posterior of (

**λ**,

_{γ}**g**,

_{γ}**θ**) and thus do not need to be updated conditional on

_{γ}**γ**. Since the unused parameters do not contribute to the likelihood, the posterior of (

**λ**,

_{−γ}**g**,

_{−γ}**θ**) is identical to its prior. From (16), the conditional posterior of

_{−γ}**γ**depends on (

**λ**,

_{−γ}**g**,

_{−γ}**θ**) and thus the update of

_{−γ}**γ**requires generation of the corresponding unused parameters in the current model. These properties lead us to develop MCMC algorithms as described below. We first briefly describe the algorithms for updating

**θ**,

_{γ}**g**, and

_{γ}**λ**and then develop a novel Gibbs sampler and Metropolis-Hastings algorithm to update the indicator variables for main and epistatic effects, respectively.

_{γ}Conditional on **γ**, **X _{γ}**, and

**λ**, the parameters μ, σ

_{γ}^{2}, and

**β**can be sampled directly from their posterior distributions, which have standard form (Gelman

_{γ}*et al.*2004). Conditional on

**γ**,

**λ**, and

_{γ}**θ**, the posterior distribution of each element of

_{γ}**g**is multinomial and thus can be sampled directly as well (Yi and Xu 2002). We adapt the algorithm of Yi

_{γ}*et al.*(2003) to our model to update locations

**λ**: (1)

_{γ}**λ**is restricted to the discrete space

**ζ**= {ζ

_{1}, … , ζ

*}, and (2) any intervals of some length δ include at most one QTL. To update λ*

_{H}*, therefore, we propose a new location λ*

_{q}^{*}

_{q}for the

*q*th QTL uniformly from 2

*d*most flanking loci of λ

*, where*

_{q}*d*is a predetermined integer (

*e.g.*,

*d*= 2), and then generate genotypes at the new location for all individuals. The proposals for the new location and the genotypes are then jointly accepted or rejected using the Metropolis-Hastings algorithm.

At each iteration of the MCMC simulation, we update all elements of **γ** in some fixed or random order. For the indicator variable of a main effect, we need to consider two different cases: a QTL is currently (1) in or (2) out of the model. For (1), the QTL position and genotypes were generated at the preceding iteration. For (2), we sample a new QTL position from its prior distribution and generate its genotypes for all individuals. An epistatic effect involves two QTL, hence three different cases: (1) both QTL are in, (2) only one QTL is in, and (3) both QTL are out of the model. Again, the new QTL position(s) and genotypes are sampled as needed.

We update γ* _{j}*, the indicator variable for an effect, using its conditional posterior distribution of γ

*, which is Bernoulli, 17where*

_{j}**x**

_{i}**is the vector of the coefficients of**

_{·}**β**for the

*i*th individual,

*w*= pr(γ

*= 1) is the prior probability that β*

_{j}*appears in the model, σ*

_{j}^{2}

_{βj}is the prior variance of β

*(see Equation 10),*

_{j}**γ**

_{−γj}means all the elements of

**γ**except for γ

*, and*

_{j}**θ**

_{−βj}represents all the elements of

**θ**except for β

*. We can sample γ*

_{j}*directly from (17) or update γ*

_{j}*with probability min(1,*

_{j}*r*), where .

The effect β* _{j}* was integrated from (17). We can generate β

*as follows. If γ*

_{j}*is sampled to be zero, β*

_{j}*= 0. Otherwise, β*

_{j}*is generated from its conditional posterior 18where and*

_{j}## POSTERIOR ANALYSIS

The MCMC algorithm described above starts from initial values and updates each group of unknowns in turn. Initial iterations are discarded as “burn-in.” To reduce serial correlation, we thin the subsequent samples by keeping every *k*th simulation draw and discarding the rest, where *k* is an integer. The MCMC sampler sequence is a random draw from the joint posterior distribution *p*(**γ**, **λ _{γ}**,

**g**,

_{γ}**θ**|

_{γ}**y**), and thus the embedded subsequence is a random sample from its marginal posterior distribution

*p*(

**γ**,

**λ**|

_{γ}**y**), which is used to infer the genetic architecture of the complex trait. For genome-wide epistatic analysis, no single model may stand out, and we may average over all possible models to assess genetic architecture. Bayesian model averaging provides more robust inferences about quantities of interest than any single model since it incorporates model uncertainty (Raftery

*et al.*1997; Ball 2001; Sillanpää and Corander 2002).

The most important characteristic may be the posterior inclusion probability of each possible locus ζ* _{h}*, estimated as 19where ξ

*is the binary indicator that QTL*

_{q}*q*is included or excluded from the model. Thus, we can obtain the cumulative distribution function per chromosome, defined as for any position

*x*on chromosome

*c*. It is worth noting that the cumulative distribution function defined here can be >1 if the corresponding chromosome contains more than one QTL. Both

*p*(ζ

*|*

_{h}**y**) and

*F*(

_{c}*x*|

**y**) can be graphically displayed and show evidence of QTL activity across the whole genome. Commonly used summaries include the posterior probability that a chromosomal region contains QTL, the most likely position of QTL (the mode of QTL positions), and the region of highest posterior density (HPD) (

*e.g.*, Gelman

*et al.*2004). To take the prior specifications,

*p*(ζ

*), into consideration, we can use the Bayes factor to show evidence for inclusion of ζ*

_{h}*against exclusion of ζ*

_{h}*(Kass and Raftery 1995), 20In a similar fashion, we can compute the Bayes factor comparing a chromosomal region containing QTL to that excluding QTL.*

_{h}We can estimate the main effects at any locus or chromosomal intervals Δ, 21The heritabilities explained by the main effects can also be estimated. In epistatic analysis, we need to estimate two types of additional parameters, the posterior inclusion probability and the size of epistatic effects, both involving pairs of loci. These two types of unknowns can be estimated with natural extensions of (19) and (21), respectively.

## EXAMPLE

We illustrate the application of our Bayesian model selection approach by an analysis of a mouse cross produced from two highly divergent strains: M16i, consisting of large and moderately obese mice, and CAST/Ei, a wild strain of small mice with lean bodies (Leamy *et al.* 2002). CAST/Ei males were mated to M16i females, and F_{1} males were backcrossed to M16i females, resulting in 54 litters and 421 mice (213 males, 208 females) reaching 12 weeks of age. All mice were genotyped for 92 microsatellite markers located on 19 autosomal chromosomes. The marker linkage map covered 1214 cM with average spacing of 13 cM. In this study, we analyze FAT, the sum of right gonadal and hindlimb subcutaneous fat pads. Prior to QTL analysis, the phenotypic data were linearly adjusted by sex and dam and standardized to mean 0 and variance 1, although this transformation may result in the possibility of destroying true biological interaction (Jansen 2003). We used the Cockerham genetic model (appendix A), in which the coefficients of main effects are defined as 0.5 and −0.5 for the two genotypes, CM and MM, where C and M represent the CAST/Ei and M16i alleles, respectively.

We partitioned each chromosome with a 1-cM grid, resulting in 1214 possible loci across the genome. A nonepistatic and an epistatic QTL model were evaluated. For all analyses, the MCMC started with no QTL and ran for 4 × 10^{5} cycles after discarding the first 2000 burn-ins. The chain was thinned by one in *k* = 20, yielding 2 × 10^{4} samples for posterior Bayesian analysis.

An initial interval map scan revealed three significant QTL (LOD > 3.2) on chromosomes 2, 13, and 15 (Figure 1), explaining 20.7, 4.9, and 5.1% of the phenotypic variance, respectively.

Under the nonepistatic analysis, epistatic effects are always excluded from the model and thus putative QTL are chosen only on the basis of their main effects. As described earlier, we took the number of significant QTL detected in the interval mapping as the prior number of main-effect QTL (*l*_{m}). To check prior sensitivity, we reran the algorithm for *l*_{m} = 1, 6. The upper bound of the number of QTL was calculated as *L* ≈ *l*_{m} + 3, or *L* = 9, 4, and 14 for *l*_{m} = 3, 1, and 6, respectively. Therefore, the prior probabilities of inclusion for each main effect were *w*_{m} = 1 − [1 − (*l*_{m}/*L*)]^{1/}* ^{K}* =

^{1}/

_{3},

^{1}/

_{4}, and

^{3}/

_{7}, respectively. Figure 2, top, displays the posterior probability of inclusion for each locus across the genome. Note the similarity to Figure 1, with clear evidence of QTL and flat profiles on other chromosomes. The peaks on chromosomes 2, 13, and 15 overlap those identified by interval mapping. The graphs of the cumulative distribution function, displayed in Figure 2, bottom, show that the posterior inclusion probability of each chromosome is close to 1 for chromosomes 2, 13, and 15. The results show that, at least in this data set, detection of large-effect QTL is not sensitive to the choice of

*l*

_{m}. However, larger

*l*

_{m}tend to pick up more small-effect QTL as expected. The profiles of the Bayes factor are depicted in Figure 3. For the three choices of

*l*

_{m}, the regions on chromosomes 2, 13, and 15 show strong evidence for being selected, and other regions show a very low Bayes factor.

The epistatic analysis took *l*_{m} = 3, the number of QTL detected in the nonepistatic analyses, as the prior expected number of main-effect QTL. Three values, *l*_{0} = 4, 6, and 8, were chosen as the prior expected number of all QTL under the epistatic model. The upper bound of the number of QTL, *L*, was thus *L* = 10, 14, and 17, respectively. From Equations 7 and 8, the prior inclusion probabilities were 0.30, 0.21, and 0.18 for main effects and 0.017, 0.025, and 0.027 for epistatic effects, for the three values of (*l*_{0}, *L*), respectively. The profiles of the posterior inclusion probability for each locus across the genome and the cumulative posterior probability for each chromosome are depicted in Figure 4, top and bottom, respectively. It can be seen that the three different prior specifications of (*l*_{0}, *L*) provided fairly similar profiles of the posteriors, indicating that the posterior inference may be not very sensitive toward the small or mediate change of *l*_{0}. As expected, the choice of a smaller prior expected number of QTL tended to provide smaller posteriors, especially for infrequently arising loci. However, the identification of frequent arising loci remained the same. The profiles of the Bayes factor are depicted in Figure 5. The three choices of *l*_{m} provided similar profiles of the Bayes factor, especially for infrequently arising loci.

As shown in Figures 4 and 5, the epistatic analyses detected the same regions on chromosomes 2, 13, and 15 as the nonepistatic analyses. In addition to those on chromosomes 2, 13, and 15, our epistatic analyses found strong evidence of QTL on chromosomes 1, 18, and 19 with high cumulative probabilities (close to 1) and suggestive evidence of QTL on chromosomes 7 and 14. In the nonepistatic analyses, these chromosomes were found to have weak main effects and hence were detected in the epistatic model mainly due to epistatic interactions.

The profiles of the location-wise main effects and the variances explained by the main effects are depicted in Figure 6. For the three prior specifications, the posterior inferences were essentially identical. Therefore, we reported only the summary statistics for *l*_{0} = 6 (see Tables 1 and 2). For the HPD regions on chromosomes 2, 13, and 15, the posterior inclusion probabilities are close to 1, and the corresponding Bayes factors are high. The estimated main effects were −0.856, 0.371, and −0.342 and explained 18.4, 3.5, and 3.1% of the phenotypic variance, respectively. For the HPD regions on chromosomes 1, 18, and 19, the posterior inclusion probabilities were ∼82, 88, and 70%, and the corresponding Bayes factors were ∼28, 47, and 12, respectively. In these HPD regions, the average main effects were weak and explained low proportions of the phenotypic variance. However, our epistatic analyses detected strong epistatic interactions associated with the HPD regions on chromosomes 1, 18, and 19. As shown in Table 2, the strongest epistasis is the interaction between chromosomes 1 and 18. This epistatic effect was estimated to be 0.936 and explained 5.6% of the phenotypic variance. The posterior inclusion probability of this epistasis was 81.9%. The region of chromosome 19 was found to interact with chromosomes 15 and 7. The interaction between the regions of chromosomes 19 and 15 was 0.604 and explained 2.5% of the phenotypic variance. The epistatic analyses also revealed interactions among chromosomes 2, 13, and 15. For example, the interaction between the HPD regions on chromosomes 2 and 13 was included in the model with probability of ∼60% and explained ∼2.5% of the phenotypic variance.

## DISCUSSION

The Bayesian model selection approach provides a comprehensive solution to mapping multiple epistatic QTL across the entire genome using the posterior distribution as a selection criterion. MCMC algorithms based on the composite model space representation mix rapidly, thus ensuring that high-probability models are visited frequently and quickly, resulting in good inference from relatively short runs. The Bayesian framework provides a robust inference of genetic architecture that incorporates model uncertainty by averaging over all possible models (Raftery *et al.* 1997; Ball 2001; Sillanpää and Corander 2002).

One of the most challenging statistical problems presented by QTL mapping is that the number of QTL is unknown. Most previous Bayesian mapping methods treat QTL models as models of varying dimension and employ the reversible jump MCMC algorithm to explore the posterior. Although such a framework is very general and powerful (Green 1995), it is difficult to implement efficient search strategies. The key idea of the proposed Bayesian approach is to turn varying dimensional space of multiple-QTL models into fixed dimensional model space by using a fixed but large set of known loci, **ζ**, and putting a constraint on the upper bound of the number of detectable QTL. In this setting, posterior simulation then can be achieved with a relatively simple Gibbs sampler or M-H algorithm (Godsill 2001; Yi 2004). The algorithm proposed herein is easier to implement than the reversible jump method and it reduces the computational time of model search, an essential feature for the practical analysis of complex genetic architectures.

A prerequisite of the proposed method is a reasonable choice of the upper bound of the number of detectable QTL. A minimal requirement is that the predetermined upper bound is greater than the true number of QTL with high probability. As an extreme case, we could take the total number of loci (*H*) as the upper bound. Since the number of detectable QTL is usually much less than *H*, such a choice is unlikely to be optimal. The suggestion made here utilizes the expected number of QTL and the prior probability distribution of the number of QTL to determine the upper bound. The expected number of QTL could be roughly estimated using standard genome scans. In practice, one could experiment with several values of the expected number of QTL and investigate their impact on the posterior inference. In high-dimensional problems, specifying the prior distributions on both the model space and parameters is perhaps the most difficult aspect of Bayesian model selection. We propose a novel method for elicitation of prior distribution on the indicator variables. Instead of directly specifying the prior inclusion probabilities *w*_{m} and *w*_{e}, the expected numbers of main-effect QTL and all QTL can first be given incorporating previous results and then are used to determine *w*_{m} and *w*_{e}. Here we have fixed *w*_{m} and *w*_{e} but we could relax this by treating *w*_{m} and *w*_{e} as unknown model parameters and assigning priors (Kohn *et al.* 2001).

A major difficulty of genome-wide epistatic analysis is created by the huge size of the model space. Strategies to reasonably reduce the model space, such as our proposed composite model space approach, can improve the performance of the MCMC algorithms and enhance our ability to detect complex epistatic QTL. We partition the entire genome into intervals by a number of points and restrict putative QTL to these fixed points, reducing loci to a discrete space. Additional speedup is achieved by computing the conditional probability of the genotypes given the marker data on this fixed (but dense) grid of possible locations before the MCMC procedure starts.

Several other strategies of reducing the model space could be incorporated into the proposed approach to improve the procedure. We could adopt a two-stage search method, first searching for main-effect QTL and second searching for epistatic effects of these and additional epistatic QTL given the already detected main-effect QTL. The positions and main effects of the QTL detected in the first stage should be updated in the second stage since inclusion of epistatic effects may yield more accurate estimation of the positions and the effects. Alternatively, we could selectively ignore some genetic effects. Even with a moderate number of detectable QTL, the epistatic models must accommodate many potential genetic effects. In a backcross population, for example, there are a total of *L*(*L* + 1)/2 (= 210, if *L* = 20, say) possible effects, but many may be negligible. To see this, categorize putative QTL into three types: (1) QTL with main effects (main-effect QTL), (2) QTL with weak main effects but epistatic effects with other main-effect QTL, and (3) QTL with weak main effects but epistatic effects among themselves. Letting the numbers of these three types of QTL be *L*_{1}, *L*_{2}, and *L*_{3} (*L* = *L*_{1} + *L*_{2} + *L*_{3}), respectively, and ignoring the main effects of (2) and (3) QTL, the number of possible effects reduces to *L*_{1}(*L*_{1} + 1)/2 + *L*_{1}*L*_{2} + *L*_{3}(*L*_{3} − 1)/2 (= 115, if *L*_{1} = 10, *L*_{2} = 5, and *L*_{3} = 5). These three types of QTL can be detected either simultaneously or conditionally with a three-stage approach.

A number of extensions of the basic model are possible within this framework. The simplicity of the MCMC search enhances the overall flexibility of this approach and enables one to consider analysis in more complex settings. Extensions to binary or ordinal traits, inclusion of fixed- or random-effect covariates, and gene-by-environment interactions are feasible. In principle, the composite space method can be directly applied to identify higher-order interactions. However, the dramatic increase in the size of model space is likely to limit the performance of the MCMC algorithm. We regard the methods proposed here as a step toward achieving more efficient and comprehensive analysis of complex genetic architectures. There are many opportunities to extend and improve upon this general approach.

## APPENDIX A:

### THE MODIFIED COCKERHAM EPISTATIC MODEL FOR BACKCROSS AND INTERCROSS POPULATIONS

For a mapping population with *K* + 1 genotypes per locus, there are *K* marginal effect degrees of freedom (d.f.) for each locus and *K*^{2} interaction-effect d.f. for any two loci. The design matrix **X** for model (1) has *KL* main-effect coefficients, *x _{iqk}*, and

*K*

^{2}

*L*(

*L*− 1)/2 epistatic effect coefficients,

*x*

_{iqq}_{′}

*, obtained from the genotypes at the corresponding loci by using a particular epistatic model. The main and epistatic effects are denoted by β*

_{k}*and β*

_{qk}

_{qq}_{′}

*, respectively.*

_{k}For a backcross population, there are two segregating genotypes denoted by *b _{q}b_{q}*,

*B*at locus

_{q}b_{q}*q*. For the commonly used Cockerham epistatic model (Kao and Zeng 2002), the coefficients are defined as where

*z*denotes the number of alleles

_{iq}*B*. For an intercross derived from two inbred lines, there are three segregating genotypes denoted by

_{q}*b*,

_{q}b_{q}*B*, and

_{q}b_{q}*B*at locus

_{q}B_{q}*q*. For the commonly used Cockerham epistatic model, the coefficients are defined as For the Cockerham model, β

_{q}_{1}and β

_{q}_{2}correspond to additive and dominance effects of QTL

*q*, respectively; and β

_{qq}_{′1}, β

_{qq}_{′2}, β

_{qq}_{′3}, and β

_{qq}_{′4}are the epistatic effects between loci

*q*and

*q*′, called additive-by-additive, additive-by-dominance, dominance-by-additive, and dominance-by-dominance effects, respectively. The Cockerham model keeps the same interpretation of main effects with or without epistatic effects. However, main effects should always be interpreted with caution in the presence of epistatic interactions.

## APPENDIX B:

### THE PRIOR EXPECTED NUMBER OF QTL INCLUDED IN THE MODEL

We define ξ* _{q}* as the binary variable to indicate inclusion (ξ

*= 1) or exclusion (ξ*

_{q}*= 0) of QTL*

_{q}*q*. QTL

*q*is included into the model when and only when at least one of the genetic effects associated with QTL

*q*is included. Therefore, we have where

*K*is the number of possible main effects for each locus,

*K*

^{2}is the number of possible epistatic effects for any two loci, and γ

*and γ*

_{qk}

_{qq}_{′}

*are the indicators of main and epistatic effects, respectively. The actual number of QTL then equals . The prior expected number of all QTL is the expectation of the actual number of QTL and thus can be derived as*

_{k}If we consider only main effects, then QTL *q* is included into the model when at least one of the main effects of QTL *q* is included. The binary indicator variable of QTL *q* then becomes . Therefore, the prior expected number of main-effect QTL is

## Acknowledgments

N.Y. and D.B.A. were supported by National Institutes of Health (NIH) grants NIH RO1GM069430, NIH RO1ES09912, NIH RO1 DK056366, NIH P30DK056336, and an obesity-related pilot/feasibility studies grant at the University of Alabama (Birmingham) (528176). G.A.C. was supported by NIH GM070683. B.S.Y. was supported by NIH/National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) 5803701, NIH/NIDDK 66369-01, American Diabetes Association 7-03-IG-01, and U.S. Department of Agriculture Cooperative State Research, Education and Extension Service grants to the University of Wisconsin (Madison) (C.J. and B.S.Y.). This research is a contribution of the University of Nebraska Agricultural Research Division (Lincoln, NE; journal series no. 14858) and the North Carolina Agricultural Research Service and was supported in part by funds provided through the Hatch Act.

## Footnotes

Communicating editor: J. B. Walsh

- Received December 29, 2004.
- Accepted April 4, 2005.

- Genetics Society of America