Genetics, Vol. 165, 867-883, October 2003, Copyright © 2003

Bayesian Model Choice and Search Strategies for Mapping Interacting Quantitative Trait Loci

Nengjun Yia,b, Shizhong Xud, and David B. Allisona,b,c
a Department of Biostatistics, University of Alabama, Birmingham, Alabama 35294
b Section on Statistical Genetics, University of Alabama, Birmingham, Alabama 35294
c Clinical Nutrition Research Center, University of Alabama, Birmingham, Alabama 35294
d Department of Botany and Plant Sciences, University of California, Riverside, California 92521

Corresponding author: Nengjun Yi, University of Alabama, Birmingham, AL 35294-0022., nyi{at}ms.soph.uab.edu (E-mail)

Communicating editor: G. CHURCHILL


*  ABSTRACT
*TOP
*ABSTRACT
*METHODS
*SIMULATION STUDIES AND REAL...
*DISCUSSION
*APPENDIX A
*APPENDIX B
*APPENDIX C
*LITERATURE CITED

Most complex traits of animals, plants, and humans are influenced by multiple genetic and environmental factors. Interactions among multiple genes play fundamental roles in the genetic control and evolution of complex traits. Statistical modeling of interaction effects in quantitative trait loci (QTL) analysis must accommodate a very large number of potential genetic effects, which presents a major challenge to determining the genetic model with respect to the number of QTL, their positions, and their genetic effects. In this study, we use the methodology of Bayesian model and variable selection to develop strategies for identifying multiple QTL with complex epistatic patterns in experimental designs with two segregating genotypes. Specifically, we develop a reversible jump Markov chain Monte Carlo algorithm to determine the number of QTL and to select main and epistatic effects. With the proposed method, we can jointly infer the genetic model of a complex trait and the associated genetic parameters, including the number, positions, and main and epistatic effects of the identified QTL. Our method can map a large number of QTL with any combination of main and epistatic effects. Utility and flexibility of the method are demonstrated using both simulated data and a real data set. Sensitivity of posterior inference to prior specifications of the number and genetic effects of QTL is investigated.


MOST complex traits important to evolution, animal and plant breeding, and medical genetics are influenced by multiple genetic and environmental inputs. It has been speculated that epistases (interactions among two or more genetic loci) are ubiquitous in the genetic control and evolution of complex traits (CHEVERUND and ROUTMAN 1995 Down; FRANKEL and SCHORK 1996 Down; LYNCH and WALSH 1998 Down; WADE 2001 Down). However, the contribution of epistasis to genetic variance components is difficult to estimate by traditional quantitative genetic approaches, except when clones are available (WU 1996 Down). With the availability of saturated linkage maps of molecular markers, it is possible to identify genes that significantly contribute to the variation of complex traits and to evaluate not only the marginal (main) effects of the identified quantitative trait loci (QTL) but also the epistatic effects (KAO et al. 1999 Down; SEN and CHURCHILL 2001 Down; YI and XU 2002 Down). An in-depth understanding of the epistatic interactions is important in the genetic study of complex traits. It has been described from both empirical and theoretical aspects that the potential improvement in power in discovering and localizing genes can be gained by allowing epistatic interaction when it is present (GAUDERMAN and THOMAS 2000 Down; KAO and ZENG 2002 Down). Although most gene-mapping studies have not incorporated epistatic effects into the models, epistatic interactions have been described in recent studies in different species (e.g., LARK et al. 1995 Down; FIJNEMAN et al. 1996 Down; ROUTMAN and CHEVERUD 1997 Down; YU et al. 1997 Down; FERNANDEZ et al. 2000 Down; SUGIYAMA et al. 2001 Down; DONG et al. 2003 Down).

Statistical methodology for epistatic modeling in gene mapping is in its infancy. Early methods of QTL mapping were built mainly on single-QTL models in which one QTL is fitted to the model at a time and only marginal QTL effects are detected (e.g., LANDER and BOTSTEIN 1989 Down; JANSEN and STAM 1994 Down; ZENG 1994 Down). These approaches are not directly extendable to analyzing epistatic effects. To evaluate epistatic interactions, multiple-QTL models must be used and the marginal effects of multiple putative QTL and associated epistatic effects must be modeled simultaneously. To date, statistical methods for mapping multiple QTL with epistasis were developed mainly on the basis of non-Bayesian methods. Most of these methods either first conduct a search for single-gene main effects and then search for interactions among those loci with significant individual effects, which may make sense in some contexts (e.g., CULVERHOUSE et al. 2002 Down), or fit two QTL at a time and the two-dimensional searches along the genome are used to detect QTL and estimate two-locus interactions (e.g., HALEY and KNOTT 1992 Down; WANG et al. 1999 Down). KAO et al. 1999 Down and ZENG et al. 1999 Down developed a multiple-interval mapping (MIM) approach to mapping multiple QTL and estimating epistasis in backcross designs. They used a stepwise model selection approach to progressively add or delete a QTL as well as an epistatic effect in the model. Recently, CARLBORG et al. 2000 Down proposed a genetic algorithm to reduce the computational demand for simultaneous mapping of multiple interactive QTL. JANNINK and JANSEN 2001 Down and BOER et al. 2002 Down described a statistical method to map multiple epistatic QTL with one-dimensional genome searches.

There has been a surge of interest in Bayesian approaches to mapping multiple QTL. Earlier Bayesian approaches to QTL mapping were developed to estimate the locations and the effect parameters for multiple QTL with a prespecified number of QTL (STEPHENS and SMITH 1993 Down; SATAGOPAN et al. 1996 Down; UIMARI et al. 1996 Down). Recently, SEN and CHURCHILL 2001 Down described a Bayesian method to map interacting QTL, but still assumed a known QTL number. However, the number of QTL is in fact unknown, and hence the dimensionality of the parameter space is also unknown. GREEN 1995 Down introduced the reversible jump Markov chain Monte Carlo (MCMC) algorithm, a variant of the Metropolis-Hastings algorithm for problems where the dimension of the parameter space is itself one of the unknowns. Using the reversible jump MCMC method, we can treat the number of QTL as a random variable and generate its posterior distribution. The Bayesian methods implemented via the reversible jump MCMC have been applied to map QTL in both line-crossing designs (SATAGOPAN and YANDELL 1996 Down; SILLANPAA and ARJAS 1998 Down, SILLANPAA and ARJAS 1999 Down; STEPHENS and FISCH 1998 Down; YI and XU 2000 Down) and pedigree data (HEATH 1997 Down; UIMARI and HOESCHELE 1997 Down; XU and YI 2000 Down; UIMARI and SILLANPAA 2001 Down; YI and XU 2001 Down; CORANDER and SILLANPAA 2002 Down). Several of these approaches have been successfully applied to real data analyses (SATAGOPAN and YANDELL 1996 Down; DAW et al. 1999 Down; HURME et al. 2000 Down; YUAN et al. 2000 Down; BINK et al. 2002 Down). YI and XU 2002 Down extended the reversible jump algorithm to map interacting QTL. Their method included all possible epistatic effects in the model and applied the reversible jump only to the number of QTL. With a large or moderate number of putative QTL, however, including all possible epistatic effects must accommodate a very large number of potential parameters, which causes problems in determination of the genetic model.

In this article, we treat the number of QTL as a random variable. Under the current number of QTL, we introduce indicator variables to represent the number of effects and the actual effects presented in the model. We use the methodology of the Bayesian model and variable selection to develop strategies for searching for multiple QTL with complex epistatic patterns and estimating the associated parameters simultaneously. Incorporation of epistasis not only necessitates determining the genetic model conditional on the number of QTL, but also necessitates a new algorithm for inferring the number of QTL since adding or deleting QTL will depend on the presence of either main or epistatic effects. An efficient reversible jump MCMC algorithm is proposed to update the number of QTL under complicated epistatic models. We also develop a reversible jump MCMC method to infer main and epistatic effects conditional on the number of QTL.


*  METHODS
*TOP
*ABSTRACT
*METHODS
*SIMULATION STUDIES AND REAL...
*DISCUSSION
*APPENDIX A
*APPENDIX B
*APPENDIX C
*LITERATURE CITED

Epistatic model:
We describe the method primarily for a mapping population with only two segregating genotypes, e.g., a backcross, double haploid lines (DHLs), or recombinant inbred lines (RILs). Assume that a complex trait is affected by l QTL in a mapping population. Among the l QTL, some may exhibit only main effects, some may show both main effects and epistasis, and others may show only epistasis. For a continuously distributed trait, the observed phenotypic value of individual i, yi, can be described by the following linear model,

(1)

where n is the number of individuals in the mapping population, µ is the overall mean, l is the number of QTL on the genome, aq is the main effect of putative QTL q, bq1q2 is the epistatic effect between putative QTL q1 and q2, xiq is the indicator variable denoting the genotype of putative QTL q for individual i and is defined by 0.5 or -0.5 for the two genotypes in the mapping population (KAO et al. 1999 Down; ZENG et al. 1999 Down), ei is the residual error assumed to follow N(0, {sigma}2e), {gamma}q is a binary indicator variable for main effect of putative QTL q, taking value one if QTL q has main effect and zero otherwise, and {gamma}q1q2 is a binary indicator variable for epistatic effect between QTL q1 and q2, taking value one if QTL q1 and q2 interact and zero otherwise. Hereafter, we call these binary indicator variables effect indicators. Apparently, an effect indicator {gamma} taking value zero is equivalent to setting zero to the corresponding effect or removing the corresponding effect from the model. As will be seen, however, the introduction of the effect indicator variables facilitates our description and analysis of the problem. Finally, it is worth noting that each QTL included in model (1) has at least one nonzero effect indicator.

In practice, we observe the phenotypic values and a set of marker genotypes , where K is the number of markers. For a mapping population with only two segregating genotypes, mik takes one of two values, e.g., 0 and 1, denoting two different genotypes. Assume that marker linkage maps have been developed on the basis of observed marker data so that the locations of markers on each chromosome are known a priori. The observed marker data are used to infer the probability distribution of the unobserved putative QTL. Since it may be difficult to distinguish multiple QTL on a marker interval (e.g., LYNCH and WALSH 1998 Down), we assume that there is at most one QTL on any marker interval although this is not a fundamental requirement for our method. For a dense linkage map, this assumption is reasonable.

Based on the observed data, our aim is to search all types of QTL and to infer their locations, main effects, and epistatic effects simultaneously. There are several special statistical problems, which complicate mapping multiple interacting QTL. First, one does not know a priori how many QTL to expect for a given trait. Therefore, the number of QTL has to be treated as an unknown parameter and inferred along with other parameters in the model. Under the epistatic model, the l putative QTL are chosen on the basis of either their significant main effects or their epistatic effects; this complicates inference about the number of QTL. Second, the locations of multiple QTL are unknown. Therefore, one needs to search for QTL on the entire genome. Third, mapping multiple interacting QTL requires accommodating a very large number of potential genetic effects, even when one assumes only a moderate number of QTL. For example, if there are 10 QTL, 10 parameters would be required to model the main effects, and 45 parameters to model digenic epistasis. In fact, many of these possible effects, particularly the epistatic effects, are not significantly different from zero and thus should be removed from the model. However, one does not know a priori which effects are significantly different from zero. Moreover, in the traditional frequentist framework, such a backward selection procedure model would increase the type I error rate if not accommodated in some way (POPE and WEBSTER 1972 Down). In this article, we develop a Bayesian approach to search all types of QTL and infer the number, locations, and main and epistatic effects simultaneously.

Prior distributions and joint posterior distribution:
One can view the problem of mapping epistatic QTL as one of model determination and variable selection. Conditional on the observed data y and M, we seek to determine the number of genes and associated effect indicator variables and to estimate the locations of identified genes and the model parameters. Hereafter, we denote the number of QTL by l, locations by {lambda}, vector of the effect indicators by {gamma}, main effects by a, epistatic effects by b, genotypic indicator for the genotypes of putative QTL by x, and all model parameters by In some situations, the genotypes of some markers and the phenotypic values of some individuals are missing. For convenience of description, all marker genotypes and phenotypic values are assumed known, although they can be imputed (see the next section).

We employ a Bayesian framework in which statistical inference is based on the joint posterior distribution of all unknowns. The joint posterior distribution of all unknowns, (l, {gamma}, {lambda}, {theta}, x), can be written as

(2)

The first term in this equation is the conditional distribution of the phenotypic data given all unknowns and has the form

where The second term p(x|l, {lambda}, M) is the probability distribution of QTL genotypic indicator conditional on the number, locations of QTL and marker genotypes. Assuming that there is at most one QTL on any marker interval, p(x|l, {lambda}, M) can be factorized into the products where mLiq and mRiq represent the left and the right flanking marker genotypes, respectively. Methods with which to calculate p(xiq|{lambda}iq, mLiq, mRiq) in different experimental designs are described in JIANG and ZENG 1997 Down.

The third term p(l, {gamma}, {lambda}, {theta}) in Equation 2 is the joint prior distribution of the parameters (l, {gamma}, {lambda}, {theta}). Assuming prior conditional independence of the parameters, we can factorize the joint prior distribution into the following products:

The prior distribution of l, p(l), is chosen to be uniform (0, lmax), or Poisson(lmean), for a prespecified integer lmax or Poisson mean lmean. To specify the conditional prior for the vector of QTL positions {lambda}, p({lambda}|l), we assume that (1) a QTL is uniformly distributed across the entire genome and (2) there is at most one QTL on a marker interval. Under these two assumptions, the conditional prior distribution of QTL locations can be calculated as

where p({lambda}1) is uniform across the entire genome, and p({lambda}q|{lambda}1, ... , {lambda}q-1) is uniform across the regions of the unoccupied marker intervals (without QTL). The prior distribution p({gamma}|l) reflects our subjective belief on the genetic effects conditional on the number of QTL. When no information regarding the genetic architecture of the trait is available, p({gamma}|l) can be decomposed into the products of all elements, i.e., and a natural choice for p({gamma}q) and p({gamma}q1q2) is uniform at two states of 0 and 1. This prior specification gives the same prior weight to all models conditional on the number of QTL and is widely used as a noninformative prior in variable selection problems (NTZOUFRAS 1999 Down). The priors for the overall mean µ and the QTL effects aq and bq1q2 are assumed to be independently normal, i.e., µ ~ N({eta}0, {tau}20), aq ~ N({eta}, {tau}2), and bq1q2 ~ N({eta}, {tau}2), with prespecified prior means {eta}0 and {eta} and variances {tau}20 and {tau}2. The prior for {sigma}2e is assumed to be uniform on a predefined interval. The lower and upper bounds for {sigma}2e are usually set to zero and the phenotypic variance present in the data, respectively.

Posterior computations:
The calculation of the above joint posterior distributions is analytically intractable, and thus the MCMC approach is required to obtain observations from the joint posterior distribution. We use the reversible jump MCMC method of GREEN 1995 Down to generate samples from the joint posterior distribution. Specifically, our reversible jump MCMC algorithm proceeds as follows:

a.Update the missing marker genotypes and phenotypic values.

b.Update the model parameters ({theta}).

c.Update the genotypic indicators (x).

d.Update the locations of QTL ({lambda}).

e.Update the effect indicator ({gamma}): add or delete a main or epistatic effect.

f.Add one new QTL with main effect or epistatic effects between existing QTL, or delete an existing QTL.

g.Add two new QTL with only an epistatic effect between themselves, or delete two existing QTL.

It is noted that deleting an effect and deleting one or two QTL from the model may result in some QTL having all corresponding effect indicators equal to zero. If it occurs, we remove such QTL from the model. A complete pass through these steps is referred to as a sampling iteration. The algorithm starts from an initial point (l(0), {gamma}(0), {lambda}(0), {theta}(0), x(0)) and then proceeds to update each of the groups of the unknowns in turn until a certain criterion of convergence is reached (BROOKS and GIUDICI 2000 Down). Once the sampler converges, the posterior samples {(l(t), {gamma}(t), {lambda}(t), {theta}(t), x(t)):t = 1, 2, ...} are random drawings from the joint posterior distribution p(l, {gamma}, {lambda}, {theta}, x|y, M), and samples of individual parameters can be regarded as random drawings from the appropriate marginal posterior distribution. These posterior samples are used to draw inferences about parameters of interest.

Except for the effect indicator vector and the number of QTL, other unknowns do not result in a change in the dimension of the parameter space. Therefore, these unknowns can be updated using traditional Metropolis-Hastings (M-H) algorithms (GELMAN et al. 1995 Down). Conditional on l, {gamma}, and x, model (1) is a conventional linear model, and thus the model parameters {theta} can be generated using the Gibbs sampler (GELMAN et al. 1995 Down; YI and XU 2002 Down). Given l, {gamma}, x, and {theta}, the missing phenotypic values can be sampled from normal distributions (GELMAN et al. 1995 Down). The missing marker genotypes and the genotypic indicators of QTL can be updated from discrete distributions using the Gibbs sampler (YI and XU 2000 Down). We use our previous method developed in YI and XU 2001 Down, YI and XU 2002 Down to update the locations of QTL, but with the assumption of there being at most one QTL at any marker interval under consideration. To update {lambda}q, we propose a new location {lambda}*q from uniform ({lambda}q - d, {lambda}q + d), where d is a predetermined tuning parameter, and generate genotypic indicators at the new location for all individuals. The proposals for the new location and the genotypic indicators are then accepted or rejected using the M-H algorithm (YI and XU 2002 Down). Our updating scheme updates the locations and the genotypic indicators of QTL jointly and results in a reasonable acceptance rate and mixing behavior (UIMARI and SILLANPAA 2001 Down; YI and XU 2001 Down, YI and XU 2002 Down). Updating the effect indicator vector and the number of QTL results in a change in the dimension of the parameter space and thus needs reversible jump steps. We here develop novel reversible jump steps for updating these two parameters as follows.

The effect indicator vector {gamma} represents which effects are present in the model. Conditional on other unknowns, updating {gamma} is a variable selection problem and thus a variety of Bayesian variable selection methods can be applied to update {gamma} (NTZOUFRAS 1999 Down). We here develop a simple reversible jump MCMC algorithm. Assume that the current number of QTL is l, and a total of l(l + 1)/2 elements are in the vector {gamma}, corresponding to possible QTL effects for model (1), including l main effects and l(l - 1)/2 digenic epistasis. To update {gamma}, we randomly select one of all l(l + 1)/2 elements and then propose a change for this element. If the element associated with the epistasis between QTL q'1 and q'2 is selected, for example, then we propose to change {gamma} to {gamma}' where with all other components remaining the same. In other words, we propose to add the epistasis bq'1q'2 into the model if the effect bq'1q'2 is out of the current model, or delete the epistasis bq'1q'2 from the model if the effect bq'1q'2 is included in the current model. The proposal can be accepted or rejected according to the reversible jump MCMC method (GREEN 1995 Down). A detailed description of the reversible jump algorithm for updating {gamma} can be found in Appendix A.

To detect multiple QTL with any combination of main and epistatic effects, we can add four types of QTL into the current model. These four types of QTL include a new QTL with only main effect, a new QTL having only epistatic effects with the current QTL, a new QTL having both main effect and epistatic effects with the current QTL, and two new QTL with only epistatic effects between themselves. We develop two novel reversible jump steps to search for all four types of QTL. Specifically, we use step f to search the first three types of QTL and step g to search the fourth type of QTL. With these two steps, we are able to identify QTL with complex epistatic patterns.

In step f, we randomly decide whether to propose (a) adding one new QTL with main effect or epistatic effects between existing QTL or (b) deleting an existing QTL with equal probability. Let j(l + 1|l) and j(l - 1|l) = 1 - j(l + 1|l) be the probabilities of choosing either move type. Of course, j(l - 1|l) = 0 if l = 0 and j(l + 1|l) = 0 if l = lmax, where lmax is the maximum value allowed for l, and otherwise we choose j(l - 1|l) = j(l|l - 1) = 0.5, for l = 2, 3, ... , lmax - 1. Adding one QTL requires generating additional parameters for the new QTL, i.e., new location, genotypic indicators, effect indicators, and selected new effects. A detailed description for generating these additional parameters and the acceptance probability are given in Appendix B. To delete one QTL, we randomly select one QTL among the existing QTL and then calculate the relevant acceptance probability. The acceptance probability for deleting QTL is also derived in Appendix B.

In step g, we make a random choice between attempting to add two new QTL with only epistatic effects or deleting two existing QTL, with probabilities j(l + 2|l) and j(l - 2|l) = 1 - j(l + 2|l), respectively. Similar to step f, we choose j(l - 2|l) = j(l|l - 2) = 0.5, for l = 2, 3, ..., lmax - 2, and j(l - 2|l) = 0 if l <= 1 and j(l + 2|l) = 0 if l >= lmax - 1. To add two QTL, we need to generate two new locations on the genome, genotypic indicators at these two locations for all individuals, and an epistatic effect for these two QTL. A detailed sampling scheme for generating these unknowns and the acceptance probability are given in Appendix C. The acceptance probability for deleting two QTL is also given in Appendix C.


*  SIMULATION STUDIES AND REAL DATA ANALYSIS
*TOP
*ABSTRACT
*METHODS
*SIMULATION STUDIES AND REAL...
*DISCUSSION
*APPENDIX A
*APPENDIX B
*APPENDIX C
*LITERATURE CITED

Simulation studies:
The applicability of the proposed method was demonstrated by analyzing three simulation experiments. Each experimental population was a backcross containing 300 segregating individuals. In designs I and II, one chromosome of length 150 cM was simulated, and 16 codominant markers were placed on the simulated chromosome with positions shown at the numerical labels of the horizontal axis of Fig 1. Zero and three QTL were simulated in designs I and II, respectively. The overall mean and the residual variance were set to µ = 0 and . In design II, three QTL were placed at 25, 63, and 103 cM, respectively, and the nonzero main and epistatic effects of the three loci were a1 = -0.70, a2 = 0.75, and b23 = 1.0, where aq is the main effect of QTL q, and bq1q2 is the epistatic effect between QTL q1 and q2. Therefore, the third QTL had no main effect but interaction with the second QTL. The proportions of phenotypic variance explained by the nonzero effects were ~9, 10, and 8%, respectively. In design III, four chromosomes with length 100 cM each were simulated, and 11 codominant markers were placed on each chromosome with positions shown at the numerical labels of the horizontal axis of Fig 3. We simulated seven QTL with complex epistatic patterns controlling the expression of a quantitative trait. The locations of the simulated QTL, the genetic effects (main and epistatic effects), and the heritability explained by each effect are shown in Table 1. The overall proportion of phenotypic variance explained by all QTL was ~55%. The overall mean and the residual variance were µ = -1 and respectively. To evaluate the ability of our method for handling missing markers and missing phenotypic values, we randomly generated missing markers of 15% and missing phenotypic values of 10% in design III.



View larger version (26K):
In this window
In a new window
Download PPT slide
 
Figure 1. Profiles of posterior QTL intensity for design II with respect to the different priors for the number and genetic effects of QTL. The true positions of the three simulated QTL are indicated by the arrows on the horizontal axis. *The prior variances for all QTL effects are 1.5. **The prior variances for all QTL effects are 2.5. For other cases, the prior variances for all QTL effects are 1.0.



View larger version (19K):
In this window
In a new window
Download PPT slide
 
Figure 2. Posterior distributions of the number of QTL for design III under the epistatic and nonepistatic models.



View larger version (20K):
In this window
In a new window
Download PPT slide
 
Figure 3. Profiles of posterior QTL intensity for design III under the epistatic (left side) and nonepistatic models (right side). The true positions of the seven simulated QTL are indicated by the arrows on the horizontal axes.


 
View this table:
In this window
In a new window

 
Table 1. Simulation of seven-QTL model: true locations, main effects, and epistatic effects of seven simulated QTL

Both designs I and II were replicated five times and the averaged results were reported. For all analyses, the MCMC were started with no QTL in the model, and the same prior distributions for the overall mean and the effect indicator vector were used. We tried to run the MCMC sampler using other starting values for the number of QTL and found the results to be consistent. The prior for the overall mean was N(0, 2). The residual variance took uniform(0, V), where V represents the phenotypic variance observed in each design. The prior for the indicator vector {gamma} was taken as uniform on the space of {gamma} as described earlier. The tuning parameter of proposals in the random-walk Metropolis-Hastings algorithm for updating QTL locations was chosen to be 2.0 cM. Different prior distributions for the number of QTL and the genetic effects were chosen for the analyses of three designs and are described later.

In each analysis, the MCMC sampler was run for 3 x 105 cycles after discarding the first 2000 cycles for the burn-in period. The chain was thinned (one iteration in every 15 cycles was saved) to reduce serial correlation in the stored samples so that the total number of samples kept in the post-Bayesian analysis was 2 x 104 (GELMAN et al. 1995 Down). The stored sample was used to infer the parameters of interest. We now describe our results for each experimental design.

Zero-QTL model: In design I, we tested whether any spurious loci were detected when there is no true QTL. The priors for all main and epistatic effects were chosen to be N(0, 0.5). The prior distribution of l was taken as uniform(0, 6) and Poisson distribution with three different prior means (see Table 2). For all analyses, we checked the plots of the changes in the number of QTL against the number of the iterations (not shown here). These plots showed that the MCMC sampler mixes well, with excursions into high values being short lived. The estimated posterior probability distributions for the number of QTL, averaged over five replicates, are given in Table 2. The posterior probability p(l = x|y, M) (x = 0, 1, 2, ...) was obtained by counting the number of samples in which the number of QTL is l, divided by the total of number of samples. For all prior specifications, the posterior probabilities of l = 0 have the highest value. Table 2 also gives Bayes factors B(0, 1) (= p(l = 0|y, M)/p(l = 1|y, M) x p(l = 1)/p(l = 0)), comparing a model with no QTL against a model with one QTL (SATAGOPAN and YANDELL 1996 Down; GAFFNEY 2001 Down). Evidently, the results supported a model with no QTL.


 
View this table:
In this window
In a new window

 
Table 2. Analysis of zero-QTL model: estimates of the posterior distribution of the number of QTL, its expectation, and Bayes factor

Three-QTL model: In the analyses of design II, we were primarily interested in checking for sensitivity to the choice of prior specifications for the number and genetic effects of QTL. The prior distribution for the number of QTL was taken as uniform(0, 10) and Poisson distribution with four different prior means (see Table 3). The priors for all main and epistatic effects were chosen to be N(0, {tau}2) and three different values for {tau}2 were considered (see Table 3). The prior distributions of the number and genetic effects of QTL appear in the acceptance ratios (see Appendix A A–C) and thus may influence the posterior distributions of the parameters of interest and the mixing behavior of the chain.


 
View this table:
In this window
In a new window

 
Table 3. Analysis of three-QTL model: estimates of the posterior distribution of the number of QTL, its expectation, and Bayes factors

As observed in the analyses of design I, the plots of the changes in the number of QTL against the number of the iterations showed that the MCMC sampler mixes well, changing frequently but always centralized around the posterior mode of the QTL number (not shown here). The estimated posterior probability distributions for the number of QTL are given in Table 3. In all analyses, the posterior mode of the number of QTL coincided with the simulated value and the posterior mean approximately equaled the true value. The posterior mode of the number of QTL was not affected by the choice of prior mean for the number of QTL and prior variance for genetic effects. However, the posterior probabilities were affected by these prior specifications. When taking Poisson distribution as the prior for the number of QTL, increasing the prior mean favored a higher number of QTL. Similarly, reducing the prior variance for genetic effects also favored a higher number of QTL. Table 3 also gives Bayes factors B(3, 2) and B(3, 4) for all analyses. Apparently, Bayes factors did not seem to be affected by the prior of l, but were affected by the choice of prior variance of genetic effects. However, all analyses supported a model with three QTL.

QTL locations were estimated using the posterior QTL intensity function (SILLANPAA and ARJAS 1998 Down). For convenience, we divided each chromosome into many small intervals of equal length (bins), say 1 cM, and then calculated the frequency of hits by the QTL in each interval from all posterior samples (SILLANPAA and ARJAS 1998 Down). A more accurate method to calculate the posterior QTL intensity can be found in HOTI et al. 2002 Down. For the different priors of the number and genetic effects of QTL, the QTL intensity profiles are depicted in Fig 1. These profiles were fairly similar and thus these priors did not seem to affect the posterior inference of QTL positions. Apparently, these profiles had three obvious peaks, each close to a true QTL. Therefore, three QTL were localized with high precision. The posterior inference about the effect indicators and the genetic effects were not sensitive to the priors of the number and genetic effects of QTL.

Seven-QTL model: In design III, we primarily evaluated the ability of the proposed method for mapping complex epistatic genes. The prior for the number of QTL was taken as uniform(0, 20). The priors for all main and epistatic effects were chosen to be N(0, 1). We tried to run the MCMC sampler using other prior distributions for the number of QTL and the main and epistatic effects and found the results to be similar. When Poisson with a small prior mean was taken as prior distribution for the number of QTL, however, the chain did not mix well and converged slowly. For uniform or Poisson distributions with suitable prior means, the chains mixed well.

The estimated posterior probability distribution for the number of QTL is given in Fig 2. It is immediately apparent from Fig 2 that the estimated posterior mode of the number of QTL coincides with the simulated value and the estimated posterior mean approximately equals the true value. The posterior probability of the true value of the QTL number was much greater than those of other values. Therefore, the number of QTL was estimated accurately using our method.

The profiles of posterior QTL intensity are shown in Fig 3 (left side). The regions of 95% highest posterior density (HPD; GELMAN et al. 1995 Down) are given in Table 4. As shown in Fig 3 and Table 4, the profiles of posterior QTL intensity are concentrated around the true locations of the simulated QTL and the HPD regions include the true locations. The modes of the posterior distribution of QTL locations are given in Table 4. It can be seen that, for most of the identified QTL, the estimated positions were close to the true values. As expected, the third and fourth QTL, which have only an epistatic effect, were localized slightly inaccurately compared with other QTL.


 
View this table:
In this window
In a new window

 
Table 4. Analysis of seven-QTL model: 95% high posterior density regions of QTL locations, posterior modes of QTL locations, main and epistatic effects, and heritability explained by each effect

Inference for the QTL effects and the proportion of phenotypic variance explained by each effect was obtained conditional on the estimated number of QTL (estimated mode) and the estimated loci falling into the corresponding 95% HPD regions. Although the empirical marginal posterior distribution for each effect and heritability can be depicted, for simplicity we report only the posterior means and standard deviations. These estimates were calculated to average over all models conditional on l = 7 with the effect set to zero if the effect was not selected. Table 4 gives the estimated posterior means and posterior standard deviations for the main and epistatic effects of the identified QTL, as well as the heritability of each effect. Compared with Table 1, we can see that most of the effects and the heritabilities of the effects were estimated accurately. Table 4 also gives the posterior probability that each effect is included in the model (i.e., that the corresponding indicator is nonzero). For all simulated main and epistatic effects, the corresponding posterior probabilities were estimated to be >90%. However, the posterior probabilities for all nonexisting effects were small.

The estimates of the overall mean and the residual variance were -0.911 and 1.052, respectively. The posterior standard deviations for these estimates were 0.137 and 0.102, respectively. Therefore, the overall mean and the residual variance were estimated with high precision.

We reanalyzed the simulated data using the nonepistatic model, where only main effects of QTL are fit in the model. The priors for the number, locations, and main effects of QTL and the tuning parameter for updating the locations of QTL were set to be the same as those in the above analysis. The estimated posterior probabilities for the number of QTL are given in Fig 2. Obviously, the estimated posterior mode was much smaller than the true number of QTL. The posterior mean was estimated to be 3.14, which was also smaller than the true number of QTL. Therefore, it is immediately apparent that some QTL remained undetected when ignoring epistasis. The profiles of the posterior QTL intensity are depicted in Fig 3 (right side). Clearly, all QTL with only epistatic effects were not detected in this analysis. Surprisingly, the two QTL simulated on chromosome one were actually not detected, although these two QTL have main effects. It was found that the main effects were estimated with bias. Finally, the estimates of the overall mean and the residual variance were -0.653 and 1.672, respectively. As expected, the residual variance was estimated to be greater than the true value because the epistatic variation was absorbed into the residual variance.

Real data analysis:
Data from the North American Barley Genome Mapping Project (TINKER et al. 1996 Down) were reanalyzed using the proposed Bayesian method. Seven traits were investigated in the project: heading, yield, maturity, height, lodging, kernel weight, and test weight. We represent only the result of heading here. The DH (double haploid) population contained 145 lines (n = 145) and each was grown in a range of environments. A total of 127 mapped markers (K = 127) covering 1500 cM of the genome along seven linkage groups were used in the analysis. The average phenotypic values across the environments were calculated for each line and these average values were treated as the original phenotypic values (yi) for the analysis. These phenotypic values were further standardized using where is the mean and s is the standard deviation of y. The standardized records were subject to the Bayesian analysis.

The priors for the number, locations, and main and epistatic effects of QTL and the tuning parameter for updating the locations of QTL were set to be the same as those in the analysis of design III described above. The length of the chain was also set to be the same as that in the simulated studies. The posterior probability distribution of the number of QTL is depicted in Fig 4, showing that the variation of the trait is most likely contributed by seven or eight loci with an equal chance (~30%), and a slight chance (~18%) contributed by nine loci. The estimated posterior expectation of the number of QTL was 7.98. The posterior probability that there are at least seven loci was ~92%. Therefore, a conservative conclusion is that at least seven loci are contributing to the genetic variation. In contrast, using the interval mapping method (LANDER and BOTSTEIN 1989 Down) and the composite interval mapping method (ZENG 1994 Down), TINKER et al. 1996 Down detected only three QTL.



View larger version (12K):
In this window
In a new window
Download PPT slide
 
Figure 4. Posterior distribution of the number of QTL for real data.

The posterior QTL intensity is shown in Fig 5. The 95% HPD regions and the posterior modes of the locations of the seven QTL are given in Table 5. Seven QTL were localized at six chromosomes with width of 95% HPD from 20 to 38 cM. The three regions, on chromosomes one, four, and seven, respectively, were declared by TINKER et al. 1996 Down as significant. But TINKER et al. 1996 Down failed to detect the other four regions. This illustration suggests (but does not unequivocally demonstrate) that our method is more powerful than certain existing approaches in discovering QTL.



View larger version (15K):
In this window
In a new window
Download PPT slide
 
Figure 5. Profiles of posterior QTL intensity for real data.


 
View this table:
In this window
In a new window

 
Table 5. Real data analysis: 95% high posterior density regions of QTL locations, posterior modes of QTL locations, main and epistatic effects, and heritability explained by each effect

Conditional on the estimated number of QTL (l = 7) and the estimated loci falling into the corresponding 95% HPD regions, we calculated the main and epistatic effects and the heritability explained by each effect (Table 5). Four QTL exhibited negative main effects, and the other three QTL showed positive main effects. The posterior standard deviations for the estimates of the main effects were small, indicating that the estimates were fairly precise. Table 5 also gives the posterior probability that each effect is included in the model. The estimated posterior probability that each main effect is included in the model was >85%. It seems that there were four nonzero epistatic effects, i.e., b14, b35, b45, and b46, whose heritabilities are 2.5, 1, 0.9, and 1.1%, respectively. The posterior probability including the strongest epistatic effect, b14 in the model, was ~85%, and other posterior probabilities were <55%. Therefore, we can conclude that there is at least one epistatic effect for heading in this DH population. The proportion of phenotypic variance explained by each main effect ranged from 5.3 to 11.3%. However, the overall proportion of phenotypic variance explained by the seven QTL was ~60%.


*  DISCUSSION
*TOP
*ABSTRACT
*METHODS
*SIMULATION STUDIES AND REAL...
*DISCUSSION
*APPENDIX A
*APPENDIX B
*APPENDIX C
*LITERATURE CITED

Epistatic variance can be an important source of variation for complex traits. However, detecting complex epistatic genes is difficult primarily due to analytical methods, insufficient sample sizes, and study designs that exclude epistasis (FRANKEL and SCHORK 1996 Down). In this article, we develop a Bayesian method implemented via the reversible jump MCMC algorithm to search for epistatic QTL across the entire genome and to jointly infer the number of QTL, their genomic positions, and their significant main and epistatic effects. The proposed method was shown, via simulation, to be effective in detecting a large number of QTL with complex epistatic patterns. We investigated sensitivity of posterior inference to prior specifications of the number and genetic effects of QTL. These prior distributions do not affect the posterior mode of the number of QTL and the positions and genetic effects of QTL, but affect the posterior distribution of the number of QTL. The Bayes factor for the number of QTL is relatively insensitive to the choice of prior distribution of the number of QTL, but sensitive to the prior distributions of genetic effects. In this aspect, therefore, the proposed method retains the properties of the existing reversible jump MCMC algorithms for mapping nonepistatic QTL (SATAGOPAN and YANDELL 1996 Down; THOMAS et al. 1997 Down; GAFFNEY 2001 Down). Recently, GAFFNEY 2001 Down proposed adjusting the prior variances of genetic effects as the number of QTL changes and showed that such a strategy can reduce sensitivity to the priors of genetic effects. We plan to incorporate the priors of GAFFNEY 2001 Down into our procedure in the future.

Our approach differs from the existing reversible jump algorithms for QTL mapping in several essential ways. First, most of the reversible jump algorithms have been developed on the basis of nonepistatic models. In nonepistatic models, only an effect parameter is associated with one QTL and thus it is relatively easy to implement the reversible jump for adding one QTL or deleting one QTL. In epistatic models, however, many epistatic effects are involved when adding or deleting a QTL. In our previous study (YI and XU 2002 Down), we generated all possible effects corresponding to the proposed QTL when one new QTL is proposed to be added. Although such a method has been shown to work well when the number of QTL is small, e.g., l = 2 or 3, the acceptance rate for adding one QTL may be drastically reduced as the number of QTL increases. In the present study, the number of effects and the actual effects are first proposed when adding one QTL. The proposed effects are then generated from their fully conditional posterior distribution. We introduced an indicator vector {gamma} to represent the number of effects and the actual effects presented in the model. The number of effects and the epistatic effects included in the model are proposed from the prior distribution of {gamma}. With such an algorithm, we have a chance at each step to pick a "good" combination of effects for the new QTL in arbitrarily complicated situations. The acceptance rate for adding or deleting one QTL in our method is typically 3–5%. Importantly, the acceptance rates are rarely influenced by the true number of QTL and the pattern of epistasis. These properties make our algorithm generally useful for mapping QTL under complicated epistatic models. Second, our method can search QTL with only epistatic effects between themselves, which is achieved in step g. Since this type of QTL does not depend on other types of QTL, we search them in a separate step, which simplifies the MCMC algorithm. In our simulation studies, the acceptance rate for step g is typically 0.5–1%. Third, our approach uses a new reversible jump step to determine the genetic models in terms of the number of effects. We have treated this problem as a variable selection problem. Model updating is performed by local moves where single terms are added or deleted from the model. When a variable is added or deleted from the model, all remaining parameters are unchanged. Such a procedure has been successfully applied to model determination and variable selection in a log-linear model (DELLAPORTAS and FORSTER 1999 Down) and a general linear model (DELLAPORTAS et al. 2002 Down). When an effect is proposed to be added to the model, we generate a value for the parameter corresponding to the proposed effect variable from its fully conditional posterior distribution. The acceptance rates for these moves are typically ~25–30%. For our complex simulated data sets and the real data, these proportions are satisfactory and show that our proposal based on the local moves and the fully conditional posterior distribution is sensible.

The aim of our approach is similar to the MIM method proposed by KAO et al. 1999 Down and reviewed by ZENG et al. 1999 Down. The idea of both our approach and the MIM approach is to determine the genetic model of a complex trait and to estimate the associated genetic parameters. The essential difference between our approach and the MIM approach lies in the statistical methodologies that the two approaches use. Our method is Bayesian implemented via modern MCMC algorithms whereas the MIM uses an expectation-maximization algorithm to evaluate the likelihood and a stepwise search procedure to build the model. Advantages of the Bayesian approaches to QTL mapping over the non-Bayesian methods have been elucidated in a number of articles (e.g., HOESCHELE et al. 1997 Down; SILLANPAA and ARJAS 1998 Down; YI and XU 2000 Down; HOESCHELE 2001 Down). These advantages are more obvious in the context of mapping interacting QTL. In our Bayesian approach, we can add or delete two QTL at a time and thus uncover epistatic QTL without main effects. For maximum likelihood, however, it is difficult to perform such a search (ZENG et al. 1999 Down). Our procedure can infer the posterior distributions for all parameters of interest as well as the Bayes factors for the number of QTL, which facilitates model selection and comparison (GAFFNEY 2001 Down). In the non-Bayesian methods, many statistical tests are required to determine the genetic model, and hence a high statistical threshold must be adopted to avoid false positives among those tests. A high threshold would drastically reduce the power of detecting QTL and finding significant epistasis (JANNINK and JANSEN 2001 Down). We aim to derive and present posterior distributions of the parameters of interest and not just approximate "best estimates." Inferences about any particular parameter of interest can be obtained by either averaging over possible models or using a single selected model. Therefore, the Bayesian approach can provide more robust inferences than non-Bayesian methods (BALL 2001 Down; SILLANPAA and CORANDER 2002 Down). The common aim of the non-Bayesian methods is to find a single "optimal" model regarded as the number, locations, and effects of QTL and then to make inferences as if the selected model were the true model. However, this ignores uncertainty about the model itself.

Our method appears to work well when tested using both the complex simulated data and real data set. However, we do not want to claim that the proposed algorithm is absolutely the optimal one. Improvements in several aspects can be made, e.g., computational efficiency, acceptance rates for adding or deleting QTL, and mixing behavior. The proposed reversible jump algorithm is computationally intensive. For analysis of a single data set of average size, our current program takes several hours, which makes a full-scale simulation study with many replicates of each choice of parameters a difficult task. However, the time required could be greatly reduced by optimizing the program codes and using a more efficient sampler (GAFFNEY 2001 Down). One possible way to improve the acceptance rate for adding QTL is to choose the location(s) of the new QTL to target the more "important" regions of the genome, rather than randomly selecting location(s) from the unoccupied regions. In the context of human nuclear families, LEE and THOMAS 2000 Down developed a method to propose a location by scanning the unoccupied regions of the entire genome for evidence of linkage of the trait residuals, although they did not discuss epistasis. They showed that their method has greatly improved the acceptance rate. On the other hand, we can apply the multiple-try MCMC to first select randomly several locations from the unoccupied regions and then choose the most important one among these locations (LIU et al. 2000 Down). The multiple-try MCMC may compromise Lee-Thomas's genome-scanning method and the randomly choosing method as used here in terms of statistical efficiency and computational load. Finally, it is also important to extend our procedure to handle experimental designs with three segregating genotypes (e.g., F2) and include higher-order epistatic effects. For such a mapping population, we need to include dominance, additive-by-dominance, dominance-by-additive, and dominance-by-dominance effects into the model. In principle, our algorithm could be easily extended to accommodate these additional parameters. Under these extensions, however, the number of parameters is largely increased. Further work will be devoted to improving computational efficiency and investigating the mixing behavior of our reversible jump algorithm.


*  ACKNOWLEDGMENTS

We are very grateful to Dr. Alfred Bartolucci, the associate editor, and two anonymous reviewers for their helpful comments and criticisms. This work was supported in part by the National Institutes of Health (NIH) no. GM55321; the U.S. Department of Agriculture National Research Initiative Competitive Grants Program 00-35300-9245 to S.X.; and NIH grants R01ES009912, P41RR006009, R01DK054298, and P30DK56336 to D.B.A.

Manuscript received March 10, 2003; Accepted for publication May 19, 2003.


*  APPENDIX A
*TOP
*ABSTRACT
*METHODS
*SIMULATION STUDIES AND REAL...
*DISCUSSION
*APPENDIX A
*APPENDIX B
*APPENDIX C
*LITERATURE CITED

UPDATING THE INDICATOR VECTOR OF QTL EFFECTS {gamma} AND CORRESPONDING EFFECTS
Assume that the current number of QTL is l and that there are a total of l(l + 1)/2 possible QTL effects for model (1), including l main effects and l(l - 1)/2 digenic epistasis. We randomly select one of the l(l + 1)/2 possible effects. If the epistasis between QTL q'1 and q'2 is picked, then we propose to change {gamma} to {gamma}', where with all other components remaining the same.

If , i.e., the epistasis bq'1q'2 is currently out of the model, we propose to add this epistasis into the model; this needs to generate an additional effect parameter from a proposal density q(bq'1q'2). We take the following fully conditional posterior distribution as the proposal density

where and {eta} and {tau}2 are prior mean and variance of QTL effects, respectively. The proposal move is accepted with probability min(1, r), where

and . If the proposed move is accepted, update {gamma} by {gamma}' and add the effect bq'1q'2 into the model; otherwise leave the state unchanged.

If i.e., the epistasis bq'1q'2 is in the current model, we propose deleting this epistasis from the model. The proposal move is accepted with probability min(1, r), where

where {theta}' is {theta} with bq'1q'2 removed, and q(bq'1q'2) takes the form

where If the proposed move is accepted, update {gamma} by {gamma}' and delete the effect bq'1q'2 from the model; otherwise the state remains unchanged.


*  APPENDIX B
*TOP
*ABSTRACT
*METHODS
*SIMULATION STUDIES AND REAL...
*DISCUSSION
*APPENDIX A
*APPENDIX B
*APPENDIX C
*LITERATURE CITED

ADDING OR DELETING ONE QTL
We first describe the sampling scheme for generating unknowns for the added QTL, i.e., new location, new genotype indicators for all individuals, effect indicators, and selected new effects, and then derive the acceptance probabilities for adding and deleting one QTL.

The unknowns for new QTL are generated as follows:

  1. Sample a location {lambda}* uniformly from all unoccupied intervals on the whole genome.

  2. Sample the genotype indicator for all individuals from the conditional distribution

    (3)

    A total of l + 1 possible new effects are associated with this proposed QTL, including one main effect and l epistatic effects between the proposed QTL and all existing QTL. In most situations, however, not all these effects are significant. Therefore, we first generate a random integral k between 1 and l + 1 and then pick k effects of l + 1 possible new effects at random. The proposed new effects, denoted by ß* = (ß*1, ... , ß*k), can be sampled sequentially from the conditional distributions,

These conditional distributions are univariate normal and thus can be easily sampled from.

The change in the number of QTL from l to l + 1, together with the proposed parameters, is accepted with probability min(1, r), where the acceptance ratio is

where {gamma}' is the indicator vector of QTL effects after adding the proposed QTL, {theta}' = ({theta}, ß*), x' = (x, x*), and

Deleting a QTL is simply the reverse process. A QTL is randomly chosen among the existing QTL. The chosen QTL, together with all corresponding parameters, is then proposed to delete from the model with probability min(1, r), where

where ß* is the main and epistatic effects of the deleted QTL, and {theta}', {gamma}', and x' are {theta}, {gamma}, and x, respectively, with the items corresponding to the deleted QTL removed.


*  APPENDIX C
*TOP
*ABSTRACT
*METHODS
*SIMULATION STUDIES AND REAL...
*DISCUSSION
*APPENDIX A
*APPENDIX B
*APPENDIX C
*LITERATURE CITED

ADDING OR DELETING TWO EPISTATIC QTL
In step g, we need to propose two QTL with only an epistatic effect between themselves. The associated parameters can be generated as follows:

  1. Sample two locations, {lambda}*1 and {lambda}*2, uniformly from all unoccupied intervals on the whole genome.

  2. Sample the genotype indicators and , for the two proposed locations, respectively, from the conditional distributions