Abstract
A complex binary trait is a character that has a dichotomous expression but with a polygenic genetic background. Mapping quantitative trait loci (QTL) for such traits is difficult because of the discrete nature and the reduced variation in the phenotypic distribution. Bayesian statistics are proved to be a powerful tool for solving complicated genetic problems, such as multiple QTL with nonadditive effects, and have been successfully applied to QTL mapping for continuous traits. In this study, we show that Bayesian statistics are particularly useful for mapping QTL for complex binary traits. We model the binary trait under the classical threshold model of quantitative genetics. The Bayesian mapping statistics are developed on the basis of the idea of data augmentation. This treatment allows an easy way to generate the value of a hypothetical underlying variable (called the liability) and a threshold, which in turn allow the use of existing Bayesian statistics. The reversible jump Markov chain Monte Carlo algorithm is used to simulate the posterior samples of all unknowns, including the number of QTL, the locations and effects of identified QTL, genotypes of each individual at both the QTL and markers, and eventually the liability of each individual. The Bayesian mapping ends with an estimation of the joint posterior distribution of the number of QTL and the locations and effects of the identified QTL. Utilities of the method are demonstrated using a simulated outbred full-sib family. A computer program written in FORTRAN language is freely available on request.
THE overwhelming amount of molecular data provides a large opportunity to locate genes controlling the expression of quantitative traits. Currently, a variety of statistical methods are available for mapping quantitative trait loci (QTL). Early methods of QTL mapping were developed on the basis of the maximum-likelihood or least-squares method under a single QTL model (e.g., Lander and Botstein 1989; Haley and Knott 1992). Yet it is now known that when multiple QTL are present in the same linkage group, the single QTL model can lead to biased estimates of QTL positions and effects (e.g., Haley and Knott 1992). In theory, effects of multiple QTL can be simultaneously included in the model, but this is difficult to implement in practice because even the number of QTL is an unknown parameter. Jansen (1993) and Zeng (1994) developed the idea of composite interval mapping in which mapping in a particular interval is combined with multiple regression on markers in other chromosomal regions to absorb effects of other QTL. Recently, Kao et al. (1999) developed a multiple interval mapping (MIM) approach particularly designed for mapping multiple QTL. All these approaches provide only point estimates for number, locations, and effects of QTL. The critical values for significance tests and interval estimates of the parameters have to be established using a repeated sampling technique, e.g., a permutation test (Churchill and Doerge 1994) or bootstrapping (Visscheret al. 1996b).
Bayesian methods of QTL mapping have been developed, in particular, for detection of multiple QTL (Satagopan and Yandell 1996; Satagopanet al. 1996; Heath 1997; Uimari and Hoeschele 1997; Stephens and Fisch 1998; SillanpÄÄ and Arjas 1998, 1999). In the Bayesian analysis, rather than maximizing a likelihood function, inferences are based on the joint posterior distribution of all unknown variables given the prior distribution of all unknowns and the observed data. The introduction of iterative simulation methods, such as the data augmentation and the more general Markov chain Monte Carlo (MCMC) algorithm (Tanner and Wong 1987; Gelfand and Smith 1990), which provide a Monte Carlo approximation to the required multiple integration, has brought the Bayesian method into the mainstream of QTL mapping. For complicated pedigree data, as commonly seen in animal breeding, Bayesian QTL mapping was demonstrated by Hoeschele and VanRanden (1993a,b), implemented via MCMC by Thaller and Hoeschele (1996) for single markers, by Uimari et al. (1996) for multiple linked markers, and by Uimari and Hoeschele (1997) for multiple linked QTL. In plants, Bayesian mapping has been seen in Satagopan et al. (1996), where a prespecified number of QTL were assumed first, and then a Bayes factor approach was used to decide the most probable number of QTL. Using the reversible jump MCMC (Green 1995), researchers can even treat the number of QTL as an unknown variable and generate its posterior distribution (Satagopan and Yandell 1996; Heath 1997; Stephens and Fisch 1998; SillanpÄÄ and Arjas 1998, 1999). This full Bayesian treatment has further revolutionized QTL mapping and opened a new horizon in quantitative genetics.
Almost all the Bayesian mapping methods mentioned above are designed for normally distributed traits. Many traits of biological interest and/or economical importance, however, show a dichotomous or binary phenotype, but are not inherited in a simple Mendelian manner. The genetic architectures of these characters are generally complex, involving multiple interacting genetic factors. Furthermore, the expression of the phenotype is often sensitive to environmental factors. As a consequence, these traits are usually explained by the concept of the threshold model (Falconer and Mackay 1996; Lynch and Walsh 1998), which assumes a latent continuous variable (called the liability) underlying a binary trait. The binary phenotype and the continuous liability are linked through a fixed threshold. One can treat the liability as an unobservable quantitative trait. Genes controlling complex binary traits can be treated as quantitative trait loci and handled using the QTL mapping approach.
QTL mapping for the liability of a binary trait is more complicated than for a regular quantitative trait. Although considerable progress has been made over the past few years, the development of new statistical methodology for binary traits still poses a great challenge. In human linkage studies, a nonparametric approach has been proposed (Kruglyak and Lander 1995). Under the threshold model, parametric methods of QTL mapping based on a generalized linear model (GLM) have been developed in line crosses (Hackett and Weller 1995; Xu and Atchley 1996; Visscheret al. 1996a; Rebai 1997; Rao and Xu 1998; Xuet al. 1998). Yi and Xu (1999a,b) recently developed a random model approach to QTL mapping for complex binary traits. Because of the additional complexity added between the phenotype and the liability, these methods were developed on the basis of either a single QTL model or with some approximation. A Bayesian mapping method has not been available for binary traits and such a method is ideal for handling problems with this level of complexity. Therefore, the purpose of this study is to explore the application of Bayesian mapping to binary and multiple-ordered categorical traits.
STATISTICAL METHODS
The threshold model and liability: Let si and yi (i = 1, ⋯ ⋯ ⋯ , n) be the binary phenotype and the underlying liability, respectively, of the ith individual in a full-sib family of interest. We are interested in developing a QTL mapping algorithm in a full-sib family because a full-sib family represents the simplest form of an outbred or open-pollinated population, and the method can be easily extended to natural populations. The threshold model assumes that there is a fixed threshold in the scale of liability, t, which determines the binary phenotype of an individual by comparing yi with t. If yi > t, we assign si = 1, and otherwise si = 0. The liability yi is treated as a usual quantitative character and is thus described by the linear model,
The observables are the binary phenotypic values
From Bayes' theorem, the joint posterior density of the unobservables, {Y, l, λ, θ, M, Z}, given the binary data S, is
Prior distributions: Assuming prior independence of the locations and effects of QTL, the joint prior density p(l, λ, θ) can be factored into the following product:
In updating marker genotypes, we use the prior probabilities of the complete marker genotype at each marker locus for each individual. These prior probabilities are calculated using the simplified multipoint method of Rao and Xu (1998) in which genotypes of all markers, including the marker in question, are used to infer the prior probabilities of the allelic inherence of the current marker. This treatment guarantees that a marker genotype sampled is compatible with observed data. When a marker is already fully informative, each genotype is uniquely identified, and the multipoint prior will automatically force the marker locus not to be updated. When the marker information content is low, using the multipoint prior can increase the efficiency of MCMC compared with the two-point prior (using flanking markers only).
Conditional posterior distributions: To implement the MCMC algorithm, conditional posterior distributions of the unknowns are needed. From the joint posterior density given in (2), the conditional posterior density of each unknown can be derived by treating other elements in the list of unknowns as constants and selecting the terms involving the item of interest. When this leads to the kernel of a standard density, e.g., normal distribution, Gibbs sampling is applied to draw samples for that distribution. Otherwise, sampling needs to be done by using other techniques, e.g., the Metropolis-Hastings algorithm.
To facilitate the development of the Gibbs sampler, the unobserved liability is included as an unknown nuisance parameter in the model. This approach, known as data augmentation, has been used in Bayesian analysis under the polygenic model of threshold traits (Sorensenet al. 1995), but not in the context of QTL mapping. The conditional posterior distribution of the underlying variable yi given the binary phenotype is a truncated normal. The probability density of this truncated normal distribution is given in appendix a. The algorithm for simulating a truncated normal variable is given by Devroye (1986).
Given the liability Y, the number l, and genotype Z of QTL, the posterior distributions for the elements of θ can be derived using the standard linear model theory under a normal distribution. If normal priors are chosen for the regression coefficients θ, these posterior distributions are also normal with a mean and variance as given in appendix a. Therefore, the augmentation approach allows the use of existing MCMC algorithms for normally distributed variables. Conditional on all other parameters, marker and QTL genotypes of each individual are independent and therefore can be updated locus by locus and individual by individual (Jansenet al. 1998). Expressions of the conditional posterior probabilities are derived using Bayes' theorem, which is given in appendix a. Unfortunately, there are no explicit expressions for the conditional posterior distributions of the number l and locations λ of QTL. Updating of l and λ must be achieved using the Metropolis-Hastings algorithm.
MCMC algorithm: A Markov chain Monte Carlo method is used to generate the joint posterior distribution of all unknowns given in Equation 2. The idea of MCMC is to simulate a random walk in the space of all unknowns that converges to a stationary distribution (Gelmanet al. 1995). The stationary distribution represents the posterior distribution of the unknowns. Various approaches have been suggested to conduct MCMC. Two commonly used approaches are the Gibbs sampler and the Metropolis-Hastings algorithms. The reversible jump MCMC is an extension of the Metropolis-Hastings sampler, permitting posterior samples to be collected from posterior distributions with varying dimensions (Green 1995; Richardson and Green 1997). The proposed MCMC algorithm is carried out in an alternating conditional sampling fashion. In other words, each iteration of the sampling cycles through all elements of unknowns {Y, θ, l, λ, M, Z} represents the drawing of each unknown conditional on the current values of all other unknowns. The algorithm starts from an initial point (Y0, θ0, l0, λ0, M0, Z0) and proceeds to modify each of the unknowns in turn. To set initial values for M0 and Z0, we first calculate the probabilities of marker and QTL genotypes using the simplified multipoint method (Rao and Xu 1998) and then sample randomly a value from the probability distribution as the initial points of marker and QTL genotypes. Given the initial values of (θ0, l0, λ0, M0, Z0), we generate Y0 from the corresponding truncated normal distributions. The Gibbs sampler approach is adopted to update Y, θ, M, and Z because the conditional posterior distributions of Y, θ, M, and Z have standard forms (appendix a). Updating λ is implemented using the Metropolis-Hastings algorithm. Updating of the QTL number l requires a change in the dimension of the model and thus needs a reversible jump step. This has been accomplished by SillanpÄÄ and Arjas (1998, 1999) and is directly adopted in this study. The MCMC process is briefly summarized as follows:
Step 1: Update the liability Y individual by individual using (A1) and (A2);
Step 2: Update coefficients β and
Step 3: Update the number l of QTL and their locations λ.
Step 4: Update the QTL genotypes individual by individual and locus by locus using (A6);
Step 5: Update marker genotypes individual by individual and locus by locus using (A7).
In step 3, we make a random choice among three move types: (1) modify the QTL locations; (2) add one new QTL to the model; and (3) delete one QTL from the model, with probabilities pm, pa, and pd = 1 − pm − pa, respectively. Of course, pa = 0 if l = L and pd = 0 if l = 0, and otherwise we choose pm = pa = pd = 1/3, for 0 < l < L.
As in SillanpÄÄ and Arjas (1998, 1999), we do not fix the order of QTL when updating the QTL locations. Elements of λ are modified one at a time using the Metropolis-Hastings algorithm. For the jth QTL, a proposal
To add a QTL, we must sample a new location, new genetic effects, and new QTL genotypes for each individual. First, we sample a chromosome with a probability proportional to the length of the chromosome. Once a chromosome is chosen, a location of the new QTL λl+1 is proposed from the uniform density on the sampled chromosome. We then sample QTL genotypes for the newly added QTL for each individual from
To delete a QTL, an existing QTL is selected at random and the relevant acceptance probability is calculated. The form of the acceptance probability is also given in appendix b. If the proposal is accepted, we delete a QTL from the model. Otherwise, the QTL number remains unchanged.
SIMULATION STUDIES
Designs of simulations: The Bayesian method was evaluated empirically by analyzing simulated data. We simulated two chromosomes of length 100 cM and 70 cM, respectively. Eleven and 8 codominant markers were respectively placed on the two chromosomes with a marker distance of 10 cM. Four equally frequent alleles were simulated at each marker locus. Marker genotypes were observed for parents and all the offspring. In each design, we simulated an outbred full-sib family with 300 individuals. Three QTL were simulated to control the expression of a binary trait. There were exactly four alleles at each QTL (each parent has two distinguished alleles and the two parents are not related). Therefore, the QTL were fully informative. We used the size of the variance explained by each QTL to control the polymorphism of the QTL. If the variance is zero, the polymorphism of the QTL does not mean anything, although the four alleles are distinguished. On the other hand, if the variance is very large, but all four alleles are identical, we get the equivalent result as zero variance. Therefore, we decided to control only the variance and let the QTL be fully informative. The locations and effects of the three simulated QTL are given in Table 1. The value of the liability of each individual took the sum of the overall mean, values of QTL additive and dominance effects, and an environmental error sampled from a standardized normal distribution. The observed binary phenotype was set to be sj = 1 if the corresponding liability exceeds t = 0, and sj = 0 otherwise. We designed two simulation experiments. In design I, the overall mean of the liability was set at 0.0, which generates a trait incidence of 50%. In design II, the mean was set at −0.95, leading to a trait incidence of 20%.
The true locations and genetic effects of the three simulated QTL
For comparison, the liability of each individual was also reported and analyzed as if it were observed. Therefore, we analyzed two data sets produced from the same group of sibs, the binary data and the normally distributed data. When the normally distributed data were analyzed, we used the same program by suppressing the liability-generating subroutine and replacing the liabilities by the reported normal observations. In addition, we added a subroutine to generate the environmental error because
We modified the maximum-likelihood (ML) method of Xu and Atchley (1996) so that it can handle the data with such a full-sib family structure and compared our Bayesian results with the ML analysis. The logistic function was replaced by the probit function. The ML was implemented using an EM algorithm derived by S. Xu (unpublished results).
For all MCMC analyses, the same initial values and priors were used. The initial value for the QTL number was set at 2 and the corresponding locations were at 50.0 cM of chromosome 1 and 40.0 cM of chromosome 2, respectively. The prior Poisson mean of the number of QTL was μ = 2 and the maximum number of QTL was L = 6. The starting values for all regression parameters were 0.0. The priors for the regression parameters were normal with mean 0.0 and variance 10.0. The prior for the QTL locations was uniform over the whole genome. The tuning parameter of the proposal distribution for QTL locations was chosen to be 2.0 cM. Finally, the proposal distributions for the allelic and dominance effects were normal with means 0.0 and variance 1.0 in cases where the addition of a new QTL to the model was proposed.
In each of the MCMC analyses, we ran a single long chain with 106 cycles of simulations. The first 200 samples (burn-in period) were discarded and the chain was thinned (saved one iteration in every 50 cycles) to reduce serial correlation in the stored samples so that the total number of samples kept in the analysis was 2 × 104.
RESULTS
Before we present the result of Bayesian mapping, we first give the results of the ML analysis for the binary data. Figure 1 shows the likelihood profiles along the two chromosomes for both designs. For design I (Figure 1, a and b), we observed one peak at position 26 cM in chromosome 1, overlapping with the true location of the first simulated QTL (at 25 cM). The estimated effects of this QTL are
Likelihood-ratio profiles of ML mapping and empirical distributions of the estimated QTL position obtained by 1000 bootstrap samples from the simulated binary data in design I (a and b) and design II (c and d). The solid curves are the likelihood-ratio profiles and the histograms are the bootstrap frequencies. The left y-axis corresponds to the likelihood-ratio statistic and the right y-axis corresponds to the bootstrap frequency. The true locations of the simulated QTL are indicated with an arrow (↑).
For design II (Figure 1, c and d), a major peak was observed at 24 cM in chromosome 1 and the corresponding effects were estimated to be
The ML analyses of QTL mapping do not provide confidence intervals for the estimated QTL locations and effects. Confidence intervals would have to be determined by a resampling technique separately. We adopted the bootstrap method of Visscher et al. (1996b) to construct the confidence intervals for the estimated QTL locations. We used 1000 bootstrap samples to simulate the distributions of the locations (see Figure 1). The bootstrap means (the standard deviations) are 27.46 (10.08) cM and 25.19 (13.88) cM for design I, and 24.10 (13.57) cM and 32.2581 (27.71) cM for design II, for the two chromosomes, respectively.
In the MCMC analyses, we used the QTL intensity function of SillanpÄÄ and Arjas (1998, 1999) to detect the number and locations of QTL. The interval length was chosen to be 1 cM long. The approximate posterior QTL intensities for both the binary and the normal data in design I are shown in Figure 2. The fact that the two approximate posterior QTL intensities both have three peaks around the true locations of the three simulated QTL supports a three-QTL model and the true model is indeed of three QTL in design I. Comparing the shapes of the QTL intensities for the binary and normal data analyses, we can see that binary data analysis does lose some information, but the information retained is still sufficient to detect all the simulated QTL.
Figure 3 depicts the approximate posterior QTL intensities for both the binary and the normal data in design II. For chromosome 1, the QTL intensity graphs are concentrated around the first QTL locations for both the binary and the normal data. The second, the weakest QTL in chromosome 1, remained undetected for both the binary and the normal data, and this result was practically the same as that obtained from the ML analysis of the binary data in design II. For chromosome 2, the two approximate posterior QTL intensities both have one peak and apparently support one QTL residing at this chromosome. However, the modes of the intensities for the normal and binary data differ by ~9 cM from the simulated true location in chromosome 2.
Histograms of the posterior QTL intensity for binary data (top) and normally distributed data (bottom) in design I, respectively. Simulated true QTL locations are indicated with an arrow (↑).
Histograms of the posterior QTL intensity for binary data (top) and normally distributed data (bottom) in design II, respectively. Simulated true QTL locations are indicated with an arrow (↑).
Empirical posterior distribution of QTL number and the posterior mean
The approximate posterior distributions for the number of QTL, obtained from the two designs, are presented in Table 2. For design I, the posterior means are essentially the same for the binary data and the normal data and coincide with the simulated number of QTL. As expected, the posterior variance of QTL number for the normal data is smaller than that for the binary data. Finally, the posterior mode of the QTL number overlaps with the true number for both types of data in design I. In the analysis of design II, the estimated posterior distributions for the number of QTL appear to have shifted to the left by 1 compared with the simulated number of QTL, for the two types of data. Again, as in design I, the posterior means and modes are essentially the same for both the binary data and the normal data.
Consider next the estimations of QTL effects. The estimates are reliable only in chromosome regions in which the posterior QTL intensity or the posterior density of QTL locations is sufficiently high (SillanpÄÄ and Arjas 1998, 1999; Stephens and Fisch 1998). The highest posterior region attempts to capture a comparatively small region of the parameter space that contains most of the posterior probability mass. The chromosome regions with sufficiently high posterior QTL intensity are given in Table 3. We used only the posterior samples, in which QTL locations fall into the regions described in Table 3, to estimate the QTL effects. The posterior distributions of the QTL effects are presented graphically in Figure 4 for the three identified QTL for design I. The point estimates and the estimation errors of locations and effects of QTL for the two designs are presented in Table 3. In most cases the estimations of the QTL effects are reliable. As expected, the posterior variances of the normal data are smaller than those obtained for the binary data. For design I, the estimated QTL locations are very close to the corresponding true values and the standard errors are relatively small compared to design II, which has low heritabilities and skewed trait incidence.
DISCUSSION
We have presented here a Bayesian QTL mapping for complex binary traits. The methodology can be generalized to multiple-ordered categorical traits (appendix c). The most obvious advantage of the Bayesian method over existing ML is the ability to investigate the distributions of parameter estimates. Among the parameters of interest, the number of QTL may be the most important one. It is the Bayesian method that provides an easy way to estimate this parameter. The threshold model that the Bayesian mapping is based on is not new to QTL mapping for categorical traits. Bayesian mapping for normally distributed traits is also available. In this article we combine both techniques to develop the Bayesian mapping for binary traits. The main point of the threshold model is that by introducing an underlying normal variable into the problem, the binary response is connected to the normal linear model via the probit function. The major advantage of the threshold model as applied to Bayesian mapping is that once the underlying liability is generated, all other unknowns have conditional posterior distributions identical to those already given in Bayesian analysis of normal data.
Since the liability is a hypothetical variable, the interpretation of categorical data with a threshold model can be delicate. However, there are a number of ways to test the general validity of the model (Lynch and Walsh 1998, Chapter 25). In the threshold model, the liability is usually assumed normal. In reality, the nature of the underlying variable is always unknown. In Bayesian analysis, some kind of distribution must be assigned to this hypothetical variable and normal distribution is the natural choice. In addition, Tan et al. (1999) recently showed that the normal assumption for the liability is robust to obvious departure from normality.
The key focus of QTL mapping is on making inferences about the number of QTL, their locations, and effects. There are a number of advantages in arriving at inferential statements by using a Bayesian approach over the traditional methods. First, Bayes' method provides a complete posterior distribution for the number of QTL, their locations, and the corresponding effects. As a consequence, interval estimates of the parameters can be obtained straightforwardly. In contrast, ML only produces point estimates of these parameters. Confidence intervals would have to be determined separately, for example, by employing bootstrap (Visscheret al. 1996b) or other sampling-based methods. Second, it is usually difficult to determine the correct number of QTL using traditional methods. It has been shown that an incorrect specification for the number of QTL can lead to distortion of estimates of locations and effects when ML and least squares (LS) are used. Bayesian mapping allows one to include the number of QTL as an unknown in the analysis and thus avoids this distortion. Third, a common problem with traditional methods is how to choose the appropriate critical value of the statistical test for declaration of the presence of QTL. With the reversible jump MCMC, the number and the locations of QTL can be characterized by the posterior probability distribution of the number of QTL and the posterior QTL intensity. One can even calculate the posterior probability that some particular chromosomal region contains at least one QTL (SillanpÄÄ and Arjas 1998, 1999). Finally, the Bayes method has the inherent flexibility introduced by its incorporation of multiple levels of randomness and the resultant ability to combine information from different sources. Therefore, the Bayesian approach could be extended to allow more complicated models for more complicated data structures.
The highest posterior QTL intensity interval, Bayesian estimates of QTL locations, and allelic and dominance effects
An essential element of our full Bayesian mapping for binary traits is its ability to move between different values of the QTL number. The proposed method performed well for the simulated data. The mixing property of the MCMC algorithm does not seem to be overly sensitive to the choice of the initial values of the unknowns. For example, when we started with l0 = 6, after 100 iterations l quickly dropped to 3 and subsequently behaved the same as when we started with l0 = 2. A similar conclusion has been obtained by Heath (1997). However, the mixing property seems to be greatly affected by the proposal distribution for QTL effects when a new QTL is added to the model. Therefore, this proposal distribution should be chosen with care (Heath 1997; Stephens and Fisch 1998; SillanpÄÄ and Arjas 1999). To ensure sufficient mixing, the single MCMC chain must be sufficiently long. Although the method is computationally very intensive, it does not need repeated analyses of resampled data, as required in ML for the permutation test. On a Sun SPARC 5 workstation, our analyses with a MCMC chain of 106 took ~6.5 hr for normal data and 8 hr for binary data, respectively. A major implementation issue in MCMC is to determine the effective sample sizes. This issue is related to the assessment of convergence of the MCMC sampler, the serial correlation between the samples, and the burn-in period. When analyzing real data, one can examine time series graphs of simulated sequence and calculate the Monte Carlo variance to obtain estimates of the effective sample sizes of all parameters (Geyer 1992). In our simulation studies, it is difficult to calculate series correlation because the dimension keeps changing from one cycle to another. When the dimension changes, the identities of the QTL also change. Therefore, we empirically determined the burn-in period, the length of the MCMC chain, and the interval length of subsampling to reduce the serial correlation.
Approximate posterior distributions of maternal allelic effects ( ), paternal allelic effects (
), and dominance effects (δj) for j = 1, 2, 3, for design I. Simulated true values of QTL effects are indicated with an arrow (↑). The solid curves represent QTL mapping for binary data: (a) the first QTL determined from interval ~20–30 cM of chromosome 1; (b) the second QTL determined from interval ~67–82 cM of chromosome 1; (c) the QTL on chromosome 2 determined from interval ~19–30 cM. The dotted curves represent QTL mapping for normal data: (a) the first QTL determined from interval ~21–28 cM of chromosome 1; (b) the second QTL determined from interval ~70–80 cM of chromosome 1; (c) the QTL on chromosome 2 determined from interval ~21–28 cM.
The Bayesian procedure presented in this study is based on known marker linkage phases in the parents. When the linkage phases are not known, they must be inferred first from marker genotypes of the parents and the offspring. If grandparents are also genotyped, the linkage phases can be accurately reconstructed; otherwise, a relatively large number of offspring for each family are required (Knottet al. 1996). Alternatively, one can treat linkage phases as random variables in the Bayesian analysis, as done by SillanpÄÄ and Arjas (1999). When the family size is too small, inference of the parental linkage phases will be subject to large error and stochastic resampling is certainly required. When the mapping population contains many small families, accurate inferences of parental linkage phases are almost impossible and other statistical models may be considered, such as the IBD-based random model approach (Xu and Atchley 1995). Under the random model approach, one does not need to know the number of alleles and the parental linkage phases.
As an alternative to the approach of the liability augmentation, one can directly use the probit relationship between the binary phenotype and the model effects to simulate the model effects θ by using the Metropolis-Hastings algorithms, i.e., replacing
As mentioned previously, the major advantage of Bayesian mapping is the ability to handle multiple QTL with multiple effects, including epistatic effects. Epistatic effects can be important in phenotypic evolution. Although we did not add the epistatic effects in our Bayesian model presented in this study, it is not difficult to do so. When a new QTL is proposed, its interactive effects with all existing QTL should be proposed and, if accepted, the epistatic effects should be included in the model. A QTL is finally added to the model if at least one effect caused by the QTL is accepted. This will involve additional reversible jumps on the dimension of the model even if the number of QTL remains the same.
Throughout the study, we update the genotype individual by individual and locus by locus. In general, this kind of single-site update does not always lead to an irreducible sampler because of strong dependency of close relatives and strong dependency of adjacent loci. In practice, when we deal with complicated pedigree data and closely linked markers, block updating more than one individual and more than one locus is required for appropriate mixing of the sampler (e.g., Heath 1997; SillanpÄÄ and Arjas 1999). In our study, because the pedigree is simple and the marker loci are not close enough, we did not find any insufficient mixing in genotype sampling. A Bayesian mapping for complicated pedigree data is now under investigation in this laboratory, where a block updating strategy is being incorporated in the sampler.
Acknowledgments
We thank Dr. D. D. Gessler for his helpful comments on the manuscript. We also thank two anonymous reviewers for their critical comments on an earlier version of the manuscript. This research was supported by the National Institutes of Health Grant GM55321-03 and the U.S. Department of Agriculture National Research Initiative Competitive Grants Program 97-35205-5075 to S.X.
APPENDIX A: CONDITIONAL POSTERIOR DISTRIBUTIONS
Conditional posterior distribution of the liability yi: Conditional on θ, Zi, and si, the liability yi follows a truncated normal distribution. Depending on the binary phenotypic value si, we have
Conditional posterior distributions of the regression coefficients: Given the liability Y, the number l, and the genotype Z of QTL, the posterior distribution for θ can be derived using the standard normal linear model theory. If normal priors are chosen for the regression coefficients θ, these posterior distributions are given by
Conditional posterior distributions of the QTL and marker genotypes: The conditional posterior distribution of the QTL genotype Zij is a discrete distribution over the possible genotypes. In model (1), the QTL genotype Zij takes one of four values. Thus, for instance, the conditional posterior distribution that an individual takes a genotype zij = (1, 0, 0, 0)T is given by
The conditional posterior distribution of the marker genotype Mij is dependent only on the genotypes of relevant flanking loci (marker or QTL). Taking the prior of the marker into consideration, we can obtain
APPENDIX B: ACCEPTANCE PROBABILITIES
Updating QTL locations: To modify the location λj of the jth QTL, a proposal
Add one new QTL: Given the liability Y, the acceptance probability is the same as that of the normal trait, except that the normal observables are replaced by the simulated values of liability. As in SillanpÄÄ and Arjas (1998), the acceptance probability is min{1, α}, where
Delete one QTL: If the jth existing QTL is proposed to be removed from the model, the acceptance probability is min{1, α}, where
APPENDIX C: GENERALIZATION TO MULTIPLE-ORDERED CATEGORICAL TRAITS
The method described in the text can be generalized to multiple-ordered categorical traits. Suppose now that the observed phenotypic value si takes one of c ordered categories, 1, ⋯ ⋯ ⋯ , c. A set of fixed thresholds, t1 < t2 < ċ ċ ċ < tc−1, in the scale of the liability determine the observed categories. Let t0 = −∞ and tc = +∞. We observe si, where si = k if tk−1 < yi ≤ tk (k = 1, ⋯ ⋯ ⋯ , c). The thresholds t1, ⋯ ⋯ ⋯ , tc−1 are unknown and need to be estimated. To ensure that the parameters are identifiable, it is necessary to impose one restriction on the thresholds. Without loss of generality, we take t1 = 0 (Albert and Chib 1993) and estimate t = (t2, ⋯ ⋯ ⋯ , tc−1). Assuming that t and (l, λ, θ) are independently distributed a priori, the joint posterior distribution can be written as
It is clear that the conditional posterior distributions of θ, M, and Z are the same as those specified in the binary model. For the liability associated with the kth observation, we have
The MCMC algorithm for the binary case described in the text is now generalized to multiple-ordered categorical traits. In brief, we only need to modify the likelihood and generated t. Eventually, the liability is generated according to a doubly truncated normal rather than a singly truncated one. Updating of other parameters remains the same as in binary data analysis.
Footnotes
-
Communicating editor: T. F. C. Mackay
- Received August 23, 1999.
- Accepted March 6, 2000.
- Copyright © 2000 by the Genetics Society of America