- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Sillanpää, M. J.
- Articles by Arjas, E.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Sillanpää, M. J.
- Articles by Arjas, E.
Bayesian Mapping of Multiple Quantitative Trait Loci From Incomplete Inbred Line Cross Data
Mikko J. Sillanpääa and Elja Arjasaa Rolf Nevanlinna Institute, University of Helsinki, FIN-00014, Helsinki, Finland
Corresponding author: Mikko J. Sillanpää, Rolf Nevanlinna Institute, Research Institute of Mathematics, Statistics, and Computer Science, P.O. Box 4, FIN-00014 University of Helsinki, Finland, mjs{at}rolf.helsinki.fi (E-mail).
Communicating editor: Z-B. ZENG
| ABSTRACT |
|---|
A novel fine structure mapping method for quantitative traits is presented. It is based on Bayesian modeling and inference, treating the number of quantitative trait loci (QTLs) as an unobserved random variable and using ideas similar to composite interval mapping to account for the effects of QTLs in other chromosomes. The method is introduced for inbred lines and it can be applied also in situations involving frequent missing genotypes. We propose that two new probabilistic measures be used to summarize the results from the statistical analysis: (1) the (posterior) QTL-intensity, for estimating the number of QTLs in a chromosome and for localizing them into some particular chromosomal regions, and (2) the locationwise (posterior) distributions of the phenotypic effects of the QTLs. Both these measures will be viewed as functions of the putative QTL locus, over the marker range in the linkage group. The method is tested and compared with standard interval and composite interval mapping techniques by using simulated backcross progeny data. It is implemented as a software package. Its initial version is freely available for research purposes under the name Multimapper at URL http://www.rni.helsinki.fi/~mjs.
WHEN two purely homozygous, inbred, very genetically divergent lines are crossed, all offspring (F1 generation) are genetically identical, being heterozygous at each locus. In the haplotypes of the F1 individuals, the locus next to each quantitative trait locus (QTL) has the same allele as it had in the parental haplotype. This is because, in this ideal case, parents are homozygous at each locus and recombination events cannot change haplotypic arrangements. Therefore, linkage disequilibrium (nonrandom allelic association) in this group is maximal.
When an F2 or backcross generation is produced, linkage disequilibrium is reduced slightly but still remains at a high level. The degree of reduction depends on the distance and on the recombination fraction between the considered QTL and the nearby marker locus. If mating is continued till F3 and the succeeding generations, disequilibrium area surrounding a QTL is reduced further in each generation. This is why the backcross or F2 intercross data from inbred lines is particularly suitable for QTL mapping.
The commonly used QTL mapping methods for plants and animals introduced recently use offspring data from divergent inbred lines in backcross or F2 intercross design. The reason for using such a design is to maximize linkage disequilibrium and the amount of heterozygosity (information content) in meioses. Examples of such methods are interval mapping (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
In interval mapping (![]()
![]()
The composite interval mapping procedure introduced by ![]()
![]()
![]()
![]()
![]()
![]()
![]()
The purpose of this paper is to present an approach for high resolution QTL mapping in inbred line crosses from incomplete data when the phenotypic distribution is assumed to be Normal. Our modeling approach is based on regression and the assumption that individual QTL effects are additive. The method belongs to the general framework of variable dimensional Bayesian models (e.g., ![]()
![]()
![]()
![]()
![]()
MCMC methods (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Bayesian inference for gene mapping has been considered by ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Bayesian QTL mapping in outbred livestock population, using granddaughter design, has been studied by ![]()
![]()
![]()
![]()
![]()
The contents of this paper are as follows. Next, we describe our statistical model. The results from simulation experiments are described thereafter. The final section contains a discussion of the method and of the experiences we had. In APPENDIX A we outline the MCMC algorithm used in the estimation.
| MODEL |
|---|
QTL search is usually concentrated on a given chromosome (or chromosomal segment). Therefore everything in the following is with respect to such a chosen fixed linkage group (chromosome) unless the contrary is stated. Let y = (y1,y2, ..., yNind) denote the vector of known phenotypes (missing phenotypes are not considered here), where Nind is the number of individuals in the experiment and yi is the phenotype of the ith individual. Suppose that the observations yi are distributed according to a normal law resulting from some design in inbred linecross data. We shall view the unknown number of QTLs, denoted by Nqtl, as an unobserved random variable. Denote the QTL locations by l = (l1,l2, ..., lNqtl) and let
be an Nqtl x Nind matrix, where the qth column xq = (xq1,xq2, ...,xqNind)' is the QTL genotype vector in location lq, with element xqi referring to the ith individual. Let G* and G be the corresponding complete and incomplete (observed) marker information respectively; G* and G are taken to be Nind x N matrices, where N is the number of markers. Let I be the chromosomal interval with the first and the last markers of the chromosome as endpoints. A fixed marker map (i.e., known recombination fractions between markers whose order is known), denoted by m, is assumed known before the analysis.
We denote complete and incomplete (observed) marker information in other chromosomes respectively by G*o and Go. Let X*o be a subset of the complete marker information in other chromosomes, consisting of selected columns (marker genotype vectors) of G*o . Using an obvious set notation, X*o
G*o . Similarly, let Xo be a subset of incomplete marker information, Xo
Go. We assume that X*o is chosen to correspond to known background control information (a selected set of markers that are hoped to be close to influential QTLs outside the interval I). Again, we arrange X*o into the form of an Nind x Nbc matrix and denote its (i,k)th element by X*ik , corresponding to the genotype of the ith individual's kth background control. Here, Nbc is the number of background controls.
In the chosen design (i.e., experimental cross), let Ngen be the number of possible genotypes and let
= (
1, ...,
Ngen) be the vector including all possible genotypes at any (marker or QTL) locus, ordered so that all homozygote genotypes come before heterozygotes. In this setting, genotypes AB and BA are considered to be the same. The regression parameters are the following: a is the regression intercept (mean value), bq = (bq1, ..., bqNgen) is a vector of regression coefficients (classification variables) where bqj is the regression coefficient for the qth QTL genotype
j at location lq,
2 = Var(ei) is the residual variance and C = (ckj) is an Nbc x Ngen matrix of regression coefficients ckj for background controls. We consider the following statistical model for y:
![]() |
(1) |
Here 1{xqi=
j} is the indicator variable (dummy), which takes value one if the ith individual's qth QTL genotype xqi at location lq is
j, and otherwise its value is zero. Similarly, 1{X*ik=
j} is the indicator variable taking value one if the ith individual's marker genotype in the kth background control is
j and otherwise its value is zero. Here we assume that the residuals ei are independent and normally distributed according to N(0,
2). In order to maintain the traditional way of considering gene effects one can make the following convention. In the case of three possible genotypes, say
= (AA,BB,AB), the constraint bq2 = -bq1 will produce an additive effect for homozygotes and make bq3 correspond to a dominance effect of a heterozygote for each q. For the background control parameters, the constraint ck2 = -ck1 for k = 1, ..., Nbc will have a similar interpretation. In case of backcross, the corresponding constraints are bq1 = 0 and ck1 = 0, respectively.
We use the shorthand notation
= (a,b1, ..., bNqtl,
2,C) and
= (
,
,l,G*,X*o,Nqtl) . Notation A* ~ A means that A* is consistent with A in cases where A is incomplete (observed) and A* is complete information. From Bayes' theorem, we get p(
|y,G,Xo,m) =
p(y,
,G,Xo|m) , where the joint density p(y,
, G,Xo|m) can be factored into a likelihood and a (joint) prior density as p(y,
,G,Xo|m) = p(y,G,Xo|
,m)p(
|m). Here the likelihood can be further written into the form of the product p(y,G,Xo|
, m) = p(G|
,m)p(Xo|G,
,m)p(y|
,m,G,Xo) = 1{G*~G}1{X*o~Xo}p(y|
,m) . In this, 1{G*~G} and 1{X*o~Xo} are indicators taking values one and zero depending on whether the complete genotypes G* in the chromosome and in the background control sites X*o (in the other chromosomes) are consistent with their observed incomplete counterparts or not.
As for the joint prior distribution p(
|m), given the marker map m, we make the following (conditional) independence assumptions:
- Given Nqtl, the vector
consisting of the parameters of the linear model (1) is independent of the other coordinates of
, i.e., of (
,l,G*,X*o) , as well as of m; - X*o , the vector consisting of the true background control genotypes in other chromosomes, is independent of (
,l,G*,Nqtl), the true marker and QTL-information in the considered chromosome as well as of the marker map m; - the true genotypes at marker locations, G*, are conditionally independent of the number of QTLs (Nqtl) and their locations (l) given the marker map m.
Then the joint prior density function can be factored further and be presented in the product form
![]() |
(2) |
The density p(X*o) is not conditioned on the fixed marker map, because the prior for genotypes can be thought not to be dependent on the marker order or distances. The posterior density of
is then proportional to the joint density
![]() |
(3) |
![]() |
(4) |
In specifying prior densities it is important to note that the number of possible alleles and genotypes (Ngen) depends on the (experimental) design in question. Therefore, also the prior densities p(
|G*,l,m,Nqtl), p(G*|m) and p(X*o) should reflect the design. Crosses between inbred lines have two alleles, forming two (three) different genotypes in backcross (F2 intercross) designs.
Let M*s be the complete genotype vector in the sth marker position, i.e., the sth column in G*, and let M*s,i be its ith component. In case there are some unobserved genotypic data, i.e., G*\\G
0, we consider the following conditional independence structure for the prior p(G*|m): we assume that p(M*s|G*-s,m) = p(M*s|M*s-1,M*s+1,m) =
Nindi=1 p(M*s,i|M*s-1,i, M*s+1,i,m) , where G*-s includes all the other columns in G* except the sth. In a backcross design, the prior for an individual i with complete genotype information G* can be computed as the product
![]() |
(5) |
![]() |
(6) |
In the case of F2 intercross, for each individual the transition probabilities p(M*s+1,i|M*s,i,m) from position s to s + 1 can be arranged into the 3 x 3 matrix containing all possible transitions between states AA, BB, and AB
![]() |
(7) |
The prior distribution of Nqtl, the number of QTLs, is here assumed to be truncated Poisson, where the Poisson mean
and the maximum number Nqtlmax of QTLs are fixed control parameters such that 0
Nqtlmax. The upper bound Nqtlmax is introduced for computational reasons.
In the following, we shall use the generic term "object" for any marker or QTL in the chromosome. The prior distribution of QTL genotypes at locations l is assumed to have the following product form:
![]() |
(8) |
Here, G*qiL and G*qiR are the genotypes of the left and the right flanking object (marker or QTL) for the qth QTL in individual i, chosen among the complete set of the markers in the chromosome and the QTLs at positions l1,.,lq-1. When the location of a QTL and the corresponding flanking object genotypes are known, the QTL genotype is independent of the genotypes of other objects (markers or QTLs) in this list. We denote by rq = (rq1, rq2) the resulting recombination fractions between the QTL at lq and the corresponding flanking objects. As is often done in QTL mapping applications, the same recombination rates for male and female meioses are assumed also here. The algorithms for constructing the probabilities of different QTL genotypes (last term in Equation 8) under backcross and F2 designs can be found in APPENDICES B1 and B2, respectively. In this construction, Haldane's map function is used for converting the distance between lq and the left flanking object to a corresponding recombination fraction for each q. Haldane's formula rq2 =
then gives the recombination fraction between lq and the right flanking object, with rqfm being the recombination fraction between the two flanking objects of the qth QTL. Note that Equation 8 allows more than a single QTL within the same marker interval. Apart from being more general than models in which at most one QTL is allowed, this feature is thought to improve the mixing properties of the MCMC sampling algorithm.
An obvious choice for the priors of all the QTL locations is the uniform distribution on the considered chromosome, corresponding to the assumption that no prior knowledge concerning the QTL loci is available. However, if some `non-data dependent' knowledge has been obtained, for example, using cytogenetical methods (e.g., physical exclusion mapping), one can specify a prior which has most of its mass on some narrower chromosomal area. (Note that the uniform prior densities here do not cause any integrability problems, because all chromosomes are of finite length.) We shall assume equal prior probabilities for background control genotypes and use uniform prior distributions for all regression parameters. The natural range for the residual variance, for example, is between zero and the phenotypic variance.
The main features of the proposed model are summarized graphically in Figure 1. Our main interest is in the number of QTLs and in their positions in the considered chromosome. In order to arrive at a meaningful description of the results from the estimation we consider the QTLs as forming a nonhomogenous Poisson process over the chromosome. The results of the statistical analysis can then be expressed in an intuitive and coincise manner in terms of the corresponding estimated intensity. In practice, we divide the chromosome into intervals (bins)
1,
2, ...,
Nbins of equal length (according to the Haldane distance). The interval length 
j
chosen by the analyst reflects the resulting mapping resolution. Denote the number of MCMC cycles (sampling iterations) by Ncycs, and let
![]() |
(9) |
j obtained from the Monte Carlo simulation. Here
N(k)qtlq=1 1{l(k)q
j} is the number of QTLs in
j in round k of the simulation. The product
^j
j
gives then an obvious approximation of the posterior expected number of QTLs in interval
j. (Note that some bins might occasionally contain more than just one QTL during the same iteration cycle.) We combine the estimates
^j into a single QTL-intensity function by writing
^(s) =
j
^j1{s
j} , that is,
^(s) =
^j for s
j .
|
For assessing phenotypic effects in the backcross design, let Dj(d) be the cumulative distribution function (c.d.f.) associated with the phenotypic effect of a putative QTL in bin
j. There will then be one such c.d.f. for each bin. In the backcross design, let
![]() |
(10) |

j, b(k)q1-µ(k)q
d} , and 1{l(k)q
j, b(k)q3-µ(k)q
d} , respectively. Here µ(k)q =
. In the constrained model (where the constraint bq2 = -bq1 is assumed for each q) the corresponding indicators are 1{l(k)q
j, b(k)q1
d} and 1{l(k)q
j, b(k)q3
d} . For each empirical c.d.f. we determine its median and the 2.5- and 97.5-percent quantiles. The statistics are then drawn as functions of j, to get curves as shown in Figure 2. The estimates are stable when the denominator Ncycs
^j
j
in Equation 10 is large. In practice one should therefore concentrate on bins j for which
^j is not too small. But these are precisely those bins which are most likely to contain a QTL anyway.
|
| SIMULATION ANALYSIS |
|---|
In order to test the performance of this method, a data set was simulated by the QTL Cartografer software (![]()
|
First, the two data sets were analyzed by using the IM and CIM/05 methods. The background controls for the analyses were chosen by standard stepwise regression (QTL Cartografer software). The background control markers for the complete data were marker two in chromosome 1, markers zero, one, and six in chromosome 2, and marker two in chromosome 3. In the analyses where 30 percent of genotypes were missing, the background controls were marker two and three in chromosome 1, markers one and six in chromosome 2 and marker three in chromosome 3. The same background controls were also used in the corresponding Bayesian analyses. In the CIM, "window width" was 10 cM (i.e., the background controls less than 10 cM from the analyzed interval were not included in the model).
Several test runs were made in order to carefully adjust the range of the proposal distributions (i.e., parameters that control the maximal step size) of the Metropolis-Hastings algorithm. Such a range is specified for each of the following: (1) QTL locations, (2) regression mean, (3) residual variance, and regression coefficients of both (4) the QTLs and (5) background control genotypes. These control parameters influence directly the rejection rates; if they are chosen carelessly the chain may not convergence to the correct limiting distribution within a reasonable time. The values used in the final analyses are given in Table 2.
|
The (C-program implementing a Metropolis-Hastings-Green) chain was run 500,000 rounds for all analyses in chromosomes 1 and 2 and 1,000,000 and 1,500,000 rounds respectively for the complete and incomplete data in chromosome 3. Computations were made on an UltraSparc Model 170 workstation, with running times varying between 1 hr and 30 min, and 6 hr and 20 min, depending on the chromosome and on other work load on the computer. The initial value for the number of QTLs was three, and the corresponding locations were 20.0 cM, 50.0 cM, and 80.0 cM in all MCMC runs. The Poisson mean (hyperparameter) was set to
= 2 and the maximum number of QTLs (in the analyzed chromosome) to Nqtlmax = 3. The prior of the residual variance was chosen to be uniform over the range [0.0, 0.89], the right endpoint being equal to the phenotypic variance estimate from the data. The prior of the intercept was taken to be uniform over [-100, 100], and that for QTL and background control genotypic regression coefficients uniform over [-2, 2]. In all cases the chosen ranges were certain to cover all realistic parameter values. Finally, the prior for QTL locations was uniform over [0, 100].
Results:
The likelihood ratio statistic (LRS) curves (in base 10 logarithmic scale) from IM and CIM/05 runs, and the Bayesian posterior QTL intensities in the considered three chromosomes, are shown on the left side of Figure 2 (complete data) and 3 (incomplete data). The phenotypic effect estimates given by the IM and CIM/05 methods, as well as the curves consisting of the pointwise (i.e., in different locations of the putative QTLs) medians and the 2.5- and 97.5-percent quantiles of the posterior distributions of phenotypic effects are shown in the same figures on the right. For an obvious reason, the phenotypic effect estimates deserve serious consideration only in those chromosomal regions in which the statistical analysis suggests that there actually might be a QTL. Judging by the level of the QTL-intensity, we have shaded such areas in the figures.
Approximate posterior probability distributions of the number of QTLs in different chromosomes, for both the complete and the incomplete data, are given in Table 3 and Table 4. The posterior expectation of the number of QTLs in each chromosome (shown in Table 3 and Table 4) coincides with the area under the corresponding QTL-intensity curve (in Figure 2 and Figure 3).
|
|
|
Perhaps the most natural question to ask in this context is the following: "What is the (posterior) probability that some particular chromosomal area, say
, contains at least one QTL, given the evidence in the data?" Denoting the number of QTLs in
by N
qtl , one can calculate various MCMC approximations of that probability as shown in Equation 11 below, where
(s) is the QTL-intensity at point s, and
^(s) is its estimator:
![]() |
(11) |
The last approximation, where the integral on the right is actually an expression for the (posterior) expected number of QTLs in
, is reasonable only if the integral is small. Table 5 gives a few such approximations for some chromosomal areas, based on either the complete or the incomplete data. These regions represent a moderate to high posterior QTL-intensity region surrounding the mode. Making the interval
longer will obviously always increase both the probability P(N
qtl
1 | data) and the expectation E(N
qtl | data) , but this will be at the cost of less accurate localization of the genes. In other words, given the evidence in the data, there is always a trade-off between accuracy and probability, just as in confidence intervals in classical statistical inference. In this sense, Table 5 represents only one possible summary of our findings. We have not made an attempt to establish a standard for forming such intervals or corresponding cut-off points here. Indeed, we think that the intensity curve in itself is the best summary of the information concerning the number of QTLs and their locations in the chromosome as obtained from the statistical analysis.
|
We do not display empirical threshold values (corresponding to permutation tests) for IM and CIM in Figure 2 and Figure 3 because they were not the main issues here. However, the LOD score 3.0 would correspond to the likelihood ratio statistic 3.0 x
(e)
13.82 , which can be used as a rule-of-thumb threshold when examining the figures (even though it was originally derived for monogenic traits in human). In practice the CIM method needs a somewhat higher threshold than IM (![]()
![]()
In Figure 2, the IM and CIM curves, apart from CIM in chromosome 3, contain enough evidence (in the sense that the LRS is greater than the threshold value 13.82) of QTL activity. From Figure 2 it can be seen that our method performs well in all chromosomes, by giving well localized high intensities for QTLs close to their true locations. Note that the highest QTL-intensities are often found near the modes of CIM or IM curves. The weakest QTLs in chromosomes 1 and 2 remained undetected by any of these methods, apparently because of their small phenotypic effects. All three methods estimated the phenotypic effect of the most influential QTLs very accurately, but IM and CIM/05 do not give confidence intervals for the estimates [unless they are estimated separately with a permutation test (![]()
In the case of incomplete data, the mode of the intensity given by our method differs by 12 cM from the simulated true location in chromosome 3 (Figure 3). This is apparently a consequence of reduced genotype information in the incomplete data set, as well as of the weak phenotypic effect attributable to this particular QTL. Note in this case the low maximal level of the posterior QTL-intensity, compared to the levels reached in the other two chromosomes.
For a more explicit comparison of the three methods, one-lod-support intervals (see ![]()
(e) units in the LRS scale). An alternative would have been to estimate confidence intervals for the QTL locations, by applying a permutation test (see ![]()
![]()
|
Considering first the complete data set and the first chromosome with the LRS evaluated at every 0.1 cM, we obtained the one-lod-support interval [23.0 cM, 29.2 cM] using the IM, and [21.6 cM, 27.3 cM] using the CIM method. The latter almost coincides with the interval
= [21 cM, 28 cM] (see Table 5 and Table 6), obtained from Figure 2 so that it contains practically all moderate to high posterior QTL-intensity values. All these intervals covered the true QTL at 24.15 cM. The posterior probability that the region
contains at least one QTL is 0.63. The true phenotypic effect of the second QTL (at 72.21 cM) in chromosome 1 was so small that this QTL was not detected by any of the three methods discussed here.
At the "left end" of chromosome 2, we obtained the one-lod-support interval [2.9 cM, 9.0 cM] when using the IM, and [10.0 cM, 11.9 cM] when using the CIM method. The latter interval is actually so narrow that the true QTL at 7.77 cM falls just outside it. The posterior QTL-intensity in this region is concentrated almost completely on the interval
= [4 cM, 10 cM] which again contains the true QTL. At the "right end" of chromosome 2, the LRS curve arising from CIM is bimodal, resulting in two overlapping one-lod-support intervals, [53.6 cM, 59.6 cM] and [52.9 cM, 67.3 cM]. The shorter CIM interval is so narrow that it does not contain the true QTL at 61.67 cM, but the wider one does. In this case, a natural choice for a Bayesian credible region would be the interval
= [53 cM, 68 cM], which is very close to that obtained by CIM and for which the posterior probability of there being at least one QTL in the region is 0.64. The IM method failed to reveal any statistically significant QTL activity in this part of chromosome 2.
In chromosome 3, we obtained the one-lod-support interval [0.0 cM, 39.8 cM] with the IM method. With the CIM method, we could not determine a corresponding support interval because the maximum LRS value was less than the critical value LOD 3.0. Because of the small phenotypic effect of the simulated QTL, the Bayesian method did not localize this QTL as well as in chromosomes 1 and 2. Perhaps a natural interval to suggest, if at all, would be [0 cM, 37 cM], which would be nearly identical to the interval suggested by the IM-analysis and which would cover 90.2 percent of the total area under the QTL-intensity curve in that chromosome.
Whenever possible, all three methods estimated the locations of the QTLs fairly well (see Table 6). In all these analyses the Bayesian estimates (posterior medians) of individual phenotypic effects were close to the true values, and they were practically the same as what was obtained when applying the CIM method. In this respect, the IM method turned out to be much inferior.
The results from analysing the incomplete data were largely similar, see Table 5 and Table 6, and Figure 3 for details. The main difference was that the one-lod-support intervals become somewhat wider, as did the corresponding Bayesian credible regions. The differences were not very large, however.
| DISCUSSION |
|---|
We have presented here a novel method for high resolution mapping of multiple QTLs and for the estimation of their phenotypic effects, using the general framework of Bayesian variable dimensional models. For the estimation of the parameters (see APPENDIX A) we use the Metropolis-Hastings algorithm with "reversible jumps" (![]()
The main advantage in using Bayesian, instead of the more traditional frequentist inferential methods, is that they enable the analyst to quantify probabilistically the uncertainty involved in each claim made about QTLs, without needing to use problematic mental constructs such as "the relative frequency of incorrect decisions made in a long sequence of trials repeated under similar conditions." The transformation of the prior distribution into the posterior through Bayes' formula, corresponds directly to the natural intuition of "learning from the data." Depending on the goals of the study, the specification of the prior can be "neutral" if the goal is to present a statistical summary of the information in the data, or, if available, it can also reflect an expert's prior knowledge about unobserved quantities, in which case the posterior will be a synthesis of such expert knowledge and empirical evidence coming from the data. Another major advantage of the Bayesian approach is the relative ease by which missing data (such as missing genotypes) problems can be handled, together with the estimation of all other unobservables. Finally, the application of MCMC methods gives considerable freedom in building large hierarchical statistical models corresponding to the analyst's perception of the underlying genetic structures and dependencies.
The results of the statistical analysis are summarized by two new measures: the posterior QTL-intensity, considered as a function of its location in the chromosome, and the posterior distribution of the phenotypic effect of the corresponding putative QTL. Such probabilistic summary measures seem to correspond directly to the immediate objectives of QTL mapping, that is, localizing the important QTLs in different chromosomes and estimating their effects on the phenotype(s). In particular, in situations where one can expect that there are several QTLs in the considered chromosomal area, the posterior QTL-intensity captures the essential information about their number and positions in an easily interpretable probabilistic form. In this way we can avoid completely the difficult inferential problems concerning the "correct" threshold values of a LOD score (or a LRS) which arise in testing multiple QTL hypotheses. Moreover, our method does not appear to produce false positives easily, as one can expect to get a low QTL-intensity in regions where there is no, or is only little, QTL activity.
The performance of our method was compared to interval mapping (IM) and composite interval mapping (CIM) by using a simulated backcross population of 250 offspring. A second data set was obtained by randomly deleting 30 percent of the marker genotypes in the complete set.
In the execution of the MCMC sampling we used an overparametrized regression model which has one extra coefficient for each QTL and for each background control locus. Therefore the model intercept and the genotypic coefficients are not identifiable as such, but their contrasts (phenotypic effects) are. We also tested the alternative updating scheme used by ![]()
The MCMC methods require some amount of computational effort and will, in practice, need the capacity of a workstation. To ensure sufficient mixing, we performed some relatively long test runs before a final QTL analysis. Another possibility is to apply some diagnostic tools, such as CODA (![]()
|
An important issue which has so far come up only implicitly in our analysis is that the number of QTLs and their phenotypic effects cannot be considered in isolation from each other. The idea that any gene with a non-zero effect on phenotypes is a QTL which in principle could be detected from data seems like an idealization far from reality, and as a consequence, "the correct number of QTLs" will exist as an objectively defined quantity only in simulated data. Here we have controlled this problem by bounding the number of QTLs in each chromosome by a given constant. An alternative would be to modify our method by paying attention only to influential QTLs, in the sense that their phenotypic effect exceeds some given threshold T. The only change this would make into our formulas is that, in Equation 9 and Equation 10, we would have to add into indicators the restriction |b(k)q2 - b(k)q1|
T .
Another question we have not discussed is how to choose the markers which are to be used as covariates in the analysis. Instead of using the results from a preliminary stepwise regression analysis as we have done here, one could "learn" from Bayesian QTL analyses of other chromosomes and choose certain markers from high posterior QTL-intensity areas as QTL representatives (background controls). Alternatively, one could choose covariates from amongst a set of candidate genes in case such genes are available. A final possibility would be to consider the entire genome in a single variable dimensional QTL analysis, always using the "current" QTLs as controls (![]()
![]()
An initial version of the program source code (written in C language) is freely available for research purposes from Rolf Nevanlinna Institute's web page (http://www.rni.helsinki.fi/~mjs). A more user-friendly documented version is currently being developed. The present framework will be extended later to cover outbred linecross and (human) pedigree data. Epistatic effects (interactions between QTLs) and multiple trait analysis would also be worth considering in the future.
| ACKNOWLEDGMENTS |
|---|
M.S. thanks KARI AURANEN, JUKKA RANTA and DARIO GASBARRA for many useful discussions about Metropolis-Hastings algorithms, and MATTI TASKINEN for helpful hints in the programming and computer work. We are also grateful to OUTI SAVOLAINEN, LEENA PELTONEN, CLAUS VOGL, JANNE PITK¨ANIEMI, PEKKA UIMARI and four anonymous referees for their constructive comments on an earlier version of the paper. This work was supported by a research grant from the Academy of Finland.
Manuscript received April 7, 1997; Accepted for publication November 17, 1997.
| APPENDIX A: ESTIMATION OF MODEL PARAMETERS VIA MARKOV CHAIN MONTE CARLO |
|---|
Markov chain Monte Carlo (MCMC) methods enable one to calculate expectations with respect to the posterior in an approximate manner in situations where computation by traditional methods, because of the high dimension of the parameter space, would be complicated or impossible. In MCMC, each expectation is approximated by a sample average where the sample is drawn by simulation from an ergodic Markov chain constructed in such a way that its limiting distribution coincides with the posterior. The fact that the number of QTLs has not been fixed in advance, and is actually estimated in conjunction with the other model parameters, leads us to consider this problem within the general framework of variable dimensional parameter estimation.
We have preferred to use the Metropolis-Hastings algorithm (M-H) instead of its special case, the Gibbs sampler. The main motivation behind this choice has been that the M-H algorithm is so easy to implement, without needing to work with analytically involved full conditional distributions. An additional bonus of M-H is its greater flexibility and the relative ease by which it can be extended to more complex designs and pedigree structures.
For algorithmic reasons (mixing properties), we do not use any constraints for genotypic coefficients, even if this kind of overparametrization represents an uncommon modeling practice. Genotypic coefficients are here not identifiable as such, but their contrasts can be estimated.
Following ![]()
, the maximal number of QTLs allowed and the ranges of the uniform priors, are all given by the analyst (see simulation analysis). Initial values of the regression parameters are generated to be close to the center of their prior range. Initial values for the QTL genotypes and missing marker genotypes are generated from their priors. Figure 1, showing the hierarchical structure of the model, may help to understand the relationships between the parameters in the following updating scheme.
Step 1:
We consider here three different move types: (1.1) modify the location(s) and configuration(s) of existing QTL(s), (1.2) add one QTL to the model, and (1.3) delete one QTL from the model. These move types have proposal probabilities pm, pa, and pd, respectively, such that pa = 1{Nqtl<Nqtlmax}c, pd = 1{Nqtl>0}c , and pm = 1 - pa - pd . Here c is a given positive constant in [0,
]. In step 1.1 we do not fix the order of QTLs unlike in ![]()
Step 1.1:
N(t)qtl: = N(t-1)qtl . The following cycle is repeated for each QTL, q = 1, ..., N(t)qtl : A new proposal for a QTL location, lnewq , is sampled from a symmetric uniform density around the previous value. The proposal is accepted with probability
When nothing else is changed, the likelihood remains unchanged even if the proposal is accepted and therefore these two likelihoods cancel from the acceptance probability expression. If a proposal is accepted then l(t)q = lnewq , and otherwise l(t)q = l(t-1)q . New QTL genotypes are proposed separately for each individual so that xnewqi is sampled from p(xqi | G*(t-1)qi,l, G*(t-1)qi,r, rq) (constructed as show in APPENDIX B). Individual proposals are accepted with probabilities {1,
}, where the likelihoods Li1 = p(yi |
(t-1), xnewqi,
(t-1)-qi, l(t), X*(t-1)i,o, N(t)qtl, m) and Li2 = p(yi |
(t-1), x(t-1)qi,
(t-1)-qi, l(t), X*(t-1)i,o, N(t)qtl, m) for individual i are evaluated at the new and the old QTL genotypes, respectively. If the proposal for individual i is accepted then x(t)qi = x(new)qi , and otherwise x(t)qi = x(t-1)qi . Here
(t-1)-qi includes all except the qth QTL genotypes in individual i in round t for QTLs having indices lower than q and in round t - 1 for higher indices.
Step 1.2:
Add one new QTL. A proposal (N(t)qtl = N(t-1)qtl + 1) for the number of QTLs is made. The new QTL location, l(N(t-1)qtl+1) , is proposed from the uniform density on I. The QTL genotypes at location l(N(t-1)qtl+1) are proposed from p(x(N(t-1)qtl+1)|G*(t-1),x(t-1)1,., x(t-1)Nqtl, l(t-1), l(N(t-1)qtl+1),m) . The regression coefficients of the new QTL genotypes are drawn from their priors. The proposal is accepted with probability
(N(t-1)qtl+1),m) and L2 = p(y |
(N(t-1))qtl, m) are evaluated at the new and the old parameter value, respectively. If the proposal is accepted, then N(t)qtl = N(t-1)qtl + 1 , and the new QTL location, the corresponding genotypes and the regression coefficients are all accepted simultaneously; otherwise N(t)qtl = N(t-1)qtl . The term
in the Hastings ratio comes from the product
The term in the denominator is squared only because we are not fixing the order of the QTLs (p. 23 in ![]()
Step 1.3:
Delete one QTL. The proposal is made that the number of QTLs is decreased from N(t-1)qtl by one, each of the deletions being equally likely. The deletion is accepted with probability
(N(t-1)qtl+1),m) and L2 = p(y|
(N(t-1))qtl,m) are evaluated at the new and the old parameter value, respectively. If the proposal is accepted, then N(t)qtl = N(t-1)qtl - 1 , and otherwise N(t)qtl = N(t-1)qtl .
Step 2:
The marker genotype proposals G*new , denoted by G*i,new for individual i, in the chromosome are sampled for all individuals from the distribution, where all the consistent genotypes are considered as equally likely. p(G*new | m) is evaluated for all the individual according to Equation 5. The marker genotype proposals are accepted for each individual i separately with probability
[Note that p(M*1,i) for each individual i in Equation 5 cancels from the Hastings ratio]. If the proposals for individual i are accepted, then G*(t)i = G*i,new , and otherwise G*(t)i = G*(t-1)i .
Step 3:
New proposals for the regression parameters are sampled from the symmetric uniform densities around their previous values (random walk). Denoting the likelihoods L1 = p(y|
new,
(t),l(t),X*(t-1)o,m,N(t)qtl) and L2 = p(y|
(t-1),
(t),l(t), X*(t-1)o, m,N(t)qtl) , the proposals (in the prior range) are accepted simultaneously with probability min{1,L1/L2}. If new regression parameter values are accepted, then
(t) =
new, and otherwise
(t) =
(t-1).
Step 4:
The genotype proposals for the background control markers in other chromosomes are sampled directly from the uniform prior density p(X*o) for all the individuals. All proposals which are consistent with the incomplete observations are considered as equally likely. The proposals are accepted, separately for each individual i, with probability min{1,L1/L2}, where the corresponding likelihoods L1 = p(yi |
(t),
(t)i,l(t),X*newi,o,m,N(t)qtl) and L2 = p(yi|
(t),
(t)i,l(t),X*(t-1)i,o, m,N(t)qtl) are evaluated at the new and the old parameter values, respectively. If the proposals for individual i are accepted, then X*(t)i,o = X*newi,o , and otherwise X*(t)i,o = X(t-1)i,o .
Step 5:
Go back to the start, Step 1, until a prespecified number of cycles has been reached.
| APPENDIX B |
|---|
This appendix contains pseudo code algorithms for calculating the conditional probabilities of different QTL genotypes given the flanking objects (Equation 8) for each individual i in backcross (APPENDIX B1) and F2 intercross (APPENDIX B2) designs. The main idea of the algorithms is described in detail in the backcross case.
B1:
- for i = 1, ..., Nind
- total = 0
- for j = 1, ..., Ngen














I
contains at least one QTL calculated for different areas 
N(k)qtl from a simulation trial of 1,000,000 iterations with complete data set (Chromosome 3).