- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Sen, S.
- Articles by Churchill, G. A.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Sen, S.
- Articles by Churchill, G. A.
A Statistical Framework for Quantitative Trait Mapping
aunak Sena and
Gary A. Churchilla
a The Jackson Laboratory, Bar Harbor, Maine 04609
Corresponding author: Gary A. Churchill, The Jackson Laboratory, 600 Main St., Bar Harbor, ME 04609., garyc{at}jax.org (E-mail)
Communicating editor: S. TAVARÉ
| ABSTRACT |
|---|
We describe a general statistical framework for the genetic analysis of quantitative trait data in inbred line crosses. Our main result is based on the observation that, by conditioning on the unobserved QTL genotypes, the problem can be split into two statistically independent and manageable parts. The first part involves only the relationship between the QTL and the phenotype. The second part involves only the location of the QTL in the genome. We developed a simple Monte Carlo algorithm to implement Bayesian QTL analysis. This algorithm simulates multiple versions of complete genotype information on a genomewide grid of locations using information in the marker genotype data. Weights are assigned to the simulated genotypes to capture information in the phenotype data. The weighted complete genotypes are used to approximate quantities needed for statistical inference of QTL locations and effect sizes. One advantage of this approach is that only the weights are recomputed as the analyst considers different candidate models. This device allows the analyst to focus on modeling and model comparisons. The proposed framework can accommodate multiple interacting QTL, nonnormal and multivariate phenotypes, covariates, missing genotype data, and genotyping errors in any type of inbred line cross. A software tool implementing this procedure is available. We demonstrate our approach to QTL analysis using data from a mouse backcross population that is segregating multiple interacting QTL associated with salt-induced hypertension.
THE problem of identifying the genetic factors underlying complex and quantitative traits has a long history. The idea of using the association between a discrete trait (marker) and a continuously variable phenotype to establish linkage of a quantitative trait locus (QTL) first appeared in the work of ![]()
![]()
![]()
![]()
![]()
Multiple-QTL models are an improvement over single-QTL models because of their ability to separate linked QTL on the same chromosome and to detect interacting QTL that may otherwise be undetected. They provide increased power to detect QTL and can eliminate biases in estimates of effect size and location that can be introduced by using an inappropriate single-QTL model (![]()
![]()
![]()
![]()
![]()
![]()
![]()
For pragmatic reasons, we chose to approach the problem of mapping multiple QTLs from a Bayesian perspective. The Bayesian framework provides a clear picture of the probabilistic structure of the QTL mapping problem. In particular, the treatment of unknown quantities (the QTL locations, the QTL genotypes, and the phenotypic means and variances) is straightforward. However, we depart from a strictly Bayesian analysis at several points as indicated below. Bayesian approaches to linkage analysis and QTL mapping have been described by others (![]()
![]()
![]()
![]()
![]()
![]()
Central to our approach is the observation that, by conditioning on the unobserved QTL genotypes, the problem of QTL mapping can be divided into two simple and statistically independent parts: the genetic model part, which relates the QTL genotypes to the phenotype, and the linkage part, in which the locations of the QTL are determined. This observation is not new (![]()
![]()
In the sections that follow, we outline our approach using a general notation that highlights the probabilistic structure of the QTL mapping problem. This structure is used to devise an efficient computational strategy for Bayesian inference. We point out the relationship of the log posterior distribution of QTL location (LPD) to the LOD score and show how the former leads to a simple method for constructing confidence regions for QTL locations. This is followed by a description of our software tools. After discussing possible variations of the basic theme, we demonstrate our approach with an example of a complete QTL data analysis. Many of the technical details can be found in Appendix AF.
| A FRAMEWORK FOR QTL INFERENCE |
|---|
Heuristic and motivation:
If two inbred strains raised in a common environment show markedly different values for a phenotype, it is implicit that there is a genetic basis for the difference. Association between the genotypes and phenotypes in progeny derived from a cross between such strains will provide information about the genetic basis for the trait. However, there are at least two components of "noise" in the data that can obscure the genetic effects. First is the environmental variation that is inherent in most quantitative phenotypes. Second is the incomplete nature of the genotype information, which can only be observed at the typed markers. Marker genotypes may also contain missing values and errors. Typically, an investigator is interested in knowing how many genes contribute to the trait, where they are located, and how they act. Statistical approaches are needed to address these questions.
Consider the simple case in which the trait is affected by allelic variation at a single gene. In the presence of environmental variation, the phenotype can be viewed as a "noisy" version of a biallelic locus (the QTL) whose position in the genome is unknown. If the genotype of the QTL could be known, the effects of the QTL could be estimated by simply looking at the distribution of phenotypes within the groups of individuals defined by the QTL alleles. Furthermore, the QTL position could be localized by mapping the biallelic QTL relative to the typed markers. Thus the unknown QTL genotypes are key.
If we have complete genotype information on a dense set of markers, a one-dimensional genome scan can be performed by regressing the phenotype on each of the markers. The more a marker explains the phenotypic variance, the more likely it is to be close to a QTL. This belief can be quantified by plotting the LOD score or F-statistic obtained from regressing the phenotype on the marker. In practice the markers may be widely spaced across the genome and some marker genotypes will be missing or in error. Still we can imagine having access to a dense set of completely genotyped markers; we call them pseudomarkers. Using linkage information in the available marker data we can infer the genotypes of the pseudomarkers. There will be some uncertainty regarding the pseudomarker genotypes so we may decide to construct several, say 10, versions of this ideal genotype data each of which is consistent with the observed marker genotypes. Variation in the imputed genotypes reflects the uncertainty in our knowledge of the true complete genotypes. If the typed markers are dense, the variation will be negligible. As in the case when we had complete genotype information, we can now regress the phenotype on each of the pseudomarkers and repeat this process for each of the imputed versions. We now have not 1, but 10 sets of LOD scores. The strength of evidence in favor of a QTL being near any given pseudomarker can be quantified by averaging the 10 LOD scores. For technical reasons explained below, the average is not an arithmetic mean of the LOD scores.
This approach extends to the simultaneous mapping of multiple QTL. Suppose that there are two QTL influencing the trait. Using the same pseudomarkers, LOD scores are computed for each pair of pseudomarkers for a two-dimensional genome scan. Interactions between QTL can be accommodated. We implemented pairwise and triplet genome scans and illustrate their application below. Higher-order genome scans could become computationally prohibitive. We argue below that the pairwise scans should be sufficient for the analysis of data with many segregating QTL provided that interactions are limited to pairwise effects and that there are no groups of three or more tightly linked QTL. When these conditions are met, the results of single and pairwise genome scans can be combined into a single statistical model that describes the simultaneous effects of multiple QTL on a quantitative phenotype. The last step represents a compromise that is necessary because a full search for four or more QTL is computationally impractical. The best approach to search for multiple QTL models remains an open problem.
Data structures and notation:
Suppose that we have data on n animals or plants derived from an inbred line cross. Denote the quantitative trait measurements by y = (y1, y2, ... , yn)', and denote the genotyping data by the n x k matrix m = (mij), where the rows correspond to individuals and the columns correspond to markers. The quantities y and m are the observed data. Assume that the chromosome of origin, order, and genetic distance between the markers is known. In practice, these quantities may have to be estimated.
The genetic model, denoted by H, is a description of the distribution of phenotypes given the QTL genotypes. If there are p contributing QTL and the trait values are normally distributed within the QTL genotype classes, a general linear model may be used to describe the relationship of the phenotype to the QTL genotypes. The parameters of the genetic model are denoted by µ. The locations of the QTL are denoted by the p-dimensional vector
. The QTL genotypes are denoted by the n x p matrix g = (gij). The rows of g correspond to individuals and columns correspond to the loci. The quantities µ,
, and g are the unobserved data. Note that the meaning and dimensionality of the unobserved data structures depend on the genetic model, H.
Theory:
Our goal is to make inferences about the genetic model parameters (µ) and the QTL locations (
) given the observed data. We use Bayesian statistical theory because it provides a convenient and mathematically consistent method for describing uncertainties in the form of posterior distributions. As outlined above, we combine information from the marker genotypes and the phenotypes to reconstruct the unknown QTL genotypes. Multiple imputed versions of the QTL genotypes are then used to compute approximations to the posterior densities of interest p(
|y, m) and p(µ|y, m). Imputed genotypes can also be used to compute the marginal probability of the data pH(m, y), which is useful for making model comparisons.
To develop our arguments we begin by looking at the joint distribution of all of the observed and unobserved data. Under the assumption of no ascertainment, the joint distribution can be factorized as
![]() |
(1) |
A proof is provided in APPENDIX A. This factorization implies that, conditional on the QTL genotypes, g, the genetic model part of the problem involving (y, µ) can be solved independently from the linkage part of the problem involving (m,
). Fig 1 represents this conditional independence and highlights the central role of the unobserved QTL genotypes.
|
This decomposition of the problem into two parts conditional on the unobserved QTL genotypes suggests that we should begin by obtaining the posterior distribution of the QTL genotypes. In Appendix B we show that the posterior distribution of the QTL genotypes after integrating out the parameters µ and
can be expressed as
![]() |
(2) |
The first term indicates how compatible a phenotype is with the QTL genotypes. The second term measures the compatibility of the QTL genotypes with the observed marker data.
Sampling QTL genotypes:
Expression (2) suggests an efficient computational approach for simulating from the posterior distribution of the QTL genotypes. We can first simulate samples from p(g|m) and then weight each sampled genotype by p(y|g). The idea is that the genotypes that are most compatible with the observed marker data are most likely to turn up in the simulation from p(g|m). Among those genotypes, the ones that are most compatible with the phenotypes will get the largest weights. Details are provided in Appendix B.
We want to consider models with multiple QTL, including cases in which there is linkage and/or interaction among the QTL. If there are p QTL in the model, the location
will have p components and the QTL genotypes will constitute an n x p matrix. Furthermore, the locations are not known a priori and thus we want to scan through all possible locations to search for QTL. These considerations suggest that we should simulate genotypes at all possible locations in the genome from their joint distribution given the marker data. In practice we generate genotypes on a discrete grid of locations spanning the genome, which we refer to as the pseudomarker grid. For a given p-tuple of pseudomarker locations u = (u1, ... , up), the ith realization of genotypes is an n x p matrix denoted as ri(u).
A weighted sample of QTL genotypes is generated by the following steps:
- Select a regularly spaced grid G of pseudomarker locations and create q realizations of the pseudomarkers by sampling from the distribution p(g|m,
= G). The notation
= G is used to indicate that the entire grid of pseudomarkers is simulated as a joint distribution. We assume that there is no crossover interference and that genetic distances between the markers are known. With these assumptions, a simple Markov chain sampling scheme can be used to generate the pseudomarker genotypes (LANDER and GREEN 1987 ).
- For each p-tuple of locations u in each pseudomarker realization (i = 1, ... , q), calculate the weight under the assumed genetic model H:

(3) If the prior on the QTL locations is uniform, the term p(
= u) is constant for all locations u and can be ignored.
Note that the weight function is model dependent whereas the pseudomarker generation is not. This is convenient for exploring the space of models as it reduces the amount of computation required when a new model is considered. For normally distributed data under the assumption that the prior distribution on the genetic model parameters is of the order of one observation, the weights are approximately proportional to n-v/2RSS-n/2, where n is the sample size, v is the model dimension, and RSS is the residual sum of squares obtained by regressing phenotypes on the QTL genotypes. Genotypes that explain more of the variation in the phenotype get a bigger weight. Additionally, the dimension of the genetic model is penalized. This penalty becomes important when different genetic models are compared. A derivation of normal model weights is provided in Appendix C.
Estimating QTL locations:
In Appendix B we show that the posterior distribution of the QTL location is proportional to the average weight of all pseudomarker realizations at that location
![]() |
(4) |
In interval mapping, inference about QTL locations is based on a likelihood-ratio test, which when expressed on the scale of base 10 logarithm, is called the LOD score. The LOD score at location
can be shown to be equal to
![]() |
(5) |
The logarithm (base 10) of the posterior distribution of the QTL locations (LPD) is
![]() |
(6) |
We expressed the LOD score and LPD in this form to illustrate their similarity. Details are provided in Appendix D. By comparing (5) and (6) we see that the LOD score takes a maximum over the genetic model parameters whereas the LPD carries out an averaging operation. In most situations when a uniform prior on the QTL locations is used, LOD(
) and LPD(
) will be approximately equal to each other up to an additive constant. In Fig 2, on the basis of the hypertension data discussed below, we compare the LOD score, the LPD, and the ![]()
|
The LPD raised to the power of 10 is the posterior density of the QTL location and hence can be used to construct confidence intervals (![]()
![]()
|m, y) is a discrete probability distribution. This is a reasonable approximation to a continuous distribution of locations provided that the grid density exceeds our ability to resolve the QTL locations. In practice even a very coarse grid (10 cM) is quite effective. A denser grid (2 cM) is preferable for localization of QTL once they have been assigned to a chromosome. The discrete nature of the grid makes the computation of a highest posterior density (HPD) region straightforward. For example, on a chromosome with two QTL, the weights for each pair of pseudomarkers can be normalized and ranked from highest to lowest. A 1 -
HPD region is constructed by including the pairs with highest weights in the set until the sum of the weights first exceeds 1 -
. Bayesian confidence intervals based on the LPD have the desired long-run frequency coverage in large samples. See ![]()
Substantive prior information can be incorporated into the LPD through the additive term log10p(
). This is analogous to the process of accumulating evidence by adding LOD scores (![]()
Estimating QTL model parameters:
In Appendix B we show that the posterior distribution of the model parameters can be expressed as
![]() |
(7) |
where the summation on u is over all p-tuples of pseudomarker locations. The first term in the summation is the "complete data" posterior distribution of the model parameters given the phenotypes and the QTL genotypes. The second is the weight of the QTL genotypes. Thus (7) is a weighted mixture of complete data posterior densities. Posterior means and variances of the model parameters, E(µ|y, m) and V(µ|y, m), are computed by the method of iterated expectation as detailed in Appendix B.
We depart from a strictly Bayesian approach here. Suppose that we are entertaining a model with six QTL. The summation over u in (7) would range over all sextuples of pseudomarkers in the genome and could be prohibitive to compute. In practice we take advantage of the partitioning of the genome into chromosomes. We estimate the model parameters one (or two) chromosomes(s) at a time and restrict the summation over u just to the chromosome(s) of interest. For example, if a chromosome is assumed to contain two QTL, we use a summation over all pairs of pseudomarkers on the chromosome to estimate the effects associated with the two QTL. If two unlinked QTL have an interaction term in the model, the two must be considered simultaneously and the summation will run over all pairs on the two chromosomes of interest. By estimating the parameters associated with small subsets (of size one, two, or three) separately we can significantly reduce the amount of computation with negligible effects on the results. When we are estimating effects of one set of QTL, the other QTL may be represented by including marker genotypes as covariates in the regression (see section on covariates below) as in CIM and MQM. Alternatively, we can simply ignore the other QTL. In practice both approaches yield essentially identical point estimates. This is a consequence of the independent assortment of chromosomes, which results in approximate orthogonality between unlinked locations in the genome. The standard errors will generally be smaller when conditioning.
Model scanning and model selection:
In practice, the genetic model H, which includes the number of QTL and their interactions, is not known and has to be chosen on the basis of the data. The problem of selecting an appropriate model is challenging and we cannot offer a complete formal solution. The model selection problem is fundamental to multiple QTL analysis. ![]()
![]()
. ![]()
The Bayes factor (![]()
![]() |
(8) |
The marginal probability of the data under a model H is approximately equal to the average of all the pseudomarker weights

(see Appendix E), where s is the number of p-tuples of pseudomarker locations. In practice we limited the application of Bayes factors to making decisions about subsets of the QTL in a model. For example, we may compare a two-QTL model on a given chromosome to a single-QTL model on that chromosome. We can compare a two-QTL model with an interaction term to a two-QTL model with only additive effects. In these cases, the summation on u can be restricted to the chromosome of interest. Bayes factors can present computational difficulties (![]()
Our data analysis consists of a model scanning step followed by model selection. It represents a departure from the Bayesian approach to model selection. We carry out single and pairwise genome scans and select only those regions (or pairs) that exceed stringent permutation testing thresholds (![]()
![]()
Permutation testing for the pairwise genome scan requires a bit of explanation. We are seeking pairs of loci that together contribute significantly to the observed phenotype distribution. Thus we base our test on a comparison of the full model, including an interaction effect, to a null model with no genetic effects. The significance of this overall test is assessed by permutation analysis. If a pair is found to be significant it is necessary to make some secondary checks to ensure that the pair is actually representing two QTL. First one examines the size of the interaction by comparing the full model to an additive model (with no interaction effects). If there is a significant interaction, the pair represents two interacting QTL. If the interaction is not significant, each member of the pair should be individually significant. In this case the pair represents two additive QTL. In ![]()
Stepwise procedures for model selection using the BIC
of ![]()
![]()
Prior distributions:
All Bayesian analyses depend on prior distributions. The influence of the prior distribution decays with increasing sample size and, for most problems, vanishes asymptotically. For sample sizes that are typical in most QTL studies (50250 individuals), the prior distribution on the model parameters is not likely to have a large effect on the posterior distributions. However, it has a more tangible impact on the Bayes factors used for model selection. For example, the Bayes factors are not well defined if (improper) reference priors are used for the genetic model parameters.
In our analyses we used proper priors whose weight is approximately equal to that of one observation. This assumption leads to the penalty term of n-v/2 in the weight function, where v is the dimension of the genetic model (see Appendix C). Using proper priors also helps stabilize numerical computations when the phenotypes are not normally distributed.
Implementation:
We implemented a basic set of computational tools in the MATLAB computing environment (The Mathworks, http://www.mathworks.com). We chose MATLAB for prototyping convenience, but any computing environment with tools for regression analysis and programming would suffice. All MATLAB functions used for this article are available at http://www.jax.org/research/churchill/ under software. The MATLAB environment provides a variety of functions that can be used for preliminary analysis and manipulation of the data.
Pseudomarker generation is implemented in the function IMPUTE, which takes marker genotypes and map positions as input and generates a three-dimensional array of imputed genotypes (the first dimension is individuals, the second dimension is pseudomarker positions, and the third dimension is replications). This array is used repeatedly in subsequent analysis steps.
Weights for imputed QTL genotypes are computed by genome scan functions. These functions can be applied to the whole genome or restricted to chromosomes of interest. A one-dimensional scan is performed using the function MAINSCAN, which produces a LPD profile assuming a single-QTL model at each pseudomarker location. MAINSCAN will produce essentially identical results to a Mapmaker/QTL analysis. A two-dimensional scan can be carried out using PAIRSCAN. This function assumes a two-QTL model and computes the weights both with and without an interaction effect. It scans through all pairs of marker loci and produces a two-dimensional LPD profile. The functions PLOTMAINSCAN and PLOTPAIRSCAN are used to plot the results from the scans. Traditionally, scanning functions have plotted the LOD score. Our functions plot the proportion of variance explained. This is approximately a linear multiple of the LOD score (![]()
![]()
After the genome scans have been carried out we can obtain estimates of the QTL model parameters. For a QTL that is not linked to or interacting with any other QTL, the function ONEESTIMATE computes estimates of the posterior mean and standard error of the effect size. This function, applied on one chromosome at a time, provides an estimate of the effect size of a QTL on that chromosome. For linked and/or interacting QTL, we provide functions TWOESTIMATE and THREEESTIMATE. Scanning and estimation functions for nonnormally distributed phenotypes have been written and are not in the testing phase.
Bayes factors for model comparisons are obtained from the output of the scan functions since the marginal distribution of the data under a genetic model is the average of the pseudomarker weights. Bayes factors are calculated on selected genomic regions of interest. For example, to compare a single-QTL model on chromosome 1 to a two-QTL additive model on chromosome 1, we will compare the average pseudomarker weight on chromosome 1 (obtained from MAINSCAN) to the average pseudomarker weight on chromosome 1 for an additive model (obtained from PAIRSCAN).
Model selection is carried out using the permutation tests on one-dimensional and two-dimensional scans followed by secondary tests. Loci can be "flagged" by the functions FLAG and FLAG2. The former uses Bayes factors for secondary tests while the latter uses likelihood ratios.
For localization of a QTL or a pair of QTL, we construct a dense pseudomarker grid on the chromosome and repeat the imputation and scanning steps on that chromosome. Then we plot the results using LOCALIZE or PAIRLOCALIZE to plot the posterior distribution of the QTL or QTL pair.
We are currently constructing examples, including analysis scripts that illustrate the steps involved in applying these software modules. Results will be posted on our web site. In our experience we find that each QTL data set is unique and requires a tailored analysis. Thus we prefer an interactive software environment that allows the analyst to work with the data.
| EXTENSIONS |
|---|
We consider two general classes of extensions to the basic framework, those that alter the genetic model and those that alter the linkage model. Changes to the genetic model can be implemented by programming a new weight function. Changes to the linkage model affect only the code that simulates the pseudomarker genotypes. Modularity in software design as well as in data analysis that the framework provides is an important advantage.
Extensions of the genetic model:
For normally distributed phenotypes, the weights are based on a normal regression model. More general distributions can be accommodated by calculating the weights assuming a generalized linear model (![]()
![]()
![]()
In general the weight function for any model is
![]() |
(9) |
For generalized linear models, under the assumption that the prior distribution on the genetic model parameters is of the order of a single observation, this works out to be approximately

where dev
-2 log(p(y|g, µ =
)) - log(p(µ =
)) is the observed unscaled deviance (![]()
is the posterior mean of µ, and v is the dimension of the genetic model. If the phenotype data follow a Poisson distribution, the weight function is

where k is the number of genotype classes, ni is the number of observations in the ith class, and yi. is the sum of the observations in the ith class. For binomial data, the weight function will be

where mi. is the total of the size parameters of the binomial observations in each group and yi. is the sum of the observations in the ith class.
Many complex trait studies involve measurement of multiple related phenotypes. If two phenotypic measurements are affected by the same set of genes, then it can be more efficient to consider a multivariate analysis (![]()
![]()
![]()

where S is the residual covariance matrix of the multivariate ANOVA and v is the model dimension (Appendix C). Models in multivariate space are more complex and some of the nice interpretations that apply to univariate normal phenotypes do not carry over. Although multivariate phenotyping comes with the promise of greater power to detect QTL, there are some costs. Unless two phenotypes are affected by the same biochemical pathway, adding a phenotype into the analysis may add genes and interactions to the list of genes affecting the (multivariate) phenotype. This may complicate the analysis of complex traits where a large number of genes are known to be affecting the trait of interest. We recommend that multivariate trait analysis be used with caution.
An interesting mutlivariate data structure is presented by ![]()
![]()
![]() |
(10) |
where WBH denotes the weight function corresponding to a binomial distribution and WNH to that of a normal distribution.
In QTL experiments covariates are often collected in addition to the phenotypes of interest. The logic behind measuring covariates is to measure and adjust for environmental factors that influence phenotypic expression. These might be blocking factors in an agricultural field trial or cage number in a mouse cross. The covariate may also be another phenotype, in which case some care must be exercised in the interpretation of results. We assume that there is no direct association between the QTL that affect the trait of interest and the covariate. If this is not the case, the phenotype and the covariate should be treated as a multivariate phenotype. We denote the covariate by x and any unknown parameters governing the distribution of x are denoted by
. It can be shown by calculations similar to that in Appendix A that the appropriate weight function is
![]() |
(11) |
The practical implication of (11) is that the weights are now based on a regression analysis of the phenotype on the covariates and the QTL genotypes.
The use of marker loci as covariates in QTL analysis was suggested by ![]()
![]()
Extensions of the linkage model:
Missing marker data are common in QTL experiments, sometimes due to difficulties with typing but also as a result of selective genotyping. Following ![]()
![]()
Let R denote the missing marker data pattern, and let mobs and mmis denote the observed and missing marker data, respectively. It can be shown that the form of the posterior distributions of the QTL locations
, the genetic model parameters µ, and the QTL genotypes g given the observed data R, y, and mobs is unchanged. The marginal distribution of the observed data under the genetic model H is
![]() |
(12) |
Thus to compute the Bayes factor for comparing different models, we only need to calculate pH(y|mobs), other terms being independent of the model.
Selective genotyping is a practical device for reducing the cost of QTL mapping experiments (![]()
Genotyping errors were considered by ![]()
Backcross and intercross designs have been widely utilized in quantitative trait mapping studies. However, there are alternative approaches and there is interest in developing new cross designs that might improve the resolution of QTL mapping studies. Designs that utilize recombinant inbred lines, congenic or isogenic lines, repeated backcrossing, and advanced intercross lines are some examples (![]()
Finally, we note that it is possible to incorporate models of crossover interference into the pseudomarker simulation module. This may substantially increase the computational burden because the conditional independence assumptions that simplify the computations under a no-interference model cannot be used (![]()
| EXAMPLE |
|---|
We illustrate the application of our approach by a reanalysis of a hypertension cross described in ![]()
101.6 mm of Hg and the standard deviation of the blood pressure was
8.4 mm of Hg. A total of 174 markers were typed. Initially only the extremes of the backcross population were genotyped in a standard selective genotyping design. Following an initial analysis of the data, additional genotyping was carried out in regions of interest on all mice.
|
We used a 10-cM pseudomarker grid with 16 imputations for the initial scanning step. The model refinement and localization were done using a 2-cM pseudomarker grid with 256 imputations. The initial genome scan using MAINSCAN revealed two significant QTL on chromosomes 1 and 4 and a suggestive peak on chromosome 15. In Fig 4 the proportion of the variance explained is graphed by location on the genome. Also shown is the 5% cutoff based on a permutation test. QTL on chromosomes 1 and 4 explain
6 and 13.5% of the variance, respectively. The next biggest QTL is on chromosome 15, which explains
5.5% of the variance. The Bayes factor estimates for comparing the null model of no genetic effects to a single-QTL model are <1 for all chromosomes except chromosomes 1, 4, and 15 (the respective values are 37.3, 1.1 x 105, and 1.7). Fig 5 shows the result of a two-dimensional scan using two-QTL models fitted using PAIRSCAN. We find that the QTL on chromosomes 1 and 4 together explain
22% of the variance. There is evidence for interaction between loci on chromosomes 6 and 15. The Bayes factor for interaction vs. no interaction is 20.4. The interaction explains
6% of the variance above an additive model. There is also evidence for interacting QTL on chromosomes 7 and 15, which explains
5% of the variance over an additive model (the Bayes factor is 11.4). The Bayes factor indicates that the evidence for the 6 x 15 interaction is stronger than that of the 7 x 15 interaction even though the size of the estimated effects is about the same (see Fig 8). A look at the localization plots (see Fig 7) reveals that the localization information is stronger with the 6 x 15 interaction than with the 7 x 15 interaction. The combined results of these scans suggest that we should look closely at chromosomes 1 and 4 to determine if there may be multiple QTL on either of these chromosomes and that we should examine the simultaneous effects of loci on chromosomes 6, 7, and 15 to sort out the nature and extent of the interactions.
|
|
|
|
|
We carried out a TRIPLESCAN restricted to chromosomes 6, 7, and 15. There is no evidence for a three-way interaction among these markers. The Bayes factor for the three-way interaction model vs. a model with all three two-way interactions is 0.2. There is also no evidence for an interaction between chromosomes 6 and 7. Thus we conclude that the two pairwise interactions detected using PAIRSCAN are sufficient to describe the joint effect of these loci. The 6 x 15 interaction was reported by ![]()
Now we turn our attention to chromosome 1, which showed some evidence for two QTL on the basis of the output from MAINSCAN and PAIRSCAN. The Bayes factor comparing a two-locus additive model on chromosome 1 to a single-locus model on chromosome 1 is 0.41. This is inconclusive evidence but it favors the single-QTL model. The localization plot assuming a two-QTL model on chromosome 1 seems to indicate two QTL, but not overwhelmingly. Additional experimentation may be needed to provide stronger evidence for the presence of two QTL on chromosome 1. We note that ![]()
The final model that we arrived at through this analysis includes five loci on chromosomes 1, 4, 6, 7, and 15. The localization information for these loci is given in Fig 7. The main effects for QTL explain, respectively, 6, 13.5, 3, 0, and 5.5% of the variance. There are two two-way interactions in the model. The interaction between QTL on chromosomes 6 and 15 explains
6% and the one between QTL on chromosomes 7 and 15 explains
6.5% of the variance. Taken together the model explains
41% of the variance. The effect sizes are summarized in Fig 8.
| DISCUSSION |
|---|
We analyzed the stochastic dependencies between the different observed and unobserved data structures in the QTL mapping problem. By conditioning on the unobserved QTL genotypes it is possible to decompose the QTL mapping problem into the linkage part and the genetic model part. This decomposition is used to develop a computational strategy that uses multiple imputations of a pseudomarker grid to approximate integrals needed to perform Bayesian inference. Our software provides a set of flexible extensible tools suitable for the analysis of QTL data.
The problem of model selection remains a thorny issue in theoretical statistics and presents a serious challenge in the analysis of QTL data. For this reason we emphasized exploratory tools over formal model selection procedures in our analysis. We provide a visual representation of two-QTL models and two-way interactions in the QTL mapping problem.
Most of the QTL analysis methods proposed so far have relied on linear (or generalized linear) regression models. Recently, classification and regression trees (CART) models were proposed for QTL data (T. SPEED, personal communication) because they have a richer interaction space. We note here that by using appropriate weight functions, WH(y, g), where the model, H, can be a regression tree, we can include CART in our framework.
Multiple-QTL models were considered by ![]()
![]()
![]()
We have already mentioned the close connection between the LOD score and the LPD. Notwithstanding their similarities, the two can be numerically divergent. There is an important conceptual difference between the LOD score and the LPD. The LOD score is designed to be a test for linkage wherein the location of the QTL (
) is a nuisance parameter and µ is the parameter of interest. Indeed, the maximum of the LOD score is used as the test statistic for linkage. The LPD is designed with the reciprocal purpose of localization where the location of the QTL (
) is the parameter of interest. The parameter µ is the nuisance parameter and is integrated out.
The Monte Carlo error in approximating the posterior distributions by the pseudomarker algorithm is inversely proportional to the square root of the number of imputations. Unlike MCMC methods, we do not have to worry about proper mixing of the Markov chains. Accuracy of the calculations also depends on the density of the pseudomarker grid. For initial genome scans we find that 20 pseudomarker sets at a spacing of
10 cM are adequate. For fine mapping, we recommend a 2-cM pseudomarker map on selected chromosomes with
200 replicated data sets. Adjustment for individual situations is quite easy. More replications may be required when the proportion of missing genotypes is high. The investigator will have to vary the density of the pseudomarker map and the number of replicated data sets according to his/her needs. The best way to know what is reasonable for the map is to start with a sparse pseudomarker set and repeat the analysis. If the pictures differ markedly, more replications are needed. To control Monte Carlo fluctuations we also adopted some additional devices explained in greater detail in APPENDIX F. Another option to controlling Monte Carlo noise (which we have not currently implemented) would be to use smoothing techniques on the LPD since it is approximately piecewise quadratic (![]()
We presented a QTL mapping framework in the context of analyzing data from inbred line crosses. However, the idea of using multiple imputation of complete genotype information is applicable to more general gene-mapping situations, including complex human pedigrees. How it performs in practice requires further investigation.
Finally, it is pertinent to point out some of the inherent limitations of QTL mapping studies. The number of QTL that we can reliably detect in a cross depends on the number of individuals available, the numbers of segregating QTL, and the strength of their effects. If there are interactions between QTL that are ignored, then we may fail to find those QTL. For example, with 100 progeny from a backcross, we can expect to observe
25 individuals in each genotype class for two unlinked QTL. With three unlinked QTL, the number is
12. If the three QTL are linked, some genotype classes may be unrepresented in the data or may have very small numbers. Thus, it may be impossible to detect and characterize interactions of order 3 or higher and genes that act through high-order interactions may not be detected. It has been shown that model misspecification can lead to misleading results (![]()
| ACKNOWLEDGMENTS |
|---|
We thank Drs. B. Paigen, F. Sugiyama, and K. Broman for sharing data with us. This work has benefited enormously from the critical comments of Dr. K. Broman and two anonymous referees. The American Heart Association provided financial support through a grant to G.A.C.
Manuscript received July 29, 2000; Accepted for publication June 4, 2001.
| APPENDIX A |
|---|
FACTORIZATION OF THE JOINT DISTRIBUTION
The joint distribution of the observed and missing data structures is
![]() |
(A1) |
The first equality is an application of the definition of conditional probability and the second follows from the assumption that the phenotype depends on the location of the gene only through the genotype of the QTL and the genetic model. The third equality follows from the assumption that the distribution of the QTL genotype depends on the genetic model only through the position of the QTL relative to the markers. In the final equality we assume that the distribution of the marker genotypes is independent of the location of the QTL and the disease model and regroup the terms. This final assumption will be true if there is no ascertainment. If individuals have been selected based on their phenotype and the phenotypes of nonselected individuals are not included in the analysis this result will not hold.
| APPENDIX B |
|---|
POSTERIOR DISTRIBUTIONS
An expression for the distribution of the QTL genotypes given the observed data can be derived from the full joint distribution using elementary operations as
![]() |
(B1) |
The expression (B1) suggests that we obtain a representation of the posterior distribution of g by sampling from p(g|m,
) and then weighting the samples by W(g) = p(y|g). The weights are importance sampling weights and the idea is that p(g|m) has a flatter distribution than p(g|m, y) and therefore is a good proposal distribution to use. Summing the weights is a numerical approximation to integration with respect to p(
)d
. It is possible to use an irregularly spaced list of
values. In that case we will have to weigh each point suitably as in a numerical trapezoidal integration formula. A numerical integration over the QTL location space is reasonable because the shape of the log-likelihood in large samples is piecewise quadratic and hence smooth (![]()
The posterior distribution of the QTL locations is




















