Abstract
We describe a fast hierarchical Bayesian method for mapping quantitative trait loci by haplotype-based association, applicable when haplotypes are not observed directly but are inferred from multiple marker genotypes. The method avoids the use of a Monte Carlo Markov chain by employing priors for which the likelihood factorizes completely. It is parameterized by a single hyperparameter, the fraction of variance explained by the quantitative trait locus, compared to the frequentist fixed-effects model, which requires a parameter for the phenotypic effect of each combination of haplotypes; nevertheless it still provides estimates of haplotype effects. We use simulation to show that the method matches the power of the frequentist regression model and, when the haplotypes are inferred, exceeds it for small QTL effect sizes. The Bayesian estimates of the haplotype effects are more accurate than the frequentist estimates, for both known and inferred haplotypes, which indicates that this advantage is independent of the effect of uncertainty in haplotype inference and will hold in comparison with frequentist methods in general. We apply the method to data from a panel of recombinant inbred lines of Arabidopsis thaliana, descended from 19 inbred founders.
AS the power of haplotypic association has become better appreciated, studies using inferred multiallelic loci (i.e., haplotypes or pairs of haplotypes) are becoming more common. This is because single-nucleotide polymorphisms (SNPs), which are the most commonly used type of marker, are very susceptible to a loss of power to detect QTL, due to a mismatch in allele frequencies between the SNP and the causative variant. While multiallelic markers contain more information and have greater power than SNPs for QTL mapping, they are more costly and cumbersome. Consequently a major analytical advance has been the combination of multiple SNP marker information, either to infer haplotypes as in many human association studies or to infer the mosaic of ancestral founder haplotypes in synthetic populations descended from multiple founder strains. The latter scenario includes crosses between inbred strains of mice or rats or inbred accessions of plants.
However, there are two potential difficulties with haplotypic association. First, in a fixed-effects framework, a parameter is estimated for each haplotype, which is undesirable when the number of haplotypes is large. In a synthetic population descended from N inbred strains, up to N haplotypes may segregate; for the mouse collaborative cross (Threadgill et al. 2002) N = 8 and for the Arabidopsis thaliana multiparent advanced generation intercross (MAGIC) population of recombinant inbred lines, N = 19 (Kover et al. 2009). For complex traits, where many QTL are expected to segregate, multiple QTL mapping only exacerbates problems with the numbers of parameters.
Second, one must account for uncertainty in the inference of haplotypes, which depends on the marker density and how well one can distinguish between all founders at a locus. At some loci the founders' haplotypes may be identical, for example, in crosses descended from inbred strains of mice.
These problems are well known in haplotype association mapping involving human populations, where in general fixed-effects regression modeling is used. Consequently methods have been developed to reduce the number of haplotype groups at a marker locus, using hierarchical clustering and Bayesian partitioning algorithms (Molitor et al. 2003; Durrant et al. 2004; Bardel et al. 2006; Morris 2006; Tzeng et al. 2006; Waldron et al. 2006; Igo et al. 2007; Liu et al. 2007; Tachmazidou et al. 2007; Knight et al. 2008).
Bayesian methods are increasingly the approach of choice for QTL mapping, particularly for multiple QTL mapping and the modeling of interactions (Yi and Shriner 2008). The hierarchical Bayesian framework can accommodate more complicated models with more parameters, even when there are many more parameters than observations (Meuwissen et al. 2001; Xu 2003). The Bayesian approach has an additional advantage when the inferred haplotypes are not all identifiable. Reliable estimates of haplotype effects can be determined because the shrinkage effect of the prior distribution restricts the posterior. However, these methods must be fast if these complex analyses are to be practical.
In a hierarchical model the key problem is how to model the distribution of the variance attributable to a QTL and its prior. Meuwissen et al. (2001) consider a hierarchical Bayesian random-effects model (HBREM) for observed multiallelic marker loci. They choose normal priors centered at zero for the individual genotype effects, with different variances for each locus. The prior distributions for the variance parameters are scaled inverse chi square, with parameters chosen to give the mean and variance preestimated from the data. However, this prior has a tiny probability of a QTL effect being equal to zero, whereas that is clearly very likely in a genome scan. Hence they also showed an alternative prior, a mixture of a point mass at zero and a scaled inverse chi-square distribution, which gave better results.
Xu (2003) considers a noninformative Jeffrey's prior on the locus variance. The model fits all markers simultaneously and can detect large-effect QTL with little noise at other markers, despite the negligible probability of zero locus variance. However, the model is limited to markers with two or three possible genotypes. Wang et al. (2005) extend this approach to inferred genotypes, but still with only two or three possible genotypes per locus and the method is very computationally intensive.
Yi and Xu (2008) argue that the noninformative Jeffrey's prior on the locus variance induces constant shrinkage on the haplotype effects and that it would be preferable to vary shrinkage according to the data. They compare exponential and scaled inverse chi-square priors on the locus variance, using hyperparameters with vague hyperpriors. They also consider a second prior on the haplotype effects (first proposed by Park and Casella 2008), of a normal distribution with variance proportional to the residual error variance. The four models performed equally when tested on populations with only two genotypes segregating at a locus.
There are several frequentist approaches to dealing with haplotype uncertainty in QTL mapping. One is to perform a fixed-effects multiple linear regression or generalized linear regression of the phenotype, treating the haplotype probabilities at the locus as the design matrix (Haley and Knott 1992; Mott et al. 2000). Another is to use multiple imputation to draw samples of haplotypes from the haplotype probabilities (Sen and Churchill 2001). A third is to use the EM algorithm to estimate the haplotypes (Excoffier and Slatkin 1995; Hawley and Kidd 1995; Long et al. 1995; Qin et al. 2002; Lin et al. 2005, 2008; Lin and Zeng 2006; Zeng et al. 2006). An alternative is data expansion, where instead of multiple imputation, the data set is expanded by drawing 10–20 replicate haplotype pairs for every individual from their inferred probability distribution, assigning the same value of the response variable to each, and analyzing the expanded data set. However, this may alter the characteristics of the data, such as the haplotype frequencies.
In a Bayesian setting, haplotype uncertainty can be accommodated either by including the predictor variables as unknowns in the updating procedure or by multiple imputation. In a fully Bayesian treatment, the unknown haplotype pair assignments are assigned priors and estimated along with the model parameters. However, Markov chain Monte Carlo (MCMC) is then needed to fit the model, updating the parameters on the basis of the haplotype pairs and then updating the haplotype pairs on the basis of the parameters. Updating the haplotype assignments by MCMC is slow and suffers from the label-switching problem among others (Jasra et al. 2005), so an alternative approach would be preferable.
In this article, we present a new HBREM for QTL mapping applicable to observed or inferred haplotypes. It does not require costly MCMC techniques, since the joint posterior distribution factorizes. It parameterizes the variance terms in the model, focusing on the proportion of the variance due to the QTL. We compare its performance with that of the frequentist fixed-effects model for both observed and inferred multiallelic loci. We show first that the posterior mode of the proportion of variance due to a locus is a better outcome measure than two standard Bayesian test statistics and second that the Bayesian estimates of the individual haplotype effects are much more accurate than the corresponding frequentist estimates. Finally we analyze real data from A. thaliana recombinant inbred lines descended from 19 parental lines.
METHODS
Model:
To be as general as possible, we use a notation for modeling the haplotypic effects that applies equally to an outbred diploid population (e.g., humans) in which two distinct haplotypes are present in an individual, giving K haplotype pairs, and a population of recombinant inbred lines (RIL) in which only one distinct haplotype is present in an individual, giving K haplotypes. Therefore “haplotype pair” should be read as “haplotype” in the case of RILs. We model a single locus at a time for direct comparison with the frequentist fixed effects model, although the extension to a multilocus model may be more relevant for many data sets.
Consider a locus where K haplotype pairs segregate. We first assume the haplotypes are known and then generalize to the more common situation where they are inferred, as is the case whenever only unphased SNP data are available or when the number of haplotypes is larger than the number of SNP alleles.
Several methods can be used to estimate the probability distribution of the haplotype pairs that could be formed from multiple adjacent markers and the choice is left to the user. In human populations, PHASE is commonly used (Stephens et al. 2001; Li and Stephens 2003; Stephens and Donnelly 2003). In synthetic animal and plant populations descended from known inbred strains the probability of descent from each ancestral haplotype can be estimated by a hidden Markov model (HMM) (Mott et al. 2000).
For a continuously distributed trait and known haplotype pairs, the observed phenotypic value yn for individual n is described by the linear regression modelwhere μ is the phenotypic mean, T = [T1, … , TK] is the vector of effect parameters for the haplotype pairs at the locus such that
, and xn = [xn1, … , xnK], n = 1, 2, … , N are indicator vectors, such that for individual n with haplotype pair m, xnm = 1 and xnk = 0, k ≠ m. The residual error terms
where σ2 is the total phenotypic variance and κε[0, 1] is the proportion of the total phenotypic variance due to the locus. In the case of diploid populations we fit haplotype pairs rather than individual haplotypes to allow for nonadditive effects.
Multiple imputation has traditionally been used when only a small fraction of the genotype data are missing. Complete data sets are created by sampling the missing genotypes from their probability distribution. But for inferred haplotypes, all of the data are missing and there are very many possible data sets, each with small probability. Therefore, imputing from a small number of sampled data sets is unlikely to sample the space adequately. However, since the probability distribution pn is known, instead we sample data sets of haplotype pairs from this distribution, drawing a new haplotype pair for each individual each time a draw of the model parameters is made. This averages the distributions of the parameters over a wide range of possible data sets and effectively treats the inferred haplotype pair probabilities as a posterior distribution rather than a prior in the MCMC method. In some situations updating the haplotypes in a fully Bayesian way would be better, but it is computationally intensive and any advantage gained depends on the quality of the prior. If the prior estimates of the haplotype probabilities are accurate, then little benefit would be gained by updating them in the model.
We use multiple imputation, sampling haplotype pairs from the probabilities pn. We assume the haplotype pairs do not depend on the model parameters or the phenotype. This is certainly reasonable for the majority of loci, since the phenotype depends only on genotypes at QTL, which will not be true for most loci for any given phenotype. We also assume that individuals are unrelated and hence that their haplotype distributions are independent, as for standard association mapping.
Prior distributions:
We choose uninformative priors for the parameters. The phenotypic mean μ could take any real value and the total variance σ2 could take any positive real value, and these values would not necessarily depend on each other. Hence μ and σ2 were assumed to be independent, such that π(μ) ∝ 1 for με(–∞, ∞) (the location-invariant Jeffrey's prior for the normal distribution) and σ2 ∼ Inv – χ2[ν = 0] (the scale-invariant Jeffrey's prior for a variance with degrees of freedom ν).
The phenotypic effect parameters Tk, k = 1, … , K for the haplotype pairs are draws from a distribution that could take any real value and should ideally sum to zero over the sample because they are deviations from the overall mean. This may not apply in individual samples of data, however, since the sample frequencies of the haplotype pairs will vary, and it would also lead to dependency between the parameters. Instead, a weaker condition is imposed in the prior, that 𝔼(Tk) = 0. The prior distribution of the effect parameters was chosen to follow a normal distribution, the distribution with the greatest entropy, which must have variance κσ2 by the definition of κ,
The proportion of variance κ due to the effect of the locus is the main parameter of interest. The proportion at any particular locus has a high prior of being at or near zero, but with the possibility of being very large. Thus the natural choice of prior for a proportion would appear to be a Beta(α, β) distribution. However, when we implemented this prior for the single-locus model, the posterior distribution of κ was found to be highly sensitive to the choice of the hyperparameters α, β (data not shown). Therefore the hyperparameters would need to be included in an additional level of the hierarchical model and estimated simultaneously from the data, following the approach of Yi and Xu (2008). Moreover, such a prior on κ would shrink the estimates of smaller effects and potentially reduce power to detect them.
Ideally κ should give an accurate estimate of the proportion of phenotypic variance due to the locus, large or small. Hence we choose a flat, noninformative priorso that the data dominate the posterior distribution of κ. The conventional model has two independent parameters corresponding to the QTL variance and the residual variance. Our reparameterization models the dependency between the variance parameters, while allowing complete factorization of the joint posterior distribution when the haplotypes are known.
Posterior computation:
Conditional on the haplotype pairs, the joint posterior distribution π(T, μ, σ2, κ | y, X) can be factorized completely,(1)where X is the NxK matrix with rows corresponding to the indicator vectors xn. (The derivation of this result follows the factorization of a similar model, which could also be used for this problem, in Chap. 5, section 5.4 of Gelman et al. 2004, and is given in full in the appendix.) The conditional posterior distributions of the parameters are calculated by integrating the joint posterior. First define
and
. Then we find
(2)
(3)
where
and
where Nk is the number of individuals with haplotype pair k and
is the mean phenotypic value for haplotype pair k.
The factorization (1) can be used to make a series of draws numbered d = 1, 2, … , of the parameters from the joint posterior distribution, as follows. First draw κ(d) from π(κ | y, X). Then σ2(d) can be drawn from π(σ2 | κ(d), y, X), μ(d) from π(μ | σ2(d), κ(d), y, X), and from π(Tk | μ(d), σ2(d), κ(d), y, X). It is straightforward to sample μ and Tk from the normal distributions (3) and (2). To draw a value θ from a scaled inverse χ2(ν, ρ2) distribution, first draw a value x from a
-distribution and then calculate θ = νρ2/x. The distribution π(κ | y, X) does not have a recognizable standard form, but for all values of X, y, and κ the value of the posterior density can be calculated, up to a normalizing constant. Hence we computed the distribution via the approximation described in section 11.1 (Chap. 11) of Gelman et al. (2004), by calculating the density function on a dense grid of values of κ given X and y and normalizing the set of density values to sum to 1. From this, the approximate cumulative distribution function (cdf) on the grid of points was calculated and draws of the parameter κ were made using the inverse-cdf method, as described in section 1.9 of Chap. 1 of Gelman et al. (2004).
For inferred haplotypes, we have the joint posterior π(T, μ, σ2, κ, X | y), including X as unknown, and a distribution for X. Hence we can write
(4)
We already have a complete factorization (1) for the joint posterior π(T, μ, σ2, κ | X, y). Hence this can be extended to include uncertainty in the haplotype assignments, avoiding MCMC sampling. To make a draw d from the joint posterior for inferred haplotypes, first draw a set of haplotype pairs X(d) from π(X) and then use X(d) and y to draw the parameters as before, drawing a new sample of haplotype pairs for each new draw of the parameters.
Comparison of models and hypothesis testing:
In the Bayesian literature there is a very wide range of statistics or outcome measures reported, with no apparent standard other than the Bayes factor and no established significance thresholds. We aim to choose statistics for hypothesis testing that will be familiar to many readers. We investigated three statistics for the Bayesian model: the mode of the posterior distribution of κ, i.e., the proportion of variance due to the QTL, log10 of an approximation to the Bayes factor based on the Bayesian information criterion (BIC) (Raftery 1986), and the difference in values of the deviance information criterion (DIC) (Spiegelhalter et al. 2002). For the frequentist regression model we used logP, the –log10 P-value of the standard ANOVA F test (Mott et al. 2000). Since we found the power of the posterior mode of κ was superior to the other Bayesian statistics, we focus on the comparison of that statistic with logP; details of the other two statistics are included in supporting information, File S1 and their performance in File S2, including Figure S1, Figure S4, Figure S5, Figure S6, and Figure S7.
Posterior mode of κ:
The posterior distribution of κ is asymmetric and appears to be unimodal in general, so the posterior mode is the best point estimate of κ. However, the distribution of κ is not of standard form, so the mode was calculated via an approximation to a Beta(α, β) distribution. The parameters α and β were estimated using the mean and variance of the posterior sample. The mode of the posterior of κ was estimated as the mode of the resulting Beta and was set to zero if it was negative and set to 1 if it was >1.
ANOVA F test:
The frequentist fixed-effects ANOVA F test is calculated by linear regression of the full fixed-effects QTL model and the null model, using the R lm() and anova() functions. For inferred haplotypes, the linear regression for the QTL model, , fits the expected haplotype pair value for each individual.
Software:
Software is freely available on request from the authors. A module to calculate significance thresholds is under development and will be added to the package. The analysis is written in C code, but embedded in the R Happy package (version 2.4).
SIMULATION STUDY:
We conducted a simulation study to evaluate the performance of this model and accuracy of individual haplotype effect estimates, both when the haplotypes were known and when they were inferred. To simulate data, we used haplotype information from a real data set of 527 A. thaliana RILs (Kover et al. 2009). These RILs are a MAGIC, descended from 19 founder accessions (inbred lines), such that the genome of each RIL is a homozygous mosaic of the founder haplotypes. They are homozygous at almost all loci, so the haplotype pairs are interpreted as haplotypes. A HMM that used multiple SNP genotype information computed the probability vectors pn in 1255 SNP marker intervals using the program HAPPY (Mott et al. 2000). For the simulation study, QTL were simulated at a subset of loci, which were selected on the basis of properties of their probability vectors to cover a wide range of situations.
To simulate phenotypes for known haplotypes X, a draw was made from pn for each individual independently and the resulting haplotypes were treated as the true haplotypes and fixed for all simulated sets of phenotypes. To simulate a data set with inferred haplotypes, a new draw of X(d) was made for each simulated set of phenotypes and then the analysis was done using pn only. A normally distributed phenotype y was simulated for a biallelic susceptibility locus accounting for a proportion Q of the total phenotypic variance, according to X at a given locus.
To simulate representative QTL, we first characterized the information content at each locus. We reasoned that power at a locus is likely to depend on two main factors: the QTL allele frequency f and, for inferred haplotypes, the uncertainty of the inference, which depends on pn. Therefore, at every locus (defined here to be the interval between adjacent genotyped SNPs) we calculated the sample-average relative entropysuch that
when all haplotypes are known and
when all haplotypes are equally likely. We also calculated the mean square error (MSE) of the estimated haplotype effects from their true simulated values under the QTL model (MSEQTL) and also from the values under the null model (MSEnull) for both the regression estimates and the Bayesian estimates.
At each simulated locus, 1000 simulations at each of the QTL allele frequencies f = 1, 2, 4, and 8 were performed. QTL effect sizes were parameterized in terms of 100Q, the percentage of the total phenotypic variance attributable to the locus. For known haplotypes, we simulated QTL with 100Q = 1, 2, 3, 4, 5, 6, 7, and 8 and for inferred haplotypes 100Q = 2, 4, 6, 8, 10, 20, and 40.
To compare the power of the statistics as outcome measures for QTL mapping, it was necessary to first determine significance thresholds for the Bayesian statistics since there appear to be no established thresholds, only genomewide rule-of-thumb thresholds for the Bayes factor. Thresholds were established via simulation under the null hypothesis of no genetic effect for the model with known X and then applied to the case of inferred X. Thresholds were computed for two different significance levels, the nominal 5% level and also genomewide, where one false positive in the genome equated to a nominal significance level of 0.08%. To get the adjusted threshold at a locus for the inferred haplotype data simulations, we simply took the relevant quantile of the distribution simulated under the null hypothesis at that locus.
RESULTS
Results for known haplotypes:
The power and type I error rates were calculated as the proportion of simulations where the statistic exceeded the relevant threshold, considered as a function of the number of haplotype pairs carrying the QTL variant (Figure 1). Figure 1 shows that mode(κ) and logP have virtually identical power. At the genomewide level, QTL explaining ≥8% of the phenotypic variance would be detected with virtually 100% power and at the nominal 5% level, QTL explaining ≥5% of the phenotypic variance would be detected with virtually 100% power.
Power to detect a QTL when haplotypes are known, as a function of the QTL effect size as a percentage of the total phenotypic variance. Results are presented for a range of QTL allele frequencies, f = the number of founder accessions out of 19, for the nominal 5% (A–D) and genomewide 0.08% (E–H) significance thresholds. The statistics are mode(κ) (red) and logP (blue).
The slight loss of power for low f in the Bayesian model was because it estimates the QTL variance independently of the sample haplotype pair frequencies, as if it were a balanced design with equal haplotype pair frequencies. The prior distribution for Tk implicitly defines the variance independently of the data, which is modeled in the likelihood. This led the two estimates of the QTL variance to differ, setting up a conflict between the prior and the likelihood. The posterior parameter estimates are a compromise depending on the relative amounts of information in the likelihood and the prior. The effect was reduced for common QTL alleles because then the sample QTL frequency was more often closer to that for a balanced design.
However, the Bayesian and regression methods were very different when we compared the accuracy of the estimates of individual haplotype effects. Figure 2 shows the distributions of MSEQTL for the estimates of the individual haplotype pair effects from the Bayesian and regression methods for a QTL allele carried by 4 of 19 haplotype pairs. The MSEs for the regression estimates are much larger and are independent of effect size of the QTL or QTL allele frequency (Figure S2 and Figure S3 in File S2). See File S2 for more information.
MSE of estimates of individual haplotype pair effects. (A) Comparison of histograms of MSEQTL for regression (blue) and Bayesian (red) estimates, with (B) comparison of MSEQTL between methods for each simulated data set. Data are for a 5% QTL carried on 4 of 19 haplotype pairs.
Results for inferred haplotypes:
The power and type I error rates were calculated using the significance thresholds calculated for known X. Figure 3 shows the mean type I error rates averaged over f, as a function of relative entropy .
Type I error rates when haplotypes are inferred, as a function of the sample-average entropy at a locus for the 5% significance threshold. Locus error rates were calculated from 1000 simulated data sets at each locus. The statistics are mode(κ) (red), logBF (light blue), DICdiff (pink), and logP (dark blue).
The logP had the correct error rate independent of , but at the nominal 5% significance level mode(κ) was increasingly conservative as
increases. The logBF statistic was also conservative, whereas the DICdiff statistic was extremely anticonservative (see File S1 and File S2 for details of logBF and DICdiff). Hence mode(κ) was the preferred Bayesian statistic on the basis of performance and interpretation. It is also directly available as a model parameter with a posterior distribution.
The effect of the entropy, on mode(κ) is due to the Bayesian method properly taking into account the uncertainty in individuals' haplotype assignments. Whichever “true” haplotypes were selected to simulate the phenotype, the proportion of correct draws from the posterior will inevitably be lower at high entropy loci. Incorrect draws lead to incorrect parameter estimates, such that the individual haplotype estimates are closer to the overall mean than their true values, leading to shrinkage in κ.
The weighted regression used for the F test does not properly allow for the haplotype uncertainty, but simply uses the expectations of individuals' haplotype effects, pn · T, which is difficult to interpret in terms of haplotype pairs. If all haplotype pairs have very similar probabilities with only very small differences between them, the Bayesian model will have much less information and lower power whereas the F test will be relatively unaffected. However, in the weighted regression, when the haplotype pair probabilities are almost equal (as when there is high entropy in the haplotype assignments), this results in the least-squares estimates being unstable and taking very large nonsensical values.
In contrast, the Bayesian approach prevents this occurring by restricting the parameter values to the support of the posterior distributions. The Bayesian estimates of Tk are shrunk consistently and predictably as the uncertainty in the haplotypes increases. This is demonstrated in Figure 4, by the relationship between MSEQTL and MSEnull for the Bayesian estimates as entropy increases, which indicates that the estimates are being shrunk toward the null model as the level of information in the genetic data falls. The corresponding plots for the regression estimates show no such relationship.
Effect of haplotype uncertainty on individual effect estimates, from the regression and Bayesian analyses in relation to the simulated values for the QTL model (MSEQTL) and the null model (MSEnull). (A) MSEQTL of regression estimates; (B) MSEnull of regression estimates; (C) MSEQTL of Bayesian estimates; (D) MSEnull of Bayesian estimates. (Note log scale of y-axes for the regression MSEs.) The box (or box-and-whisker) plots show the spread of the distribution of points generated by the simulation for the different entropy levels. The box shows the central 50% of the distribution, between the 25 and 75% quartiles, with the middle bar representing the median or 50% quartile. The whiskers extend to the farthest data point that is no more than 1.5 times the length of the box away from the box. Data points farther away are plotted individually.
The relative type I error rates of the statistics were reflected in their power levels (Figure 5). For low locus entropy, they have similar power, as observed when the haplotype pairs are known or observed. The logP had the highest power for high entropy levels, due to the fact that it does not properly take into account the uncertainty in the haplotypes. If the significance thresholds for both statistics are adjusted with increasing entropy to maintain the size of the test (i.e., fix the proportion of false positives), the power of mode(κ) matches the power of logP and even exceeds it for small QTL effect sizes.
Power to detect a QTL for inferred haplotypes for constant and adjusted thresholds, as a function of the sample-average entropy at a locus. Results are presented for a range of QTL effect sizes as a percentage of the total phenotypic variance, for the nominal 5% (A–E) and genomewide 0.08% (F–J) significance levels. Power is reported for the statistics mode(κ) and logP for both constant thresholds [red, mode(κ); blue, logP] and adjusted thresholds [pink, mode(κ); light blue, logP].
Real data analysis of Arabidopsis QTL:
We illustrate the relative performance of logP and mode(κ) with analysis of QTL in a population of A. thaliana MAGIC lines (Kover et al. 2009). We analyzed two phenotypes, days to germination and days to bolting time (Figures 6 and 7). (Bolting time was calculated as the number of days between germination and noting the first flower bud.)
Real data analysis of the phenotype, days to germination. Horizontal lines represent significance thresholds calculated via simulation: nominal 5% (dashed line) and genomewide 0.08% (solid line). Vertical lines represent the boundaries between chromosomes.
Real data analysis of the phenotype, days to bolting time. Horizontal lines represent significance thresholds calculated via simulation: nominal 5% (dashed line) and genomewide 0.08% (solid line). Vertical lines represent the boundaries between chromosomes.
At the genomewide threshold, there were one QTL for days to germination and three significant QTL for days to bolting time, with both statistics identifying the same QTL at this threshold. The Bayesian statistic, mode(κ), has much less background noise across the genome than logP and additionally has the advantage of indicating the size of the effect of the QTL.
There is some population structure in the MAGIC lines, which potentially violates the assumptions of independence in both Bayesian and frequentist analyses. Certain lines have “cousins” that share ∼25% of the genome by descent. This could lead to long-range linkage disequilibrium (LD) across chromosomes and “ghost” peaks in the association analysis. However, the LD patterns observed in the data show little sign of this phenomenon (Kover et al. 2009). To confirm this, we reran our analysis on the data with cousin lines removed and the results were unchanged (Figure S8 and Figure S9 in File S2).
DISCUSSION
Our Bayesian single-locus hierarchical random-effects model has a complete factorization of the joint posterior distribution so does not require a costly MCMC step. This gives it comparable analysis times to the frequentist ANOVA analysis. We extended the factorization to allow for uncertainty in inferred haplotypes, using a version of multiple imputation. This extension provides an exact solution that fully takes into account the degree of uncertainty in the haplotypes. When haplotypes are known, the analysis of 1000 loci and 500 subjects runs in <2 min, and when haplotypes are inferred, the same analysis runs in <20 min. The increased running time is due to the calculation of the posterior distribution of κ, for each draw of haplotypes in the inferred haplotype model.
When the haplotype data are known, Bayesian and frequentist methods had virtually identical power, but the frequentist estimates of the haplotype effects were much less accurate than the Bayesian estimates. The Bayesian model had lower power for QTL whose alleles were carried on haplotypes observed at low frequency, but this was due to the choice of prior for the haplotype pair effects. Altering the prior to estimate the haplotype pair frequencies from the data might fix this problem.
When the haplotypes are inferred, the null distributions of several Bayesian summary statistics were affected, in different ways, by the degree of uncertainty in the haplotype assignments at a locus, leading to different type I error rates at different loci in the genome. We find that the higher accuracy of the individual haplotype pair effect estimates from the Bayesian model remains for the inferred haplotype data, although the estimates from both methods are affected by the uncertainty in haplotype assignment. Moreover, the error in the Bayesian estimates was easily explained by the loss of information, with effect estimates shrunk toward the null model as the uncertainty increased, whereas the regression estimates became increasingly unpredictable and unreliable. Because the Bayesian method is more accurate than the frequentist even when haplotypes are known, this shows that the difference will persist whatever frequentist method is used to deal with haplotype uncertainty, e.g., EM instead of the regression method used here.
Our preferred Bayesian summary statistic is κ, the proportion of variance due to the locus, as estimated by the mode of the posterior distribution. In addition to its natural interpretation as the effect size of the QTL, κ is technically superior to the other Bayesian statistics we considered. First, it is less conservative than the approximate Bayes factor. Second, the DIC statistic is highly anticonservative and hence invalid at the 5% level. This is due to the use of plug-in likelihoods with the multiple imputation approach, which do not properly take into account the uncertainty in the haplotype inference, in contrast to κ.
The F test also ignores uncertainty in the inferred haplotypes (as measured by the entropy) and consequently has much greater error in the estimation of the haplotype pair effects, although higher power at loci with higher entropy. The advantage of the F test is that it has a well-defined constant significance threshold for any given size of test. The mode of κ-statistic loses power relative to the F test as entropy increases, when a constant significance threshold is applied, because the type I error rate decreases. However, if the significance threshold for the mode of κ-statistic is adjusted as entropy increases to maintain the type I error rate, it matches the power of the F test and even exceeds it for small QTL effect sizes. Also, despite the effect of entropy on the null distribution, we found that the Bayesian model detected the same QTL at the genomewide significance level, even when not adjusted, when analyzing a real data set. This emphasizes the importance of high-quality methods for inferring haplotypes, such that the majority of loci in the genome have low levels of haplotype entropy.
Our model can be extended to fit all loci simultaneously, with a flat Dirichlet prior distribution for the loci proportions of variance. The Dirichlet distribution imposes natural constraints on the variance parameters, as the Uniform[0, 1] prior on the locus proportion of variance does for the single-locus model. Indeed, forcing the sum of the variances for all loci to not exceed the total phenotypic variance may act as a shrinkage prior without the need for an additional level of the hierarchical model. However, the single-locus analog of the Dirichlet distribution is the Beta distribution and our model is sensitive to the choice of parameters for a Beta prior. Hence it may be necessary to include these hyperparameters as unknowns in the model with flat hyperpriors, as described by Yi and Xu (2008). The multilocus model would require MCMC techniques to sample from the joint posterior, similarly to the conventional model in Chapter 15 of Gelman et al. (2004). Further research is needed into the added value of the multilocus model over our single-locus model for inferred genetic data to justify the additional computing time it would require.
APPENDIX: FACTORIZATION OF THE JOINT POSTERIOR
Let N be the number of individuals in the sample and let K be the number of genotypes at the genetic locus such that for k = 1, … , K, there are Nk individuals with genotype k, where the Nk are observed (fixed) such that . Let μ be the overall mean of the observed phenotype y, let σ2 be the total phenotypic variance, and let κ be the proportion of the total phenotypic variance attributable to the genetic locus. Let Tk be the effect on the phenotype of genotype k. Let T = [T1, … , TK]T and let L(y, μ, T, σ2, κ, X) be the likelihood function of the data and the parameters. Then using Bayes' rule, the (unnormalized) joint posterior distribution of all the parameters given the data is proportional to the likelihood multiplied by the priors, such that
(A1)
Inspection of the joint posterior reveals that the posterior distribution of T depends on all of the other parameters, μ, σ2, κ and also that conditional on these parameters, the posterior distributions of the Tk's do not depend on each other. Hence we first consider the conditional distribution of Tk, where
Conditional posterior distribution of Tk:
To start the factorization we can write(A2)where π (μ, σ2, κ | y, X) does not depend on T. Hence
where
and
. This is the standard form of a normal distribution, so
To continue the factorization in (A2), we must find π(μ, σ2, κ | y, X) by integrating out Tk, k = 1, … , K from the joint posterior (A1):
By comparison with the probability density function (pdf) of a normal distribution, adding and subtracting the extra term required to complete the square,Hence
Conditional posterior distribution of μ given σ2, κ:
We have(A3)where π(σ2, κ | y, X) does not depend on μ. Hence
This is the standard form of a normal distribution, so
To complete the expression in (A3), we must find π(σ2, κ | y, X) by integrating out μ from the joint density π(μ, σ2, κ | y, X):
By comparison with a normal pdf, completing the square as before,
Hence
Conditional posterior distribution of σ2 given κ:
Write(A4)where π(κ | y, X) does not depend on σ2. Hence as before,
which is the standard form of a scaled inverse χ2-distribution,
where
Posterior distribution of κ:
To find an expression for the marginal posterior distribution of κ, we must integrate out σ2 from π(σ2, κ | y, X),for κ ∈ [0, 1]. This distribution is not of standard form, but it is continuous in κ with no singularities in the interval (0, 1). At κ = 1 the density is zero, while at κ = 0, λk = Nk and the density
is finite. Hence the posterior is proper.
Combining the results in (A1), (A2), and (A3), a factorization exists of the joint posterior distribution such that
Footnotes
Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.109.113183/DC1.
Communicating editor: H. Zhao
- Received December 16, 2009.
- Accepted December 23, 2009.
- Copyright © 2010 by the Genetics Society of America