Abstract
We describe a Bayesian approach to analyzing multilocus genotype or haplotype data to assess departures from gametic (linkage) equilibrium. Our approach employs a Markov chain Monte Carlo (MCMC) algorithm to approximate the posterior probability distributions of disequilibrium parameters. The distributions are computed exactly in some simple settings. Among other advantages, posterior distributions can be presented visually, which allows the uncertainties in parameter estimates to be readily assessed. In addition, background knowledge can be incorporated, where available, to improve the precision of inferences. The method is illustrated by application to previously published datasets; implications for multilocus forensic match probabilities and for simple association-based gene mapping are also discussed.
DEPARTURES from gametic (or linkage) and Hardy-Weinberg (HW) equilibria can provide clues about aspects of population histories and mating behavior (see, e.g., Lewontin 1974) and can be useful in locating disease genes (Jorde 1995; Federet al. 1996; Nielsenet al. 1998). They also play an important role in the forensic use of DNA profile evidence. Match probability calculations either rely on assumptions of equilibrium (National Research Council 1996) or else allow for patterns of departures that hold in simplified population models (Weir 1994; Balding and Nichols 1995; Ayres and Overall 1999). It is important that the validity of such assumptions in actual populations is verified empirically, as far as is feasible.
Traditional statistical treatments usually focus on testing hypotheses of equilibrium, with recent developments involving randomization tests (e.g., Zaykinet al. 1995; Slatkin and Excoffier 1996). Although they may form a useful first step, such hypothesis tests represent a limited form of statistical inference, since the tests concern only whether or not the data are consistent with equilibrium, rather than directly assessing how large are the departures from precise equilibrium that inevitably exist in real populations (e.g., Smith 1970). In forensic applications, for example, a hypothesis of equilibrium may be rejected with a sufficiently large sample, whereas a forensic scientist may nevertheless believe that the magnitude of the departure is sufficiently small that the hypothesis of equilibrium, though strictly false, is adequate for the application at hand.
Point estimation methods for disequilibrium parameters have been developed (see, e.g., Weir 1979). Here, we propose a Markov chain Monte Carlo (MCMC) method to investigate probability distributions for gametic disequilibrium measures given the data. This extends previous work (Ayres and Balding 1998; Shoemakeret al. 1998) on assessing departures from HW.
Perhaps the most important advantage of our approach is interpretability: the questions of interest are answered directly in terms of probabilities that can conveniently be presented graphically via probability density curves, providing an immediate yet detailed assessment of the variability associated with an estimate. A further advantage is that, since the approach is likelihood based, it is statistically powerful and can incorporate a wide range of modeling assumptions. Previous treatments assume random union of gametes (RUG) to infer population haplotype proportions from genotype data (e.g., Excoffier and Slatkin 1995). Although we also implement the RUG model in the following analyses, we note that other models can be readily applied, such as those that incorporate inbreeding measures.
The choice of prior distribution is sometimes seen as a barrier to the implementation of direct probability, or Bayesian, methods. We introduce a class of hierarchical prior distributions for the haplotype proportions, which allows the scientist some flexibility either to incorporate relevant background information, if desired, or to adopt a relatively “vague” prior.
We illustrate our method by analyzing samples of genotypes at two unlinked loci and at three linked loci. We also briefly discuss its application to forensic identification, and to haplotype data and simple disequilibrium gene mapping. Computer programs (C code) for the MCMC algorithms are available from the authors on request.
METHODS
Measures of gametic disequilibrium: Genetic equilibrium corresponds to statistical independence, and many authors (see, e.g., Weir 1979) measure gametic disequilibrium in terms of the differences between population haplotype proportions and the values that would be expected under equilibrium, given the allele proportions. Following this approach, for a two-locus haplotype consisting of alleles Ai and Bj, we introduce the notation
The range of Dij depends on pi and qj, which makes cross-locus and cross-population comparisons difficult. To alleviate this problem, Lewontin (1964) defined the normalized difference, with range [-1, 1], by
Moreover, in practice the range of values of D′ consistent with gametic equilibrium is not readily apparent and can vary from locus to locus. Under equilibrium, each Dij, and hence D′, takes value zero. However, just as a χ2 goodness-of-fit statistic is unlikely to be very close to zero even when the model is valid, so estimates of D′ based on data from equilibrium populations are unlikely to be very close to zero (furthermore, variances of D′ are difficult to calculate; see Zapataet al. 1997). Insight into whether or not the data are consistent with gametic equilibrium can be gained by reanalyzing them with the alleles at each locus randomly permuted, thus simulating samples from a population in equilbrium with the same pi and qj as the population under study (e.g., Slatkin and Excoffier 1996). However, if the primary aim of an analysis is to test the hypothesis of equilibrium, then the likelihood-ratio (LR) statistic or a Bayes factor would be more appropriate than D′.
Measures of gametic disequilibrium not based on Dij have also been proposed. Smouse (1974) specifies a log-linear model for the hij, with allele-specific parameters ai and bj, and an interaction term cij that can be employed as an alternative to Dij. Weir (1996, pp. 127-133) details a closely related multiplicative model and extends the analysis to genotype data.
Here, we focus on D′ as a summary measure of gametic disequilibrium (together with an extension D′′, introduced below). This measure is widely used and, although it suffers from the interpretability drawbacks described above, there seems to be no univariate measure that avoids such difficulties. When interest focuses on gametic disequilibrium due to linkage, such as in “simple” genetic mapping, then a natural criterion for choosing between disequilibrium measures is correlation with physical distance and Devlin and Risch (1995) find that D′ has good properties in that setting.
Random union of gametes model: When only genotype counts are available, a model is required to relate the hij to genotype proportions, which then implies a model for the Dij. For two loci at which the population proportion of genotype AiAi′BjBj′ is denoted pii′jj′ (with i ≤ i′ and j ≤ j′), perhaps the simplest plausible model assumes RUG:
The log-likelihood for a random sample of genotypes is obtained by substituting (3) into the multinomial log-likelihood function,
Modeling background information: Here, we are primarily concerned not with point estimates but with the full joint distribution of the hij, and hence of the dij and D′, given the genotype data. This requires a probability model for the hij prior to observing the data. Perhaps the simplest such model is given by the (multivariate) uniform distribution, which may be interpreted as corresponding to no background information about haplotype proportions. However, a uniform prior for the hij does not correspond to an uninformative prior for D′, and the level of informativeness is fixed and cannot be controlled. Moreover, the uniform-on-haplotypes prior does not encapsulate the fact that haplotypes are composed of alleles and hence, for example, the h1j, j ≠ 1 and the hi1, i ≠ 1 are informative about p1 and q1 and thus may well be informative about h11.
Suppose that information was available in advance, perhaps from surveys in other populations, which indicated that pi and qj were likely to be close to, say, αi and βj, respectively. (Conceptually, αi and βj might be thought of as metapopulation allele proportions.) A tractable family of prior distributions for the hij would then be the Dirichlet family with parameters λαiβj, where λ is a constant, so that each hij has prior expectation and variance given by
—Prior (dotted curves) and posterior (solid curves) densities for the overall disequilibrium measure D′, based on two data sets simulated under the RUG model, each with n = 200 genotypes and with three and six alleles at each locus. The “true” values of D′ underlying the simulated data are 0.081 and 0.253. The posterior curves are obtained by Gaussian kernel density estimation using 5000 MCMC algorithm outputs. The three prior densities for the hij are Dirichlet with parameters IJαiβj (A and B), 2IJαiβj (C and D), and 4IJαiβj (E and F), conditional on the αi and the βj, which are each (multivariate) uniform. Dot-dashed curves show average posterior densities from 50 random permutations of the genotypes at each locus, mimicking 50 samples from populations in gametic equilibrium.
Medians and equal-tailed 90% intervals of the prior and posterior distribution for D′ shown in Figure 1
The sum of the Dirichlet parameters provides a measure of the information conveyed by the distribution. Choosing λ so that the average of the λαiβj is one would give a distribution that has the same information content as the uniform (for which all the parameters equal one) and may provide a reasonable vague prior for the hij.
This framework for specifying a prior distribution for hij does not require that αi and βj be specified precisely. Instead, they can be assigned probability distributions, leading to a hierarchical prior model. Below, we adopt independent uniform distributions for the αi and βj, although background information could in practice be incorporated into more informative distributions.
MCMC algorithm: We implement an MCMC stochastic simulation algorithm for genotype data to approximate the joint distribution of the hij, and hence of the gametic disequilibrium measures, under the RUG model and the hierarchical prior distribution described above. The MCMC algorithm adopted is of the Metropolis-Hastings type (Metropoliset al. 1953; Hastings 1970). At each iteration of the algorithm, a decision is made whether to keep the current vector of parameter values or reject it in favor of a new vector. The accept/reject decision is made in such a way that the proportion of iterations at which the current vector lies in any region of the parameter space approximates the probability that the true parameter vector lies in that region, with the approximation becoming more accurate as the number of iterations increases. Further details of the MCMC algorithm are given in the appendix.
—Prior (dotted curves, corresponding to the prior distribution in Figure 1 with λ = IJ, where I = 8 and J = 6) and posterior (solid curves) densities for the overall disequilibrium measure D′ for samples of Maoris (1091 individuals) and Samoans (139 individuals) at STR loci THO1 and TPOX. Allele labels are numbers of repeats. The dotdashed curves show posterior densities averaged over 50 random permutations of the genotypes. Data provided by John Buckleton (Institute of Environmental Science and Research, New Zealand).
Figure 1 shows the posterior density curves for D′, approximated via the MCMC algorithm, given two samples of two-locus genotypes simulated under the RUG model with D′ = 0.081 (three alleles) and D′ = 0.253 (six alleles). Three prior distributions were employed, shown as dotted curves. Key quantiles of the prior and posterior distributions are given in Table 1.
Even with a reasonably large sample size (200 individuals), D′ is a difficult parameter to estimate. This is because the data bear directly on the population genotype proportions, whereas differences between allele and haplotype proportions are the quantities of interest. This difficulty is reflected by the posterior curves of Figure 1, which support a rather broad range of values for D′ and display some sensitivity to the choice of prior. However, in each case the posterior median is close to the true value and usually closer than the corresponding ML-based estimates (0.058 and 0.315), for which the sampling variance is difficult to calculate. Moreover, since D′ is univariate it is relatively easy to plot both prior and posterior density curves and hence assess visually the effect of the prior from the plots. Background information, when available, can be incorporated via the prior and may be invaluable in situations of little data and/or many alleles.
Also shown in Figure 1 are density curves averaged over 50 random permutations of the alleles, mimicking 50 samples from populations in gametic equilibrium with the same allele proportions. The data with three alleles at each locus are clearly consistent with equilibrium, but those with six alleles are not. These results are in accord with the P values 0.56 and 0.00 obtained from an LR-based permutation test for gametic disequilibrium (Slatkin and Excoffier 1996).
Figure 1 corresponds to a single simulated dataset. We also applied the MCMC method (for the prior with λ = IJ) to 100 datasets of size n = 1000, simulated with I = J = 3. The underlying hij were such that D′ = 0.404. For each dataset we calculated the posterior median: the 100 estimated posterior medians had mean 0.403 and standard deviation 0.039. These values compare favorably with the MLE-based estimates Dˆ′,for which the mean and standard deviation over these 100 simulated datasets were 0.407 and 0.040.
RESULTS
Two unlinked loci used for forensic identification: The MCMC method was applied to the genotypes at two unlinked forensic short tandem repeat (STR) loci, THO1 and TPOX, for samples of Maoris (n = 1091) and Samoans (n = 139) resident in New Zealand. Eight alleles were observed for locus THO1 and six for TPOX (additional alleles observed in other populations are ignored here, although they could be incorporated into the analysis if desired).
Figure 2 shows prior (λ = IJ) and posterior curves for the overall measure D′ together with a curve obtained from 50 random permutations of the data (mimicking equilibrium). There is a substantial overlap of these curves, suggesting that both samples are consistent with gametic equilibrium in the underlying populations; these conclusions are in agreement with P values obtained from the LR-based permutation test (0.42 and 0.13).
A full multilocus match probability involves correlations of genes both within and between individuals (in the latter case, between the defendant and an alternative possible source of the crime scene DNA). In current practice (see, e.g., Evett and Weir 1998), adjustment is often made for between-individual correlations at a single locus. However, multilocus forensic match probabilities are usually obtained by taking the product of the single-locus probabilities, thereby assuming independence between loci. Strong gametic disequilibrium may invalidate this assumption.
—Posterior densities (solid curves) for pairwise disequilibrium coefficients D2 from a sample of n = 96 genotypes, at loci LF261 (four alleles), LF168 (four alleles), and LF347 (five alleles) of the MOYO strains of the A. aegypti mosquito data of Yan et al. (1997). Point estimates (from ML estimates) are 0.492, 0.220, and 0.242. Dot-dashed curves are posterior densities averaged over 50 random permutations of the genotypes. The prior densities (dotted curves) are based on a Dirichlet prior for the hij, with parameters 40αiβjγk, conditional on the αi, βj, and γk, which are each (multivariate) uniform.
Although THO1 and TPOX are unlinked, gametic disequilibrium may nevertheless arise (due to founder effects, selection, or drift) and affect multilocus forensic match probability calculations involving these loci. It is therefore important to investigate levels of gametic disequilibrium, and a selection of marginal posterior density curves for the Dij is shown in Figure 3. All the posterior distributions support values close to zero, encouraging optimism that the effect of gametic disequilibrium on two-locus forensic match probabilities involving these loci may indeed be negligible.
Although these results tend to support current practice, note that we have not simultaneously taken all relevant correlations into account. In particular, other forms of assocation may invalidate the independence of genes assumption in the match probability (see Ayres 2000 for the case of between-locus dependence due to inbreeding). Also, forensic identification involves many loci, often >10, whereas here we have considered only the two-locus case.
—Posterior median (▪) and central 90% posterior intervals (—) for D′ in two populations, for the Xq25-Xq28 SNP data of Taillon-Miller et al. (2000); marker pairs analyzed here correspond to those presented in Table 2 of Taillon-Miller et al. (2000). A multivariate uniform prior was used for the hij.
Three linked loci: The MCMC algorithm for approximating the joint posterior of the haplotype proportions is readily extended to three loci. For forensic applications, we may be interested in investigating the difference hijk - piqjrk, which can be readily obtained from the MCMC output. For other problems, simultaneous estimation of the pairwise disequilibrium measures may be of more interest. However, multilocus systems impose additional constraints on the Dij. For three diallelic loci, Robinson et al. (1991) describe a new pairwise normalized measure
Yan et al. (1997) analyzed multilocus genotype data from chromosome 3 of the MOYO strain of the mosquito Aedes aegypti. For the three RFLP loci LF261, LF168, and LF347, they reported that the values of D′ for all three pairs were significant at the 1% level. We have calculated posterior density curves for D2 on the basis of their original data, as well as curves based on random permutations (Figure 4).
Our results suggest strong disequilibrium between LF 261 and LF168, since the curve based on the randomly permuted data has little overlap with that based on the observed data. In contrast, Figure 4 suggests little or no disequilibrium between the other two pairs of loci. The latter conclusion differs from that of Yan et al. (1997). It is, however, consistent with the relative map positions of the loci—LF168 is situated between LF261 and LF347, much closer to LF261 than to LF347. Although the MCMC results correctly identify LF168 and LF261 as the closest pair, disequilibrium between the other marker pairs is so weak that the ordering of all three markers on the basis of the joint posterior distribution of the D2 cannot be achieved with confidence, the correct order being assigned a probability of 42%.
Posterior summaries of D′ for the analyses of Figure 5
Haplotype data: In some cases haplotype counts may be available, simplifying the direct probability approach. For example, for three-locus haplotypes hijk and a hierarchical prior, we have
A method for sampling from this distribution is given in the appendix, together with a summary of implications for disequilibrium mapping. However, we focus below on the simpler case when, for two loci, a straightforward (nonhierarchical) Dirichlet distribution with parameters kij is implemented for the hij. When the likelihood is multinomial, the posterior distribution for the hij will again be Dirichlet with parameters nij + kij, and a sample from this distribution can be obtained by standard random number generation (see, e.g., Appendix A of Gelmanet al. 1995).
Taillon-Miller et al. (2000) analyzed several pairs of single nucleotide polymorphism (SNP) markers in the human Xq25-Xq28 region for three populations [general European (CEPH), Finnish, and Sardinian]. They found significant P values (P < 0.001, from a χ2 test for gametic equilibrium) for markers separated by up to ∼900 kb. They also found that, in general, point estimates of disequilibrium measures (such as D′) did not differ greatly between the large outbred population (represented by the CEPH sample) and the genetically isolated populations of Finland and Sardinia. These results were consistent with an STR analysis of similar populations (Eaveset al. 2000), though both conflict with the suggestion that genetically isolated populations tend to exhibit higher levels of disequilibrium and are therefore more useful for disease gene mapping (see, e.g., Wrightet al. 1999). In summarizing the conclusions of Taillon-Miller et al. (2000) and Eaves et al. (2000), Boehnke (2000) argues that the levels of disequilibrium observed appeared slightly stronger in the isolates than in the general mixed populations.
We have reanalyzed the CEPH and Finnish SNP data of Taillon-Miller et al. (2000), implementing a multivariate uniform prior for hij for each pair of markers analyzed. This distribution imposes a prior belief that none of the alleles is very rare [the implied prior for the pi and qj is Beta(2,2)], which is reasonable for these markers as they have been selected on the basis of polymorphism. Results are shown in Figure 5: posterior medians and 90% intervals for the two populations are plotted against physical distance.
The measure of variability provided by our MCMC approach allows more careful comparison of the levels of disequilibrium across the populations analyzed. For almost all of the marker pairs given in Table 2, the posterior 90% intervals from the two populations overlap substantially, indicating that there is little evidence of any difference across the populations. This is quantified in Table 2, which gives the posterior probability for each marker pair that D′ is larger in the Finnish population than in the CEPH population: these probabilities exceed 90% for only a handful of markers, and in no case exceed 97%. (Note the values of D′ across closely linked markers are not independent.)
Our results therefore quantify the observation made by Taillon-Miller et al. (2000) that disequilibrium levels were similar across the populations. The data provide little evidence that gametic disequilibrium is higher in the Finnish population than in the general European population.
DISCUSSION
The direct probability, or Bayesian, approach developed here permits interpretable visual answers to the question of interest about disequilibrium parameters. Moreover, it can readily incorporate complex models and background knowledge about a population, when available. For a discussion of the advantages of Bayesian approaches to problems in genetics, see Shoemaker et al. (1999). We have also developed a family of hierarchical prior distributions that allow the scientist some flexibility in specifying background knowledge.
Zapata et al. (1997) note that point estimates of D′ij are frequently reported without a corresponding measure of variability (such as the standard error), which can complicate comparisons over loci and populations. However, the calculation of Var(D′ij) is complicated by the different rescaling of positive and negative values in the definition of D′ij. Zapata et al. derived an approximation to Var(D′ij) for biallelic loci only. Our direct probability approach provides an approximation not just for the variance of (multiallelic) D′ij and D′ but for their entire posterior distributions. A particular advantage is that the posterior intervals obtained can be directly interpreted in terms of probabilities, unlike standard confidence intervals that are routinely provided for some point estimates, which do not have such a direct interpretation.
There are no theoretical limits to the number of loci that can be analyzed simultaneously. However, for a fixed sample size, the information contained in the data decreases as the number of loci increases, and, as for hypothesis testing, useful inferences are usually not feasible for more than about three loci.
APPENDIX
MCMC algorithm for genotype data: Metropolis-Hastings algorithms are methods for generating a sample from an arbitrary probability distribution Π (with probability density function π) by constructing a Markov chain whose stationary distribution is Π. If the current state of the chain is x, a candidate new state x′ is chosen with probability density q(x′|x). The candidate is accepted with probability
For the algorithm implemented here, π is the joint posterior density function of the hij. For two loci, each candidate x′ differs from x at a randomly chosen pair of the hij, say hrs and hwz. A proposal value h′rs is chosen uniformly in the interval
Slow convergence and poor mixing can arise in the presence of many alleles and/or loci. No difficulties were experienced with the examples discussed here that could not reasonably be overcome by choosing suitably large values for k and b. A burn-in of b = 30,000 iterations was found to be adequate for the two-locus algorithm (50,000 for three loci), with every k = 200th (300) iteration output (these values having been determined by the inspection of sequential and autocorrelation plots of the output for initial runs). The output of each run underlying the figures and tables was analyzed with the MCMC diagnostic computer package CODA (Bestet al. 1995), which indicated satisfactory convergence and mixing properties. For each application of the MCMC algorithm to the data, 5000 values were output (1000 for the permuted data).
MCMC algorithm for haplotype data: For three-locus haplotype data, assuming a multinomial log likelihood for the hijk (given hyperparameters α, β, and γ), together with a Dirichlet prior distribution, after observing the nijk the hijk have a Dirichlet distribution with parameters nijk + λαiβjγk. A posterior sample from p({hijk}|α, β, γ, N) can therefore be readily obtained using standard methods for the Dirichlet distribution (see, e.g., Appendix A of Gelmanet al. 1995). To obtain a distribution for the hijk that does not involve α, β, and γ, we can employ (6) together with a method of simulating from p(α, β, γ|N). A number of approaches are available, and details of an MCMC algorithm are given here.
The target distribution for the MCMC algorithm is p(α, β, γ|N), the probability density function of the hyperparameters α, β, and γ given the data N. The likelihood p(N|α, β, γ) is of a standard form known as the multinomial-Dirichlet (Smith and Bernardo 1994, p. 135), and by Bayes theorem we can write
A suitable Metropolis-Hastings algorithm can proceed as follows: first select a locus l, chosen uniformly at random. Suppose for notational convenience that α is the hyperparameter vector corresponding to the chosen locus l, then choose two elements of α, say αv and αw. The proposal
This algorithm, and modifications of it, can be useful in the location of disease loci via simple disequilibrium mapping. Briefly, under the assumption of a single disease mutation that arose sometime in the past, loci closest to the disease locus should exhibit higher levels of disequilibrium than those that are far away (e.g., Jorde 1995). Devlin and Risch (1995) detailed the use of point estimates of disequilibrium for inferring the closest of a number of biallelic markers (identifying the marker with the highest observed disequilibrium value as the closest). The direct probability approach, implementing the methods outlined above, adds a degree of interpretability to simple disequilibrium mapping, assigning probabilities to the event that a marker is closest to the disease locus—see Ayres (1998) for further details.
Multiallelic three-locus normalized measures: The following bounds apply to the Dij for two loci in a threelocus system (disequilibrium measures for the other locus pairs are denoted Dik and Djk),
Acknowledgments
We thank the following for kindly providing the data analyzed in this study: John Buckleton (human STR data), Guiyun Yan (mosquito data), Patty Taillon-Miller, and Pui-Yan Kwok (SNP data). We thank Laurent Excoffier for helpful comments on an earlier draft, as well as two anonymous referees. Work was supported in part by the UK Biotechnology and Biological Sciences Research Council, under grant 45/G09617.
Footnotes
-
Communicating editor: G. A. Churchill
- Received June 20, 1999.
- Accepted September 19, 2000.
- Copyright © 2001 by the Genetics Society of America