Abstract
We describe a Bayesian approach to analyzing multilocus genotype or haplotype data to assess departures from gametic (linkage) equilibrium. Our approach employs a Markov chain Monte Carlo (MCMC) algorithm to approximate the posterior probability distributions of disequilibrium parameters. The distributions are computed exactly in some simple settings. Among other advantages, posterior distributions can be presented visually, which allows the uncertainties in parameter estimates to be readily assessed. In addition, background knowledge can be incorporated, where available, to improve the precision of inferences. The method is illustrated by application to previously published datasets; implications for multilocus forensic match probabilities and for simple associationbased gene mapping are also discussed.
DEPARTURES from gametic (or linkage) and HardyWeinberg (HW) equilibria can provide clues about aspects of population histories and mating behavior (see, e.g., Lewontin 1974) and can be useful in locating disease genes (Jorde 1995; Federet al. 1996; Nielsenet al. 1998). They also play an important role in the forensic use of DNA profile evidence. Match probability calculations either rely on assumptions of equilibrium (National Research Council 1996) or else allow for patterns of departures that hold in simplified population models (Weir 1994; Balding and Nichols 1995; Ayres and Overall 1999). It is important that the validity of such assumptions in actual populations is verified empirically, as far as is feasible.
Traditional statistical treatments usually focus on testing hypotheses of equilibrium, with recent developments involving randomization tests (e.g., Zaykinet al. 1995; Slatkin and Excoffier 1996). Although they may form a useful first step, such hypothesis tests represent a limited form of statistical inference, since the tests concern only whether or not the data are consistent with equilibrium, rather than directly assessing how large are the departures from precise equilibrium that inevitably exist in real populations (e.g., Smith 1970). In forensic applications, for example, a hypothesis of equilibrium may be rejected with a sufficiently large sample, whereas a forensic scientist may nevertheless believe that the magnitude of the departure is sufficiently small that the hypothesis of equilibrium, though strictly false, is adequate for the application at hand.
Point estimation methods for disequilibrium parameters have been developed (see, e.g., Weir 1979). Here, we propose a Markov chain Monte Carlo (MCMC) method to investigate probability distributions for gametic disequilibrium measures given the data. This extends previous work (Ayres and Balding 1998; Shoemakeret al. 1998) on assessing departures from HW.
Perhaps the most important advantage of our approach is interpretability: the questions of interest are answered directly in terms of probabilities that can conveniently be presented graphically via probability density curves, providing an immediate yet detailed assessment of the variability associated with an estimate. A further advantage is that, since the approach is likelihood based, it is statistically powerful and can incorporate a wide range of modeling assumptions. Previous treatments assume random union of gametes (RUG) to infer population haplotype proportions from genotype data (e.g., Excoffier and Slatkin 1995). Although we also implement the RUG model in the following analyses, we note that other models can be readily applied, such as those that incorporate inbreeding measures.
The choice of prior distribution is sometimes seen as a barrier to the implementation of direct probability, or Bayesian, methods. We introduce a class of hierarchical prior distributions for the haplotype proportions, which allows the scientist some flexibility either to incorporate relevant background information, if desired, or to adopt a relatively “vague” prior.
We illustrate our method by analyzing samples of genotypes at two unlinked loci and at three linked loci. We also briefly discuss its application to forensic identification, and to haplotype data and simple disequilibrium gene mapping. Computer programs (C code) for the MCMC algorithms are available from the authors on request.
METHODS
Measures of gametic disequilibrium: Genetic equilibrium corresponds to statistical independence, and many authors (see, e.g., Weir 1979) measure gametic disequilibrium in terms of the differences between population haplotype proportions and the values that would be expected under equilibrium, given the allele proportions. Following this approach, for a twolocus haplotype consisting of alleles A_{i} and B_{j}, we introduce the notation
The range of D_{ij} depends on p_{i} and q_{j}, which makes crosslocus and crosspopulation comparisons difficult. To alleviate this problem, Lewontin (1964) defined the normalized difference, with range [1, 1], by
Moreover, in practice the range of values of D′ consistent with gametic equilibrium is not readily apparent and can vary from locus to locus. Under equilibrium, each D_{ij}, and hence D′, takes value zero. However, just as a χ^{2} goodnessoffit statistic is unlikely to be very close to zero even when the model is valid, so estimates of D′ based on data from equilibrium populations are unlikely to be very close to zero (furthermore, variances of D′ are difficult to calculate; see Zapataet al. 1997). Insight into whether or not the data are consistent with gametic equilibrium can be gained by reanalyzing them with the alleles at each locus randomly permuted, thus simulating samples from a population in equilbrium with the same p_{i} and q_{j} as the population under study (e.g., Slatkin and Excoffier 1996). However, if the primary aim of an analysis is to test the hypothesis of equilibrium, then the likelihoodratio (LR) statistic or a Bayes factor would be more appropriate than D′.
Measures of gametic disequilibrium not based on D_{ij} have also been proposed. Smouse (1974) specifies a loglinear model for the h_{ij}, with allelespecific parameters a_{i} and b_{j}, and an interaction term c_{ij} that can be employed as an alternative to D_{ij}. Weir (1996, pp. 127133) details a closely related multiplicative model and extends the analysis to genotype data.
Here, we focus on D′ as a summary measure of gametic disequilibrium (together with an extension D′′, introduced below). This measure is widely used and, although it suffers from the interpretability drawbacks described above, there seems to be no univariate measure that avoids such difficulties. When interest focuses on gametic disequilibrium due to linkage, such as in “simple” genetic mapping, then a natural criterion for choosing between disequilibrium measures is correlation with physical distance and Devlin and Risch (1995) find that D′ has good properties in that setting.
Random union of gametes model: When only genotype counts are available, a model is required to relate the h_{ij} to genotype proportions, which then implies a model for the D_{ij}. For two loci at which the population proportion of genotype A_{i}A_{i}_{′}B_{j}B_{j}_{′} is denoted p_{ii}_{′}_{jj}_{′} (with i ≤ i′ and j ≤ j′), perhaps the simplest plausible model assumes RUG:
The loglikelihood for a random sample of genotypes is obtained by substituting (3) into the multinomial loglikelihood function,
Modeling background information: Here, we are primarily concerned not with point estimates but with the full joint distribution of the h_{ij}, and hence of the d_{ij} and D′, given the genotype data. This requires a probability model for the h_{ij} prior to observing the data. Perhaps the simplest such model is given by the (multivariate) uniform distribution, which may be interpreted as corresponding to no background information about haplotype proportions. However, a uniform prior for the h_{ij} does not correspond to an uninformative prior for D′, and the level of informativeness is fixed and cannot be controlled. Moreover, the uniformonhaplotypes prior does not encapsulate the fact that haplotypes are composed of alleles and hence, for example, the h_{1}_{j}, j ≠ 1 and the h_{i}_{1}, i ≠ 1 are informative about p_{1} and q_{1} and thus may well be informative about h_{11}.
Suppose that information was available in advance, perhaps from surveys in other populations, which indicated that p_{i} and q_{j} were likely to be close to, say, α_{i} and β_{j}, respectively. (Conceptually, α_{i} and β_{j} might be thought of as metapopulation allele proportions.) A tractable family of prior distributions for the h_{ij} would then be the Dirichlet family with parameters λα_{i}β_{j}, where λ is a constant, so that each h_{ij} has prior expectation and variance given by
The sum of the Dirichlet parameters provides a measure of the information conveyed by the distribution. Choosing λ so that the average of the λα_{i}β_{j} is one would give a distribution that has the same information content as the uniform (for which all the parameters equal one) and may provide a reasonable vague prior for the h_{ij}.
This framework for specifying a prior distribution for h_{ij} does not require that α_{i} and β_{j} be specified precisely. Instead, they can be assigned probability distributions, leading to a hierarchical prior model. Below, we adopt independent uniform distributions for the α_{i} and β_{j}, although background information could in practice be incorporated into more informative distributions.
MCMC algorithm: We implement an MCMC stochastic simulation algorithm for genotype data to approximate the joint distribution of the h_{ij}, and hence of the gametic disequilibrium measures, under the RUG model and the hierarchical prior distribution described above. The MCMC algorithm adopted is of the MetropolisHastings type (Metropoliset al. 1953; Hastings 1970). At each iteration of the algorithm, a decision is made whether to keep the current vector of parameter values or reject it in favor of a new vector. The accept/reject decision is made in such a way that the proportion of iterations at which the current vector lies in any region of the parameter space approximates the probability that the true parameter vector lies in that region, with the approximation becoming more accurate as the number of iterations increases. Further details of the MCMC algorithm are given in the appendix.
Figure 1 shows the posterior density curves for D′, approximated via the MCMC algorithm, given two samples of twolocus genotypes simulated under the RUG model with D′ = 0.081 (three alleles) and D′ = 0.253 (six alleles). Three prior distributions were employed, shown as dotted curves. Key quantiles of the prior and posterior distributions are given in Table 1.
Even with a reasonably large sample size (200 individuals), D′ is a difficult parameter to estimate. This is because the data bear directly on the population genotype proportions, whereas differences between allele and haplotype proportions are the quantities of interest. This difficulty is reflected by the posterior curves of Figure 1, which support a rather broad range of values for D′ and display some sensitivity to the choice of prior. However, in each case the posterior median is close to the true value and usually closer than the corresponding MLbased estimates (0.058 and 0.315), for which the sampling variance is difficult to calculate. Moreover, since D′ is univariate it is relatively easy to plot both prior and posterior density curves and hence assess visually the effect of the prior from the plots. Background information, when available, can be incorporated via the prior and may be invaluable in situations of little data and/or many alleles.
Also shown in Figure 1 are density curves averaged over 50 random permutations of the alleles, mimicking 50 samples from populations in gametic equilibrium with the same allele proportions. The data with three alleles at each locus are clearly consistent with equilibrium, but those with six alleles are not. These results are in accord with the P values 0.56 and 0.00 obtained from an LRbased permutation test for gametic disequilibrium (Slatkin and Excoffier 1996).
Figure 1 corresponds to a single simulated dataset. We also applied the MCMC method (for the prior with λ = IJ) to 100 datasets of size n = 1000, simulated with I = J = 3. The underlying h_{ij} were such that D′ = 0.404. For each dataset we calculated the posterior median: the 100 estimated posterior medians had mean 0.403 and standard deviation 0.039. These values compare favorably with the MLEbased estimates Dˆ′,for which the mean and standard deviation over these 100 simulated datasets were 0.407 and 0.040.
RESULTS
Two unlinked loci used for forensic identification: The MCMC method was applied to the genotypes at two unlinked forensic short tandem repeat (STR) loci, THO1 and TPOX, for samples of Maoris (n = 1091) and Samoans (n = 139) resident in New Zealand. Eight alleles were observed for locus THO1 and six for TPOX (additional alleles observed in other populations are ignored here, although they could be incorporated into the analysis if desired).
Figure 2 shows prior (λ = IJ) and posterior curves for the overall measure D′ together with a curve obtained from 50 random permutations of the data (mimicking equilibrium). There is a substantial overlap of these curves, suggesting that both samples are consistent with gametic equilibrium in the underlying populations; these conclusions are in agreement with P values obtained from the LRbased permutation test (0.42 and 0.13).
A full multilocus match probability involves correlations of genes both within and between individuals (in the latter case, between the defendant and an alternative possible source of the crime scene DNA). In current practice (see, e.g., Evett and Weir 1998), adjustment is often made for betweenindividual correlations at a single locus. However, multilocus forensic match probabilities are usually obtained by taking the product of the singlelocus probabilities, thereby assuming independence between loci. Strong gametic disequilibrium may invalidate this assumption.
Although THO1 and TPOX are unlinked, gametic disequilibrium may nevertheless arise (due to founder effects, selection, or drift) and affect multilocus forensic match probability calculations involving these loci. It is therefore important to investigate levels of gametic disequilibrium, and a selection of marginal posterior density curves for the D_{ij} is shown in Figure 3. All the posterior distributions support values close to zero, encouraging optimism that the effect of gametic disequilibrium on twolocus forensic match probabilities involving these loci may indeed be negligible.
Although these results tend to support current practice, note that we have not simultaneously taken all relevant correlations into account. In particular, other forms of assocation may invalidate the independence of genes assumption in the match probability (see Ayres 2000 for the case of betweenlocus dependence due to inbreeding). Also, forensic identification involves many loci, often >10, whereas here we have considered only the twolocus case.
Three linked loci: The MCMC algorithm for approximating the joint posterior of the haplotype proportions is readily extended to three loci. For forensic applications, we may be interested in investigating the difference h_{ijk}  p_{i}q_{j}r_{k}, which can be readily obtained from the MCMC output. For other problems, simultaneous estimation of the pairwise disequilibrium measures may be of more interest. However, multilocus systems impose additional constraints on the D_{ij}. For three diallelic loci, Robinson et al. (1991) describe a new pairwise normalized measure
Yan et al. (1997) analyzed multilocus genotype data from chromosome 3 of the MOYO strain of the mosquito Aedes aegypti. For the three RFLP loci LF261, LF168, and LF347, they reported that the values of D′ for all three pairs were significant at the 1% level. We have calculated posterior density curves for D^{2} on the basis of their original data, as well as curves based on random permutations (Figure 4).
Our results suggest strong disequilibrium between LF 261 and LF168, since the curve based on the randomly permuted data has little overlap with that based on the observed data. In contrast, Figure 4 suggests little or no disequilibrium between the other two pairs of loci. The latter conclusion differs from that of Yan et al. (1997). It is, however, consistent with the relative map positions of the loci—LF168 is situated between LF261 and LF347, much closer to LF261 than to LF347. Although the MCMC results correctly identify LF168 and LF261 as the closest pair, disequilibrium between the other marker pairs is so weak that the ordering of all three markers on the basis of the joint posterior distribution of the D^{2} cannot be achieved with confidence, the correct order being assigned a probability of 42%.
Haplotype data: In some cases haplotype counts may be available, simplifying the direct probability approach. For example, for threelocus haplotypes h_{ijk} and a hierarchical prior, we have
A method for sampling from this distribution is given in the appendix, together with a summary of implications for disequilibrium mapping. However, we focus below on the simpler case when, for two loci, a straightforward (nonhierarchical) Dirichlet distribution with parameters k_{ij} is implemented for the h_{ij}. When the likelihood is multinomial, the posterior distribution for the h_{ij} will again be Dirichlet with parameters n_{ij} + k_{ij}, and a sample from this distribution can be obtained by standard random number generation (see, e.g., Appendix A of Gelmanet al. 1995).
TaillonMiller et al. (2000) analyzed several pairs of single nucleotide polymorphism (SNP) markers in the human Xq25Xq28 region for three populations [general European (CEPH), Finnish, and Sardinian]. They found significant P values (P < 0.001, from a χ^{2} test for gametic equilibrium) for markers separated by up to ∼900 kb. They also found that, in general, point estimates of disequilibrium measures (such as D′) did not differ greatly between the large outbred population (represented by the CEPH sample) and the genetically isolated populations of Finland and Sardinia. These results were consistent with an STR analysis of similar populations (Eaveset al. 2000), though both conflict with the suggestion that genetically isolated populations tend to exhibit higher levels of disequilibrium and are therefore more useful for disease gene mapping (see, e.g., Wrightet al. 1999). In summarizing the conclusions of TaillonMiller et al. (2000) and Eaves et al. (2000), Boehnke (2000) argues that the levels of disequilibrium observed appeared slightly stronger in the isolates than in the general mixed populations.
We have reanalyzed the CEPH and Finnish SNP data of TaillonMiller et al. (2000), implementing a multivariate uniform prior for h_{ij} for each pair of markers analyzed. This distribution imposes a prior belief that none of the alleles is very rare [the implied prior for the p_{i} and q_{j} is Beta(2,2)], which is reasonable for these markers as they have been selected on the basis of polymorphism. Results are shown in Figure 5: posterior medians and 90% intervals for the two populations are plotted against physical distance.
The measure of variability provided by our MCMC approach allows more careful comparison of the levels of disequilibrium across the populations analyzed. For almost all of the marker pairs given in Table 2, the posterior 90% intervals from the two populations overlap substantially, indicating that there is little evidence of any difference across the populations. This is quantified in Table 2, which gives the posterior probability for each marker pair that D′ is larger in the Finnish population than in the CEPH population: these probabilities exceed 90% for only a handful of markers, and in no case exceed 97%. (Note the values of D′ across closely linked markers are not independent.)
Our results therefore quantify the observation made by TaillonMiller et al. (2000) that disequilibrium levels were similar across the populations. The data provide little evidence that gametic disequilibrium is higher in the Finnish population than in the general European population.
DISCUSSION
The direct probability, or Bayesian, approach developed here permits interpretable visual answers to the question of interest about disequilibrium parameters. Moreover, it can readily incorporate complex models and background knowledge about a population, when available. For a discussion of the advantages of Bayesian approaches to problems in genetics, see Shoemaker et al. (1999). We have also developed a family of hierarchical prior distributions that allow the scientist some flexibility in specifying background knowledge.
Zapata et al. (1997) note that point estimates of D′_{ij} are frequently reported without a corresponding measure of variability (such as the standard error), which can complicate comparisons over loci and populations. However, the calculation of Var(D′_{ij}) is complicated by the different rescaling of positive and negative values in the definition of D′_{ij}. Zapata et al. derived an approximation to Var(D′_{ij}) for biallelic loci only. Our direct probability approach provides an approximation not just for the variance of (multiallelic) D′_{ij} and D′ but for their entire posterior distributions. A particular advantage is that the posterior intervals obtained can be directly interpreted in terms of probabilities, unlike standard confidence intervals that are routinely provided for some point estimates, which do not have such a direct interpretation.
There are no theoretical limits to the number of loci that can be analyzed simultaneously. However, for a fixed sample size, the information contained in the data decreases as the number of loci increases, and, as for hypothesis testing, useful inferences are usually not feasible for more than about three loci.
APPENDIX
MCMC algorithm for genotype data: MetropolisHastings algorithms are methods for generating a sample from an arbitrary probability distribution Π (with probability density function π) by constructing a Markov chain whose stationary distribution is Π. If the current state of the chain is x, a candidate new state x′ is chosen with probability density q(x′x). The candidate is accepted with probability
For the algorithm implemented here, π is the joint posterior density function of the h_{ij}. For two loci, each candidate x′ differs from x at a randomly chosen pair of the h_{ij}, say h_{rs} and h_{wz}. A proposal value h′_{rs} is chosen uniformly in the interval
Slow convergence and poor mixing can arise in the presence of many alleles and/or loci. No difficulties were experienced with the examples discussed here that could not reasonably be overcome by choosing suitably large values for k and b. A burnin of b = 30,000 iterations was found to be adequate for the twolocus algorithm (50,000 for three loci), with every k = 200th (300) iteration output (these values having been determined by the inspection of sequential and autocorrelation plots of the output for initial runs). The output of each run underlying the figures and tables was analyzed with the MCMC diagnostic computer package CODA (Bestet al. 1995), which indicated satisfactory convergence and mixing properties. For each application of the MCMC algorithm to the data, 5000 values were output (1000 for the permuted data).
MCMC algorithm for haplotype data: For threelocus haplotype data, assuming a multinomial log likelihood for the h_{ijk} (given hyperparameters α, β, and γ), together with a Dirichlet prior distribution, after observing the n_{ijk} the h_{ijk} have a Dirichlet distribution with parameters n_{ijk} + λα_{i}β_{j}γ_{k}. A posterior sample from p({h_{ijk}}α, β, γ, N) can therefore be readily obtained using standard methods for the Dirichlet distribution (see, e.g., Appendix A of Gelmanet al. 1995). To obtain a distribution for the h_{ijk} that does not involve α, β, and γ, we can employ (6) together with a method of simulating from p(α, β, γN). A number of approaches are available, and details of an MCMC algorithm are given here.
The target distribution for the MCMC algorithm is p(α, β, γN), the probability density function of the hyperparameters α, β, and γ given the data N. The likelihood p(Nα, β, γ) is of a standard form known as the multinomialDirichlet (Smith and Bernardo 1994, p. 135), and by Bayes theorem we can write
A suitable MetropolisHastings algorithm can proceed as follows: first select a locus l, chosen uniformly at random. Suppose for notational convenience that α is the hyperparameter vector corresponding to the chosen locus l, then choose two elements of α, say α_{v} and α_{w}. The proposal
This algorithm, and modifications of it, can be useful in the location of disease loci via simple disequilibrium mapping. Briefly, under the assumption of a single disease mutation that arose sometime in the past, loci closest to the disease locus should exhibit higher levels of disequilibrium than those that are far away (e.g., Jorde 1995). Devlin and Risch (1995) detailed the use of point estimates of disequilibrium for inferring the closest of a number of biallelic markers (identifying the marker with the highest observed disequilibrium value as the closest). The direct probability approach, implementing the methods outlined above, adds a degree of interpretability to simple disequilibrium mapping, assigning probabilities to the event that a marker is closest to the disease locus—see Ayres (1998) for further details.
Multiallelic threelocus normalized measures: The following bounds apply to the D_{ij} for two loci in a threelocus system (disequilibrium measures for the other locus pairs are denoted D_{ik} and D_{jk}),
Acknowledgments
We thank the following for kindly providing the data analyzed in this study: John Buckleton (human STR data), Guiyun Yan (mosquito data), Patty TaillonMiller, and PuiYan Kwok (SNP data). We thank Laurent Excoffier for helpful comments on an earlier draft, as well as two anonymous referees. Work was supported in part by the UK Biotechnology and Biological Sciences Research Council, under grant 45/G09617.
Footnotes

Communicating editor: G. A. Churchill
 Received June 20, 1999.
 Accepted September 19, 2000.
 Copyright © 2001 by the Genetics Society of America