Decomposing Multilocus Linkage Disequilibrium
 Root Gorelick1 and
 Manfred D. Laubichler
 1 Corresponding author: Department of Biology, Arizona State University, P.O. Box 871501, Tempe, AZ 852871501. Email: cycad{at}asu.edu
Abstract
We present a mathematically precise formulation of total linkage disequilibrium between multiple loci as the deviation from probabilistic independence and provide explicit formulas for all higherorder terms of linkage disequilibrium, thereby combining J. Dausset et al.'s 1978 definition of linkage disequilibrium with H. Geiringer's 1944 approach. We recursively decompose higherorder linkage disequilibrium terms into lowerorder ones. Our greatest simplification comes from defining linkage disequilibrium at a single locus as allele frequency at that locus. At each level, decomposition of linkage disequilibrium is mathematically equivalent to number theoretic compositions of positive integers; i.e., we have converted a genetic decomposition into a mathematical decomposition.
A precise measurement of linkage disequilibrium is required for studying virtually any phenomenon in multilocus population genetics. This is especially true for explicit multilocus models that investigate the contributions of physiological epistasis to additive genetic variance (Cheverud and Routman 1995; Wagneret al. 1998; Wagner and Laubichler 2000). Linkage disequilibrium is usually defined as the deviation from probabilistic independence between alleles at two different loci. This deviation from independence can have different causes, such as a lack of independent segregation or recombination, or any number of other evolutionary forces. The presence of linkage disequilibrium (gametic disequilibrium) is thus an indication that either stochastic (e.g., drift) or deterministic (e.g., selection, gene flow) evolutionary forces have been acting on a population (Hedrick 2000; Ardlieet al. 2002).
The classical definition of linkage disequilibrium, D, follows the probability theory definition of deviation from independence. Independence of two events, B and C, means that Pr(BC) = Pr(B) · Pr(C), where Pr is probability and BC is the joint distribution of B and C, so that the deviation from independence is measured as D = Pr(BC) – Pr(B) · Pr(C). Changing notation slightly to let A_{k}_{(}_{i}_{)} designate the kth allele at the ith locus gives the linkage disequilibrium between the alleles at two loci, D_{2}, as D_{2} = Pr(A_{k}_{(1)}A_{k}_{(2)}) – Pr(A_{k}_{(1)}) · Pr(A_{k}_{(2)}), where Pr represents probability and A_{k}_{(1)}A_{k}_{(2)} represents the joint occurrence of A_{k}_{(1)} and A_{k}_{(2)} in a single haploid gamete. In most modern interpretations of probability theory, the primitive concept of “probability” is interpreted as a relative frequency; therefore, Pr(A_{k}_{(1)}) is the same as the frequency of allele k at locus 1.
The quintessential examples of linkage disequilibrium are coadapted gene complexes, in which several loci are tightly linked because they provide a large selective advantage if they occur together. In these cases, linkage disequilibrium is maintained by selection. Although coadapted gene complexes are implicit in Wright's shiftingbalance hypothesis (Wright 1931), have been used to explain outbreeding depression (Dobzhansky 1948; Lynch 1991), and are frequently cited as evolutionary hypotheses (Palopoli and Wu 1996; Rawson and Burton 2002), the linkage disequilibrium of these purported coadapted gene complexes is almost never quantified. This is particularly surprising given the wellcited article by Geiringer (1944), in which she provides most of the algorithm for computing higherorder linkage disequilibrium coefficients. In this article, we complete and simplify Geiringer's formulation and then show how the sums of products of those coefficients equal the definition of (total) linkage disequilibrium as the deviation from probabilistic independence given by Dausset et al. (1978).
Methodologically, we follow Geiringer's lead and decompose higherorder linkage disequilibrium into lowerorder linkage disequilibrium terms. In other words, we take a topdown approach to defining multilocus linkage disequilibrium, rather than the bottomup approach followed by virtually everyone since Geiringer (1944). Lewontin (1974) is typical of the bottomup approach. There are very few other topdown decomposition approaches such as Bulmer's (1980) decomposition of multilocus epistasis or Wagner and Laubichler's (2000) character decomposition approach in population genetics.
In this article, we first define linkage disequilibrium at a single locus as the allele frequency at this locus, which greatly simplifies notation. Second, we extend the definition of linkage disequilibrium to multiple loci by invoking compositions of positive integers. Our decomposition of multilocus linkage disequilibrium is entirely consistent with the standard definitions for two loci, as well as its previous extensions to three, four, and six loci (Geiringer 1944; Bennett 1954; Hastings 1984). Third, we show how this definition is entirely consistent with the notion of linkage disequilibrium as the deviation from probabilistic independence.
DECOMPOSITION OF MULTILOCUS LINKAGE DISEQUILIBRIUM
Define the onelocus coefficient of linkage disequilibrium, D_{1}, as D_{1}(A_{k}_{(}_{i}_{)}) = Pr(A_{k}_{(}_{i}_{)}). This definition may appear paradoxical, but it dramatically simplifies notation for the decomposition of multilocus linkage disequilibrium. In elementary algebra we have the analogous problem of defining the algebraic expression x^{n} when n = 0 (Lakoff and Núñez 2000). Note that our definition of a locus encompasses proteincoding loci, quantitative trait loci, and even single nucleotides.
Following Hastings (1984), the formulas for two and threelocus multilocus linkage disequilibrium, in which D_{1}(A_{k}_{(}_{i}_{)}) was substituted for Pr(A_{k}_{(}_{i}_{)}), are defined as
Let D_{n} be the coefficient of linkage disequilibrium between n loci. Then the pattern here is that D_{n} = Pr(A_{k}_{(1)}A_{k}_{(2)}... A_{k}_{(}_{n}_{)}) minus all possible products of lowerorder linkage disequilibrium coefficients, such that each term has all of its subscripts adding up to n. The key to writing down an explicit formula for D_{n} is that the phrase “all possibilities of the subscripts adding up to n” refers to partitions of the positive integer n (Andrews 1976). A partition π of a positive integer n is a set of positive integers that adds up to n; i.e., π=
{n_{1},n_{2},..., n_{m}} such that
The only way to decompose n into a single positive integer is c = (n). Therefore, we can also write the highestorder coefficient of linkage disequilibrium as
Equation 1 has never been written explicitly for general multilocus linkage disequilibrium, even though special cases have been given by Geiringer (1944), Bennett (1954), and Hastings (1984). The only explicit definition previously given for multilocus linkage disequilibrium is due to Dausset et al. (1978),
We are now ready to derive the relationship between D_{n} and D_{n}. In Equation 3, substitute Σ_{all compositions c of n} [Π_{ni}_{∈}_{c}D_{ni} (...)] for Pr(A_{k}_{(1)},... A_{k}_{(}_{n}_{)}) (see Equation 2), yielding
DISCUSSION
We have converted the genetics problem of decomposing linkage disequilibrium into the mathematical problem of decomposing positive integers into their additive parts, all while maintaining the convenient heuristic definition of total linkage disequilibrium as the deviation from independence. Unlike Geiringer (1944), we can write down an explicit formula for multilocus linkage disequilibrium because we invoke partitions of integers and define D_{1}(A) = Pr(A), thereby merging her notion of linkage disequilibrium with those of Dausset et al. (1978).
One immediate consequence of our decomposition approach is that the single highestorder coefficient of linkage disequilibrium, D_{n}, cannot be examined in isolation. Because
Multilocus definitions of linkage disequilibrium have not been used very often in empirical studies because of the large number of inputs and linkage disequilibrium coefficients that must be analyzed (2^{n} – 1). Currently, even thirdorder linkage disequilibrium is seldom measured (Thomson and Baur 1984). However, explicit terms for multilocus linkage disequilibrium are of theoretical importance.
One important theoretical application is the analysis of multilocus epistasis. Cheverud and Routman (1995) developed a twolocus model of physiological epistasis that has been further refined by Wagner et al. (1998). To analyze the evolutionary consequences of epistasis in these models, one has to first define linkage disequilibrium for a subset of the loci. Thus, to extend models of physiological epistasis to multiple loci, we must first define linkage disequilibrium for that subset of loci, which we have just done. Models of multilocus epistasis will be crucial in debates over what factors maintain coadapted gene complexes, increase additive genetic variance, and foster speciation (Goodnight 1988, 1995; Wade and Goodnight 1998).
Acknowledgments
We thank Phil Hedrick, Tom Dowling, and two anonymous reviewers for their helpful comments.
Footnotes

Communicating editor: M. W. Feldman
 Received November 26, 2003.
 Accepted December 18, 2003.
 Copyright © 2004 by the Genetics Society of America