Decomposing Multilocus Linkage Disequilibrium
- Root Gorelick1 and
- Manfred D. Laubichler
- 1 Corresponding author: Department of Biology, Arizona State University, P.O. Box 871501, Tempe, AZ 85287-1501. E-mail: cycad{at}asu.edu
Abstract
We present a mathematically precise formulation of total linkage disequilibrium between multiple loci as the deviation from probabilistic independence and provide explicit formulas for all higher-order terms of linkage disequilibrium, thereby combining J. Dausset et al.'s 1978 definition of linkage disequilibrium with H. Geiringer's 1944 approach. We recursively decompose higher-order linkage disequilibrium terms into lower-order ones. Our greatest simplification comes from defining linkage disequilibrium at a single locus as allele frequency at that locus. At each level, decomposition of linkage disequilibrium is mathematically equivalent to number theoretic compositions of positive integers; i.e., we have converted a genetic decomposition into a mathematical decomposition.
A precise measurement of linkage disequilibrium is required for studying virtually any phenomenon in multilocus population genetics. This is especially true for explicit multilocus models that investigate the contributions of physiological epistasis to additive genetic variance (Cheverud and Routman 1995; Wagneret al. 1998; Wagner and Laubichler 2000). Linkage disequilibrium is usually defined as the deviation from probabilistic independence between alleles at two different loci. This deviation from independence can have different causes, such as a lack of independent segregation or recombination, or any number of other evolutionary forces. The presence of linkage disequilibrium (gametic disequilibrium) is thus an indication that either stochastic (e.g., drift) or deterministic (e.g., selection, gene flow) evolutionary forces have been acting on a population (Hedrick 2000; Ardlieet al. 2002).
The classical definition of linkage disequilibrium, D, follows the probability theory definition of deviation from independence. Independence of two events, B and C, means that Pr(BC) = Pr(B) · Pr(C), where Pr is probability and BC is the joint distribution of B and C, so that the deviation from independence is measured as D = Pr(BC) – Pr(B) · Pr(C). Changing notation slightly to let Ak(i) designate the kth allele at the ith locus gives the linkage disequilibrium between the alleles at two loci, D2, as D2 = Pr(Ak(1)Ak(2)) – Pr(Ak(1)) · Pr(Ak(2)), where Pr represents probability and Ak(1)Ak(2) represents the joint occurrence of Ak(1) and Ak(2) in a single haploid gamete. In most modern interpretations of probability theory, the primitive concept of “probability” is interpreted as a relative frequency; therefore, Pr(Ak(1)) is the same as the frequency of allele k at locus 1.
The quintessential examples of linkage disequilibrium are coadapted gene complexes, in which several loci are tightly linked because they provide a large selective advantage if they occur together. In these cases, linkage disequilibrium is maintained by selection. Although coadapted gene complexes are implicit in Wright's shifting-balance hypothesis (Wright 1931), have been used to explain outbreeding depression (Dobzhansky 1948; Lynch 1991), and are frequently cited as evolutionary hypotheses (Palopoli and Wu 1996; Rawson and Burton 2002), the linkage disequilibrium of these purported coadapted gene complexes is almost never quantified. This is particularly surprising given the well-cited article by Geiringer (1944), in which she provides most of the algorithm for computing higher-order linkage disequilibrium coefficients. In this article, we complete and simplify Geiringer's formulation and then show how the sums of products of those coefficients equal the definition of (total) linkage disequilibrium as the deviation from probabilistic independence given by Dausset et al. (1978).
Methodologically, we follow Geiringer's lead and decompose higher-order linkage disequilibrium into lower-order linkage disequilibrium terms. In other words, we take a top-down approach to defining multilocus linkage disequilibrium, rather than the bottom-up approach followed by virtually everyone since Geiringer (1944). Lewontin (1974) is typical of the bottom-up approach. There are very few other top-down decomposition approaches such as Bulmer's (1980) decomposition of multilocus epistasis or Wagner and Laubichler's (2000) character decomposition approach in population genetics.
In this article, we first define linkage disequilibrium at a single locus as the allele frequency at this locus, which greatly simplifies notation. Second, we extend the definition of linkage disequilibrium to multiple loci by invoking compositions of positive integers. Our decomposition of multilocus linkage disequilibrium is entirely consistent with the standard definitions for two loci, as well as its previous extensions to three, four, and six loci (Geiringer 1944; Bennett 1954; Hastings 1984). Third, we show how this definition is entirely consistent with the notion of linkage disequilibrium as the deviation from probabilistic independence.
DECOMPOSITION OF MULTILOCUS LINKAGE DISEQUILIBRIUM
Define the one-locus coefficient of linkage disequilibrium, D1, as D1(Ak(i)) = Pr(Ak(i)). This definition may appear paradoxical, but it dramatically simplifies notation for the decomposition of multilocus linkage disequilibrium. In elementary algebra we have the analogous problem of defining the algebraic expression xn when n = 0 (Lakoff and Núñez 2000). Note that our definition of a locus encompasses protein-coding loci, quantitative trait loci, and even single nucleotides.
Following Hastings (1984), the formulas for two- and three-locus multilocus linkage disequilibrium, in which D1(Ak(i)) was substituted for Pr(Ak(i)), are defined as
Let Dn be the coefficient of linkage disequilibrium between n loci. Then the pattern here is that Dn = Pr(Ak(1)Ak(2)... Ak(n)) minus all possible products of lower-order linkage disequilibrium coefficients, such that each term has all of its subscripts adding up to n. The key to writing down an explicit formula for Dn is that the phrase “all possibilities of the subscripts adding up to n” refers to partitions of the positive integer n (Andrews 1976). A partition π of a positive integer n is a set of positive integers that adds up to n; i.e., π=
{n1,n2,..., nm} such that
The only way to decompose n into a single positive integer is c = (n). Therefore, we can also write the highest-order coefficient of linkage disequilibrium as
Equation 1 has never been written explicitly for general multilocus linkage disequilibrium, even though special cases have been given by Geiringer (1944), Bennett (1954), and Hastings (1984). The only explicit definition previously given for multilocus linkage disequilibrium is due to Dausset et al. (1978),
We are now ready to derive the relationship between Dn and Dn. In Equation 3, substitute Σall compositions c of n [Πni∈cDni (...)] for Pr(Ak(1),... Ak(n)) (see Equation 2), yielding
DISCUSSION
We have converted the genetics problem of decomposing linkage disequilibrium into the mathematical problem of decomposing positive integers into their additive parts, all while maintaining the convenient heuristic definition of total linkage disequilibrium as the deviation from independence. Unlike Geiringer (1944), we can write down an explicit formula for multilocus linkage disequilibrium because we invoke partitions of integers and define D1(A) = Pr(A), thereby merging her notion of linkage disequilibrium with those of Dausset et al. (1978).
One immediate consequence of our decomposition approach is that the single highest-order coefficient of linkage disequilibrium, Dn, cannot be examined in isolation. Because
Multilocus definitions of linkage disequilibrium have not been used very often in empirical studies because of the large number of inputs and linkage disequilibrium coefficients that must be analyzed (2n – 1). Currently, even third-order linkage disequilibrium is seldom measured (Thomson and Baur 1984). However, explicit terms for multilocus linkage disequilibrium are of theoretical importance.
One important theoretical application is the analysis of multilocus epistasis. Cheverud and Routman (1995) developed a two-locus model of physiological epistasis that has been further refined by Wagner et al. (1998). To analyze the evolutionary consequences of epistasis in these models, one has to first define linkage disequilibrium for a subset of the loci. Thus, to extend models of physiological epistasis to multiple loci, we must first define linkage disequilibrium for that subset of loci, which we have just done. Models of multilocus epistasis will be crucial in debates over what factors maintain coadapted gene complexes, increase additive genetic variance, and foster speciation (Goodnight 1988, 1995; Wade and Goodnight 1998).
Acknowledgments
We thank Phil Hedrick, Tom Dowling, and two anonymous reviewers for their helpful comments.
Footnotes
-
Communicating editor: M. W. Feldman
- Received November 26, 2003.
- Accepted December 18, 2003.
- Copyright © 2004 by the Genetics Society of America