## Abstract

A new measure of directional linkage disequilibrium is developed for detecting epistatic selection on interacting genes. Simulations show that by orienting the direction of linkage disequilibrium on the basis of the ancestral-derived status of alleles, the new measure indeed improves the power to detect a positive fitness interaction between two new mutations.

LINKAGE disequilibrium (LD) measures the nonrandom association of alleles at distinct loci. Because strong LD tends to be eroded in a recombining genome, significant deviation from random combination (linkage equilibrium) may be indicative of close proximity of a pair of loci along a chromosome (linkage) or fitness interactions among cosegregating variants (statistical epistasis). For a system of two loci with two alleles each (say, *A*/*a* and *B*/*b*), there are four possible combinations of alleles (or haplotypes), and a simple form of LD is expressed as the excess haplotype frequency over the expectation under linkage equilibrium (Lewontin and Kojima 1960),(for symbols and notations, see Table 1). Likewise, it can be expressed by different pairs of alleles; *e.g.*,andHere, it should be noted that(1)The relation (1) suggests that the direction of LD may be defined arbitrarily, depending on which of the allelic combinations we focus on. Classically, it has been proposed that alleles be labeled according to their abundance in the population, such that *p _{A}* >

*p*and

_{a}*q*>

_{B}*q*(Langley and Crow 1974; Langley

_{b}*et al.*1974). Alleles so defined, negative

*D*should commonly be observed at equilibrium if deleterious mutations interact synergistically, because we would then expect to find excess deleterious (hence rare) variants on common genetic backgrounds.

_{AB}The present study proposes an orientation of directional LD based on ancestor–descendant relationships of alleles and illustrates the applicability of the new directional measure in detecting epistatic interactions among cosegregating variants. We first focus on a process eventually leading to the joint fixation of two novel mutations *a* and *b*, which originated independently from their parental alleles *A* and *B*, respectively. If we further assume that the substitution process is facilitated by a positive fitness interaction between the two mutant alleles, we would observe an excess frequency of the double mutant *ab* and hence a transient formation of positive *D _{ab}* (=

*D*) in the process of double fixation. By properly utilizing the knowledge about ancestor–descendant relationships of the alleles, we expect that the directional departure from the neutral expectation would be more effectively detected. In contrast, when the appropriate information is not available, we have to rely solely on the absolute value, here denoted |

_{AB}*D*| (= |

*D*| = |

_{AB}*D*| = |

_{Ab}*D*| = |

_{aB}*D*|), while ignoring the direction of LD. The power of detecting epistatic effects would then be largely diminished.

_{ab}To characterize the biased nature inherent in the transient LD, we here introduce a directional measure of LD, defined as(2)The statistic (2), which specifically quantifies the relative excess of the mutant alleles in the coupling phase, also represents the correlation of allele frequencies between the two loci (*i.e.*, −1 ≤ *r** ≤ 1). Its squared value is therefore equivalent to an oft-used measure of LD,(3)(Hill and Robertson 1968), but we emphasize that by adopting the definition (2), the direction of LD is determined strictly by discriminating the ancestral *vs.* derived alleles at each of the loci.

In the following, we show by stochastic simulations that the statistical power to detect positive epistatic interactions is indeed enhanced by applying the directional measure *r**. Two forms of fitness interactions are considered:(4)and(5)(*s* > 0 in both cases). In the coadaptation model, two mutations *a* and *b* are individually neutral but together form a coadapted haplotype *ab* (*e.g.*, Takahasi and Tajima 2005), whereas in the compensation model, two individually deleterious mutations compensate each other when combined together (*e.g.*, Kimura 1985; Innan and Stephan 2001). Simulations proceed forward in time and follow haplotype frequency changes in a panmictic population of *N* diploids with discrete generations. Stochastic dynamics are simulated by the multinomial model of random genetic drift, where the expected frequencies in the next generation are given by the standard two-locus theory with selection (as represented by Equations 4 and 5) and recombination (at rate *c*). Each run of simulations starts from an initial population where only the first mutation *a* is segregating at frequency *p*_{0} in the absence of the allele *b*; namely, (*P _{AB}*,

*P*,

_{Ab}*P*,

_{aB}*P*) = (1 −

_{ab}*p*

_{0}, 0,

*p*

_{0}, 0) holds at the onset. A single copy of the second mutation

*b*is then introduced at frequency 1/(2

*N*) on a randomly chosen chromosome; no new mutations are introduced afterward (as in Takahasi and Tajima 2005). Subsequent evolution is traced until either of the mutant alleles is lost (

*i.e.*, either

*p*= 0 or

_{a}*q*= 0 is reached) or both reach fixation (

_{b}*P*= 1). To study the conditional process that ends up in the joint fixation of the two mutations, we concentrate only on the latter situation and discard the former. Simulations are repeated until 10

_{ab}^{6}sample paths leading to double fixation are generated for each set of parameters. On the basis of this collection, we study the temporal changes in LD after the introduction of the second mutation, assuming that the ancestral-derived status of the alleles is either known or unknown. To simulate the distribution of LD for situations without the ancestral information, we take the positive square root of the squared correlation (3),(6)which also represents an unsigned version of the directional measure (2). Further, to evaluate the power of detecting epistatic selection, the 5% significance levels with or without the ancestral information are determined on the basis of the distributions of LD under neutrality. The neutral expectations are obtained simply by assuming

*s*= 0 in the above simulations. This implies that also for the neutral process, we focus on the distribution conditional on double fixation. Below we report the results for

*p*

_{0}= 0.1, but qualitatively similar conclusions hold as long as the initial frequency

*p*

_{0}is not too high (

*e.g.*,

*p*

_{0}= 0.2; see below). Similarly, whereas the following analysis assumes a relatively small population size (

*N*= 100), this assumption is made solely to save computational time, and equivalent results apply equally to larger populations (assuming similar ranges of population scaled parameters

*Ns*and

*Nc*; results not shown).

Simulations show that conditional on double fixation, positive *r** may be formed shortly after the introduction of the second mutation *b*, even when the allelic combinations are selectively neutral (*i.e.*, *s* = 0 in the above simulations). As evolution proceeds, recombination breaks up the temporarily established positive *r** under neutrality, and the biased distribution rapidly becomes symmetrical about *r** = 0 (Figure 1A, distribution in solid lines). In contrast, when the mutations interact epistatically, positive *r** persists for a longer period of time despite the counteracting effect of recombination (Figure 1A, shaded distribution), until either of the mutant alleles becomes fixed. During this phase of evolution, the incorporation of allelic status into the analysis would allow us to successfully discriminate positive *r** caused by epistatic interactions from symmetrically distributed *r** under neutrality (Figure 1A). In contrast, the distinction is less clear when the unsigned measure |*r*| is used in the absence of the appropriate ancestral information (Figure 1B). By applying the directional measure *r**, the power of detecting positive epistatic interactions may be gained as much as 1.5-fold (Figures 2 and 3). Even under weak epistasis (*s* = 0.01), relatively high levels of detection power are retained for some time after the second mutational event, although the power quickly drops off and approaches the standard level of 5% when the ancestral information is lacking. It seems that higher LD is produced in the compensation model because of selection disfavoring the haplotypes *Ab* and *aB* that potentially reduce the absolute value of LD.

Overall, our simulations show that for moderate levels of linkage (*c* = 0.02 or 0.05 in the above examples), the power of detecting positive epistatic interactions may be substantially enhanced by explicitly taking the ancestral-derived status of alleles into account. They also suggest that when the linkage is too tight (*e.g.*, *c* ≤ 0.01), the relative power may not be increased because conditional on double fixation, positive *r** tends to persist even under neutrality. When the rates of recombination are much higher, discernible levels of LD would not arise in any case. We have also assumed a fixed value *p*_{0} = 0.1 for the initial frequency of the allele *a*. When the first mutation is initially more abundant (*e.g.*, *p*_{0} ≫ 0.2), the amount of LD will remain low during evolution, as this assumption implies that the evolutionary increase of the allele *a* is led primarily by random genetic drift in the absence of the second mutation *b*. Accordingly, the role played by the positive epistatic interaction in driving the joint increase of the two mutations, which is the prerequisite for the generation of significant LD, would be of minor importance (*cf.* Takahasi and Tajima 2005). We also found that analogous measures of directional LD (such as the simple measure of *D _{ab}*) successfully distinguished the peculiar pattern of LD under epistatic selection from the neutral expectation; in the above examples, the statistical power of

*D*was almost comparable to or slightly less than the corresponding power based on our new measure

_{ab}*r**. However,

*D*′ (Lewontin 1964) was an exception; because the absolute value of

*D*′ becomes unity when at least one of four possible haplotypes is absent, its distribution is more easily concentrated at the boundaries, even under neutrality.

Whereas the above analysis clearly illustrates the potential advantage of the directional measure, its direct application seems impractical because the analysis is based on a number of conditions that are generally unknown in practice; specifically, the simulations are conditioned on double fixation, and the performance of each LD measure is evaluated by conditioning on the time since the second mutational event (as in Figures 2 and 3). For practical purposes, we may wish to base our analysis on some property that can be directly observed or at least be inferred from the available data. To exemplify this possibility, we have conducted an additional series of simulations and investigated the relative power of *r** by conditioning on the observed frequencies of mutant alleles at a pair of loci (Tables 2 and 3).

In this new series of simulations, we consider a more realistic situation where mutant alleles are repeatedly introduced at both loci and the two-locus frequency dynamics are fully simulated without postulating a fixed initial frequency *p*_{0} as in the above case. In so doing, the recurrent mutation rates are assumed so low that when an allele is lost at one of the loci, a new mutant allele is immediately introduced to ensure that the alleles always segregate at the two loci. This procedure, which saves a considerable amount of computational time by skipping the waiting period for the next mutational event, does not affect our results because we are here interested only in those situations where the two loci are both polymorphic so that some degree of LD may persist between the loci. Moreover, in contrast to the above analysis conditioned on double fixation, the new simulations also make use of the mutant alleles that are eventually lost from the population when computing the distributions of LD. This is because in practice, we do not know which of the segregating alleles will be fixed in the future generations.

Every generation the mutant allele frequencies (*p _{a}* and

*q*) in addition to the LD measures (

_{b}*r** and |

*r*|) were recorded. The allele-frequency spaces were then partitioned (somewhat arbitrarily) into 10 classes with an equal interval of 0.1, and the pattern of LD was studied separately for each class of mutant allele frequencies. Excluding the two terminal classes 0.0 ∼ 0.1 and 0.9 ∼ 1.0, there are 36 [= 8 × (8 + 1)/2] two-locus frequency configurations. For each of these possible combinations of frequency classes, we first investigated the null distributions of LD (conditional on given allele-frequency combinations), assuming neutrality. On the basis of a total of 10

^{6}data sets collected separately for each of the 36 configurations, the 5% significance level conditional on mutant allele frequencies was independently determined. As depicted in Figure 4, the significance levels based on the directional measure

*r** are substantially affected by the two-locus frequency configuration. When the mutant allele frequencies at the two loci are nearly equal [as for the frequency combination (0.8 ∼ 0.9, 0.8 ∼ 0.9) in Figure 4], the 5% significance level grows rapidly as the linkage between the loci becomes tight, and even under neutrality it reaches the maximum value of unity roughly when

*Nc*≤ 1. In contrast, when the frequency difference between the loci is large [as for the frequency combinations (0.8 ∼ 0.9, 0.1 ∼ 0.2) in Figure 4], the neutral threshold is kept low for a wide range of recombination rates.

In a similar manner, simulations with the selection models (4) and (5) were then conducted to evaluate the power of our new measure *r** relative to its unsigned version |*r*|. As shown in Tables 2 and 3, the power of detecting epistatic interactions may be gained substantially by adding the ancestral information, especially when at one of the loci the mutant allele has attained a relatively high frequency (0.6 ∼ 0.9), while it is kept low (0.2 ∼ 0.4) at the other locus. Under these unbalanced frequency configurations, the detection power of the unsigned measure |*r*| may be negligibly small, whereas relatively high detection power is retained by using the directional measure *r**. Consequently, when the coadaptation model of epistatic selection is assumed, the maximum increase in power reaches almost 50-fold with the parameter values studied here (*s* = 0.05 and *c* = 0.02; Table 2). Under the compensation model, the detection power may be enhanced even more substantially in spite of a smaller selection coefficient (*s* = 0.02 with *c* = 0.02; Table 3). Although we here show only a limited set of simulation results to demonstrate the improved performance of the directional measure *r**, its superiority over the unsigned measure |*r*| was confirmed for a much wider range of parameters under the two selection models.

Note that the test described above involves some additional parameters and conditions that must also be inferred from the available data (*e.g.*, ancestor–descendant relationship of alleles, linkage phase in the case of diploid organisms, recombination rate, and so on). It seems almost certain that the misinference of any of these factors would cause a bias in the LD-based test. Especially, an erroneous specification of the allelic status might occur when an appropriate outgroup sequence is not available or at sites with high mutation rates. Such misinference would cause a reduction in the relative performance of *r**; that is, the advantage of using a directional measure is reduced. Indeed, additional simulations have found that when the ancestral-derived status of the alleles is incorrectly assigned at one of the two loci, the detection power of *r** is almost completely lost. Because the detection power decreases linearly with the rate of erroneous inference, the relationship between the error rate and the relative power is as in Figure 5, which illustrates the results for the coadaptation model; the compensation model yields essentially the same pattern. These results imply that the availability of reliable outgroup information is a key to an apt application of our directional measure.

Moreover, as with any other LD-based methods, an incorrect inference of the linkage relationship may cause a serious error in the above test, since the expected level of LD under neutrality is closely related to the degree of linkage between loci (*i.e.*, recombination rate). This is especially so when the difference in mutant allele frequencies between the two loci is small [as for (0.8 ∼ 0.9, 0.8 ∼ 0.9); see Figure 4] such that the threshold level changes radically depending on the population scaled recombination rate; a precise estimation of the (scaled) recombination rate would then be a prerequisite to carry out a proper test. However, when the frequency difference is large, an erroneous estimation of the recombination rate would not be as serious an issue since the neutral threshold is nearly invariant across a wide range of recombination rates [as for (0.8 ∼ 0.9, 0.1 ∼ 0.2); see Figure 4]; in other words, sufficient power would be retained even if we accept a conservative estimate of recombination rate. Notably, this is exactly when the relative advantage of using the directional measure *r** becomes rather substantial, as demonstrated above (see Tables 2 and 3).

The present study demonstrates that when LD-based methods are used to detect epistatic interactions among cosegregating variants, there is a potential advantage in properly polarizing the direction of LD by additionally considering the ancestor–descendant relationships of alleles. Besides the difficulties associated with the ancestral inference, there still remain many practical problems as discussed above, but they are equally shared by any other LD-based tests, and it seems a general tendency that the test performance is improved by the directional measure. Since the relative performance of *r** depends on the frequencies of the interacting variants (as indicated in Tables 2 and 3), in practice we may first wish to look for a pair of loci (or nucleotide sites) that shows a considerable amount of directional LD for a given combination of mutant allele frequencies at the two loci. By applying this primary screening to available polymorphism data, it would then be easier to choose the candidate pairs of loci that may be subject to further functional assays. Although there have been only a few suggested examples of positive interlocus interactions among naturally occurring molecular variants (*e.g.*, Rawson and Burton 2002; Caicedo *et al.* 2004), we expect that the effective search for epistatic interactions in the wild would be much facilitated by explicitly incorporating the ancestral-derived status of the segregating alleles into consideration.

## Acknowledgments

The authors thank two anonymous reviewers for comments. This work was partly supported by grants from the Japan Society for the Promotion of Science.

## Footnotes

↵1 These authors contributed equally to this work.

Communicating editor: M. W. Feldman

- Received December 10, 2007.
- Accepted April 21, 2008.

- Copyright © 2008 by the Genetics Society of America