Originally published as Genetics Published Articles Ahead of Print on February 1, 2006.

Genetics, Vol. 172, 2647-2663, April 2006, Copyright © 2006
doi:10.1534/genetics.105.050179

The Hitchhiking Effect on Linkage Disequilibrium Between Linked Neutral Loci

* Section of Evolutionary Biology, Biocenter, University of Munich, 82152 Planegg-Martinsried, Germany and {dagger} Department of Computer Sciences and {ddagger} Section of Evolution and Ecology, University of California, Davis, California 95616

1 Corresponding author: University of Munich, Grosshaderner Strasse 2, 82152 Planegg, Germany.
E-mail: stephan{at}zi.biologie.uni-muenchen.de

Manuscript received August 27, 2005. Accepted for publication January 30, 2006.

ABSTRACT

We analyzed a three-locus model of genetic hitchhiking with one locus experiencing positive directional selection and two partially linked neutral loci. Following the original hitchhiking approach by Maynard Smith and Haigh, our analysis is purely deterministic. In the first half of the selected phase after a favored mutation has entered the population, hitchhiking may lead to a strong increase of linkage disequilibrium (LD) between the two neutral sites if both are <0.1s away from the selected site (where s is the selection coefficient). In the second half of the selected phase, the main effect of hitchhiking is to destroy LD. This occurs very quickly (before the end of the selected phase) when the selected site is between both neutral loci. This pattern cannot be attributed to the well-known variation-reducing effect of hitchhiking but is a consequence of secondary hitchhiking effects on the recombinants created in the selected phase. When the selected site is outside the neutral loci (which are, say, <0.1s apart), however, a fast decay of LD is observed only if the selected site is in the immediate neighborhood of one of the neutral sites (i.e., if the recombination rate r between the selected site and one of the neutral sites satisfies Formula). If the selected site is far away from the neutral sites (say, r > 0.3s), the decay rate of LD approaches that of neutrality. Averaging over a uniform distribution of initial gamete frequencies shows that the expected LD at the end of the hitchhiking phase is driven toward zero, while the variance is increased when the selected site is well outside the two neutral sites. When the direction of LD is polarized with respect to the more common allele at each neutral site, hitchhiking creates more positive than negative linkage disequilibrium. Thus, hitchhiking may have a distinctively patterned LD-reducing effect, in particular near the target of selection.


GENETIC drift is recognized as a fundamental stochastic force shaping the polymorphism within populations and divergence between species of both neutral and selected variants at a locus (FISHER 1930; WRIGHT 1931; KIMURA 1983). Remarkably similar patterns to those caused by genetic drift are predicted by theoretical models incorporating stochastically varying selection (GILLESPIE 1994). In 1974 Maynard Smith and Haigh analyzed a simple and obvious extension of the genetic drift models, incorporating the stochastic coupling due to linkage with another locus undergoing strong directional selection. Addressing the apparent uniformity of allozyme polymorphism across species, MAYNARD SMITH and HAIGH (1974) focused on the "hitchhiking effect" on allelic frequencies and heterozygosity.

Since this seminal work the hitchhiking effect associated with positive directional selection has been extended to address observations in the emerging field of molecular population genetics (AGUADÉ et al. 1989; STEPHAN and LANGLEY 1989; BEGUN and AQUADRO 1992). These studies revealed low levels of DNA sequence polymorphism in Drosophila in genomic regions of low crossing over and led to theoretical analyses of the hitchhiking effect on nucleotide diversity (KAPLAN et al. 1989) and on the frequency spectrum of polymorphisms (BRAVERMAN et al. 1995). Whether formulated in terms of gamete frequencies or in the coalescent framework, the genetic models dealt with a pair of loci: a selected and a linked neutral locus with a defined rate of recombination between them. Independent of whether single or recurrent hitchhiking events were analyzed (as a stochastic process or a deterministic approximation), the conclusion has consistently been that hitchhiking should have profound effects on expected heterozygosity and allele frequencies at the neutral locus, if selection is strong and linkage is tight.

In 1977 Thomson considered the impact of hitchhiking on linkage disequilibrium (LD). Most of her analysis focused on the association between the selected alleles and those at a single neutral locus. Thomson also considered the impact of hitchhiking (of a heterotic polymorphism) on the LD between alleles at two linked neutral loci. On the basis of numerical examples she concluded that hitchhiking creates LD between neutral loci within the same genomic domain in which it affects levels of heterozygosity. The impact of simple directional selection of rare variants rapidly going to fixation was not considered by THOMSON (1977). Except for this initial study and a few rather targeted applications (ROBINSON et al. 1991; GROTE et al. 1998), no attempts have been made to extend this simple two-locus hitchhiking model of MAYNARD SMITH and HAIGH (1974) to multiple loci. This is curious, as the hitchhiking effect has played a major role in molecular population genetics for >15 years.

On the basis of the analyses of Thomson and her colleagues, it has been believed that hitchhiking affects not only polymorphisms at individual sites (or loci), but also the association between polymorphisms. Using a three-locus model with one selected and two neutral loci, she showed that hitchhiking that has a strong effect on heterozygosity can also generate strong LD between the neutral loci. Without paying close attention to Thomson's writing, the important role of hitchhiking in generating LD has been reiterated in textbooks and publications by many authors. It was not until recently that a few authors reported quite the opposite results (GILLESPIE 1997; KIM and STEPHAN 2002; KIM and NIELSEN 2004). Analyzing "shift" models that are similar to the hitchhiking model described above, GILLESPIE (1997, p. 297) concluded that "linked selection can reduce variation without building up high levels of linkage disequilibrium, contrary to our intuition." These latter studies focused on average effects observable in simulated data. In small-sample coalescent simulations, KIM and NIELSEN (2004) found increased LD between alleles at two neutral loci on the same side of the selected locus at the time of fixation and reduced LD across the site of selection. Furthermore, they provide heuristic arguments to explain this pattern. These different and somewhat contradictory views of the relationship between genetic hitchhiking and LD motivated us to pursue an analytic investigation of the question. We followed the deterministic approach of MAYNARD SMITH and HAIGH (1974), extending their model to three loci as did THOMSON (1977). To analyze this model, we used the framework of BARTON and TURELLI (1991), which provides a natural setting for the investigation of the directional hitchhiking, yielding transparent mathematical expressions that illuminate the rather surprising dynamics of LD under the hitchhiking effect.


THE THREE-LOCUS HITCHHIKING MODEL
We consider a three-locus model with two neutral loci and one selected locus. For each locus, we assume that there are only two allele types, denoted by 0 and 1. The selected locus may be between the two neutral ones or on either side of them (see Figure 1). We denote by L and R the left and right neutral loci, respectively, and by S the selected locus. The corresponding recombination fractions between loci are rLR, etc. We assume positive directional selection according to the following fitness scheme:

Formula
where s is the selection coefficient of the selected allele (type 1) and h the dominance coefficient. Note that we follow here the notation of MAYNARD SMITH and HAIGH (1974), not the definition of KAPLAN et al. (1989). Effective population size Ne is assumed to be very large (Formula), such that a deterministic analysis of the model is appropriate.


Figure 1
View larger version (3K):
In this window
In a new window
Download PPT slide
 
FIGURE 1.—

Three possible configurations. The selected locus is denoted by S, whereas the left and the right neutral loci are denoted by L and R, respectively. The recombination rate rLR between L and R is given by adding or subtracting rLS and rRS, depending on the configuration.

 

The system of full recursions:

To derive the recursions for the marginal (type 1) allele frequencies pL, pR, and pS at the three loci and the LDs (central moments) of the second (CLR, CLS, CRS) and third (CLRS) orders measured with respect to type 1 alleles, we follow the approach of BARTON and TURELLI (1991). In that approach, coefficients that appear in the recursions are recombination rates and generalized selection coefficients, the latter denoted by Formula and Formula, where X and Y are nonempty subsets of loci. (Generalized selection coefficients are defined as coefficients that appear in expressing the relative fitness in terms of certain quantities related to LDs.) In our case, nonzero selection coefficients that appear in the recursions at time t are

Formula
For h = Formula, these simplify to

Formula

Following BARTON and TURELLI (1991), we define rLRS = rLS,R + rRS,L + rLR,S and use rX,Y to denote the rate of recombination events that partition the loci into two nonempty sets X and Y. In our work, we ignore double-crossover recombination events, so we define rX,Y = 0 if the partition X, Y corresponds to a double-crossover event. For example, if the selected locus S is between the neutral loci, then rLS,R = rRS, rRS,L = rLS, and rLR,S = 0. Further, we assume that recombination rates are additive. The recombination rate rLR is therefore given by adding or subtracting rLS and rRS, depending on the configuration (see Figure 1).

To simplify notation, we hereafter omit writing the dependence on time t. We define qS = 1 – pS, qL = 1 – pL, and qR = 1 – pR. Marginal allele frequencies satisfy the following recursions:

Formula 1(1)

Formula 2(2)

Formula 3(3)
The LDs satisfy the recursions

Formula 4(4)

Formula 5(5)

Formula 6(6)

Formula 7(7)
where

Formula 8(8)

Formula 9(9)

Formula 10(10)

Formula 11(11)
with

Formula 11
The above general recursions apply to all three configurations shown in Figure 1. Depending on the particular configuration being considered, recombination rates rX,Y need to be defined appropriately.

The system of truncated recursions:

We explore the behavior of the recursion Equations 17 in the region Formula 11 and Formula 11 (see MAYNARD SMITH and HAIGH 1974). Here, r may be any recombination parameter appearing in the Equations 17. Keeping only the terms linear in s or r leads to the following set of recursions:

Formula 12(12)

Formula 13(13)

Formula 14(14)

Formula 15(15)

Formula 16(16)

Formula 17(17)

Formula 18(18)
These equations agree to first order in r and s with those of THOMSON (1977) (compare her Equations 30iii, 30iv, and 31). Since the hitchhiking effect can be best observed when Formula 18 (such that simultaneously Formula 18 holds), we may approximate the truncated recursions by the following ordinary differential equations (ODEs; see MAYNARD SMITH and HAIGH 1974):

Formula 19(19)

Formula 20(20)

Formula 21(21)

Formula 22(22)

Formula 23(23)

Formula 24(24)

Formula 25(25)
Here we have introduced time t (in generations) into the equation for pS and parameterized the other quantities by pS (which is a monotonically increasing function of t).

Structure of equations:

Several features of the dynamical system become apparent from these two sets of equations. Most importantly, selection acts on the alleles at the neutral loci indirectly and in a strictly hierarchical fashion: on the marginal neutral allele frequencies via the pairwise LDs CLS and CRS, on the LD between the neutral sites by the third moment, and on the latter by a fourth-order term (i.e., the product of the two pairwise LDs CLS and CRS).

Analytical solutions when the selected locus is between the neutral loci:

The ODEs for the LDs are first-order linear differential equations. The ODEs for the pairwise LDs CLS and CRS are homogeneous. The equations for CLR and CLRS contain the higher-order moments as inhomogeneous terms that act as "driving forces" of the dynamics. However, except for this impact of the higher-order terms, the equations are decoupled and can be solved successively. We have the following results.

The frequency of the selected allele at locus S is

Formula 26(26)
whereas marginal allele frequencies at the neutral loci are

Formula 27(27)

Formula 28(28)
where

Formula 29(29)
which corresponds to the heterozygosity at the selected locus. The LDs CLS(t) and CRS(t) can be written as

Formula 30(30)

Formula 31(31)

Given these solutions for CLS(t) and CRS(t), the coupled ODEs (24) and (25) admit simple exact solutions when the selected locus is between the two neutral loci. More specifically, the third-order LD is given by

Formula 32(32)
and the LD between the neutral loci can be written as

Formula 33(33)
where rLR = rLRS = rLS + rRS.

Analytical solutions when the selected locus is outside the neutral loci:

Solutions to the ODEs (19)–(23) do not depend on whether the selected locus is inside or outside the two neutral loci. For example, the allele frequency pS(t) and the LDs CLS(t) and CRS(t) are given by (26), (30), and (31), respectively, in all cases. However, the dynamics of the ODEs (24) and (25) for CLR(t) and CLRS(t), respectively, depend crucially on the position of the selected locus S with respect to the neutral loci L and R. As we elaborate presently, the dynamics of CLR(t) when S is between L and R exhibit radically different behavior than when S is outside.

In what follows, suppose that S is to the right of R, which implies rLS = rLRS = rLR + rRS. The case where S is to the left of L can be handled in a similar vein, with rRS replaced with rLS. Now, the ODE (25) for the third-order LD CLRS(t) does not admit a closed-form solution; a general solution can be obtained in terms of the incomplete beta function B(z; x, y), defined as Formula 33. However, noting that B(z; 1 – 2rRS/s, 1 + 2rRS/s) {approx} z if Formula 33, we obtain the following simple approximate solution:

Formula 34(34)
Further, using this solution and the approximation B[z; 2 – 2rRS/s, 1 + 2rRS/s] {approx} z2 for Formula 34, we obtain the following approximate solution to (24):

Formula 35(35)
Note the striking resemblance of (34) and (35) to (32) and (33), respectively. For rRS = 0, the two sets of equations agree exactly, and hence our solutions for different regions form one continuous solution for the entire domain. The only difference between the two sets of equations is that, in (34) and (35), an extra factor Formula 35 appears together with [pS(t) – pS(0)]/HS(0). This simple difference leads to important observable differences in the dynamics of the LDs.

Comparison of approximate analytic solutions with numerical solutions to the full recursions:

We have written a computer program to solve the full recursions (1)–(11) numerically. Comparison of our analytic solutions (33) and (35) with numerical solutions to the full recursions are shown in Tables 1 and 2, respectively. As these tables show, our analytic solutions are a good approximation to the exact dynamics. In obtaining our analytic solution (35) for the case in which the selected locus is outside the neutral loci, recall that we assumed Formula 35 to approximate the incomplete beta function. As expected, Table 2 shows that (35) becomes less accurate as rRS increases, but it is a good approximation as long as Formula 35.


View this table:
In this window
In a new window

 
TABLE 1

Comparison of the exact CLR(t) (obtained by solving the full recursions numerically) with the analytic approximation (33) when the selected locus is between the two neutral loci

 

View this table:
In this window
In a new window

 
TABLE 2

Comparison of the exact CLR(t) with the analytic approximation (35) when the selected locus is to the right of locus R

 


THE DYNAMICS OF LD
Here we consider the dynamics of the LD between the two neutral loci. We utilize our analytic solutions from the previous section to study several important aspects of the dynamics.

Vanishing LD:

In the domain Formula 35, the term –sCLSCRS dominates the recursion for the third moment [see (18)] and hence influences also the LD between the neutral sites. If the selected site is between the two neutral sites and linkage is sufficiently tight (rLR < 0.1s), |CLR| quickly increases after the favored mutation has entered the population and, after transiently reaching a peak, rapidly decays to zero before the selected phase ends. To show this, define tf as the time satisfying pS(tf) = 1 – pS(0). Henceforward, we loosely refer to this time as the fixation time. Using (26), one can show that

Formula 36(36)
We wish to show that, if the selected locus is between the two neutral loci, then CLR(tf) {approx} 0 for all possible initial conditions of interest. Common to all initial conditions is that pS(0) = 1/(2N), with N being the population size. For N = ~104–106, Formula 36, and therefore

Formula 37(37)
which implies that (33) at tf can be written approximately as follows:

Formula 38(38)

We use {000, 001, 010, 011, 100, 101, 110, 111} to denote gametic types. Their frequencies are denoted by {f000, f001, f010, f011, f100, f101, f110, f111}, which are related to the marginal frequencies and the LDs as follows:

Formula 38
Using these relations, we obtain

Formula 39(39)
At t = 0, a new favored mutation occurs on only one of the following gametic types: 000, 001, 100, or 101. For instance, if the mutation occurs on a gamete of type 101, f010(0) = f011(0) = f110(0) = 0 and f111(0) = pS(0) = 1/(2N). More generally, only one of f010(0), f111(0), f011(0), f110(0) is supposed to be nonzero. This implies that the right-hand side of (39) must be zero. Hence, the coefficient of exp(–rLRtf) in (38) is exactly zero. If the approximation (37) were not used, then the coefficient of exp(–rLRtf) in CLR(tf) would not be exactly zero, but would still be very small. Note that exp(–rLRtf) may not be very small. For example, exp(–rLRtf) = 0.452813 for pS(0) = 0.00005, rLR = 0.00002, and s = 0.001. This shows that, for small recombination rates (Formula 39), selection rather than recombination is the dominant force that causes CLR(t) to vanish before the fixation time. For large recombination rates, the contribution exp(–rLRt) from recombination should dominate over selection effects.

This behavior may be explained as follows. Under the scenario of tight linkage and strong selection, a low-frequency gamete on which the favored mutation landed is quickly dragged into intermediate to high frequency. If the recombination rates are nonzero, this gamete may undergo recombination, thereby creating the two types of single recombinants that also carry the selected allele and thus increase in frequency. This reduces the LD between L and R created by the hitchhiking effect in the first half of the selected phase. This hitchhiking effect on the recombinants is stronger, the greater the linkage between the selected site and the two neutral sites is, and thus also the product CLSCRS.

Figure 2a shows another important observation: LD may vanish very quickly in the selected phase, while relative heterozygosity approaches a finite (i.e., nonzero) equilibrium value. Thus, LD does not vanish because of the variation-reducing effect of hitchhiking per se, but as a consequence of secondary hitchhiking effects on the recombinants created in the selected phase (described above).


Figure 2
View larger version (17K):
In this window
In a new window
Download PPT slide
 
FIGURE 2.—

Example trajectories of CLR(n) are shown on the left-hand side, and those of normalized heterozygosity Het(n)/Het(0) are shown on the right-hand side, where solid (resp. dashed) lines correspond to locus L (resp. R). (a) The selected locus is at the midpoint between the neutral loci, i.e., rLS = rRS = 0.01s. (b) The selected locus is to the right of locus R and rRS = 0.01s. (c) The selected locus is to the right of locus R and rRS = 0.03s. (d) The selected locus is to the right of locus R and rRS = 0.3s. Exact recursions (1)–(11) were used, with s = 0.01, rLR = 0.02s, pS(0) = 0.00005, pL(0) = 0.38, pR(0) = 0.41, and CLR(0) = 0.0242. At generation n = 3982, pS(n) = 1 – pS(0). Initial conditions were chosen so that CLR(0) != 0, for two reasons: to demonstrate that, when the selected locus is between the neutral loci (as in case a), CLR(n) quickly vanishes in the selected phase even if the initial value CLR(0) is not zero and to contrast the effect of recombination (see case d) with that of selection.

 
A selected mutation occurring outside the two neutral sites on a low-frequency gamete may also lead to a transient peak of CLR, if both neutral polymorphic sites are <0.1s recombination distances away from the selected site (see Figure 2, b and c). Although this peak vanishes faster than under neutral conditions (i.e., with the selected site far away from the neutral sites, as in Figure 2d), the decay rate is not as high as when the favored mutation occurs between the neutral sites (Figure 2a). We analyze this behavior in more detail below.

Shown in Figure 3 is a plot of CLR(n) for varying position of the selected locus. As in Figure 2, the distance between the two neutral loci is fixed at rLR = 0.0002, and the same set of initial conditions is used. Note that this plot is symmetric about the plane r = 0; we return to this point later in the article. We stress that our conclusions described above do not depend on the particular values of neutral marginal allele frequencies pL(0) and pR(0) used for illustration. Even for low values of pL(0) and pR(0), for example, the same conclusions hold.


Figure 3
View larger version (22K):
In this window
In a new window
Download PPT slide
 
FIGURE 3.—

The LD CLR between the neutral loci as a function of the position of the selected locus, for rLR = 0.0002 and s = 0.01. Here, r is the position of the selected locus S, and the value r = 0 corresponds to the midpoint between the neutral loci. Locus L is fixed at r = –0.0001, whereas locus R is fixed at r = 0.0001. The same set of initial conditions as in Figure 2 was used for all r. Exact recursions (1)–(11) were used to generate this plot.

 
An alternative illustration of the above discussion is provided in Figure 4, which shows pairwise LD plots for a region containing 100 neutral loci and a single selected locus. The two plots shown correspond to two different time points. The selected locus is located in the middle of the region and LD values below a cutoff value are not plotted.


Figure 4
View larger version (10K):
In this window
In a new window
Download PPT slide
 
FIGURE 4.—

Truncated pairwise LD plots for a region consisting of 100 neutral loci and 1 selected locus located in the middle. The recombination rate between any two adjacent loci is 0.4s/100, and hence the recombination rate between the leftmost and the rightmost neutral loci is 0.4s. We used s = 0.01, pS(0) = 0.00005, pR(0) = pL(0) = 0.5, and CLR(0) = 0. The plots were truncated so that LD values <1% of the largest value 0.0577 were not included. (a) Pairwise LD plot at t = tf/2, where the time of fixation tf satisfies pS(tf) = 1 – pS(0). (b) Pairwise LD plot at t = tf.

 

The maximum LD at tf:

Consider the case in which the selected locus S is to the right of locus R. Viewing CLR(tf) as a function of rRS, whether a local optimum in the domain Formula 39 exists depends on initial conditions. The example shown in Figure 3 has a local maximum at rRS/s {approx} 0.039. Differentiating the analytic solution (35), it is possible to determine whether there is a critical point rRS = r*RS that satisfies

Formula 40(40)
Suppose that, at t = 0, the new gamete carrying a selected allele (of type 1) at locus S is of type ij1, with i being the allele at locus L and j the allele at locus R. Then, given that [{delta}i1pL(0)][{delta}j1pR(0)] > CLR(0), where {delta}ab is 1 if a = b or 0 if a != b, we obtain

Formula 41(41)
and

Formula 41
The value of r*RS in (41) may be very large for some initial conditions. In such a case, as our analytic solution (35) is valid only in the domain Formula 41, all we can say for sure is that CLR(tf) has no critical point in the domain Formula 41 [i.e., CLR(tf) is either a monotonically increasing or a monotonically decreasing function of rRS in that domain]. Further, if [{delta}i1pL(0)][{delta}j1 pR(0)] ≤ CLR(0), there is no real-valued r*RS such that our approximate analytic solution (35) satisfies (40). Hence, noting that CLR(tf) approaches Formula 41 as rRS/s increases, we conclude that, in the domain Formula 41,

Formula 41
where

Formula 42(42)
and Formula 42. Note that the maximum possible value of X is Formula 42. The critical point rLS = r*LS for the case in which the selected locus is to the left of locus L is also given by (41).

Invariance of CLR and CLRS when the selected locus is between the two neutral loci:

When the favored mutation occurs between the two neutral sites, the dynamics of the system of truncated recursions for CLR and CLRS do not depend on the position of the selected locus. This can be seen immediately from Equations 17 and 18, which depend only on the sum rLS + rRS and not on any individual recombination parameter. We show in the APPENDIX that this invariance also (nearly) holds for the system of full recursions.


INITIAL CONDITIONS AND THE PARAMETER SPACE OF THE MODEL
Here we assume that the selected locus is to the right of locus R. For such a case, recall that the LD (measured with respect to type 1 alleles) between the neutral loci is given by (35). We use ijk to denote gametic types, with i being for locus L, j for locus R, and k for locus S. By the gamete of origin, we mean the new gamete at t = 0 carrying a selected allele (of type 1) at locus S. We include ij in superscript [i.e., we write Formula 42] if the gamete of origin is of type ij1. Using (35), we obtain

Formula 43(43)
where, as before, {delta}ab is 1 if a = b or 0 if a != b, and {alpha}(t, y) is defined as

Formula 44(44)

Recall that marginal (type 1) allele frequencies pL(t) and pR(t) at the neutral loci are given by (27) and (28), respectively. For Formula 44 and Formula 44, we can use the approximation

Formula 44
from which it follows that

Formula 44
where {alpha}(tf, rRS/s) is defined as in (44). Similar to Formula 44, we use Formula 44 and Formula 44 to denote pL(t) and pR(t), respectively, if the gamete of origin is of type ij1. It is straightforward to show that

Formula 45(45)

Formula 46(46)

Frequency-averaged LD and the range of the hitchhiking effect:

In what follows, we use xijk to denote the frequency of the gametic type ijk at time t = 0 and define xij· = xij0 + xij1. The type of the gamete of origin could be any of 001, 011, 101, and 111. Suppose that the probability of the gamete of origin being of type ij1 is equal to the frequency of the gametic type ij0 just before time t = 0 (note that this frequency is equal to xij·). Then, the average value of CLR(t) with respect to this probability is given by Formula 46, which we call a frequency-averaged LD. We show below that, contrary to people's common intuition, the effect of selection on such an averaged LD does not depend on haplotype diversity xij· at t = 0.

Using (43), we can show that Formula 46 is given by

Formula 46
If CLR(0) = 0, then Formula 46 for all t. For CLR(0) != 0, we define

Formula 46
Note that rLR need not be much smaller than s for our analytic solution (35) to be valid (recall that only Formula 46 is required). Assuming Formula 46, we can ignore genetic drift and regard Formula 46 as the behavior of LD under neutrality. Thus, A(t, rRS/s) can be viewed as the ratio of the frequency-averaged LD in the presence of selection to that in the absence of selection. At the time tf of fixation, pS(tf) = 1 – pS(0) and therefore

Formula 47(47)
For given rRS/s, A(tf, rRS/s) depends only on pS(0) = 1/(2N); it has no dependence on other initial conditions. A plot of A(tf, rRS/s) is shown in Figure 5 for pS(0) = 0.00005.


Figure 5
View larger version (6K):
In this window
In a new window
Download PPT slide
 
FIGURE 5.—

A plot of A(tf, rRS/s) for pS(0) = 0.00005. The function A(tf, rRS/s), defined in (47), captures the effect of selection on both relative frequency-averaged LD and relative frequency-averaged heterozygosity [see (47) and (48)].

 
We now compare A(tf, rRS/s) with relative frequency-averaged heterozygosity. Let us focus on the right neutral locus R and define Formula 47. For HR(0) = 2pR(0)[1 pR(0)] != 0, one can use (46) to show that

Formula 48(48)
For pS(0) = 1/(2N), this is Formula 48, which is equivalent to Equation 14d of STEPHAN et al. (1992) (the factor 2 in the exponent of that formula needs to be replaced by 4 because of the different definition of the selection coefficient). In the absence of genetic drift, Formula 48(tf) = HR(0) for s = 0. Therefore, (48) can be regarded as the ratio of the frequency-averaged heterozygosity in the presence of selection to that in the absence of selection. Surprisingly, this ratio is exactly equal to the analogous ratio for LD shown in (47). The function A(tf, rRS/s) plays a special role in the sense that it encodes the effect of selection on two different frequency-averaged quantities.

For site heterozygosity, the hitchhiking effect is generally profound only when, provided that Formula 48, the recombination distance r between the selected and neutral sites satisfies r < 0.1s (MAYNARD SMITH and HAIGH 1974). The term determining this effect is (2N)–4r/s, assuming that the initial frequency pS(0) of the selected allele is 1/(2N). Our above analysis shows that the range of a substantial reduction of LD due to hitchhiking [determined by Formula 48] is exactly equal to that for variation [determined by (2N)–4r/s] (see Figure 5).

Characterization of equivalent initial conditions:

We now find the set of all initial conditions that lead to the same value of CLR at the time of fixation; i.e., CLR(tf) = c, where c is some fixed constant.

To be concrete, suppose that the gamete of origin is of type 001, in which case x101 = x011 = x111 = 0 and x001 = pS(0) = 1/(2N). First, note that x000 + x010 + x100 + x110 + pS(0) = 1 implies

Formula 49(49)
which defines a tetrahedron {Delta} as depicted in Figure 6a. Second, using (43), one can show that Formula 49 implies

Formula 50(50)
which defines a surface {Xi} in a three-dimensional Euclidean space with (x110, x010, x100) as coordinates. The intersection of surface {Xi} with tetrahedron {Delta}, illustrated in Figure 6b, corresponds to the set of initial conditions such that Formula 50. A case in which the gamete of origin is of type other than 001 can be handled in a similar vein.


Figure 6
View larger version (18K):
In this window
In a new window
Download PPT slide
 
FIGURE 6.—

Illustration of initial conditions leading to the same Formula 50. (a) The illustrated tetrahedron {Delta} corresponds to the set of all possible initial conditions, given that the gamete of origin is of type 001. {Delta} is defined by (49), and a point in {Delta} corresponds to a particular initial condition. (b) The intersection of {Delta} and surface {Xi}, defined by (50), corresponds to the set of all initial conditions leading to a fixed value c of Formula 50. For visual clarity, only one face F of {Delta} is shown. The intersection of {Delta} and {Xi} is the part of {Xi} below F.

 

Probability distributions of CLR(tf):

Recall that CLR(t) for t > 0 depends on initial conditions. In what follows, we regard initial gametic frequencies as being random and consider the probability distribution of CLR(tf). The squared correlation coefficient R2(tf) is addressed later in the DISCUSSION. We assume that all initial gametic frequency configurations x000, x010, x100, x110 are equally likely and satisfy x000 + x010 + x100 + x110 + pS(0) = 1. Under this assumption of uniform distribution, it is possible to compute the probability distribution Formula 50[CLR(tf) < c] for fixed rLR/s and rRS/s. The key idea is to utilize the characterization of equivalent initial conditions described above. More precisely, as c changes, the surface defined by CLR(tf) = c changes in a smooth fashion, sweeping out a region in three dimensions. The probability Formula 50[CLR(tf) < c] is equal to the volume of the region corresponding to CLR(tf) < c inside {Delta}, normalized by the total volume of {Delta}.

Our main result, illustrated in Figure 7, is

Formula 50
where Formula 50 is defined below. Let

Formula 50
Cases with c > 0 and c < 0 are treated separately below.


Figure 7
View larger version (30K):
In this window
In a new window
Download PPT slide
 
FIGURE 7.—

Probability distributions of Formula 50 for uniformly distributed initial gametic frequency configurations. These plots were generated using our analytic formulas for Formula 50[Formula 50 < c], with rLR/s = 0.02 and 1/(2N) = 0.00005. As rRS/s increases, Formula 50[Formula 50 < c] for all ij become identical. (a) Formula 50 and Formula 50. (b) Formula 50 and Formula 50.

 

For c > 0:

If c ≤ b/4, Formula 50, and c ≤ ab(1 a), then

Formula 50
If c ≤ b/4, Formula 50, and c > ab(1 a), then

Formula 50
If either c > b/4 or Formula 50, then Formula 50[C < c] = 1.

For c = –|c| < 0:

If |c| ≤ (1 – a)2b/4, then

Formula 50
If |c| > (1 – a)2b/4, then Formula 50[C < –|c|] = 0.

Polarization:

Polarized LDs are measured with respect to major alleles. To determine the polarized LD C{omega}(tf) between the neutral loci, we compute

Formula 51(51)
the main point being that C{omega}(tf) = CLR(tf) if {sigma} > 0 and C{omega}(tf) = –CLR(tf) if {sigma} < 0. [Recall that CLR(t) is measured with respect to type 1 alleles.]

First, for rLR = rRS = 0, note that

Formula 52(52)
Second, for fixed initial marginal frequencies, we need to determine for what values of rLR and rRS the sign of {sigma} changes. Using (45) and (46), we can obtain the following results:

For i = 0, {sigma} {approx} 0 if pL(0) > Formula 52 and

Formula 53(53)

For i = 1, {sigma} {approx} 0 if qL(0) > Formula 53 and

Formula 54(54)

For j = 0, {sigma} {approx} 0 if pR(0) > Formula 54 and

Formula 55(55)

For j = 1, {sigma} {approx} 0 if qR(0) > Formula 55 and

Formula 56(56)

Combined with (52), these equations determine completely whether C{omega}(tf) = CLR(tf) or C{omega}(tf) = –CLR(tf) for given parameter values. For example, suppose that (i, j) = (0, 0). If pL(0) < Formula 56 and pR(0) < Formula 56, then there is no real-valued solution to the condition {sigma} = 0, and therefore C{omega}(tf) = CLR(tf) for all values of rLR and rRS. If pL(0) > Formula 56 and pR(0) < Formula 56, or if pL(0) < Formula 56 and pR(0) > Formula 56, then {sigma} changes sign as illustrated in Figure 8, a and b, where

Formula 56
and

Formula 56
Note that u takes its minimum value at pL(0) = 1 and that Formula 56 as Formula 56. Similarly, v takes its minimum value at pR(0) = 1 and Formula 56 as Formula 56. If both pL(0) > Formula 56 and pR(0) > Formula 56, then there are three possibilities, depicted in Figure 8, c–e.


Figure 8
View larger version (5K):
In this window
In a new window
Download PPT slide
 
FIGURE 8.—

The sign of {sigma} = [Formula 56(tf) – Formula 56(tf)] x [Formula 56(tf) – Formula 56(tf)] for (i, j) = (0, 0). The polarized LD C{omega}(tf) between the neutral loci is CLR(tf) if {sigma} > 0 or –CLR(tf) if {sigma} < 0. The rLR,rRS domain is partitioned into two, three, or four blocks, depending on pL(0) and pR(0). Note that {sigma} tends to be positive in the neighborhood of (rLR, rRS) = (0, 0). The size and shape of this neighborhood depends on u and v, given by Formula 56 and Formula 56. (a) pL(0) > Formula 56 and pR(0) < Formula 56. (b) pL(0) < Formula 56 and pR(0) > Formula 56. (c) pL(0) = pR(0) > Formula 56. (d) pL(0) > pR(0) > Formula 56. (e) pR(0) > pL(0) > Formula 56.

 

Regions of positive C{omega}(tf):

To determine the regions of positive C{omega}(tf), we need to know how the sign of Formula 56 depends on rRS; (43) implies that the sign of Formula 56 does not depend on rLR. For concreteness, suppose that (i, j) = (0, 0), in which case we can obtain the following results from using (43):
  1. If CLR(0) ≥ 0, then Formula 56 ≥ 0 for all rRS, and therefore the sign of C{omega}(tf) is completely determined by that of {sigma}.
  2. If CLR(0) < 0, then Formula 56 changes sign as illustrated in Figure 9, where

    Formula 56

Note that w takes its minimum value of zero at |CLR(0)| = pL(0)pR(0) and that it increases monotonically as |CLR(0)|/[pL(0)pR(0)] decreases. The polarized LD C{omega}(tf) is positive if and only if Formula 56 and {sigma} are either both positive or both negative. Examples are shown in Figure 10.


Figure 9
View larger version (4K):
In this window
In a new window
Download PPT slide
 
FIGURE 9.—

The sign of Formula 56 for CLR(0) < 0. If CLR(0) ≥ 0, then Formula 56 is nonnegative for all rRS and rLR. If CLR(0) < 0, however, the sign of Formula 56 can change at rRS = w, where Formula 56. In general, Formula 56 tends to be positive near (rLR, rRS) = (0, 0).

 

Figure 10
View larger version (4K):
In this window
In a new window
Download PPT slide
 
FIGURE 10.—

Examples of the sign of the polarized LD C{omega}(tf) when the gamete of origin is of type 001 and CLR(0) < 0. Note that C{omega}(tf) is positive if and only if Formula 56 and {sigma} are either both positive or both negative. In general, C{omega}(tf) tends to be positive near (rLR, rRS) = (0, 0). (a) pL(0) > pR(0) > Formula 56 and w < u. (b) pL(0) > Formula 56, pR(0) < Formula 56, and w > u.

 
As shown in Figure 8, {sigma} tends to be positive in the neighborhood of (rLR, rRS) = (0, 0). The size and shape of this neighborhood depend on u and v. Likewise, as shown in Figure 9, Formula 56 tends to be positive in the neighborhood of (rLR, rRS) = (0, 0), with the size of the neighborhood depending on w. As a consequence, the polarized LD C{omega}(tf) also tends to be positive near (rLR, rRS) = (0, 0).

More generally, if the gamete of origin is of type ij1, the sign of Formula 56 (respectively, {sigma}) can be analyzed using (43) [respectively, (52)–(56)]. For all ij, the polarized LD C{omega}(tf) tends to be positive near (rLR, rRS) = (0, 0).

An exact symmetry when the selected locus is outside the two neutral loci:

Suppose that the selected locus is outside the two neutral loci and that geometric configuration and recombination fractions are fixed. Let {pS(0), pL(0), pR(0), CLS(0), CRS(0), CLR(0), CLRS(0)} and {p'S(0), p'L(0), p'R(0), C'LS(0), C'RS(0), C'LR(0), C'LRS(0)} denote two different sets of initial conditions. At generation n > 1, we use "prime" to refer to the marginal allele frequencies and LDs obtained using the second set of initial conditions. In the APPENDIX, we show that if CLS(0) = C'RS(0) and CRS(0) = C'LS(0), while pS(0) = p'S(0), CLR(0) = C'LR(0), and CLRS(0) = C'LRS(0), then the system of full recursions (1)–(11) implies

Formula 56
for all n ≥ 1. This is an exact symmetry result that holds for an arbitrary dominance coefficient h.

An application of this general result is the explanation of the symmetry of Figure 3 with respect to reflection about the r = 0 plane, for those regions corresponding to the selected locus being outside the neutral loci. Note that what is depicted in Figure 3 is different from the obviously symmetric case in which initial conditions CLS(0) and CRS(0) get exchanged when locus S is reflected about r = 0. In that figure, initial conditions remain fixed, while the geometric configuration of the loci and recombination fractions change upon reflection. That situation is related to changing initial conditions as described above, while keeping the geometric configuration and recombination fractions fixed.


DISCUSSION
To understand the forces that shape genomic variation in natural populations and the divergence between species, observed patterns must be compared to predictions of the models that faithfully represent the mechanisms through which such forces may work. While much of the natural selection of organismic phenotypes may be effectively approximated by deterministic single-locus equations, interactive and stochastic forces are thought to play a significant role. Until recently genetic drift has been considered the primary stochastic process determining the temporal, geographic, and genomic distribution of the vast majority of DNA sequence polymorphism and divergence. Gillespie has repeatedly demonstrated and emphasized fundamental differences between constant-fitness models and stochastically varying selection, despite their superficial similarities (GILLESPIE 1994). Recently emerging results of surveys of genomic regions of low crossing over per physical length indicate that linked selection rather than genetic drift can dominate the levels of polymorphism within populations (AGUADÉ et al. 1989; STEPHAN and LANGLEY 1989; BEGUN and AQUADRO 1992). The hitchhiking effect not only reduces the average level of heterozygosity in the surrounding genomic regions, but it also leaves a skewed frequency spectrum (BRAVERMAN et al. 1995). The early study by THOMSON (1977) indicated that linked selection can create linkage disequilibrium. Several subsequent articles have addressed specific cases (ROBINSON et al. 1991; GROTE et al. 1998) or noted some temporal and spatial patterns (KIM and NIELSEN 2004). Here we have demonstrated that the hitchhiking effect involves a number of strong and surprisingly distinct dynamics and patterns of linkage disequilibrium. We believe that the approach we have taken to address the impact of selection can be extended further to address more complex selection schemes and genetic interactions.

The technological capacity of molecular population genomics is increasing rapidly. For example, the HapMap Project (INTERNATIONAL HAPMAP CONSORTIUM 2005) provides extensive genotypic survey results on >1 million SNPs in almost 300 individual humans. At this scale of observation one can anticipate much more powerful inferences about the role of direct selection, linked selection, crossing over, gene conversion, mutation, and geographic demography. Indeed, on the basis of such new data, genomic variation in the rate of crossing over has been proposed as the primary determinant of the patterns of linkage disequilibrium in human populations (MCVEAN et al. 2004).

Several representations/notations have been developed to analyze the dynamics of multilocus systems (BÜRGER 2000). Through a series of articles Barton and Turelli have elaborated and applied their method on the basis of the explicit representations of the moments of allele frequencies (BARTON and TURELLI 1987, 1991; TURELLI and BARTON 1990, 1994). Their representation proved surprisingly tractable and transparent in the analysis of the hitchhiking effect on linkage disequilibrium.

Our analysis begins with the full representation of the three-locus dynamics using the notation of Barton and Turelli. These equations suggest the familiar approximation, "truncated equations," in which Formula 56 and small higher-order terms can be dropped. The truncated equations immediately expose much of the fundamental structure and their differential analogs, ordinary differential equations, allow approximate analytic solutions. Comparisons of these ODE dynamics with those of the Barton and Turelli representation indicate that the approximations remain quite accurate as long as Formula 56 (see Tables 1 and 2). Particularly fortuitous and important is the role of the three-locus LD CLRS in driving the dynamics of the LD CLR between the two linked neutral loci and the dependence of CLRS on the product of the two-locus LDs CLS and CRS.

This systematic investigation of the dynamics of LD under hitchhiking reveals four important features. First, and quite generally, hitchhiking indeed generates LD during the initial half of the hitchhiking time course. As Figure 3 shows LD (positive in this instance) reaches a maximum shortly before the originally rare selected allele reaches 0.5. This result is consistent with Thomson's analysis of hitchhiking caused by the dynamics of an initially rare allele under balancing selection in that its frequency reaches an equilibrium closer to 0.5 than to 1.0. But what is truly surprising is that from several important perspectives the hitchhiking effect on LD is one of reduction. In Figure 3 it is obvious that in the second half of the hitchhiking period the large peak of LD (positive in this case) decreases rapidly. Figure 2 shows several configurations of initial conditions and demonstrates that the decline in the magnitude of LD is not attributable to decline in the heterozygosity at the two neutral loci. A second and striking result is that preexisting LD is completely destroyed when the selected locus is situated between the neutral sites. This geometric relationship produces a striking pattern when all pairwise associations are plotted together as in Figure 4. This is probably the mechanism behind the pattern noted by KIM and NIELSEN (2004). This LD-reducing effect of hitchhiking is also evident when the selected site is outside the neutral pair since much of the LD generated during the initial phase is destroyed in the latter phase. A third unexpected property of the hitchhiking on LD is that the averaging over the frequencies of the gametes with which the rare selected variant can be associated indicates that the net effect of hitchhiking would be to reduce preexisting average LD. This is despite the fact that hitchhiking does tend to increase the variance in LD (see below). Note in Figure 3 that there is considerably increased LD in both regions flanking the two neutral sites (i.e., when the selected site is outside and is close to the two neutral sites). When the rare favored allele appears on two of the other three haplotypes (10 or 01) the