Abstract
We analyzed a three-locus model of genetic hitchhiking with one locus experiencing positive directional selection and two partially linked neutral loci. Following the original hitchhiking approach by Maynard Smith and Haigh, our analysis is purely deterministic. In the first half of the selected phase after a favored mutation has entered the population, hitchhiking may lead to a strong increase of linkage disequilibrium (LD) between the two neutral sites if both are <0.1s away from the selected site (where s is the selection coefficient). In the second half of the selected phase, the main effect of hitchhiking is to destroy LD. This occurs very quickly (before the end of the selected phase) when the selected site is between both neutral loci. This pattern cannot be attributed to the well-known variation-reducing effect of hitchhiking but is a consequence of secondary hitchhiking effects on the recombinants created in the selected phase. When the selected site is outside the neutral loci (which are, say, <0.1s apart), however, a fast decay of LD is observed only if the selected site is in the immediate neighborhood of one of the neutral sites (i.e., if the recombination rate r between the selected site and one of the neutral sites satisfies ). If the selected site is far away from the neutral sites (say, r > 0.3s), the decay rate of LD approaches that of neutrality. Averaging over a uniform distribution of initial gamete frequencies shows that the expected LD at the end of the hitchhiking phase is driven toward zero, while the variance is increased when the selected site is well outside the two neutral sites. When the direction of LD is polarized with respect to the more common allele at each neutral site, hitchhiking creates more positive than negative linkage disequilibrium. Thus, hitchhiking may have a distinctively patterned LD-reducing effect, in particular near the target of selection.
GENETIC drift is recognized as a fundamental stochastic force shaping the polymorphism within populations and divergence between species of both neutral and selected variants at a locus (Fisher 1930; Wright 1931; Kimura 1983). Remarkably similar patterns to those caused by genetic drift are predicted by theoretical models incorporating stochastically varying selection (Gillespie 1994). In 1974 Maynard Smith and Haigh analyzed a simple and obvious extension of the genetic drift models, incorporating the stochastic coupling due to linkage with another locus undergoing strong directional selection. Addressing the apparent uniformity of allozyme polymorphism across species, Maynard Smith and Haigh (1974) focused on the “hitchhiking effect” on allelic frequencies and heterozygosity.
Since this seminal work the hitchhiking effect associated with positive directional selection has been extended to address observations in the emerging field of molecular population genetics (Aguadé et al. 1989; Stephan and Langley 1989; Begun and Aquadro 1992). These studies revealed low levels of DNA sequence polymorphism in Drosophila in genomic regions of low crossing over and led to theoretical analyses of the hitchhiking effect on nucleotide diversity (Kaplan et al. 1989) and on the frequency spectrum of polymorphisms (Braverman et al. 1995). Whether formulated in terms of gamete frequencies or in the coalescent framework, the genetic models dealt with a pair of loci: a selected and a linked neutral locus with a defined rate of recombination between them. Independent of whether single or recurrent hitchhiking events were analyzed (as a stochastic process or a deterministic approximation), the conclusion has consistently been that hitchhiking should have profound effects on expected heterozygosity and allele frequencies at the neutral locus, if selection is strong and linkage is tight.
In 1977 Thomson considered the impact of hitchhiking on linkage disequilibrium (LD). Most of her analysis focused on the association between the selected alleles and those at a single neutral locus. Thomson also considered the impact of hitchhiking (of a heterotic polymorphism) on the LD between alleles at two linked neutral loci. On the basis of numerical examples she concluded that hitchhiking creates LD between neutral loci within the same genomic domain in which it affects levels of heterozygosity. The impact of simple directional selection of rare variants rapidly going to fixation was not considered by Thomson (1977). Except for this initial study and a few rather targeted applications (Robinson et al. 1991; Grote et al. 1998), no attempts have been made to extend this simple two-locus hitchhiking model of Maynard Smith and Haigh (1974) to multiple loci. This is curious, as the hitchhiking effect has played a major role in molecular population genetics for >15 years.
On the basis of the analyses of Thomson and her colleagues, it has been believed that hitchhiking affects not only polymorphisms at individual sites (or loci), but also the association between polymorphisms. Using a three-locus model with one selected and two neutral loci, she showed that hitchhiking that has a strong effect on heterozygosity can also generate strong LD between the neutral loci. Without paying close attention to Thomson's writing, the important role of hitchhiking in generating LD has been reiterated in textbooks and publications by many authors. It was not until recently that a few authors reported quite the opposite results (Gillespie 1997; Kim and Stephan 2002; Kim and Nielsen 2004). Analyzing “shift” models that are similar to the hitchhiking model described above, Gillespie (1997, p. 297) concluded that “linked selection can reduce variation without building up high levels of linkage disequilibrium, contrary to our intuition.” These latter studies focused on average effects observable in simulated data. In small-sample coalescent simulations, Kim and Nielsen (2004) found increased LD between alleles at two neutral loci on the same side of the selected locus at the time of fixation and reduced LD across the site of selection. Furthermore, they provide heuristic arguments to explain this pattern. These different and somewhat contradictory views of the relationship between genetic hitchhiking and LD motivated us to pursue an analytic investigation of the question. We followed the deterministic approach of Maynard Smith and Haigh (1974), extending their model to three loci as did Thomson (1977). To analyze this model, we used the framework of Barton and Turelli (1991), which provides a natural setting for the investigation of the directional hitchhiking, yielding transparent mathematical expressions that illuminate the rather surprising dynamics of LD under the hitchhiking effect.
THE THREE-LOCUS HITCHHIKING MODEL
We consider a three-locus model with two neutral loci and one selected locus. For each locus, we assume that there are only two allele types, denoted by 0 and 1. The selected locus may be between the two neutral ones or on either side of them (see Figure 1). We denote by L and R the left and right neutral loci, respectively, and by S the selected locus. The corresponding recombination fractions between loci are rLR, etc. We assume positive directional selection according to the following fitness scheme:where s is the selection coefficient of the selected allele (type 1) and h the dominance coefficient. Note that we follow here the notation of Maynard Smith and Haigh (1974), not the definition of Kaplan et al. (1989). Effective population size Ne is assumed to be very large (
), such that a deterministic analysis of the model is appropriate.
Three possible configurations. The selected locus is denoted by S, whereas the left and the right neutral loci are denoted by L and R, respectively. The recombination rate rLR between L and R is given by adding or subtracting rLS and rRS, depending on the configuration.
The system of full recursions:
To derive the recursions for the marginal (type 1) allele frequencies pL, pR, and pS at the three loci and the LDs (central moments) of the second (CLR, CLS, CRS) and third (CLRS) orders measured with respect to type 1 alleles, we follow the approach of Barton and Turelli (1991). In that approach, coefficients that appear in the recursions are recombination rates and generalized selection coefficients, the latter denoted by and
, where X and Y are nonempty subsets of loci. (Generalized selection coefficients are defined as coefficients that appear in expressing the relative fitness in terms of certain quantities related to LDs.) In our case, nonzero selection coefficients that appear in the recursions at time t are
For h =
, these simplify to
Following Barton and Turelli (1991), we define rLRS = rLS,R + rRS,L + rLR,S and use rX,Y to denote the rate of recombination events that partition the loci into two nonempty sets X and Y. In our work, we ignore double-crossover recombination events, so we define rX,Y = 0 if the partition X, Y corresponds to a double-crossover event. For example, if the selected locus S is between the neutral loci, then rLS,R = rRS, rRS,L = rLS, and rLR,S = 0. Further, we assume that recombination rates are additive. The recombination rate rLR is therefore given by adding or subtracting rLS and rRS, depending on the configuration (see Figure 1).
To simplify notation, we hereafter omit writing the dependence on time t. We define qS = 1 − pS, qL = 1 − pL, and qR = 1 − pR. Marginal allele frequencies satisfy the following recursions:(1)
(2)
(3)The LDs satisfy the recursions
(4)
(5)
(6)
(7)where
(8)
(9)
(10)
(11)with
The above general recursions apply to all three configurations shown in Figure 1. Depending on the particular configuration being considered, recombination rates rX,Y need to be defined appropriately.
The system of truncated recursions:
We explore the behavior of the recursion Equations 1–7 in the region and
(see Maynard Smith and Haigh 1974). Here, r may be any recombination parameter appearing in the Equations 1–7. Keeping only the terms linear in s or r leads to the following set of recursions:
(12)
(13)
(14)
(15)
(16)
(17)
(18)These equations agree to first order in r and s with those of Thomson (1977) (compare her Equations 30iii, 30iv, and 31). Since the hitchhiking effect can be best observed when
(such that simultaneously
holds), we may approximate the truncated recursions by the following ordinary differential equations (ODEs; see Maynard Smith and Haigh 1974):
(19)
(20)
(21)
(22)
(23)
(24)
(25)Here we have introduced time t (in generations) into the equation for pS and parameterized the other quantities by pS (which is a monotonically increasing function of t).
Structure of equations:
Several features of the dynamical system become apparent from these two sets of equations. Most importantly, selection acts on the alleles at the neutral loci indirectly and in a strictly hierarchical fashion: on the marginal neutral allele frequencies via the pairwise LDs CLS and CRS, on the LD between the neutral sites by the third moment, and on the latter by a fourth-order term (i.e., the product of the two pairwise LDs CLS and CRS).
Analytical solutions when the selected locus is between the neutral loci:
The ODEs for the LDs are first-order linear differential equations. The ODEs for the pairwise LDs CLS and CRS are homogeneous. The equations for CLR and CLRS contain the higher-order moments as inhomogeneous terms that act as “driving forces” of the dynamics. However, except for this impact of the higher-order terms, the equations are decoupled and can be solved successively. We have the following results.
The frequency of the selected allele at locus S is(26)whereas marginal allele frequencies at the neutral loci are
(27)
(28)where
(29)which corresponds to the heterozygosity at the selected locus. The LDs CLS(t) and CRS(t) can be written as
(30)
(31)
Given these solutions for CLS(t) and CRS(t), the coupled ODEs (24) and (25) admit simple exact solutions when the selected locus is between the two neutral loci. More specifically, the third-order LD is given by(32)and the LD between the neutral loci can be written as
(33)where rLR = rLRS = rLS + rRS.
Analytical solutions when the selected locus is outside the neutral loci:
Solutions to the ODEs (19)–(23) do not depend on whether the selected locus is inside or outside the two neutral loci. For example, the allele frequency pS(t) and the LDs CLS(t) and CRS(t) are given by (26), (30), and (31), respectively, in all cases. However, the dynamics of the ODEs (24) and (25) for CLR(t) and CLRS(t), respectively, depend crucially on the position of the selected locus S with respect to the neutral loci L and R. As we elaborate presently, the dynamics of CLR(t) when S is between L and R exhibit radically different behavior than when S is outside.
In what follows, suppose that S is to the right of R, which implies rLS = rLRS = rLR + rRS. The case where S is to the left of L can be handled in a similar vein, with rRS replaced with rLS. Now, the ODE (25) for the third-order LD CLRS(t) does not admit a closed-form solution; a general solution can be obtained in terms of the incomplete beta function B(z; x, y), defined as . However, noting that B(z; 1 − 2rRS/s, 1 + 2rRS/s) ≈ z if
, we obtain the following simple approximate solution:
(34)Further, using this solution and the approximation B[z; 2 − 2rRS/s, 1 + 2rRS/s] ≈ z2 for
, we obtain the following approximate solution to (24):
(35)Note the striking resemblance of (34) and (35) to (32) and (33), respectively. For rRS = 0, the two sets of equations agree exactly, and hence our solutions for different regions form one continuous solution for the entire domain. The only difference between the two sets of equations is that, in (34) and (35), an extra factor
appears together with [pS(t) − pS(0)]/HS(0). This simple difference leads to important observable differences in the dynamics of the LDs.
Comparison of approximate analytic solutions with numerical solutions to the full recursions:
We have written a computer program to solve the full recursions (1)–(11) numerically. Comparison of our analytic solutions (33) and (35) with numerical solutions to the full recursions are shown in Tables 1 and 2, respectively. As these tables show, our analytic solutions are a good approximation to the exact dynamics. In obtaining our analytic solution (35) for the case in which the selected locus is outside the neutral loci, recall that we assumed to approximate the incomplete beta function. As expected, Table 2 shows that (35) becomes less accurate as rRS increases, but it is a good approximation as long as
.
Comparison of the exact CLR(t) (obtained by solving the full recursions numerically) with the analytic approximation (33) when the selected locus is between the two neutral loci
Comparison of the exact CLR(t) with the analytic approximation (35) when the selected locus is to the right of locus R
THE DYNAMICS OF LD
Here we consider the dynamics of the LD between the two neutral loci. We utilize our analytic solutions from the previous section to study several important aspects of the dynamics.
Vanishing LD:
In the domain , the term −sCLSCRS dominates the recursion for the third moment [see (18)] and hence influences also the LD between the neutral sites. If the selected site is between the two neutral sites and linkage is sufficiently tight (rLR < 0.1s), |CLR| quickly increases after the favored mutation has entered the population and, after transiently reaching a peak, rapidly decays to zero before the selected phase ends. To show this, define tf as the time satisfying pS(tf) = 1 − pS(0). Henceforward, we loosely refer to this time as the fixation time. Using (26), one can show that
(36)We wish to show that, if the selected locus is between the two neutral loci, then CLR(tf) ≈ 0 for all possible initial conditions of interest. Common to all initial conditions is that pS(0) = 1/(2N), with N being the population size. For N = ∼104–106,
, and therefore
(37)which implies that (33) at tf can be written approximately as follows:
(38)
We use {000, 001, 010, 011, 100, 101, 110, 111} to denote gametic types. Their frequencies are denoted by {f000, f001, f010, f011, f100, f101, f110, f111}, which are related to the marginal frequencies and the LDs as follows:Using these relations, we obtain
(39)At t = 0, a new favored mutation occurs on only one of the following gametic types: 000, 001, 100, or 101. For instance, if the mutation occurs on a gamete of type 101, f010(0) = f011(0) = f110(0) = 0 and f111(0) = pS(0) = 1/(2N). More generally, only one of f010(0), f111(0), f011(0), f110(0) is supposed to be nonzero. This implies that the right-hand side of (39) must be zero. Hence, the coefficient of exp(−rLRtf) in (38) is exactly zero. If the approximation (37) were not used, then the coefficient of exp(−rLRtf) in CLR(tf) would not be exactly zero, but would still be very small. Note that exp(−rLRtf) may not be very small. For example, exp(−rLRtf) = 0.452813 for pS(0) = 0.00005, rLR = 0.00002, and s = 0.001. This shows that, for small recombination rates (
), selection rather than recombination is the dominant force that causes CLR(t) to vanish before the fixation time. For large recombination rates, the contribution exp(−rLRt) from recombination should dominate over selection effects.
This behavior may be explained as follows. Under the scenario of tight linkage and strong selection, a low-frequency gamete on which the favored mutation landed is quickly dragged into intermediate to high frequency. If the recombination rates are nonzero, this gamete may undergo recombination, thereby creating the two types of single recombinants that also carry the selected allele and thus increase in frequency. This reduces the LD between L and R created by the hitchhiking effect in the first half of the selected phase. This hitchhiking effect on the recombinants is stronger, the greater the linkage between the selected site and the two neutral sites is, and thus also the product CLSCRS.
Figure 2a shows another important observation: LD may vanish very quickly in the selected phase, while relative heterozygosity approaches a finite (i.e., nonzero) equilibrium value. Thus, LD does not vanish because of the variation-reducing effect of hitchhiking per se, but as a consequence of secondary hitchhiking effects on the recombinants created in the selected phase (described above).
Example trajectories of CLR(n) are shown on the left-hand side, and those of normalized heterozygosity Het(n)/Het(0) are shown on the right-hand side, where solid (resp. dashed) lines correspond to locus L (resp. R). (a) The selected locus is at the midpoint between the neutral loci, i.e., rLS = rRS = 0.01s. (b) The selected locus is to the right of locus R and rRS = 0.01s. (c) The selected locus is to the right of locus R and rRS = 0.03s. (d) The selected locus is to the right of locus R and rRS = 0.3s. Exact recursions (1)–(11) were used, with s = 0.01, rLR = 0.02s, pS(0) = 0.00005, pL(0) = 0.38, pR(0) = 0.41, and CLR(0) = 0.0242. At generation n = 3982, pS(n) = 1 − pS(0). Initial conditions were chosen so that CLR(0) ≠ 0, for two reasons: to demonstrate that, when the selected locus is between the neutral loci (as in case a), CLR(n) quickly vanishes in the selected phase even if the initial value CLR(0) is not zero and to contrast the effect of recombination (see case d) with that of selection.
A selected mutation occurring outside the two neutral sites on a low-frequency gamete may also lead to a transient peak of CLR, if both neutral polymorphic sites are <0.1s recombination distances away from the selected site (see Figure 2, b and c). Although this peak vanishes faster than under neutral conditions (i.e., with the selected site far away from the neutral sites, as in Figure 2d), the decay rate is not as high as when the favored mutation occurs between the neutral sites (Figure 2a). We analyze this behavior in more detail below.
Shown in Figure 3 is a plot of CLR(n) for varying position of the selected locus. As in Figure 2, the distance between the two neutral loci is fixed at rLR = 0.0002, and the same set of initial conditions is used. Note that this plot is symmetric about the plane r = 0; we return to this point later in the article. We stress that our conclusions described above do not depend on the particular values of neutral marginal allele frequencies pL(0) and pR(0) used for illustration. Even for low values of pL(0) and pR(0), for example, the same conclusions hold.
The LD CLR between the neutral loci as a function of the position of the selected locus, for rLR = 0.0002 and s = 0.01. Here, r is the position of the selected locus S, and the value r = 0 corresponds to the midpoint between the neutral loci. Locus L is fixed at r = −0.0001, whereas locus R is fixed at r = 0.0001. The same set of initial conditions as in Figure 2 was used for all r. Exact recursions (1)–(11) were used to generate this plot.
An alternative illustration of the above discussion is provided in Figure 4, which shows pairwise LD plots for a region containing 100 neutral loci and a single selected locus. The two plots shown correspond to two different time points. The selected locus is located in the middle of the region and LD values below a cutoff value are not plotted.
Truncated pairwise LD plots for a region consisting of 100 neutral loci and 1 selected locus located in the middle. The recombination rate between any two adjacent loci is 0.4s/100, and hence the recombination rate between the leftmost and the rightmost neutral loci is 0.4s. We used s = 0.01, pS(0) = 0.00005, pR(0) = pL(0) = 0.5, and CLR(0) = 0. The plots were truncated so that LD values <1% of the largest value 0.0577 were not included. (a) Pairwise LD plot at t = tf/2, where the time of fixation tf satisfies pS(tf) = 1 − pS(0). (b) Pairwise LD plot at t = tf.
The maximum LD at tf:
Consider the case in which the selected locus S is to the right of locus R. Viewing CLR(tf) as a function of rRS, whether a local optimum in the domain exists depends on initial conditions. The example shown in Figure 3 has a local maximum at rRS/s ≈ 0.039. Differentiating the analytic solution (35), it is possible to determine whether there is a critical point rRS = r*RS that satisfies
(40)Suppose that, at t = 0, the new gamete carrying a selected allele (of type 1) at locus S is of type ij1, with i being the allele at locus L and j the allele at locus R. Then, given that [δi1 − pL(0)][δj1 − pR(0)] > CLR(0), where δab is 1 if a = b or 0 if a ≠ b, we obtain
(41)and
The value of r*RS in (41) may be very large for some initial conditions. In such a case, as our analytic solution (35) is valid only in the domain
, all we can say for sure is that CLR(tf) has no critical point in the domain
[i.e., CLR(tf) is either a monotonically increasing or a monotonically decreasing function of rRS in that domain]. Further, if [δi1 − pL(0)][δj1 − pR(0)] ≤ CLR(0), there is no real-valued r*RS such that our approximate analytic solution (35) satisfies (40). Hence, noting that CLR(tf) approaches
as rRS/s increases, we conclude that, in the domain
,
where
(42)and
. Note that the maximum possible value of X is
. The critical point rLS = r*LS for the case in which the selected locus is to the left of locus L is also given by (41).
Invariance of CLR and CLRS when the selected locus is between the two neutral loci:
When the favored mutation occurs between the two neutral sites, the dynamics of the system of truncated recursions for CLR and CLRS do not depend on the position of the selected locus. This can be seen immediately from Equations 17 and 18, which depend only on the sum rLS + rRS and not on any individual recombination parameter. We show in the appendix that this invariance also (nearly) holds for the system of full recursions.
INITIAL CONDITIONS AND THE PARAMETER SPACE OF THE MODEL
Here we assume that the selected locus is to the right of locus R. For such a case, recall that the LD (measured with respect to type 1 alleles) between the neutral loci is given by (35). We use ijk to denote gametic types, with i being for locus L, j for locus R, and k for locus S. By the gamete of origin, we mean the new gamete at t = 0 carrying a selected allele (of type 1) at locus S. We include ij in superscript [i.e., we write ] if the gamete of origin is of type ij1. Using (35), we obtain
(43)where, as before, δab is 1 if a = b or 0 if a ≠ b, and α(t, y) is defined as
(44)
Recall that marginal (type 1) allele frequencies pL(t) and pR(t) at the neutral loci are given by (27) and (28), respectively. For and
, we can use the approximation
from which it follows that
where α(tf, rRS/s) is defined as in (44). Similar to
, we use
and
to denote pL(t) and pR(t), respectively, if the gamete of origin is of type ij1. It is straightforward to show that
(45)
(46)
Frequency-averaged LD and the range of the hitchhiking effect:
In what follows, we use xijk to denote the frequency of the gametic type ijk at time t = 0 and define xij· = xij0 + xij1. The type of the gamete of origin could be any of 001, 011, 101, and 111. Suppose that the probability of the gamete of origin being of type ij1 is equal to the frequency of the gametic type ij0 just before time t = 0 (note that this frequency is equal to xij·). Then, the average value of CLR(t) with respect to this probability is given by , which we call a frequency-averaged LD. We show below that, contrary to people's common intuition, the effect of selection on such an averaged LD does not depend on haplotype diversity xij· at t = 0.
Using (43), we can show that is given by
If CLR(0) = 0, then
for all t. For CLR(0) ≠ 0, we define
Note that rLR need not be much smaller than s for our analytic solution (35) to be valid (recall that only
is required). Assuming
, we can ignore genetic drift and regard
as the behavior of LD under neutrality. Thus, A(t, rRS/s) can be viewed as the ratio of the frequency-averaged LD in the presence of selection to that in the absence of selection. At the time tf of fixation, pS(tf) = 1 − pS(0) and therefore
(47)For given rRS/s, A(tf, rRS/s) depends only on pS(0) = 1/(2N); it has no dependence on other initial conditions. A plot of A(tf, rRS/s) is shown in Figure 5 for pS(0) = 0.00005.
We now compare A(tf, rRS/s) with relative frequency-averaged heterozygosity. Let us focus on the right neutral locus R and define . For HR(0) = 2pR(0)[1 − pR(0)] ≠ 0, one can use (46) to show that
(48)For pS(0) = 1/(2N), this is
, which is equivalent to Equation 14d of Stephan et al. (1992) (the factor 2 in the exponent of that formula needs to be replaced by 4 because of the different definition of the selection coefficient). In the absence of genetic drift,
(tf) = HR(0) for s = 0. Therefore, (48) can be regarded as the ratio of the frequency-averaged heterozygosity in the presence of selection to that in the absence of selection. Surprisingly, this ratio is exactly equal to the analogous ratio for LD shown in (47). The function A(tf, rRS/s) plays a special role in the sense that it encodes the effect of selection on two different frequency-averaged quantities.
For site heterozygosity, the hitchhiking effect is generally profound only when, provided that , the recombination distance r between the selected and neutral sites satisfies r < 0.1s (Maynard Smith and Haigh 1974). The term determining this effect is (2N)−4r/s, assuming that the initial frequency pS(0) of the selected allele is 1/(2N). Our above analysis shows that the range of a substantial reduction of LD due to hitchhiking [determined by
] is exactly equal to that for variation [determined by (2N)−4r/s] (see Figure 5).
Characterization of equivalent initial conditions:
We now find the set of all initial conditions that lead to the same value of CLR at the time of fixation; i.e., CLR(tf) = c, where c is some fixed constant.
To be concrete, suppose that the gamete of origin is of type 001, in which case x101 = x011 = x111 = 0 and x001 = pS(0) = 1/(2N). First, note that x000 + x010 + x100 + x110 + pS(0) = 1 implies(49)which defines a tetrahedron Δ as depicted in Figure 6a. Second, using (43), one can show that
implies
(50)which defines a surface Ξ in a three-dimensional Euclidean space with (x110, x010, x100) as coordinates. The intersection of surface Ξ with tetrahedron Δ, illustrated in Figure 6b, corresponds to the set of initial conditions such that
. A case in which the gamete of origin is of type other than 001 can be handled in a similar vein.
Illustration of initial conditions leading to the same . (a) The illustrated tetrahedron Δ corresponds to the set of all possible initial conditions, given that the gamete of origin is of type 001. Δ is defined by (49), and a point in Δ corresponds to a particular initial condition. (b) The intersection of Δ and surface Ξ, defined by (50), corresponds to the set of all initial conditions leading to a fixed value c of
. For visual clarity, only one face F of Δ is shown. The intersection of Δ and Ξ is the part of Ξ below F.
Probability distributions of CLR(tf):
Recall that CLR(t) for t > 0 depends on initial conditions. In what follows, we regard initial gametic frequencies as being random and consider the probability distribution of CLR(tf). The squared correlation coefficient R2(tf) is addressed later in the discussion. We assume that all initial gametic frequency configurations x000, x010, x100, x110 are equally likely and satisfy x000 + x010 + x100 + x110 + pS(0) = 1. Under this assumption of uniform distribution, it is possible to compute the probability distribution ℙ[CLR(tf) < c] for fixed rLR/s and rRS/s. The key idea is to utilize the characterization of equivalent initial conditions described above. More precisely, as c changes, the surface defined by CLR(tf) = c changes in a smooth fashion, sweeping out a region in three dimensions. The probability ℙ[CLR(tf) < c] is equal to the volume of the region corresponding to CLR(tf) < c inside Δ, normalized by the total volume of Δ.
Our main result, illustrated in Figure 7, iswhere
is defined below. Let
Cases with c > 0 and c < 0 are treated separately below.
Probability distributions of for uniformly distributed initial gametic frequency configurations. These plots were generated using our analytic formulas for ℙ[
< c], with rLR/s = 0.02 and 1/(2N) = 0.00005. As rRS/s increases, ℙ[
< c] for all ij become identical. (a)
and
. (b)
and
.
For c > 0:
If c ≤ b/4, , and c ≤ ab(1 − a), then
If c ≤ b/4,
, and c > ab(1 − a), then
If either c > b/4 or
, then ℙ[C < c] = 1.
For c = −|c| < 0:
If |c| ≤ (1 − a)2b/4, thenIf |c| > (1 − a)2b/4, then ℙ[C < −|c|] = 0.
Polarization:
Polarized LDs are measured with respect to major alleles. To determine the polarized LD Cω(tf) between the neutral loci, we compute(51)the main point being that Cω(tf) = CLR(tf) if σ > 0 and Cω(tf) = −CLR(tf) if σ < 0. [Recall that CLR(t) is measured with respect to type 1 alleles.]
First, for rLR = rRS = 0, note that(52)Second, for fixed initial marginal frequencies, we need to determine for what values of rLR and rRS the sign of σ changes. Using (45) and (46), we can obtain the following results:
For i = 0, σ ≈ 0 if pL(0) >
and
(53)
For i = 1, σ ≈ 0 if qL(0) >
and
(54)
For j = 0, σ ≈ 0 if pR(0) >
and
(55)
For j = 1, σ ≈ 0 if qR(0) >
and
(56)
Combined with (52), these equations determine completely whether Cω(tf) = CLR(tf) or Cω(tf) = −CLR(tf) for given parameter values. For example, suppose that (i, j) = (0, 0). If pL(0) < and pR(0) <
, then there is no real-valued solution to the condition σ = 0, and therefore Cω(tf) = CLR(tf) for all values of rLR and rRS. If pL(0) >
and pR(0) <
, or if pL(0) <
and pR(0) >
, then σ changes sign as illustrated in Figure 8, a and b, where
and
Note that u takes its minimum value at pL(0) = 1 and that
as
. Similarly, v takes its minimum value at pR(0) = 1 and
as
. If both pL(0) >
and pR(0) >
, then there are three possibilities, depicted in Figure 8, c–e.
The sign of σ = [(tf) −
(tf)] × [
(tf) −
(tf)] for (i, j) = (0, 0). The polarized LD Cω(tf) between the neutral loci is CLR(tf) if σ > 0 or −CLR(tf) if σ < 0. The rLR,rRS domain is partitioned into two, three, or four blocks, depending on pL(0) and pR(0). Note that σ tends to be positive in the neighborhood of (rLR, rRS) = (0, 0). The size and shape of this neighborhood depends on u and v, given by
and
. (a) pL(0) >
and pR(0) <
. (b) pL(0) <
and pR(0) >
. (c) pL(0) = pR(0) >
. (d) pL(0) > pR(0) >
. (e) pR(0) > pL(0) >
.
Regions of positive Cω(tf):
To determine the regions of positive Cω(tf), we need to know how the sign of depends on rRS; (43) implies that the sign of
does not depend on rLR. For concreteness, suppose that (i, j) = (0, 0), in which case we can obtain the following results from using (43):
If CLR(0) ≥ 0, then
≥ 0 for all rRS, and therefore the sign of Cω(tf) is completely determined by that of σ.
If CLR(0) < 0, then
changes sign as illustrated in Figure 9, where
Note that w takes its minimum value of zero at |CLR(0)| = pL(0)pR(0) and that it increases monotonically as |CLR(0)|/[pL(0)pR(0)] decreases. The polarized LD Cω(tf) is positive if and only if and σ are either both positive or both negative. Examples are shown in Figure 10.
The sign of for CLR(0) < 0. If CLR(0) ≥ 0, then
is nonnegative for all rRS and rLR. If CLR(0) < 0, however, the sign of
can change at rRS = w, where
. In general,
tends to be positive near (rLR, rRS) = (0, 0).
Examples of the sign of the polarized LD Cω(tf) when the gamete of origin is of type 001 and CLR(0) < 0. Note that Cω(tf) is positive if and only if and σ are either both positive or both negative. In general, Cω(tf) tends to be positive near (rLR, rRS) = (0, 0). (a) pL(0) > pR(0) >
and w < u. (b) pL(0) >
, pR(0) <
, and w > u.
As shown in Figure 8, σ tends to be positive in the neighborhood of (rLR, rRS) = (0, 0). The size and shape of this neighborhood depend on u and v. Likewise, as shown in Figure 9, tends to be positive in the neighborhood of (rLR, rRS) = (0, 0), with the size of the neighborhood depending on w. As a consequence, the polarized LD Cω(tf) also tends to be positive near (rLR, rRS) = (0, 0).
More generally, if the gamete of origin is of type ij1, the sign of (respectively, σ) can be analyzed using (43) [respectively, (52)–(56)]. For all ij, the polarized LD Cω(tf) tends to be positive near (rLR, rRS) = (0, 0).
An exact symmetry when the selected locus is outside the two neutral loci:
Suppose that the selected locus is outside the two neutral loci and that geometric configuration and recombination fractions are fixed. Let {pS(0), pL(0), pR(0), CLS(0), CRS(0), CLR(0), CLRS(0)} and {p′S(0), p′L(0), p′R(0), C′LS(0), C′RS(0), C′LR(0), C′LRS(0)} denote two different sets of initial conditions. At generation n > 1, we use “prime” to refer to the marginal allele frequencies and LDs obtained using the second set of initial conditions. In the appendix, we show that if CLS(0) = C′RS(0) and CRS(0) = C′LS(0), while pS(0) = p′S(0), CLR(0) = C′LR(0), and CLRS(0) = C′LRS(0), then the system of full recursions (1)–(11) impliesfor all n ≥ 1. This is an exact symmetry result that holds for an arbitrary dominance coefficient h.
An application of this general result is the explanation of the symmetry of Figure 3 with respect to reflection about the r = 0 plane, for those regions corresponding to the selected locus being outside the neutral loci. Note that what is depicted in Figure 3 is different from the obviously symmetric case in which initial conditions CLS(0) and CRS(0) get exchanged when locus S is reflected about r = 0. In that figure, initial conditions remain fixed, while the geometric configuration of the loci and recombination fractions change upon reflection. That situation is related to changing initial conditions as described above, while keeping the geometric configuration and recombination fractions fixed.
DISCUSSION
To understand the forces that shape genomic variation in natural populations and the divergence between species, observed patterns must be compared to predictions of the models that faithfully represent the mechanisms through which such forces may work. While much of the natural selection of organismic phenotypes may be effectively approximated by deterministic single-locus equations, interactive and stochastic forces are thought to play a significant role. Until recently genetic drift has been considered the primary stochastic process determining the temporal, geographic, and genomic distribution of the vast majority of DNA sequence polymorphism and divergence. Gillespie has repeatedly demonstrated and emphasized fundamental differences between constant-fitness models and stochastically varying selection, despite their superficial similarities (Gillespie 1994). Recently emerging results of surveys of genomic regions of low crossing over per physical length indicate that linked selection rather than genetic drift can dominate the levels of polymorphism within populations (Aguadé et al. 1989; Stephan and Langley 1989; Begun and Aquadro 1992). The hitchhiking effect not only reduces the average level of heterozygosity in the surrounding genomic regions, but it also leaves a skewed frequency spectrum (Braverman et al. 1995). The early study by Thomson (1977) indicated that linked selection can create linkage disequilibrium. Several subsequent articles have addressed specific cases (Robinson et al. 1991; Grote et al. 1998) or noted some temporal and spatial patterns (Kim and Nielsen 2004). Here we have demonstrated that the hitchhiking effect involves a number of strong and surprisingly distinct dynamics and patterns of linkage disequilibrium. We believe that the approach we have taken to address the impact of selection can be extended further to address more complex selection schemes and genetic interactions.
The technological capacity of molecular population genomics is increasing rapidly. For example, the HapMap Project (International HapMap Consortium 2005) provides extensive genotypic survey results on >1 million SNPs in almost 300 individual humans. At this scale of observation one can anticipate much more powerful inferences about the role of direct selection, linked selection, crossing over, gene conversion, mutation, and geographic demography. Indeed, on the basis of such new data, genomic variation in the rate of crossing over has been proposed as the primary determinant of the patterns of linkage disequilibrium in human populations (McVean et al. 2004).
Several representations/notations have been developed to analyze the dynamics of multilocus systems (Bürger 2000). Through a series of articles Barton and Turelli have elaborated and applied their method on the basis of the explicit representations of the moments of allele frequencies (Barton and Turelli 1987, 1991; Turelli and Barton 1990, 1994). Their representation proved surprisingly tractable and transparent in the analysis of the hitchhiking effect on linkage disequilibrium.
Our analysis begins with the full representation of the three-locus dynamics using the notation of Barton and Turelli. These equations suggest the familiar approximation, “truncated equations,” in which and small higher-order terms can be dropped. The truncated equations immediately expose much of the fundamental structure and their differential analogs, ordinary differential equations, allow approximate analytic solutions. Comparisons of these ODE dynamics with those of the Barton and Turelli representation indicate that the approximations remain quite accurate as long as
(see Tables 1 and 2). Particularly fortuitous and important is the role of the three-locus LD CLRS in driving the dynamics of the LD CLR between the two linked neutral loci and the dependence of CLRS on the product of the two-locus LDs CLS and CRS.
This systematic investigation of the dynamics of LD under hitchhiking reveals four important features. First, and quite generally, hitchhiking indeed generates LD during the initial half of the hitchhiking time course. As Figure 3 shows LD (positive in this instance) reaches a maximum shortly before the originally rare selected allele reaches 0.5. This result is consistent with Thomson's analysis of hitchhiking caused by the dynamics of an initially rare allele under balancing selection in that its frequency reaches an equilibrium closer to 0.5 than to 1.0. But what is truly surprising is that from several important perspectives the hitchhiking effect on LD is one of reduction. In Figure 3 it is obvious that in the second half of the hitchhiking period the large peak of LD (positive in this case) decreases rapidly. Figure 2 shows several configurations of initial conditions and demonstrates that the decline in the magnitude of LD is not attributable to decline in the heterozygosity at the two neutral loci. A second and striking result is that preexisting LD is completely destroyed when the selected locus is situated between the neutral sites. This geometric relationship produces a striking pattern when all pairwise associations are plotted together as in Figure 4. This is probably the mechanism behind the pattern noted by Kim and Nielsen (2004). This LD-reducing effect of hitchhiking is also evident when the selected site is outside the neutral pair since much of the LD generated during the initial phase is destroyed in the latter phase. A third unexpected property of the hitchhiking on LD is that the averaging over the frequencies of the gametes with which the rare selected variant can be associated indicates that the net effect of hitchhiking would be to reduce preexisting average LD. This is despite the fact that hitchhiking does tend to increase the variance in LD (see below). Note in Figure 3 that there is considerably increased LD in both regions flanking the two neutral sites (i.e., when the selected site is outside and is close to the two neutral sites). When the rare favored allele appears on two of the other three haplotypes (10 or 01) the final LD is strongly negative. Thus the average (weighted by the frequencies of the four gametes) will remain at zero if there is no LD and tend toward zero if initially different from zero. The rate of approach to zero is greater than or equal to that expected in the absence of hitchhiking.
The fourth notable LD hitchhiking effect is on the expected LD when this association is polarized by the marginal allele frequencies. Langley and Crow (1974) noted that with molecular polymorphism data the sign of LD is typically arbitrary. They proposed to orient LD such that it reflected the deviation for the expected most frequent gametic type and demonstrated that under quadratic stabilizing selection this measure of LD, denoted Cω, is negative. Under hitchhiking the average Cω tends to be positive. This can be understood as the consequence of the fact that the neutral alleles at each site on the initially selected haplotype tend to rise to frequencies >0.5 and the LD between those alleles is positive. A bias in the distribution of Cω either regionally or across the genome could be interpreted as evidence that hitchhiking is shaping LD. Thus the frequency-averaged hitchhiking effect on LD is to drive it to zero. But as shown in Figure 11a there is a bias with respect to marginal frequencies at the two neutral sites; Cω(tf) tends to be positive for small r/s. And, of course, there is a broad range of r in which the variance of LD is increased when the selected site is outside the two neutral sites. Figure 11b shows that the projection of the probability distribution of the squared correlation coefficient R2(tf) also has a peak for small r/s, near 0.02.
Probability distributions of the polarized LD Cω(tf) and the squared correlation coefficient R2(tf), obtained from numerical simulations using rLR/s = 0.02 and 1/(2N) = 0.00005. The polarized LD Cω(tf) tends to be positive for small rRS/s. (a) ℙ[Cω(tf) < c]. (b) ℙ[R2(tf) < c].
The genomic scale over which hitchhiking has a significant effect on heterozygosity and the frequency spectrum has been considered previously. Beyond the obvious inherent in the approximation, Stephan et al. (1992) showed that the reduction in heterozygosity was approximately proportional to 1 − (2N)−4r/s. Simulations of Braverman et al. (1995) indicated a similar scale and shape to the skewness in the frequency spectrum as measured by Tajima's D (also see Durrett and Schweinsberg 2005). In studying the expectation of the linkage disequilibrium caused by hitchhiking we note a striking common function A(t, r/s) that relates the averages of both heterozygosity and LD to what they would be in a large population in the absence of hitchhiking. We are tempted to speculate that this simple function may be fundamental to average dynamics of other moments of allele frequencies under the hitchhiking scenario.
While the effect of hitchhiking on the average CLR is to drive it toward zero, this is clearly not expected for , or the absolute value of CLR. We have not obtained an analytic expression for such expectations but simulated results such as those shown in Figures 3, 4, and 11 indicate that the hitchhiking effect on magnitude of CLR between neutral sites on the same side of the selected site can be substantial.
Our results were derived using a deterministic three-locus model of hitchhiking. Similar results hold for the pseudo-hitchhiking model (Gillespie 2000 and J. Gillespie, unpublished results). We have compared both models. The recursion equations of the pseudo-hitchhiking model are a good approximation of the dynamics of the three-locus model if the selected locus is outside the two neutral loci and the distance between the selected locus and either one of the neutral loci is much larger than the distance between the two neutral loci. In this parameter region, LD predicted by both models decays more quickly than under neutrality. How might these conclusions about the theoretical hitchhiking dynamics of LD influence the interpretation of population genomic polymorphism and divergence? Certainly it seems to inform any effort to identify regions in the genomes of natural populations in which there has been very recent selected substitution of newly arising mutations or otherwise rare variants. Hitchhiking may not increase LD in the neighborhood of a selected site as it has been widely thought; rather it can decrease it especially when the neutral sites are on opposite sides of the selected locus (see Figure 4). More generally LD that is built up by hitchhiking shortly after the occurrence of a favored mutation is quickly destroyed (even before fixation is reached). As a consequence, genomic regions around targets of recent positive directional selection are expected to exhibit a lack of LD, which is not simply due to the variation-reducing force of hitchhiking. This local dip in the magnitude of LD may be of use in the localization of targets of positive selection in the genome. Given the current debate of how various variation-reducing forces can be distinguished (in particular, bottlenecks from selective sweeps; Glinka et al. 2003; Haddrill et al. 2005), there is merit in attempting to include the specific pattern of LD predicted by these analyses into the methods for identifying targets of selection by selective sweeps (e.g., Kim and Stephan 2002; Kim and Nielsen 2004). Because populations that have undergone population size bottlenecks should show elevated genomewide levels of LD, regions lacking LD around targets of selection may be more easily distinguishable from the rest of the loci than when statistics that are solely based on the reduction of variation are used.
We have not attempted to extend our results to situations in which recurring and genomically randomly distributed hitchhiking events occur. The significant impediment to the analysis of the effect of such recurrent hitchhiking on heterozygosity may be the impact of simultaneous events within the same genomic region. But if selection is strong and events are sufficiently rare such occurrences may be negligible (Kaplan et al. 1989; Durrett and Schweinsberg 2005). While this issue of the dynamic interaction of simultaneous linked hitchhiking events may well remain for the analysis of the impact of hitchhiking on LD, there is clearly a second considerable issue. While in large populations the heterozygosity does not change in between hitchhiking events, that is not true of LD, which, of course, decays in magnitude at rate r. If the rate of the recurrent and randomly distributed hitchhiking events were sufficiently rare and there were no other force causing LD, the results given above are applicable, since LD would decay to zero throughout the genomic region before the next event. Given that LD is, in fact, commonly present on some scale in the various studied species, further analysis and/or simulations are warranted to make a general prediction of the genomic pattern.
APPENDIX
Quasi-invariance (embedded selected locus case):
Here we examine in more detail the case where the selected locus is between the neutral loci. More exactly, we wish to keep the sum rLS + rRS fixed to some value, say ρ, and consider varying rLS and rRS while satisfying that condition. To avoid being long-winded, we call this kind of translation of the selected locus a constrained S-translation. We wish to show that the dynamics of certain linkage disequilibria are quasi-invariant, as we clarify presently, under the constrained S-translation.
First, note that the dynamics of pS do not depend at all on the position of the selected locus. Then, as rLR = rLS + rRS for the case under consideration, recursions (6) and (10) do not change under the constrained S-translation. Since rLS,R = rRS and rRS,L = rLS, we have rLS,R + rRS,L = rRS + rLS and rLRS = rLR,S + rRS + rLS. If no double crossovers are allowed, then rLR,S is identically zero. Note that and that the sum g(rLS) + g(rRS) does not change under the constrained S-translation. Therefore, recursions (7) and (11) do not change under the constrained S-translation, as long as rLR,S does not depend on where in between the neutral loci the selected locus is located.
We now turn to the product CLSCRS. For ease of notation, we define(A1)Then, (4), (5), (8), and (9) imply
The product CLSCRS satisfies the recursion
The quantity f(rLS; k)f(rRS; k) can be written as
where
Under the restriction that rLS + rRS = ρ, the maximum value of rLSrRS is ρ2/4, whereas the minimum is 0. Since γ(k) is positive definite, the maximum variation of CLS(n)CRS(n), as the selected locus moves between the neutral loci, can be obtained by comparing CLS(n)CRS(n) at rLSrRS = 0 with that at rLSrRS = ρ2/4. Define maximal relative variation ε(n) as
It is straightforward to show that
where “…” represents terms proportional to ρm, m ≥ 4. In the case of directional selection, γ(k)/(α(k) + ρβ(k)) is of order 1 for all values of s, h, pS(k), and ρ. Therefore, ε(n) = O(ρ2n), and we conclude that relative variation increases as time passes.
The dynamics of CLR and CLRS are almost (or quasi-) invariant under the constrained S-translation in the following sense: the range of ρ in which selection has observable influence on the dynamics of CLR and CLRS is where . In that case, it is possible to maintain
throughout the entire period from the initial generation to the fixation generation. We would then observe almost no variation in CLR or CLRS as the location of the selected locus is varied between the neutral loci. For large ρ, selection has little influence on CLR and CLRS, so their dynamics should be approximately invariant under translation of the selected locus.
An exact symmetry:
Suppose that the selected locus is outside the two neutral loci and that recombination fractions rLS, rRS, rLR, rLRS, rLR,S, rRS,L, and rLS,R appearing in the system of full recursions (1)–(11) are fixed. In what follows, the dominance coefficient h is assumed to be arbitrary. Let {pS(0), pL(0), pR(0), CLS(0), CRS(0), CLR(0), CLRS(0)} and {p′S(0), p′L(0), p′R(0), C′LS(0), C′RS(0), C′LR(0), C′LRS(0)} denote two different sets of initial conditions. At generation n > 1, we use “prime” to refer to the allele frequencies and LDs obtained using the second set of initial conditions.
We first consider the second-order LDs involving the selected locus.
Lemma 1. Suppose that CLS(0) = C′RS(0) and CRS(0) = C′LS(0). Then, for all n ≥ 1,
Proof. This result follows from induction on n. Recall thatwhere the function f(r, k) is defined as in (A1). Similarly,
If CLS(0) = C′RS(0) and CRS(0) = C′LS(0), then
Suppose that the claim is true for all 1 ≤ n ≤ k. Then, for n = k + 1,
where the third line follows from the induction hypothesis. ▪
Using the above lemma, we can obtain the following result regarding the third-order LD and the LD between the neutral loci:
Proposition 1. Suppose that pS(0) = p′S(0), CLS(0) = C′RS(0), CRS(0) = C′LS(0), CLR(0) = C′LR(0), and CLRS(0) = C′LRS(0). Then,for all n ≥ 1.
Proof. First, note that pS(0) = p′S(0) implies pS(n) = p′S(n), , and
for all n ≥ 1. Therefore, since C′LSC′RS = CLSCRS by Lemma 1, we obtain
which is equivalent to (10), and
which is equivalent to (6). Similarly,
and
are equivalent to (11) and (7), respectively. Hence, C′LR and C′LRS satisfy exactly the same set of recursions as do CLR and CLRS. Since C′LR(0) = CLR(0) and C′LRS(0) = CLRS(0), it thus follows that C′LR(n) = CLR(n) and C′LRS(n) = CLRS(n) for all n ≥ 1. ▪
Acknowledgments
We thank Michael Turelli for teaching us the multilocus formalism and John Gillespie for supplying us with an unpublished manuscript on pseudo-hitchhiking. We acknowledge support from the following sources: the Volkswagen-Foundation grant I/78 815 (W.S.), the National Science Foundation grants ELA-0220154 (Y.S.S. and C.H.L.) and IIS-0513910 (Y.S.S. and C.H.L.), and the National Human Genome Research Institute (National Institutes of Health) grants 5R01HG002107-03 and 5R01HG002942-02 (C.H.L.).
Footnotes
Communicating editor: J. Wakeley
- Received August 27, 2005.
- Accepted January 30, 2006.
- Copyright © 2006 by the Genetics Society of America