Originally published as Genetics Published Articles Ahead of Print on February 1, 2006.
Genetics, Vol. 172, 2647-2663, April 2006, Copyright © 2006
doi:10.1534/genetics.105.050179
The Hitchhiking Effect on Linkage Disequilibrium Between Linked Neutral Loci
Wolfgang Stephan*,1,
Yun S. Song
and
Charles H. Langley
* Section of Evolutionary Biology, Biocenter, University of Munich, 82152 Planegg-Martinsried, Germany and
Department of Computer Sciences and
Section of Evolution and Ecology, University of California, Davis, California 95616
1 Corresponding author: University of Munich, Grosshaderner Strasse 2, 82152 Planegg, Germany.
E-mail: stephan{at}zi.biologie.uni-muenchen.de
Manuscript received August 27, 2005.
Accepted for publication January 30, 2006.
ABSTRACT
We analyzed a three-locus model of genetic hitchhiking with one locus experiencing positive directional selection and two partially linked neutral loci. Following the original hitchhiking approach by Maynard Smith and Haigh, our analysis is purely deterministic. In the first half of the selected phase after a favored mutation has entered the population, hitchhiking may lead to a strong increase of linkage disequilibrium (LD) between the two neutral sites if both are <0.1s away from the selected site (where s is the selection coefficient). In the second half of the selected phase, the main effect of hitchhiking is to destroy LD. This occurs very quickly (before the end of the selected phase) when the selected site is between both neutral loci. This pattern cannot be attributed to the well-known variation-reducing effect of hitchhiking but is a consequence of secondary hitchhiking effects on the recombinants created in the selected phase. When the selected site is outside the neutral loci (which are, say, <0.1s apart), however, a fast decay of LD is observed only if the selected site is in the immediate neighborhood of one of the neutral sites (i.e., if the recombination rate r between the selected site and one of the neutral sites satisfies
). If the selected site is far away from the neutral sites (say, r > 0.3s), the decay rate of LD approaches that of neutrality. Averaging over a uniform distribution of initial gamete frequencies shows that the expected LD at the end of the hitchhiking phase is driven toward zero, while the variance is increased when the selected site is well outside the two neutral sites. When the direction of LD is polarized with respect to the more common allele at each neutral site, hitchhiking creates more positive than negative linkage disequilibrium. Thus, hitchhiking may have a distinctively patterned LD-reducing effect, in particular near the target of selection.
GENETIC drift is recognized as a fundamental stochastic force shaping the polymorphism within populations and divergence between species of both neutral and selected variants at a locus (FISHER 1930; WRIGHT 1931; KIMURA 1983). Remarkably similar patterns to those caused by genetic drift are predicted by theoretical models incorporating stochastically varying selection (GILLESPIE 1994). In 1974 Maynard Smith and Haigh analyzed a simple and obvious extension of the genetic drift models, incorporating the stochastic coupling due to linkage with another locus undergoing strong directional selection. Addressing the apparent uniformity of allozyme polymorphism across species, MAYNARD SMITH and HAIGH (1974) focused on the "hitchhiking effect" on allelic frequencies and heterozygosity.
Since this seminal work the hitchhiking effect associated with positive directional selection has been extended to address observations in the emerging field of molecular population genetics (AGUADÉ et al. 1989; STEPHAN and LANGLEY 1989; BEGUN and AQUADRO 1992). These studies revealed low levels of DNA sequence polymorphism in Drosophila in genomic regions of low crossing over and led to theoretical analyses of the hitchhiking effect on nucleotide diversity (KAPLAN et al. 1989) and on the frequency spectrum of polymorphisms (BRAVERMAN et al. 1995). Whether formulated in terms of gamete frequencies or in the coalescent framework, the genetic models dealt with a pair of loci: a selected and a linked neutral locus with a defined rate of recombination between them. Independent of whether single or recurrent hitchhiking events were analyzed (as a stochastic process or a deterministic approximation), the conclusion has consistently been that hitchhiking should have profound effects on expected heterozygosity and allele frequencies at the neutral locus, if selection is strong and linkage is tight.
In 1977 Thomson considered the impact of hitchhiking on linkage disequilibrium (LD). Most of her analysis focused on the association between the selected alleles and those at a single neutral locus. Thomson also considered the impact of hitchhiking (of a heterotic polymorphism) on the LD between alleles at two linked neutral loci. On the basis of numerical examples she concluded that hitchhiking creates LD between neutral loci within the same genomic domain in which it affects levels of heterozygosity. The impact of simple directional selection of rare variants rapidly going to fixation was not considered by THOMSON (1977). Except for this initial study and a few rather targeted applications (ROBINSON et al. 1991; GROTE et al. 1998), no attempts have been made to extend this simple two-locus hitchhiking model of MAYNARD SMITH and HAIGH (1974) to multiple loci. This is curious, as the hitchhiking effect has played a major role in molecular population genetics for >15 years.
On the basis of the analyses of Thomson and her colleagues, it has been believed that hitchhiking affects not only polymorphisms at individual sites (or loci), but also the association between polymorphisms. Using a three-locus model with one selected and two neutral loci, she showed that hitchhiking that has a strong effect on heterozygosity can also generate strong LD between the neutral loci. Without paying close attention to Thomson's writing, the important role of hitchhiking in generating LD has been reiterated in textbooks and publications by many authors. It was not until recently that a few authors reported quite the opposite results (GILLESPIE 1997; KIM and STEPHAN 2002; KIM and NIELSEN 2004). Analyzing "shift" models that are similar to the hitchhiking model described above, GILLESPIE (1997, p. 297) concluded that "linked selection can reduce variation without building up high levels of linkage disequilibrium, contrary to our intuition." These latter studies focused on average effects observable in simulated data. In small-sample coalescent simulations, KIM and NIELSEN (2004) found increased LD between alleles at two neutral loci on the same side of the selected locus at the time of fixation and reduced LD across the site of selection. Furthermore, they provide heuristic arguments to explain this pattern. These different and somewhat contradictory views of the relationship between genetic hitchhiking and LD motivated us to pursue an analytic investigation of the question. We followed the deterministic approach of MAYNARD SMITH and HAIGH (1974), extending their model to three loci as did THOMSON (1977). To analyze this model, we used the framework of BARTON and TURELLI (1991), which provides a natural setting for the investigation of the directional hitchhiking, yielding transparent mathematical expressions that illuminate the rather surprising dynamics of LD under the hitchhiking effect.
THE THREE-LOCUS HITCHHIKING MODEL
We consider a three-locus model with two neutral loci and one
selected locus. For each locus, we assume that there are only
two allele types, denoted by 0 and 1. The selected locus may
be between the two neutral ones or on either side of them (see
Figure 1). We denote by L and R the left and right neutral loci,
respectively, and by S the selected locus. The corresponding
recombination fractions between loci are
rLR, etc. We assume
positive directional selection according to the following fitness
scheme:
where
s is the selection
coefficient of the selected allele (type 1) and
h the dominance
coefficient. Note that we follow here the notation of
MAYNARD SMITH and HAIGH (1974),
not the definition of
KAPLAN et al. (1989). Effective population
size
Ne is assumed to be very large (

), such that a deterministic analysis of the model is appropriate.

View larger version (3K):
In this window
In a new window
Download PPT slide
|
FIGURE 1. Three possible configurations. The selected locus is denoted by S, whereas the left and the right neutral loci are denoted by L and R, respectively. The recombination rate rLR between L and R is given by adding or subtracting rLS and rRS, depending on the configuration.
|
|
The system of full recursions:
To derive the recursions for the marginal (type 1) allele frequencies
pL,
pR, and
pS at the three loci and the LDs (central moments)
of the second (
CLR,
CLS,
CRS) and third (
CLRS) orders measured
with respect to type 1 alleles, we follow the approach of
BARTON and TURELLI (1991).
In that approach, coefficients that appear in the recursions
are recombination rates and generalized selection coefficients,
the latter denoted by

and

, where
X and
Y are nonempty
subsets of loci. (Generalized selection coefficients are defined
as coefficients that appear in expressing the relative fitness
in terms of certain quantities related to LDs.) In our case,
nonzero selection coefficients that appear in the recursions
at time
t are
For
h =

, these simplify to
Following BARTON and TURELLI (1991), we define rLRS = rLS,R + rRS,L + rLR,S and use rX,Y to denote the rate of recombination events that partition the loci into two nonempty sets X and Y. In our work, we ignore double-crossover recombination events, so we define rX,Y = 0 if the partition X, Y corresponds to a double-crossover event. For example, if the selected locus S is between the neutral loci, then rLS,R = rRS, rRS,L = rLS, and rLR,S = 0. Further, we assume that recombination rates are additive. The recombination rate rLR is therefore given by adding or subtracting rLS and rRS, depending on the configuration (see Figure 1).
To simplify notation, we hereafter omit writing the dependence on time t. We define qS = 1 pS, qL = 1 pL, and qR = 1 pR. Marginal allele frequencies satisfy the following recursions:
 | (1) |
 | (2) |
 | (3) |
The LDs satisfy the recursions
 | (4) |
 | (5) |
 | (6) |
 | (7) |
where
 | (8) |
 | (9) |
 | (10) |
 | (11) |
with
The above general recursions apply to all three
configurations shown in
Figure 1. Depending on the particular
configuration being considered, recombination rates
rX,Y need
to be defined appropriately.
The system of truncated recursions:
We explore the behavior of the recursion
Equations 1
7 in the region

and

(see
MAYNARD SMITH and HAIGH 1974).
Here,
r may be any recombination parameter appearing in the
Equations 1
7. Keeping only the terms linear in
s or
r leads to the following set of recursions:
 | (12) |
 | (13) |
 | (14) |
 | (15) |
 | (16) |
 | (17) |
 | (18) |
These equations agree to first order in
r and
s with those of
THOMSON (1977) (compare her
Equations 30iii,
30iv, and 31). Since the hitchhiking effect can be best observed
when

(such that simultaneously

holds), we may approximate the truncated recursions by the following ordinary differential
equations (ODEs; see
MAYNARD SMITH and HAIGH 1974):
 | (19) |
 | (20) |
 | (21) |
 | (22) |
 | (23) |
 | (24) |
 | (25) |
Here we have introduced
time
t (in generations) into the equation for
pS and parameterized
the other quantities by
pS (which is a monotonically increasing
function of
t).
Structure of equations:
Several features of the dynamical system become apparent from
these two sets of equations. Most importantly, selection acts
on the alleles at the neutral loci indirectly and in a strictly
hierarchical fashion: on the marginal neutral allele frequencies
via the pairwise LDs
CLS and
CRS, on the LD between the neutral
sites by the third moment, and on the latter by a fourth-order
term (
i.e., the product of the two pairwise LDs
CLS and
CRS).
Analytical solutions when the selected locus is between the neutral loci:
The ODEs for the LDs are first-order linear differential equations.
The ODEs for the pairwise LDs
CLS and
CRS are homogeneous. The
equations for
CLR and
CLRS contain the higher-order moments
as inhomogeneous terms that act as "driving forces" of the dynamics.
However, except for this impact of the higher-order terms, the
equations are decoupled and can be solved successively. We have
the following results.
The frequency of the selected allele at locus S is
 | (26) |
whereas marginal allele frequencies
at the neutral loci are
 | (27) |
 | (28) |
where
 | (29) |
which corresponds to the heterozygosity
at the selected locus. The LDs
CLS(
t) and
CRS(
t) can be written
as
 | (30) |
 | (31) |
Given these solutions for CLS(t) and CRS(t), the coupled ODEs (24) and (25) admit simple exact solutions when the selected locus is between the two neutral loci. More specifically, the third-order LD is given by
 | (32) |
and the LD between the neutral loci can be
written as
 | (33) |
where
rLR =
rLRS =
rLS +
rRS.
Analytical solutions when the selected locus is outside the neutral loci:
Solutions to the ODEs (
19)(
23) do not depend on whether
the selected locus is inside or outside the two neutral loci.
For example, the allele frequency
pS(
t) and the LDs
CLS(
t) and
CRS(
t) are given by (
26), (
30), and (
31), respectively, in all
cases. However, the dynamics of the ODEs (
24) and (
25) for
CLR(
t)
and
CLRS(
t), respectively, depend crucially on the position
of the selected locus S with respect to the neutral loci L and
R. As we elaborate presently, the dynamics of
CLR(
t) when S
is between L and R exhibit radically different behavior than
when S is outside.
In what follows, suppose that S is to the right of R, which implies rLS = rLRS = rLR + rRS. The case where S is to the left of L can be handled in a similar vein, with rRS replaced with rLS. Now, the ODE (25) for the third-order LD CLRS(t) does not admit a closed-form solution; a general solution can be obtained in terms of the incomplete beta function B(z; x, y), defined as
. However, noting that B(z; 1 2rRS/s, 1 + 2rRS/s)
z if
, we obtain the following simple approximate solution:
 | (34) |
Further,
using this solution and the approximation
B[
z; 2 2
rRS/
s,
1 + 2
rRS/
s]
z2 for

, we obtain the following approximate solution to (
24):
 | (35) |
Note the striking
resemblance of (
34) and (
35) to (
32) and (
33), respectively.
For
rRS = 0, the two sets of equations agree exactly, and hence
our solutions for different regions form one continuous solution
for the entire domain. The only difference between the two sets
of equations is that, in (
34) and (
35), an extra factor

appears together with [
pS(
t)
pS(0)]/
HS(0). This simple difference leads to important observable
differences in the dynamics of the LDs.
Comparison of approximate analytic solutions with numerical solutions to the full recursions:
We have written a computer program to solve the full recursions
(
1)(
11) numerically. Comparison of our analytic solutions
(
33) and (
35) with numerical solutions to the full recursions
are shown in
Tables 1 and
2, respectively. As these tables show,
our analytic solutions are a good approximation to the exact
dynamics. In obtaining our analytic solution (
35) for the case
in which the selected locus is outside the neutral loci, recall
that we assumed

to approximate the incomplete beta function. As expected,
Table 2 shows that
(
35) becomes less accurate as
rRS increases, but it is a good
approximation as long as

.
View this table:
In this window
In a new window
|
TABLE 1 Comparison of the exact CLR(t) (obtained by solving the full recursions numerically) with the analytic approximation (33) when the selected locus is between the two neutral loci
|
|
View this table:
In this window
In a new window
|
TABLE 2 Comparison of the exact CLR(t) with the analytic approximation (35) when the selected locus is to the right of locus R
|
|
THE DYNAMICS OF LD
Here we consider the dynamics of the LD between the two neutral
loci. We utilize our analytic solutions from the previous section
to study several important aspects of the dynamics.
Vanishing LD:
In the domain

, the term
sCLSCRS dominates the recursion for the third moment
[see (
18)] and hence influences also the LD between the neutral
sites. If the selected site is between the two neutral sites
and linkage is sufficiently tight (
rLR < 0.1
s), |
CLR| quickly
increases after the favored mutation has entered the population
and, after transiently reaching a peak, rapidly decays to zero
before the selected phase ends. To show this, define
tf as the
time satisfying
pS(
tf) = 1
pS(0). Henceforward, we loosely
refer to this time as the
fixation time. Using (
26), one can
show that
 | (36) |
We wish
to show that, if the selected locus is between the two neutral
loci, then
CLR(
tf)

0 for all possible initial conditions of
interest. Common to all initial conditions is that
pS(0) = 1/(2
N),
with
N being the population size. For
N =

10
410
6,

, and therefore
 | (37) |
which implies that (
33) at
tf can be written
approximately as follows:
 | (38) |
We use {000, 001, 010, 011, 100, 101, 110, 111} to denote gametic types. Their frequencies are denoted by {f000, f001, f010, f011, f100, f101, f110, f111}, which are related to the marginal frequencies and the LDs as follows:
Using these relations, we obtain
 | (39) |
At
t = 0, a new
favored mutation occurs on only one of the following gametic
types: 000, 001, 100, or 101. For instance, if the mutation
occurs on a gamete of type 101,
f010(0) =
f011(0) =
f110(0)
= 0 and
f111(0) =
pS(0) = 1/(2
N). More generally, only one of
f010(0),
f111(0),
f011(0),
f110(0) is supposed to be nonzero.
This implies that the right-hand side of (
39) must be zero.
Hence, the coefficient of exp(
rLRtf) in (
38) is exactly
zero. If the approximation (
37) were not used, then the coefficient
of exp(
rLRtf) in
CLR(
tf) would not be exactly zero, but
would still be very small. Note that exp(
rLRtf) may not
be very small. For example, exp(
rLRtf) = 0.452813 for
pS(0) = 0.00005,
rLR = 0.00002, and
s = 0.001. This shows that,
for small recombination rates (

), selection rather than recombination is the dominant force that
causes
CLR(
t) to vanish before the fixation time. For large
recombination rates, the contribution exp(
rLRt) from
recombination should dominate over selection effects.
This behavior may be explained as follows. Under the scenario of tight linkage and strong selection, a low-frequency gamete on which the favored mutation landed is quickly dragged into intermediate to high frequency. If the recombination rates are nonzero, this gamete may undergo recombination, thereby creating the two types of single recombinants that also carry the selected allele and thus increase in frequency. This reduces the LD between L and R created by the hitchhiking effect in the first half of the selected phase. This hitchhiking effect on the recombinants is stronger, the greater the linkage between the selected site and the two neutral sites is, and thus also the product CLSCRS.
Figure 2a shows another important observation: LD may vanish very quickly in the selected phase, while relative heterozygosity approaches a finite (i.e., nonzero) equilibrium value. Thus, LD does not vanish because of the variation-reducing effect of hitchhiking per se, but as a consequence of secondary hitchhiking effects on the recombinants created in the selected phase (described above).

View larger version (17K):
In this window
In a new window
Download PPT slide
|
FIGURE 2. Example trajectories of CLR(n) are shown on the left-hand side, and those of normalized heterozygosity Het(n)/Het(0) are shown on the right-hand side, where solid (resp. dashed) lines correspond to locus L (resp. R). (a) The selected locus is at the midpoint between the neutral loci, i.e., rLS = rRS = 0.01s. (b) The selected locus is to the right of locus R and rRS = 0.01s. (c) The selected locus is to the right of locus R and rRS = 0.03s. (d) The selected locus is to the right of locus R and rRS = 0.3s. Exact recursions (1)(11) were used, with s = 0.01, rLR = 0.02s, pS(0) = 0.00005, pL(0) = 0.38, pR(0) = 0.41, and CLR(0) = 0.0242. At generation n = 3982, pS(n) = 1 pS(0). Initial conditions were chosen so that CLR(0) 0, for two reasons: to demonstrate that, when the selected locus is between the neutral loci (as in case a), CLR(n) quickly vanishes in the selected phase even if the initial value CLR(0) is not zero and to contrast the effect of recombination (see case d) with that of selection.
|
|
A selected mutation occurring outside the two neutral sites
on a low-frequency gamete may also lead to a transient peak
of
CLR, if both neutral polymorphic sites are <0.1
s recombination
distances away from the selected site (see
Figure 2, b and c).
Although this peak vanishes faster than under neutral conditions
(
i.e., with the selected site far away from the neutral sites,
as in
Figure 2d), the decay rate is not as high as when the
favored mutation occurs between the neutral sites (
Figure 2a).
We analyze this behavior in more detail below.
Shown in Figure 3 is a plot of CLR(n) for varying position of the selected locus. As in Figure 2, the distance between the two neutral loci is fixed at rLR = 0.0002, and the same set of initial conditions is used. Note that this plot is symmetric about the plane r = 0; we return to this point later in the article. We stress that our conclusions described above do not depend on the particular values of neutral marginal allele frequencies pL(0) and pR(0) used for illustration. Even for low values of pL(0) and pR(0), for example, the same conclusions hold.

View larger version (22K):
In this window
In a new window
Download PPT slide
|
FIGURE 3. The LD CLR between the neutral loci as a function of the position of the selected locus, for rLR = 0.0002 and s = 0.01. Here, r is the position of the selected locus S, and the value r = 0 corresponds to the midpoint between the neutral loci. Locus L is fixed at r = 0.0001, whereas locus R is fixed at r = 0.0001. The same set of initial conditions as in Figure 2 was used for all r. Exact recursions (1)(11) were used to generate this plot.
|
|
An alternative illustration of the above discussion is provided
in
Figure 4, which shows pairwise LD plots for a region containing
100 neutral loci and a single selected locus. The two plots
shown correspond to two different time points. The selected
locus is located in the middle of the region and LD values below
a cutoff value are not plotted.

View larger version (10K):
In this window
In a new window
Download PPT slide
|
FIGURE 4. Truncated pairwise LD plots for a region consisting of 100 neutral loci and 1 selected locus located in the middle. The recombination rate between any two adjacent loci is 0.4s/100, and hence the recombination rate between the leftmost and the rightmost neutral loci is 0.4s. We used s = 0.01, pS(0) = 0.00005, pR(0) = pL(0) = 0.5, and CLR(0) = 0. The plots were truncated so that LD values <1% of the largest value 0.0577 were not included. (a) Pairwise LD plot at t = tf/2, where the time of fixation tf satisfies pS(tf) = 1 pS(0). (b) Pairwise LD plot at t = tf.
|
|
The maximum LD at tf:
Consider the case in which the selected locus S is to the right
of locus R. Viewing
CLR(
tf) as a function of
rRS, whether a
local optimum in the domain

exists depends on initial conditions. The example shown in
Figure 3 has a local maximum at
rRS/
s 
0.039. Differentiating the analytic
solution (
35), it is possible to determine whether there is
a critical point
rRS =
r*
RS that satisfies
 | (40) |
Suppose that, at
t = 0, the new gamete carrying
a selected allele (of type 1) at locus S is of type
ij1, with
i being the allele at locus L and
j the allele at locus R. Then,
given that [
i1
pL(0)][
j1
pR(0)] >
CLR(0),
where
ab is 1 if
a =
b or 0 if
a
b, we obtain
 | (41) |
and
The value of
r*
RS in (
41) may be very large for
some initial conditions. In such a case, as our analytic solution
(
35) is valid only in the domain

, all we can say for sure is that
CLR(
tf) has no critical point
in the domain

[
i.e.,
CLR(
tf)
is either a monotonically increasing or a monotonically decreasing
function of
rRS in that domain]. Further, if [
i1
pL(0)][
j1
pR(0)]
CLR(0), there is no real-valued
r*
RS such that
our approximate analytic solution (
35) satisfies (
40). Hence,
noting that
CLR(
tf) approaches

as
rRS/
s increases, we conclude that, in the domain

,
where
 | (42) |
and

. Note that the maximum possible value of
X is

. The critical point
rLS =
r*
LS for the case in which the selected locus is to the left of locus
L is also given by (
41).
Invariance of CLR and CLRS when the selected locus is between the two neutral loci:
When the favored mutation occurs between the two neutral sites,
the dynamics of the system of truncated recursions for
CLR and
CLRS do not depend on the position of the selected locus. This
can be seen immediately from
Equations 17 and
18, which depend
only on the sum
rLS +
rRS and not on any individual recombination
parameter. We show in the
APPENDIX that this invariance also
(nearly) holds for the system of full recursions.
INITIAL CONDITIONS AND THE PARAMETER SPACE OF THE MODEL
Here we assume that the selected locus is to the right of locus
R. For such a case, recall that the LD (measured with respect
to type 1 alleles) between the neutral loci is given by (
35).
We use
ijk to denote gametic types, with
i being for locus L,
j for locus R, and
k for locus S. By
the gamete of origin, we
mean the new gamete at
t = 0 carrying a selected allele (of
type 1) at locus S. We include
ij in superscript [
i.e., we write

] if the gamete of origin is of type
ij1. Using (
35), we obtain
 | (43) |
where, as before,
ab is 1 if
a =
b or 0 if
a
b, and

(
t,
y) is defined as
 | (44) |
Recall that marginal (type 1) allele frequencies pL(t) and pR(t) at the neutral loci are given by (27) and (28), respectively. For
and
, we can use the approximation
from which it follows that
where

(
tf,
rRS/
s) is defined as
in (
44). Similar to

, we use

and

to denote
pL(
t) and
pR(
t), respectively, if
the gamete of origin is of type
ij1. It is straightforward to
show that
 | (45) |
 | (46) |
Frequency-averaged LD and the range of the hitchhiking effect:
In what follows, we use
xijk to denote the frequency of the
gametic type
ijk at time
t = 0 and define
xij· =
xij0 +
xij1. The type of the gamete of origin could be any of 001,
011, 101, and 111. Suppose that the probability of the gamete
of origin being of type
ij1 is equal to the frequency of the
gametic type
ij0 just before time
t = 0 (note that this frequency
is equal to
xij·). Then, the average value of
CLR(
t)
with respect to this probability is given by

, which we call a
frequency-averaged LD. We show
below that, contrary to people's common intuition, the effect
of selection on such an averaged LD does not depend on haplotype
diversity
xij· at
t = 0.
Using (43), we can show that
is given by
If
CLR(0)
= 0, then

for all
t. For
CLR(0)

0, we define
Note
that
rLR need not be much smaller than
s for our analytic solution
(
35) to be valid (recall that only

is required). Assuming

, we can ignore genetic drift and regard

as the behavior of LD under neutrality.
Thus,
A(
t,
rRS/
s) can be viewed as the ratio of the frequency-averaged
LD in the presence of selection to that in the absence of selection.
At the time
tf of fixation,
pS(
tf) = 1
pS(0) and therefore
 | (47) |
For given
rRS/
s,
A(
tf,
rRS/
s) depends only on
pS(0) = 1/(2
N); it has no dependence
on other initial conditions. A plot of
A(
tf,
rRS/
s) is shown
in
Figure 5 for
pS(0) = 0.00005.

View larger version (6K):
In this window
In a new window
Download PPT slide
|
FIGURE 5. A plot of A(tf, rRS/s) for pS(0) = 0.00005. The function A(tf, rRS/s), defined in (47), captures the effect of selection on both relative frequency-averaged LD and relative frequency-averaged heterozygosity [see (47) and (48)].
|
|
We now compare
A(
tf,
rRS/
s) with relative frequency-averaged
heterozygosity. Let us focus on the right neutral locus R and
define

. For
HR(0) = 2
pR(0)[1
pR(0)]

0, one can use (
46) to show that
 | (48) |
For
pS(0) = 1/(2
N),
this is

, which is equivalent to
Equation 14d of
STEPHAN et al. (1992) (the factor 2 in the
exponent of that formula needs to be replaced by 4 because of
the different definition of the selection coefficient). In the
absence of genetic drift,

(
tf)
=
HR(0) for
s = 0. Therefore, (
48) can be regarded as the ratio
of the frequency-averaged heterozygosity in the presence of
selection to that in the absence of selection. Surprisingly,
this ratio is exactly equal to the analogous ratio for LD shown
in (
47). The function
A(
tf,
rRS/
s) plays a special role in the
sense that it encodes the effect of selection on two different
frequency-averaged quantities.
For site heterozygosity, the hitchhiking effect is generally profound only when, provided that
, the recombination distance r between the selected and neutral sites satisfies r < 0.1s (MAYNARD SMITH and HAIGH 1974). The term determining this effect is (2N)4r/s, assuming that the initial frequency pS(0) of the selected allele is 1/(2N). Our above analysis shows that the range of a substantial reduction of LD due to hitchhiking [determined by
] is exactly equal to that for variation [determined by (2N)4r/s] (see Figure 5).
Characterization of equivalent initial conditions:
We now find the set of all initial conditions that lead to the
same value of
CLR at the time of fixation;
i.e.,
CLR(
tf) =
c,
where
c is some fixed constant.
To be concrete, suppose that the gamete of origin is of type 001, in which case x101 = x011 = x111 = 0 and x001 = pS(0) = 1/(2N). First, note that x000 + x010 + x100 + x110 + pS(0) = 1 implies
 | (49) |
which
defines a tetrahedron

as depicted in
Figure 6a. Second, using
(
43), one can show that

implies
 | (50) |
which
defines a surface

in a three-dimensional Euclidean space with
(
x110,
x010,
x100) as coordinates. The intersection of surface

with tetrahedron

, illustrated in
Figure 6b, corresponds to
the set of initial conditions such that

. A case in which the gamete of origin is of
type other than 001 can be handled in a similar vein.
Probability distributions of CLR(tf):
Recall that
CLR(
t) for
t > 0 depends on initial conditions.
In what follows, we regard initial gametic frequencies as being
random and consider the probability distribution of
CLR(
tf).
The squared correlation coefficient
R2(
tf) is addressed later
in the DISCUSSION. We assume that all initial gametic frequency
configurations
x000,
x010,
x100,
x110 are equally likely and
satisfy
x000 +
x010 +
x100 +
x110 +
pS(0) = 1. Under this assumption
of uniform distribution, it is possible to compute the probability
distribution

[
CLR(
tf) <
c] for fixed
rLR/
s and
rRS/
s. The key idea is to utilize the
characterization of equivalent initial conditions described
above. More precisely, as
c changes, the surface defined by
CLR(
tf) =
c changes in a smooth fashion, sweeping out a region
in three dimensions. The probability

[
CLR(
tf) <
c] is equal to the volume of the region
corresponding to
CLR(
tf) <
c inside

, normalized by the total
volume of

.
Our main result, illustrated in Figure 7, is
where

is defined below. Let
Cases with
c > 0 and
c < 0 are treated separately
below.
For c > 0:
If
c
b/4,

, and
c
ab(1
a), then
If
c
b/4,

, and
c >
ab(1
a), then
If
either
c >
b/4 or

, then

[
C <
c] = 1.
For c = |c| < 0:
If |
c|

(1
a)
2b/4, then
If |
c| > (1
a)
2b/4, then

[
C < |
c|] = 0.
Polarization:
Polarized LDs are measured with respect to major alleles. To
determine the polarized LD
C
(
tf) between the neutral loci, we
compute
 | (51) |
the main
point being that
C
(
tf) =
CLR(
tf) if

> 0 and
C
(
tf) =
CLR(
tf)
if

< 0. [Recall that
CLR(
t) is measured with respect to
type 1 alleles.]
First, for rLR = rRS = 0, note that
 | (52) |
Second, for fixed initial marginal
frequencies, we need to determine for what values of
rLR and
rRS the sign of

changes. Using (
45) and (
46), we can obtain
the following results:
- For i = 0,
0 if pL(0) >
and
 | (53) |
- For i = 1,
0 if qL(0) >
and
 | (54) |
- For j = 0,
0 if pR(0) >
and
 | (55) |
- For j = 1,
0 if qR(0) >
and
 | (56) |
Combined with (52), these equations determine completely whether C
(tf) = CLR(tf) or C
(tf) = CLR(tf) for given parameter values. For example, suppose that (i, j) = (0, 0). If pL(0) <
and pR(0) <
, then there is no real-valued solution to the condition
= 0, and therefore C
(tf) = CLR(tf) for all values of rLR and rRS. If pL(0) >
and pR(0) <
, or if pL(0) <
and pR(0) >
, then
changes sign as illustrated in Figure 8, a and b, where
and
Note
that
u takes its minimum value at
pL(0) = 1 and that

as

. Similarly,
v takes its minimum value at
pR(0) = 1 and

as

. If both
pL(0) >

and
pR(0) >

, then there are three possibilities, depicted in
Figure 8, ce.
Regions of positive C
(tf):
To determine the regions of positive
C
(
tf), we need to know
how the sign of

depends on
rRS; (
43) implies that the sign of

does not depend on
rLR. For concreteness, suppose
that (
i,
j) = (0, 0), in which case we can obtain the following
results from using (
43):
- If CLR(0)
0, then
0 for all rRS, and therefore the sign of C
(tf) is completely determined by that of
.
- If CLR(0) < 0, then
changes sign as illustrated in Figure 9, where
Note
that
w takes its minimum value of zero at |
CLR(0)| =
pL(0)
pR(0)
and that it increases monotonically as |
CLR(0)|/[
pL(0)
pR(0)]
decreases. The polarized LD
C
(
tf) is positive if and only if

and

are either both positive
or both negative. Examples are shown in
Figure 10.
As shown in
Figure 8,

tends to be positive in the neighborhood
of (
rLR,
rRS) = (0, 0). The size and shape of this neighborhood
depend on
u and
v. Likewise, as shown in
Figure 9,

tends to be positive in the neighborhood
of (
rLR,
rRS) = (0, 0), with the size of the neighborhood depending
on
w. As a consequence, the polarized LD
C
(
tf) also tends to
be positive near (
rLR,
rRS) = (0, 0).
More generally, if the gamete of origin is of type ij1, the sign of
(respectively,
) can be analyzed using (43) [respectively, (52)(56)]. For all ij, the polarized LD C
(tf) tends to be positive near (rLR, rRS) = (0, 0).
An exact symmetry when the selected locus is outside the two neutral loci:
Suppose that the selected locus is outside the two neutral loci
and that geometric configuration and recombination fractions
are fixed. Let {
pS(0),
pL(0),
pR(0),
CLS(0),
CRS(0),
CLR(0),
CLRS(0)} and {
p'
S(0),
p'
L(0),
p'
R(0),
C'
LS(0),
C'
RS(0),
C'
LR(0),
C'
LRS(0)} denote two different sets of initial conditions. At
generation
n > 1, we use "prime" to refer to the marginal
allele frequencies and LDs obtained using the second set of
initial conditions. In the
APPENDIX, we show that if
CLS(0)
=
C'
RS(0) and
CRS(0) =
C'
LS(0), while
pS(0) =
p'
S(0),
CLR(0)
=
C'
LR(0), and
CLRS(0) =
C'
LRS(0), then the system of full recursions
(
1)(
11) implies
for
all
n 
1. This is an exact symmetry result that holds for an
arbitrary dominance coefficient
h.
An application of this general result is the explanation of the symmetry of Figure 3 with respect to reflection about the r = 0 plane, for those regions corresponding to the selected locus being outside the neutral loci. Note that what is depicted in Figure 3 is different from the obviously symmetric case in which initial conditions CLS(0) and CRS(0) get exchanged when locus S is reflected about r = 0. In that figure, initial conditions remain fixed, while the geometric configuration of the loci and recombination fractions change upon reflection. That situation is related to changing initial conditions as described above, while keeping the geometric configuration and recombination fractions fixed.
DISCUSSION
To understand the forces that shape genomic variation in natural
populations and the divergence between species, observed patterns
must be compared to predictions of the models that faithfully
represent the mechanisms through which such forces may work.
While much of the natural selection of organismic phenotypes
may be effectively approximated by deterministic single-locus
equations, interactive and stochastic forces are thought to
play a significant role. Until recently genetic drift has been
considered the primary stochastic process determining the temporal,
geographic, and genomic distribution of the vast majority of
DNA sequence polymorphism and divergence. Gillespie has repeatedly
demonstrated and emphasized fundamental differences between
constant-fitness models and stochastically varying selection,
despite their superficial similarities (
GILLESPIE 1994). Recently
emerging results of surveys of genomic regions of low crossing
over per physical length indicate that linked selection rather
than genetic drift can dominate the levels of polymorphism within
populations (
AGUADÉ et al. 1989;
STEPHAN and LANGLEY 1989;
BEGUN and AQUADRO 1992). The hitchhiking effect not only reduces
the average level of heterozygosity in the surrounding genomic
regions, but it also leaves a skewed frequency spectrum (
BRAVERMAN et al. 1995).
The early study by
THOMSON (1977) indicated that linked selection
can create linkage disequilibrium. Several subsequent articles
have addressed specific cases (
ROBINSON et al. 1991;
GROTE et al. 1998)
or noted some temporal and spatial patterns (
KIM and NIELSEN 2004).
Here we have demonstrated that the hitchhiking effect involves
a number of strong and surprisingly distinct dynamics and patterns
of linkage disequilibrium. We believe that the approach we have
taken to address the impact of selection can be extended further
to address more complex selection schemes and genetic interactions.
The technological capacity of molecular population genomics is increasing rapidly. For example, the HapMap Project (INTERNATIONAL HAPMAP CONSORTIUM 2005) provides extensive genotypic survey results on >1 million SNPs in almost 300 individual humans. At this scale of observation one can anticipate much more powerful inferences about the role of direct selection, linked selection, crossing over, gene conversion, mutation, and geographic demography. Indeed, on the basis of such new data, genomic variation in the rate of crossing over has been proposed as the primary determinant of the patterns of linkage disequilibrium in human populations (MCVEAN et al. 2004).
Several representations/notations have been developed to analyze the dynamics of multilocus systems (BÜRGER 2000). Through a series of articles Barton and Turelli have elaborated and applied their method on the basis of the explicit representations of the moments of allele frequencies (BARTON and TURELLI 1987, 1991; TURELLI and BARTON 1990, 1994). Their representation proved surprisingly tractable and transparent in the analysis of the hitchhiking effect on linkage disequilibrium.
Our analysis begins with the full representation of the three-locus dynamics using the notation of Barton and Turelli. These equations suggest the familiar approximation, "truncated equations," in which
and small higher-order terms can be dropped. The truncated equations immediately expose much of the fundamental structure and their differential analogs, ordinary differential equations, allow approximate analytic solutions. Comparisons of these ODE dynamics with those of the Barton and Turelli representation indicate that the approximations remain quite accurate as long as
(see Tables 1 and 2). Particularly fortuitous and important is the role of the three-locus LD CLRS in driving the dynamics of the LD CLR between the two linked neutral loci and the dependence of CLRS on the product of the two-locus LDs CLS and CRS.
This systematic investigation of the dynamics of LD under hitchhiking reveals four important features. First, and quite generally, hitchhiking indeed generates LD during the initial half of the hitchhiking time course. As Figure 3 shows LD (positive in this instance) reaches a maximum shortly before the originally rare selected allele reaches 0.5. This result is consistent with Thomson's analysis of hitchhiking caused by the dynamics of an initially rare allele under balancing selection in that its frequency reaches an equilibrium closer to 0.5 than to 1.0. But what is truly surprising is that from several important perspectives the hitchhiking effect on LD is one of reduction. In Figure 3 it is obvious that in the second half of the hitchhiking period the large peak of LD (positive in this case) decreases rapidly. Figure 2 shows several configurations of initial conditions and demonstrates that the decline in the magnitude of LD is not attributable to decline in the heterozygosity at the two neutral loci. A second and striking result is that preexisting LD is completely destroyed when the selected locus is situated between the neutral sites. This geometric relationship produces a striking pattern when all pairwise associations are plotted together as in Figure 4. This is probably the mechanism behind the pattern noted by KIM and NIELSEN (2004). This LD-reducing effect of hitchhiking is also evident when the selected site is outside the neutral pair since much of the LD generated during the initial phase is destroyed in the latter phase. A third unexpected property of the hitchhiking on LD is that the averaging over the frequencies of the gametes with which the rare selected variant can be associated indicates that the net effect of hitchhiking would be to reduce preexisting average LD. This is despite the fact that hitchhiking does tend to increase the variance in LD (see below). Note in Figure 3 that there is considerably increased LD in both regions flanking the two neutral sites (i.e., when the selected site is outside and is close to the two neutral sites). When the rare favored allele appears on two of the other three haplotypes (10 or 01) the