- Split View
-
Views
-
Cite
Cite
P Pfaffelhuber, A Lehnert, W Stephan, Linkage Disequilibrium Under Genetic Hitchhiking in Finite Populations, Genetics, Volume 179, Issue 1, 1 May 2008, Pages 527–537, https://doi.org/10.1534/genetics.107.081497
- Share Icon Share
Abstract
The model of genetic hitchhiking predicts a reduction in sequence diversity at a neutral locus closely linked to a beneficial allele. In addition, it has been shown that the same process results in a specific pattern of correlations (linkage disequilibrium) between neutral polymorphisms along the chromosome at the time of fixation of the beneficial allele. During the hitchhiking event, linkage disequilibrium on either side of the beneficial allele is built up whereas it is destroyed across the selected site. We derive explicit formulas for the expectation of the covariance measure D and standardized linkage disequilibrium
THE detection of targets of positive selection using polymorphism data is an important research topic. There are two major patterns in DNA data that help to identify these targets. First, the fast fixation of a beneficial allele causes a reduction of neutral diversity at closely linked neutral loci and a distortion of the site-frequency spectrum. Second, the fast fixation of the beneficial allele causes an increased level of linkage disequilibrium (LD) around the selected site. Both patterns have been used to construct statistical tests to reject neutrality (Hudson et al. 1994; Kelly 1997; Depaulis and Veuille 1998; Fay and Wu 2000; Kim and Nielsen 2004).
While the diversity-reducing effect of genetic hitchhiking is well described on a quantitative level (e.g., Maynard Smith and Haigh 1974; Kaplan et al. 1989; Stephan et al. 1992; Barton 1998; Etheridge et al. 2006), investigations of patterns of LD only started with Kim and Nielsen (2004), using numerical simulations. Analytical expressions for measures of LD after a selective sweep have been obtained by Stephan et al. (2006), who use differential equations to derive an expression for the covariance measure D [defined in (2)] between a pair of neutral alleles linked to a beneficial allele. This study was complemented by a genealogical (i.e., backward in time) perspective in Pfaffelhuber and Studeny (2007) and McVean (2007).
The aim of this article is threefold: first, we describe a genealogical perspective of the joint genealogy of two neutral loci linked to a beneficial allele at the time of its fixation, which is accurate for large selection coefficients. Second, using the genealogical perspective, we derive an explicit analytic expression for standardized LD
In our genealogical perspective we rely on the frequently used assumption that the genealogy at the selected site is exactly star-like at the end of the selective sweep. We show that genetic hitchhiking can lead to perfectly associated (i.e.,
MODELS AND MEASURES OF LINKAGE DISEQUILIBRIUM
Maynard Smith and Haigh (1974) argued that neutral variants that are partially linked to the beneficial allele at t = 0 increase in frequency together with the beneficial allele. We extend this model to two neutral loci following Stephan et al. (2006). We have to take two possible geometries for the selected and the two neutral loci into account; see Figure 1. Either (a) the neutral loci are on the same side of the selected site or (b) the selected locus is in the middle of both neutral loci. Throughout we assume that mutation rates are sufficiently small that at most two alleles are segregating at both loci. At the selected S-locus we call b the wild-type and B the beneficial allele. For the other loci, we call the alleles L, ℓ at the first and R, r at the second neutral locus. The neutral loci are called the L/ℓ- and R/r-loci or, in short, the L- and R-loci.
During reproduction, recombination events might occur. If a recombination event occurs between two loci, they have different ancestors. Taking the recombination probability per generation between the two loci as r and measuring time in units of 2N generations, a recombination event splits the ancestry of the two loci at rate ρ := 2Nr. These scaled recombination rates between all pairs of loci are given in Figure 1. Note that ρSR = ρSL + ρLR for geometry a and ρLR = ρLS + ρSR for geometry b.
Let us denote the allelic frequencies at the neutral loci by qL, qℓ, qR, qr, qLR, qLr, qℓR, qℓr; e.g., qLR gives the fraction of the total population carrying both the L-allele at the L-locus and the R-allele at the R-locus.
Usually, data are obtained from samples only while these equations are based on population frequencies. As a consequence, measures for LD need to be corrected for finite sample size (Hudson 1985). Denoting allelic frequencies in the sample by
The star-like approximation:
To approximate polymorphism patterns at the end of the selective sweep we use a genealogical perspective and introduce the star-like approximation. In this approximation we assume throughout that the selective sweep is so short that no new neutral mutations occur during fixation of the beneficial allele.
We proceed in three steps. First, we consider the selected site only; then we add a single neutral locus; finally, we add a second neutral locus. The latter approximation allows us to derive explicit expressions for
The genealogy at the selected site:
Consider a sample of beneficial alleles taken from the population at time T. Apparently, there is a single haploid individual at time 0 that is the ancestor of all individuals in the sample. In our analysis we make the assumption that this individual at time 0 is in fact the most recent common ancestor of all possible samples. Consequently, the genealogy at the selected site is star-like.
The assumption of a star-like genealogy at the selected site is frequently used in the analysis of selective sweeps (Maynard Smith and Haigh 1974; Fay and Wu 2000; McVean 2007). Moreover, it has been shown that it is accurate as long as log(α) is large (Durrett and Schweinsberg 2004).
The genealogy at a linked neutral locus:
The joint genealogy at two linked neutral loci:
To derive expressions for LD between two neutral sites we have to extend the star-like approximation. During the selective phase several recombination events might happen. To distinguish them, we speak, e.g., of an SL-recombination event if it falls between the S- and the L-locus.
For both geometries we divide the time of the selective sweep into two halves. Toward the end of the selected phase, we assume that no recombinations to the wild-type background occur. The only events that occur in this phase are LR-recombination events to the effect that the alleles at both loci are linked to different beneficial alleles; see Figure 3. The probability that the ancestries of the alleles at the L- and R-loci do not split in this second half is approximately
For the joint genealogy of both neutral loci during the starting phase of the selective sweep we have to distinguish between geometries a and b. We set p□ := p(ρ□) for □ = SL, LR, SR, LS, SR. Let us first consider geometry a, where the selected locus is outside both neutral loci; see also Figure 4. All cases are listed in Table 1.
Case . | Event . | Probability . |
---|---|---|
(i) | No recombination event | pSLpLR |
(ii) | An LR-recombination event makes the allele at the R-locus escape the sweep without the allele at the L-locus. | pSL(1 − pLR) |
(iii) | By an SL-recombination event the line escapes the sweep and the alleles at the L- and the R-locus stay linked. | (1 – pSL)pLR |
(iv) | An SL-recombination event brings the alleles at the L- and R-loci linked into the wild-type background; here, the ancestry of both alleles is split by an LR-recombination. | ℙ[(iv) or (v)] = (1 – pSL)(1 – pLR) |
(v) | An LR- and an SL-recombination event bring first the allele at the R-locus and then the allele at the L-locus into the wild-type background. |
Case . | Event . | Probability . |
---|---|---|
(i) | No recombination event | pSLpLR |
(ii) | An LR-recombination event makes the allele at the R-locus escape the sweep without the allele at the L-locus. | pSL(1 − pLR) |
(iii) | By an SL-recombination event the line escapes the sweep and the alleles at the L- and the R-locus stay linked. | (1 – pSL)pLR |
(iv) | An SL-recombination event brings the alleles at the L- and R-loci linked into the wild-type background; here, the ancestry of both alleles is split by an LR-recombination. | ℙ[(iv) or (v)] = (1 – pSL)(1 – pLR) |
(v) | An LR- and an SL-recombination event bring first the allele at the R-locus and then the allele at the L-locus into the wild-type background. |
All events are described backward in time.
Case . | Event . | Probability . |
---|---|---|
(i) | No recombination event | pSLpLR |
(ii) | An LR-recombination event makes the allele at the R-locus escape the sweep without the allele at the L-locus. | pSL(1 − pLR) |
(iii) | By an SL-recombination event the line escapes the sweep and the alleles at the L- and the R-locus stay linked. | (1 – pSL)pLR |
(iv) | An SL-recombination event brings the alleles at the L- and R-loci linked into the wild-type background; here, the ancestry of both alleles is split by an LR-recombination. | ℙ[(iv) or (v)] = (1 – pSL)(1 – pLR) |
(v) | An LR- and an SL-recombination event bring first the allele at the R-locus and then the allele at the L-locus into the wild-type background. |
Case . | Event . | Probability . |
---|---|---|
(i) | No recombination event | pSLpLR |
(ii) | An LR-recombination event makes the allele at the R-locus escape the sweep without the allele at the L-locus. | pSL(1 − pLR) |
(iii) | By an SL-recombination event the line escapes the sweep and the alleles at the L- and the R-locus stay linked. | (1 – pSL)pLR |
(iv) | An SL-recombination event brings the alleles at the L- and R-loci linked into the wild-type background; here, the ancestry of both alleles is split by an LR-recombination. | ℙ[(iv) or (v)] = (1 – pSL)(1 – pLR) |
(v) | An LR- and an SL-recombination event bring first the allele at the R-locus and then the allele at the L-locus into the wild-type background. |
All events are described backward in time.
Consider line (i) as an example. We assume that all recombination events that split the alleles at the two loci such that both remain in the beneficial background already occurred in the late phase of the sweep. Hence, all recombination events automatically bring at least one allele to the wild-type background and both alleles stay linked in the beneficial background only if neither an SL- nor an LR-recombination event occurs. Since recombination events between both pairs occur independently and the probability that no recombination event brings an allele in scaled recombination distance ρ to the wild-type background is p(ρ), it follows that case (i) has probability pSLpLR.
Observe that the effect for both lines (iv) and (v) is that the alleles at both loci are unlinked in the wild-type background. To produce one of these events there must be one SL- and one LR-recombination event. In line (iv) the first recombination event (backward in time) occurs between S and L and the second only between L and R, while in line (v) the order is reversed. Altogether, either of the two events happens if and only if there is both an SL- and an LR-recombination event that results in the given probability of (1 − pSL)(1 − pLR).
The genealogy for geometry b can be obtained similarly. Figure 5 and Table 2 give all the details. Observe that for geometry b it is not possible that an allele at the L- and one at the R-locus are linked in the wild-type background at t = 0.
Line . | Event . | Probability . |
---|---|---|
(i) | No recombination event | pLSpSR |
(ii) | An SR-recombination event makes the allele at the R-locus escape the sweep without the allele at the L-locus. | pLS(1 − pSR) |
(iii) | An LS-recombination event makes the allele at the L-locus escape the sweep without the allele at the R-locus. | (1 – pLS)pSR |
(iv) | An LS-recombination event followed by an SR-recombination event brings the alleles at the L- and the R-locus into the wild-type background. | ℙ[(iv) or (v)] = (1 – pLS)(1 – pSR) |
(v) | Same as (iv) but in reverse order of the LS- and SR-recombination events. |
Line . | Event . | Probability . |
---|---|---|
(i) | No recombination event | pLSpSR |
(ii) | An SR-recombination event makes the allele at the R-locus escape the sweep without the allele at the L-locus. | pLS(1 − pSR) |
(iii) | An LS-recombination event makes the allele at the L-locus escape the sweep without the allele at the R-locus. | (1 – pLS)pSR |
(iv) | An LS-recombination event followed by an SR-recombination event brings the alleles at the L- and the R-locus into the wild-type background. | ℙ[(iv) or (v)] = (1 – pLS)(1 – pSR) |
(v) | Same as (iv) but in reverse order of the LS- and SR-recombination events. |
Line . | Event . | Probability . |
---|---|---|
(i) | No recombination event | pLSpSR |
(ii) | An SR-recombination event makes the allele at the R-locus escape the sweep without the allele at the L-locus. | pLS(1 − pSR) |
(iii) | An LS-recombination event makes the allele at the L-locus escape the sweep without the allele at the R-locus. | (1 – pLS)pSR |
(iv) | An LS-recombination event followed by an SR-recombination event brings the alleles at the L- and the R-locus into the wild-type background. | ℙ[(iv) or (v)] = (1 – pLS)(1 – pSR) |
(v) | Same as (iv) but in reverse order of the LS- and SR-recombination events. |
Line . | Event . | Probability . |
---|---|---|
(i) | No recombination event | pLSpSR |
(ii) | An SR-recombination event makes the allele at the R-locus escape the sweep without the allele at the L-locus. | pLS(1 − pSR) |
(iii) | An LS-recombination event makes the allele at the L-locus escape the sweep without the allele at the R-locus. | (1 – pLS)pSR |
(iv) | An LS-recombination event followed by an SR-recombination event brings the alleles at the L- and the R-locus into the wild-type background. | ℙ[(iv) or (v)] = (1 – pLS)(1 – pSR) |
(v) | Same as (iv) but in reverse order of the LS- and SR-recombination events. |
Again, by the star-like approximation, the ancestry of each line of a finite sample behaves independently of the other lines.
RESULTS
We are now in a position to obtain analytical results on measures of LD at the end of a selective sweep.
𝔼[D(T)]:
σ̂D2:
Assuming that the population was in neutral equilibrium when the sweep started, we predict the pattern of LD for α = 1000, n = 20, a per-site mutation rate of θ = 0.005, and ρ = 0.025 between two adjacent bases shown in Figures 6 and 7. Note that selection coefficients in the order of α = 1000 are observed in practice (Beisswanger et al. 2006). Significant amounts of LD build up on each side of the selected site, but there is no LD for a pair of polymorphisms from both sides of the selected site. In Figure 7 we assume that two neutral polymorphisms have a fixed distance and consider the dependence of LD on their distance to the selected site. We see here that the finite sample size has a profound effect on the level of LD. Moreover, even for ρLR = 50 a twofold increase of LD relative to neutral expectations can be expected if both neutral loci are in a 2-kb distance from the selected site.
The big effect of the finite sample size (n = 20 in the numerical example) close to the selected site on
Furthermore, using the star-like approximation we can compute fT/2, gT/2, and hT/2. For example, consider a linked pair of one allele at the L- and one allele at the R-locus in the beneficial background at time T/2. One possibility that it is heterozygous at both loci is that their ancestors at time 0 are a linked pair of one L- and one R-locus at time 0 and these are heterozygous. The probability for this event (which is denoted a11 below) is for geometry a given as
As a second example consider a23 for geometry a. This is the probability that one linked and one unlinked pair of one allele at the L- and one allele at the R-locus each, taken from the beneficial background at time T/2, have four different ancestors at time 0. Either the ancestral line of exactly one allele at the L-locus stays in the beneficial background [probability 2pSL(1 − pSL)] and both alleles at the R-locus escape the sweep [probability (1 − pLR)(1 − pSR)] or both alleles at the L-locus are linked to a wild-type allele at the beginning of the sweep [probability (1 − pSL)2] and the linked pair is split by an LR-recombination event [probability (1 − pLR)].
Simulations:
We use the program SSW (Kim and Stephan 2002) to simulate data under a selective sweep and compare these simulations to our predictions for
The heuristics of Hudson (1985) that 𝔼[r2] and
For data analysis it is most important to see how long such a pattern can be observed. In Figure 9 we analyze the pattern of LD in the sweep region at three time points: before fixation when the frequency of the beneficial allele is 0.95 (which is for the given parameters the time t ≈ T − 0.01N), at the time of fixation, and 0.1N generations afterward. Two observations can be made here. First, LD between both sides of the selected site is destroyed only at the very end of a selective sweep. Second, while LD for closely linked (0.2 kb, which equals ρLR = 5) neutral variants is still elevated after 0.1N generations, the effect of selection on LD completely vanishes for more distant (2 kb, which equals ρLR = 50) neutral loci. A closer analysis reveals that the decay of LD is fastest directly after the selective sweep (see supporting online supplemental material).
DISCUSSION
Recently several statistical tests to infer selection using patterns of LD have been developed (Hudson et al. 1994; Depaulis and Veuille 1998; Sabeti et al. 2002; Toomajian et al. 2003; Kim and Nielsen 2004; Hanchard et al. 2006; Wang et al. 2006). The heuristics behind these tests are as follows: if a beneficial allele enters the population and increases in frequency, neutral variants increase in frequency by genetic hitchhiking. Recombination did not have much time during the selective sweep to break up linkage between these neutral polymorphisms. As a consequence, we see alleles that have both high frequency—typical for old alleles under neutrality—and long-range associations with other alleles, which is typical for young alleles (Sabeti et al. 2006).
In a simulation study, Jensen et al. (2007) carry out a power analysis of the test developed in Kim and Nielsen (2004). They show that distinct patterns of LD vanish within 0.1N generations after fixation of the beneficial allele. Such a signal is too weak to produce significant results using the overall pattern of LD. However, using the increased level of LD between tightly linked polymorphisms it might be possible to distinguish recurrent sweeps from neutrality or other demographic scenarios, for example, population bottlenecks.
On a fine scale, the effect of genetic hitchhiking on LD at the time of fixation can be described as follows (see also Figure 6): on either side of the beneficial allele, correlations between existent polymorphisms are built up, leading to long-range LD. Between the two sides of the beneficial allele LD is destroyed. This destruction can be explained heuristically: the observation of polymorphisms on any side of the beneficial allele (assuming no new mutations in the sweep) requires a recombination event between the beneficial allele and the neutral polymorphisms. By this recombination event a large haplotype is introduced into the population, leading to strong LD on each side of the beneficial allele. The existence of two neutral polymorphisms on both sides of the beneficial allele requires two independent recombination events, one on each side of the beneficial allele. By the independence of these events, LD vanishes when the beneficial allele fixes.
Looking at the pattern of LD at the end of a selective sweep, one might be tempted to conclude that there must be a hotspot of recombination at the selected site. This has been investigated by Reed and Tishkoff (2006), who indeed found out that hitchhiking may confound tests for recombination hotspots. However, only hitchhiking can reduce sequence diversity, which helps to make a clear distinction between genetic hitchhiking and recombination hotspots (McVean 2007).
In our study, we use the star-like approximation for the genealogy at the selected site to describe patterns of LD. This approximation is already implicit in the analysis of Maynard Smith and Haigh (1974) and it still inspires new methods for data analysis (e.g., Nielsen et al. 2005). Our star-like approximation of the joint genealogy at the two neutral loci is a slight but crucial modification of the approach of McVean (2007). On the one hand, McVean does not describe splits in the wild-type background [see line (iv) in Figure 4] but implicitly accounts for these events. On the other hand, he ignores splits in the beneficial background that are shown in Figure 3. As a consequence, his star-like approximation becomes less accurate with increasing distance of both neutral loci. In addition, McVean's approximation is incompatible with the results on D obtained in Stephan et al. (2006). Like McVean we see a big effect of a finite sample size on patterns of LD. Since we use larger selection coefficients α, we could not reproduce his finding that neutral mutations that are more recent than the beneficial allele lead to a significant reduction in LD.
Generally, the star-like approximation gives a good approximation for
The star-like approximation was criticized and proposed to be replaced by the genealogy of a Yule process (Durrett and Schweinsberg 2004; Pfaffelhuber et al. 2006). The corresponding Yule approximation for the joint genealogy of two neutral loci using a Yule process was obtained by Pfaffelhuber and Studeny (2007). However, the star-like approximation is still useful. First, as shown by Durrett and Schweinsberg (2004) the star-like approximation for a single neutral locus gives correct results if log(α) is large. Second, by the independence of all lines during the selective sweep, it allows for explicit calculations. In particular, using the star-like approximation, it is possible to obtain not only predictions for standardized LD or second moments of D—see (13) and (14)—but also higher moments of D.
Recently, the model of selective sweeps has been extended to the case of multiple origins of the beneficial allele—so-called soft selective sweeps (Hermisson and Pennings 2005). Such multiple origins may, for example, be due to recurrent mutation to the beneficial allele during its fixation. Together with new mutants at the selected site, new ancestral haplotypes are imported into the beneficial background. As a consequence, statistical tests based on haplotype structure, i.e., LD, have most power to detect soft selective sweeps (Pennings and Hermisson 2006). Moreover, the coalescent at the selected locus was also derived (Pennings and Hermisson 2006): given that the frequency of the beneficial allele is x, an ancestral line escapes the sweep with rate θb(1 − x)/(2x), where θb is the scaled mutation rate to the beneficial allele. Extending the analysis of genealogies to pairs of neutral loci close to the selective sweep, we believe that an analysis of LD under soft selective sweeps is feasible, shedding new light on the distinction between classical and soft selective sweeps.
APPENDIX
Two relationships are important for the derivation of our results on LD: first, a genealogical interpretation of
Linkage disequilibrium and pairwise heterozygosities:
To interpret
Population and sample measures of linkage disequilibrium:
Footnotes
Communicating editor: R. Nielsen
Acknowledgement
We thank Joachim Hermisson for fruitful discussion and we are grateful to Pleuni Pennings and Nick Barton for helpful comments on our work. W.S. thanks the Deutsche Forschungsgemeinschaft for support (grant STE 325/7). P.P. thanks the BMBF for financial support via FRISYS (Kennzeichen 0313921).
References
Barton, N.,
Beisswanger, S., W. Stephan and D. DeLorenzo,
Depaulis, F., and M. Veuille,
Durrett, R., and J. Schweinsberg,
Etheridge, A., P. Pfaffelhuber and A. Wakolbinger,
Fay, J. C., and C. I. Wu,
Hanchard, N., K. Rockett, C. Spencer, G. Coop, M. Pinder et al.,
Hermisson, J., and P. S. Pennings,
Hudson, R., K. Bailey, D. Skarecky, J. Kwiatowski and F. Ayala,
Hudson, R. R.,
Jensen, J. D., K. R. Thornton, C. D. Bustamante and C. F. Aquadro,
Kaplan, N., R. Hudson and C. H. Langley,
Kelly, J.,
Kim, Y., and R. Nielsen,
Kim, Y., and W. Stephan,
Maynard Smith, J., and J. Haigh,
McVean, G. A.,
Nielsen, R., S. Williamson, Y. Kim, M. Hubisz, A. Clark et al.,
Ohta, T., and M. Kimura,
Pennings, P., and J. Hermisson,
Pfaffelhuber, P., and A. Studeny,
Pfaffelhuber, P., B. Haubold and A. Wakolbinger,
Reed, F. A., and S. A. Tishkoff,
Sabeti, P., D. R. J. Higgins, H. Levine, D. Richter, S. Schaffner et al.,
Sabeti, P., S. Schaffner, B. Fry, J. Lohmuller, P. Varilly et al.,
Song, Y. S., and J. S. Song,
Stephan, W., T. H. E. Wiehe and M. W. Lenz,
Stephan, W., Y. Song and C. H. Langley,
Strobeck, C., and K. Morgan,
Toomajian, C., R. Ajioka, L. Jorde, J. Kushner and M. Kreitman,
Wang, E., G. Komada, P. Baldi and R. Moyzis,