help button home button Genetics J Gen Phys
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

Originally published as Genetics Published Articles Ahead of Print on December 28, 2006.

Genetics, Vol. 175, 1395-1406, March 2007, Copyright © 2007
doi:10.1534/genetics.106.062828

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
genetics.106.062828v1
175/3/1395    most recent
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by McVean, G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by McVean, G.

The Structure of Linkage Disequilibrium Around a Selective Sweep

Gil McVean1

Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom

1 Address for correspondence: Department of Statistics, 1 S. Parks Rd., Oxford OX1 3TG, United Kingdom.
E-mail: mcvean{at}stats.ox.ac.uk

Manuscript received June 30, 2006. Accepted for publication December 24, 2006.


    ABSTRACT
 TOP
 ABSTRACT
 TWO-LOCUS IDENTITIES AND A...
 MODELING GENEALOGIES UNDER A...
 NEUTRAL LOCI SEPARATED BY...
 NEUTRAL LOCI ON THE...
 INCORPORATING NEUTRAL MUTATIONS...
 STOCHASTIC SIMULATION
 DISCUSSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
The fixation of advantageous mutations by natural selection has a profound impact on patterns of linked neutral variation. While it has long been appreciated that such selective sweeps influence the frequency spectrum of nearby polymorphism, it has only recently become clear that they also have dramatic effects on local linkage disequilibrium. By extending previous results on the relationship between genealogical structure and linkage disequilibrium, I obtain simple expressions for the influence of a selective sweep on patterns of allelic association. I show that sweeps can increase, decrease, or even eliminate linkage disequilibrium (LD) entirely depending on the relative position of the selected and neutral loci. I also show the importance of the age of the neutral mutations in predicting their degree of association and describe the consequences of such results for the interpretation of empirical data. In particular, I demonstrate that while selective sweeps can eliminate LD, they generate patterns of genetic variation very different from those expected from recombination hotspots.


SELECTIVE sweeps, in which a beneficial mutation is swept to fixation in a population by natural selection, have a profound impact on patterns of linked genetic variation through what is known as the hitchhiking effect (MAYNARD SMITH and HAIGH 1974). Although simple in concept, studies of the process continue to uncover novel and unusual properties that have direct implications for the detection of such events from empirical data. For example, the realization that the interaction of hitchhiking with recombination can lead to an excess of high-frequency-derived mutations (FAY and WU 2000) gave novel insights into the well-known fact that hitchhiking can lead to a bias toward low-frequency polymorphism (FU and LI 1993; BRAVERMAN et al. 1995). Recently, studies of the effects of selective sweeps on patterns of linkage disequilibrium (LD) have also identified characteristic, and perhaps surprising, patterns (KIM and STEPHAN 2002; PRZEWORSKI 2002; KIM and NIELSEN 2004; REED and TISHKOFF 2005; STEPHAN et al. 2006). For example, while sweeps can lead to an increase in LD while they are still in progress (HUDSON et al. 1994; SABETI et al. 2002), when the beneficial mutation has reached fixation, LD across the selected site is eliminated (KIM and NIELSEN 2004; STEPHAN et al. 2006). Interpreting empirical patterns of genetic variation in the light of such observations is therefore potentially confusing and raises important questions. For example, are positions at which LD is observed to break down rapidly the result of selective sweeps or recombination hotspots? Indeed, it has been demonstrated that for certain population genetic methods selective sweeps may be falsely interpreted as hotspots of recombination (REED and TISHKOFF 2005).

The aim of this article is to provide an intuitive interpretation of the effects of selective sweeps on patterns of LD, through considering the relationship between LD and the structure of the underlying genealogical history. Previous work has shown that there is a direct quantitative relationship between the magnitude of LD observed between a pair of neutral mutations and the correlation structure of the underlying genealogy (MCVEAN 2002). By using the conventional approximation that strong selective sweeps lead to short, star-like genealogies at the selected site, this theory is extended to examine the correlation structure between the genealogies of neutral loci either separated by or adjacent to the selected site. Comparison with the results of stochastic simulation demonstrates that this theory predicts the qualitative and, to some extent, quantitative, behavior of LD around a selective sweep. In addition, the theory identifies the importance of the age of neutral mutations (relative to the selected one) in determining patterns of LD and predicts large differences in the nature of the breakdown of LD around a selective sweep and a recombination hotspot.


    TWO-LOCUS IDENTITIES AND A GENEALOGICAL INTERPRETATION OF LD
 TOP
 ABSTRACT
 TWO-LOCUS IDENTITIES AND A...
 MODELING GENEALOGIES UNDER A...
 NEUTRAL LOCI SEPARATED BY...
 NEUTRAL LOCI ON THE...
 INCORPORATING NEUTRAL MUTATIONS...
 STOCHASTIC SIMULATION
 DISCUSSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Informally, LD between neutral alleles at two loci arises because of correlations in the genealogical history of the two loci. Put another way, if the time to the MRCA (most recent common ancestor) for a pair of chromosomes at a given position, x, on the genome is informative about the time to the MRCA for the same pair of chromosomes at another genomic position, y (relative to any other pair of chromosomes), the alleles at the two loci are expected to show significant LD. However, different statistical measures of LD focus on different aspects of such correlation. Here we focus on one widely used two-locus measure of LD for biallelic loci, the square of the correlation coefficient in allelic state or Formula (HILL and ROBERTSON 1968). For a pair of biallelic loci, with alleles 0 and 1 at locus x and also 0 and 1 at locus y, the statistic is defined as

Formula 1(1)
Here, Formula 1 is the sample frequency of the 11 haplotype and Formula 1 is the marginal sample frequency of the "1" allele at locus x. Note that for biallelic loci the value of Formula 1 does not depend on which allele is assigned the value 1. Consequently, in what follows the subscript for D is omitted.

Ideally, we wish to calculate the expected value of Formula 1 between alleles at the two loci, conditioning on observing at least one of each allele at each of the two loci in a sample of size n sequences:

Formula 2(2)
There is, unfortunately, no simple expression for this expectation, although recent advances have been made in its numerical evaluation (SONG and SONG 2007). However, it is possible to derive expressions for a related quantity, called Formula 2:

Formula 3(3)
(OHTA and KIMURA 1971). After this point the conditioning on segregation at the two loci will be implicit. It can be shown through Monte Carlo simulation (HUDSON 1985; MCVEAN 2002) that Equation 3 is a good approximation to the expectation of Formula 3 (i.e., Equation 2) for large sample sizes and when rare variants are excluded.

Previous work (STROBECK and MORGAN 1978; HUDSON 1985) showed that the statistic Formula 3 can be rewritten in terms of two-locus identity coefficients:

Formula 4(4)
To understand the two-locus identity coefficients, consider sampling four chromosomes at random with replacement from a population and labeling them i, j, k, and l. The three terms on the right-hand side of Equation 4 are, respectively, the probability that sequences i and j are identical in state at both sites x and y, the probability that sequences i and j are identical at locus x and that sequences i and k are identical at site y, and the probability that sequences i and j are identical at site x and sequences k and l are identical at site y. These three configurations, which are referred to as A, B, and C, respectively, are central to the following discussion and are represented in Figure 1A. A similar expression applies to the sample statistic where the chromosomes are drawn (with replacement) from the sample (HUDSON 1985). In small samples it is therefore possible that i, j, etc., are not distinct.


Figure 1
View larger version (18K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 1.— (A) Two-locus configurations relating to Equation 4 showing the three ways in which two chromosomes at each of two loci can be sampled. (B) The model of a selective sweep. Chromosomes (bottom bars) at a locus where there has been a recent and complete selective sweep (shaded triangle) are related to each through a star-like genealogy. However, recombination events (dotted arrow) during the selected phase allow lineages to escape to the ancestral background. In the model, neutral mutations (circles) occur on the portion of the genealogy older than the selected mutation, the neutral phase. In the example shown, only a single lineage recombines, such that the two neutral mutations are in perfect association; r2 = 1.

 
The key point about Equation 4 is that the expectation of Formula 4 can be written in terms of the expectation of these two-locus identity coefficients. Under the infinite-sites model, in which each polymorphism observed is the result of a single mutation event within the sample's history, it is possible to relate the two-locus identities to the expectations of genealogical properties at the two loci (MCVEAN 2002). For example,

Formula 5(5)
where Tx is the total time in the genealogy (i.e., the sum of the branch lengths) at locus x and Formula 5 is the coalescence time for sequences i and j at locus x. By obtaining similar expressions for the other two-locus identities and also the denominator of Equation 3, it was shown that

Formula 6(6)
(MCVEAN 2002), where Formula 6 is the Pearson correlation coefficient between the coalescence time for sequences i and j at locus x and the coalescence time for sequences k and l at locus y and CVx is the coefficient of variation in the time to the most recent common ancestor (MRCA) for a pair of randomly sampled chromosomes at locus x, Formula 6. Note that there are three correlations in Equation 6, relating to the three sample configurations (see Equation 4 and Figure 1A).

The most important implication of Equation 6 is that it provides a quantitative approach for relating patterns of LD to features of the underlying genealogical history. For example, demographic histories in which the population has increased, decreased, or remained constant in size influence LD both through their effects on the correlation structure of genealogies and through their effects on the coefficient of variation in time to the MRCA. For example, population growth reduces the coefficient of variation thus reducing LD, while population bottlenecks increase the coefficient of variation, increasing LD. The theory can also be extended to consider more complex situations, for example, the case of a series of island populations connected by migration (WAKELEY and LESSARD 2003). In the next section, the theory is extended to the case of a pair of neutral loci linked to a site that has undergone a complete selective sweep in which the beneficial mutation has just reached fixation in the population.


    MODELING GENEALOGIES UNDER A SELECTIVE SWEEP
 TOP
 ABSTRACT
 TWO-LOCUS IDENTITIES AND A...
 MODELING GENEALOGIES UNDER A...
 NEUTRAL LOCI SEPARATED BY...
 NEUTRAL LOCI ON THE...
 INCORPORATING NEUTRAL MUTATIONS...
 STOCHASTIC SIMULATION
 DISCUSSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Looking back in time, a neutral locus on a single lineage at some genetic distance Formula 6 from a selected site (where r is the genetic map distance in Morgans and Ne is the effective population size, assumed to be diploid) can either recombine away from the selected mutation before its removal from the population, with probability p, or not, with probability q = 1 – p. The probability of "escape" is a function of the recombination rate and the frequency trajectory of the selected mutation, itself a random variable determined by the scaled selection coefficient Formula 6. By approximating the trajectory of the selected mutation by that of the deterministic expectation, it has been previously shown that

Formula 7(7)
(MAYNARD SMITH and HAIGH 1974; KAPLAN et al. 1989; STEPHAN et al. 1992; DURRETT and SCHWEINSBERG 2004). Implicit within this formula is an expression for the age of the selected mutation:

Formula 8(8)
As for all expressions relating to age, this is expressed in units of 2Ne generations. When there is more than a single lineage to consider (i.e., a sample of size n > 1), the shape of genealogy under the selected mutation has to be considered. However, if a selective sweep is sufficiently strong, this genealogy can be approximated as a star phylogeny (MAYNARD SMITH and HAIGH 1974; KIM and STEPHAN 2002) with the age of the common ancestor, Formula 8, taken from Equation 8 (Figure 1B). Although this approximation can be criticized (BARTON 1998; DURRETT and SCHWEINSBERG 2004; ETHERIDGE et al. 2006), it nevertheless has proved very useful in analytical treatments of hitchhiking, because of the resulting independence between lineages in whether they recombine away from the selected mutation.

A further simplifying assumption, Formula 8, is also made, where Formula 8 is the time until the MRCA for a sample of n chromosomes. Under the standard neutral model, Formula 8. Looking back in time, the history of the sample can therefore be divided into two phases (Figure 1B). During the first "selection phase" the only events that can occur are recombination events that move neutral loci from the background of the selected allele to that of the ancestral, wild-type allele. The end of the selection phase is marked by the origin of the selected mutation at which point all chromosomes carrying the selected allele coalesce immediately, and the selected allele is removed. Subsequently, in the "neutral phase," the history of the remaining lineages follows that of the standard neutral model. In the extreme, the selection phase can be considered instantaneous with respect to the timescale of the neutral coalescent process (i.e., Formula 8) and therefore any mutations segregating must have occurred on the portion of the genealogy that predates the origin of selected mutation. Under this assumption if no lineages have recombined to the ancestral background at a given distance from the selected site, there will be no polymorphism in the sample.

By dividing the history of the sample into these two phases it can be seen that the effect of the selective sweep on patterns of LD is determined by how it influences the configuration of chromosomes found at the start of the neutral phase (just further back in time than the origin of the selected mutation). In particular, we need to calculate the transition probabilities that describe how each of the initial configurations, A, B, and C, is distributed at the start of the neutral phase. For example, consider configuration A where the selected site separates the two neutral loci (Figure 2). Depending on the distribution of recombination events that move a neutral locus from the selected to the ancestral background, this initial configuration can be transformed into any of 10 possible states at the end of the selected phase. The removal of the selected mutation subsequently transforms these 10 configurations, through coalescence of those still carrying the selected mutation, to any of configurations A, B, and C or to ones where one or both of the neutral loci coalesce (indicated by O in Figure 2). Details of the probabilities of each transition are given in APPENDIXES A and B.


Figure 2
View larger version (16K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 2.— Transition probabilities for the two-stage model of a selective sweep. The initial configuration (type A), where the selected mutation (solid circle) separates the two neutral loci (triangles), can be transformed into one of 10 different configurations at the end of the selection phase (the open circle indicates the ancestral, unselected mutation). Probabilities for each transition are shown in terms of pi and qi, respectively, the probability of a recombination event occurring in interval i during the selective sweep and the probability of no recombination (Formula 8). The removal of the selected mutation induces coalescence between any chromosomes still carrying the selected mutation. The configurations at the start of the neutral phase can be classified into types A, B, or C, corresponding to the three configurations in Equation 4, or into type O in which at least one of the two loci has coalesced.

 

View this table:
[in this window]
[in a new window]

 
APPENDIX A Transition probabilities when the selected mutation separates the neutral loci

 

View this table:
[in this window]
[in a new window]

 
APPENDIX B Transition probabilities when the selected mutation is adjacent to the neutral loci

 
Once the transition probabilities to each possible state at the start of the neutral phase have been calculated, it is a simple matter to obtain expressions for the necessary genealogical statistics. In particular, for each starting configuration we can write the expectation of the product of the coalescence time at the two neutral loci as a function of these transition probabilities. For example,

Formula 9(9)
where Formula 9 is the probability that configuration A in the sampled chromosomes (all of which carry the selected mutation) results in configuration B at the start of the neutral phase. The subscript S on the left-hand side indicates that the expectation refers to the selected allele, while the subscript W on the right-hand side indicates that these expectations refer to the wild-type allele (i.e., the standard neutral expectations). Under the standard neutral model these quantities are known for different configurations of chromosomes. In particular,

Formula 10(10)
(PLUZHNIKOV and DONNELLY 1996; MCVEAN 2002). Expressions similar to Equation 9 can be obtained for the other initial configurations B and C. Note that it is not necessary to include in Equation 9 a term for transitions to state O, as the expected product of coalescence times for this state is zero under the assumption Formula 10.

Finally, because the configurations can be thought of as relating to subsamples (with replacement) from a sample of n sequences, there is a possibility that sequences i, j, k, and l may not be distinct (the same sequence could be picked twice). A simple correction has to be made to the expectations,

Formula 11(11)
where n is the sample size (HUDSON 1985; MCVEAN 2002).


    NEUTRAL LOCI SEPARATED BY THE SELECTED SITE
 TOP
 ABSTRACT
 TWO-LOCUS IDENTITIES AND A...
 MODELING GENEALOGIES UNDER A...
 NEUTRAL LOCI SEPARATED BY...
 NEUTRAL LOCI ON THE...
 INCORPORATING NEUTRAL MUTATIONS...
 STOCHASTIC SIMULATION
 DISCUSSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
First, consider the case of two loci separated by the selected site and distant from it by recombination distances of Formula 11 and Formula 11, respectively, such that the probabilities of a lineage escaping the selective sweep are Formula 11 and Formula 11, respectively. By considering the probability of recombination in each interval it can be shown that

Formula 12(12)
(see APPENDIX A). Consequently

Formula 13(13)
It follows that whatever the values of Formula 13 and Formula 13

Formula 14(14)
In other words, LD across the selected site (as measured by Formula 14) is zero or at least no greater than background levels caused by finite sample size. This result agrees with previous findings (KIM and NIELSEN 2004; STEPHAN et al. 2006) obtained by simulation and analysis of deterministic models of selection. It is worth noting that a deterministic model (in which drift during the selection phase is ignored) is equivalent to assuming that no coalescent events occur during this period, the same assumption as is made here.

However, it is also worth noting that while LD may be zero, there is actually nonzero correlation in coalescence time. For example, if Formula 14 and Formula 14, it can be shown that

Formula 15(15)
It is perhaps surprising that there should be nonzero correlation in the time to the MRCA at the two neutral loci, but yet no LD. The nonzero correlation arises because lineages that escape the sweep will have low, though nonzero, correlations in the time to the MRCA resulting from the neutral part of their ancestry. For example, Equation 15 is derived by noting that when the recombination rate is low, the most probable configuration that arises in which both neutral loci escape the sweep is configuration B (this is true for all initial configurations). However, each initial configuration requires exactly the same set of recombination events to occur to reach configurations B and C at the start of the neutral phase, so the resulting correlation structure is the same for each initial configuration, and there is no LD.


    NEUTRAL LOCI ON THE SAME SIDE OF THE SELECTED LOCUS
 TOP
 ABSTRACT
 TWO-LOCUS IDENTITIES AND A...
 MODELING GENEALOGIES UNDER A...
 NEUTRAL LOCI SEPARATED BY...
 NEUTRAL LOCI ON THE...
 INCORPORATING NEUTRAL MUTATIONS...
 STOCHASTIC SIMULATION
 DISCUSSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
Now consider a pair of loci that are both on the same side of the selected site, with the nearer (or proximal), x, being at recombination distance Formula 15 and the more distant (or distal), y, being at a recombination distance Formula 15 from x. In this situation the different initial configurations have different probabilities of resulting in each configuration at the start of the neutral phase. For example, configuration A can escape the sweep through a single recombination, while configuration C requires a minimum of two recombination events to escape the sweep. By considering the effect of recombination events occurring in each part of each chromosome during the selection phase (see APPENDIX B) it follows that for configuration A

Formula 16(16)
For configuration B

Formula 17(17)
While for configuration C

Formula 18(18)
The mean and variance of the time to coalescence at each locus are

Formula 19(19)
These results can be used to derive numerical expressions for Equation 6 for various parameter values (Figure 3). However, several important features of the results can be identified. First, when Formula 19 it follows that

Formula 20(20)
Under this approximation, Equation 6 evaluates at zero. However, when Formula 20, such that Formula 20, it is also critical to account for the finite sample size, such that i, j, k, and l are not necessarily distinct. Under these conditions a good approximation for the expected LD is

Formula 21(21)
Equation 21 predicts that conditional on observing polymorphism at the linked neutral loci there will be perfect correlation (i.e., Formula 21) between the alleles if there is no recombination between them (Figure 3). This result can be understood by noting that the most probable way in which polymorphism will be observed if Formula 21 is if a single lineage escapes the selective sweep. Any neutral mutations must occur during the neutral phase, in which only two lineages will be present (the lineage leading to the MRCA of the selected mutation and the escaped lineage), leading to perfect association (in effect the mutations will occur on the same branch of the unrooted genealogy, as in Figure 1B). Another prediction of Equation 21 is that the magnitude of LD decreases rapidly as the recombination rate between the neutral loci increases. Indeed for moderate to large sample sizes it should decrease below that expected for an identical pair of neutral sites unaffected by a sweep (Figure 3). From a genealogical perspective, any recombination events occurring between the two neutral loci will rapidly lead to a breakdown in the correlation of the genealogies at the two positions. Informally, the effect can also be understood in terms of allele frequency. When Formula 21, polymorphism at the proximal locus is most likely to be in the form of a singleton (i.e., one chromosome differs from all the others). Recombination between the proximal and the distal loci will allow nonsingleton polymorphism at the distal locus and this is likely to show weak LD with the singleton allele at the proximal locus.


Figure 3
View larger version (9K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 3.— The effect of a nearby selective sweep on LD between a pair of linked neutral loci. Numerical evaluation of Equation 6 is shown with the correction for finite sample size in the case where both neutral loci are on the same side of the selected site. Formula 21 is shown as a function of the recombination rate, R = 4Ner, between the neutral loci (x-axis) and the probability of escape for the proximal locus being 0.001 (solid line), 0.1 (dark-shaded line), and 0.5 (light-shaded line). The dotted line shows the evaluation of Equation 6 under the neutral model. The probability of escape for the distal locus is given by Formula 21 with Formula 21.

 
As the recombination rate between the proximal neutral locus and the selected site increases, the impact of the selective sweep diminishes and the LD between the neutral loci approaches that expected under the neutral model. However, the two key features of the pattern remain. First, if the neutral loci are very closely linked, LD is generally increased relative to the neutral expectation. Second, weakly linked neutral loci show a small decrease in LD relative to the neutral case (Figure 3). Both features can be explained by the above reasoning.


    INCORPORATING NEUTRAL MUTATIONS YOUNGER THAN THE SELECTED MUTATION
 TOP
 ABSTRACT
 TWO-LOCUS IDENTITIES AND A...
 MODELING GENEALOGIES UNDER A...
 NEUTRAL LOCI SEPARATED BY...
 NEUTRAL LOCI ON THE...
 INCORPORATING NEUTRAL MUTATIONS...
 STOCHASTIC SIMULATION
 DISCUSSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
So far, it has been assumed that the time to the origin of the beneficial mutation is approximately zero, such that any polymorphism found in the sample has to be older than the selected mutation. However, when the probability of a lineage escaping the selective sweep by recombination is low the expected time in genealogies in which no recombination occurs is considerable relative to the total expected time in the genealogy. Consequently, when Formula 21 it is relatively likely that polymorphism observed in a sample that has experienced a selective sweep may be more recent than the selected mutation. From the genealogical perspective, considering such recent mutations is equivalent to setting Formula 21. Because no coalescent events occur during the selected phase, the only influence of a nonzero value of Formula 21 is to increase the expected coalescence time (it has no effect on the correlations in coalescence time or variance) and consequently decrease the coefficient of variation in coalescence time, thus reducing LD. When the neutral loci are either side of the selected site LD is low anyway, so inclusion of recent mutation has little or no impact on LD. However, when the two neutral loci are on the same side of the selected mutation recent mutation can have a considerable impact on LD, because neutral mutations older than the selected one will typically show strong LD if they are themselves tightly linked (as described above). To get an idea for the importance of including recent mutations, note that when Formula 21, typically at most one lineage will escape the sweep and the contribution of the neutral phase to the expected time in the genealogy of the sample is ~Formula 21. Under these same conditions the total length of the genealogy within the selected phase is Formula 21. Consequently, the probability that an observed neutral mutation at the proximal locus is older than the selected mutation is ~Formula 21. In humans the average recombination rate is Formula 21 in European populations (MYERS et al. 2005), so that a polymorphism 5 kb from the selected site will have only a 50% probability of being older than the selected mutation.

Figure 4 shows that inclusion of recent mutations has a marked effect on Formula 21. When the recombination rate between the neutral loci is zero, mutations older than the selected one are predicted to show (and do show) monotonically decreasing LD as a function of increasing Formula 21. However, when recent mutations are considered, LD very close to the selected site is near zero when Formula 21 is small. LD increases as Formula 21 increases, exceeding the neutral expectation at intermediate values of Formula 21. Finally, as Formula 21 approaches one, the expected LD decreases toward neutral expectation. The nonmonotonic relationship between the distance of the neutral loci from the selected site and the strength of LD is actually more marked in the simulations (see below) than in the theoretical predictions. Qualitatively similar patterns are predicted when the neutral loci are only partially linked (data not shown).


Figure 4
View larger version (9K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 4.— The effects of a nearby selective sweep on a pair of completely linked neutral mutations (A) when only neutral mutations within the neutral phase are considered and (B) when neutral mutations can also occur during the selected phase. In each plot the solid line and shaded circles indicate numerical evaluation of Equation 6 and the dotted line and solid circles indicate the results of stochastic simulation carried out under the assumed model (i.e., no coalescence is allowed during the selective phase). Note the dramatic effect of including recent neutral mutations; there is a monotonic decline of LD between old neutral mutations as they get further from the selected site. However, because recent mutations have little or no LD, their inclusion results in LD between the neutral loci maximizing at some distance from the selected site. For each point, 105 simulations were carried with a sample size of 20 and with a recombination rate between the neutral loci of zero. (A) Formula 21 and (B) Formula 21 (note that under the simulation scheme the two series do not agree for Formula 21 because of the additional time in the tips of the trees in B).

 

    STOCHASTIC SIMULATION
 TOP
 ABSTRACT
 TWO-LOCUS IDENTITIES AND A...
 MODELING GENEALOGIES UNDER A...
 NEUTRAL LOCI SEPARATED BY...
 NEUTRAL LOCI ON THE...
 INCORPORATING NEUTRAL MUTATIONS...
 STOCHASTIC SIMULATION
 DISCUSSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
To examine the accuracy of the results obtained here, Monte Carlo simulations were performed under two different models for the selective sweep. In series A, the effects of a selective sweep were simulated under the approximate model used as the basis of the analytical results. Specifically the genealogical history is divided into two phases: a phase of duration Formula 21 during which the only events that can occur are recombination events that move lineages from the selected to the wild-type background, a point of instant coalescence between all lineages still carrying the selected allele, and a neutral phase. In series B, fully stochastic models of selective sweeps were simulated using the program SelSim (SPENCER and COOP 2004). Briefly, the method first simulates a stochastic trajectory for the selected mutation backward in time using a diffusion approximation (COOP and GRIFFITHS 2004) and then subsequently performs a structured coalescent simulation conditional on the trajectory. By performing the two series of simulations it is possible to examine both the accuracy of Equation 6 as an approximation to the expectation of Formula 21 and the accuracy of the approximate model for selective sweeps. For efficiency simulations were carried out by placing mutations uniformly on the simulated genealogies at loci x and y and the ith simulation was assigned a weight given by the product of the total branch lengths at each site, Formula 21. Expected values of Formula 21 are estimated from the weighted average over ≥105 simulations for each parameter combination.

Where the selected site separates the two neutral loci the extent of association between the neutral loci in the series A simulations was, as predicted, no higher than background (data not shown). When the selected site does not separate the neutral loci the results are highly sensitive to assumptions about the duration of the selective phase (Figure 4, A and B; note that there is no recombination between the neutral loci). In Figure 4A it was assumed that the age of the selected mutation was negligible compared to the age of the neutral genealogy, Formula 21. In Figure 4B, the age of the selected mutation was fixed at Formula 21, the average obtained by fully stochastic simulation with S = 400 and a sample size of 20. There are two key features of these results. First, in both cases Equation 6 typically overestimates the expected value of Formula 21, although the expression is accurate when the probability of escape is low. The second key point is the difference the inclusion of recent neutral mutations makes. As predicted, mutations older than the selected one do typically show very strong LD. However, when the probability of escaping the selective sweep is very low, recent neutral mutations make the majority contribution to LD, such that the average value of Formula 21 is very low.

Figure 5 shows the comparison between the analytical results and the average value of Formula 21 calculated from the fully stochastic simulations. These give qualitatively the same results as those obtained under the approximate model of a selective sweep. When the selected mutation separates the neutral mutations there is no LD between them (Figure 5A), irrespective of the level of diversity observed. When the selected site does not separate the neutral loci the LD between linked neutral loci is zero when the proximal locus is very close to the selected site, increases beyond its neutral expectation as the probability of escape increases, and then decreases back to the neutral expectation. This feature is seen both when the neutral loci are completely linked (Figure 5B) and when they are only partially linked (Figure 5C). The most notable difference between the two series of simulations is that in series A the approximation was a considerable overestimate of the true LD, whereas in series B it is typically a slight underestimate. In the absence of a selective sweep Equation 6 is typically an overestimate of Formula 21, as it also is when the approximate model is used as the basis of stochastic simulation (Figure 4). The most likely explanation for the underestimate in Figure 5 is that the genealogy under the selected mutation is not star shaped, and hence there can be significant LD between neutral mutations that occur during the selective phase. Indeed, as the sample size increases, the approximation of a star-like genealogy in the selective phase becomes progressively worse (DURRETT and SCHWEINSBERG 2004).


Figure 5
View larger version (14K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 5.— The effects of a selective sweep on patterns of LD if the selected site is either separating (A) or adjacent to (B and C) the neutral loci. In each plot the shaded line indicates the prediction of Equation 6 allowing for finite sample size but not for recent mutations, the solid line indicates the prediction of Equation 6 allowing for finite sample size and for recent mutations, and the solid circles and dotted line show the values obtained by fully stochastic simulation. The configuration relating to each plot is shown in the top right corner (triangles, neutral loci; circles, selected loci). (A) The selected site is at the midpoint between the two selected loci, which are separated by the recombination fraction shown. (B) The selected site is adjacent to the neutral loci at the recombination fraction indicated beyond the proximal locus and the two neutral loci are completely linked. (C) The same as B except the two neutral loci are separated by a recombination fraction of 4Ner = 5. In all cases 106 simulations were performed with a sample size of 20 and a scaled selection coefficient of 4Ns = 400 using the SelSim package (SPENCER and COOP 2004).

 
In summary, the stochastic simulations demonstrate that the combination of Equation 6 and the approximate model of a selective sweep provides a reasonably accurate quantitative prediction of the effects of selective sweeps on the average value of r2. They do not, of course, predict the full distribution and the approximation gets progressively worse for weaker selection coefficients (data not shown). Informally, the approximation appears to be valuable for S > 100.


    DISCUSSION
 TOP
 ABSTRACT
 TWO-LOCUS IDENTITIES AND A...
 MODELING GENEALOGIES UNDER A...
 NEUTRAL LOCI SEPARATED BY...
 NEUTRAL LOCI ON THE...
 INCORPORATING NEUTRAL MUTATIONS...
 STOCHASTIC SIMULATION
 DISCUSSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
The results presented here provide a detailed understanding of the effects of selective sweeps on patterns of linkage disequilibrium, particularly for the case where a mutation of large effect has recently reached fixation in the population. Although previous theoretical and simulation-based studies have demonstrated some of the patterns described, the genealogical perspective taken provides an intuitive approach to understanding key features of the process. In particular, two key features can be identified.

Selective sweeps can eliminate LD:
If a selective sweep is sufficiently strong and recent, such that the genealogy of the sample at the selected site can be approximated as a star (i.e., all lineages coalesce at the same time), all LD between neutral loci separated by the selected site is eliminated. As previously noted (KIM and NIELSEN 2004), there is a simple genealogical explanation for this observation. In effect, the genealogical interpretation of LD implies that significant LD will occur when the coalescent time for a pair of chromosomes at one position on a chromosome is informative about the coalescent time for the same pair of chromosomes at another position (relative to the coalescent time of all other pairs of chromosomes). Within a star-like genealogy all pairs of chromosomes coalesce at the same time. Consequently the coalescent time for a given pair at one point is uninformative about the coalescent time at any other point for the same pair (i.e., there is no variance in coalescence time within the star), and there is no LD. Moving away from the selected site recombination events will allow linked neutral sites to revert to the neutral distribution of genealogies. However, such "recovery" from the star-like genealogy happens independently on the two sides of the selected site. Consequently, the coalescent time for a pair of chromosomes on one side of the selected site will always be uninformative about the coalescent time for the same pair of chromosomes on the other side.

What is the implication of this result for understanding patterns of variation? The most obvious issue is that selective sweeps, through abolishing LD, may create patterns that look like recombination hotspots. Indeed, it has been shown that one statistical test for hotspots does have an elevated false positive rate at selective sweeps (REED and TISHKOFF 2005). However, it should be noted that the patterns of genetic variation (and underlying genealogies) associated with a hotspot and those associated with a selective sweep are strikingly different. In humans, hotspots are typically short (1–2 kb) regions where there is a very rapid breakdown in LD, and there are many "detectable" recombination events and no distortion to the distribution of marginal genealogies (i.e., no distortion to the frequency distribution of neutral variation) (JEFFREYS et al. 2001). In contrast, a selective sweep of considerable strength will affect the density and frequency distribution of polymorphism over considerable distances. For example, a scaled selection coefficient of Formula 21 (a selection coefficient of ~1% in humans) will affect the frequency distribution of polymorphism up to a genetic distance of at least Formula 21 on either side (this is the distance at which there is a 50% chance of lineage escaping the sweep). In humans, the average recombination rate is ~Formula 21 in European populations (MYERS et al. 2005), such that a region some 140 kb in size should be strongly affected. In short, even if a sweep does influence LD in such a way as to resemble a hotspot, the sweep is also likely to lead to unusual patterns of variation that are indicative of a selective sweep.

One way to ask the question of whether selective sweeps can create false hotspots is to ask whether, conditioning on seeing polymorphism at given genetic distances on either side of the selected mutation, the evidence for historical recombination is greater or less than under the neutral model. Table 1 shows how selective sweeps influence the probability of seeing all four possible haplotypes relative to the neutral case. Under the infinite-sites model such data sets are direct evidence for recombination (HUDSON and KAPLAN 1985). The patterns are quite striking: sweeps lead to a dramatic decrease in the probability of observing all four haplotypes relative to the neutral model. This is true whether all mutations are considered or just those >10% in frequency. In short, selective sweeps do not lead to any increase in the evidence for recombination. The reported bias to one method for detecting hotspots (REED and TISHKOFF 2005) therefore is likely to result from the fact that this method uses a nongenealogical model for patterns of variation. Analysis of data sets simulated with selective sweeps indicates that coalescent-based estimators of the recombination rate show no such local increase in estimated rate. Rather, the depression in the opportunity for recombination at such sites also leads to a slight decrease in average estimated rate (Figure 6).


View this table:
[in this window]
[in a new window]

 
TABLE 1 Probability of observing an incompatibility across a selective sweep

 

Figure 6
View larger version (20K):
[in this window]
[in a new window]
[Download PPT slide]
 
FIGURE 6.— The effect of selective sweeps on estimates of the recombination rate. For data sets previously simulated with a selective sweep (the position of which is indicated by the vertical bar) and constant recombination rate (R = 10; indicated by the dotted line) (REED and TISHKOFF 2005), a model of variable rate recombination was fitted using the reversible-jump MCMC method of MCVEAN et al. (2004), using a block penalty of 5. Four series of data sets were analyzed, each of 100 replicates, with S = 4Nes = 10, 100, 177.8, and 562.3, respectively ({theta} = 10 and n = 100 for each). Each chart shows the median (solid line) and quartiles (shaded lines) of the distribution of the estimated rate. In no case is there a tendency to overestimate the recombination rate.

 
Selective sweeps can increase (and decrease) LD:
While LD between neutral loci is eliminated by a selective sweep at an intervening site, if the selected site does not separate the neutral loci LD can be increased or decreased depending on their proximity to the selected site. A further complication is that the age of the neutral mutations relative to the selected one has critical consequences for the magnitude of LD. If both neutral loci are closely linked to the selected site, mutations older than the selected one will typically show strong LD and younger mutations will typically have little or no LD. When both features are combined the result is a nonmonotonic relationship between the proximity of a pair of neutral loci to a selected one and the strength of LD.

What are the implications of these results for the interpretation of empirical patterns of genetic variation? Previous work has suggested that incorporating information on LD does not greatly improve the power of statistical approaches to identifying selective sweeps (KIM and NIELSEN 2004). This result is understandable given the complexity of the patterns described. One possibility is that incorporating information about the age of linked neutral polymorphism (for example, by comparison with related populations in which no sweep is thought to have occurred) may increase the power to detect selection. In particular, sweeps will lead to series of old SNPs at low frequency and in strong LD interleaved with series of young SNPs at low frequency and in very low LD. Of course, inferences about the age of a mutation within the population that has experienced selection will be confounded by the effect of the sweep.

One argument against using patterns of LD directly to make inferences about selective sweeps is that their effects on LD can all be understood in terms of the generation of a star-like genealogy at the selected site. Consequently, the most powerful methods for detecting selective sweeps will be those that are most powerful at detecting local star-like genealogies with short times to the MRCA (KIM and STEPHAN 2002; KIM and NIELSEN 2004; NIELSEN et al. 2005). For example, of existing methods to detect recent, complete selective sweeps, perhaps the most powerful is one that compares models with and without a local star-like genealogy at a putatively selected site using only the allele-frequency distribution (NIELSEN et al. 2005). However, what the results presented here show is that selective sweeps can induce unusual patterns of association between neutral mutations near selected sites, a feature that is currently not considered in this method. In effect, the results suggest that there may be additional information about selective sweeps in the way genetic variation recovers around a selected locus; however, it remains to be seen whether such recovery differs systematically from cases where star-like genealogies have occurred by chance or through population bottlenecks.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 TWO-LOCUS IDENTITIES AND A...
 MODELING GENEALOGIES UNDER A...
 NEUTRAL LOCI SEPARATED BY...
 NEUTRAL LOCI ON THE...
 INCORPORATING NEUTRAL MUTATIONS...
 STOCHASTIC SIMULATION
 DISCUSSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 
I thank Nick Barton, Alison Etheridge, Rasmus Nielsen, Jay Taylor, and two anonymous reviewers for discussion and comments on the manuscript and Wolfgang Stephan for providing the original inspiration for this work.


    LITERATURE CITED
 TOP
 ABSTRACT
 TWO-LOCUS IDENTITIES AND A...
 MODELING GENEALOGIES UNDER A...
 NEUTRAL LOCI SEPARATED BY...
 NEUTRAL LOCI ON THE...
 INCORPORATING NEUTRAL MUTATIONS...
 STOCHASTIC SIMULATION
 DISCUSSION
 ACKNOWLEDGEMENTS
 LITERATURE CITED
 

BARTON, N. H., 1998 The effect of hitch-hiking on neutral genealogies. Genet. Res. 72: 123–133.[CrossRef]

BRAVERMAN, J. M., R. R. HUDSON, N. L. KAPLAN, C. H. LANGLEY and W. STEPHAN, 1995 The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. Genetics 140: 783–796.[Abstract]

COOP, G., and R. C. GRIFFITHS, 2004 Ancestral inference on gene trees under selection. Theor. Popul. Biol. 66: 219–232.[CrossRef][Medline]

DURRETT, R., and J. SCHWEINSBERG, 2004 Approximating selective sweeps. Theor. Popul. Biol. 66: 129–138.[CrossRef][Medline]

ETHERIDGE, A. M., P. PFAFFELHUBER and A. WAKOLBINGER, 2006 An approximate sampling formula under genetic hitchhiking. Ann. Appl. Probab. 16: 685–729.[CrossRef]

FAY, J. C., and C. I. WU, 2000 Hitchhiking under positive Darwinian selection. Genetics 155: 1405–1413.[Abstract/Free Full Text]

FU, Y. X., and W. H. LI, 1993 Statistical tests of neutrality of mutations. Genetics 133: 693–709.[Abstract]

HILL, W. G., and A. ROBERTSON, 1968 Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38: 226–231.[CrossRef]

HUDSON, R. R., 1985 The sampling distribution of linkage disequilibrium under an infinite allele model without selection. Genetics 109: 611–631.[Abstract/Free Full Text]

HUDSON, R. R., and N. L. KAPLAN, 1985 Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 111: 147–164.[Abstract/Free Full Text]

HUDSON, R. R., K. BAILEY, D. SKARECKY, J. KWIATOWSKI and F. J. AYALA, 1994 Evidence for positive selection in the superoxide dismutase (Sod) region of Drosophila melanogaster. Genetics 136: 1329–1340.[Abstract]

JEFFREYS, A. J., L. KAUPPI and R. NEUMANN, 2001 Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat. Genet. 29: 217–222.[CrossRef][Medline]

KAPLAN, N. L., R. R. HUDSON and C. H. LANGLEY, 1989 The "hitchhiking effect" revisited. Genetics 123: 887–899.[Abstract/Free Full Text]

KIM, Y., and R. NIELSEN, 2004 Linkage disequilibrium as a signature of selective sweeps. Genetics 167: 1513–1524.[Abstract/Free Full Text]

KIM, Y., and W. STEPHAN, 2002 Detecting a local signature of genetic hitchhiking along a recombining chromosome. Genetics 160: 765–777.[Abstract/Free Full Text]

MAYNARD SMITH, J., and J. HAIGH, 1974 The hitch-hiking effect of a favourable gene. Genet. Res. 23: 23–35.[Medline]

MCVEAN, G. A., 2002 A genealogical interpretation of linkage disequilibrium. Genetics 162: 987–991.[Abstract/Free Full Text]

MCVEAN, G. A., S. R. MYERS, S. HUNT, P. DELOUKAS, D. R. BENTLEY et al., 2004 The fine-scale structure of recombination rate variation in the human genome. Science 304: 581–584.[Abstract/Free Full Text]

MYERS, S., L. BOTTOLO, C. FREEMAN, G. MCVEAN and P. DONNELLY, 2005 A fine-scale map of recombination rates and hotspots across the human genome. Science 310: 321–324.[Abstract/Free Full Text]

NIELSEN, R., S. WILLIAMSON, Y. KIM, M. J. HUBISZ, A. G. CLARK et al., 2005 Genomic scans for selective sweeps using SNP data. Genome Res. 15: 1566–1575.[Abstract/Free Full Text]

OHTA, T., and M. KIMURA, 1971 Linkage disequilibrium between two segregating nucleotide sites under the steady flux of mutations in a finite population. Genetics 68: 571–580.[Free Full Text]

PLUZHNIKOV, A., and P. DONNELLY, 1996 Optimal sequencing strategies for surveying molecular genetic diversity. Genetics 144: 1247–1262.[Abstract]

PRZEWORSKI, M., 2002 The signature of positive selection at randomly chosen loci. Genetics 160: 1179–1189.[Abstract/Free Full Text]

REED, F. A., and S. A. TISHKOFF, 2005 Positive selection can create false hotspots of recombination. Genetics 172: 2011–2014.[CrossRef][Medline]

SABETI, P. C., D. E. REICH, J. M. HIGGINS, H. Z. LEVINE, D. J. RICHTER et al., 2002 Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832–837.[CrossRef][Medline]

SONG, Y. S., and J. S. SONG, 2007 Analytic computation of the expectation of the linkage disequilibrium coefficient r2. Theor. Popul. Biol. 71: 49–60.[CrossRef][Medline]

SPENCER, C. C., and G. COOP, 2004 SelSim: a program to simulate population genetic data with natural selection and recombination. Bioinformatics 20: 3673–3675.[Abstract/Free Full Text]

STEPHAN, W., T. WIEHE and M. W. LENZ, 1992 The effect of strongly selected substitutions on neutral polymorphism: analytical results based on diffusion theory. Theor. Popul. Biol. 41: 237–254.[CrossRef]

STEPHAN, W., Y. S. SONG and C. H. LANGLEY, 2006 The hitchhiking effect on linkage disequilibrium between linked neutral loci. Genetics 172: 2647–2663.[Abstract/Free Full Text]

STROBECK, C., and K. MORGAN, 1978 The effect of intragenic recombination on the number of alleles in a finite population. Genetics 88: 829–844.[Abstract/Free Full Text]

WAKELEY, J., and S. LESSARD, 2003 Theory of the effects of population structure and sampling on patterns of linkage disequilibrium applied to genomic data from humans. Genetics 164: 1043–1053.[Abstract/Free Full Text]

Communicating editor: R. NIELSEN




This article has been cited by other articles:


Home page
Genome Res.Home page
K. Bullaughey, M. Przeworski, and G. Coop
No effect of recombination on the efficacy of natural selection in primates
Genome Res., April 1, 2008; 18(4): 544 - 554.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
B. Eldon and J. Wakeley
Linkage Disequilibrium Under Skewed Offspring Distribution Among Individuals in a Population
Genetics, March 1, 2008; 178(3): 1517 - 1532.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
genetics.106.062828v1
175/3/1395    most recent
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted