## Abstract

The model of genetic hitchhiking predicts a reduction in sequence diversity at a neutral locus closely linked to a beneficial allele. In addition, it has been shown that the same process results in a specific pattern of correlations (linkage disequilibrium) between neutral polymorphisms along the chromosome at the time of fixation of the beneficial allele. During the hitchhiking event, linkage disequilibrium on either side of the beneficial allele is built up whereas it is destroyed across the selected site. We derive explicit formulas for the expectation of the covariance measure *D* and standardized linkage disequilibrium between a pair of polymorphic sites. For our analysis we use the approximation of a star-like genealogy at the selected site. The resulting expressions are approximately correct in the limit of large selection coefficients. Using simulations we show that the resulting pattern of linkage disequilibrium is quickly—*i.e*., in <0.1*N* generations—destroyed after the fixation of the beneficial allele for moderately distant neutral loci, where *N* is the diploid population size.

THE detection of targets of positive selection using polymorphism data is an important research topic. There are two major patterns in DNA data that help to identify these targets. First, the fast fixation of a beneficial allele causes a reduction of neutral diversity at closely linked neutral loci and a distortion of the site-frequency spectrum. Second, the fast fixation of the beneficial allele causes an increased level of linkage disequilibrium (LD) around the selected site. Both patterns have been used to construct statistical tests to reject neutrality (Hudson *et al*. 1994; Kelly 1997; Depaulis and Veuille 1998; Fay and Wu 2000; Kim and Nielsen 2004).

While the diversity-reducing effect of genetic hitchhiking is well described on a quantitative level (*e.g*., Maynard Smith and Haigh 1974; Kaplan *et al*. 1989; Stephan *et al*. 1992; Barton 1998; Etheridge *et al*. 2006), investigations of patterns of LD only started with Kim and Nielsen (2004), using numerical simulations. Analytical expressions for measures of LD after a selective sweep have been obtained by Stephan *et al*. (2006), who use differential equations to derive an expression for the covariance measure *D* [defined in (2)] between a pair of neutral alleles linked to a beneficial allele. This study was complemented by a genealogical (*i.e*., backward in time) perspective in Pfaffelhuber and Studeny (2007) and McVean (2007).

The aim of this article is threefold: first, we describe a genealogical perspective of the joint genealogy of two neutral loci linked to a beneficial allele at the time of its fixation, which is accurate for large selection coefficients. Second, using the genealogical perspective, we derive an explicit analytic expression for standardized LD [defined in (3)] at the end of a selective sweep. Our main result is given in (10). Third, we use simulations to see in which time frame before and after fixation we can observe a specific pattern of LD.

In our genealogical perspective we rely on the frequently used assumption that the genealogy at the selected site is exactly star-like at the end of the selective sweep. We show that genetic hitchhiking can lead to perfectly associated (*i.e*., ) alleles close to the selected site if both neutral loci are on the same side of the beneficial allele. If they are on different sides, LD is eliminated during the sweep. Interestingly, standardized LD in a finite sample is much higher than in the whole population. All results on at the time of fixation of the beneficial allele can be obtained from the explicit expressions that are found in Equation 10. Finally, our simulations show that the pattern of LD changes drastically shortly before and after fixation of the beneficial allele.

## MODELS AND MEASURES OF LINKAGE DISEQUILIBRIUM

If a new beneficial allele *B* enters a population of *N* sexually reproducing diploid individuals, it might increase in frequency until it fixes in the population. If the fitness advantage of each copy of the *B*-allele is *s* and , the frequency *X* of the beneficial allele in the population can be described by the differential equation(1)(see, *e.g*., Kaplan *et al*. 1989; Stephan *et al*. 1992), where α := 2*Ns* and time is measured in 2*N* generations. The process stops at time *T* = 2 log(1/ε − 1)/α when *X _{T}* = 1 − ε. In the following, we choose ε = 1/α since the fixation time of a beneficial allele is ∼2 log(α)/α if genetic drift is taken into account (Hermisson and Pennings 2005). In particular, we set

*T*:= 2 log(α)/α.

Maynard Smith and Haigh (1974) argued that neutral variants that are partially linked to the beneficial allele at *t* = 0 increase in frequency together with the beneficial allele. We extend this model to two neutral loci following Stephan *et al*. (2006). We have to take two possible geometries for the selected and the two neutral loci into account; see Figure 1. Either (a) the neutral loci are on the same side of the selected site or (b) the selected locus is in the middle of both neutral loci. Throughout we assume that mutation rates are sufficiently small that at most two alleles are segregating at both loci. At the selected *S*-locus we call *b* the wild-type and *B* the beneficial allele. For the other loci, we call the alleles *L*, ℓ at the first and *R*, *r* at the second neutral locus. The neutral loci are called the *L*/ℓ- and *R*/*r*-loci or, in short, the *L*- and *R*-loci.

During reproduction, recombination events might occur. If a recombination event occurs between two loci, they have different ancestors. Taking the recombination probability per generation between the two loci as *r* and measuring time in units of 2*N* generations, a recombination event splits the ancestry of the two loci at rate ρ := 2*Nr*. These scaled recombination rates between all pairs of loci are given in Figure 1. Note that ρ_{SR} = ρ_{SL} + ρ_{LR} for geometry a and ρ_{LR} = ρ_{LS} + ρ_{SR} for geometry b.

Let us denote the allelic frequencies at the neutral loci by *q _{L}*,

*q*

_{ℓ},

*q*,

_{R}*q*,

_{r}*q*,

_{LR}*q*,

_{Lr}*q*

_{ℓR},

*q*

_{ℓr};

*e.g*.,

*q*gives the fraction of the total population carrying both the

_{LR}*L*-allele at the

*L*-locus and the

*R*-allele at the

*R*-locus.

Several statistics have been proposed to measure correlations, *i.e*., LD, between two loci. Two of them are(2)

Usually, data are obtained from samples only while these equations are based on population frequencies. As a consequence, measures for LD need to be corrected for finite sample size (Hudson 1985). Denoting allelic frequencies in the sample by , we obtain sample measures of LD by exchanging population frequencies with sample frequencies in (2), which results in the sample measures and .

While one can obtain moments of the random variables for various demographic scenarios, even the expectation of is hard to obtain under a standard neutral model. (Note, however, the recent advances in Song and Song 2007.) It was argued by Hudson (1985) that standardized LD, introduced by Ohta and Kimura (1969),(3)provides a good approximation of as long as low-frequency variants are ignored.

#### The star-like approximation:

To approximate polymorphism patterns at the end of the selective sweep we use a genealogical perspective and introduce the star-like approximation. In this approximation we assume throughout that the selective sweep is so short that no new neutral mutations occur during fixation of the beneficial allele.

We proceed in three steps. First, we consider the selected site only; then we add a single neutral locus; finally, we add a second neutral locus. The latter approximation allows us to derive explicit expressions for [*D*] and at the end of the selective sweep in Equations 5 and 10.

##### The genealogy at the selected site:

Consider a sample of beneficial alleles taken from the population at time *T*. Apparently, there is a single haploid individual at time 0 that is the ancestor of all individuals in the sample. In our analysis we make the assumption that this individual at time 0 is in fact the *most recent* common ancestor of all possible samples. Consequently, the genealogy at the selected site is star-like.

The assumption of a star-like genealogy at the selected site is frequently used in the analysis of selective sweeps (Maynard Smith and Haigh 1974; Fay and Wu 2000; McVean 2007). Moreover, it has been shown that it is accurate as long as log(α) is large (Durrett and Schweinsberg 2004).

##### The genealogy at a linked neutral locus:

If DNA sequences did not recombine the whole chromosome would share the same ancestry with the beneficial allele. However, by recombination, common ancestry is broken up. Let us consider the allele at a single neutral locus linked to the selected site carrying the beneficial allele *B*. It might be that an ancestor of this allele was linked to a wild-type allele *b* and only by recombination merged with a beneficial allele *B*. Following ancestral lines this means that the ancestral line changes its background from the beneficial to the wild-type background. Assuming that ρ is the scaled recombination rate between the beneficial and the neutral locus and the frequency of the beneficial allele is *X*, the instantaneous rate of changing backgrounds is ρ(1 − *X*). The probability that the ancestral line does not change backgrounds is thus (recall α := 2*Ns*)(4)(Kaplan *et al*. 1989; Barton 1998). This event is shown in Figure 2 in case (i). With probability 1 − *p*(ρ) there was a recombination event and the neutral allele is linked to a wild-type one at time *t* = 0; this happened to line (ii) in Figure 2. We also say that the line escaped the sweep (backward in time). By the star-like approximation, each line of a finite sample escapes the sweep independently of the others. It has been shown that other events, *e.g*., back-recombination into the beneficial background, occur only with low probability (Durrett and Schweinsberg 2004; Etheridge *et al*. 2006). Hence, we ignore such events here.

##### The joint genealogy at two linked neutral loci:

To derive expressions for LD between two neutral sites we have to extend the star-like approximation. During the selective phase several recombination events might happen. To distinguish them, we speak, *e.g*., of an *SL*-recombination event if it falls between the *S*- and the *L*-locus.

For both geometries we divide the time of the selective sweep into two halves. Toward the end of the selected phase, we assume that no recombinations to the wild-type background occur. The only events that occur in this phase are *LR*-recombination events to the effect that the alleles at both loci are linked to different beneficial alleles; see Figure 3. The probability that the ancestries of the alleles at the *L*- and *R*-loci do not split in this second half is approximately
[recall (4)], where we used the fact that the contribution of times when to the last integral is small. This case is shown for line (i) of Figure 3. With probability 1 − *p*(ρ_{LR}) the alleles at the *L*- and *R*-loci have different ancestors, which both carry the beneficial allele at time *T*/2 as shown in line (ii) of Figure 3. In the latter case the ancestral lines of the alleles at the *L*- and the *R*-locus independently escape the sweep as in the case of a single neutral locus in Figure 2.

For the joint genealogy of both neutral loci during the starting phase of the selective sweep we have to distinguish between geometries a and b. We set *p _{□}* :=

*p*(ρ

_{□}) for □ =

*SL*,

*LR*,

*SR*,

*LS*,

*SR*. Let us first consider geometry a, where the selected locus is outside both neutral loci; see also Figure 4. All cases are listed in Table 1.

Consider line (i) as an example. We assume that all recombination events that split the alleles at the two loci such that both remain in the beneficial background already occurred in the late phase of the sweep. Hence, all recombination events automatically bring at least one allele to the wild-type background and both alleles stay linked in the beneficial background only if neither an *SL*- nor an *LR*-recombination event occurs. Since recombination events between both pairs occur independently and the probability that no recombination event brings an allele in scaled recombination distance ρ to the wild-type background is *p*(ρ), it follows that case (i) has probability *p _{SL}p_{LR}*.

Observe that the effect for both lines (iv) and (v) is that the alleles at both loci are unlinked in the wild-type background. To produce one of these events there must be one *SL*- and one *LR*-recombination event. In line (iv) the first recombination event (backward in time) occurs between *S* and *L* and the second only between *L* and *R*, while in line (v) the order is reversed. Altogether, either of the two events happens if and only if there is both an *SL*- and an *LR*-recombination event that results in the given probability of (1 − *p _{SL}*)(1 −

*p*).

_{LR}The genealogy for geometry b can be obtained similarly. Figure 5 and Table 2 give all the details. Observe that for geometry b it is not possible that an allele at the *L*- and one at the *R*-locus are linked in the wild-type background at *t* = 0.

Again, by the star-like approximation, the ancestry of each line of a finite sample behaves independently of the other lines.

## RESULTS

We are now in a position to obtain analytical results on measures of LD at the end of a selective sweep.

#### 𝔼[*D*(*T*)]:

Writing *D*(0) and *D*(*T*) for the LD measures at the beginning and end of the selective sweep, we obtain (using the star-like approximation)(5)for geometries a and b, respectively. Note that (5) agrees approximately with Equation 47 in Stephan *et al*. (2006) for large values of α.

To derive (5), consider a pair of one allele at the *L*- and one allele at the *R*-locus at the end of the sweep. In the case that the alleles are linked (*i.e*., taken from the same individual at the end of the sweep) we denote the probability that both have the same ancestor (*i.e*., their ancestors are linked) at the beginning of the sweep by *d*. Moreover, if the alleles are unlinked at the end of the sweep, we denote the probability that their ancestors are linked at the beginning of the sweep by *e*. Assuming no new mutations during the sweep,such that(6)which leaves us with the task to compute *d* and *e* for geometries a and b. We have(7)because the ancestors of a pair of alleles that are unlinked at the end of the sweep can be linked at the beginning of the sweep only if none of the two ancestral lines recombines out of the sweep. Moreover, for *d*, we have two cases: either the two linked alleles split between *T* and *T*/2 like line (ii) in Figure 3 or they do not. If this happens, the probability for a common ancestor is the same as for the unlinked case, *d*. If the two alleles do not split between *T* and *T*/2, there must not be a recombination event separating them between *T*/2 and 0. So,(8)Combining (7) and (8) with (6) shows (5).

#### σ̂_{D}^{2}:

To formulate our result on , we need the three quantities(9)for 0 ≤ *t* ≤ *T*. At the end of the selective sweep, we show that if the sample size *n* is large enough such that terms of order 1/*n*^{2} can be ignored, we have(10)where , and denote the three quantities (9) at the beginning of the sweep. Moreover, ζ and χ are corrections according to the finite sample size for geometry a and are given by(11)If the population at time 0 is in equilibrium, both loci mutate with probability *u* and we set θ := 4*Nu*. Ohta and Kimura (1969) have shown that(12)

Assuming that the population was in neutral equilibrium when the sweep started, we predict the pattern of LD for α = 1000, *n* = 20, a per-site mutation rate of θ = 0.005, and ρ = 0.025 between two adjacent bases shown in Figures 6 and 7. Note that selection coefficients in the order of α = 1000 are observed in practice (Beisswanger *et al*. 2006). Significant amounts of LD build up on each side of the selected site, but there is no LD for a pair of polymorphisms from both sides of the selected site. In Figure 7 we assume that two neutral polymorphisms have a fixed distance and consider the dependence of LD on their distance to the selected site. We see here that the finite sample size has a profound effect on the level of LD. Moreover, even for ρ_{LR} = 50 a twofold increase of LD relative to neutral expectations can be expected if both neutral loci are in a 2-kb distance from the selected site.

The big effect of the finite sample size (*n* = 20 in the numerical example) close to the selected site on can be seen from (10). Note that for ρ_{SL} ≈ ρ_{SR} ≈ ρ_{LR} ≈ 0 we have *p _{SL}* ≈

*p*≈

_{SR}*p*≈ 1 and we find that . However, mutations in the region close to the selected site are rarely observed.

_{LR}To derive (10) we show that for geometry a(13)and for geometry b(14)These results, together with (A4) and , then imply (10).

There is a close relationship between and pairwise heterozygosities as described in the appendix. There are three measures for pairwise heterozygosity we have to take into account. Consider two pairs of one allele at the *L*- and one allele at the *R*-locus each, taken from the population at time *T*. Let *f _{T}*,

*g*, and

_{T}*h*be the probabilities that both pairs are heterozygous if both pairs are linked, only one pair is linked, and both pairs are unlinked, respectively; see also the appendix. The quantities

_{T}*f*

_{0},

*g*

_{0}, and

*h*

_{0}are defined analogously for the population at time 0. Moreover, we take

*f*

_{T}_{/2},

*g*

_{T}_{/2}, and

*h*

_{T}_{/2}as the corresponding pairwise heterozygosities if the two pairs of one allele at the

*L*- and one allele at the

*R*-locus each are taken from the

*beneficial*background at time

*T*/2. To obtain (13) and (14) we consider a sample taken at time

*T*. First, splits of linked alleles at the

*L*- and

*R*-loci in the beneficial background are generated between

*T*and

*T*/2. For both geometries, we obtain(15)To see this, consider two linked pairs of alleles at the

*L*- and

*R*-loci as an example. These are heterozygous at both the

*L*- and the

*R*-locus if none of them splits (which occurs with probability ), one of them splits [probability 2

*p*(1 −

_{LR}*p*)], or both split [probability ] and the resulting pairs of

_{LR}*L*- and

*R*-loci are heterozygous.

Furthermore, using the star-like approximation we can compute *f _{T}*

_{/2},

*g*

_{T}_{/2}, and

*h*

_{T}_{/2}. For example, consider a linked pair of one allele at the

*L*- and one allele at the

*R*-locus in the beneficial background at time

*T*/2. One possibility that it is heterozygous at both loci is that their ancestors at time 0 are a linked pair of one

*L*- and one

*R*-locus at time 0 and these are heterozygous. The probability for this event (which is denoted

*a*

_{11}below) is for geometry a given as since at least one

*SL*-recombination event and no

*LR*-recombination event must occur. In other words, one of the two lines is like (iii) in Figure 4 while the other is either like (i) or (iii) in the same figure. For geometry b this probability is 0 because it is not possible to have a linked pair of one

*L*- and one

*R*-locus in the wild-type background at time 0 as can be seen from Figure 5.

As a second example consider *a*_{23} for geometry a. This is the probability that one linked and one unlinked pair of one allele at the *L*- and one allele at the *R*-locus each, taken from the beneficial background at time *T*/2, have four different ancestors at time 0. Either the ancestral line of exactly one allele at the *L*-locus stays in the beneficial background [probability 2*p _{SL}*(1 −

*p*)] and both alleles at the

_{SL}*R*-locus escape the sweep [probability (1 −

*p*)(1 −

_{LR}*p*)] or both alleles at the

_{SR}*L*-locus are linked to a wild-type allele at the beginning of the sweep [probability (1 −

*p*)

_{SL}^{2}] and the linked pair is split by an

*LR*-recombination event [probability (1 −

*p*)].

_{LR}Altogether we have(16)with *A* = (*a _{ij}*)

_{1≤i,j≤3}. For geometry a,

*A*has the form

For geometry b, *LS*- and *SR*-recombination events occur independently, leading toCombining (A3), (15), and (16) we can write for both geometries

For geometry awhich shows (13). For geometry b we have similarlywhich gives (14).

#### Simulations:

We use the program SSW (Kim and Stephan 2002) to simulate data under a selective sweep and compare these simulations to our predictions for from (10). We changed the program to set ε = 1/α in (1). The parameter values in our simulations coincide with those taken for Figures 6 and 7. We consider a 20-kb stretch of DNA in a sample of *n* = 20 taken at the time a beneficial mutation with α = 1000 has fixed. Here, the sweep region where levels of polymorphism are reduced by at least 50% consists of ∼10 kb.

The heuristics of Hudson (1985) that 𝔼[*r*^{2}] and coincide approximately if we ignore low-frequency variants are also valid at the end of a selective sweep (consult the supporting online supplemental material to see numerical results). Moreover, in Figure 8 we compare simulated data to predictions from (10) for *n* = 20. Here, we divide the 20-kb stretch of DNA into 100 bins of 0.2 kb each and measure LD between SNPs of two different bins. In Figure 8A, we use adjacent bins while Figure 8B shows results for bins that are 2 kb apart. We see that LD is highly elevated for the closely linked pair that is also seen in (10). The fit between simulated data and our predictions is worse for smaller values of α and larger *n* (see supporting online supplemental material). While the deviation from the numerical results is as large as 25% in Figure 8B, *i.e*., for α = 1000, it increases to 30% for α = 500 and decreases to 20% for α = 2000 with the same values for ρ_{LR}/α, respectively. The worse fit of the analytical results for the larger sample size can also be explained. The genealogy of larger samples is more complex and thus may differ from the star-like approximation in several ways.

For data analysis it is most important to see how long such a pattern can be observed. In Figure 9 we analyze the pattern of LD in the sweep region at three time points: before fixation when the frequency of the beneficial allele is 0.95 (which is for the given parameters the time *t* ≈ *T* − 0.01*N*), at the time of fixation, and 0.1*N* generations afterward. Two observations can be made here. First, LD between both sides of the selected site is destroyed only at the very end of a selective sweep. Second, while LD for closely linked (0.2 kb, which equals ρ_{LR} = 5) neutral variants is still elevated after 0.1*N* generations, the effect of selection on LD completely vanishes for more distant (2 kb, which equals ρ_{LR} = 50) neutral loci. A closer analysis reveals that the decay of LD is fastest directly after the selective sweep (see supporting online supplemental material).

## DISCUSSION

Recently several statistical tests to infer selection using patterns of LD have been developed (Hudson *et al*. 1994; Depaulis and Veuille 1998; Sabeti *et al*. 2002; Toomajian *et al*. 2003; Kim and Nielsen 2004; Hanchard *et al*. 2006; Wang *et al*. 2006). The heuristics behind these tests are as follows: if a beneficial allele enters the population and increases in frequency, neutral variants increase in frequency by genetic hitchhiking. Recombination did not have much time during the selective sweep to break up linkage between these neutral polymorphisms. As a consequence, we see alleles that have both high frequency—typical for old alleles under neutrality—and long-range associations with other alleles, which is typical for young alleles (Sabeti *et al*. 2006).

In a simulation study, Jensen *et al*. (2007) carry out a power analysis of the test developed in Kim and Nielsen (2004). They show that distinct patterns of LD vanish within 0.1*N* generations after fixation of the beneficial allele. Such a signal is too weak to produce significant results using the overall pattern of LD. However, using the increased level of LD between tightly linked polymorphisms it might be possible to distinguish recurrent sweeps from neutrality or other demographic scenarios, for example, population bottlenecks.

On a fine scale, the effect of genetic hitchhiking on LD at the time of fixation can be described as follows (see also Figure 6): on either side of the beneficial allele, correlations between existent polymorphisms are built up, leading to long-range LD. Between the two sides of the beneficial allele LD is destroyed. This destruction can be explained heuristically: the observation of polymorphisms on any side of the beneficial allele (assuming no new mutations in the sweep) requires a recombination event between the beneficial allele and the neutral polymorphisms. By this recombination event a large haplotype is introduced into the population, leading to strong LD on each side of the beneficial allele. The existence of two neutral polymorphisms on both sides of the beneficial allele requires two independent recombination events, one on each side of the beneficial allele. By the independence of these events, LD vanishes when the beneficial allele fixes.

Looking at the pattern of LD at the end of a selective sweep, one might be tempted to conclude that there must be a hotspot of recombination at the selected site. This has been investigated by Reed and Tishkoff (2006), who indeed found out that hitchhiking may confound tests for recombination hotspots. However, only hitchhiking can reduce sequence diversity, which helps to make a clear distinction between genetic hitchhiking and recombination hotspots (McVean 2007).

In our study, we use the star-like approximation for the genealogy at the selected site to describe patterns of LD. This approximation is already implicit in the analysis of Maynard Smith and Haigh (1974) and it still inspires new methods for data analysis (*e.g*., Nielsen *et al*. 2005). Our star-like approximation of the joint genealogy at the two neutral loci is a slight but crucial modification of the approach of McVean (2007). On the one hand, McVean does not describe splits in the wild-type background [see line (iv) in Figure 4] but implicitly accounts for these events. On the other hand, he ignores splits in the beneficial background that are shown in Figure 3. As a consequence, his star-like approximation becomes less accurate with increasing distance of both neutral loci. In addition, McVean's approximation is incompatible with the results on *D* obtained in Stephan *et al*. (2006). Like McVean we see a big effect of a finite sample size on patterns of LD. Since we use larger selection coefficients α, we could not reproduce his finding that neutral mutations that are more recent than the beneficial allele lead to a significant reduction in LD.

Generally, the star-like approximation gives a good approximation for at the end of a selective sweep. It predicts correctly the increase of LD close to the selected site and the elimination of LD between both sides of the selective site. The slight underestimation of LD of the star-like approximation (see Figure 8) can also be explained: coalescence events during the selective phase lead to more complex scenarios than star-like genealogies at the selected site. In particular, these events are responsible for the fact that the star-like approximation creates a too long genealogy that then leads to an overestimation of the number of recombination events (*i.e*., an underestimation of LD) under the star-like hypothesis. Even more complex genealogies appear if we take back-recombinations into the beneficial background into account (Barton 1998). Although such events have been shown to appear with low probability (Etheridge *et al*. 2006), the star-like approximation underestimates LD because back-recombinations can lead to common ancestry at the beginning of the selective sweep.

The star-like approximation was criticized and proposed to be replaced by the genealogy of a Yule process (Durrett and Schweinsberg 2004; Pfaffelhuber *et al*. 2006). The corresponding Yule approximation for the joint genealogy of two neutral loci using a Yule process was obtained by Pfaffelhuber and Studeny (2007). However, the star-like approximation is still useful. First, as shown by Durrett and Schweinsberg (2004) the star-like approximation for a single neutral locus gives correct results if log(α) is large. Second, by the independence of all lines during the selective sweep, it allows for explicit calculations. In particular, using the star-like approximation, it is possible to obtain not only predictions for standardized LD or second moments of *D*—see (13) and (14)—but also higher moments of *D*.

Recently, the model of selective sweeps has been extended to the case of multiple origins of the beneficial allele—so-called *soft selective sweeps* (Hermisson and Pennings 2005). Such multiple origins may, for example, be due to recurrent mutation to the beneficial allele during its fixation. Together with new mutants at the selected site, new ancestral haplotypes are imported into the beneficial background. As a consequence, statistical tests based on haplotype structure, *i.e*., LD, have most power to detect soft selective sweeps (Pennings and Hermisson 2006). Moreover, the coalescent at the selected locus was also derived (Pennings and Hermisson 2006): given that the frequency of the beneficial allele is *x*, an ancestral line escapes the sweep with rate θ_{b}(1 − *x*)/(2*x*), where θ_{b} is the scaled mutation rate to the beneficial allele. Extending the analysis of genealogies to pairs of neutral loci close to the selective sweep, we believe that an analysis of LD under soft selective sweeps is feasible, shedding new light on the distinction between classical and soft selective sweeps.

## APPENDIX

Two relationships are important for the derivation of our results on LD: first, a genealogical interpretation of and second, the difference between measures of LD in the population and in a sample. Both are not restricted to the analysis of selective sweeps.

#### Linkage disequilibrium and pairwise heterozygosities:

Consider two diallelic loci (one called the *L*-locus and the other the *R*-locus) in a population with (random) allele frequencies *q _{L}*,

*q*,

_{R}*q*

_{ℓ},

*q*,

_{r}*q*, … . Using (2) we write(A1)such that .

_{LR}To interpret , and as pairwise heterozygosities, we need three quantities. Consider two pairs of one allele at the *L*- and one allele at the *R*-locus each. Each pair of alleles is either linked or unlinked, *i.e*., the allele at the *L*-locus lies on the same chromosome as the allele at the *R*-locus or they are located on different chromosomes from different individuals. Two pairs of alleles might both be linked, or one is linked while the other is unlinked, or both are unlinked. Denote by *f* the probability that two linked pairs of alleles at the *L*- and the *R*-locus are heterozygous at both the *L*- and the *R*-locus. Moreover, *g* denotes the probability that two pairs of alleles, where the first pair is linked and the second pair is unlinked, are heterozygous when picked randomly from the population. Third, *h* denotes the probability that two pairs of unlinked alleles are heterozygous.

For the allelic frequencies, some relationships hold; *e.g*., *q _{Lr}* =

*q*−

_{L}*q*,

_{LR}*q*

_{ℓR}=

*q*−

_{R}*q*,

_{LR}*q*

_{ℓr}= 1 −

*q*−

_{L}*q*+

_{R}*q*, and

_{LR}*D*=

*q*

_{ℓr}−

*q*

_{ℓ}

*q*. These allow us to write(A2)such that(A3)Note that Strobeck and Morgan (1978) and Hudson (1985) use pairwise

_{r}*homo*zygosities to derive , while we use pairwise

*hetero*zygosities.

#### Population and sample measures of linkage disequilibrium:

Usually, we are given data from a finite sample and want to compute the amount of LD. Using the sample frequencies , we define the sample quantities , and as in (A1). By a calculation analogous to (A2), where , and are the corresponding pairwise heterozygosities in the sample. Between *f*, *g*, and *h* and , and , we have the relationshipsFor example, two linked pairs of one allele at the *L*- and one allele at the *R*-locus each, taken at random (with replacement) from a sample are heterozygous, if we did not pick the same individual twice and the resulting two different lines are heterozygous at both loci.

Let *I* be the identity matrix and setsuch that . Assuming that the sample size is large enough such that terms of order can be ignored, the above reasoning and some matrix algebra gives(A4)

## Acknowledgments

We thank Joachim Hermisson for fruitful discussion and we are grateful to Pleuni Pennings and Nick Barton for helpful comments on our work. W.S. thanks the Deutsche Forschungsgemeinschaft for support (grant STE 325/7). P.P. thanks the BMBF for financial support via FRISYS (Kennzeichen 0313921).

## Footnotes

Communicating editor: R. Nielsen

- Received September 3, 2007.
- Accepted February 22, 2008.

- Copyright © 2008 by the Genetics Society of America