The allele frequency of a neutral variant in a population is pushed either upward or downward by directional selection on a linked beneficial mutation (“selective sweeps”). DNA sequences sampled after the fixation of the beneficial allele thus contain an excess of rare neutral alleles. This study investigates the allele frequency distribution under selective sweep models using analytic approximation and simulation. First, given a single selective sweep at a fixed time, I derive an expression for the sampling probabilities of neutral mutants. This solution can be used to estimate the time of the fixation of a beneficial allele from sequence data. Next, I obtain an approximation to mean allele frequencies under recurrent selective sweeps. Under recurrent sweeps, the frequency spectrum is skewed toward rare alleles. However, the excess of high-frequency derived alleles, previously shown to be a signature of single selective sweeps, disappears with recurrent sweeps. It is shown that, using this approximation and multilocus polymorphism data, genomewide parameters of directional selection can be estimated.
THE origin and spread of mutations that increase their carriers' reproductive success are fundamental processes of Darwinian evolution. In contrast to deleterious and neutral mutations, beneficial mutations in plants and animals occur too infrequently to be observed directly in laboratories. As a result, we know much less about beneficial mutations (positive selection) than about neutral or deleterious mutations (purifying selection). Many basic facts about beneficial mutations—their rate, fitness effects, genomic locations, and molecular properties—remain unknown. Recent developments in population genetic theory, however, have suggested some promising approaches to obtaining these quantities. Given that beneficial mutations occur very rarely, basic information about them can be obtained only by inferring past evolutionary events of positive selection. As in other problems of evolutionary inference, the dominant approach is to search present-day sequence polymorphism and diversity for the footprint of past events, in this case, for the signature of genetic hitchhiking.
Genetic hitchhiking (or “selective sweeps”) is the sudden loss of genetic variation at neutral loci when a new beneficial allele arises nearby and is fixed in the population (Maynard Smith and Haigh 1974; Kaplan et al. 1989; Stephan et al. 1992; Barton 2000). Recent selective sweeps are expected to generate specific patterns of sequence variation, including a local reduction of variation (Kim and Stephan 2002), a skew in the allele frequency distribution (Braverman et al. 1995; Fay and Wu 2000), and an increase in linkage disequilibrium (Thomson 1977; Przeworski 2002; Kim and Nielsen 2004). Developments of hitchhiking theory, along with the rapid increase of DNA sequence data, have led to the discovery of many episodes of recent positive selection in natural populations (for example, Enard et al. 2002; Clark et al. 2004; Schlenke and Begun 2004).
It is more difficult, however, to measure the cumulative effect of recurrent selective sweeps throughout the genome. The main effect of recurrent selective sweeps is to reduce the standing level of variation, which is well understood (Kaplan et al. 1989; Wiehe and Stephan 1993; Gillespie 2000). In particular, since recombination reduces the hitchhiking effect, a positive correlation between local recombination rate and sequence diversity is predicted under recurrent sweeps. Such a correlation has been observed in Drosophila melanogaster (Begun and Aquadro 1992) and was used to estimate the intensity of directional selection, αλ, where α is the scaled selection coefficient and λ is the frequency of hitchhiking events (Wiehe and Stephan 1993; Stephan 1995; Andolfatto 2001). However, this calculation assumes that selective sweeps alone create the correlation between recombination and genetic variation. In reality, the constant removal of deleterious mutations along the genome by purifying selection (“background selection”; Charlesworth et al. 1993) also reduces the level of variation and contributes to the relationship between recombination and variation (Hudson and Kaplan 1995; Charlesworth 1996). Therefore, fitting observed levels of variation to selective sweep models yields only an overestimate of αλ.
A second signature of the hitchhiking effect is evident in the distribution of neutral allele frequencies. With partial recombination, the hitchhiking effect causes neutral variants at intermediate frequencies to increase close to fixation or to decrease close to extinction. A sample of DNA sequences affected by selective sweeps thus contains fewer segregating alleles of intermediate frequency than expected under neutral equilibrium (Braverman et al. 1995). Background selection, on the other hand, only slightly changes the frequency spectrum (Golding 1997; Przeworski et al. 1997). Thus, it may be possible to isolate the effect of directional selection from that of purifying selection by examining frequency spectra throughout the genome. For example, the observation in Andolfatto and Przeworski (2001) that regions of the D. melanogaster genome with reduced crossing over harbor more rare variants can be used to support the hypothesis that recurrent selective sweeps affect neutral loci throughout the genome. Examination of frequency spectra may further allow us to estimate parameters of directional selection. This requires an analytic solution for sampling probabilities under selective sweep models.
Using a simple model of hitchhiking and a well-known diffusion approximation, this study first derives an approximation to allele frequency distribution for a given time after one selective sweep event. This solution then leads to an approximation under recurrent selective sweeps. These approximations, along with coalescent simulations, are used to investigate patterns of polymorphism under recurrent selective sweeps and to calculate maximum-likelihood estimates of the parameters of directional selection.
To investigate the frequency spectrum after recurrent selective sweeps, the coalescent-with-recombination algorithm of Kim and Stephan (2002) is used with modifications (described below). A genealogy (ancestral recombination graph) is constructed for a sample of n sequences. Time is counted backward into the past in units of either 4N generations (during the neutral phase) or single generations (during the selective phase). Each sequence is M nucleotides (sites) long and recombination between adjacent sites occurs with probability ρ per generation. The ancestral recombination graph starts with n edges at time 0. Recombination and coalescent events occur according to probabilities given in Kim and Stephan (2002).
Simulations are performed under two different models: a single selective sweep at a given time or recurrent selective sweeps. The genealogy for the single-sweep model has one selective phase, which begins at time τ. This simulation is identical to that in Kim and Stephan (2002), except that it uses the simulated trajectory of beneficial allele frequency (see below). For the recurrent sweep model, the construction of a genealogy begins with the neutral phase, which lasts for tn units (∼4Ntn generations). tn is exponentially distributed with mean 1/Λ = 1/(4NMλ), where λ is the rate of strongly selected substitutions per generation per site. When the neutral phase ends, the selective phase begins, followed by another neutral phase. This process is repeated until the height of the ancestral recombination graph (cumulative time from the present) reaches Tlimit = 5 (20N generations). For each selective phase, the location of the beneficial allele is chosen randomly between 1 and M.
Before each selective phase begins, the time-dependent frequency of two genetic backgrounds—the beneficial allele (B) and the ancestral allele (b)—is predetermined by a forward simulation as described by Innan and Kim (2004). It is modeled such that the beneficial allele starting from a single copy is fixed in a diploid population of size N in which the fitnesses of genotypes bb, Bb, and BB are 1, 1 + 2hs, and 1 + 2s, respectively. For these simulations, I used N = 105, but the outcome of the selective sweep simulations depends little on N as long as α = 2Ns is constant (data not shown).
The allele frequency distribution is observed either at fixed distances from the location of selection (single-sweep model) or from a short (1 or 2 kb long) segment in the center of the chromosome (recurrent sweep model). For the latter, marginal coalescent trees for sites M/2 − 499 to M/2 + 500 (for a 1-kb segment) are extracted from the recombination graph, and mutations are added on trees with probability θ per 4N generations. From the kth replicate of this simulation, Skj sites segregating j mutant alleles are generated (k = 1, … , K; j = 1, … , n − 1). Because mutations occur with variable tree lengths, the total number of segregating sites, , may change for different k. The sampling probability for j mutants per site, Pnj, is estimated by (for a 1-kb segment) from all K replicates. Since the frequency spectrum (pn1, pn2, … , pnn−1) is defined as the distribution of allele frequency conditional on polymorphism at the site, pnj is estimated from simulations by .
We consider single-nucleotide polymorphisms (SNPs) in n homologous DNA sequences sampled from a population of random-mating N diploids. The mutation rate is assumed to be very low (i.e., the infinite-sites model is assumed) such that the mutant allele can be unambiguously distinguished from the ancestral allele by using a close outgroup sequence. The probability that a given site is polymorphic with k (= 1, … , n − 1) mutant alleles is defined as Pnk. We denote by Pn0 the probability that the site is monomorphic. Under the model of neutral evolution in a constant-sized population, Pnk = 4Nμ/k = θ/k (Ewens 2004). Kim and Stephan (2002) obtained the corresponding sampling probability after a selective sweep. However, their solution assumes that the sequences are sampled immediately after the fixation of the beneficial mutation. Below, a general solution of Pnk for arbitrary time since the fixation event is derived.
First, Pnk after a single selective sweep is investigated. I assume that the fixation of a beneficial allele occurred t generations ago and consider a neutral locus at a distance from the site of selection measured by the recombination fraction r. The selective advantage of the beneficial mutation is given by s, and genic selection is assumed (h = 0.5). I attempt to obtain Pnk as a function of t. The number of sites segregating a neutral mutant in the frequency interval [x, x + dx] at the time when the beneficial mutation occurs is given by ϕ(x)dx. If the beneficial mutation occurs on a chromosome carrying this neutral variant, x increases to y + (1 − y)x on average, where y is the mean increase of the identity by descent in the population due to hitchhiking (Gillespie 2000), approximated by (4Ns)−r/s (Kim and Nielsen 2004). This event happens with probability xϕ(x)dx (Fay and Wu 2000). If the beneficial mutation occurs in repulsion phase with the variant, x decreases to (1 − y)x on average, with probability (1 − x)ϕ(x)dx. This radical change of neutral allele frequency may be modeled as though it occurs instantaneously, as the sojourn time of the beneficial mutation [≈2 log(4Ns)/s generations] is very short relative to 4N, the coalescent timescale of the neutral model. If ϕ(x) = θ/x, the expected distribution of mutant allele frequencies under neutral equilibrium, and r ≪ s, this hitchhiking effect will produce a skewed distribution in which high-frequency mutants are as frequent as low-frequency mutants but intermediate-frequency mutants are rare (Fay and Wu 2000; Kim and Stephan 2002). However, this U-shaped pattern decays quickly due to mutation and genetic drift: Low-frequency mutants will increase due to new mutational input, while high-frequency mutants will drift either to lower-frequency classes or to fixation. At t generations after the fixation of the beneficial allele, Pnk is approximately(1)for 0 < k < n. E[Xa(1 − X)b | z, t] is the expectation of Xa(1 − X)b, where X is the frequency of a neutral allele that has drifted for t generations starting at frequency z. The first two integral terms of Equation 1 represent the probability of sampling k copies of a neutral allele that was already segregating before the hitchhiking event occurred. The third term represents the contribution by neutral variants that entered the population after the sweep. E[Xa(1 − X)b | z, t] can be obtained using the diffusion approximation (Kimura 1955) (see appendix). After rescaling time by τ = t/(4N), the explicit solution of Equation 1 is given by(2)whereandignoring terms on the order of 1/N (a sufficiently large N is assumed). Coefficients are defined in the appendix.
Figure 1 shows the predicted sampling probabilities for eight sequences with various combinations of time and map distance. The predictions from Equation 2 agree well with simulation results, although Equation 2 predicts a slightly greater excess of extreme-frequency classes (frequencies 1 and n − 1 in the sample), particularly for τ = 0 (see also Kim and Stephan 2002). This is likely due to the assumption that sweeps are instantaneous, which results in ignoring the effect of genetic drift at neutral loci and the stochastic decay of linkage disequilibrium between the beneficial and neutral alleles during the period of the sweep. In the approximation above, y represents only the deterministic increase in the identity by descent. However, in reality and in the simulations y is a random variable due to genetic drift. Figure 2A shows the distribution of y observed in the forward simulation of a selective sweep. This empirically determined distribution of y can be used to predict the hitchhiking effect on the sampling probability at τ = 0 (Figure 2B, shaded bars). It is shown that including the variance of y produces a sampling probability distribution that is less skewed and is closer to the result of coalescent simulation than the prediction by Equation 2. In addition to the fluctuation of y, the genetic drift of the neutral allele along the lineages that escape the hitchhiking-induced coalescence (the frequency of this genetic background is 1 − y at the end of sweep) may further reduce the skew of frequency distribution.
To connect the current solution to earlier studies, the frequency spectrum obtained here may be transformed into conventional summary statistics of the frequency spectrum—Tajima's D (Tajima 1989a) and Fay and Wu's H (Fay and Wu 2000). Both statistics are expected to be zero under the neutral equilibrium and negative after a single selective sweep. Tajima's D compares the number of segregating sites (S) and the average number of pairwise differences (; Tajima 1983) in the sample. For a given S, the prediction of under the current model iswhere pni = Pni/(Pn1 + ⋯ + Pnn−1). Then, the expected Tajima's D might be obtained from Equation 38 of Tajima (1989a), using instead of . Fay and Wu's H is the difference between and , the expected homozygosity of mutant alleles summed over polymorphic sites (Fay and Wu 2000). I normalize Fay and Wu's H by using H = instead of . Mean H might be predicted by , where
Equation 2 can be used to examine how long the skewed distribution of allele frequency lasts. It was shown that the expected heterozygosity is recovered to the before-sweep level approximately in 4N generations after the sweep: (Kim and Stephan 2000). The expectations of relative heterozygosity (), Tajima's D, and Fay and Wu's H were calculated for a single sweep with α = 1000 and n = 15 and are plotted in Figure 3 for various map distances and times after the sweep. The decay of Tajima's D is only slightly faster than the decay of heterozygosity. However, the decay of Fay and Wu's H is much faster, as demonstrated in Przeworski (2002). H becomes even positive after τ = 0.1 in a position-dependent way. This reversal ( > ) was first observed in Kim and Stephan (2000) and is further examined in the next section. The spatial pattern of Fay and Wu's H through time is rather complicated. It was argued that, to produce negative Tajima's D, the recombination fraction from the selective target scaled by the selection coefficient (r/s) should not be too small or too large. The optimal r/s that generates the most negative H shifts upward with τ (this is also true if unnormalized H = − is used; result not shown).
Recurrent selective sweeps:
From the sampling probability for a single, isolated sweep given by Equation 1, we can obtain Pnk under a model of recurrent selective sweeps. Consider a chromosome M nucleotides long. Recombination occurs with probability ρ between any two adjacent nucleotide sites per generation. It is assumed that each nucleotide site mutates into a beneficial allele with selection coefficient s (no dominance). All sites on the chromosome have equal probability, λ, of fixing a beneficial mutation each generation (λ ≈ 4Nus, where u is the rate of new beneficial mutation per site). It is assumed that the site at which the variation is observed (focal site) is located ML nucleotides away from the left end and MR nucleotides from the right end (M = ML + MR). Under this model, the amount and pattern of variation at the focal site depends on the hitchhiking effect of recurrent substitutions on the chromosome (Wiehe and Stephan 1993; Braverman et al. 1995; Gillespie 2000). Averaging over time, an equilibrium distribution of mutant allele frequency will be established. Assume that, under this equilibrium, a neutral mutant at the focal site is found in the frequency interval [x, x + dx] with probability . Let be the corresponding sampling probability. Again, it is assumed that the substitution of a beneficial mutation occurs instantaneously. Going backward in time, the waiting time until the last fixation of a beneficial mutation is given by τ, which is exponentially distributed with mean 1/(4MNλ) = 1/Λ. Since equilibrium is assumed, the allele frequency distributions both at present and at time τ are given by . Then,(3)whereand
As n increases should approach . Then, with sufficiently large n, we may approximate Equation 3 byTherefore, the sampling probabilities under the model of recurrent sweeps are given by(4)where , and Q is a (n − 1) × (n − 1) matrix whose element in ith row and jth column is . The frequency spectrum (pn1, pn2, … , pnn−1) is defined as sampling distribution conditional on polymorphism. Therefore, under recurrent sweeps, it is approximated by(5)
Figure 4A shows that sampling probabilities given by Equation 4 are in good agreement with results from simulation of recurrent sweeps. As in Figure 1, the approximations of and are slightly larger than those observed in the simulations. However, the frequency spectrum observed in the same set of simulated data shows the opposite result (Figure 4B): the analytic approximation () predicts less skew of frequency spectrum than that observed in the simulation. This discrepancy occurs due to the negative correlation between the number of segregating sites in the sample and the skew of the frequency spectrum. For example, if the last hitchhiking event occurs very recently and near the center, the simulated data will have smaller S and greater skew of the frequency spectrum. As the sampling probability estimated from simulation is not normalized by S, it is expected to be less skewed than the frequency spectrum.
Equation 5 is transformed to summary statistics as described in the section of single sweeps. Table 1 shows that the predicted Tajima's D is less negative than the average observed in simulations, as expected from comparisons of the analytic and simulation results shown in Figure 4B. Despite this bias, the predicted Tajima's D is still useful for tracking the general change of frequency spectrum with changing parameters. First, as observed in Braverman et al. (1995), Tajima's D decreases linearly with the reduction of variability (measured by , where is Watterson's estimate of θ; Watterson 1975) relative to the neutral equilibrium (Figure 5). This result is in agreement with Figure 3 in which the heterozygosity and Tajima's D return to equilibrium approximately at the same rate after a single sweep. However, the slope of the linear relationship between and Tajima's D changes with scaled map length of the sequence (R = 4NMr) relative to the strength of selection (α). With smaller R/α, more negative Tajima's D is obtained for the same reduced level of variation. This was observed also in the simulation (Table 1, cases 2–4).
Fay and Wu's H with recurrent selective sweeps is positive ( > ) for all parameter sets examined, including those in Table 1. Figure 4B shows that, with a severe skew of the frequency spectrum caused by recurrent hitchhiking events, more mutant alleles are found at high frequency than at intermediate frequency. However, this skew does not translate into a negative H because, compared to the frequency spectrum under neutrality (curves in Figure 4B), there is no excess of high-frequency mutants. On the other hand, the proportion of low-frequency mutants is much greater than that under the neutral expectation. A recent study using coalescent simulation also found that Fay and Wu's H becomes positive under recurrent sweeps (Haddrill et al. 2005). One may argue that this results from an excessively large proportion of singletons, particularly mutant alleles with frequency 1, relative to high-frequency derived alleles after selective sweeps. However, in all cases of recurrent sweeps in Table 1, Fay and Wu's H is still positive even if singletons are removed from the data sets.
This is an interesting result as previous studies found that the excess of high-frequency derived alleles (negative H) is an important signature of selective sweeps (Fay and Wu 2000; Kim and Stephan 2002; Przeworski 2002). However, these studies modeled recent selective sweeps with no further events in the past. This “single-sweep” model predicts a coalescent tree with long inner branches—i.e., most gene lineages find a very recent common ancestor, but one or two lineages escape the early coalescence via recombination [see Fay and Wu's (2000) Figure 2B]. Mutations mapped on this tree will be found at either very low or high frequency in the sample. In the recurrent sweep model, however, the probability of finding such a tree will be small because gene lineages that have escaped the coalescence at the most recent hitchhiking event are still subject to coalescence at hitchhiking events further back in the past. To confirm this explanation, I performed new simulations in which, after the first hitchhiking event at time tn (distributed exponentially with mean 1/Λ), no more selective sweeps in the past were allowed. The mean values of H from those simulations are negative (Table 1, cases 7–9).
Finally, I examined the effect of the dominance of beneficial mutations on the frequency spectrum. For similar degrees of reduction in , Tajima's D becomes slightly more negative with increasing dominance (Table 1, cases 10–12).
PARAMETER ESTIMATIONS OF DIRECTIONAL SELECTION
In this section, I examine whether the analytic approximation obtained above can be used to estimate parameters of directional selection from sequence data. First, Equation 2 may allow the inference of the time since the hitchhiking event, if only one selective sweep event has occurred in the recent past. In Kim and Stephan (2002), the composite likelihood of the data under the selective sweep model was calculated assuming that sequences were sampled immediately after the fixation of the beneficial allele (τ = 0). Here, Equation 2 is used to perform a likelihood-ratio test as described in Kim and Stephan (2002) but with τ being an unknown variable. Przeworski (2003) also estimated τ using a different maximum-likelihood approach, a rejection-sampling method based on three summary statistics: S, Tajima's D, and the number of distinct haplotypes.
The test was applied to simulated data sets of 15 sequences (10 kb long, 4Nρ = 0.05, θ = 0.005). Fixation of the beneficial mutation occurs in the middle of the sequence at time τ. Three values of τ were used: 0, 0.05, and 0.1. Table 2 shows the mean and standard deviation of the estimate of sweep time (τ) as well as those of the strength (α) and location (X) of the beneficial allele. Different tests were performed depending on whether the parameters θ and X are known or unknown and whether derived/ancestral states of alleles are distinguished. If the true value of θ is assumed to be unknown, Watterson's estimate of θ from data itself is used for the composite likelihood. Estimation of τ is poor in general: standard deviations of are greater than means in many cases. Using the true value of θ slightly improves the accuracy of . It is sometimes possible to infer the location of the target of directional selection from external information (for example, Enard et al. 2002). However, having the true value of X does not improve the parameter estimation of and . Not distinguishing derived vs. ancestral states of alleles generally results in poorer estimates, especially with larger τ. When τ = 0, mean ranged from 0.002 to 0.018. Given that the coalescence of gene lineages due to hitchhiking occurs mainly when the frequency of the beneficial allele is low, the behavior of the genealogy during much of the selective phase is similar to that under neutrality. Therefore, the estimate of τ when τ = 0 should reflect the length of the selective phase. The length of the selective phase in this case is ∼(1/α)ln(2α) ≈ 0.007, in agreement with the range of obtained.
Next, I asked if Equations 4 and 5 can be used to estimate genomic parameters of directional selection under a simple model of recurrent selective sweeps. I assume that selective sweeps occur at a rate λ per generation per nucleotide, regardless of recombination rates that may vary along the genome. The strength of directional selection, α, is constant. Since the distribution of allele frequency captured in the analytic approximation is only the average over many realizations of selective sweeps for a given parameter set, it will be essential to use data from many independently evolving loci. Assume that the data set consists of L such loci. The recombination rate for each locus is assumed known. Let Sij be the number of nucleotide sites with j mutant alleles from the ith locus (j = 0, … , ni − 1; Si0 is defined to be the number of monomorphic sites). A composite likelihood based on the sampling probability given by Equation 4 can then be defined as(6)Note that the likelihood is given by a function of 4Nλ, not λ, because the effect of recurrent sweeps depends on the frequency of hitchhiking events in the coalescent time scale (4N). The composite likelihood based on the frequency spectrum is identical except that pij is used instead of and terms with j = 0 disappear. Remember that the sampling probability can be calculated only when the scaled mutation rate, θ, of the locus is known, while the frequency spectrum does not depend on θ. The estimates of α and 4Nλ (or λ if N is known) may be found by maximization of the above function.
This approach was tested against simulated data sets of L = 30, ni = n = 15, and θ = 0.01. One data set is produced by simulating 30 loci individually and then combining them. The ith locus was simulated with a recombination rate given by 4Nρi = (i + 1) × 10−3 (i = 1, … , L) and with chromosome length Mi = 0.8α/(4Nρi). (Here each “chromosome” models an independently evolving segment on the actual chromosome. Mi is approximately the size of the region affected by a beneficial mutation of strength α; if the scaled recombination rate is >0.4α, the heterozygosity decreases at most by 6%.) Polymorphism is observed in a 2-kb-long segment at the center of each chromosome. A total of 1000 replicate data sets were generated with α = 1000 and 4Nλ = 4 × 10−5. The first four replicates were used to obtain the profile of the composite likelihood using the frequency spectrum in the parameter space of α and 4Nλ (Figure 6). The profile for each replicate has a plateau over the space defined by a product of α and 4Nλ. This suggests that the strength and the rate of selective sweeps may not be estimated separately from frequency spectra of multilocus genomic data. The same conclusion was drawn previously, using the heterozygosity of data: the effect of hitchhiking is given by the product αλ (Wiehe and Stephan 1993; Stephan 1995). However, it is not simple to estimate the composite parameter αλ using Equation 3 because this equation is not expressed by this single variable.
It is often possible to obtain a separate estimate of the rate of substitutions driven by directional selection in the genome (e.g., Smith and Eyre-Walker 2002). If λ is given by an external source, the estimate of α might be obtained. Figure 7 shows the distribution of when the composite likelihood of the above data was maximized using the correct value of 4Nλ (4 × 10−5). The composite likelihood based on sampling probability, assuming the correct θ is known, yields better estimates of α, as expected. Using only the frequency spectrum lowers the accuracy of . But it still gives reasonably unbiased estimates (mean = 1045). When the same analyses were performed without distinguishing ancestral/derived alleles, the distribution of changed slightly (from 1020[mean] ± 495[SD] to 1025 ± 521 using the sampling probability; from 1045 ± 647 to 1083 ± 783 using the frequency spectrum). This suggests that genetic information regarding the intensity of directional selection is contained in the excess of rare alleles rather than in the excess of high-frequency derived alleles. The error in the estimation of α shown in Figure 7 is quite large. Adding more independent loci will reduce this error. For example, by increasing L from 30 to 60 (each chromosome is duplicated in the new data set), improves to 1045 ± 315 using the sampling probability and to 968 ± 392 using the frequency spectrum (distinguishing ancestral/derived alleles; based on 500 replicates).
A skew in the site frequency spectrum, i.e., the excess of rare alleles and high-frequency derived alleles compared to the expectation under neutral equilibrium, is well known to be characteristic of the genetic variation resulting from positive directional selection (Braverman et al. 1995; Simonsen et al. 1995; Fay and Wu 2000). The theoretical basis of this effect has been studied for single recent selective sweeps (Kim and Stephan 2002; Kim and Nielsen 2004), in which case one can make a simplifying assumption that genetic drift after the fixation of a beneficial allele is negligible. Relatively little progress has been made for the theory of frequency spectrum under recurrent selective sweeps. This study appears to represent the first step toward a complete mathematical analysis of the frequency spectrum under recurrent selective sweeps.
There are two sources of the rare alleles in DNA sequences after a selective sweep. First, a selective sweep may generate a star-like genealogy with many short outer branches. Mutations mapping onto those branches will be found at low frequency in the sample. Second, when only one selective sweep has occurred in the recent past, one or two lineages may escape coalescence during the hitchhiking event and generate long inner branches. Alleles descended along such an “escaped” branch will also be found in low frequency in the sample. Both classes of rare alleles contribute to generate negative values of Tajima's D. However, there are qualitative differences between the two classes. Only the second class is associated with an excess of high-frequency derived alleles and high linkage disequilibrium (Fay and Wu 2000; Kim and Nielsen 2004). The results of this study suggest that negative Tajima's D generated under recurrent selective sweeps must be due mainly to the first class of rare alleles, since no excess of high-frequency derived alleles is predicted or observed in the simulations (Table 1). A neutral variant that has increased to a high frequency due to one hitchhiking event is likely to be dragged to fixation during subsequent hitchhiking events. An excess of high-frequency variants thus cannot be maintained under recurrent sweeps. Low-frequency variants, in contrast, are constantly replenished by new mutations and can occur with recurrent sweeps. This result raises concerns about the interpretation of Fay and Wu's H-test as an attempt to detect positive selection from sequence polymorphism data: a failure to observe a negative H should not be regarded as a failure to detect recent directional selection, particularly if the locus under test is believed to have experienced multiple adaptive substitutions (e.g., with excess nonsynonymous relative to synonymous substitutions). It is, however, possible to observe an excess of high- compared to intermediate-frequency derived alleles under high rates of recurrent sweeps (Figure 4). We may need to develop another summary statistic that can conveniently capture this feature of recurrent selective sweeps.
The analytic approximations obtained here can be used to estimate genomic parameters of directional selection from multilocus data. However, it should be stressed that these approximations are based on a simple model of recurrent selective sweeps. In this model, selective sweeps are assumed to occur in a random-mating population of constant size and with no deleterious mutation. In reality, the frequency spectrum may be affected by both spatial and temporal changes of population structure. One possible way of removing these confounding factors and isolating the effect of selective sweeps alone is to examine whether the frequency spectrum is more skewed in genomic regions of lower recombination. Recurrent selective sweeps are expected to produce this kind of strong correlation. For example, Andolfatto and Przeworski (2001) observed more negative Tajima's D in regions of lower crossing over in the D. melanogaster genome and argued that this correlation supports the model of recurrent selective sweeps. We may thus estimate the intensity of selection in the Drosophila genome by finding values that produce a similar profile of Tajima's D over recombination rate. However, background selection, which was argued to be prevalent in D. melanogaster (Hudson and Kaplan 1995; Charlesworth 1996), makes it difficult to apply the current model to their data.
There are two potentially opposing effects of background selection on the frequency spectrum under the recurrent sweep model. First, because deleterious alleles segregate at low frequency, background selection may further skew the frequency spectrum in the direction of negative Tajima's D (Charlesworth et al. 1995; Bachtrog 2004), particularly in regions of low recombination. In this case, the intensity of directional selection is likely to be overestimated by fitting the current model to the correlation between recombination and Tajima's D. However, it is known that background selection can produce substantially negative Tajima's D only when deleterious mutations are weakly selected and the mutation rate is high (Golding 1997; Przeworski et al. 1997; Bachtrog 2004). For example, purifying selection in Drosophila appears to be too strong to generate significantly negative Tajima's D (Andolfatto and Przeworski 2001). The second effect of background selection is to reduce the short-term effective population size, also strongly in regions of low recombination. In this case, the effect of selective sweeps on a genealogy diminishes because, for given λ, the rate of selective sweeps on the coalescent timescale, 4Nλ, becomes smaller. Thus, Tajima's D may become less negative when background selection is added. The per generation substitution rate λ itself will also decrease because the efficacy of directional selection on beneficial mutations decreases by interference from purifying selection at linked loci (Peck 1994; Barton 1995; Kim and Stephan 2000). Therefore, the intensity of directional selection required to explain the observed pattern of the frequency spectrum may increase when background selection is added. In Drosophila, this second effect of background selection is likely to be more important than the first since the reduction of effective population size by background selection is predicted to be substantial (Hudson and Kaplan 1995; Charlesworth 1996).
In real populations, other factors may make it difficult to apply the current method to estimate the intensity of selection. It is well known that demographic history and population structure affect the allele frequency spectrum. Recent population expansion, for example, generates an excess of rare alleles, mimicking the effect of recurrent selective sweeps (Tajima 1989b; Fu 1997). For example, Haddrill et al. (2005) showed that a simple model of population bottleneck is the most parsimonious explanation for the pattern of the frequency spectrum along the genome of D. melanogaster. Although it might still be possible to confirm selective sweeps by observing a positive correlation between recombination rate and Tajima's D, if selective sweeps occur along with demographic changes, the estimate of αλ using the above approximation may have substantial error. Another simplifying assumption of the current model is that a selective sweep starts with one copy of a new beneficial mutation. One recent study suggests, however, that adaptive substitutions starting from standing genetic variation might be as common as those from a new beneficial mutation (Hermisson and Pennings 2005). Selective sweeps starting with multiple copies of a beneficial mutation may produce frequency spectra that are quite different from those expected under the standard model of hitchhiking (Innan and Kim 2004; Hermisson and Pennings 2005).
APPENDIX: CALCULATION OF E[Xa(1 − X)b | z, t]
Consider a neutral allele with initial frequency z reaching frequency X after t generations of random genetic drift. An approximate solution for the nth moment of X, E[Xn] = E[Xn | z, t], can be obtained from the following diffusion approximation first described by Kimura (1955):(A1)His solution is expressed as a sum of infinite series. Unfortunately, this series converges very slowly for small t (<N), the range this study mainly considers. However, from an examination of the exact solutions to the above equation for small n [e.g., , , and ], it can be easily shown that the solution for arbitrary n takes a formwhere is a polynomial function of z. Inserting E[Xn] and E[Xn−1] of the above form into Equation A1, we obtain(A2)Since for t = 0,(A3)E[Xn | z, t] is thus given by obtaining by recursion using Equations A2 and A3 and . Most importantly, this exact solution has finite terms. can be written as . Coefficients are readily obtained from the above recursion.
E[Xa(1 − X)b | z, t] is now obtained fromwhere can be written as . Therefore, . The table of is available upon request.
I thank Wolfgang Stephan, Allen Orr, Andrea Betancourt, Daven Presgraves, Naoyuki Takahata, and two anonymous reviewers for helpful comments on the manuscript. This work was funded by National Science Foundation grant DEB-0449581.
Communicating editor: N. Takahata
- Received July 18, 2005.
- Accepted November 29, 2005.
- Copyright © 2006 by the Genetics Society of America