Abstract
The theory of genetic hitchhiking predicts that the level of genetic variation is greatly reduced at the site of strong directional selection and increases as the recombinational distance from the site of selection increases. This characteristic pattern can be used to detect recent directional selection on the basis of DNA polymorphism data. However, the large variance of nucleotide diversity in samples of moderate size imposes difficulties in detecting such patterns. We investigated the patterns of genetic variation along a recombining chromosome by constructing ancestral recombination graphs that are modified to incorporate the effect of genetic hitchhiking. A statistical method is proposed to test the significance of a local reduction of variation and a skew of the frequency spectrum caused by a hitchhiking event. This method also allows us to estimate the strength and the location of directional selection from DNA sequence data.
THE level of genetic variation at a neutral locus can be influenced by natural selection at linked loci. The substitution of a strongly selected beneficial mutation produces a “hitchhiking” effect on the frequency of neutral alleles at linked loci (Maynard Smith and Haigh 1974; Kaplanet al. 1989; Stephanet al. 1992). Neutral variants are either lost or fixed along with the ancestral or beneficial allele at the selected locus unless recombination breaks down the association between neutral and selected alleles during the substitution process. As a result, genetic variation around the site of directional selection is greatly reduced by this hitchhiking event or “selective sweep.” Selection against recurrent deleterious mutations also reduces variation at linked loci (Charlesworthet al. 1993). This mechanism, known as “background selection,” causes the continuous removal of linked sequences along with deleterious mutations, resulting in a reduced effective population size.
To elucidate the relative contributions of selective sweeps and background selection in shaping the positive correlation between genetic variation and recombination (Begun and Aquadro 1992), recent investigations focused on the features of genetic variation where the two mechanisms make different predictions. A hitchhiking event produces an excess of rare alleles (Bravermanet al. 1995; Fu 1997) and highfrequencyderived alleles (Fay and Wu 2000) in a sample of DNA sequences. It can also greatly reduce differentiation among subdivided populations (Stephanet al. 1998). These mechanisms are more efficient when the level of recombination between neutral and selected loci is reduced. Therefore, efforts to discover such phenomena were undertaken mainly for polymorphism data from regions of low recombination.
Another unique feature of genetic hitchhiking is the expected pattern of genetic variation along a recombining chromosome, i.e., in regions of intermediate to high recombination rates. The reduction of genetic variation is greatest at the site of directional selection, but not as great at distant sites due to recombination. Therefore, it produces a “valley” of expected heterozygosity along the sequence. This pattern was used to demonstrate recent episodes of directional selection in populations (Benassiet al. 1999; Wanget al. 1999; Fullertonet al. 2000; Nurminskyet al. 2001).
Although the expected spatial pattern of variation along a chromosome caused by hitchhiking is straightforward, it is not certain whether it can be detected in a sample of DNA sequences. The size of the area affected by a single hitchhiking event can be very large if selection is strong or recombination rate is low. On the other hand, for relatively weak selection and high recombination rates, the size of the area might be sufficiently small to be detected in a survey of a gene of moderate length. However, the large variance of nucleotide diversity in a DNA sample makes it difficult to distinguish the pattern caused by a weak hitchhiking effect from a similar pattern generated randomly under neutral evolution with recombination. In the presence of recombination, different regions on a sequence have different genealogies whose sizes can differ considerably. Therefore, a local reduction of variation in a certain region of a recombining chromosome can happen by chance without hitchhiking events.
In this article, we investigate the pattern of genetic variation resulting from a single hitchhiking event on a recombining chromosome. A likelihoodbased statistical test is developed to evaluate the significance of a local reduction of variation. It is also examined if the strength and location of directional selection can be estimated from DNA sequence data.
COALESCENT SIMULATION
This study requires a coalescent simulation in which both intragenic recombination and directional selection take place during the ancestry of a DNA sample. The ancestral recombination graph (ARG) described by Griffiths and Marjoram (1997) allows the realization of intragenic recombination for any length of a DNA sequence. We modified the ARG to incorporate the effect of genetic hitchhiking. Figure 1 illustrates an ARG with hitchhiking for a sample of three sequences. The ancestral history of the sample is divided into neutral phases and a selective phase.
During a neutral phase the ARG is constructed as described by Griffiths and Marjoram (1997), with the following modifications. Edges of the ARG represent ancestral sequences at a given time (T) in the past measured in units of 2N generations, where N is the effective number of individuals in a diploid population. If there are k edges at a given time (k = n at T = 0), either coalescent or recombination events can occur with rates k(k − 1)/2 and kR/2, respectively. R is 4N times the recombination rate between both ends of the sequence. Therefore, if the sequence is L nucleotides long, R = 4NLρ, where ρ is the pernucleotide recombination rate. Each of the k edges is labeled by a pair of integers (I_{i}, J_{i}) (i = 1, … , k). This pair of integers delimits the region within which sequences ancestral to sample sequences are found. Therefore, recombination outside this region can be ignored. At T = 0, (I_{i}, J_{i}) = (1, L) for all n edges. At a coalescence event, two randomly chosen edges (for example, the lth and mth edges) join to a new edge, which is then labeled by (Min(I_{l}, I_{m}), Max(J_{l}, J_{m})). At a recombination event, an edge is chosen randomly and a random uniform integer, U, is drawn between 1 and L − 1. If an edge labeled by (I, J) was chosen, it joins to two parental edges only if I ≤ U < J. Then, the two parental edges are labeled as (I, U) and (U + 1, J). If U < I or > J, no change is made at the edge. This procedure is necessary to minimize k in the simulation.
The selective phase is the period when a substitution of a beneficial mutation that causes a hitchhiking effect takes place. The beneficial allele B has a genic selective advantage s over the parent allele b. This substitution occurs at a site M nucleotides away from the left end of the sequence and the fixation of B is completed at T = τ. The allele frequency of B, x, is assumed to change deterministically from 1 − ζ to ζ. Therefore, x at T = τ + t is given by
The coalescent for each nucleotide site (or the “marginal tree”) is embedded in the ARG. The marginal tree is extracted as described in Griffiths and Marjoram (1997). The number of edges, k, at a given time changes stochastically during the construction of the ARG. Theoretically, k eventually hits 1, and the ARG is completed. However, if R »10, k fluctuates at high numbers, and waiting until k = 1 in the simulation is then practically impossible. We therefore stop the construction of the ARG at an arbitrary point, T_{limit}, which is chosen to be a large number relative to the mean time to the most recent common ancestor (MRCA) for a marginal tree. If a marginal tree cannot find the MRCA until T_{limit}, the remaining branches of the tree are forced to coalesce at T_{limit}. This procedure has a negligible effect on our results since most nucleotide sites find their MRCAs before T_{limit}. T_{limit} = 7.0 was used in this study. Source codes written in C for the simulation of ARGs and other procedures described in this article are available upon request.
PATTERNS OF GENETIC VARIATION ALONG A CHROMOSOME WITH HITCHHIKING
The simulated patterns of sequence polymorphism are obtained by introducing mutations into the marginal tree for each nucleotide site. To verify that the simulation procedure generates the correct ancestral genealogy expected under the model of hitchhiking, nucleotide diversities at many fixed sites along the sequence were summarized over 50,000 replicates of the ARG for a set of parameters (Figure 2). The simulation results agreed well with the expectation on the basis of the analytic solutions by Stephan et al. (1992) and Kim and Stephan (2000). Without the selective phase, the mean and variance of coalescent times of marginal trees agreed well with the expectation under the standard neutral model (data not shown). The number of shifts from one MRCA to another along the sequence was also observed and compared to the prediction (Equation 5 of Wiuf and Hein 1999). With R = 20 (2N = 10^{5}, L = 10^{4}, and ρ = 10^{−8}), the observed numbers of shifts for sample sizes 2 and 10, each averaged over 200 replicates, were 13.72 and 19.74, respectively. The corresponding expectations are 13.33 and 19.64, respectively.
We assume that the derived allele can be distinguished from the ancestral allele, which is defined to
be the allele at the root of the marginal tree. If more than one mutant is segregating at one site, all mutant alleles are classified as the derived allele and not distinguished from each other. To examine the pattern of variation, three different estimators [
A local reduction or valley of heterozygosity (
Compared to neutrality (Figure 3a), the relative level of
The stronger the hitchhiking effect, the larger is the region that is expected to be affected. To find the relationship between the mean length of the region of reduced variation and the parameter values of the hitchhiking model, we generated 200 simulated datasets for a fixed combination of N, ρ, s, τ, and φ. A 1kblong window moves from the left end of the 40kblong sequence with an increment of one nucleotide and calculates
Examples 6 and 7 show that almost identical W_{φ/2}'s are obtained with the same values of Nρ and α, but with different ρ and s values. Therefore, for a given τ, Nρ and α are the two principal parameters governing the pattern of variation caused by a hitchhiking event. We compared the mean W_{φ/2} to the theoretical prediction, E[W_{φ/2}] (Table 1), which is based on the expectation of
After a selective sweep, the level of genetic variation is slowly restored due to new neutral mutations. Therefore, with given values of Nρ and α, W_{φ/2} should become smaller with increasing τ, as is indeed observed in examples 11–13, for which τ = 0.001, 0.05, and 0.2, respectively. The level of variation around the site of selection, which is zero immediately after the sweep, should be characterized by τ (Wiehe and Stephan 1993). As the site of selection is expected to be found at the center of the largest segment of reduced variation (defined above), we examined the average nucleotide diversity,
So far the patterns of genetic variation based on the segregation at single sites were examined. Another important aspect of sequence variation is the association of polymorphisms between neighboring loci. We calculated r^{2} (Hill and Robertson 1968) for all pairs of segregating sites in each simulated set and obtained the average,
STATISTICAL TEST OF A LOCAL SIGNATURE CAUSED BY GENETIC HITCHHIKING
In the following, a maximumlikelihood method is developed to examine the significance of a local reduction of genetic variation and to estimate the strength of directional selection. The probability of observing a certain frequency of derived alleles at a site after a recent hitchhiking event can be obtained by previously used analytic approximations. Under neutrality, the expected number of sites where the derived variant is in the frequency interval [p, p + dp] in the population is given by
As it is currently unrealistic to have polymorphic data from a reasonably large sample of long continuous sequences, we apply the test to a region for which only short segments are sequenced, interspaced with larger nucleotide stretches for which no data are available. That is, we consider a survey in which 11 1kblong segments distributed over a 40kb region are sequenced. The distances between segments are uniformly 2.9 kb. The sample size is 10 for all segments. Simulated data used for Table 1 (examples 9 and 11–13) were reused, but only sites from the 11 segments were included. The maximum composite likelihood under the neutral model (L_{0}) and that under the hitchhiking model (L_{1}) were obtained for each simulated dataset. Then, the likelihood ratio is given by L_{1}/L_{0}. L_{0} is a function of φ and L_{1} is a function of N, μ, ρ, s, and the location of the selected locus, X. X is allowed to vary in the middle 10kb region of the sequence (15 kb < X < 25 kb); i.e., we consider a situation where a candidate region for the site of selection has already been inferred. It is difficult practically to allow all these parameters to vary freely until a unique combination that maximizes L_{1} is found. Therefore, we chose only s and X as free variables and assumed that separate estimates of N, μ, and ρ are available. Thus, in one test (test A), the same values of N, μ, and ρ specified in the simulation are used in the calculation of L_{0} and L_{1}. In the other test (test B), to be conservative, we let the mutation rate be inferred from the data by using the average heterozygosity (
The null distribution of likelihood ratios was obtained by applying tests to datasets generated under the neutral model (200 replicates corresponding to example 9 of Table 1). To reduce the problem of local optima, eight different initial guesses of X between 15 and 25 kb, with s = 0.01, were used to start the maximization procedure using Powell's method (Presset al. 1992; program written by D. L. Swofford and kindly provided by J. P. Huelsenbeck). The null distribution of log(L_{1}/L_{0}) for test A/option 1 is shown in Figure 5. All test methods (tests A and B and options 1 and 2) produced almost identical null distributions (data not shown). About 30% of the log(L_{1}/L_{0}) values are negative. This may suggest a failure of obtaining the global maximum of L_{1} for some null datasets, since L_{1}, with two more free variables, should be larger than L_{0}. However, increasing the number of initial guesses did not improve the likelihood ratios (data not shown). The negative values are more likely due to the restricted range of s and the assumption of τ = 0. As α should be large enough to generate a hitchhiking effect (also ε = 1/α should remain small), s was not allowed to be <6 × 10^{−5} (α = 60). Only when s is very small, the hitchhiking model with τ = 0 can fit neutral data in which a local reduction of variation is not found. Therefore, there was a limit in maximizing L_{1}.
Table 2 summarizes the power of the test and the point estimates of s and X for each hitchhiking model. Power is the proportion of replicates that produce log(L_{1}/L_{0}) values greater than the 95th percentile of the corresponding null distribution. Test A yielded very high power of rejecting neutral evolution, even for larger values of τ. The main reason for obtaining large likelihood ratios from test A is that it uses the “true” standing level of variation (φ) for the calculation of L_{1} and L_{0}. As the average heterozygosity has been reduced below φ due to selective sweep, the neutral model based on the true value of φ cannot fit the data. The negligible differences in the power between options 1 and 2 (except at τ = 0.2) indicate that the additional information obtained by distinguishing between ancestral and derived alleles contributed little in test A, whereas a significant reduction of heterozygosity played a major role in increasing the likelihood ratio.
On the other hand, a reduction of heterozygosity is not a major factor for increasing the likelihood ratio in test B. To obtain a higher likelihood ratio in this test for a given number of segregating sites, the spatial distribution of those sites along the sequence and the allele frequency spectrum should be close to the expectation under the hitchhiking model. As expected, the power of test B is smaller than that of test A (Table 2). However, it is still high (84–97%) for small values of τ (0.001 and 0.05). Power declines as τ increases, since the spatial pattern and the frequency spectrum of segregating sites approach those under neutrality as time passes after the selective sweep. Tests using option 1 yield higher power than those using option 2 for τ = 0.001 and 0.05, which means that the skew toward highfrequencyderived alleles at segregating sites is observed as described by P_{n,k} (Figure 4). For τ = 0.2, however, both tests A and B had higher power with option 2. This is obvious from the fact that, at τ = 0.2, the proportion of highfrequencyderived alleles is lowered below its level under neutrality (Figure 3c). Therefore, distinguishing the derived allele from the ancestral allele in these tests has an advantage for detecting very recent hitchhiking events only. However, it should be noted that our analytic prediction of the frequency spectrum is based on the assumption of τ = 0. Complete solutions for P_{n,k} for any value of τ may make option 1 still useful for detecting more distant hitchhiking events.
Maximum (composite)likelihood estimates of s and X were also obtained. Test A with option 1 produced the most unbiased estimates of s, although the accuracy is quite low for all combinations of the test methods (Table 2). Joint estimates of s and X using test A with option 1 for datasets generated under neutrality and under hitchhiking with s = τ = 0.001 are shown in Figure 6. From the neutral data, joint estimates were clustered in the parameter space of small s (close to the lower limit) and X between the “sequenced” segments. This can be expected since the hypothesized valley due to hitchhiking should be sufficiently narrow to fit between the sequenced segments where the level of polymorphism is high. On the other hand, the joint estimates from the hitchhiking datasets were centered around the true value (s = 0.001, X = 20 kb). It is also shown that estimates of X tend to cluster on the sequenced segments in this case.
To further investigate the performance of the composite likelihoodratio test, we produced additional but shorter (10kb) sequences by simulation. Unlike in the previous analysis, polymorphism data are assumed to be obtained from the entire continuous region and X is also allowed to vary over the entire region. Only option 1 is used. To make results comparable to the previous cases, R was adjusted to be 800 by setting ρ = 4 × 10^{−8}. In the simulation of selective sweeps, selection occurs at position 5 kb with τ = 0.001, but with various s values (Table 3). With s = 0.001 (α = 1000), the powers of tests A and B were 0.97 and 0.915, respectively. These can be compared with 0.995 and 0.97 from the previous analysis (Table 2, s = 0.001 and τ = 0.001). Considering a slight reduction of the surveyed region (11–10 kb) where informative segregating sites were observed, the power of the tests with this new scheme appears to remain as high as in the previous one using discontinuous regions. With decreasing s, the power of detecting the hitchhiking event and the accuracy of the parameter estimates decrease, as expected (Table 3). However, power declined also when s increased from 0.002 to 0.005. Examination of simulated data showed that, with s = 0.005, the number of segregating sites is highly reduced and those sites are frequently found clustered on one side of the site of selection. This produced very low likelihood ratios in both tests A and B. This effect should disappear if a wider region of the chromosome is surveyed.
We also conducted a few tests to assess the effect of uncertainty in prior estimates of N and ρ. Tests A and B were performed for a dataset described above (Table 3, s = 0.001) but with a prior estimate of N = 10^{5}, fivefold lower than the true value. New null distributions of the likelihood ratio were obtained accordingly. The power of tests A and B decreased to 0.935 and 0.865, respectively, from 0.97 and 0.915 (Table 3) due to the incorrect assumption of N. Average ŝ decreased slightly (8.5 × 10^{−4} from 1.15 × 10^{−3} for test A and 5.1 × 10^{−4} from 7.1 × 10^{−4} for test B). Next, we used the correct value of N but a fivefold lower prior estimate of ρ (8 × 10^{−9}). Average ŝ decreased about fivefold (1.78 × 10^{−4} and 1.19 × 10^{−4} for tests A and B, respectively) as expected. Power decreased to 0.855 and 0.78 for tests A and B, respectively.
DISCUSSION
In this study, coalescent simulations using the ancestral recombination graph (Hudson 1983; Griffiths and Marjoram 1997) were used to investigate patterns of nucleotide sequence polymorphism under the models of neutrality and genetic hitchhiking. The modification of the ARG to incorporate the effect of hitchhiking is analogous to the twolocus coalescent model of Braverman et al. (1995). In fact, our ARG with hitchhiking reduces to their model if the recombination break point is fixed rather than uniformly distributed between a neutral locus and the site under selection. Many simulation and theoretical studies based on this twolocus model generated valuable knowledge about genetic hitchhiking (Kaplanet al. 1989; Bravermanet al. 1995; Barton 1998; Fay and Wu 2000). However, the study of variation at a single neutral site under hitchhiking cannot provide information about spatial patterns of variation, which is shaped by genealogical correlation between many consecutive sites. The ARG allows such a study by generating coalescent trees for all sites simultaneously.
Local reduction of genetic variation without the corresponding reduction of interspecific divergence was used as evidence of past directional selection in maize (Wanget al. 1999), Drosophila (Benassiet al. 1999; Nurminskyet al. 2001), and human (Nachman and Crowell 2000; Fullertonet al. 2000). However, these studies did not address the possibility of observing such patterns due to stochastic change of genetic variation along recombining chromosomes under neutrality. Table 1 shows that, in a chromosome with Nρ = 10^{−3}, segments of reduced variation that are >1 kb can be frequently found under neutrality. The average length of segments sharing the same ancestral history becomes longer as Nρ becomes smaller (Wiuf and Hein 1999). For species with relatively small effective population sizes, such as human, Nρ can be «10^{−3}. In such a case, a very cautious interpretation of the data is warranted when a sudden drop of heterozygosity over a few kilobases is observed in these species.
To address this problem, we developed a composite likelihoodratio test to detect the local signature of genetic hitchhiking along a recombining chromosome, where the null distribution of variation is obtained by neutral coalescent simulations with recombination. The composite likelihood under the hitchhiking model is based on the probability of observing a certain ratio of segregating variants, P_{n,k}, for each site. P_{n,k} is a function of N, μ, ρ, s, and X. In test A, we assumed that the actual values of N, μ, and ρ are already known. The recombination rate per nucleotide can be determined independently for some species for which both physical and genetic maps are available. The effective population size and the mutation rate might be obtained from polymorphism and divergence data from adjacent chromosomal regions, if the standing levels of diversity and divergence are uniform in those regions and hitchhiking events do not occur frequently; i.e., the standing level is determined mainly by neutrality or background selection (Kim and Stephan 2000). Therefore, test A can be used only for species such as Drosophila and human where a considerable amount of population genetic information has been obtained. In the sense that it requires information from the other loci, test A is not much different from the HudsonKreitmanAguadé (HKA) test (Hudsonet al. 1987). However, unlike the HKA test, test A uses information contained in the frequency spectrum at segregating sites and allows the parameters of the hitchhiking model to be estimated.
Test B relaxes the assumption that φ, the level of variation in the region immediately before the hitchhiking event happened, is known. In the most conservative treatment, φ is given as the average nucleotide diversity observed in the data to be tested. Therefore, test B is the method of choice when information on other loci is not available. However, due to the incorrectness of φ, the estimation of s is poorer in test B than in test A.
A similar approach to detect the signature of hitchhiking has recently been proposed by Galtier et al. (2000). Their method detects diversityreducing events such as hitchhiking or bottleneck and estimates the time and the strength of those events from DNA sequence polymorphism. However, they assumed no recombination within the region to be tested and needed several independent regions to distinguish selective sweeps from bottlenecks. Our approach is different in that it detects a characteristic spatial pattern shaped by recombination around the site of directional selection. It is similar, however, in that it also detects skews in the frequency spectrum of segregating sites, as Galtier et al. (2000) detect “distortions” in the shape of gene genealogies. Unfortunately, a relationship between the force that distorts the genealogy and the strength of directional selection was not given in their study.
Knowledge about the strength and the rate of directional selection in natural populations is fundamental in evolutionary biology. Previously, Wiehe and Stephan (1993) and Stephan (1995) obtained a rough estimate of αν, where ν is the rate of strongly selected substitutions per nucleotide, using the positive correlation of variation and recombination in Drosophila melanogaster. Separate estimation of α and ν might be achieved by surveying large areas of a genome for signatures of hitchhiking events, using the method proposed in this article. According to the assumption that standing variation in the region tested is not influenced by other hitchhiking events, this method is expected to be most useful in regions of high recombination rates (Kim and Stephan 2000). It should be noted, however, that the inference of s and X may be accompanied by a considerable amount of error (Figure 6). Figure 3 and Table 1 show that, for a given set of parameter values, the size of the valley of reduced nucleotide diversity varies significantly and the center of the valley may drift away from the site of selection. This stochasticity is especially serious for populations with small effective sizes (Figure 3d). This might explain the observation by Wang et al. (1999) that within a 1.1kb region of highly reduced variation no fixed differences were found between maize and teosinte.
Acknowledgments
We thank two reviewers for valuable suggestions. This work was supported by funds from the University of Munich and the Deutsche Forschungsgemeinschaft (STE 325/41).
APPENDIX
Consider a neutral locus where a mutant, A, is segregating. The substitution of a beneficial allele, B, for the wildtype allele, b, is assumed to occur at a linked locus at recombination fraction r away from the neutral locus. p_{1} is defined as the frequency of A among chromosomes carrying the B allele when the frequency of B increased to the value of near fixation, 1 − ε. Then,
Footnotes

Communicating editor: Y.X. Fu
 Received May 10, 2001.
 Accepted November 19, 2001.
 Copyright © 2002 by the Genetics Society of America