- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Kim, Y.
- Articles by Stephan, W.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Kim, Y.
- Articles by Stephan, W.
Detecting a Local Signature of Genetic Hitchhiking Along a Recombining Chromosome
Yuseob Kima,b and Wolfgang Stephanaa Department of Evolutionary Biology, University of Munich, 80333 Munich, Germany
b Department of Biology, University of Rochester, Rochester, New York 14627
Corresponding author: Wolfgang Stephan, University of Munich, Luisentstr. 14, 80333 Munich, Germany., stephan{at}zi.biologie.uni-muenchen.de (E-mail)
Communicating editor: Y.-X. FU
| ABSTRACT |
|---|
The theory of genetic hitchhiking predicts that the level of genetic variation is greatly reduced at the site of strong directional selection and increases as the recombinational distance from the site of selection increases. This characteristic pattern can be used to detect recent directional selection on the basis of DNA polymorphism data. However, the large variance of nucleotide diversity in samples of moderate size imposes difficulties in detecting such patterns. We investigated the patterns of genetic variation along a recombining chromosome by constructing ancestral recombination graphs that are modified to incorporate the effect of genetic hitchhiking. A statistical method is proposed to test the significance of a local reduction of variation and a skew of the frequency spectrum caused by a hitchhiking event. This method also allows us to estimate the strength and the location of directional selection from DNA sequence data.
THE level of genetic variation at a neutral locus can be influenced by natural selection at linked loci. The substitution of a strongly selected beneficial mutation produces a "hitchhiking" effect on the frequency of neutral alleles at linked loci (![]()
![]()
![]()
![]()
To elucidate the relative contributions of selective sweeps and background selection in shaping the positive correlation between genetic variation and recombination (![]()
![]()
![]()
![]()
![]()
Another unique feature of genetic hitchhiking is the expected pattern of genetic variation along a recombining chromosome, i.e., in regions of intermediate to high recombination rates. The reduction of genetic variation is greatest at the site of directional selection, but not as great at distant sites due to recombination. Therefore, it produces a "valley" of expected heterozygosity along the sequence. This pattern was used to demonstrate recent episodes of directional selection in populations (![]()
![]()
![]()
![]()
Although the expected spatial pattern of variation along a chromosome caused by hitchhiking is straightforward, it is not certain whether it can be detected in a sample of DNA sequences. The size of the area affected by a single hitchhiking event can be very large if selection is strong or recombination rate is low. On the other hand, for relatively weak selection and high recombination rates, the size of the area might be sufficiently small to be detected in a survey of a gene of moderate length. However, the large variance of nucleotide diversity in a DNA sample makes it difficult to distinguish the pattern caused by a weak hitchhiking effect from a similar pattern generated randomly under neutral evolution with recombination. In the presence of recombination, different regions on a sequence have different genealogies whose sizes can differ considerably. Therefore, a local reduction of variation in a certain region of a recombining chromosome can happen by chance without hitchhiking events.
In this article, we investigate the pattern of genetic variation resulting from a single hitchhiking event on a recombining chromosome. A likelihood-based statistical test is developed to evaluate the significance of a local reduction of variation. It is also examined if the strength and location of directional selection can be estimated from DNA sequence data.
| COALESCENT SIMULATION |
|---|
This study requires a coalescent simulation in which both intragenic recombination and directional selection take place during the ancestry of a DNA sample. The ancestral recombination graph (ARG) described by ![]()
|
During a neutral phase the ARG is constructed as described by ![]()
, where
is the per-nucleotide recombination rate. Each of the k edges is labeled by a pair of integers (Ii, Ji) (i = 1, ... , k). This pair of integers delimits the region within which sequences ancestral to sample sequences are found. Therefore, recombination outside this region can be ignored. At T = 0, (Ii, Ji) = (1, L) for all n edges. At a coalescence event, two randomly chosen edges (for example, the lth and mth edges) join to a new edge, which is then labeled by (Min(Il, Im), Max(Jl, Jm)). At a recombination event, an edge is chosen randomly and a random uniform integer, U, is drawn between 1 and L - 1. If an edge labeled by (I, J) was chosen, it joins to two parental edges only if I
U < J. Then, the two parental edges are labeled as (I, U) and (U + 1, J). If U < I or > J, no change is made at the edge. This procedure is necessary to minimize k in the simulation.
The selective phase is the period when a substitution of a beneficial mutation that causes a hitchhiking effect takes place. The beneficial allele B has a genic selective advantage s over the parent allele b. This substitution occurs at a site M nucleotides away from the left end of the sequence and the fixation of B is completed at T =
. The allele frequency of B, x, is assumed to change deterministically from 1 -
to
. Therefore, x at T =
+ t is given by
![]() |
(1) |
(![]()
= 2Ns and
, which is the length of the selective phase. The choice of
does not change the resulting genealogy significantly (![]()
for the simulations. During the selective phase, B and b edges exist, indicating whether an ancestral sequence includes the beneficial allele or not. Therefore, all edges are B edges at the beginning of the selective phase (T =
). The system of labeling edges is also changed: (I, J) at the end of the neutral phase at T =
changes to (Min(I, M), Max(J, M)). This change means that recombination between the site of directional selection and the ancestral sequence should be followed during the selective phase. There are four possible events during the selective phase: (1) coalescence between B edges; (2) coalescence between b edges; (3) recombination in a B edge; (4) recombination in a b edge. The probability of these four events during the time interval [t, t +
t] is given by
![]() |
(2a) |
![]() |
(2b) |
![]() |
(2c) |
![]() |
(2d) |
respectively, where kB and kb are the numbers of B and b edges at
, respectively. The waiting time,
t, between events is randomly drawn from an exponential distribution with parameter
. Then, one of the four events is allowed to occur according to its probability. This method should be used only when waiting time,
t, is short (<<1/
) such that the change of x(t) between events is negligible. In this study, due to large values of R, values of
t are sufficiently small. With lower values of R, a rejection method such as the one by ![]()
U, the former parental edge must become a B edge, since the beneficial allele has descended from the ancestral sequence in this edge. The other parental edge, however, becomes either a B edge with probability x(t) or a b edge with probability 1 - x(t). Likewise, if M > U, the parental edge with (U + 1, Max(J, M)) becomes a B edge, with the other parental edge becoming either B or b. The same principle is applied to a recombination event in a b edge. The selective phase, which ends when x(t)
or the combined number of B and b edges becomes 1, is followed by another neutral phase where the distinction between B and b edges is erased.
The coalescent for each nucleotide site (or the "marginal tree") is embedded in the ARG. The marginal tree is extracted as described in ![]()
| PATTERNS OF GENETIC VARIATION ALONG A CHROMOSOME WITH HITCHHIKING |
|---|
The simulated patterns of sequence polymorphism are obtained by introducing mutations into the marginal tree for each nucleotide site. To verify that the simulation procedure generates the correct ancestral genealogy expected under the model of hitchhiking, nucleotide diversities at many fixed sites along the sequence were summarized over 50,000 replicates of the ARG for a set of parameters (Fig 2). The simulation results agreed well with the expectation on the basis of the analytic solutions by ![]()
![]()
![]()
, the observed numbers of shifts for sample sizes 2 and 10, each averaged over 200 replicates, were 13.72 and 19.74, respectively. The corresponding expectations are 13.33 and 19.64, respectively.
|
We assume that the derived allele can be distinguished from the ancestral allele, which is defined to be the allele at the root of the marginal tree. If more than one mutant is segregating at one site, all mutant alleles are classified as the derived allele and not distinguished from each other. To examine the pattern of variation, three different estimators [
(![]()
W (![]()
H (![]()
were calculated for the simulated sequences. Differences among the three estimators reveal deviations from neutrality (![]()
![]()
= 100 or 1000) and
= 0.001 or 0.2. As only four examples randomly chosen from the simulations are shown for each model, one may not be allowed to draw a general conclusion from these figures. However, some features of hitchhiking effects on sequence variation could be consistently identified from these examples. We use these examples mainly to illustrate these features.
|
A local reduction or valley of heterozygosity (
) along the sequence is the most important pattern expected under the model of genetic hitchhiking. Under neutral evolution (Fig 3A), the stochastic change of 
along the sequence occasionally generates deep valleys of variation (for example, regions indicated by *). However, in this case valleys are usually narrow compared to those under hitchhiking (Fig 3, bd). The stochastic spatial pattern of variation is influenced by R. Using the same parameter values as in Fig 3A but smaller N, deeper and wider valleys were frequently observed (data not shown). With hitchhiking (Fig 3, bd), a deep valley always appears at or around the site of directional selection. However, the "shape" of the valleys varies considerably among realizations for a given value of s. Valleys are rather asymmetrical around the site of the beneficial mutation, which implies that the shape of the valley may provide imprecise information about the location of the target of selection (see below). The asymmetry gets larger as N decreases. Fig 3D uses the same values of s,
, and
as Fig 3B but a 10 times smaller N. As a result, the stochastic noise in the spatial pattern has been dramatically increased.
Compared to neutrality (Fig 3A), the relative level of
H versus 
increased immediately after the hitchhiking event (Fig 3B), as expected by ![]()
= 0.2 (Fig 3C), i.e., 0.4N generations after the hitchhiking event, a higher relative level of
H as shown in Fig 3B is not observed. Especially,
H is distinctively lower than 
around the site of selection where the level of nucleotide diversity has only partially recovered since the selective sweep (regions labeled by
). This is consistent with the observation that the excess of high frequency variants appears suddenly after a hitchhiking event but soon disappears through the fixation of these alleles, reversing the excess of high frequency mutants (![]()
W is expected to become larger than 
due to hitchhiking (![]()
W is not as obvious as in the case of
H. However, we could identify regions where
W became distinctively larger than 
in Fig 3B (labeled by +). The generality of these observations drawn from the examples of Fig 3 is further investigated by a statistical test applied to larger simulated datasets (see below).
The stronger the hitchhiking effect, the larger is the region that is expected to be affected. To find the relationship between the mean length of the region of reduced variation and the parameter values of the hitchhiking model, we generated 200 simulated datasets for a fixed combination of N,
, s,
, and
. A 1-kb-long window moves from the left end of the 40-kb-long sequence with an increment of one nucleotide and calculates 
at each position. Regions of reduced variation are defined by the centers of windows for which 
<
/2. Therefore, a segment of the affected area is delimited by the centers of the two windows that mark the beginning and the end of the stretch of nucleotides with 
<
/2. The length of the longest of such segments found on the 40-kb sequence is defined as W
/2. The mean and standard deviation of W
/2 over 200 replicates are shown in Table 1. The proportion, pwithin, of the simulated datasets in which the site of the beneficial mutation is included in the largest segment that defines W
/2 is also recorded (Table 1). Examples 13, 7, and 8 show that an increase of the mutation rate per nucleotide, given by
, does not lead to a proportional decrease in W
/2. Therefore, the mutation rates used in Table 1 are high enough to "saturate" and reveal the underlying stochastic patterns of coalescent times along the sequence. As expected, W
/2 with hitchhiking is significantly larger than that without. The mean of W
/2 is roughly proportional to s/
, as expected from the solutions of the hitchhiking effect (![]()
![]()
/2; W
/2 decreases with increasing N (examples 6 and 12). However, this effect is not as large as that determined by the parameter s/
, in particular for large values of N (Equation 19 in ![]()
![]()
) generations, is longer in large populations, where
(
1/(2Ns)) is the frequency of the beneficial mutation when it starts increasing deterministically. Looking backward in time, the rate of coalescence for two gene lineages during the selective phase gets sufficiently high only when the product of population size and beneficial allele frequency becomes low. Therefore, the waiting time (in generations) until the coalescence event is longer in large populations. However, the recombination rate per generation is independent of the population size. Therefore, the probability of the recombination event being the first event is higher in a larger population. This explains the smaller effect of a single hitchhiking event in a larger population as shown in Table 1.
|
Examples 6 and 7 show that almost identical W
/2's are obtained with the same values of N
and
, but with different
and s values. Therefore, for a given
, N
and
are the two principal parameters governing the pattern of variation caused by a hitchhiking event. We compared the mean W
/2 to the theoretical prediction, E[W
/2] (Table 1), which is based on the expectation of 
along the sequence from Equation 13 of ![]()
/2 is consistently smaller than E[W
/2]. This discrepancy occurs partly because the calculation of E[W
/2] assumes that the duration of the selective phase is negligible on the timescale of 2N generations. However, the lengths of the selective phase,
, are 0.076 and 0.017 for examples 6 and 12, respectively, which are not much smaller than
= 0.05. In the early part of the selective phase (when the frequency of the beneficial mutation is high), the behavior of the genealogy is similar to that in the neutral phase. Therefore, the length of the first neutral phase, i.e., the time since the last hitchhiking event, is effectively longer than
. It should be noted that the standard deviation of W
/2 is considerably large, as suggested by Fig 3. Furthermore, in >20 of 200 realizations, the site of the beneficial mutation is not included in the largest segment of reduced variation (see pwithin in Table 1) even with strong hitchhiking (examples 11 and 12). These results again indicate a large amount of stochasticity in the pattern of variation shaped by hitchhiking effects.
After a selective sweep, the level of genetic variation is slowly restored due to new neutral mutations. Therefore, with given values of N
and
, W
/2 should become smaller with increasing
, as is indeed observed in examples 1113, for which
= 0.001, 0.05, and 0.2, respectively. The level of variation around the site of selection, which is zero immediately after the sweep, should be characterized by
(![]()
*
, for the middle one-third of this segment. The average values (±standard deviation) of
*
for examples 1113 are 3.88 x 10-4 (±4.09 x 10-4), 5.48 x 10-4 (± 3.85 x 10-4), and 9.85 x 10-4 (±3.86 x 10-4), respectively. The corresponding theoretical values obtained by numerical integration of Equation 13 of ![]()
*
suggest that a correct estimation of
from polymorphism data will be very difficult.
So far the patterns of genetic variation based on the segregation at single sites were examined. Another important aspect of sequence variation is the association of polymorphisms between neighboring loci. We calculated r2 (![]()
, over the entire 40-kb region. Hitchhiking caused an increase in
(Table 1). It might be possible that this increase in
was caused by the excess of rare alleles (i.e., singletons) generated by the hitchhiking effect, because r2 frequently becomes large by chance when the allele frequencies at both loci are extreme. Unfortunately, we could not exclude singletons from the analysis since not many segregating sites are left if singletons are removed from the data generated under the hitchhiking model. However, a visual inspection of the raw hitchhiking data reveals that there are several extensive haplotype structures that are not likely to be created by chance alone. Large stretches of polymorphic sites share an identical pattern of segregation; i.e., there are only two haplotypes observed in such a stretch. We recorded the maximum number, Smax, of such consecutive sites found in each dataset. Table 1 shows that Smax increases with hitchhiking. With strong selection, the increase of Smax is very large (for instance, compare examples 5 and 11). The increase of
and Smax by hitchhiking can be explained by a coalescent argument. Ancestral histories of two neutral loci become identical if no recombination event occurs before the MRCA for both loci is found (![]()
![]()
increases (examples 1113), because recombination events during that neutral phase break up associations created in the selective phase.
| STATISTICAL TEST OF A LOCAL SIGNATURE CAUSED BY GENETIC HITCHHIKING |
|---|
In the following, a maximum-likelihood method is developed to examine the significance of a local reduction of genetic variation and to estimate the strength of directional selection. The probability of observing a certain frequency of derived alleles at a site after a recent hitchhiking event can be obtained by previously used analytic approximations. Under neutrality, the expected number of sites where the derived variant is in the frequency interval [p, p + dp] in the population is given by
![]() |
(3) |
(![]()
![]() |
(4) |
(![]()
r/s (Appendix). Here, r is the recombination fraction between the neutral locus and the selected locus and
is the frequency of the beneficial allele when it begins to increase deterministically. It should be noted that Equation 4 is obtained by assuming deterministic changes of allele frequencies during the selective phase. The probability of observing a site where k derived alleles are found in a sample of size n is given by
![]() |
(5) |
and

where
under the neutral model, and
under the hitchhiking model. Pn,k was found to be sensitive to the choice of
. We used
, which gave the best fit to the simulation results (Fig 4). The likelihood of all data under the model of genetic hitchhiking is obtained by multiplying the probabilities for all nucleotide sites under consideration. This is a composite likelihood because there is a correlation of Pn,k between sites due to shared ancestral histories. Therefore, it should be distinguished from the conventional likelihood-ratio test that is based on exact likelihoods. A statistical test in this analysis thus depends on an empirical distribution of the test statistic obtained by simulation. Composite likelihood is frequently used when the derivation of exact likelihoods is difficult (e.g., ![]()
|
As it is currently unrealistic to have polymorphic data from a reasonably large sample of long continuous sequences, we apply the test to a region for which only short segments are sequenced, interspaced with larger nucleotide stretches for which no data are available. That is, we consider a survey in which 11 1-kb-long segments distributed over a 40-kb region are sequenced. The distances between segments are uniformly 2.9 kb. The sample size is 10 for all segments. Simulated data used for Table 1 (examples 9 and 1113) were reused, but only sites from the 11 segments were included. The maximum composite likelihood under the neutral model (L0) and that under the hitchhiking model (L1) were obtained for each simulated dataset. Then, the likelihood ratio is given by L1/L0. L0 is a function of
and L1 is a function of N, µ,
, s, and the location of the selected locus, X. X is allowed to vary in the middle 10-kb region of the sequence (15 kb < X < 25 kb); i.e., we consider a situation where a candidate region for the site of selection has already been inferred. It is difficult practically to allow all these parameters to vary freely until a unique combination that maximizes L1 is found. Therefore, we chose only s and X as free variables and assumed that separate estimates of N, µ, and
are available. Thus, in one test (test A), the same values of N, µ, and
specified in the simulation are used in the calculation of L0 and L1. In the other test (test B), to be conservative, we let the mutation rate be inferred from the data by using the average heterozygosity (
) over all 11 segments of the sequence as the fixed prior estimate of
. Therefore, the standing level of variation is simply the level observed in the data. But we still used the true value of N for the calculation of
We also assumed either that the derived neutral allele is distinguished from the ancestral allele at each site (option 1) or that they are not distinguished (option 2). For the latter, there are only five ratios of segregating variants in the sample of 10 sequences. Let Qk,n (k = 1, ... , n/2) be the probability of observing a [k(n - k) segregation ratio. Then the likelihood ratios are calculated simply by using
.
The null distribution of likelihood ratios was obtained by applying tests to datasets generated under the neutral model (200 replicates corresponding to example 9 of Table 1). To reduce the problem of local optima, eight different initial guesses of X between 15 and 25 kb, with s = 0.01, were used to start the maximization procedure using Powell's method (![]()
= 0. As
should be large enough to generate a hitchhiking effect (also
should remain small), s was not allowed to be <6 x 10-5 (
= 60). Only when s is very small, the hitchhiking model with
= 0 can fit neutral data in which a local reduction of variation is not found. Therefore, there was a limit in maximizing L1.
|
Table 2 summarizes the power of the test and the point estimates of s and X for each hitchhiking model. Power is the proportion of replicates that produce log(L1/L0) values greater than the 95th percentile of the corresponding null distribution. Test A yielded very high power of rejecting neutral evolution, even for larger values of
. The main reason for obtaining large likelihood ratios from test A is that it uses the "true" standing level of variation (
) for the calculation of L1 and L0. As the average heterozygosity has been reduced below
due to selective sweep, the neutral model based on the true value of
cannot fit the data. The negligible differences in the power between options 1 and 2 (except at
= 0.2) indicate that the additional information obtained by distinguishing between ancestral and derived alleles contributed little in test A, whereas a significant reduction of heterozygosity played a major role in increasing the likelihood ratio.
|
On the other hand, a reduction of heterozygosity is not a major factor for increasing the likelihood ratio in test B. To obtain a higher likelihood ratio in this test for a given number of segregating sites, the spatial distribution of those sites along the sequence and the allele frequency spectrum should be close to the expectation under the hitchhiking model. As expected, the power of test B is smaller than that of test A (Table 2). However, it is still high (8497%) for small values of
(0.001 and 0.05). Power declines as
increases, since the spatial pattern and the frequency spectrum of segregating sites approach those under neutrality as time passes after the selective sweep. Tests using option 1 yield higher power than those using option 2 for
= 0.001 and 0.05, which means that the skew toward high-frequency-derived alleles at segregating sites is observed as described by Pn,k (Fig 4). For
= 0.2, however, both tests A and B had higher power with option 2. This is obvious from the fact that, at
= 0.2, the proportion of high-frequency-derived alleles is lowered below its level under neutrality (Fig 3C). Therefore, distinguishing the derived allele from the ancestral allele in these tests has an advantage for detecting very recent hitchhiking events only. However, it should be noted that our analytic prediction of the frequency spectrum is based on the assumption of
= 0. Complete solutions for Pn,k for any value of
may make option 1 still useful for detecting more distant hitchhiking events.
Maximum (composite)-likelihood estimates of s and X were also obtained. Test A with option 1 produced the most unbiased estimates of s, although the accuracy is quite low for all combinations of the test methods (Table 2). Joint estimates of s and X using test A with option 1 for datasets generated under neutrality and under hitchhiking with
are shown in Fig 6. From the neutral data, joint estimates were clustered in the parameter space of small s (close to the lower limit) and X between the "sequenced" segments. This can be expected since the hypothesized valley due to hitchhiking should be sufficiently narrow to fit between the sequenced segments where the level of polymorphism is high. On the other hand, the joint estimates from the hitchhiking datasets were centered around the true value
. It is also shown that estimates of X tend to cluster on the sequenced segments in this case.
|
To further investigate the performance of the composite likelihood-ratio test, we produced additional but shorter (10-kb) sequences by simulation. Unlike in the previous analysis, polymorphism data are assumed to be obtained from the entire continuous region and X is also allowed to vary over the entire region. Only option 1 is used. To make results comparable to the previous cases, R was adjusted to be 800 by setting
. In the simulation of selective sweeps, selection occurs at position 5 kb with
, but with various s values (Table 3). With
, the powers of tests A and B were 0.97 and 0.915, respectively. These can be compared with 0.995 and 0.97 from the previous analysis (Table 2,
and
). Considering a slight reduction of the surveyed region (1110 kb) where informative segregating sites were observed, the power of the tests with this new scheme appears to remain as high as in the previous one using discontinuous regions. With decreasing s, the power of detecting the hitchhiking event and the accuracy of the parameter estimates decrease, as expected (Table 3). However, power declined also when s increased from 0.002 to 0.005. Examination of simulated data showed that, with
, the number of segregating sites is highly reduced and those sites are frequently found clustered on one side of the site of selection. This produced very low likelihood ratios in both tests A and B. This effect should disappear if a wider region of the chromosome is surveyed.
|
We also conducted a few tests to assess the effect of uncertainty in prior estimates of N and
. Tests A and B were performed for a dataset described above (Table 3,
) but with a prior estimate of
, fivefold lower than the true value. New null distributions of the likelihood ratio were obtained accordingly. The power of tests A and B decreased to 0.935 and 0.865, respectively, from 0.97 and 0.915 (Table 3) due to the incorrect assumption of N. Average
decreased slightly (8.5 x 10-4 from 1.15 x 10-3 for test A and 5.1 x 10-4 from 7.1 x 10-4 for test B). Next, we used the correct value of N but a fivefold lower prior estimate of
(8 x 10-9). Average
decreased about fivefold (1.78 x 10-4 and 1.19 x 10-4 for tests A and B, respectively) as expected. Power decreased to 0.855 and 0.78 for tests A and B, respectively.
| DISCUSSION |
|---|
In this study, coalescent simulations using the ancestral recombination graph (![]()
![]()
![]()
![]()
![]()
![]()
![]()
Local reduction of genetic variation without the corresponding reduction of interspecific divergence was used as evidence of past directional selection in maize (![]()
![]()
![]()
![]()
![]()
, segments of reduced variation that are >1 kb can be frequently found under neutrality. The average length of segments sharing the same ancestral history becomes longer as N
becomes smaller (![]()
can be <<10-3. In such a case, a very cautious interpretation of the data is warranted when a sudden drop of heterozygosity over a few kilobases is observed in these species.
To address this problem, we developed a composite likelihood-ratio test to detect the local signature of genetic hitchhiking along a recombining chromosome, where the null distribution of variation is obtained by neutral coalescent simulations with recombination. The composite likelihood under the hitchhiking model is based on the probability of observing a certain ratio of segregating variants, Pn,k, for each site. Pn,k is a function of N, µ,
, s, and X. In test A, we assumed that the actual values of N, µ, and
are already known. The recombination rate per nucleotide can be determined independently for some species for which both physical and genetic maps are available. The effective population size and the mutation rate might be obtained from polymorphism and divergence data from adjacent chromosomal regions, if the standing levels of diversity and divergence are uniform in those regions and hitchhiking events do not occur frequently; i.e., the standing level is determined mainly by neutrality or background selection (![]()
![]()
Test B relaxes the assumption that
, the level of variation in the region immediately before the hitchhiking event happened, is known. In the most conservative treatment,
is given as the average nucleotide diversity observed in the data to be tested. Therefore, test B is the method of choice when information on other loci is not available. However, due to the incorrectness of
, the estimation of s is poorer in test B than in test A.
A similar approach to detect the signature of hitchhiking has recently been proposed by ![]()
![]()
Knowledge about the strength and the rate of directional selection in natural populations is fundamental in evolutionary biology. Previously, ![]()
![]()

, where
is the rate of strongly selected substitutions per nucleotide, using the positive correlation of variation and recombination in Drosophila melanogaster. Separate estimation of
and
might be achieved by surveying large areas of a genome for signatures of hitchhiking events, using the method proposed in this article. According to the assumption that standing variation in the region tested is not influenced by other hitchhiking events, this method is expected to be most useful in regions of high recombination rates (![]()
![]()
| ACKNOWLEDGMENTS |
|---|
We thank two reviewers for valuable suggestions. This work was supported by funds from the University of Munich and the Deutsche Forschungsgemeinschaft (STE 325/4-1).
Manuscript received May 10, 2001; Accepted for publication November 19, 2001.
| APPENDIX |
|---|
Consider a neutral locus where a mutant, A, is segregating. The substitution of a beneficial allele, B, for the wild-type allele, b, is assumed to occur at a linked locus at recombination fraction r away from the neutral locus. p1 is defined as the frequency of A among chromosomes carrying the B allele when the frequency of B increased to the value of near fixation, 1 -
. Then,
![]() |
(A1) |
(![]()
and p2
are the frequencies of A among chromosomes carrying B and b alleles, respectively, when the frequency of B is
and
. Using the approximation given by ![]()
![]() |
(A2) |
where
. Let p0 be the frequency of A when one copy of the B allele first appeared in the population. Assuming that the initial linkage disequilibrium between the two loci does not break down until the frequency of B increases to
and p2
p0, the expectation of p1 is Cp0 if B is initially linked with a and 1 - C(1 - p0) if B is initially linked with A. The former event occurs with probability 1 - p0 and the latter with p0. This leads to the transformation of
0(·) into
1(·).
| LITERATURE CITED |
|---|
BARTON, N. H., 1998 The effect of hitch-hiking on neutral genealogies. Genet. Res. 72:123-133.
BARTON, N. H., 2000 Genetic hitchhiking. Philos. Trans. R. Soc. Lond. B 355:1553-1562[Medline].
BEGUN, D. J. and C. F. AQUADRO, 1992 Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster.. Nature 356:519-520[Medline].
BENASSI, V., F. DEPAULIS, G. K. MEGHLAOUI, and M. VEUILLE, 1999 Partial sweeping of variation at the Fbp2 locus in a west African population of Drosophila melanogaster.. Mol. Biol. Evol. 16:347-353[Abstract].
BRAVERMAN, J. M., R. R. HUDSON, N. L. KAPLAN, C. H. LANGLEY, and W. STEPHAN, 1995 The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. Genetics 140:783-796[Abstract].
CHARLESWORTH, B., M. T. MORGAN, and D. CHARLESWORTH, 1993 The effect of deleterious mutations on neutral molecular variation. Genetics 134:1289-1303[Abstract].
FAY, J. and C.-I WU, 2000 Hitchhiking under positive Darwinian selection. Genetics 155:1405-1413
FU, Y.-X., 1997 Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics 147:915-925[Abstract].
FULLERTON, S. M., A. G. CLARK, K. M. WEISS, D. A. NICKERSON, and S. L. TAYLOR et al., 2000 Apolipoprotein E variation at the sequence haplotype level: implications for the origin and maintenance of a major human polymorphism. Am. J. Hum. Genet. 67:881-900[Medline].
GALTIER, N., F. DEPAULIS, and N. H. BARTON, 2000 Detecting bottlenecks and selective sweeps from DNA sequence polymophism. Genetics 155:981-987
GRIFFITHS, R. C., and P. MARJORAM, 1997 An ancestral recombination graph, pp. 257270, in Progress in Population Genetics and Human Evolution, edited by P. DONNELLY and S. TAVARÉ. Springer-Verlag, New York.
HILL, W. G. and A. ROBERTSON, 1968 Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38:473-485.
HUDSON, R. R., 1983 Properties of the neutral allele model with intragenic recombination. Theor. Popul. Biol. 23:183-201[Medline].
HUDSON, R. R., M. KREITMAN, and M. AGUADÉ, 1987 A test of neutral molecular evolution based on nucleotide data. Genetics 116:153-159
KAPLAN, N. L., R. R. HUDSON, and C. H. LANGLEY, 1989 The "hitchhiking effect" revisited. Genetics 123:887-899
KIM, Y. and W. STEPHAN, 2000 Joint effects of genetic hitchhiking and background selection on neutral variation. Genetics 155:1415-1427
KIMURA, M., 1971 Theoretical foundation of population genetics at the molecular level. Theor. Popul. Biol. 2:174-208[Medline].
MAYNARD SMITH, J. and J. HAIGH, 1974 The hitch-hiking effect of a favourable gene. Genet. Res. 23:23-35[Medline].
NACHMAN, M. W. and S. L. CROWELL, 2000 Contrasting evolutionary histories of two introns of the Duchenne muscular dystrophy gene, Dmd, in humans. Genetics 155:1855-1864
NURMINSKY, D., D. DE AGUIAR, C. D. BUSTAMANTE, and D. L. HARTL, 2001 Chromosomal effects of rapid gene evolution in Drosophila melanogaster.. Science 291:128-130
PRESS, W. H., S. A. TEUKOLSKY, W. T. VETTERLING and B. P. FLANNERY, 1992 Numerical recipes in C. Cambridge University Press, Cambridge, UK.
RANNALA, B. and M. SLATKIN, 2000 Methods for multipoint disease mapping using linkage disequilibrium. Genet. Epidemiol. 19:S71-S77.
STEPHAN, W., 1995 An improved method for estimating the rate of fixation of favorable mutations based on DNA polymorphism data. Mol. Biol. Evol. 12:959-962[Medline].
STEPHAN, W., T. H. E. WIEHE, and M. W. LENZ, 1992 The effect of strongly selected substitutions on neutral polymorphism: analytical results based on diffusion theory. Theor. Popul. Biol. 41:237-254

.





, and
. Squares represent average heterozygosity at single nucleotide sites averaged over 50,000 replicates of the simulations. The expected
as the recombination rate between a nucleotide site i and the site of selection. Directional selection occurs at position 20 kb with s = 0.001 and 



; hitchhiking models with (b) N = 5 x 105, s = 0.001,
; (c)
; and (d)
. The values of the other parameters are
. Selection occurs at position 20 kb. For each model, four replicates are shown. 



, and 


