Genetic Drift in an Infinite Population: The Pseudohitchhiking Model
 John H. Gillespie⇓
 Author email: jhgillespie{at}ucdavis.edu
Abstract
Selected substitutions at one locus can induce stochastic dynamics that resemble genetic drift at a closely linked neutral locus. The pseudohitchhiking model is a onelocus model that approximates these effects and can be used to describe the major consequences of linked selection. As the changes in neutral allele frequencies when hitchhiking are rapid, diffusion theory is not appropriate for studying neutral dynamics. A stationary distribution and some results on substitution processes are presented that use the theory of continuoustime Markov processes with discontinuous sample paths. The coalescent of the pseudohitchhiking model is shown to have a random number of branches at each node, which leads to a frequency spectrum that is different from that of the equilibrium neutral model. If genetic draft, the name given to these induced stochastic effects, is a more important stochastic force than genetic drift, then a number of paradoxes that have plagued population genetics disappear.
THIS article investigates the hypothesis that linked selection rather than genetic drift is the major stochastic force in many natural populations. Certain kinds of linked selection can produce stochastic dynamics that are remarkably like those of genetic drift. If true, this hypothesis may explain a number of paradoxical observations about genetic variation in natural populations.
The ideas presented here have at least four main antecedents. The first, of course, is Maynard Smith and Haigh's (1974) seminal article on “the hitchhiking effect.” Their investigation was prompted by a problem raised in Lewontin (1974): Assuming that protein variation is neutral, “the extent of enzyme polymorphism is surprisingly constant between species.” So constant, in fact, that the effective sizes of most species must be within one order of magnitude of each other. Maynard Smith and Haigh argued that hitchhiking events are like population bottlenecks in their ability to reduce genetic variation to levels that will be similar across species. This article is an exploration of that idea.
The second antecedent is the extensive literature showing that genetic variation is reduced in regions of low recombination (Aguadéet al. 1989; Miyashita 1990; Berryet al. 1991; Begun and Aquadro 1992; Aguadé and Langley 1994). The simplest hypothesis to explain this phenomenon is the effects of linked selection. While there is an active controversy over the form of this selection (Kaplanet al. 1989; Charlesworthet al. 1993; Charlesworth 1994; Bravermanet al. 1995; Gillespie 1997), there is general agreement over the hypothesis that some form of linked selection causes the reduction.
The third antecedent is a simulation study that showed that adaptive substitutions can cause the level of genetic variation at a linked neutral locus to be only weakly dependent on the population size (Gillespie 1999). This simulation confirms the basic premise in Maynard Smith and Haigh (1974) that hitchhiking can cause a homogenization of levels of variation across species, but points out that for this to happen, the rate of substitution at the selected locus must be an increasing concave function of population size.
The fourth antecedent came from Will Provine during a conversation in Liberia, Costa Rica, in which he tried to convince me that genetic drift must be a minor force compared to the effects of linked selection. He used an asexual haploid species to make his point, but his arguments carry weight for sexual, diploid species and are a major impetus for the work reported here.
The main goal of this article is to describe the effects of a steady stream of adaptive substitutions at one locus on the dynamics of a linked, neutral locus. A full mathematical treatment of this situation is out of reach. However, it appears that the induced stochastic effects of the substitutions on the neutral locus can be faithfully captured in a onelocus model called the pseudohitchhiking model.
NO CROSSINGOVER
We begin with the study of a neutral locus that is so tightly linked to a selected locus that there is no crossingover between them. Both the selected and the neutral loci are represented by Watterson's infinitesites, norecombination model of a gene (Watterson 1975). Evolution occurs in a finite population of size N subject to the standard assumptions of the WrightFisher model.
The mutation rate at the neutral locus is called u and the mutation rate at the selected locus is called v. Each mutation at the selected locus raises the fitness of the homozygote for that mutation by an amount σ over the fitness of the homozygote for the parent allele. The heterozygote fitness is exactly intermediate between the two homozygote fitnesses. In accordance with the assumptions of the shift model (Ohta and Tachida 1990), all fitnesses are measured relative to that of the allele with the most recently fixed site.
The rate of fixation of advantageous mutations at the selected locus, ρ, as a function of the population size is illustrated in Figure 1. These results were obtained from a computer simulation using the same approach and lisp code as described in Gillespie (1999). The values of ρ obtained from this simulation play an important role in what follows. The usual approximation for this rate, 2Nvσ, is also plotted.
Our main interest is in the properties of the neutral locus that is linked to the selected locus. Variation at the neutral locus is measured by the sum of site heterozygosities (SSH),
It is a simple matter to describe the relationship between ssh and N mathematically if we make two simplifying assumptions:
The times of fixations at the selected locus form a Poisson process with rate ρ.
The time required to fix a selected allele is so short relative to the time between substitutions (1/ρ) and the time scale of genetic drift (N) that the fixations may be viewed as occurring instantaneously.
With these two assumptions, the mean time back to the common ancestor of a pair of randomly chosen neutral alleles is
In an infinite population, three fates await a neutral allele whose frequency is x_{i} in a given generation:
A selected mutation that ultimately fixes in the population could appear on the same chromosome as one copy of our allele and that copy would then be whisked to fixation by hitchhiking. The probability that a favorable mutation appears on a copy of our allele is just its frequency, x_{i}.
A selected mutation that ultimately fixes in the population could appear on the same chromosome as some other allele. In this case our allele will be eliminated from the population.
No selected mutation that ultimately fixes enters the population, in which case the frequency of our allele remains unchanged.
The frequency of our allele, after a hitchhiking event that may have occurred has run its course, may be summarized as follows:
In a finite population, the variance in the change of
In an infinite population, the stochastic effects of linked selection on a neutral locus can be examined mathematically, but not by using diffusion theory as is often done to study genetic drift. Rather, we use a continuoustime Markov model with discontinuous sample paths. That diffusion theory is not appropriate follows from
Many of the sorts of problems that have been solved for genetic drift have analogs in this new context. For example, consider the stationary distribution of a neutral locus with two alleles that mutate to one another with rate u and are linked to a selected locus with substitutions occurring at rate ρ. As the only force acting between hitchhiking events is mutation, the frequency of one of the alleles is given by
A closely related density, which can be compared to the simulations, is that for the frequency of the unmutated copies of the most recently fixed neutral allele. After a fixation τ generations ago, the frequency of unmutated copies of the allele is
Equation twolocus simulations 5 is compared to the in Figure 3. As the population size grows, the simulations and the equation come into close agreement. The reason that they differ for smaller population sizes is that Equation 5 assumes an infinite population size. Once the population size is large enough, the agreement between the simulations and theory is very good. At the largest population sizes, the two curves begin to diverge. The reason for this appears to be that the assumption that substitutions occur instantaneously breaks down. For example, when v = 5 × 10^{−7} and N = 10,000, the fixation time of a selected allele is 438 generations and the time between substitutions is ~1265 generations. Thus, the fixations are reasonably spaced. However, at N = 20,000, the fixation time is ~504 generations and the time between fixations is ~719 generations. In this case, there is considerable overlap between the substitutions, which violates the timescale assumption and leads to a higher than expected homozygosity. Similarly, we saw in Figure 2 that ssh is too low when the population size is very large.
A proper limit could be obtained from the twolocus model by allowing N → ∞ as v → 0 in such a way that ρ remains constant. Unfortunately, we do not have an explicit formula for the dependency of ρ on N, so we cannot state the conditions required for convergence as N → ∞. However, were we willing to accept the usual approximation,
Another problem that is easily handled concerns the fixation process for the neutral locus. Recall that the origination process is made up of the times of appearance of mutations that ultimately fix in the population and the fixation process is the times that they ultimately fix (Gillespie 1993). The origination process for the neutral model is a Poisson process with rate u, this being so even if there is linked selection. Watterson (1982, 1984) gave some partial results for the neutral fixation process, which is considerably more complicated than the origination process as multiple sites may fix in the same generation. In particular, he was able to show that the number of sites that fix in a particular generation, given that at least one site fixed, is geometrically distributed. He was not able to find the distribution of the times between these fixation episodes.
In an infinite population, the time between fixation events may be written as the sum of two random times. The first of these, Y, is the time until the first appearance of a mutation that will ultimately fix. As the origination process is Poisson, this time is exponentially distributed with rate u. The second, Z, is the time until the next hitchhiking event (which must of necessity fix the mutation). By assumption, Z is exponentially distributed with rate ρ. Thus, the times of the fixation episodes form a renewal process with the time between events, usually called the failure time, being T = Y + Z. The moments of T are
The asymptotic index of dispersion for a renewal process is the variance in the failure time divided by the square of the mean failure time (Cox 1962). For our case, this is
The number of mutations that fix, given that at least one fixes, is 1 plus a random number whose distribution is a Poisson randomized by uZ. As a Poisson randomized by an exponential is geometrically distributed, we have that the number of sites that fixes at each episode is geometric with mean
The final observation concerns the nature of the coalescent. In a finite population, the coalescent will be the usual neutral coalescent until the first hitchhiking event, at which point all of the extant lineages coalesce. The death process for the number of extant lineages, n, is governed by the following transition probabilities:
In this section we have been comparing a twolocus simulation to calculations that flow from a pair of assumptions: the Poisson nature of the selected substitution process and the instantaneous fixation time of selected substitutions. These two assumptions allow us to model the behavior of the neutral locus without any explicit use of the dynamics of the selected locus other than knowledge of the value of ρ. We call this onelocus model the pseudohitchhiking model, the prefix pseudo serving to emphasize that the full hitchhiking dynamics are not part of the onelocus model. We have seen that the properties of the pseudohitchhiking model are very close to those of the twolocus model when the assumptions of the model are met. In particular, we require that the hitchhiking events form an isolated stream of impulses.
CROSSINGOVER
When crossingover occurs between the selected and neutral loci, selection no longer carries the hitchhiking allele to fixation, which requires a modification of the pseudohitchhiking model. In this section we consider a modification that gives acceptable results for tight linkage. The development of the model itself is quite instructive and provides insights into factors that must be considered in developing a more sophisticated version.
In the first step of the generalization of the pseudohitchhiking model, we allow the frequency of the neutral hitchhiking allele to stop before reaching fixation. When a favorable mutation first enters the population, it is on the same chromosome as only one copy of a neutral allele. The frequency of that copy will increase from 1/2N to some new value, call it y, at the expense of all other copies of the allele and all other alleles, which will have their frequencies reduced by a fraction 1 − y. Thus, after a hitchhiking event has run its course, the frequency of the neutral alleles will have changed according to the following scheme:
The change in x_{i} is
At this point we have reached an impasse: What is the value of y? If we choose to make y a parameter of the model, then its value can be derived from the results of Maynard Smith and Haigh (1974). But y will surely be a random variable reflecting the stochastic dynamics of hitchhiking. In this case, the values of y associated with a sequence of hitchhiking events will form a stationary sequence of independent, identically distributed random variables. Unfortunately, there is no available theory that allows us to derive the distribution of y. Before addressing the problem of adding randomness to the model, we examine the model with a deterministic y and use this as a benchmark to measure further refinements in the model.
Maynard Smith and Haigh (1974) describe the effects of the substitution of a new advantageous mutation at one locus on the frequency of two neutral alleles at a linked locus. The new mutant is originally on the same chromosome as one of the two neutral alleles and causes the frequency of that allele to increase by an amount that is determined by three parameters: the selection coefficient, σ, the rate of recombination, r, and the population size, N. (The latter parameter is relevant only through the assumption that the frequency of the newly arisen selected mutation is 1/2N; there is no genetic drift in their model.) The final frequency of the neutral allele that was linked to the advantageous mutation is
Equation 12 can be rearranged as
Figure 6 presents values of ssh for a twolocus simulation as well as those using Equation 11 with the value of ρ taken from the selected locus in the twolocus simulation and the value of y obtained numerically as described above. The agreement is good, as it must be, for very tight linkage. However, for weaker linkage the twolocus and pseudohitchhiking results diverge significantly. (For very loose linkage they will converge again as hitchhiking becomes unimportant.)
To add randomness, we first note that the values of y form a sequence of independent identically distributed random variables, and thus that
Of course, the full dynamics of the pseudohitchhiking model will depend on the complete distribution of y rather than on just E{y^{2}}. We can examine a complete model by assuming that y is βdistributed and obtaining its mean and variance from the same Maynard Smith and Haigh twolocus simulations that were described in the preceding two paragraphs. This distribution is then used in a direct simulation of the pseudohitchhiking model from which the average value of ssh is recorded. The results of these simulations are illustrated in the fourth curve in Figure 6. There is essentially no difference between these results and those from Equation 16. This is not unexpected as ssh for neutral models depends only on the first and secondorder moments in the change in the neutral allele frequencies and these, in turn, depend only on the first two moments of y. Other properties of the pseudohitchhiking model may well depend on higherorder moments of the process and will require the use of a sequence of random values of y rather than the fixed value E{y^{2}}.
The agreement between the pseudohitchhiking model with random y and the twolocus simulations for larger values of r is still not as good as we would hope. There are two potential sources for the discrepancy. The first is that the dynamics of selected substitutions in the Maynard Smith and Haigh simulations are not identical to those of the twolocus simulations. In the latter, which is an infinitesites model, there are always several alleles segregating at the selected locus. In fact, sometimes two advantageous alleles with the same fitness will move through the population at the same time. Such dynamics are considerably more complicated than those of the Maynard Smith and Haigh simulations, which always have exactly two segregating alleles at the selected locus. At this time I have no way to assess the impact of this difference.
A second source for the discrepancy concerns the assumption that the frequencies of all of the nonhitchhiking alleles in the pseudohitchhiking model are lowered by the same constant factor 1 − y. When y is deterministic, this is the correct assumption. However, when y is random it is not correct to assume that all of the nonhitchhiking alleles are lowered by the same factor. In fact, the nonhitchhiking alleles should all be lowered by a different random amount, reflecting the various effects of drift and recombination that occur during the hitchhiking event. In some cases, there may even be two separate hitchhiking alleles. It appears to be quite difficult to add this particular element of randomness, although further work may uncover a way.
Although further refinements of the pseudohitchhiking model will be forthcoming, the remainder of this article is concerned with the properties of the model as defined above.
THE COALESCENT
As a first step, consider the genealogy of n alleles sampled from a pseudohitchhiking population with deterministic y and N = ∞. In this case, the only way that a coalescence can occur is if there is a hitchhiking event. The probability of such an event in a particular generation is ρ. If there were an event, then a single copy of one of the alleles in the population increases its frequency to y. The probability that i of the n sampled alleles are descended from that fortunate allele is the binomial probability
We can summarize these observations as follows:
The probability that a coalescence does not occur in a particular generation is
When n = 2, the probability of a coalescence is ρy^{2}. Thus, the mean number of mutations separating these two alleles is
The next increment in complexity involves the addition of genetic drift. In any particular generation, a coalescence may be due to the finiteness of the population or to hitchhiking. In the former case, the coalescent can only shrink from n to n − 1 while in the latter case the size of the coalescent can shrink from n to n − i, i = 1 … (n − 1). Thus, the probabilities of all possible transitions are
Figure 7 gives examples of the calculation of Tajima's D using a direct simulation of the pseudohitchhiking model and using a coalescent simulation with the transition probabilities given above. The two approaches give identical answers, as they should. There are two interesting aspects to these results. The first is that Tajima's D becomes more negative with increasing population size. The negativity comes from the fact that a coalescence can involve more than two lineages, the increasing magnitude comes from a decreasing role of genetic drift and with it, a decreasing frequency of n → n − 1 transitions.
The second interesting aspect of Figure 7 is the increase in D that accompanies increasing sample sizes. Tajima's D for the entire population is −2.2 when N = 10,000. Thus, as the sample size increases, D does approach the population value. However, the approach is not sufficiently fast for D to be a reliable estimator of the population D. D has a dual role as an estimator and for hypothesis testing. D is scaled such that D > 2 is cause to reject the neutral model. However, it is clear from Figure 7 that even though a population may have a large skew in its frequency spectrum, D will not exhibit a mean value that is close to the significance level for the sort of sample sizes typically used in population studies. Thus, there is reason to doubt that D has sufficient power to distinguish between models with typically sized samples when using only a single locus. Of course, much more power is achieved when loci are combined.
If y is random, then the death process for the coalescent is
NEUTRAL EVOLUTION
As a stochastic force, pseudohitchhiking is very similar to genetic drift. Certain properties of population genetical models should be essentially independent of which of these two stochastic forces is present. The rate of neutral evolution, k = u, is one such property. Although the fact that the frequency of a neutral allele is a martingale under the pseudohitchhiking model makes this calculation entirely trivial, if we take a somewhat longwinded route through the derivation, it will give us some insights into the nature of neutral evolution in very large populations with hitchhiking.
Consider first the probability that a very rare neutral allele with frequency x ≈ 0 never experiences a hitchhiking event. For simplicity, assume that y is constant. The fate of this rare allele over the first few generations is summarized in Table 1.
The probability that the allele is never chosen is
The rate of fixation of neutral alleles can now be written in the suggestive form,
It is instructive to see how the pseudohitchhiking model (in an infinite population) would have fared had it, rather than genetic drift, been the stochastic force used in Kimura and Ohta's (1971) classic article on the neutral theory. That article used two observations, k ≈ 10^{−7} and F ≈ 0.9, to estimate two parameters, u = 10^{−7} and N_{e} = 2.5 × 10^{5}, for a species with one generation per year. Had the pseudohitchhiking model been used, the estimate of u would have remained the same. Using the obvious generalization of Equation 5,
DISCUSSION
The possibility that stochastic effects from linked selection events are a more important stochastic force than genetic drift, i.e., that
Levels of polymorphism at neutral sites would be insensitive to population size. By contrast, when genetic drift is the main stochastic force, ssh = 4Nu is linearly dependent on population size.
If, as seems plausible, ρE{y^{2}} is less variable between species than is N, then levels of variation should be relatively constant between species.
The frequency spectrum of alleles should be skewed from the neutral spectrum in a direction that leads to negative values of Tajima's D. The skew should be more extreme in regions of low recombination.
Assuming the correctness of the underlying model of selection, estimates of such quantities as Ns in large populations (i.e.,
Genetic variation should be proportional to levels of recombination.
Ever since Lewontin raised the issue, population geneticists have wrestled with the apparent lack of sensitivity of levels of variation to the variation in population sizes between species and to the homogeneity of variation between species. Various solutions have been proposed, but few can readily account for the fact that the silent nucleotide site heterozygosities of most diploid species are within one order of magnitude of each other. We have before us a rather simple solution to the problem, and one that does not cause a radical change in our understanding of the stochastic dynamics of populations. Rather, it suggests a reinterpretation of the parameters of our stochastic models and a slight, though important, change in the nature of the coalescent.
Is linked selection a more important force than drift? In regions of low recombination, including mitochondria, the answer is quite possibly in the affirmative. What about regions of the genome with “normal” levels of recombination? In Drosophila, the site heterozygosity away from regions of low recombination is around π = 0.006. Using Equation 17, we have
Of course, the real impact of hitchhiking involves events from many closely linked loci. The effects of hitchhiking events from more distant loci decrease with r. A full quantitative analysis of this combined effect will be discussed in a future publication as there are some complications stemming from the interactions of substitutions at closely linked loci on each other. Nonetheless, even this simple argument suggests that amino acid substitutions themselves could represent the hitchhiking agents required for our theory to be valid.
There is a much more intriguing, though largely unexplored, source of linked perturbations: meiotic drive. While there are some wellknown and dramatic cases of male drive elements in natural populations such as segregationdistorter in Drosophila (Hiraizumiet al. 1960) and the tallele in Mus (Lewontin and Dunn 1960), not much is known about segregation distortion in females, where, because only one of the four products of meiosis makes it into a gamete, it is much more likely to occur. The reason that so little is known about female meiotic drive is due to the technical problem of disassociating viability and drive effects of chromosomes. But, if particular chromosomes in nature were driven to higher frequency by a segregation advantage, then all of the alleles on those chromosomes would increase in frequency just as required in our model. Given the attraction of a driftlike stochastic force that is independent of population size, the possibility that chromosomes might experience transient drive should be seriously considered.
The stochastic effect of linked substitutions as captured in the pseudohitchhiking model is remarkably like genetic drift. The mean change in frequency of an allele is zero and the variance in the change is proportional to x(1 − x). Should the domain of genetic drift be extended to include this new force or should it be given another name entirely? When classifying the “factors of evolution,” Wright (1955) used only the secondorder moments. Thus, his definition of “random drift” does encompass pseudohitchhiking. Under Wright's classification, our title's phrase “genetic drift in an infinite population” makes perfect sense. If another name should prove useful, “genetic draft,” as suggested to me by Bill Gilliland, is a good candidate as it is close to genetic drift and it continues the hitchhiking idiom by alluding to drafting to gain speed as practiced by bicyclists.
Other forms of linked selection will lead to different dynamics for neutral alleles. Some, like the TIM model (Takahataet al. 1975), will lower the heterozygosity and will skew the frequency spectrum to give D < 0 (Gillespie 1997). Thus, there is room for a great deal of additional work to describe the stochastic effects of linked selection in other contexts. Many of these effects may contribute even more to the divorce of genetic drift and population size.
Acknowledgments
I thank Dick Hudson, Chuck Langley, Ralph Haygood, Masaru Iizuka, and the Davis Evolution discussion Group for their many useful comments on this work. This article is dedicated to my friend and colleague Will Provine in recognition of his important contributions to the history of ideas in population genetics and for his tenacious campaign to consider a wider view of genetic drift. The research reported here was funded in part by National Science Foundation grant DEB9527808.
Footnotes

Communicating editor: R. R. Hudson
 Received June 26, 1999.
 Accepted February 3, 2000.
 Copyright © 2000 by the Genetics Society of America