Abstract
The coalescent process can describe the effects of selection at linked loci only if selection is so strong that genotype frequencies evolve deterministically. Here, we develop methods proposed by Kaplan, Darden, and Hudson to find the effects of weak selection. We show that the overall effect is given by an extension to Price’s equation: the change in properties such as moments of coalescence times is equal to the covariance between those properties and the fitness of the sample of genes. The distribution of coalescence times differs substantially between allelic classes, even in the absence of selection. However, the average coalescence time between randomly chosen genes is insensitive to the current allele frequency and is affected significantly by purifying selection only if deleterious mutations are common and selection is strong (i.e., the product of population size and selection coefficient, Ns > 3). Balancing selection increases mean coalescence times, but the effect becomes large only when mutation rates between allelic classes are low and when selection is extremely strong. Our analysis supports previous simulations that show that selection has surprisingly little effect on genealogies. Moreover, small fluctuations in allele frequency due to random drift can greatly reduce any such effects. This will make it difficult to detect the action of selection from neutral variation alone.
WE develop a diffusion approximation, first introduced by Kaplan et al. (1988), which extends the coalescent to take account of arbitrary forms of selection. Kingman (1982) introduced the coalescent process as a simple description of the genealogical relationships among a set of neutral genes. Although the theory of the coalescent has developed largely independently, it is closely related to the classical concept of identity by descent (Nagylaki 1989). The coalescent extends naturally to describe structured populations, in which genes may be found in different places or embedded in different genetic backgrounds. The effects of selection can easily be included, provided that it is so strong relative to random drift that the frequencies of different genetic backgrounds can be approximated as changing deterministically (i.e., the product of population size and selection coefficient, Ns ⪢ 1; Kaplanet al. 1988; Hudson 1990).
In some cases, assuming that the genetic or spatial structure of a population changes deterministically is a good approximation. For example, when a single favorable mutation arises and spreads, it carries with it any linked variants. The effects of such “selective sweeps” on genetic variability can be accurately described by assuming that the new allele increases exponentially, even though it is subject to strong random fluctuations in the early generations, when it is present in few copies (Maynard Smith and Haigh 1974; Kaplanet al. 1988; Barton 1998). However, the deterministic approximation plainly fails when selection is weak or absent. Consider the relationships between genes that can be of two allelic types. Even if these alleles are neutral, and so represent an arbitrary labeling of the genes, two genes of the same allelic type are likely to be substantially more closely related than are two genes of different type. Moreover, the average relationship between randomly chosen genes depends on the allele frequency, since an allele that happens to have increased by chance will cause a selective sweep just as if it had increased by selection. Although relationships averaged over the distribution of allele frequencies and over allelic classes must be unaffected by the labeling of neutral alleles, relationships do depend on allelic class and on allele frequency (e.g., Slatkin 1996).
We usually do not know which alleles are selected and so can observe relationships only among randomly chosen genes in populations with random genotype frequencies. However, with selection, even the average relationships are distorted and can be calculated using the structured coalescent only when selection is much stronger than drift. Weak selection (Ns ∼ 1) spread across many loci can have significant cumulative effects (McVean and Charlesworth 2000). Even when Ns is large, fluctuations may still be important. For example, Barton and Navarro (2002) showed that the effects of balancing selection at multiple loci are strongly affected by drift even when Ns is extremely large, because each particular genetic background is present in few (if any) copies.
When selection and drift are of comparable strength, a purely coalescentbased approach becomes complicated. Neuhauser and Krone (1997) and Krone and Neuhauser (1997) have shown that certain kinds of selection can be represented by “ancestral graphs,” which are constructed by allowing branching as well as coalescence as one moves back in time, followed by a culling of potential ancestors to generate the genealogy. This method is computationally demanding, especially with strong selection, because of the proliferation of ancestral lineages. Slade (2000a,b, 2001) and Fearnhead (2001) have introduced modifications that make calculations feasible for stronger selection. Nevertheless, this method does not seem likely to lead to a deeper analytical understanding, which would extend to more general kinds of selection and more complex genetics. Donnelly and Kurtz (1999a) and Slade (2001) have also shown how recombination can be included with selection in the algorithm. However, despite these various advances, the method is still computationally intensive for strong selection: for example, with overdominance, only Ns < 10 or so can be simulated (Slade 2000a). Moreover, it is limited to certain kinds of selection: linear frequency dependence or selection on diploids requires branching into three potential ancestors instead of two, and, more generally, a kth order polynomial dependence of haploid fitness on allele frequency requires branching into (k + 2) virtual ancestors. In practice, anything beyond the simplest kind of epistasis or frequency dependence is ruled out.
Other recent simulation techniques follow the state of the whole population backward through time. Donnelly et al. (2001) discuss methods for importance sampling, which start with the wellunderstood neutral process, and apply a bias that represents the action of selection. This is most efficient when selection is weak. Slatkin (2001) begins by reversing the selective process, which should allow stronger selection to be represented accurately. Because the diffusion is reversible with additive selection, the procedure is exact in this case. With nonadditive selection, Slatkin (2001) uses a procedure that approximates the correct backward diffusion. One can also follow the evolution of population allele frequencies back through time and use the MetropolisHastings algorithm to sample the appropriate distribution of frequencies. Under the diffusion approximation, the probability of any particular path is given by a Gaussian distribution of velocities around their deterministic expectation, which approximates the product of Markov transition matrices (Schulman 1981; Rouhani and Barton 1987).
Kaplan et al. (1988) introduced a more direct approach to the problem, by following relationships between genes within and between allelic classes, conditional on the frequencies of those classes in the population. However, their approach has not been taken up. This may be partly because Kaplan et al. (1988) did not specify boundary conditions for their equations, so that they could not be solved using standard software; however, Darden et al. (1989) do provide an alternative numerical algorithm. Barton et al. (2003) give a rigorous justification for Kaplan et al.’s (1988) diffusion equations, including the necessary boundary conditions. In this article, we show that in the absence of selection, the overall average relationship between random genes in a random population is the same as under simple neutrality. This must of course be the case, since labeling a pair of neutral alleles cannot affect the distribution of genealogies. However, this result extends to give a general expression for the effect of selection on the genealogy, which can be seen as an extension of Price’s (1970) equation.
The analysis is of a neutral locus that is linked to a single selected locus that carries two alternative alleles. (Setting the recombination rate to zero allows us to follow genealogies at a single selected locus). Essentially the same equations apply to probabilities of identity in state, assuming infiniteallele mutation at the neutral locus, the mean and higher moments of coalescence time, and the full distribution of coalescence times. The equivalence between these can be seen by noting that the identity in state is the generating function for the distribution of coalescence times. Our numerical results are mainly for the distribution of pairwise coalescence times, but in the last part of the article we consider the distribution of the total length of a large genealogy.
We begin by setting out the diffusion approximation for identities in allelic state; the equations for coalescence times are essentially the same. We then change variables to work with (i) the average over randomly chosen pairs of genes, (ii) differences associated with one or the other allele, and (iii) differences within classes vs. between classes. This change of variables leads to a simple formula for the expectation over the stationary distribution and to approximations for strong mixing between classes and for strong selection. Throughout this first part of the article, two simple examples are used to illustrate the derivations (two genes sampled from either a neutral locus or a locus under balancing selection). In the later sections, a wider range of parameters is explored.
THE MODEL
Consider selection on a single locus that carries two alleles, labeled P, Q. This is linked to a neutral second locus, with recombination rate r. Allele frequencies at the selected locus are p, q at the beginning of the generation. For simplicity, assume that selection acts on haploids; however, detailed assumptions about the life cycle do not affect the diffusion approximation. Numerical examples assume purifying selection with fitnesses of Q, P of 1:1 + s; balancing selection is modeled by assuming frequency dependence such that s is replaced by s(p_{0} p). This is close to a model of overdominance with diploid fitnesses 1  sp_{0}:1:1  sq_{0}, with p_{0} + q_{0} = 1. (The relation between models of overdominance and linear frequency dependence is discussed by Neuhauser 1999.) Mutation then occurs at a rate μp¯ from Q alleles to P alleles and μq¯ in the opposite direction. (In terms of the actual mutation rates μ=μ_{Q}_{→}_{P} +μ_{P}_{→}_{Q}, p¯ = μ_{Q}_{→}_{P}/μ; the equilibrium under mutation alone is p¯.) Where we consider identity in allelic state, mutation to novel alleles occurs at a rate ν at the linked neutral locus. Diploid zygotes are formed by random union and undergo meiosis. Finally, 2N gametes are sampled to found the next generation. (We keep the convention that population size is 2N genomes, corresponding to N diploid individuals.) This discrete time model is defined in more detail by Barton et al. (2003), who also set out the analogous continuoustime Moran model. Our notation is summarized in Table 1.
In the limit where selection, drift, and mutation are weak, we can take a diffusion approximation to this model. We scale selection, mutation, and recombination relative to N, and time relative to 2N, so that T = t/2N, S = Ns, R = Nr, U = Nμ, and V = Nν. [Note that Barton et al. (2003) scale time relative to N generations.] Suppose that we sample n genes. We need to follow the probability f_{j,k} that j genes of type Q, and k genes of type P, are identical in state at the neutral locus. We assume this neutral locus to be subject to infinitealleles mutation at rate ν. f_{j,k} is also the generating function for the distribution of total length of the genealogy and so can be used to find the distribution of the number of segregating sites.
Kaplan et al. (1988, Equation 20) provide a system of diffusion equations for f_{j,k} without recombination. This system is extended to include recombination by Hudson and Kaplan (1988). These give
The terms involving V = Nν represent the steady decay in identity due to mutation at the neutral locus. The positive terms (f_{j}_{1,}_{k}  f_{j,k})/q, (f_{j,k}_{1}  f_{j,k})/p represent the increase in identity due to coalescence within allelic classes. The terms involving differences in identity (f_{j}_{1,}_{k}_{+1}  f_{j}_{,}_{k}), (f_{j}_{+1,}_{k}_{1}  f_{j}_{,}_{k}) represent the movement of genes between allelic classes by mutation of allelic classes and by recombination. Note that mutation from allele Q to allele P dominates recombination when allele P is rare [term U(p¯/p) + R in Equation 1], because most copies of P will in that case have arisen as recent mutations from Q. Finally, the last two terms represent the proportion of populations currently at allele frequency p that derive from populations with a different frequency; this process is approximated as a backward diffusion.
In principle, the full distribution of genealogies can be recovered by assigning a notional mutation rate to each node, {ν_{i}, ν_{ij}, ν_{ijk},...} say, and by following identities among sets of genes that are either in background Q or P (f_{{}_{a}_{},{}_{b}_{,}_{c}_{}}, say). Then, terms such as j(j  1)/2 in Equation 1 separate out into distinct terms, corresponding to different permutations of loci over backgrounds. This gives a set of equations in a very large number of variables and, worse, an extremely large number of f’s: all possible partitions of the lineages must be tracked separately. So, numerical solution is difficult for even three genes. This approach might be useful, however, for deriving simpler equations that describe particular features of the distribution of genealogies.
As detailed in Barton et al. (2003), the probabilities of identity are the minimal positive solutions to the equilibrium version of Equations 1. This implicitly specifies the boundary conditions for the system, but to obtain numerical solutions, we require them explicitly. Consider first small p. The righthand side of Equations 1 is dominated by terms in 1/p. Solving for these terms leads to
Numerical methods for solving Equations 1 and 2 are explained in the appendix. Figure 1 gives a check on these methods for two genes. Three methods for calculating identities within and between neutral allelic classes are compared. First, identities can be calculated conditional on allele frequencies, simulated over 50,000 generations, for a population of 2N = 100 (Figure 1, dots). Allele frequencies were simulated over 50,000 generations, using the exact backward transition matrix calculated for the WrightFisher model. Second, the identities can be calculated by solving the discrete equivalent of Equation 1, which gives exact results for the WrightFisher model. This involves linear equations for a set of three vectors for f_{0,2}, f_{1,1}, f_{2,0}, each of length 2N + 1. The results fit closely with those estimated by simulation and are indistinguishable in Figure 1. Finally, the diffusion approximation (Equation 1) was used. This fits closely over most of the range (compare solid curves with dots). However, it underestimates identities between genes in rare allelic classes (e.g., f_{0,2} for p → 0; Figure 1, top left). The approximation is expected to fail for p ∼ 1/N, since only a small number of copies are involved. A similar comparison for 2N = 1000 shows that the discrepancy is then restricted to a narrower region, as expected.
IDENTITIES, TREE LENGTHS, AND PAIRWISE COALESCENCE TIMES
First, consider probabilities of identity in allelic state under the infinitealleles model. There are substantial differences between identities involving different allelic classes: identities between classes are much lower than those within (compare the lower curve for f_{1,1} with the upper curves for f_{0,2}, f_{2,0}). Identities also vary substantially with allele frequency. Withinclass identity decreases from ∼1 when the allele is present in a few copies, down to (1 + 4Up¯f_{1,1})/(1 + 4Up¯) (Equation 2) when it is frequent enough for the diffusion approximation to hold (p ⪢ 1/N), and then down to a value somewhat greater than the neutral expectation of 1/(1 + 4V) when the allele nears fixation. The betweenclass identity necessarily approaches that within the commonest class near fixation and decreases in between, because there is then a rapid influx into the rarer class by mutation from the commoner class.
Note that there is considerable variation in the identities, and hence in the distribution of coalescence times, in any particular generation of a simulated population (Figure 2, top). This arises from the random history of allele frequencies (Figure 2, middle) and is an extra source of variation, over and above the intrinsic variation in the actual coalescence time. The latter is a sample from a distribution that itself fluctuates with allele frequencies. The smooth curves shown in Figure 1 are the identities conditional on current allele frequency and are an average over the distribution of past fluctuations in allele frequency. Calculation of the variance in pairwise identity as a function of current allele frequencies would require consideration of associations among sets of four genes.
Figure 3 shows the same comparison as in Figure 1, but with balancing selection of strength S_{b} = Ns_{b} = 4 toward an equilibrium point p_{0} = 0.7. Again, the three methods agree to high accuracy, except for identities within rare allelic classes. Even though selection is much stronger than mutation and drift, it has surprisingly little effect in reducing the identities f_{2,0}, f_{1,1}, f_{0,2}. The stationary distribution is now concentrated around p_{0} = 0.7; because identities depend strongly on allele frequency, this might be expected to alter the identity between random pairs of genes, averaged over the stationary distribution. However, balancing selection reduces this average to only E[f¯] = 0.4977, relative to the neutral value of 1/(1 + 4V) = 0.5. This is because the average identity f¯ = q^{2}f_{2,0} + 2pqf_{1,1} + p^{2}f_{0,2} is almost independent of allele frequency (e.g., Figure 2, bottom). We consider this issue in more detail below.
Similar equations can be derived for the probability density of total length of the genealogy, Φ_{j,k}. This is a function of the total length,
The boundary conditions at
Figure 4 shows the solution to Equations 3, for the same parameters as Figure 3. The rate of coalescence between allelic classes, Φ_{1,1}, is necessarily zero, and so the distribution Φ_{1,1} passes through the origin (Figure 4, thick lines). However, when one or the other allele is rare, genes in the rarer class are likely to be descended from genes in the common class relatively recently. Hence, Φ_{1,1} rapidly approaches Φ_{2,0} when P is rare (Figure 4, top left and bottom right). Coalescence times within a rare allelic class are likely to be very short, unless the two genes derive from the common class via recent mutation. Because the mutational flux from common to rare is high, these two possibilities have comparable probability (see two terms ∼1/p in Equation 1). Thus, for small p the distribution Φ_{0,2} is a mixture of a singularity at zero and a component proportional to Φ_{2,0} (Figure 4, bottom curve in top left and bottom right). At intermediate frequency, both withinclass distributions are close to the neutral expectation, Exp[T] (Figure 4, middle).
Figure 5 compares the distribution of coalescence times between randomly sampled pairs of genes,
The relation between Equations 1 and 3 can be understood by noting that the identity in state is the generating function for the distribution of coalescence times, with scaled parameter
These limits can be found directly or from Equations 2. Note that for n = 2 genes, the expected total length of the genealogy is twice the mean pairwise coalescence time, which we denote by τ.
Figure 6 shows how the mean coalescence time changes with allele frequency, with and without balancing selection. The mean coalescence time between genes in different allelic classes is much greater than that within classes, but approaches the same value as within the commoner class as that class approaches fixation. Withinclass coalescence times approach 4Up¯J_{1,1}/(1 + 4Up¯) as the allele becomes rare. This value is determined by a balance between the rapid rate of coalescence within rare classes and the rapid influx of copies by mutation from the commoner class.
A CHANGE OF VARIABLES
To make some approximations, and to understand average identity, it is helpful to change variables, as follows. First, consider the pairwise case. Let
Similar transformations apply for the expected total length, J, and for the distribution of coalescence times, Φ. The average identity among randomly chosen pairs of genes is f¯; Δ is the difference in identity between a gene associated with P and a random partner and a gene associated with Q and a random partner; and θ is the sum of the differences between identities of alleles in the same classes and alleles in distinct classes. Equations 1 transform to
The equations for the distribution of coalescence times are obtained by setting V = 0 and dropping the terms that do not involve f¯, Δ, or θ (i.e., 1 in Equation 8a and 1/pq in Equation 8c). Boundary conditions at τ= 0 are that
Figure 7 shows the transformation for mean coalescence time, to the variables
Figure 8 shows the same comparisons, but for purifying selection. The overall mean
THE NET EFFECT OF SELECTION
A remarkably simple result is obtained by taking the expectation of the average identity over the stationary distribution, E[f¯]. We know that the stationary distribution satisfies the forward diffusion 0 = (2Spq + 2 (p¯  p)U)Ψ∂_{p}((pq/2)Ψ). Integrating the first of Equations 8a, 8b, 8c over the stationary distribution Ψ, we have
Integrating by parts, the third term vanishes, and we have
We immediately see that, in the neutral case (S = 0), E[f¯] is unaffected by the arbitrary labeling: it is given by the standard formula from the neutral theory, 1/(1 + 4V). Selection will perturb average identity by a proportion that depends on E[SpqΔ[p]]; this will be small for purifying selection, since Δ changes sign somewhere in the center. However, it will be large and negative for balancing selection, if the null point of selection coincides with the null point for Δ. Then, both Δ and S will change sign near the center and so E[SpqΔ[p]] is negative throughout. Thus, balancing selection is expected to reduce average identity.
The mean coalescence time is given by the same equation as Equation 9, but with V set to zero, and
Similarly, the distribution of coalescence times, averaged over random pairs of genes and over the distribution of allele frequencies, is
The expected mean coalescence time can be calculated either directly or by using the right side of Equation 11. In numerical calculations, the latter is more accurate, both because it gives what is usually a small deviation from the neutral expectation of 1 and because the regions near fixation (where the stationary distribution diverges for Up¯, Uq¯ < 1/4) do not contribute significantly to the integral (since pq and Δ^{τ} tend to zero at the boundaries). The boundaries do not contribute to the deviation from neutrality because in these regions selection is negligible relative to drift, and there is rapid flow between backgrounds.
The relationships of Equations 9, 10, 11, 12 extend to arbitrary numbers of genes in the sample. To be definite, consider the expected total length of the genealogy, J_{j}_{,}_{k} (Equation 5). However, the same argument applies to other properties of the genealogy, such as the identity in state or the distribution of times to the most recent common ancestor. As mentioned above, one can write down the generating function for the complete density in the same form, and so this result applies for any quantities derived from it.
Summing over the binomial probability density for the number of genes {j, k} sampled from each background, we obtain
The first term n represents the increase in tree length at a rate n, when n lineages are present. The second term represents coalescence at a rate n(n  1)/2; when coalescence occurs, the tree length expected among (n  1) genes replaces that among n genes: (J¯_{n}_{1}  J¯_{n}). The third term represents the perturbation to average tree length caused by selection. It is proportional to
Taking the expectation over the stationary distribution, the last term vanishes, and we obtain a recursion:
We see immediately that in the absence of selection the expected total tree length is given by the neutral formula,
Equation 14 can be interpreted as an instance of Robertson’s (1966) “secondary theorem of natural selection,” which states that the increase in any quantity caused by selection is equal to the covariance between that quantity and fitness. [This was independently developed into a general representation of selection by Price (1970).] In terms of the original variables,
APPROXIMATIONS
There does not appear to be an explicit solution to Equations 1 or 8a, 8b, 8c, even in the absence of selection. We therefore explore several approximations, in the hope that these might extend to more complex situations, which are difficult to investigate numerically. We consider in turn the extremes of no mixing between allelic classes, assuming that change in allele frequency is negligible and assuming that there is rapid mixing between backgrounds by mutation and recombination.
No movement between backgrounds: If there is no recombination or mutation between alleles (U, R = 0) then the three identities given by Equations 1 change independently. The betweenclass identity f_{1,1} clearly tends to zero, while the withinclass identities are those of two separate populations of size p, q, which fluctuate according to a diffusion process. (Strong frequency dependence is required to prevent loss of variation at the selected locus in the absence of mutation: near the boundaries, S → p^{α} as p → 0, α> 1.) However, even this uncoupling into a set of singlevariable equations does not lead to an explicit solution. The identities could be found by a change of timescale in the standard coalescent, but this does not lead to closedform solutions (Donnelly and Kurtz 1999b).
Weak random drift: If random drift is weak relative to mutation, recombination, and selection, then allele frequencies will be close to a deterministic limit. Many treatments of the effects of selection on genealogies assume this limit and apply the standard structured coalescent (e.g., Kaplan et al. 1988, 1989; Hudson and Kaplan 1995; Wakeley 2001). For the model presented here, this limit corresponds to dropping the diffusion terms (i.e., those terms involving ∂_{p} or ∂_{p}_{,}_{p} from Equation 1) and thereby assuming that the population has always had the same allele frequency. This approximation is not explicitly dependent on the strength of selection, but does depend on both mutation and recombination rates, which determine the rate of mixing between allelic classes. Under this approximation, the mean pairwise coalescence time is
A similar expression can be obtained for the identity in allelic state, which allows calculation of higher moments of the distribution of coalescence times. The overall mean,
Figure 10 shows the mean coalescence time as a function of the strength of balancing selection. Here, U = N, μ= 0.5 or 0.25, and so effects are not large: balancing selection has a substantial effect on genealogies only when mutation is low enough that genes rarely move between backgrounds. The mean coalescence time does approach the prediction from Equation 15 for large S (right side of figure), but balancing selection must be extremely strong for the deterministic limit to be accurate. That is, weak random fluctuations in allele frequency can substantially reduce coalescence times. Note that weak disruptive selection (e.g., underdominance) reduces coalescence times slightly, because allele frequencies tend to sweep back and forth between alternative alleles (see left of graph). However, strong underdominance has little effect, because the population is then near fixation.
Figure 11 shows an example in which fluctuations significantly reduce mean coalescence time despite strong selection. With S = Ns = 32, allele frequencies cluster around the equilibrium of p_{0} = 0.7 (bellshaped curve). The coalescence time is almost independent of allele frequency, simply because populations away from equilibrium are recently derived from populations close to equilibrium (thin solid curve). In contrast, the deterministic prediction (dashed line) ignores the diffusion of populations between different allele frequencies and so depends more strongly on allele frequency. Taking the value of this approximation at p_{0} = 0.7 gives a substantial overestimate of mean coalescence time, even under such strong selection.
Figures 12 and 13 show similar results, but for mutation/selection balance rather than for balancing selection. The lines to the right in Figure 12 show the prediction for the deterministic limit (Equation 15); as S ∞, mean coalescence time is reduced by a factor (1 →  (U/S) + O(1/S^{2})) (Charlesworthet al. 1993). This effect of “background selection” against deleterious alleles is reduced substantially by random drift for S = Ns < ∼3. Figure 13 shows the stationary distribution at S = 3, U/S =μ/s = 1/8; as for balancing selection, a moderate degree of fluctuation around the deterministic expectation substantially reduces the effect of selection in reducing mean coalescence time [here, from 1  (U/S) = 0.875 at S → ∞ to 0.908 at S = 3].
Rapid mixing: If mutation and/or recombination are strong relative to drift (U or R ⪢ 1), then there will be little divergence between allelic classes: Δ and θ will be small, and f¯ close to 1/(1 + 4V). First, consider the case with U, S ∼ 1 and R ⪢ 1. From Equations 8a, 8b, 8c, we see that θ is augmented by the term (1  f¯)/pq and dissipated by recombination, and so is O(1/R). Δ is generated by θ and dissipated by recombination and also mutation at the boundaries (pq ∼ 1/R). Hence, Δ ∼ 1/R^{2}. The perturbation to f¯ is therefore only of order 1/R^{2}. Letting f¯ = 1/(1 + 4V) + ϕ the leading terms are
Hence
Note that the term U(p¯  p)(q  p) in the denominator of the expression for Δ is negligible except near the boundaries, when Rpq ∼ 1. The perturbation ϕ to average identity is not given explicitly. However, its expected value, integrated over the stationary distribution, is given by Equation 10. Hence
Because the stationary distribution is known explicitly, this perturbation can readily be calculated. Moreover, it has a simple form when seen as a function of V. This allows us to take the inverse Laplace transform, which shows that the distribution of coalescence times is
Note that the key term E[] in Equation 19 is negative for purifying selection and positive for balancing selection.
If balancing selection is strong enough that the stationary density near the boundaries is negligible, and if R ⪢ U, then the term involving mutation in the denominator is negligible, and
Thus, when linkage is loose the effect of selection is to increase mean coalescence times by an amount proportional to the additive variance in fitness (2S^{2}pq). This can be understood as an inflation in the rate of random genetic drift due to inherited variation in fitness (Hill and Robertson 1966; Santiago and Caballero 1995). With balancing selection, the marginal selection coefficient S = S_{b}(p_{0}  p) is zero at p = p_{0}. However, allele frequency fluctuates around this expectation with variance var(p) ∼ 1/4S_{b} for large S_{b}. Hence, E[S^{2}pq/R^{2}] ∼ S_{b}p_{0}q_{0}/4R^{2}. Note that this is not the same as the limit of Equation 15 as R → ∞, which is 1/4R. Allele frequency fluctuations have a significant effect that cannot be neglected even for strong selection.
We examine the accuracy of these approximations by considering the effect of increasing recombination. Figure 14 shows the increase in mean coalescence time caused by balancing selection, plotted against recombination rate. Mutation is set to a small positive value (U = 0.05), to ensure that a stationary distribution exists. However, mutation has a negligible effect on the results unless linkage is tight (R ∼ U). The increase over the neutral value,
Figure 15 shows a similar plot for the decrease in mean coalescence time caused by purifying selection. Now, mutation is set to a moderately high level (U = 0.5); with weak mutation, the population would almost always be fixed, and effects on linked variation would be negligible. The deterministic limit of Equation 15 (top thick line) now performs poorly. This is because it is based on the assumption that allele frequency is close to the deterministic equilibrium of 1  (U/S), which is never the case for U = 0.5, even when selection is strong. This approximation is expected to be accurate only for U ⪢ 1. The large R approximation of Equation 19 is more accurate (dashed lines), but systematically underestimates the effect by ∼18%. Examination of the differences in mean coalescence time between backgrounds, Δ^{τ}, θ^{τ}, shows that the approximation of Equations 17 breaks down near the margins (q ∼ (1/R)). Since the stationary density is appreciable in this region for R = 10, much larger recombination rates would be needed for accuracy to be improved. As for balancing selection, this approximation is accurate only in cases where the effect on coalescence time is small.
For tight linkage, the effect of purifying selection at first increases with selection, but then decreases (see Figure 15, left side). This can be seen more clearly in Figure 16, which shows the mean coalescence time as a function of S = Ns, with complete linkage. As selection becomes very strong, the rare allele is driven out of the population, and so the genealogy returns toward its form under the neutral coalescent.
Large genealogies: We have concentrated on numerical examples for pairwise coalescence time partly because computations are then much faster, but also because the effects of fluctuations in allele frequency, and hence of selection, are primarily on the deeper parts of the genealogy, when there are just a few ancestral lineages. Here, we use Equation 5 to consider the effect of purifying selection on the expected total length of a larger genealogy. Figure 17 shows how the expected length depends on allele frequency and on the composition of the sample. If five genes of type P are sampled, then the genealogy is much shorter when that allele is rare (line rising steeply from left to right); similarly, if five copies of type Q are sampled, the genealogy is shorter when Q is rare, because coalescence is much more rapid within the rarer class. However, the relationship with allele frequency is quite weak for mixed samples, because coalescence occurs more slowly within each allelic class, and so lineages are likely to move between classes by mutation. In this example, there is purifying selection S = Ns = 2. However, the patterns for a neutral locus, or for other kinds of selection, are similar.
Figure 18 shows the net effect of purifying selection, S = Ns, on the expected total length, for up to n = 50 genes. Mutation rate is U = Nμ= 0.5, and there is no recombination. Equation 14 shows that the net effect of selection can be separated exactly into a sum, the jth term being due to the time when j lineages are present. Thus, Figure 18 shows the total effect on genealogies with 50 genes (top points) and the component effects on genealogies with up to 2, 3, 4, 5,... genes (bottom series of points). As explained above, most of the perturbation due to selection accrues when just a few lineages are present. Overall, the effects are small relative to the expected total tree length under neutrality (8.96 for n = 50 genes). The curves on the right give the simple approximation for large S, that effective population size and hence tree length is reduced by a factor (1  (U/S)). This approximation is accurate only for S so large that the net effect is small.
DISCUSSION
We have used the equations set out by Kaplan et al. (1988; see also Hudson and Kaplan, 1988) to find the effects of weak selection (Ns ∼ 1) on genealogies at a linked neutral locus. These genealogies are produced by the structured coalescent process, in which genes move between the genetic backgrounds defined by the selected locus as a result of both recombination and mutation of the selected alleles. Most previous studies have assumed that allele frequencies evolve deterministically and so are restricted to strong selection (Ns ⪢ 1). Here, we allow allele frequencies at the selected locus to evolve as a diffusion process. This leads to a set of coupled equations, each being the sum of two terms: a diffusion that allows for random changes in allele frequency through time and a recursion that describes the structured coalescent conditional on allele frequency. We find that quite small stochastic fluctuations in the frequencies of alternative genetic backgrounds substantially reduce the effects of selection.
In this article, we have considered only selection on a single biallelic locus and have for the most part sampled two genes from the linked neutral locus. Moreover, we have assumed that the selected locus has reached a stationary state, so that properties of the genealogy such as mean coalescence time can be taken to be functions of allele frequency only. In principle, it is straightforward to extend the method. Nonstationary processes can be described by taking the variables to be functions of time as well as allele frequency and following them using the same backward diffusion as in Equations 1. For example, one might ask about the genealogy immediately after a substitution has occurred by chance: Is the genealogy shortened in the same way as with a selective sweep?
Allowing multiple loci or more alleles under selection requires that the variables be functions of genotype frequencies, and this would greatly slow numerical calculations. More variables would also need to be followed, because genes can find themselves in many more genetic backgrounds. Extension to larger samples of genes at the neutral locus is simpler (Kaplanet al. 1988). Now, we follow the relationship between a set of j genes in background Q and k genes in background P; for a sample of n genes, this requires (n  1)(n + 4)/2 variables. This does not raise substantial difficulties, because one can break up the calculation into sets of equations involving 2, 3,... n genes at a time. Direct solutions of the differential equations (Equation 5) are feasible up to five or so genes, and much larger numbers can be approximated by assuming that coalescence from n genes down to approximately five genes occurs quickly, relative to the timescale of the diffusion process (Figure 9).
Most existing results assume that allele frequencies evolve deterministically (e.g., Kaplan et al. 1988, 1989; Hudson and Kaplan 1995; Wakeley 2001). Our numerical results show that this deterministic approximation is accurate only for quite strong selection: moderate fluctuations, reflected in dispersion of the stationary distribution, are sufficient to reduce effects substantially below those predicted. An obvious extension for the future is to make an expansion in powers of 1/Ns, so as to improve on the deterministic approximation. The opposite approach is to examine the effects of weak selection. Krone and Neuhauser (1997; Theorem 4.26) show that the effect of purifying selection and symmetric mutation on coalescence times is O((Ns)^{2}) (see Figure 12). This result was obtained by showing that various terms, each corresponding to an alternative topology of the ancestral graph, cancel. Under our approach, we see immediately from Equation 19 that the firstorder contribution is zero: to leading order, Δ is an odd function centered on P = 0.5, and so E[SpqΔ] = 0 for a symmetrical stationary distribution. (It is not possible for us to compare with Krone and Neuhauser’s Theorem 4.19, because this gives the probability that two individuals are identical in the sense that they have experienced no mutations that alter their allelic class since they shared a common ancestor. This is not a property of the genealogy alone and is not the same as the classical “identity by descent”).
We see the main advantage of our method as being amenable to analytical approximations that might allow progress in understanding more complex cases. In principle, it would be possible to couple a diffusion process to the structured coalescent so as to generate the distribution of genealogies conditional on observed data (for example, observations of neutral mutations carried by sampled genes). However, this would be computationally impractical for more than a single selected locus and so would restrict attention to very simple hypotheses. Our hope is primarily to find general results that will help us to understand how diverse evolutionary processes influence sampled sequences.
We have shown that the expected change in certain properties of a genealogy caused by selection is equal to the covariance between those properties and fitness. Applying Robertson’s (1966) “secondary theorem” [or Price’s (1970) equation] in this way is unusual in several respects. First, the argument applies to the expected value of a randomly distributed variable. Second, the argument is applied to samples of genes, rather than to individuals. Third, the phenotype of the sample of genes is taken to be a measure of their joint ancestry—in this instance, the expected length of the genealogy. However, one can see that Price’s (1970) arguments do apply with these extensions: a sample of genes can be connected with their offspring in the next generation, and the fitness of the whole sample can be seen as covarying with its genealogical properties. By including the transmission terms in Price’s (1970) equation, one can derive the whole of Equation 14: the expected state of the offspring sample changes as a result of the extra time step (first term) and as a result of coalescence (second term). It may be fruitful to apply this stochastic extension of Price’s equations to other problems.
Recent simulations suggest that the effects of selection at one site on the genealogy are surprisingly weak (Golding 1997; Neuhauser and Krone 1997; Przeworskiet al. 1999; Williamson and Orive 2002). Of course, the frequencies of selected alleles are strongly affected: purifying selection eliminates allelic variation, and balancing selection maintains it. Thus, observations on biased codon usage can give direct estimates of Ns (McVean and Charlesworth 2000), because the frequencies of alternative bases at the third position are themselves under selection through their effects on translation. However, the structure of the genealogy is typically affected very little by selection, and thus, observations of this genealogy, or of closely linked neutral variation, tell us little about the action of selection. This is a serious practical limitation, because we often do not know the actual nucleotides that cause fitness differences and so must base our inferences on synonymous or noncoding variation that we assume to be neutral.
Przeworski et al. (1999) describe the “competition of alleles on a genealogy” whose structure is largely independent of the distribution of the selected alleles. This description is somewhat misleading, because the genealogy does depend strongly on the allelic state of the sample, even when no selection is acting (e.g., Figure 1): genes in the same allelic state are more likely to share a recent common ancestor. However, when genealogies are averaged over the possible allelic configurations and over the stationary distribution of allele frequencies, the result is close to the neutral coalescent. Since we typically do not know which alleles influence fitness, we can usually observe only this average.
To examine the claim that selection has small effects on the genealogy, even at the selected locus, we must distinguish balancing from purifying selection. In the former case, there can be very strong effects provided that mutation rates are extremely low (U ⪡ 1), since this allows time for the two genetic backgrounds to diverge considerably. Such divergence can of course be observed directly, for example, at selfincompatibility loci in plants or the major histocompatibility locus in mammals (Hughes 1999). Extension of this argument to multiple balanced polymorphisms shows that in a sufficiently large population, and with sufficiently stable selection, coalescence times can become extremely long. However, random drift in even large populations greatly reduces this effect (Barton and Navarro 2002), because balancing selection cannot maintain every combination of selected alleles at constant frequency. Similarly, our singlelocus results (Figure 10) show that even with Ns_{b} = 20, the effect on neutral variability can be more than halved. In reality, fluctuations in selection are likely to further reduce the effect of balancing selection below the ideal case of an extremely large population under constant conditions.
Recent discussions have concentrated on the effects of selection against deleterious mutations on genealogical structure. Purifying selection is of most general importance, because it acts on all functional sequences and because its effects can be substantial in aggregate, at least for organisms with a high genomic mutation rate and in regions of low recombination (Charlesworthet al. 1993). Purifying selection on a single site has surprisingly weak effects. For example, Williamson and Orive (2002, Table 4) found that with a total mutation rate U = 5, the expected total length of a genealogy connecting 50 genes is reduced by at most 28%, at S = Ns ∼ 5. (We express U in terms of our model of reversible mutation; Williamson and Orive used 4Nμ_{Q}_{→}_{P} = 4Nμ_{P}_{→}_{Q} = 4Up¯ = 10; their 2Nσ is equivalent to our 2S.) Przeworski et al. (1999) used a much lower mutation rate (U = 0.1) and observed a reduction of at most 0.53% at S = 3. However, because of the computational difficulties of simulations of either the whole population or the ancestral selection graph, the view that purifying selection at a single site has small effects on genealogies is based on quite limited numerical results.
We can identify three distinct reasons why purifying selection should have little effect on genealogies. First, even when selection is strong relative to drift, the effect of a single site is small. With complete linkage and two selected alleles, effective population size is reduced by a factor (1  (μ/s)) for large Ns (Charlesworthet al. 1993). This is a consequence of the fact that if back mutation is negligible, and if the fittest class is to be maintained indefinitely, then only the fraction (1  (μ/s)) of the population that is free of deleterious mutations can contribute in the long term. Since deleterious alleles are likely to be rare at any one site, the effect on neutral variability will be small. Nevertheless, the cumulative effects of many sites can be large. Averaging over a genetic map of length R, the effective population size is reduced by Exp[Σ_{i} μ_{i}/R], a factor that depends on the total mutation rate per map distance (Hudson and Kaplan 1995).
Second, for a given mutation rate, a significant effect is seen only for intermediate selection strengths. For strong selection, deleterious mutations become negligibly rare, while for weak selection, the effect is only of second order in S (Figure 12; Neuhauser and Krone 1997). The maximum possible effect increases with mutation rate (Figure 16), because the frequency of deleterious mutations is increased. However, this effect is offset by the more frequent movement of genes between the two alternative backgrounds, which reduces the covariance between the allele frequency in the sample, and the structure of the genealogy. For fixed and large S, mean coalescence time decreases in proportion to U (Figure 16, right side), but for weak selection, it is almost independent of mutation rate (Figure 16, left side). Presumably, the effect of increased frequency of deleterious alleles and of faster mixing between backgrounds counterbalances when S is small.
Finally, the effect of selection on large genealogies is appreciable only deep in the tree, when just a few lineages segregate. This is because lineages rapidly coalesce down to a small number of ancestors, and so the fluctuations in allele frequency that mediate the effects of selection have little influence during this period. Consider the perturbation to the expected total length of the genealogy, which is plotted against the strength of purifying selection in Figure 18, for U = 0.5. For 50 genes, the greatest reduction below the neutral value is by 4.2%, when S = Ns ∼ 3. About half of this reduction is due to the effects of selection during the time when fewer than six lineages are present. Overall, the proportionate reduction in tree length is about the same for small and large samples. (In this example, pairwise coalescence time is reduced by at most 5.5%.) Statistics such as the length of external branches are expected to be much less sensitive to the effects of selection, an argument supported by the simulations of Przeworski et al. (1999) and Williamson and Orive (2002, Tables 3 and 4).
The methods that we have developed in this article suggest several avenues for future research. First, to what extent does selection distort the topology of the genealogy, rather than just changing its length? An understanding of such distortions is necessary if we are to be able to distinguish the effects of different kinds of selection from their effects on neutral variability. Second, analytical approximations can be developed for strong selection and for large genealogies, which may allow a better understanding of the joint effects of multiple loci on genetic variability. Finally, it may be possible to develop the idea that selection can act on samples of genes, and their genealogical relationships, in the same way that it acts on individual genes and their phenotypes. This suggests an intriguing link between the separate literatures on the stochastic evolution of genealogical relationships and on the deterministic evolution of groups of related individuals.
APPENDIX: NUMERICAL METHODS
Darden et al. (1989) described a numerical method for solving Equation 1 and gave some results for the case of two genes. Because they did not specify boundary conditions, they could not use standard algorithms. Instead, they approximated the differential equations on a discrete grid and thus obtained a set of linear equations that could be solved using matrix methods. This method is similar to solving the exact WrightFisher model for a finite population of 2N genes (see Figure 1).
Since we have specified the boundary conditions (Equations 2), we can solve the equilibrium version of Equation 1 using the builtin algorithms in Mathematica (Wolfram 1996), at least for up to five or so genes. We proceed iteratively. Initially, all identities are set to zero. At the nth stage of the iteration we solve the stationary version of Equation 1 subject to the boundary conditions of Equations 2, using a “shooting method”; the boundary conditions and the contribution from mutation and recombination involve identities that are provided by our trial solution. Thus, for example, to solve for
For a sample of two genes, this procedure is successful after only a few iterations. For larger samples, the differential equations are less stable, and so it is necessary to integrate over a series of separate intervals. The algorithm we used is as follows. Starting at some small ε ⪡ 1, integrate forward as described above. By making a small change in the initial condition, obtain a second solution that is close to the first for small p, but that may diverge as p becomes large. Choose a point p_{1} at which these two trial solutions are close to each other; we thus have an accurate solution for ε < p < p_{1}. Repeat the procedure starting from the opposite boundary (p = 1  ε), obtaining a solution valid for p_{2} < p < 1  ε. If p_{2} < p_{1}, splice the two solutions together at the point where they differ least. Otherwise, repeat the procedure, working in from p_{1} to p_{3}, from p_{2} to p_{4}, and so on until the solutions starting from left and right overlap.
With more than five or so genes, the differential equations become extremely sensitive to the values near the boundaries, and so “shooting methods” fail. This instability arises because the terms due to coalescence grow quadratically with the number of genes and so become extremely large relative to the diffusion terms that smooth the solution. This suggests an alternative approximation. We begin by calculating the identities among up to (say) five genes using the methods described above. To calculate identities among n = 6 genes, we first discard the diffusion terms (i.e., the last two terms in Equation 1). This is equivalent to assuming that allele frequency fluctuations are negligible over the short timescale set by coalescence among the n lineages. This gives a set of linear equations:
Acknowledgments
We are grateful to the Erwin Schrödinger International Institute for Mathematical Physics in Vienna for support during winter 2002/2003, where this manuscript was completed. We also acknowledge support from the Engineering and Physical Sciences Research Council and Biotechnology and Biological Sciences Research Council (grant MMI09726) and thank the referees for their helpful comments.
Footnotes

Communicating editor: W. Stephan
 Received October 30, 2002.
 Accepted November 10, 2003.
 Copyright © 2004 by the Genetics Society of America