- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Barton, N. H.
- Articles by Etheridge, A. M.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Barton, N. H.
- Articles by Etheridge, A. M.
The Effect of Selection on Genealogies
N. H. Bartona and A. M. Etheridgeba Institute of Cell, Animal and Population Biology, University of Edinburgh, Edinburgh EH9 3JT, United Kingdom
b Department of Statistics, University of Oxford, Oxford OX1 3T6, United Kingdom
Corresponding author: N. H. Barton, Animal and Population Biology, University of Edinburgh, W. Mains Rd., Edinburgh EH9 3JT, Scotland., n.barton{at}ed.ac.uk (E-mail)
Communicating editor: W. STEPHAN
| ABSTRACT |
|---|
The coalescent process can describe the effects of selection at linked loci only if selection is so strong that genotype frequencies evolve deterministically. Here, we develop methods proposed by Kaplan, Darden, and Hudson to find the effects of weak selection. We show that the overall effect is given by an extension to Price's equation: the change in properties such as moments of coalescence times is equal to the covariance between those properties and the fitness of the sample of genes. The distribution of coalescence times differs substantially between allelic classes, even in the absence of selection. However, the average coalescence time between randomly chosen genes is insensitive to the current allele frequency and is affected significantly by purifying selection only if deleterious mutations are common and selection is strong (i.e., the product of population size and selection coefficient, Ns > 3). Balancing selection increases mean coalescence times, but the effect becomes large only when mutation rates between allelic classes are low and when selection is extremely strong. Our analysis supports previous simulations that show that selection has surprisingly little effect on genealogies. Moreover, small fluctuations in allele frequency due to random drift can greatly reduce any such effects. This will make it difficult to detect the action of selection from neutral variation alone.
WE develop a diffusion approximation, first introduced by ![]()
![]()
![]()
![]()
![]()
In some cases, assuming that the genetic or spatial structure of a population changes deterministically is a good approximation. For example, when a single favorable mutation arises and spreads, it carries with it any linked variants. The effects of such "selective sweeps" on genetic variability can be accurately described by assuming that the new allele increases exponentially, even though it is subject to strong random fluctuations in the early generations, when it is present in few copies (![]()
![]()
![]()
![]()
We usually do not know which alleles are selected and so can observe relationships only among randomly chosen genes in populations with random genotype frequencies. However, with selection, even the average relationships are distorted and can be calculated using the structured coalescent only when selection is much stronger than drift. Weak selection (Ns
1) spread across many loci can have significant cumulative effects (![]()
![]()
When selection and drift are of comparable strength, a purely coalescent-based approach becomes complicated. ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Other recent simulation techniques follow the state of the whole population backward through time. ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The analysis is of a neutral locus that is linked to a single selected locus that carries two alternative alleles. (Setting the recombination rate to zero allows us to follow genealogies at a single selected locus). Essentially the same equations apply to probabilities of identity in state, assuming infinite-allele mutation at the neutral locus, the mean and higher moments of coalescence time, and the full distribution of coalescence times. The equivalence between these can be seen by noting that the identity in state is the generating function for the distribution of coalescence times. Our numerical results are mainly for the distribution of pairwise coalescence times, but in the last part of the article we consider the distribution of the total length of a large genealogy.
We begin by setting out the diffusion approximation for identities in allelic state; the equations for coalescence times are essentially the same. We then change variables to work with (i) the average over randomly chosen pairs of genes, (ii) differences associated with one or the other allele, and (iii) differences within classes vs. between classes. This change of variables leads to a simple formula for the expectation over the stationary distribution and to approximations for strong mixing between classes and for strong selection. Throughout this first part of the article, two simple examples are used to illustrate the derivations (two genes sampled from either a neutral locus or a locus under balancing selection). In the later sections, a wider range of parameters is explored.
| THE MODEL |
|---|
Consider selection on a single locus that carries two alleles, labeled P, Q. This is linked to a neutral second locus, with recombination rate r. Allele frequencies at the selected locus are p, q at the beginning of the generation. For simplicity, assume that selection acts on haploids; however, detailed assumptions about the life cycle do not affect the diffusion approximation. Numerical examples assume purifying selection with fitnesses of Q, P of 1:1 + s; balancing selection is modeled by assuming frequency dependence such that s is replaced by s(p0 - p). This is close to a model of overdominance with diploid fitnesses 1 - sp0:1:1 - sq0, with p0 + q0 = 1. (The relation between models of overdominance and linear frequency dependence is discussed by ![]()
from Q alleles to P alleles and µ
in the opposite direction. (In terms of the actual mutation rates µ = µQ
P + µP
Q,
; the equilibrium under mutation alone is
.) Where we consider identity in allelic state, mutation to novel alleles occurs at a rate
at the linked neutral locus. Diploid zygotes are formed by random union and undergo meiosis. Finally, 2N gametes are sampled to found the next generation. (We keep the convention that population size is 2N genomes, corresponding to N diploid individuals.) This discrete time model is defined in more detail by ![]()
|
In the limit where selection, drift, and mutation are weak, we can take a diffusion approximation to this model. We scale selection, mutation, and recombination relative to N, and time relative to 2N, so that T = t/2N, S = Ns, R = Nr, U = Nµ, and V = N
. [Note that ![]()
. fj,k is also the generating function for the distribution of total length of the genealogy and so can be used to find the distribution of the number of segregating sites.
![]()
![]()
![]() |
(1) |
with f0,1 = f1,0 = 1 by convention. ![]()
The terms involving V = N
represent the steady decay in identity due to mutation at the neutral locus. The positive terms (fj-1,k - fj,k)/q, (fj,k-1 - fj,k)/p represent the increase in identity due to coalescence within allelic classes. The terms involving differences in identity (fj-1,k+1 - fj,k), (fj+1,k-1 - fj,k) represent the movement of genes between allelic classes by mutation of allelic classes and by recombination. Note that mutation from allele Q to allele P dominates recombination when allele P is rare [term U(
/p) + R in Equation 1], because most copies of P will in that case have arisen as recent mutations from Q. Finally, the last two terms represent the proportion of populations currently at allele frequency p that derive from populations with a different frequency; this process is approximated as a backward diffusion.
In principle, the full distribution of genealogies can be recovered by assigning a notional mutation rate to each node, {
i,
ij,
ijk, ...} say, and by following identities among sets of genes that are either in background Q or P (f{a},{b,c}, say). Then, terms such as j(j - 1)/2 in Equation 1 separate out into distinct terms, corresponding to different permutations of loci over backgrounds. This gives a set of equations in a very large number of variables and, worse, an extremely large number of f's: all possible partitions of the lineages must be tracked separately. So, numerical solution is difficult for even three genes. This approach might be useful, however, for deriving simpler equations that describe particular features of the distribution of genealogies.
As detailed in ![]()
![]() |
(2) |
Similarly, as p tends to 1, the terms in 1/q dominate and we obtain analogous boundary conditions to Equation 2.
Numerical methods for solving Equation 1 and Equation 2 are explained in the Appendix. Fig 1 gives a check on these methods for two genes. Three methods for calculating identities within and between neutral allelic classes are compared. First, identities can be calculated conditional on allele frequencies, simulated over 50,000 generations, for a population of 2N = 100 (Fig 1, dots). Allele frequencies were simulated over 50,000 generations, using the exact backward transition matrix calculated for the Wright-Fisher model. Second, the identities can be calculated by solving the discrete equivalent of Equation 1, which gives exact results for the Wright-Fisher model. This involves linear equations for a set of three vectors for f0,2, f1,1, f2,0, each of length 2N + 1. The results fit closely with those estimated by simulation and are indistinguishable in Fig 1. Finally, the diffusion approximation (Equation 1) was used. This fits closely over most of the range (compare solid curves with dots). However, it underestimates identities between genes in rare allelic classes (e.g., f0,2 for p
0; Fig 1, top left). The approximation is expected to fail for p
1/N, since only a small number of copies are involved. A similar comparison for 2N = 1000 shows that the discrepancy is then restricted to a narrower region, as expected.
|
| IDENTITIES, TREE LENGTHS, AND PAIRWISE COALESCENCE TIMES |
|---|
First, consider probabilities of identity in allelic state under the infinite-alleles model. There are substantial differences between identities involving different allelic classes: identities between classes are much lower than those within (compare the lower curve for f1,1 with the upper curves for f0,2, f2,0). Identities also vary substantially with allele frequency. Within-class identity decreases from
1 when the allele is present in a few copies, down to (1 + 4U
f1,1)/(1 + 4U
) (Equation 2) when it is frequent enough for the diffusion approximation to hold (p >> 1/N), and then down to a value somewhat greater than the neutral expectation of 1/(1 + 4V) when the allele nears fixation. The between-class identity necessarily approaches that within the commonest class near fixation and decreases in between, because there is then a rapid influx into the rarer class by mutation from the commoner class.
Note that there is considerable variation in the identities, and hence in the distribution of coalescence times, in any particular generation of a simulated population (Fig 2, top). This arises from the random history of allele frequencies (Fig 2, middle) and is an extra source of variation, over and above the intrinsic variation in the actual coalescence time. The latter is a sample from a distribution that itself fluctuates with allele frequencies. The smooth curves shown in Fig 1 are the identities conditional on current allele frequency and are an average over the distribution of past fluctuations in allele frequency. Calculation of the variance in pairwise identity as a function of current allele frequencies would require consideration of associations among sets of four genes.
|
Fig 3 shows the same comparison as in Fig 1, but with balancing selection of strength Sb = Nsb = 4 toward an equilibrium point p0 = 0.7. Again, the three methods agree to high accuracy, except for identities within rare allelic classes. Even though selection is much stronger than mutation and drift, it has surprisingly little effect in reducing the identities f2,0, f1,1, f0,2. The stationary distribution is now concentrated around p0 = 0.7; because identities depend strongly on allele frequency, this might be expected to alter the identity between random pairs of genes, averaged over the stationary distribution. However, balancing selection reduces this average to only
, relative to the neutral value of 1/(1 + 4V) = 0.5. This is because the average identity
is almost independent of allele frequency (e.g., Fig 2, bottom). We consider this issue in more detail below.
|
Similar equations can be derived for the probability density of total length of the genealogy,
j,k. This is a function of the total length,
, and current allele frequency, p. After a long time, the density approaches a steady state, which satisfies
![]() |
(3) |
At
= 0, we have
![]() |
(4) |
The boundary conditions at
= 0 set the rate of coalescence within allelic classes as being inversely proportional to the frequency of the class. (Recall that time has been scaled relative to 2N). The partial differential equations themselves have essentially the same form as for the identities and represent the movement of genes between allelic classes and the diffusion of populations between allele frequency states.
Fig 4 shows the solution to Equation 3, for the same parameters as Fig 3. The rate of coalescence between allelic classes,
1,1, is necessarily zero, and so the distribution
1,1 passes through the origin (Fig 4, thick lines). However, when one or the other allele is rare, genes in the rarer class are likely to be descended from genes in the common class relatively recently. Hence,
1,1 rapidly approaches
2,0 when P is rare (Fig 4, top left and bottom right). Coalescence times within a rare allelic class are likely to be very short, unless the two genes derive from the common class via recent mutation. Because the mutational flux from common to rare is high, these two possibilities have comparable probability (see two terms
1/p in Equation 1). Thus, for small p the distribution
0,2 is a mixture of a singularity at zero and a component proportional to
2,0 (Fig 4, bottom curve in top left and bottom right). At intermediate frequency, both within-class distributions are close to the neutral expectation, Exp[-T] (Fig 4, middle).
|
Fig 5 compares the distribution of coalescence times between randomly sampled pairs of genes,
, with the neutral formula Exp[-T], for the same example of balancing selection. When allele P is rare, the coalescence time tends to be shorter, while as the frequency approaches the intermediate value where the population is most likely to be found (p
p0 = 0.7), coalescence times become slightly longer than the neutral expectation. As P approaches fixation, coalescence times again become slightly shorter than in the absence of selection and allelic structure. Overall, there is little change: in this example, balancing selection increases mean coalescence time by 13.9%.
|
The relation between Equation 1 and Equation 3 can be understood by noting that the identity in state is the generating function for the distribution of coalescence times, with scaled parameter 4V (i.e.,
. This can be confirmed by integrating the product of Equation 3 with Exp[-4V
] over
. The moments of coalescence times can be recovered by taking differentials of f at V = 0. For example, the expected total length
is given by
![]() |
(5) |
with J1,0 = J0,1 = 0 by convention. As p
0, mean total length tends to
![]() |
(6) |
These limits can be found directly or from Equation 2. Note that for n = 2 genes, the expected total length of the genealogy is twice the mean pairwise coalescence time, which we denote by
.
Fig 6 shows how the mean coalescence time changes with allele frequency, with and without balancing selection. The mean coalescence time between genes in different allelic classes is much greater than that within classes, but approaches the same value as within the commoner class as that class approaches fixation. Within-class coalescence times approach 4U
J1,1/(1 + 4U
) as the allele becomes rare. This value is determined by a balance between the rapid rate of coalescence within rare classes and the rapid influx of copies by mutation from the commoner class.
|
| A CHANGE OF VARIABLES |
|---|
To make some approximations, and to understand average identity, it is helpful to change variables, as follows. First, consider the pairwise case. Let
![]() |
(7) |
;
is the difference in identity between a gene associated with P and a random partner and a gene associated with Q and a random partner; and
is the sum of the differences between identities of alleles in the same classes and alleles in distinct classes. Equation 1 transform to
![]() |
(8a) |
![]() |
(8b) |
![]() |
(8c) |
Each variable is augmented by a diffusion term,
(). As is shown in Equation 9, the average of this diffusion term over the stationary distribution is zero; thus, it shifts the variable without producing a net change. The first equation shows that average identity is augmented by 4pqS
[p], which is the product of the change in allele frequency due to selection, and the difference in identity between genes associated with P rather than with Q. Net identity increases if a higher identity is associated with an allele favored by selection. The crucial quantity, then, is
. The second equation shows that this is reduced by recombination and mutation, especially near the edges, and augmented by a term proportional to
, which is the difference in identity within and between allelic classes. The last equation, for
, shows that it is augmented by coalescence, especially at the boundaries, in proportion to (1 -
)/pq. It is also augmented by a term proportional to
.
The equations for the distribution of coalescence times are obtained by setting V = 0 and dropping the terms that do not involve
,
, or
(i.e., 1 in Equation 8a and Equation 1/pq in Equation 8c). Boundary conditions at
= 0 are that
,
= 0, and
= 1/pq. The expected total length (which is twice the mean pairwise coalescence time for n = 2) is given by the same equations as above, but setting V = 0, and subtracting 1/pq from Equation 8c.
Fig 7 shows the transformation for mean coalescence time, to the variables
, 
, 
(Table 1). The mean coalescence time between randomly chosen pairs of genes varies rather little with allele frequency and increases only moderately with increasing balancing selection (Fig 7, top). This is because the equation for
is driven mainly by the term 1 -
, which leads to the standard solution
. The diffusion terms
(
p
- 4
) redistribute
across the range of allele frequencies, but, as we show below, do not alter its average value. The main driving term is small, because under balancing selection both S and
are zero somewhere in the interior, and so their product is small when allele frequency is concentrated in the interior.
|
Fig 8 shows the same comparisons, but for purifying selection. The overall mean
is affected little by weak selection (S
1, say). With stronger purifying selection, mean coalescence time is substantially reduced when the favorable allele becomes rare (Fig 8, left side of top left), but is slightly increased when the favorable allele is common. The average identity is insensitive to selection when selection is weak because then 
changes sign in the center and so its product with Spq also changes sign; the net effect through the driving term 4Spq
is thus small. As selection becomes stronger, 
becomes positive over a wider region, implying that the mean coalescence times involving genes in the favored allelic class become longer. However, the net effect on the pattern of
is hard to predict, because the diffusion term is strong. The next section sets out a simple result for the mean coalescence time, averaged over the stationary distribution, which necessarily does not include this diffusion term.
|
| THE NET EFFECT OF SELECTION |
|---|
A remarkably simple result is obtained by taking the expectation of the average identity over the stationary distribution, E[
]. We know that the stationary distribution satisfies the forward diffusion
. Integrating the first of Equation 8aEquation 8bEquation 8c over the stationary distribution
, we have
![]() |
(9) |
Integrating by parts, the third term vanishes, and we have
![]() |
(10) |
We immediately see that, in the neutral case (S = 0), E[
] is unaffected by the arbitrary labeling: it is given by the standard formula from the neutral theory, 1/(1 + 4V). Selection will perturb average identity by a proportion that depends on E[Spq
[p]]; this will be small for purifying selection, since
changes sign somewhere in the center. However, it will be large and negative for balancing selection, if the null point of selection coincides with the null point for
. Then, both
and S will change sign near the center and so E[Spq
[p]] is negative throughout. Thus, balancing selection is expected to reduce average identity.
The mean coalescence time is given by the same equation as Equation 9, but with V set to zero, and 
= p(
0,2 -
1,1) + q(
1,1 -
2,0):
![]() |
(11) |
Similarly, the distribution of coalescence times, averaged over random pairs of genes and over the distribution of allele frequencies, is
![]() |
(12) |
The expected mean coalescence time can be calculated either directly or by using the right side of Equation 11. In numerical calculations, the latter is more accurate, both because it gives what is usually a small deviation from the neutral expectation of 1 and because the regions near fixation (where the stationary distribution diverges for U
, U
< 1/4) do not contribute significantly to the integral (since pq and 
tend to zero at the boundaries). The boundaries do not contribute to the deviation from neutrality because in these regions selection is negligible relative to drift, and there is rapid flow between backgrounds.
The relationships of Equation 9Equation 10Equation 11Equation 12 extend to arbitrary numbers of genes in the sample. To be definite, consider the expected total length of the genealogy, Jj,k (Equation 5). However, the same argument applies to other properties of the genealogy, such as the identity in state or the distribution of times to the most recent common ancestor. As mentioned above, one can write down the generating function for the complete density in the same form, and so this result applies for any quantities derived from it.
Summing over the binomial probability density for the number of genes {j, k} sampled from each background, we obtain
![]() |
(13) |
The first term n represents the increase in tree length at a rate n, when n lineages are present. The second term represents coalescence at a rate n(n - 1)/2; when coalescence occurs, the tree length expected among (n - 1) genes replaces that among n genes: (
n-1 -
n). The third term represents the perturbation to average tree length caused by selection. It is proportional to
Jn, which is half the regression of average tree length on the number of P alleles in the sample (i.e., of Jj,k on k). The last term represents the effect of changes in allele frequency through time.
Taking the expectation over the stationary distribution, the last term vanishes, and we obtain a recursion:
![]() |
(14) |
We see immediately that in the absence of selection the expected total tree length is given by the neutral formula,
. The perturbation due to selection during the time when there are j lineages in the genealogy is proportional to 2/j(j - 1), which is the expected time for which j lineages persist before coalescence. This suggests that the effects of selection will primarily be on the deeper parts of the genealogy (i.e., small j); however, it is not clear how
j changes with j. Numerical calculations confirm our intuition that the effects of selection decrease rapidly as j increases. Fig 9 shows the successive perturbations due to selection for up to n = 50 genes, compared with the successive contributions 2/j under neutrality. The
j decrease for j > 10, and so the net perturbation, which is further reduced by the factor 2/j(j - 1), quickly becomes negligible. The sum of the perturbations is -0.29, compared with a neutral expectation of
. Thus, purifying selection reduces total tree length by only 3.3% in this example, and more than half of that effect arises during the time when only two or three lineages are extant. (As explained in the Appendix, we use the insensitivity of the genealogy to allele frequency fluctuations in our calculations of terms involving large numbers of genes.)
|
Equation 14 can be interpreted as an instance of ROBERTSON's (1966) "secondary theorem of natural selection," which states that the increase in any quantity caused by selection is equal to the covariance between that quantity and fitness. [This was independently developed into a general representation of selection by ![]()
, which is equal to the covariance between the fitness of the sample of n genes and the expected total length of the genealogy that relates these genes. (Here, relative fitness of a sample with k copies of allele P is sk, or 2Nsk = 2Sk when rescaled.) Equation 14 shows that this covariance is exactly equal to the increase in the expected total length caused by selection. Because essentially the same equations apply to the identity f, which can be viewed as the generating function of the genealogy, we can see that relations similar to Equation 14 apply to any quantities that are linear functions of the probability density of genealogies (for example, the kth moments of coalescence times).
| APPROXIMATIONS |
|---|
There does not appear to be an explicit solution to Equation 1 or Equation 8aEquation 8bEquation 8c, even in the absence of selection. We therefore explore several approximations, in the hope that these might extend to more complex situations, which are difficult to investigate numerically. We consider in turn the extremes of no mixing between allelic classes, assuming that change in allele frequency is negligible and assuming that there is rapid mixing between backgrounds by mutation and recombination.
No movement between backgrounds:
If there is no recombination or mutation between alleles (U, R = 0) then the three identities given by Equation 1 change independently. The between-class identity f1,1 clearly tends to zero, while the within-class identities are those of two separate populations of size p, q, which fluctuate according to a diffusion process. (Strong frequency dependence is required to prevent loss of variation at the selected locus in the absence of mutation: near the boundaries, S
p-
as p
0,
> 1.) However, even this uncoupling into a set of single-variable equations does not lead to an explicit solution. The identities could be found by a change of timescale in the standard coalescent, but this does not lead to closed-form solutions (![]()
Weak random drift:
If random drift is weak relative to mutation, recombination, and selection, then allele frequencies will be close to a deterministic limit. Many treatments of the effects of selection on genealogies assume this limit and apply the standard structured coalescent (e.g., ![]()
![]()
![]()
![]()
p or
p,p from Equation 1) and thereby assuming that the population has always had the same allele frequency. This approximation is not explicitly dependent on the strength of selection, but does depend on both mutation and recombination rates, which determine the rate of mixing between allelic classes. Under this approximation, the mean pairwise coalescence time is
![]() |
(15) |
A similar expression can be obtained for the identity in allelic state, which allows calculation of higher moments of the distribution of coalescence times. The overall mean, E[
], can be estimated either by integrating over the stationary distribution or by fixing p at its deterministic equilibrium value (![]()
Fig 10 shows the mean coalescence time as a function of the strength of balancing selection. Here, U = N, µ = 0.5 or 0.25, and so effects are not large: balancing selection has a substantial effect on genealogies only when mutation is low enough that genes rarely move between backgrounds. The mean coalescence time does approach the prediction from Equation 15 for large S (right side of figure), but balancing selection must be extremely strong for the deterministic limit to be accurate. That is, weak random fluctuations in allele frequency can substantially reduce coalescence times. Note that weak disruptive selection (e.g., underdominance) reduces coalescence times slightly, because allele frequencies tend to sweep back and forth between alternative alleles (see left of graph). However, strong underdominance has little effect, because the population is then near fixation.
|
Fig 11 shows an example in which fluctuations significantly reduce mean coalescence time despite strong selection. With S = Ns = 32, allele frequencies cluster around the equilibrium of p0 = 0.7 (bell-shaped curve). The coalescence time is almost independent of allele frequency, simply because populations away from equilibrium are recently derived from populations close to equilibrium (thin solid curve). In contrast, the deterministic prediction (dashed line) ignores the diffusion of populations between different allele frequencies and so depends more strongly on allele frequency. Taking the value of this approximation at p0 = 0.7 gives a substantial overestimate of mean coalescence time, even under such strong selection.
|
Fig 12 and Fig 13 show similar results, but for mutation/selection balance rather than for balancing selection. The lines to the right in Fig 12 show the prediction for the deterministic limit (Equation 15); as S
, mean coalescence time is reduced by a factor (1 - (U/S) + O(1/S2)) (![]()
3. Fig 13 shows the stationary distribution at S = 3, U/S = µ/s = 1/8; as for balancing selection, a moderate degree of fluctuation around the deterministic expectation substantially reduces the effect of selection in reducing mean coalescence time [here, from 1 - (U/S) = 0.875 at S
to 0.908 at S = 3].
|
|
Rapid mixing:
If mutation and/or recombination are strong relative to drift (U or R >> 1), then there will be little divergence between allelic classes:
and
will be small, and
close to 1/(1 + 4V). First, consider the case with U, S
1 and R >> 1. From Equation 8a HREF="#FD8b">Equation 8bEquation 8c, we see that
is augmented by the term (1 -
)/pq and dissipated by recombination, and so is O(1/R).
is generated by
and dissipated by recombination and also mutation at the boundaries (pq
1/R). Hence,
1/R2. The perturbation to
is therefore only of order 1/R2. Letting
the leading terms are
![]() |
(16) |
Hence
![]() |
(17) |
Note that the term U(
- p)(q - p) in the denominator of the expression for
is negligible except near the boundaries, when Rpq
1. The perturbation
to average identity is not given explicitly. However, its expected value, integrated over the stationary distribution, is given by Equation 10. Hence
![]() |
(18) |
Because the stationary distribution is known explicitly, this perturbation can readily be calculated. Moreover, it has a simple form when seen as a function of V. This allows us to take the inverse Laplace transform, which shows that the distribution of coalescence times is
![]() |
(19) |
Note that the key term E[ ] in Equation 19 is negative for purifying selection and positive for balancing selection.
If balancing selection is strong enough that the stationary density near the boundaries is negligible, and if R >> U, then the term involving mutation in the denominator is negligible, and
![]() |
(20) |
Thus, when linkage is loose the effect of selection is to increase mean coalescence times by an amount proportional to the additive variance in fitness (2S2pq). This can be understood as an inflation in the rate of random genetic drift due to inherited variation in fitness (![]()
![]()
1/4Sb for large Sb. Hence, E[S2pq/R2]
Sbp0q0/4R2. Note that this is not the same as the limit of Equation 15 as R
, which is 1/4R. Allele frequency fluctuations have a significant effect that cannot be neglected even for strong selection.
We examine the accuracy of these approximations by considering the effect of increasing recombination. Fig 14 shows the increase in mean coalescence time caused by balancing selection, plotted against recombination rate. Mutation is set to a small positive value (U = 0.05), to ensure that a stationary distribution exists. However, mutation has a negligible effect on the results unless linkage is tight (R
U). The increase over the neutral value, E[
] - 1, is plotted on a log scale, because when R is large, small effects must be discerned. As selection increases (Sb = 4, 8, 16, 32, bottom to top), mean coalescence time converges to the deterministic limit (thick line; Equation 15) However, convergence is slow for large R. There, the large R approximation of Equation 19 is accurate (dashed lines to right). However, the approximation is good only for R > 10, in which case mean coalescence time is increased by at most 2.5%. The large R approximation is not helpful for parameters that give a large effect.
|
Fig 15 shows a similar plot for the decrease in mean coalescence time caused by purifying selection. Now, mutation is set to a moderately high level (U = 0.5); with weak mutation, the population would almost always be fixed, and effects on linked variation would be negligible. The deterministic limit of Equation 15 (top thick line) now performs poorly. This is because it is based on the assumption that allele frequency is close to the deterministic equilibrium of 1 - (U/S), which is never the case for U = 0.5, even when selection is strong. This approximation is expected to be accurate only for U >> 1. The large R approximation of Equation 19 is more accurate (dashed lines), but systematically underestimates the effect by
18%. Examination of the differences in mean coalescence time between backgrounds, 
, 
, shows that the approximation of Equation 17 breaks down near the margins (q
(1/R)). Since the stationary density is appreciable in this region for R = 10, much larger recombination rates would be needed for accuracy to be improved. As for balancing selection, this approximation is accurate only in cases where the effect on coalescence time is small.
|
For tight linkage, the effect of purifying selection at first increases with selection, but then decreases (see Fig 15, left side). This can be seen more clearly in Fig 16, which shows the mean coalescence time as a function of S = Ns, with complete linkage. As selection becomes very strong, the rare allele is driven out of the population, and so the genealogy returns toward its form under the neutral coalescent.
|



; thus, U = V = 0.25.
.



, as in 





















, R = 0, S = 3. The thin solid curve gives the exact solution from 






