Abstract
The general theory of the effective size (Ne) for populations under directional selection is extended to cover linkage. Ne is a function of the association between neutral and selected genes generated by finite sampling. This association is reduced by three factors: the recombination rate, the reduction of genetic variance due to drift, and the reduction of genetic variance of the selected genes due to selection. If the genetic size of the genome (L in Morgans) is not extremely small the equation for Ne is
DIRECTIONAL selection generates differences in the reproductive success of individuals, increasing the variance of change in gene frequency and reducing the genetic diversity of neutral alleles. A population, then, behaves for these parameters like an ideal unselected population of size Ne, the effective population size (Wright 1931), which is in general smaller than the actual number of reproductive individuals. If selection acts on a noninherited trait, Ne is simply a function of the variance of the number of progeny per parent, and predictions have been developed for a variety of cases (see review by Caballero 1994).
When differences in fitness are inherited, the effective population size cannot be predicted from the variance of progeny number at a given generation. The drift process is amplified over generations because the random association that originated in a given generation between neutral and selected genes remains in descendants for a number of generations until it is eliminated by segregation and recombination. This problem was first addressed by Robertson (1961), and recently, adequate solutions were given by Woolliams et al. (1993) and Santiago and Caballero (1995) for directional selection on quantitative traits determined by an unlinked system of additive loci. But the drift process is larger when selection acts on a linked set of loci as random associations last longer under linkage. Although previous formulations predict the inbreeding coefficient that is calculated by tracing the paths in a genealogy independently of the existence of linkage, this may be different from the real inbreeding coefficient that represents the probability of identity by descent of genes carried by individuals. The reason for this is that both copies of a neutral gene in the same individual do not have identical probabilities of being transmitted to the following generations, as they are embedded in different chromosomes with different selective genes.
Hudson and Kaplan (1995) and, more generally, Nordborg et al. (1996) have derived expressions for predicting the nucleotide diversity at neutral loci under the background selection model (Charlesworthet al. 1993), which is based on the continuous appearance of linked deleterious mutations in the population. Barton (1995) made similar derivations for the fixation probability of a favorable allele. In parallel, models dealing with the hitchhiking of neutral genes caused by the spread of selectively favorable mutations at linked loci (Maynard-Smith and Haigh 1974) have been refined in recent years (Wiehe and Stephan 1993, and references therein).
Here, we develop a prediction of the effective population size under linkage, extending the argument of Robertson (1961) and Santiago and Caballero (1995). They have basically shown that the effective size of a population is a function of the variance of the cumulative selective values associated with neutral genes. Simplifying for second-order terms, under random mating and Poisson distribution of family size, the equation for the effective population size becomes
Thus, the term Q2 C2 is the variance of the long-term selective values, and with no linkage and weak selection it approximates 4C2. This does not hold, however, under linkage, but the argument can be rebuilt to consider the decline of the association between a neutral gene and selected genes on chromosomes. The problem of the reduction of the effective size under linkage is reduced to the problem of finding the appropriate value of Q2, and the same argument used by Santiago and Caballero (1995) can be followed.
Initially, to make predictions independently of gene frequencies and effects, an infinitesimal model (an infinite number of genes of small effect) is considered, but predictive equations are also valid for the background selection model of Charlesworth et al. (1993), and we connect our equations with those of Nordborg et al. (1996) for this model. Moreover, we apply the principles of the theory to the selective sweep model (Maynard-Smith and Haigh 1974) in considering a continuous flux of weakly advantageous mutations, instead of rare mutations of strong favorable effect passing quickly through the population and further recovery of variation by neutral mutation (Wiehe and Stephan 1993). Finally, we show that the two categories of estimators of polymorphism, basically mean heterozygosity per site and proportion of segregating sites, are related to the effective size to different extent, but predictions can still be made from effective size theory.
Summary of most common notations
DERIVATION OF EXPRESSIONS
The general model: We consider a monoecious diploid population with random mating and a constant number of reproductive individuals, N. Every individual in the population is made up of two haploid homologous complements with ν chromosomes l Morgans (M) long each. (Table 1 shows the most common notation used in this article.) Each complement is referred to as a “gamete” (i.e., there are 2N gametes in the population). It is assumed that there is no genetic correlation between gametes in parents. The mapping function of Haldane (1919), r = [1 − exp(−2x)]/2, is assumed to relate the recombination fraction r and the genetic distance x in Morgans. A large number n of loci uniformly distributed on the chromosomes determines fitness. Allelic effects can be different for different loci, but gene action is additive within loci; that is, the fitness value of the heterozygote is the average value of the corresponding homozygotes. This latter assumption, however, can be removed for some models (see below). Gene effects are multiplicative between loci. This genetic system is at mutation-selection-drift equilibrium with a mean fitness of one; i.e., each parent has two descendants on average, and the genetic variance for fitness of individuals is C2, which is assumed to be small.
As the effects of loci are multiplicative, the relationship between variance of individuals and the contributions of the n selective loci to variation is
Consider a neutral locus in the middle of a chromosome. We assume that the neutral alleles at this locus are initially produced by mutation, but this is not a necessary assumption of the model. Due to the finite size of the population, the sampling process generates random associations between the neutral alleles and selected loci. The expected change in gene frequency of the neutral allele (S) is the covariance between the frequency of the allele in gametes (p) and the selective value (f) of individuals carrying the gametes, S = cov (p, f) (see Santiago and Caballero 1995, p. 1016). We derive this expected change next.
Let pi be the frequency of an allele of the neutral locus in gamete i (pi can be 0 or 1). For the moment, we consider one single selected locus j with additive effects of alleles. Locus j is at a genetic distance of x M from the neutral locus. Let us consider a copy of the neutral allele present in a given individual, and let
Robertson (1961) and Santiago and Caballero (1995) showed that the effective population size can be predicted from the variance of the cumulative selective values associated with the neutral gene, as Ne = N/(1 + Var. of cumulative selective values). The variance of the cumulative selective values due to locus j is
If j were the only locus with effect on fitness in the genome, the effective population size would be
Now consider the n selected loci with different contribution to the variance for fitness. With multiplicative effects among them, the total variance of the cumulative selective values for all the selective loci in the genome is
For large populations
Numerical analysis (data not shown) indicates that the relevant parameter is the product L = νl, that is, the genetic size of the genome. Variations in the distribution of the sizes of the chromosomes do not make much difference if the size of the whole genome is constant. Thus, the first term in the above equation can be dropped by setting ν = 1 and substituting l by L:
Application to particular genetic systems: The above equations for predicting the effective population size are a function of the proportional reduction of the genetic variation (1 − Z) at selected loci. Two processes are involved in the dynamics of the genetic variation of loci: selection and drift. It is generally assumed that drift eliminates variation at a constant rate 1/2Ne. The change in genetic variance due to selection depends on the genetic system. For some models, this change is constant. Particularly, phenotypic selection on an infinitesimal model (Bulmer 1980; Santiago 1998) and the background selection model (Charlesworthet al. 1993) erode variation at a rate that is independent of the changes in gene frequency at particular loci.
At equilibrium, the mutational input per generation (Vm) equals the loss of variation by drift and selection. Therefore, the proportion of the expressed genetic variance C2 that is lost by drift and selection per generation is Vm/C2. The remaining fraction of the expressed variation, which is expected to be maintained after one generation of selection, is
Infinitesimal model: All the previous equations apply under the infinitesimal model. Favorable and deleterious mutation models reduce to the infinitesimal model if effects are very small. If the population is small, selection is not very strong, and linkage is not very tight, the equilibrium variance can be approximated by C2 = 2NeVm (Lynch and Hill 1986; see Santiago 1998 for a general equation to predict C2 and Vm under linkage), and Z = 1 − (Vm/C2) ≈ 1 − 1/(2Ne).
Deleterious mutations model: This is equivalent to the background selection model of Charlesworth et al. (1993). Predictions of heterozygosity for this model have been developed by Hudson and Kaplan (1995) and more generally by Nordborg et al. (1996). In what follows we show expressions for Ne that generalize those predictions, including the effect of finite populations. Expressions obtained in the previous section are fully applicable, but we consider now a selection coefficient t against heterozygotes. In this case, the assumption previously made of additive gene action within a locus can be removed, because for the background selection model, the effect on the heterozygote and not on the mutant homozygote is critical. If the frequency of a deleterious allele in a particular generation i is qi, the genetic variance contributed by this locus is proportional to qi (1 − qi) ≈ qi, as the deleterious allele frequency will be generally small, and the expected gene frequency in the next generation is qi+1 ≈ qi − qi(1 − qi)t ≈ qi(1 − t). Therefore, the proportional change in the genetic variance of the selected locus due to selection is qi+1(1 − qi+1)/qi(1 − qi) ≈ qi+1/qi = (1 − t). This is the factor by which genetic variance is changed by selection for this model. This result can also be obtained directly from Equation 9, noting that under mutation-selection balance C2 = Ut (Crow and Kimura 1970) and Vm = Ut2, where U is the total genomic (diploid) mutation rate for detrimental genes. Combining both drift and selection, the reduction of the association between neutral and selected genes in one generation is approximately
Without recombination (L = 0) and t ⪢ 1/2Ne, Equation 6 reduces to Q ′2 = 1/t2. Substituting this and C2 = Ut into Equation 5, Ne ≈ N exp[−U/(2t)]. This equation is identical to the expression of Kimura and Maruyama (1966) and Haigh (1978) for the size of the chromosome class with the lowest number of deleterious mutants. As all the chromosomes in the population will be derived from one member of the best class, the size of this chromosome class is indeed the effective population size given in number of chromosomes. Charlesworth et al. (1993) found the equivalent algebraic solution for the heterozygosity, π = π0 exp(−U/2t), where π0 represents the expected heterozygosity if there is no selection.
Favorable mutations model: Assume that the selective values of the three genotypes at any selective locus j are 1, 1 + t, and 1 + 2t, respectively. The predictive equations previously shown do not hold if favorable mutations are not effectively neutral (i.e., t > 1/2Ne, after Kimura 1983) as changes in variance are dependent on the dynamics of the gene frequencies. However, the general principles of the theory are also applicable for largeeffects if there is a continuous flux of advantageous mutations. Assume that the frequency of the favorable allele in the generation in which the neutral allele of reference appears by mutation is q1. The expected gene frequency of the selected allele in the next generation is q2 ≈ q1 + q1(1 − q1)t and the proportional change in genetic variance due to selection is q2(1 − q2)/q1(1 − q1). Drift also reduces the genetic variance by 1 − 1/2Ne, therefore, Z1 = (1 − 1/2Ne)q2(1 − q2)/q1(1 − q1). Here we consider only the effect of the gamete associated with the neutral gene (i.e., we neglect
ASYMPTOTIC EFFECTIVE SIZE, HETEROZYGOSITY, AND POLYMORPHISM
The parameter Ne that we have derived is the asymptotic effective population size. If a neutral allele appears in the population at a given generation by mutation, the drift process will be initially weak on it, but random associations with selected genes will accumulate over generations making drift increase until an asymptotic value is reached. In the first generation, the magnitude of the drift process on the neutral allele can be quantified by the variance in allele frequency in the first generation. Although this refers only to the particular neutral genes that appeared one generation ago, we refer to it as the effective population size in the first generation,
Reduction of the probability of segregation and the heterozygosity contributed by a locus with a single copy of a neutral gene in the initial generation. The reductions are given as a percentage of the values in the initial generation. The effective population size (Ne,i) associated with that locus in generation i is also plotted. N = 100, L = 1, C2 = 0.02, and t = 0.01 (deleterious mutations model). The last element in each series corresponds to the asymptotic value (both heterozygosity and probability of segregation equal 0).
There is no simple solution for the Q′ terms for consecutive generations (except for infinite generations; i.e., Equation 6). Therefore, numerical methods have to be applied to estimate the values of the partial effective sizes in consecutive generations. If genetic variance for fitness is not large, in the first generation Ne,1 is close to the census size N of the population. In the following generations the effective size drops toward its asymptotic value (see Figure 1). For a new neutral mutation, the decay of genetic variance is 1/2Ne,1 in the first generation, 1/2Ne,2 in the second generation, and so on. A consequence of this cumulative effect of drift on new mutations is that there is not a simple formula to connect asymptotic population size, heterozygosity, and proportion of segregating sites for neutral alleles, as we address next.
Heterozygosity: Under the infinite sites model, the heterozygosity contributed by a new mutation (i.e., with frequency 1/2N) is 2(1/2N)(1 − 1/2N) ≈ 1/N. Then, with a mutation rate μ per locus and generation, the number of new mutations per generation is 2Nμ, and the input of heterozygosity per generation is about 2μ. The neutral variability generated by these mutations decreases at an increasing rate, which is a function of the consecutive values of Ne,i, so the remaining proportion after τ generations is
Proportion of segregating sites: Under an infinite sites model, the proportion of segregating sites increases by 2Nμ, the number of new mutations per generation. The equilibrium proportion of segregating sites, s, can be obtained by calculating the probability that mutants appearing in previous generations are still segregating in the current one. Looking backward in time, the remaining fraction of the segregating sites produced τ generations ago is a function of the magnitude of the drift process until the current generation. As we have seen, this magnitude is represented by the partial Ne,i values from generation 1 to generation τ and it can be summarized by the harmonic mean Ne,Hτ of these τ values, i.e.,
The proportion of segregating sites is generally much more dependent on N than on Ne because only a small proportion of new mutations segregate for a long period. For example, if there is no selection, the s value for the whole population is approximately 4Nμ ln 2N (see Ewens 1979). The input of segregating sites per generation is 2Nμ. At equilibrium, this is also the number of sites that become monomorphic per generation. Therefore, the proportion of polymorphic loci that become monomorphic per generation is 2Nμ/s = 1/(2 ln 2N), which is a relatively large proportion. For example, for a population of N = 100, about 10% of the segregating sites are lost by drift every generation. This is also illustrated in Figure 1. The probability of segregation of an initially single-copy neutral allele has most of its reduction in the initial generations. Given this large rate of loss of polymorphic loci per generation, it is clear that the proportion of segregating sites is very dependent on the mutations arising few generations ago and, therefore, the Ne,i values of the initial generations have much influence. Because these initial Ne,i values are closer to the census size N than to the asymptotic effective size, Ne, the proportion of segregating sites in the whole population is only slightly dependent on Ne. On the contrary, the rate of loss of heterozygosity per generation is relatively small (1/2N with no selection). For example, for a population of size N = 100, it is only 0.5% per generation. Therefore, the heterozygosity is more dependent on the asymptotic effective population size, Ne. The above arguments indicate that if the asymptotic Ne is much smaller than the census size N, heterozygosity will be more affected by selection than the proportion of segregating sites because the latter depends strongly on N. This dependence of the asymptotic reduction of s, π, and Ne,i on population size predicted under a model of deleterious mutations is shown in Figure 2. The larger the population size the stronger the selection as the mutant effects are assumed to be constant (t = 0.01). Reductions of Ne,i and π are close and tend to be equal with large N (strong selection), as previously noted by Charlesworth et al. (1995) for background selection. The proportion of segregating sites is much less affected by the increase in strength of selection.
Allele frequency spectrum: The application of the classic theory of Ne provides methods to predict the spectrum of frequencies of neutral genes a number of generations after their appearance (e.g., Crow and Kimura 1970). According to our results, these methods should consider the evolution of the consecutive values of the effective size for the n generations, but the application would be quite complex. However, the precision is not much affected if the harmonic mean Ne,Hτ of the partial Ne,i values for the τ generations is used as the constant effective population size for the τ generations. The distribution of neutral gene frequencies in the population can then be computed as a combination of distributions for neutral mutations that appeared in the actual generation, one generation ago, two generations ago, etc., up to infinity. An illustration of this is given in the next section.
Example of the dependence of the asymptotic reduction of proportion of segregating sites (s), heterozygosity (π), and effective population size (Ne) on the number of reproductive individuals (N). C2 = 0.001, t = 0.01 (deleterious mutations model), and L = 0.
EVALUATION OF RESULTS
The above predictions and equations were checked by Monte Carlo simulations. Random mating populations with N diploid individuals were simulated. The selective system was controlled by n loci evenly distributed in linear chromosomes. Further n neutral loci were allocated alternating with the selected loci. The population was initially run for thousands of generations so that the selective system could reach mutation-selection-drift equilibrium. Thereafter, two different sets of runs were carried out according to the objective. In the simulations used to evaluate Ne, alleles from each neutral locus were initially set at frequency 0.5. The population was then simulated for 100–300 generations until the asymptotic effective size was clearly reached. Fifty additional generations were run. At least 200 independent replicates of this process were simulated. The variance (Vari) of the frequency of the neutral genes was computed for each generation i over loci and replicates. The effective population size at a given generation i was computed as Ne,i = 0.5(0.25 − Vari−1)/(Vari − Vari−1). The observed asymptotic Ne value was computed as the average of the Ne,i values of the 50 additional generations. A different set of simulations was run to evaluate the heterozygosity and the segregation of polymorphic loci. In this case, the neutral genes were introduced as mutants, and the population was run until the equilibrium heterozygosity and polymorphism was reached. The selective value of an individual was calculated as (1 + t)k for the model of favorable mutations and (1 − t)k for the model of detrimental mutations, where k is the number of mutants carried by the individual. Every generation the mean fitness of the population was set to 1, and the variance of relative fitnesses of individuals (C2) was computed.
Simulations (and predictions in parentheses) based on multiplicative gene action with mutants of equal effect, t, on the heterozygote
Table 2 shows some simulations of asymptotic values of Ne, π, and s. Predictions were generally close to simulations. As was explained before, the effective size is progressively reduced over generations until the asymptotic value is reached. A comparison with simulations is made in Figure 3. Predictions of the equilibrium heterozygosity and proportion of segregating sites in Table 2 were made from these values of Ne,i in consecutive generations as explained above. As expected, the absolute reduction of Ne is generally greater than the reduction of heterozygosity and polymorphism (cf. Figure 2) because π and s depend not only on the asymptotic Ne but also on nonasymptotic values, particularly s. A tendency of convergence between the ratios π/π0 and s/s0 with increasing population size is predicted, as noted by Charlesworth et al. (1995) for background selection (see also Figure 2).
Predictions are also accurate when mutations of unequal effects are considered. For example, simulations from Nordborg et al. (1996) with U = 0.4, L = 1 and mutation effects with mean t = 0.04 drawn from a gamma distribution with parameters α = 0.70 and β = 0.032 give an average π/π0 of 0.67. The prediction from Equation 4a is 0.65, and that from the approximation (4b) (assuming no correlation between C2i and Q ′2i) is 0.67, suggesting that the shape of the distribution of effects is not very important for the effective population size.
Finally, Figure 4 represents an example of the agreement between observed and expected allele frequency spectrums. The expected frequency distribution, under selection for the whole population of mutations originated τ generations ago, was obtained by using transition matrix methods. The partial Ne,i values for generations 1 to τ were predicted, and the harmonic mean Ne,Hτ of these was used as the constant effective size of mutations originated τ generations ago. Predictions (top line) were made by accumulating all the expected distributions for neutral mutations originated in all the previous generations and in the current one. Simulations (boxes) were very close to these predictions. The bottom line shows the expected distribution, which would have been predicted under a pure neutral model without selection. This was calculated assuming the constant effective population size, which explains the observed level of heterozygosity in the population.
Three examples of the progressive reduction of Ne,i associated with a new neutral mutation in a population of size N = 100. Simulated (boxes) and predicted (lines) values for 40 generations. The last element in each series corresponds to the asymptotic Ne value. C2 ≈ 0.044 and t = 0.05 (deleterious mutations model). Solid boxes, four chromosomes 1 Morgan long each; open boxes, one chromosome 1 Morgan long; and solid triangles, one chromosome with no recombinations.
Example of gene spectrum in a population of 500 individuals under a multiplicative deleterious mutations model and no recombination. C2 = 0.00426, t = 0.01. Only gene frequencies between 0 and 0.02 are represented. Simulated values in boxes. Top line: prediction under selection; bottom line, prediction under a pure neutral model (see text for explanations).
DISCUSSION
The fundamental concept in our analysis is that the parameter Ne, which summarizes the magnitude of the drift process in a genomic region or in the whole genome, is a function of the rate of reduction of the covariance between the neutral genes and the selected system. This reduction depends on three factors: the genetic size of the genome (i.e., the recombination rate), the change of variance of the selected loci due to selection, and the reduction of variance due to drift. At equilibrium, the total rate of reduction is Vm/C2 = 1/2Ne + t for models in which this rate is independent of the gene frequencies (i.e., infinitesimal model or deleterious mutations model). When the effects of the selected loci on fitness are large in relation to Ne, say t ⪢ 1/2Ne, the relative influence of genetic drift is small and predictions become independent of Ne. In this case, there is full agreement with equations from Hudson and Kaplan (1995), Barton (1995), and Nordborg et al. (1996) for background selection (deleterious mutations). As the effect of the genes decreases, with t < 1/2Ne and getting close to the assumptions of the infinitesimal model, the predictions are more dependent on Ne, and the bigger the population, the smaller the ratio Ne/N (see Table 2).
Our predictions of Ne can be made in terms of compound parameters, such as the variance for fitness, C2, and the new input of mutational variance, Vm, but not necessarily on mutation rates and mutational effects of spontaneous mutations, whose magnitudes are in a current debate (e.g., Peck and Eyre-Walker 1997). Houle et al. (1996) have reviewed estimates of C2/Vm for a variety of traits and species, obtaining an average value of 50 for life-history traits. This is an estimate of the average persistence time of detrimental mutations. For viability in Drosophila C2 = 0.01, approximately, for the whole genome (Mukai 1988). Because the genome size of Drosophila melanogaster is about 1.25 (considering that there is no recombination in males), substituting Vm/C2 = 0.02, C2 = 0.01, and L = 1.25 into Equations 8 and 9, we obtain Ne = 0.67N, which is a considerable reduction in effective size due to inherited differences in viability alone.
A main requisite for our model to work is the continuous flux of genetic variation for fitness in all the chromosome regions. Mutation introduces new variation at neutral sites while selection reduces the genetic variability. This requirement is far away from the strong selective sweep model assumed by Wiehe and Stephan (1993), for which the hitchhiking of neutral alleles by favorable mutations can be considered as a two-step process. First, a strongly selected gene passes quickly through the population, wiping out linked variation, and second, polymorphism is recovered by mutation in a period where no hitchhiking occurs. Therefore, our equations for favorable mutations are not applicable to the assumptions made by Wiehe and Stephan (1993). For example, with parameters N = 1000, L = 0, t = 0.2, the average simulated heterozygosity (π/π0) is 0.031, 0.084, and 0.246 when a single selective locus is segregating all the time, one-third of the time, or one-twelfth of the time, respectively. The corresponding predictions obtained with our method are 0.028, 0.033, and 0.041, respectively. Thus, the two latter, including periods of recovery of polymorphism, deviate from the assumptions of our model, and predictions become more and more inaccurate.
Our derivation follows the arguments of Robertson (1961) and Santiago and Caballero (1995). The variance of long-term selective values (Q2C2) is partitioned into two components, one due to the chromosome carrying the neutral allele of reference (Q ′2) and the other due to the homologous chromosome (Q ″2). With no linkage, large populations and weak selection, both terms approximate a value of 4 and Equation 3, Ne = N/(1 + 4C2), is obtained, as deduced by Robertson (1961) and Santiago and Caballero (1995). However, Barton (1995) and Nordborg et al. (1996) arrived at the conclusion that with no linkage Ne = N/(1 + 2C2). Nordborg et al. (1996, Appendix iii) linked this to the previous expression by arguing that in the former the term C2 is, in fact, C2/2, because it refers to the variance of fitness of families (couples) instead of individuals. This is not correct, however. In the argument of Robertson (1961) and Santiago and Caballero (1995), C2 is the variance of the relative fitness values of couples because these were fixed (monogamous matings) but this was assumed to be the fitness associated with neutral alleles, and all four alleles (in the couple) had the same associated fitness. The model would be equivalent for monoecious populations (Caballero and Santiago 1995).
The reason for the confusion is clear from the derivation in this article. Barton (1995) and Nordborg et al. (1996) considered only the gamete carrying the neutral allele as the determinant for the fitness associated to this allele. This is the same as neglecting Q ″2, as we did, for example, to obtain Equation 5. Now, for large population size (Z ≈ 1), Q ′ ≈ 1/r, and from Equation 5, Ne = N/(1 + C2/2r2), which agrees with Barton's expression. If now r = 0.5, the above expression yields Ne = N/(1 + 2C2). However, neglecting Q″2 is allowed only for moderate or strong linkage because only for r ⪡ 0.5 is Q ′2 ⪢ Q ″2. The intuitive explanation is that with tight linkage, the fitness associated with the neutral allele depends mostly on that of the gamete carrying it and Q ′2 ⪢ Q ″2. However, for very loose linkage or no linkage, the fitness of the homologous gamete is also important: Q ′2 ≈ Q ″2 ≈ 4, and Ne = N/(1 + 4C2).
To reduce the complexity of the derivation, we have considered that the recombination rate is constant across the chromosome and the neutral gene is located in the middle of a chromosome. An equivalent derivation can also be developed for a neutral gene at any location. The neutral location does not make a big difference unless the gene is in the final region of the chromosome tip. In this region, the effect of drift is smaller as closely linked selective genes can only (or mainly) be located at one side of the neutral gene, reducing down to a half the random associations with selected genes. As these regions in both tips are very small, their weight on the average Ne value for the whole genome is irrelevant and the result for the central location is a very good approximation to the average Ne. This effect indicates that Ne is mainly determined by the strength of selection acting on the region closely linked to the neutral locus. An equivalent conclusion has been reached by Nordborg et al. (1996) under the background selection model.
Regional variations in the frequency of recombination are often observed (see Lichten and Goldman 1995), with a general pattern of reduced recombination in proximal regions (e.g., Nachman and Churchill 1996). Additionally, the distribution of transcriptional of genetic variability for fitness is not evenly distributed in the genome. The exact magnitude of these deviations is unknown, but the former equations could also be applied if the density of genetic variability for fitness were more or less proportional to the rate of recombination. Otherwise, the computation of the appropriate value of Q ′2 must consider the particular distribution of selected genes and genetic distances between these genes and the neutral locus.
The reduction of effective size associated with a neutral gene is progressive: The magnitude of the drift process is smaller for new neutral mutations than for old ones, and this process accumulates on neutral genes over generations until an asymptotic value is reached. The consequence is that heterozygosity will always be larger than that expected if all the neutral genes in the population had a constant effective size equal to the asymptotic value (Ne) and, therefore, cannot be formally predicted in the simple way, 4Neμ. The magnitude of the underprediction depends on how quickly the asymptotic Ne value is reached (see Figure 3). For a given genome size or recombination rate, this relies on the rate of reduction of the genetic variance. Under the assumptions of the infinitesimal model, the reduction of the variance in the selected system will be mainly due to drift if selection is weak, the reduction of effective size will be slow, and the difference between the real heterozygosity and that expected from the asymptotic Ne will be the highest. As the effect t of selected genes becomes larger, the rate of reduction increases and the asymptotic Ne is reached earlier. In a model of deleterious mutations of large effect (the background selection model), heterozygosity tends to be almost equal to 4Neμ as mutations reach their asymptotic value of Ne in a few generations. Nordborg et al. (1996) developed a prediction of the reduction of heterozygosity due to linked selected loci under background selection (formally π/π0), which is identical to our prediction of the asymptotic Ne/N when drift is not considered. This prediction turns inexact as the population size or the effect of selected genes decreases (Nordborget al. 1996), because the associated effective sizes of neutral alleles become further and further away from the asymptotic value. An issue related to those above refers to the reductions in the probability of fixation of advantageous mutations because of their linkage to other selected loci (see Barton 1994). Mutants of very large effect, whose fate is decided in a few generations, are affected by the asymptotic Ne less than mutants of small effect and, therefore, the fixation probability of the former is reduced by a smaller amount (Caballero and Santiago 1998).
The progressive reduction of the effective size associated with mutations can also explain the apparent disconnection between heterozygosity and number of segregating sites under selection, which is the basis of those tests compare the observed spectrum of gene frequencies with its expectation under a pure neutral model. As we have seen, the proportion of segregating sites is little dependent on Ne, as it is mostly due to recent mutations, which have associated effective sizes close to the census size of the population and far away from the asymptotic value. For very large populations, however, the number of generations that effectively contribute to the proportions of segregating sites is larger. Therefore, the drop of effective size during the initial generations affects it more, making heterozygosity and number of segregating sites similarly reduced. This effect has been described by Charlesworth et al. (1995).
The spectrum of gene frequencies can be approximated from the evolution of Ne associated to mutations over generations. For mutations originated τ generations before the actual generation, the magnitude of the drift process can be summarized by the harmonic mean (Ne,Hτ) of the Ne,i values from generation i = 1 to τ. The remaining proportion of heterozygosity can be predicted by (1 − 1/2Ne,Hτ)τ. Analogously, the proportion of segregating sites can be approximated from Ne,Hτ using the general theory of the effective population size (Equations 17–18). In other words, the spectrum of gene frequencies for mutations originated τ generations ago is approximately the expected under a neutral model using the appropriate Ne,Hτ. Deviations from the pure neutral spectrum arise when the contributions of all previous generations are accumulated. Different spectra corresponding to different Ne,Hτ values of previous generations (from τ = 1 to ∞) are superimposed, one over the others, building a general spectrum that cannot be explained by a single Ne value under a neutral model (see Figure 4).
When statistical tests are applied to compare predicted and observed spectra of gene frequencies, the finite size of the samples can make the deviations from the neutral model difficult to detect. Observations in natural populations of Drosophila denote reduced diversity in regions with low recombination rates (Begun and Aquadro 1992), but most data show no deviations from the neutral spectrum (see Charlesworthet al. 1995). Although virtually any model considering directional selection could account for the observed correlation between nucleotide variation and recombination rate, simple selective sweep models with strong selection cannot explain the statistical agreement with the neutral spectrum (Hudson 1994; Bravermanet al. 1995). Predictions using computer simulations reveal that the statistical agreement is consistent with the background selection model (Charlesworthet al. 1995; Hamblin and Aquadro 1996). The general theory that we have described can help to determine the conditions for the background selection model, alone or combined with weak selective sweep models, which could explain the pattern of observed variation.
Finally, some remarks concerning artificial selection can be made. In the general theory of quantitative traits, linkage is usually ignored as farm species generally have several chromosomes, suggesting that the assumption of free recombination is close to reality. Additionally, linkage makes the analytical model more cumbersome: Additive models are complicated by the effect of the generation of negative covariances between genes affecting fitness (Bulmer 1980; Santiago 1998). Although our theory takes into account this effect, which is included the term Z = 1 − Vm/C2, its application to a model in which parents are selected individuals and the genetic values of both “gametes” are negatively correlated is not straight. Further insight into these models is necessary to assess the impact of linkage in artificial selection programs.
Acknowledgments
We thank B. Charlesworth, W. G. Hill, and N. Barton for helpful comments. This work was supported by grant PB95-0909-C02-02 from Ministerio de Educación y Cultura (Spain) to E.S. and by grant 64102C605 from Universidad de Vigo to A.C.
Footnotes
-
Communicating editor: B. S. Weir
- Received August 19, 1997.
- Accepted April 30, 1998.
- Copyright © 1998 by the Genetics Society of America