Genetics, Vol. 149, 2105-2117, August 1998, Copyright © 1998

Effective Size and Polymorphism of Linked Neutral Loci in Populations Under Directional Selection

Enrique Santiagoa and Armando Caballerob
a Departamento Biología Funcional, Universidad de Oviedo, 33071 Oviedo, Spain
b Departamento Bioquímica, Genética e Inmunología, Universidad de Vigo, 36200 Vigo, Spain

Corresponding author: Enrique Santiago, Departamento de Biología Funcional, Universidad de Oviedo, 33071 Oviedo, Spain., esr{at}sauron.quimica.uniovi.es (E-mail).

Communicating editor: B. S. WEIR


*  ABSTRACT
*TOP
*ABSTRACT
*DERIVATION OF EXPRESSIONS
*ASYMPTOTIC EFFECTIVE SIZE,...
*EVALUATION OF RESULTS
*DISCUSSION
*LITERATURE CITED

The general theory of the effective size (Ne) for populations under directional selection is extended to cover linkage. Ne is a function of the association between neutral and selected genes generated by finite sampling. This association is reduced by three factors: the recombination rate, the reduction of genetic variance due to drift, and the reduction of genetic variance of the selected genes due to selection. If the genetic size of the genome (L in Morgans) is not extremely small the equation for Ne is

where N is the number of reproductive individuals, C 2 is the genetic variance for fitness scaled by the squared mean fitness, (1 - Z) = Vm/C2 is the rate of reduction of genetic variation per generation and Vm is the mutational input of genetic variation for fitness. The above predictive equation of Ne is valid for the infinitesimal model and for a model of detrimental mutations. The principles of the theory are also applicable to favorable mutation models if there is a continuous flux of advantageous mutations. The predictions are tested by simulation, and the connection with previous results is found and discussed. The reduction of effective size associated with a neutral mutation is progressive over generations until the asymptotic value (the above expression) is reached after a number of generations. The magnitude of the drift process is, therefore, smaller for recent neutral mutations than for old ones. This produces equilibrium values of average heterozygosity and proportion of segregating sites that cannot be formally predicted from the asymptotic Ne, but both parameters can still be predicted by following the drift along the lineage of genes. The spectrum of gene frequencies in a given generation can also be predicted by considering the overlapping of distributions corresponding to mutations that arose in different generations and with different associated effective sizes.


DIRECTIONAL selection generates differences in the reproductive success of individuals, increasing the variance of change in gene frequency and reducing the genetic diversity of neutral alleles. A population, then, behaves for these parameters like an ideal unselected population of size Ne, the effective population size (WRIGHT 1931 Down), which is in general smaller than the actual number of reproductive individuals. If selection acts on a noninherited trait, Ne is simply a function of the variance of the number of progeny per parent, and predictions have been developed for a variety of cases (see review by CABALLERO 1994 Down).

When differences in fitness are inherited, the effective population size cannot be predicted from the variance of progeny number at a given generation. The drift process is amplified over generations because the random association that originated in a given generation between neutral and selected genes remains in descendants for a number of generations until it is eliminated by segregation and recombination. This problem was first addressed by ROBERTSON 1961 Down, and recently, adequate solutions were given by WOOLLIAMS et al. 1993 Down and SANTIAGO and CABALLERO 1995 Down for directional selection on quantitative traits determined by an unlinked system of additive loci. But the drift process is larger when selection acts on a linked set of loci as random associations last longer under linkage. Although previous formulations predict the inbreeding coefficient that is calculated by tracing the paths in a genealogy independently of the existence of linkage, this may be different from the real inbreeding coefficient that represents the probability of identity by descent of genes carried by individuals. The reason for this is that both copies of a neutral gene in the same individual do not have identical probabilities of being transmitted to the following generations, as they are embedded in different chromosomes with different selective genes.

HUDSON and KAPLAN 1995 Down and, more generally, NORDBORG et al. 1996 Down have derived expressions for predicting the nucleotide diversity at neutral loci under the background selection model (CHARLESWORTH et al. 1993 Down), which is based on the continuous appearance of linked deleterious mutations in the population. BARTON 1995 Down made similar derivations for the fixation probability of a favorable allele. In parallel, models dealing with the hitchhiking of neutral genes caused by the spread of selectively favorable mutations at linked loci (MAYNARD-SMITH and HAIGH 1974 Down) have been refined in recent years (WIEHE and STEPHAN 1993 Down, and references therein).

Here, we develop a prediction of the effective population size under linkage, extending the argument of ROBERTSON 1961 Down and SANTIAGO and CABALLERO 1995 Down. They have basically shown that the effective size of a population is a function of the variance of the cumulative selective values associated with neutral genes. Simplifying for second-order terms, under random mating and Poisson distribution of family size, the equation for the effective population size becomes

,

where N is the number of reproductive individuals, C 2 is the genetic variance of fitness of individuals (these are measured relatively to the mean fitness), and Q is the sum of a series of relative terms, the first one being the change of one unit in neutral gene frequency because of new associations created in a given generation and the rest being the remaining fractions of this change in the following generations. For example, for unlinked genes and weak selection, Q {approx} 1 + + + + ... = 2 (ROBERTSON 1961 Down), because the average selective advantages of individuals (and, therefore, the changes in gene frequency of neutral alleles) are expected to be reduced by one-half each generation in its descendants, because of segregation and recombination. The complete argument can be found in the derivations leading to Equation 16 in SANTIAGO and CABALLERO 1995 Down.

Thus, the term Q 2 C 2 is the variance of the long-term selective values, and with no linkage and weak selection it approximates 4C2. This does not hold, however, under linkage, but the argument can be rebuilt to consider the decline of the association between a neutral gene and selected genes on chromosomes. The problem of the reduction of the effective size under linkage is reduced to the problem of finding the appropriate value of Q 2, and the same argument used by SANTIAGO and CABALLERO 1995 Down can be followed.

Initially, to make predictions independently of gene frequencies and effects, an infinitesimal model (an infinite number of genes of small effect) is considered, but predictive equations are also valid for the background selection model of CHARLESWORTH et al. 1993 Down, and we connect our equations with those of NORDBORG et al. 1996 Down for this model. Moreover, we apply the principles of the theory to the selective sweep model (MAYNARD-SMITH and HAIGH 1974 Down) in considering a continuous flux of weakly advantageous mutations, instead of rare mutations of strong favorable effect passing quickly through the population and further recovery of variation by neutral mutation (WIEHE and STEPHAN 1993 Down). Finally, we show that the two categories of estimators of polymorphism, basically mean heterozygosity per site and proportion of segregating sites, are related to the effective size to different extent, but predictions can still be made from effective size theory.


*  DERIVATION OF EXPRESSIONS
*TOP
*ABSTRACT
*DERIVATION OF EXPRESSIONS
*ASYMPTOTIC EFFECTIVE SIZE,...
*EVALUATION OF RESULTS
*DISCUSSION
*LITERATURE CITED

The general model:
We consider a monoecious diploid population with random mating and a constant number of reproductive individuals, N. Every individual in the population is made up of two haploid homologous complements with {nu} chromosomes l Morgans (M) long each. (Table 1 shows the most common notation used in this article.) Each complement is referred to as a "gamete" (i.e., there are 2N gametes in the population). It is assumed that there is no genetic correlation between gametes in parents. The mapping function of HALDANE 1919 Down, r = , is assumed to relate the recombination fraction r and the genetic distance x in Morgans. A large number n of loci uniformly distributed on the chromosomes determines fitness. Allelic effects can be different for different loci, but gene action is additive within loci; that is, the fitness value of the heterozygote is the average value of the corresponding homozygotes. This latter assumption, however, can be removed for some models (see below). Gene effects are multiplicative between loci. This genetic system is at mutation-selection-drift equilibrium with a mean fitness of one; i.e., each parent has two descendants on average, and the genetic variance for fitness of individuals is C 2, which is assumed to be small.


 
View this table:
In this window
In a new window

 
Table 1. Summary of most common notations

As the effects of loci are multiplicative, the relationship between variance of individuals and the contributions of the n selective loci to variation is C2 = {Pi}nj=1 (1 + c2j) - 1 , where c2j is the square of the coefficient of variation contributed by locus j, i.e., the variance for fitness of the locus, scaled such that the average fitness is 1.

Consider a neutral locus in the middle of a chromosome. We assume that the neutral alleles at this locus are initially produced by mutation, but this is not a necessary assumption of the model. Due to the finite size of the population, the sampling process generates random associations between the neutral alleles and selected loci. The expected change in gene frequency of the neutral allele (S) is the covariance between the frequency of the allele in gametes (p) and the selective value (f) of individuals carrying the gametes, S = cov (p, f) (see SANTIAGO and CABALLERO 1995 Down, p. 1016). We derive this expected change next.

Let pi be the frequency of an allele of the neutral locus in gamete i (pi can be 0 or 1). For the moment, we consider one single selected locus j with additive effects of alleles. Locus j is at a genetic distance of x M from the neutral locus. Let us consider a copy of the neutral allele present in a given individual, and let f'j be the selective value contributed by the selected allele present in the same gamete as the neutral allele and f''j the selective value contributed by the homologous selected allele in the other gamete. Under random mating, the expected change in gene frequency (i.e., covariance) of the copy of the neutral allele in the first generation (S1) can be partitioned into the change due to the selected allele in the same gamete as the neutral gene (S'1) and the change due to the homologous selected allele in the other gamete in the same individual (S''1) ,

A fraction of the random associations generated in the first generation will remain in the following generations even if the population is expanded to an infinite size after the first generation. The expected value of the remaining covariances in the second generation depends on two factors: the change in expressed genetic variance of the selected locus and the recombination rate between the selected and the neutral loci. The first factor affects both partial covariances (S'1 and S''1) in an identical way: Every generation, the genetic variation of the selected locus is assumed to be reduced by selection and drift by a constant proportion (1 - Z). Thus, both covariances are reduced to a proportion Z. On the contrary, the decline because of recombination is different for both partial covariances. The association between the neutral allele and the selective value of the same gamete is maintained with a probability 1 - r = (i.e., if they do not recombine). Therefore the expected partial covariance that remains in generation 2 is

The effect of recombination on the other partial covariance (between the neutral allele and the selected gene in the other gamete) is opposite to the previous one. Recombination incorporates the selected allele of the homologous gamete into the gamete carrying the neutral gene with a probability r. Therefore, the remaining covariance in generation 2 is

In the following generations, both covariances are reduced in the same proportion (1 - r)Z per generation, the selected allele remaining in the same gamete as the neutral gene as the condition for the maintenance of the association, i.e.,


and so on. The sum of all these covariances from generation 1 to infinity is the total change in gene frequency over generations due to the association newly created between the neutral allele and the selected locus j in the initial generation. New associations are created between the neutral gene and the selected locus in successive generations until an asymptotic stage is reached. From SANTIAGO and CABALLERO 1995 Down we note that this asymptotic stage is obtained as the sum of the expected changes in gene frequency over generations, given a change of one unit in the first generation,

(1a)

(1b)

[Note that in the derivation of Equation 17 of SANTIAGO and CABALLERO 1995 Down the term r has a different meaning than in this article, being the correlation of gene frequencies between mates. In the present derivation this term is zero because random mating and large population sizes are assumed.]

ROBERTSON 1961 Down and SANTIAGO and CABALLERO 1995 Down showed that the effective population size can be predicted from the variance of the cumulative selective values associated with the neutral gene, as Ne = . The variance of the cumulative selective values due to locus j is Q2j c2j , and this can be again partitioned into the variance due to the selected allele originally located in the gamete with the neutral gene, and the variance due to the selected allele in the other gamete,

If j were the only locus with effect on fitness in the genome, the effective population size would be

(2)

From Equation 1a and Equation 1b we note that Q'j and Q''j take the value 2/(2 - Z) {approx} 2 when the neutral gene and the selected locus are located in different chromosomes (i.e., x = {infty}), the population size is large, and selection does not change the genetic variance very quickly (i.e., Z {approx} 1). Therefore, with no linkage, Q'2 = Q''2 {approx} 4, and Equation 2 yields

(3)

(ROBERTSON 1961 Down; SANTIAGO and CABALLERO 1995 Down). BARTON 1995 Down and NORDBORG et al. 1996 Down(Appendix iii) arrived at the conclusion that with no linkage Ne = instead of Equation 3, but this is not correct as we discuss.

Now consider the n selected loci with different contribution to the variance for fitness. With multiplicative effects among them, the total variance of the cumulative selective values for all the selective loci in the genome is

Therefore, the asymptotic value of Ne is

(4a)

If c2j and Q2j values are uncorrelated (i.e., independence between Z and c 2 values), Equation 4a reduces to

(4b)
where Q'2 = {Sigma}nj=1Q'2 jn and Q''2 = {Sigma}nj=1 .

For large populations Q''j is nearly 2 when linkage is not very tight (Equation 1b), and it asymptotically tends to 1 as x tends to 0. Thus, the average Q''2 ranges from 1 (complete linkage) to 4 (no linkage). Q'2 approximates 1/r 2 (Equation 1a) in large populations under weak selection (i.e., Z {approx} 1), so it may take values much larger than 4 for tight linkage (r {Lt} 1/2). Thus, Q''2 can usually be neglected relative to Q'2 , and Equation 4b can be reduced to

(5)
without losing much precision. The term Q'2 , which refers to the effect of selected genes in the same gamete as the neutral gene, has two components. One is due to selected loci in chromosomes other than that of the neutral locus (with probability [{nu} - 1]/{nu}). The other component is due to loci in the chromosome carrying the neutral gene (with probability 1/{nu}). As the neutral gene is assumed to be located in the middle of the chromosome, the second component can be obtained by integration over one-half of the chromosome length. Thus, using Equation 1a and the above probabilities,

Numerical analysis (data not shown) indicates that the relevant parameter is the product L = {nu}l, that is, the genetic size of the genome. Variations in the distribution of the sizes of the chromosomes do not make much difference if the size of the whole genome is constant. Thus, the first term in the above equation can be dropped by setting {nu} = 1 and substituting l by L:

(6)

The following approximations to the above expression can be made:

(7)

Then, a general expression for Ne with linkage can be obtained by substituting Q'2 from Equation 6 into Equation 5. For L > 0.2 or so, using the approximation (7), Equation 5 can be simplified to

(8)

If selection is weak and linkage is not very tight, i.e., the exponent is smaller than 1 or so, Equation 8 can be expressed in a way more familiar to the classical equations for the effective population size, Ne {approx} N/[1 + C 2/(1 - Z)L].

Application to particular genetic systems:
The above equations for predicting the effective population size are a function of the proportional reduction of the genetic variation (1 - Z) at selected loci. Two processes are involved in the dynamics of the genetic variation of loci: selection and drift. It is generally assumed that drift eliminates variation at a constant rate 1/2Ne. The change in genetic variance due to selection depends on the genetic system. For some models, this change is constant. Particularly, phenotypic selection on an infinitesimal model (BULMER 1980 Down; SANTIAGO 1998 Down) and the background selection model (CHARLESWORTH et al. 1993 Down) erode variation at a rate that is independent of the changes in gene frequency at particular loci.

At equilibrium, the mutational input per generation (Vm) equals the loss of variation by drift and selection. Therefore, the proportion of the expressed genetic variance C 2 that is lost by drift and selection per generation is Vm/C 2. The remaining fraction of the expressed variation, which is expected to be maintained after one generation of selection, is

(9)

This term can be substituted in the previous equtions to obtain the appropriate Q'2 and Ne values. For example, the predictive Equation 8 becomes Ne {approx} N exp[-(C 2)2/(L Vm)].

Infinitesimal model: All the previous equations apply under the infinitesimal model. Favorable and deleterious mutation models reduce to the infinitesimal model if effects are very small. If the population is small, selection is not very strong, and linkage is not very tight, the equilibrium variance can be approximated by C 2 = 2NeVm (LYNCH and HILL 1986 Down; see SANTIAGO 1998 Down for a general equation to predict C 2 and Vm under linkage), and Z = 1 - () {approx} 1 - .

Deleterious mutations model: This is equivalent to the background selection model of CHARLESWORTH et al. 1993 Down. Predictions of heterozygosity for this model have been developed by HUDSON and KAPLAN 1995 Down and more generally by NORDBORG et al. 1996 Down. In what follows we show expressions for Ne that generalize those predictions, including the effect of finite populations. Expressions obtained in the previous section are fully applicable, but we consider now a selection coefficient t against heterozygotes. In this case, the assumption previously made of additive gene action within a locus can be removed, because for the background selection model, the effect on the heterozygote and not on the mutant homozygote is critical. If the frequency of a deleterious allele in a particular generation i is qi, the genetic variance contributed by this locus is proportional to qi (1 - qi) {approx} qi, as the deleterious allele frequency will be generally small, and the expected gene frequency in the next generation is qi+1 {approx} qi - qi(1 - qi)t {approx} qi(1 - t). Therefore, the proportional change in the genetic variance of the selected locus due to selection is qi+1 {approx} . This is the factor by which genetic variance is changed by selection for this model. This result can also be obtained directly from Equation 9, noting that under mutation-selection balance C 2 = Ut (CROW and KIMURA 1970 Down) and Vm = Ut 2, where U is the total genomic (diploid) mutation rate for detrimental genes. Combining both drift and selection, the reduction of the association between neutral and selected genes in one generation is approximately

(10)

Equation 10 can also be obtained from the formula by KEIGHTLEY and HILL 1988 Down and BURGER et al. 1988 Down, i.e., C2 = . Substituting into Equation 9 we get Equation 10. Thus, the appropriate value of Q'j can be obtained from Equation 1a,

(11)

The value of Q'2 is given by Equation 6; if the genome size is not extremely small (L > 0.2), using (7) and (10) we get Q'2 {approx} , and Equation 8 becomes

(12)

When the effect of drift is negligible (i.e., t {Gt} 1/2Ne), then C2 = Ut and Ne = N exp() {approx} , which agrees with the approximation of N. H. BARTON (unpublished results; see p. 671 of CABALLERO 1994 Down; BARTON 1995 Down). This equation can also be derived from Equation 4aEquation 4b of NORDBORG et al. 1996 Down, after substituting our Equation 11, which is in fact the average value of the cumulative effect of a large number of selective loci evenly spaced on the genome.

Without recombination (L = 0) and t {Gt} 1/2Ne, Equation 6 reduces to Q'2 = . Substituting this and C2= Ut into Equation 5, Ne {approx} N exp[]. This equation is identical to the expression of KIMURA and MARUYAMA 1966 Down and HAIGH 1978 Down for the size of the chromosome class with the lowest number of deleterious mutants. As all the chromosomes in the population will be derived from one member of the best class, the size of this chromosome class is indeed the effective population size given in number of chromosomes. CHARLESWORTH et al. 1993 Down found the equivalent algebraic solution for the heterozygosity, {pi} = {pi}0 exp(), where {pi}0 represents the expected heterozygosity if there is no selection.

Favorable mutations model: Assume that the selective values of the three genotypes at any selective locus j are 1, 1 + t, and 1 + 2t, respectively. The predictive equations previously shown do not hold if favorable mutations are not effectively neutral (i.e., t > 1/2Ne, after KIMURA 1983 Down) as changes in variance are dependent on the dynamics of the gene frequencies. However, the general principles of the theory are also applicable for large effects if there is a continuous flux of advantageous mutations. Assume that the frequency of the favorable allele in the generation in which the neutral allele of reference appears by mutation is q1. The expected gene frequency of the selected allele in the next generation is q2 {approx} q1 + q1(1 - q1)t and the proportional change in genetic variance due to selection is q2(1 - q2)/q1(1 - q1). Drift also reduces the genetic variance by 1 - 1/2Ne, therefore, Z1 = . Here we consider only the effect of the gamete associated with the neutral gene (i.e., we neglect Q''j ), obtaining Q'j with the same argument leading to Equation 1a. If the frequency of recombination between the neutral gene and the selected locus j is r, and the covariance between them in the first generation is S'1 , the expected changes in gene frequency in the following generations are



and so on. Zi is the proportional change in genetic variance from the initial generation to generation i due to selection and drift. Therefore, the value of Q'j given an initial frequency q1 for the selected locus when the neutral gene appears is (see Equation 1a)

where i represents the successive generations and qi are the sequential frequencies of the selective allele in the consecutive generations, which are obtained as qi {approx} qi-1 + qi-1(1 - qi-1)t. Now, the neutral mutation may appear when the selected allele has any gene frequency (q1) in the range 0 to 1. Thus, the appropriate value of Q'j is the mean weighted value of the Q'j(q1) values corresponding to all the possible initial frequencies q1 in the range 0 to 1, the weights being the product of the proportional contribution of the possible initial frequencies to the observed genetic variance and the probability of being at all the possible values of the initial frequency. In a deterministic mutation-selection model, the probability of having a particular frequency, q, is proportional to 1/[q(1 - q)] (CROW and KIMURA 1970 Down), and the contribution of the gene to the variance is proportional to q(1 - q). Therefore, the product of these terms is independent of the gene frequency for large Ne t. Hence, Q'j can be approximated as the average value of a number m of Q'j (q1) values corresponding to initial frequencies q1 evenly spaced through the spectrum of gene frequencies from 0 to 1,

(13)

The appropriate Q'2 value of the neutral locus is the average value of the Q'2j values corresponding to all the selective loci j in the genome. This average value has to be substituted into Equation 5. Therefore, although it seems difficult to reach a simple algebraic equation to predict Ne for the model of favorable mutations, the principles previously shown can be applied to find numerical approximations.


*  ASYMPTOTIC EFFECTIVE SIZE, HETEROZYGOSITY, AND POLYMORPHISM
*TOP
*ABSTRACT
*DERIVATION OF EXPRESSIONS
*ASYMPTOTIC EFFECTIVE SIZE,...
*EVALUATION OF RESULTS
*DISCUSSION
*LITERATURE CITED

The parameter Ne that we have derived is the asymptotic effective population size. If a neutral allele appears in the population at a given generation by mutation, the drift process will be initially weak on it, but random associations with selected genes will accumulate over generations making drift increase until an asymptotic value is reached. In the first generation, the magnitude of the drift process on the neutral allele can be quantified by the variance in allele frequency in the first generation. Although this refers only to the particular neutral genes that appeared one generation ago, we refer to it as the effective population size in the first generation,

(14)
Q'1 can be computed as the average value for all the Q'j1 values of selected loci, being obviously equal to one, Q'j1 = (){Sigma}1i=1 S'i = 1, so that Q'1 = 1 . The value of Q''1 is also 1. As stated before, the asymptotic value of Q'' is less than or equal to 2, and after a few generations it will be much smaller than Q' under linkage, so we can ignore it in Equation 14 and henceforth. The magnitude of the drift process on the neutral allele from generation 1 to 2 is analogously quantified by the variance in allele frequency from generation 1 to 2, which we refer to as Ne,2. This is calculated using Q'2 , the average of all the Q'j2 (see the accumulation of terms stated before), obtained as Q'j2 = (){Sigma}2i=1S'i = 1 + (1 - r)Z , and

(15)

From generation 2 to 3, Ne, 3 can be calculated using Q'3 which is the average of all the Q'j3 , obtained as Q'j3 = (){Sigma}3i=1S'i = 1 + (1 - r)Z + (1 - r)2Z 2 , and so on up to infinite, Q'j,{infty} = Q'j , when the asymptotic effective population size, Ne,{infty} = Ne (equations in the previous sections), is reached.

There is no simple solution for the Q' terms for consecutive generations (except for infinite generations; i.e., Equation 6). Therefore, numerical methods have to be applied to estimate the values of the partial effective sizes in consecutive generations. If genetic variance for fitness is not large, in the first generation Ne,1 is close to the census size N of the population. In the following generations the effective size drops toward its asymptotic value (see Figure 1). For a new neutral mutation, the decay of genetic variance is 1/2Ne,1 in the first generation, 1/2Ne,2 in the second generation, and so on. A consequence of this cumulative effect of drift on new mutations is that there is not a simple formula to connect asymptotic population size, heterozygosity, and proportion of segregating sites for neutral alleles, as we address next.



View larger version (12K):
In this window
In a new window
Download PPT slide
 
Figure 1. Reduction of the probability of segregation and the heterozygosity contributed by a locus with a single copy of a neutral gene in the initial generation. The reductions are given as a percentage of the values in the initial generation. The effective population size (Ne,i) associated with that locus in generation i is also plotted. N = 100, L = 1, C 2 = 0.02, and t = 0.01 (deleterious mutations model). The last element in each series corresponds to the asymptotic value (both heterozygosity and probability of segregation equal 0).

Heterozygosity:
Under the infinite sites model, the heterozygosity contributed by a new mutation (i.e., with frequency 1/2N) is 2(1/2N)(1 - 1/2N) {approx} 1/N. Then, with a mutation rate µ per locus and generation, the number of new mutations per generation is 2Nµ, and the input of heterozygosity per generation is about 2µ. The neutral variability generated by these mutations decreases at an increasing rate, which is a function of the consecutive values of Ne,i, so the remaining proportion after {tau} generations is R {tau} = {Pi}{tau}i=1(1 - ) . Therefore, the expected heterozygosity at equilibrium ({pi}) is the sum of the contributions by mutations during all the previous generations,

(16)

If there is no selection, or selection acts on a noninherited trait, there is a single value of Ne for the consecutive generations. Thus, R{tau} = (1 - 1/2Ne){tau}, and substituting this into Equation 16, {pi} = 4Neµ, as expected (CROW and KIMURA 1970, p. 323). Furthermore, when selection is on an inherited trait and the selective effects are large, the consecutive values of Ne,i decay very quickly reaching values close to the asymptotic effective size, Ne, in a few generations. Under this condition, heterozygosity is again well approximated by {pi} = 4Neµ. Otherwise, this equation underestimates heterozygosity because the effective size associated with a mutation is larger than the asymptotic Ne for a long period of time. This is illustrated in Figure 1, which shows the expected heterozygosity for consecutive generations of a neutral allele starting with a single copy in the initial generation. It is observed that the heterozygosity has a rate of reduction lower than that of Ne,i. However, the degree of disassociation between heterozygosity and the asymptotic Ne is much smaller than that between the proportion of segregating sites and the asymptotic Ne, as we explain next.

Proportion of segregating sites:
Under an infinite sites model, the proportion of segregating sites increases by 2Nµ, the number of new mutations per generation. The equilibrium proportion of segregating sites, s, can be obtained by calculating the probability that mutants appearing in previous generations are still segregating in the current one. Looking backward in time, the remaining fraction of the segregating sites produced {tau} generations ago is a function of the magnitude of the drift process until the current generation. As we have seen, this magnitude is represented by the partial Ne,i values from generation 1 to generation {tau} and it can be summarized by the harmonic mean Ne,H{tau} of these {tau} values, i.e., = () {Sigma}{tau}i=1() . Thus, the probability of segregation in the current generation of mutations appeared {tau} generations ago (P{tau}) can be approximated by

(17)

(GALE 1990 Down, p. 108). This equation gives overestimates of P{tau} in the long term, say for {tau} > Ne,H{tau}. Therefore, in practice we utilize this equation until the difference for two consecutive generations, P{tau} - P{tau}+1, is smaller than the expected asymptotic rate of decay 1/2Ne. After that, the recursive equation P{tau}+1 = P{tau}(1 - 1/2Ne) is used. The proportion of segregating sites s can be computed as the sum of the remaining contributions from all the previous generations,

(18)

The proportion of segregating sites is generally much more dependent on N than on Ne because only a small proportion of new mutations segregate for a long period. For example, if there is no selection, the s value for the whole population is approximately 4Nµ ln 2N (see EWENS 1979 Down). The input of segregating sites per generation is 2Nµ. At equilibrium, this is also the number of sites that become monomorphic per generation. Therefore, the proportion of polymorphic loci that become monomorphic per generation is = , which is a relatively large proportion. For example, for a population of N = 100, about 10% of the segregating sites are lost by drift every generation. This is also illustrated in Figure 1. The probability of segregation of an initially single-copy neutral allele has most of its reduction in the initial generations. Given this large rate of loss of polymorphic loci per generation, it is clear that the proportion of segregating sites is very dependent on the mutations arising few generations ago and, therefore, the Ne,i values of the initial generations have much influence. Because these initial Ne,i values are closer to the census size N than to the asymptotic effective size, Ne, the proportion of segregating sites in the whole population is only slightly dependent on Ne. On the contrary, the rate of loss of heterozygosity per generation is relatively small (1/2N with no selection). For example, for a population of size N = 100, it is only 0.5% per generation. Therefore, the heterozygosity is more dependent on the asymptotic effective population size, Ne. The above arguments indicate that if the asymptotic Ne is much smaller than the census size N, heterozygosity will be more affected by selection than the proportion of segregating sites because the latter depends strongly on N. This dependence of the asymptotic reduction of s, {pi}, and Ne,i on population size predicted under a model of deleterious mutations is shown in Figure 2. The larger the population size the stronger the selection as the mutant effects are assumed to be constant (t = 0.01). Reductions of Ne,i and {pi} are close and tend to be equal with large N (strong selection), as previously noted by CHARLESWORTH et al. 1995 Down for background selection. The proportion of segregating sites is much less affected by the increase in strength of selection.



View larger version (13K):
In this window
In a new window
Download PPT slide
 
Figure 2. Example of the dependence of the asymptotic reduction of proportion of segregating sites (s), heterozygosity ({pi}), and effective population size (Ne) on the number of reproductive individuals (N). C 2 = 0.001, t = 0.01 (deleterious mutations model), and L = 0.

Allele frequency spectrum:
The application of the classic theory of Ne provides methods to predict the spectrum of frequencies of neutral genes a number of generations after their appearance (e.g., CROW and KIMURA 1970 Down). According to our results, these methods should consider the evolution of the consecutive values of the effective size for the n generations, but the application would be quite complex. However, the precision is not much affected if the harmonic mean Ne,H{tau} of the partial Ne,i values for the {tau} generations is used as the constant effective population size for the {tau} generations. The distribution of neutral gene frequencies in the population can then be computed as a combination of distributions for neutral mutations that appeared in the actual generation, one generation ago, two generations ago, etc., up to infinity. An illustration of this is given in the next section.


*  EVALUATION OF RESULTS
*TOP
*ABSTRACT
*DERIVATION OF EXPRESSIONS
*ASYMPTOTIC EFFECTIVE SIZE,...
*EVALUATION OF RESULTS
*DISCUSSION
*LITERATURE CITED

The above predictions and equations were checked by Monte Carlo simulations. Random mating populations with N diploid individuals were simulated. The selective system was controlled by n loci evenly distributed in linear chromosomes. Further n neutral loci were allocated alternating with the selected loci. The population was initially run for thousands of generations so that the selective system could reach mutation-selection-drift equilibrium. Thereafter, two different sets of runs were carried out according to the objective. In the simulations used to evaluate Ne, alleles from each neutral locus were initially set at frequency 0.5. The population was then simulated for 100–300 generations until the asymptotic effective size was clearly reached. Fifty additional generations were run. At least 200 independent replicates of this process were simulated. The variance (Vari) of the frequency of the neutral genes was computed for each generation i over loci and replicates. The effective population size at a given generation i was computed as Ne,i = . The observed asymptotic Ne value was computed as the average of the Ne,i values of the 50 additional generations. A different set of simulations was run to evaluate the heterozygosity and the segregation of polymorphic loci. In this case, the neutral genes were introduced as mutants, and the population was run until the equilibrium heterozygosity and polymorphism was reached. The selective value of an individual was calculated as (1 + t)k for the model of favorable mutations and (1 - t)k for the model of detrimental mutations, where k is the number of mutants carried by the individual. Every generation the mean fitness of the population was set to 1, and the variance of relative fitnesses of individuals (C 2) was computed.

Table 2 shows some simulations of asymptotic values of Ne, {pi}, and s. Predictions were generally close to simulations. As was explained before, the effective size is progressively reduced over generations until the asymptotic value is reached. A comparison with simulations is made in Figure 3. Predictions of the equilibrium heterozygosity and proportion of segregating sites in Table 2 were made from these values of Ne,i in consecutive generations as explained above. As expected, the absolute reduction of Ne is generally greater than the reduction of heterozygosity and polymorphism (cf. Figure 2) because {pi} and s depend not only on the asymptotic Ne but also on nonasymptotic values, particularly s. A tendency of convergence between the ratios {pi}/{pi}0 and s/s0 with increasing population size is predicted, as noted by CHARLESWORTH et al. 1995 Down for background selection (see also Figure 2).



View larger version (17K):
In this window
In a new window
Download PPT slide
 
Figure 3. Three examples of the progressive reduction of Ne,i associated with a new neutral mutation in a population of size N = 100. Simulated (boxes) and predicted (lines) values for 40 generations. The last element in each series corresponds to the asymptotic Ne value. C 2 {approx} 0.044 and t = 0.05 (deleterious mutations model). Solid boxes, four chromosomes 1 Morgan long each; open boxes, one chromosome 1 Morgan long; and solid triangles, one chromosome with no recombinations.


 
View this table:
In this window
In a new window

 
Table 2. Simulations (and predictions in parentheses) based on multiplicative gene action with mutants of equal effect, t, on the heterozygote

Predictions are also accurate when mutations of unequal effects are considered. For example, simulations from NORDBORG et al. 1996 Down with U = 0.4, L = 1 and mutation effects with mean t = 0.04 drawn from a gamma distribution with parameters {alpha} = 0.70 and ß = 0.032 give an average {pi}/{pi}0 of 0.67. The prediction from Equation 4a is 0.65, and that from the approximation (4b) (assuming no correlation between C2i and Q'2i ) is 0.67, suggesting that the shape of the distribution of effects is not very important for the effective population size.

Finally, Figure 4 represents an example of the agreement between observed and expected allele frequency spectrums. The expected frequency distribution, under selection for the whole population of mutations originated {tau} generations ago, was obtained by using transition matrix methods. The partial Ne,i values for generations 1 to {tau} were predicted, and the harmonic mean Ne,H{tau} of these was used as the constant effective size of mutations originated {tau} generations ago. Predictions (top line) were made by accumulating all the expected distributions for neutral mutations originated in all the previous generations and in the current one. Simulations (boxes) were very close to these predictions. The bottom line shows the expected distribution, which would have been predicted under a pure neutral model without selection. This was calculated assuming the constant effective population size, which explains the observed level of heterozygosity in the population.



View larger version (8K):
In this window
In a new window
Download PPT slide
 
Figure 4. Example of gene spectrum in a population of 500 individuals under a multiplicative deleterious mutations model and no recombination. C 2 = 0.00426, t = 0.01. Only gene frequencies between 0 and 0.02 are represented. Simulated values in boxes. Top line: prediction under selection; bottom line, prediction under a pure neutral model (see text for explanations).


*  DISCUSSION
*TOP
*ABSTRACT
*DERIVATION OF EXPRESSIONS
*ASYMPTOTIC EFFECTIVE SIZE,...
*EVALUATION OF RESULTS
*DISCUSSION
*LITERATURE CITED

The fundamental concept in our analysis is that the parameter Ne, which summarizes the magnitude of the drift process in a genomic region or in the whole genome, is a function of the rate of reduction of the covariance between the neutral genes and the selected system. This reduction depends on three factors: the genetic size of the genome (i.e., the recombination rate), the change of variance of the selected loci due to selection, and the reduction of variance due to drift. At equilibrium, the total rate of reduction is = + t for models in which this rate is independent of the gene frequencies (i.e., infinitesimal model or deleterious mutations model). When the effects of the selected loci on fitness are large in relation to Ne, say t {Gt} 1/2Ne, the relative influence of genetic drift is small and predictions become independent of Ne. In this case, there is full agreement with equations from HUDSON and KAPLAN 1995 Down, BARTON 1995 Down, and NORDBORG et al. 1996 Down for background selection (deleterious mutations). As the effect of the genes decreases, with t < 1/2Ne and getting close to the assumptions of the infinitesimal model, the predictions are more dependent on Ne, and the bigger the population, the smaller the ratio Ne/N (see Table 2).

Our predictions of Ne can be made in terms of compound parameters, such as the variance for fitness, C 2, and the new input of mutational variance, Vm, but not necessarily on mutation rates and mutational effects of spontaneous mutations, whose magnitudes are in a current debate (e.g., PECK and EYRE-WALKER 1997 Down). HOULE et al. 1996 Down have reviewed estimates of C 2/Vm for a variety of traits and species, obtaining an average value of 50 for life-history traits. This is an estimate of the average persistence time of detrimental mutations. For viability in Drosophila C 2 = 0.01, approximately, for the whole genome (MUKAI 1988 Down). Because the genome size of Drosophila melanogaster is about 1.25 (considering that there is no recombination in males), substituting = 0.02, C2 = 0.01, and L = 1.25 into Equation 8 and Equation 9, we obtain Ne = 0.67N, which is a considerable reduction in effective size due to inherited differences in viability alone.

A main requisite for our model to work is the continuous flux of genetic variation for fitness in all the chromosome regions. Mutation introduces new variation at neutral sites while selection reduces the genetic variability. This requirement is far away from the strong selective sweep model assumed by WIEHE and STEPHAN 1993 Down, for which the hitchhiking of neutral alleles by favorable mutations can be considered as a two-step process. First, a strongly selected gene passes quickly through the population, wiping out linked variation, and second, polymorphism is recovered by mutation in a period where no hitchhiking occurs. Therefore, our equations for favorable mutations are not applicable to the assumptions made by WIEHE and STEPHAN 1993 Down. For example, with parameters N = 1000, L = 0, t = 0.2, the average simulated heterozygosity ({pi}/{pi}0) is 0.031, 0.084, and 0.246 when a single selective locus is segregating all the time, one-third of the time, or one-twelfth of the time, respectively. The corresponding predictions obtained with our method are 0.028, 0.033, and 0.041, respectively. Thus, the two latter, including periods of recovery of polymorphism, deviate from the assumptions of our model, and predictions become more and more inaccurate.

Our derivation follows the arguments of ROBERTSON 1961 Down and SANTIAGO and CABALLERO 1995 Down. The variance of long-term selective values (Q 2C 2) is partitioned into two components, one due to the chromosome carrying the neutral allele of reference (Q'2) and the other due to the homologous chromosome (Q''2) . With no linkage, large populations and weak selection, both terms approximate a value of 4 and Equation 3, Ne = , is obtained, as deduced by ROBERTSON 1961 Down and SANTIAGO and CABALLERO 1995 Down. However, BARTON 1995 Down and NORDBORG et al. 1996 Down arrived at the conclusion that with no linkage Ne = . NORDBORG et al. 1996 Down(Appendix iii) linked this to the previous expression by arguing that in the former the term C2 is, in fact, C2/2, because it refers to the variance of fitness of families (couples) instead of individuals. This is not correct, however. In the argument of ROBERTSON 1961 Down and SANTIAGO and CABALLERO 1995 Down, C2 is the variance of the relative fitness values of couples because these were fixed (monogamous matings) but this was assumed to be the fitness associated with neutral alleles, and all four alleles (in the couple) had the same associated fitness. The model would be equivalent for monoecious populations (CABALLERO and SANTIAGO 1995 Down).

The reason for the confusion is clear from the derivation in this article. BARTON 1995 Down and NORDBORG et al. 1996 Down considered only the gamete carrying the neutral allele as the determinant for the fitness associated to this allele. This is the same as neglecting Q''2 , as we did, for example, to obtain Equation 5. Now, for large population size (Z {approx} 1), Q' {approx} , and from Equation 5, Ne = , which agrees with BARTON's expression. If now r = 0.5, the above expression yields Ne = . However, neglecting Q''2 is allowed only for moderate or strong linkage because only for r {Lt} 0.5 is Q'2 {Gt} Q''2 . The intuitive explanation is that with tight linkage, the fitness associated with the neutral allele depends mostly on that of the gamete carrying it and Q'2 {Gt} Q''2 . However, for very loose linkage or no linkage, the fitness of the homologous gamete is also important: Q'2 {approx} Q''2 {approx} 4 , and Ne = .

To reduce the complexity of the derivation, we have considered that the recombination rate is constant across the chromosome and the neutral gene is located in the middle of a chromosome. An equivalent derivation can also be developed for a neutral gene at any location. The neutral location does not make a big difference unless the gene is in the final region of the chromosome tip. In this region, the effect of drift is smaller as closely linked selective genes can only (or mainly) be located at one side of the neutral gene, reducing down to a half the random associations with selected genes. As these regions in both tips are very small, their weight on the average Ne value for the whole genome is irrelevant and the result for the central location is a very good approximation to the average Ne. This effect indicates that Ne is mainly determined by the strength of selection acting on the region closely linked to the neutral locus. An equivalent conclusion has been reached by NORDBORG et al. 1996 Down under the background selection model.

Regional variations in the frequency of recombination are often observed (see LICHTEN and GOLDMAN 1995 Down), with a general pattern of reduced recombination in proximal regions (e.g., NACHMAN and CHURCHILL 1996 Down). Additionally, the distribution of transcriptional genes throughout the genome does not seem to be uniform (GARDINER 1996 Down), suggesting that the source of genetic variability for fitness is not evenly distributed in the genome. The exact magnitude of these deviations is unknown, but the former equations could also be applied if the density of genetic variability for fitness were more or less proportional to the rate of recombination. Otherwise, the computation of the appropriate value of Q'2 must consider the particular distribution of selected genes and genetic distances between these genes and the neutral locus.

The reduction of effective size associated with a neutral gene is progressive: The magnitude of the drift process is smaller for new neutral mutations than for old ones, and this process accumulates on neutral genes over generations until an asymptotic value is reached. The consequence is that heterozygosity will always be larger than that expected if all the neutral genes in the population had a constant effective size equal to the asymptotic value (Ne) and, therefore, cannot be formally predicted in the simple way, 4Neµ. The magnitude of the underprediction depends on how quickly the asymptotic Ne value is reached (see Figure 3). For a given genome size or recombination rate, this relies on the rate of reduction of the genetic variance. Under the assumptions of the infinitesimal model, the reduction of the variance in the selected system will be mainly due to drift if selection is weak, the reduction of effective size will be slow, and the difference between the real heterozygosity and that expected from the asymptotic Ne will be the highest. As the effect t of selected genes becomes larger, the rate of reduction increases and the asymptotic Ne is reached earlier. In a model of deleterious mutations of large effect (the background selection model), heterozygosity tends to be almost equal to 4Neµ as mutations reach their asymptotic value of Ne in a few generations. NORDBORG et al. 1996 Down developed a prediction of the reduction of heterozygosity due to linked selected loci under background selection (formally {pi}/{pi}0), which is identical to our prediction of the asymptotic Ne/N when drift is not considered. This prediction turns inexact as the population size or the effect of selected genes decreases (NORDBORG et al. 1996 Down), because the associated effective sizes of neutral alleles become further and further away from the asymptotic value. An issue related to those above refers to the reductions in the probability of fixation of advantageous mutations because of their linkage to other selected loci (see BARTON 1994 Down). Mutants of very large effect, whose fate is decided in a few generations, are affected by the asymptotic Ne less than mutants of small effect and, therefore, the fixation probability of the former is reduced by a smaller amount (CABALLERO and SANTIAGO 1998 Down).

The progressive reduction of the effective size associated with mutations can also explain the apparent disconnection between heterozygosity and number of segregating sites under selection, which is the basis of statistical tests of neutrality (e.g., FU 1996 Down). Actually, those tests compare the observed spectrum of gene frequencies with its expectation under a pure neutral model. As we have seen, the proportion of segregating sites is little dependent on Ne, as it is mostly due to recent mutations, which have associated effective sizes close to the census size of the population and far away from the asymptotic value. For very large populations, however, the number of generations that effectively contribute to the proportions of segregating sites is larger. Therefore, the drop of effective size during the initial generations affects it more, making heterozygosity and number of segregating sites similarly reduced. This effect has been described by CHARLESWORTH et al. 1995 Down.

The spectrum of gene frequencies can be approximated from the evolution of Ne associated to mutations over generations. For mutations originated {tau} generations before the actual generation, the magnitude of the drift process can be summarized by the harmonic mean (Ne,H{tau}) of the Ne,i values from generation i = 1 to {tau}. The remaining proportion of heterozygosity can be predicted by (1 - 1/2Ne,H{tau}){tau}. Analogously, the proportion of segregating sites can be approximated from Ne,H{tau} using the general theory of the effective population size (Equation 17Equation 18). In other words, the spectrum of gene frequencies for mutations originated {tau} generations ago is approximately the expected under a neutral model using the appropriate Ne,H{tau}. Deviations from the pure neutral spectrum arise when the contributions of all previous generations are accumulated. Different spectra corresponding to different Ne,H{tau} values of previous generations (from {tau} = 1 to {infty}) are superimposed, one over the others, building a general spectrum that cannot be explained by a single Ne value under a neutral model (see Figure 4).

When statistical tests are applied to compare predicted and observed spectra of gene frequencies, the finite size of the samples can make the deviations from the neutral model to difficult to detect. Observations in natural populations of Drosophila denote reduced diversity in regions with low recombination rates (BEGUN and AQUADRO 1992 Down), but most data show no deviations from the neutral spectrum (see CHARLESWORTH et al. 1995 Down). Although virtually any model considering directional selection could account for the observed correlation between nucleotide variation and recombination rate, simple selective sweep models with strong selection cannot explain the statistical agreement with the neutral spectrum (HUDSON 1994 Down; BRAVERMAN et al. 1995 Down). Predictions using computer simulations reveal that the statistical agreement is consistent with the background selection model (CHARLESWORTH et al. 1995 Down; HAMBLIN and AQUADRO 1996 Down). The general theory that we have described can help to determine the conditions for the background selection model, alone or combined with weak selective sweep models, which could explain the pattern of observed variation.

Finally, some remarks concerning artificial selection can be made. In the general theory of quantitative traits, linkage is usually ignored as farm species generally have several chromosomes, suggesting that the assumption of free recombination is close to reality. Additionally, linkage makes the analytical model more cumbersome: Additive models are complicated by the effect of the generation of negative covariances between genes affecting fitness (BULMER 1980 Down; SANTIAGO 1998 Down). Although our theory takes into account this effect, which is included the term Z = 1 - , its application to a model in which parents are selected individuals and the genetic values of both "gametes" are negatively correlated is not straight. Further insight into these models is necessary to assess the impact of linkage in artificial selection programs.


*  ACKNOWLEDGMENTS

We thank B. CHARLESWORTH, W. G. HILL, and N. BARTON for helpful comments. This work was supported by grant PB95-0909-C02-02 from Ministerio de Educación y Cultura (Spain) to E.S