Genetics, Vol. 161, 389-410, May 2002, Copyright © 2002

Population, Evolutionary and Genomic Consequences of Interference Selection

Josep M. Comerona,b and Martin Kreitmana
a Department of Ecology and Evolution, University of Chicago, Chicago, Illinois 60637
b Department of Biological Sciences, University of Iowa, Iowa City, Iowa 52242

Corresponding author: Josep M. Comeron, University of Iowa, 433 Biology Bldg., Iowa City, IA 52242., josep-comeron{at}uiowa.edu (E-mail)

Communicating editor: N. TAKAHATA


*  ABSTRACT
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*APPENDIX
*LITERATURE CITED

Weakly selected mutations are most likely to be physically clustered across genomes and, when sufficiently linked, they alter each others' fixation probability, a process we call interference selection (IS). Here we study population genetics and evolutionary consequences of IS on the selected mutations themselves and on adjacent selectively neutral variation. We show that IS reduces levels of polymorphism and increases low-frequency variants and linkage disequilibrium, in both selected and adjacent neutral mutations. IS can account for several well-documented patterns of variation and composition in genomic regions with low rates of crossing over in Drosophila. IS cannot be described simply as a reduction in the efficacy of selection and effective population size in standard models of selection and drift. Rather, IS can be better understood with models that incorporate a constant "traffic" of competing alleles. Our simulations also allow us to make genome-wide predictions that are specific to IS. We show that IS will be more severe at sites in the center of a region containing weakly selected mutations than at sites located close to the edge of the region. Drosophila melanogaster genomic data strongly support this prediction, with genes without introns showing significantly reduced codon bias in the center of coding regions. As expected, if introns relieve IS, genes with centrally located introns do not show reduced codon bias in the center of the coding region. We also show that reasonably small differences in the length of intermediate "neutral" sequences embedded in a region under selection increase the effectiveness of selection on the adjacent selected sequences. Hence, the presence and length of sequences such as introns or intergenic regions can be a trait subject to selection in recombining genomes. In support of this prediction, intron presence is positively correlated with a gene's codon bias in D. melanogaster. Finally, the study of temporal dynamics of IS after a change of recombination rate shows that nonequilibrium codon usage may be the norm rather than the exception.


THE general concept of effective population size (Ne), due to WRIGHT 1931 Down, WRIGHT 1938 Down, is usually associated with species-specific factors such as inbreeding, unequal numbers of the two sexes, variance in mating success, temporal variation in population size, population subdivision, and pervasive selection. According to theory, these factors are expected to have a genome-wide influence on the standing crop of both weakly selected and selectively neutral mutations because evolutionary parameters governing these mutations, Nes ({alpha}) and Neµ (ß), respectively, include effective population size (see Table 1). However, polymorphism levels are not constant across the genome but rather are correlated with recombination rates, suggesting that additional factors may be influencing Ne (AGUADE et al. 1989 Down; STEPHAN and LANGLEY 1989 Down, STEPHAN and LANGLEY 1998 Down; BEGUN and AQUADRO 1992 Down; MARTIN-CAMPOS et al. 1992 Down; LANGLEY et al. 1993 Down; AGUADE and LANGLEY 1994 Down; AQUADRO et al. 1994 Down; STEPHAN 1994 Down; NACHMAN 1997 Down; DVORAK et al. 1998 Down; KRAFT et al. 1998 Down; NACHMAN et al. 1998 Down; PRZEWORSKI et al. 2000 Down). This has led to theoretical investigations of the effects of linkage and selection on neutral variation levels.


 
View this table:
In this window
In a new window

 
Table 1. Variables and parameters

Further investigation on whether Ne can be viewed as varying across a genome takes advantage of its consequences on the effectiveness of weak selection. Theory predicts that the evolutionary dynamics of mutations whose selective effects are on the order of the reciprocal of population size (i.e., {alpha} {cong} 0.25–2.5) are expected to be very sensitive to small shifts in Ne (OHTA and KIMURA 1971 Down; OHTA 1972 Down, OHTA 1995 Down; LI 1987 Down). Two lines of evidence are congruent with Ne changing across genomes in Drosophila (HILTON et al. 1994 Down). The first comes from studies on the biased usage of synonymous codons (codon bias), a paradigmatic example of weak selection for translational efficiency in many organisms (GRANTHAM et al. 1981 Down; IKEMURA 1981 Down; BENNETZEN and HALL 1982 Down; GROSJEAN and FIERS 1982 Down; KURLAND 1987 Down; SHARP and LI 1987 Down, SHARP and LI 1989 Down; SHIELDS et al. 1988 Down; BULMER 1991 Down; MORIYAMA and HARTL 1993 Down; AKASHI 1994 Down, AKASHI 1995 Down, AKASHI 1996 Down; EYRE-WALKER 1996 Down; COMERON and KREITMAN 1998 Down; COMERON et al. 1999 Down; MCVEAN and VIEIRA 2001 Down). Codon bias increases with recombination rates in Drosophila melanogaster, indicating an increase in the effectiveness of selection (and hence larger Ne) in regions of high recombination (KLIMAN and HEY 1993 Down; COMERON et al. 1999 Down), an effect that is also observed using the complete D. melanogaster genome (J. M. COMERON, unpublished data). Second, in interspecific comparisons in the D. melanogaster species complex, the rate of replacement substitutions (Ka) is significantly smaller in regions of high recombination than in regions of low recombination (HILTON et al. 1994 Down; COMERON and KREITMAN 2000 Down), suggesting stronger purifying selection against deleterious amino acid changes in genes located in regions of high recombination.

Several models of strong selection ({alpha} >> 1) have been proposed to explain why Ne is reduced in regions of low recombination: (i) the hitchhiking (HH) and pseudo-HH (pHH) models, which invoke frequent positive Darwinian selection (MAYNARD SMITH and HAIGH 1974 Down; KAPLAN et al. 1989 Down; GILLESPIE 2000 Down); (ii) the background selection (BGS) model, which assumes a high and constant rate of deleterious mutations (CHARLESWORTH et al. 1993 Down; CHARLESWORTH 1994 Down, CHARLESWORTH 1996 Down; HUDSON and KAPLAN 1995 Down); (iii) the joint effects of positively and negatively selected mutations (KIM and STEPHAN 2000 Down); and (iv) models based on fluctuating selection (GILLESPIE 1997 Down). The fixation of a favorable allele (i.e., selective sweep or draft under the HH and pHH models, respectively) alters the probability of fixation of linked selected mutations present in the population, decreasing the probability of fixation of other advantageous mutations and increasing the chance of fixation of deleterious mutations. A similar outcome will be produced by the steady elimination of strongly deleterious alleles and the indirect uniform reduction of Ne under BGS. In all of these models, mutations under strong directional selection make linked mutations behave as if they are evolving under a smaller population size (ROBERTSON 1961 Down). The indirect effects of selected mutations on linked variation will be greatest in the absence of recombination and will decrease with increasing recombination.

However, strong selection is not a requirement for this effect. HILL and ROBERTSON 1966 Down, using simulations of a two-locus model, showed that the probability of fixation of a favorable mutation at one locus becomes reduced due to the segregation of alleles at a second selected locus. This so-called Hill-Robertson effect (FELSENSTEIN 1974 Down; LEWONTIN 1974 Down) is caused by the presence of selected mutations in the genetic background and the consequent increment of stochastic sampling effects (i.e., variance of offspring number and genetic drift) in finite populations (see also ROBERTSON 1961 Down; BIRKY and WALSH 1988 Down). An increase in genetic drift under a Hill-Robertson scenario has the consequence of reducing the intensity of selection and it is regarded as equivalent to a reduction of Ne (ROBERTSON 1961 Down; HILL and ROBERTSON 1966 Down; FELSENSTEIN 1974 Down; KLIMAN and HEY 1993 Down). These effects of interference apply mostly to mutations under moderate/weak selection because these mutations have much longer sojourn times than strongly selected mutations and this enhances the opportunity for weakly selected mutations to influence one another's fate.

Weakly selected mutations, taken individually, are not expected to have a measurable effect on population parameters or on the tree topology of linked neutral mutations (GOLDING 1997 Down; NEUHAUSER and KRONE 1997 Down; PRZEWORSKI et al. 1999 Down). Similarly, interference between pairs of weakly selected mutations should also be negligible because this interaction is expected to be a second-order magnitude effect. We are, however, interested in investigating interference due to segregating mutations under weak selection, which we call Interference Selection (IS), because for many organisms weakly selected mutations are both abundant and physically clustered within genomes, and hence the magnitude of IS interactions becomes measurably large. First, a large fraction of the mutations segregating in populations of many species may be weakly selected. Synonymous mutations, for example, are an abundant source of weakly selected mutations in Drosophila and other species. Weakly selected mutations may also be common in many species among amino acid replacement changes (OHTA 1995 Down; ZENG et al. 1998 Down) and in regulatory regions (LUDWIG et al. 1998 Down). Second, weakly selected mutations are likely to be physically clustered in the coding regions of genes and in regulatory regions, generating genomic regions with a high density of segregating weakly selected mutations. Reduced physical distance and hence recombination between these mutations creates the opportunity for multiple interference interactions. Nevertheless, it is not obvious whether IS can be rationalized simply in terms of Ne because the magnitude of its effect will depend jointly on the mutation rate to selected alleles, the intensity of selection, and the recombination rate between mutations under selection.

LI 1987 Down, using simulations, showed that weakly selected mutations under complete linkage can interfere with one another to cause shifts in the frequency of favorable mutations at equilibrium. Later, COMERON et al. 1999 Down allowed the numbers of selected sites, selection coefficients, and recombination rates to vary across ranges of values thought to be biologically realistic for species such as Drosophila. This multilocus study showed that the level of codon bias (i.e., used as a proxy for the effectiveness of selection and Ne) decreases when either the recombination rate decreases or the number of weakly selected sites increases for different intensities of weak selection. MCVEAN and CHARLESWORTH 2000 Down further investigated both codon bias and level of polymorphism in selected sites under IS, confirming that these traits are affected by IS not only for absolute linkage but for a range of recombination rates observed in outcrossing species. These studies proposed the Hill-Robertson effect [the small-scale Hill-Robertson (COMERON et al. 1999 Down) or weak-selection Hill-Robertson (MCVEAN and CHARLESWORTH 2000 Down) effect] and a reduction in Ne to explain the results. These two studies, however, analyzed only the consequences of IS on the sites under selection, where the outcome is the composite effect of both selection on the mutations themselves and IS (which at the same time alters the effectiveness of selection).

TACHIDA 2000 Down studied evolutionary effects that selection at many sites has on interlaced neutral sites with total linkage. Here we extend this approach with the analysis of many population and evolutionary parameters under IS both on selected mutations and on adjacent neutral mutations, using a broad range of numbers of selected sites, recombination rates, and weak selection coefficients. The effects of IS on adjacent neutral variability are key to gaining insight into the mechanism and evolutionary effects of IS and to appraising the importance of IS as a viable evolutionary force.

We also investigate two other consequences of IS under low rates of recombination. First, because genes (exons and regulatory regions) are embedded in a matrix of generally less severely constrained DNA, IS may occur at sites with well-defined boundaries along the DNA. Therefore, we study the expected consequences of IS along such intervals under selection. We hypothesize that the effects of IS should not be uniform across a gene, and we use simulations to generate predictions that can be tested with Drosophila genomic data. Second, neutral sequences located between groups or clusters of selected sites under weak/moderate selection (e.g., introns within coding regions of genes) can be viewed as modifiers of recombination and hence can alter the effectiveness of selection (COMERON and KREITMAN 2000 Down; COMERON 2001 Down). We investigate whether this indirect selection on the presence and length of intermediate "neutral" sequences is plausible. If true, this makes indels in introns and other noncoding regions potential targets for selection.


*  MATERIALS AND METHODS
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*APPENDIX
*LITERATURE CITED

Forward computer simulations:
A Wright-Fisher model was simulated with N diploid individuals (2N chromosomes) as previously described (COMERON et al. 1999 Down; see also LI 1987 Down) and with the following modifications. Every chromosome is composed of two adjacent stretches of sites, each of length L, with one stretch evolving neutrally (neutral sequence) and one evolving under weak selection (selected sequence). In each generation the total number of mutations and recombination events are drawn from a Poisson distribution with mean 4LßN and 3L{rho}N, respectively (see Table 1 for definitions). The number of recombination events per meiosis is not restricted (assuming no chiasma interference) to avoid underestimating the effect of recombination in long sequences when {rho}N is high. The mutation process allows only two allelic states at a site, and reversible mutation is permitted, mimicking the mutation process between preferred (p) and unpreferred (u) codons. Mutation rates from p to u and vice versa are w and v, respectively, where is the mutational bias. Unless otherwise indicated, we applied a mutation rate of , with . The previous assumption of only two allelic states (LI 1987 Down; COMERON et al. 1999 Down; MCVEAN and CHARLESWORTH 2000 Down; TACHIDA 2000 Down), one favorable and the other deleterious, is a reasonable approximation for weakly selected mutations at synonymous sites in organisms like Drosophila where G/C-ending codons are the preferred state (SHIELDS et al. 1988 Down; AKASHI 1994 Down, AKASHI 1995 Down). Based on preliminary studies (Appendix), our simulated populations were composed of 1000 chromosomes; a sample size of 12 sequences was used to estimate population genetic parameters.

The fitness of each individual is based only on the selected sequence. The selection differential between p (preferred allele) and u (unpreferred allele) in the selected sequence is +s; fitness is multiplicative over sites and mutations are semidominant in their effect on fitness. Each new generation is obtained by first choosing N individuals (2N chromosomes) with probability proportional to their relative fitness. The next generation is constituted by randomly pairing the 2N chromosomes (each composed by the selected and adjacent neutral sequence), which are possibly mutated and/or recombined to form N new diploid individuals.

As indicated in RESULTS, the time to reach base composition equilibrium under a mutation-selection-drift (MSD) balance and IS (e.g., codon usage) when may take on the order of {cong}100–250 N generations. Accordingly, our study of IS at equilibrium begins after a minimum of 250 N generations to assure base composition equilibrium. Each independent population realization was analyzed every N generations for a minimum of 1000 N generations. Population parameters were estimated in 20 independent samples. All estimates of population and evolutionary parameters (heterozygosity, nucleotide diversity, frequency skew, fixation rates) as well as the frequency of preferred codons (P) were obtained by studying the same number of sites, 250, regardless of the total number of sites in the sequence when L >= 250. These studied sites were homogeneously distributed across the sequence, unless explicitly indicated, to assure an average estimate of the parameters across the region. In every simulation both the selected and neutral sequences were analyzed. The ranges of parameter values we investigated for recombination, selection, and length were 0 <= {rho}N <= 0.4, 0.25 <= {alpha}N <= 2.5, and 125 <= L <= 2500. The ranges of recombination rates under study are representative of most eukaryotes, including D. melanogaster. Assuming Ne {cong} 1 x 106 for D. melanogaster (ANDOLFATTO and PRZEWORSKI 2000 Down) and using laboratory-based rates of crossing over, the complete D. melanogaster genome would satisfy {rho}N < 0.05, {rho}N < 0.1 after taking into account gene conversion (HILLIKER and CHOVNICK 1981 Down; ANDOLFATTO and NORDBORG 1998 Down; ANDOLFATTO and PRZEWORSKI 2000 Down). D. melanogaster genomic regions with {rho}N <= 0.004 may contain ~15% of genes using rates of crossing over; these genomic regions are defined by the cytological bands 1A–2C/20C–20F (X chromosome), 21A/38B–40F/41A–44B/60D–60F (chromosome 2), 61A–61B/76B–80F/81A–84E (chromosome 3), and the complete fourth chromosome. When the contribution of gene conversion to the total recombination is taken into account, {rho}N <= 0.004 may apply to >10% of D. melanogaster genes.

To evaluate the relative change in the effectiveness of selection on the selected sequences caused by varying the parameters (changing recombination rates and L and presence and length of intermediate regions) we compared the estimated value of the parameter {alpha} on the basis of the observed P. Following directly from the probability of fixation of p and u, and P at equilibrium under the infinitely many sites model and free recombination (WRIGHT 1931 Down; CROW and KIMURA 1970 Down; EWENS 1979 Down), we have for diploid organisms

(see LI 1987 Down; BULMER 1991 Down), when ß << 1. We call this the single-site model based on MSD (SS-MSD).

Linkage disequilibrium (LD) was estimated as the average over all pairwise comparisons of polymorphic sites by using D' (LD-D'; LEWONTIN 1964 Down); coupling gametes were composed of the most frequent alleles (LANGLEY et al. 1974 Down). We used LD-D' because this measure is the least affected by the frequency of the mutations compared to measures such as D (LEWONTIN and KOJIMA 1960 Down) or Zns (HILL and ROBERTSON 1968 Down; KELLY 1997 Down) in conditions of reduced or no recombination. However, LD-D' is not entirely independent of the frequency of the mutations (LEWONTIN 1988 Down), and so we also studied the extent of LD within predefined frequency classes. We avoided using sequence-based estimates of recombination such as Hudson's C (Chud; HUDSON 1987 Down) because its mean is influenced not only by the frequency of mutations but also by the number of segregating sites when this number is not large, a number that is changed by IS. To make comparable estimates of LD between sequences of different length, all estimates of LD were obtained using the polymorphisms detected within 250 contiguous sites; otherwise, the average distance between polymorphisms increases with L, biasing the comparisons of LD.

Analyses of the D. melanogaster genome:
We studied the complete D. melanogaster genome (ADAMS et al. 2000 Down; Version 2 (October 2000), http://www.ebi.ac.uk/genomes). The analyses were performed using all 9172 genes with empirical data supporting both their presence and specified exon/intron structure (i.e., with complete mRNA information and no evidence of multiply spliced forms). The study of intergenic sequences was carried out using only those intergenic sequences (6271) that are flanked by two genes with empirical data supporting both their presence and specified gene structure.

Heterogeneous codon bias across exons:

Two groups of genes were investigated. The first group (659 genes) was composed of all genes with a single long exon (>1000 bp or >333 amino acids). The second group (187 genes) included all genes with long coding regions (>333 amino acids) interrupted by introns and satisfying two criteria: (i) all (one or more) introns should be centrally located, dividing the coding region into two comparable regions (i.e., introns located between 30 and 70% of the relative total length of the coding region), and (ii) at least one intron should be >100 bp. The synonymous codon usage bias was measured using the frequency of GC-ending codons (GC3), the frequency of GC-ending codons in four-fold degenerate amino acids (GC4), and the frequency of preferred codons in D. melanogaster (AKASHI 1995 Down). As indicated, GC3 is a good measure of codon bias in D. melanogaster and at the same time is equivalent to the in silico study of IS with two alleles.

Codon bias and the proportion of selected sites in a gene:

The analysis was carried out using all 7499 complete genes (out of 9172) with introns. As a proxy for the relative number (or density) of selected sites in a gene, we used the proportion of the length of the coding region (PLCR) in a gene, measured as the ratio between the length of the coding region and the length of the coding region plus the total length of the introns.

The recombination rate for each gene in the D. melanogaster genome was estimated as previously described (see COMERON et al. 1999 Down for details) on the basis of the cytological position associated to each genomic scaffold sequence. All statistical analyses were carried out using Statistica for Windows 5.1 (1997).


*  RESULTS
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*APPENDIX
*LITERATURE CITED

Effects of IS on population and evolutionary parameters at selected and adjacent neutral sequences
Effectiveness of selection: We investigated the effectiveness of selection on weakly selected mutations by analyzing the proportion of preferred mutations at equilibrium (P; LI 1987 Down; COMERON et al. 1999 Down; MCVEAN and CHARLESWORTH 2000 Down). In contrast to neutral polymorphism, for which measures such as heterozygosity ({theta}) are nearly linearly related to Ne, the relationship between P and the selection parameter {alpha} (Nes) is strongly nonlinear. For instance, a 5% increase in P represents a 50, 21, and 27% increase in {alpha} when the original P is 0.5, 0.6, and 0.9. Therefore, although our simulations measure shifts in P with changes of parameters affecting IS, the magnitude of these shifts is better reflected by the change in the parameter {alpha} needed to account for the results under a no-interference model (SS-MSD). As Fig 1 shows, the effectiveness of selection, as measured by {alpha}, decreases as the recombination rate decreases, and this effect increases with L (see also COMERON et al. 1999 Down and MCVEAN and CHARLESWORTH 2000 Down for the effect of IS on measures of codon bias and P). In addition, the relative reduction in the effectiveness of selection due to IS increases with the strength of selection acting on individual mutations ({alpha}N < 2.5).



View larger version (14K):
In this window
In a new window
Download PPT slide
 
Figure A1. Effect of the size of simulated populations on evolutionary consequences of interference selection (IS) due to weakly selected mutations. Results are shown for L = 2500 completely linked sites. Solid lines depict the frequency of preferred sites (P) and dashed lines depict Tajima's D (n = 12). Solid and open triangles show the results for {alpha}N = 1 and {alpha}N = 2.5, respectively.

Polymorphism levels: We studied the effect of IS on polymorphism levels, as measured by heterozygosity, in selected ({theta}s) and neutral ({theta}n) sequences. Under single-site models of weak selection (SS-MSD), the expectations are clear. A general reduction of the intensity of selection ({alpha}) predicts a relative increase of {theta}s, making {theta}s closer to {theta}n. On the other hand, a reduction in Ne will cause a direct reduction in {theta}n. The expected net consequence of reducing Ne, hence {alpha}, for mutations under SS-MSD is a reduction of {theta}s because the reduction of {theta}n is always greater than the expected increase of selected polymorphism due to a reduced selection, although this decrease of {theta}s is not expected to be proportional to the reduction of Ne. For strong selection, a moderate reduction of Ne would not alter {theta}s.

Our results (Fig 2) show that {theta}s is below the levels expected on the basis of the imposed strength of selection acting on these mutations under SS-MSD. For all combinations of selection intensity ({alpha}N) and recombination rates ({rho}N), {theta}s decreases as the number of sites under selection (L) increases (see also MCVEAN and CHARLESWORTH 2000 Down). The relative impact that increasing L has in reducing heterozygosity is similar for different selection intensities when (e.g., {theta}s for is 70–75% of that for , both for and ), and this impact decreases, but is still noticeable, for very weak selection and high recombination (e.g., and ). For , we also studied whether an even higher recombination rate would completely eliminate IS. The results show that does eliminate most IS on {theta}s when L <= 125 compared to SS-MSD expectations, but IS is still detectable for larger L.



View larger version (65K):
In this window
In a new window
Download PPT slide
 
Figure 2. Relative level of polymorphism (heterozygosity; {theta}) in selected ({theta}s) and adjacent neutral sequences ({theta}n). Results are shown for different recombination rates ({rho}N), scaled selection coefficients per site ({alpha}N), and number of selected sites (L). Horizontal lines indicate the expected values under a single-site model and mutation-selection-drift balance (SS-MSD). Sample size (n) = 12. () Selected ({theta}s). ({blacksquare}) Neutral ({theta}n). (A) {rho}N = 0, (B) {rho}N = 0.004, (C) {rho}N = 0.4.

Heterozygosity is also reduced in the adjacent neutral sequences as a result of linkage to sites under weak selection. The most extreme reductions in neutral variation are observed for (Fig 2A), where increasing either L or {alpha}N in the selected sequence substantially reduces {theta}n. The impact that the L has on {theta}n increases with the intensity of selection. For example, when , {theta}n is ~75 and ~50% of that observed when for and , respectively. When and L is large, there is a tendency to observe similar levels of polymorphism in selected ({theta}s) and adjacent neutral mutations ({theta}n), which are most evident for (when L <= 2500), implying that linked selectively neutral sites and sites under weak selection may not be distinguishable by this criterion. Increasing L reduces {theta}n even when the recombination rate is high (Fig 2B and Fig C). This observed reduction in {theta}n is similar for different selection intensities when recombination is highest ({rho}N = 0.4), likely reflecting a recombination rate threshold for IS.

Divergence and divergence to polymorphism ratio: Under SS-MSD equilibrium the rate of fixation or divergence of selected mutations rapidly decreases with increasing selection intensity {alpha}. We focused on the rate of divergence when {rho}N = 0 to illustrate the effect of IS on this evolutionary parameter (Fig 3A). As expected, selection acting at linked sites does not influence divergence for neutral mutations. The fixation rate of mutations under selection increases with L for any given {alpha}N, indicative of a reduction in the effectiveness of selection due to IS.



View larger version (18K):
In this window
In a new window
Download PPT slide
 
Figure 3. Divergence (fixation rate) and divergence:polymorphism (Div/Pol) ratio under IS in selected and adjacent neutral sequences for {rho}N = 0. Results are shown for different scaled selection coefficients per site ({alpha}N) and different number of selected sites (L). For all cases, values are relative to neutral expectations. (A) Fixation rate of selected sequences relative to adjacent neutral sequences. Dashed line indicates the expected fixation rate under a single-site model and mutation-selection-drift balance (SS-MSD). (B) Div/Pol ratio of selected sequences. Dashed line indicates the expected Div/Pol ratio under SS-MSD. (C) Div/Pol ratio in neutral sequences adjacent to sequences with L selected sites. n = 12.

The SS-MSD model also predicts that the divergence:polymorphism ratio (Div/Pol) for weakly selected mutations decreases with increasing {alpha} because weak selection has stronger effects in reducing the rate of fixation than the level of polymorphism. Fig 3B shows the Div/Pol ratio for the region containing mutations under selection again for . For the case of , Div/Pol decreases with selection but to a lesser degree than that expected for a SS-MSD case. For the intermediate case of , Div/Pol is only barely affected by selection. More exceptional is the situation in which the number of sites under selection is moderate to large (e.g., ): In these cases Div/Pol increases not only relative to single-site expectations but also relative to neutral expectations. This trend results from two opposing effects of IS on selected mutations, increasing divergence and reducing polymorphism. Neutral sites (Fig 3C) show a consistent increase in the Div/Pol ratio with increasing IS, caused by the effect that IS has in reducing levels of linked neutral polymorphism.

Mutation frequency spectrum: The SS-MSD models predict that, as {alpha} increases, weakly selected mutations will become less abundant and allele frequencies at polymorphic sites will decrease compared to neutral expectations. Fig 4 plots Tajima's D statistic, a measure of the skew of allele frequency compared to neutral frequency spectrum, for selected and neutral sequences under complete linkage. In the selected sequence, a more negative Tajima's D is observed with increasing selection intensity, as expected (see TACHIDA 2000 Down). As L increases, and hence IS, a further excess of rare mutations is seen compared to single-site expectations for any given {alpha}N. This trend does not hold, however, when {alpha}N and L are large, (i.e., ), where Tajima's D becomes unaffected or even less negative. This is, however, not surprising because Tajima's D statistic is not entirely independent of the number of segregating sites: It tends toward zero as the number of segregating sites in a sample becomes small for any given (nonneutral) frequency of variants. Therefore, IS increases the relative frequency of rare variants (hence it induces a negative Tajima's D) but IS also decreases the number of segregating sites, thus biasing Tajima's D estimates closer to zero when IS and its reduction of the number of segregating sites are severe.



View larger version (27K):
In this window
In a new window
Download PPT slide
 
Figure 4. Tajima's D statistic under IS in selected and adjacent neutral sequences for {rho}N = 0. Results are shown for different scaled selection coefficients per site ({alpha}N) and different numbers of selected sites (L). n = 12.

The frequency spectrum of neutral mutations departs from the neutral equilibrium expectation, showing an excess of low frequency alleles when IS occurs in the adjacent selected sequence. Tajima's D becomes more negative as the number of selected sites or {alpha}N on these sites increases. When IS is strongest, the skew toward low frequencies becomes similar for both selected and neutral mutations, a trend we have also encountered for heterozygosity.

IS also influences the allele frequency of variants in the selected sequences when recombination is highest while the frequency spectrum of neutral variants in adjacent sequences remains mostly unaffected. The skew toward low frequency variants in the selected sequences increases with L, although to a lesser degree compared to {rho}N <= 0.004, and this effect intensifies with increasing {alpha}N.

IS and linkage disequilibrium: HILL and ROBERTSON 1966 Down showed that as recombination decreases (hence IS increases) there is an increment in repulsion associations between mutations under selection. MCVEAN and CHARLESWORTH 2000 Down further showed that this negative LD decreases with the recombinational distance between selected sites. Here, we investigated how the intensity of selection and the number of sites under selection—two contributors to IS—each influence the extent of LD, as measured by D' (LD-D'; LEWONTIN 1964 Down). Again, we studied both selected and neutral mutations to better infer the mechanism influencing LD. Fig 5A shows that negative LD increases with L, for both intermediate- and no-recombination cases. The increment in LD is detected not only in the selected sequence but also in the adjacent neutral sequence. When L (and therefore IS) is small, LD-D' is clearly different for the selected and neutral sequences. But for larger values of L the magnitude of LD becomes more similar for the two classes of mutations. Increasing the intensity of selection {alpha}N increases LD (with repulsion associations) in the selected sequences to a greater extent than in the adjacent neutral sequences (see Fig 5B for ). When recombination is very high , LD-D' also varies in the selected sequences for different {alpha}N, but not in the adjacent neutral sequences.



View larger version (19K):
In this window
In a new window
Download PPT slide
 
Figure 5. Linkage disequilibrium measured by D' (LD-D') in selected (solid lines) and adjacent neutral sequences (dashed lines) for {rho}N = 0 and {rho}N = 0.004. (A) LD-D' for different numbers of sites (L) in the selected sequence and {alpha}N = 1. (B) LD-D' for different scaled selection coefficients per site ({alpha}N) for L = 2500.

However, the conditions that increase negative LD are the very same conditions for which there is an excess of low-frequency variants (with more negative Tajima's D estimates). Because most measures of LD correlate, to some degree, with the frequency of the mutations under study (LEWONTIN 1988 Down; see MATERIALS AND METHODS), the observation of greater negative LD-D' in the same situations in which Tajima's D is also more negative might well be two faces of the same phenomenon: IS skews the frequency of mutations in the population. To assess whether the increment in D' with IS is only the result of an increasing skew in the mutational frequency spectrum, we studied the average LD-D' for different average Tajima's D classes (i.e., as a proxy of different frequency classes); our interest centered on the results within each frequency class. Investigation of adjacent neutral sequences allowed us to eliminate the direct effects that selection has on the frequency spectrum of the selected mutations. Results, shown in Fig 6, show that LD-D' becomes increasingly negative for each frequency class as selection intensity {alpha}N (and IS) increases. Therefore, the observed increase in LD with IS, as measured by D', is not only the consequence of a shift in the frequency spectrum.



View larger version (34K):
In this window
In a new window
Download PPT slide
 
Figure 6. Linkage disequilibrium measured by D' (LD-D') in neutral sequences adjacent to selected sequences in different Tajima's D classes for different scaled selection coefficients per site ({alpha}N). L = 2500 and {rho}N = 0.

Temporal dynamics after a change of recombinational environment: Genome-wide and/or gene-specific changes in recombination rates may be common in many evolutionary systems, and so it is important to study the time needed to reach new equilibria. For instance, in Drosophila there is extensive gene order shuffling within chromosomal arms between species (LEMEUNIER and ASHBURNER 1976 Down; PINSKER and SPERLICH 1984 Down; PAPACEIT and PREVOSTI 1989 Down; WHITING et al. 1989 Down; SEGARRA and AGUADE 1992 Down; KRESS 1993 Down; LOZOVSKAYA et al. 1993 Down; GALLEGO et al. 1999 Down), which most likely causes frequent changes in recombination rates (CHARLESWORTH 1994 Down).

We studied the number of generations needed to reach equilibrium after a change in the recombination rate under IS conditions. For simplicity we assumed a population at equilibrium under the initial conditions and the instantaneous fixation of a randomly chosen allele in a new recombinational environment. Such a situation might apply to a gene located near the breakpoint of a chromosomal rearrangement that was quickly driven to fixation (possibly by natural selection). In accord with expectations, most population and evolutionary parameters reach their new equilibria much sooner than codon usage (Fig 7A and Fig B). Under neutrality, few Ne generations ({cong}4Ne for diploid individuals) after a hitchhiking event are sufficient to achieve near-equilibrium levels of polymorphism (PERLITZ and STEPHAN 1997 Down). Under IS the time required for weakly selected mutations to reach values close to those at equilibrium for parameters such as {theta}s or Tajima's D increases with {alpha}N, requiring {cong}20–40N generations when while this time is close to 4N generations when {alpha}N <= 0.5.



View larger version (49K):
In this window
In a new window
Download PPT slide
 
Figure 7. Temporal dynamics in selected sequences across 250 N generations after a change of recombination environment (see text for details). Black lines indicate the change from total linkage ({rho}N = 0) to high recombination ({rho}N = 0.4) and gray lines from high recombination to total linkage. The frequency of preferred codons (P), polymorphism levels ({theta}s, n = 12), Tajima's D (n = 12), and fixation rates are shown for L = 2500. Each value represents the average of 10 independent simulations.

Multilocus parameters, such as those that estimate codon usage, take a very large number of generations to reach a new equilibrium following perturbation. Indeed, codon usage requires {cong}100–250N or ~1–2.5/µ generations to reach the new equilibrium. This required time is not strongly dependent on the number of sites under selection, but it is dependent, as expected, on µ (data not shown). Two other features of codon bias evolution following a change of recombination environment were also observed. First, the change in codon bias is faster when the number of preferred mutations is increasing (i.e., changing from low to high recombination) than when the number is decreasing. Second, the weaker the selection the longer the period required to reach the new base composition equilibrium in either direction. These two features can be easily explained by three factors: (i) The average time to fixation of weakly advantageous mutations is shorter than that expected for weakly deleterious mutations, (ii) the speed to fixation of preferred mutations increases with {alpha}, and (iii) mutational pressure toward unpreferred mutations is higher when the ancestral sequence has higher P.

We also observed a tendency for population and evolutionary parameters to "overshoot" their equilibrium values when a sequence changes from no recombination to a high recombination rate (stronger for than for ), but this effect is not detectable in the opposite direction. This overshoot creates a transient situation more nearly resembling a neutral equilibrium. Under the conditions we investigated, population parameters are closest to the neutral expectations {cong}4–5N generations after the fixation of a single sequence in an environment with high recombination. For instance, in the case of , {theta}s changes through time from zero (right after the fixation event) to the new IS equilibrium under high recombination (after {cong} 40N generations) with an intermediate state 50% higher than that finally observed at equilibrium.

IS and its effect across regions under uniform selection: Consider an interval of a recombining genome in which segregating sites under selection result in IS, and further assume this interval is embedded in a region containing few additional mutations under selection. Since the magnitude of the IS effect acting at a particular site in this interval will be governed by the interactions between the segregating sites located on both sides of that site, it is reasonable to expect that IS will be stronger at sites embedded in the middle of a region under selection than at sites located close to an edge of this region. This situation may apply to many protein-coding regions in eukaryotic genomes, but it would be pertinent to any group of physically clustered sites under weak selection surrounded by largely unconstrained sites. Here, we investigate the magnitude of the "center" vs. "edge" effect for plausible rates of recombination and selection. The issue under scrutiny is whether IS differs measurably between the center and edge of a region under selection when both mutation rates and selection coefficients are uniformly distributed across the region.

We studied population and evolutionary parameters across sequences with 2500 sites under uniform selection, recombination, and mutation, with emphasis on a lateral and the central region of 250 sites each. Two indicators of IS are depicted in Fig 8 for the central and lateral regions: the proportion of preferred codons (P) and the divergence to polymorphism (Div/Pol) ratio. As expected, no effect is seen for the no-recombination case . For intermediate recombination rates, IS differs between regions, with the central region showing stronger IS (central regions have lower P and higher Div/Pol ratio than lateral regions). This heterogeneous distribution of IS across regions (the center effect) decreases when recombination is very high, but it can still be seen for high recombination when selection intensity is weak .



View larger version (24K):
In this window
In a new window
Download PPT slide
 
Figure 8. Comparison between central (dashed line) and lateral (continuous line) regions of sequences under selection for different recombination rates ({rho}N) and scaled selection coefficients per site ({alpha}N). Two indicators of IS are depicted: frequency of preferred codons (P) and divergence:polymorphism ratio (Div/Pol). The complete sequence under selection has 2500 sites and the lateral and central regions have 250 sites each. Selection coefficients and recombination and mutation rates are uniformly distributed across the entire sequence.

The effect of neutral sequences between regions under selection: We studied whether or not small changes in the overall recombination rate between two regions under selection, caused only by a change in the physical distance between them, have a detectable effect on the overall IS. Here, the simulation procedure allowed us to generate a variable number of neutral sites (i.e., middle or "spacer" sequence) between two sequences under selection. The two regions under selection were identical, with equal numbers of selected sites, selection coefficients per site, and mutation and recombination rates per site. Mutation and recombination rates per site in the spacer sequence are the same as in the flanking selected sequences, but the mutations were selectively neutral. Thus, with respect to IS the presence and length of the intermediate neutral region alters only the number of recombination events between the two selected regions; it does not change directly any parameter on the flanking selected sequences.

We studied intermediate rates of recombination for the case of a neutral sequence located between two sequences each of 500 selected sites. Fig 9A depicts the relative change of the effectiveness of selection (i.e., {alpha} based on P) caused by the presence and length of the spacer sequence. The results show that the length of an intermediate neutral region has a detectable effect. In all cases, longer spacers lead to an increase of the effectiveness of selection (a reduction in IS) on the adjacent selected mutations. Serving as illustration, for the case of and the presence of a 1000-bp-long region in the middle of the selected sequence is equivalent to a relative increase of 7.4% in the overall fitness associated with the selected sequences (i.e., a gain of 2.3% preferred codons). A substantial fraction of the potential increment in fitness in regions of moderate to high recombination is achieved with short/intermediate sequences (<1000 bp), while for regions of more severely restricted recombination longer sequences are required to produce an equivalent increment in fitness. The maximum relative gain in fitness is higher for than for for the two rates of recombination investigated, as expected (see Fig 1).



View larger version (24K):
In this window
In a new window
Download PPT slide
 
Figure 9. Consequence on IS of the presence and length of an intermediate neutral sequence. Simulations are based on two sequences under selection of 500 sites each separated by an intermediate neutral sequence of Linterm sites (see text for details). Results are shown for 0 <= Linterm <= 1000 sites and for Linterm = {infty} (inf.). Open and solid diamonds depict {alpha}N = 2.5 and {alpha}N = 1.0, respectively. (A) Selected region: relative effectiveness of selection ({alpha}) per site based on the frequency of preferred sites (P). Solid and dashed lines are {rho}N = 0.0004 and {rho}N = 0.004, respectively. (B) Neutral region: Tajima's D statistic (solid lines) and linkage disequilibrium D' (LD-D'; dashed lines) for {rho}N = 0.0004. n = 12.

Empirical tests of IS based on D. melanogaster's genome
Distribution of codon bias within genes: As indicated in our simulations, IS is expected to be stronger in the center of regions under IS than in the margins of these regions. This leads to the first test prediction: Codon usage bias, a measure of the effectiveness of selection, will be lower in the middle of coding regions of genes than in the amino- or carboxy-terminal regions. Comparing codon bias levels within genes eliminates expression level and gene length as factors that can alter codon usage (MORIYAMA and POWELL 1998 Down; COMERON et al. 1999 Down; DURET and MOUCHIROUD 1999 Down). It also controls for any possible heterogeneity in mutational tendencies, either across the genome or associated with transcription rates. We have not attempted to control for heterogeneity in amino acid constraints within genes, but we expect this to obscure rather than to enhance the predicted IS effect (see also below).

We restricted our attention to the set of genes in the D. melanogaster genome composed of single long exons (>333 amino acids; see MATERIALS AND METHODS), a total of 659 genes. The frequency of GC at the third position of codons (GC3) was used as a measure of codon usage bias (SHIELDS et al. 1988 Down; AKASHI 1994 Down, AKASHI 1995 Down; see MATERIALS AND METHODS). For each gene, we measured GC3 in three sections of 100 codons, the first and last 100 codons (proximal and distal regions, respectively) and 100 codons in the middle of the gene (central region). As shown in Fig 10A, average GC3 frequencies differ significantly across the three regions (Friedman ANOVA test, , P < 1 x 10-6), with a lower GC3 in the central region. A similar result is obtained when the average GC3 of the two lateral sections is compared to the central section or when each lateral region is compared separately to the central section (, respectively). On average, the lower GC3 content in the central region of coding regions is equivalent to a reduction in {alpha} of ~10% on synonymous mutations compared to lateral regions.



View larger version (12K):
In this window
In a new window
Download PPT slide
 
Figure 10. GC content at the third position of codons (GC3), as a measure of codon bias, across the same gene in D. melanogaster. (A) Central and lateral (5' and 3') regions of 659 genes constituted by a single long exon (>333 amino acids). (B) Central and lateral (5' and 3') regions of 187 genes (>333 amino acids) containing only centrally located introns. Central and lateral regions are 100 codons long. The central region of genes without introns shows a significantly reduced GC3 and other measures of codon bias, compared to the lateral regions of the same genes (see text for details).

IS simulations also show that the intensity of IS increases with the length of the gene region (and hence number of sites) subject to weak selection. This leads to the second test prediction: The relative reduction in codon bias in the center of a gene will be positively correlated with the length of the coding region. Consistent with this prediction, we find a highly significant positive correlation between the length of the coding region and the difference of GC3 between lateral and central regions in the same set of 659 genes analyzed above (Spearman's correlation ). Quantitatively similar results are obtained with the frequency of preferred codons and with GC content at fourfold degenerate sites; data not shown.

According to our simulations, the presence of neutrally evolving sequences placed in the center of a region subject to weak selection can relieve the IS effect. This leads to the third test prediction: Centrally located introns will ameliorate the effect of IS in the central region of genes that contain them. To test this prediction, we compared codon bias in these same 659 genes, which lack introns, with comparable genes with introns located in the center of the coding regions (see MATERIALS AND METHODS). Fig 10B shows the results for the 187 genes obtained from the genome database satisfying these criteria. For these genes, there is no apparent reduction of GC3 in the middle of the coding regions. Accordingly, we do not detect significant heterogeneity of GC3 between the three regions or a difference between the GC3 content of central and the two lateral regions (P > 0.15). In addition, the central sections of coding regions of genes with central intron(s) have a significantly higher GC3 content than the equivalent central sections in genes without introns (Mann-Whitney U-test, ) whereas both lateral sections show similar frequencies (P = 0.61 and P = 0.09). Therefore, the lower GC3 frequency in the middle of the coding region in genes without introns cannot be the result of general relaxed selection on codon bias in the central part of coding regions.

Proportion of selected sites in a gene and codon bias: According to our simulations, the presence of neutral sequences embedded in a region under selection causes an increase in the effectiveness of selection on adjacent selected sequences, the length of such neutral sequences being positively correlated with the increment of the effectiveness of selection. We investigated, therefore, a fourth test prediction: Codon bias will be positively correlated with measures of a gene's intron length and number. As a first approximation, we studied the relationship between measures of codon bias (e.g., GC3) and total intron length in the set of all genes with confirmed intron/exon structure. There is a weak positive relationship between GC3 and total intron length, both using all introns (R = 0.040, P = 0.0007) and after eliminating the small fraction of introns with detectable remnants of TE elements (R = 0.041, P = 0.0005).

We also studied the relationship between codon bias and measures of the proportion of sites under selection in a gene. As a simple measure of the density of sites under selection in a gene, we used the PLCR in a gene when embedded introns are included (see MATERIALS AND METHODS). The prediction under IS is again explicit: Codon bias (as measured by GC3) will decrease as PLCR increases. The analysis of all 7499 genes with introns reveals a significantly negative relationship between GC3 and PLCR (R = -0.136, P < 1 x 10-6); equivalent results are obtained using other measures of codon bias. Fig 11, a display of GC3 when genes are grouped with respect to PLCR into five sets of equal sample size, shows that the effect may be stronger when PLCR is medium/high.



View larger version (16K):
In this window
In a new window
Download PPT slide
 
Figure 11. Relationship between the proportion of the length of the coding region (PLCR) in a gene and GC3 in D. melanogaster (see MATERIALS AND METHODS). The five PLCR classes divide the data set of 7499 genes into subsets of equivalent size (n {cong} 1500 genes). The analysis of all genes with introns in D. melanogaster shows a significantly negative relationship between PLCR and GC3 (and other measures of codon bias; R = -0.136, P < 1 x 10-6; see text for details).

Gene length may have a confounding effect on the relationship between GC3 and PLCR because the length of coding region is negatively related to codon bias (MORIYAMA and POWELL 1998 Down; COMERON et al. 1999 Down; DURET and MOUCHIROUD 1999 Down). Gene length is, for obvious reasons, positively correlated with PLCR (R = 0.11, P < 1 x 10-6). Multiple linear regression analysis corroborates that GC3 is negatively associated with PLCR (partial r = -0.094, P < 1 x 10-6) after controlling for the length of the coding region (multiple r = 0.19, P < 1 x 10-6). Thus, there is a significant negative relationship between GC3 and PLCR not attributable to gene length.

Gene length and intron presence: IS predictions of the favorable consequences of intermediate or spacer sequences forecast that intron length will increase with the length of the coding region. The average length of introns increases with the total length of the coding region (R = 0.219, P < 1 x 10-6). This relationship is not attributable to differences in either recombination rates or gene expression levels and remains significant (P < 1 x 10-6) after controlling for these variables. A greater number of introns are also observed in long genes (R = 0.53, P < 1 x 10-6) although this observation could be connected to causes other than IS.

Intergenic distance and gene length: In addition to having longer (and a greater number of) introns in relation to the length of a coding region, is there also evidence for greater intergenic distance as a function of length of coding regions of adjacent genes? This result is expected under a scenario where longer intergenic regions are favored when, otherwise, IS between adjacent genes would be enhanced, i.e., when the lengths of the neighboring coding regions increase. To address this seventh test prediction, we investigated the length of intergenic regions separating well-defined genes (see MATERIALS AND METHODS). The results, displayed in Fig 12, reveal a positive relationship between the length of the 6271 intergenic sequences investigated and the length of the flanking coding regions (R = 0.097, P < 1 x 10-6); a positive relationship is also observed in regions of high recombination (>3 x 10-8/bp/generation; R = 0.104, P < 1 x 10-6, n = 2367). Alternative explanations to this observation based on functional considerations might also be proposed, such as genes with longer coding regions, if they are functionally more complex, might require tighter gene regulation (and hence longer noncoding regions). But we are unaware of any explicit empirical support for this class of alternative explanations. If our interpretation of this correlation is correct, it would suggest that IS between adjacent genes might not be negligible in most of the range of recombination rates in Drosophila. This relationship is not an indirect consequence of the effect that recombination rates might have on both parameters: The length of the intergenic regions decreases with increasing recombination rates (R = -0.034, P = 0.008) but no relationship is detected between the length of coding regions and recombination (P > 0.40).



View larger version (23K):
In this window
In a new window
Download PPT slide
 
Figure 12. Relationship between the total length of two adjacent coding regions and the size of the intergenic sequence between these two genes in D. melanogaster. The analysis of 6271 intergenic sequences flanked by well-defined genes (see text) shows a positive relationship between the length of these intergenic sequences and the length of flanking coding sequences (R = 0.097, P < 1 x 10-6).

<