Abstract
The effects on nucleotide variation of adaptations to temperate habitats and of the possible bottleneck associated with the origin of European populations of Drosophila melanogaster should be detectable in DNA sequences given the short time elapsed relative to the species population size. We surveyed nucleotide variation in 109 fragments distributed across the X chromosome in a European population of D. melanogaster to detect the footprint of positive selection. Fragments were located primarily in large noncoding regions. Multilocus tests based on Tajima's D statistic revealed a significant departure from neutral expectations in a stationary panmictic population, with an important contribution from both positive and negative D values. A positive relationship between Tajima's D values and distance to coding region was detected, with a comparative excess of significantly negative D values in the subset of fragments closer to coding regions. Also, there was a significant heterogeneity in the polymorphism to divergence ratio, with 12 fragments contributing 42% to the test statistic. Moreover, these fragments were comparatively closer to coding regions. These findings would imply positive selection events, and thus selective sweeps, during the species expansion to Europe.
CHANGES in the biotic and abiotic environment of a species often lead to bursts of adaptive changes, which result from the action of selection on mutations conferring an advantage in the new environment (Gillespie 1991). Since expansions of the range of a species are generally accompanied by new environmental challenges, the study of DNA variation in derived populations offers an opportunity to unveil the footprint of positive selection. However, range expansions are often associated with founder events, which will also affect variation in DNA sequences. Thus, levels and patterns of variation in derived populations will reflect not only the selective events in their recent past but also the demographic history associated with the range expansion of the species. Although both effects are superimposed, multilocus surveys of variation might shed light on their relative importance, given that demographic events are expected to affect similarly all regions studied, whereas the effect of selection is locus specific.
Drosophila melanogaster is a cosmopolitan species originating in Central Africa that moved from tropical to temperate zones after the last glaciation (i.e., 10,000–15,000 years ago; Lachaise et al. 1988). Surveys of nucleotide variation at individual loci on the X chromosome revealed levels of variation generally lower in non-African than in African populations (Begun and Aquadro 1993; Andolfatto 2001). Although this observation is consistent with a stochastic loss of variation during the out-of-Africa expansion of the species (David and Capy 1988), it could be also explained by selective sweeps associated with its adaptation to temperate zones (Maynard Smith and Haigh 1974; Kaplan et al. 1989). Indications of both demographic and selective footprints on nucleotide variation have emerged from the joint analysis of the patterns of variation detected in independent surveys of one or a few regions in derived populations of D. melanogaster (Andolfatto and Przeworski 2000; Przeworski et al. 2001) and more recently from the single-population multilocus survey of sequence variation in ancient and derived populations of this species (Glinka et al. 2003).
Here, we have surveyed nucleotide variation in 109 fragments of noncoding intergenic regions (according to Release 2 of the D. melanogaster genome sequence) distributed across the X chromosome of D. melanogaster in a sample of 13 sequences from a single European population. The fragments studied were located, unlike in previous surveys, in rather large noncoding regions. This has allowed us to have fragments at diverse distances from coding regions. Since most of the fragments studied are in regions with normal or high rates of recombination, their distance to coding regions might greatly affect their potential to reveal selective events (Kaplan et al. 1989). Indeed, most adaptations might be mediated by changes in the conformation of proteins due to amino acid replacements and/or by changes in the amount of protein produced and, therefore, most adaptive changes would occur in coding regions and their close flanking regions. The footprint of positive selection on linked variation (linked selection) would thus be more prevalent in noncoding fragments the closer they were to coding regions. Our analysis of the possible relationship between nucleotide variation and physical distance to coding regions has allowed us to ascertain the role of positive selection in shaping nucleotide variation in the population studied.
MATERIALS AND METHODS
Drosophila strains:
Isofemale lines of D. melanogaster were established upon collection in Sant Sadurní d'Anoia (Catalonia, Spain) in September 2001. F1 individuals were used to obtain isochromosomal lines for the X chromosome by a series of crosses with the balancer stock FM6, y31d sc8 dm B / l(1)X10. Thirteen of these X-isochromosomal lines were used in the present study. One D. simulans isofemale line from the same collection was also used.
Choosing fragments:
Release 2 of the D. melanogaster genome sequence (http://flybase.org) was used to choose 109 fragments of noncoding regions (and to design the corresponding amplification and sequencing primers) distributed across the X chromosome. These regions and fragments were chosen according to the following criteria: (i) noncoding regions as determined by GeneSeen inspection (http://flybase.org), (ii) noncoding region length >50 kb, (iii) amplification-fragment size ranging between 600 and 1400 bp, (iv) absence of homonucleotide runs longer than six in the database sequence, and (v) unique sequence in the genome confirmed by BLAST analysis (http://flybase.org). Condition ii had to be relaxed in some cases. PCR-amplification oligonucleotides were generally 20-nucleotides long and their melting temperature (TM) varied between 44° and 66°, with that of oligonucleotides from each pair differing by at most 3°. In the case of long fragments, additional internal primers were designed for sequencing. In some cases, additional primer pairs or internal primers were designed for successful amplification and sequencing in D. simulans. The list of the successful amplification primer pairs and conditions is available from the authors. All data concerning the location of fragments, distance between fragments, and distance to the closest coding region and coding transcription units are according to Release 3.1.Od (December 2003) of the D. melanogaster genome sequence.
DNA extraction, amplification, and sequencing:
Genomic DNA from individuals of each line was obtained using a quick DNA extraction procedure (protocol 48 in Ashburner 1989). Three fragments were generally amplified together, and the products were treated with Exo-SapIT (Amersham Biosciences, Piscataway, NJ) and used directly as templates for sequencing with the ABI PRISM version 3.0 kit (Applied Biosystems, Foster City, CA) according to manufacturer's conditions. After ethanol precipitation, sequencing reactions were separated on an ABI PRISM 3730 sequencer (Perkin-Elmer, Norwalk, CT). Chromatograms were in all cases visually inspected, and all polymorphic sites checked both in each line and across lines. For over one-fourth of the fragments, a second and independent PCR reaction was used for sequencing; in all cases, both amplifications yielded the same sequence, which supports the lack of PCR artifacts in our experimental procedure. Fragments were sequenced on both strands.
Sequence analysis:
Sequences were assembled with the SeqEd 1.03 program (Applied Biosystems), multiply aligned with the Clustal_X program (Thompson et al. 1997), and the multiple alignments were edited with the MacClade version 3.06 program (Maddison and Maddison 1992). The DnaSP version 3.99 program (Rozas et al. 2003) was used to estimate nucleotide diversity and different summary statistics. Multilocus tests of neutrality were performed with a program kindly provided by S. Ramos-Onsins, except for the Hudson-Kreitman-Aguadé (HKA; Hudson et al. 1987) multilocus test (program HKA distributed by Jody Hey through http://lifesci.rutgers.edu/~heylab).
The significance of Tajima's D test statistic (Tajima 1989), and of Fu's Fs statistic (Fu 1997), was obtained by computer simulations of the neutral process in a stationary panmictic population (henceforth referred to as the neutral model) and under other demographic scenarios. The simulations (1000 replicates) were performed under recombination and conditioning on the number of segregating sites. Estimates of the recombination population parameter (R) were obtained for each fragment using the Hey and Kliman estimates (Hey and Kliman 2002) of the recombination rate (r) and considering Ne = 106 for D. melanogaster (Kreitman 1983; Andolfatto and Przeworski 2000). Two-tailed tests considering the median value were performed on these statistics given the lack of any prior hypothesis about the direction of departure from neutral predictions. The P-values associated with individual two-tailed tests were combined in a single test as indicated (Sokal and Rohlf 1995; Przeworski et al. 2001). One-tailed tests were performed whenever information on the direction of the deviation was useful (e.g., for graphical purposes).
Given the out-of-Africa range expansion of the species, a simple bottleneck model was considered. The model assumes that the ancestral population was in equilibrium, that its size decreased from N to αN at time Tb, and that the derived population recovered to size N at time T0. Times Tb and T0 are in units of 3N generations (given that the fragments are located on the X chromosome). The severity of the bottleneck (Sb) is determined by the reduction in population size [N/(αN)] and the duration of the bottleneck (Tb − T0) (Fay and Wu 1999). Sb is approximately proportional to its duration and size reduction [(Tb − T0)α−1] (Fay and Wu 1999). Indeed, the distribution of summary statistics such as Tajima's D is similar for bottlenecks with the same severity (e.g., for a bottleneck that is 10 times as long but causes a 10-fold lower size reduction). Severities (Sb) ranging from 3.3 × 103 to 3.3 × 10−5 were considered, with α values varying from 10−1 to 10−5, and duration (Tb − T0) from 3.3 × 10−6 to 3.3 × 10−2 (i.e., from 10 to 105 generations). As mentioned above, a severity equal to 0.33, for example, would correspond to different combinations of number of generations and size reduction: 105 generations, 106–105 individuals; 104 generations, 106–104 individuals; 103 generations, 106–103 individuals; and so on. Five Tb values ranging from 0.01 to 0.05 (i.e., from 30,000 to 150,000 generations ago, or 3,000 to 15,000 years ago) were considered in an effort to encompass possible scenarios for European-derived D. melanogaster populations (David and Capy 1988; Lachaise et al. 1988). For certain parameter sets, the assumption of equal sizes for the ancestral and derived populations was relaxed, and ratios of derived to ancestral population size (β) equal to 0.5 and 0.25 were considered.
Both a parametric (Pearson's correlation coefficient) and a nonparametric (Spearman's coefficient of rank correlation) approach were used to evaluate the relationship between Tajima's D values (or their associated probabilities) and physical distance to coding regions.
Multilocus tests of neutrality, as well as correlation analysis, require independence of the different data points. Variation in the fragments studied was considered to have evolved independently given the average distance between contiguous regions and their general location in regions with normal (or high) levels of recombination (see below).
RESULTS
Fragments studied:
The distribution of the 109 fragments studied across the X chromosome is shown in Figure 1. The average distance between contiguous fragments was 199.4 kb, with only eight cases where the distance was <50 kb. According to Release 3.1.Od of the D. melanogaster genome sequence, 83 fragments were in intergenic regions, 25 within generally long introns, and only 1 was in a coding region. Figure 2 shows the distribution of the fragments studied according to their distance to the closest coding region. For the 108 noncoding fragments, this distance varied between 1 kb (2 fragments) and 52 kb (1 fragment) with an average distance of 14.5 kb. The length of the noncoding region in these 108 fragments varied between 9.7 and 177.1 kb, with an average length of 45.5 kb.
Location of the 109 sequenced fragments (diamonds) across the X chromosome. The location of fragments >14.7 kb apart from their closest coding regions (see text) is shown as triangles. The distribution of the recombination rate across the chromosome (Hey and Kliman 2002) is also represented (solid line).
Distribution of the 109 sequenced fragments according to their distance to the closest coding region as indicated in Release 3.1.Od (December 2003) of the Drosophila melanogaster genome sequence. Shading differentiates noncoding intergenic regions (open bars) from intronic (shaded bars) and coding (solid bar) regions.
Levels of polymorphism and divergence:
A total of 1030 nucleotide polymorphic sites and 253 indels were detected in the 109 fragments surveyed. Table 1 gives a summary of different measures of nucleotide variation in the complete data set, while the detailed information is provided in supplemental Table S1 at http://www.genetics.org/supplemental/. The average number of segregating sites was 9.4, with an average of 3.3 singletons per fragment. Only 2 fragments harbored no variation, and 8 had one single segregating site. Average nucleotide diversity was 0.0038.
Nucleotide polymorphism and divergence
The minimum number of recombination events inferred from the presence of all four gametic classes (RM, Hudson and Kaplan 1985) was 1.1 per fragment, and most fragments with two or more parsimony informative sites presented at least one recombination event (56 of 81). This observation indicates that recombination has played an important role in the history of the fragments surveyed, as expected from the recombination rates estimated from the comparison of the physical and genetic maps (Hey and Kliman 2002; Figure 1). There was also evidence in the data for recombination between fragments, even in the case of the two fragments <20 kb apart.
A positive but not significant correlation between nucleotide diversity and rate of recombination was detected (Pearson's R = 0.166, P = 0.084); this weak correlation vanished when fragments in regions of low recombination (r < 1 cM/Mb) were excluded from the analysis. This lack of a significant correlation was also observed in previous similar analyses (Przeworski et al. 2001; Glinka et al. 2003) and could reflect the superimposed effects of selection and recent demographic events in European populations.
The homologous sequence of 85 fragments was obtained in D. simulans. These fragments constituted a random sample of the 109-fragment set relative to both recombination rate and nucleotide variation (with an average number of segregating sites equal to 9.3). The average nucleotide divergence between species was 0.051. This estimate is much lower than previously reported values for synonymous sites (0.102 in Moriyama and Powell 1996) but similar to that reported for noncoding regions (0.062 for a European population; Glinka et al. 2003).
Under the neutral model, levels of polymorphism and divergence should be correlated since both are a function of the neutral mutation rate. The multilocus HKA test yielded a significant result (χ2 = 142.56, d.f. 84, P < 0.001), indicating the uncoupling of the polymorphism and divergence levels, and therefore a larger than expected variance in the polymorphism to divergence ratio. The larger variance might be due to the action of selection in a subset of regions, but also to violations of the model assumptions. However, the uneven contribution of the fragments to the test result (with 12 fragments accounting for ∼42% of the χ2 value) would point to selection rather than to demographic history as the causing force and more specifically to directional selection given the reduced level of polymorphism in 8 of these fragments.
Pattern of polymorphism:
Tajima's D statistic (Tajima 1989) was calculated for each of the 107 polymorphic fragments, and its significance was established by computer simulation based on the coalescent process under panmixia and recombination. Individual tests yielded a significant departure of the neutral model in 34 cases, with 19 showing a significantly negative D value and 15 a significantly positive D value (Figure 3).
Distribution of the one-tailed probabilities, P(D ≤ Dobs), associated with Tajima's D values calculated for each of the 107 polymorphic fragments sequenced.
Multilocus tests based on the mean and variance of Tajima's D across the 107 studied fragments were also performed. Although the multilocus test based on the mean value of D yielded a nonsignificant result (, P = 0.532), that based on its variance was highly significant [Var(D) = 1.147, P < 0.001]. To rule out the possible nonindependence of some of the data points (despite the general evidence for recombination both within and between fragments), the multilocus test on Tajima's D variance was also performed (i) excluding fragments that were physically rather close (either <50 or <100 kb) and (ii) excluding fragments that were located in regions with moderate or low rates of recombination (<2 or 1 cM/Mb). These tests were in all cases highly significant (P < 0.001). To further explore the significantly high variance of Tajima's D across the 107-fragment set, we obtained the combined probability of the P-values associated with individual (two-tailed) tests (Sokal and Rohlf 1995). The test yielded a highly significant result (P = 1.407 × 10−23), indicating an excess of significant D values (either positive or negative). Figure 3 clearly shows that both positive and negative D values contributed to this excess. Indeed, the combined probabilities calculated separately for positive and negative D values were very low (2.76 × 10−10 and 1.85 × 10−15, respectively), indicating an important and rather equivalent contribution from positive and negative D values to the detected deviation from the neutral model.
Frequency spectrum of polymorphisms and distance to coding regions:
The fragments surveyed differed in their distance to coding regions (Figure 2). The putative relationship between Tajima's D values and physical distance to the closest coding region was, therefore, examined. A positive relationship was detected (Pearson's R = 0.204, P = 0.035; Spearman's R = 0.159, P = 0.0054), pointing to an effect of distance on patterns of variation. A similar result was obtained when the one-tailed P-values (Figure 3) associated with the empirical D values [P(D ≤ Dobs)] were considered (Pearson's R = 0.190, P = 0.05; Spearman's R = 0.246, P = 0.0115). Excluding fragments that were physically rather close and/or fragments with moderate or low rates of recombination did not affect the results (not shown). No effect was detected either when the only fragment in a coding region was excluded from the analysis or when distance to the closest transcription unit was considered (results not shown).
To further study the effect of distance on patterns of variation, fragments were grouped according to their distance to the closest coding region. The distance corresponding to the average D value (∼14.7 kb) was used to divide the 107-fragment data set into two subsets comprising the 63 and 44 fragments located below and above that distance, respectively. Relative to the complete data set (), the average D values were
(for fragments closer to coding regions) and
(for more distant fragments). This relative shift toward negative and positive values supports the detected relationship with distance. Indeed, the numbers of significantly positive and negative D values differed significantly between the 63-fragment subset (5 and 14, respectively) and the 44-fragment subset (10 and 5, respectively), with a comparative excess of negative values in the 63-fragment subset (χ2 = 5.536, d.f. 1, P < 0.0186; Figure 4).
Distribution of the one-tailed probabilities, P(D ≤ Dobs), associated with the empirical Tajima's D values. (A) The 63 fragments <14.7 kb apart from the closest coding region; (B) the 44 fragments >14.7 kb apart from the closest coding region.
Pattern of polymorphism and demography:
Although the positive relationship detected between Tajima's D and distance to coding regions and the result of the HKA test would point to selection rather than to demography, we wanted to know to what extent the pattern of variation detected, and more specifically the distribution of Tajima's D, might reflect the bottleneck possibly associated with the range expansion of the species. Multilocus coalescent simulations were thus performed for bottlenecks of different severity and time of onset (see materials and methods). Simulations were first performed assuming that the ancestral and present populations had similar sizes. In each case, multilocus tests based on the mean and variance of Tajima's D across the 107 studied fragments were performed; also, the two-tailed P-values associated with the observed Tajima's D statistics were obtained, and their combined probability was calculated (Table 2). Severe (Sb ≥ 3.3) as well as weak (Sb ≤ 0.03) bottlenecks were in all cases incompatible with the data. For intermediate-severity bottlenecks (Sb = 0.33), nonsignificant results for both the test based on the mean and that based on the variance of this summary statistic were obtained only for Tb > 0.02, with marginally significant results for Tb = 0.02. However, the combined two-tailed probability of individual D values was negligible for large Tb values (Tb ≥ 0.04) and had some appreciable but low value for times of onset equal to 0.02 and 0.03 (i.e., 6000 and 9000 years ago). Moreover, the analysis under the bottleneck scenario of other aspects of the data as summarized by the Fs statistic (Fu 1997) yielded highly significant results (Table 2) for most parameter combinations except for an intermediate-severity bottleneck (Sb = 0.33) of rather recent onset (Tb = 0.02). Similar, or more extreme, results were obtained when the assumption of equal sizes for the ancestral and derived populations was relaxed (Table 2; results not shown) and when simulations were conditioned on θ (results not shown). Given the uncertainty about the relationship between the population sizes in the ancestral and derived populations, and the low probability values obtained for both Tajima's D and Fu's Fs statistics even for the most favorable parameter combinations, our analysis would indicate that a rather recent bottleneck of moderate severity might have affected, but cannot fully explain, the pattern of X chromosome variation in the derived population studied.
Multilocus probabilities of Tajima's D and Fu's Fs empirical values under various demographic scenarios
DISCUSSION
European populations of D. melanogaster are derived populations that originated some 15,000 years ago (Lachaise et al. 1988). The effects on nucleotide variation of adaptations to temperate habitats and of the possible bottleneck associated with their origin should still be detectable in DNA sequences given the short time elapsed relative to the species population size (∼106; Braverman et al. 1995; Simonsen et al. 1995; Kim and Stephan 2002; Przeworski 2002). Theoretical studies have shown that Tajima's D statistic (Tajima 1989), which contains information on the number of segregating sites and on their frequency spectrum, is among the most powerful statistics to detect the short-lived signature of positive selection (Simonsen et al. 1995; Kim and Stephan 2002; Przeworski 2002). Multilocus tests based on this statistic were, therefore, used to differentiate the genome-wide effects of a bottleneck from the locus-specific effects from recent selective events. Moreover, multilocus tests based on Fu's Fs statistic that captures other facets of the data and is also very powerful (Fu 1997) were used to further evaluate possible bottleneck scenarios.
In our multilocus analysis, the variance of the empirical D values distribution is much larger than expected under the neutral model (P < 0.001). There is indeed an excess of both significantly positive and significantly negative D values, which clearly indicates that the frequency spectrum of polymorphisms is not affected similarly in all fragments studied. The distribution of Tajima's D values for the 107 fragments studied departs significantly from neutral expectations for a constant-size population. Our analysis of the empirical distributions of both Tajima's D and Fu's Fs values under a simple bottleneck model yielded significant results under all parameter combinations (Table 2). Some appreciable probability values were, however, obtained for a rather recent bottleneck of moderate severity. These results together with the significant decoupling between levels of polymorphism and divergence detected by the multi-locus HKA test reveal that selection has contributed to shape variation in the European population studied. Moreover, the average distance to coding regions for the 12 fragments with the largest contribution to the HKA statistic (11.6 kb) was below that for the 85 fragments analyzed (15.2).
Our observation of a positive relationship between the empirical D values (or their associated probabilities) and the distance to coding regions of the corresponding fragments is not expected either under the neutral model in a stationary population or under a bottleneck or any other demographic model. In both situations, the different D values among regions would result from the stochasticity associated with the evolutionary process and would be distributed randomly across the genome. On the other hand, the extent of the footprint left by directional selection on linked variation is inversely related to the rate of recombination (Maynard Smith and Haigh 1974; Kaplan et al. 1989; Charlesworth et al. 1993). In regions with normal and high rates of recombination, like most of those studied here, the effect of directional selection should rapidly decrease with distance to the target of selection. Under the simplifying assumption that targets of selection were restricted to coding regions and their flanking regulatory regions, the levels and patterns of variation in noncoding regions would be less affected by selective events the larger their distance to coding regions was. Even in a more realistic scenario, one would expect an uneven distribution of targets of selection, with the major proportion still in coding regions or in their vicinity.
The relative excess of significantly negative D values in fragments close to coding regions and the significant correlation between D values and distance to coding regions would therefore imply the action of selection. More specifically, it would point to selection affecting the frequency spectrum of polymorphisms toward an excess of rare variants. Positive selection driving advantageous mutations to fixation would be the best candidate contributing to explain this observation, since the pattern of variation after a selective sweep is characterized by an excess of rare variants that result in negative Tajima's D values. Indeed, strong purifying selection does not generally affect the frequency spectrum of polymorphisms (Charlesworth et al. 1993, 1995). On the other hand, weakly deleterious mutations could cause an excess of rare variants and therefore negative Tajima's D values (Tachida 2000; Gordo et al. 2002). However, weak selection has a less pronounced effect on the frequency spectrum than positive selection, which makes its effect detectable only in rather large samples.
The action of positive selection should affect both the level and pattern of variation. However, no correlation between nucleotide diversity and distance to coding regions was detected. Moreover, only a weak correlation between nucleotide diversity and rates of recombination (mainly attributable to the few fragments in regions of low recombination) and no correlation between Tajima's D and rates of recombination were detected. It could be argued that the first result might partly reflect a more short-lived signal of positive selection on the level than on the pattern of variation. Both observations might, however, result from the interaction between positive selection and demographic events. Indeed, such an interaction might have obscured the effects of advantageous fixations, with only the footprint left by very advantageous mutations being detectable. Further dissection of the possible interaction should, however, await further theoretical and analytical development, but most importantly (Andolfatto 2001) the results of similar single-population multilocus surveys in autosomal regions of D. melanogaster.
In summary, our multilocus survey of nucleotide variation in noncoding regions across the X chromosome has revealed that the pattern of variation as summarized by Tajima's D statistic has a large variance. Moreover, it has revealed a decoupling between levels of polymorphism and divergence and a positive relationship between Tajima's D empirical values and the physical distance of the corresponding fragments to the closest coding region. Although a rather recent bottleneck of moderate severity in the history of the European population studied could contribute to explain the large variance of Tajima's D, the detected polymorphism to divergence heterogeneity and the positive correlation detected would imply selective events during the range expansion of D. melanogaster to temperate zones and more specifically in the adaptation of the European populations to the new habitat.
Acknowledgments
We thank Anna Fernández-Robles, David Salguero, and Gema Blasco for excellent technical assistance and Serveis Científico-Tècnics from Universitat de Barcelona for automatic sequencing facilities. We also thank members of our group for fly collection; A. Blanco and A. Vilella for computer program implementation help; and J. Rozas, C. Segarra, and the anonymous reviewers for comments on the manuscript. We especially thank S. Ramos-Onsins for sharing unpublished software for multilocus tests of neutrality. This work was supported by grants QLRT-2001-00004 from the European Community; BMC2001-2909 from Comisión Interdepartamental de Ciencia y Tecnología, Spain; 2001SGR-00101 from Comissió Interdepartamental de Recerca i Innovació Tecnològica, Generalitat de Catalunya, Spain; and by special support (Distinció per la Promoció de la Recerca Universitària) from Generalitat de Catalunya to M.A.
Footnotes
- Received March 18, 2004.
- Accepted April 23, 2004.
- Genetics Society of America