Drosophila melanogaster is an important model organism in evolutionary genetics, yet little is known about the population structure and the demographic history of this species within sub-Saharan Africa, which is thought to contain its ancestral range. We surveyed nucleotide variation at four 1-kb fragments in 240 individual lines representing 21 sub-Saharan and 4 Palearctic population samples of D. melanogaster. In agreement with recent studies, we find a small but significant level of genetic differentiation within sub-Saharan Africa. A clear geographic pattern is observed, with eastern and western African populations composing two genetically distinct groups. This pattern may have resulted from a relatively recent establishment of D. melanogaster in western Africa. Eastern populations show greater evidence for long-term stability, consistent with the hypothesis that eastern Africa contains the ancestral range of the species. Three sub-Saharan populations show evidence for cosmopolitan introgression. Apart from those cases, the closest relationships between Palearctic and sub-Saharan populations involve a sample from the rift zone (Uganda), suggesting that the progenitors of Palearctic D. melanogaster might have come from this region. Finally, we find a large excess of singleton polymorphisms in the full data set, which is best explained by a combination of population growth and purifying selection.
DROSOPHILA melanogaster, the common pomace fly or vinegar fly, is an important model organism in many areas of evolutionary genetics. In particular, considerable research has focused on detecting the action of positive selection from patterns of nucleotide variation in this species (e.g., Harr et al. 2002; Kauer et al. 2002; Orengo and Aguadé 2004; Bauer DuMont and Aquadro 2005; Ometto et al. 2005; Pool et al. 2006). However, a deviation from the null hypothesis of neutral evolution in an equilibrium population might be caused either by a departure from selective neutrality or by a nonequilibrium population history. For example, negative values of the Tajima's D statistic are commonly interpreted as evidence of a recent selective sweep, yet demographic forces such as population growth, migration, and severe bottlenecks are equally capable of generating such a result (Tajima 1989). Other proposed tests of neutral evolution, such as Fay and Wu's H (Fay and Wu 2000) and the likelihood method of Kim and Stephan (2002), are also strongly influenced by demography (e.g., Depaulis et al. 2003; Jensen et al. 2005). Therefore, the inference of selection from population genetic data relies critically on an accurate understanding of demographic history.
In the case of D. melanogaster, our demographic knowledge is surprisingly limited. It is generally accepted that D. melanogaster had an ancestral range in Africa and has reached some parts of the world quite recently. On the basis of current species distributions, polymorphic characters, and ecological information, Lachaise et al. (1988) suggest an Afrotropical origin for D. melanogaster, specifically in western or central equatorial Africa. Studies of molecular polymorphism have supported the African origin of D. melanogaster. For example, Begun and Aquadro (1993) found that DNA sequence variation at seven X-linked loci was more than twice as high in a Zimbabwe sample compared to a sample from the United States. Additional studies involving microsatellites (e.g., Kauer et al. 2002) and DNA sequence variation (e.g., Ometto et al. 2005) have also found higher genetic variation in sub-Saharan than in cosmopolitan populations.
The expansion of D. melanogaster from sub-Saharan Africa into northern Africa and Eurasia is thought to have occurred on the order of 10,000 years ago. Lachaise et al. (1988) hypothesized that D. melanogaster might have crossed the central Sahara between 6500 and 9500 years ago, a period when this region may have been less arid. On the basis of the number of apparently new mutations in cosmopolitan samples, Baudry et al. (2004) estimate a time of 6400 years since the out-of-Africa bottleneck, although this estimate may represent the time since population size recovery. Thornton and Andolfatto (2006) recently applied a Bayesian inference method to infer bottleneck parameters from multilocus nucleotide variation in Zimbabwe and Netherlands populations, obtaining a point estimate of 16,000 years for the time since the bottleneck began (assuming 10 generations/year and an inferred initial effective population size of 2.4 × 106). However, we do not know where within sub-Saharan Africa the progenitors of Palearctic D. melanogaster originated, and therefore the suitability of Zimbabwe as a proxy for the cosmopolitan source population is uncertain.
Populations of D. melanogaster from Europe, northern Africa, and the Middle East typically show low levels of genetic differentiation (e.g., Baudry et al. 2004; Dieringer et al. 2005; Schlötterer et al. 2006). In contrast, Asian populations appear to be more structured (Baudry et al. 2004; Schlötterer et al. 2006), but are still thought to share a single, common origin with other Palearctic populations. The subsequent expansion of D. melanogaster to other parts of the world (such as the Americas and Australia) is thought to have occurred within the past few hundred years (David and Capy 1988), primarily because D. melanogaster is absent from the earliest insect collections in these regions. The primary source for New World populations of D. melanogaster appears to be Europe, but varying levels of direct African ancestry have also been inferred (Caracristi and Schlötterer 2003).
Rather little is known about the genetic structure and population history of D. melanogaster within sub-Saharan Africa. Although the divergence time between D. melanogaster and D. simulans has been estimated at 2–3 million years (MY) (Lachaise et al. 1988; Russo et al. 1995), Tamura et al. (2004) recently suggested a divergence time of 5.4 MY (after accounting for codon bias at 62 genes). According to the biogeographic hypothesis of Lachaise et al. (1988), this speciation event left D. melanogaster occupying a central African range west of the African Rift, with D. simulans inhabiting the eastern African coast and/or islands such as Madagascar. Both D. melanogaster and D. simulans eventually shifted from wild-living to human commensal species (although D. melanogaster is more strictly commensal), but the timing of these transitions is unknown (Lachaise and Silvain 2004).
Today, D. melanogaster can be found across the majority of sub-Saharan Africa, but relatively few studies of molecular variation have sampled more than one or two sub-Saharan populations. Early studies of sequence variation, using Kenya and Zimbabwe samples, found a very low level of differentiation between these populations (Begun and Aquadro 1995), a finding supported by subsequent work (e.g., Caracristi and Schlötterer 2003; Haddrill et al. 2005).
In contrast, studies of silent variation among Adh slow haplotypes (Bénassi and Veuille 1995) and length variation in trinucleotide coding repeats (Michalakis and Veuille 1996) found substantial differentiation between an eastern (Malawi) and a western African (Ivory Coast) population. In both of these studies, the greatest divergence was found between Malawi, on the one hand, and Ivory Coast and the cosmopolitan samples on the other hand. The loci sampled in these two studies were located on the second chromosome, and the high level of African differentiation observed might have been influenced by the In(2L)t inversion. This inversion, like several others, shows strong differences in frequency between sub-Saharan populations (Lemeunier and Aulard 1992; Aulard et al. 2002), which might be related to the action of selection.
Three recent studies of molecular polymorphism have included more than two sub-Saharan population samples of D. melanogaster. First, a study of sequence polymorphism at four X-linked protein-coding genes included two eastern (Kenya, Zimbabwe) and two western (Ivory Coast, Niger) populations, along with several cosmopolitan samples (Baudry et al. 2004). The authors reported significant FST values (0.050–0.072) between most pairs of sub-Saharan samples, with the exception of Kenya–Zimbabwe (0.026) and Niger–Zimbabwe (0.016), concluding that small but significant differentiation exists within sub-Saharan Africa. The Kenya sample was found to possess many of the common cosmopolitan haplotypes, and the authors proposed that cosmopolitan D. melanogaster may have originated in eastern Africa. Dieringer et al. (2005) conducted a microsatellite analysis that included 10 sub-Saharan population samples from five different countries. These sub-Saharan samples were found to be significantly different from northern African samples (which were more similar to European populations). A low level of subdivision within sub-Saharan Africa was suggested, but a clear geographic pattern did not emerge, and genetic distance was more highly correlated with collection date than with geographic distance. Haddrill et al. (2005) surveyed nucleotide variation at 10 X-linked noncoding fragments in Gabon, Kenya, and Zimbabwe samples from sub-Saharan Africa, along with The Netherlands and Pennsylvania cosmopolitan samples. These populations were found to group in three genetic clusters: Gabon, Kenya–Zimbabwe, and Netherlands–Pennsylvania. Compared with the eastern African populations (Kenya and Zimbabwe), the western African population (Gabon) had lower variation and greater linkage disequilibrium, which might suggest a relatively recent founding for this population. However, demographic inferences based on a few population samples are necessarily tentative, and the limited geographic sampling of molecular variation from sub-Saharan D. melanogaster has not provided a clear understanding of the history of these populations or of the spatial pattern of genetic structure.
Here we report an analysis of nucleotide variation from 25 population samples of D. melanogaster (Table 1), 21 of which are sub-Saharan (Figure 1). Data were obtained from a total of 240 isofemale lines at four 1-kb X-linked fragments, each located in a highly recombining, long intergenic region. The broad geographic sampling allowed us to identify a distinct geographic pattern of genetic differentiation and to make inferences regarding the history of D. melanogaster in sub-Saharan Africa and the founding of Palearctic populations.
MATERIALS AND METHODS
DNA sequence data were obtained from up to 10 isofemale lines from each of the population samples of D. melanogaster listed in Table 1 and from a single male D. simulans (from an isofemale line collected in Gisakura, Rwanda, in 2005). DNA was isolated from a homozygous X chromosome line or from a single male fly representing each isofemale line, using a method adapted from a protocol from G. Rubin (http://www.bio.com/protocolstools/protocol.jhtml?id=p82).
Four X-linked loci were amplified and sequenced; each was at least 10 kb from any annotated gene and located in a region of high recombination (Kindahl 1994; Hey and Kliman 2002). The loci studied are designated by their cytological locations: 4F2, 8A4, 11A5, and 12F1. These locations correspond approximately to genetic map positions 11, 24, 38, and 49 cM, respectively. Further information concerning the precise location, neighboring genes, and PCR and sequencing primers for these loci can be found in supplemental Table S1 at http://www.genetics.org/supplemental/.
PCR products were purified using shrimp alkaline phosphotase and exonuclease I enzymes. Sequencing reactions using ABI BigDye ddNTPs were read by an ABI 3730 capillary sequencer at the Biotechnology Resource Center of Cornell University (http://www.brc.cornell.edu). Sequence reads were aligned using Sequencher 4.2. Chromatograms were visually inspected at all putatively polymorphic sites. Data consisted of both single- and double-stranded sequence coverage, and as expected, no evidence of heterozygosity was observed. Final alignments were made using the CLUSTALW algorithm as implemented in MegAlign 5.08 and manually corrected.
HKA tests (Hudson et al. 1987) were conducted using an online, multilocus implementation (http://lifesci.rutgers.edu/∼heylab/HeylabSoftware.htm). Most DNA summary statistics, including Tajima's D (Tajima 1989) and the FST of Hudson et al. (1992), were calculated using DnaSP 4.0 (Rozas et al. 2003). FST distance trees were constructed using the neighbor-joining algorithm implemented in MEGA 3.1 (Kumar et al. 2004). The nearest neighbor statistic (Snn) of Hudson (2000) tests how often a sequence's nearest neighbor (most similar sequence) is from the same geographic sample. Snn matrices and corresponding P-values (from 10,000 permutations) for each locus were obtained using libsequence (Thornton 2003). The four locus-specific Snn P-values obtained for each population comparison were combined using the method described by Bailey and Gribskov (1998). Analysis of molecular variance (AMOVA; Excoffier et al. 1992) and Mantel matrix correspondence tests (Mantel 1967) for the relationship between genetic and geographic distances were performed using Arlequin 3.0 (Excoffier et al. 2005).
In addition to the summary statistics listed above, genetic differentiation was also examined using the Bayesian clustering method Structure 2.1 (Pritchard et al. 2000; Falush et al. 2003), which estimates the proportion of each individual's ancestry that derives from each of K genetic subpopulations. The analysis was run using the linkage model, which takes into account the genetic distances between sites. Since the relative positions for the linkage model only need to be proportional to genetic distances, the physical distances between sites were used as a proxy. Structure was run for a burn-in length of 1,000,000 replicates and a run length of 1,000,000 replicates. Results were graphically depicted using Distruct (Rosenberg 2004).
Unlike most of the analyses performed, Structure does not rely on a specific mutation model and can therefore accommodate both substitution and indel polymorphism. We included indel regions and sites with more than two segregating nucleotides in the Structure data, according to the following rules, which were intended to maximize the information retained while not counting an indel more than once: (1) Sites with singleton insertions were omitted; (2) singleton deletions were coded as missing data; (3) for nonsingleton indels of 1 bp, the deletion was coded as a fifth character state; and (4) for nonsingleton indels >1 bp, the indel region was collapsed to represent a single variable site, and each unique haplotype within the indel region was considered a separate allele.
Linkage disequilibrium was assessed by estimating the population recombination rate (ρ). For each population at each locus, ρ was estimated using the method of Hudson (2001), as implemented in LDhat 2.0 (http://www.stats.ox.ac.uk/∼mcvean/LDhat/). Log likelihoods for values of ρ from 0 to 1000 (per 1-kb locus, estimated with an interval size of 1) were summed across loci to obtain a composite likelihood surface and point estimate of ρ for each population. Because the estimation of ρ depends on the presence of variation, higher values of θ may lead to higher estimates of ρ. To remove this effect, estimates of ρ were divided by π (an estimator of θ), as suggested by Haddrill et al. (2005).
A simulation approach was used to test whether population growth could account for the number of singleton polymorphisms observed in the full data. The coalescent simulator ms (Hudson 2002) was utilized to simulate sets of four genetically unlinked loci resembling those examined in this study, each with a sample size of 240, a length of 768 bp, 110 segregating sites, and a population recombination rate (ρ) of 150/locus. The above length and number of segregating sites reflect the portion of our data for which ancestral states could be inferred from the D. simulans sequence, averaged across the four loci. ρ = 150 is a conservative estimate for the level of recombination, since the recombination rate for the loci studied is likely to be >3 × 10−8 recombination events/bp/generation (Kindahl 1994; Hey and Kliman 2002), the effective population size is ∼2.4 × 106 (Thornton and Andolfatto 2006), the full length of each locus is actually >1000 bp, and therefore ρ (3NerL) should be at least 3 × (2.4 × 106) × (3 × 10−8) × 1000 = 216.
A total of 4202 consensus sites were sequenced, of which 3222 were deletion free among the 240 D. melanogaster chromosomes sampled. These deletion-free sites contained a total of 506 variable sites, of which 471 were biallelic. Consistent with previous studies, we find that sub-Saharan populations are consistently more variable than cosmopolitan populations. As shown in Table 2, the sub-Saharan sample with the lowest nucleotide diversity (Niger: 0.0072) is substantially more variable than the most diverse Palearctic sample (Tunisia: 0.0039). Within sub-Saharan Africa, highly variable populations exist both in the east [e.g., Kenya–Malindi (KM), Malawi (Mw), Uganda (Ug), and both Zimbabwe samples] and in the west [e.g., Nigeria, (Ng) Cameroon–Oku].
Most sub-Saharan samples (17 of 21) gave negative values for Tajima's D (Table 2), indicating an excess of rare alleles compared to the neutral equilibrium expectation (Tajima 1989). On the lower end, average D-values between −0.8 and −0.9 were observed for both eastern (Kenya–Malindi and Uganda) and western (Gabon and Nigeria) populations. Excluding the Congo–Brazzaville (CB) sample (discussed below), a total of 17 of 80 single-locus D-values from sub-Saharan populations were significantly negative (falling below the 2.5% cutoff of D-values from neutral simulations). The negative D-values observed in these populations could reflect demographic histories involving population growth and/or purifying selection at the sequenced loci. Cosmopolitan populations gave both positive and negative D-values, with a high variance among loci (Table 2; supplemental Table S2 at http://www.genetics.org/supplemental/), consistent with expectations for bottlenecked populations (e.g., Haddrill et al. 2005). With only four loci, the information concerning variance in Tajima's D is clearly limited, but it was notable that all eastern African populations except Eritrea had V[D] < 0.15, while 9 of 12 western African populations had V[D] > 0.15.
The anomalously high variance in D observed in the CB sample is most likely a product of nonrandom sampling from this population. This sample did not consist of true isofemale lines and was included only to assess whether this population contained an appreciable quantity of unique variation. It did not, and was excluded from most subsequent analyses. Also excluded from most analyses was the Cameroon–Mbengwi (CW) sample, which was collected within 100 km of both the Cameroon–Oku (CO) and the Cameroon–Nkouondja (CN) samples. The CW sample was included primarily to test for fine-scale genetic structure, but none was observed, and levels of variation were similar (Table 2). Since CN and CO had larger sample sizes, most analyses were restricted to these samples.
None of the sequenced loci showed evidence for recent positive selection, and in general patterns of variability were consistent among loci (locus-specific summary statistics are presented in supplemental Table S2 at http://www.genetics.org/supplemental/). Locus 8A4 showed moderately lower variability (and divergence) overall, but no sub-Saharan population showed dramatically reduced variability at any locus. Multilocus HKA tests (Hudson et al. 1987) performed for each population sample resulted in no rejections of the neutral model (in fact, there were no P-values <0.25). Although some single-locus values of Tajima's D were observed to reject the null model (supplemental Table S2 at http://www.genetics.org/supplemental/), none of the cosmopolitan D-values fell outside the 95% confidence interval of empirical D-values obtained by Ometto et al. (2005) for 241 variable loci in a European population of D. melanogaster. For sub-Saharan populations (aside from the CB sample), all of the moderately negative D-values that had fallen outside the neutrally simulated distribution (supplemental Table S2 at http://www.genetics.org/supplemental/) fell within the 95% confidence interval of 253 empirical D-values from a Zimbabwe population (Ometto et al. 2005).
High pairwise FST values (generally >0.2) were observed between cosmopolitan and sub-Saharan populations (Table 3). High FST values were also observed between China (Ch) and other cosmopolitan samples, which is consistent with recent findings (Baudry et al. 2004; Schlötterer et al. 2006) of genetic structure among Asian populations of D. melanogaster. Comparisons among the other cosmopolitan samples (France, Israel, and Tunisia) gave no evidence for differentiation.
Within sub-Saharan Africa, most pairwise FST values were <0.1, but many of these were statistically significant (Table 3). In general, FST values >0.03 often had permutation test P-values <0.05 (depending on the sample sizes and the configuration of variation in the populations compared). Snn P-values were significant in about the same number of pairwise comparisons as those for FST, with mostly the same comparisons identified by both statistics (supplemental Table S3 at http://www.genetics.org/supplemental/). FST and Snn values were included primarily to provide a quantitative estimate of the differentiation observed between any given pair of populations, and the P-values listed in Table 3 and supplemental Table S3 (http://www.genetics.org/supplemental/) do not reflect corrections for multiple testing. However, given that 174 of 253 pairwise FST comparisons had P-values <0.05 (Table 3), while an average of 13 false positives would be expected in 253 tests at the 0.05 significance level, it seems that a large majority of significant results in these tables are likely to indicate genuine population structure.
Some sub-Saharan populations, particularly Eritrea (Er) and South Africa (SA), showed a consistent pattern of elevated differentiation when compared to other sub-Saharan samples (Table 3; supplemental Table S3 at http://www.genetics.org/supplemental/). An examination of locus-specific FST matrices and neighbor-joining population distance trees revealed particularly close relationships between each of these populations and cosmopolitan samples at certain loci (4F2 and 12F1 for Eritrea; 4F2, 8A4, and 11A5 for South Africa; locus-specific FST matrices are given in supplemental Table S4 at http://www.genetics.org/supplemental/), suggesting the possibility of a cosmopolitan admixture. More moderate elevations in FST were also observed for the Gabon (Ga), Kenya–Naivasha (KN), Niger (Nr), and Rwanda (Rw) populations. Gabon also showed potential evidence of cosmopolitan admixture, having a very close relationship to the cosmopolitan populations at locus 12F1 only (supplemental Table S4 at http://www.genetics.org/supplemental/). While the heterogeneous pattern of locus-specific cosmopolitan introgression that we infer seems contrary to the expectations of neutral gene flow, it is very similar to the pattern observed by Kauer et al. (2003) in D. melanogaster from Harare, Zimbabwe, and could indicate selection against immigrant genotypes at specific loci. While it can be difficult to decisively differentiate the hypotheses of cosmopolitan admixture vs. cosmopolitan co-ancestry, the high levels of linkage disequilibrium (low ρ/π estimates) inferred for these three populations (Table 4) are consistent with a history of recent admixture. The KN, Nr, and Rw samples showed no obvious evidence of large-scale cosmopolitan admixture at any locus, and their differentiation might be attributable to some combination of local population size reductions and restricted gene flow.
In contrast to the more differentiated sub-Saharan populations discussed above, Uganda showed particularly close relationships to many other sub-Saharan samples (Table 3; supplemental Table S2 and Figure S1 at http://www.genetics.org/supplemental/). While Uganda showed modest differentiation from most other eastern African populations, it showed essentially no differentiation from any of the Cameroon and Nigeria samples (despite a geographic separation of ∼2000–2500 km) and fairly close relationships to the other western African samples as well. A second group of sub-Saharan populations with no detectable differentiation was found within eastern Africa, composed of the Kenya–Malindi, Malawi, Zimbabwe–Kariba (ZK) and Zimbabwe–Sengwa (ZS) samples (which span a range of ∼2100 km).
When 17 sub-Saharan populations were analyzed (all except CB, CW, Er, and SA), a significant relationship was observed between genetic distance (measured as FST) and geographic distance (P = 0.00023; squared correlation coefficient = 0.26, Mantel test). However, the exceptions to this relationship mentioned above (closely related populations separated by large geographic distances) suggested that a simple isolation by distance model might not be a complete explanation for patterns of genetic differentiation in sub-Saharan D. melanogaster. A neighbor-joining population tree based on FST values also suggested a geographic pattern for differentiation among the sub-Saharan samples (Figure 2). Specifically, the eastern populations (KM, KN, Mw, Rw, ZK, and ZS) grouped separately from the western populations, with Uganda occupying an intermediate position. No relationship between genetic distance and geographic distance was detected within the eastern group of 6 populations (P = 0.40, Mantel test) or within the western group of 11 populations (including Uganda; P = 0.48). Thus, the relationship between genetic and geographic distances detected for 17 sub-Saharan populations appears to have been driven primarily by differentiation between eastern and western populations. Correspondingly, we note that comparisons between eastern and western sub-Saharan populations showed greater differentiation (mean FST = 0.062 for 66 pairwise comparisons) than comparisons among eastern populations (mean FST = 0.025 for 15 pairwise comparisons) or among western populations (mean FST = 0.011 for 55 pairwise comparisons).
The Mantel test was also applied to test for a correlation between genetic distance and collection date in sub-Saharan D. melanogaster, such as that reported by Dieringer et al. (2005). No such correlation was found (P = 0.11 for a simple Mantel test; P = 0.34 after accounting for the correlation between genetic differentiation and geographic distance). Most of the samples analyzed in this study were collected within the past few years (Table 1), and this might have prevented such a correlation from being found. However, the close relationships observed between the newer Kenya–Malindi and Malawi samples and both of the older Zimbabwe samples would argue against any recent and rapid genetic change at the loci examined.
Structure analysis was applied first to the entire data set. For K = 2 subpopulations, the cosmopolitan populations were placed largely in the cluster shaded in Figure 3A. Most sub-Saharan populations had ancestry in both clusters: western populations showed approximately equal ancestry in both clusters, while most eastern populations had a majority of ancestry in the unshaded clusters (Figure 3A). When the number of subpopulations was set to K = 3, an increase in the log likelihood of the data from −20,188 to −18,894 was observed. Although cosmopolitan samples were placed primarily in one cluster, the geographic pattern involving the other two clusters was not clear (supplemental Figure S2 at http://www.genetics.org/supplemental/).
To clarify the pattern of genetic differentiation within sub-Saharan Africa, Structure was run using 17 sub-Saharan samples (excluding CB, CW, Er, and SA). With K = 2 subpopulations, eastern samples tended to have similar levels of ancestry in both clusters, while western samples had a larger portion of ancestry in the unshaded cluster in Figure 3B. A significant correlation was observed between longitude and cluster membership (R2 = 0.53; P = 0.0009; supplemental Figure S3 at http://www.genetics.org/supplemental/), reinforcing the geographic basis of this differentiation.
Although runs with up to 10 subpopulations were performed in both cases, Structure gave little support for differentiation beyond K = 3 for the full data set and K = 2 for the sub-Saharan data set. While log-likelihood scores generally continued to rise for larger values of K, every run with K-values greater than those specified above produced one or more nearly empty clusters (having <0.1 of each population's ancestry), and further geographic patterns were not evident. Thus, there seemed to be no justification for inferring additional genetic clusters on the basis of the Structure results.
AMOVA (Excoffier et al. 1992) was performed using cosmopolitan (4 populations) and sub-Saharan (17 populations) groupings. These groupings were found to account for 9.9% of variation, with another 9.6% of variation found among populations within these groups, and 80.5% of variation occurring within populations. A second AMOVA was performed on the sub-Saharan populations only, using eastern (6 populations) and western (11 populations) groupings. In this case, 8.6% of variation in sub-Saharan Africa was between the eastern and western groups, compared to just 1.9% of variation between populations within those groups, while 89.5% of variation was found within populations.
Relative antiquity of eastern and western sub-Saharan populations:
The ancestral range of D. melanogaster is thought to be within sub-Saharan Africa (Lachaise et al. 1988). However, we do not know whether D. melanogaster has occupied a broad geographic range within sub-Saharan Africa for a long period of time, or if this species might have recently expanded within sub-Saharan Africa from a smaller ancestral range. Therefore, we consider the patterns of polymorphism expected for ancient vs. recently founded populations. First, we expect recently established populations to be less variable than ancient populations, due to the loss of variation during a founder event. Second, we expect the population bottlenecks associated with founder events to alter the allele-frequency spectra of recently established populations (e.g., Wall et al. 2002; Depaulis et al. 2003; Haddrill et al. 2005). Specifically, recently founded populations may have higher average Tajima's D-values (due to the loss of rare alleles during a moderate bottleneck). Also, we expect recently founded populations to have a higher variance among loci in Tajima's D, due to the stochastic effects of a founder-event-associated bottleneck. An exception to these criteria for Tajima's D would be if a recently founded population had suffered a very extreme bottleneck (such that nearly all variation was lost at most loci), since this might also produce consistently negative Tajima's D-values. However, all of our sub-Saharan samples are sufficiently variable that a severe bottleneck does not appear to be a plausible demographic history for any of them. Finally, we expect recently founded populations to have increased linkage disequilibrium (e.g., Reich et al. 2001; Wall et al. 2002; Thornton and Andolfatto 2006), because not every allelic combination present in the ancestral population will be transferred to the new population.
These expectations were quantified using three summary statistics: nucleotide diversity (π), maximum Tajima's D (the highest D-value observed at any locus for a given population, thus incorporating the expected increase both in D and its variance for recently founded populations), and ρ/π (the population recombination rate, estimated as described in materials and methods and divided by π to account for the effect of variability on its estimation). Thus, we expect ancient populations with comparatively stable demographic histories to have high π, low maximum D, and high ρ/π.
The pattern produced by any one of these three statistics would not have yielded a clear geographic hypothesis for the ancestral range of D. melanogaster. However, when sub-Saharan populations were ordered according to the sum of their ranks for the three statistics, a geographic pattern was evident (Table 4). The five populations that showed the best evidence for ancient and stable histories (Ug, KM, ZK, ZS, and Mw) were all from eastern Africa. Thus, in spite of evolutionary and sampling variance affecting the estimation of these statistics, a hypothesis for the ancestral range of this species could be generated.
Source of Palearctic D. melanogaster:
Relationships between sub-Saharan and cosmopolitan populations were investigated with the goal of inferring which sub-Saharan populations might resemble the source (prebottleneck) population that gave rise to Palearctic D. melanogaster. As mentioned above, three sub-Saharan populations (Er, Ga, and SA) displayed evidence of recent cosmopolitan admixture. Aside from these samples, 16 sub-Saharan populations were ranked according to three statistics illustrating differentiation from the Palearctic populations (Table 5). FST, Snn, and the number of shared polymorphisms all revealed Uganda as the sub-Saharan population most closely related to the Palearctic samples.
Relationships between sub-Saharan and Palearctic populations were also depicted using neighbor-joining population distance trees on the basis of FST. When the nonadmixed sub-Saharan populations were analyzed along with the Palearctic samples, the sample branching off closest to the Palearctic samples was Uganda (Figure 2). Aside from Er and SA, Ug and KN were also the two sub-Saharan populations with the highest level of ancestry in the predominantly cosmopolitan cluster from the K = 3 run of Structure.
Excess of singleton polymorphisms and evaluation of potential causes:
For a neutral equilibrium population evolving according to the infinite-sites model, theory predicts that the number of singleton polymorphisms observed in a sample should be θ, regardless of the sample size (e.g., Fu 1995). For our full data set, the expected number of singletons (mutations present in 1 of 240 individuals) would be 83.6 if θ is estimated from the number of segregating sites (S = 506). In contrast to this prediction, we observed 279 singleton sites (55.1% of all segregating sites observed were singletons, compared to 16.5% expected on the basis of θs). Rechecking of chromatograms suggested that this singleton excess was not a result of base-calling errors. Many singletons were verified by high-quality sequence on both strands, and we do not expect to observe PCR misincorporations in sequence amplified from a large pool of genomic DNA, so we conclude that this singleton excess represents a real characteristic of polymorphism in D. melanogaster.
The D statistic of Fu and Li (1993) specifically tests for deviations from the expected number of singleton polymorphisms (external mutations). For our full data set, Fu and Li's D was −7.69 (indicating an excess of singletons), and the null model was rejected (P < 0.001 using simulations with recombination). The observed deviation could indicate a departure from selective neutrality or from demographic equilibrium. We investigated the ability of three potential causes to generate the observed singleton excess: population structure, population growth, and negative selection (constraint).
Population structure was observed in our data, and it is possible that such structure might have helped to generate the observed singleton excess. To account for this possibility, we assessed the proportion of singleton polymorphisms in groups of populations that contain no detectable structure. Two groups of four populations each were tested: one from eastern Africa (the “East4” sample: KM, Mw, ZK, ZS) and one from western Africa [the “West4” sample: Cameroon–Mbalang–Djalango (CD), CN, CO, Ng]. In these two groupings of 40 individuals each, the proportions of singleton polymorphisms were 0.527 for the East4 sample (119 singletons of 226 segregating sites) and 0.502 for the West4 sample (109 singletons of 217 segregating sites), compared to an expectation of 0.235 based on θs. These results were compared to random subsamples of the data in which 40 of 240 individuals were chosen without respect to their geographic origin (thus sampling across known population structure). The 10,000 random resamplings had a mean proportion of singletons of 0.532. Both the East4 and West4 values were well within the 95% confidence interval of resampled values: 42.0% of resampled values were below the East4 grouping's proportion of singletons, while 18.3% fell below the West4 value. Since no significant difference could be detected between structured and unstructured subsamples, we conclude that population structure has little effect on the proportion of singletons in our data.
To assess the potential of population growth to account for the observed singleton excess, we used the coalescent simulator ms (Hudson 2002) to generate data sets resembling our own under a model of instantaneous growth (see materials and methods). As shown in Table 6, a variety of population growth parameter combinations give a proportion of singletons that is similar to or much greater than that observed in our data. For example, growth occurring 0.01333 coalescent units ago (∼10,000 years ago) produces slightly >50% singletons for a wide range of growth rates (between 100- and 1,000,000-fold growth). More recent growth tends to produce greater singleton excess, with >98% singletons in the most extreme cases. Thus, while the true history of any D. melanogaster population is likely to be more complex than the growth models simulated (involving, for example, population structure and population bottlenecks), a demographic history that involves recent growth appears to have considerable power to generate a singleton excess like the one observed.
Although population growth could be a sufficient explanation for the observed singleton excess, we also tested for evidence of negative selection. The four sequenced fragments are from long intergenic regions, but previous studies have reported evidence for constraint on noncoding DNA in D. melanogaster (Bergman and Kreitman 2001; Andolfatto 2005; Kern and Begun 2005), and the high rate of recombination in these regions could allow purifying selection to act efficiently. Deleterious mutations may thus be kept at low frequency and be observed as singletons in our sample. If this has occurred, one might expect regions of our sequenced loci that have a stronger singleton excess to also have reduced divergence between species (both patterns being generated by selective constraint). Upon breaking our loci into 100-bp windows, we found evidence for an inverse correlation between divergence (using D. simulans as an outgroup) and proportion of singleton polymorphisms (Figure 4; R2 = 0.19; P = 0.015), thus providing evidence that purifying selection is at least partially responsible for the singleton excess in our data. However, the proportion of singletons shown in Figure 4 seems to “bottom out” between 0.3 and 0.4, which is still well above the expected value of 0.165. The window containing the lowest proportion of singletons (0.316) also had the highest divergence (0.212), which is suggestive of a rather low level of selective constraint. Since even this window contains approximately twice as many singletons as expected on the basis of θ, it seems that purifying selection alone may not be a sufficient explanation for the observed singleton excess. Instead, the most plausible explanation might be a combination of population growth and negative selection.
An analysis of DNA sequence variation at four 1-kb X-linked fragments was performed for 25 population samples of D. melanogaster to explore genetic structure and historical relationships involving sub-Saharan populations. The primary structure detected within sub-Saharan Africa differentiated eastern from western populations (with Uganda having a closer relationship to the western samples). Evidence for east–west structure included pairwise FST values, Mantel test results indicating a significant correlation between genetic and geographic distance within all of sub-Saharan Africa but not within the eastern and western population groups, neighbor-joining population distance trees depicting the separation between eastern and western populations, Structure analysis of sub-Saharan populations showing a correlation between longitude and the cluster membership of populations, and AMOVA analysis indicating that considerably more variation exists between eastern and western population groupings than among populations within those groups. These findings are consistent with previous studies of inversion frequencies (Lemeunier and Aulard 1992; Aulard et al. 2002) and studies of molecular variation that had found significant differentiation between one (or two) eastern and one western African population sample (e.g., Bénassi and Veuille 1995; Michalakis and Veuille 1996; Haddrill et al. 2005).
Additional differentiation within sub-Saharan Africa was observed for populations that showed close relationships to cosmopolitan populations at one or more loci. Specifically, low locus-specific FST values in cosmopolitan comparisons and high levels of linkage disequilibrium provided evidence for cosmopolitan admixture affecting South Africa at three loci, Eritrea at two loci, and Gabon at one locus. Thus, cosmopolitan admixture was inferred for a minority of sub-Saharan samples, but the disparate locations of these three samples, along with previous reports of cosmopolitan admixture in Brazzaville, Congo (Vouidibio et al. 1989; Capy et al. 2000), and Harare, Zimbabwe (Kauer et al. 2003), suggest that cosmopolitan genotypes have been introduced at multiple sub-Saharan sites.
Apart from cases of likely cosmopolitan admixture, we were interested in assessing which sub-Saharan population showed the closest relationship with the Palearctic populations. This population could represent the best contemporary proxy for the source population that gave rise to the original Palearctic population of D. melanogaster, and its location might suggest the approximate geographic route by which D. melanogaster first left sub-Saharan Africa. On the basis of FST, Snn, shared polymorphisms, population distance trees, and Structure cluster membership, it was clear that Uganda had the closest genetic affinity with the Palearctic samples. While such a result could be attributable to cosmopolitan admixture in Uganda, the high variation and low linkage disequilibrium observed in this sample argue against such an explanation.
Lachaise et al. (1988) suggested that D. melanogaster might have expanded northward from western Africa between 6000 and 9500 years ago, crossing a Saharan region that was less arid at the time. Although we found that western African populations were very closely related to our Uganda sample, it was Uganda that showed the closest relationship to the Palearctic populations. If we are correct in concluding that the source population that gave rise to Palearctic D. melanogaster originated from the equatorial rift zone, it may not be necessary to invoke a crossing of the central Sahara. Instead, the northward expansion of D. melanogaster might have occurred via the Nile Valley and/or the Red Sea coastal area.
By analyzing some contrasting predictions for polymorphism in populations with relatively stable demographic histories vs. those that may have experienced recent founder events, we identified five populations from eastern Africa that showed the best evidence for ancient and stable histories. Specifically, Uganda, Kenya–Malindi, Zimbabwe–Kariba, Zimbabwe–Sengwa, and Malawi were found to have, on average, higher variation, lower and less variable Tajima's D-values, and less linkage disequilibrium than the other sub-Saharan populations studied (Table 4). These results could suggest that the ancestral range of D. melanogaster was within eastern Africa (either east of the rift or within the rift zone), which is consistent with the hypotheses of Veuille et al. (2004) and Haddrill et al. (2005). However, the geographic breadth of these five populations, along with the existence of unsampled regions within sub-Saharan Africa, prevents us from drawing precise conclusions regarding the geographic origin of D. melanogaster. Importantly, our data do not rule out an ancestral range that includes the western rift mountains, which could be compatible with an origin in the “East Central refugia” of tropical forest, a possibility suggested by Lachaise et al. (1988) and Lachaise and Silvain (2004). Alternatively, given that D. melanogaster has greater resistance to desiccation (van Herrewege and David 1997) and to high and low temperatures (Stanley et al. 1980) than any of its close relatives, this species may have evolved in a relatively less humid region of eastern Africa. It is also possible that the geographic range of this species has changed through time in response to climate change and other factors.
It is interesting that although the Uganda population shows its closest relationships to be with western African populations, it was identified along with four other eastern African populations in our analysis of putatively ancient population samples. If the hypothesis of an eastern African origin for D. melanogaster is correct, the above results might suggest that western populations were founded comparatively recently and were derived from a population very similar to our Uganda sample. Clearly, this proposed westward expansion from the equatorial rift zone did not result in a substantial loss of variability for most western populations. Therefore, this expansion may have involved a large number of founders and/or a high rate of subsequent migration between founder and source populations.
In light of the above discussion, our results suggest the following as a reasonable hypothesis for the history of D. melanogaster and related species. While most other subgroups within the D. melanogaster species group are endemic to East Asia, the nine described species of the D. melanogaster subgroup are native to sub-Saharan Africa and nearby islands (Lachaise et al. 1988, 2000). Therefore, it is thought that the ancestor of the melanogaster subgroup reached Africa from Eurasia via the Arabian landmass sometime after the first land connection was established, <20 million years ago (MYA) (Figure 5A).
Regarding the earliest speciation event within the D. melanogaster subgroup, earlier studies (e.g., Lachaise et al. 1988; Russo et al. 1995) suggested that this branching separated the ancestor of D. erecta and D. orena from the remaining members of this clade (the D. melanogaster/D.simulans and the D. teissieri/D. yakuba species pairs thus forming a clade). However, recent DNA studies have supported a different topology in which the D. erecta/D. orena and the D. teissieri/D. yakuba species pairs form a clade, while the D. melanogaster/D. simulans lineage would have diverged from these taxa earlier (Kopp and True 2002; Kastanis et al. 2003; Ko et al. 2003). The estimated timing of this speciation event, ∼10–15 MYA (Tamura et al. 2004), suggests the possibility that the emergence of the East African Rift (Girdler 1991) might have played a role in the isolation of these taxa (Figure 5B).
Restricting our focus to the D. melanogaster/D. simulans lineage, aridification of the rift zone was suggested by Lachaise et al. (1988) as a potential vicariant event separating populations that would become D. melanogaster (west of the rift) from those that would become D. simulans (east of the rift or on Indian Ocean islands). Recent studies of DNA sequence polymorphism in D. simulans have suggested that populations from Madagascar are more genetically diverse than populations from continental Africa (Dean and Ballard 2004; Veuille et al. 2004; Baudry et al. 2006). If, as these data imply, D. simulans originated on Madagascar or other Indian Ocean islands, there would be no need to invoke the rift valley as a barrier separating this species from D. melanogaster. Rather, the simplest explanation for the separation of these two species would be that the ancestor of D. simulans dispersed to Madagascar (or other islands), while populations that remained on the mainland became D. melanogaster (Figure 5C). Thus, it is not necessary to place D. melanogaster west of the rift to explain its divergence from D. simulans, and our analysis supports an ancestral range for D. melanogaster in eastern Africa.
More recently, we infer that D. melanogaster expanded in two directions from its ancestral range in eastern Africa, perhaps spreading from a point of origin near our Uganda sample. Expanding westward, D. melanogaster extended its range into central and western Africa (Figure 5D). The newly founded sub-Saharan populations incurred little loss of variation, and only mildly altered allele frequencies and linkage disequilibrium. Also, perhaps associated with this same wave of expansion, D. melanogaster moved north, eventually reaching northern Africa and Eurasia (Figure 5D). This expansion gave rise to the first Palearctic populations and was associated with a more severe loss of variation, as documented by several multilocus studies (e.g., Kauer et al. 2002; Ometto et al. 2005; Thornton and Andolfatto 2006). The inferred geographic expansion of D. melanogaster, north and west from the equatorial rift zone, might have been facilitated by the domestication of this species, which could be compatible with the hypothesis of a relatively recent domestication event in the Nilo-Saharan region (Lachaise and Silvain 2004). D. simulans also expanded its range, founding mainland populations in eastern and central Africa and also spreading beyond sub-Saharan Africa (Figure 5D).
The results presented here should provide an improved framework for the testing of specific demographic hypotheses in this important model species. For a variety of evolutionary analyses, including studies of adaptive differences among populations, this demographic knowledge should help researchers to select the most appropriate sub-Saharan populations to test hypotheses of interest. For example, we have shown that our Uganda sample may be a better proxy for the Palearctic source population than Zimbabwe (which has commonly been used for this purpose), so this sample should be particularly useful in comparisons between cosmopolitan and sub-Saharan populations. Increased understanding of demography in sub-Saharan D. melanogaster should also aid in the interpretation of data from evolutionary studies by allowing researchers to place their results in a stronger historical context.
We are especially grateful to those individuals who collected flies on our behalf (listed in Table 1) and those who facilitated J.E.P.'s travel and collection in Cameroon (particularly the Wildlife Conservation Society). This work was supported by a National Science Foundation Doctoral Dissertation Improvement Grant (DEB-0411730) to J.E.P. and C.F.A. and by a National Institutes of Health grant (GM36431) to C.F.A. Additional support related to fly collection was provided by the Cornell Graduate School and Sigma Xi of Cornell.
This article is dedicated to the memory of Daniel Lachaise, whose contributions to the field of Drosophila evolutionary biology helped to lay the groundwork for numerous studies, including this one.
Communicating editor: M. K. Uyenoyama
- Received March 29, 2006.
- Accepted August 10, 2006.
- Copyright © 2006 by the Genetics Society of America