We analyze patterns of nucleotide variability at 15 X-linked loci and 14 autosomal loci from a North American population of Drosophila simulans. We show that there is significantly more linkage disequilibrium on the X chromosome than on chromosome arm 3R and much more linkage disequilibrium on both chromosomes than expected from estimates of recombination rates, mutation rates, and levels of diversity. To explore what types of evolutionary models might explain this observation, we examine a model of recurrent, nonoverlapping selective sweeps and a model of a recent drastic bottleneck (e.g., founder event) in the demographic history of North American populations of D. simulans. The simple sweep model is not consistent with the observed patterns of linkage disequilibrium nor with the observed frequencies of segregating mutations. Under a restricted range of parameter values, a simple bottleneck model is consistent with multiple facets of the data. While our results do not exclude some influence of selection on X vs. autosome variability levels, they suggest that demography alone may account for patterns of linkage disequilibrium and the frequency spectrum of segregating mutations in this population of D. simulans.
A fundamental question in population genetics is the relative importance of natural selection vs. neutral and/or demographic factors in shaping genome-wide patterns of sequence variability (Lewontin 1974; Kimura 1983). Distinguishing between selective and neutral/demographic effects on genome variability requires multiple independent loci scattered throughout the genome. Although there are polymorphism data from dozens of loci in Drosophila (see, e.g., Moriyama and Powell 1996; Andolfatto and Przeworski 2000; Andolfatto 2001; Przeworskiet al. 2001), these data are difficult to interpret because many of the loci are sampled in different populations. As a result, one is never quite sure to what extent patterns of variation at a collection of loci are the byproduct of the particular sampling schemes used. In particular, in the absence of independent knowledge about the demographic history of a species, it becomes very difficult, if not impossible, to draw inferences about the role of natural selection. Much of this ambiguity can be eliminated by sequencing the same lines from the same populations at multiple loci, yet it is only recently that consistently sampled Drosophila data have been gathered from many loci (Begun and Whitley 2000; Begunet al. 2000; Andolfatto and Przeworski 2001).
In one such study, Begun and Whitley (2000) collected polymorphism data from 15 X-linked loci and 14 loci on chromosome arm 3R in a California population of Drosophila simulans. Their goal was to distinguish between different explanations for the empirical observation that levels of variability are positively correlated with rates of meiotic crossing over in Drosophila (Aguadéet al. 1989; Begun and Aquadro 1992; Aquadroet al. 1994). Two major theories have been proposed for this pattern; both describe the effects of natural selection at linked sites. The hitchhiking or selective sweep model posits that alleles driven to fixation by positive selection reduce levels of variation at linked neutral sites (Maynard Smith and Haigh 1974; Kaplanet al. 1989), while the background selection model considers the variation-reducing effect of strong purifying selection (Hudson and Kaplan 1995; Charlesworthet al. 1993). Both models predict a greater reduction of variability in areas of reduced recombination, so both are potential explanations for the positive correlation between diversity levels and recombination rates.
Begun and Whitley (2000) attempt to distinguish between background selection and positive selection models by comparing levels of variability on the X chromosome and an autosome (Aquadroet al. 1994). If deleterious mutations tend to be partially recessive (Crow and Simmons 1983; Houleet al. 1997), purifying selection is expected to be more efficient on the X chromosome relative to the autosomes, due to haploidy in males. The background selection model therefore predicts that, if all other factors are comparable, there should be less of a reduction in levels of variation on the X chromosome than on the autosomes (Charlesworthet al. 1993; Charlesworth 1996). Begun and Whitley (2000), however, find reduced levels of diversity on the X chromosome, even when X-linked diversity is multiplied by 4/3 (to correct for the fewer number of X chromosomes, assuming equal male and female effective population sizes). They conclude that the background selection model is incompatible with their data and suggest instead that greater hitchhiking effects due to positive selection on the X relative to the autosomes might account for their pattern. This will occur if positively selected alleles are sometimes recessive; haploidy on the X in males would then facilitate X-linked selective sweeps.
Considering what we know about the demographic history of D. simulans populations, it may also be relevant to consider nonselective explanations for Begun and Whitley’s observation. D. simulans is a human commensal; it is thought to have originated in tropical Africa and may have colonized the Americas as recently as a few hundred years ago (David and Capy 1988; Lachaiseet al. 1988). A contraction in population size (i.e., a population bottleneck or founder event) during the initial colonization may have had a large impact on the patterns of variation in samples from migrant populations. Indeed, variability in New World populations appears to be lower than in African populations, as expected after a population size contraction (Hamblin and Veuille 1999; Andolfatto 2001). In addition, the average Tajima’s D (a commonly used summary of the frequency spectrum of segregating mutations, cf. Tajima 1989a) at two X-linked loci is higher in non-African than in African populations; this also is suggestive of a population contraction, with an associated loss of low frequency alleles in migrant populations (Fay and Wu 1999; Hamblin and Veuille 1999).
In this article, we revisit Begun and Whitley’s data, by explicitly considering both a model of recurrent selective sweeps and a recent bottleneck model; we also examine additional aspects of the data besides levels of diversity. Braverman et al. (1995) showed that a model of recurrent selective sweeps predicts a sharp excess of rare variants relative to the expectations of the standard constant-sized Wright-Fisher neutral model (called hereafter the null model). In our analyses, we study both D (Tajima 1989a) and a summary of linkage disequilibrium (i.e., the nonrandom association between alleles at different nucleotide sites). We measure linkage disequilibrium by estimating (under the null model) the population recombination parameter ρ (= 4Nr, where N is the effective population size and r is the sex-averaged recombination rate per base pair per generation) from the sequence data at each locus (CHRM, cf. Wall 2000). The parameter r is estimated from a comparison of D. simulans physical and genetic maps; we can then estimate N as . Low estimated values of Nˆρ indicate high levels of linkage disequilibrium and vice versa. A similar approach was used by Andolfatto and Przeworski (2000) using data from D. melanogaster and D. simulans. They found that N estimated from linkage disequilibrium and rˆ was much smaller than the standard estimate of N based on levels of variability and an estimate of the mutation rate, which was interpreted as a genome-wide excess of linkage disequilibrium in the two species. Our study presents the advantages of consistently sampled data and more accurate estimates of ρ.
A list of definitions for the symbols used in this paper (in approximate order of introduction) can be found in Table 1.
Loci and samples: We consider 29 loci collected from a single D. simulans population (Wolfskill Orchard, California), reported by Begun and Whitley (2000). Sequences were obtained from GenBank and aligned by eye. We consider only biallelic single-nucleotide base substitutions; multiple substitutions, insertion-deletion mutations, and other overlapping mutational events (e.g., substitutions overlapping with deletions) are excluded from analyses. Excluding multiple substitutions should not lead to a bias in our frequency spectrum analyses and should be conservative for our estimates of ρ. The alternative, i.e., considering multiple substitutions as multiple mutations with missing information, would lead to underestimates of H and RM due to the missing information and thus to underestimates of ρ (see below).
Estimating r: Previous methods for estimating r have fit high-order polynomial curves to the available genetic and physical map data for D. simulans (Trueet al. 1996; Andolfatto and Przeworski 2000). Here, we assume a composite model, where the rate is constant across much of the chromosome, but is reduced near the telomeres and centromeres (cf. Charlesworth 1996; Andolfatto and Przeworski 2001). Cytological map positions are obtained from Flybase (http://flybase.bio.indiana.edu; Dec 1, 2000), and the DNA content for each band is estimated from the D. melanogaster values in Heino et al. (1994; i.e., we assume D. melanogaster and D. simulans have similar amounts of DNA per band). There is no recombination in males, so we use sex-averaged rates.
X chromosome: Genetic map positions for 14 marker loci are taken from Sturtevant (1929). Note that these marker loci and the autosomal ones are distinct from the loci considered in the polymorphism analysis. The X chromosome is divided into telomeric (I), middle (II), and centromeric (III) segments (see Figure 1a). We model genetic distance as a linear function of physical distance from the white locus (2.9 Mb, 4.1 cM) to the fused locus (20.2 Mb, 59.4 cM; region II, Pearson R2 > 0.99). The recombination rate estimate over this interval is 3.2 cM/Mb. Recent recombination measurements (Takano-Shimizu 1999) have estimated lower rates of recombination near the telomere (i.e., region I) than measured by Sturtevant (1929). Combining the genetic map data of Sturtevant (1929) and Takano-Shimizu (1999), and assuming that genetic distance increases with the square of physical distance from the telomere (0.0 Mb) to the white locus (2.9 Mb), a rate of 1.5 cM/Mb is estimated for Pgd (2.1 Mb). All other X-linked loci from Begun and Whitley (2000) fall within the boundaries of region II.
Chromosome 3R: Genetic map positions for 13 marker loci were obtained from Ohnishi and Voelker (1979). D. simulans and D. melanogaster differ by a single inversion on this chromosomal arm (breakpoints 84F1 and 93F6-7; Lemeunier and Ashburner 1976). This difference has been accounted for in estimates of physical distances between genetic markers. The chromosome arm is divided into two regions (see Figure 1b). The first region is defined as the centromeric region (I) and extends from the centromere to delta (5.4 Mb, 64 cM). We define a second region (II, Figure 1b) that encompasses the rest of the chromosome arm from delta to Acph-1 (22.2 Mb, 134 cM), which is an estimated 16.8 Mb in length. For region II, we model genetic distance as a linear function of physical distance (Pearson R2 > 0.99). Under this model, the recombination rate is estimated to be 4.2 cM/Mb.
If we assume that genetic distance increases with the square of physical distance from the centromere (0.0 Mb) to delta (5.4 Mb), we estimate a rate of 2.2 cM/Mb for the miranda locus (4.9 Mb). We localized the pitchoune locus (93F16-94A1) outside of the distal break-point (93F6-7) of the fixed 3R inversion difference between D. melanogaster and D. simulans by in situ hybridization on polytene chromosomes (using a method modified from Sniegowski and Charlesworth 1994). Since pitchoune lies outside the inversion, it resides ∼15.5 Mb from the centromere (i.e., region II). The remaining 12 loci on chromosome 3R considered for polymorphism analyses are also in region II.
Estimating Nρ: We proceed assuming that the true r is known for each locus. Two summaries of linkage disequilibrium are H, the observed number of distinct haplotypes, and RM, the minimum number of inferred recombination events (Hudson and Kaplan 1985). We calculate Pr(H, RM|ρ), or equivalently assuming the null model. Likelihoods are calculated for each locus separately, as well as for all X-linked loci and all autosomal loci (see below). Here “lik” refers to the likelihood, and the joint likelihood at a collection of loci is found by multiplying the likelihoods at the individual loci. This is reasonable since none of the loci are closely linked to each other and thus can be considered as evolutionarily independent. Define Nx and Na as the effective population sizes of the X and the autosomes, respectively. Define Nˆρx and Nˆρa as the maximum-likelihood estimates for the X-linked and autosomal loci; Nˆρx is the value of Nx that maximizes Πall X-linked locilik(Nx|H, RM), while Nˆρa is the value of Na that maximizes Πall autosomal locilik(Na|H, RM). We choose to summarize the data before performing maximum likelihood because of computational constraints; maximum likelihood on the full data is computationally infeasible even for recombination rates one-tenth of those considered here. As it is, the likelihoods for this article took several months of computing time on a pair of 600 Mhz Pentium III processors.
The likelihoods were estimated from simulations that assume a neutral infinite-sites model and use the protocol of Hudson (1993). This method differs slightly from standard coalescent simulations, which generate random genealogies and then place mutations on the branches with rate θ/2. (θ= 4Nμ, where N is the effective population size and μ is the mutation rate per base pair per generation.) Instead, we generate random genealogies and then place the observed number of mutations, S, on the tree. One motivation for this procedure is that S is observed, while θ must be estimated (Hudson 1993). Both methods produce similar results in other contexts (e.g., Wall 2000; Wall and Hudson 2001). A minimum of 2 × 105 replicates were run for a large number of different Nx and Na values at each locus. The particular values used and estimated likelihoods are available on request from the authors.
Null simulations: We compare the selective sweep and bottleneck models (described below) with a constantsized, panmictic, neutral coalescent model (Hudson 1983). Here, we consider it more appropriate to use standard coalescent simulations (with fixed mutation rate, as opposed to a fixed number of segregating sites) for the selective sweep simulations, because they have a large variance in tree sizes. To ensure comparability, we also use standard coalescent simulations for the bottleneck and null simulations. To estimate θ, the population mutation rate, we split the data into three classes of sites (introns, synonymous sites, and nonsynonymous sites) and assume a fixed neutral (population) mutation rate for each class. θ is then estimated for each class using Watterson’s (1975) estimator, separately for the X and 3R.
Selective sweep simulations: We consider a model where recurrent, nonoverlapping favorable alleles arise at sites linked to a neutral locus. The model assumes that beneficial mutations are selected immediately upon introduction into the population. Our methods follow those of Braverman et al. (1995), but we run coalescent simulations with a fixed mutation rate as opposed to the “fixed S” methodology (Hudson 1993; Bravermanet al. 1995). Also, our implementation incorporates intragenic recombination within the neutral locus, except during the selective phases. The lack of intragenic recombination during the selective phases makes little difference to our results, so long as the neutral locus is relatively short (M. Przeworski, unpublished results). On average, it produces more linkage disequilibrium than a model that always has intragenic recombination (thus is conservative for our purposes). Note that there is a typo in Equation 1 of Braverman et al. (1995), describing the increase in frequency of the favored allele. Instead, we use Equation 3a in Stephan et al. (1992), which approximates the increase from frequency ε to 1 - ε. We implement our simulations with ε = 1/2N (as in Bravermanet al. 1995). Selection is additive, and we arbitrarily assume the selection coefficient of one beneficial allele to be s = 0.005. We obtain similar results for other values of s (results not shown; see also discussion).
Denote the decrease in diversity due to hitchhiking by (1 -δθ). We choose three plausible values for δθ: 0.85, 0.75, and 0.65 for the autosomes and 0.65, 0.55, and 0.45 for the X. The autosomal values were chosen to be close to the observed ratio of non-African to African diversity levels (Andolfatto 2001), while the X chromosome values were chosen to produce a wide range of X-linked to autosomal diversity ratios. The motivation here is that African populations of D. simulans may be roughly at equilibrium, whereas non-African populations might have reduced variation due to local adaptation. This line of reasoning suggests a range of plausible values for δθ. Note also that data from other loci suggest that the ratio of X to autosomal levels of diversity may be higher than was observed by Begun and Whitley (2000; 0.79 for all non-African data, cf. Andolfatto 2001). θ is estimated for each locus as in the null simulations and then divided by δθ. The rate of selective sweeps Λr in Bravermanet al. 1995) was estimated by simulation to produce the desired decrease (1 - δθ) in θ. The actual values used are listed in Table 2. In almost all of the 3R simulations and for larger values of Nx, Λr is small enough that the probability of a second benefical mutation arising while a sweep is still ongoing is <0.05. See the discussion for more on the applicability of the model.
Bottleneck simulations: It is relatively straightforward to incorporate changes in population size into the coalescent framework (e.g., Tajima 1989b; Slatkin and Hudson 1991). For the autosomes, we consider a model where the effective population size is constant in the past at 5 × 106. Then, at time T0, the population size immediately crashes to a bottleneck size Nb, after which it increases exponentially to a current effective population size of 5 × 106. For a given value of T0, we choose Nb so that the reduction in diversity caused by a bottleneck equals the autosomal values of (1 - δθ) (see above). As before, we assume a fixed mutation rate for introns, synonymous sites, and nonsynonymous sites, and we estimate these rates from the data (cf. Watterson 1975) before dividing by δθ. Since one of our goals is to examine whether bottlenecks have a stronger effect on the X, we use the estimated autosomal mutation rates for the X chromosome as well, after multiplying by nr to correct for the chromosomal differences in effective population size. The scaled time T0 (in coalescent time units) is similarly divided by nr for the X relative to 3R, but the population sizes estimated from the patterns of linkage disequilibrium (described below) are assumed to freely vary. This need arises because θ values for X-linked loci must be close to Watterson’s (1975) estimate of θ for simulations to be comparable (i.e., levels of diversity in the simulations should be close to what is observed in the data). However, it is better to allow Nx and Na (as estimates of linkage disequilibrium) to vary freely, so that we can see what effect bottlenecks have on linkage disequilibrium. We present results for the following parameter combinations: (a) T0 = 2000 generations ago, δθ = 0.85, and nr = 0.6; (b) T0 = 2000 generations ago, δθ = 0.75, and nr = 0.7; (c) T0 = 1.2 × 105 generations ago, δθ = 0.85, and nr = 0.7. T0 values were chosen to correspond to recent (∼200 years ago) or ancient (∼12,000 years ago) colonization of the Americas, δθ was chosen to be close to the ratio of non-African to African autosomal diversity levels in D. simulans (0.76, cf. Table 3 in Andolfatto 2001), and nr was chosen so that the relative levels of variability on the X and the autosomes would be close to what was observed by Begun and Whitley (2000). These values are shown in Table 3. See the discussion for more on the sensitivity of the results to the particular parameter values chosen and whether the particular nr values used are plausible.
Frequency spectrum of segregating mutations: We use D (Tajima 1989a) to test whether the observed frequency spectrum of segregating mutations is compatible with the expectations under bottlenecks or selective sweeps. We consider D, the average D value for the X-linked loci and the 3R loci, separately and tabulate both the average simulated D value and the proportion of simulations that have D greater than or equal to what is observed (see results). A total of 5000 replicates were run for each model and parameter combination. We present results for only the most conservative values of Nx and Na.
Estimating μ: We take a value of 1.5 × 10-9 per site per generation for the neutral mutation rate at silent sites. This estimate is based on average per year divergence at synonymous sites in various Drosophila species comparisons (Sharp and Li 1989; Li 1997; McVean and Vieira 2001), assuming an average of 10 generations per year for D. simulans (see Andolfatto and Przeworski 2000 for a discussion). The neutral mutation rate is only loosely related to our analyses, but it provides a connection between observed levels of diversity and an estimate of the effective population size under the null model. We assume that mutation rates do not vary significantly among chromosomes. This assumption is supported by comparisons of average divergence at synonymous sites on the X and on chromosome 3R (Begun and Whitley 2000).
Credibility intervals for Nθx/Nθa: To assess what range of X to autosomal diversity levels is consistent with the data of Begun and Whitley (2000), we calculate approximate credibility intervals for Nθx/Nθa from the observed numbers of segregating sites. We take a neutral mutation rate of per site per generation (see above). As with the bottleneck simulations, we assume fixed population mutation rates for synonymous sites, nonsynonymous sites, and introns and estimate these (cf. Watterson 1975) from the autosomal loci. We then assume that these parameters (multiplied by Nθx/Nθa) apply to the X-linked loci as well and calculate Here S refers to the total number of inferred segregating sites summed over all X-linked loci. We employ the standard χ2 approximation for -2 ln(L1/L0) to obtain approximate 95% credibility intervals, where L0 is the maximum likelihood and L1 is the likelihood at an alternative parameter value.
Likelihood-based statistics: To quantify how consistent the actual data are with the null model, we employ a likelihood-ratio test. We calculate Nˆρx and Nˆρa from the actual data as described earlier. Then, for Nx = N0 and Na = N1, we calculate The significance levels for R are determined by simulation for a range of N0 and N1 values. We simulate 104 replicates of the 29 loci with Nx = N0 and Na = N1. Then, we calculate Nˆρx0 and Nˆρa1 for each replicate, where Nˆρx0 is the value of Nx that maximizes and Nρa1 is the value of Na that maximizes The collection of values provides a simulated distribution of R(N0, N1) values, from which we tabulate how often the simulated R values are greater than or equal to the actual R value. For each parameter combination, we also calculate what proportion of trials have estimated population sizes Nˆρx0 and Nˆρa1 equal to the values estimated from the actual data. Define R*(N0, N1) is the likelihood of the actual effective population size estimates (using H and RM) when data are generated under the null model (with Nx = N0 and Na = N1). From our simulations, we plot the value of R* as a function of N0 and N1.
Ideally, we would like to perform the same analyses under the selective sweep and bottleneck models, but calculating the relevant likelihoods is computationally prohibitive. Instead, we use R* again, with all likelihoods calculated assuming the null model, even though the simulated data are generated under a different model. As before, R* is a measure of how likely it is for the simulated data to produce the actual estimated population sizes. A total of 104 replicates are run for each parameter combination. Other ad hoc statistics were considered; they all produced similar results (results not shown).
Excess linkage disequilibrium on all chromosomes: Table 4 shows the estimates of Nˆρ on the basis of H, RM, and rˆ for each locus. One observation that is immediately apparent is that these values are systematically less than estimates of N based on estimates of the neutral mutation rate and observed levels of polymorphism. For example, if we take for silent sites (see methods) and θ= 0.030 per synonymous base pair (estimated from all of the autosomal loci considered in this article, cf. Watterson 1975), then we obtain the estimate Nˆθa = 5.0 × 106. The corresponding estimate from the X-linked loci is Nˆθx = 2.5 × 106. In contrast, 26 out of 29 loci have Nˆρ estimates at least an order of magnitude less than the corresponding Nˆθ estimate. This discrepancy between Nˆρ (estimated from linkage disequilibrium) and Nˆθ (estimated from levels of diversity) has been noted before with different Drosophila data and slightly different methodology (Andolfatto and Przeworski 2000) and implies that there is more linkage disequilibrium in nucleotide polymorphism data than expected from standard estimates of μ, r, and θ. Note that the low values in Table 4 are not the result of the particular properties of . In fact, simulations (under the standard equilibrium neutral model) show that for the small sample sizes considered here, CHRM is biased upward, suggesting that the Nˆρ values in Table 4 might on average be overestimates (J. D. Wall, unpublished results).
Contrasting patterns between X and autosomes: Because patterns of variation vary greatly from locus to locus even when the underlying parameters are the same, the precision of the estimate of N can be greatly increased by combining information from multiple loci. Figure 2 shows the relative log likelihoods of N for all of the X-linked loci (the curve on the left) and all of the autosomal loci (the curve on the right). For ease of comparison, the curves have been normalized so that their maxima are at 0. It is striking how distinct the two likelihood curves are: Nˆρa (= 3.2 × 105) is more than six times Nˆρx (= 0.5 × 105). The horizontal line in Figure 2 shows the ∼95% credibility intervals (using the standard asymptotic approximations for maximum likelihood) for the chromosome-specific estimates of N; the two intervals do not overlap. A nonparametric rank order test shows that the locus-specific Nˆρ estimates for the X-linked loci are indeed less than the autosomal estimates (Table 4, Mann-Whitney U-test; P < 0.002).
Note that since males carry only one X chromosome, we do not necessarily expect Nx to equal Na. If male and female effective population sizes are equal, then 4Nx = 3Na. However, there are many possible factors that may cause the two effective population sizes to be unequal (Crow and Morton 1954; Caballero 1995; Charlesworth 2001). If chromosomal effective population sizes are proportional to silent site values (cf. Watterson 1975), then, from the data analyzed here, Nˆθx/Nˆθa ≈ 0.50. When all sites are considered (with different rates for synonymous sites, nonsynonymous sites, and introns), then Nˆθx/Nˆθa ≈ 0.59. The ∼95% credibility interval for Nθx/Nθa based on all sites (see methods) is 0.43-0.69. Under neutrality, Nθx/Nθa = 0.50 is unexpected, regardless of how biased the gender-specific population sizes are (Caballero 1995). This observation is in part what led Begun and Whitley (2000) to conclude that natural selection must be acting to reduce the levels of variation on the X relative to the autosomes. Our results demonstrate that the difference in the levels of linkage disequilibrium between the X and 3R (Nˆρx/Nˆρa = 0.16) is substantially greater than the difference in their diversity levels.
Figure 3a shows the P value of R (see methods) as a function of Nx and Na. For all population sizes where 2Nx ≥ Na, the actual value of R is significantly too large. This suggests that there is significantly more linkage disequilibrium on the X than on 3R, even after correcting for the differences in effective population sizes suggested by diversity levels. For the same population sizes, Figure 3b shows the proportion of trials for which Nˆρx0 = 0.5 × 105 and Nˆρa1 = 3.2 × 105 (see methods). The different shading categories were chosen so that in Figure 3, a and b were as similar in appearance as possible; the lightest areas on both graphs represent areas of parameter space that are compatible with the data. For all population sizes where 2Nx ≥ Na, the value of R* is quite small (i.e., R* < 8.0 × 10-4). If instead we repeat the rank order test with the null hypothesis that 2Nx = Na, then the two chromosomes are still significantly different (Table 4, Mann-Whitney U-test; P < 0.01).
The chromosomal difference in diversity levels (Begun and Whitley 2000), the chromosomal difference in levels of linkage disequilibrium (this study), and the overall high levels of linkage disequilibrium (Andolfatto and Przeworski 2000; this study) are all ways in which D. simulans data do not conform to the expectations of the standard equilibrium neutral model. We now examine how two simple alternative models (a recurrent selective sweep model and a recent bottleneck model) are expected to affect the levels of diversity and linkage disequilibrium on different chromosomes. Though the true history of D. simulans populations is likely to be much more complex, these models should help us gain insight into the effects of demography and natural selection on the patterns of segregating variation.
Sweep model: We considered all combinations of δθ = 0.85, 0.75, and 0.65 for 3R and δθ = 0.65, 0.55, and 0.45 for the X. All nine sets of simulations produced very similar results, and we display only a representative pair of them here. Figure 4 shows the value of R* as a function of Nx and Na. Figure 4a has δθ = 0.85 for 3R and δθ = 0.45 for the X, while the corresponding δθ values in Figure 4b are 0.75 and 0.55, respectively. We find that recurrent selective sweeps do not lead to striking increases in levels of linkage disequilibrium, as measured (see also Przeworski 2002). In particular, the increase in average Nˆρx0 and Nˆρa1 in the sweep simulations is no more than what is expected from the decrease in levels of diversity. In other words, the estimated ratio of the number of recombination events to the number of mutation events, , does not vary much when data are generated under either the null model or the recurrent selective sweep model. Exploratory simulations suggest that this observation might hold under a wider range of sample sizes and relative recombination rates than considered for the D. simulans data (results not shown). Thus, this simple model for repeated episodes of positive selection seems to explain neither the overall high levels of linkage disequilibrium nor the chromosomal difference in levels of linkage disequilibrium.
Previous work has shown that recurrent selective sweeps lead to a strong skew in the frequency spectrum toward an excess of rare variants (Bravermanet al. 1995). The D. simulans data of Begun and Whitley (2000), on the other hand, show no marked skew in the frequency spectrum. The average value of Tajima’s D for the 15 X-linked loci is 0.205, while the average value for the 14 loci on 3R is -0.021. Table 5 shows what proportions of the simulations have average D values (D) greater than or equal to what is actually observed. Under recurrent hitchhiking, the average simulated D is negative, as expected. The actual D for the X-linked loci is significantly too high (P < 0.004, one-tailed test), while the true D for the autosomal loci is not unusual.
Bottleneck model: Due to computational constraints, we consider only a few parameter combinations. Figure 5 shows R* as a function of Nx and Na for three different examples. As can be seen, recent bottlenecks are consistent with much higher effective population sizes. Equivalently, recent bottlenecks cause an increase in observed levels of linkage disequilibrium. For example, two out of three (Figure 5, a and b) are consistent with Na = 5.0 × 106, while one out of three (Figure 5b) is consistent with Nx = 2.5 × 106. In addition, at least for some parameter values (e.g., those of Figure 5b), the average ratio of Nˆρx0/Nˆρa1 is much larger than what was estimated under the null model (0.16) and closer to the expectation from levels of diversity (i.e., 0.59). Since there are fewer X chromosomes than autosomes, a bottleneck is more severe for the X (i.e., the minimal population size is smaller). This leads to both a greater increase in linkage disequilibrium and a greater reduction in levels of variability on the X relative to the autosomes. In principle, a recent bottleneck might explain both the chromosomal differences in levels of linkage disequilibrium and the overall high levels of linkage disequilibrium, but it remains to be seen whether the parameter values required are plausible (see discussion).
The effect of a bottleneck on the frequency spectrum is complex. For results from a similar model, see Fay and Wu (1999). During and immediately after a bottleneck, a deficiency of rare variants is expected, but as time passes, the accumulation of recent mutations leads to an excess of low frequency segregating sites. For the small values of T0 considered here, bottlenecks are expected to lead to positive Tajima’s D values (more so for the X than the autosomes). Table 5 shows the average of the simulated D values, as well as the proportion of simulated D values greater than or equal to the actual values. As is expected under a recent bottleneck, the actual D is higher for the X-linked loci than it is for the autosomal loci. In all cases, the actual D values for both the X and 3R are within the middle 95% of the simulated distribution, though D for the 3R loci is close to being significantly too low.
This study analyzes sequence data from a North American population of D. simulans and documents that the high observed levels of linkage disequilibrium and the chromosomal differences in levels of linkage disequilibrium are not expected under the standard null model. Both demographic and selective departures from the null model are possible explanations, and we considered two of these alternatives to the null model. We describe below some of the difficulties associated with assessing whether these models are appropriate.
The bottleneck model: Not much is known about the demographic history of North American populations of D. simulans, but as a human commensal, D. simulans is unlikely to have arrived in North America before humans did. The first people in the Americas are thought to have crossed via the Bering Strait ∼14,000-15,000 years ago (see, e.g., Joneset al. 1994). If we assume an average of 10 generations a year, then T0 = 2000 generations ago and 1.2 × 105 generations ago correspond to 200 years ago and 12,000 years ago, respectively. However, it seems improbable that D. simulans crossed via the Bering Strait, since this would require travel through thousands of miles of harsh Arctic weather and dependence on humans with a low population density. Thus, it may be more likely that D. simulans was introduced into the Americas after the European conquest ∼500 years ago. No one knows when D. simulans first started crossing the Atlantic as stowaways on ships, but it seems plausible that at first the number of migrants was limited. Both the volume of traffic and the cargo composition changed slowly over time; at some point in the past, successful migration to the Americas must have been possible but difficult. So, independent of genetic data, a recent bottleneck in the history of American populations of D. simulans seems to be a reasonable demographic model. We chose to model a single founder event, followed by rapid population growth. Perhaps a more realistic model would have many founder events, spread out over time (continuing to the present day). However, the earliest migrants might have contributed a disproportionally large amount to the gene pool of the new population; the newly founded population may have had ample opportunity to grow, since 500 years ago there were many settled human communities in the Americas. If so, later migrants would then be less important, since they would contribute proportionally very little to the genetic makeup of the population.
In summary, our simple bottleneck model probably captures some fundamental element of the population history of North American D. simulans. Assuming that ancestral populations were close to mutation-drift equilibrium, a simple bottleneck model can, at least qualitatively, account for three essential features of the Californian D. simulans data: (1) a genome-wide increase in levels of linkage disequilibrium; (2) more linkage disequilibrium on the X than on the autosomes; and (3) a skew in the frequency spectrum toward more common variants on the X relative to the autosomes.
However, this does not necessarily mean that a bottleneck is a sufficient explanation for the patterns of variation in the data analyzed in this article. The effect that a bottleneck has on levels of diversity, linkage disequilibrium, and the frequency spectrum is quite sensitive to many unknown parameters. Exploratory simulations suggest that decreasing (1 - δθ) or T0 (while keeping the other parameters constant) leads to a greater increase in linkage disequilibrium, while decreasing nr leads to more of an effect on the X relative to the autosomes. Also, if the current effective population size and T0 are larger, there is little effect on levels of linkage disequilibrium. For example, if the current N is 1 × 109, then T0 must be quite small (e.g., T0 ≤ 4 × 103 generations) for a bottleneck to have an appreciable effect on estimates of linkage disequilibrium (results not shown).
Perhaps more worrisome is the fact that the ratio of effective sizes for the X and the autosomes in the ancestral population (nr) must be low (i.e., ≤0.75) to be consistent with the observed ratio of diversities in the Californian population (i.e., ≤0.69, the approximate upper bound for Nˆθx/Nˆθa). In the bottleneck simulations we present (Table 2, Figure 5), we assume 0.6 ≤ nr ≤ 0.7. In other words, we assume that the male effective population size is greater than or equal to the female effective population size. This situation may be unlikely for Drosophila where sexual selection is expected to reduce the effective population size of males relative to females (Crow and Morton 1954). On the other hand, Charlesworth (2001) pointed out that if females are generally in poor breeding condition (as observed by Boulètreau 1987 in an European population), then nr will be reduced. This may counter the effects of sexual selection in males. In fact, for non-African D. melanogaster, Charlesworth (2001) estimates nr = 0.73 and 0.64, with or without sexual selection on males, respectively. Thus, the prebottleneck nr may have been low if the founding population was non-African (e.g., European). Unfortunately, we have little information about the relative variance in male and female reproductive successes in Drosophila populations or the origin of this particular North American population of D. simulans. Under the simplest assumption of equal numbers of males and females (i.e., nr = 0.75), a severe and recent bottleneck can still produce an X/autosome ratio of diversities that is consistent with the findings of Begun and Whitley (2000); if T0 = 2000 generations and δθ = 0.75, then Nˆθx/Nˆθa = 0.684.
Positive selection models: An alternative to a purely demographic explanation is that natural selection for adaptation has influenced the observed patterns of variation. D. simulans originated in Africa (David and Capy 1988; Lachaiseet al. 1988), and some adaptive evolution must have occurred while populations coped with different environments and colder climates. We modeled natural selection by simulating recurrent, nonoverlapping selective sweeps linked to a neutral locus. Although under certain assumptions (discussed in Begun and Whitley 2000) this model can reduce X-linked (relative to autosomal) levels of diversity, our simulations show that it is inconsistent with other facets of the data: A simple selective sweep model leads to an excess of rare variants and no appreciable increase in levels of linkage disequilibrium. In contrast, the data show no skew in the frequency spectrum, high levels of linkage disequilibrium on 3R, and extremely high levels of linkage disequilibrium on the X.
We chose the simple recurrent sweep model partly because it has been carefully studied before (e.g., Kaplanet al. 1989; Bravermanet al. 1995) and partly because it is reasonably easy to implement in the coalescent context (where many replicates can be run quickly). However, it is not clear whether the model is appropriate for Drosophila data. For example, our simulations assume that selection is additive, even though one of the main arguments for greater hitchhiking effects on the X invokes dominance effects (see, e.g., Begun and Whitley 2000). This facet is not a major concern, since any process with recurrent, rapid fixation of new alleles is likely to produce a similar pattern in sequence data (i.e., a skew in the frequency spectrum toward rare alleles and no increase in levels of linkage disequilibrium, using the methods in this article).
Another concern is the frequency of selective sweeps. We have chosen simulation parameters that allow few overlapping sweeps. We calculate [similar to (6) in Bravermanet al. 1995] that the probability that a second selective sweep starts before a given one has finished is >0.05 for values of Nx ≤ 2.4 × 105 and Na ≤ 1.5 × 105. Most of these overlaps consist either of new beneficial alleles arising after an older beneficial allele has already swept to high frequency (but not fixed) and/or two beneficial alleles that are not tightly linked to each other; in both cases, the two sweeps are essentially independent. In general, if s and δθ are fixed, then multiple sweeps are more likely to overlap as N decreases. This happens because a selective sweep with a given value of s has an effect on standing levels of linked neutral diversity that is only weakly dependent on N, while sweeps take longer (in units of scaled time) in smaller populations, so are more likely to overlap. Note that we fixed δθ so that the effect of selection would be comparable across different values of N. If instead we were to fix the rate of introduction of advantageous alleles, then there would be more sweeps as N increases, and δθ would decrease with increasing N; because we have no prior knowledge regarding Λr, this implementation does not seem to be appropriate. Since our goal is to determine whether recurrent selective sweeps can produce the excess of linkage disequilibrium that is observed (given the proposed reduction in X-linked vs. autosomal diversity), the relevant question is whether larger values of Nx and Na are compatible with the data. The answer to this question is still no; Figure 4 shows that R* is very small when both Nx and Na are large (i.e., when the nonoverlapping sweep assumption is met).
The problem of overlapping sweeps might be exacerbated if the rate of selective events over time is not constant or the strength of selection is weaker. The general effects of a recurrent selective sweep model on the frequency spectrum and levels of linkage disequilibrium are not very sensitive to s, as long as s ≥ 0.002 (results not shown). However, for smaller selection coefficients (e.g., s < 0.002) and the small population sizes considered here, the simple selective sweep model becomes inappropriate due to the large number of overlapping selective events. Also, if natural selection is being driven by adaptation to new environments, then the rate of introduction of favorable alleles might depend heavily on the location and movement of populations and would be much higher at some times than at others. Without any independent source of information on the relevant parameters, we have no idea how often selective sweeps may have overlapped and interfered with each other over time. We also have no idea how multiple competing sweeps (perhaps in a subdivided population) affect levels of variation, the frequency spectrum, or patterns of linkage disequilibrium, or for that matter how sweeps in a subdivided population behave. For any of these models to be viable explanations of the data, they would need to increase levels of linkage disequilibrium on both chromosomes (though much more on the X than the autosomes). They would also need to be able to cause a decrease in levels of variability (on the X) without causing a skew in the frequency spectrum toward rare variants. This seems unlikely unless many of the sweeps are ongoing. Further work will explore how such models affect patterns of sequence polymorphism.
Another possibility is that adaptive evolution operated on standing variation, instead of newly arising mutations. If so, the rate of adaptation on the X might actually be slower than the rate on the autosomes (Orr and Betancourt 2001). Nothing is known yet about the predictions of such a model regarding levels of diversity, the frequency spectrum, or levels of linkage disequilibrium on different chromosomes. But, as before, it is unlikely that this model could decrease levels of diversity without affecting the frequency spectrum, unless many selective events have not yet led to fixation of the favored type.
Finally, natural selection might operate in a way that is fundamentally different from the simple directional selection models discussed above. However, Gillespie (1997) examined a range of selective models and found that all had similar effects on levels of diversity and the frequency spectrum (see, e.g., his Figure 3). This makes it less likely that any of them are consistent with the observed frequency spectra and levels of diversity (leaving aside the issue of whether they are consistent with the observed levels of linkage disequilibrium). An alternative put forward by Begun and Whitley (2000) is that the rapid changes in frequency (without fixation) of X-linked meiotic drive or sexually antagonistic alleles may also account for reduced levels of variability on the X relative to the autosomes. Nothing is known about how such a model would affect the frequency spectrum or levels of linkage disequilibrium.
Conclusions: Any evolutionary model that seeks to be a sufficient explanation for the North American D. simulans data must simultaneously be consistent with the observed levels of diversity, frequency spectra, and levels of linkage disequilibrium on the X and autosomes. A simple bottleneck model can do so, but only if nr ≤ 0.75 and the population size reduction was severe and recent. It is not clear how reasonable these conditions are. On the other hand, a simple hitchhiking model can be rejected because it is inconsistent with both the observed frequency spectra and levels of linkage disequilibrium.
The relative role of natural selection in shaping patterns of D. simulans genetic variation remains unknown. More work needs to be done to explore how other models of natural selection affect patterns of variability. These models might examine, e.g., adaptation in structured populations, natural selection in variable environments (cf., Gillespie 1991, 1997; Orr and Betancourt 2001), and/or interference between multiple favorable alleles. This work will give us a sense of which models are plausible for D. simulans data.
It will be much easier to test D. simulans evolutionary models once sequence polymorphism data from other (predominantly African) populations are gathered. These data might allow one to infer whether migration to the Americas occurred primarily from Europe or from Africa and would help us construct a reasonable demographic null model. Only by explicitly considering demography will we be able to start deciphering the contribution of natural selection for adaptation to different populations of D. simulans.
We thank B. Charlesworth and two anonymous reviewers for helpful suggestions on an earlier version of this manuscript. J.D.W. and M.P. were supported by National Science Foundation Postdoctoral Fellowships in Bioinformatics. P.A. was supported by a European Molecular Biology Organization Postdoctoral Fellowship.
Communicating editor: H. Ochman
- Received December 10, 2001.
- Accepted June 12, 2002.
- Copyright © 2002 by the Genetics Society of America