Abstract
Previous multilocus surveys of nucleotide polymorphism have documented a genome-wide excess of intralocus linkage disequilibrium (LD) in Drosophila melanogaster and D. simulans relative to expectations based on estimated mutation and recombination rates and observed levels of diversity. These studies examined patterns of variation from predominantly non-African populations that are thought to have recently expanded their ranges from central Africa. Here, we analyze polymorphism data from a Zimbabwean population of D. melanogaster, which is likely to be closer to the standard population model assumptions of a large population with constant size. Unlike previous studies, we find that levels of LD are roughly compatible with expectations based on estimated rates of crossing over. Further, a detailed examination of genes in different recombination environments suggests that markers near the telomere of the X chromosome show considerably less linkage disequilibrium than predicted by rates of crossing over, suggesting appreciable levels of exchange due to gene conversion. Assuming that these populations are near mutation-drift equilibrium, our results are most consistent with a model that posits heterogeneity in levels of exchange due to gene conversion across the X chromosome, with gene conversion being a minor determinant of LD levels in regions of high crossing over. Alternatively, if levels of exchange due to gene conversion are not negligible in regions of high crossing over, our results suggest a marked departure from mutation-drift equilibrium (i.e., toward an excess of LD) in this Zimbabwean population. Our results also have implications for the dynamics of weakly selected mutations in regions of reduced crossing over.
THE parameters θ (= 4Neμ), where Ne is the species effective population size and μ is the mutation rate per generation, and ρ (= 4Ner), where r is the sex-averaged crossing-over rate per generation, influence the amount and the pattern of genomic sequence variability. Under a Wright-Fisher neutral model, hereafter the standard neutral model, these two parameters are both proportional to Ne. Ne cannot be estimated directly from polymorphism data; instead, it must be estimated indirectly using estimates of θ and μ or estimates of ρ and r. Under the standard neutral model, both methods (i.e., using
Contrary to these expectations, recent studies of human and Drosophila data have revealed that the two methods for estimating Ne yield disparate results. In humans, the estimate of Ne from
The Drosophila data analyzed thus far have consisted predominantly of non-African samples. Both D. melanogaster and D. simulans are human commensals; they are thought to have originated in sub-Saharan Africa and only relatively recently colonized other locations (Lachaiseet al. 1988). Non-African populations may have experienced a recent contraction in population size or a recent burst of selection associated with adaptation to new environments (e.g., Aquadroet al. 1994; Hamblin and Veuille 1999; Andolfatto 2001; Kaueret al. 2002; Wallet al. 2002). Both of these scenarios are expected to reduce levels of variability in non-African populations relative to African ones, as observed (Begun and Aquadro 1993; Andolfatto 2001; Kaueret al. 2002); in addition, either one may have caused an increase in levels of intragenic LD and might explain the observed discrepancy between different estimates of Ne.
Interestingly, patterns of variation in human populations share some similarities with Drosophila (Aquadroet al. 2001). Non-African populations have less variability than sub-Saharan African ones (e.g., Frisseet al. 2001; Stephenset al. 2001), and a similar demographic model has been proposed for humans (e.g., Tishkoffet al. 1996): an African origin, a recent population bottleneck in non-African populations, and recent population growth. Unlike the comparable Drosophila studies, human surveys of genetic variation have usually contained a substantial sample from sub-Saharan African populations. It is tempting to speculate that this difference in sampling schemes may explain part of the difference in patterns of LD in the human and Drosophila data. Further suggestive evidence of the sensitivity to sampling scheme comes from studies of different human populations, which find substantially higher levels of LD in non-African populations compared with African ones (Frisseet al. 2001).
In addition to the effects of population history on patterns of LD, the nature of the meiotic recombination mechanism itself is an issue. In current models of meiotic recombination (e.g., Szostaket al. 1983), all recombination events involve some exchange of segments of DNA between homologous chromosomes. Such exchanges explain the phenomenon of gene conversion (the nonreciprocal transfer of an allele from one homolog to another). Whether or not these exchanges are accompanied by the reciprocal exchange of flanking markers (i.e., crossing over) reflects alternative outcomes of Holliday junction resolution. For clarity, we refer to recombination events without the exchange of flanking markers as “gene conversion,” and “crossing over” refers to recombination events with the exchange of flanking markers. Estimates of the recombination rate based on the large-scale comparison of genetic and physical maps ignore gene conversion, which contributes little to the rate of exchange between distant markers. However, at smaller physical scales (i.e., several kilobases), depending on parameters, gene conversion may contribute substantially to the total rate of genetic exchange (Andolfatto and Nordborg 1998). Thus, the interpretation of patterns of LD in the context of demographic models depends to some extent on gene conversion parameters.
Frequent gene conversion in humans may account for inflated estimates of ρ (over small physical scales) relative to large-scale map-based estimates of r in African populations (Frisseet al. 2001; Przeworski and Wall 2001). However, incorporating gene conversion would make the observed LD excess in Drosophila even more unusual (Andolfatto and Przeworski 2000; Wallet al. 2002). Since gene conversion and demography both affect patterns of LD, estimates of gene conversion parameters from nucleotide polymorphism data may be extremely inaccurate since the true population history is unknown.
In an attempt to distinguish between possible explanations for the discordant estimates of Ne in Drosophila, we analyze sequence polymorphism data at multiple loci in Zimbabwean population samples of D. melanogaster. This population is more likely than non-African populations to be closer to equilibrium since it is nearer to D. melanogaster’s ancestral range and may not have experienced drastic changes in size or ecology in its recent history.
The physical scale of LD should increase as the rate of recombination decreases. Interestingly, a recent survey of two gene regions at the tip of the X chromosome of D. melanogaster, where rates of meiotic crossing over are greatly reduced, suggested that the physical scale of LD is not larger than that in other regions of the genome (Langleyet al. 2000). Jensen et al. (2002) observed a similar pattern on chromosome 4 of D. melanogaster and D. simulans, which usually lacks crossing over during female meiosis (Hawleyet al. 1993). These patterns were interpreted as evidence for high levels of gene conversion in these regions of reduced crossing over. Here we explore these claims by quantifying levels of LD at many loci over a broad range of crossing-over rates on the X chromosome. Using these data, we investigate several alternative models of how gene conversion and crossing over may be associated.
METHODS
Previously published data: We consider all sequence polymorphism studies that include a sample size (n) of seven or more D. melanogaster sequences from Zimbabwe and 10 or more segregating sites (S). We adopt these minimal size restrictions because estimates of ρ are highly biased and inaccurate when both n and S are small (see justification in results). Fifteen previously published X chromosome data sets fit this size requirement: su(s) and su(wa) (Langleyet al. 2000); 6-phosphogluconate dehydrogenase, vinculin, and zeste (Pgd, vinc, and z; Andolfatto and Przeworski 2001); glucose-6-phosphate dehydrogenase (G6pd; Eaneset al. 1996); shaggy (sgg; C. Schlötterer, personal communication); period/CG2650, 100G10.2, sxy4, frag.3, frag.4, and CG3592 (Harret al. 2002); runt (Labateet al. 1999); and vermilion (v; Begun and Aquadro 1995). Nine autosomal gene regions fit the size requirement: Acp26Aa and Acp26Ab (Tsauret al. 1998), which were analyzed as one gene region; Acp36DE (Begunet al. 2000); Alcohol dehydrogenase (S.-C. Tsaur, unpublished data); Esterase-6 promoter region (Est6-p; Odgerset al. 2002); Hexokinase C (Hex-C; Duvernell and Eanes 2000); In(2L)t proximal breakpoint [In(2L)t-PBP; Andolfatto and Kreitman 2000]; Phosphogluconate mutase (Pgm; Verrelli and Eanes 2000); Heat-shock protein 70Bb (Hsp70Bb; Masideet al. 2002); and bicoid (Baineset al. 2002). All loci were surveyed in lines from one or both of two Zimbabwean population samples (Sengwa Wildlife Reserve and Harare, respectively) first described by Begun and Aquadro (1993).
Data collection: We collected nucleotide variability data for eight additional gene regions: yellow (y, 2017 bp, n = 49); Fasciclin-2 (Fas2, 588 bp, n = 17); spaghetti squash (sqh, 572 bp, n = 16); Hyperkinetic (Hk, 563 bp, n = 21); dusky (dy, 567 bp, n = 20); licorne (lic, 615 bp, n = 23); rutabega (rut, 548 bp, n = 22), and 1049 bp of the white gene (in two regions of 546 and 503 bp, respectively, separated by 3.9 kb, n = 16), which we analyzed as one region. We also expanded the snf1A data set reported in Andolfatto and Przeworski (2001). For the snf1A gene region, we increased the sample size from 13 to 25 individuals and the total length of the surveyed DNA from 514 to 2189 bp (in four separate regions of 591, 514, 488, and 596 bp, spread over 9.33 kb). These four regions surrounding the snf1A gene were analyzed as one region.
Details of the sequencing strategy for each gene region can be found at http://helios.bto.ed.ac.uk/evolgen/andolfatto/zimbabweLD. We PCR amplified each region from genomic DNA of single male flies (one male from each isofemale line), precipitated the products with polyethylene glycol (PEG-8000) to remove primers, and sequenced them on both strands using Big-Dye (Applied Biosystems, Foster City, CA) or DYEnamic ET (Amersham, Arlington Heights, IL) sequencing kits. One of our PCR/sequencing primers overlapped with a 63-bp deletion in region 4 of snf1A in individual zs53. For this region, this allele was amplified with flanking primers, cloned into pCR4-TOPO (Invitrogen, San Diego), and three independent clones were sequenced. For all loci surveyed, the lines used either are the same 13 individuals surveyed in Andolfatto and Przeworski (2001) from the Sengwa Wildlife Reserve, Zimbabwe or included additional alleles from a sample of 12 lines from Harare, Zimbabwe (kindly provided by C.-I Wu) and up to 24 lines collected from Victoria Falls, Zimbabwe (kindly provided by B. Ballard). Aligned and annotated sequences are available in nexus file format at the website http://helios.bto.ed.ac.uk/evolgen/andolfatto/zimbabweLD. A summary of polymorphic variation at each locus can be found at the website http://helios.bto.ed.ac.uk/evolgen/andolfatto/zimbabweLD. appendix a summarizes information about each of the loci used in this study.
Analysis of levels of variability: Our first goal is to estimate the effective population size (Ne) for each locus independently. For this purpose, we estimate
A second goal of this analysis is to compare joint estimates of Ne from
Estimating Nˆθ from
Linkage disequilibrium analysis: The goal of this analysis is to estimate Ne from levels of LD. We choose to summarize LD by an estimate of the population recombination rate ρ= 4Ner, where Ne is the effective population size, and r is the sex-averaged rate of recombination (cf. Andolfatto and Przeworski 2000; Wallet al. 2002). For these analyses, we use all biallelic mutations, including both single-nucleotide polymorphisms (SNPs) and simple insertion-deletion mutations. All multiply hit sites or overlapping mutational events are excluded.
We first employ an estimator,
Several other summary estimators of ρ have also been proposed (e.g., Hudson 1987, 2001; Hey and Wakeley 1997). We have chosen the estimator proposed by Hudson (2001),
A considerable fraction of polymorphisms segregating among autosomal genes (43/279) are amino acid replacement substitutions. If these are under weak purifying selection (Andolfatto 2001), they may segregate at lower frequencies on average, potentially affecting estimates of ρ. We estimated heterozygosities for each site [
Assessing the behavior of estimators under neutrality and alternative models: To test the properties of the two estimators of ρ under the standard neutral model, we run simulations with sample sizes (n) of 7, 13, or 50, with ρ= 3θ. We consider a range of θ values and calculate the average and median values of
Directional selection at linked sites is known to shape patterns of genetic variation (Hill and Robertson 1966) and is thought to explain the observed positive correlation between levels of diversity and the estimated crossing-over rate (Begun and Aquadro 1992; Aquadroet al. 1994). Both linkage to positively selected alleles (i.e., hitchhiking or selective sweeps, cf. Maynard Smith and Haigh 1974) or deleterious alleles (i.e., background selection, cf. Charlesworthet al. 1993) have been advanced as possible explanations.
Background selection due to strongly deleterious mutations can be modeled as a simple reduction in effective population size (Hudson and Kaplan 1994). This assumption is equivalent to varying θ (and ρ) in simulations of the standard neutral model. However, under background selection models that posit more weakly deleterious mutations, this assumption may not be valid (Charlesworthet al. 1995; Gordoet al. 2002). This situation is beyond the scope of our study and will be a topic of future work.
We do examine the effects of recurrent hitchhiking due to advantageous mutations by running simulations under a model of recurrent, nonoverlapping selective sweeps (Przeworski 2002). The program used was kindly provided by M. Przeworski. We consider a locus of fixed size (n = 50, number of sites = 2000, θ= 15, ρ= 45) and determine how the average and median values of Nˆθ and Nˆρ, as estimated assuming the standard neutral model, change as a function of the rate of selective sweeps (Λ in Przeworski 2002). We fix the selection coefficient s = 0.002 and the species population size Ne = 4 × 106. The choice of other parameter values yielded similar results (results not shown).
Incorporating the effects of gene conversion: Estimates of Nˆρ based on rates of crossing over (rˆ) may be systematically underestimated because they ignore gene conversion. Gene conversion was implemented into the standard coalescent with recombination (cf. Hudson 1990; Wiuf and Hein 2000; Przeworski and Wall 2001). Gene conversion is defined here as exchange of short tracts of DNA between homologous chromosomes with no associated crossing over between flanking markers. Thus, in this implementation, gene conversion and crossing over are modeled as mechanistically independent, and gene conversion exchanges associated with crossing over are ignored. For all models, we assume the rate of crossing over is rˆ and that the distribution of gene conversion tract lengths is geometric with a mean of 352 bp (Hillikeret al. 1994).
We consider three models of how gene conversion (GC) and crossing-over (CO) rates may be associated. In the first model (GC1), we assume that two-thirds of recombination events are GC, and thus the rate of GC, rˆGC = 2rˆ between adjacent sites, where rˆ is the estimated local rate of crossing over for a particular gene region. In a second model (GC2), we assume that the rate of GC is constant across a chromosome; rˆGC = 2rˆMAX, where rˆMAX is the highest estimated rate of crossing over on the chromosome (rˆMAX = 2.27 × 10-8/generation between adjacent sites on the X chromosome). In a third model (GC3), we assume that the total rate of recombination between adjacent sites is constant across a chromosome and that the rate of gene conversion between adjacent sites is rˆGC = (rˆMAX - rˆ), where rˆ is the estimated rate of crossing over for the locus under study. GC1 and GC2 are equivalent for genomic regions with the highest rates of crossing over and diverge in regions of reduced crossing over. Likewise, GC3 is equivalent to a model with no GC for regions with the highest rates of crossing over. These models are meant to be illustrative rather than quantitative and other possible models of recombination are addressed in the discussion. Nˆρ is estimated as above (from
How do gene conversion rates in our models compare to data on gene conversion from Drosophila? On the basis of the data from the genes maroon-like and rosy, it is thought that the rate of gene conversion (γ, see Andolfatto and Nordborg 1998) is on the order of 10-5 events/generation and that 50% or more of recombination events are gene conversions without an associated exchange of flanking markers (Finnerty 1976; Hilliker and Chovnick 1981). Under GC1 and GC2, the sex-averaged GC rate is 1.5 × 10-5/generation in regions of the genome with the highest rates of crossing over, which is similar to the rate that has been measured at rosy and maroon-like (∼10-5, Finnerty 1976; Hilliker and Chovnick 1981). The ratio of GC:CO has been estimated to be ∼1 at maroon-like (Finnerty 1976) and ∼4 for rosy (Hilliker and Chovnick 1981). In GC1 and GC3 models, the ratio of GC to CO rates at the rosy locus (rˆ = 0.7 cM/Mb) would be ∼2 and corresponds to a GC rate of 6 × 10-6/generation. In the GC2 model, the GC:CO ratio at rosy would be ∼6. For maroon-like (rˆ = 1.36 cM/Mb), the GC:CO ratio = 2 under GC1 (rate ∼10-5/generation) and would be ∼0.7 under GC3 (rate ∼3 × 10-6/generation). In the GC2 model, the GC:CO ratio at maroon-like would be ∼3. Thus, our assumed rates of gene conversion, and the extent of association between gene conversion and crossing over, are roughly in agreement with the limited data on rates of gene conversion in Drosophila.
Bias of under the standard neutral model
RESULTS
Performance of ρ estimators and choice of data: With a choice of two similarly performing ρ estimators,
—Comparison of estimates of (per locus) for the 24 X-linked and 9 autosomal polymorphism data sets used in this study.
We have used these results as a guide to select data for our study (see methods) and accordingly we have limited the data sets considered here to those that have n ≥ 7 and S > 10 (appendix a). Only 4 of our selected data sets have n = 7 and for each of these
Figure 1 compares estimates of
Differences in sample size and the number of segregating sites are not the only possible causes of the systematic difference between estimators. Regions of reduced recombination in D. melanogaster have lower levels of variability due to hitchhiking and/or background selection (Aquadroet al. 1994; Charlesworth 1996). If background selection can be approximated as a simple reduction in effective population size, the expected behavior of ρ estimators will be similar to that under the neutral model (cf. Table 1). However, the observation that these regions also have more low-frequency polymorphisms suggests that recurrent selective sweeps may also influence polymorphism patterns in these regions (Andolfatto and Przeworski 2001).
Since one of our objectives is to compare Nˆρ and Nˆθ estimates across a gradient of crossing-over rates, we investigated the properties of the two ρ estimators under a recurrent hitchhiking model (Table 2). More frequent hitchhiking is expected to decrease levels of variability at closely linked sites (Maynard Smith and Haigh 1974; Kaplanet al. 1989) and thus decrease Nˆθ. As can be seen in Table 2, the two ρ estimators respond differently to increasing rates of hitchhiking.
Effect of selective sweeps on estimates of Nθ and Nρ
Further investigation of the behavior of these ρ estimators under alternative models is beyond the scope of this study. In the analyses that follow, we focus on
Comparing levels of diversity and linkage disequilibrium: In a neutral panmictic population, we expect that Nˆρ ≈ Nˆθ. Table 3 lists estimates of Nˆ for each data set and these are plotted in cytological order in Figure 2. With the exception of the tip of the X chromosome (the first 10 points cover cytological bands 1B1 to 3B3 on the X chromosome), values of Nˆρ based on
Further, due to the inherent stochasticity of the evolutionary process, individual estimates of Nˆθ and Nˆρ are not particularly accurate. We can achieve greater precision by estimating the relative likelihoods for different Nˆ values using multiple loci jointly. Figure 3a shows the joint-likelihood curve of Nˆρ (based on
Several interesting observations emerge from these results. First, the estimate of Nˆθ for the X chromosome is larger than that for the autosomes. This trend has been noted before on the basis of levels of nucleotide (Andolfatto 2001) and microsatellite variability (Kaueret al. 2002). On the basis of the confidence intervals for
Estimates of Ne (in millions) for each locus
—Point estimates of N based on linkage disequilibrium ( , open circles) and levels of diversity (
, shaded circles). A total of 24 X-linked and 9 autosomal loci are plotted in order of cytological position and in order of appearance in appendix a and Table 3). The assumed mutation rate is 1.5 × 10-9/site/generation.
Interestingly, joint estimates of Nˆρ and Nˆθ are significantly different from each other on both the X and the autosomes, given our assumed mutation rate
—(a) Joint maximum-likelihood estimates of Nˆρ (curves) and Nˆθ (points with ∼95% confidence intervals). The assumed mutation rate is 1.5 × 10-9/generation. Note that confidence intervals for Nˆθ assume no recombination. The dotted line indicates the approximate 95% confidence bounds using the standard chi-square approximation. (b) Comparison of joint-likelihood estimates of Nˆθ and Nˆρ in Zimbabwean D. melanogaster and a California population of D. simulans (Begun and Whitley 2000; Wallet al. 2002). Approximate 95% confidence interval bars on estimates are indicated.
Joint-likelihood estimates of Nˆρ and Nˆθ for 15 X-linked and 14 autosomal loci from the California population of D. simulans are shown for comparison (Figure 3b). It is interesting to note in Figure 3b that ratios of Nˆρ/ Nˆθ for both the X and the autosomes are much larger for the Zimbabwean population of D. melanogaster (0.4 and 0.2, respectively) than for the California population of D. simulans (0.02 and 0.07, respectively; cf. Wallet al. 2002). While uncertainty in the mutation rate (∼2-fold) may explain the discrepancy between Nˆρ and Nˆθ in Zimbabwean D. melanogaster, it is unlikely to explain the >10-fold difference between these estimates of N in the D. simulans population in Figure 3b. This suggests that something is fundamentally different in the evolutionary history of these two populations or species. In particular, the Zimbabwe population of D. melanogaster may be closer to mutation-drift equilibrium, while the California population of D. simulans clearly is not.
Behavior of estimates of ρ as a function of r: A positive correlation between
The insensitivity of
—Relationships between (a) per 100 bp, (b)
per 100 silent sites, (c)
, (d) Nˆρ (
), and (e) Nˆρ (
)/Nˆθ (
) vs. sex-averaged rˆ. Estimates for 24 X-linked loci are plotted as shaded circles; estimates for 9 autosomal loci are plotted as open circles. Spearman correlation coefficients (R) and two-tailed P values are given for X-linked loci only. The dotted line in e indicates the neutral equilibrium expectation for Nρ/Nθ.
Could a bias in
—Joint maximum-likelihood estimates of Nˆρ (curves) andNˆθ (points with ∼95% confidence intervals) under gene conversion (model GC1, see methods). The dotted line indicates the approximate 95% confidence bounds using the standard chi-square approximation.
Gene conversion and LD patterns in regions of high crossing over: Our estimates of recombination rate are based on large-scale comparisons of genetic and physical maps that ignore the contribution of gene conversion to the total rate of recombination. At small physical scales (i.e., <1 kb in Drosophila), the contribution of gene conversion to the overall rate of recombination might be substantial (Andolfatto and Nordborg 1998). The precise relationship between crossing over and gene conversion is not well understood in Drosophila, but the consensus view is that they are associated processes that result from Holliday junction formation and resolution during meiotic recombination (Carpenter 1984).
Given that gene conversion is a potentially important contributor to the overall rate of recombination, we reexamine the relationship between levels of LD relative to levels of variability (measured as joint Nˆ for loci with rˆ > 1 cM/Mb), assuming a model that includes both crossing over and gene conversion (e.g., GC1; see methods). In general, we find that, if gene conversion contributes substantially to the overall rate of recombination in regions of high crossing over, there is clear evidence for an excess of LD given levels of diversity in the Zimbabwe data. As an illustration, we present results for gene conversion model GC1 in Figure 5 (this model assumes that two-thirds of recombination events are gene conversions without associated crossing over; see methods). Under this model, NˆρX = 0.33 million for the X chromosome (∼95% confidence interval of 0.24-0.50) and NˆρA = 0.15 million for the autosomes (∼95% confidence interval of 0.04-0.20). Since Nˆρ and Nˆθ differ by more than an order of magnitude under this model, this discrepancy is not likely to be explained by uncertainty in the mutation rate (about twofold, see methods). Thus, our results suggest either that the level of genetic exchange due to gene conversion is close to negligible in regions of high crossing over or that this Zimbabwean population departs substantially from the predictions of mutation-drift equilibrium in the direction of a excess of LD. Note that estimates of Nˆρ for D. simulans (see Figure 3b) did not include gene conversion, so the claim that the Zimbabwean population is closer to a population that is at mutation-drift equilibrium is still a valid one.
DISCUSSION
LD, population history, and gene conversion: Previous studies have found Nˆρ ⪡ Nˆθ for many genes in both D. melanogaster and D. simulans, which has been interpreted as a genome-wide excess of LD in these two species (Andolfatto and Przeworski 2000; Wallet al. 2002). Here, we find that Nˆρ (based on rates of crossing over) are much closer to Nˆθ estimates in Zimbabwean populations of D. melanogaster (Figures 2 and 3), suggesting that the latter populations may be closer to expectations under mutation-drift equilibrium (with respect to LD patterns). The marked contrast in LD patterns observed between this and earlier studies (e.g., Figure 3b) indicates that the differences between Nˆρ and Nˆθ estimates in non-African populations of D. melanogaster and D. simulans are not due to systematic errors in the methodology or estimates of r and μ. Rather, these findings point to differences in the history of the populations (or species) studied. In particular, Zimbabwean populations of D. melanogaster may have a more stable demographic history than non-African populations, which has led to large differences between these populations in patterns of LD. This may be expected since Zimbabwean D. melanogaster populations are closer to the species’ ancestral range (Lachaiseet al. 1988).
Our results may also shed some light on the importance of gene conversion (i.e., recombination without the exchange of flanking markers) relative to crossing over in Drosophila. Estimates of r based on the large-scale comparisons of genetic and physical maps essentially measure only the crossing-over rate; when there is gene conversion, they may considerably underestimate the actual rate of genetic exchange (Andolfatto and Nordborg 1998). In contrast to Drosophila, previous studies of human sequence polymorphism data have found that Nˆρ (based on rates of crossing over) were generally much larger than Nˆθ (Frisseet al. 2001; Przeworski and Wall 2001), which was interpreted as evidence that gene conversion may be having a significant impact on LD patterns in humans. If the same situation were true in D. melanogaster, we also might expect to generally observe Nˆρ ⪢ Nˆθ.
In fact, estimates of Nˆρ based on rates of crossing over are lower than Nˆθ in Zimbabwean D. melanogaster populations, suggesting that, if these populations are near a mutation-drift equilibrium, levels of exchange due to gene conversion cannot be high in regions of high crossing over. Alternatively, if gene conversion is an important force in regions of high crossing over, we must conclude that these populations depart from equilibrium model expectations. For example, assuming that gene conversion contributes as much to intragenic recombination as crossing over implies that there is actually a marked excess of LD in these populations (Figure 5). Distinguishing between these alternatives is not possible without better estimates of gene conversion parameters in regions of high crossing over (see below).
Uncertainty about the relative importance of gene conversion does not affect our conclusion that population history is an important determinant of patterns of LD in Drosophila. Earlier studies, primarily on non-African Drosophila populations, have ignored the possible impact of gene conversion on LD patterns; including gene conversion would make the observed departures from the standard neutral model even more extreme (Andolfatto and Przeworski 2000; Wallet al. 2002). Thus, even if Zimbabwean populations of D. melanogaster also depart from the predictions of mutation-drift equilibrium, the discrepancy is smaller than that in previous studies, which focused on non-African populations (see Figure 3b).
Patterns of LD across a recombination gradient: Estimates of the local effective population size based on patterns of diversity (Nˆθ) are strongly positively correlated with estimated rates of crossing over in Drosophila (Begun and Aquadro 1992; Aquadroet al. 1994; Andolfatto and Przeworski 2001), as expected under background selection and/or recurrent selective sweep models. It is less clear why we observe a negative correlation between Nˆρ and rˆ (Figure 4d) or why Nˆρ/Nˆθ are strongly negatively correlated with rˆ (Figure 4e).
Could either background selection or recurrent hitchhiking explain the patterns observed in Figure 4, d and e? If background selection due to strongly deleterious mutations can be reasonably well approximated by a reduction in effective population size (Hudson and Kaplan 1994), it can be modeled using the standard neutral model by invoking a smaller local effective population size (i.e., smaller ρ and θ, but with ρ/θ remaining the same) with decreasing r. Thus, we expect Nˆθ and Nˆρ to be reduced on average by roughly the same amount in regions of lower crossing over. Contrary to this expectation, the ratio Nˆρ/Nˆθ increases as rˆ decreases (Figure 4e). Since the background selection model predicts a strongly positive correlation between Nˆρ and rˆ (and since the median of
If mutations are only slightly deleterious, background selection may not be well approximated by a simple reduction in effective population size (e.g., Charlesworthet al. 1995; McVean and Charlesworth 2000; Tachida 2000; Comeron and Kreitman 2002; Gordoet al. 2002). Under such a model, deleterious alleles will be in repulsion disequilibrium (Hill and Robertson 1966), as may be linked neutral variation (Comeron and Kreitman 2002). However, it is not clear how such a model would affect our estimates of
Simple recurrent selective sweep models are also known to skew the frequency spectrum of segregating polymorphisms toward an excess of rare variants (e.g., Bravermanet al. 1995). A significant skew toward rare polymorphisms in regions of reduced crossing over in this Zimbabwean D. melanogaster population suggests that some form of hitchhiking may be operating (Andolfatto and Przeworski 2001). We have performed a set of preliminary simulations to examine the effects of recurrent selective sweeps on estimators of ρ (Table 2). As the rate of selective sweeps increases, the level of polymorphism decreases, yielding smaller Nˆθ estimates on average. However, selective sweeps also affect levels of LD, and the average Nˆρ value (as estimated by
Evidence for heterogeneity in gene conversion rates across the genome: As discussed earlier, our rˆ estimates do not reflect the true rate of genetic exchange when there is gene conversion. In particular, if the relative level of exchange due to gene conversion were higher in regions of reduced crossing over, then Nˆρ/Nˆθ would be negatively correlated with rˆ, as observed in Figure 4e. Langley and colleagues have suggested that such a model may account for the rapid rate of decay of pairwise LD with physical distance in regions of reduced crossing over at the tip of the X chromosome (Langleyet al. 2000). A similarly rapid decay in LD is observed on the fourth chromosome of D. melanogaster (Jensenet al. 2002). Since this chromosome lacks crossing over under normal conditions, high levels of gene conversion may also account for lower than expected levels of LD.
Patterns of LD on the X chromosome under four models of gene conversion
Data on gene conversion in Drosophila are sparse. To date, perhaps the best estimation of gene conversion rates in D. melanogaster is the detailed study of the rosy locus (e.g., Hilliker and Chovnick 1981; Hillikeret al. 1994). The authors estimate that up to 80% of meiotic recombination events may be resolved as gene conversion events at rosy, but it is not clear how general these results are. For example, data from the maroon-like locus suggest that gene conversions may account for ∼50% of recombination events (Finnerty 1976). Since rosy and maroon-like are both in regions of intermediate levels of crossing over (rosy, rˆ = 0.7 cM/Mb; maroon-like, rˆ = 1.36 cM/Mb), this degree of association may not generalize to regions of the genome with higher rates of crossing over. This said, the only measurement of gene conversion in a region of high crossing over, though not easily interpreted, suggests that the contribution of gene conversion is not negligible (rudimentary; see Finnerty 1976).
How might GC and CO covary across the genome? To answer this question, we investigated the effect of three possible models of gene conversion on estimates of Nˆρ across the X chromosome. In the first model (GC1), rates of gene conversion and crossing over positively covary. In the second model (GC2), gene conversion rates are constant across the X chromosome. In the third model (GC3), gene conversion and crossing over negatively covary (see methods for details). We measure the “fit” of a gene conversion model on the basis of two features of the data. The first is that, if the total recombination rate (the sum of GC + CO) is adequately accounted for at each gene, estimates of Nˆρ/ Nˆθ should be roughly equal in regions of high and low crossing over (i.e., the negative correlation between Nˆρ/ Nˆθ and rˆ should disappear). A second measure of fit is the absolute value of Nˆρ/Nˆθ. Assuming that this Zimbabwe population is near a mutation-drift equilibrium, we expect Nˆρ/Nˆθ to be ∼1.
In Table 4, we list estimates of Nˆρ/Nˆθ for yellow, su(s), su(wa), snf1a, Pgd, and vinculin (see appendix a for details) under these models and compare them to the average estimate in regions of the highest crossing over on the X chromosome (rˆ = 2.27 cM/Mb). As can be seen, while the model in which GC and CO positively covary (GC1) reduces Nˆρ/Nˆθ estimates in regions of reduced crossing over, the negative correlation between Nˆρ/Nˆθ and rˆ remains. Models GC2 and GC3 are better fits to the data because they yield estimates of Nˆρ/Nˆθ that are similar in regions of high and low crossing over. Overall, the model in which gene conversion and crossing-over rates negatively covary (GC3) appears to be the most reasonable explanation because it explains the correlation between Nˆρ/Nˆθ and rˆ without the need to invoke a large LD excess in this population. However, data on gene conversion from the rudimentary locus (reviewed above) suggest that our GC3 model may underestimate the impact of gene conversion on patterns of LD in regions of high crossing over. If this is the case, our results suggest a departure from equilibrium expectations in this population.
The above analysis is intended to be only illustrative as many other models of gene conversion are possible. For example, some form of heterozygosity-dependent gene conversion may be operating (Stephan and Langley 1992; Langleyet al. 2000). If so, regions with reduced crossing over, which tend to have less variability (see Figure 4b), may have higher rates of gene conversion and/or longer gene conversion tracts (see Langleyet al. 2000 for a discussion). Note that, in our models, we have assumed that the distribution of gene conversion tract lengths is the same in regions of high and low crossing over. A heterozygosity-dependent gene conversion model has not yet been explored enough to make quantitative predictions, but it has the potential to explain the negative Nˆρ/Nˆθ since it is qualitatively similar to our GC3 model. Further distinguishing between the models we have presented, and confidently inferring whether Zimbabwean D. melanogaster populations are near a mutation-drift equilibrium, will require a more detailed investigation of the gene conversion parameters in regions of high and low crossing over.
Implications for weak selection models and recombination-associated mutation biases: While gene conversion may explain lower than expected LD in regions of reduced crossing over, it does not pose a problem for the interpretation of the strongly positive correlation between Nˆθ and rˆ. Models of recurrent hitchhiking (Bravermanet al. 1995) or background selection (Charlesworth 1996) invoke relatively strong selection to account for this correlation (i.e., selection that will affect relatively large chromosomal segments). The effects of these selection models on levels of variability are primarily determined by the crossing over rate, since gene conversion contributes little to the total recombination rate at distances much greater than the average gene conversion tract length (Andolfatto and Nordborg 1998). However, gene conversion may be relevant to models that posit weaker selection, such as models invoking weak selection on synonymous codon usage or some fraction of amino acid replacement mutations (e.g., Li 1987; McVean and Charlesworth 2000; Tachida 2000; Comeron and Kreitman 2002). Weakly selected mutations are predicted to interfere with the efficacy of selection at closely linked sites, a phenomenon called weak Hill-Robertson interference (wHRi). Our results suggest that gene conversion may limit the magnitude of wHRi, particularly in regions of low crossing over, where wHRi effects are predicted to be the most important.
Finally, we note that a correlation between synonymous site G + C base composition and rates of crossing over has been documented in D. melanogaster (Kliman and Hey 1993). This pattern has been interpreted as evidence of reduced efficacy of weak selection for preferred (i.e., G or C ending) codons in regions of low crossing over due to Hill-Robertson interference in these regions. This view has recently been challenged on the basis of the observation that G + C content of noncoding DNA also appears to correlate with rates of crossing over (Maraiset al. 2001; Marais and Piganeau 2002), suggesting a recombination-mediated mutation bias [the biased gene conversion (BGC) hypothesis; Birdsell 2002; Marais 2003]. Invoking the BGC hypothesis to explain the correlation between G + C content and rates of crossing over presupposes that gene conversion and crossing over positively covary (similar to our model GC1). However, our results based on LD patterns suggest that gene conversion and crossing over are not associated in this way, at least at the tip of the X chromosome in Drosophila. The BGC hypothesis may not explain differences in base composition in regions of high vs. low crossing over if rates of gene conversion are constant across the genome or negatively covary with rates of crossing over.
It is worth noting that G + C content of noncoding DNA in telomeric regions of Drosophila appears to be higher than that in centromeric regions (Maraiset al. 2001; Marais and Piganeau 2002), despite both being regions of reduced crossing over (Charlesworth 1996). Thus, our results may be consistent with the BGC hypothesis if high rates of gene conversion occur in telomeric regions but not in other regions of reduced crossing over. However, noncoding regions on chromosome 4 of D. melanogaster (which lacks crossing over) have the lowest G + C content in the genome (Maraiset al. 2001; G. Marais, personal communication), yet polymorphism data from the fourth chromosome suggest considerable levels of meiotic exchange (Jensenet al. 2002). Assuming that all genetic exchange on chromosome 4 is due to gene conversion, we estimate
Summary statistics for each gene region used in this study
Acknowledgments
We thank D. Bachtrog, B. Charlesworth, G. Marais, M. Przeworski, and S. Wright for helpful discussions and comments on this manuscript. We are grateful to M.-L. Wu, C.-I Wu, and L. Partridge and her lab for sending us flies. Special thanks go to M. Przeworski for the use of her recurrent hitchhiking program and to B. Ballard for making new African population collections available to the community. P.A. was supported by a Royal Society of Edinburgh/Scottish Executive Personal Research Fellowship. J.D.W. was supported in part by a National Science Foundation Postdoctoral Fellowship in Bioinformatics. This research was funded by the Biotechnology and Biological Sciences Research Council.
Footnotes
-
Communicating editor: W. Stephan
- Received March 10, 2003.
- Accepted July 25, 2003.
- Copyright © 2003 by the Genetics Society of America