Previous multilocus surveys of nucleotide polymorphism have documented a genome-wide excess of intralocus linkage disequilibrium (LD) in Drosophila melanogaster and D. simulans relative to expectations based on estimated mutation and recombination rates and observed levels of diversity. These studies examined patterns of variation from predominantly non-African populations that are thought to have recently expanded their ranges from central Africa. Here, we analyze polymorphism data from a Zimbabwean population of D. melanogaster, which is likely to be closer to the standard population model assumptions of a large population with constant size. Unlike previous studies, we find that levels of LD are roughly compatible with expectations based on estimated rates of crossing over. Further, a detailed examination of genes in different recombination environments suggests that markers near the telomere of the X chromosome show considerably less linkage disequilibrium than predicted by rates of crossing over, suggesting appreciable levels of exchange due to gene conversion. Assuming that these populations are near mutation-drift equilibrium, our results are most consistent with a model that posits heterogeneity in levels of exchange due to gene conversion across the X chromosome, with gene conversion being a minor determinant of LD levels in regions of high crossing over. Alternatively, if levels of exchange due to gene conversion are not negligible in regions of high crossing over, our results suggest a marked departure from mutation-drift equilibrium (i.e., toward an excess of LD) in this Zimbabwean population. Our results also have implications for the dynamics of weakly selected mutations in regions of reduced crossing over.
THE parameters θ (= 4Neμ), where Ne is the species effective population size and μ is the mutation rate per generation, and ρ (= 4Ner), where r is the sex-averaged crossing-over rate per generation, influence the amount and the pattern of genomic sequence variability. Under a Wright-Fisher neutral model, hereafter the standard neutral model, these two parameters are both proportional to Ne. Ne cannot be estimated directly from polymorphism data; instead, it must be estimated indirectly using estimates of θ and μ or estimates of ρ and r. Under the standard neutral model, both methods (i.e., using and or and rˆ) should yield similar estimates of Ne (Hudson 1987).
Contrary to these expectations, recent studies of human and Drosophila data have revealed that the two methods for estimating Ne yield disparate results. In humans, the estimate of Ne from and , over short physical distances (i.e., within several kilobases), is substantially lower than the estimate from and rˆ. This suggests less intragenic linkage disequilibrium (LD) than expected under the standard neutral model, given estimated rates of crossing over (Frisseet al. 2001; Przeworski and Wall 2001). In Drosophila melanogaster and D. simulans the opposite pattern is observed: this approach suggests that there is an excess of LD relative to the standard neutral model expectations (Andolfatto and Przeworski 2000; Wallet al. 2002). Thus it appears that at least one of the assumptions of the standard neutral model is incorrect for both taxa. Possible explanations include more complex population histories (e.g., changes in effective population size over time, or geographic structure), the widespread effects of natural selection (at the loci studied, or at closely linked loci), or systematic errors in estimating μ, r, or ρ. By determining how various population genetic models differentially affect levels of diversity and levels of LD, we might infer which types of models are most plausible for each species (Wall 2001; Wallet al. 2002).
The Drosophila data analyzed thus far have consisted predominantly of non-African samples. Both D. melanogaster and D. simulans are human commensals; they are thought to have originated in sub-Saharan Africa and only relatively recently colonized other locations (Lachaiseet al. 1988). Non-African populations may have experienced a recent contraction in population size or a recent burst of selection associated with adaptation to new environments (e.g., Aquadroet al. 1994; Hamblin and Veuille 1999; Andolfatto 2001; Kaueret al. 2002; Wallet al. 2002). Both of these scenarios are expected to reduce levels of variability in non-African populations relative to African ones, as observed (Begun and Aquadro 1993; Andolfatto 2001; Kaueret al. 2002); in addition, either one may have caused an increase in levels of intragenic LD and might explain the observed discrepancy between different estimates of Ne.
Interestingly, patterns of variation in human populations share some similarities with Drosophila (Aquadroet al. 2001). Non-African populations have less variability than sub-Saharan African ones (e.g., Frisseet al. 2001; Stephenset al. 2001), and a similar demographic model has been proposed for humans (e.g., Tishkoffet al. 1996): an African origin, a recent population bottleneck in non-African populations, and recent population growth. Unlike the comparable Drosophila studies, human surveys of genetic variation have usually contained a substantial sample from sub-Saharan African populations. It is tempting to speculate that this difference in sampling schemes may explain part of the difference in patterns of LD in the human and Drosophila data. Further suggestive evidence of the sensitivity to sampling scheme comes from studies of different human populations, which find substantially higher levels of LD in non-African populations compared with African ones (Frisseet al. 2001).
In addition to the effects of population history on patterns of LD, the nature of the meiotic recombination mechanism itself is an issue. In current models of meiotic recombination (e.g., Szostaket al. 1983), all recombination events involve some exchange of segments of DNA between homologous chromosomes. Such exchanges explain the phenomenon of gene conversion (the nonreciprocal transfer of an allele from one homolog to another). Whether or not these exchanges are accompanied by the reciprocal exchange of flanking markers (i.e., crossing over) reflects alternative outcomes of Holliday junction resolution. For clarity, we refer to recombination events without the exchange of flanking markers as “gene conversion,” and “crossing over” refers to recombination events with the exchange of flanking markers. Estimates of the recombination rate based on the large-scale comparison of genetic and physical maps ignore gene conversion, which contributes little to the rate of exchange between distant markers. However, at smaller physical scales (i.e., several kilobases), depending on parameters, gene conversion may contribute substantially to the total rate of genetic exchange (Andolfatto and Nordborg 1998). Thus, the interpretation of patterns of LD in the context of demographic models depends to some extent on gene conversion parameters.
Frequent gene conversion in humans may account for inflated estimates of ρ (over small physical scales) relative to large-scale map-based estimates of r in African populations (Frisseet al. 2001; Przeworski and Wall 2001). However, incorporating gene conversion would make the observed LD excess in Drosophila even more unusual (Andolfatto and Przeworski 2000; Wallet al. 2002). Since gene conversion and demography both affect patterns of LD, estimates of gene conversion parameters from nucleotide polymorphism data may be extremely inaccurate since the true population history is unknown.
In an attempt to distinguish between possible explanations for the discordant estimates of Ne in Drosophila, we analyze sequence polymorphism data at multiple loci in Zimbabwean population samples of D. melanogaster. This population is more likely than non-African populations to be closer to equilibrium since it is nearer to D. melanogaster’s ancestral range and may not have experienced drastic changes in size or ecology in its recent history.
The physical scale of LD should increase as the rate of recombination decreases. Interestingly, a recent survey of two gene regions at the tip of the X chromosome of D. melanogaster, where rates of meiotic crossing over are greatly reduced, suggested that the physical scale of LD is not larger than that in other regions of the genome (Langleyet al. 2000). Jensen et al. (2002) observed a similar pattern on chromosome 4 of D. melanogaster and D. simulans, which usually lacks crossing over during female meiosis (Hawleyet al. 1993). These patterns were interpreted as evidence for high levels of gene conversion in these regions of reduced crossing over. Here we explore these claims by quantifying levels of LD at many loci over a broad range of crossing-over rates on the X chromosome. Using these data, we investigate several alternative models of how gene conversion and crossing over may be associated.
Previously published data: We consider all sequence polymorphism studies that include a sample size (n) of seven or more D. melanogaster sequences from Zimbabwe and 10 or more segregating sites (S). We adopt these minimal size restrictions because estimates of ρ are highly biased and inaccurate when both n and S are small (see justification in results). Fifteen previously published X chromosome data sets fit this size requirement: su(s) and su(wa) (Langleyet al. 2000); 6-phosphogluconate dehydrogenase, vinculin, and zeste (Pgd, vinc, and z; Andolfatto and Przeworski 2001); glucose-6-phosphate dehydrogenase (G6pd; Eaneset al. 1996); shaggy (sgg; C. Schlötterer, personal communication); period/CG2650, 100G10.2, sxy4, frag.3, frag.4, and CG3592 (Harret al. 2002); runt (Labateet al. 1999); and vermilion (v; Begun and Aquadro 1995). Nine autosomal gene regions fit the size requirement: Acp26Aa and Acp26Ab (Tsauret al. 1998), which were analyzed as one gene region; Acp36DE (Begunet al. 2000); Alcohol dehydrogenase (S.-C. Tsaur, unpublished data); Esterase-6 promoter region (Est6-p; Odgerset al. 2002); Hexokinase C (Hex-C; Duvernell and Eanes 2000); In(2L)t proximal breakpoint [In(2L)t-PBP; Andolfatto and Kreitman 2000]; Phosphogluconate mutase (Pgm; Verrelli and Eanes 2000); Heat-shock protein 70Bb (Hsp70Bb; Masideet al. 2002); and bicoid (Baineset al. 2002). All loci were surveyed in lines from one or both of two Zimbabwean population samples (Sengwa Wildlife Reserve and Harare, respectively) first described by Begun and Aquadro (1993).
Data collection: We collected nucleotide variability data for eight additional gene regions: yellow (y, 2017 bp, n = 49); Fasciclin-2 (Fas2, 588 bp, n = 17); spaghetti squash (sqh, 572 bp, n = 16); Hyperkinetic (Hk, 563 bp, n = 21); dusky (dy, 567 bp, n = 20); licorne (lic, 615 bp, n = 23); rutabega (rut, 548 bp, n = 22), and 1049 bp of the white gene (in two regions of 546 and 503 bp, respectively, separated by 3.9 kb, n = 16), which we analyzed as one region. We also expanded the snf1A data set reported in Andolfatto and Przeworski (2001). For the snf1A gene region, we increased the sample size from 13 to 25 individuals and the total length of the surveyed DNA from 514 to 2189 bp (in four separate regions of 591, 514, 488, and 596 bp, spread over 9.33 kb). These four regions surrounding the snf1A gene were analyzed as one region.
Details of the sequencing strategy for each gene region can be found at http://helios.bto.ed.ac.uk/evolgen/andolfatto/zimbabweLD. We PCR amplified each region from genomic DNA of single male flies (one male from each isofemale line), precipitated the products with polyethylene glycol (PEG-8000) to remove primers, and sequenced them on both strands using Big-Dye (Applied Biosystems, Foster City, CA) or DYEnamic ET (Amersham, Arlington Heights, IL) sequencing kits. One of our PCR/sequencing primers overlapped with a 63-bp deletion in region 4 of snf1A in individual zs53. For this region, this allele was amplified with flanking primers, cloned into pCR4-TOPO (Invitrogen, San Diego), and three independent clones were sequenced. For all loci surveyed, the lines used either are the same 13 individuals surveyed in Andolfatto and Przeworski (2001) from the Sengwa Wildlife Reserve, Zimbabwe or included additional alleles from a sample of 12 lines from Harare, Zimbabwe (kindly provided by C.-I Wu) and up to 24 lines collected from Victoria Falls, Zimbabwe (kindly provided by B. Ballard). Aligned and annotated sequences are available in nexus file format at the website http://helios.bto.ed.ac.uk/evolgen/andolfatto/zimbabweLD. A summary of polymorphic variation at each locus can be found at the website http://helios.bto.ed.ac.uk/evolgen/andolfatto/zimbabweLD. appendix a summarizes information about each of the loci used in this study.
Analysis of levels of variability: Our first goal is to estimate the effective population size (Ne) for each locus independently. For this purpose, we estimate assuming a constant mutation rate, μ (see below). We multiply Nˆθ for loci on the X chromosome by 4/3 to make them comparable to estimates for autosomal loci (which makes a number of assumptions, see below). We employ two estimators of θ: , which is based on the number of single nucleotide polymorphisms observed in the sample (Watterson 1975); and , which is the average pairwise divergence between sampled chromosomes per site (cf. Li 1997, p. 238). For both of these methods, we use the total number of silent mutations (including multiply hit sites) and for , we also use the Jukes-Cantor correction for multiple hits as implemented in DnaSP version 3.59 (http://www.ub.es/dnasp/).
A second goal of this analysis is to compare joint estimates of Ne from (Nˆθ) to joint estimates of Ne from LD (see below). Since a correlation between variability and recombination rates has been well documented in D. melanogaster (Begun and Aquadro 1992; Aquadroet al. 1994; Andolfatto and Przeworski 2001), we limit this analysis to genomic regions with rˆ > 1 cM/Mb (i.e., those regions least likely to be affected by selection at linked sites). To do this, we calculated averages of and for silent sites (weighted by the number of silent sites surveyed). This approach has the advantage that and are unbiased estimates of θ; however, confidence intervals must be obtained by coalescent simulations for each locus. We instead estimated θ ( ) and confidence intervals using a likelihood-based approach based on Equation 12 of Hudson (1990) as implemented by Wright et al. (2003). For each locus, this approach maximizes the likelihood recursion equation (1) where (2) and (3) over a range of θ values, where n is the sample size and S is the observed number of silent polymorphisms (again including multiple hits). Assuming that each genomic region investigated is evolutionarily independent, likelihoods can be combined to produce a joint likelihood surface for all loci, the maximum of which is . Confidence limits for are estimated using the standard χ2 approximation. This approach assumes no recombination within loci, but independence among loci. Simulations confirm that the bias of this estimator is not large and that confidence limits for estimated by this method are wider than those in the presence of intralocus recombination (S. Wright, personal communication).
Estimating Nˆθ from requires knowledge of the mutation rate for silent sites. By comparing estimates of Nˆθ, or estimating Nˆθ from a joint estimate of , we implicitly assume that the mutation rate at silent sites is constant among loci. We assume that μ= 1.5 × 10-8/silent site/year and that D. melanogaster populations undergo an average of 10 generations/year, yielding a site/generation. Our estimate of the mutation rate depends on many assumptions (reviewed in Andolfatto and Przeworski 2000), although several estimates based on independent approaches are in close agreement with the above figure. Further, the true mutation rate is unlikely to be >3 × 10-9/site/generation (Andolfatto and Przeworski 2000; McVean and Vieira 2001). We discuss the sensitivity of our results to assumptions about the mutation rate. Where we compare X-linked and autosomal loci, it is worth noting that we have assumed that the ratio of their effective sizes is as expected under a neutral model with equal effective population sizes for males and females. In fact, the appropriate scaling of the X and autosomes is uncertain given that natural populations experience factors that alter the relative variance in reproductive success for males and females (Charlesworth 2001). For most analyses, we consider X-linked and autosomal loci separately, so this is not an issue.
Linkage disequilibrium analysis: The goal of this analysis is to estimate Ne from levels of LD. We choose to summarize LD by an estimate of the population recombination rate ρ= 4Ner, where Ne is the effective population size, and r is the sex-averaged rate of recombination (cf. Andolfatto and Przeworski 2000; Wallet al. 2002). For these analyses, we use all biallelic mutations, including both single-nucleotide polymorphisms (SNPs) and simple insertion-deletion mutations. All multiply hit sites or overlapping mutational events are excluded.
We first employ an estimator, , proposed by Wall (2000). To estimate , we summarize the data using the observed number of distinct haplotypes (H) and the minimum number of inferred recombination events (RM, cf. Hudson and Kaplan 1985) and estimate via simulation the value of ρ that maximizes the likelihood of obtaining the observed values of H and RM (cf. Wall 2000). The simulations use a generalization of the methodology of Hudson (1993), which allows noncontiguous but linked regions to be analyzed. We estimate rˆ for each locus on the basis of comparisons of physical and genetic maps as described in Charlesworth (1996) and Andolfatto and Przeworski (2001). We assume that r is known exactly and estimate Nˆρ ( ) over increments of 3.3 × 105 or smaller, using a minimum of 2 × 105 replicates for each parameter value for each locus (cf. Wallet al. 2002). Nˆρ is estimated for each locus separately and also jointly for loci with rˆ > 1 cM/Mb, assuming that each locus is evolutionarily independent (cf. Wallet al. 2002). As for diversity-based estimates, we multiply Nˆρ for loci on the X chromosome by 4/3 to make them comparable to estimates for autosomal loci. Approximate 95% confidence intervals for joint estimates are found by making the standard asymptotic maximum-likelihood assumptions. We caution that there is no evidence that these assumptions are appropriate in this context, but the intervals serve as a useful diagnostic.
Several other summary estimators of ρ have also been proposed (e.g., Hudson 1987, 2001; Hey and Wakeley 1997). We have chosen the estimator proposed by Hudson (2001), , which is a composite-likelihood estimator of ρ based on pairwise LD among all pairs of polymorphic mutations in a sample. This estimator performs similarly to and is substantially better than other estimators (see Wall 2000 and Hudson 2001 for details). For one of our X-linked data sets, frag4, the estimate of is extremely large (i.e., >10,000). We exclude this locus from correlation tests and thus tests involving use only 23 of 24 loci.
A considerable fraction of polymorphisms segregating among autosomal genes (43/279) are amino acid replacement substitutions. If these are under weak purifying selection (Andolfatto 2001), they may segregate at lower frequencies on average, potentially affecting estimates of ρ. We estimated heterozygosities for each site [ , where n is the sample size and p is the polymorphic site frequency] and found that the mean heterozygosities of replacement (mean = 0.29) and silent (mean = 0.32) polymorphisms are not significantly different (P = 0.25, Student’s t-test with unequal variances; P = 0.27, two-tailed Mann-Whitney U-test). We have thus chosen to include replacement polymorphisms; excluding them would considerably reduce the number of polymorphisms for some loci, possibly resulting in an upward bias in estimates of ρ (see results). The fraction of replacement polymorphisms among X-linked loci was considerably smaller (9/884) and thus their effect on estimates of ρ, if any, should be negligible. Note that part of the difference in the number of amino acid replacement polymorphisms detected on X-linked vs. autosomal genes reflects the fact that about two times more coding DNA was surveyed for the latter. The remainder of the difference probably reflects differences in the efficacy of selection against amino acid polymorphisms on the two chromosomes (see Andolfatto 2001).
Assessing the behavior of estimators under neutrality and alternative models: To test the properties of the two estimators of ρ under the standard neutral model, we run simulations with sample sizes (n) of 7, 13, or 50, with ρ= 3θ. We consider a range of θ values and calculate the average and median values of divided by the actual value of ρ. A total of 104 replicates were run for (and 2 × 103 replicates for ) under each parameter combination. Results of similar simulations for n = 50 for ρ= 4θ and ρ=θ are reported by Hudson (2001). We expect, on the basis of estimates of the neutral mutation rate and rates of crossing over, that ρ/θ will vary considerably across the genome, with ρ< 3θ for yellow, su(s), and su(wa) and ρ> 3θ for most loci in regions of high rates of crossing over. In addition to the simulations above, we have also investigated the distribution of for parameters (n and ) that best match those expected for the yellow locus under ρ ≈ θ/4 and ρ ≈ 3θ. These results have been posted at the site http://helios.bto.ed.ac.uk/evolgen/andolfatto/zimbabweLD.
Directional selection at linked sites is known to shape patterns of genetic variation (Hill and Robertson 1966) and is thought to explain the observed positive correlation between levels of diversity and the estimated crossing-over rate (Begun and Aquadro 1992; Aquadroet al. 1994). Both linkage to positively selected alleles (i.e., hitchhiking or selective sweeps, cf. Maynard Smith and Haigh 1974) or deleterious alleles (i.e., background selection, cf. Charlesworthet al. 1993) have been advanced as possible explanations.
Background selection due to strongly deleterious mutations can be modeled as a simple reduction in effective population size (Hudson and Kaplan 1994). This assumption is equivalent to varying θ (and ρ) in simulations of the standard neutral model. However, under background selection models that posit more weakly deleterious mutations, this assumption may not be valid (Charlesworthet al. 1995; Gordoet al. 2002). This situation is beyond the scope of our study and will be a topic of future work.
We do examine the effects of recurrent hitchhiking due to advantageous mutations by running simulations under a model of recurrent, nonoverlapping selective sweeps (Przeworski 2002). The program used was kindly provided by M. Przeworski. We consider a locus of fixed size (n = 50, number of sites = 2000, θ= 15, ρ= 45) and determine how the average and median values of Nˆθ and Nˆρ, as estimated assuming the standard neutral model, change as a function of the rate of selective sweeps (Λ in Przeworski 2002). We fix the selection coefficient s = 0.002 and the species population size Ne = 4 × 106. The choice of other parameter values yielded similar results (results not shown).
Incorporating the effects of gene conversion: Estimates of Nˆρ based on rates of crossing over (rˆ) may be systematically underestimated because they ignore gene conversion. Gene conversion was implemented into the standard coalescent with recombination (cf. Hudson 1990; Wiuf and Hein 2000; Przeworski and Wall 2001). Gene conversion is defined here as exchange of short tracts of DNA between homologous chromosomes with no associated crossing over between flanking markers. Thus, in this implementation, gene conversion and crossing over are modeled as mechanistically independent, and gene conversion exchanges associated with crossing over are ignored. For all models, we assume the rate of crossing over is rˆ and that the distribution of gene conversion tract lengths is geometric with a mean of 352 bp (Hillikeret al. 1994).
We consider three models of how gene conversion (GC) and crossing-over (CO) rates may be associated. In the first model (GC1), we assume that two-thirds of recombination events are GC, and thus the rate of GC, rˆGC = 2rˆ between adjacent sites, where rˆ is the estimated local rate of crossing over for a particular gene region. In a second model (GC2), we assume that the rate of GC is constant across a chromosome; rˆGC = 2rˆMAX, where rˆMAX is the highest estimated rate of crossing over on the chromosome (rˆMAX = 2.27 × 10-8/generation between adjacent sites on the X chromosome). In a third model (GC3), we assume that the total rate of recombination between adjacent sites is constant across a chromosome and that the rate of gene conversion between adjacent sites is rˆGC = (rˆMAX - rˆ), where rˆ is the estimated rate of crossing over for the locus under study. GC1 and GC2 are equivalent for genomic regions with the highest rates of crossing over and diverge in regions of reduced crossing over. Likewise, GC3 is equivalent to a model with no GC for regions with the highest rates of crossing over. These models are meant to be illustrative rather than quantitative and other possible models of recombination are addressed in the discussion. Nˆρ is estimated as above (from ) from simulations with gene conversion to estimate the likelihoods.
How do gene conversion rates in our models compare to data on gene conversion from Drosophila? On the basis of the data from the genes maroon-like and rosy, it is thought that the rate of gene conversion (γ, see Andolfatto and Nordborg 1998) is on the order of 10-5 events/generation and that 50% or more of recombination events are gene conversions without an associated exchange of flanking markers (Finnerty 1976; Hilliker and Chovnick 1981). Under GC1 and GC2, the sex-averaged GC rate is 1.5 × 10-5/generation in regions of the genome with the highest rates of crossing over, which is similar to the rate that has been measured at rosy and maroon-like (∼10-5, Finnerty 1976; Hilliker and Chovnick 1981). The ratio of GC:CO has been estimated to be ∼1 at maroon-like (Finnerty 1976) and ∼4 for rosy (Hilliker and Chovnick 1981). In GC1 and GC3 models, the ratio of GC to CO rates at the rosy locus (rˆ = 0.7 cM/Mb) would be ∼2 and corresponds to a GC rate of 6 × 10-6/generation. In the GC2 model, the GC:CO ratio at rosy would be ∼6. For maroon-like (rˆ = 1.36 cM/Mb), the GC:CO ratio = 2 under GC1 (rate ∼10-5/generation) and would be ∼0.7 under GC3 (rate ∼3 × 10-6/generation). In the GC2 model, the GC:CO ratio at maroon-like would be ∼3. Thus, our assumed rates of gene conversion, and the extent of association between gene conversion and crossing over, are roughly in agreement with the limited data on rates of gene conversion in Drosophila.
Performance of ρ estimators and choice of data: With a choice of two similarly performing ρ estimators, (Wall 2000) and , (Hudson 2001), we first set out to investigate how these estimators perform under the standard neutral model in the context of parameters relevant to the available data (e.g., sample size, the number of segregating sites, etc.) In Table 1, we summarize the performance of the two estimators as a function of the number of alleles sampled (n) and the amount of variation (θ, proportional to the number of segregating sites). Both estimators show considerable upward bias when the sample size is small and/or θ is small; however, the median of is generally biased downward under these conditions. Overall, appears to be more biased than ; this is particularly pronounced when the number of segregating sites is small (even for large sample sizes). On the basis of these simulations, we conclude that should be used cautiously for 7 ≤ n ≤ 13 unless θ is in the vicinity of 10 or larger and probably not at all for samples of <7 alleles. Since seems to be particularly sensitive to small θ, it should be used cautiously when there are <13 alleles unless θ≥ 10 and probably not at all if θ< 5.
We have used these results as a guide to select data for our study (see methods) and accordingly we have limited the data sets considered here to those that have n ≥ 7 and S > 10 (appendix a). Only 4 of our selected data sets have n = 7 and for each of these ( is calculated from Stot; see appendix a). The 10 data sets that have 7 < n ≤ 13 have . Possible effects of biases of the estimators on our results are noted in the analyses and discussion that follow.
Figure 1 compares estimates of and for all loci fitting the above restrictions. and are strongly positively correlated (R = 0.78, P < 0.001, n = 33, Spearman rank correlation test, two-tailed). However, are also systematically larger than , as might be expected given the results in Table 1. appendix b (http://helios.bto.ed.ac.uk/evolgen/andolfatto/zimbabweLD) summarizes correlations among n, rˆ, (per locus), and estimates of (per locus) for the X chromosome data.
Differences in sample size and the number of segregating sites are not the only possible causes of the systematic difference between estimators. Regions of reduced recombination in D. melanogaster have lower levels of variability due to hitchhiking and/or background selection (Aquadroet al. 1994; Charlesworth 1996). If background selection can be approximated as a simple reduction in effective population size, the expected behavior of ρ estimators will be similar to that under the neutral model (cf. Table 1). However, the observation that these regions also have more low-frequency polymorphisms suggests that recurrent selective sweeps may also influence polymorphism patterns in these regions (Andolfatto and Przeworski 2001).
Since one of our objectives is to compare Nˆρ and Nˆθ estimates across a gradient of crossing-over rates, we investigated the properties of the two ρ estimators under a recurrent hitchhiking model (Table 2). More frequent hitchhiking is expected to decrease levels of variability at closely linked sites (Maynard Smith and Haigh 1974; Kaplanet al. 1989) and thus decrease Nˆθ. As can be seen in Table 2, the two ρ estimators respond differently to increasing rates of hitchhiking. is strongly positively correlated with and generally yields Nˆρ that are slightly smaller than Nˆθ. In contrast, the median is only weakly correlated with estimates of , and Nˆρ from are typically much larger than estimates of Nˆθ; this difference is larger as the rate of selective sweeps increases.
Further investigation of the behavior of these ρ estimators under alternative models is beyond the scope of this study. In the analyses that follow, we focus on for several reasons. First, allows us to easily combine information from multiple loci in a likelihood framework. Second, is not severely biased upward under parameters that best match our available data (i.e., most are small samples with few segregating sites); in fact, the median estimate is often below the true ρ. Finally, using is not likely to result in Nˆρ > Nˆθ in regions of reduced crossing over if recurrent hitchhiking is reducing levels of variability in these regions.
Comparing levels of diversity and linkage disequilibrium: In a neutral panmictic population, we expect that Nˆρ ≈ Nˆθ. Table 3 lists estimates of Nˆ for each data set and these are plotted in cytological order in Figure 2. With the exception of the tip of the X chromosome (the first 10 points cover cytological bands 1B1 to 3B3 on the X chromosome), values of Nˆρ based on are slightly lower than the values of Nˆθ(13/14 on the X have Nˆρ < Nˆθ). For Nˆρ based on , the pattern is less striking, particularly for the X (7/14 have Nˆρ < Nˆθ; data not shown); however, the pattern is equally strong for the two estimators on the autosomes (9/9 loci and 8/9 loci have Nˆρ < Nˆθ, based on and , respectively). It is difficult to draw solid conclusions from this type of analysis because different sample sizes and numbers of segregating sites for each locus (and different averages for these loci on the X and autosomes) complicate the interpretation of comparisons.
Further, due to the inherent stochasticity of the evolutionary process, individual estimates of Nˆθ and Nˆρ are not particularly accurate. We can achieve greater precision by estimating the relative likelihoods for different Nˆ values using multiple loci jointly. Figure 3a shows the joint-likelihood curve of Nˆρ (based on ) for the X-linked and autosomal loci with rˆ > 1 cM/Mb (i.e., loci in areas of high recombination that are least likely to be affected by selection at linked sites). The maximum-likelihood estimate of Nˆρ for the X chromosome is NˆρX = 1.7 million, with an ∼95% confidence interval of 1.1-2.3 million. The corresponding estimate of NˆθX is 4.0 million (based on the ), with a 95% confidence interval of 3.1-5.1 million. For the autosomes, NˆρA = 0.4 million, with an ∼95% confidence interval of 0.2-0.9 million. The corresponding estimate of NˆθA is 1.8 million with a 95% confidence interval of 1.3-2.9 million.
Several interesting observations emerge from these results. First, the estimate of Nˆθ for the X chromosome is larger than that for the autosomes. This trend has been noted before on the basis of levels of nucleotide (Andolfatto 2001) and microsatellite variability (Kaueret al. 2002). On the basis of the confidence intervals for , we can reject the standard model with equal population sizes for males and females (i.e., an X:autosome ratio of 3/4). Several factors may contribute to this pattern, including a large variance in reproductive success for males (i.e., sexual selection), background selection, and autosome-specific inversion polymorphisms (Charlesworth 1996, 2001; Andolfatto 2001; Kaueret al. 2002). Interestingly, the difference in effective sizes is even more pronounced for estimates of Ne based on linkage disequilibrium; NˆθX/NˆθA = 2.2 whereas NˆρX/ NˆρA = 4.1. More LD on the autosomes (reflected in the higher NˆρX/NˆρA ratio) is not predicted under either background selection or sexual selection models and therefore points to possible effects of inversion polymorphisms. Inversions are both much more common and at higher frequencies on the autosomes of D. melanogaster (Lemeunier and Aulard 1992). Thus inversion polymorphisms may be reducing variability on the autosomes by increasing the effects of linked selection (Andolfatto 2001). In addition, the associated reduction in the effective rate of recombination (crossing over is suppressed in inversion heterozygotes) and rapid changes in inversion frequency (Andolfattoet al. 1999) can both increase levels of LD on the autosomes relative to levels of diversity. Additional data from autosomal genes may allow us to distinguish among alternative hypotheses.
Interestingly, joint estimates of Nˆρ and Nˆθ are significantly different from each other on both the X and the autosomes, given our assumed mutation rate . This suggests that Zimbabwe populations, despite our a priori prediction, show an excess of LD relative to expectations under the neutral model. However, while confidence intervals for Nˆθ are overestimated (since we have assumed no intragenic recombination), a trivial explanation for the discrepancy between Nˆθ and Nˆρ in Zimbabwean D. melanogaster might be error in the estimated mutation rate (e.g., Nˆθ would be half as large if we assume the true mutation rate is twofold higher; see methods). Given that the mutation rate is uncertain, the apparent LD excess in this population based on estimated rates of crossing over is not entirely convincing.
Joint-likelihood estimates of Nˆρ and Nˆθ for 15 X-linked and 14 autosomal loci from the California population of D. simulans are shown for comparison (Figure 3b). It is interesting to note in Figure 3b that ratios of Nˆρ/ Nˆθ for both the X and the autosomes are much larger for the Zimbabwean population of D. melanogaster (0.4 and 0.2, respectively) than for the California population of D. simulans (0.02 and 0.07, respectively; cf. Wallet al. 2002). While uncertainty in the mutation rate (∼2-fold) may explain the discrepancy between Nˆρ and Nˆθ in Zimbabwean D. melanogaster, it is unlikely to explain the >10-fold difference between these estimates of N in the D. simulans population in Figure 3b. This suggests that something is fundamentally different in the evolutionary history of these two populations or species. In particular, the Zimbabwe population of D. melanogaster may be closer to mutation-drift equilibrium, while the California population of D. simulans clearly is not.
Behavior of estimates of ρ as a function of r: A positive correlation between and rˆ is expected for two reasons: first, (an estimate of 4Ner) should increase proportionally with rˆ; second, the gradient in rˆ strongly positively covaries with empirical estimates of diversity ( ) in Drosophila (Begun and Aquadro 1992; Aquadroet al. 1994; Andolfatto and Przeworski 2001) and, hence, with the apparent effective population size of the genomic region (Nˆθ, which under the standard neutral model is inversely proportional to the expected linkage disequilibrium for a given r). Here, we focus on data from the X chromosome because we lack data for autosomal loci in regions of reduced crossing over. For the 24 gene regions on the X chromosome, (per site) is positively correlated with rˆ in our data set [Figure 4a; , R = 0.44, P < 0.05, 24 loci; , R = 0.43, P < 0.05, 23 loci (see methods); Spearman rank correlation test, two-tailed]. As expected on the basis of previous results, and Nˆθ for the 24 X-linked loci considered here also show a strong positive correlation with rˆ (Figure 4b; based on , R = 0.62, P = 0.001, 24 loci; as above). However, a striking aspect of the results presented in Figure 2 is that estimates of Nˆρ are rather large near the telomere of the X chromosome (i.e., loci to the far left of the graph), despite the fact that the crossing-over rate and levels of diversity are considerably reduced relative to other regions on the X (appendix a and Table 3). Under the standard neutral model, a measure that should be independent of the effective population size is ρ/θ (= 4Ner/4Neμ), which we estimate as (Hudson 1987; Andolfatto and Przeworski 2000). Since we assume that the mutation rate is constant across loci, we expect estimates of to positively covary with rˆ. We detect no such correlation (Figure 4c; , 24 loci, P > 0.05; , R = 0.13, 23 loci, P > 0.05; as above).
The insensitivity of to changes in rˆ (Figure 4c) suggests that the correlation between and rˆ (Figure 4a) is driven by the relationship between and rˆ (Figure 4b). In fact, Nˆρ and rˆ are significantly negatively correlated [Figure 4d; Nˆρ ( ), R =-0.51, 24 loci, P = 0.01; Nˆρ ( ), R =-0.27, 23 loci, P > 0.05; as above], the opposite trend to that expected on the basis of the relationship between Nˆθ and rˆ. In other words, there is less diversity in regions of low crossing over (Figure 4b), but also less LD than expected given levels of diversity and estimated rates of crossing over. This can also be seen in Figure 4e, which plots Nˆρ/Nˆθ (estimated from and π) as a function of rˆ. Under the standard neutral model, Nρ/Nθ should be ≈1 and independent of r. While this prediction is roughly true for the 11 X-linked loci with rˆ > 1 cM/Mb (Nˆρ/Nˆθ ≤ 1), loci with rˆ < 0.5 cM/Mb show considerably higher values of Nˆρ/Nˆθ. Overall, there is a strong negative correlation between Nˆρ/Nˆθ and rˆ (Figure 4e; R =-0.79, 24 loci, P < 0.001; as above), implying that regions of reduced crossing over have considerably less LD than predicted by rates of crossing over under the standard neutral model.
Could a bias in estimates explain the observed correlations? In simulations of the neutral model (Table 1), we noted a slight upward bias in mean estimates with decreasing θ. However, our correlation result (which employs a nonparametric test) depends more on median estimates of , which were typically lower than the actual ρ in these simulations (see Table 1). Additional simulations with parameters appropriate for the yellow locus (n = 49, S = 18, ρ ≈ θ/4) confirmed that the median is downward biased (see methods). Thus, bias in is not an explanation for the negative correlation between Nˆρ/Nˆθ and rˆ.
Gene conversion and LD patterns in regions of high crossing over: Our estimates of recombination rate are based on large-scale comparisons of genetic and physical maps that ignore the contribution of gene conversion to the total rate of recombination. At small physical scales (i.e., <1 kb in Drosophila), the contribution of gene conversion to the overall rate of recombination might be substantial (Andolfatto and Nordborg 1998). The precise relationship between crossing over and gene conversion is not well understood in Drosophila, but the consensus view is that they are associated processes that result from Holliday junction formation and resolution during meiotic recombination (Carpenter 1984).
Given that gene conversion is a potentially important contributor to the overall rate of recombination, we reexamine the relationship between levels of LD relative to levels of variability (measured as joint Nˆ for loci with rˆ > 1 cM/Mb), assuming a model that includes both crossing over and gene conversion (e.g., GC1; see methods). In general, we find that, if gene conversion contributes substantially to the overall rate of recombination in regions of high crossing over, there is clear evidence for an excess of LD given levels of diversity in the Zimbabwe data. As an illustration, we present results for gene conversion model GC1 in Figure 5 (this model assumes that two-thirds of recombination events are gene conversions without associated crossing over; see methods). Under this model, NˆρX = 0.33 million for the X chromosome (∼95% confidence interval of 0.24-0.50) and NˆρA = 0.15 million for the autosomes (∼95% confidence interval of 0.04-0.20). Since Nˆρ and Nˆθ differ by more than an order of magnitude under this model, this discrepancy is not likely to be explained by uncertainty in the mutation rate (about twofold, see methods). Thus, our results suggest either that the level of genetic exchange due to gene conversion is close to negligible in regions of high crossing over or that this Zimbabwean population departs substantially from the predictions of mutation-drift equilibrium in the direction of a excess of LD. Note that estimates of Nˆρ for D. simulans (see Figure 3b) did not include gene conversion, so the claim that the Zimbabwean population is closer to a population that is at mutation-drift equilibrium is still a valid one.
LD, population history, and gene conversion: Previous studies have found Nˆρ ⪡ Nˆθ for many genes in both D. melanogaster and D. simulans, which has been interpreted as a genome-wide excess of LD in these two species (Andolfatto and Przeworski 2000; Wallet al. 2002). Here, we find that Nˆρ (based on rates of crossing over) are much closer to Nˆθ estimates in Zimbabwean populations of D. melanogaster (Figures 2 and 3), suggesting that the latter populations may be closer to expectations under mutation-drift equilibrium (with respect to LD patterns). The marked contrast in LD patterns observed between this and earlier studies (e.g., Figure 3b) indicates that the differences between Nˆρ and Nˆθ estimates in non-African populations of D. melanogaster and D. simulans are not due to systematic errors in the methodology or estimates of r and μ. Rather, these findings point to differences in the history of the populations (or species) studied. In particular, Zimbabwean populations of D. melanogaster may have a more stable demographic history than non-African populations, which has led to large differences between these populations in patterns of LD. This may be expected since Zimbabwean D. melanogaster populations are closer to the species’ ancestral range (Lachaiseet al. 1988).
Our results may also shed some light on the importance of gene conversion (i.e., recombination without the exchange of flanking markers) relative to crossing over in Drosophila. Estimates of r based on the large-scale comparisons of genetic and physical maps essentially measure only the crossing-over rate; when there is gene conversion, they may considerably underestimate the actual rate of genetic exchange (Andolfatto and Nordborg 1998). In contrast to Drosophila, previous studies of human sequence polymorphism data have found that Nˆρ (based on rates of crossing over) were generally much larger than Nˆθ (Frisseet al. 2001; Przeworski and Wall 2001), which was interpreted as evidence that gene conversion may be having a significant impact on LD patterns in humans. If the same situation were true in D. melanogaster, we also might expect to generally observe Nˆρ ⪢ Nˆθ.
In fact, estimates of Nˆρ based on rates of crossing over are lower than Nˆθ in Zimbabwean D. melanogaster populations, suggesting that, if these populations are near a mutation-drift equilibrium, levels of exchange due to gene conversion cannot be high in regions of high crossing over. Alternatively, if gene conversion is an important force in regions of high crossing over, we must conclude that these populations depart from equilibrium model expectations. For example, assuming that gene conversion contributes as much to intragenic recombination as crossing over implies that there is actually a marked excess of LD in these populations (Figure 5). Distinguishing between these alternatives is not possible without better estimates of gene conversion parameters in regions of high crossing over (see below).
Uncertainty about the relative importance of gene conversion does not affect our conclusion that population history is an important determinant of patterns of LD in Drosophila. Earlier studies, primarily on non-African Drosophila populations, have ignored the possible impact of gene conversion on LD patterns; including gene conversion would make the observed departures from the standard neutral model even more extreme (Andolfatto and Przeworski 2000; Wallet al. 2002). Thus, even if Zimbabwean populations of D. melanogaster also depart from the predictions of mutation-drift equilibrium, the discrepancy is smaller than that in previous studies, which focused on non-African populations (see Figure 3b).
Patterns of LD across a recombination gradient: Estimates of the local effective population size based on patterns of diversity (Nˆθ) are strongly positively correlated with estimated rates of crossing over in Drosophila (Begun and Aquadro 1992; Aquadroet al. 1994; Andolfatto and Przeworski 2001), as expected under background selection and/or recurrent selective sweep models. It is less clear why we observe a negative correlation between Nˆρ and rˆ (Figure 4d) or why Nˆρ/Nˆθ are strongly negatively correlated with rˆ (Figure 4e).
Could either background selection or recurrent hitchhiking explain the patterns observed in Figure 4, d and e? If background selection due to strongly deleterious mutations can be reasonably well approximated by a reduction in effective population size (Hudson and Kaplan 1994), it can be modeled using the standard neutral model by invoking a smaller local effective population size (i.e., smaller ρ and θ, but with ρ/θ remaining the same) with decreasing r. Thus, we expect Nˆθ and Nˆρ to be reduced on average by roughly the same amount in regions of lower crossing over. Contrary to this expectation, the ratio Nˆρ/Nˆθ increases as rˆ decreases (Figure 4e). Since the background selection model predicts a strongly positive correlation between Nˆρ and rˆ (and since the median of is downward biased with decreasing θ), we can safely reject this model as an explanation for the patterns observed in Figure 4, d and e.
If mutations are only slightly deleterious, background selection may not be well approximated by a simple reduction in effective population size (e.g., Charlesworthet al. 1995; McVean and Charlesworth 2000; Tachida 2000; Comeron and Kreitman 2002; Gordoet al. 2002). Under such a model, deleterious alleles will be in repulsion disequilibrium (Hill and Robertson 1966), as may be linked neutral variation (Comeron and Kreitman 2002). However, it is not clear how such a model would affect our estimates of . In particular, the resulting negative skew in the frequency spectrum of linked neutral variants (Tachida 2000; Comeron and Kreitman 2002; Gordoet al. 2002) leads to an increase in the number of haplotypes (H) per polymorphic mutation, but the minimum number of recombination events (RM) will decrease. Thus, the combined effect on is difficult to predict. Likewise, the performance of under departures from the neutral model has also not been investigated.
Simple recurrent selective sweep models are also known to skew the frequency spectrum of segregating polymorphisms toward an excess of rare variants (e.g., Bravermanet al. 1995). A significant skew toward rare polymorphisms in regions of reduced crossing over in this Zimbabwean D. melanogaster population suggests that some form of hitchhiking may be operating (Andolfatto and Przeworski 2001). We have performed a set of preliminary simulations to examine the effects of recurrent selective sweeps on estimators of ρ (Table 2). As the rate of selective sweeps increases, the level of polymorphism decreases, yielding smaller Nˆθ estimates on average. However, selective sweeps also affect levels of LD, and the average Nˆρ value (as estimated by ) is reduced by roughly the same amount as the average Nˆθ value (Table 2). Note that this does not mean that selective sweeps act as a simple reduction in the effective population size, but merely that they tend to decrease estimates by roughly as much as they decrease θ estimates (though we also note that responds differently; see results). In summary, it is clear from our simulations (Table 2 and supplementary information available at http://helios.bto.ed.ac.uk/evolgen/andolfatto/zimbabweLD) that a recurrent selective sweep model cannot explain the strong negative correlation between Nˆρ/Nˆθ and rˆ (Figure 4e).
Evidence for heterogeneity in gene conversion rates across the genome: As discussed earlier, our rˆ estimates do not reflect the true rate of genetic exchange when there is gene conversion. In particular, if the relative level of exchange due to gene conversion were higher in regions of reduced crossing over, then Nˆρ/Nˆθ would be negatively correlated with rˆ, as observed in Figure 4e. Langley and colleagues have suggested that such a model may account for the rapid rate of decay of pairwise LD with physical distance in regions of reduced crossing over at the tip of the X chromosome (Langleyet al. 2000). A similarly rapid decay in LD is observed on the fourth chromosome of D. melanogaster (Jensenet al. 2002). Since this chromosome lacks crossing over under normal conditions, high levels of gene conversion may also account for lower than expected levels of LD.
Data on gene conversion in Drosophila are sparse. To date, perhaps the best estimation of gene conversion rates in D. melanogaster is the detailed study of the rosy locus (e.g., Hilliker and Chovnick 1981; Hillikeret al. 1994). The authors estimate that up to 80% of meiotic recombination events may be resolved as gene conversion events at rosy, but it is not clear how general these results are. For example, data from the maroon-like locus suggest that gene conversions may account for ∼50% of recombination events (Finnerty 1976). Since rosy and maroon-like are both in regions of intermediate levels of crossing over (rosy, rˆ = 0.7 cM/Mb; maroon-like, rˆ = 1.36 cM/Mb), this degree of association may not generalize to regions of the genome with higher rates of crossing over. This said, the only measurement of gene conversion in a region of high crossing over, though not easily interpreted, suggests that the contribution of gene conversion is not negligible (rudimentary; see Finnerty 1976).
How might GC and CO covary across the genome? To answer this question, we investigated the effect of three possible models of gene conversion on estimates of Nˆρ across the X chromosome. In the first model (GC1), rates of gene conversion and crossing over positively covary. In the second model (GC2), gene conversion rates are constant across the X chromosome. In the third model (GC3), gene conversion and crossing over negatively covary (see methods for details). We measure the “fit” of a gene conversion model on the basis of two features of the data. The first is that, if the total recombination rate (the sum of GC + CO) is adequately accounted for at each gene, estimates of Nˆρ/ Nˆθ should be roughly equal in regions of high and low crossing over (i.e., the negative correlation between Nˆρ/ Nˆθ and rˆ should disappear). A second measure of fit is the absolute value of Nˆρ/Nˆθ. Assuming that this Zimbabwe population is near a mutation-drift equilibrium, we expect Nˆρ/Nˆθ to be ∼1.
In Table 4, we list estimates of Nˆρ/Nˆθ for yellow, su(s), su(wa), snf1a, Pgd, and vinculin (see appendix a for details) under these models and compare them to the average estimate in regions of the highest crossing over on the X chromosome (rˆ = 2.27 cM/Mb). As can be seen, while the model in which GC and CO positively covary (GC1) reduces Nˆρ/Nˆθ estimates in regions of reduced crossing over, the negative correlation between Nˆρ/Nˆθ and rˆ remains. Models GC2 and GC3 are better fits to the data because they yield estimates of Nˆρ/Nˆθ that are similar in regions of high and low crossing over. Overall, the model in which gene conversion and crossing-over rates negatively covary (GC3) appears to be the most reasonable explanation because it explains the correlation between Nˆρ/Nˆθ and rˆ without the need to invoke a large LD excess in this population. However, data on gene conversion from the rudimentary locus (reviewed above) suggest that our GC3 model may underestimate the impact of gene conversion on patterns of LD in regions of high crossing over. If this is the case, our results suggest a departure from equilibrium expectations in this population.
The above analysis is intended to be only illustrative as many other models of gene conversion are possible. For example, some form of heterozygosity-dependent gene conversion may be operating (Stephan and Langley 1992; Langleyet al. 2000). If so, regions with reduced crossing over, which tend to have less variability (see Figure 4b), may have higher rates of gene conversion and/or longer gene conversion tracts (see Langleyet al. 2000 for a discussion). Note that, in our models, we have assumed that the distribution of gene conversion tract lengths is the same in regions of high and low crossing over. A heterozygosity-dependent gene conversion model has not yet been explored enough to make quantitative predictions, but it has the potential to explain the negative Nˆρ/Nˆθ since it is qualitatively similar to our GC3 model. Further distinguishing between the models we have presented, and confidently inferring whether Zimbabwean D. melanogaster populations are near a mutation-drift equilibrium, will require a more detailed investigation of the gene conversion parameters in regions of high and low crossing over.
Implications for weak selection models and recombination-associated mutation biases: While gene conversion may explain lower than expected LD in regions of reduced crossing over, it does not pose a problem for the interpretation of the strongly positive correlation between Nˆθ and rˆ. Models of recurrent hitchhiking (Bravermanet al. 1995) or background selection (Charlesworth 1996) invoke relatively strong selection to account for this correlation (i.e., selection that will affect relatively large chromosomal segments). The effects of these selection models on levels of variability are primarily determined by the crossing over rate, since gene conversion contributes little to the total recombination rate at distances much greater than the average gene conversion tract length (Andolfatto and Nordborg 1998). However, gene conversion may be relevant to models that posit weaker selection, such as models invoking weak selection on synonymous codon usage or some fraction of amino acid replacement mutations (e.g., Li 1987; McVean and Charlesworth 2000; Tachida 2000; Comeron and Kreitman 2002). Weakly selected mutations are predicted to interfere with the efficacy of selection at closely linked sites, a phenomenon called weak Hill-Robertson interference (wHRi). Our results suggest that gene conversion may limit the magnitude of wHRi, particularly in regions of low crossing over, where wHRi effects are predicted to be the most important.
Finally, we note that a correlation between synonymous site G + C base composition and rates of crossing over has been documented in D. melanogaster (Kliman and Hey 1993). This pattern has been interpreted as evidence of reduced efficacy of weak selection for preferred (i.e., G or C ending) codons in regions of low crossing over due to Hill-Robertson interference in these regions. This view has recently been challenged on the basis of the observation that G + C content of noncoding DNA also appears to correlate with rates of crossing over (Maraiset al. 2001; Marais and Piganeau 2002), suggesting a recombination-mediated mutation bias [the biased gene conversion (BGC) hypothesis; Birdsell 2002; Marais 2003]. Invoking the BGC hypothesis to explain the correlation between G + C content and rates of crossing over presupposes that gene conversion and crossing over positively covary (similar to our model GC1). However, our results based on LD patterns suggest that gene conversion and crossing over are not associated in this way, at least at the tip of the X chromosome in Drosophila. The BGC hypothesis may not explain differences in base composition in regions of high vs. low crossing over if rates of gene conversion are constant across the genome or negatively covary with rates of crossing over.
It is worth noting that G + C content of noncoding DNA in telomeric regions of Drosophila appears to be higher than that in centromeric regions (Maraiset al. 2001; Marais and Piganeau 2002), despite both being regions of reduced crossing over (Charlesworth 1996). Thus, our results may be consistent with the BGC hypothesis if high rates of gene conversion occur in telomeric regions but not in other regions of reduced crossing over. However, noncoding regions on chromosome 4 of D. melanogaster (which lacks crossing over) have the lowest G + C content in the genome (Maraiset al. 2001; G. Marais, personal communication), yet polymorphism data from the fourth chromosome suggest considerable levels of meiotic exchange (Jensenet al. 2002). Assuming that all genetic exchange on chromosome 4 is due to gene conversion, we estimate to be ∼20 (on the basis of the data of Jensenet al. 2002 and assuming a gene conversion tract length of 352 bp), which is in close agreement with estimates at the tip of the X chromosome (see Figure 4c). This corresponds to a conversion rate on the order of 10-5 events/generation, which is remarkably similar to direct estimates at rosy and maroon-like. Thus, our results suggesting high rates of gene conversion in regions of reduced crossing over do not appear to be restricted to the tip of the X chromosome. A more detailed investigation of the physical scale of LD in other genomic regions may shed further light on the association of gene conversion with crossing over and its implications for weak selection models and the evolution of base composition.
We thank D. Bachtrog, B. Charlesworth, G. Marais, M. Przeworski, and S. Wright for helpful discussions and comments on this manuscript. We are grateful to M.-L. Wu, C.-I Wu, and L. Partridge and her lab for sending us flies. Special thanks go to M. Przeworski for the use of her recurrent hitchhiking program and to B. Ballard for making new African population collections available to the community. P.A. was supported by a Royal Society of Edinburgh/Scottish Executive Personal Research Fellowship. J.D.W. was supported in part by a National Science Foundation Postdoctoral Fellowship in Bioinformatics. This research was funded by the Biotechnology and Biological Sciences Research Council.
Communicating editor: W. Stephan
- Received March 10, 2003.
- Accepted July 25, 2003.
- Copyright © 2003 by the Genetics Society of America