Genetics, Vol. 165, 1289-1305, November 2003, Copyright © 2003

Linkage Disequilibrium Patterns Across a Recombination Gradient in African Drosophila melanogaster

Peter Andolfattoa,b and Jeffrey D. Wallc,d
a Department of Zoology, University of Toronto, Toronto, Ontario M5S 3G5, Canada,
b Institute of Cell, Animal and Population Biology, University of Edinburgh, Edinburgh EH9 3JT, United Kingdom,
c Program in Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089
d Department of Human Genetics, University of Chicago, Chicago, Illinois 60637

Corresponding author: Peter Andolfatto, Ramsay Wright Bldg., 25 Harbord St., University of Toronto, Toronto, Ontario M5S 3G5, Canada., pandolfatto{at}zoo.utoronto.ca (E-mail)

Communicating editor: W. STEPHAN


*  ABSTRACT
*TOP
*ABSTRACT
*METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Previous multilocus surveys of nucleotide polymorphism have documented a genome-wide excess of intralocus linkage disequilibrium (LD) in Drosophila melanogaster and D. simulans relative to expectations based on estimated mutation and recombination rates and observed levels of diversity. These studies examined patterns of variation from predominantly non-African populations that are thought to have recently expanded their ranges from central Africa. Here, we analyze polymorphism data from a Zimbabwean population of D. melanogaster, which is likely to be closer to the standard population model assumptions of a large population with constant size. Unlike previous studies, we find that levels of LD are roughly compatible with expectations based on estimated rates of crossing over. Further, a detailed examination of genes in different recombination environments suggests that markers near the telomere of the X chromosome show considerably less linkage disequilibrium than predicted by rates of crossing over, suggesting appreciable levels of exchange due to gene conversion. Assuming that these populations are near mutation-drift equilibrium, our results are most consistent with a model that posits heterogeneity in levels of exchange due to gene conversion across the X chromosome, with gene conversion being a minor determinant of LD levels in regions of high crossing over. Alternatively, if levels of exchange due to gene conversion are not negligible in regions of high crossing over, our results suggest a marked departure from mutation-drift equilibrium (i.e., toward an excess of LD) in this Zimbabwean population. Our results also have implications for the dynamics of weakly selected mutations in regions of reduced crossing over.


THE parameters {theta} (= 4Neµ), where Ne is the species effective population size and µ is the mutation rate per generation, and {rho} (= 4Ner), where r is the sex-averaged crossing-over rate per generation, influence the amount and the pattern of genomic sequence variability. Under a Wright-Fisher neutral model, hereafter the standard neutral model, these two parameters are both proportional to Ne. Ne cannot be estimated directly from polymorphism data; instead, it must be estimated indirectly using estimates of {theta} and µ or estimates of {rho} and r. Under the standard neutral model, both methods (i.e., using and or and ) should yield similar estimates of Ne (HUDSON 1987 Down).

Contrary to these expectations, recent studies of human and Drosophila data have revealed that the two methods for estimating Ne yield disparate results. In humans, the estimate of Ne from and , over short physical distances (i.e., within several kilobases), is substantially lower than the estimate from and . This suggests less intragenic linkage disequilibrium (LD) than expected under the standard neutral model, given estimated rates of crossing over (FRISSE et al. 2001 Down; PRZEWORSKI and WALL 2001 Down). In Drosophila melanogaster and D. simulans the opposite pattern is observed: this approach suggests that there is an excess of LD relative to the standard neutral model expectations (ANDOLFATTO and PRZEWORSKI 2000 Down; WALL et al. 2002 Down). Thus it appears that at least one of the assumptions of the standard neutral model is incorrect for both taxa. Possible explanations include more complex population histories (e.g., changes in effective population size over time, or geographic structure), the widespread effects of natural selection (at the loci studied, or at closely linked loci), or systematic errors in estimating µ, r, or {rho}. By determining how various population genetic models differentially affect levels of diversity and levels of LD, we might infer which types of models are most plausible for each species (WALL 2001 Down; WALL et al. 2002 Down).

The Drosophila data analyzed thus far have consisted predominantly of non-African samples. Both D. melanogaster and D. simulans are human commensals; they are thought to have originated in sub-Saharan Africa and only relatively recently colonized other locations (LACHAISE et al. 1988 Down). Non-African populations may have experienced a recent contraction in population size or a recent burst of selection associated with adaptation to new environments (e.g., AQUADRO et al. 1994 Down; HAMBLIN and VEUILLE 1999 Down; ANDOLFATTO 2001 Down; KAUER et al. 2002 Down; WALL et al. 2002 Down). Both of these scenarios are expected to reduce levels of variability in non-African populations relative to African ones, as observed (BEGUN and AQUADRO 1993 Down; ANDOLFATTO 2001 Down; KAUER et al. 2002 Down); in addition, either one may have caused an increase in levels of intragenic LD and might explain the observed discrepancy between different estimates of Ne.

Interestingly, patterns of variation in human populations share some similarities with Drosophila (AQUADRO et al. 2001 Down). Non-African populations have less variability than sub-Saharan African ones (e.g., FRISSE et al. 2001 Down; STEPHENS et al. 2001 Down), and a similar demographic model has been proposed for humans (e.g., TISHKOFF et al. 1996 Down): an African origin, a recent population bottleneck in non-African populations, and recent population growth. Unlike the comparable Drosophila studies, human surveys of genetic variation have usually contained a substantial sample from sub-Saharan African populations. It is tempting to speculate that this difference in sampling schemes may explain part of the difference in patterns of LD in the human and Drosophila data. Further suggestive evidence of the sensitivity to sampling scheme comes from studies of different human populations, which find substantially higher levels of LD in non-African populations compared with African ones (FRISSE et al. 2001 Down).

In addition to the effects of population history on patterns of LD, the nature of the meiotic recombination mechanism itself is an issue. In current models of meiotic recombination (e.g., SZOSTAK et al. 1983 Down), all recombination events involve some exchange of segments of DNA between homologous chromosomes. Such exchanges explain the phenomenon of gene conversion (the non-reciprocal transfer of an allele from one homolog to another). Whether or not these exchanges are accompanied by the reciprocal exchange of flanking markers (i.e., crossing over) reflects alternative outcomes of Holliday junction resolution. For clarity, we refer to recombination events without the exchange of flanking markers as "gene conversion," and "crossing over" refers to recombination events with the exchange of flanking markers. Estimates of the recombination rate based on the large-scale comparison of genetic and physical maps ignore gene conversion, which contributes little to the rate of exchange between distant markers. However, at smaller physical scales (i.e., several kilobases), depending on parameters, gene conversion may contribute substantially to the total rate of genetic exchange (ANDOLFATTO and NORDBORG 1998 Down). Thus, the interpretation of patterns of LD in the context of demographic models depends to some extent on gene conversion parameters.

Frequent gene conversion in humans may account for inflated estimates of {rho} (over small physical scales) relative to large-scale map-based estimates of r in African populations (FRISSE et al. 2001 Down; PRZEWORSKI and WALL 2001 Down). However, incorporating gene conversion would make the observed LD excess in Drosophila even more unusual (ANDOLFATTO and PRZEWORSKI 2000 Down; WALL et al. 2002 Down). Since gene conversion and demography both affect patterns of LD, estimates of gene conversion parameters from nucleotide polymorphism data may be extremely inaccurate since the true population history is unknown.

In an attempt to distinguish between possible explanations for the discordant estimates of Ne in Drosophila, we analyze sequence polymorphism data at multiple loci in Zimbabwean population samples of D. melanogaster. This population is more likely than non-African populations to be closer to equilibrium since it is nearer to D. melanogaster's ancestral range and may not have experienced drastic changes in size or ecology in its recent history.

The physical scale of LD should increase as the rate of recombination decreases. Interestingly, a recent survey of two gene regions at the tip of the X chromosome of D. melanogaster, where rates of meiotic crossing over are greatly reduced, suggested that the physical scale of LD is not larger than that in other regions of the genome (LANGLEY et al. 2000 Down). JENSEN et al. 2002 Down observed a similar pattern on chromosome 4 of D. melanogaster and D. simulans, which usually lacks crossing over during female meiosis (HAWLEY et al. 1993 Down). These patterns were interpreted as evidence for high levels of gene conversion in these regions of reduced crossing over. Here we explore these claims by quantifying levels of LD at many loci over a broad range of crossing-over rates on the X chromosome. Using these data, we investigate several alternative models of how gene conversion and crossing over may be associated.


*  METHODS
*TOP
*ABSTRACT
*METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Previously published data:
We consider all sequence polymorphism studies that include a sample size (n) of seven or more D. melanogaster sequences from Zimbabwe and 10 or more segregating sites (S). We adopt these minimal size restrictions because estimates of {rho} are highly biased and inaccurate when both n and S are small (see justification in RESULTS). Fifteen previously published X chromosome data sets fit this size requirement: su(s) and su(wa) (LANGLEY et al. 2000 Down); 6-phosphogluconate dehydrogenase, vinculin, and zeste (Pgd, vinc, and z; ANDOLFATTO and PRZEWORSKI 2001 Down); glucose-6-phosphate dehydrogenase (G6pd; EANES et al. 1996 Down); shaggy (sgg; C. SCHLÖTTERER, personal communication); period/CG2650, 100G10.2, sxy4, frag.3, frag.4, and CG3592 (HARR et al. 2002 Down); runt (LABATE et al. 1999 Down); and vermilion (v; BEGUN and AQUADRO 1995 Down). Nine autosomal gene regions fit the size requirement: Acp26Aa and Acp26Ab (TSAUR et al. 1998 Down), which were analyzed as one gene region; Acp36DE (BEGUN et al. 2000 Down); Alcohol dehydrogenase (S.-C. TSAUR, unpublished data); Esterase-6 promoter region (Est6-p; ODGERS et al. 2002 Down); Hexokinase C (Hex-C; DUVERNELL and EANES 2000 Down); In(2L)t proximal breakpoint [In(2L)t-PBP; ANDOLFATTO and KREITMAN 2000 Down]; Phosphogluconate mutase (Pgm; VERRELLI and EANES 2000 Down); Heat-shock protein 70Bb (Hsp70Bb; MASIDE et al. 2002 Down); and bicoid (BAINES et al. 2002 Down). All loci were surveyed in lines from one or both of two Zimbabwean population samples (Sengwa Wildlife Reserve and Harare, respectively) first described by BEGUN and AQUADRO 1993 Down.

Data collection:
We collected nucleotide variability data for eight additional gene regions: yellow (y, 2017 bp, n = 49); Fasciclin-2 (Fas2, 588 bp, n = 17); spaghetti squash (sqh, 572 bp, n = 16); Hyperkinetic (Hk, 563 bp, n = 21); dusky (dy, 567 bp, n = 20); licorne (lic, 615 bp, n = 23); rutabega (rut, 548 bp, n = 22), and 1049 bp of the white gene (in two regions of 546 and 503 bp, respectively, separated by 3.9 kb, n = 16), which we analyzed as one region. We also expanded the snf1A data set reported in ANDOLFATTO and PRZEWORSKI 2001 Down. For the snf1A gene region, we increased the sample size from 13 to 25 individuals and the total length of the surveyed DNA from 514 to 2189 bp (in four separate regions of 591, 514, 488, and 596 bp, spread over 9.33 kb). These four regions surrounding the snf1A gene were analyzed as one region.

Details of the sequencing strategy for each gene region can be found at http://helios.bto.ed.ac.uk/evolgen/andolfatto/zimbabweLD. We PCR amplified each region from genomic DNA of single male flies (one male from each isofemale line), precipitated the products with polyethylene glycol (PEG-8000) to remove primers, and sequenced them on both strands using Big-Dye (Applied Biosystems, Foster City, CA) or DYEnamic ET (Amersham, Arlington Heights, IL) sequencing kits. One of our PCR/sequencing primers overlapped with a 63-bp deletion in region 4 of snf1A in individual zs53. For this region, this allele was amplified with flanking primers, cloned into pCR4-TOPO (Invitrogen, San Diego), and three independent clones were sequenced. For all loci surveyed, the lines used either are the same 13 individuals surveyed in ANDOLFATTO and PRZEWORSKI 2001 Down from the Sengwa Wildlife Reserve, Zimbabwe or included additional alleles from a sample of 12 lines from Harare, Zimbabwe (kindly provided by C.-I Wu) and up to 24 lines collected from Victoria Falls, Zimbabwe (kindly provided by B. Ballard). Aligned and annotated sequences are available in nexus file format at the website http://helios.bto.ed.ac.uk/evolgen/andolfatto/zimbabweLD. A summary of polymorphic variation at each locus can be found at the website http://helios.bto.ed.ac.uk/evolgen/andolfatto/zimbabweLD. TABLE AA summarizes information about each of the loci used in this study.

Analysis of levels of variability:
Our first goal is to estimate the effective population size (Ne) for each locus independently. For this purpose, we estimate {theta} = /4 assuming a constant mutation rate, µ (see below). We multiply {theta} for loci on the X chromosome by 4/3 to make them comparable to estimates for autosomal loci (which makes a number of assumptions, see below). We employ two estimators of {theta}: W, which is based on the number of single nucleotide polymorphisms observed in the sample (WATTERSON 1975 Down); and {pi}, which is the average pairwise divergence between sampled chromosomes per site (cf. LI 1997 Down, p. 238). For both of these methods, we use the total number of silent mutations (including multiply hit sites) and for {pi}, we also use the Jukes-Cantor correction for multiple hits as implemented in DnaSP version 3.59 (http://www.ub.es/dnasp/).

A second goal of this analysis is to compare joint estimates of Ne from ({theta}) to joint estimates of Ne from LD (see below). Since a correlation between variability and recombination rates has been well documented in D. melanogaster (BEGUN and AQUADRO 1992 Down; AQUADRO et al. 1994 Down; ANDOLFATTO and PRZEWORSKI 2001 Down), we limit this analysis to genomic regions with > 1 cM/Mb (i.e., those regions least likely to be affected by selection at linked sites). To do this, we calculated averages of W and {pi} for silent sites (weighted by the number of silent sites surveyed). This approach has the advantage that W and {pi} are unbiased estimates of {theta}; however, confidence intervals must be obtained by coalescent simulations for each locus. We instead estimated {theta} (ML) and confidence intervals using a likelihood-based approach based on Equation 12 of HUDSON 1990 Down as implemented by WRIGHT et al. 2003 Down. For each locus, this approach maximizes the likelihood recursion equation

(1)

where

(2)

and

(3)

over a range of {theta} values, where n is the sample size and S is the observed number of silent polymorphisms (again including multiple hits). Assuming that each genomic region investigated is evolutionarily independent, likelihoods can be combined to produce a joint likelihood surface for all loci, the maximum of which is ML. Confidence limits for ML are estimated using the standard {chi}2 approximation. This approach assumes no recombination within loci, but independence among loci. Simulations confirm that the bias of this estimator is not large and that confidence limits for estimated by this method are wider than those in the presence of intralocus recombination (S. WRIGHT, personal communication).

Estimating {theta} from requires knowledge of the mutation rate for silent sites. By comparing estimates of {theta}, or estimating {theta} from a joint estimate of , we implicitly assume that the mutation rate at silent sites is constant among loci. We assume that µ = 1.5 x 10-8/silent site/year and that D. melanogaster populations undergo an average of 10 generations/year, yielding a = 1.5 x 10-9/silent site/generation. Our estimate of the mutation rate depends on many assumptions (reviewed in ANDOLFATTO and PRZEWORSKI 2000 Down), although several estimates based on independent approaches are in close agreement with the above figure. Further, the true mutation rate is unlikely to be >3 x 10-9/site/generation (ANDOLFATTO and PRZEWORSKI 2000 Down; MCVEAN and VIEIRA 2001 Down). We discuss the sensitivity of our results to assumptions about the mutation rate. Where we compare X-linked and autosomal loci, it is worth noting that we have assumed that the ratio of their effective sizes is as expected under a neutral model with equal effective population sizes for males and females. In fact, the appropriate scaling of the X and autosomes is uncertain given that natural populations experience factors that alter the relative variance in reproductive success for males and females (CHARLESWORTH 2001 Down). For most analyses, we consider X-linked and autosomal loci separately, so this is not an issue.

Linkage disequilibrium analysis:
The goal of this analysis is to estimate Ne from levels of LD. We choose to summarize LD by an estimate of the population recombination rate {rho} = 4Ner, where Ne is the effective population size, and r is the sex-averaged rate of recombination (cf. ANDOLFATTO and PRZEWORSKI 2000 Down; WALL et al. 2002 Down). For these analyses, we use all biallelic mutations, including both single-nucleotide polymorphisms (SNPs) and simple insertion-deletion mutations. All multiply hit sites or overlapping mutational events are excluded.

We first employ an estimator, W00, proposed by WALL 2000 Down. To estimate W00, we summarize the data using the observed number of distinct haplotypes (H) and the minimum number of inferred recombination events (RM, cf. HUDSON and KAPLAN 1985 Down) and estimate via simulation the value of {rho} that maximizes the likelihood of obtaining the observed values of H and RM (cf. WALL 2000 Down). The simulations use a generalization of the methodology of HUDSON 1993 Down, which allows noncontiguous but linked regions to be analyzed. We estimate for each locus on the basis of comparisons of physical and genetic maps as described in CHARLESWORTH 1996 Down and ANDOLFATTO and PRZEWORSKI 2001 Down. We assume that r is known exactly and estimate {rho} (=/4) over increments of 3.3 x 105 or smaller, using a minimum of 2 x 105 replicates for each parameter value for each locus (cf. WALL et al. 2002 Down). {rho} is estimated for each locus separately and also jointly for loci with > 1 cM/Mb, assuming that each locus is evolutionarily independent (cf. WALL et al. 2002 Down). As for diversity-based estimates, we multiply {rho} for loci on the X chromosome by 4/3 to make them comparable to estimates for autosomal loci. Approximate 95% confidence intervals for joint estimates W00 are found by making the standard asymptotic maximum-likelihood assumptions. We caution that there is no evidence that these assumptions are appropriate in this context, but the intervals serve as a useful diagnostic.

Several other summary estimators of {rho} have also been proposed (e.g., HUDSON 1987 Down, HUDSON 2001 Down; HEY and WAKELEY 1997 Down). We have chosen the estimator proposed by HUDSON 2001 Down, H01, which is a composite-likelihood estimator of {rho} based on pairwise LD among all pairs of polymorphic mutations in a sample. This estimator performs similarly to W00 and is substantially better than other estimators (see WALL 2000 Down and HUDSON 2001 Down for details). For one of our X-linked data sets, frag4, the estimate of H01 is extremely large (i.e., >10,000). We exclude this locus from correlation tests and thus tests involving H01 use only 23 of 24 loci.

A considerable fraction of polymorphisms segregating among autosomal genes (43/279) are amino acid replacement substitutions. If these are under weak purifying selection (ANDOLFATTO 2001 Down), they may segregate at lower frequencies on average, potentially affecting estimates of {rho}. We estimated heterozygosities for each site [ = 2np(1 - p)/(n - 1), where n is the sample size and p is the polymorphic site frequency] and found that the mean heterozygosities of replacement (mean = 0.29) and silent (mean = 0.32) polymorphisms are not significantly different (P = 0.25, Student's t-test with unequal variances; P = 0.27, two-tailed Mann-Whitney U-test). We have thus chosen to include replacement polymorphisms; excluding them would considerably reduce the number of polymorphisms for some loci, possibly resulting in an upward bias in estimates of {rho} (see RESULTS). The fraction of replacement polymorphisms among X-linked loci was considerably smaller (9/884) and thus their effect on estimates of {rho}, if any, should be negligible. Note that part of the difference in the number of amino acid replacement polymorphisms detected on X-linked vs. autosomal genes reflects the fact that about two times more coding DNA was surveyed for the latter. The remainder of the difference probably reflects differences in the efficacy of selection against amino acid polymorphisms on the two chromosomes (see ANDOLFATTO 2001 Down).

Assessing the behavior of estimators under neutrality and alternative models:
To test the properties of the two estimators of {rho} under the standard neutral model, we run simulations with sample sizes (n) of 7, 13, or 50, with {rho} = 3{theta}. We consider a range of {theta} values and calculate the average and median values of divided by the actual value of {rho}. A total of 104 replicates were run for W00 (and 2 x 103 replicates for H01) under each parameter combination. Results of similar simulations for n = 50 for {rho} = 4{theta} and {rho} = {theta} are reported by HUDSON 2001 Down. We expect, on the basis of estimates of the neutral mutation rate and rates of crossing over, that {rho}/{theta} will vary considerably across the genome, with {rho} < 3{theta} for yellow, su(s), and su(wa) and {rho} > 3{theta} for most loci in regions of high rates of crossing over. In addition to the simulations above, we have also investigated the distribution of W00 for parameters (n and ) that best match those expected for the yellow locus under {rho} {cong} {theta}/4 and {rho} {cong} 3{theta}. These results have been posted at the site http://helios.bto.ed.ac.uk/evolgen/andolfatto/zimbabweLD.

Directional selection at linked sites is known to shape patterns of genetic variation (HILL and ROBERTSON 1966 Down) and is thought to explain the observed positive correlation between levels of diversity and the estimated crossing-over rate (BEGUN and AQUADRO 1992 Down; AQUADRO et al. 1994 Down). Both linkage to positively selected alleles (i.e., hitchhiking or selective sweeps, cf. MAYNARD SMITH and HAIGH 1974 Down) or deleterious alleles (i.e., background selection, cf. CHARLESWORTH et al. 1993 Down) have been advanced as possible explanations.

Background selection due to strongly deleterious mutations can be modeled as a simple reduction in effective population size (HUDSON and KAPLAN 1994 Down). This assumption is equivalent to varying {theta} (and {rho}) in simulations of the standard neutral model. However, under background selection models that posit more weakly deleterious mutations, this assumption may not be valid (CHARLESWORTH et al. 1995 Down; GORDO et al. 2002 Down). This situation is beyond the scope of our study and will be a topic of future work.

We do examine the effects of recurrent hitchhiking due to advantageous mutations by running simulations under a model of recurrent, nonoverlapping selective sweeps (PRZEWORSKI 2002 Down). The program used was kindly provided by M. Przeworski. We consider a locus of fixed size (n = 50, number of sites = 2000, {theta} = 15, {rho} = 45) and determine how the average and median values of {theta} and {rho}, as estimated assuming the standard neutral model, change as a function of the rate of selective sweeps ({Lambda} in PRZEWORSKI 2002 Down). We fix the selection coefficient s = 0.002 and the species population size Ne = 4 x 106. The choice of other parameter values yielded similar results (results not shown).

Incorporating the effects of gene conversion:
Estimates of {rho} based on rates of crossing over () may be systematically underestimated because they ignore gene conversion. Gene conversion was implemented into the standard coalescent with recombination (cf. HUDSON 1990 Down; WIUF and HEIN 2000 Down; PRZEWORSKI and WALL 2001 Down). Gene conversion is defined here as exchange of short tracts of DNA between homologous chromosomes with no associated crossing over between flanking markers. Thus, in this implementation, gene conversion and crossing over are modeled as mechanistically independent, and gene conversion exchanges associated with crossing over are ignored. For all models, we assume the rate of crossing over is and that the distribution of gene conversion tract lengths is geometric with a mean of 352 bp (HILLIKER et al. 1994 Down).

We consider three models of how gene conversion (GC) and crossing-over (CO) rates may be associated. In the first model (GC1), we assume that two-thirds of recombination events are GC, and thus the rate of GC, GC = 2 between adjacent sites, where is the estimated local rate of crossing over for a particular gene region. In a second model (GC2), we assume that the rate of GC is constant across a chromosome; GC = 2MAX, where MAX is the highest estimated rate of crossing over on the chromosome (MAX = 2.27 x 10-8/generation between adjacent sites on the X chromosome). In a third model (GC3), we assume that the total rate of recombination between adjacent sites is constant across a chromosome and that the rate of gene conversion between adjacent sites is GC = (MAX - ), where is the estimated rate of crossing over for the locus under study. GC1 and GC2 are equivalent for genomic regions with the highest rates of crossing over and diverge in regions of reduced crossing over. Likewise, GC3 is equivalent to a model with no GC for regions with the highest rates of crossing over. These models are meant to be illustrative rather than quantitative and other possible models of recombination are addressed in the DISCUSSION. {rho} is estimated as above (from W00) from simulations with gene conversion to estimate the likelihoods.

How do gene conversion rates in our models compare to data on gene conversion from Drosophila? On the basis of the data from the genes maroon-like and rosy, it is thought that the rate of gene conversion ({gamma}, see ANDOLFATTO and NORDBORG 1998 Down) is on the order of 10-5 events/generation and that 50% or more of recombination events are gene conversions without an associated exchange of flanking markers (FINNERTY 1976 Down; HILLIKER and CHOVNICK 1981 Down). Under GC1 and GC2, the sex-averaged GC rate is 1.5 x 10-5/generation in regions of the genome with the highest rates of crossing over, which is similar to the rate that has been measured at rosy and maroon-like (~10-5, FINNERTY 1976 Down; HILLIKER and CHOVNICK 1981 Down). The ratio of GC:CO has been estimated to be ~1 at maroon-like (FINNERTY 1976 Down) and ~4 for rosy (HILLIKER and CHOVNICK 1981 Down). In GC1 and GC3 models, the ratio of GC to CO rates at the rosy locus ( = 0.7 cM/Mb) would be ~2 and corresponds to a GC rate of 6 x 10-6/generation. In the GC2 model, the GC:CO ratio at rosy would be ~6. For maroon-like ( = 1.36 cM/Mb), the GC:CO ratio = 2 under GC1 (rate ~10-5/generation) and would be ~0.7 under GC3 (rate ~3 x 10-6/generation). In the GC2 model, the GC:CO ratio at maroon-like would be ~3. Thus, our assumed rates of gene conversion, and the extent of association between gene conversion and crossing over, are roughly in agreement with the limited data on rates of gene conversion in Drosophila.


*  RESULTS
*TOP
*ABSTRACT
*METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Performance of {rho} estimators and choice of data:
With a choice of two similarly performing {rho} estimators, W00 (WALL 2000 Down) and H01, (HUDSON 2001 Down), we first set out to investigate how these estimators perform under the standard neutral model in the context of parameters relevant to the available data (e.g., sample size, the number of segregating sites, etc.) In Table 1, we summarize the performance of the two estimators as a function of the number of alleles sampled (n) and the amount of variation ({theta}, proportional to the number of segregating sites). Both estimators show considerable upward bias when the sample size is small and/or {theta} is small; however, the median of W00 is generally biased downward under these conditions. Overall, H01 appears to be more biased than W00; this is particularly pronounced when the number of segregating sites is small (even for large sample sizes). On the basis of these simulations, we conclude that W00 should be used cautiously for 7 <= n <= 13 unless {theta} is in the vicinity of 10 or larger and probably not at all for samples of <7 alleles. Since H01 seems to be particularly sensitive to small {theta}, it should be used cautiously when there are <13 alleles unless {theta} >= 10 and probably not at all if {theta} < 5.


 
View this table:
In this window
In a new window

 
Table 1. Bias of under the standard neutral model

We have used these results as a guide to select data for our study (see METHODS) and accordingly we have limited the data sets considered here to those that have n >= 7 and S > 10 (TABLE AA). Only 4 of our selected data sets have n = 7 and for each of these W > 8 (W is calculated from Stot; see TABLE AA). The 10 data sets that have 7 < n <= 13 have W >= 6. Possible effects of biases of the estimators on our results are noted in the analyses and discussion that follow.

Fig 1 compares estimates of W00 and H01 for all loci fitting the above restrictions. W00 and H01 are strongly positively correlated (R = 0.78, P < 0.001, n = 33, Spearman rank correlation test, two-tailed). However, H01 are also systematically larger than W00, as might be expected given the results in Table 1. TABLE AA (http://helios.bto.ed.ac.uk/evolgen/andolfatto/zimbabweLD) summarizes correlations among n, , W (per locus), and estimates of (per locus) for the X chromosome data.



View larger version (17K):
In this window
In a new window
Download PPT slide
 
Figure 1. Comparison of estimates of (per locus) for the 24 X-linked and 9 autosomal polymorphism data sets used in this study.

Differences in sample size and the number of segregating sites are not the only possible causes of the systematic difference between estimators. Regions of reduced recombination in D. melanogaster have lower levels of variability due to hitchhiking and/or background selection (AQUADRO et al. 1994 Down; CHARLESWORTH 1996 Down). If background selection can be approximated as a simple reduction in effective population size, the expected behavior of {rho} estimators will be similar to that under the neutral model (cf. Table 1). However, the observation that these regions also have more low-frequency polymorphisms suggests that recurrent selective sweeps may also influence polymorphism patterns in these regions (ANDOLFATTO and PRZEWORSKI 2001 Down).

Since one of our objectives is to compare {rho} and {theta} estimates across a gradient of crossing-over rates, we investigated the properties of the two {rho} estimators under a recurrent hitchhiking model (Table 2). More frequent hitchhiking is expected to decrease levels of variability at closely linked sites (MAYNARD SMITH and HAIGH 1974 Down; KAPLAN et al. 1989 Down) and thus decrease {theta}. As can be seen in Table 2, the two {rho} estimators respond differently to increasing rates of hitchhiking. W00 is strongly positively correlated with and generally yields {rho} that are slightly smaller than {theta}. In contrast, the median H01 is only weakly correlated with estimates of , and {rho} from H01 are typically much larger than estimates of {theta}; this difference is larger as the rate of selective sweeps increases.


 
View this table:
In this window
In a new window

 
Table 2. Effect of selective sweeps on estimates of N{theta} and N{rho}

Further investigation of the behavior of these {rho} estimators under alternative models is beyond the scope of this study. In the analyses that follow, we focus on W00 for several reasons. First, W00 allows us to easily combine information from multiple loci in a likelihood framework. Second, W00 is not severely biased upward under parameters that best match our available data (i.e., most are small samples with few segregating sites); in fact, the median estimate is often below the true {rho}. Finally, using W00 is not likely to result in {rho} > {theta} in regions of reduced crossing over if recurrent hitchhiking is reducing levels of variability in these regions.

Comparing levels of diversity and linkage disequilibrium:
In a neutral panmictic population, we expect that {rho} {cong} {theta}. Table 3 lists estimates of for each data set and these are plotted in cytological order in Fig 2. With the exception of the tip of the X chromosome (the first 10 points cover cytological bands 1B1 to 3B3 on the X chromosome), values of {rho} based on W00 are slightly lower than the values of {theta}(13/14 on the X have {rho} < {theta}). For {rho} based on H01, the pattern is less striking, particularly for the X (7/14 have {rho} < {theta}; data not shown); however, the pattern is equally strong for the two estimators on the autosomes (9/9 loci and 8/9 loci have {rho} < {theta}, based on W00 and H01, respectively). It is difficult to draw solid conclusions from this type of analysis because different sample sizes and numbers of segregating sites for each locus (and different averages for these loci on the X and autosomes) complicate the interpretation of comparisons.



View larger version (19K):
In this window
In a new window
Download PPT slide
 
Figure 2. Point estimates of N based on linkage disequilibrium (W00, open circles) and levels of diversity ({pi}, shaded circles). A total of 24 X-linked and 9 autosomal loci are plotted in order of cytological position and in order of appearance in TABLE AA and Table 3). The assumed mutation rate is 1.5 x 10-9/site/generation.


 
View this table:
In this window
In a new window

 
Table 3. Estimates of Ne (in millions) for each locus

Further, due to the inherent stochasticity of the evolutionary process, individual estimates of {theta} and {rho} are not particularly accurate. We can achieve greater precision by estimating the relative likelihoods for different values using multiple loci jointly. Fig 3A shows the joint-likelihood curve of {rho} (based on W00) for the X-linked and autosomal loci with > 1 cM/Mb (i.e., loci in areas of high recombination that are least likely to be affected by selection at linked sites). The maximum-likelihood estimate of {rho} for the X chromosome is {rho}X = 1.7 million, with an ~95% confidence interval of 1.1–2.3 million. The corresponding estimate of {theta}X is 4.0 million (based on the ML), with a 95% confidence interval of 3.1–5.1 million. For the autosomes, {rho}A = 0.4 million, with an ~95% confidence interval of 0.2–0.9 million. The corresponding estimate of {theta}A is 1.8 million with a 95% confidence interval of 1.3–2.9 million.



View larger version (20K):
In this window
In a new window
Download PPT slide
 
Figure 3. (a) Joint maximum-likelihood estimates of {rho} (curves) and {theta} (points with ~95% confidence intervals). The assumed mutation rate is 1.5 x 10-9/generation. Note that confidence intervals for {theta} assume no recombination. The dotted line indicates the approximate 95% confidence bounds using the standard chi-square approximation. (b) Comparison of joint-likelihood estimates of {theta} and {rho} in Zimbabwean D. melanogaster and a California population of D. simulans (BEGUN and WHITLEY 2000 Down; WALL et al. 2002 Down). Approximate 95% confidence interval bars on estimates are indicated.

Several interesting observations emerge from these results. First, the estimate of {theta} for the X chromosome is larger than that for the autosomes. This trend has been noted before on the basis of levels of nucleotide (ANDOLFATTO 2001 Down) and microsatellite variability (KAUER et al. 2002 Down). On the basis of the confidence intervals for , we can reject the standard model with equal population sizes for males and females (i.e., an X:autosome ratio of 3/4). Several factors may contribute to this pattern, including a large variance in reproductive success for males (i.e., sexual selection), background selection, and autosome-specific inversion polymorphisms (CHARLESWORTH 1996 Down, CHARLESWORTH 2001 Down; ANDOLFATTO 2001 Down; KAUER et al. 2002 Down). Interestingly, the difference in effective sizes is even more pronounced for estimates of Ne based on linkage disequilibrium; {theta}X/{theta}A = 2.2 whereas {rho}X/{rho}A = 4.1. More LD on the autosomes (reflected in the higher {rho}X/{rho}A ratio) is not predicted under either background selection or sexual selection models and therefore points to possible effects of inversion polymorphisms. Inversions are both much more common and at higher frequencies on the autosomes of D. melanogaster (LEMEUNIER and AULARD 1992 Down). Thus inversion polymorphisms may be reducing variability on the autosomes by increasing the effects of linked selection (ANDOLFATTO 2001 Down). In addition, the associated reduction in the effective rate of recombination (crossing over is suppressed in inversion heterozygotes) and rapid changes in inversion frequency (ANDOLFATTO et al. 1999 Down) can both increase levels of LD on the autosomes relative to levels of diversity. Additional data from autosomal genes may allow us to distinguish among alternative hypotheses.

Interestingly, joint estimates of {rho} and {theta} are significantly different from each other on both the X and the autosomes, given our assumed mutation rate ( = 1.5 x 10-9/generation). This suggests that Zimbabwe populations, despite our a priori prediction, show an excess of LD relative to expectations under the neutral model. However, while confidence intervals for {theta} are overestimated (since we have assumed no intragenic recombination), a trivial explanation for the discrepancy between {theta} and {rho} in Zimbabwean D. melanogaster might be error in the estimated mutation rate (e.g., {theta} would be half as large if we assume the true mutation rate is twofold higher; see METHODS). Given that the mutation rate is uncertain, the apparent LD excess in this population based on estimated rates of crossing over is not entirely convincing.

Joint-likelihood estimates of {rho} and {theta} for 15 X-linked and 14 autosomal loci from the California population of D. simulans are shown for comparison (Fig 3B). It is interesting to note in Fig 3B that ratios of {rho}/{theta} for both the X and the autosomes are much larger for the Zimbabwean population of D. melanogaster (0.4 and 0.2, respectively) than for the California population of D. simulans (0.02 and 0.07, respectively; cf. WALL et al. 2002 Down). While uncertainty in the mutation rate (~2-fold) may explain the discrepancy between {rho} and {theta} in Zimbabwean D. melanogaster, it is unlikely to explain the >10-fold difference between these estimates of N in the D. simulans population in Fig 3B. This suggests that something is fundamentally different in the evolutionary history of these two populations or species. In particular, the Zimbabwe population of D. melanogaster may be closer to mutation-drift equilibrium, while the California population of D. simulans clearly is not.

Behavior of estimates of {rho} as a function of r:
A positive correlation between and is expected for two reasons: first, (an estimate of 4Ner) should increase proportionally with ; second, the gradient in strongly positively covaries with empirical estimates of diversity () in Drosophila (BEGUN and AQUADRO 1992 Down; AQUADRO et al. 1994 Down; ANDOLFATTO and PRZEWORSKI 2001 Down) and, hence, with the apparent effective population size of the genomic region ({theta}, which under the standard neutral model is inversely proportional to the expected linkage disequilibrium for a given r). Here, we focus on data from the X chromosome because we lack data for autosomal loci in regions of reduced crossing over. For the 24 gene regions on the X chromosome, W00 (per site) is positively correlated with in our data set [ Fig 4A; W00, R = 0.44, P < 0.05, 24 loci; H01, R = 0.43, P < 0.05, 23 loci (see METHODS); Spearman rank correlation test, two-tailed]. As expected on the basis of previous results, and {theta} for the 24 X-linked loci considered here also show a strong positive correlation with (Fig 4B; based on {pi}, R = 0.62, P = 0.001, 24 loci; as above). However, a striking aspect of the results presented in Fig 2 is that estimates of {rho} are rather large near the telomere of the X chromosome (i.e., loci to the far left of the graph), despite the fact that the crossing-over rate and levels of diversity are considerably reduced relative to other regions on the X (TABLE AA and Table 3). Under the standard neutral model, a measure that should be independent of the effective population size is {rho}/{theta} (= 4Ner/4Neµ), which we estimate as / (HUDSON 1987 Down; ANDOLFATTO and PRZEWORSKI 2000 Down). Since we assume that the mutation rate is constant across loci, we expect estimates of / to positively covary with . We detect no such correlation (Fig 4C; W00/{pi}, R = 0.01, 24 loci, P > 0.05; H01/{pi}, R = 0.13, 23 loci, P > 0.05; as above).




View larger version (38K):
In this window
In a new window
Download PPT slide
 
Figure 4. Relationships between (a) W00 per 100 bp, (b) {pi} per 100 silent sites, (c) W00/{pi}, (d) {rho} (W00), and (e) {rho} (W00)/{theta} ({pi}) vs. sex-averaged . Estimates for 24 X-linked loci are plotted as shaded circles; estimates for 9 autosomal loci are plotted as open circles. Spearman correlation coefficients (R) and two-tailed P values are given for X-linked loci only. The dotted line in e indicates the neutral equilibrium expectation for N{rho}/N{theta}.

The insensitivity of / to changes in (Fig 4C) suggests that the correlation between and (Fig 4A) is driven by the relationship between and (Fig 4B). In fact, {rho} and are significantly negatively correlated [ Fig 4D; {rho} (W00), R = -0.51, 24 loci, P = 0.01; {rho} (H01), R = -0.27, 23 loci, P > 0.05; as above], the opposite trend to that expected on the basis of the relationship between {theta} and . In other words, there is less diversity in regions of low crossing over (Fig 4B), but also less LD than expected given levels of diversity and estimated rates of crossing over. This can also be seen in Fig 4E, which plots {rho}/{theta} (estimated from W00 and {pi}) as a function of . Under the standard neutral model, N{rho}/N{theta} should be {cong}1 and independent of r. While this prediction is roughly true for the 11 X-linked loci with > 1 cM/Mb ({rho}/{theta} <= 1), loci with < 0.5 cM/Mb show considerably higher values of {rho}/{theta}. Overall, there is a strong negative correlation between {rho}/{theta} and (Fig 4E; R = -0.79, 24 loci, P < 0.001; as above), implying that regions of reduced crossing over have considerably less LD than predicted by rates of crossing over under the standard neutral model.

Could a bias in W00 estimates explain the observed correlations? In simulations of the neutral model (Table 1), we noted a slight upward bias in mean estimates with decreasing {theta}. However, our correlation result (which employs a nonparametric test) depends more on median estimates of W00, which were typically lower than the actual {rho} in these simulations (see Table 1). Additional simulations with parameters appropriate for the yellow locus (n = 49, S = 18, {rho} {cong} {theta}/4) confirmed that the median W00 is downward biased (see METHODS). Thus, bias in W00 is not an explanation for the negative correlation between {rho}/{theta} and .

Gene conversion and LD patterns in regions of high crossing over:
Our estimates of recombination rate are based on large-scale comparisons of genetic and physical maps that ignore the contribution of gene conversion to the total rate of recombination. At small physical scales (i.e., <1 kb in Drosophila), the contribution of gene conversion to the overall rate of recombination might be substantial (ANDOLFATTO and NORDBORG 1998 Down). The precise relationship between crossing over and gene conversion is not well understood in Drosophila, but the consensus view is that they are associated processes that result from Holliday junction formation and resolution during meiotic recombination (CARPENTER 1984 Down).

Given that gene conversion is a potentially important contributor to the overall rate of recombination, we reexamine the relationship between levels of LD relative to levels of variability (measured as joint for loci with > 1 cM/Mb), assuming a model that includes both crossing over and gene conversion (e.g., GC1; see METHODS). In general, we find that, if gene conversion contributes substantially to the overall rate of recombination in regions of high crossing over, there is clear evidence for an excess of LD given levels of diversity in the Zimbabwe data. As an illustration, we present results for gene conversion model GC1 in Fig 5 (this model assumes that two-thirds of recombination events are gene conversions without associated crossing over; see METHODS). Under this model, {rho}X = 0.33 million for the X chromosome (~95% confidence interval of 0.24–0.50) and {rho}A = 0.15 million for the autosomes (~95% confidence interval of 0.04–0.20). Since {rho} and {theta} differ by more than an order of magnitude under this model, this discrepancy is not likely to be explained by uncertainty in the mutation rate (about twofold, see METHODS). Thus, our results suggest either that the level of genetic exchange due to gene conversion is close to negligible in regions of high crossing over or that this Zimbabwean population departs substantially from the predictions of mutation-drift equilibrium in the direction of a excess of