Abstract
We analyze nucleotide polymorphism data for a large number of loci in areas of normal to high recombination in Drosophila melanogaster and D. simulans (24 and 16 loci, respectively). We find a genomewide, systematic departure from the neutral expectation for a panmictic population at equilibrium in natural populations of both species. The distribution of sequencebased estimates of 2Nc across loci is inconsistent with the assumptions of the standard neutral theory, given the observed levels of nucleotide diversity and accepted values for recombination and mutation rates. Under these assumptions, most estimates of 2Nc are severalfold too low; in other words, both species exhibit greater intralocus linkage disequilibrium than expected. Variation in recombination or mutation rates is not sufficient to account for the excess of linkage disequilibrium. While an equilibrium island model does not seem to account for the data, more complicated forms of population structure may. A proper test of alternative demographic models will require loci to be sampled in a more consistent fashion.
THE standard assumptions of the neutral theory of molecular variation (Kimura 1983) are that the vast majority of mutations are neutral and that genes are sampled from a panmictic population at equilibrium. Under this model, the population recombination parameter, C = 4Nc, and the population mutation parameter, θ = 4Nμ, can be estimated from sequence polymorphism data (where N is the effective population size of the species, c the rate of recombination per base pair per generation, and μ the rate of mutation per base pair per generation). The ratio is then an estimate of the recombination rate between adjacent bases, scaled to the mutation rate per base pair. Estimates of these same parameters, c and μ, can be obtained from genetic and physical map data and from nucleotide divergence between closely related species, respectively. This allows for a direct comparison between the two methods of estimation (Hudson 1987). If the standard neutral model is correct, the two methods should yield similar results. This prediction is our point of departure.
Several authors have pointed out that estimates of C/θ for specific loci are different from those expected from independent estimates of c and μ (Hudson 1987; Leichtet al. 1995; Eaneset al. 1996; Hey and Wakeley 1997; Hassonet al. 1998; Andolfatto and Kreitman 2000). Low estimated values of C/θ have been interpreted as reflecting strong linkage disequilibrium (e.g., Schaeffer and Miller 1993; Kirby and Stephan 1996). Higher than expected levels of linkage disequilibrium have also been noted for areas of normal recombination using a different approach (Miyashitaet al. 1993; Miyashita and Langley 1994).
Excess linkage disequilibrium is often interpreted as reflecting the action of natural selection at the locus (e.g., Miyashitaet al. 1993; Schaeffer and Miller 1993; Kirby and Stephan 1996). However, this pattern can also result from demographic departures from model assumptions, e.g., population structure (Li and Nei 1974; Ohta 1982). A characterization of background levels of linkage disequilibrium may help distinguish between these two explanations. Demography is a force that affects the entire genome, while natural selection has a local effect that is not expected to be the same for unlinked regions. Thus, a locus shaped by natural selection is expected to show a pattern of polymorphism that differs from most other loci. Here, we summarize the results for 24 independent loci in Drosophila melanogaster and 16 in D. simulans. We find that values are systematically too small to be consistent with the theoretical predictions of the standard neutral theory, given reasonable estimates of c and μ. In other words, levels of linkage disequilibrium (as measured by ) are almost always significantly higher than expected.
METHODS
We use polymorphism data sets for regions of normal to high recombination (>5 × 10^{−9} per base pair per generation) that have more than three segregating sites (24 genes in D. melanogaster and 16 genes in D. simulans). We include biallelic single nucleotide polymorphisms but not mutations that overlap deletions as they represent incomplete information. The sequences for most loci can be obtained from GenBank at http://www.ncbi.nlm.nih.gov/Entrez/. The data sets analyzed are available upon request to P. Andolfatto. Polymorphism data sets used in this study are the following: Acp26A (Aguadéet al. 1992; Tsauret al. 1998), Acp70A (Cicera and Aguadé 1997), Adh (Kreitman 1983; Sumner 1991; S. C. Tsaur, unpublished results), Boss (Ayala and Hartl 1993), Dpp (Richteret al. 1997), E(eve) (Ludwig and Kreitman 1995), Est6 (Cooke and Oakeshott 1989; Karotamet al. 1993; Hasson and Eanes 1996), G6pd (Eaneset al. 1993), Gld (Hamblin and Aquadro 1996, 1997), Hsp83 (Hasson and Eanes 1996), In(3L)P (Wesley and Eanes 1994), In(2L)t (Andolfattoet al. 1999; Andolfatto and Kreitman 2000), Mlc1 (Leichtet al. 1995), Pgd (Begun and Aquadro 1994), prune (Simmonset al. 1994), period (Kliman and Hey 1993), Ref(2)P (Wayneet al. 1996), Rh3 (Ayalaet al. 1993), runt (Labateet al. 1999), Sod (Hudsonet al. 1997), Top2 (Palopoli and Wu 1996) and Tpi (Hassonet al. 1998), white (Kirby and Stephan 1995), Vermilion (Begun and Aquadro 1995), and zeste and Yp2 (Hey and Kliman 1993). CecC (Clark and Wang 1997) and Amyd (Inomataet al. 1995) were chosen as representative genes from their respective clusters. In(2L)t and In(3L)P are polymorphic inversions in D. melanogaster. Here, the labels refer to sequences spanning the breakpoint sites on standard chromosomes (proximal and distal breakpoint, respectively). For D. simulans, In(2L)t refers to the homologue of the D. melanogaster In(2L)t proximal breakpoint region. AdhFast haplotypes were excluded from the analysis, as were inverted alleles from Adh, Dpp, Est6, Hsp83, In(2L)t, and In(3L)P samples. For Sod, we used region 2021, which is ~12 kb upstream of the Sod coding region (Hudsonet al. 1997).
Recombination estimates: For each locus, laboratory estimates of the regional rate of crossing over, c, are obtained as follows: for every chromosomal arm, polynomial curves were fitted to plots of cumulative genetic distance as a function of cumulative physical distance. The derivative of the polynomial at a given physical map position is taken to be the recombination rate (Ashburner 1989, pp. 453–457; Trueet al. 1996; Comeronet al. 1999). Laboratory estimates are not available for the second chromosome of D. simulans; we use the rates for the homologous region of D. melanogaster as surrogates for Adh, E(eve), In(2L)t, and Top2.
To obtain an estimate of C, we need estimates of c and N (we refer to this estimate of C as C_{map}). Since males do not recombine in D. melanogaster and D. simulans (Ashburner 1989, p. 476), the population parameter C is (1/2)4Nc = 2Nc for autosomal genes and (2/3)3Nc = 2Nc for Xlinked genes. The recombination rate c is taken to be the crossingover rate estimated from laboratory crosses (see discussion). To estimate N, we equate the observed θ_{w} at silent and noncoding sites of each locus with 4Nμ (or 3Nμ for Xlinked loci). Under the standard neutral model, θ_{w} (Watterson 1975) is an unbiased estimate of the population mutation rate, θ. Our estimate of the mutation rate is obtained from , the estimated rate of divergence per year (which depends on , the estimated time to the common ancestor of the melanogaster and obscura species groups), and , the estimated number of generations per year. If million years (my) and , our estimate of μ is 3 × 10^{−9}/bp/generation. (We discuss the validity of these assumptions in detail in the discussion.) Given these estimates for d, g, and T, the average across loci is roughly 10^{6} for D. melanogaster and 2 × 10^{6} for D. simulans.
Hudson's (1987) estimator of C (henceforth referred to as C_{hud}) was calculated using a modification of a program kindly provided by J. Wakeley. Ideally, one would like to use all the information in the data. However, fulllikelihood approaches (e.g., Griffiths and Marjoram 1996) are not computationally feasible for high levels of recombination (Wall 2000). C_{hud} was chosen among many estimators of C because, under the standard neutral model, it can easily be related to the amount of linkage disequilibrium in the sample. Specifically, C_{hud} is a moment estimator obtained from the relationship expected between the sample variance of the number of pairwise differences (a measure of linkage disequilibrium) and the population recombination rate (Hudson 1987). In our implementation, any value of C_{hud} ≥ 10,000 is taken to be 10,000. Similarly, if there is more association between sites than expected under the standard neutral model with no recombination, C_{hud} is set to zero. Note that C_{hud} is much more dependent on model assumptions than is our estimate of c and can only be interpreted as an estimate of 4Nc under a very restricted set of models.
Coalescent simulations of the standard neutral model: Coalescent simulations (Hudson 1990) of a panmictic population were run using modifications of programs kindly provided by R. Hudson and J. Wall. The simulations assume a WrightFisher population at equilibrium from which samples are drawn randomly (referred to as the “standard neutral model” or SNM). Mutations are neutral; each new mutation occurs at a previously unmutated site. Parameters for the simulations are the sample size, the sequence length in base pairs, and C_{map}. Ten thousand replicates are run for each parameter set.
Although we refer to our model as the “standard neutral model,” standard coalescent simulations use the parameter θ and treat S as a random variable. Here, we generate genealogies and then place the observed number of mutations (at both synonymous and nonsynonymous sites), S, on the tree (Hudson 1993). We take this approach because the number of segregating sites is observable, while we have little information about the population parameter θ (Hudson 1993).
Note that S refers to the number of segregating sites, not the number of mutations, so we are effectively ignoring multiple hits. This choice is conservative for our purposes, since multiple hits will tend to decrease linkage disequilibrium. There are nine data sets in D. melanogaster with visible multiple hits (at most three sites) and seven in D. simulans (Adh has six sites with more than two alleles; the six other data sets have fewer than three). Results are virtually unchanged if we consider S to be the number of inferred mutations instead of the number of segregating sites (results not shown).
For each locus, given the observed number of segregating sites and our estimate of C, we ask what proportion P of simulated runs have a value of C_{hud} smaller than or equal to the observed value. That is, for locus i, P_{i} = Pr(simulated C_{hud} ≤ observed C_{hud} SNM, C_{map}). If loci are independent, C = C_{map}, and the standard neutral model is accurate, the distribution of P values across loci should be uniformly distributed between 0 and 1. We test for a departure from uniformity by using the fact that, for n data sets, should be χ^{2} distributed with 2n d.f. (Fisher 1954, as cited in Sokal and Rohlf 1995). The independence of loci is a crucial assumption of our multilocus analysis. We verified that it is valid by running 1000 simulations with two loci separated by C = 80 (the sample size was 20, and intragenic recombination was C = 10). No correlation was detected between loci (results not shown).
Heterogeneity in selective constraints: We ran coalescent simulations to test the effect of a nonuniform distribution of mutations on estimates of C_{hud}. We model this spatial heterogeneity as variation in mutation rate. Coalescent simulations are run with the same parameters as for the panmictic, uniform mutation case. The sequence is divided into three parts, spanning onefourth of the length of the sequence, onehalf, and onefourth, respectively (Figure 1). This case is meant to represent an exon flanked by two introns. Since on average one site out of four is silent in a coding region, the mutation rate in “introns” is taken to be fourfold higher than in the middle half. This model assumes that introns and silent sites have similar levels of constraint (cf. Moriyama and Powell 1996; Li 1997). In practice, at the end of each simulated run, the total sequence is divided into segments with the same genealogy and the same mutation rate. Given a constant rate of mutation across base pairs, the probability that a mutation will be placed in segment i is given by where n is the number of segments, L_{i} is the length of segment i, and T_{i} is the total branch length of the genealogy for segment i. To add stepwise variation in mutation rates, we weight these probabilities by the “relative mutation rate” for each interval (yaxis in Figure 1).
Population subdivision: We also ran coalescent simulations for a symmetric island model. Since geographic subdivision increases the extent of linkage disequilibrium (Li and Nei 1974; Ohta 1982), it is not obvious that the loci can be treated as independent (Nei and Maruyama 1975; Robertson 1975). To verify the assumption of independence, we simulated two loci separated by C = 80, with an intragenic rate of recombination of C = 10, 4Nm = 0.1, and a sample size of 20; whether sampling from one or both demes, no correlations were detected (results not shown).
If we assume a symmetric twodeme model, observed values of F_{ST} (Wright 1951) can be used to estimate the amount of gene flow between populations (Hudsonet al. 1992b). Estimates vary from roughly 4Nm = 0.5 to 2 for D. melanogaster and 0.75 to 1.5 for D. simulans, depending on which loci are used (m is the number of migrants per deme per generation; Irvinet al. 1998). We run coalescent simulations with 4Nm = 0.2 to 1. Samples are either drawn equally (or close to equally) from both demes or only from one deme. Each parameter set is run 10,000 times. To gain insight as to how the variance in outcomes depends on the number of islands, we also run a fiveisland model with 4Nm = 1 and 4Nm = 3, sampling from only one deme.
To evaluate the fit of a symmetric island model, we use both C_{hud} and B' (Wall 1999). The statistic B' was developed as a onetailed test of the standard neutral model (Wall 1999). B' is the number of pairs of adjacent segregating sites that partition the sample in the same way (Wall 1999). A partition of the sample consists of two disjoint subsets, whose union is the sample; each subset is composed of individuals with the same allele at that site. B' can be thought of as a measure of linkage disequilibrium among adjacent segregating sites. The expectation of B' should be higher under a geographic subdivision model than under a panmictic one for the same level of recombination (Wall 1999). Given the true recombination rate, B' is the most powerful existing test for rejecting the panmictic neutral model when simulations are run for a twoisland model and sampling is from a single locality (Wall 1999). The probabilities reported for B' correspond to the proportion of runs with a value greater than or equal to the observed B', i.e., P(B') = Pr(simulated B' ≥ observed B'  SNM, C = C_{map}). A low value of P(B') indicates high levels of linkage disequilibrium. Simulations for B' were run with the total number of (inferred) mutations (which in all cases equaled the sum of the number of segregating sites and the number of multiple hits). Sites with multiple hits were treated as multiple mutations with missing information. For each site, there are only three ways the missing information can be filled in, depending on which of the three alleles is ancestral. This leads to a range of values for B', from which the most conservative one is taken.
In some cases, we wish to demonstrate that there is too little linkage disequilibrium in the data; this corresponds to the other tail of the B' statistic or Pr(simulated B ≤ observed B' SNM, C = C_{map}). Note that this probability is not 1 − Pr(simulated B' ≥ observed ′ ≥B' ≥′) since ′ ≥B' ≥′ is discrete. Using the number of inferred mutations is no longer conservative, so we rerun the simulations with the number of segregating sites to calculate this tail. We do not examine the other tail of ′ ≥C′ ≥_{hud} (′ ≥i.e.′ ≥, large values) because, for small samples, ′ ≥C′ ≥_{hud} is expected to be much larger than the true mean (Hudson 1987). Thus, there is little power to detect an unusually large value of ′ ≥C′ ≥_{hud}.
RESULTS
C_{hud}/θ_{w} is consistently too low: Figures 2 and 3 present two estimates of the number of recombination events per mutation for each locus. The first (represented by squares) is based on direct laboratory measurements of the crossingover rate. This estimate is c/2μ if the gene is autosomal and 2c/3μ if it is Xlinked, where c is the recombination rate per base pair. C_{hud}/θ_{w} (shown with circles) is estimated from sequence polymorphism data. θ_{w} is based on the number of (silent and noncoding) segregating sites, not on the number of mutations; i.e., multiple hits are ignored. (This is conservative for our purposes since it leads to a smaller value of θ_{w} than if the estimate were based on the number of mutations.) C_{hud} is a measure of linkage disequilibrium (see methods). Since both C and θ are scalar multiples of the effective population size under standard neutral assumptions, dividing C_{hud} by θ_{w} should make the ratio independent of the effective population size under the null model. This is of use because we do not have an estimate of the effective population size that is independent of genetic diversity levels. Scaling C_{hud} to θ_{w} could also be important if background selection is reducing the effective population size for some loci (Charlesworthet al. 1995). As can be seen by eye in Figures 2 and 3, C_{hud}/θ_{w} is almost always smaller than c/2μ (or 2c/3μ if Xlinked).
One reason for using C_{hud} is that the median is known to be above the true mean (Hudson 1987)—in contrast to, e.g., Hey and Wakeley's estimator of the recombination rate γ (Hey and Wakeley 1997; Wall 2000). Under the standard neutral model and taking estimates of c and μ to be the true rates, we expect that the median of C_{hud}/θ_{w} will be >c/2μ (or 2c/3μ for Xlinked loci). This was confirmed by simulation (results not shown). Since the loci are independent, we can use a signs test with probability onehalf that C_{hud}/θ_{w} is above c/2μ (or 2c/3μ for Xlinked loci). Such a test is highly conservative, yet significant for both species (P = 0.0032 for D. melanogaster and P = 0.019 for D. simulans, onetailed). Laboratory estimates of the crossingover rate are not available for chromosome 2 of D. simulans. If the rates from D. melanogaster are used as surrogates, an additional four data sets can be considered (for 16 genes, P = 0.038, onetailed); the true rates in D. simulans are probably higher (see Trueet al. 1996).
P values are not uniformly distributed: Of interest is not only the direction of the discrepancy between C_{hud}/θ_{w} and c/2μ but also the magnitude of the difference. To quantify this, we ran coalescent simulations under the assumption that C = C_{map}. Tabulated in Tables 1 and 2 for each locus are the proportion P of 10,000 simulated data sets with a C_{hud} value smaller than or equal to the observed one. Recall that under the null model, the distribution of P values across loci should be uniform; this is exceedingly unlikely: P < 10^{−15} and P < 10^{−9} for D. melanogaster and D. simulans, respectively. We might expect an excess of high P values since many of our assumptions are conservative. Instead we observe too many low P values (Figure 4): for 13 of the 24 D. melanogaster loci, P(C_{hud}) < 0.05. Similarly, in D. simulans, 7 out of 16 loci have P values below 0.05. We conclude that either C_{hud} is too small, i.e., there is too much linkage disequilibrium given our assumed recombination rate, or θ is too large, i.e., there is too much diversity given our assumed mutation rate.
DISCUSSION
Recombination rates: Our results rely on the assumption that the laboratory rates of crossing over, which are interpolated from measurements over large distances, are not overestimates. The correlation between diversity levels and recombination rates (Begun and Aquadro 1992; Charlesworth 1996) suggests that the relative rates in different regions are well characterized, but the absolute rates could be in error. Multiple lines of evidence suggest that they are not: first, four direct intragenic measurements of recombination in D. melanogaster estimate rates that are equal to or above those inferred from larger distances (see references in Lindsley and Zimm 1992 for rudimentary, white, lozenge, and rosy). As an illustration, for white, the intralocus estimate of recombination is the same as the regional estimate of crossing over (2.5 × 10^{−8}/bp/generation), yet C_{hud}/θ_{w} is 18fold smaller than predicted in the sample from D. melanogaster. Second, differences between laboratory and natural conditions such as age and temperature appear to have only minor effects on the rate of recombination (Ashburner 1989, pp. 460–461). Genetic background seems to have small effects as well: recombination rates estimated from F_{1} progeny of laboratory strains and wild lines of D. melanogaster (Brooks and Marks 1986) are in close agreement with rates estimated from laboratory strains. To explain our observations, laboratory rates would have to be systematic overestimates of the true rate, rather than randomly erroneous, since for 40 loci, C_{hud} is almost always lower than expected. Yet the laboratory estimates ignore the contribution of gene conversion, which, on the scale of a gene, should be on the order of crossing over (Andolfatto and Norborg 1998). Thus, if anything, the laboratory measurements of crossing over are more likely to be twofold underestimates of the total rate of genetic exchange on an intralocus scale.
A concern in Drosophila melanogaster is the presence of highfrequency chromosomal inversions in natural populations (Lemeunier and Aulard 1992), since heterozygotes experience reduced recombination. The ensuing reduction in the population recombination rate is at most twofold (if inversions are at 50% frequency and with complete inhibition of recombination in heterozygotes); this is not sufficient to account for the skew in Figure 4a. The presence of inversions in the samples themselves could be expected to generate linkage disequilibrium. With the exception of Boss and Mlc1, inverted chromosomes were excluded when the locus is close to a breakpoint of a common inversion. The existence of a highfrequency inversion may still affect the extent of association between sites on standard haplotypes close to the breakpoints (this is analogous to sampling from one deme in a subdivided population). Farther from the inversion breakpoints, gene conversion between chromosomal rearrangements is likely to be high enough (Chovnick 1973; Andolfattoet al. 1999) to homogenize differences between arrangement classes for genes. In support of this, a test for subdivision (Hudsonet al. 1992a) between inverted and standard chromosomes for Est6, located in the middle of In(3L)P, was not significant (P = 0.5871). Most salient to our observation, inversions are rare and at low frequency in D. simulans and on the X chromosome of D. melanogaster (Lemeunier and Aulard 1992), which also have significantly low values of C_{hud} (for the eight loci on the X chromosome in D. melanogaster, a uniformity test yields P < 10^{−9}). Note finally that autosomal inversion heterozygosity can also increase rates of recombination on the X chromosome, which would make laboratory estimates of c for Xlinked loci underestimates of recombination rates in natural populations (Schultz and Redfield 1951; Sniegowskiet al. 1994).
Mutation rates: A second assumption is that the neutral mutation rate per base pair per generation is . This assumption enters into the signs test directly and as a means to estimate N for our simulations (see methods). In what follows, we review what is known about the mutation rate in Drosophila and discuss the sensitivity of our results to this parameter.
Synonymous divergence estimates vary across loci from 0.37 × 10^{−8}/bp/year to 2.98 × 10^{−8} (with an average of 1.56 × 10^{−8}; Li 1997, p. 191). These estimates rely on my for the time to the common ancestor of the melanogaster and obscura species groups (Throckmorton 1975). While could be in error (e.g., Russoet al. 1995 suggest mya), the substitution rates estimated for D. melanogaster are consistent with several independent estimates from Hawaiian Drosophila (Rowan and Hunt 1991; Kambyselliset al. 1995; Fleischeret al. 1998) as well as with the results of mutation accumulation experiments (which do not rely on the same assumptions about the substitution process or the time to divergence). On the basis of the pooled results of three spontaneous mutation rate experiments, Harada et al. (1993) estimated the allozyme bandmorph mutation rate to be 7.5 × 10^{−7} per generation for an average protein size of 400 amino acids. Assuming that roughly onethird of changes are detected (Keightley and EyreWalker 1999), this rate corresponds to 2 × 10^{−9}/bp/generation (with an upper 95% confidence limit of 4.75 × 10^{−9}).
Comparing pergeneration rates to peryear rates is complicated by the fact that we do not know the annual number of generations in the wild. In the laboratory, the generation time is ~14.5 days (i.e., 25 generations a year) at 20° (Ashburner 1989, pp. 193–194). Generation time at 20° can be as long as 23 days in some species of Drosophila (~16 generations a year) although it is generally shorter for temperatures between 20° and 30°. Both species considered here, which are now cosmopolitan, are thought to have spent most of their evolutionary history in warm climates (Lachaiseet al. 1988). In summary, the plausible range of mutation rates per generation varies from 0.6 × 10^{−9} (the average synonymous substitution estimate with 26 generations a year) to 4.75 × 10^{−9}. There is a genomewide excess of linkage disequilibrium for any mutation rate within this range (results not shown).
Variation in mutation rates: If our mutation rates and laboratory estimates of recombination are roughly correct, then C_{hud} is unexpectedly low; i.e., there is a genomewide excess of linkage disequilibrium. Several departures from the standard neutral model assumptions could potentially generate an excess of linkage disequilibrium. In our simulations, mutations are placed uniformly along the sequence while in actual data sets, mutations are clustered, presumably because of heterogeneity in selective constraints. Thus, the average distance between segregating sites may be smaller in actual data than in simulated data, perhaps resulting in more linkage disequilibrium.
To test this possibility, we ran coalescent simulations with the same parameters as before (see methods) but with a simple model of variation in mutation rates (see Figure 1). The results for D. melanogaster are presented in Table 3. To verify that our model produces at least as much spatial heterogeneity as observed in the data, we tabulated the longest distance (D_{max}) between any two consecutive polymorphisms in the actual data set (cf., Goss and Lewontin 1996). For an equal number of segregating sites, a greater D_{max} indicates more spatial heterogeneity of mutations. We kept track of the proportion of the simulated runs with D_{max} greater than or equal to the observed D_{max} when rates vary fourfold. As indicated in column three of Table 3, the actual data show much less heterogeneity than does our model. Yet a fourfold variation in mutation rates has very little effect on our results: for D. melanogaster, for instance, a uniformity test on the distribution of P(C_{hud}) still yields P < 10^{−10}. While actual data is not heterogeneous in exactly the same way as predicted by our model, these simulations (and simulations with an eightfold variation in mutation rates; results not shown) suggest that even strong heterogeneity does not alter our qualitative conclusions. We can therefore rule out variation in mutation rates as being an important factor.
Variation in recombination rates: C_{hud} does not use information about the distance between segregating sites, only the extent of association among them. Smallscale variation in recombination rate should look similar to variation in mutation rates, with the history of sites within a locus more or less correlated than expected given the physical distance between them. In fact, for an infinite sites model, variation in recombination rates can be implemented in the same way as is done for variation in mutation rates (see methods). The simulated sequence can be thought of as a sequence in genetic rather than physical distance. Variation in recombination rates affects the translation of the genetic distance into physical distance by a factor similar to the one used to model variation in mutation rates. For example, if a particular interval has a low rate of recombination, then the number of mutations placed on it will be larger than expected given the (genetic) distance. So if recombination rates vary to the same extent as modeled for mutation rates, they should have only minor effects on C_{hud}.
On a larger scale, if the Drosophila genome were a collection of few hotspots and many coldspots for recombination, rates averaged over larger distances could yield systematic overestimates of the mean rate. Since we observe few data sets where C_{hud} is above the laboratory rate, recombination rates at these hotspots would have to be several orders of magnitude above those observed at most loci. Whether this is plausible is unclear. Recombination hotspots of this magnitude have been reported in fungi, humans, and mice (Lichten and Goldman 1995; Jeffreyset al. 1999) as well as in maize (e.g., Dooner and MartinezFerez 1997) but, to our knowledge, not in Drosophila.
Departures from the demographic assumptions of the standard neutral model: Departures from the demographic assumptions of a panmictic equilibrium can also generate an excess of linkage disequilibrium (and possibly decrease C_{hud}). If there is population structure, for example, the effective population recombination rate should be decreased relative to panmixia since haplotypes in different subpopulations will not have a chance to recombine as often. In the data analyzed here, sampling schemes vary greatly across loci, with a third sampled entirely outside of Africa. If samples are partitioned by sampling location, populationspecific estimates of C_{hud} are sometimes higher and sometimes lower than are estimates based on total samples, so there is no clear effect of pooling different populations (results not shown). However, few samples include a large number of sequences from multiple populations, so this approach is not very informative.
Instead, we try to assess the fit of the data to simple demographic models. Several researchers have argued that populations sampled at present are the result of the recent admixture of previously subdivided populations (e.g., Richteret al. 1997; Hassonet al. 1998). This claim usually stems from the observation of two or more highly diverged haplotypes (e.g., Hudsonet al. 1997; Labateet al. 1999). Very recent admixture (i.e., several hundred years ago) should look similar to sampling two demes at the present. This suggests an island model where samples are drawn from both subpopulations. But it is also possible that most or all samples are drawn from only one historical deme. D. melanogaster and D. simulans are thought to have an African origin (Hale and Singh 1987; Lachaiseet al. 1988) but the number of demes is unknown. Most sampled localities may have been founded from one historical deme.
We evaluate the fit of a symmetric twoisland model by considering changes in the distribution of P(C_{hud}) and P(B') as the migration parameter 4Nm varies from 0.2 to 1. These values of 4Nm are lower than what is suggested by most F_{ST} values. For all parameter values we tried, sampling from both demes was a worse fit to the data than sampling entirely from one (results not shown). Thus, we present results only for the latter case. As expected, as the migration rates decrease, there is more linkage disequilibrium in the simulated data. As a result, low values of P(C_{hud}) and P(B') become more likely. In Table 4, we report the probability U that the distributions of P(C_{hud}) and P(B') are uniform (as they should be if the null model is correct). In both species, many C_{hud} values are still too low for 4Nm > 1 (U = 0.006 for D. melanogaster and 0.004 for D. simulans). For 4Nm ≤ 1, there are too many high P(B') values, i.e., too little linkage disequilibrium in the data relative to the predictions of the model. Many of the assumptions made in this analysis are no longer conservative when considering the other tail (i.e., too little linkage disequilibrium). However, this analysis does suggest that a simple island model will only account for the low values of C_{hud} if migration rates are lower than suggested by levels of population differentiation and that the model can explain some aspects of linkage disequilibrium (C_{hud}) but not others (B'). When a summary of the frequency spectrum (Tajima 1989) is considered along with aspects of linkage disequilibrium, an island model is found to be an even worse fit (results not shown). Under a fiveisland model where only one island is sampled, the qualitative results are the same (results not shown).
It may not be surprising that an equilibrium island model is an inadequate demographic model, as the history of the species is likely to have been much more complex. In particular, it has been argued that nonAfrican populations have experienced a population bottleneck, e.g., with the European colonization of the Americas (David and Capy 1988; Begun and Aquadro 1993, 1994, 1995). A variant on this model further assumes that populations were subdivided in Africa and that nonAfrican populations are the result of recurrent founder events from different historical demes (Hamblin and Veuille 1999). Either version predicts a reduction in variation outside of Africa relative to African populations. For D. melanogaster, the data in support of this hypothesis are equivocal: loci on the X show a reduction in variability in American populations relative to those of Zimbabwe (Begun and Aquadro 1993), while loci on the autosomes are equally likely to be more or less variable outside of Africa (P. Andolfatto, unpublished results). For D. simulans, there does appear to be a reduction in variability outside Africa (Irvinet al. 1998), although population structure in Africa may be a confounding factor (P. Andolfatto, unpublished results).
Interestingly, for loci in D. melanogaster, C_{hud} values appear to be higher when estimated on the basis of African populations alone than when estimated from nonAfrican populations (such a comparison is possible only for Acp26a, Adh, In(2L)t, and Vermilion). This is not true of the two suitable loci in D. simulans (G6pd and Vermilion). These findings might suggest that African D. melanogaster populations are closer to linkage equilibrium than are nonAfrican ones. However, samples from Africa tend to be smaller than those from outside Africa, and we expect the median C_{hud} value to be larger for smaller samples, even in the absence of population structure (Hudson 1987). A proper test of this demographic model will require more consistently sampled data and a large number of sequences for multiple populations.
Natural selection: An alternative to a demographic explanation is that natural selection has acted on or near many of the loci. (These two alternatives are not mutually exclusive.) While a demographic departure from model assumptions may be a more parsimonious explanation for a systematic trend, it remains possible that natural selection is pervasive and commonly leads to increased linkage disequilibrium. In addition, it should be kept in mind that many of the loci analyzed here were collected because of prior evidence for selection.
One mode of selection thought to be prevalent in the Drosophila genome are “selective sweeps” in which a rare advantageous allele is rapidly fixed in the population (Maynard Smith and Haigh 1974). What effect selective sweeps would have on C_{hud} is unknown. Along with a skew in the frequency spectrum, a prediction of this model is a reduction in variability near the selected site (Bravermanet al. 1995). However, the loci with a significant skew in the frequency spectrum (Tajima 1989) do not have noticeably lower levels of variation than other genes (significance is assessed with simulations that condition on the laboratorybased estimate of C). Of the five such loci in D. melanogaster (Dpp, Hsp83, Ref2p, Tpi, and Vermilion) and three in D. simulans (Period, Runt, and Tpi), two and two, respectively, have θ_{w} above the median of loci (results not shown). The scenario just outlined assumes that a variant is selected while rare and swept to fixation in a panmictic population. A variation on this model is transient selection, where a variant is only swept to intermediate frequency (Hudsonet al. 1994). Other types of selection that could generate linkage disequilibrium include local adaptation (e.g., Stephanet al. 1998) and “traffic models” where multiple, linked beneficial mutations are selected for simultaneously (Kirby and Stephan 1996). These models may not lead to a detectable reduction in variability; none has yet been modeled explicitly (but see Gillespie 1997).
Epistatic interactions between nearby sites can also generate considerable linkage disequilibrium. In this type of model, recombinant haplotypes are selected against, reducing the effective frequency of crossing over. As an example, Adh protein production in D. melanogaster has been shown to be determined by epistatic interactions among multiple polymorphisms (Stam and Laurie 1996). Compensatory interactions between sites involved in the maintenance of secondary premRNA structure might also generate linkage disequilibrium (e.g., Kirbyet al. 1995). Intralocus epistasis would have to be pervasive to account for the genomewide departure reported here.
In conclusion, unless it can be demonstrated that laboratorybased estimates of the crossingover rate are systematic overestimates of the recombination rate, any theory for patterns of variability in the genome will have to accommodate a genomewide excess of linkage disequilibrium (as measured by C_{hud}). If this excess is best explained by a demographic departure from the standard neutral model, this has important implications for both parameter estimation (such as estimates of θ) and for inferences from patterns of polymorphism at a particular locus. An alternative interpretation is that the mutation rates are much higher than indicated by levels of divergence at silent sites and noncoding DNA for any gene. This would point to a different but equally serious problem with the neutral theory.
Acknowledgments
We thank B. Charlesworth, M. Foote, M. Hamblin, R. Hudson, M. Kreitman, and J. Wall for helpful discussions and comments on an earlier version. This manuscript was significantly improved by comments from A. Clark, C. Langley, and an anonymous reviewer. R. Hudson, J. Wakeley, and J. Wall provided computer programs, S.C. Tsaur provided unpublished Adh data, and J. Comeron and J. True provided genetic and physical map data.
Footnotes

Communicating editor: A. G. Clark
 Received July 24, 1999.
 Accepted May 25, 2000.
 Copyright © 2000 by the Genetics Society of America