Abstract
Simulations of positive directional selection, under parameter values appropriate for approximating human genetic diversity and rates of recombination, reveal that the effects of strong selective sweeps on patterns of linkage disequilibrium (LD) mimic the pattern expected with recombinant hotspots.
IN several cases, the local distribution of meiotic recombination in humans is nonuniform and concentrated into small regions, on the order of 1 kbp in size, termed recombinant hotspots (for reviews see Petes 2001; Arnheimet al. 2003; de Massy 2003; Wall and Pritchard 2003; Kauppiet al. 2004). These recombinant hotspots appear to be a common feature in the human genome (Crawfordet al. 2004; McVeanet al. 2004) and are rarely shared between humans and closely related species (Wallet al. 2003; Ptaket al. 2004, 2005; Winckleret al. 2005). This suggests recombinant hotspots may be rapidly evolving and species specific. These hotspots contribute to the “haplotype block” pattern of genetic regions with high linkage disequilibrium (LD; nonindependent associations between alleles at different positions) separated by boundaries of low LD, which is actively being characterized to optimize marker choice for association mapping studies (International HapMap Consortium 2003, 2005; Tishkoff and Verrelli 2003). Knowledge of the distribution of LD is critical to mapping the genetic basis of complex phenotypes (Weiss and Clark 2002). Methods have been developed to detect recombinant hotspots from DNA sequence data, which utilize patterns of LD to infer the existence of these hotspots (e.g., Chakravartiet al. 1984; Li and Stephens 2003; Zhanget al. 2004; for review see Stumpf and McVean 2003). In particular, Li and Stephens (2003) developed a coalescent-based “product of approximate conditionals” model, which uses the distribution of haplotypes to estimate the likelihood of the underlying recombination rate.
Positive directional selection, in which a new mutant rises in frequency and quickly fixes in a population (i.e., a selective sweep), can be rapid on an evolutionary timescale and/or population specific (for reviews see Andolfatto 2001; Aquadroet al. 2001; Schlötterer 2002a; Bamshad and Wooding 2003). One predicted effect of this type of selection on a sample of DNA sequences is an increase in LD in regions flanking the site undergoing selection (Kim and Stephan 2002; Przeworski 2002), but a reduction of LD across the site of selection (Kim and Nielsen 2004). This dual pattern may not be intuitive at first, so consider that reducing the genealogical history of a sequence reduces the number of recombination events, thus generally increasing LD in the region. However, for linked genetic variation to be present immediately after a selective sweep, it must either be a new mutation, and therefore rare and contribute little to overall LD, or have experienced recombination during the sweep with the target of selection. This suggests that haplotypes with LD between polymorphic alleles that span the target of selection will not persist beyond the fixation of the selected allele. Here we do not address gene conversion, which could preserve LD over the site of selection. This pattern of two regions of high LD, separated by low LD, is similar to the pattern of LD expected with a recombinant hotspot. Furthermore, the speed and species specificity of selective sweeps may also mimic the species-specific distribution of recombinant hotspots.
To explore the possible effect of selective sweeps on the inference of recombinant hotspots, we simulated positive directional selection of varying intensities (using SelSim 2.1, Spencer and Coop 2004) and applied hotspot detection software (Hotspotter 1.0, Li and Stephens 2003) to the resulting simulated sequences. The parameter values for the simulations were picked to approximate a 10-kbp sequence of DNA from a human population sample. We found that for strong positive selection (σ ≈ 100, where σ ≡ 2Ns, N is the diploid population size, and s is the relative strength of selection), a locally elevated recombination rate can falsely be inferred in the region of selection with statistical significance 22% of the time, corresponding to a 16% excess over the false positive rate (FPR) from neutral simulations (Figure 1). This selective elevation over the neutral FPR is highly significant (P = 1 × 10−7; see Figure 1 legend). However, as the strength of selection becomes even stronger (σ ≥ 300), there is a rapid drop in the FPR—probably due to a loss in power associated with a paucity of genetic variation remaining immediately after a strong selective sweep. The patterns of LD resulting from positive selection can produce locally elevated estimated rates of recombination that are similar to the relative rates reported in the literature (e.g., Figure 2; cf. Jeffreys and Neumann 2002; Wallet al. 2003; McVeanet al. 2004; Ptaket al. 2004; Verrelli and Tishkoff 2004; Winckleret al. 2005). The geometric mean of estimated recombination rates at the site of selection, from 100 replicates at σ = 100, is 13.25 times higher than the background rate. Furthermore, this elevated FPR can persist for up to N generations (0.5 × 2N generations) after the selective sweep has ended (Figure 3). Assuming an effective human population size of 10,000 and an average generation time of 25 years, this corresponds to a maximum persistence time of ∼250,000 years.
A plot of the relative FPR of inferring significantly elevated local recombination rates along a simulated sequence of recombining DNA. The increase (in percentage) under a selective sweep scenario is plotted relative to the FPR from neutral simulations. The position of the site under positive selection was fixed at 0.45. Each population sample consists of 100 sequences. The recombination parameter ρ ≡ 4Nr (for a diploid, where r is the per-generation recombination rate) was uniform and set to a value of 10 over the total region. The mutation parameter θ ≡ 4Nμ (where μ is the per-generation mutation rate) was also set to 10 for the region. A stochastic model of positive selection was used and the fixation of the selected allele was modeled as just completing in the sampled generation. One hundred replicates at each value of selection, σ, were generated and analyzed assuming a single hotspot 0.1 units wide of fixed location (0–0.1, 0.1–0.2,…, 0.9–1). A significantly elevated recombination rate was called when the lower 95% confidence interval of the local estimated recombination rate was higher than the background recombination rate estimate (this includes both hotspots and “warm” spots). In general, FPR excesses higher than 4 (counts out of 100) are significantly elevated (P < 0.05), assuming a Poisson distribution of false positives with a mean equal to the neutral rate, which is an average of six false positives.
An example of the inferred relative recombination rate from a sample simulated under the conditions described in Figure 1 with σ = 100. The relative recombination rate between each window, with a width 1/10th of the total sequence, and the remaining region were estimated. This example illustrates how a false hotspot of recombination may be inferred. In this case, the “hotspot” is at position 0.45 and has a local recombination rate estimate 49 times higher than that of the surrounding sequence. The upper and lower 95% confidence intervals are also plotted and a significantly elevated point estimate is represented by a solid circle.
A plot of the excess, over neutrality, of the FPR of inferring a significantly elevated recombination rate at the position of selection vs. time in units of 2N generations. Solid circles denote a significantly elevated FPR relative to neutral simulations (see Figure 1 legend).
We do not mean to imply that true recombinant hotspots do not exist in humans; they have certainly been verified by experimental means (e.g., Hubertet al. 1994; Cullenet al. 1995; Smithet al. 1998; Yipet al. 1999; Jeffreyset al. 2001). But we do suggest caution when inferring the existence of hotspots solely on the basis of patterns of LD. The transient nature of positive selection, both over time and between populations, may easily mimic the rapidly evolving nature of recombination in primates. When a hotspot is inferred, it may be useful to also address the relative levels of genetic variation compared to levels of divergence (Ptaket al. 2004) to help rule out past positive selection—particularly since recombination may be associated with a mutagenic process (Rattrayet al. 2002; Hellmannet al. 2003) and selective sweeps can quickly remove genetic variation. However, recombinant hotspots and selective sweeps may be linked at a more basic level. There is evidence that hotspot crossover asymmetry can result in a form of meiotic drive (Jeffreys and Neumann 2002), which itself is a “selfish” form of positive selection (for review see Reedet al. 2005 and references therein). This crossover asymmetry predicts that a derived recombination-suppressing allele will eventually fix in the population (Jeffreys and Neumann 2002), resulting in the co-occurrence of both a recombinant hotspot and a progressing selective sweep.
The possibility exists that inferred recombinant hotspots in gene regions that also appear to have undergone positive selection (e.g., Verrelli and Tishkoff 2004) are not due to nonuniform densities of meiotic recombination, but may simply be a by-product of positive selection. In the same vein, estimates of the rate of hotspot sharing between species, based on LD analysis (Ptaket al. 2005), may be underestimated, and short-scale LD may be lower than expected (e.g., Pritchard and Przeworski 2001) if recent positive selection plays a significant role. If selective sweeps do make a significant contribution to the patterns of LD in the human genome, then a better understanding of the effects of positive selection may have important implications for projects that characterize LD for association studies, particularly to the extent that selective pressures may have varied among human populations (e.g., Hamblin and Di Rienzo 2000; Schlötterer 2002b; Akeyet al. 2004; Storzet al. 2004).
Acknowledgments
We thank Yuseob Kim and Michael Li for helpful suggestions. We also thank two anonymous reviewers for their feedback on a previous version of this manuscript. This work was supported by a Burroughs Wellcome Fund and David and Lucile Packard Career Awards to S.A.T. F.A.R. was partially supported by the Center for Bioinformatics and Computational Biology, University of Maryland.
Footnotes
Communicating editor: M. Nordborg
- Received October 8, 2005.
- Accepted December 23, 2005.
- Copyright © 2006 by the Genetics Society of America