There has been considerable recent interest in understanding the way in which recombination rates vary over small physical distances, and the extent of recombination hotspots, in various genomes. Here we adapt, apply, and assess the power of recently developed coalescent-based approaches to estimating recombination rates from sequence polymorphism data. We apply full-likelihood estimation to study rate variation in and around a well-characterized recombination hotspot in humans, in the β-globin gene cluster, and show that it provides similar estimates, consistent with those from sperm studies, from two populations deliberately chosen to have different demographic and selectional histories. We also demonstrate how approximate-likelihood methods can be used to detect local recombination hotspots from genomic-scale SNP data. In a simulation study based on 80 100-kb regions, these methods detect 43 out of 60 hotspots (ranging from 1 to 2 kb in size), with only two false positives out of 2000 subregions that were tested for the presence of a hotspot. Our study suggests that new computational tools for sophisticated analysis of population diversity data are valuable for hotspot detection and fine-scale mapping of local recombination rates.
RECOMBINATION plays a central role in shaping patterns of molecular genetic diversity in natural populations. There is growing evidence of extensive variation in recombination rates over scales as small as kilobases, although neither the mechanism nor the pattern of this variation is well understood (Daly et al. 2001; Jeffreys et al. 2001; Petes 2001; Nachman 2002; Schneider et al. 2002). In humans, the traditional approach to estimating recombination rates has been through pedigree studies, but these have resolution only over scales of centimorgans (or megabases in physical distance at the human genome-wide average recombination rate). Even in organisms where breeding experiments are more straightforward, the landscape of fine-scale variation in the recombination rate remains relatively uncharted.
Recently, laboratory-based studies have mapped recombination hotspots within several human loci from analysis of crossovers detected in sperm, for example, to intervals of 1–2 kb within the major histocompatibility (MHC) class II region (Jeffreys et al. 2001; Jeffreys and Neumann 2002). However, achieving high resolution using laboratory-based methods is technically difficult and costly, and the recombination rate estimates are specific to males. As there are substantial differences in sex-specific recombination rates at the centimorgan scale (Kong et al. 2002), we should not expect fine-scale recombination rate estimates from sperm studies to be fully informative of the evolutionary process of recombination, which averages across males and females, and over long spans of time.
Another source of information about local recombination rates is population genetics data. These data do not simply enumerate a direct count of the recombination events, but have the advantage of reflecting the evolutionary process. While qualitative information about underlying recombination rates can sometimes be obtained from standard pairwise measures of linkage disequilibrium (LD), quantitative estimation of recombination rates from polymorphism data is a challenging statistical problem that has only recently begun to attract considerable attention (e.g., Griffiths and Marjoram 1996a; Kuhner et al. 2000; Wall 2000; Fearnhead and Donnelly 2001, 2002; Hudson 2001; McVean et al. 2002). At one end of a spectrum of statistical approaches are simpler methods, which ignore much of the information in the sample and base inference on one or a few summary statistics (Wall 2000). At the other extreme, so-called full-likelihood methods utilize all of the information in the data, but at the expense of being highly computationally intensive (Griffiths and Marjoram 1996a; Kuhner et al. 2000; Fearnhead and Donnelly 2001). An intermediate class of computational methods approximates the true likelihood in various ways (Hudson 2001; Fearnhead and Donnelly 2002; McVean et al. 2002; Li and Stephens 2003). To date, virtually all statistical methods have assumed that the recombination rate is constant over the interval being studied.
With the advent of surveys of molecular genetic variation on genomic scales, particularly but not exclusively in humans, computational population-based methods should enable fine-scale recombination rate variation to be characterized across the human and other genomes, providing information that may be crucial not only for disentangling the molecular, demographic, and selective factors generating linkage disequilibrium, but also for association mapping of complex diseases (Kruglyak 1999; Jorde 2000; Ott 2000; Pritchard and Przeworski 2001; Reich et al. 2001).
Here we adapt, apply, and assess two methods to study different aspects of fine-scale variation in recombination rates from population data. We first apply full-likelihood estimation (Fearnhead and Donnelly 2001) to study fine-scale rate variation around and inside one of the most well-characterized recombination hotspots in the human genome. The β-globin gene complex on human chromosome 11p (Figure 1) includes a recombination hotspot that was originally identified by Chakravarti et al. (1984). In that study, a pattern of random association between 5′ and 3′ haplotypes for the β-globin complex, based on restriction fragment length polymorphisms (RFLPs), was replicated in four populations, suggesting a recombination hotspot 5′ to the β-globin gene within a 9.1-kb interval marked by a TaqI RFLP 5′ to the δ-globin gene (Figure 1). Chakravarti et al. (1984) estimated recombination rates to be elevated 3–30 times above the genome-wide average. Recent single-sperm typing of ∼5000 informative meioses gave an estimate for the male recombination fraction across the hotspot of 80 × 10−5 kb−1 [95% confidence interval (C.I.): 9 × 10−5–160 × 10−5] and detected no recombination events over the adjacent 5′ 90-kb region (Schneider et al. 2002). (The human genome-wide average recombination rate is ∼1 × 10−5 kb−1.) There have also been six observations to date in families of recombination with flanking exchange in the β-globin cluster (Smith et al. 1998). The precision with which these crossovers can be localized to the hotspot varies, being limited by the location of informative single-nucleotide polymorphisms (SNPs) in the parental chromosomes. However, location within the hotspot can be confirmed for most, and none is definitely outside it. A recent population-based analysis, adopting different statistical approaches from those described here, narrowed the recombination hotspot to a 2-kb region just 5′ of the β-globin gene (Wall et al. 2003).
The availability of molecularly phased population data from a worldwide sample of populations, and of direct estimates of recombination rates across the hotspot from sperm studies, is ideal for examining both the accuracy of full-likelihood estimation and its sensitivity to different population demographic histories. We analyze samples from two populations, The Gambia and the United Kingdom, deliberately chosen on the basis of their different demographic histories and selective pressures. Another valuable feature of these data for the present analyses is that there are no additional uncertainties associated with allelic (haplotype) inference by statistical methods.
There is growing evidence that recombination hotspots may be widespread across the human genome, although to date relatively few have been characterized. It is thus of considerable interest to be able to detect hotspots (defined here as small regions where the recombination rate is increased considerably relative to the local background rate) on genomic scales, and such information would also be invaluable for the design and analysis of disease studies. Again, one potential source of information is population data, with the increasing availability of population surveys of genetic diversity over genomic scales. Because of the computational burden involved, it does not seem practicable to adapt full-likelihood methods to this problem. Here we present and study a new method for detecting hotspots on the basis of an approximate-likelihood approach (Fearnhead and Donnelly 2002), which can closely mimic full-likelihood inference. Informally, the approach is to separately analyze small subregions of the genome, and then to detect hotspots by comparing the likelihood curves for each subregion against that for an underlying background rate.
MATERIALS AND METHODS
β-Globin DNA sequences:
The haplotype sequences were reported in Harding et al. (1997) and consist of data from 31 chromosomes from The Gambia and 4b chromosomes from Oxfordshire, United Kingdom. The chromosomes were sequenced in a 3-kb region that encompasses the β-globin gene (see Figure 1). Haplotypes were determined experimentally. Polymorphisms at sites 318 and 320 are associated with length variation rather than with nucleotide substitution and were excluded from our analysis. All other polymorphic sites were included.
Within the coalescent framework, the optimal approach to analyzing population data is via full-likelihood statistical methods: these methods use all of the information in the data. Calculating the likelihood surface is usually intractable, as it involves integrating over all possible genealogical histories that are consistent with the data. There are a number of computational approaches to approximating the full-likelihood surface, via importance sampling (Griffiths and Tavaré 1994; Fearnhead and Donnelly 2001) or Markov chain Monte Carlo (Kuhner et al. 1995); see Stephens and Donnelly (2000) for a review.
Approximating the full-likelihood surface in the presence of recombination is a particularly challenging statistical problem. All the computational methods can produce an arbitrarily accurate approximation to the full-likelihood surface given infinite computing time. This is not necessarily the case for practical amounts of computing time, especially for some available methods, for which the approximations can be poor (Fearnhead and Donnelly 2001).
We use the infs program of Fearnhead and Donnelly (2001) to estimate the local recombination rate for the β-globin region. This is an importance sampling method for approximating the likelihood surface of the recombination and mutation rates, under a coalescent model, and has been demonstrated to be substantially more efficient than alternative methods [orders of magnitude more efficient than the methods of Griffiths and Marjoram (1996a) and Kuhner et al. (2000), but it can still take days of computing time to analyze even small data sets accurately, particularly when the recombination rate is large]. The model assumes a panmictic, constant-sized population, and an infinite-sites mutation model (which excludes the possibility of repeat mutation), but has been shown in simulation studies to perform reasonably well when the true demographic scenario differs from this (Fearnhead and Donnelly 2001). We return in the discussion to possible consequences of departures from these assumptions. The method is based on (i) simulating a set of possible genealogical histories for the data, (ii) calculating the likelihood surface for each of these histories, and (iii) approximating the full-likelihood surface by a suitable weighted average of the surfaces for each history.
The computational cost of full-likelihood inference has led to a number of approximate-likelihood methods being developed (Hudson 2001; Fearnhead and Donnelly 2002; McVean et al. 2002; Li and Stephens 2003). Such approximate methods seem essential for analysis of data over large genomic regions.
The approximate marginal-likelihood (AML) method of Fearnhead and Donnelly (2002) is based on an approximation to the likelihood surface for the data at sites whose minor allele frequency is above a certain threshold. The simulation results in Fearnhead and Donnelly (2002) suggest that if only singleton mutations are removed, then the AML can be a very close approximation to the full likelihood, but with a computational saving of 1–3 orders of magnitude. For detecting the hotspots from the simulated data, we used the program sequenceLD (available from www.maths.lancs.ac.uk/~fearnhea/software) to calculate the AML surface.
The pairwise likelihood (PL), which has also been called the composite likelihood, is a method devised by Hudson (2001) and extended by McVean et al. (2002). It involves multiplying together the likelihood surface for all pairs of segregating sites. This is a different approximation from that of the AML, but its computational cost is much smaller. In particular it can be used to analyze data with a large number of segregating sites (which the AML cannot). We adapted the program LDhat (available from www.stats.ox.ac.uk/~mcvean) to calculate the PL surface for the recombination rate for region II of the β-globin gene (see Figure 1).
Both the full- and the approximate-likelihood methods estimate the parameters as the values that maximize the corresponding likelihood surface. Obtaining confidence intervals is more difficult, as the standard asymptotic results for the distribution of either the estimators or the likelihood-ratio statistic do not necessarily hold due to the dependence between the data from different chromosomes. Currently, the only available theoretical results concerning estimators of the population-scaled recombination rate are to do with their consistency (Fearnhead 2003).
For the case of full likelihood and AML, simulation results suggest that confidence intervals with approximately the correct coverage probabilities can be obtained using the usual chi-square approximation to the generalized likelihood-ratio statistic (Fearnhead and Donnelly 2001, 2002). Less seems to be known about obtaining confidence intervals from the PL.
One alternative to estimating recombination rates is to try to detect historical recombination events from population data. A recent method, that of Myers and Griffiths (2003), uses a dynamic programming approach to provide a lower bound on the minimum number of recombination events in each subset of SNPs in the region. The approach is nonparametric, in the sense that it is not based on any model for the evolution of the population (although as currently implemented it does assume no repeat mutations). In contrast, all the statistical methods described above do require an evolutionary model [most are based on the standard coalescent, although several, for example, those of McVean et al. (2002), Fearnhead and Donnelly (2002), and Li and Stephens (2003), explicitly allow for repeat mutation], but their output is also different, in providing an estimate of the underlying recombination rate.
We applied the full-likelihood approach for estimating recombination rates from population data to previously reported sequence haplotypes (see materials and methods) from a 3-kb region (Figure 1) extending across the 3′ boundary of the hotspot. This data set is just within what can be practicably analyzed using the full-likelihood approach. Comparison of full-likelihood estimates with direct estimates from sperm typing, and with other population-based estimates, allows an assessment of the accuracy of the method. Comparison of estimates across two different population samples allows an empirical check of the robustness of the estimates to population demographic history.
To analyze rate variation, we divided the 3-kb region into a number of small subregions and separately estimated the recombination rate in each subregion. We used four subregions (see Figure 1). The choice of region I (as the first 510 bp) was based in part on the need for a region small enough to allow reliable estimation of the likelihood surface, and in part on the lack of polymorphism across the remainder of the sequence within the hotspot (effectively our region II), suggesting that recombination rates across region II would be better estimated by comparing haplotypes between regions I and III–IV. Region III consists of two exons and one intron. We excluded it from our analysis because it contains only two SNPs, one being the sickle cell mutation (position 917). There are no nonsynonymous mutations in region IV in our data.
A summary of the pattern of polymorphism in each sample is given in Table 1. The polymorphism data are consistent with a larger effective population size for the African compared to the non-African population, as has been observed elsewhere (e.g., Frisse et al. 2001). Previous analyses of this region have shown that the pattern of polymorphism is consistent with selective neutrality (Harding et al. 1997).
Figure 2 gives the joint-likelihood surfaces in the population mutation and recombination parameters for regions I and IV in both samples. These convey the information about recombination rates in the sequence data to which they relate. Table 2 gives point estimates and confidence intervals obtained from the likelihood surfaces as described in materials and methods.
In light of the relatively high level of variation in the region, one possible explanation for the high ρ estimates for region I is that it happens to have a deep genealogy by chance, and hence a greater opportunity for recombination events to happen and to be detected. The full-likelihood method incorporates inferences about the depth of the genealogy when producing estimates of the recombination rate. So given a fixed number of recombination events, evidence for a deep genealogy will reduce (as it should) the estimate of ρ. Furthermore, changing the estimate of θ will affect how deep the genealogy is likely to be, and thus varying θ enables us to assess the sensitivity of our estimate of ρ to this effect. Here, even varying θ by a factor of 2 has little effect on the estimates of ρ (data not shown).
To estimate the population recombination rate in region II, we assumed that the recombination rates in regions I, III, and IV were given by the respective full-likelihood estimates. The recombination rate between any pair of sites that straddle region II can then be parameterized solely in terms of the recombination rate within region II. We estimated this parameter using an extension of the PL method described in the materials and methods section. For each pair of sites that straddle the region we calculated the likelihood of the recombination rate in region II. Multiplying these likelihoods together for all such pairs gives us a pairwise likelihood that we maximized to estimate the recombination rate in region II. We converted the point estimates of ρ into estimates of the recombination rate per kilobase, using values of 15,000 and 9500 for the effective population size, Ne, of The Gambia and the United Kingdom, respectively. These values were obtained from the analysis of Harding et al. (1997), but we note that there are now many estimates of the human effective population size in the region 10,000–20,000 (Przeworski et al. 2000), and so the exact values used for Ne are not crucial to our main conclusions, nor are they a factor when comparing recombination rate estimates between regions within a population.
While the exact numerical estimates have associated uncertainty, two features are striking. The analysis suggests a difference of around two orders of magnitude in recombination rates in two short sequences separated by <1 kb. Further, the estimated rate in regions I and II is 30–50 times higher than the genome-wide average, suggesting that this 897-bp region is “hotter” than all but one of the recently characterized hotspots in the MHC (Jeffreys et al. 2001). The recombination rate per kilobase in regions I and II is similar to the average (male) rate per kilobase across the entire hotspot as estimated directly from sperm typing. The data are not consistent with all of the recombination activity across the known hotspot occurring within the 897 bp we studied, so that elevated recombination rates must also occur elsewhere in the 9.1-kb region. Each of these results is consistent with a recently published population-based analysis based on different samples and approximate-likelihood estimation techniques (Wall et al. 2003).
The AML method, which allows for repeat mutation, applied to regions I and II together gave similar results to those presented here. Single-sperm analysis (Schneider et al. 2002) has established that recombination rates are low across region III, and inclusion or exclusion of it made little difference to recombination rate estimates (data not shown). Similar results are obtained if different choices of subregions are made.
We now contrast the information available from the preceding coalescent analysis with that from some other current approaches. Figure 3 plots commonly used summary measures (e.g., Abecasis et al. 2001; Jeffreys et al. 2001; Dawson et al. 2002; Wall and Pritchard 2003) of pairwise LD for each sample. These are much less informative than the full-likelihood analysis described above: neither the qualitative pattern nor the quantitative extent of sharp variation revealed below is apparent from such plots.
On the same polymorphism samples we also applied the method of Myers and Griffiths (2003), which detects historical recombination events (see materials and methods). In the Gambian sample, the method revealed that there must have been at least nine recombination events in the history of the sample. Of these nine, all potentially occurred within the hotspot. Three can be definitively localized to within the hotspot and a further three could be localized to an area that includes, but is slightly larger than, the hotspot. None of the nine events is definitively localized outside the hotspot. In the United Kingdom sample there must have been at least three historical recombination events, of which two can be definitively localized within the hotspot, and the third localized to an area slightly larger than the hotspot with none definitively localized outside the hotspot. The results of the detection method are shown graphically in Figure 4.
The detection approach has the advantage of not relying on an evolutionary model, and the method we have used extends the well-known four-gamete test to “find” more historical recombination events than other approaches find. [Hudson and Kaplan's (1985) RM, the number of recombination events detected through the four-gamete test, is 2 for each data set.] Further, the results are consistent with, and complement, our estimation results, and the evidence for a hotspot is much clearer from the detection plot (Figure 4) than from plots of pairwise summaries of LD such as those in Figure 3.
It is important to remember that the detection approach provides a different kind of information from the estimation methods. Only a small proportion of actual historical recombination events will leave a conclusive trace even when, as here, resequencing provides all available SNPs in the data. Adopting the sperm-based rate estimate of 80 × 10−5 kb−1 (Schneider et al. 2002), assuming the rate to be the same in females, and Ne of 15,000 and 9500 for The Gambia and the United Kingdom, theory (Griffiths and Marjoram 1996b) would predict an expectation of 170 historical recombination events in the hotspot (regions I and II) in the Gambian sample and 110 in the United Kingdom sample. As another comparison, our coalescent approach estimates an average of 150 and 60 recombination events, respectively, in the Gambian and United Kingdom samples for region I alone.
Detection of hotspots:
We now consider the problem of detecting recombination hotspots using SNP data over large genomic regions. We present results from a detailed simulation study aimed at evaluating how feasible it is to detect recombination hotspots using the AML method. For definiteness, we based our simulation study on a model appropriate for human populations. The extent to which this model, and the results from the simulation study, are applicable to other organisms is discussed later. The parameter values for our model are given in terms of the population-scaled mutation rate, θ, and recombination rate, ρ. For a diploid population of effective population size Ne and probabilities of mutation and recombination per generation of u and r, respectively, these are defined by θ = 4Neu and ρ = 4Ner.
We simulated samples of 50 haplotypes, defined by SNPs, over 100-kb regions, using a coalescent model, which assumes a panmictic population, constant population size, and no selection. We assumed a finite-sites mutation model, which allows for repeat mutation, with each of the 100,000 bases treated as a separate site. At each site we used a two-allele mutation model, with mutations switching between these alleles. We modeled variable mutation rates by assuming that 99.9% of sites mutated at a rate equivalent to θ = 1 kb−1; the remaining sites had a mutation rate drawn from a gamma distribution with mean and standard deviation of θ = 0.1 base−1 (the mean rate is thus 100 times the background rate). Because of these hypervariable sites, the overall value of θ is 1.1 kb−1. Very little is known about the pattern of mutation rate variation in humans. Recurrent mutation can be misinterpreted as a signal of recombination, so that recombination hotspots should be more difficult to detect in the presence of recurrent mutation. Our average θ value of ∼1 kb−1 is consistent with available data for humans (Li and Sadler 1991; Cargill et al. 1999). The explicit inclusion of recurrent mutation at each base, and the presence of hypervariable sites, should move our assessment of the power to detect hotspots in the direction of being conservative.
We assumed a background recombination rate of ρ = 0.2 kb−1 and used four models for recombination hotspots: (A) no hotspot, (B) a 1-kb hotspot with ρ = 50 kb−1, (C) a 1-kb hotspot with ρ = 10 kb−1, and (D) a 2-kb hotspot with ρ = 10 kb−1. We did not allow for gene conversion.
The average recombination rate in the human genome is 1.2 cM/Mb (Kong et al. 2002). It is not currently known how much of this is due to hotspots, nor what their density is, so it is not clear in simulations how to apportion overall recombination rates to “background” and “hotspots.” Jeffreys et al. (2001) estimated the background recombination rate in the major MHC to be 0.04 cM/Mb, but extrapolation of this to the whole genome seems unwarranted at this stage. We have thus chosen a middle course, with the background rate in our simulations corresponding to one-third to one-sixth of the genome-wide average, depending on choice of Ne. The size of hotspots we assume is consistent with those detected in the MHC (Jeffreys et al. 2001). The hotspot rates correspond to 10 cM/Mb and 50 cM/Mb, or 50 and 250 times the background rate we use; the hotspots characterized in the MHC have rates varying from 0.4 to 140 cM/Mb (Jeffreys et al. 2001). The SNP density in our simulated data was 1 kb−1 on average.
Ascertainment effects can be an important factor in analyzing SNP data. Of course, different SNP discovery studies use different experimental designs. We modeled SNP ascertainment under one of the simplest, and perhaps most common, scenarios, that in which a SNP is discovered if it is polymorphic in a sample of two chromosomes. Thus samples of 52 sequences were simulated, with 2 randomly chosen sequences constituting a panel. Sites that were polymorphic in the panel were identified as SNPs. Our sample of sequences consisted of the remaining 50 sequences defined by their alleles at these SNPs. (Note that only our simulation method incorporated SNP ascertainment. The AML method for detecting hotspots does not use details of SNP ascertainment.)
Our approach to detecting hotspots was to first calculate the AML for a set of subregions that span the region of interest. The computational cost of calculating the AML surface depends primarily on the number of SNPs the subregion contains, and we chose each subregion to contain six SNPs (which gives a good compromise between the accuracy of inferences from the AML surface and the computational cost of calculating the AML surface). We allowed the subregions to overlap and had a new subregion starting after every third SNP.
For each subregion we calculated the AML surface using 100,000 iterations of the program sequenceLD, which took an average of 4 hr on a 900-MHz PC. The surface was calculated over a grid of 101 ρ values, equally spaced over the interval 0–10 kb−1, and a single θ value, 0.0015 base−1 (in practice the choice of θ value makes negligible difference, data not shown). When calculating the AML we included all SNPs that were segregating in the sample, and thus expect the surface to be very close to the full-likelihood surface for the data (ignoring any ascertainment).
After calculating the AML surfaces, for each subregion we calculated a likelihood-ratio test statistic for the presence of a hotspot. For subregion i, the statistic is calculated by the following:
Estimate the background recombination rate. The estimate, ρ̂b, is the ρ value that minimizes the product of AML surfaces from all other subregions.
Let li(ρ) be the log of the AML of region i evaluated at ρ. The likelihood-ratio statistic for a hotspot is
We detect a hotspot in a subregion i if LRi > c for a suitably chosen cutoff value c.
To make an informed choice of c requires knowledge of the distribution of the likelihood-ratio (LR) statistic under the null hypothesis of no hotspot. Standard asymptotic theory would suggest that the statistic takes the value 0 with probability 1/2 and otherwise has a χ21 distribution. We used the empirical distribution of the LR statistics in non-hotspot regions to test whether their actual distribution is close to this asymptotic distribution. A quantile-quantile plot is given in Figure 5, which suggests that the asymptotic distribution approximately holds. (The empirical distribution has too much mass, ∼0.7, on the value 0.)
We chose c = 10 for the results that we present. Under the asymptotic distribution, this gives a false-positive rate of 0.0008 per subregion. On average there is 1 SNP/kb, and therefore ∼35 subregions for each 100-kb region. Under the conservative assumption that each subregion is independent, this equates to an approximate probability of 0.028 of a false positive when analyzing a 100-kb region.
We analyzed 10 data sets simulated for each of the four hotspot models. The LR values for model C, the model with the smallest hotspot, are shown in Figure 6, and a summary of the results is given in Table 3. There were no false positives for any of the 40 data sets. In total 80% of the hotspots were detected. Surprisingly the size or strength of the hotspot seems to have no obvious effect on the power of the method. More important seems to be the density of SNPs in and near the hotspot. Half the hotspots that were not detected were due to not having SNPs near the hotspot. In these three cases, there were 20- to 30-kb regions that incorporated the hotspot that contained 0 or 1 SNP. Detection efficiency decreases further for weaker hotspots. For example, in further simulations, 6 of 10 hotspots with a rate of 4 kb−1 (20 times our assumed background rate) were detected, and none of 5 hotspots with a rate of 2 kb−1 (10 times our assumed background rate) were detected.
To test the robustness of our method to model misspecification, we repeated our analysis but with data simulated under a model of recent exponential growth. Our model was for a constant population size followed by exponential growth by a factor of 500 over the last 1600 generations (for human populations, such a model is suggested by the results of Sherry et al. 1994; Rogers and Jorde 1995). We again used four hotspot models, with the same mutation and recombination models as above, and with the values of θ and ρ being for the population before the exponential growth.
The results of our study are given in Table 4. They are notably worse than the results for the constant population size. This time there were two false positives, although this is still consistent with our putative 2.8% rate of false positives. The overall rate of detecting hotspots is now 63%. (There is significant evidence for different rates of detection for the two models; P = 0.001.)
As an empirical complement to our simulation study, we applied our detection method to population data from 50 individuals over 216 kb of the HLA region covering six recently characterized hotspots (Jeffreys et al. 2001). The available data is unphased, so we applied our method as described above to haplotypes estimated from the genotype data by the program PHASE (Stephens et al. 2001; Stephens and Donnelly 2003). (The estimated haplotypes were kindly provided by Matthew Stephens.)
As before, we chose subregions that contain six SNPs, here with the last SNP of one subregion being the first SNP of the next subregion. To deal with the presence of multiple hotspots, we inferred the background rate using only the AML curves for subregions that had not currently been inferred as hotspot regions. In practice this involved iterating the calculation of background rates and the detection of hotspots until no new hotspots were detected.
Figure 7 shows that all six of the characterized hotspots are detected, and in particular that the method has resolution to distinguish between closely separated hotspots in this region. In addition there are two further putative hotspots, one ∼5 kb 3′ of the TAP2 hotspot (right-hand side of Figure 7) and the other between DNA1 and DNA2 (the two leftmost known hotspots in Figure 7). It is hard to assess whether these are false positives or additional hotspots. In the case of the former, the sperm data in Jeffreys et al. (2000) is uninformative for recombination 3′ of the TAP2 hotspot, and Jeffreys et al. (2001) conjecture that the TAP2 hotspot may be part of a cluster. In the latter case, our subregion covers a 464-bp region to which two crossovers were localized in the sperm data of Jeffreys et al. (2001)(see their Figure 2c). However, the small number of crossover events means that it is impossible to say from the sperm data whether this is an additional hotspot or a false positive.
We have demonstrated that new computational tools for calculating the full-likelihood surface, or an accurate approximation of it, from population data are highly valuable for detection and analysis of local recombination hotspots. We have presented results for two related but distinct problems: the estimation of recombination rates and the detection of recombination hotspots.
The results of our first study show that recombination rates vary by around two orders of magnitude between a 897-bp region and a 1725-bp region that are separated by 445 bp. Furthermore the estimated rate in the hotspot region is 30–50 times greater than the genome-wide average, suggesting that this region is hotter than all but one of the recently characterized hotspots in the HLA region. Comparisons with sperm studies, and other analyses of data from this region, confirm full-likelihood estimation of recombination rates to be reliable.
Diversity in the human β-globin gene cluster has been intensively studied over many years, not least because many mutations in it cause a range of mild to severe inherited hemoglobinopathies (Weatherall and Clegg 1996). On the basis of these studies we can be sure that none of the diversity represented in the United Kingdom sample causes functional variation. We also know that it is inappropriate to assume neutrality for the sample from The Gambia. One of the best known of human mutations, Hemoglobin S (HbS), causes sickle-cell anemia in homozygous individuals and has been elevated in frequency in some populations, mainly in sub-Saharan Africa, by the advantage conferred on HbS heterozygotes who are protected against severe malarial mortality. The HbS allele is present in the sample from The Gambia, at a frequency of 3 out of 31 chromosomes. The recent rise of HbS alleles to polymorphic frequencies by malarial selection in Africa is evident from the presence of characteristic RFLP haplotypes that span the recombination hotspot (Flint et al. 1998; Webster et al. 2003). Consequently, our data set from The Gambia might have been expected to underestimate the number of recombination events compared with expectations under neutrality, with our estimate based on the United Kingdom sample, and with the estimate based on sperm typing, but this does not seem to be a big effect in practice.
Estimation of recombination rates from polymorphism data necessarily relies on an underlying evolutionary model, in our case the so-called standard neutral model. This model does not provide an exact account of the evolutionary history of the sequences we have examined, but we conclude that departures from these assumptions have not greatly influenced our results. These departures include the possibility for recurrent mutation, malarial selection in The Gambia, and a likely bottleneck in the ancestral population for the United Kingdom sample (Reich et al. 2001). Recurrent mutation is most likely at CpG sites but no evidence for their hypermutability has been found in any of the polymorphism data available to us from either β-globin (Harding et al. 1997) or other genes of the complex (Webster et al. 2003). Rate estimates for these regions from the approximate methods of Fearnhead and Donnelly (2002) and McVean et al. (2002), which explicitly allow repeat mutation, are similar (data not shown) to those from the full-likelihood approach reported here. Our computational methods have been shown (Fearnhead and Donnelly 2001) to be reasonably robust to departures from underlying demographic assumptions. Selection as heterozygous advantage has an effect equivalent to changing demographic assumptions and introducing population structure (Nordborg 2001). Furthermore, power studies (C. Spencer, personal communication) for detection of selection in LD data show that regions containing polymorphisms held at intermediate frequencies by balancing selection provide similar information about recombination rates as do neutral regions. In marked contrast, fixations by selective sweeps do leave clear signals of extensive LD that would lead to underestimates of recombination rates. The latter is not the case in our data.
But over and above these theoretical arguments, we are considerably encouraged by the fact that when applied to two populations, United Kingdom and The Gambia, deliberately chosen to have been subject to different demographic histories and selection regimes, the full-likelihood approach gave similar rate estimates across populations within and outside the hotspot. This complements earlier simulation studies (Fearnhead and Donnelly 2001) in showing that at least for these real data sets, full-likelihood estimation of recombination rates is robust to aspects of population history and deviations from neutrality.
It is unclear to what extent homologous recombination fails to resolve as a crossover with flanking exchange, but instead resolves as short tracts (of a few hundred or so base pairs) of gene conversion. Several population analyses have suggested gene conversion may play an important role (Frisse et al. 2001; Przeworski and Wall 2001), and data on gene conversion processes in humans are beginning to become available (Jeffreys and May 2004). The RFLP haplotypes, LD patterns, and single-sperm typing data for the β-globin complex are mainly informative of recombination with flanking exchange, rather than gene conversion. On the other hand, gene conversion events either wholly within or overlapping one boundary of the regions we have considered would be expected to be reflected in increased rate estimates using our approach. The quantitative consequences of gene conversion for coalescent-based estimates of recombination rate are not yet well understood and warrant further investigation.
The results of the second study show that approximate-likelihood methods are powerful at detecting recombination hotspots: >70% of the hotspots were correctly detected, with only two false positives from >2000 subregions that were tested for the presence of a hotspot. The results were encouraging even for data simulated under a model of recent population growth, but analyzed under a constant population size model. We also applied the method to population data across the HLA region in which six hotspots have recently been characterized. All the known hotspots were detected, and the method successfully distinguished hotspots that were physically close. We also found evidence for a fourth hotspot in the HLA-DNA region and a hotspot 5 kb 3′ of the TAP2 hotspot. For our simulation study we fixed the sample size to be 50 chromosomes. Larger sample sizes, and more importantly denser SNP maps, should improve the power of the method to detect hotspots. Our simulation study was based on haplotype data. If unphased genotype data were available, one natural approach would be first to estimate the haplotypes statistically (Stephens et al. 2001; Stephens and Donnelly 2003) and then to apply the methods described above. This is the approach we used for the HLA region, so those results speak to the accuracy of hotspot detection when haplotype phase needs to be estimated. We note that our method for detecting hotspots explicitly allows for repeat mutations.
We considered the formal problem of testing for the presence of a hotspot and suggest a test that declares a hotspot when the likelihood-ratio statistic exceeds a threshold c, determined by simulation. In some applications a less formal approach may be appropriate, and putative hotspots could be studied in order of decreasing value of the likelihood-ratio statistic, or quantile-quantile plots of observed values of the statistic against those generated by simulation could be used to detect outliers indicating putative hotspots.
We simulated data under assumptions plausible for human populations, in view of the current interest, and available data, for that particular organism. Other organisms have different effective population sizes and different values for the probabilities of mutation and recombination per meiosis, u and r. For example, Drosophila melanogaster has an effective population size that is ∼10 times greater than that of humans, but has similar values of u and r (Li and Sadler 1991; Nachman 2002). This means that the values of θ and ρ per kilobase that we chose for our model would be appropriate for θ and ρ values per 100 bases of D. melanogaster, and our results for 100 kb of human DNA would correspond to results for 10 kb of D. melanogaster DNA. For other organisms the appropriate scaling of genomic regions would be different. (As we use a finite-sites mutation model, our model does not strictly scale; for example, we will be modeling 100 bases of D. melanogaster DNA with a 1000-sites model, but in practice this will not greatly affect results: the only effect this has is on the amount of repeat mutation, and experience suggests that, at least for scalings of one or maybe two orders of magnitude, any effect of increased repeat mutation would be small. For example, the analysis of LPL data in Fearnhead and Donnelly 2002 gave almost identical results even when some sites were allowed to mutate either 10 or 100 times more frequently than normal.)
On the one hand, population-based estimates of recombination rates rest on simplistic models for the evolution of diversity and the uncertainty surrounding even the most sophisticated estimates cannot be dismissed. Uncertainty of this order is not a feature of pedigree or sperm-typing studies, although we note that quantifying uncertainty in rate estimates is not straightforward for any pedigree-, sperm-, or population-based methods. On the other hand, it is feasible and practicable to apply computational methods to diversity data sampled from anywhere in the human genome, or indeed from any genome, and substantial information will accumulate. It will be interesting to see the future contribution of coalescent methods toward detection of recombination hotspots and robust estimation of recombination rates in hotspots, compared with the results of faster algorithms based on summary statistics. Our study demonstrates that computationally intensive inference methods applied to polymorphism data provide practical and valuable tools for learning about fine-scale variation in recombination rates.
- Received August 22, 2003.
- Accepted April 29, 2004.
- Genetics Society of America