A Coalescent Model of Recombination Hotspots
 Carsten Wiuf1 and
 David Posada
 1 Corresponding author: Variagenics, 60 Hampshire St., Cambridge, MA 02139. Email: wiuf{at}birc.dk
Abstract
Recent experimental findings suggest that the assumption of a homogeneous recombination rate along the human genome is too naive. These findings point to blockstructured recombination rates; certain regions (called hotspots) are more prone than other regions to recombination. In this report a coalescent model incorporating hotspot or blockstructured recombination is developed and investigated analytically as well as by simulation. Our main results can be summarized as follows: (1) The expected number of recombination events is much lower in a model with pure hotspot recombination than in a model with pure homogeneous recombination, (2) hotspots give rise to large variation in recombination rates along the genome as well as in the number of historical recombination events, and (3) the size of a (nonrecombining) block in the hotspot model is likely to be overestimated grossly when estimated from SNP data. The results are discussed with reference to the current debate about blockstructured recombination and, in addition, the results are compared to genomewide variation in recombination rates. A number of new analytical results about the model are derived.
THE process of recombination in humans has been intensively debated over the last years. Various recent findings suggest that the standard model assuming a flat rate of recombination along a chromosome is too crude an approximation to the actual recombination process acting on the human genome and that the standard model does not adequately explain the findings. In Daly et al. (2001), Jeffreys et al. (2001), Johnson et al. (2001), and Gabriel et al. (2002) it is argued that recombination tends to happen more often in certain regions, socalled hotspots, of a chromosome than in other regions, giving rise to long islands of nonrecombining or virtually nonrecombining genetic material.
If the above reports are true, our understanding of the recombination process as an evolutionary force must be adjusted accordingly: Modeling of recombination and interpretation of recombination patterns plays an important role in the analysis of genetic data. In this report we develop a model, the coalescent with recombination hotspots, which can be used for simulation and analysis of genetic data. Simulation of genetic data is an important tool for investigating and testing hypotheses about how genetic data have been shaped and is a useful way of gaining intuition about and insight into the consequences of evolutionary processes.
The coalescent with recombination hotspots is an extension of Kingman's (1982) coalescent and of the coalescent with recombination in various forms, the coalescent with uniform recombination rate and multilocus coalescent models (Hudson 1983; Hudson and Kaplan 1985; see Griffiths 1981 for a twolocus model). The idea is that recombination breakpoints are not chosen randomly along the chromosomes but are concentrated in certain regions of the chromosomes. One way to model this is to choose centers of recombination activity (i.e., hotspots) according to some point process (e.g., a Poisson process) and let recombination events happen at a rate descending from the centers. In the following a model along these lines is developed. In the next section an informal description of the model is presented followed by a mathematical treatment with comparisons to the standard model. A scheme for simulation of sequence samples and histories is described. Some familiarity with the coalescent with recombination is required.
This report is intended to be methodological, where issues of relevance to the analysis of data are addressed. The new model is compared to the coalescent model with uniform recombination rate through simulations of various summary statistics. Of special interest are the consequences of ignoring hotspot recombination and how hotspot recombination affects the genomewide variation in recombination rates. Various issues relating to inference in the hotspot model are raised in the discussion.
A MODEL OF RECOMBINATION HOTSPOTS
Think of an entire chromosome as being represented by a line and the gene or region we are interested in as being represented by the interval (0, 1), as illustrated in Figure 1. Throughout we use “gene” in a loose sense, letting it be short for an arbitrary but fixed region in the genome. Thus, a gene might be several thousand kilobases long. For a gene L nucleotides long each nucleotide takes up 1/L of the interval. However, for mathematical convenience we think of the gene as a continuous stretch of points.
The model we outline below is fairly general. In mathematical exposition we restrict the model to a simpler model that still has most of the flexibility of the general model. Choose hotspots according to some point process. Perhaps the simplest and most sensible process in this connection is a Poisson process with intensity λ> 0, so that on average there are λx hotspots in a chromosomal segment of length x, and in particular there are λ hotspots on average in the gene (0, 1). In this fashion hotspots are scattered throughout the chromosome and different genes will have different numbers of hotspots, but different copies of the same chromosome will have the exact same number and the exact same locations of hotspots. In Figure 2 hotspots are labeled x_{j}, j =±1, 2,.... If we knew the exact locations of the hotspots, e.g., from experiments, these would not have to be modeled stochastically. In the absence of such knowledge the point process reflects our prior information or expectation of how hotspots are distributed throughout the chromosome.
Recombination crossovers happen around a particular hotspot, x_{j}, with a rate, c_{j}, per generation and when a crossover occurs the breakpoint is chosen according to a distribution, g_{j}(x), around x_{j}. We choose to call x_{j} a hotspot, though a more correct terminology might be a “center of recombination activity.” Unless the distribution g_{j}(x) is closely centered around x_{j}, few recombinations would be at x_{j} precisely. We say that recombination happens around x_{j} if the breakpoint is chosen from g_{j}(x). The rates c_{j} could be chosen from some distribution, e.g., a Γ, or be constant for all j, c_{j} = c. In the former case we talk about rate heterogeneity, and in the latter about rate homogeneity. Similarly, g_{j}(x) might vary with j or be independent of j, g_{j}(x) = g(x). For example, g_{j}(x) could be normal with variance
Two hotspots are in the example given in Figure 3, one at x_{1} = 0.2 and the other at x_{2} = 0.65, and g_{j}(x), j = 1, 2, are normal with variances
Since g_{j}(x) is proportional to the probability by which recombination happens at distance x from the hotspot x_{j}, the sum of the curves in Figure 3 represents the overall rate of recombination in a given point (Figure 4). If g_{j}(x) is sufficiently narrow around hotspots little overlap with other hotspots occurs, resulting in truly distinguishable peaks.
The following interpretation of the model is intuitive: A hotspot x_{j} can be thought of as a specific site or segment that is required for recombination to take place; however, the breakpoint itself might not be at the hotspot or fully determined by the hotspot, but just located somewhere randomly in the vicinity. It is worth stressing that at present published data are sparse and there is little evidence for choosing one model [i.e., the point process x_{j}, j =±1,2...,the recombination rates c_{j}, and the breakpoint distributions g_{j}(x)] in favor of another.
MATHEMATICAL EXPOSITION
In the following we focus on establishing some results about the rate of recombination between two sites and show how the rate affects the number of segregating sites, S_{n}, and the number of recombination events, R_{n}, in a sample's history as compared to the standard model with flat recombination rate. We have simplified the general model of recombination hotspots presented in the previous section to ease presentation and computations. However, some of the results hold more generally. To be specific, we assume that g_{j}(x) does not depend on j, i.e., g_{j}(x) = g(x) for all j, and that the distance between hotspots is Gamma distributed, Γ(m, λ), m > 0 (hereafter referred to as the “gamma process”). If m = 1, then the gamma process is a Poisson process with rate λ. Allowing m ≠ 1 introduces interference: If m > 1, hotspots tend to be pushed away from each other, whereas if 0 < m < 1, they tend to cluster. Further, we assume that the distribution of rates, c_{j}, has expectation c. Table 1 provides an overview of the notation.
If an event happens around x_{j} the probability that the breakpoint is in (z, z + dz) is g(z – x_{j})dz, where dz denotes a small segment around z, say, of the length of a nucleotide. Summing over all hotspots x_{j} gives the rate r_{z}dz by which recombination happens per generation in a particular site z. Here r_{z} is given by
The rate, r_{z}_{1}_{z}_{2}, of recombination between any two sites, z_{1} and z_{2}, can be found from (1) by integration,
The variances of r_{z} and r_{z}_{1}_{z}_{2} are more involved and do not have simple closed expressions. In special cases they can be found, though (see the appendix).
Simulation of sample histories: Consider a diploid population of size N; i.e., there are 2N chromosomes. It follows from standard arguments (e.g., Hudson 1990) that for N large and c_{j} small, c_{j} ≈ 0, and the time (going into the past) until a gene has been created by a recombination event is exponential with parameter ρ_{01}/2, Exp(ρ_{01}/2). Here ρ_{z}_{1}_{z}_{2} = 4Nr_{z}_{1}_{z}_{2} and time is measured in units of 2N generations. Also define γ = 4Nc, the scaled average recombination rate per hotspot.
The location of the breakpoint z is given by the density, h(z):
Simulation of sample histories can be performed in the following way. Assume that there are n genes in the sample. Let k count the number of genes that at a given time in the past carries ancestral material (see Figure 5 for an illustration).

Simulate c_{j} and x_{j} according to the gamma process with parameters m and λ. Calculate h(z) from Equation 5 and compute ρ_{01}.

Simulate the next event according to the coalescent with flat recombination rate ρ_{01} and sample size k.

If the event is a coalescence event choose two genes at random to coalesce; otherwise choose one gene to recombine. The breakpoint, z, is chosen according to Equation 5. Update k.

Stop when k = 1.
Simulation of hotspots (step 1) is straightforward once the position of the first hotspot in the region is determined: The length between hotspots is Γ(m, λ), which can be simulated using standard algorithms. The location of the first hotspot in the region can be simulated using a rejection algorithm. Details are given in the appendix.
Time spent on computation in step 3 can be reduced if a lookup table for h(z) is constructed. A lookup table takes the form of a dense grid of points and for each point, v, the corresponding point, z, on the gene is determined such that the probability of a recombination breakpoint between 0 and z is v, i.e., P(break between 0 and z) =
It is straightforward to include uniform recombination and/or gene conversion in addition to recombination by hotspots. Uniform recombination can either be included in the breakpoint distribution h(z), adjusting r_{z} accordingly, or be accounted for as a separate type of event. Thus, in a simulation three types of events are possible: coalescence, hotspot recombination events, and uniform recombination events. Gene conversion is best treated as a different type of event, because it often involves two breakpoints instead of one (Wiuf and Hein 2000). Both extensions seem realistic in light of how genetic data presently are conceived. Infinitesite mutation can be simulated at the same time as the genealogy, as described by previous authors (e.g., Hudson 1990 and references therein).
Number of segregating sites: Of interest is the distribution of (A) the number of segregating sites, S_{n}, given a particular outcome, x_{j}, of the gamma process and (B) the same number averaged over all possible outcomes of the gamma process. In both cases we average over all possible outcomes of c_{j}. B relates to the genomewide variation, whereas A relates to the variation within a single gene. Assume the mutation rate is θ = 4Nu for the whole gene (0, 1), where u is the mutation rate per gene per generation. As shown in Hudson (1990), S_{n} is Poissondistributed Po(θL_{n}/2), where L_{n} is the total branch length measured in 2N generations of the genealogy relating n sequences. The distribution of L_{n} depends on whether we consider a particular outcome x_{j} (A) or whether we average over all such outcomes (B).
The expectation of S_{n} under both A and B can be obtained easily, because it depends on the genealogy for a given site only (see, e.g., Hudson 1990). For A we find
Number of recombination events: The same line of argumentation as in the previous paragraph applies to the number of recombination events, R_{n}, in a sample's history. We show results for the expectation and variance of R_{n} without detailed proofs, again distinguishing between A and B.
For A we have
Equations 11, 12, 13 hold only if g(x) is a continuous distribution; e.g., if the breakpoint is exactly in x_{j}, then E_{x}(R_{n}) and E(R_{n}) are lower than the values given in Equations 11 and 12, respectively. In Figure 6, left, the second breakpoint hits the location of the first and, thus, does not count in R_{n}. In contrast, both events count in Figure 6, right. When all breakpoints are exactly in x_{j}, j =±1, 2..., the model is effectively a multilocus model rather than an infinitesite model. Griffiths (1991) discussed a twolocus model with recombination and found a recursion for the expected number of recombination events. This recursion can be solved explicitly for small n. Also Simonsen and Churchill (1997) discussed a twolocus model and found analytic expressions for the expectation and variance of the number of recombination events for n = 2. Their result for the expectation agrees with that of Griffiths (1991) for n = 2. In a sense, the number of events is overcounted in both articles, because events involving loci that already have found a most recent common ancestor are counted (Figure 7 gives an example). However, Griffiths' (1991) recursion can still be applied after changing the boundary conditions (see the appendix for details). Let ϵ_{n}(y) be the expected number of recombination events between two loci with scaled rate y. Then
SIMULATION RESULTS
In this section we investigate various quantities by simulation. Particularly, we are interested in the expectation and variance of R_{n} and the variances of S_{n} and ρ_{01}. The expectation of S_{n} is independent of the recombination process and is given in Equation 7. The expectation of ρ_{01} is found from Equation 4. In addition, we present simulated results for Hudson and Kaplan's (1985) lower bound, R_{M}, on the number of recombination events in a sample history and Myers and Griffiths' (2003) haplotypebased bound, H_{M}. It is always true that H_{M} ≥ R_{M} and that H_{M} is lower than the true minimum (Myers and Griffiths 2003). Both statistics give indications to what the overall rate of recombinations might be.
The coalescent with recombination hotspots and homogeneous recombination rate was implemented in a program by one of us. A program was implemented to calculate E(R_{n}) in Equation 18 with Gammadistributed rates, γ_{j} =γZ_{j}, Z_{j} ∼ Γ(ζ, ζ). (This is analogous to how heterogeneity in mutation rates is modeled; see, e.g., Yang 1996.) Tables 2 and 3 show the mean and variance, respectively, of R_{n} for various combinations of the two rates: the hotspot rate ρ (varying λ, m, and γ) and the uniform rate ν = 4Nv, where v is the rate per gene per generation. The density g(x) was either a normal, N(0, σ^{2}), or a uniform, U(–α, α). Values of α and ρ were chosen such that the respective distributions had the same variance, i.e.,
The tables show some interesting trends. First of all, Table 2 shows that E(R_{n}) is considerably lower in a model with pure block recombination than in a model with pure homogeneous recombination. This is true irrespective of whether or not interference is assumed or whether all hotspots have the same recombination rate or the rate varies (ζ = 1). The expectation E(R_{n}) is the sum of the expectations for pure block recombination and for pure homogeneous recombination. Note that rate heterogeneity lowers the expectation compared to the case of rate homogeneity. This is in line with the observation in Equation 19 for n = 2. If g(x) has low variance, the expectation for pure block recombination can be seen as an approximate “effective” number of recombinations, because mutations are unlikely to separate close recombination breakpoints. For m = 1, the probability of no hotspots within the gene is exp(–λ), which evaluates to 37% for λ = 1 and 2% for λ = 4. For m = 4, the probability of no hotspots is exp(–λ) {1 + ¾λ + ¼λ + 1/24λ^{3}}, which is 20% for λ = 4 and 0.1% for λ = 16.
Table 3 shows that variation in R_{n} is largely induced by variation in the number of hotspots; as λ increases the rate becomes more uniform and the variance decreases, both for m = 1 and m = 4. As an example compare the three values for ρ^{2} = 10^{–4} in the first column. This is further emphasized by the observation that Var(R_{n}) seems to be (slowly) decreasing in ρ^{2} > 0, which is as expected because the recombination rate becomes more uniform with higher ρ^{2}. Interference decreases variation, because the variance in the number of hotspots decreases with increasing m and λ/m fixed, and the recombination rate becomes more uniform. Heterogeneity, on the other hand, increases variation, because the recombination rate becomes more variable.
The same trends in Table 3 are seen in Table 4: Var(S_{n}) decreases with increasing ρ^{2} and also with increasing λ, for both m = 1 and m = 4. However, note that for ρ^{2} = 0 the variance of S_{n} attains its largest value because trees for individual nucleotides are more correlated than those for ρ^{2} > 0. We do not see the same for R_{n}:If ρ^{2} = 0, some recombination events break between the same nucleotides and might therefore not count in R_{n} (cf. Figure 6). Interference and heterogeneity affect the variance similarly to what was seen in Table 3, though less dramatically.
Table 5 summarizes the genomewide variation, Var(ρ_{01}), in recombination rates in the hotspot model. The variation is mainly due to variation in the number of hotspots and very little to the value of ρ^{2}.If λ is large the variance approaches 0, but it can be substantial for small values of λ. It appears that the variance is roughly inversely proportional to λ.
Table 6 shows summary statistics of R_{M} and H_{M}. The inferred minimum number of recombinations is lower in the hotspot model than in the uniform model: We tend to underestimate the amount of historical recombinations more in the hotspot model than in the uniform model. This is the case even if ρ^{2} > 0 (results not shown), because recombination events around a hotspot are detected by R_{M} or H_{M} only if there is an accumulation of mutations in the region around the hotspot. The actual number of recombination events is in general much higher (cf. Table 2).
DISCUSSION
We have developed an extension of the coalescent with recombination that in a simple way accounts for heterogeneity in recombination rate and hotspots. Rate heterogeneity can be modeled in ways other than the one proposed here: Basically, what is required is a relationship, stochastic or deterministic, between physical and genetic distance. The standard coalescent model of uniform rate suggests a linear relationship between the two distances. This results in a oneparameter model. In contrast the model proposed here has a stochastic relationship between the two distances, in the sense that the rate is different for each realization of a gene's history. The model can be summarized as having four main parameters: an intensity, λ, that controls the number of hotspots; a measure of interference, m; a recombination rate, γ, per hotspot; and a parameter that regulates the size, ρ^{2}, of the hotspot. Two of these, σ^{2} (or more correctly,
Fisher et al. (1947) introduced the gamma process in the context of chromatid interference. Their model is called the χ^{2} model. The χ^{2} model is not conceptually identical to the model presented in this article: In our model the gamma process determines the location of hotspots. These locations are the same for all individuals and are potential breakpoints for recombination. In contrast, in the χ^{2} model the gamma process determines the location of chromatid crossovers, actual recombination breakpoints. These vary from individual to individual, both in number and in location.
We showed how various summary statistics are affected by hotspot recombination compared to uniform recombination. One important message is that in the hotspot model we tend to underestimate the number of recombination events more in a sample history than in a model of homogeneous recombination. From an inference point of view this is extremely unsatisfactory: There are more parameters to estimate in the hotspot model than in the uniform model and reliable estimates might therefore be difficult to achieve. For one thing, it is well known that estimators of the recombination rate are often biased downward (Wall 2000), suggesting that these estimators are even more likely to be biased downward in the hotspot model. As a further issue, the extra parameters in the hotspot model make inference computationally intractable. Even in the standard model, maximumlikelihood estimation is computationally nontrivial (Fearnhead and Donnelly 2001), and many approaches rely on summary statistics and/or simplifications of the likelihood (Wall 2000; Fearnhead and Donnelly 2002, and references therein).
Statistically, it is not known whether the parameters can be estimated consistently. Fearnhead (personal communication) shows that in certain cases it is possible to estimate the flat recombination rate consistently as the length of the genomic region increases. Similarly, it might be expected that λ, m, γ, and ρ^{2} can be estimated consistently. Fearnhead's (personal communication) proof does not directly apply in the present situation because he works with the standard oneparameter coalescent model and the proof makes explicit use of this. As a final comment along these lines, if data are analyzed under the standard model alone, a systematic downward bias is expected; this was clearly demonstrated in Table 2.
We calculated summary statistics for the recombination rate from 22 genes spread throughout the human genome (from Table 1 in Nachman 2001). We found an average rate of 2.08 cM/Mb with a variance of 1.63. If we assume the effective population size is N = 10^{4} and a gene of ∼10^{4} nucleotides, then estimates of the expectation and variance of ρ_{01} are given by E(ρ_{01}) = 8.32 and Var(ρ_{01}) = 26.08, respectively. Compared with the values in Table 5, these point to a fairly high level of variation in recombination rate. For example, the variance (26.08 × 10^{2}/8.32^{2} = 37.7) is consistent with some interference and homogeneity in rates (e.g., m = 4, λ = 4) or with no interference and heterogeneity in rates (e.g., m = 1, λ = 5.2, assuming an inverse proportional relationship between Var(ρ_{01}) and λ; see Table 5).
Table 6 also brings a message to researchers spending effort on inferring haplotype maps of human chromosomes. Hotspots (or block end and start points) are inferred from singlenucleotide polymorphism (SNP) patterns, but as shown in Table 6 these patterns lead to gross underestimation of the true number of recombination events. The statistic R_{M} infers at most one recombination between any two SNPs:
Acknowledgments
We thank L. Subrahmanyan for helpful and useful discussions. We are grateful to two anonymous reviewers who provided many fruitful comments and suggestions. The program used for simulation can be obtained from http://www.uvigo.es/dposada.
APPENDIX
Expectation of r_{z}: Upon taking expectation of r_{z}, Equation 1 can be written
Variance of r_{z}_{1}_{z}_{2} and ρ_{z}_{1}_{z}_{2}: If g(x) puts all probability mass in the hotspot, the variance of r_{z}_{1}_{z}_{2} can be found. Note that in this particular case
Simulation of the first hotspot in a gene: McPeek and Speed (1995) state the distribution of the position of the first hotspot in a gene,
Recursion for ϵ_{n}(y): Griffiths (1991) provides the following recursion for calculation of ϵ_{n}(y) for given values of n and y. Let a, b, and c ≥ 0, and let n = a + b + c. The number c is the number of genes where both loci are ancestral to the sample, a is the number where locus 1 is ancestral, and b is the number where locus 2 is ancestral. The recursion is
Footnotes

Communicating editor: J. Hey
 Received August 2, 2002.
 Accepted February 3, 2003.
 Copyright © 2003 by the Genetics Society of America