## Abstract

Background selection, the effects of the continual removal of deleterious mutations by natural selection on variability at linked sites, is potentially a major determinant of DNA sequence variability. However, the joint effects of background selection and genetic recombination on the shape of the neutral gene genealogy have proved hard to study analytically. The only existing formula concerns the mean coalescent time for a pair of alleles, making it difficult to assess the importance of background selection from genome-wide data on sequence polymorphism. Here we develop a structured coalescent model of background selection with recombination and implement it in a computer program that efficiently generates neutral gene genealogies for an arbitrary sample size. We check the validity of the structured coalescent model against forward-in-time simulations and show that it accurately captures the effects of background selection. The model produces more accurate predictions of the mean coalescent time than the existing formula and supports the conclusion that the effect of background selection is greater in the interior of a deleterious region than at its boundaries. The level of linkage disequilibrium between sites is elevated by background selection, to an extent that is well summarized by a change in effective population size. The structured coalescent model is readily extendable to more realistic situations and should prove useful for analyzing genome-wide polymorphism data.

GENOME-WIDE surveys of DNA sequence variability within populations in a variety of species are providing evidence that the amount and pattern of neutral or nearly neutral variability in given regions of the genome are affected by selection at sites that are linked to those under observation (Wright and Andolfatto 2008; Charlesworth *et al.* 2009; McVicker *et al.* 2009; Sella *et al.* 2009; Cutter and Choi 2010). While these effects are strongest in genomic regions or genomes with low frequencies of genetic recombination, where all sites are closely linked (Betancourt *et al.* 2009; Kaiser and Charlesworth 2009; Seger *et al.* 2010), they can also be detected in regions with “normal” levels of genetic recombination (Andolfatto 2007; Shapiro *et al.* 2007; Haddrill *et al.* 2011). In addition, the extent of adaptation at the sequence level, as measured by codon usage and the level of selective constraint on nonsynonymous sites, is reduced when recombination is infrequent (Betancourt and Presgraves 2002; Hey and Kliman 2002; Presgraves 2005; Haddrill *et al.* 2007; Larracuente *et al.* 2008; Betancourt *et al.* 2009).

These observations, which date back to the early findings in Drosophila that silent site variability at a locus is correlated with the local rate of recombination for the region in which it is situated (Aguadé *et al.* 1989; Begun and Aquadro 1992), suggest strongly that selection at sites genetically linked to those under observation is influencing the evolutionary process at the latter. Two main processes have been proposed as the source of this influence. The first is hitchhiking by positively selected mutations (Maynard Smith and Haigh 1974) or “selective sweeps” (Berry *et al.* 1991). Evidence in favor of this interpretation has been compiled by Stephan (1995) and Andolfatto (2007), among others (reviewed by Sella *et al.* 2009). The second involves hitchhiking effects caused by the continual introduction of new, deleterious variants by mutation at sites across the genome and their elimination by selection, as first discussed by Fisher (1930). Weakly selected or neutral variants on a haplotype that carries a closely linked mutation, which is sufficiently strongly selected against that it is virtually certain to be eliminated from the population, will also be removed from the population. This results in a reduction in the mean coalescent time for neutral variants and in the efficacy of selection on weakly selected variants; this process has become known as “background selection” (Charlesworth *et al.* 1993).

Several analytical and simulation studies of the effects of background selection on neutral variability (Charlesworth *et al.* 1993, 1995; Hudson and Kaplan 1994, 1995; Nordborg *et al.* 1996; Fu 1997; Neuhauser and Krone 1997; Santiago and Caballero 1998; Gordo *et al.* 2002; Williamson and Orive 2002; Zeng *et al.* 2006; Wakeley 2008b; O’Fallon *et al.* 2010) and on the properties of weakly selected variants (Charlesworth 1994; Stephan *et al.* 1999; Zeng and Charlesworth 2010) have been conducted. The effect of background selection on the mean coalescent time for a pair of alleles, caused by mutations at a large number of sites subject to sufficiently strong selection that the allele frequencies behave approximately deterministically, is especially well understood (Hudson and Kaplan 1995; Nordborg *et al.* 1996). The relevant formula has been used to predict patterns of variability as a function of local recombination rate in *Drosophila* (Hudson and Kaplan 1995; Charlesworth 1996), humans (Cai *et al.* 2009; McVicker *et al.* 2009), and *Caenorhaditis elegans* (Cutter and Choi 2010; Rockman *et al.* 2010). In addition, there is increasing evidence that background selection may affect patterns of variation and the efficacy of selection in and around a single gene, although the effects are small and detectable only with genome-wide data (Loewe and Charlesworth 2007; McVicker *et al.* 2009; Hammer *et al.* 2010).

In addition to its effect on the level of neutral variability, as determined by the mean coalescent time for a pair of alleles, background selection can also distort the shape of the neutral gene genealogy, in the direction of increasing the length of external branches relative to the rest of the tree, resulting in an excess of rare variants relative to standard neutral expectation, especially when selection against deleterious mutations is relatively weak (Charlesworth *et al.* 1993, 1995; Fu 1997; Santiago and Caballero 1998; Gordo *et al.* 2002; Williamson and Orive 2002; Zeng *et al.* 2006). This effect has proved hard to study analytically, and most results have been obtained either by coalescent models that assume no recombination (Charlesworth *et al.* 1995; Fu 1997; Neuhauser and Krone 1997; Gordo *et al.* 2002; Zeng *et al.* 2006; Wakeley 2008b; O’Fallon *et al.* 2010) or by forward-in-time computer simulations (Williamson and Orive 2002; Kaiser and Charlesworth 2009; Zeng and Charlesworth 2010).

Given the evidence from genomic data that background selection against nonsynonymous variants in coding sequences may have significant local effects, it is important to develop efficient methods to predict its effects on both levels of variability and patterns of departure of allele frequency spectra from neutral expectations. To initiate progress toward this end, we have developed a structured coalescent model of background selection that includes recombination. The purpose of this article is to describe this model and its implementation in a computer program that generates numerical predictions of several biologically informative genealogical statistics, as well as to check the validity of the method against the results of forward simulations. The results are at present quite limited in their range, but the approach should be readily extendable to more realistic situations.

## Theory

### A background selection model

In this section, we formulate a background selection model forward in time, under the assumption that the population is at mutation–selection equilibrium, so that genotype frequencies at sites under selection are constant over generations. The results are used in the next section to construct a structured coalescent model that takes into account the joint effects of background selection and recombination on local gene genealogies. Consider a haploid population of constant size *N*. We focus on a “deleterious region” where the total rate of mutation to deleterious alleles follows a Poisson distribution with a mean of *U* mutations per generation per individual. The mutation rate is uniform across the deleterious region, and back mutation does not occur. We represent the deleterious region by the interval [0, 1]. A haplotype that carries *i* deleterious mutations is referred to as an *i*-haplotype. We further assume that all mutations in the deleterious region have the same fitness effect, *s* (0 < *s* < 1). The selected sites in a haplotype interact multiplicatively, so that the fitness of an *i*-haplotype is (1 – *s*)* ^{i}* relative to wild type.

We assume a discrete-generation model as shown in Figure 1. The adults in generation *t* produce infinitely many gametes, without fertility differences. Mutation, selection, and recombination then operate sequentially on the resulting infinite population, causing changes in the frequency of *i*-haplotypes, which are denoted by *f _{i}* and

*g*at different stages. Note that1Under the assumption that the system is at equilibrium,

_{i}*f*and

_{i}*g*remain constant over generations. Genetic drift operates through population size regulation, which reduces the population size to

_{i}*N*at random with respect to genotype. We assume

*N*to be sufficiently large that the distribution of haplotype frequencies with respect to selected sites remains approximately unchanged by the population size regulation. Consequently, drift affects only neutral sites within the region in question.

The order of the evolutionary forces in the life cycle is to some extent arbitrary and is chosen mainly for theoretical convenience. However, if all the evolutionary forces are weak, so that second-order terms in their effects can be neglected, their effects on haplotype frequencies are approximately additive (Ewens 2004, Chaps. 4 and 5), and the dynamics of the model are approximately independent of the order of the forces.

A key feature in the life cycle is that recombination does not alter the distribution of haplotype frequencies (compare the two populations immediately before and after the recombination step in Figure 1). A derivation is given below, but heuristically this result follows from previous analyses of selection models with an infinite population size and multiplicative fitnesses (Kimura and Maruyama 1966; Charlesworth 1990; Shnol and Kondrashov 1993; Johnson 1999). These show that the genotypic composition of an infinite population at equilibrium between selection and mutation is independent of the presence or absence of recombination, which should therefore have no effect on *f _{i}*.

To establish the above result, we first note that previous work with zero recombination has shown that2where λ* _{f}* =

*U*(1 –

*s*)/

*s*and λ

*=*

_{g}*U*/

*s*are the mean numbers of mutations that a haplotype carries before and after the mutation step in the life cycle (Haigh 1978; Gordo

*et al.*2002). In other words,

*f*and

_{i}*g*follow Poisson distributions. Note that mutation increases the average number of mutations that an individual carries from λ

_{i}*to λ*

_{f}*. As a result, the mean fitness of the population is lowered from exp{−*

_{g}*U*(1 – s)} before mutation to exp(−

*U*) after mutation. On the other hand, selection changes the mean number of mutations back to λ

*by allowing fitter individuals to contribute more offspring to the gene pool of the next generation, thus restoring the mean fitness to the premutation level.*

_{f}Recombination is formulated as follows. To produce a new haplotype in the postrecombination population, two haplotypes are sampled randomly with replacement from the postselection population. With probability 1 – *R*, no recombination occurs, in which case one of the two chosen haplotypes is randomly selected to be present in the postrecombination population. If recombination occurs (with probability *R*), it is assumed to involve a single crossover event whose breakpoint falls randomly within the interval [0, 1]. Two recombinants are produced and one of them is randomly selected to be retained in the postrecombination population. The recombination rate parameter *R* is therefore equivalent to the genetic map length of the focal genomic region, whose size is so small that the chance of double crossover events is negligible.

Suppose that an *i*-haplotype and a *j*-haplotype are involved in a recombination event with breakpoint *x* (0 < *x* < 1). To construct the recombinants, it is necessary to determine the numbers of mutations on both sides of the breakpoint. We assume that, for a randomly chosen *i*-haplotype and a given recombination breakpoint *x*, the number of mutations to the left of the breakpoint, denoted by *L _{i}*, follows a binomial distribution with index

*i*and parameter

*x*,3where 0 ≤

*l*≤

*i*.

Equation 3 effectively assumes that the mutations carried by an *i*-haplotype are uniformly distributed on the interval [0, 1]. This is reasonable, because the model is spatially homogeneous in the sense that the mutation rate to deleterious alleles is uniform and each deleterious mutation has the same fitness effect regardless of its location.

Under the above recombination model, the frequency of *i*-haplotypes in the postrecombination population, denoted by *f _{i}**, is4where5(see the

*Appendix*for a derivation).

Equations 4 and 5 can be understood by noting that an *i*-haplotype in the postrecombination population is either a nonrecombinant or a recombinant with probabilities 1 – *R* and *R*, respectively. If it is a nonrecombinant, it must be a descendant of an *i*-haplotype in the postselection population. If it is a recombinant, the chance that the breakpoint falls between *x* and *x* + *dx* is *dx*. Given *x*, *l* of the *i* mutations, which are to the left of *x*, are inherited from the first parent with *j* (≥*l*) mutations, and the remaining *i* – *l* to the right of *x* come from the other parent with *k* (≥*i* – *l*) mutations.

To understand the spatial distribution of mutations in an *i*-haplotype taken randomly from the postrecombination population, we note that6(see the *Appendix* for a derivation).

Equation 6 shows that, among the recombinants with *i* mutations, the proportion that are formed by recombination events with a breakpoint between *x* and *x* + *dx* is *dx* (*i.e.*, it is a uniform distribution). Given *x*, the proportion that *l* of the *i* mutations are found in the region [0, *x*] is *B*(*l* | *i*, *x*). These properties are the same as those leading to Equation 3. Hence the mutations carried by a recombinant are uniformly distributed on the interval [0, 1]. Because the same must be true for the mutations in a nonrecombinant, we can conclude that the uniformity assumption with respect to the spatial distribution of mutations is preserved in the postrecombination population.

Equations 4–6 show that recombination in our model does not alter the distribution of haplotype frequencies, as expected from the heuristic arguments given above. However, it should be noted that the derivation relies on the assumption of multiplicative fitness effects, in an infinite population at mutation–selection equilibrium. The result is untrue under models with synergistic interactions between mutations at different sites (Kimura and Maruyama 1966; Charlesworth 1990; Shnol and Kondrashov 1993).

### A structured coalescent model

Suppose that we sample a random individual from the finite-size adult population in generation *t* + 1 (the bottom population in Figure 1). The probability that this haplotype carries *i* mutations is *f _{i}*. To construct the coalescent model, we trace the ancestry of this haplotype step by step according to the life cycle in Figure 1. First, we note that the focal haplotype must have been present in the postrecombination population. A haplotype in the postrecombination population can be either a nonrecombinant or a recombinant, with probabilities 1 –

*R*and

*R*, respectively. If the focal haplotype is a nonrecombinant, then it must have been present in the postselection population as an

*i*-haplotype. If the focal haplotype is a recombinant, it contains genetic material from two different individuals in the postselection population. We need to reconstruct the two parental haplotypes, so that we can trace the history of ancestry farther backward. From Equation 6, we can deduce that, for a recombinant, the breakpoint can be determined by drawing

*x*from a uniform distribution on [0, 1] and that the number of mutations inherited from the first parental haplotype in [0,

*x*] can be determined by drawing

*l*from a binomial distribution with index

*i*and parameter

*x*and the remaining

*i*–

*l*mutations are inherited from (

*x*, 1] in the other parental haplotype. These segments are marked by the thick lines in Figure 2 and are referred to as “real” segments because they are directly inherited by the recombinant under investigation. On the other hand, both parental haplotypes contain “imaginary” segments that are nonancestral to those in the sample (thin lines in Figure 2). It is, however, necessary to keep track of the numbers of mutations in both the real and the imaginary segments because, as shown below, they jointly determine the rate that a haplotype moves between different genetic backgrounds and the rate that two haplotypes coalesce.

Consider the first parental haplotype in Figure 2, which has an imaginary part to the right of breakpoint *x*. Because this haplotype is in the postselection population, using the fact that the spatial distribution of mutations is uniform, it follows that the number of mutations in the imaginary segment follows a Poisson distribution with mean (1 – *x*)λ* _{f}*. Similarly, the number of mutations in the imaginary segment in the second parental haplotype follows a Poisson distribution with mean

*x*λ

*. Thus, by sampling from these Poisson distributions, we can determine the total number of mutations in these haplotypes (Figure 2).*

_{f}We can see from Figure 2 that, to construct the parental haplotypes of a recombinant, we need to assign mutations to different regions. In other words, a recombination event brings location information to the haplotypes in the structured coalescent model. To manage this information, we define the genetic background of a haplotype, , as7In Equation 7, **x** = (*x*_{0}, *x*_{1}, . . . , *x _{B}*) and

**y**= (

*y*

_{1},

*y*

_{2}, . . . ,

*y*), with 0 =

_{B}*x*

_{0}<

*x*

_{1}< . . . <

*x*= 1 and

_{B}*y*≥ 0. The

_{b}*x*(0 <

_{b}*b*<

*B*) denote the breakpoints caused by recombination events that generated this haplotype, and the

*y*denote the numbers of mutations in (

_{b}*x*

_{b}_{–1},

*x*] (1 ≤

_{b}*b*≤

*B*). We further define the size of a genetic background as the total number of mutations carried by the focal haplotype:8For example, in Figure 2, the genetic background of the sampled haplotype is = {

**x**= (0, 1),

**y**= (

*i*)}, whereas that of the left-hand parental haplotype is = {

**x**= (0,

*x*, 1),

**y**= (

*l*,

*q*

_{1})}.

The above steps enable us to reconstruct the parental haplotypes in the postselection population that gave birth to the haplotypes under investigation. To go farther backward, we note that any haplotypes that exist in the postselection population must have survived selection and therefore must have been present in the postmutation population. Suppose that we have an *i*-haplotype in the postmutation population. This haplotype may have received new mutations. Because the mutational process is irreversible, the parent of the *i*-haplotype necessarily has no more than *i* mutations. Specifically, the probability that the focal haplotype has a parent with *j* (≤*i*) mutations is9(see the *Appendix* for a derivation). Therefore, *j* can be determined by sampling from a binomial distribution with index *i* and parameter 1 – *s*. Once *j* is specified, the parental haplotype can be constructed by randomly purging *i* – *j* mutations from the *i*-haplotype.

The methods described so far can determine the haplotypes of the postreproduction population that are ancestors of one of the extant haplotypes in our sample. Repeating these procedures for other haplotypes in the sample, we can obtain a set of haplotypes in the postreproduction population that are ancestral to the individuals in our sample. The next task is to determine the probability that some of these postreproduction haplotypes were born to the same parents in the previous generation.

Suppose that we have two haplotypes in the postreproduction population whose genetic backgrounds are denoted by ^{(1)} and ^{(2)}, respectively. Because reproduction and mutation happen in two separate steps in the life cycle, and the two focal haplotypes are in the population before mutation takes place, if |^{(1)}| ≠ |^{(2)}|, it is impossible for the two haplotypes to have a common parent in the previous generation. When |^{(1)}| = |^{(2)}|, there are two possible situations. First, the two focal haplotypes may have incompatible genetic backgrounds and therefore cannot share a common parent in the previous generation. An example is illustrated in Figure 3A. The first haplotype has one mutation in (0, *x*]. Consequently its parent must also have had one mutation in this region. However, this is incompatible with the fact that the second haplotype does not carry any mutation in (0, *x*]. In contrast, the two haplotypes with genetic backgrounds ^{(1)} and ^{(2)} depicted in Figure 3B may coalesce into the same parent with genetic background ^{(A)}. This is because a mutation is located somewhere in (0, *y*] in the first haplotype, whereas none exists in (0, *x*] in the second haplotype (*x* < *y*). Therefore, if the two haplotypes share the same parent, the parental haplotype cannot have any mutation in (0, *x*]. This confines the mutation in the first haplotype to (*x*, *y*], because it must have inherited the region (0, *x*] from its parent. Applying similar arguments to (*y*, 1], we conclude that the parental haplotype cannot have any mutation in this region, and this requirement restricts the mutation in the second haplotype to (*x*, *y*]. Thus, we finish the construction of the potential location of the mutation in the parental haplotype, denoted by ^{(A)}.

More generally, Figure 3 illustrates the fact that, for the postreproduction population, coalescent events can happen only among haplotypes with compatible genetic backgrounds. For two haplotypes with genetic backgrounds ^{(1)} and ^{(2)}, the computer algorithm given in supporting information, File S1 can simultaneously determine whether they are compatible and, if they are, the genetic background ^{(A)} of their parental haplotype in the previous generation.

To calculate the coalescent probability, we need to know the number of individuals with a given genetic background in the finite adult population of the previous generation. Making use of the fact that the model is spatially homogeneous, we can extend the argument that leads to Equation 3 and assume that, among the *i*-haplotypes, the proportion with a given genetic background follows a multinomial distribution10where *I*() is a function that takes the value of one if the condition in the parentheses is true and the value of zero otherwise.

Suppose that we have two haplotypes in the postreproduction population, with genetic backgrounds ^{(1)} and ^{(2)}. Equation 10 implies that, in the adult population of the previous generation, the number of individuals with genetic background ^{(}^{z}^{)} is *Nf _{i}P_{i}*(

^{(}

^{z}^{)}), and the probability that a particular one of these is the parent of the

*z*th descendant haplotype is 1/[

*Nf*(

_{i}P_{i}^{(}

^{z}^{)})] (

*z*= 1 or 2). For a coalescent event to occur, the two sets of potential parents must have overlapped, and one of the individuals in the intersection must have given birth to the two descendant haplotypes in question. The situations where the two parental sets do or do not overlap correspond to the cases where the two descendants have compatible or incompatible genetic backgrounds (

*e.g.*, Figure 3). The overlapping region between the two parental sets, denoted by

^{(A)}, can be found by the computer algorithm given in File S1. Using Equation 10, we note that the number of individuals in the intersection in the previous generation is

*Nf*(

_{i}P_{i}^{(A)}). Hence, the probability that the two descendant haplotypes were born to the same parent in the previous generation is11Note that, for two incompatible genetic backgrounds,

*P*(

_{i}^{(A)}) = 0 and thus

*P*

_{CA}= 0 (

*e.g.*, Figure 3A).

This calculation follows the spirit of the work of Hudson and Kaplan (1994), in the sense that the deleterious mutations are not pinpointed to specific sites in the deleterious region. Rather, the mutations are merely confined to segments in the deleterious region (*i.e.*, genetic backgrounds), so that haplotypes with the same genetic background can be considered together, enabling the construction of coalescent processes. However, the model of Hudson and Kaplan (1994) considered only the effect of background selection on a pair of alleles at a neutral site, linked to a nonrecombining deleterious region. Here we establish the spatial distribution of deleterious mutations and introduce , so that all the neutral sites in the deleterious region can be considered simultaneously for an arbitrary sample size.

It is, however, difficult to calculate the probability of coalescent events that involve more than two individuals. Implementing a simulation algorithm that exactly follows the dynamics described above is, therefore, not straightforward. To solve the problem, we resort to the following well-established approximation techniques.

### Continuous-time approximations

We approximate the processes using the standard rescaling techniques that underlie the coalescent framework (Kingman 1982; Hudson 1990). We assume that *U*, *s*, *R*, and 1/*N* are so small that their second-order terms can be neglected (*i.e.*, the evolutionary forces are weak). Under this assumption, the possibility that two or more of the three possible events (*i.e.*, recombination, mutation, and coalescence) occur in the same generation is in the order of 1/*N*^{2} and can be ignored. As a result, recombination, mutation, and coalescence can be regarded as three independent Poisson processes. We define three scaled parameters: θ = *NU*, γ = *Ns*, and ρ = *NR*. As in the neutral coalescent, time is measured in units of *N* generations. Going backward in time, recombination splits a haplotype into two parental haplotypes at rate ρ. For mutation, we note from Equation 9 that12 where *O*(*s*^{2}) represents terms of the second order in *s*. Therefore, the next mutation event arrives at an *i*-haplotype at rate *i*γ and changes it to an (*i* − 1)-haplotype by randomly removing one of the *i* mutations. Finally, as in the standard neutral coalescent, we assume that a coalescent event involves only two haplotypes. Under the rescaling, the rate at which two compatible haplotypes coalesce into a common parent is *NP*_{CA} (see Equation 11).

Suppose that we have a sample of *n* haplotypes with genetic backgrounds ^{(1)}, ^{(2)}, . . . , ^{(}^{n}^{)}. The total scaled rates for recombination and mutation events are *r*_{r} = *n*ρ and *r*_{m} = , respectively. For the total coalescent rate, we need to examine all the *n*(*n* – 1)/2 pairs to identify those that involve two haplotypes with compatible genetic backgrounds, so that a *NP*_{CA} value can be calculated for each of these compatible pairs. The total coalescent rate, *r*_{c}, is given by the sum of the *NP*_{CA} values. Then, the scaled time to the next event follows an exponential distribution with rate *r*_{t} = *r*_{r} + *r*_{m} + *r*_{c}. Given that an event has occurred, the probability that it is due to a particular process can be calculated by dividing the rate of the process of interest by *r*_{t}. For instance, the chance that the first haplotype undergoes a mutational change conditional on an event occurring is γ|^{(1)}|/*r*_{t}. After randomly removing one mutation from this individual, we recalculate the event rates to determine the time of the next event. This process is repeated until the first time when only one haplotype remains. The resulting relationship between the haplotypes in the whole history of ancestry defines an ancestral recombination graph (ARG). Local gene genealogies at different neutral sites in the deleterious region can be extracted from the ARG using standard methods (Hudson 1990).

## Materials and Methods

### Coalescent simulations

We have written a computer program that implements the structured coalescent model with the continuous-time approximations outlined. The algorithm uses the neutral coalescent algorithm implemented in the program *ms* (Hudson 2002). Our program requires four parameters: *n*, θ, γ, and ρ. To set up the initial haplotypes, *n* random numbers are drawn from a Poisson distribution with mean λ = *U*/*s* (*i.e.*, we assume λ = λ* _{g}* ≈ λ

*, which is a good approximation under the continuous-time model). Because these haplotypes are taken randomly from an equilibrium population, by using the arguments leading to Equations 3 and 10, we can assume that the mutations are uniformly distributed on these haplotypes [*

_{f}*i.e.*,

^{(}

^{z}^{)}has the form {

**x**= (0, 1),

**y**= (|

^{(}

^{z}^{)}|)} for

*z*= 1, . . . ,

*n*]. To reconstruct the coalescent history of a sample, for each haplotype in the ARG, the program keeps track of its genetic background and the ancestral information about the extant haplotypes in the sample it contains (

*e.g.*, Figure 2). For each set of parameter values presented in the

*Results*section, 10

^{5}simulation replicates were performed.

### Forward-in-time simulations

To check whether the structured coalescent model provides good approximations, we implemented a forward-in-time simulation algorithm that follows the life cycle given in Figure 1. A haploid Wright–Fisher population of constant size *N* is simulated. Each site in a haplotype has two states, *A*_{0} and *A*_{1}, representing the wild type and the mutant type, respectively. In each generation, the number of new mutations experienced by a haplotype is determined by drawing a random number from a Poisson distribution with mean *U*. A haplotype with *i* mutations has fitness (1 – *s*)* ^{i}*. To produce a haplotype in the next generation, we first randomly sample two haplotypes from the current generation, with the probability of sampling a particular haplotype being proportional to its fitness. With probability

*R*, the parental haplotypes undergo a recombination event. The location of the breakpoint is determined by drawing from a uniform distribution on [0, 1]. Two recombinants are constructed and one of the two is randomly chosen for retention. These steps are repeated

*N*times to generate the

*N*individuals in the new generation.

Each replicate of the simulation starts from a mutation-free population. This population is allowed to evolve for 20*N* generations so that statistical equilibrium is achieved. Random samples are then taken every 10*N* generations. To compare with the results obtained from the coalescent simulations, the gene genealogies at five evenly spaced sites in the simulated region are recorded (*i.e.*, at *X*_{1} = 0, *X*_{2} = 0.25, *X*_{3} = 0.5, *X*_{4} = 0.75, and *X*_{5} = 1). The forward simulation algorithm is much more computationally demanding than the coalescent algorithm, especially when *N* is large. Unless stated otherwise, we used an *N* value of 5000 in the forward simulations and obtained 2000 random samples for each set of parameter values.

### Statistics of interest

We focus on four genealogy-based statistics: *T*_{2}, η_{t}, ζ_{e}, and *r _{xy}*.

*T*

_{2}refers to the time to the most recent common ancestor of a sample of size 2 at a given nucleotide site. Assuming neutral evolution and measuring time in units of

*N*generations (the same scaling is applied to the other time-related statistics of interest), the expectation of

*T*

_{2}is

*E*(

*T*

_{2}

^{(neu)}) = 1 (Wakeley 2008a, p. 76). Under background selection and recombination, previous analyses have obtained13where

*u*is the mutation rate to deleterious alleles at a nucleotide site and

*r*is the recombination rate between the

_{w}*w*th selected site and the focal neutral site (Hudson and Kaplan 1995; Nordborg

*et al.*1996). In the absence of recombination (

*i.e.*

*r*≡ 0), Equation 13 reduces to exp(−

_{w}*U*/

*s*) (Charlesworth

*et al.*1993; Hudson and Kaplan 1994). Note that

*E*(

*T*

_{2}

^{(bgs)})

*N*is the expected number of generations needed for two randomly sampled alleles to coalesce into a common ancestor. Hence, an effective population size under background selection can be defined as

*N*

_{e}(

*T*

_{2}) =

*E*(

*T*

_{2}

^{(bgs)})

*N*.

*T*_{2} is closely related to the widely used measure of DNA sequence variability, π, the nucleotide site diversity, which is defined as the probability that two randomly sampled haplotypes have different variants at the nucleotide site under investigation (Tajima 1983). Under the infinite-sites model of mutation (Kimura 1969), π is determined by *T*_{2} through the equation π = 2*Nu*^{(neu)}*E*(*T*_{2}), where *u*^{(neu)} is the neutral mutation rate per site (Tajima 1983). Thus, we can define a measure of the diversity-reducing effect of background selection as14Note that *E*[*B*(*T*_{2})] = *E*(*T*_{2}^{(bgs)}) and *N*_{e}(*T*_{2}) = *E*[*B*(*T*_{2})]*N*.

The second statistic is η_{t}, the total length of all the branches in the genealogy of a sample of size *n* at the focal nucleotide site. Under the standard neutral coalescent model,15(Wakeley 2008a, p. 76). η_{t} is related to *S*, the number of segregating sites observed in the sample, whose expectation is given by *E*(*S*) = *Nu*^{(neu)}*E*(η_{t}) under the infinite-sites model (Watterson 1975; Hudson 1990). Defining *a _{n}* = 1 + 1/2 + . . . + 1/(

*n*– 1), η

_{t}is also related to the alternative measure of sequence variability, θ

_{W}, given by θ

_{W}=

*S*/

*a*, whose expected value is

_{n}*Nu*

^{(neu)}

*E*(η

_{t})/

*a*(Watterson 1975). Hence we can define a second measure of the diversity-reducing effect of background selection as

_{n}Third, we look at ζ_{e}, the proportion of external branches (*i.e.*, branches leading to extant haplotypes in the sample) in the genealogy at the focal site. Specifically, ζ_{e} = η_{e}/η_{t}, where η_{e} is the total length of all the external branches. The neutral expectation of η_{e} is *E*(η_{e}^{(neu)}) = 2 (Fu and Li 1993). However, the moments of ζ_{e} are unknown, so that we obtained the neutral expectation of ζ_{e}, denoted by *E*(ζ_{e}^{(neu)}), via coalescent simulation. Because mutations on an external branch can be inherited only by the haplotype descending from this branch, these mutations must be at low frequencies in the sample. In particular, with the infinite-sites assumption, mutations on external branches lead to singletons (*i.e.*, segregating sites where the mutant variant is present in only one individual). Let *S*_{e} be the number of singletons. The proportion of segregating sites that are singletons is *S*_{e}/*S* = (*Nu*^{(neu)}η_{e})/(*Nu*^{(neu)}η_{t}) = η_{e}/η_{t} = ζ_{e}. Hence, ζ_{e} is a major determinant of the relative abundance of low-frequency variants in the sample. We therefore define a third relative diversity measure as

Note that, under neutrality with a stable population size, *E*[*B*(*T*_{2})] = *E*[*B*(η_{t})] = *E*[*B*(ζ_{e})] = 1. However, *T*_{2}, η_{t}, and ζ_{e} are known to respond differently to the influence of background selection (Charlesworth *et al.* 1993, 1995; Fu 1997; Gordo *et al.* 2002; Williamson and Orive 2002; Zeng *et al.* 2006; Kaiser and Charlesworth 2009; Seger *et al.* 2010). This is the basis of using tests such as Tajima’s *D* (Tajima 1989) and Fu and Li’s *D* (Fu and Li 1993) to detect the presence of background selection. For example, the power of Tajima’s *D* is mainly determined by the difference between *T*_{2} and η_{t}, whereas the power of Fu and Li’s *D* is mainly determined by ζ_{e}.

The last statistic of interest, *r _{xy}*, is the correlation in

*T*

_{2}between two different sites at positions

*x*and

*y*in the deleterious region (0 ≤

*x*<

*y*≤ 1).

*r*can be viewed as a measure of linkage disequilibrium (LD). Under neutrality, we have18where ρ

_{xy}*= (*

_{xy}*y*–

*x*)ρ (there is a linear relationship between recombination frequency and physical distance between

*x*and

*y*) (Griffiths 1981; McVean 2002).

We carried out the following analysis to understand the effects of background selection on patterns of LD, which have not been examined previously. First, in the simulations (both forward and coalescent), we kept track of the local genealogies at five evenly spaced sites in the deleterious region (*i.e.*, at *X*_{1} = 0, *X*_{2} = 0.25, *X*_{3} = 0.5, *X*_{4} = 0.75, and *X*_{5} = 1). Because of the spatial symmetry of the model, it suffices to estimate *r _{xy}* for only six pairs of sites:

*X*

_{1}

*X*

_{2},

*X*

_{1}

*X*

_{3},

*X*

_{1}

*X*

_{4},

*X*

_{1}

*X*

_{5},

*X*

_{2}

*X*

_{3}, and

*X*

_{2}

*X*

_{4}. We denote these six pairwise

*r*values in the order shown above by

_{xy}*r*(

_{i}*i*= 1, . . . , 6). Let

**r**= (

*r*

_{1}, . . . ,

*r*

_{6}). When we have two such

**r**vectors, referred to as

**r**

^{(1)}and

**r**

^{(2)}, we can define the distance between

**r**

^{(1)}and

**r**

^{(2)}as19Suppose that

**r**

^{(1)}was obtained from the background selection model with parameters θ

^{(1)}, γ

^{(1)}, and ρ

^{(1)}. To understand the effects of background selection, we divided the interval (0, 10ρ

^{(1)}) into an evenly spaced grid. For each point in the grid, we obtained an

**r**vector under neutrality using Equation 18 and calculated its distance from

**r**

^{(1)}. Let ρ* be the value in the grid that produced the pattern closest to

**r**

^{(1)}, as measured by Equation 19. Then ρ* can be viewed as the effective recombination rate under background selection. Hence we can define a second effective population size as

*N*

_{e}(

*r*) = (ρ*/ρ

_{xy}^{(1)})

*N*. In contrast to

*N*

_{e}(

*T*

_{2}), which measures the reduction in nucleotide diversity brought about by background selection,

*N*

_{e}(

*r*) measures the effects of background selection on LD. Finally, we define

_{xy}## Results

We present results obtained from both forward and coalescent simulations. The questions of interest are (i) the effects of background selection on patterns of neutral diversity and LD and (ii) whether the structured coalescent model provides a good approximation. For the latter question, instead of generating and examining sequence variability, we focus on several genealogy-based statistics, which are the ultimate determinants of sequence-based statistics (see above). The advantage of using genealogy-based statistics is that they are less variable than sequence-based statistics (the latter contains the additional variation brought about by the mutation process), allowing a more accurate assessment of the performance of the coalescent model.

### The effects of background selection on *B*(*T*_{2})

The only general analytic result regarding the joint effects of background selection and recombination on neutral diversity is the expected value of *T*_{2}^{(bgs)}, given by Equation 13. We can deduce three results from this formula. First, background selection reduces neutral diversity (as measured by π) because *E*[*B*(*T*_{2})] < 1 (Equation 14). Second, the extent of reduction in diversity is spatially inhomogeneous, in that the reduction is more extreme in the interior of the deleterious region relative to the boundaries. As a result, the effects of background selection cannot be fully summarized as a simple reduction in *N*_{e} for the whole region (Nordborg *et al.* 1996; Loewe and Charlesworth 2007). Third, within the limit of validity of Equation 13, *E*(*T*_{2}^{(bgs)}) and equivalently *E*[*B*(*T*_{2})] depend exclusively on the ratios between *U*, *s*, and *r _{w}*, but not the absolute values of these parameters.

The simulation results, obtained from both the forward and the coalescent approaches, confirm the first two predictions (Table 1). To evaluate the third prediction, we compare the first three pairs of results in Table 1 where the ratios of *U*/*s* and *R*/*U* are the same within each pair. With weaker selection, values calculated from Equation 13 are much lower than those obtained from simulations (*e.g.*, the first case in Table 1). The agreement improves as γ, the scaled selection coefficient, becomes larger (*e.g.*, the second case in Table 1). This pattern holds regardless of the presence or absence of recombination. In other words, Equation 13 tends to underestimate *E*[*B*(*T*_{2})] (*i.e.*, overestimate the reduction in diversity) when selection is weak, as observed in previous simulations (*e.g.*, Nordborg *et al.* 1996). This is probably caused by the fact that the derivation of Equation 13 replies on the assumption that, in the coalescent history of a haplotype, the time that it was loaded with more than one deleterious mutation is very short compared to the time it spent as a 0- or 1-haplotype, and consequently, coalescent events take place only among 0- or 1-haplotypes (Hudson and Kaplan 1994, 1995). Because the rate that mutations are purged is proportional to γ (Equation 12), Equation 13 is more accurate when γ is larger.

Encouragingly, the structured coalescent model appears to be more accurate than Equation 13, in the sense that the coalescent results agree more closely with those obtained from forward simulations, in terms of both the mean and the percentiles of *B*(*T*_{2}) (Table 1). However, the mean values of *B*(*T*_{2}) produced by the structured coalescent model are sometimes higher than those produced by the forward simulations. This is probably because, in the continuous-time approximation, we assume that only one event can happen at any one time and that each coalescent event involves only two haplotypes. However, in a finite population, especially one with a small population size, the chance of having more than one event may be sizeable. Furthermore the product of *Nf _{i}P_{i}*() (Equation 11) can be small, which increases the chance of having coalescent events involving multiple haplotypes. These factors may affect the accuracy of coalescent models (Fu 2006). In fact, for the case with scaled parameters θ = γ = 6 and ρ = 3 (the third case in Table 1), when forward simulations were conducted using either a smaller haploid population size of 2500 or a larger one of 20,000, the mean values of

*B*(

*T*

_{2}) at

*x*= 0.5 (the center of the deleterious region) were 0.587 and 0.610, respectively, showing a convergence to the value of 0.609 obtained from the coalescent model. Thus, the discrepancies shown in Table 1 may be unimportant for species with large effective population sizes such as

*Drosophila melanogaster*whose diploid

*N*

_{e}is ∼10

^{6}(Kreitman 1983) and should not undermine the potential usefulness of the coalescent model.

### The effects of background selection on *B*(η_{t}) and *B*(ζ_{e})

In addition to reducing nucleotide site diversity (π), it is known that, in the absence of recombination, background selection also distorts neutral genealogies, such that external branches take up a larger proportion of the genealogy, and this effect tends to be greater when selection is weaker (Charlesworth *et al.* 1993, 1995; Fu 1997; Gordo *et al.* 2002; Williamson and Orive 2002; Zeng *et al.* 2006; Kaiser and Charlesworth 2009; Seger *et al.* 2010). To see whether the structured coalescent model can accurately describe these aspects of variability, we investigate the properties of two other summary statistics, η_{t} and ζ_{e}, which are major determinants of the number of segregating sites and the relative abundance of low-frequency variants. In Tables 2 and 3, we present values of *B*(η_{t}) and *B*(ζ_{e}), which are defined as the values of η_{t} and ζ_{e} under background selection relative to their neutral expectations (Equations 16 and 17), so that deviations from unity indicate the direction of departure from neutrality. Over the combinations of parameter values we have examined, the coalescent model offers very good approximations to both *B*(η_{t}) and *B*(ζ_{e}); this is true for both the mean and the percentiles. Again, as with *B*(*T*_{2}), the properties of *B*(η_{t}) and *B*(ζ_{e}) depend on absolute values of the scaled parameters, not just their ratios.

Several patterns emerge from the comparison of Tables 1–3, where data were generated using the same set of parameter values. First, from Tables 1 and 2, it can be seen that *E*[*B*(*T*_{2})] and *E*[*B*(η_{t})] are reduced to roughly the same extent by background selection, although the reduction in *E*[*B*(*T*_{2})] tends to be slightly more extreme, especially when selection is weaker. This is in agreement with the observation that Tajima’s *D*, which depends on the difference between *T*_{2} and η_{t}, tends to have a negative mean under background selection (Charlesworth *et al.* 1995; Fu 1997; Gordo *et al.* 2002; Williamson and Orive 2002; Zeng *et al.* 2006). As expected, with strong selection and/or a high level of recombination, the difference between *E*[*B*(*T*_{2})] and *E*[*B*(η_{t})] is very small and can be difficult to detect (*e.g.*, the last case in Tables 1 and 2).

Second, as is the case for *B*(*T*_{2}), the mean values of *B*(η_{t}) and *B*(ζ_{e}) are spatially inhomogeneous when ρ > 0 (Tables 2 and 3). *E*[*B*(η_{t})] is smaller in the center than at the boundaries (Table 2), suggesting that there will be on average fewer segregating neutral variants in the interior of the region. The reverse is true for *E*[*B*(ζ_{e})], whose values tend to be larger in the center but smaller at the boundaries (Table 3). However, an examination of η_{e}, the total length of all the external branches, shows that *E*(η_{e}^{(bgs)}) < *E*(η_{e}^{(neu)}) and that *E*(η_{e}^{(bgs)}) is smaller in the interior of the region than at the boundaries. Therefore, the increase in *E*[*B*(ζ_{e})] in the center of the region is caused by a more rapid decline in η_{t} relative to η_{e}. With *E*[*B*(ζ_{e})] > 1 and its spatial pattern, we expect a higher proportion of low-frequency variants in the interior of the region. Therefore, in addition to the known result that Fu and Li’s *D*, which is determined by ζ_{e}, should have a negative mean value under background selection, which was found previously using a model with zero recombination (Charlesworth *et al.* 1995; Fu 1997; Gordo *et al.* 2002; Williamson and Orive 2002; Zeng *et al.* 2006), the results in Tables 2 and 3 show that the mean value should also be more negative in the center than at the boundaries, although the difference may be small and may not be easily detectable with sequence variability.

We further explored the effects of various levels of recombination on *E*[*B*(η_{t})] and *E*[*B*(ζ_{e})] (Figure 4). As expected, recombination increases neutral diversity (as measured by *E*[*B*(η_{t})]), to levels well above that observed in the absence of recombination (the dashed line at the bottom). Similarly, *E*[*B*(ζ_{e})] gets closer to the neutral expectation of unity with increasing rates of recombination. However, even with a fairly high level of recombination (θ = 25 and ρ/θ = 4; triangles in Figure 4A), the values of *E*[*B*(η_{t})] and *E*[*B*(ζ_{e})] at the boundary of the deleterious region, where the influence of background selection is minimal, are still 18% lower and 3% higher than the neutral expectation, respectively (Figure 4A), suggesting that, even in highly recombining regions of the genome, the effect of background selection should not be overlooked. The spatial pattern displayed in Figure 4 extends the results of previous studies that are based on *E*[*B*(*T*_{2})] (Nordborg *et al.* 1996; Loewe and Charlesworth 2007) and lends further support to the conclusion that the effect of background selection is greater in the center of the region. In particular, the parameters used to generate Figure 4B are comparable to those thought to be “typical” of a gene in the *D. melanogaster* genome (Loewe and Charlesworth 2007) and hence may be relevant to the intragenic spatial patterns of codon usage bias found in *D. melanogaster* (Comeron and Kreitman 2002; Comeron and Guthrie 2005; Loewe and Charlesworth 2007).

Figure 4 shows that the results obtained from the structured coalescent model agree well with those obtained from the forward simulations. The discrepancies are probably due to the artifact of using a small population size in the forward simulations, as explained above for the case of *B*(*T*_{2}). For instance, in the case where θ = 7.5, γ = 15, and ρ/θ = 0.25 (circles in Figure 4B), the values of *E*[*B*(η_{t})] at *x* = 0 (the bottom-left circles) obtained from the forward simulations with *N* = 5000 and 20,000 are 0.679 and 0.692, respectively, with the latter being very close to the value of 0.699 obtained under the coalescent model. Hence we can conclude that the structured coalescent model can accurately predict patterns of diversity for all levels of recombination.

### Patterns of linkage disequilibrium under background selection

The effects of background selection on patterns of LD have not previously been examined. As an initial attempt, we used *r _{xy}*, the correlation in

*T*

_{2}between two different sites

*x*and

*y*, as a measure of LD. As shown in Table 4, the structured coalescent model can accurately capture this aspect of neutral variability. Comparing these results with neutral expectations (Equation 18), we find that background selection increases the level of correlation (

*i.e.*, higher LD) between sites [compare

*r*

_{xy}^{(bgs)}with the neutral expectations obtained with ρ = 3 in Table 4]. This is probably caused by a reduction in the effectiveness of recombination, as a result of a shorter expected coalescent time in the presence of background selection {recall that

*E*[

*B*(

*T*

_{2})] < 1 and

*E*[

*B*(η

_{t})] < 1}. Using the method described below Equation 19, the effective recombination rate for the set of parameters in Table 4 was estimated to be ρ* = 1.98. It can be seen that the

*r*

_{xy}^{(neu)}values obtained using ρ* and Equation 18 agree closely with those obtained under the background selection model (Table 4). Note that, with ρ* = 1.98,

*B*(

*r*) = 0.66 (Equation 20), which is identical to the

_{xy}*E*[

*B*(

*T*

_{2})] value at the boundary (see the third case in Table 1). In other words, if we estimate

*N*

_{e}from either nucleotide diversity at the boundary of the region (

*i.e.*, the value of π at

*x*= 0, which is governed by

*T*

_{2}) or patterns of LD [

*i.e.*, the six pairwise

*r*

_{xy}^{(bgs)}values], we should obtain a similar estimate of

*N*

_{e}≈ 0.66

*N*. This result is supported by additional simulations using a wide range of parameter combinations (Table 5). However, we have examined only one particular measure of LD, and more research is needed to examine the effects of background selection on LD.

## Discussion

The results that we have described above show that our structured coalescent approach provides a rapid algorithm for computing the neutral genealogies of sites situated within a group of sites subject to sufficiently strong purifying selection that they can be treated as at equilibrium under mutation and selection (the conditions for this assumption to be valid are discussed in Nordborg *et al.* 1996). In contrast to previous coalescent results (Charlesworth *et al.* 1995; Fu 1997; Neuhauser and Krone 1997; Wakeley 2008b; O’Fallon *et al.* 2010), we allow for the occurrence of crossing over within the region in question. Our forward simulations show that the method provides extremely accurate numerical results. While we have assumed a haploid model for convenience, the results apply equally to a diploid model when there is semidominance with respect to the effects of mutations on fitness, with the appropriate changes in parameter values, or to a diploid model with sufficiently strong selection that deleterious mutations with partially dominant effects present at low frequency in the population, so that their elimination exclusively involves heterozygotes (Kimura and Maruyama 1966; Simmons and Crow 1977; Charlesworth 1990). The assumption of haploidy is thus not especially restrictive.

Since we can compute the genealogies of neutral sites at arbitrary locations within a set of selected sites, we can investigate any desired property of a neutral site as a function of its location within the region subject to purifying selection, as well as correlations between the properties of linked neutral sites within the region. This allows predictions to be made concerning the extent of the reduction caused by background selection in the expected pairwise coalescent time, *T*_{2}^{(bgs)}, as a function of location within the region (Table 1). *T*_{2}^{(bgs)} can be regarded as a measure of the local effective population size and determines the expected level of pairwise neutral diversity under the infinite sites model (Tajima 1983). In addition, statistics such as the mean total size of the genealogy at a given location, when contrasted with the mean pairwise coalescent time, and the proportion of the genealogy contributed by external branches, ζ_{e}^{(bgs)}, can be used to assess the extent of distortion of the genealogy in favor of longer external branches, which earlier studies of a nonrecombining region have shown to be produced by background selection, especially when selection is relatively weak (Charlesworth *et al.* 1993, 1995; Fu 1997; Gordo *et al.* 2002; Williamson and Orive 2002; Zeng *et al.* 2006; Kaiser and Charlesworth 2009; Seger *et al.* 2010). This distortion will be accompanied by an excess of rare, derived neutral variants compared with what is observed in the absence of background selection. Our results show that these effects are observed even in the presence of recombination (Tables 2 and 3 and Figure 4) and that they are strongest in the middle of the region subject to selection. The ζ_{e}^{(bgs)} statistic seems to show this effect particularly clearly.

If the region subject to purifying selection is taken to represent a coding sequence, then our results suggest that not only is the local effective population size most strongly reduced in the middle of a noninterrupted stretch of coding sequence (as found previously by Loewe and Charlesworth 2007), but also this will be associated with a stronger excess of rare variants at neutral sites in the middle of genes compared with their end or with intergenic or intronic sequences. It should shortly be possible to test this prediction quantitatively against data from large-scale genome-wide resequencing projects, which are capable of detecting very small effects when data from large numbers of genes are combined. There is already evidence for reductions in noncoding nucleotide site diversity in human populations in the proximity of genes compared with more remote regions (McVicker *et al.* 2009; Hammer *et al.* 2010).

Intuitively, the reduction in effective population size for a small genomic region caused by background selection might be expected to cause a corresponding increase in the level of linkage disequilibrium among pairs of closely linked neutral sites. This has not previously been investigated theoretically. We have used our program to calculate the correlations between the genealogies at different sites within the region subject to purifying selection; the results show that these are well predicted by Equation 18 for the standard neutral case, if the population scaled recombination rate ρ is calculated using the effective size under background selection at the boundary of the region.

The results presented here are obviously very limited in scope and need to be extended in several ways. First, our results for genealogies need to be translated into predictions concerning observable DNA sequence diversity statistics; this is a relatively straightforward matter of adding neutral mutations onto the genealogies, as is standard practice in coalescent theory (reviewed by Wakeley 2008a). Second, a distribution of mutational effects on fitness needs to be included in the model, in view of the evidence that the selection coefficients against deleterious nonsynonymous mutations follow a wide distribution (Loewe and Charlesworth 2006; Keightley and Eyre-Walker 2007; Boyko *et al.* 2008). Third, instead of considering a single genomic region, the division of the genome into different regions with differing strengths of purifying selection, such as noncoding and coding sequences, needs to be modeled. Fourth, the effects of nonreciprocal recombination need to be included, since this is of major importance over short genetic distances in eukaryotes and across the whole genome in bacteria. Fifth, the effects of population size change and population structure need to be modeled. Some of these extensions, notably the inclusion of a distribution of fitness effects, present more of a challenge than others. The ultimate goal is to have a flexible and rapid set of programs that allows predictions to be made about the properties of neutral genetic diversity across the genome.

## Acknowledgments

We thank Nick Barton and Matt Hartfield for valuable discussions. This study made use of the resources provided by the Edinburgh Compute and Data Facility (ECDF) (http://www.ecdf.ed.ac.uk/). The ECDF is partially supported by the e-Science Data, Information and Knowledge Transformation (eDIKT) initiative (http://www.edikt.org.uk). K.Z. acknowledges support from a Biomedical Personal Research Fellowship awarded by the Royal Society of Edinburgh and the Caledonian Research Foundation.

## Appendix

### Derivation of Equation 5

First, we note that(A1)Similarly, we have(A2)Therefore Equation 5 can be rewritten as(A3)

### Derivation of Equation 6

Using the second equation in Equation A3, we have

(A4)### Derivation of Equation 9

Because mutation is unidirectional, the *i*-haplotypes in the postmutation population must have been produced by premutation haplotypes with *j* (≤*i*) mutations. Noting that the proportion of *j*-haplotypes in the premutation population is given by *f _{j}* and that the number of new mutations (

*i.e.*,

*i*–

*j*) follows a Poisson distribution with mean

*U*, we have

## Footnotes

*Communicating editor: N. Takahata*

- Received May 12, 2011.
- Accepted June 19, 2011.

- Copyright © 2011 by the Genetics Society of America