## Abstract

It is well known that most new mutations that affect fitness exert deleterious effects and that natural populations are often composed of subpopulations (demes) connected by gene flow. To gain a better understanding of the joint effects of purifying selection and population structure, we focus on a scenario where an ancestral population splits into multiple demes and study neutral diversity patterns in regions linked to selected sites. In the background selection regime of strong selection, we first derive analytic equations for pairwise coalescent times and *F*_{ST} as a function of time after the ancestral population splits into two demes and then construct a flexible coalescent simulator that can generate samples under complex models such as those involving multiple demes or nonconservative migration. We have carried out extensive forward simulations to show that the new methods can accurately predict diversity patterns both in the nonequilibrium phase following the split of the ancestral population and in the equilibrium between mutation, migration, drift, and selection. In the interference selection regime of many tightly linked selected sites, forward simulations provide evidence that neutral diversity patterns obtained from both the nonequilibrium and equilibrium phases may be virtually indistinguishable for models that have identical variance in fitness, but are nonetheless different with respect to the number of selected sites and the strength of purifying selection. This equivalence in neutral diversity patterns suggests that data collected from subdivided populations may have limited power for differentiating among the selective pressures to which closely linked selected sites are subject.

IT was pointed out a long time ago that most mutations that affect the phenotype must be deleterious, because the biological system has been under selection for billions of years to improve its performance (*e.g.*, Muller 1950). Evidence collected from both mutation accumulation experiments and analyses of large-scale DNA sequence variation is consistent with the prevalence of deleterious mutations and the relative rarity of advantageous ones (reviewed by Eyre-Walker and Keightley 2007; Halligan and Keightley 2009; Charlesworth 2012a). It is therefore unsurprising that the effects of selection against deleterious mutations are an integral part in the study of many fundamental questions in evolutionary genetics, such as the evolution of sex and recombination (Otto and Lenormand 2002), the consequences of inbreeding and self-fertilization (Charlesworth and Charlesworth 1987), and the origin and nature of genetic variation underlying diseases (Wright *et al.* 2003).

An important line of research is to understand the effects of selection against recurrent deleterious mutations on patterns of genetic diversity at linked neutral sites. This has been extensively studied in the *background selection regime* (Charlesworth *et al.* 1993) in which deleterious mutations are assumed to be sufficiently strongly selected against that their frequencies stay very close to the deterministic limit expected in an infinitely large population (Kimura and Maruyama 1966; Haigh 1978). It is well established that, in a panmictic population, background selection causes a reduction in diversity and an excess of low-frequency variants in linked neutral regions relative to neutral expectations (Charlesworth *et al.* 1995; Hudson and Kaplan 1995; Nordborg *et al.* 1996; Zeng and Charlesworth 2011; Nicolaisen and Desai 2012, 2013), regardless of whether the population is in equilibrium or has experienced recent changes in population size (Zeng 2013). Given the ubiquity of deleterious mutations and the fact that diversity patterns under background selection can be similar to those expected in the presence of recurrent selective sweeps, it has been suggested that a model incorporating selection against recurrent deleterious mutations may be a better null hypothesis than the standard neutral model in the study of genome-wide polymorphism patterns (Cutter and Payseur 2013; Comeron 2014).

The effects of background selection in a subdivided population are less well understood. Existing knowledge is mainly based on the theoretical work of Nordborg (1997) and Charlesworth *et al.* (1997). Their model assumes that the population is subdivided into two demes (or subpopulations; these two terms are used interchangeably) and is in statistical equilibrium between mutation, selection, drift, and migration. Analytical equations were derived for and the expected coalescent time for a pair of alleles sampled either from two different demes or from the same deme, respectively. These are then used to examine how background selection affects *F*_{ST}, the standard measure of population differentiation, which can be defined as (1)(*cf*. Slatkin 1991) (for other definitions, see Charlesworth 1998 or Holsinger and Weir 2009). Under this model, the effects of background selection on pairwise coalescent times and *F*_{ST} can be approximated as a reduction in the effective population size, denoted by *N*_{e}, such that the effective number of migrants is reduced relative to neutrality, leading to an increase in *F*_{ST} (Charlesworth *et al.* 1997; Nordborg 1997). Because, all else being equal, the *F*_{ST}-elevating effect of background selection is expected to increase as local recombination rates decrease, the fact that most species studied to date show substantial variation in recombination rate across their genomes (Smukowski and Noor 2011) implies that background selection alone can generate large “genomic islands” that have significantly higher differentiation levels compared to the rest of the genome (Charlesworth 1998). Indeed, the importance of excluding background selection as a confounding factor in the search for genes underlying local adaptation and speciation, which can also lead to genomic islands (Wu 2001; Beaumont 2005), has been highlighted in recent reviews (Noor and Bennett 2009; Cruickshank and Hahn 2014).

Although the model of Nordborg (1997) and Charlesworth *et al.* (1997) has provided important insights into the joint effects of background selection and population subdivision, it has several limitations. First, the equilibrium assumption means that it is not applicable to the study of populations that consist of demes derived recently from a common ancestral population (*e.g.*, *Drosophila melanogaster* and humans). Second, the effect of background selection is modeled as a reduction in *N*_{e}. However, analyses of models with panmixia have shown that this approximation does not capture the fact that background selection causes an excess of low-frequency variants in the site-frequency spectrum (SFS) (Charlesworth *et al.* 1995; Zeng and Charlesworth 2011; Nicolaisen and Desai 2012, 2013). It is unclear how the SFS is affected under the joint influence of population structure and background selection. Finally, the model provides predictions about expected diversity levels for individual neutral sites. But with the ever-increasing availability of large-scale data sets, it is desirable to be able to generate predictions for statistics that make use of multiple linked sites (*e.g.*, measures of linkage disequilibrium, the number of haplotypes), so that information in the data can be used more effectively.

In contrast to the background selection regime, the *interference selection regime* considers models with a large number of tightly linked selected sites (McVean and Charlesworth 2000; Comeron and Kreitman 2002). Interference selection is therefore an important factor for the study of asexual or highly self-fertilizing organisms as well as large genomic regions in sexually reproducing organisms that lack recombination (*e.g.*, the fourth chromosome of *D. melanogaster*). Models in this regime generally behave very differently from those in the background selection regime (reviewed by Neher 2013), and analytical results derived under background selection are typically not applicable (Kaiser and Charlesworth 2009; O’Fallon *et al.* 2010; Seger *et al.* 2010; Zeng and Charlesworth 2010). An interesting result obtained recently by Good *et al.* (2014) is that, in an equilibrium panmictic population, diversity patterns in linked neutral regions are determined by the variance in fitness at the selected sites, such that models that have different parameter values (*i.e.*, different numbers of selected sites and selection coefficients) but nevertheless identical variance in fitness will produce indistinguishable neutral diversity patterns. This *equivalence* between models implies that there is a limit to how much neutral diversity patterns can inform us about selection in their vicinity. It is of considerable interest to examine whether such a limit also exists when the population is subdivided.

In this study, we analyze models in both the background selection and interference selection regimes to address the questions raised above. In both cases, we focus on a demographic scenario where a panmictic ancestral population splits into multiple demes and examine diversity patterns for a period starting from the nonequilibrium phase immediately after the split, until the establishment of a new equilibrium. For the background selection regime, we first derive analytical results for both and as a function of time since the split under a two-deme model. We then construct a flexible coalescent simulator that can accommodate biologically important features including multiple demes, (nonconservative) migration, changes in population size, variation in selection coefficients against deleterious mutations across sites, and recombination. The reliability of these approximations is evaluated by checking them against extensive forward simulations. For the interference selection regime, because of a lack of a known solution of the Good *et al.* (2014) model in subdivided populations, we present forward simulation results. Specifically, we consider models that have different parameter values but identical variance in fitness in the panmictic ancestral population and ask whether they also have similar neutral diversity patterns following the emergence of population structure.

## The Model

Our analysis in both the background selection and interference selection regimes is concerned with a haploid population with a diploid phase in the life cycle where recombination may take place (*e.g.*, Zeng and Charlesworth 2011). Forward in time, the model initially contains a single panmictic ancestral population, which later splits instantly into *K* demes (*K* ≥ 2). Define *t* as the number of generations before the present and let *T* represent the time when the ancestral population gave rise to the demes (*T* ≥ 1). The size of the ancestral population is denoted by *N*_{0}(*t*) (*t* > *T*), whereas the sizes of the demes are denoted by *N _{k}*(

*t*) (

*k*= 1, 2, …,

*K*). In each generation, a proportion

*m*(

_{kl}*t*) of individuals in deme

*k*are immigrants from deme

*l*(

*k*≠

*l*). We focus on a genomic region with linked selected sites. New mutations arise at the

*i*th selected at rate

*u*per generation. Mutation is unidirectional, from ancestral (the wild type) to derived (deleterious). The fitness of an individual with a mutation at the

_{i}*i*th site is reduced by a fraction 1 –

*s*(multiplicative fitness), where

_{i}*s*is the selection coefficient. Mutations are assumed to accumulate in a similar manner at the neutral sites in the region.

_{i}We can describe the life cycle, using the parameters defined above. An individual in the next generation in deme *k* is a migrant from deme *l* (*l* ≠ *k*) with probability *m _{kl}*(

*t*). A migrant is created from a parent or a pair of parents in the deme of origin, depending on whether recombination occurs; otherwise, it is created from parent(s) in the current deme. Parents are chosen according to their relative fitnesses within the deme of their origin. The number of new deleterious mutations arising in each chromosome is Poisson distributed with mean

*U*= Σ

*u*; neutral mutations are added similarly. The number of crossovers between two parental chromosomes is Poisson distributed with mean

_{i}*R*, where

*R*is the total genetic distance between the two ends of the genomic region of interest. This life cycle corresponds to fertility selection among adults, followed by reproduction and juvenile migration.

## Approximations in the Background Selection Regime

In this section, we derive analytical results and construct a coalescent simulator, assuming that purifying selection is sufficiently strong that the frequencies of deleterious mutations are maintained close to their deterministic limit (*i.e.*, *u _{i}*/

*s*at the

_{i}*i*th selected site), which requires

*N*(

_{k}*t*)

*s*1 and

_{i}*u*

_{i}*s*(

_{i}*i*= 1, 2, …, and

*k*= 0, 1, …,

*K*) (see

*Results*for more discussion on the condition under which the approximations are expected to be valid).

### Recursion equations for a two-deme model

Assume that the ancestral population split into *K* = 2 demes at time *T* in the past. We focus on a single neutral site embedded in a genomic region with linked selected sites. We seek to find the effects of selection at the selected sites on the distribution of coalescent times at the neutral site. The restrictive assumptions set out here are later relaxed to allow multiple neutral sites and more than two demes when the coalescent simulator is constructed. We refer to the single neutral site as the focal site. The genetic distance between the focal site and the *i*th selected site is denoted by *r _{i}*.

First, consider the panmictic ancestral population. With strong selection, it has been shown that the selected sites behave approximately independently (Hudson and Kaplan 1995; Johnson 1999; Nicolaisen and Desai 2013). This property can be used to derive the effective population size in the presence of background selection backward in time. Specifically, if we follow the ancestry of a chromosome sampled from the present-day population, the number of deleterious mutations linked to the focal site changes because of mutation and recombination (*e.g.*, figure 2 of Zeng 2013). Let *P*(*D _{i}*,

*t*) be the probability that, at time

*t*in the past, the focal site in an ancestral lineage (chromosome) had variant

*D*at selected site

_{i}*i*, where

*D*= 0 or 1 depending on whether the wild type or the mutant type occurs at the site. Nicolaisen and Desai (2013) have shown that (2)where

_{i}*λ*=

_{i}*u*/

_{i}*s*. The statistical independence between selected sites suggests that the joint probability that an ancestral lineage carried variants

_{i}*D*(

_{i}*i*= 1, 2, …, )

*t*generations ago, denoted by can be calculated as (3)Note that, when

*s*, the random variable

_{i}*D*is approximately Poisson distributed with mean

_{i}*λ*at time

_{i}*t*= 0. Thus, Π

*(*

_{i}P*D*, 0) is equivalent to the well-known result that the number of deleterious mutations that a randomly sampled chromosome carries follows a Poisson distribution with mean

_{i}*λ =*∑

*(Johnson 1999).*

_{i}λ_{i}Two ancestral lineages in the ancestry of a sample can coalesce only when they are in the same *genetic background* (*i.e.*, having exactly the same set of variants at the selected sites) (Hudson and Kaplan 1995; Zeng and Charlesworth 2011), which occurs with probability When this happens, the two lineages coalesce with probability 1/*N*_{gb}(*t*), where *N*_{gb}(*t*) is the total number of individuals in that genetic background. To obtain *N*_{gb}(*t*), we note that, with strong selection, the selected sites remain approximately independent and the frequencies of mutants remain close to *u _{i}*/

*s*, even in the presence of changes in the population size (Zeng 2013), suggesting that Finally, the probability that a pair of ancestral chromosomes coalesce at time

_{i}*t*,

*P*

_{c}(

*t*), can be interpreted as 1/

*N*

_{e0}(

*t*), where

*N*

_{e0}(

*t*) is the effective population size of the ancestral population. Put together, we arrive at the result of Nicolaisen and Desai (2013): (4)Rearranging Equation 4, the effective population size at time

*t*is approximately (5)Letting

*t*approach infinity, we obtain (6)which is the classic result derived by Hudson and Kaplan (1995) and Nordborg

*et al.*(1996) (

*C*= lim

_{t}_{→∞}

*B*(

*t*)). In other words, the classic result approximates the effects of background selection by using a neutral model with a reduced

*N*

_{e}and therefore does not predict that the shape of the SFS should deviate from neutral expectations. In contrast, Equation 5 shows that

*N*

_{e0}(0) =

*N*

_{0}(0) and

*N*

_{e0}(

*t*) declines as time proceeds backward (see Supporting Information, Figure S1). Put differently, forward in time, Equation 5 approximates the effects of background selection by using a neutral model with a growing population size, which explains the characteristic excess of low-frequency variants (Charlesworth

*et al.*1995; Zeng and Charlesworth 2011; Nicolaisen and Desai 2013). Hence, we expect Equation 5 to provide better approximations when there is a significant chance for coalescent events to occur at a stage where

*N*

_{e0}(

*t*) differs substantially from

*N*

_{e0}(∞) (

*e.g.*, when the sample size is large; Figure S1).

When the ancestral population splits, because the juvenile populations of the two demes that exist before migration are propagated in the same way as a new generation is created in the ancestral population, the dynamics (backward in time) of a chromosome sampled randomly from either deme are determined by Equation 3. Therefore, we can use Equation 5 to model the coalescent process for a sample of chromosomes. For a chromosome sampled from deme *k* after migration, it originates from the juvenile population of either deme *k* or deme *l* with probability 1 – *m _{kl}* and

*m*, respectively (assuming, for simplicity, that the migration rate is constant over time). In this case, Equation 3 can still be employed to determine the distribution of variants at the linked selected sites because = [1 –

_{kl}*m*]Π

_{kl}*(*

_{i}P*D*,

_{i}*t*) +

*m*Π

_{kl}*(*

_{i}P*D*,

_{i}*t*) = Π

*(*

_{i}P*D*,

_{i}*t*). The same heuristic argument can be used to show that Equation 3 is approximately true for any generation after the split. In other words, the population structure introduced by the split does not change the frequency of the mutant variant at each selected site (see also Maruyama 1972), nor does it affect the statistical independence between selected sites. Therefore, the probability that two ancestral chromosomes coalesce in deme

*k*at time

*t*before the present is given by Equation 4 with

*N*

_{0}(

*t*) replaced by

*N*(

_{k}*t*). Thus, the effective population size of deme

*k*(

*k*= 1, 2) under the influence of background selection can be modeled as (7)Equations 5 and 7 suggest that the effects of background selection on diversity patterns at the focal site can be approximated by neutral models with time-dependent population (deme) sizes. Thus, we can derive recursion equations to describe the coalescent process, using the neutral framework developed by Teshima and Tajima (2002). As is the case for most coalescent models (Hudson 1990), we assume the population (deme) sizes [

*i.e.*,

*N*

_{e}

*(*

_{k}*t*)] to be sufficiently large and the migration rates [

*i.e.*,

*m*(

_{kl}*t*)] to be sufficiently small that the chances of the following events occurring are negligible: (1) a coalescent event involving more than two ancestral lineages, (2) more than one coalescent event happening in a generation, and (3) an individual migrating multiple times in a generation.

First, consider the ancestral population. The probability that there are *n* ancestral lineages at time *t*, *Q _{n}*(

*t*), satisfies (8)for

*t*>

*T*, with the initial condition

*Q*(

_{n}*T*) = 1, if

*n*is equal to the number of ancestral lineages at time

*T*, and 0 otherwise. The first term on the right-hand side describes a scenario where the

*n*lineages at time

*t*– 1 persist to time

*t*due to a lack of coalescence, whereas the second term considers the possibility of a coalescent event at time

*t*, reducing the number of lineages from

*n*+ 1 to

*n*.

To obtain recursions for *t* ∈ [1, *T*], during which the population is composed of two demes, we define (1) *Q _{a}*

_{,}

_{n}_{–}

*(*

_{a}*t*), the probability that there are

*a*and

*n – a*ancestral lineages in demes 1 and 2, respectively, at time

*t*[(

*a*,

*n*–

*a*) is referred to as a sampling configuration], and (2)

*M*

_{a}_{,}

*(*

_{b}*t*|

*n*), the probability that migration changes the sampling configuration from (

*a*,

*n – a*) to (

*b*,

*n – b*).

Using Equation 7, we can modify the result of Teshima and Tajima (2002), so that it takes into account the effects of background selection on the coalescent process at the focal site, (9)where (10)The three terms on the right-hand side of Equation 9 describe the following events, respectively: (1) There is no coalescence in either deme and migration changes the sampling configuration from (*b*, *n* – *b*) at time *t* – 1 to (*a*, *n* – *a*) at time *t*; (2) migration changes the sampling configuration from (*b*, *n* + 1 – *b*) to (*a* + 1, *n* – *a*), followed by a coalescent event in deme 1, leading to (*a*, *n* – *a*); (3) migration changes the sampling configuration from (*b*, *n* + 1 – *b*) to (*a*, *n* + 1 – *a*), followed by a coalescent event in deme 2, leading to (*a*, *n* – *a*). Equation 10 is the same as the corresponding equation on p. 83 of Teshima and Tajima (2002) and can be derived by noting that the number of migrants among the *a* chromosomes in deme 1 follows a binomial distribution with parameters *a* and *m*_{12}(*t*), and that in deme 2 is also binomially distributed with parameters *n* – *a* and *m*_{21}(*t*).

Assume that *a*_{0} and *b*_{0} chromosomes are sampled randomly at *t* = 0 (the present) from demes 1 and 2, respectively. Appealing to Equations 8 and 9, we can calculate *f*(*n*, *t | a*_{0}, *b*_{0}), the probability that the number of ancestral lineages reduces from *n* to *n* – 1 at time *t* given the initial sampling configuration (*a*_{0}, *b*_{0}), as (11)with the initial condition that *Q _{a}*

_{,}

*(0) = 1, if*

_{b}*a*=

*a*

_{0}and

*b*=

*b*

_{0}, and 0 otherwise. Following Teshima and Tajima (2002), we have dropped terms of order 1/[

*N*

_{e}

*(*

_{k}*t*)]

^{2}or higher in Equation 11, which include events such as migration followed by coalescence in the same generation or a coalescent event involving more than two ancestral lineages. Equation 11 can be solved numerically to obtain properties of the coalescent process at the focal site under the joint influence of background selection and population subdivision. For a two-deme model (

*K*= 2), the complexity of the recursion equation, measured by the number of

*Q*

_{a}_{,}

*(*

_{b}*t*) for a given

*t*, is (

*n*+ 4)(

*n*– 1)/2 for a sample of size

*n*. Moreover,

*f*(

*a*,

*t | a*

_{0},

*b*

_{0}), where

*a*<

*n*, depends on all

*f*(

*b*,

*t | a*

_{0},

*b*

_{0}), where

*a*<

*b*≤

*n*. Thus, evaluating the recursion equation becomes complex even for a modest

*n*. In this case, conducting coalescent simulations is a more efficient alternative (see below).

The theory developed above can be used to calculate and and hence *F*_{ST} (Equation 1), as a function of *T*, the time since the split of the ancestral population. We need only to consider a sample of size 2, for which there are three possible sampling configurations at *t* = 0: (*a*_{0}, *b*_{0}) ∈ Equation 11 becomes (12) and can be expressed in terms of *f*(2, *t* | *a*_{0}, *b*_{0}) as (13)It is possible to weigh the contribution of the demes differently in the definition of (*e.g.*, Hudson *et al.* 1992). Our choice is partly for simplicity and partly because we used models with two equal-sized demes in most of our simulations.

We illustrate the calculation of and using a model where *N*_{0}(*t*) ≡ *aN*, *N*_{1}(*t*) ≡ *c*_{1}*N*, *N*_{2}(*t*) ≡ *c*_{2}*N*, *m*_{12}(*t*) ≡ *m*_{12}, and *m*_{21}(*t*) ≡ *m*_{21}. The migration probabilities *M _{a}*

_{,}

*(*

_{b}*t*| 2) become time independent and are denoted by

*M*

_{a}_{,}

*. A matrix with elements*

_{b}*M*

_{a}_{,}

*(0 ≤*

_{b}*a*,

*b*≤ 2) can be defined as (14)Equation 9 reduces to (15)which can be iterated numerically by using the initial condition

*Q*

_{a}_{,}

*(0) = 1, if*

_{b}*a*=

*a*

_{0}and

*b*=

*b*

_{0}, and 0 otherwise. The values of

*Q*

_{a}_{,}

*(*

_{b}*t*) can then be fed into Equation 12 to obtain

*f*(2,

*t*|

*a*

_{0},

*b*

_{0}), which can in turn be used to evaluate Equation 13.

### The fast approximation for a two-deme model

It is possible to obtain explicit analytical formulas for predicting and at the focal site by making an additional assumption: If *t* has become sufficiently large when the two ancestral lineages coalesce that *B*(*t*) ≈ *C* (see Figure S1), then the effects of background selection on the coalescent process can be well approximated by a neutral model with appropriately reduced population (deme) sizes. This is based on the *fast approximation* proposed by Nordborg (1997). It requires that the rate at which the two chromosomes of interest coalesce is much slower than changes in the distribution of variants at the linked selected sites (governed by Equation 3).

Consider, as an example, a special case of the model that leads to Equations 14 and 15 in which *N*_{0}(*t*) ≡ *aN*, *N*_{1}(*t*) ≡ *N*_{2}(*t*) ≡ *N*, and *m*_{12}(*t*) ≡ *m*_{21}(*t*) ≡ *m*. Under the fast approximation, the effects of background selection can be approximated by a neutral model in which an ancestral population of size *aCN* gave rise to two demes of size *CN* in generation *T* before the present, with migration rates *m*_{12} = *m*_{21} = *m*. Thus, we can apply theoretical results derived for this neutral “isolation with migration” model (Nath and Griffiths 1993; Wakeley 1996; Teshima and Tajima 2002; Wilkinson-Herbots 2008). Following Wilkinson-Herbots (2008; see section 3.1 therein), we scale time in units of *CN* generations and define *τ*_{e} = *T*/(*CN*), = = *M*_{e} = *CNm*, *D =* (16*M*_{e})^{2} + 1, and (16)The background selection-aware versions of equations 22 and 23 of Wilkinson-Herbots (2008) can be written as (17)Letting *τ*_{e} go to infinity, we recover the well-known result for a symmetric two-deme model (Nordborg 1997): (18)That is, the effects of background selection on *F*_{ST} are equivalent to a reduction in the scaled migration rate from *M* = *Nm* in the absence of recurrent deleterious mutations at linked sites to *M*_{e} in the present model. As a result, *F*_{ST} is increased under background selection relative to neutrality (Charlesworth *et al.* 1997).

### The coalescent simulator

The two approximations developed above consider a single neutral site embedded in a genomic region with selected sites. Here we relax this restriction by extending the coalescent model of Zeng and Charlesworth (2011; Zeng 2013). This extension relies on a similar set of assumptions used to obtain the recursion equations. We demonstrate the key features of the coalescent process by focusing on the following case: (1) a genomic region with *L* sites, all of which are subject to purifying selection (*i.e.*, *L* = ); (2) the selection coefficient *s*, the mutation rate *u*, and the recombination rate *r* are all constant across sites; and (3) the population is in equilibrium and consists of two demes of size *N*, connected by symmetric migration occurring at rate *m* per generation. Thus, the total mutation and recombination rates for the region concerned are *U* = *uL* and *R* = *rL*, respectively. The number of deleterious variants a randomly sampled chromosome carries is Poisson distributed with mean *λ* = *U*/*s*. Time, denoted by *τ*, is scaled in units of *N* generations, with *τ* = 0 representing the present. The dynamics of the coalescent are determined by the following scaled parameters: *γ* = *Ns*, *ρ* = *Nr*, and *M* = *Nm*. Specifically, to generate a random genealogy for a sample of *n* chromosomes, we first determine the number of deleterious mutations that each chromosome carries at *τ* = 0, which can be done by sampling from Poisson(*λ*). Backward in time, four types of events can occur: mutation (loss of deleterious mutation), recombination, migration, and coalescent. Under the assumptions of the model, these events constitute four independent, competing Poisson processes with rates *μ*_{m}, *μ*_{r}, *μ*_{mig}, and *μ*_{c}, respectively. The time to the next event can be determined by sampling from an exponential distribution with rate *μ* = *μ*_{m} + *μ*_{r} + *μ*_{mig} + *μ*_{c}. Then, the probability that this event is a mutation event is equal to *μ*_{m}/*μ*, proportional to its contribution to *μ*; the same applies to the other events.

The rate at which a deleterious mutation is lost is *γ*. Therefore, *μ*_{m} = *iγ* when there are in total *i* mutations among all the *n* ancestral lineages at that time. If the next event is a mutation event, one of the *i* mutations is chosen at random for removal. The rates to the next migration or recombination events are *μ*_{mig} = *nM* and *μ*_{r} = *nρL*, respectively. When migration happens, a randomly chosen chromosome is moved to the other deme. For a recombination event, we pick one chromosome at random and split it into two ancestral chromosomes. In this process, the deleterious mutations originally carried by the affected chromosome are randomly distributed to the two ancestral chromosomes, using the algorithm of Zeng and Charlesworth (2011). This redistribution of mutations during recombination affects the coalescent probability between lineages in the coalescent process and is a major feature that distinguishes the current method from that of Hudson and Kaplan (1995).

The calculation of the total coalescent rate is the most time-consuming, because it requires conducting pairwise comparisons for all chromosomes existing at the time, which in turn involves examining whether the physical distributions of the deleterious mutations carried by each pair of chromosomes (determined by the mutation and recombination events that have affected them thus far) are compatible (*i.e.*, whether it is possible for them to have been born to the same parent) and, if so, the physical distribution of these mutations in the ancestral chromosome into which these two chromosomes may merge (similar to the arguments leading to Equation 4; for details, see Zeng and Charlesworth 2011 and Zeng 2013). Let *μ*_{c}* _{ij}* be the coalescent rate between chromosomes

*i*and

*j*(see equation 11 of Zeng and Charlesworth 2011). Then

*μ*

_{c}= ∑

*μ*

_{c}

*for all*

_{ij}*i*≠

*j*(

*μ*

_{c}

*= 0 if the two chromosomes in a pair are from different demes and/or the physical locations of deleterious mutations mean that they could not have descended from the same parent in the previous generation). Given that the next event is a coalescent, we sample*

_{ij}*i*and

*j*from a distribution with probability density

*μ*

_{c}

*/*

_{ij}*μ*

_{c}and merge them into their corresponding ancestral chromosome.

Introducing neutral regions into the coalescent model is straightforward, because their effects on the coalescent process are equivalent to increasing the recombination rates between selected sites. It is also possible to accommodate multiple selection coefficients, variation in recombination rate, changes in population size, multiple demes, asymmetrical migration rates, and merging and splitting of populations. We present data obtained from some of these more complex models in *Results*.

## Simulation Methods

We used forward simulations to (1) test the validity of the three approximations that we have developed in the background selection regime and (2) study neutral diversity patterns in the interference selection regime. In addition to and we also considered the following summary statistics: (1) the site-frequency spectrum; (2) , the total branch length; and (3) , the ratio of the sum of all external branches (*i.e.*, those leading to the sampled chromosomes) to . *E*() is proportional to the expected number of segregating sites in the sample, and *E*() is the expected proportion of singletons (*i.e.*, polymorphic sites where the derived allele appears only once). All statistics were calculated using data from neutral sites in the simulated region.

### Forward simulation

The algorithm, implemented in a Java package named *Forwards*, is an extended version of that used by Zeng and Charlesworth (2010, 2011). It is based on the life cycle described above. In all cases, we initialized the simulation with a panmictic population of size *N*_{0} that was free of deleterious mutations. This population was allowed to evolve for at least 10*N*_{0} generations to reach statistical equilibrium before splitting into multiple demes. Because forward simulations are computationally expensive, we used *N*_{0} = 1000 in most cases and kept track of the local genealogies at only a handful of sites across the simulated region. The results reported below were typically based on 100 replicates of simulations.

### Coalescent simulation

The coalescent simulator has been implemented in a computer program called *msbgs*. This program is based on *ms* (Hudson 2002). To improve efficiency, efforts have been spent on enhancing the memory management algorithm and eliminating unnecessary comparisons between ancestral chromosomes when calculating *µ*_{c}. A small-scale comparison suggests that *msbgs*, written in C, is up to two orders of magnitude more efficient than a previous nonoptimized implementation and is often comparable to *ms* in speed (data not shown). The data presented below were typically based on 10^{4} replicates of simulations.

### Data availability

User-friendly versions of the computer programs used in this study have been made available at http://zeng-lab.group.shef.ac.uk.

## Result

### The accuracy of the approximation methods in predicting pairwise coalescent times and *F*_{ST} under background selection

We assess the accuracy of the three new methods [the recursion (Equation 13), the fast approximation (Equation 17), and the coalescent simulator] by comparing their predictions to results obtained from forward simulations. In addition, we are interested in determining the condition under which the approximations are expected to perform well. Previous analysis of the behavior of Muller’s ratchet in randomly mating populations of constant size *N* without recombination has found that the ratchet is unlikely to click when *n*_{0}*s* 1, where *n*_{0} is the number of individuals in the mutation-free class [*i.e.*, *N*Π* _{i}P*(

*D*= 0, 0)] (Gordo

_{i}*et al.*2002; Jain 2008; Neher and Shraiman 2012). More generally,

*n*

_{0}can be expressed as

*CN*, so that it can accommodate recombination and variation in selection coefficients. In the special case where the total deleterious mutation rate is

*U*, the selection coefficient is

*s*, and the total recombination rate is

*R*,

*CN*reduces to

*N*e

^{−}

^{U}^{/(}

^{s}^{+}

^{R}^{/2)}(Nicolaisen and Desai 2013). Thus, we examine whether

*n*

_{0}

*s*1 is also a good indicator of the validity of the new approximations. Specifically, for the rest of this study, we define

*n*

_{0}as

*CN*

_{min}, where

*N*

_{min}is the smallest population size among all the demes and the ancestral population in a particular model.

We begin with cases without recombination. Figure 1, A and B, is based on models where the ancestral population splits into 2 and 10 equal-sized demes, respectively. For the 10-deme case, we conducted neutral simulations in which the population (deme) sizes were reduced by a fraction of *C*, equivalent to the fast approximation. The analytical equations and the coalescent simulator capture the effects of background selection on the two pairwise coalescent times ( and ) and *F*_{ST} well. There is little difference in performance between the three approximations during both the nonequilibrium phase after the split of the ancestral population and the equilibrium phase. This suggests that the effects of background selection on these three measures of diversity can be well summarized by a reduction in *N*_{e}, as assumed by the fast approximation. This is not surprising because the pairwise coalescent times are typically long compared to the time it takes for *B*(*t*) (Equation 5) to converge to *C* (Nicolaisen and Desai 2012, 2013). It therefore explains the observation that obtained under the fast approximation and the recursion is always very similar, whereas which is typically much smaller, can be substantially different in some cases (Figure S2, Figure S3, and Figure S4). Compared to and *F*_{ST} converges to its new equilibrium value at a much faster rate (Figure 1 and Figure 2; Figure S2, Figure S3, and Figure S4). Additional comparisons also suggest that *F*_{ST} patterns obtained from models with different parameter values are often more similar to each other than when and are compared (data not shown). These results imply that *F*_{ST} may be less informative about ancestral events than other aspects of the data [*e.g.*, *π*_{B} (between-deme diversity) and *π*_{S} (within-deme diversity), which are proportional to and respectively].

Based on a large number of parameter combinations (*e.g.*, Figure S2, Figure S3, and Figure S4), we find that the three methods generally provide fairly accurate predictions when *n*_{0}*s* ⪆ 5. When *n*_{0}*s* < 1, all three deviate significantly from the forward simulations, with always being overestimated, but there is no consistent pattern regarding their abilities in predicting With *n*_{0}*s* < 1, the recursion and the coalescent simulator tend to provide *F*_{ST} predictions that are closer to the forward simulation results.

In Figure 2, we consider a model with recombination, two types of selected sites that are subject to different intensities of purifying selection, and nonconservative migration between the two demes (*i.e.*, *N*_{1}*m*_{12} ≠ *N*_{2}*m*_{21}). Because Equation 17, which requires symmetric migration, no longer applies, fast approximation results were obtained by simulating a neutral model with reduced deme sizes. Background selection depresses both and with the effects on the latter being more pronounced. The net effect is an elevation of *F*_{ST}. This phenomenon is strongest in the center of the selected region (position 6501 in Figure 2) and weakens as we move farther into the flanking neutral regions (*e.g.*, position 1 in Figure 2). Overall, the new methods agree with the forward simulations very well.

### The joint effects of background selection and population subdivision on the site-frequency spectrum

When the sample size increases, so that more coalescent events took place in the recent past when *B*(*t*) deviates significantly from *C* (Nicolaisen and Desai 2012, 2013), we expect the coalescent simulator to perform better than the fast approximation. We illustrate this point, in Figure 3, using a two-deme model, by comparing the SFS obtained from forward simulations with those obtained from conducting coalescent simulations using either the full background selection model or a neutral model with the deme size reduced from *N* to *CN* (the fast approximation). A total of 30 alleles were sampled, with each deme contributing 15. As expected, the SFS shifts toward low-frequency variants under the influence of background selection. However, the effects on intermediate- and high-frequency variants seem complex. In this particular example, background selection makes the characteristic uptick in the middle of the SFS less visible and even causes a slight excess of high-frequency variants. Although the fast approximation fails to capture these features, predictions generated by the coalescent simulator are fairly accurate (Figure 3).

Not surprisingly, when coalescence within demes is negligible (which tends to occur in the more recent past), as is the case when there are many demes and only one allele is taken from each deme, background selection has little effect on the shape of the SFS (Figure 4; see Figure S5 for another example). In fact, as the number of demes *K* increases, the SFS converges to that expected under a panmictic population of constant size *N*_{e_large}* _{_K}* =

*CNK*[1 + 1/(2

*CNmK*)], where

*N*is the deme size and

*m*≡

*m*(

_{kl}*cf*. Chap. 6 of Wakeley 2008). This “scattered” sampling scheme (reviewed in Chap. 6 of Wakeley 2008) can potentially be exploited in the analysis of data from structured populations. It should, however, be noted that, due to

*N*

_{e_large}

*’s dependency on*

_{_K}*C*, the scattered sampling scheme does not remove background selection’s influence on local diversity levels. Thus, everything else being equal, regions with more frequent recombination are expected to have more polymorphic sites (Figure S6).

In Table 1, we further explore how diversity patterns depend on the parameters. Cases 1 and 2 have identical *u*, *s*, *r* (and thus, identical *C*), and *M* (= *Nm*), meaning that they are equivalent under the fast approximation; the same applies to cases 4 and 5. However, forward and coalescent simulations suggest that these models have different average total branch lengths (proportional to the number of segregating sites) and different average proportions of the tree accounted for by external branches (proportional to the number of singletons). In contrast, models with identical *θ*, *γ*, *ρ*, and *M* behave in a similar manner (case 1 *vs.* case 3; case 4 *vs.* case 6). These data thus provide further evidence of the inaccuracy of the fast approximation. They also suggest that, although Equations 5 and 9 depend on *u*, *s*, *r*, and *m*, it is the scaled counterparts of these parameters (*i.e.*, *θ*, *γ*, *ρ*, and *M*) that determine the behavior of the model.

### Equivalence of neutral diversity patterns in the interference selection regime

We have thus far focused exclusively on cases where selection is strong (*i.e.*, *n*_{0}*s* 1). However, for the study of large nonrecombining genomic regions (*e.g.*, in asexual and highly self-fertilizing organisms), it is important to explore the interference selection regime. In this case, *U*, the total deleterious mutation rate, can become large due to tight linkage, such that *U*/*s* 1, and *n*_{0}*s* approaches 0. Good *et al.* (2014) have shown that, in an equilibrium panmictic population of size *N*, neutral diversity patterns are determined by the variance in fitness, denoted by *σ*^{2}. Specifically, for a given *N*, different combinations of *U* and *s* that lead to the same composite parameter *Us*^{2} will have not only the same *σ*^{2}, but also indistinguishable neutral diversity patterns. Here we use forward simulations to explore models with the same *σ*^{2} in the ancestral population (which can be calculated numerically using the Python package provided by Good *et al.* 2014) and follow their dynamics after the emergence of population structure and gene flow.

There is evidence for the existence of the equivalence between models with respect to neutral diversity patterns, both in the nonequilibrium phase following the divergence from an ancestral population and later in equilibrium. This has been observed for models with *Ns* < 1 (Figure 5) and those with *Ns* substantially >1 (Figure 6), as well as in the following cases: (1) models with selection coefficients and sampling schemes different from those shown in Figure 5 (Figure S7 and Figure S8), (2) selection coefficients sampling from a distribution of fitness effects (Figure 6), (3) more than two demes (Figure 5B; Figure S9), and (4) a two-deme model with nonconservative migration (*i.e.*, *N*_{1}*m*_{12} ≠ *N*_{2}*m*_{21} and *m*_{12} ≠ 0 and *m*_{21} ≠ 0; Figure S10). However, we found a few exceptions, for instance, in a two-deme model in which only one of the migration rates is nonzero (Figure S11). In this case, for a fairly long period after the split, neutral diversity patterns produced by the different selection models clearly overlap before they eventually diverge from one another, suggesting that these models could still be very hard to distinguish for a substantial amount of time.

We have also investigated the effects of the scattered sampling strategy on the SFS in the many-deme limit. Since conducting forward simulations in the presence of a large number of demes is very time-consuming, only two pairs of models have been attempted (Figure 7). Interestingly, although the shape of the SFS remains similar for models with the same *σ*^{2} in the ancestral population (*i.e.*, models 1 and 2; models 3 and 4), there is an excess of both low- and high-frequency variants. This U-shaped SFS, which is characteristic for panmictic populations under interference selection (Neher and Hallatschek 2013), is markedly different from those observed in the background selection regime (*cf*. Figure 4), suggesting some fundamental differences between the two regimes.

## Discussion

### Advantages and limitations of the three approximations in the background selection regime

The above analyses show that *F*_{ST} values at neutral sites are expected to be higher in genomic regions that either have lower recombination frequencies (see Jackson *et al.* 2015 for a recent analysis in *D. melanogaster*) or are more tightly linked to functionally important elements (Figure 2). This heterogeneity may lead to false positives in genomic scans for outlier loci with unusually high *F*_{ST} (Noor and Bennett 2009; Cruickshank and Hahn 2014). A possible use of the new approximations is to generate baseline *F*_{ST} values in the presence of background selection for comparison. An illustrative example is a recent study by Comeron (2014) in which, by applying analytical results developed for panmictic populations (Equation 6) to a *D. melanogaster* data set, it was found that a model with background selection and recombination (induced by both crossing over and gene conversion) provides a much better explanation of patterns of diversity (as measured by *π*, nucleotide diversity, in putatively neutral regions) than variation in recombination rates alone. Furthermore, Comeron (2014) identified possible targets of recent positive selection (*i.e.*, loci with lower diversity levels than predicted) and balancing selection (*i.e.*, loci with higher than expected diversity levels). From previous studies of single populations (Charlesworth 1996, 2012b; McVicker *et al.* 2009; Cutter and Choi 2010; Rockman *et al.* 2010; Flowers *et al.* 2012; Comeron 2014), it is clear that applying these methods requires knowledge of the deleterious mutation rate, the distribution of fitness effects of new mutations, the recombination rate (both crossover and gene conversion), and genome annotations that specify the physical locations of various functional elements. Additionally, for subdivided populations, migration rates and the time since the demes derived from the ancestral population are also needed. With the rapid accumulation of large-scale genomic resources, estimates of these parameters are becoming available to an increasingly large array of taxa (see a recent survey by Corbett-Detig *et al.* 2015). On the other hand, by examining the fit between model predictions and observations [*e.g.*, *π*_{B} (between-deme diversity), *π*_{S} (within-deme diversity), or *F*_{ST}], the new methods may assist the estimation of some of the parameters (*cf*. Corbett-Detig *et al.* 2015).

For predicting *F*_{ST}, the three methods behave similarly (Figure 1 and Figure 2). Hence, the computationally more expensive coalescent simulator is probably not the method of choice. Since none of the methods are expected to work well in the interference regime, they should not be applied to genomic regions with severely restricted recombination rates. The fast approximation is the easiest to compute and can make use of known results derived for neutral models. In cases where analytical results for corresponding neutral models are unavailable, neutral simulations can be employed (see Figure 2). However, because *C* varies across putatively neutral sites in the presence of recombination (Equation 6), each focal site requires its own set of simulations (Figure 2), which may be computationally prohibitive. The recursion can readily accommodate complicating factors such as time-dependent migration rates and/or varying population sizes. (These factors often defy the derivation of analytical equations under neutrality and hence the applicability of the fast approximation.) For a model with *K* demes, the complexity of the recursion, measured by the number of *Q _{k=a}*

_{,}

_{l=}_{2−}

*(*

_{a}*t*) (

*i.e.*, the chance that, at time

*t*in the past,

*a*of the two chromosomes were in deme

*k*and the rest were in deme

*l*), is

*K*(

*K*+ 1)/2. Hence, it should be applicable to models with a modest

*K*.

The coalescent simulator possesses several advantages over the other two methods. First, it is the only one that can generate large samples with multiple linked sites and can therefore be used to assess the joint impact of population structure and background selection on statistics such as the number of segregating sites, the number of haplotypes, and measures of linkage disequilibrium, which are often included in inference methods (*e.g.*, Duchen *et al.* 2013). Second, the coalescent simulator is the most flexible among the three with respect to incorporating biologically important factors. In addition to those presented above, it should be possible to include variation in recombination rates (Hellenthal and Stephens 2007) and gene conversion (Wiuf and Hein 2000).

An important area for future research is the generation of predictions about genome-wide diversity patterns for samples containing more than two chromosomes. It is known that the standard implementation of the *neutral* ancestral recombination graph suffers from reduced efficiency when the recombination rate is high (*e.g.*, as a result of simulating a very large genomic region) (Marjoram and Wall 2006; Staab *et al.* 2015). The efficiency and reliability of the coalescent simulator with background selection also decline as a function of the recombination rate (see Zeng 2013 for a method for assessing this potential problem), meaning that it is not applicable to genome-level data sets. A solution proposed for the neutral case is the sequential Markov coalescent model (McVean and Cardin 2005). This approach requires that each genomic position has the same distribution of marginal genealogies, which does not apply in the presence of background selection (*e.g.*, Figure 2) and is therefore probably not applicable (data collected from a small-scale experiment agree with this intuition; K. Zeng, unpublished results). Another possibility is to employ a sliding-window approach that involves dividing the genome into *linkage blocks* whereby sites within a block are in strong linkage disequilibrium and are treated as if they were completely linked, whereas different blocks are treated as though they were unlinked (Good *et al.* 2014; Weissman and Hallatschek 2014). Because the dependence of diversity levels on the population size under models in the background selection regime is much stronger than under the models in the interference selection regime considered by Good *et al.* (2014) and Weissman and Hallatschek (2014), ignoring the diversity-reducing effects of unlinked blocks on the focal block as in those previous studies is probably undesirable. Fortunately, the effects of unlinked sites are known and can be modeled as a reduction in *N*_{e} in the focal block (*e.g.*, equation 4 of Charlesworth 2012a). A third possibility is to carry out neutral simulations for each site, assuming that the effective population size at time *t* is determined by Equations 5 and 7 (*cf*. Nicolaisen and Desai 2013). However, this is likely to be computationally demanding because Equations 5 and 7 vary across sites, and therefore it is necessary to carry out simulations for each neutral site in turn. Moreover, it cannot make predictions about patterns of linkage disequilibrium between sites, which often contain useful information about demography. In sum, the development and assessment of methods for generating genome-wide predictions warrant systematic investigations in the future.

### Equivalence in the interference selection regime

In contrast to the background selection regime, interference selection is induced by a large number of tightly linked selected sites (reviewed by Neher 2013). Some of the parameters we used in the simulations may not be unrealistic for the fourth chromosome of *D. melanogaster* (*e.g.*, models 1 and 2 in Figure 6; see Kaiser and Charlesworth 2009). Our simulations provide evidence that, in subdivided populations, neutral diversity patterns obtained under models with fewer selected sites under stronger selection can be very similar to those obtained under models with more selected sites under weaker selection (*e.g.*, Figure 5). In addition, similar diversity patterns can also be generated by models in which selection coefficients follow different distributions (*e.g.*, Figure 6). The requirement for this *equivalence* to occur is that, assuming a common demographic scenario (defined by the size of the ancestral population, the sizes of the demes, and the migration rates), the models have identical values of the compound parameter *U* where = ∫*s*^{2}*ρ*(*s*)*ds*, with *ρ*(*s*) being the density function of the distribution of fitness effects of new mutations. Thus, although not a proof, our simulations suggest that neutral diversity patterns collected from subdivided populations may have limited ability in differentiating different selection models in their nearby regions, as first pointed out by Good *et al.* (2014) in their analysis of panmictic populations in equilibrium.

The above results point to the importance of further theoretical developments to establish the precise condition under which the equivalence between models occurs. By developing a mapping between the strong and weak selection regimes, Good *et al.* (2014) showed that the equivalence can be exploited such that methods developed in the background selection regime can be used to generate predictions for models in the interference selection regime. If such a mapping could be established in the presence of population subdivision, then the background selection theory developed above may have application outside its intended parameter region (see Figure S12 for an exploratory experiment). Nevertheless, the fact that the scattered sampling method does not alleviate the effect of selection on the SFS, as it does in the background selection regime (Figure 4 *vs.* Figure 7), suggests that there are important differences between the two regimes, which may imply that developing a mapping for subdivided populations may be more challenging than for panmictic populations. A further difficulty is that the transition from the background selection regime to the interference regime is not infinitely sharp (compare Figure 6 and Figure S13). Therefore, even if a mapping can be developed, there may be a “gray area” between the two regimes that requires a different kind of theoretical treatment. Overall, these considerations indicate that more analyses to further our understanding of the interference selection regime are warranted.

## Acknowledgments

We thank Mark Beaumont, Brian Charlesworth, and two anonymous reviewers for helpful comments. We acknowledge support from the Biotechnology and Biological Sciences Research Council (BB/K000209/1) and the Natural Environment Research Council (NE/L005328/1). This project made use of the computational resources provided by the University of Sheffield’s high-performance computer cluster, Iceberg.

## Footnotes

*Communicating editor: M. A. Beaumont*Supporting information is available online at www.genetics.org/content/early/2015/09/30/genetics.115.178558/suppl/DC3.

- Received June 4, 2015.
- Accepted September 24, 2015.

- Copyright © 2015 by the Genetics Society of America