## Abstract

Compared to a neutral model, purifying selection distorts the structure of genealogies and hence alters the patterns of sampled genetic variation. Although these distortions may be common in nature, our understanding of how we expect purifying selection to affect patterns of molecular variation remains incomplete. Genealogical approaches such as coalescent theory have proven difficult to generalize to situations involving selection at many linked sites, unless selection pressures are extremely strong. Here, we introduce an effective coalescent theory (a “fitness-class coalescent”) to describe the structure of genealogies in the presence of purifying selection at many linked sites. We use this effective theory to calculate several simple statistics describing the expected patterns of variation in sequence data, both at the sites under selection and at linked neutral sites. Our analysis combines a description of the allele frequency spectrum in the presence of purifying selection with the structured coalescent approach of Kaplan *et al.* (1988), to trace the ancestry of individuals through the distribution of fitnesses within the population. We also derive our results using a more direct extension of the structured coalescent approach of Hudson and Kaplan (1994). We find that purifying selection leads to patterns of genetic variation that are related but not identical to a neutrally evolving population in which population size has varied in a specific way in the past.

PURIFYING selection acting simultaneously at many linked sites (“background selection”) can substantially alter the patterns of molecular variation at these sites and at linked neutral sites (Hill and Robertson 1966; Kaplan *et al.* 1988; Hudson and Kaplan 1994, 1995; McVean and Charlesworth 2000; Gordo *et al.* 2002; O’Fallon *et al.* 2010; Seger *et al.* 2010). In recent years, evidence from sequence data points to the general importance of these selective forces among many linked variants in microbial and viral populations and on short distance scales in the genomes of sexual organisms (Comeron *et al.* 2008; Hahn 2008; Seger *et al.* 2010). In these situations, existing theory does not fully explain patterns of molecular evolution (Hahn 2008).

It is difficult to incorporate negative selection at many linked sites into genealogical frameworks such as coalescent theory, because these frameworks typically rely on characterizing the space of possible genealogical trees *before* considering the possibility of mutations at various locations on these trees. When selection operates, the probabilities of particular trees cannot be defined independently of the mutations, and the approach breaks down (Tavare 2004; Wakeley 2009).

Despite this difficulty, a number of productive approaches have been developed to predict how negative selection influences patterns of molecular variation and to infer selection pressures from data. Charlesworth *et al.* (1993) introduced the background selection model and showed that strong purifying selection reduces the effective population size relevant for linked neutral sites (Charlesworth 1994; Charlesworth *et al.* 1995). However, weaker selection also distorts patterns of variation, in a way that cannot be completely described by a neutral model with any effective population size (McVean and Charlesworth 2000; Comeron and Kreitman 2002), a phenomenon often referred to as Hill–Robertson interference (Hill and Robertson 1966). Several theoretical frameworks have been developed to analyze this situation. The ancestral selection graph of Neuhauser and Krone (1997) and Krone and Neuhauser (1997) provides an elegant formal solution to the problem, but unfortunately it requires extensive numerical calculations (Przeworski *et al.* 1999). These limit the intuition we can draw from this method and make it impractical as the basis for inference from most modern sequence data. An alternative approach is based on the structured coalescent and views the population as subdivided into different fitness classes, tracing the genealogies of individuals as they move between classes. This approach was first introduced by Kaplan *et al.* (1988) and further developed by Hudson and Kaplan (1994, 1995) in the case in which fluctuations in the size of each fitness class can be neglected. This structured coalescent approach has been been the basis for computational methods developed by Gordo *et al.* (2002), Seger *et al.* (2010), and Zeng and Charlesworth (2011), and analytical approaches such as those of Barton and Etheridge (2004), Hermisson *et al.* (2002), and O’Fallon *et al.* (2010).

In this article, we build on the structured coalescent framework by introducing the idea of a “fitness-class coalescent.” Rather than considering the coalescence process in real time, we treat each fitness class as a “generation” and trace how individuals have descended by mutations through fitness classes, moving from one generation to the next by subsequent mutations. We show that the coalescent probabilities in this fitness-class coalescent can be computed using an approach based on the Poisson random field (PRF) method of Sawyer and Hartl (1992) or, equivalently, can be derived as an extension of the structured coalescent approach of Hudson and Kaplan (1994).

Our fitness-class coalescent theory can be precisely mapped to a coalescence theory in which certain quantities (*e.g.*, coalescence times) have different meanings than in the traditional theory. We can then invert this mapping to determine the structure of genealogies and calculate statistics describing expected patterns of genetic variation. This approach requires certain approximations, but it also has several advantages. Most importantly, we are able to derive relatively simple analytic expressions for coalescent probabilities and distributions of simple statistics such as heterozygosity. Consistent with earlier work, we find that the effects of purifying selection are broadly similar to an effective population size that changes as time recedes into the past. Our analysis makes this intuition precise and quantitative: we can compute the exact form of this time-varying effective population size, as defined by the rate of pairwise coalescence. We also show that this intuition has important limitations: for example, different pairs of individuals have different time-varying effective population size histories, meaning that in principle it is possible to distinguish selection from changing population size. Our approach also makes it possible to calculate the diversity of selected alleles themselves, which may be important when selection is common (Williamson and Orive 2002).

We begin in the next section by describing the fitness-class coalescent idea that underlies our approach. We then describe the details of our model and analyze two ways to implement the fitness-class coalescent. The first relies on the PRF method of Sawyer and Hartl (1992) to describe the frequency distribution of distinct lineages within each fitness class. We show how this lineage structure can be used to compute coalescence probabilities in each fitness class. The second approach is based on tracing the ancestry of individuals in the order that events occur as described by Hudson and Kaplan (1994) and implemented numerically by Gordo *et al.* (2002). We show how we can sum over all possible ancestral paths to compute equivalent coalescence probabilities in each fitness class. The two approaches provide different and complementary intuitive pictures of the process and depend on various approximations in somewhat different ways.

After computing coalescence probabilities with both approaches, we show how these probabilities can be used to analyze the structures of genealogies, and we calculate various statistics describing genetic variation in these populations, which we compare to numerical simulations. We then discuss the relationship between our results, neutral theory, and earlier work on selection, and we explore how various approximations limit our approach. The most important of these approximations is that we neglect fluctuations in the size of each fitness class, analogous to earlier work (Hudson and Kaplan 1994), which restricts our analysis to the case of strong selection (relative to inverse population size). This approximation also means that we neglect Muller’s ratchet. We describe this and related approximations and describe their regime of validity in the *Discussion*. Finally, in the appendices we explore these approximations in more detail and describe how they inform the relationship between our work and earlier approaches.

## The Fitness-Class Coalescent

In this section, we outline the main ideas underlying our fitness-class coalescent approach. We begin our analysis by considering the balance between mutations at many linked sites and negative selection against the mutants, which leads to an equilibrium distribution of fitnesses within a population (Haigh 1978). We illustrate this in Figure 1, for the case in which all deleterious mutations have the same fitness cost. Each individual is characterized by the number *k* of deleterious mutations it contains. Each fitness class *k* contains many distinct lineages, each of which arose from deleterious mutations in more-fit individuals, as illustrated in Figure 2. Neutral mutations also occur, but we consider these later.

Hudson and Kaplan (1994) observed that individuals move between fitnesses by deleterious mutations and that when two individuals are in the same fitness class they could be from the same lineage and hence coalesce. Our fitness-class coalescent exploits this observation to define an effective genealogical process that completely bypasses the ancestral process in real time. Instead, we treat each fitness class as a generation, and we count time in deleterious mutations: each deleterious mutation moves us from one generation to the next. In this way, we can trace the ancestry of individuals through the fitness distribution. For example, there is some probability that two individuals chosen from fitness class *k* are genetically identical (*i.e.*, come from the same lineage). If not, they each arose from mutations within fitness class *k* − 1. If both those mutations occurred in individuals in the same lineage in fitness class *k* − 1, we say the two individuals “coalesced” in class *k* − 1. If not, they came from different mutations from class *k* − 2, and could have coalesced there, and so on. In this way, we can construct a fitness-class coalescent tree describing the relatedness of two individuals, as illustrated in Figure 2.

In this article we show that the probability that two randomly chosen individuals who are currently in fitness classes *k* and *k*′ coalesce in class *k* − ℓ, *n _{k}* is the population size of fitness class

*k*,

*s*is an effective selection pressure against these individuals, and

_{k}This coalescent probability is inversely proportional to the population size of the fitness class, *n _{k}*

_{−ℓ}, and the effective selection coefficient within that class,

*s*

_{k}_{−ℓ}, modified by the combinatoric coefficient

*k*− ℓ has size

*n*

_{k}_{−ℓ}, so the coalescence probability per real generation is

*s*

_{k}_{−ℓ}generations in that class, so the total coalescence probability in this class has the form

*k*− ℓ at the same time. In other words, the probability that coalescence occurs in a class equals the inverse population size of the class times the number of generations lineages spend together in that class. In the following sections of this article we derive Equation 1 in the two alternative ways mentioned in the Introduction: by explicitly considering the lineage frequency distribution and by following the path summation method of Hudson and Kaplan (1994) and Gordo

*et al.*(2002).

Our approach of treating mutation events as timesteps, and computing coalescence probabilities at each timestep, allows us to make a precise mapping to coalescence theory in which certain quantities have a different meaning than in the traditional theory. In this framework, we can calculate a simple analytic expression for the probability that two lineages sampled from particular fitness classes will coalesce in any other fitness class. These fitness-class coalescence probabilities allow us to explicitly calculate the structure of genealogies in this “mutation time.” We can then compute the distribution of any statistic describing expected sequence variation by averaging over the fitness classes our original individuals come from. For a statistic *x* that depends on genealogies between two individuals, for example, we write expressions of the form*H*(*k*, *k*′) describes the probability that two individuals sampled at random from the population come from classes *k* and *k*′, respectively.

From the form of these expressions and our simple result for the coalescence probabilities, we can immediately see the main effect of selection on the structure of genealogies. The discussion following Equation 1 implies that the effect of negative selection is similar to that of an effective population size that changes as time recedes into the distant past—*i.e.*, some *N*_{e}(*t*). This intuition has been suggested by earlier work (see, *e.g.*, Seger *et al.* 2010). As we will see, our analysis describes the precise form of *N*_{e}(*t*): it follows the distribution *n _{k}*

_{−ℓ}as ℓ increases further to the past, modified by the coefficient

*N*

_{e}(

*t*). As is clear from Equation 3, these different histories are averaged according to the distribution

*H*(

*k*,

*k*′). While it is the average

*N*

_{e}(

*t*) between pairs that determines the distribution of pairwise statistics, this suggests that statistical power may exist in larger samples to distinguish negative selection from neutral population expansion. We explore these general conclusions of our analysis in detail in the

*Discussion*.

Note that in the standard neutral coalescent, one first calculates the distribution of coalescence times and then imagines mutations occurring as a Poisson process throughout the coalescent tree, with rates proportional to branch lengths. In our fitness-class coalescent, by contrast, the coalescence times *are* the mutations. To avoid confusion, from here on we will to the effective generations in our model as “steps,” and refer to the fitness-class coalescent “times” as the “steptimes.” We reserve the word time to refer to the actual coalescent time, measured in actual generations.

After determining a fitness-class coalescent tree, we can invert our mapping to determine the structure of genealogies in real time. We will do this by calculating how the steptime in our fitness-class coalescent model translates into an actual time in generations. This will allow us to relate the distribution of branch lengths in steptimes to an actual coalescent tree in generations. We can then treat neutral mutations as is usually done in the standard coalescent: as a Poisson process with probabilities proportional to branch lengths.

Our fitness-time coalescent requires a number of approximations that limit its applicability. Most importantly, we neglect Muller’s ratchet, and more generally we ignore the effects of fluctuations in the size of each fitness class. We discuss these approximations in more detail below. We find that within a broad and biologically relevant parameter regime they lead to systematic but small corrections to our results. Despite these limitations, our approach also has several advantages relative to previous work. The fitness-time coalescent approach makes many otherwise difficult analytic calculations tractable, allows us to compute the diversity at the selected sites in addition to linked neutral sites, and may offer a useful basis for practical methods of coalescent simulation and inference.

## Model

We imagine a finite haploid population of constant size *N*. Each haploid genome has a large number of sites, which begin in some ancestral state and mutate at a constant rate. Each mutation either is neutral or confers some fitness disadvantage *s* (where, by convention, *s* > 0). We assume an infinite-sites framework, so there is negligible probability that two mutations segregate simultaneously at the same site. We assume that there is no epistasis for fitness and that each deleterious mutation carries fitness cost *s*, so that the fitness of an individual with *k* deleterious mutations is *w _{k}* = (1 −

*s*)

*. Since we assume that*

^{k}*s*≪ 1, we often approximate

*w*by 1 −

_{k}*sk*.

The population dynamics are assumed to follow the diffusion limit of the standard Wright–Fisher model. That is, we assume that deleterious mutations occur at a genome-wide rate *U*_{d} per individual per generation (with deleterious mutations assumed to be decoupled from selection). We define θ_{d}/2 ≡ *NU*_{d} as the per-genome scaled deleterious mutation rate. Similarly, neutral mutations occur at a rate *U*_{n} per individual per generation, and we analogously define θ_{n}/2 ≡ *NU*_{n}. We assume that each newly arising mutation occurs at a site at which there are no other segregating polymorphisms in the population (the infinite-sites assumption).

We focus exclusively on the case of perfect linkage, where we imagine that all the sites we are considering are in an asexual genome or within a short enough distance in a sexual genome that recombination can be entirely neglected. Although our model is defined for haploids, this assumption means that our analysis also applies to diploid populations provided that there is no dominance (*i.e.*, being homozygous for the deleterious mutation carries twice the fitness cost as being heterozygous). In this case, our model is equivalent to that considered by Hudson and Kaplan (1994).

We believe that this is the simplest possible model that is based on a concrete picture of mutations at individual sites that can describe the effects of a large number of linked negatively selected sites on patterns of genetic variation. It is essentially equivalent to the model described by Charlesworth *et al.* (1993) and Hudson and Kaplan (1994), which has formed the basis for much of the analysis of background selection (Charlesworth *et al.* 1993; Gordo *et al.* 2002; Seger *et al.* 2010).

Our analysis will develop a fitness-class coalescent theory that involves tracing the ancestry of individuals as they change in fitness by acquiring deleterious mutations. To do this, we need to first understand the distribution of fitnesses within the population. Since in our model all deleterious mutations have the same fitness cost *s*, we can classify individuals on the basis of their Hamming class, *k*, relative to the wild type (which by definition has *k* = 0). That is, individuals in class *k* have *k* deleterious mutations more than the most-fit individuals in the population. Note that not all individuals in class *k* have the same set of *k* deleterious mutations. Furthermore, *k* refers only to the number of *deleterious* mutations an individual has; individuals with the same *k* can have different numbers of neutral mutations. We normalize fitness such that by definition all individuals in class *k* = 0 have fitness 1. Individuals in class *k* then have fitness 1 − *ks* (Figure 1).

Haigh (1978) showed that the balance between mutation and selection leads to a steady state in which the fraction of the population in fitness class *k*, which we call *h _{k}*, is given by a Poisson distribution with mean

*U*

_{d}/

*s*,

*U*

_{d}and that

Throughout our analysis, we assume that the population exists in this steady-state mutation–selection balance. In particular, we neglect the fact that in a finite population there will be fluctuations around this *h _{k}*. This approximation is central to our approach, and we make it in subtly different ways in both our lineage-structure and our sum of ancestral paths calculations of the fitness-class coalescence probabilities. It will typically be valid in the bulk of the fitness distribution when selection is strong (

*Ns*≫ 1); our analysis is limited to this strong selection case and breaks down when

*Ns*≲ 1. We discuss this approximation in more detail in the

*Discussion*and in

*Appendix B*. We note that this approximation also implies that we assume that Muller’s ratchet can be neglected. We will return to the question of the importance of Muller’s ratchet in more detail in the

*Discussion*.

We will later need to understand the distributions of timings, *k* − 1 to class *k*. We can calculate this by noting that the probability that an individual in class *k* arose from a mutation in an individual in class *k* − 1 rather than a reproduction event from an individual in class *k* is*h _{k}*, and noting that these mutation events are a Poisson process, we find

*et al.*(2002) following the approach of Hudson and Kaplan (1994).

## Lineage Structure and the Fitness-Class Coalescence Probabilities

In general, the individuals in a particular fitness class *k* will not be genetically identical. Rather, there will be a number of different lineages within this class, each lineage created by a deleterious mutation from class *k* − 1. We now consider the structure of lineage diversity among individuals within a given fitness class in the mutation–selection balance. Note that for our purposes here, we only consider deleterious mutations in defining lineages; we consider the diversity at neutral sites separately below.

Consider a fitness class *k*, which has an overall frequency *h _{k}* (Figure 1B). The frequency

*h*is maintained by a stochastic process in which the class is constantly receiving new individuals from class

_{k}*k*− 1 due to deleterious mutations. In our infinite-alleles model, each such mutation creates a lineage which is an allele that is unique within the population. Each lineage fluctuates in frequency for a while before eventually dying out, perhaps after acquiring additional mutations that found new lineages in fitness class

*k*+ 1. At any given moment, there is some frequency distribution of lineages in each class

*k*(see Figure 2). While the identity of these lineages changes over time, there is a probability distribution that at any moment there is a given frequency distribution of lineages. In steady state, this probability distribution does not change with time.

New lineages are founded in class *k* at a rate θ* _{k}*/2, where

*k*at a per capita rate

*s*as the

_{k}*effective selection coefficient*against an allele in class

*k*, because it is the rate at which any particular lineage in class

*k*loses individuals, and we define

*k*with a frequency between

*a*and

*b*(in the total population) is Poisson distributed with mean

*k*is simply

*k*both come from the same lineage is

We note that the PRF result involves various implicit approximations, and is valid within a specific parameter regime. Most importantly, we neglect fluctuations in the sizes of each fitness class. This has two main effects. First, it means that we neglect the corresponding fluctuations in the distribution of lineage frequencies *f _{k}*(

*x*). Second, it means we are implicitly neglecting the fact that, given a lineage of size

*x*exists in class

*k*, the actual

*h*is on average not at its steady state value (

_{k}*e.g.*, if a high-frequency lineage exists,

*h*will tend to be larger). We explain these approximations in detail in Appendix B, and describe an alternative branching process formulation for the lineage structure that corrects for the second effect described above.

_{k}### The fitness-class coalescent probabilities

We can now calculate the degree of relatedness between two individuals sampled from the population. Our goal is to understand the probability distribution of the fitness-class coalescence steptimes for two individuals chosen at random from the population. We begin by calculating the coalescence probability in each step.

First, imagine that by chance we pick two individuals from the same fitness class *k*. If the two individuals are from the same lineage, they coalesce within this class. In this case, they are genetically identical and the coalescence steptime is 0. If not, we want to calculate the probability they coalesce in class *k* − 1, *A* in class *k* was founded by a mutation from class *k* − 1 a time *t*_{1} ago, and the lineage of individual *B* in class *k* was founded by a mutation a time *t*_{2} ago, the probability the two individuals came from a common lineage in class *k* − 1 is*t*_{1} and *t*_{2}, *x*/*h _{k}* is the probability one of the individuals came from a lineage of size

*x*given that the lineage exists,

*f*(

_{k}*x*) is the probability that the lineage exists, and

*k*− 1 changes in frequency from

*x*to

*y*in time |

*t*

_{2}−

*t*

_{1}| (where

*y*could be 0, corresponding to a lineage that has already mutated back to class

*k*− 2 by the time the second individual mutates to class

*k*− 1). The forms of

*Q*and

*G*are described in

*Appendix A*.

If the two individuals coalesced in this first step, the coalescent steptime is 1. If not (which occurs with probability *i.e.*, in the mutations that took them from class *k* − 2 to *k* − 1),

So far we have imagined that both individuals that we originally selected from the population came from the same class *k*. This will not generally be true. Rather, when we pick two individuals at random, they will come from classes *k* and *k*′ with probability*k* ≤ *k*′. We define *k* and *k*′ coalesce in class *k* − ℓ. Note that

From the set of coalescence probabilities Equation 13, we can calculate the probability distribution of coalescence steptimes between two individuals. We describe these steptimes by the distribution of classes in which coalescence occurs; given that we pick two individuals from classes *k* and *k*′ (with *k* ≤ *k*′ by convention) the probability that they coalesce in class *k* − ℓ is simply*Appendix A*.

### Computing the coalescence probabilities

We now have a formal structure describing the structure of coalescent genealogies in the presence of negative selection. It remains, however, to evaluate the coalescent probabilities in each step by evaluating the integrals in Equation 13. We explain the details of this calculation in *Appendix A*. We find*k*, *k*′, and ℓ but not on the population parameters,

Equation 15 is the complete solution for coalescent probabilities in the nonconditional approximation. This general form for the coalescence probabilities makes intuitive sense. *Nh _{k}*

_{−ℓ}is the population size of class

*k*− ℓ, and

*k*− ℓ before mutating away. Since the per-generation coalescent probability in a population of size

*n*is proportional to

*k*− ℓ is approximately proportional to one over the population size of this class times the number of generations individuals spend in this class. The additional 1 in the denominator captures the fact that the individuals might mutate away from the class before coalescing there (which reduces the average time they spend in the class together). The numerical factor multiplying this basic scaling,

*i.e.*, the

*dt*

_{1}and

*dt*

_{2}integrals). It reflects the probability that the ancestors of the two individuals we are considering were both in class

*k*− ℓ at the same time, since they could not otherwise coalesce there.

From this result, we can also form an intuitive picture of the shape of genealogies in the presence of negative selection. We have just seen that the coalescence probability per actual generation depends on the parameters as *h _{k}*

_{−ℓ}. For example, if we imagine sampling two individuals from the same below-average fitness class, the probability distribution of their genealogies is like having a population size that initially increases and then decreases as we look backward in time. Of course, this analogy only goes so far. Most importantly, the coalescent steptimes are related to the statistics describing genetic diversity in a different way from how normal coalescent times are usually related to these statistics. We return to this point in the section on the structure of genealogies below.

## A Sum of Ancestral Paths Approach

We have just computed the fitness-class coalescence probabilities by considering the lineage structure within each fitness class. Kaplan *et al.* (1988) proposed a somewhat different way to look at the same problem: they considered a sample of individuals and, without explicitly describing lineage structure, computed the relative probabilities that the next event to occur backward in time would involve a mutation or coalescent event. For example, if two individuals are in the same fitness class, the next event could be either coalescence within that class or a mutation event. The rates at which these events occur determines their relative probabilities.

In its original form, this approach used diffusion equations to account for fluctuations in the frequencies of each fitness class *h _{k}*. Barton and Etheridge (2004) used this framework to provide a complete solution for the effect of selection at a single site on the structure of genealogies. However, it has not yet proven possible to solve these equations in the more general case of selection at many linked sites. Instead, Hudson and Kaplan (1994) made progress by neglecting fluctuations in the frequencies

*h*, the same approximation that is central to our approach. Using this approximation, they derived a recursion relation for the mean time to a common ancestor, their Equation 12. Gordo

_{k}*et al.*(2002) used this equation as the basis for a coalescent simulation.

Recursion relations of the Hudson and Kaplan (1994) form can be solved numerically, and have been used to generate data describing coalescent statistics, but have not yet led to an analytic description of the structure of genealogies. We now demonstrate that these numerical methods are equivalent to our lineage-based formalism above, by showing that the Hudson and Kaplan (1994) approach can be used to derive identical analytical formulas for the fitness-class coalescent probabilities. We refer to this as a “sum of ancestral paths” approach, because it relies on summing over all possible paths of individual ancestry through the fitness distribution. The equivalence of this approach to our lineage-structure calculations means that our analytical results in this article match earlier numerical and simulation results based on the Hudson and Kaplan (1994) formulation.

In order to calculate the coalescence probabilities for a sample of two individuals, we consider the set of all possible ancestral paths these individuals may have followed. Each path is represented by an ordered set of events, backward in time. These events may either be deleterious mutation events, which move one of the ancestral lineages to the previous fitness class, or coalescence events, which merge the two ancestral lineages. In order for two individuals to coalesce in class *k* − ℓ, each ancestral lineage must undergo a series of deleterious mutation events, bringing them from their initial classes to class *k* − ℓ. The lineages must then coalesce before any additional deleterious mutations occur. For example, in order for two individuals sampled from class *k* to coalesce in class *k* − 1, the first event, backward in time, must be a deleterious mutation. This mutation can occur in either individual. After this event, one of the ancestral lineages is still in class *k*, while the other is in class *k* − 1. The second event, backward in time, must be a deleterious mutation event in the ancestral lineage that remains in class *k*. Both ancestral lineages are now in class *k* − 1. Finally, the third event must be a coalescent event. Note that there are a total of two paths, since either individual may have been the first to mutate.

The probability of any particular ancestral path is the product of the probability of each event in the path. We saw above that deleterious mutations occur in an individual in class *k* at rate *sk*. If the two individuals are in different classes, they are not able to coalesce as the next event. Thus the probability of each possible event is simply:*Nh _{k}*. Therefore, we have

*et al.*(2002), derived from the framework of Hudson and Kaplan (1994).

Using these probabilities, we can easily calculate the probability of any particular path. In general, in order for two individuals sampled from classes *k*′ and *k* to coalesce in class *k* − ℓ, the ancestral paths must consist of some order of *k*′ − *k* + 2ℓ events which include *k*′ − *k* + ℓ deleterious mutation events in the ancestral lineage that began in *k*′, and ℓ deleterious mutation events in the ancestral lineage that began in *k*. The path must then conclude with a final coalescent event. Note that there are a total of

We can carry out this sum in the general case by dividing up the *k* − ℓ. Each case leads to a different path probability, and these probabilities can be exactly summed. We carry out this calculation in detail in Appendix A. We find that to leading order in

We note that in deriving this result, we have made the same approximations we used in our lineage structure based approach. Thus the results from the PRF method and the sum of ancestral paths are exactly equivalent in the regime where they are valid. However, there are subtle differences in the results to higher orders of the approximations, which provide useful intuition about the process. For example, in the sum of ancestral paths approach it is more natural to calculate

## The Structure of Genealogies and Statistics of Genetic Diversity

We can now use the coalescence probabilities described above to calculate the structure of genealogies in the presence of negative selection. We can then use these genealogies to calculate various statistics describing the genetic diversity within the population. We know the coalescent probabilities in each step of our fitness-class coalescent process, so in principle we can calculate the probability of any genealogy relating an arbitrary number of individuals using methods analogous to those used in standard neutral coalescent theory. This would then allow us to calculate the distribution of any statistic describing the genetic diversity among these individuals, again using methods analogous to neutral coalescent theory.

Here we will focus on the simplest genealogical relationship: the distribution of the time to the most recent common ancestor of two individuals, which demonstrates the main ideas in the simplest context. This allows us to calculate the distribution of the per-site heterozygosity π. This is the only statistic relevant to a sample of two individuals. In larger samples, the coalescent probabilities between any pair of sampled individuals are independent of those between any other pair that does not share the same most recent common ancestor, so the distribution of per-site heterozygosity we expect within such a sample is closely related to the ensemble distribution of π we calculate here.

In our fitness-class coalescent framework, it is natural to consider diversity at the negatively selected sites separately from diversity at linked neutral sites. We focus first on the distribution of coalescent steptimes and π* _{d}*, the per-site heterozygosity at negatively selected sites alone, ignoring neutral mutations. We will then turn to the connection between steptimes and actual times in generations, which will enable us to calculate the distribution of neutral diversity, including the per-site heterozygosity at neutral sites π

*. In analyzing data, we will of course typically not know*

_{n}*a priori*which sites are neutral and which are negatively selected. In such a situation, we merely add up the expected diversity at neutral sites and negatively selected sites, so that the total expected per-site heterozygosity is π = π

_{d}+ π

_{n}.

### Distribution of steptimes and π_{d}

We begin by imagining that we sample two individuals at random from the *same* fitness class *k*. If they coalesce in class *k* − ℓ, they each acquired ℓ different deleterious mutations to reach class *k*. Thus the number of negatively selected sites at which they will be polymorphic is twice their coalescent steptime, π_{d} = 2ℓ. We therefore have_{d} = 2ℓ) is the probability π_{d} = 2ℓ.

More generally, if two individuals sampled from classes *k* and *k*′ coalesce in class *k* − ℓ, we have π_{d} = 2ℓ + *k*′ − *k*. This means we have*k* and *k*′ to find the distribution of π_{d} among individuals sampled at random from the population. We find*k* or π_{d}/2. Note that in practice we only have to evaluate the sum over *k* from 0 to a multiple of *U*_{d}/*s*, since *H*(*k*, *k*′) will be negligible for larger *k*.

These results for the distributions of genealogy lengths and of π_{d} involve several sums. However, all the terms in these sums are straightforward and the numerical evaluations of their values are simple and fast. In Figure 4 we show a representative example of the predicted distribution of the per-site heterozygosity at negatively selected sites, ρ(π_{d}), compared to simulation results. We explore the significance of the shape of the distribution ρ(π_{d}), how this distribution depends on the parameter values, and the source of the small but systematic deviations between the theoretical predictions and the simulation results in the Discussion.

### The relationship between steptimes and time in generations

So far we have focused on the genealogies measured in steptimes, which allowed us to calculate the distribution of heterozygosity among negatively selected sites. We would now like to relate the steptimes to actual times in generations. To do this, we consider the probability that a coalescence event occurred at time *t*, given two individuals sampled from classes *k* and *k*′ that coalesced in class *k* − ℓ, ψ(*t*|*k*, *k*′, ℓ). We compute this distribution in File S5, and find*A* ≡ *k*′+*k* −*i* and

Note that when *Nh _{k}*

_{−}

*(*

_{t}s*k*− ℓ) ≫ 1 (the same condition required to neglect fluctuations in

*h*, see

_{k}*Appendix B*), this expression can be simplified; we find

*s*(

*k*− ℓ) = 0. In this case, we must use the more complex expression Equation (25) (or in the case when the coalescence time within the 0-class can be neglected compared to the time taken to descend from the 0-class, the simpler expression described in Equation (39) below).

Averaging over the possible values of *k*, *k*′, and ℓ, we find the overall distribution of actual coalescent time between two randomly chosen individuals,*H*(*k*, *k*′), *t*|*k*, *k*′, ℓ) are as given above. However, as we will see below, in calculating neutral diversity we will typically find it easier to work directly with ψ(*t*|*k*, *k*′, ℓ) rather than this unconditional distribution for ψ(*t*).

### The neutral heterozygosity π_{n}

From the distributions of real times to a common ancestor described above, we can calculate the distribution of π_{n}, the neutral heterozygosity. Since the neutral mutations occur as a Poisson process with rate *U*_{n}, and there are a total of 2*t* generations in which these mutations can occur, π_{n} follows a Poisson distribution with mean 2 *U*_{n}*t*, where *t* is drawn from the distribution of coalescence times, Equation 27. We have

We note that an alternative way to compute neutral heterozygosity is to further extend the sum of ancestral paths approach which we used above to provide an alternative derivation of the coalescence probabilities. In this formulation, we do not make any connection to real times. However, this approach provides an alternative way to compute the distribution of neutral heterozygosity, ρ(π_{n}). We carry out this computation in File S6, and show that it leads to results identical to our analysis above.

### The total heterozygosity π

To calculate the distribution of total heterozygosity π = π_{n} + π_{d}, we must account for the fact that π_{d} and π_{n} are not independent: large *π _{d}* means a large coalescent steptime and hence makes a large π

_{n}more likely. The distribution of π

_{d}is given by ρ(π

_{d}) above. Above we found ψ(

*t*|

*k*,

*k*′, ℓ), which implies that

_{d}= 2ℓ +

*k*−

*k*′, this implies

*e.g.*, when analyzing the joint distribution of synonymous and nonsynonymous variation). It implies a particular relationship between the observed diversity at selected sites and the reduction in linked neutral variation.

In many situations, however, we will not know which alleles are selected and which are neutral. In this case, we want to understand the distribution of total heterozygosity π, which is given by_{n}), since it involves analogous sums. In Figure 6, we compare this predicted distribution of total heterozygosity to simulations. As with the other aspects of heterozygosity, we find good general agreement to the simulations, with the slight systematic errors that are consistent with the effects of Muller’s ratchet.

### The mean pairwise heterozygosity

Above we have calculated the distribution of heterozygosity for both neutral and deleterious mutations, as well as total heterozygosity. It is straightforward to average these results to calculate the mean pairwise heterozygosity for both neutral and deleterious mutations; the mean total pairwise heterozygosity is simply the sum of these. In Figure 7 and Figure 8 we show how this mean heterozygosity depends on population size, mutation rate, and selection strength, for neutral and deleterious mutations respectively. We see that the dependence of 〈π_{d}〉 on the population size is fairly weak. While it increases roughly linearly with *N* in the weak selection regime, this quickly saturates and for *Ns* substantially greater than 1 the mean heterozygosity becomes almost independent of population size. The dependence on *U*_{d}/*s*, by contrast, is much stronger. The dependence of 〈π_{n}〉 on the parameters is also interesting: this depends weakly on the parameters for small *N* or *U*_{d}/*s*, but for larger *N* becomes roughly linear. These results make intuitive sense, particularly in light of the “mutation-time” approximation that we introduce in the Discussion, where we discuss these figures in more detail.

### Statistics in larger samples

The distributions of π_{n} and π_{d} described above are very different from the distributions of heterozygosity expected in the absence of selection. We could certainly measure the distribution of pairwise heterozygosity from a sample of many individuals from a population, and use this to infer the action of selection. However, it may also be useful to understand the expected distribution of other statistics describing the variation in larger samples. One statistic often used to describe variation in larger samples is the total number of segregating sites among a sample of *n* individuals, *S _{n}*. Here we describe how our framework allows us to calculate the distribution of

*S*

_{3}; similar methods can be used to calculate the distribution of

*S*for larger

_{n}*n*. As we will see, it is unwieldy to calculate closed form expressions for these quantities in our framework, so here we merely lay out a prescription for calculating

*S*

_{3}.

We first consider the distribution of *k*, *k*′, and *k*″ in Figure 9. Two of these three lineages coalesced in class *k*_{1}. We call the steptime at which two of the three lineages coalesced τ_{3} (see Figure 9). We next need to calculate the distribution of τ_{2}, the total steptime to common ancestry of the three individuals. This time of course cannot be smaller than τ_{3}. Given values of τ_{3} and τ_{2}, it is clear from Figure 9 that the total number of segregating negatively selected sites is

Calculating the joint distribution of τ_{2} and τ_{3} is tedious, because we must sum over all possible orderings of the coalescence events, but it can be computed using either our lineage structure method or the sum of ancestral paths approach. The basic result is analogous to our results for the coalescence steptime between a pair of individuals: coalescence probabilities within a given class are proportional to the inverse size of that class times the number of real generations the ancestors of given individuals typically spend in that class, times a factor that reflects the time that the ancestors of sampled individuals are present in each class at the same time.

Given a particular value of

However, while this analysis provides a prescription for calculating the distribution of *S*_{n} in a specific parameter regime we refer to as the “mutation-time” regime, but the complexities of the general calculation are tangential to the ideas behind our framework, so we do not pursue them further here. However, these issues will be important to explore in future work aiming to use this framework for data analysis, and our approach here can be used as the basis for genealogical simulations. Further, since our methods allow us to quickly compute the probability of a given genealogical history and to draw a particular genealogy from the appropriate distribution, they may provide a useful basis for importance sampling or MCMC methods to infer selection pressures from data.

## Numerical Simulations of the Genetic Diversity

We compare the predictions of our fitness-class coalescence analysis to Monte Carlo simulations of the Wright-Fisher model. In our simulations, we consider a population of constant size *N* and we keep track of the frequencies of all genotypes over successive, discrete generations. In each generation, *N* individuals are sampled with replacement from the preceding generation, according to the standard Wright-Fisher multinomial sampling procedure (Ewens 2004) in which the chance of sampling an individual is determined by its fitness relative to the population mean fitness.

In our simulations, each genotype is characterized by the set of sites at which it harbors deleterious mutations and the set of sites at which it harbors neutral mutations. In each generation, a Poisson number of deleterious mutations are introduced, with mean *NU _{d}*, and a Poisson number of neutral mutations are introduced, with mean

*NU*; each new mutation is ascribed to a novel site, indexed by a random number. The mutations are distributed randomly and independently among the individuals in the population (so that a single individual might receive multiple mutations in a given generation). The simulations record the time (in generations) at which each distinct genotype was first introduced.

_{n}Starting from a monomorphic population, all simulations were run for at least *N* generations (whichever was larger), to ensure relaxation both to the steady-state mutation–selection equilibrium and to the PRF equilibrium of allelic frequencies within each fitness class. The final state of the population—*i.e.*, the frequencies of all surviving genotypes—was recorded at the last generation. To produce the empirical distributions of π_{d} and π_{n} shown in Figure 4 and Figure 5, we averaged across at least 300 independent populations for each parameter set.

Our simulations allow for random fluctuations in the frequencies of each fitness class, and for Muller’s ratchet. In most of the parameter regimes we explored, the ratchet proceeded during the simulation, so that the least-loaded class at the end of each simulation typically contained anywhere from no deleterious mutations (typical for *U*_{d}/*s* = 2) to ∼10 (typical for *U*_{d}/*s* = 4). We see that despite these effects, our theory agrees well with the simulations, although there are small systematic errors that are consistent with effects of the ratchet. Generally speaking these errors increase as we increase *U*_{d}/*s*, but become less severe for larger *N* or *s*. We consider these effects of Muller’s ratchet in more detail in the *Discussion*.

## Discussion

In recent years, both experimental studies and sequence data have pointed to the general importance of selective forces among many linked variants in microbial and viral populations and on short distance scales in the genomes of sexual organisms (Hahn 2008). Our analysis provides a framework for understanding how one particular type of selection—pervasive purifying (*i.e.*, negative) selection against deleterious mutations—affects the structure of genetic variation at the negatively selected sites themselves and at linked neutral loci. This type of selection is presumably widespread in many populations, in which there is a selective pressure to maintain existing genotypes, and mutations away from these genotypes at a variety of loci are deleterious.

A variety of earlier work has addressed aspects of this problem, as described in the Introduction. The key insight of our approach is that instead of following the true ancestral process, we develop a *fitness-class* genealogical approach that focuses on how individuals “move” through the fitness distribution. Here each mutation plays the role of a reproductive event that moves individuals through the fitness distribution, and each fitness class is a generation in which coalescence can occur with some probability. We calculate this probability using a simple approximation based on the PRF model of Sawyer and Hartl (1992), rather than by considering the actual reproductive process within that class. By extending formulas originally computed by Hudson and Kaplan (1994), we showed that these coalescent probabilities can also be computed using a summation of ancestral paths based on the structured coalescent described by Kaplan *et al.* (1988). Hence the conclusions from our analysis also describe the simulations of Gordo *et al.* (2002) and are consistent with all other results based on this structured coalescent approach. Our work is also closely related to recent work in a continuous-fitness model by O’Fallon *et al.* (2010), which uses a similar framework to analyze the weak-selection regime but not the *Ns* ≫ 1 situation we study here. We explore the relationship between our analysis and earlier work in more detail in *Appendix C*.

Our approach leads to simple expressions for the coalescent probability at each step in our fitness-class genealogical process. This makes it a complete effective coalescent theory: using these probabilities, we can calculate the probability that a sample of individuals has any particular ancestral relationship. Our coalescent probabilities are different from those in the standard Kingman coalescent (Kingman 1982), so the structure of genealogies has a different form.

Of course, since our process is an effective rather than an actual coalescent, the relationship between a fitness-class genealogy and the expected statistics of genetic variation given that genealogy is different than in the standard neutral coalescent. Given a particular genealogy measured in steptimes, the numbers of deleterious mutations *are* the coalescent times, and to calculate the statistics of neutral variation we have to make use of the relationship between steptimes and actual coalescence times. This contrasts with the Kingman coalescent, where numbers of neutral mutations are typically Poisson-distributed variables with means proportional to coalescence times (Wakeley 2009). However, we can account for these differences by starting with the distribution of fitness-class genealogies and then converting these genealogies into actual coalescence times.

In this article, we have used this fitness-class approach to calculate simple statistics describing genetic variation, in particular the distribution of pairwise heterozygosity. This leads to analytic expressions for the quantities of interest, although these expressions involve sums that are most easily calculated numerically. These are easy to compute, do not become harder to evaluate in larger populations, and hence are more efficient to evaluate than either simulations or calculations within the ancestral selection graph.

### An intuitive picture of the structure of genealogies

The most important aspect of our analysis is not the specific results for heterozygosity, which match the conclusions of earlier simulations. Rather, the fitness-class coalescent approach allows us to draw several important general conclusions about how negative selection distorts the structure of genealogies. For two individuals drawn from particular fitness classes, the effect of negative selection is similar to that of an effective population size that changes as time recedes into the past. This is consistent with suggestions from earlier work (*e.g.*, the simulation study of Williamson and Orive 2002 and the work of Seger *et al.* 2010). However, this is not a population size that decreases in a simple way into the past. Our analysis shows the exact form of this time-dependent population size. Further, it is clear from our analysis that this is not the only effect of negative selection on genealogies. There are two key complications. First, the statistics of genetic variation (particularly at the deleterious sites themselves) depend on the structure of genealogies differently in our fitness-class coalescent than in the standard neutral coalescent. Second, the time-varying rate of coalescence between a pair of individuals depends on the fitness classes they were sampled from. In other words, different pairs of individuals have a different time-varying effective population size. This suggests that genetic diversity cannot be represented by a single time-varying effective *N*_{e}(*t*) for the whole population, which means that it may be possible to develop statistical tests to distinguish negative selection from population size. All of these general intuitive conclusions about the structure of genealogies in our fitness-class coalescent are illustrated in Figure 10.

We now pause to make this intuitive picture of the shape of typical genealogies more precise. In general the probability that two individuals will coalesce within class *k* has the form *n _{k}* is the population size of that class,

*sk*is the effective selection pressure against individuals within that class, and

*A*is a constant that depends on which classes the lineages began in, but not on any of the population parameters. We have seen that each lineage spends on average 1/(

*sk*) generations in class

*k*. Thus we can think of each individual as seeing a historical effective population size as shown in Figure 10C: it starts in some class

*k*with size

*n*and spends 1/

_{k}*sk*generations in that class before moving to class

*k*− 1, and so on.

If we sample two individuals, however, they will not always be in the same class at the same time. This effect reduces the coalescence probabilities in each class, as captured by the factor *A*/2. This factor is the average fraction of the 1/(*sk*) generations each lineage spends in class *k* that the two lineages spend there together. Alternatively, we can think of this factor as consisting of two parts: *A* is the probability that the two lineages are ever in the same class at the same time, and 1/(2*sk*) is the average amount of time that they coexist in the class if they coexist at all (they each spend on average 1/(*sk*) generations there, but on average overlap for only half this time if they overlap at all). While the two lineages are in the class at the same time, the per-generation coalescent probability is 1/*n _{k}*.

This logic implies that genealogies in the presence of purifying selection look like neutral genealogies with a specific type of historical population-size dependence. Imagine, for example, that we picked two individuals from the same fitness class *k*. They each spend on average 1/(*sk*) generations in class *k*, and during that time they have a probability *sk*) generations. If they fail to coalesce, they then move to class *k* − 1, where they spend *k* − 2, and so on.

So far, this picture of a time-dependent population size is rather crude, but we can make it more precise. Specifically, we can write the coalescence probability between two individuals sampled from class *k* and *k*′ as a function of time in generations as*N*_{e}(*t*), as the inverse probability of coalescence at time *t* given that coalescence has not yet occurred,*N*_{e}(*t*) is defined as usual as the inverse of the probability that the two individuals will coalesce at time *t* given that they have not yet done so.

We illustrate this precise time-dependent population size *N*_{e}(*t*) in Figure 10D. We see that for two individuals sampled from the same fitness class, *N*_{e}(*t*) typically increases into the recent past and then decreases into the more distant past. This reflects the fact that the two individuals are becoming less likely to be in the same fitness class in the recent past, but that as time recedes into the distant past they are likely to be in the highly fit classes that have smaller *n _{k}*. For two individuals sampled from classes near but not identical to each other,

*N*

_{e}(

*t*) starts high and then drops before exhibiting a pattern similar to that among individuals sampled from the same class. This reflects the fact that it takes at least a short time before the two individuals have any chance of being in the same class. Finally, for two individuals sampled from more distant classes,

*N*

_{e}(

*t*) simply declines into the past, both because longer ago they were more likely to be in the same class and more likely to be in the small classes near the high-fitness tail.

Averaging over the whole population, Figure 10D shows the precise time-dependent population size *N*_{e}(*t*) for two randomly sampled individuals. This average *N*_{e}(*t*) initially stays roughly constant as time recedes into the past before decreasing thereafter. For these two randomly sampled individuals, selection is indistinguishable from this particular historically varying population size. The distribution of coalescence times between this pair of individuals looks the same as neutral coalescent histories with this specific population-size history. The deleterious mutation rates and selection pressures matter only in that they determine the form of this population size history. We note that the average *N*_{e}(*t*) shown in Figure 10D implies that recent branches of genealogies will typically be longer relative to ancient branches than we would expect under neutrality. Thus background selection will lead to an excess of low-frequency variants, and hence lead to negative values of Tajima’s *D*, consistent with expectations from previous work (Charlesworth *et al.* 1995; Fu 1997; Gordo *et al.* 2002).

However, a key difference from a neutral population of time-varying size is that, as is clear in Figure 10D, pairs of individuals do not typically come from the same fitness class. Rather, they come at random from different parts of the fitness distribution, and those that come from different places have ancestries characterized by different historically varying population sizes. The total distribution of ancestry is the sum of all of these. In other words, the genetic variation within the population is like that in a population in which some individuals had one type of historical population size history, while others had another. If we restrict ourselves to pairwise statistics such as π, the average *N*_{e}(*t*) across pairs of individuals will accurately describe the genetic diversity. However, when we consider appropriately defined statistics in larger samples, the fact that there is no single *N*_{e}(*t*) for the whole population could be important. It remains an interesting question for future work to explore how to exploit this fact to develop statistical tests to distinguish the effects of purifying selection from that of a historically varying effective population size.

### Approximations underlying our approach

Our analysis relies on several key approximations. First, both our lineage structure and our sum of ancestral paths methods assume that we can neglect fluctuations in the total frequency *h _{k}* of each class. Related to this approximation, we have also implicitly assumed that the probability a lineage in class

*k*reaches a frequency close to

*h*can be neglected. In

_{k}*Appendix B*, we analyze these approximations in detail and show that they will hold in class

*k*whenever

*Nh*≫ 1. In practice, this condition will often break down in the high- and low-fitness tails of the fitness distribution. Fortunately, provided it holds in the bulk of the distribution in which most individuals will be sampled (which will typically be true provided

_{k}sk*Ns*≫ 1), our approach is still a good approximation. We have also made several other more technical approximations in computing the fitness-class coalescent probabilities. We discuss these in detail in File S1 and File S4.

Our final and most important approximation is that we assume that Muller’s ratchet can be neglected. The ratchet occurs when *h*_{0} fluctuates to 0, so we can think of this approximation as an extreme aspect of neglecting fluctuations in the sizes of each fitness class. This approximation can sometimes be problematic; we discuss it in detail below.

Although we have focused primarily on situations in which selection is weak compared to total deleterious mutation rates, our approach is also valid regardless of whether *s* is strong or weak compared to *U*_{d}. However, when selection is sufficiently strong (*Ns* ≫ 1 and *U*_{d}/*s* < 1), then an effective population size approximation accurately describes the patterns of genetic variation, as we describe below. Thus our methods are primarily useful for situations in which selection is weak compared to mutation rates.

### Relationship with an effective population size approximation

Charlesworth *et al.* (1993) considered how selection against many linked deleterious mutations affects linked neutral diversity in a model identical to ours. These authors found that when selection is sufficiently strong, the shape of genealogies and hence the statistics of variation at linked neutral sites are identical to the neutral case, with a reduced effective population size. We refer to this as the effective population size (EPS) approximation.

The idea behind the EPS approximation is that when selection is strong, deleterious mutations are quickly eliminated from the population by selection. Thus if we sample individuals from the population, they must have very recently descended from individuals within the class of individuals that had no deleterious mutations (the 0-class). The EPS approximation assumes that the time for this to happen can be neglected and that individuals never coalesce before it does. These individuals then coalesce within the 0-class as a neutral process with effective population size equal to the size of that 0-class, which is

The EPS approximation is valid provided that the neutral coalescence time within the 0-class, *t*_{neut}, is large compared to the time it takes for a typical individual to have descended from the 0-class, *t*_{desc}. We know *k* ∼ *U*_{d}/*s*, we have that*Ns* > 1 and *U*_{d} < *s*. However, whenever *U*_{d} becomes much larger than *s*, it will typically break down even in enormous populations, as has been suggested by Nordborg *et al.* (1996) and Kaiser and Charlesworth (2009).

Our analysis describes the effects of background selection beyond the EPS approximation. We do not assume that the coalescence time through the fitness distribution is small compared to the coalescence times within the 0-class or that coalescence cannot occur among individuals carrying deleterious mutations. It is precisely these two effects that lead to distortions away from the neutral expectations, making it impossible to describe genealogies using neutral theory with a revised effective population size. Although our analysis is a generalization of the EPS approximation, it is not inconsistent with it. However, we have focused primarily on situations in which the EPS approximation breaks down, and coalescence times through the fitness distribution are large compared to those in the 0-class, because this is the situation in which our approach is most useful.

Note also that in many situations it may be the case that there are many linked weakly selected mutations *and* many linked strongly selected mutations. In such circumstances, the process we consider and the EPS approximation can act simultaneously, each for different classes of mutations. Imagine we had one class of mutations with fitness cost *s*_{1}, which occur with mutation rate *U*_{1}, where *U*_{1} < *s*_{1} and *Ns*_{1} ≫ 1 so that the EPS approximation applies. At the same time, imagine another class of mutations with fitness cost *s*_{2}, which occur with mutation rate *U*_{2}, where *U*_{2} ≫ *s*_{2} so that the EPS approximation breaks down for these mutations. In this case, the genetic diversity we expect to see will be characteristic of our fitness-class coalescent theory (with *U*_{d} = *U*_{2} and *s* = *s*_{2}), but with a reduced effective population size

### A “mutation–time” approximation

We have seen that our analysis accounts for two effects missing from the EPS approximation: coalescence events outside the 0-class and the time it takes for individuals to have descended from the 0-class. Whenever *U*_{d}/*s* and *N* are both sufficiently large, the former effect can be neglected while the latter is still important, because the number of lineages in each fitness class becomes large and hence coalescence events are very unlikely to occur outside of the 0-class. This leads to an approximation that we can think of as a generalization of the EPS approximation. Rather than considering primarily the diversity generated within the most-fit background, we focus instead on the diversity that accumulates while lineages move between different less-fit backgrounds. Hence we term this approach a mutation–time approximation (MTA) for short. In this approximation, we assume that all individuals coalesce within the 0-class, as with the EPS approximation. However, unlike the EPS approximation, we consider the time it took for individuals to descend from the 0-class in addition to the coalescence time within the 0-class. This approximation is valid for large *N* (when even *Nh*_{1} is enormous compared to 1/*s*) so that coalescence always occurs in the 0-class.

In this mutation–time approximation our results become much simpler and provide a useful intuitive picture of the structure of genealogies and genetic variation. Consider the deleterious heterozygosity π_{d} of two individuals sampled from fitness classes *k* and *k*′. In this approximation, these two individuals always coalesce in the 0-class so we always have π_{d} = *k* + *k*′. Since two individuals are sampled from classes *k* and *k*′ with probability *H*(*k*, *k*′), the distribution of π_{d} in the population as a whole is extremely simple: we have

This simple approximation makes it clear why the distribution of π_{d} looks the way it does, and explains how it varies with *U*_{d}/*s* and with *N*, both in this mutation–time approximation and more generally. For large *N*, when coalescence outside the 0-class can be neglected, two individuals from class *k* and *k*′ have π_{d} = *k* + *k*′. Thus the distribution of π_{d} has roughly the same shape as the distribution of fitness within the population. The mean π_{d} is 2*U*_{d}/*s*, since the average individual comes from class *k* = *U*_{d}/*s*. Smaller and larger π_{d} are less likely; the distribution of fitness in the population has variance equal to the mean, so the variance of the distribution of π_{d} is also roughly equal to its mean. As *N* gets smaller, there is sometimes coalescence outside of the 0-class. This reduces π_{d} given *k* and *k*′. Hence as we reduce *N*, the distribution of π_{d} shifts somewhat leftward, with a peak somewhat below 2*U*_{d}/*s*, and has slightly more variance relative to the mean since there is a less definite correspondence between *k*, *k*′, and π_{d}. Since π_{n} is determined by π_{d}, this also explains why the distribution of π_{n} has the peaked form we observe and how it depends on *U*_{d}/*s* and *N* (note that for π_{n} the coalescence time within the 0-class, which increases linearly with *N*, must also be included). All of these intuitive expectations are reflected in our results, as shown in Figure 4, Figure 5, Figure 7, and Figure 8. Note for example that in Figure 4, the peak of π_{d} is slightly below 2*U*_{d}/*s* (reflecting the finite population size) and has variance about equal to its mean; we have verified that as *N* increases, the shape of the distribution remains roughly the same, but the mean increases toward 2*U*_{d}/*s* and the variance decreases slightly.

More complex statistics of sequence variation are similarly straightforward to calculate in the mutation–time approximation. When considering larger samples, the genetic diversity is determined by the fitness classes these individuals come from, which is always simple since the probability a given individual is sampled from fitness class *k* is just the Poisson-distributed *h _{k}*. This approximation may therefore prove useful in developing simple and intuitive expressions for various statistics. For example, we can use this approximation to calculate a simple expression for the distribution of the total number of segregating negatively selected sites in a sample of size

*n*,

*k*that sum to

_{i}*x*. We find

_{d}looks as it does. We note, however, that as we increase the sample size

*n*, the population size

*N*must be even larger for this MTA to hold.

We can also calculate the distributions of actual coalescence times and hence the distributions of statistics describing neutral diversity in the mutation–time approximation. Consider the distribution of the real coalescence time between two individuals chosen from classes *k* and *k*′. In the mutation–time approximation where the coalescence time within the 0-class can be neglected, the actual coalescence time is*k* and *k*′, we have_{n} in the usual way,

We can immediately see that the average coalescence time in this MTA is_{d}, the shape of this distribution of π_{n} is primarily determined by the shape of *H*(*k*, *k*′). In this case, the peak in *h _{k}* at

*k*=

*U*

_{d}/

*s*leads to a peak in the distribution of real times and hence a peak in the distribution of π

_{n}. The width of the distribution of π

_{n}is somewhat wider, however, since even given individuals coming from fitness classes near the mean, there is a broad distribution of possible real times, and a broad distribution of π

_{n}even given a particular real time.

This average heterozygosity would correspond to an effective population size of_{n} nor its relationship to other statistics describing the genetic diversity. For smaller values of *N* where the mutation–time approximation breaks down, the average π_{n} would be somewhat lower than the MTA predicts and its distribution somewhat broader.

### Muller’s ratchet

We have neglected Muller’s ratchet throughout our analysis and assumed that the fitness distribution *h _{k}* is fixed. Yet Muller’s ratchet will certainly occur and in some circumstances could have a significant impact on genetic diversity (Charlesworth and Charlesworth 1997; Gordo

*et al.*2002; Seger

*et al.*2010). Thus this is a potentially important omission from our theory. In this section we discuss some of the complications associated with Muller’s ratchet that are important to keep in mind when considering our approach. We discuss the parameter regimes where neglecting Muller’s ratchet should be reasonable and those where it is likely to cause more serious problems. We provide rough estimates of how large we expect these problems to be and suggest a few possible ways in which future work might incorporate Muller’s ratchet into our general framework.

Muller’s ratchet causes several related problems within our theoretical framework. First, it causes the values of *h _{k}* to change with time and means they may not always follow a Poisson distribution. This changes the distribution of lineage frequencies within each class, and hence changes the coalescence probabilities. After a “click” of the ratchet, the whole distribution

*h*shifts in a complicated way, eventually reaching a new state where it is shifted left (so the class that was originally at frequency

_{k}*h*is now at frequency

_{k}*h*

_{k}_{−1}, and so on). In a similarly complex way, the PRF distribution of lineage frequencies in class

*k*shifts from

*f*to

_{k}*f*

_{k}_{−1}, and so on. This naturally changes the coalescence probabilities in each class. Fortunately, since the coalescence probabilities in class

*k*are generally very similar to those in classes

*k*+ 1 or

*k*− 1, this effect is unlikely to lead to major inaccuracies provided the ratchet does not click many times within a coalescent time. This is true except when we start considering coalescence in classes close to the 0-class, where the

*k*-dependence becomes significant. This can be thought of as an additional problem associated with Muller’s ratchet and is associated with the fact that the ratchet shifts the whole fitness distribution. This effect is easiest to see with an example: imagine we sample two individuals within the

*k*-class and that these individuals did not coalesce before their ancestors were both in the 0-class. At the time (in the past) when these individuals’ ancestors were in the 0-class, this current 0-class might have been the 1-class or 2-class (or higher). Thus these two individuals within the 0-class might not coalesce until, for example, their ancestors were in what is currently the “−2-class.” This clearly means that we might in fact have π

_{d}> 2

*k*, which our analysis assumes is impossible. In fact, we observe precisely this effect in simulations, and it is the reason why we commonly observe systematic deviations where the simulated values of π

_{d}are larger than our theory predicts.

From this discussion it is clear that the key factor in determining whether Muller’s ratchet can reasonably be neglected is how many times the ratchet clicks in a coalescence time. We have seen above that an average individual coalesces through the fitness distribution in a time at most of order _{d}, and neglecting it is a reasonable first approximation. In practice we find in our simulations that for the parameter regimes we consider, π_{d} is at most of order 2 larger than our theoretical predictions, which would correspond roughly to the effect of a single click of the ratchet during a typical coalescence time.

The discussion above suggests a way to incorporate Muller’s ratchet within our theoretical framework, albeit in an *ad hoc* way. The ratchet shifts the distribution *h _{k}* underneath the fitness-class coalescent process. The details of this shift are complicated, but on average every click of the ratchet shifts the distribution one step to the left. We can define

*k*

_{min}to be the number of deleterious mutations (relative to the optimal genotype) in the most-fit individual at any given time. For the case where

*h*replaced by

_{k}*k*

_{min}increases over time. This increase is a random process, but has some average rate, leading to an average

*k*

_{min}(

*t*). As we look backward in time during the fitness-class coalescent process, the value of

*k*

_{min}is decreasing due to Muller’s ratchet. This suggests a simple approximation: we replace the actual value of

*k*with an “effective” value of

*k*that accounts for the fact that

*k*

_{min}decreases as we look backward in time. For each step through the fitness distribution, we imagine that

*k*

_{min}has decreased by the appropriate amount, and hence the effective value of

*k*in the new fitness class is decreased by <1 compared to the old fitness class. When

*h*is on average shifted from the Poisson form (Gessler 1995). To incorporate the ratchet into our analysis in this situation, we first must recalculate the relevant coalescence probabilities given the expected average form of

_{k}*h*, and then carry out the above program. These and other methods to account for Muller’s ratchet remain an interesting topic for future work.

_{k}Despite the potential relevance of Muller’s ratchet in practical situations, we note that it does not affect our results in the standard coalescent limit. As is apparent from our general expressions for the coalescence probabilities, the structure of our fitness-class coalescent theory does not depend on all three parameters *N*, *U*_{d}, and *s* independently. Rather, it depends only on the combinations *NU*_{d} and *Ns*. Thus our theory makes sense in the standard limit where *NU*_{d} and *Ns* are held constant while we take *N* → ∞. In this limit, Muller’s ratchet does not occur. Whether this means that we can neglect the ratchet for large but finite *N* depends on the convergence properties of the coalescent limit. This is a difficult limit to explore with simulations, because it requires large population sizes. However, we have used simulations to verify in a few cases in which, as expected, increasing *N* while keeping *NU*_{d} and *Ns* constant does not change the predicted structure of genealogies but decreases some of the systematic differences between theoretical predictions and the simulations that are suggestive of the effect of the ratchet. Note that while this ratchet-free limit does not change the structure of genealogies in our fitness-class coalescent, the distribution of real coalescent times does change, since all real time scales are proportional to *s*. Thus, as might be expected, we must also take *NU*_{n} constant as *N* → ∞ if we wish neutral diversity to also remain unaffected in this limit.

Note that this ratchet-free limit, while fairly standard in coalescent theory, is somewhat different from the mutation–time approximation we discussed above. Of course, we can easily imagine a population that is large enough that the mutation–time approximation applies and *then* take the standard coalescent limit.

### Conclusion

Our fitness-class coalescent approach provides a framework in which we can compute distributions of genealogical structures in situations in which many linked negatively selected sites distort patterns of genetic variation. We have used this framework to calculate the distributions of a few simple statistics describing sequence variation. It remains for future work to use this fitness-class coalescent approach to compute a wide array of statistics to better understand the details of how purifying selection on many linked sites distorts patterns of genetic variation. The eventual goal will be to use our results to help interpret the increasing amounts of sequence data that seem to point to the importance of negative selection on many linked sites.

## Acknowledgments

We thank Daniel Fisher and John Wakeley for many useful discussions, which inspired our fitness-class coalescent approach. M.M.D. acknowledges support from the James S. McDonnell Foundation and the Harvard Milton Fund. A.M.W. thanks the Princeton Center for Theoretical Science at Princeton University, where she was a fellow during some of her work on this article. Many of the computations in this article were run on the Odyssey cluster supported by the FAS Sciences Division Research Computing Group at Harvard University. L.E.N. is supported by the Department of Defense through the National Defense Science and Engineering Graduate Fellowship Program, and also acknowledges support from an National Science Foundation graduate research fellowship. J.B.P. acknowledges support from the James S. McDonnell Foundation, the Alfred P. Sloan Foundation, the David and Lucille Packard Foundation, the Burroughs Wellcome Fund, Defense Advanced Research Projects Agency (HR0011-05-1-0057), and the U.S National Institute of Allergy and Infectious Diseases (2U54AI057168).

## Appendix A: The Fitness-Class Coalescent Probabilities

### PRF lineage-structure approach

In the main text, we used our PRF lineage-structure approach to write an integral expression for the probability *k* and *k*′ coalesce in class *k* − ℓ, Equation 13, above. In this appendix, we evaluate this integral to calculate the coalescent probabilities.

Equation 13 depends on the transition probability for the change in the frequency of a lineage from *x* to *y* in a time |*t*_{1} − *t*_{2}| in class *k* − ℓ, *G _{k}*

_{−ℓ}(

*y*→

*x*),|

*t*

_{2}−

*t*

_{1}|). This transition probability was calculated by Kimura (1955) and can be expressed as an infinite sum of Gegenbauer polynomials. Fortunately, it appears in the context of an integral

*y*over

*G*

_{k}_{−ℓ}. Hence this integral is given by the deterministic result for the change in the frequency of the lineage,

*x*integral can be evaluated using standard asymptotic methods; we find

*h*described in

_{k}*Appendix B*. Plugging in this result, we find

To make further progress, we must understand *k* and *k*′ originally mutated from class *k* − ℓ to class *k* − ℓ + 1. In general, *t*_{1} and *t*_{2} are not independent, since for the two lineages to have coalesced in class *k* − ℓ they must not have coalesced in any earlier classes, which makes them less likely to have been in those classes at the same time. In File S1, we analyze these distortions and their effects on the coalescence probabilities. Here we make use of a simpler approximation: since the coalescence probability in each step will turn out to be small, conditioning on not coalescing in a particular class does not shift the distribution of mutation timings much. We therefore neglect the complications associated with the probability distributions of the mutant timings conditional on noncoalescence. We refer to this as the nonconditional approximation and discuss its validity further in File S1.

In the nonconditional approximation, the times *t*_{1} and *t*_{2} are independent, *k* − ℓ is still there when the ancestor of the second individual mutated into that class. Thus *k* − ℓ at the same time, while

### Sum of ancestral paths approach

In the main text, we considered the probability of any particular ancestral path in the history of a sample of two individuals. In this section, we sum over the probabilities of all possible ancestral paths to compute the fitness-class coalescence probabilities. First, we consider sampling two individuals from the same fitness class *k*. For these two individuals to coalesce in class *k*, the first event must be a coalescent event. Using the event probabilities computed in the main text, we find *k* − 1, the first event must be a deleterious mutation event. Since both individuals’ ancestral lineages are currently in class *k*, the probability that the first event is a deleterious mutation event is *k* − 1 and one in class *k*. The next event must be a deleterious mutation in the latter, which occurs with probability

We can continue to extend this logic to subsequent fitness classes. For example, for coalescence to occur in class *k* − 2, there are six possible paths. We can label them as AABBc, BBAAc, ABABc, ABBAc, BABAc, and BAABc, where A corresponds to a mutation in the first individuals’ ancestral lineage, B corresponds to a mutation in the second individuals’ ancestral lineage, and c corresponds to a coalescent event. We can calculate the probability of each path. For example,*k* − 1 class at the same time. This distorts the probability of mutations at that step, since coalescence could also have occurred. For paths of this type, we have

It is informative to consider the form of this result. The *k* − 2, given that they existed in class *k* − 2 at the same time. The remaining factors represent the probability that the two ancestral lineages existed at the same time in class *k* − 2. This consists of a leading-order term

We can continue on to consider the probability of coalescence in class *k* − 3. There are now a total of *k* − 1 and *k* − 2 (*e.g.*, ABABABc), in class *k* − 1 only (*e.g.*, ABAABBc), in class *k* − 2 only (*e.g.*, AABBABc), or in neither (*e.g.*, AAABBBc). The probability of each type of path is identical, except for a distortion factor *k* − *i* in which the two ancestral lineages were together at the same time. The probabilities can be calculated as before and summed to yield *k*′ and *k*.

In File S4, we describe the details of carrying out this summation over all possible paths to determine the coalescent probabilities. We find*k* ≤ *k*′ by convention. The form of this solution is intuitive. The factor *k* − ℓ, given that the two ancestral lineages existed in this class at the same time. The remaining factors reflect the probability that the two lineages are together in class *k* − ℓ at some point. This consists of a leading-order term, which is identical to the *l* + 1 terms in the correction, each of which is known and calculable.

Provided that 2*Nh _{k}sk* ≫ 1, we can neglect the higher-order terms in Equation 56. This is equivalent to calculating the probability of coalescence in a given class, without considering the possibility that coalescence events could have occurred in previous classes. Thus it converts our expression for

The condition 2*Nh _{k}sk* ≫ 1 is the condition we are already assuming in treating the frequencies of each class,

*h*as constant (see

_{k}*Appendix B*). Thus the results from the PRF method and the sum of ancestral paths are exactly equivalent in the regime where they are valid. We discuss the correspondence between approximations in the sum of ancestral paths method as compared to the PRF method in more detail in File S4.

## Appendix B: Fluctuations in *h*_{K}

_{K}

Throughout our analysis, we have neglected fluctuations in the frequencies of each fitness class *h _{k}*. This approximation was necessary to write our PRF expressions for lineage structure,

*f*(

_{k}*x*), which depend on

*h*. Similarly, it was necessary for us to compute the probabilities of each possible ancestral event in our sum of ancestral paths method. In this appendix, we examine this approximation in detail and analyze its regime of validity.

_{k}Fluctuations in the fitness-class frequencies affect the coalescence probability within class *k* in three different ways. First, fluctuations in *h _{k}*

_{−1}affect the rate at which mutations enter class

*k*. When

*h*

_{k}_{−1}is larger than average, more mutations occur. Within the PRF method, this means that there will be more small lineages than the steady-state

*f*(

_{k}*x*) accounts for, which reduces the coalescence probability. In the sum of ancestral paths method, this means that the probability of mutation events increases relative to the probability of coalescence events, which similarly reduces the coalescence probability. When

*h*

_{k}_{−1}is smaller than average, fewer mutations occur, and the reverse is true.

Second, fluctuations in *h _{k}* affect the coalescence rates within this class. Consider the case in which

*h*is larger than average. Within the PRF method, this means that the probability that two individuals randomly sampled from class

_{k}*k*come from a given lineage of size

*x*is less than our assumption of

*h*is smaller than average, the reverse is true.

_{k}The third effect of fluctuations is specific to the PRF method, in which we assumed that the probability that two individuals in class *k* come from a lineage of frequency *x* (given that the lineage exists) is *x* in fitness class *k* does not affect the expected frequency of the class *h _{k}*. This is not strictly true: given that there exists a lineage at high frequency, it is likely that

*h*is larger than average, and vice versa. In other words, there is a correlation between the size of a lineage and the frequency of the class, so the probability that two individuals picked from a class come from a lineage of frequency

_{k}*x*is not precisely

*x*is large, this expression overestimates the probability that two individuals are from the same lineage, since given that those high-frequency lineages exist,

*h*will be larger than average. Similarly (although less dramatically), when

_{k}*x*is small our expression underestimates the probability that two individuals are from the same lineage.

Note that this third effect of fluctuations is distinct from the second effect above. The second effect describes fluctuations in *h _{k}* that are uncorrelated to the frequency of a particular lineage. It thus applies to both the PRF and the sum of ancestral paths methods; it reflects the general fact that when

*h*is larger, coalescence is less likely. The third effect, on the other hand, reflects the fact that if we assume that we sample an individual from a lineage of size

_{k}*x*, this biases the value of

*h*. Since our sum of ancestral paths method never makes any references to lineages, this third effect of fluctuations applies only to the PRF method.

_{k}These three effects all depend on the size of the fluctuations relative to the average size of the each fitness class. Thus neglecting fluctuations will be a good approximation provided that the fluctuations in *h _{k}* are small compared to

*h*. To determine when this will hold, we note that each lineage in class

_{k}*k*can reach, at most, a maximum size of order

*Nh*. This means that, provided that

_{k}*h*should be negligible provided that this condition holds.

_{k}To make this intuition more precise, we must calculate the variance in *h _{k}* and compare it to

*h*. In principle this information is contained in our PRF expressions, but it is much simpler to compute using a continuous-time branching process method. That is, rather than use a diffusion approximation to describe the dynamics of each lineage, we use a continuous-time branching process. As before, we imagine that new lineages in class

_{k}*k*are created at a rate θ

*/2. In steady state there will be some time-independent probability that there are*

_{k}*n*total individuals across all the lineages in the class,

*P*(

*n*). Note that on average we must have

*n*/

*N*=

*h*, and that

_{k}*P*(

*n*) contains information on the fluctuations in the

*h*. We first compute the generating function for

_{k}*P*(

*n*),

*k*subscripts. Note that this calculation is based on a continuous-time branching process, in which individuals have a different distribution of offspring number than in a Wright–Fisher process, leading to a transient distribution of the frequencies of individual lineages that is half as large as in the Wright–Fisher model for lineages of substantial frequency. Thus to make comparisons with the Wright–Fisher process, we have to take θ → 2θ (as we would in comparing Wright–Fisher to Moran models), as described by Desai and Fisher (2007).

Equation 59 describes the fluctuations in the size of an individual fitness class: the mean, variance, and higher moments of *n* can be easily computed by taking derivatives of *H*(*z*). Thus we can immediately compute Var(*h _{k}*)/

*h*using standard generating function methods. We find that in fact the fluctuations in

_{k}*h*are indeed negligible provided that

_{k}*Ns*≫ 1, our approach will still be a good approximation.

All three effects of fluctuations in *h _{k}* described above are negligible in the same parameter regime,

*Nh*≫ 1. However, the fact that the third effect applies only to our PRF result obscures the precise relationship between our two approaches and the relationship to earlier work. Further, relaxing this approximation provides a useful comparison of the subtle differences between the assumptions underlying the approaches. Thus we describe here an alternative approach to understanding the lineage structure in a fitness class that allows us to account for these correlations between the size of a lineage,

_{k}sk*x*, and the frequency of the fitness class,

*h*.

_{k}We first note that, in his original calculation of the neutral ESF, Ewens (1972) used a diffusion result, *f*(*x*), roughly analogous to our PRF expression to describe the probability that there exists a lineage with frequency *x* in the population at a given time. However, Ewens’ *f*(*x*) was derived as the solution to the diffusion approximation to the *K*-allele Wright–Fisher process, in the limit of infinite alleles. This process explicitly imposes the constraint that the sum of all lineages in the population at a given time must add to 1. This means that there is no correlation between the size of a lineage and the total number of individuals in the population.

The PRF calculation of the lineage structure does not involve this explicit constraint. This is what makes it possible to compute a simple analytical expression for *f _{k}*(

*x*). This lack of constraint means that the PRF result admits fluctuations in

*h*, which lead to corresponding correlations between

_{k}*x*and

*h*. We could partially avoid this by defining γ

_{k}*=*

_{k}*Nh*, rather than

_{k}s_{k}*Nh*, as we have so far. This would effectively mean that each lineage is assumed to be diffusing between 0 and

_{k}*h*rather than between 0 and 1 and forbid any lineage from reaching a frequency larger than

_{k}*h*. Thus it reduces the discrepancies associated with the correlations between

_{k}*x*and

*h*. However, even with this redefinition, there is no constraint that the lineages in a given class all add to precisely

_{k}*h*, and so correlations still exist.

_{k}To correct exactly for the effects of correlations between *x* and *h _{k}*, we extend the continuous-time branching process model introduced above. We now imagine that there are

*B*sites in the genome, each of which can mutate to create a new lineage in class

*k*. In the large-

*B*limit, each distinct lineage in class

*k*arose from a mutation at a different site in the genome (and we later make the infinite-sites assumption

*B*→ ∞, which makes this exactly true). The rate at which new mutations found lineages in class

*k*due to mutations at a specific one of these

*B*sites is

*n*mutations at a particular site

*i*in class

*k*is

*k*subscripts and we have taken θ → 2θ to match to the Wright–Fisher model as described above.

If we define *n _{i}*

_{,}

*to be the total number of mutants at site*

_{k}*i*in class

*k*, we have that

*=*

_{k}*Nh*). We now imagine that we sample some number

_{k}*m*individuals from class

*k*. The probability that they are all from the same lineage is

*then*dividing by the average σ

*. In other words, we explicitly account for the correlations between*

^{m}*x*and

*h*.

_{k}We can rewrite Equation 63 using the identity*B* sites are independent, we find*m* times with respect to *x* results in an expression for *B* → ∞, and neglecting higher-order terms in *s*, we find

If we were to use the original PRF result to calculate the probability that two individuals sampled simultaneously from class *k* are from the same lineage, we would find *h _{k}* yields the modified probability

*h*. All of the formulae quoted in the main text and shown in the figures incorporate this correction, which appropriately handles the correlations between the frequency of an individual lineage and the size of the fitness class.

_{k}## Appendix C: Relation to Previous Work

In this appendix we compare our analysis to related work and summarize the key approximations that we and others have used. We have presented two main approaches to calculating coalescence probabilities in this article. The first approach is based on the lineage structure within each fitness class, described using a PRF-based method. The second approach involves summing over all possible ancestral paths, based on the structured coalescent framework introduced by Kaplan *et al.* (1988) and Hudson and Kaplan (1994, 1995). We show in this article that both approaches involve closely related approximations and yield equivalent expressions for the coalescence probabilities.

Historically, attempts to describe the coalescent process in the presence of selection go back to the structured coalescent introduced by Kaplan *et al.* (1988). These authors considered a sample of individuals from given fitness classes and computed the relative probabilities that the next event to occur backward in time would involve a mutation or coalescent event, without explicitly describing lineage structure. In their original work, Kaplan *et al.* (1988) used a full stochastic description of the frequencies of each fitness class, in which one keeps track of the probability distribution of these frequencies to account for selection. They derived diffusion equations for the transition probabilities between states. This approach is very general, but as a result is complex and requires numerical evaluation. Barton and Etheridge (2004) developed this diffusion approach to compute the effect of selection on genealogies in a system in which selection acts only on a single locus.

Hudson and Kaplan (1994) later simplified their original structured coalescent approach to describe the case in which fluctuations in the frequencies of fitness classes can be neglected. In this deterministic approximation, they showed that one can compute very simple expressions for the relative probabilities of the next event to occur backward in time in the history of a sample. In this manner, Hudson and Kaplan (1994) were able to generate a simple recursion relation for the mean time to a common ancestor, their Equation 12. Gordo *et al.* (2002) used this equation as the basis for a coalescent simulation, and Zeng and Charlesworth (2011) recently extended this method to describe the joint effects of recombination and background selection.

Recursion relations of the Hudson and Kaplan (1994) form can be solved numerically and have been used to generate data describing coalescent statistics, but have not yet led to an analytic description of the structure of genealogies in the presence of negative selection at many linked sites. In this article we have shown that one can sum over ancestral paths within this framework, to derive analytical formulas for the coalescence probabilities that are equivalent to those computed from our lineage-based formalism. This equivalence means that our analytical results in this article match earlier numerical and simulation results based on the Hudson and Kaplan (1994) formulation. However, like the Hudson and Kaplan (1994) framework, neither of our approaches in this article account for fluctuations in the frequencies of fitness classes.

In reality, the frequency of each fitness class will fluctuate due to genetic drift. As we have described in *Appendix B*, these fluctuations are substantial in classes whose deterministic size is small compared to the inverse of the effective selection pressure against individuals in that class, *Nh _{k}sk* < 1. This leads to important effects on the structure of genealogies if most fitness classes through the bulk of the fitness distribution fluctuate substantially. This will occur whenever

*Ns*≲ 1, so fluctuations must therefore be taken into account for small

*Ns*. While the diffusion approach of Kaplan

*et al.*(1988) in principle provides a complete solution to this problem for all values of

*Ns*, this formalism and the related results of Barton and Etheridge (2004) are computationally strenuous. A need for further work on accurate but more analytically tractable approaches that are able to account for the frequency fluctuations remains.

We note that the work of O’Fallon *et al.* (2010) and of Hermisson *et al.* (2002) introduced analytical approaches valid for the case of *Ns* ∼1, although these methods are not based on a model related to the ideas of Kaplan *et al.* (1988). We also note that the problem of fluctuating fitness-class sizes has been considered in the case of other problems (for example, forward selection; Coop and Griffiths 2004), but a detailed discussion is outside the scope of this work (Table C1).

Neglecting the fluctuations in fitness-class frequencies is in principle reasonable when *Ns* ≫ 1. However, we note that even when *Ns* ≫ 1, the sizes of the smallest fitness classes near the tails of the distribution may still fluctuate substantially. Muller’s ratchet is one aspect of this general effect. Recently Seger *et al.* (2010) extended the simulation scheme of Gordo *et al.* (2002) to address this problem by first doing a forward-time simulation, recording the fluctuations in the classes (including Muller’s ratchet) from this simulation, and then putting these fluctuations into a backward simulation by hand. Our methods do not account for these effects. They are therefore less general than the work of Seger *et al.* (2010), and break down due to fluctuation effects more quickly as *Ns* decreases. On the other hand, our analysis does not rely on forward simulations and is able to compute simple analytic expressions for coalescence probabilities.

We also note that although we consider the large *Ns* approximation, our approach has a broader range of applicability than the effective population-size approximation, which assumes that the coalescence time is dominated by the time to coalescence within the most-fit class. For the EPS approximation to be valid requires that this latter time *Ns* ≫ 1. Thus we can easily have *Ns* ≫ 1, yet

## Footnotes

*Communicating editor: N. A Rosenberg*

- Received September 9, 2011.
- Accepted November 29, 2011.

- Copyright © 2012 by the Genetics Society of America