## Abstract

Positive selection distorts the structure of genealogies and hence alters patterns of genetic variation within a population. Most analyses of these distortions focus on the signatures of hitchhiking due to hard or soft selective sweeps at a single genetic locus. However, in linked regions of rapidly adapting genomes, multiple beneficial mutations at different loci can segregate simultaneously within the population, an effect known as clonal interference. This leads to a subtle interplay between hitchhiking and interference effects, which leads to a unique signature of rapid adaptation on genetic variation both at the selected sites and at linked neutral loci. Here, we introduce an effective coalescent theory (a “fitness-class coalescent”) that describes how positive selection at many perfectly linked sites alters the structure of genealogies. We use this theory to calculate several simple statistics describing genetic variation within a rapidly adapting population and to implement efficient backward-time coalescent simulations, which can be used to predict how clonal interference alters the expected patterns of molecular evolution.

BENEFICIAL mutations drive long-term evolutionary adaptation, and despite their rarity they can dramatically alter the patterns of genetic diversity at linked sites. Extensive work has been devoted to characterizing these signatures in patterns of molecular evolution and using them to infer which mutations have driven past adaptation.

When beneficial mutations are rare and selection is strong, adaptation progresses via a series of selective sweeps. A single new beneficial mutation occurs in a single genetic background and increases rapidly in frequency toward fixation. This is known as a “hard” selective sweep, and it purges genetic variation at linked sites and shortens coalescence times near the selected locus (Maynard-Smith and Haigh 1974). Most statistical methods used to detect signals of adaptation in genomic scans are based on looking for signatures of these hard sweeps (Sabeti *et al.* 2006; Nielsen *et al.* 2007; Akey 2009; Novembre and Di Rienzo 2009; Pritchard *et al.* 2010).

Hard selective sweeps are the primary mode of adaptation in small- to moderate-sized populations in which beneficial mutations are sufficiently rare. However, in larger populations where beneficial mutations occur more frequently, many different mutant lineages can segregate simultaneously in the population. If the loci involved are sufficiently distant that recombination occurs frequently enough between them, their fates are independent and adaptation will proceed via independent hard sweeps at each locus. However, in largely asexual organisms such as microbes and viruses, and on shorter distance scales within sexual genomes, selective sweeps at linked loci can overlap and interfere with one another. This is referred to as clonal interference, or Hill–Robertson interference in sexual organisms (Hill and Robertson 1966; Gerrish and Lenski 1998). These interference effects can dramatically change both the evolutionary dynamics of adaptation and the signatures of positive selection in patterns of molecular evolution. We illustrate them schematically in Figure 1.

We and others have characterized the evolutionary dynamics by which a population accumulates beneficial mutations in the presence of clonal interference (Gerrish and Lenski 1998; Ridgway *et al.* 1998; Rouzine *et al.* 2003; Desai and Fisher 2007; Hallatschek 2011; Good *et al.* 2012). Many recent experiments in a variety of different systems have confirmed that these interference effects are important in a wide range of laboratory populations of microbes and viruses (de Visser *et al.* 1999; Miralles *et al.* 1999; Bollback and Huelsenbeck 2007; Desai *et al.* 2007; Kao and Sherlock 2008). These theoretical and experimental developments have recently been reviewed by Park *et al.* (2010) and Sniegowski and Gerrish (2010).

Although this earlier theoretical work has provided a detailed characterization of evolutionary dynamics in the presence of clonal interference, it does not make any predictions about the patterns of genetic variation within an adapting population. In this article, we address this question of how clonal interference alters the structure of genealogies, and how this affects patterns of molecular evolution both at the sites underlying adaptation and at linked neutral sites. Our work is related to earlier analysis of the same situation by Kim and Stephan (2003), who described the effects of multiple overlapping selective sweeps on fixation rates of beneficial mutations and some aspects of the variation at linked neutral sites, building on earlier models of recurrent hitchhiking (Kaplan *et al.* 1989; Wiehe and Stephan 1993). We consider here a more general description of how clonal interference alters the structure of genealogies, which can be used to predict the distribution of any statistic describing genetic variation both at positively selected and linked neutral sites. This has become particularly relevant in light of recent advances that now make it possible to sequence individuals and pooled population samples from microbial adaptation experiments (Gresham *et al.* 2008; Kao and Sherlock 2008; Barrick and Lenski 2009; Barrick *et al.* 2009).

We note that much recent work in molecular evolution and statistical genetics has analyzed related scenarios where adaptation involves multiple mutations, motivated by recent theoretical work (Orr and Betancourt 2001; Ralph and Coop 2010) and empirical data from *Drosophila* (Sella *et al.* 2009) and humans (Coop *et al.* 2009; Hernandez *et al.* 2011) that suggests that simple hard sweeps may be rare. This includes most notably analysis of the effects of “soft sweeps,” where recurrent beneficial mutations occur at a single locus, or selection acts on standing variation at this locus (Hermisson and Pennings 2005; Pennings and Hermisson 2006a,b). Soft sweeps drive multiple genetic backgrounds to moderate frequencies, leaving several deeper coalescence events and hence a weaker signature of reduced variation in the neighborhood of the selected locus than a hard sweep (Przeworski *et al.* 2005).

In contrast to the situation we analyze here, both hard and soft sweeps refer to the action of selection at a single locus. We consider instead a case more analogous to models in quantitative genetics, where selection acts on a large number of loci that all affect fitness. In other words, our analysis of clonal interference can be thought of as a description of polygenic adaptation, where selection favors the individuals who have beneficial alleles at multiple loci. Recent work has argued for the potential importance of polygenic adaptation from standing genetic variation (Pritchard and Di Rienzo 2010; Pritchard *et al.* 2010), loosely analogous to the case where soft sweeps act at many loci simultaneously (Chevin and Hospital 2008; Hancock *et al.* 2010). Our analysis in this article, by contrast, describes polygenic adaptation via multiple new mutations of similar effect at many loci, where each locus has a low enough mutation rate that it would undergo a hard sweep in the absence of the other loci.

As with hard and soft sweeps, the signatures of this form of adaptation on nearby genomic regions are determined by how it alters the structure and timing of coalescence events. In this article, we therefore focus on computing how clonal interference alters the structure of genealogies. This involves two basic effects. On the one hand, mutations at the many loci occur and segregate simultaneously, interfering with each others’ fixation. This preserves some deeper coalescence events, as in a soft sweep. On the other hand, since the mutations occur at different sites, multiple beneficial mutations can also occur in the same genetic background and hitchhike together. This tends to shorten coalescence times, making the signature of adaptation somewhat more like a “hard sweep.” Together, these effects lead to unique patterns of genetic diversity characteristic of clonal interference.

Our analysis of these effects is based on the fitness-class coalescent we previously used to describe the effects of purifying selection on the structure of genealogies (Walczak *et al.* 2012). This in turn is closely related to the structured coalescent model of Hudson and Kaplan (1994). We begin in the next section by describing our model and summarize our earlier analysis of the rate and dynamics of adaptation in the presence of clonal interference, which describes the distribution of fitnesses within the population (Desai and Fisher 2007). We then show how one can trace the ancestry of individuals as they “move” between different fitness classes via mutations (our fitness-class coalescent approach). We compute the probability that any set of individuals coalesce when they are within the same fitness class. This leads to a description of the probability of any possible genealogical relationship between a sample of individuals from the population. Finally, we show how the distortions in genealogical structure caused by clonal interference alter the distributions of simple statistics describing genetic variation at the selected loci as well as linked neutral loci. We also use our approach to implement coalescent simulations analogous to those previously used to describe the action of purifying selection (Gordo *et al.* 2002; Seger *et al.* 2010), based on the structured coalescent method of Hudson and Kaplan (1994). These coalescent simulations can be used to analyze in detail how this form of selection alters the structure of genealogies.

Our results provide a theoretical framework for understanding the patterns of genetic diversity within rapidly evolving experimental microbial populations. Our analysis may also have relevance for understanding how pervasive positive selection alters patterns of molecular evolution more generally, but we emphasize that our work here focuses entirely on asexual populations or on diversity within a short genomic region that remains perfectly linked over the relevant time scales. In the opposite case of strong recombination, adaptation will progress via independent hard selective sweeps at each selected locus. Further work is required to understand the effects of intermediate levels of recombination, where the approach recently introduced by Neher *et al.* (2010) may provide a useful starting point.

## Model and Evolutionary Dynamics

### Model

We consider a finite haploid asexual population of constant size *N*, in which a large number of beneficial mutations are available, each of which increases fitness by the *same* amount *s*. We define *U*_{b} as the total mutation rate to these mutations. We neglect deleterious mutations and beneficial mutations with other selective advantages. We have previously shown that the dynamics in rapidly adapting populations are dominated by beneficial mutations of a specific fitness effect (Desai and Fisher 2007; Fogle *et al.* 2008; Good *et al.* 2012), so this model is a useful starting point, but we return to discuss these assumptions further in the *Discussion*. We also assume that there is no epistasis for fitness, so the fitness of an individual with *k* beneficial mutations is *w _{k}*=(1+

*s*)

*≈ 1 +*

^{k}*sk*. This is the same model of adaptation we have previously considered (Desai and Fisher 2007) and is largely equivalent to models used in most related theoretical work on clonal interference (Rouzine

*et al.*2008, 2003; Park

*et al.*2010). We later also consider linked neutral sites with total mutation rate

*U*

_{n}, but for now we focus on the structure of genealogies and neglect neutral mutations.

To analyze expected patterns of genetic variation, we must also make specific assumptions about how mutations occur at particular sites. We consider a perfectly linked genomic region that has a total of *B* loci at which beneficial mutations can occur. We assume that these mutations occur at rate *μ* per locus, for a total beneficial mutation rate *U*_{b} = *μB*. We later take the infinite-sites limit, *B* → ∞, while keeping the overall beneficial mutation rate *U*_{b} constant. Each mutation is assumed to confer the same fitness advantage *s*, where *s* ≪ 1. We also assume throughout that selection is strong compared to mutations, *s* ≫ *U*_{b}, which allows us to use our earlier results in Desai and Fisher (2007) as a basis for our analysis. Analysis of the opposite case where *s* < *U*_{b} remains an important topic for future work, which could be based on alternative models of the dynamics such as the approach of Hallatschek (2011). Although our model is defined for haploids, our analysis also applies to diploid populations provided that there is no dominance (*i.e.*, being homozygous for the beneficial mutation carries twice the fitness benefit as being heterozygous).

This model is the simplest framework that captures the effects of positive selection on a large number of independent loci of similar effect. However, the dynamics of adaptation in this model can be complex. Beginning from a population with no mutations at the selected loci, there is first a transient phase while variation at these loci initially increases. There is then a steady-state phase during which the population continuously adapts toward higher fitness. Finally, adaptation will eventually slow down as the population approaches a well-adapted state. In this article, we focus on the second phase of rapid and continuous adaptation, which has been the primary focus of previous work by us and others (Desai and Fisher 2007; Rouzine *et al.* 2008; Park *et al.* 2010; Hallatschek 2011). Our goal is to understand how this continuous rapid adaptation alters the structure of genealogies and hence patterns of genetic variation. We begin in the next subsection by summarizing the relevant aspects of our earlier results for the distribution of fitness within the population.

### The distribution of fitness within the population

In our model in which all beneficial mutations confer the same advantage, *s*, the distribution of fitnesses within the population can be characterized by the fraction of the population, *ϕ _{k}*, that has

*k*beneficial mutations more or less than the population average. We refer to this as “fitness-class

*k*.”

When *N* and *U*_{b} are small, it is unlikely that a second beneficial mutation will occur while another is segregating. Hence adaptation proceeds by a succession of selective sweeps. In this regime, beneficial mutations destined to survive drift arise at rate *NU*_{b}*s* and then fix in *ϕ*_{0} = 1 and *ϕ*_{k} = 0 for *k* ≠ 0.

In larger populations, however, new mutations continuously arise before the older mutants fix. Thus the population maintains some variation in fitness even while it adapts. The distribution of fitnesses within the population is determined by the balance between two effects. On the one hand, new mutations arise at the high-fitness “nose” of this distribution, generating new mutants more fit than any other individuals in the population. This increases the variation in fitness in the population. (While new mutations occur throughout the fitness distribution, the mutations essential to maintaining variation are those that arise at the nose and generate new most-fit individuals.) On the other hand, selection destroys less-fit variants, increasing the mean fitness and decreasing the variation in fitness within the population. This is illustrated in Figure 2.

We showed in previous work that this balance between mutation and selection leads to a constant steady-state distribution of fitnesses within the population, measured relative to the current (and constantly increasing) mean fitness (Desai and Fisher 2007). In this steady-state distribution, the fraction of individuals with *k* beneficial mutations relative to the current mean in the population is approximately*C* is an overall normalization constant that will not matter for our purposes. Note that the distribution *ϕ _{k}* is approximately Gaussian.

This distribution *ϕ _{k}* is cut off above some finite maximum

*k*, which corresponds to the nose of the distribution, the most-fit class of individuals. We define the

*lead*of the fitness distribution,

*qs*, as the difference between the mean fitness and the fitness of these most-fit individuals (so

*q*is the maximum value of

*k*; the most-fit individuals have

*q*more beneficial mutations than the average individual). In Desai and Fisher (2007), we showed that

Above we have implicitly defined *i.e.*, for the mean fitness to increase by the lead of the fitness distribution. This takes *q* establishment times, so that the this “nose-to-mean” time is*N*. We note that no single mutant sweeps to fixation in this time: rather, a whole set of mutants comprising a new fitness class at the nose comes to dominate the population a time *τ*_{nm} later.

## The Fitness-Class Coalescent Approach

We now wish to understand the patterns of genetic variation within a rapidly adapting population in the clonal interference regime. To do so, we use a fitness-class coalescent method in which we trace how sampled individuals descended from individuals in less-fit classes, moving between classes by mutation events. In each fitness class there is some probability of coalescence events. To calculate these coalescence probabilities, we must first understand the clonal structure within each fitness class: this we now consider.

### Clonal structure

Each fitness class is first created when a new beneficial mutation occurs in the current most-fit class, creating a new most-fit class at the nose of the fitness distribution (see inset of Figure 2). This new clonal mutant lineage fluctuates in size due to the effects of genetic drift and selection before it eventually either goes extinct or establishes (*i.e.*, reaches a large enough size that drift becomes negligible). After establishing, the lineage begins to grow almost deterministically. Concurrently additional mutations occur at the nose of the distribution, also founding new mutant lineages within this most-fit class. This process is illustrated in Figure 3A.

We wish to understand the frequency distribution of these new clonal lineages, each founded by a different beneficial mutation. In our infinite-sites model, each such lineage is genetically unique. We can gain an intuitive understanding of this frequency distribution with a simple heuristic argument. After it establishes, the size of the current most-fit class, *n _{q}*

_{−1}(

*t*), grows approximately deterministically according to the formula

*U*

_{b}

*n*

_{q}_{−1}(

*t*), creating even more-fit individuals. Each new mutation has a probability

*qs*of escaping genetic drift to form a new established mutant lineage. Thus the ℓth established mutant lineage at the nose on average occurs at roughly the time

*t*

_{ℓ}that satisfies

*t*

_{ℓ}and then noting that the size,

*n*

_{ℓ}, of the ℓth established lineage will be proportional to

The analysis above describes the clonal structure created as a new fitness class is formed, advancing the nose. After approximately *s*, and the growth rates of all the fitness classes we have described will decrease correspondingly. Thus we can strictly use only the calculations above up to some finite number of mutations, ℓ_{max}, after which all growth rates will have decreased due to the advance of the mean fitness of the population. Mutations will continue to occur after this time, but their frequency distribution will be slightly different. Fortunately, in the strong selection regime we consider (*s* ≫ *U*_{b}), the total contribution of all mutations after this point to the total size of the class is small compared to the contributions of the mutations that occur while this class is at the nose (Desai and Fisher 2007; Brunet *et al.* 2008b). These studies have also shown that these later-occurring mutations almost never fix. Thus the ancestries of most samples of individuals will not include any such mutations, and they will not strongly affect genealogical structure or accumulate in the long term. We therefore neglect this cutoff to the number of mutations that occur at the nose, as well as the contribution of later mutations. This approximation will break down for very large samples. However, the errors it introduces can be shown to be relatively small even when considering quantities such as the time to the most recent common ancestor of the whole population. We note, however, that whenever *U*_{b} is greater than or of order *s*, we expect this approximation to break down and beneficial mutations that occur slightly away from the nose to become important.

Another important aspect of the dynamics that simplifies the behavior is that despite the changing growth rate of the fitness class as a whole, the *frequencies* of the established lineages within the class remain fixed. In other words, the clonal structure within the class remains “frozen” after it is initially created, rather than fluctuating with time (see Figure 3B). As we show, this and the neglect of late-arising mutations are good approximations in the regimes we consider here.

While our heuristic analysis provides a good picture of the typical frequency distribution of clonal lineages within each fitness class, it misses a crucial effect. Occasionally a new mutation at the nose will, by chance, occur anomalously early. This single mutant lineage can then dominate its fitness class. These events are quite rare, but when they do occur this single lineage can purge a substantial fraction of the total genetic diversity within the population. As we see, these events together with less-rare but still early mutations are essential to understanding the structure of genealogies within the population, as they lead to a substantial probability of “multiple merger” coalescent events.

To capture these effects, we must carry out a more careful stochastic analysis of the clonal structure within each fitness class. As before, we focus on the clonal structure created when that class was at the nose of the fitness distribution, since it remains “frozen” thereafter. To do so, we note that the population size at the nose can be written as*ν _{i}*(

*t*) reflects the stochastic effects of a clone generated from mutations at site

*i*(of

*B*total possible sites). At late enough times, the distribution of

*ν*becomes time independent, as shown previously (Desai and Fisher 2007). This time-independent

_{i}*ν*summarizes the combined effect of all the stochastic dynamics of mutations at this site that are relevant for the long-term dynamics. We showed that the generating function of

_{i}*ν*is

_{i}*B*, and we have defined

*G*(

_{i}*z*) for the size of the clonal lineage founded at each possible site contains all of the relevant information about the lineage frequency distribution, including the stochastic effects described above. Below we use it to calculate coalescence probabilities within our fitness-class coalescent approach, which we now turn to.

### Tracing genealogies

To calculate the structure of genealogies, we take a fitness-class approach analogous to the one we used to analyze the case of purifying selection (Walczak *et al.* 2012). We first consider sampling several individuals from the population. These individuals come from some set of fitness classes with probabilities given by the frequencies of those fitness classes, *ϕ _{k}*. We note that in the purifying selection case, fluctuations in the

*ϕ*due to genetic drift were a potential complication in determining these sampling probabilities. Here, these fluctuations are much less important provided that

_{k}*U*

_{b}/

*s*≪ 1. We note, however, that fluctuations in different

*ϕ*are correlated due to the stochasticity at the nose. Furthermore, averages of

_{k}*ϕ*are far larger than their median values due to rare fluctuations. Such fluctuations, which we discuss in detail elsewhere (Fisher 2013), may lead to some slight corrections to our results. But for most purposes, the “typical” values of the

_{k}*ϕ*(

_{k}*i.e.*, the average

*ϕ*excluding these rare fluctuations) are what matters: thus we make the simple approximation that the probability of sampling one individual from class

_{k}*k*

_{1}and a second from class

*k*

_{2}is simply

*ϕ*as given in Equation 2. Analogous formulas apply for larger samples.

_{k}Each sampled individual comes from a specific fitness class *k* and belongs to a specific clonal lineage within that class. This clonal lineage was created when this fitness class was at the nose of the distribution, approximately *k*−1. That individual in turn belonged to some clonal lineage within class *k*−1, which in turn was created when that class was at the nose by a new mutation in an individual from what is now fitness class *k*−2, and so on.

We now describe the probability of a genealogy relating a sample of several individuals. Imagine, for simplicity, that we sampled two individuals that both happened to be in the same fitness class, *k*. If these individuals were from the same clonal lineage within that class, then they are genetically identical at all the *B* positively selected sites. We say they coalesced in class *k*. If these individuals were not from the same clonal lineage within the class, then they both descended from individuals, in what is now fitness class *k*−1, that got distinct beneficial mutations. If the individuals in which these mutations occurred are from the same clonal lineage within class *k*−1, we say the sampled individuals coalesced in class *k*−1. If so, they differ at two of the *B* positively selected sites. If not, they descended from individuals, in what is now fitness class *k*−2, that got distinct beneficial mutations, and so on. We can apply similar logic to larger samples or when the individuals were sampled from different fitness classes. We illustrate this fitness-class coalescent process in Figure 4.

Given that a sample of individuals coalesced in some lineage in fitness class *k*, it remains to determine when this coalescence event (or events) occurred. To do so, we note that each lineage in class *k* was originally founded by a single mutant individual approximately *U*_{b}/*s* ≪ 1 holds, the typical variation in coalescence times within a class will be small compared to

We note that the probability a sample of individuals comes from the same clonal lineage is the same in each fitness class, since the clonal structure of the class was always determined when that class was at the nose of the distribution (nevertheless, conditional on some individuals coalescing in a class, the probability of additional coalescence events is substantially altered; see below). In addition, the coalescence probabilities do not depend on when the mutations occurred in the ancestral lineages of each sampled individual, since all clonal lineages were founded when a class was at the nose of the fitness distribution. These are major simplifications compared to the case of purifying selection, where the relative timings of mutations and the differences in clonal structure in different classes are important complications (Desai *et al.* 2012; Walczak *et al.* 2012).

To use the fitness-class coalescent approach to calculate the probability of a given genealogical relationship among a sample of individuals from the population, it remains only to calculate the probabilities that arbitrary subsets of these individuals coalesced within each fitness class. In the next section, we use the above-described clonal structure to compute these fitness-class coalescence probabilities.

### Fitness-class coalescence probabilities

We begin our calculation of the fitness-class coalescence probabilities by considering the probability that *H* individuals coalesce to 1 in a given class. We call this probability *D _{H}*

_{1}. This coalescence event will occur if and only if all

*H*of these individuals are members of the same clonal lineage. The probability an individual is sampled from a clone of size

*ν*is

*ν*/

*σ*, so summing over all possible clones we have

*Appendix*we use the expression for distribution of

*ν*from Equation 10, and take the

*B*→ ∞ limit, to find

*H*individuals coalesce into

*K*in a given fitness class, with

*h*

_{1}individuals coalescing into lineage 1,

*h*

_{2}individuals coalescing into lineage 2, and so on, up to

*h*individuals coalescing into lineage

_{K}*K*(note that

*Appendix*, we show that this probability,

*H*individuals coalesced into

*K*lineages, but that they did so in a specific configuration {

*h*}. For example, if we have four individuals coalescing into two, this could occur by three of them coalescing into one and the other lineage not coalescing, or alternatively by two pairwise coalescence events. These different topologies affect some aspects of molecular evolution such as the polymorphism frequency spectrum. To compute these quantities, we must work with the full coalescence probabilities in Equation 15.

_{j}However, the specific coalescence configurations do not affect non-topology-related quantities such as the total branch length, time to most recent ancestor, or any statistics that depend on these quantities (*e.g.*, the total number of segregating sites *H* and *K*. Thus it is useful to sum the probabilities of all possible configurations {*h _{j}*} that lead to a particular

*K*. We call this total probability of

*H*individuals coalescing to

*K*lineages

*D*. We have

_{HK}*h*} is constrained to values such that

_{j}To compute *D _{KH}*, we first make the definition

*f*(

*H*,

*K*),

*Appendix*, we show that

*f*(

*H*,

*K*) using a simple contour integral,

*f*(

*H*,

*K*) for arbitrary

*H*and

*K*by noting that

*D*. To give a few examples, we find

_{HK}*f*(

*H*,

*K*) and evaluate any arbitrary

*D*. We note that in the large

_{HK}*H*limit, one can directly obtain

*f*(

*H*,

*K*) using saddlepoint evaluation of the contour integral defined above.

Note that the case of rapid adaptation, for which clonal interference is pervasive, corresponds to the case where *q* is reasonably large (conversely *q* =1 corresponds to sequential selective sweeps, and our analysis does not apply in this limit). In the large-*q* regime, *D*_{21} is small. In neutral coalescent theory, the probability of a three-way coalescence event would then be even smaller: *D*_{31} ∼ *D*_{21}, so “multiple-merger” coalescence events are not uncommon. This is a signature of the fact that occasionally a fitness class is dominated by a single large clone, as described above. When this happens, that clone dominates the structure of genealogies, as any ancestral lineages we trace through the fitness distribution are very likely to have originated from this single large lineage, and hence coalesce within this fitness class. Although these anomalously large clones are rare, they are sufficiently common that they are responsible for a significant fraction of the total coalescence events, and they are responsible for tendency of genealogies to take on a more “star-like” shape.

## Genealogies and Patterns of Genetic Variation

From the results above for the probabilities of all possible coalescence events in each fitness class, we can calculate the probability of any genealogy relating an arbitrary set of sampled individuals. From these genealogies, we can in turn calculate the probability distribution of any statistic describing the expected patterns of genetic diversity in the sample.

We begin by neglecting neutral mutations and calculating the structure of genealogies in fitness-class space. That is, we consider individuals sampled from some set of fitness classes. We trace their ancestries backward in time as they “advance” from one fitness class to the next, via mutational events, and calculate the probability that they coalesce in a particular set of earlier-established classes. Since each step in the fitness-class coalescent tree corresponds to a beneficial mutation, this immediately gives us the pattern of genetic diversity at the positively selected sites. We later consider how these fitness-class genealogies correspond to genealogies in real time and use this to derive the expected patterns of diversity at linked neutral sites.

### The distribution of heterozygosity at positively selected sites

We first describe the simplest possible case, a sample of two individuals. If we sample two individuals at random from the population, the first comes from class *k*_{1} and the second from class *k*_{2} with probability *π _{b}*, will be (

*k*

_{1}− ℓ) + (

*k*

_{2}− ℓ) =

*k*

_{1}+

*k*

_{2}− 2ℓ.

We can now calculate the average *π _{b}* given

*k*

_{1}and

*k*

_{2}by noting that

*k*coalesce within that class (in which case they have

*π*= 0), we have

_{b}*k*

_{1}and

*k*

_{2}to find the overall average. Since we saw in Equation 2 that

*k*

_{1}and

*k*

_{2}are approximately normally distributed with variance

*q*, the second term (corresponding to heterozygosity between individuals sampled from the same class) is approximately 2

*q*, while the first term is approximately

*Ns*the factor

We can use a similar approach to compute the full probability distribution of *π*_{b}. We have*k*_{1} and *k*_{2} to get the unconditional distribution of *π _{b}*. In Figure 5, A and B, we illustrate these theoretical predictions for the overall distribution of pairwise heterozygosity with the results of full forward-time Wright–Fisher simulations, for two representative parameter combinations. We see that the distribution of heterozygosity has a nonzero peak and that the agreement with simulations is generally good.

We emphasize that our results for *P*(*π*_{b}) describe the *ensemble* distribution of heterozygosity. That is, if we picked a single pair of individuals from each of many *independent populations*, this is the distribution of *π _{b}* one would expect to see. It is

*not*the population distribution: if we were to pick many pairs of individuals from the same population, the

*π*

_{b}of these pairs would not be independent because much of the coalescence within individual populations occurs in rare classes that are dominated by a single lineage for which

*D*

_{21}is much higher than its average value. Thus if we measured the average

*π*

_{b}within each population by taking many samples from it, the distribution of this

### Statistics in larger samples

We can compute the average and distribution of statistics describing larger samples in an analogous fashion to the pair samples. For example, consider the total number of segregating positively selected sites among a sample of three individuals, which we call *S*_{3b}. These three individuals are sampled (in order) from classes *k*_{1}, *k*_{2}, and *k*_{3}, respectively, with probability *k*, by conditioning on the coalescence possibilities within class *k* we find that the average total number of segregating positively selected sites is*S _{kkk}*〉, we find

*k*

_{1},

*k*

_{2}, and

*k*

_{3}using the properties of differences of Gaussian random variables, as above. Alternatively, as in samples of size two, in large populations we can make the rough approximation that all sampled individuals come from the mean fitness class. Analogous calculations can be used to find the average number of segregating positively selected sites in still larger samples.

In Figure 6 we illustrate some of these predictions (in practice generated from coalescent simulations; see below) for samples of size 2, 3, and 10, and compare these to the results of forward-time Wright–Fisher simulations. We note that the agreement is generally good.

We can apply similar thinking to describe the distribution of the total number of segregating selected sites. First consider this distribution for a sample of size 3, all of which happen to be sampled from the same fitness class *k*, *S _{kkk}*. We have

*z*

^{γ}and sum over

*γ*to pass to generating functions,

*H*individuals all chosen from the

*same*fitness class

*k*, which we call

*S*, has the distribution

_{H}*H*individuals chosen at random from arbitrary fitness classes. The general case becomes quite unwieldy to compute analytically, because we must average over all fitness classes in which internal coalescence events can occur. Computing these averages for the case of a sample of size 3, we find that the generating function for the distribution of the total number of segregating positively selected sites among a sample of three individuals sampled from classes

*k*

_{1},

*k*

_{2}, and

*k*

_{3}is given by

Analogous expressions can be computed for larger samples, but these involve ever more complex combinatorics. One may also wish to compute other statistics describing genetic variation in larger samples, such as the allele frequency spectrum. While in principle it is possible to calculate analytic expressions for any such statistic using methods similar to those described above, in practice it is easier to use our fitness-class coalescent probabilities to implement coalescent simulations, and then use these simulations to compute any quantity of interest. We describe these coalescent simulations in a later section. Alternatively, for large populations we can make use of the rough approximation that all individuals are always sampled from the mean fitness class; we explore some consequences of this approximation further in a section below.

### Time in generations and neutral diversity

Thus far we have focused on the fitness-class structure of genealogies and the genetic variation at positively selected sites. We now describe the correspondence between our fitness-class coalescent genealogy and the genealogy as measured in actual generations. Fortunately, this correspondence is extremely simple: each clonal lineage was originally created by mutations when that fitness class was at the nose of the fitness distribution. Thus if we define the current mean fitness to be class *k* = 0, the current nose class will be at approximately *k* = *q*, and some arbitrary class *k* will have been created at the nose approximately *et al.* 2012).

The simple approximation of neglecting the variations in time of establishment of the fitness classes allows us to make a straightforward deterministic correspondence between the fitness-class coalescent genealogy and the coalescence times. We can then compute the expected patterns of genetic diversity at linked neutral sites: the number of neutral mutations on a genealogical branch of length *T* generations is Poisson distributed with mean *U*_{n}*T*. From this we can compute the distribution of statistics describing neutral variation (*e.g.*, the neutral heterozygosity *π*_{n} or total number of neutral segregating sites in a sample *S*_{n}) from the corresponding statistics describing the variation at the positively selected sites. We illustrate these theoretical predictions for the distribution of neutral heterozygosity *π*_{n} in Figure 5, C and D, and compare these predictions to the results of full forward-time Wright–Fisher simulations. In Figure 6 we also show our predictions (generated using the coalescent simulations described above) for the mean number of segregating neutral sites in samples of size 2, 3, and 10, compared to the results of forward-time Wright–Fisher simulations. We note that the agreement is good across the parameter regime we consider, although there are some systematic deviations for smaller values of *U*_{b}/*s* where our approximations are expected to be less accurate.

### Time to the most recent common ancestor

Thus far we have considered the coalescence events at each mutational step separately: this is necessary to describe the full structure of genealogies. However, another important quantity of interest is the time to the most recent common ancestor—*i.e.*, the coalescence time of the entire sample. We begin by considering this time measured in mutational steps, and then describe how this relates to the coalescence time measured in generations.

We can derive relatively simple expressions for the number of mutational steps to coalescence of an entire sample by directly calculating the probability of coalescence events over several steps at once. To do so, we note that since the dynamics at each mutational step are identical, the generating function*i* that occurred ℓ mutational steps ago, can be derived iteratively. Equation 10 gives the generating function for ℓ = 1. Then, since any of the *B* possible further mutations on the *H* individuals sampled from the same fitness class, *J*(*H*). The cumulative distribution of *J* is given by*F*(*H*, ℓ) using methods identical to those used to calculate the fitness-class coalescence probabilities above and find*J*(*H*) more directly from the fitness-class coalescence probabilities in a single step, by conditioning on the coalescence events that can happen in the first step in a similar way to that we used to compute 〈*π*_{b}〉 and 〈*S*_{3b}〉.

In the large-*q* limit, the ratios of these coalescence times (measured in mutational steps) in samples of different sizes are independent of *q*:*Discussion*. For large *H* we find(54)which is also in agreement with the Bolthausen–Sznitman coalescent (Goldsschmidt and Martin 2005). These results suggest that there is a *q*-independent limiting process: we discuss this briefly below. We also note that the distribution of times to coalescence for large *H* is quite different than in the neutral case—the between-population variation in *J*(*H*)/〈*J*(2)〉 is only of order unity, compared to its mean of log log *H*. In contrast, for the neutral coalescent, the time to last common ancestor of the whole population has mean of 2〈*J*(2)〉 and random variations of the same order.

As with other aspects of genealogical structures, it is straightforward to convert these expressions for the coalescence times measured in mutational steps to the time in generations to the most recent common ancestor of a sample, *T*_{MRCA}(*H*). Specifically, *J* = ℓ corresponds to the case where the most recent ancestor occurs ℓ mutational steps ago, so if the sampled individuals were from class *k* the time to the most recent common ancestor is *τ*_{nm} is the characteristic time scale of the coalescent, as claimed above.

Thus far we have considered the most recent common ancestor of *H* individuals all sampled from the same fitness class *k*. However, in general we typically sample individuals from a variety of different classes. In this case, we must sum over all possible internal coalescence events, until we reach a state where all remaining ancestral lineages are together in the same fitness class. This quickly becomes unwieldy in larger samples. In practice, it is easier to compute times to the most common recent ancestor in these cases using coalescent simulations based on our fitness-class coalescent approach, which we describe below.

As with other statistics described above, however, there is a simple approximation which is asymptotically correct for large populations: we can simply assume that all individuals are sampled from the mean fitness class. This approximation relies on the fact that most individuals sampled randomly from the population will have fitnesses close to the mean: within of order *τ*_{nm}. As this is the time scale on which typical coalescent events take place, treating all the individuals as if they were in the dominant fitness class is a reasonable rough approximation. In this approximation, the results for the times to most common ancestor for samples of *H* can be simply obtained from the single-fitness class results above. We find

### The frequency of individual mutations

An alternative way to compute many of the coalescent properties is to consider the fraction of the population with a particular mutation, which is closely related to the site-frequency spectrum. The frequency of a given mutation at a particular site is determined by when that mutation occurred relative to others in its fitness class. But its frequency at later times is also strongly affected by whether (and when) later mutations occur in its genetic background at each subsequent mutational step.

We first consider how the frequency of a particular mutation changes with time due to successive mutations in its lineage. If at one time the mutation has frequency *g* in the nose population, then a time *i.e.*, after ℓ further steps have occurred), it will have some frequency, *f*, in the current nose population. The probability density of *f* can be found by comparing the statistics of the relative number of descendants, *ν*, of the fraction *g* of the initial nose population that has the mutation in question with the relative number of descendants *g*, of the initial nose population that does not have the mutation in question. Specifically, *f* = *ν*/*σ*. By definition, the conditional probability density of *f* is given by*ν* and *ω* enforces the *δ*-function, and *η*_{ℓ} = (1 − 1/*q*)^{ℓ} as defined earlier. The integral can be done straightforwardly to obtain*H* individuals coalescing ℓ steps in the past and hence the variances in the coalescent times of *H* individuals can be computed.

To compute the distribution of the fraction of the current nose class that are descendants of a particular mutation that occurred ℓ steps in the past, we can simply set *B* ≫ 1, we obtain*f*^{H}. Summing over all B sites and using the standard integrals of powers of f and 1 − f expressed in terms of gamma functions, we obtain immediately the same result we had found above:

In the limit of large *q*, the exponent *η* that parameterizes the time difference, *q*: only the nose-to-mean time that it takes for the new mutants to dominate the population matters. In this limit, a single mutational step occurs in a time that is a very small fraction, *ε* = 1/*q*, of the nose-to-mean time *τ*_{nm}. The conditional probability of going from *g* to *f* in this step is*f* − *g*, as one would expect in the limit of a small time step. But it also corresponds to a probability per unit time of a jump from *g* to *f* of *g*) or not containing the mutation (frequency 1 − *g*) increasing in size by a factor between 1 + *h* and 1 + *h* + *dh* with rate *ε* providing a small *h* cutoff). This corresponds to a continuous time birth process in a subpopulation of (large) size *n* with rate per individual to give birth to *k* offspring,

### Coalescent simulations

We can use the fitness-class coalescence probabilities in Equation 15 to implement an algorithm for coalescent simulations along the lines of Gordo *et al.* (2002), using the structured coalescent framework of Hudson and Kaplan (1994). Specifically, to describe the diversity in a sample of *n* individuals, we first randomly sample their fitness classes independently from the distribution *ϕ _{k}*. We then start with the individual in the most-fit class and trace back its ancestry as it steps through successive classes within the fitness distribution. When that individual enters a class with other individuals, we use Equation 15 to determine the probabilities of all possible coalescence events in that class. We then continue to trace back the ancestry of the sample further through the distribution, allowing for coalescence events at each step according to the appropriate probabilities. We continue this procedure until all individuals have coalesced.

This simple coalescent algorithm produces a fitness-class coalescent tree drawn from the appropriate probability distribution of genealogies. We can then compute any statistic of interest describing this genealogy. By repeating this algorithm, we can obtain the probability distribution of the statistic. In practice this is a highly efficient procedure, since the coalescent simulations are extremely fast and the computational time required scales only with the size of the sample rather than the size of the population.

### Comparison to simulations

Our coalescent simulations represent an algorithmic implementation of our fitness-class coalescent, using all of the analytical expressions for the sampling and coalescence probabilities described above. Thus these coalescent simulations rely on all of the approximations underlying our method. To test the validity of these approximations and the accuracy of our fitness-class coalescent method, we compared the predictions of these coalescent simulations to full forward-time Wright–Fisher simulations of our model. These comparisons are illustrated in Figure 5 and Figure 6 and in Table 1.

We implemented our Wright-Fisher simulations assuming a population of constant size *N*, in which each generation consisted of a mutation and a selection step. In the mutation step, we independently chose the number of beneficial and neutral mutations within each extant genotype from the appropriate multinomial distribution. Each new mutation was assigned a unique index and all unique genotypes were tracked. In the selection step, we sample *N* individuals with replacement from the previous generation, using a multinomial sampling weight adjusted for selective differences between individuals relative to the population mean fitness (Ewens 2004).

## Discussion

We have developed a fitness-class coalescent method to calculate how positive selection on many linked sites alters the structure of genealogies. This has allowed us to calculate how clonal interference shapes the patterns of genetic diversity in rapidly adapting populations. Our approach moves away from the traditional method of calculating the structure of genealogies in real time. Rather, we treat each mutational step from one fitness class to the next as an “effective generation” and trace how a sample of individuals descended by mutations through these fitness classes. In each effective generation we calculated the total probability of all possible coalescence events, Equation 15. This allows us to calculate the structure of genealogies in this fitness-class space, which corresponds directly to the genetic diversity at positively selected sites. We then converted this fitness-class coalescent to the genealogy in real time to calculate the expected patterns of neutral diversity.

We have shown that we can use this approach to compute analytic expressions for the distributions of several simple statistics describing patterns of molecular evolution. However, it is often easiest to compute expected patterns of variation using backward-time coalescent simulations, which explicitly implement the fitness-class coalescent algorithm using the distribution of the fraction of the population in each fitness class *ϕ _{k}* and the coalescence probabilities in Equation 15 to simulate genealogies. These coalescent simulations are extremely efficient, and in practice it is usually faster to run millions of these backward-time simulations than it is to numerically evaluate the sums over fitness classes involved in the corresponding exact analytic expressions. These coalescent simulations also have the advantage of being very similar in spirit to structured coalescent simulations that describe the effects of purifying selection (see,

*e.g.*, Gordo

*et al.*2002 and Seger

*et al.*2010), so they can in principle be used for parameter estimation and inference in analogous ways.

Our analysis throughout this article is very similar in spirit to the fitness-class coalescent method we previously used to describe how purifying selection at many linked sites alters the structure of genealogies and patterns of molecular evolution (Desai *et al.* 2012; Walczak *et al.* 2012). However, there are two important technical differences. First, in the case of purifying selection, fluctuations in the frequencies of each fitness class *ϕ _{k}* due to genetic drift can be substantial in certain parameter regimes. These fluctuations are particularly important near the nose of the distribution, where they can lead to effects such as Muller’s ratchet. Although individuals are unlikely to be sampled from this nose, they are very likely to coalesce there. Neglecting these fluctuations was therefore an important approximation that substantially restricted the regime of validity of our analysis. By contrast, in the case of positive selection, fluctuations in the sizes of each fitness class are negligible (except at the nose) across a broad range of relevant parameter values. Furthermore, fluctuations at the nose are much less important for patterns of diversity than in the case of purifying selection, because individuals are unlikely to either be sampled there or to coalesce there. This reflects a fundamental difference between the neutral and purifying selection processes and the rapid adaptation dynamics analyzed here. For the former, genetic drift plays a key role in driving the fluctuations, while for the latter, genetic drift is almost irrelevant: the fluctuations are dominated by the stochasticity in the timings of the beneficial mutations that occur near the nose of the fitness distribution.

A second key simplification of our analysis of positive selection, compared to the purifying selection case, is that the clonal structure of each fitness class becomes effectively “frozen” once that class is no longer at the nose of the fitness distribution. This means that coalescence probabilities are identical in all fitness classes, which stands in contrast to the case of purifying selection, where the clonal structure within all classes is constantly changing. This also avoids the need to carefully analyze the timing and order of mutation events in the history of a sample and simplifies the mapping between our fitness-class coalescent genealogy and the genealogy measured in real time.

Our results demonstrate how positive selection on many linked sites distorts the structure of genealogies away from neutral expectations. We show several examples of these selected genealogies, for various different parameter values, in Figure 7. The most striking qualitative conclusion of our analysis is that multiple merger events, where several ancestral lineages coalesce into one in a single effective generation, occur with comparable probabilities to pairwise coalescence events. We note that these events are multiple mergers within a single effective generation in our fitness-class coalescent and hence are not actually multiple mergers within a single real generation. However, these events happen very close together in real time compared to the other relevant time scales, so they appear as effectively instantaneous. This leads to a more “starlike” shape of genealogical trees. This signature is characteristic of the action of positive selection; our analysis here illustrates how starlike we expect genealogies to be (and how many deeper coalescence events are preserved) given the interplay between interference and hitchhiking effects characteristic of this rapid adaptation regime. It may prove useful in future work to analyze this specific situation in the context of more general models of the coalescent with multiple mergers (Pitman 1999).

We note that the characteristic time scale of the coalescence is the nose-to-mean time, *τ*_{nm}, which is the time after which the collection of new mutants at the nose take to dominate the population. In units of this time, trees for different values of *q* become statistically similar for large *q*. One striking feature, that occurs roughly once each *τ*_{nm}, is the coalescence of a substantial fraction of all the (remaining) lineages at a single time step: this is caused by one new beneficial mutation occurring so much earlier than typical that its descendants represent a substantial fraction of the population in the nose. Examples of this can be seen in Figure 7. Another perhaps-surprising feature of the genealogies in large samples is that some aspects are *less* variable from one population to another than neutral coalescent trees, while other aspects are more variable. In the recent past, for times much shorter than the mean coalescence time of pairs of individuals, neutral coalescent trees tend to be rather similar, while the multiple-coalescence events that characterize the positively selected genealogies cause larger variations between populations. In contrast, the time to last common ancestor of large samples is broadly distributed for neutral trees but narrowly distributed (at least asymptotically) for positively selected trees.

Because individuals are unlikely to be sampled from near the nose of the distribution, the initial coalescence events in the history of the sample are typically in the bulk of the fitness distribution. Since these coalescence events happened well in the past when these classes were at the nose of the distribution, the terminal branches in the genealogies of a sample are likely to be longer compared to internal branches than we would expect under neutrality. In other words, recent branches of genealogies are longer relative to more ancient branches. This effect is qualitatively similar to the situation in which effective population size declines as time recedes into the past: this has long been recognized as a general signature of the effects of both purifying and positive selection. It leads to an excess of singleton mutations in the site-frequency spectrum and the negative values of Tajima’s *D* that we have observed. However, clonal interference mitigates these effects relative to a hard selective sweep.

Our results also demonstrate that even when beneficial mutations are rare compared to neutral mutations, *U*_{b} ≪ *U*_{n}, positively selected sites can still contribute a significant fraction of the total genetic variation observed in a population. For example, in a sample of two individuals the total heterozygosity at positively selected sites is typically several times *q*. The typical neutral heterozygosity, on the other hand, is of order *π*_{n} ∼ *U*_{n}*τ*_{nm}. Thus even when *U*_{n} ≫ *U*_{b}, *π*_{b} is often comparable to or even greater than *π*_{n}. This is consistent with the general observation in microbial evolution experiments that a substantial fraction of observed mutations are beneficial (Gresham *et al.* 2008; Kao and Sherlock 2008; Barrick and Lenski 2009; Barrick *et al.* 2009). The fact that positively selected sites can be a significant fraction of the polymorphisms emphasizes the importance of understanding the patterns of diversity at these sites, which have distinct patterns compared to linked neutral variation and hence may provide important signatures in sequence data of adaptation that involves clonal interference.

Our predictions for the structure of the fitness-class genealogies depend on the population size, mutation rate, and strength of selection only through the combinations log[*Ns*] and log[*U*_{b}/*s*]. The time scales in generations are also proportional to the inverse of the strength of selection. Thus the patterns of genetic variation in an adapting population depend only very weakly (logarithmically) on population size and mutation rate in the large-*q* regime, where clonal interference is pervasive, suggesting that there is limited power to infer these parameters from patterns of molecular evolution. This is a consequence of the fact that the evolutionary dynamics are also only very weakly dependent on these parameters in the clonal interference regime.

We have seen that in the large-*q* limit of our model, the ratios of the number of mutational steps to the most recent common ancestors in samples of different sizes are exactly equivalent to those expected in the Bolthausen–Sznitman coalescent (Bolthausen and Sznitman 1998). This is identical to the limiting behavior of these ratios in several very different models of selection recently studied by Brunet, Derrida, and others (Brunet *et al.* 2006, 2007, 2008a; Berestycki *et al.* 2013; Brunet and Derrida 2012); see Derrida and Brunet (2013) for a recent review. The reason for this equivalence between very different models remains unclear, but suggests a degree of universality: an interesting topic for future work. We emphasize, however, that the times to most recent ancestors in our model reduce to the Bolthausen–Sznitman ratios only when measured in mutational steps and only when all individuals are sampled from the same fitness class. The ratios of time to most recent common ancestors, measured in generations, have a different form. Nevertheless, in the limit of very large *q*, almost all the individuals will have fitness much closer to the mean than to the nose. As the rate of coalescence is proportional to the difference between the mean and the nose, the approximation of sampling only from the largest fitness class is asymptotically good. The modifications of the Bolthausen-Sznitman ratios are then simply determined by adding the nose-to-mean time, (which turns out to be equal to the mean pairwise correlation time) to all the coalescent times.

Our analysis in this article has focused on the simplest possible model of positive selection on a large number of linked sites, and we have neglected many potential complications. For example, we have assumed that epistatic interactions between mutations can be neglected and that the total potential supply of beneficial mutations is not significantly depleted over the course of adaptation. This is consistent with our focus on rapidly adapting populations in the large-*q* clonal interference regime. As a population approaches a fitness peak, these approximations will likely fail and the dynamics of adaptation and patterns of genetic variation may either become more complex or return to the regime where further adaptation is driven by isolated selective sweeps. We have also focused exclusively on beneficial mutations that all have the same fitness effect *s* and have neglected both deleterious mutations and beneficial mutations that confer different fitness effects. This is justified by earlier work by us and others that suggests that in rapidly adapting populations, clonal interference ensures that evolution is dominated by beneficial mutations that confer a specific fitness advantage (Fogle *et al.* 2008; Rouzine *et al.* 2008; Good *et al.* 2012). However, we have recently analyzed the evolutionary dynamics within a population in a model that explicitly allows for a distribution of fitness effects of beneficial mutations (Good *et al.* 2012). We and others have also analyzed the case where a mix of both beneficial and deleterious mutations are possible (Rouzine *et al.* 2008, 2003; Goyal *et al.* 2012). Those works describe the variation in fitness within populations in these more complex models and hence could form the basis for a more complex version of the fitness-class coalescent method we have used here. This generalized fitness-class coalescent would admit the possibility of mutational steps of various different sizes and toward both lower and higher fitness.

An alternative approach by one of us allows for beneficial mutations to have a variety of different effects, without making reference to fitness classes (Fisher 2013). As long as the distribution of fitness effects of potential beneficial mutations falls off faster than a simple exponential for large *s*, the dynamics in large populations is dominated by mutations with *s* close to some value, *et al.* 2012; Fisher 2013). In this case, most properties of the dynamics on time scales longer than the nose-to-mean time *τ*_{nm} are quite universal (and more strongly so when *τ*_{nm} is also the time scale of the coalescence, this suggests that the coalescent statistics should also be universal. The continuous-time results quoted above for the evolution of the frequency of a subpopulation emerges naturally in this more general analysis and indeed correspond to the universal limit of asymptotically large populations (Fisher 2013). In the alternative regime where the distribution of fitness effects of potential beneficial mutations falls off more slowly than exponentially, mutations can jump from the bulk of the distribution to the lead. These play an important role in the dynamics and cause *q* to remain small even for asymptotically large populations (Desai and Fisher 2007). The behavior is then less universal, but this situation is likely to be relevant in real populations, especially in the initial stages of adaptation to a new environment. Further study into these effects of the distribution of effects of beneficial mutations, of initial transient dynamics, and of large numbers of deleterious mutations are interesting topics for future research.

The final simplification of our analysis is its focus on purely asexual populations: we have neglected the effects of recombination. Thus our results are primarily applicable to interpreting the patterns of genetic variation in asexual microbial evolution experiments, although they may also be relevant to sexual organisms on short genomic distance scales within which recombination is rare on the relevant timescales. We note, however, that our results provide an essential ingredient for predicting the effects of infrequent recombination on the evolutionary dynamics. Specifically, we can use our predictions for the genetic variation between a pair of individuals sampled from the population to predict the distribution of fitnesses of recombinant offspring resulting from sex between these individuals. This in turn determines how rare recombination alters the evolutionary dynamics and the distribution of fitnesses within the population. It may prove possible to then in turn calculate how these shifts in evolutionary dynamics alter the patterns of genetic diversity in the population. These extensions of our approach to analyze the effects of recombination on both evolutionary dynamics and patterns of molecular evolution are an important direction for future research.

## Acknowledgments

We thank Katya Kosheleva, Richard Neher, Boris Shraiman, Thierry Mora, Lauren Nicolaisen, Benjamin Good, Elizabeth Jerison, and John Wakeley for many useful discussions. M.M.D. acknowledges support from the James S. McDonnell Foundation, the Alfred P. Sloan Foundation, and the Harvard Milton Fund. D.S.F. acknowledges support from the National Science Foundation via DMS-1120699.

## Appendix: Coalescence Probabilities

In this appendix, we carry out the calculations of coalescence probabilities in detail. Consider *H* individuals who coalesce into *K* lineages, with *h*_{1} individuals coalescing into lineage 1, *h*_{2} individuals coalescing into lineage 2, and so on, up to *h _{K}* individuals coalescing into lineage

*K*. We note that

*H*individuals coalesce into

*K*lineages at a

*specific*set of

*K*sites (out of the total of

*B*) in the genome: call these sites 1–

*K*in the genome, for concreteness. We also assume for now that the

*H*individuals coalesce in a

*specific*way into these

*K*lineages (

*i.e.*, individual 3 coalesces into the lineage at site 5, etc.). We denote the frequency of the lineage at site

*j*in the genome by

*f*, so that

_{j}*A*the probability that the

*H*individuals coalesce into the

*K*lineages at these specific sites according to the specific configuration {

*h*}.

_{j}Given these definitions, we have

We make use of the identity*σ* as the sum of the *ν _{j}* and separate out the

*ν*that correspond to the lineages we are considering. Note that the

_{j}*ν*are independent of each other. Thus one obtains

_{j}*B*so that (

*B*−

*K*)/

*B*≈ 1, we find

*B*approximation that

*dy*integral yields a Γ function, giving

So far we have considered the probability of this coalescence event involving *K* lineages at a specific set of *K* sites on the genome. We now want to sum over all the possible sets of *K* sites on the genome at which this could occur. In the large-*B* limit, there are a total of *B ^{K}*/

*K*! of these. We define

*E*to be the probability of this coalescence event involving

*K*lineages at

*any*set of

*K*sites on the genome. We have

Now so far we have assumed that specific individuals coalesce into specific lineages. But given a set {*h _{j}*} there are a total of

*H*individuals coalescing into

*K*lineages, in a specific configuration {

*h*}, which we call

_{j}To compute *D _{HK}*, we first make the definition

*f*(

*H*,

*K*). However, we can define its generating function

*H*= 0, but we define

*f*(

*H*,

*K*) = 0 for

*H*<

*K*. Now we have

*f*(

*H*,

*K*), we find

*h*are now independent. Recognizing the Taylor series, we have

*K*= 1 to recover the result for

*D*

_{H}_{1}quoted in Equation 14.

## Footnotes

*Communicating editor: W. Stephan*

- Received October 24, 2012.
- Accepted November 26, 2012.

- Copyright © 2013 by the Genetics Society of America