## Abstract

Mounting evidence suggests that natural populations can harbor extensive fitness diversity with numerous genomic loci under selection. It is also known that genealogical trees for populations under selection are quantifiably different from those expected under neutral evolution and described statistically by Kingman’s coalescent. While differences in the statistical structure of genealogies have long been used as a test for the presence of selection, the full extent of the information that they contain has not been exploited. Here we demonstrate that the shape of the reconstructed genealogical tree for a moderately large number of random genomic samples taken from a fitness diverse, but otherwise unstructured, asexual population can be used to predict the relative fitness of individuals within the sample. To achieve this we define a heuristic algorithm, which we test *in silico*, using simulations of a Wright–Fisher model for a realistic range of mutation rates and selection strength. Our inferred fitness ranking is based on a linear discriminator that identifies rapidly coalescing lineages in the reconstructed tree. Inferred fitness ranking correlates strongly with actual fitness, with a genome in the top 10% ranked being in the top 20% fittest with false discovery rate of 0.1–0.3, depending on the mutation/selection parameters. The ranking also enables us to predict the genotypes that future populations inherit from the present one. While the inference accuracy increases monotonically with sample size, samples of 200 nearly saturate the performance. We propose that our approach can be used for inferring relative fitness of genomes obtained in single-cell sequencing of tumors and in monitoring viral outbreaks.

MOST mutations are believed to have minimal effects on the fitness of the organism and much of the analysis of the genomic data on populations (see Excoffier and Heckel 2006 for a review of methods) has been based on the neutral hypothesis, according to which the dynamics of genetic polymorphisms and the overall genetic diversity of the population are governed by the neutral *drift*, *i.e.*, stochastic fluctuations in allele frequency arising from the intrinsic stochasticity in offspring number. The neutral model assumes that deleterious mutations are eliminated by selection fast enough to not significantly contribute to population diversity and beneficial mutations are rare enough to produce only occasional adaptive *sweeps*, where the population is taken over by the offspring of the adaptive genotype, transiently suppressing neutral genetic diversity. Statistical properties of genealogies generated by neutral dynamics in asexual populations are understood in great detail (Hein *et al.* 2005; Wakeley 2008) in terms of Kingman’s *coalescent* process (Kingman 1982), which follows the ancestors of the present population back in time as far as the *most recent common ancestor* (MRCA). The neutral coalescent (Hein *et al.* 2005; Wakeley 2008) forms the basis for estimating mutation and recombination rates and provides the null hypothesis in tests for the presence of selection (Tajima 1989; Fu and Li 1993).

Yet, as advances in sequencing have made it possible to obtain quantitative data on genetic diversity, numerous studies have reached the conclusion that nonneutral polymorphisms are ubiquitous in populations across the spectrum of life: from viruses (Coffin *et al.* 1995; Novella *et al.* 1995; Moya *et al.* 2004; Neher and Leitner 2010; Batorsky *et al.* 2011) and bacteria (Barrick *et al.* 2009) to flies (Sella *et al.* 2009) to mitochondria (Seger *et al.* 2010) and cells in cancerous tumors (Merlo *et al.* 2006). In addition, laboratory evolution experiments in bacteria (Lenski *et al.* 1991; Miralles *et al.* 1999) and yeast (Kao and Sherlock 2008; Lang *et al.* 2011) have demonstrated directly that large asexual populations contain numerous subclones that are continuously generated by mutation and compete for fixation. Thus, large asexual populations cannot be assumed selectively neutral.

The presence of selection affects the shape of genealogical trees, often giving them an asymmetric and “comb-like” appearance that is strikingly different from that of the neutral trees generated by Kingman’s coalescent (Hein *et al.* 2005; Wakeley 2008; Seger *et al.* 2010; Trevor *et al.* 2011). An example of such “genealogical anomalies”—*i.e.*, large deviations from neutral genealogical structure (Maia *et al.* 2004)—is provided by the recent study (Seger *et al.* 2010) of mitochondrial diversity in three distinct populations of whale lice, *Cyamus ovalis*, where the authors demonstrate that the observed genealogies are statistically consistent with a nonneutral model with frequent mutations of small selective effect.

Our analysis is based on a similar model of asexual evolutionary dynamics driven by small deleterious and beneficial mutations. In Figure 1 we show schematically a sample of continuous genealogy for a fixed-size population governed by Wright–Fisher dynamics (Hein *et al.* 2005; Wakeley 2008), incorporating genetic drift, mutation, and natural selection. The example in Figure 1 covers the period over which the offspring of one of the genomes (Figure 1, top) spread over the whole population (Figure 1, bottom). We ask, given a sample of genomes from the “present time” population (Figure 1, red circles), can one predict the genetic future of the population? Or, more specifically, can one identify, within the present sample, the closest relatives of the future population, *i.e.*, individuals that are on, or closest to, the genealogical backbone of the future population? Since long-term survival is correlated with fitness, this task is closely related to the problem of identifying the fitter fraction of the present-day sample.

Here, we demonstrate that the anomalous structure of the genealogical tree reconstructed for a sample of genomes can serve not only as the evidence of selection, but also as the basis for inferring the relative fitness ranking of sampled individuals and their proximity in sequence space to the fittest genomes. Information pertinent to this inference is contained in the pattern of coalescence for different lineages: in a nutshell, lineages that undergo several coalescence events much before others are relatively fit, while the less fit lineages do not merge with the rest (going backward in time) until later. Below we provide the simulation-based evidence supporting this scenario.

Our study builds upon considerable recent progress in the theoretical understanding of natural selection and drift dynamics in fitness-diverse asexual populations (Tsimring *et al.* 1996; Rouzine *et al.* 2003, 2008; Desai and Fisher 2007; O’Fallon *et al.* 2010; Sniegowski and Gerrish 2010; Good *et al.* 2012; Goyal *et al.* 2012; Walczak *et al.* 2012) and the emerging description of corresponding genealogies (Bolthausen and Sznitman 1998; Brunet *et al.* 2007; Berestycki 2009; O’Fallon *et al.* 2010; Seger *et al.* 2010; Desai *et al.* 2013; Walczak *et al.* 2012; Neher and Hallatschek 2013; Neher 2013). We focus on the asexual case and address how this approach might be extended to the analysis of recombining populations in the *Discussion*.

We focus on the regime where numerous beneficial or deleterious mutations segregate simultaneously and the population is formed by many clones with different fitness values. In this regime, sometimes referred to as clonal interference (Miralles *et al.* 1999; Desai and Fisher 2007), competition between clones and the linkage between mutations play a key role in evolutionary dynamics. This regime is realized in large populations with high mutation rates. Precise conditions depend on the distribution of fitness effects of mutations and have been discussed in many recent articles (Rouzine *et al.* 2003, 2008; Desai and Fisher 2007; Brunet *et al.* 2008; Sniegowski and Gerrish 2010). For example, in the case where only beneficial mutations are present, the condition for being in the interference regime is given by *Nμ*_{b} > 1/log(*Ns*), where *μ*_{b} is the beneficial mutation rate (Desai and Fisher 2007; Brunet *et al.* 2008; Rouzine *et al.* 2008). This is basically the condition that new beneficial mutations get established in the population at a rate faster than they can “sweep” the population (see supporting information, File S1, for additional discussion). In the case of purifying selection where only deleterious mutations are present, it can be shown (Rouzine *et al.* 2003, 2008; Walczak *et al.* 2012) that the required condition is where *μ*_{d} is the deleterious mutation rate, *s* is the deleterious effect of mutations, and *N* is the population size.

Quite generally, when the population is formed by several clones with different fitness values, the fate of any new mutation depends not only on its own selective effect, but also on the fitness of the genotype on which it occurs (Good *et al.* 2012). As a result, the MRCA of such a fitness-diverse population is with high probability among the very fittest of its generation (O’Fallon *et al.* 2010). In return, the pattern of genealogical coalescence is controlled by the time it takes for surviving lineages to converge, as they are tracked back in time, on the leading edge of the fitness distribution at previous times.

This article is organized as follows. After formulating the model, we provide examples of genealogies, illustrating their anomalous shape compared to the neutral coalescent, and demonstrate the correlation between the ancestral *weight*, defined as the fraction of the present-day sample constituted by the descendants of the ancestor, and the mean fitness of the those descendants. We then define a fitness-ranking score based on the suitably integrated ancestral weights along the reconstructed lineage of each individual in the sample. Applying the ranking to numerous samples (for populations with the same and with different mutation/selection parameters) and comparing each realization to the true fitness known from the forward simulation, we demonstrate the ability of the proposed algorithm to infer the relative fitness of sampled genomes and to identify genotypes that are likely to survive into the future. The *Discussion* addresses possible applications and generalizations of the proposed inference method.

## Model and Methods

### Model of evolutionary dynamics

Consider an asexual population of size *N* that evolves with nonoverlapping generations under the influx of deleterious and beneficial mutations. New mutations arise at the rate *μ* + *μ*_{0} (per genome per generation) with a fraction *εμ* being beneficial, (1 − *ε*)*μ* deleterious, and the remainder *μ*_{0} being neutral. For simplicity we assume both beneficial and deleterious mutations to have the same effect *s* ≪ 1 and to change the fitness of individual *i* carrying that mutation additively: *F _{i}* →

*F*±

_{i}*s*. As in the Wright–Fisher model, natural selection acts by biasing the probability of an individual genome to appear in the next generation, which is taken to be proportional to exp(

*f*) with being the individual fitness relative to the mean fitness of the population which in general is a function of time.

_{i}We carried out 10^{3} simulations of 2 × 10^{5} generations for several plausible parameter combinations in the range of *μ* = 10^{−4}−10^{−2}, *s* = 10^{−3}−10^{−2}, with *ε* taking values 0, 0.1, and 1, and *μ*_{0} = 10*μ* and *N* = 64,000. In File S1, we study the degree of clonal diversity and interference for the set of parameters that we have simulated and show that it explores a broad range in the clonal interference regime.

The genealogical trees were constructed in two ways. We recorded the genealogies in the course of the forward simulation, providing exact ancestries of any sample in the population. In addition, an inferred genealogy of random samples (between 30 and 500 genomes) was constructed using standard neighbor-joining/UPGMA-derived methods (Durbin 1998) is detailed in File S1. In File S1, we present the performance of the tree reconstruction method for different parameter values and show that it satisfactorily reconstructs the genealogical trees. For higher mutation rates (*e.g.*, *μ* = 5 × 10^{−3} and *μ* = 10^{−2}) where there are tens to hundreds of differences between a typical pair of genomes, even setting the neutral mutation rate equal to *μ* would be sufficient for an accurate reconstruction of the trees.

#### Fitness distribution and distortion in the shape of genealogical trees

In the parameter range considered, simulated populations exhibit substantial fitness diversity with fitness variance in the order of arising from ∼10−10^{3} simultaneously segregating nonneutral polymorphisms. Figure 2, A and B, shows examples of the population-wide fitness distribution for two different mutation rates (see File S1 for additional examples). In general, genetic diversity in the population is an increasing function of *μ*/*s*. For the highest mutation rate and lowest selection coefficients considered, *μ* = 10^{−2} and *s* = 10^{−3}, the population exhibits extensive genetic diversity and is formed by many small clones (Figure 2B), whereas for the lower mutation rates, as in Figure 2A, the population typically includes larger clones.

Figure 2, D and E, shows typical examples of genealogical trees constructed for random samples of size *n* = 30 drawn from the populations corresponding to Figure 2, A and B, respectively. The fitness of sampled genomes, which we know from the forward simulation, is visualized using color. Also shown are ancestral weights along some of the lineages. This weight, *w _{i}*, is defined as the number of genomes in the present time sample that are direct descendants of lineage

*i*. For example, each leaf at the bottom has weight

*w =*1, while the lineage at the root has the full weight of the sample

*n*= 30. For the sake of comparison, we also show a typical genealogical tree for a neutrally evolving population in Figure 2F.

One immediately notes two well-known differences distinguishing Figure 2D and 2E from Figure 2F. Genealogies from fitness-diverse populations (i) have long terminal legs and are compressed toward the MRCA root of the tree and (ii) exhibit strong asymmetry of branching. These anomalies are quantified in Figure 3. Figure 3A presents distributions of pairwise coalescent times in the population, *τ _{ij}*, for {

*i*,

*j*} genome pairs for several parameter sets. In Kingman’s coalescent,

*τ*has an exponential distribution (with mean

_{ij}*N*) (Hein

*et al.*2005; Wakeley 2008) and most lineages in a genealogical tree coalesce at early times (looking backward). In contrast, the bulk of coalescence in a population under selection is significantly delayed compared to the total coalescent time—an effect corresponding to the comb-like appearance of the trees.

The asymmetry of branching is quantified in Figure 3B, which presents the distribution of weights at the level just below the root, where there are only two ancestral lineages left in the tree. The strong bias toward extreme values of *w* in populations under selection is contrasted with *w*-independent distribution predicted and observed in the neutral case (see File S1 for additional characteristics that quantify differences between the shapes of trees).

## Results

### Correlation between ancestral weight and offspring fitness

Let us consider the whole population and trace the surviving lineages back in time, identifying all ancestors of the present-day population *t* generations in the past. Figure 4A shows the distribution of the ancestral fitness (relative to the mean for that generation) at several time points in the past. This distribution becomes progressively shifted toward higher fitness values compared to the distribution for the whole population (O’Fallon *et al.* 2010). In the limit of large times, this distribution converges to the nonextinction probability as a function of the fitness in the ancestral population (Neher *et al.* 2010; O’Fallon *et al.* 2010; Neher and Hallatschek 2013).

Let us consider the time in the past when there are still a large number of ancestors (*e.g.*, ∼10^{3} in the population of *N* = 64,000, which under conditions corresponding to simulations in Figure 4A occurs at *t* ≃ 100). Figure 4B shows the scatter plot of the weight of ancestors *vs.* their fitness advantage. Note that, by collapsing the points on the fitness axis, one gets the histogram shown in Figure 4A. We observe a strong positive correlation between the weight and the fitness of an ancestor. Higher-fitness individuals in the past generations are not only more likely to survive, but, conditioned on survival, they also leave more offspring. Thus the weight of the ancestor, which can be determined from a reconstructed genealogical tree, can be used as a proxy for ancestral fitness: a quantity that one does not expect to know directly, except in the case of computer simulations. In File S1 we provide plots of average ancestral fitness conditioned on its weight for various time points and parameter sets and confirm that the positive correlation between the weight and the fitness of ancestors holds quite generally. This correlation decreases as the time shifts farther into the past.

Next, we examine the correlation between the weight of an ancestor and the fitness of its surviving progeny. Consider a sample of genomes with size *n* and the corresponding genealogical tree. One expects genomes that are derived from relatively high-fitness ancestors to belong to higher-fitness classes at the present time. Since ancestral fitness correlates with weight, we expect higher-weight ancestors to produce, on average, higher-fitness descendants. To see this, let us consider an ancestor *i*, with weight *w _{i}*, that existed some

*t*generations in the past. We examine the fitness of the

*w*offspring in the sample descending from that ancestor. In particular, we focus on the mean, and the variance, over the

_{i}*w*offspring (subscript

_{i}*d*refers to descendants). Let us denote the average of these quantities over random samples of genomes and over population replicas by and Note that and depend on the time

*t*, namely, how far back in the genealogy one is considering.

In Figure 4, C and D, we show and at two different time points in the past for samples of size *n* = 100 (see File S1 for other parameter sets). In both cases, the mean fitness of the derived genomes is an increasing function of the weight of their ancestor. Consider a time close to the root of a tree such that a lineage can have a weight that is a significant portion of the sample size (*e.g.*, right plot in Figure 4C). As expected, the value of for such high-weight ancestors is close to zero (remember that *f _{i}* was defined relative to the population mean, so that the average of

*f*over the whole sample is zero). At the same time for ancestors with

_{i}*w*approaching

*n*. Interestingly, for the lineages that still have a small weight late in the coalescence process, the value of is clearly negative.

High-fitness genomes typically merge first in a tree and form high-weight ancestors. To make this point clear, consider the distribution of the pairwise coalescent time, *τ _{ij}*, shown in Figure 3A. Averaging

*τ*over all {

_{ij}*i*,

*j*} pairs of genomes in a population gives the mean coalescent time

*T*

_{2}. Now, consider the average of

*τ*conditioned on the fitness of the two genomes and denote it by

_{ij}*t*

_{2}(

*f*,

_{i}*f*). Figure 4D shows a heat map of

_{j}*t*

_{2}(

*f*,

_{i}*f*)/

_{j}*T*

_{2}. For two genomes both with high fitness, the average coalescent time is <

*T*

_{2}. The reason is that such genomes are likely to be relatively recent lineages emanating from the “nose” of the distribution. In other words, the chance of sampling identical or similar sequences is greater for fitter samples than for less fit samples, since fitter samples have shorter average pairwise coalescent time. This observation is the key to the proposed fitness inference method.

### Relative fitness inference based on the reconstructed genealogy

Above we have reviewed different ways in which the shape of the genealogical trees for populations under selection differs from the expectation of neutral theory. We have also demonstrated the correlation between ancestral weights and the fitness of the descendants. We showed that sampled genomes that belong to high-fitness classes typically have shorter coalescent time compared to unfit genomes. We now show that this insight can be converted into a method for inferring relative fitness of genomes within the sample.

To that end, let us consider a randomly chosen set of *n* genomes from a population and use standard phylogenetic tree-building methods (see File S1) to approximately reconstruct the genealogy of the sample. The accuracy of the reconstructed genealogy compared to the actual genealogy, known exactly from the forward simulation of population dynamics, is discussed in File S1. It increases with the neutral mutation rate *μ*_{0}: in the biologically plausible regime of *μ*_{0}/*μ* ≈ 10 considered here, it proves more than adequate to enable meaningful inference.

Next, based on the reconstructed tree, we associate with each leaf *i* = 1, … , *n* a *fitness-proxy score* (FPS), *φ _{i}*, defined by its lineage within the tree. Specifically, we define

*φ*as a linear discriminator in the form (1)where {

_{i}*a*(

_{k}*i*)} is the lineage of genome

*i*, starting with the genome itself as

*a*

_{0}(

*i*) and running the length,

*m*, of the lineage (

_{i}*i.e.*, the number of nodes) until the root of the tree. When the ancestral lineage

*a*

_{k}_{−1}(

*i*) with weight coalesces at internal node

*k*, it forms a new ancestral lineage

*a*(

_{k}*i*) with weight (see File S1 for an illustration of this notation on the example of a particular tree). The time of formation of the corresponding internal node is denoted by The parameter

*T*

_{2}is the estimate of the average pairwise coalescent time, obtained from the sampled genomes. Finally, Θ(

*x*) is a “soft step” function (a.k.a. Fermi function): Θ(

*x*) = (1 + exp(

*β*(

*x*/

*x*

_{∗}−1)))

^{−1}parameterized by the position of the step

*x*

_{∗}and its characteristic width

*β*. If

*β*≫ 1, the function Θ(

*x*) changes abruptly from one to zero as

*x*becomes >

*x*

_{∗}, so that where

*a*

_{∗}is the oldest ancestor in the lineage with For

*β*∼ 1 the FPS is defined by a weighted sum of ancestral weights (see File S1 for details).

The logic behind our heuristic choice of the specific form of *φ _{i}* is to exploit the correlation between the offspring fitness and ancestral weights. Note that, at least on the high-fitness/high-weight end of the distribution, this correlation decreases as

*t*becomes large compared to

_{a}*T*

_{2}. The reason for this is that for long times in the past, even the lineages originating from high-fitness ancestors spread all over the fitness distribution at the present time. Hence, we choose

*x*

_{∗}< 1: specifically the results below were obtained with

*x*

_{∗}= 0.5 and

*β*= 5, but in File S1 we examine the performance of the ranking algorithm as a function of the parameters and demonstrate that nearly optimal performance for the present form of the FPS is achieved for a broad range of

*x*

_{∗}and

*β*. Critically, normalization of

*t*to the characteristic time of coalescence for the sample,

_{a}*T*

_{2}, eliminates the need to know the evolutionary parameters of the population, such as

*μ*or

*N*.

We rank genomes according to their *φ _{i}* score and compare this ranking with the actual fitness of each genome. In addition to inferring relative fitness, it is useful to know how genetically close a genome with a given rank is to the fittest one in the sample. Hence, for each genome we define

*d*as the average of its Hamming distance to the fittest 10% of genomes in the sample. Figure 5, A and B, shows the results of the ranking for two

_{i}*n*= 200 samples from the populations that already appeared in Figure 2, A and B. We observe a correlation between FPS ranking and the actual fitness in general and the tendency (quantified below) for the fittest genomes of the sample to show in the top ranks. In addition, high-ranked genomes that do not belong to high-fitness classes still have small

*d*values, indicating that they are genetically close to the fittest genotypes. In other words, even if a high-ranked genome is not fit, typically it has only recently branched off from a fit clone and, compared to a randomly chosen unfit sequence, shares greater sequence similarity with fittest genotypes.

_{i}The above observations are confirmed and quantified by repeating and averaging the analysis for 8000 independent population samples and different sets of parameters. Specifically, Figure 6 shows the fitness distribution of the top-ranked genomes for the two parameter sets used in Figure 5. The results clearly indicate that the top-ranked genomes tend to be among fitter genotypes in the population. In addition, Figure 7A shows mean fitness conditional on the FPS ranking and Figure 7B shows the mean rank conditional on actual fitness (normalized by *σ*) for two different values of *μ*. Figure 7C shows mean distance from the fittest conditional on the FPS ranking (for four different values of *μ*), with distance normalized to Δ_{10%} defined as the average *d _{i}* among the fittest 10%. Remarkably, we observe that

*d*/Δ

_{10%}for the highest-ranked genomes approaches one, indicating good convergence, in the sense of Hamming distance, of the top-ranked genomes to the fittest set. Further analysis of the algorithm’s performance, as well as additional parameter sets including the case of purifying selection (

*ε*= 0), can be found in File S1.

As already mentioned, we are interested in the set of evolutionary parameters for which many mutations segregate simultaneously and the population is formed by numerous clones with different fitness values. The opposing limit, which occurs for small population size, *N*, or mutation rate, *μ*, corresponds to the regime of selective sweeps/successive mutations. In this latter regime, the population is typically formed by only a few clones and the fitness diversity is relatively low. Moreover, for smaller values of the parameters *N* and *μ*, the inference of genealogical trees becomes less accurate as the genetic diversity between sampled sequences decreases. Therefore, we expect the performance of the algorithm to deteriorate for small population size, *N*, or mutation rate, *μ*. In File S1, we show that for smaller values of the quantity θ = *N**μ*, particularly for θ < 1, the performance of the fitness inference algorithm deteriorates. We also show that the performance of the algorithm deteriorates as the fitness diversity in the population, represented by *σ*/*s*, decreases. Note that the quantity *σ*/*s* provides a measure for the number of different fitness classes in the population.

As we see in Figure 5A, high-ranked genomes that do not belong to fittest classes still tend to have small genetic distance to fittest individuals (also note in Figure 2, D and E, the genomes with blue color located close to the mostly orange/red clusters on the right side of the trees). This is because the Hamming distance is dominated by neutral mutations *μ*_{0} ≫ *μ* and is less susceptible to fluctuations compared to fitness, which is defined by a much smaller number of nonneutral mutations. To the extent that genetic relatedness is defined by the distance, the latter is essential for identifying within the sample the closest relatives of future populations. Taking advantage of ready accessibility of evolutionary future within our simulations, we have directly tested the ability of our approach to identify, within the sample, the genotypes that are closer to those of future populations. For each sampled genome, we define as the average of its Hamming distance to all of the genomes in the current population that are ancestors of the population in a generation about one genetic turnover time in the future (we know these ancestors from the forward simulation). We choose this turnover time to be the first time in the future when <1% of individuals from the current population have any descendant left. In each case we normalized the distances by defined as the average of the smallest 10% of values of Figure 7D shows conditional on the FPS ranking. We again observe that