## Abstract

While bacteria divide clonally, horizontal gene transfer followed by homologous recombination is now recognized as an important contributor to their evolution. However, the details of how the competition between clonality and recombination shapes genome diversity remains poorly understood. Using a computational model, we find two principal regimes in bacterial evolution and identify two composite parameters that dictate the evolutionary fate of bacterial species. In the divergent regime, characterized by either a low recombination frequency or strict barriers to recombination, cohesion due to recombination is not sufficient to overcome the mutational drift. As a consequence, the divergence between pairs of genomes in the population steadily increases in the course of their evolution. The species lacks genetic coherence with sexually isolated clonal subpopulations continuously formed and dissolved. In contrast, in the metastable regime, characterized by a high recombination frequency combined with low barriers to recombination, genomes continuously recombine with the rest of the population. The population remains genetically cohesive and temporally stable. Notably, the transition between these two regimes can be affected by relatively small changes in evolutionary parameters. Using the Multi Locus Sequence Typing (MLST) data, we classify a number of bacterial species to be either the divergent or the metastable type. Generalizations of our framework to include selection, ecologically structured populations, and horizontal gene transfer of nonhomologous regions are discussed as well.

Bacterial genomes are extremely variable, comprising both a consensus “core” genome, which is present in the majority of strains in a population, and an “auxiliary” genome, comprising genes that are shared by some but not all strains (Medini *et al.* 2005; Tettelin *et al.* 2005; Hogg *et al.* 2007; Lapierre and Gogarten 2009; Touchon *et al.* 2009; Dixit *et al.* 2015; Marttinen *et al.* 2015).

Multiple factors shape the diversification of the core genome. For example, point mutations generate single-nucleotide polymorphisms (SNPs) within the population that are passed on from mother to daughter. At the same time, stochastic elimination of lineages leads to fixation of polymorphisms, which effectively reduces population diversity. The balance between point mutations and fixation determines the average number of genetic differences between pairs of individuals in a population, often denoted by θ.

During the last two decades, exchange of genetic fragments between closely related organisms has also been recognized as a significant factor in bacterial evolution (Guttman and Dykhuizen 1994; Milkman 1997; Falush *et al.* 2001; Thomas and Nielsen 2005; Studier *et al.* 2009; Touchon *et al.* 2009; Vos and Didelot 2009; Dixit *et al.* 2015). Transferred fragments are integrated into the recipient chromosome via homologous recombination. Notably, recombination between pairs of strains is limited by the divergence in transferred regions. The probability of successful recombination of foreign DNA into a recipient genome decays exponentially with *δ*, the local divergence between the donor DNA fragment and the corresponding DNA on the recipient chromosome (Vulić *et al.* 1997; Majewski 2001; Thomas and Nielsen 2005; Fraser *et al.* 2007; Polz *et al.* 2013). Segments with divergence *δ* greater than divergence have negligible probability of successful recombination. In this work, we refer to the divergence as the transfer efficiency. is shaped at least in part by the restriction modification (RM), the mismatch repair (MMR) systems, and the biophysical mechanisms of homologous recombination (Vulić *et al.* 1997; Majewski 2001). The transfer efficiency imposes an effective limit on the divergence among subpopulations that can successfully exchange genetic material with each other (Vulić *et al.* 1997; Majewski 2001).

In this work, we develop an evolutionary theoretical framework that allows us to study in broad detail the nature of competition between recombinations and point mutations across a range of evolutionary parameters. We identify two composite parameters that govern how genomes diverge from each other over time. Each of the two parameters corresponds to a competition between vertical inheritance of polymorphisms and their horizontal exchange via homologous recombination.

First is the competition between the recombination rate ρ and the mutation rate μ. Within a coevolving population, consider a pair of strains diverging from each other. The average time between consecutive recombination events affecting any given small genomic region is where is the average length of transferred regions. The total divergence accumulated in this region due to mutations in either of the two genomes is If the pair of genomes is likely to become sexually isolated from each other in this region within the time that separates two successive recombination events. In contrast, if frequent recombination events would delay sexual isolation resulting in a more homogeneous population. Second is the competition between the population diversity θ and If one expects spontaneous fragmentation of the entire population into several transient sexually isolated subpopulations that rarely exchange genetic material between each other. In contrast, if unhindered exchange of genetic fragments may result in a single cohesive population.

Using computational models, we show that the two composite parameters identified above, and determine qualitative evolutionary dynamics of bacterial species. Furthermore, we identify two principal regimes of this dynamics. In the divergent regime, characterized by a high local genomic regions acquire multiple mutations between successive recombination events and rapidly isolate themselves from the rest of the population. The population remains mostly clonal where transient sexually isolated subpopulations are continuously formed and dissolved. In contrast, in the metastable regime, characterized by a low and a low ), local genomic regions recombine repeatedly before ultimately escaping the pull of recombination (hence the name metastable). At the population level, in this regime all genomes can exchange genes with each other resulting in a genetically cohesive and temporally stable population. Notably, our analysis suggests that only a small change in evolutionary parameters can have a substantial effect on evolutionary fate of bacterial genomes and populations.

We also show how to classify bacterial species using the conventional measure of the relative strength of recombination over mutations, (defined as the ratio of the number of SNPs brought by recombinations and those generated by point mutations in a pair of closely related strains), and our second composite parameter Based on our analysis of the existing Multi Locus Sequence Typing (MLST) data, we find that different real-life bacterial species belong to either divergent or metastable regimes. We discuss possible molecular mechanisms and evolutionary forces that decide the role of recombination in a species’ evolutionary fate. We also discuss possible extensions of our analysis to include adaptive evolution, effects of ecological niches, and genome modifications such as insertions, deletions, and inversions.

## Computational Models

We consider a population of coevolving bacterial strains. The population evolves with nonoverlapping generations and in each new generation each of the strains randomly chooses its parent (Gillespie 2010). As a result, the population remains constant over time. Strain genomes have length Individual base pairs acquire point mutations at a constant rate μ and recombination events are attempted at a constant rate ρ (see Figure 1A). The mutations and recombination events are assumed to have no fitness effects (later, we discuss how this assumption can be relaxed). The probability of a successful integration of a donor gene decays exponentially, with the local divergence *δ* between the donor and the recipient. Table 1 lists all important parameters in our model.

Unlike point mutations that occur anywhere on the genome, genomic segments involved in recombination events have a well-defined starting point and length. To understand the effect of these two factors, below we introduce three variants of a model of recombination with increasing complexity illustrated in Figure 1B. In the first and the only mathematically tractable model, we fix both length and start/end points of recombined segments. In the second model, recombined segments have a fixed length but variable starting/ending positions. Finally, in the most realistic third model, recombined segments have variable lengths [drawn from an exponential distribution with an average of 5000 bp (Dixit *et al.* 2015)] and variable starting/ending positions. *Prima facie*, these three models appear quite distinct from each other, potentially leading to divergent conclusions about the distribution of diversity on the genome. In particular, one might assume that the first model in which different segments recombine (and evolve) completely independently from each other would lead to significantly different evolutionary dynamics than the other two models. This assumption was not confirmed by our numerical simulations. Indeed, later in the manuscript we demonstrate (see Figure 7 below) that all three variants of the model have rather similar evolutionary dynamics. In what follows, we first present our mathematical description and simulations of the first model and then compare and contrast it to other two models.

The effective population sizes of real bacteria are usually large (Tenaillon *et al.* 2010). This prohibits simulations with realistic parameters wherein genomes of individual bacterial strains are explicitly represented. In what follows (recombination model 1) we overcome this limitation by employing an approach we had proposed earlier Dixit *et al.* (2015). It allows us to simulate the evolutionary dynamics of only two genomes (labeled *X* and *Y*), while representing the rest of the population using evolutionary theory (Dixit *et al.* 2015). *X* and *Y* start diverging from each other as identical twins at time (when their mother divides). We denote by the sequence divergence of the transferable unit (or gene) between *X* and *Y* at time *t* and by the genome-wide divergence.

Based on population–genetic and biophysical considerations, we derive the transition probability (*a* for after and *b* for before) that the divergence in any gene changes from to in one generation (Dixit *et al.* 2015). There are two components to the probability, *M* and *R*. Point mutations in either of two strains, represented by occur at a rate per base pair per generation and increase the divergence in a gene by Hence when (1)(2)We assume, without loss of generality, that recombination from donor strain *D* replaces a gene on strain *X*. Unlike point mutations, after a recombination, local divergence between *X* and *Y* can change suddenly, taking values either larger or smaller than the current divergence (see Figure 2 for an illustration) (Dixit *et al.* 2015). We have the probabilities (Dixit *et al.* 2015),(3)In Equation 3, is the normalization constant.

## Computational Analysis of the Simplified Model of Recombination

### Evolutionary dynamics of local divergence has large fluctuations

Figure 3 shows a typical stochastic evolutionary trajectory of the local divergence of a single gene in a pair of genomes simulated using We have used realistic values of and (Fraser *et al.* 2007; Dixit *et al.* 2015). Mutation and recombination rates (per generation) in real bacteria are extremely small (Dixit *et al.* 2015). To keep the simulation times manageable, mutation and recombination rates used in our simulations were orders of magnitude higher compared to those observed in real bacteria ( per base pair per generation and per base pair per generation, ) (Ochman *et al.* 1999; Wielgoss *et al.* 2011), while keeping the ratio of the rates realistic (Touchon *et al.* 2009; Vos and Didelot 2009; Didelot *et al.* 2012; Dixit *et al.* 2015). Alternatively, one may interpret it as one time step in our simulations being considerably longer than a single bacterial generation.

As seen in Figure 3, the time evolution of is noisy; mutational drift events that gradually increase the divergence linearly with time (red) are frequently interspersed with homologous recombination events (green if they increase and blue if they decrease it) that suddenly change the divergence to typical values seen in the population (see Equation 3). Eventually, either through the gradual mutational drift or a sudden recombination event, increases beyond the integration barrier set by the transfer efficiency, Beyond this point, this particular gene in our two strains splits into two different sexually isolated subclades. Any further recombination events in this region would be limited to their subclades and thus would not further change the divergence within this gene. At the same time, the mutational drift in this region will continue to drive the two strains further apart indefinitely.

In Figure 4, we plot the time evolution of and its ensemble average (as % difference). We have used and and , respectively. As seen in Figure 4, when is large in any local genomic region, multiple mutations are acquired between two successive recombination events. Consequently, individual genes escape the pull of recombination rapidly and increases roughly linearly with time at a rate For smaller values of the rate of change of in the long term decreases as many of the individual genes repeatedly recombine with the population. However, even then the fraction of genes that have escaped the integration barrier slowly increases over time, leading to a linear increase in with time albeit with a slope different from Thus, the repeated resetting of individual s after homologous recombination (see Figure 3) generally results in a that increases linearly with time.

At the shorter timescale, the trends in genome divergence are opposite to those at the longer timescale. At a fixed θ, a low value of implies faster divergence and vice versa. When recombination rate is high, genomes of strains quickly “equilibrate” with the population and the genome-wide average divergence between a pair of strains reaches the population average diversity (see the red trajectory in Figure 4). From here, any new mutations that increase the divergence are constantly wiped out through repeated recombination events with the population.

Computational algorithms that build phylogenetic trees from multiple sequence alignments often rely on the assumption that the sequence divergence faithfully represents the time that has elapsed since their Most Recent Common Ancestor (MRCA). However, Figure 3 and Figure 4 serve as a cautionary tale. Notably, after just a single recombination event, the local divergence at the level of individual genes does not at all reflect time elapsed since divergence but rather depends on statistics of divergence within a recombining population [see Dixit *et al.* (2015) for more details]. At the level of genomes, when is large (*e.g.*, the blue trajectory in Figure 4), the time since MRCA of two strains is directly correlated with the number of mutations that separate their genomes. In contrast, when is small (see pink and red trajectories in Figure 4), frequent recombination events repeatedly erase the memory of the clonal ancestry. Nonetheless, individual genomic regions slowly escape the pull of recombination at a fixed rate. Thus, the time since MRCA is reflected not in the total divergence between the two genomes but in the fraction of the length of the total genomes that has escaped the pull of recombination. One will have to use a very different rate of accumulation of divergence to estimate evolutionary time from genome-wide average divergence.

### Quantifying metastability

How does one quantify the metastable behavior described above? Figure 4 suggests that high rates of recombination prevent pairwise divergence from increasing beyond the typical population divergence at the whole-genome level. Thus, for any set of evolutionary parameters, μ, ρ, θ, and the time it takes for a pair of genomes to diverge far beyond the typical population diversity θ can serve as a quantifier for metastability.

In Figure 5, we plot the number of generations (in units of the effective population size ) required for the ensemble average of the genome-wide average divergence between a pair of genomes to exceed (twice the typical intrapopulation diversity) as a function of and Analyzing the ensemble average (represented by dashed lines in Figure 4) allows us to avoid the confounding effects of small fluctuations in the stochastic time evolution of around this average. Note that in the absence of recombination, it takes generations before exceeds The four cases explored in Figure 4 are marked with green diamonds in Figure 5.

We observe two distinct regimes in the behavior of as a function of and In the divergent regime, after a few recombination events, the divergence at the level of individual genes quickly escapes the integration barrier and increases indefinitely. Consequently, increases linearly with time (see *e.g.*, in Figure 4 and Figure 5) and reaches within generations. In contrast, for smaller values of in the metastable regime it takes an extremely long time for to reach In this regime, genes get repeatedly exchanged with the rest of the population and remains nearly constant over long periods of time (see *e.g.*, in Figure 4 and Figure 5). Notably, near the boundary region between the two regimes a small perturbation in the evolutionary parameters could change the evolutionary dynamics from divergent to metastable and vice versa.

Do the conclusions about the transition between divergent and metastable dynamics depend on the particular choice of ? In the Figure A2, we show that in fact the transition is independent of and is fully determined by the two evolutionary nondimensional parameters and identified in this study.

### Population structure: the distribution of pairwise divergences of genomes within a population

Can we understand the phylogenetic structure of the entire population by studying the evolutionary dynamics of just a single pair of strains?

Given a sufficient amount of time, every pair of genomes in our model would diverge indefinitely (see Figure 4). However, in a finite population of size the average probability of observing a pair of strains whose MRCA existed *t* generations ago is exponentially distributed, (here and below we use the bar to denote averaging over multiple realizations of the coalescent process, or long-time average over population dynamics) (Kingman 1982; Higgs and Derrida 1992; Serva 2005). Thus, it becomes more and more unlikely to find such a pair in a finite-sized population.

Let to denote the probability distribution of for all pairs of genomes in a given population, while stands for the same distribution averaged over long time or multiple realizations of the population. One has.(4)In Equation 4, is the probability that a pair of strains in a population snapshot shared their MRCA *t* generations ago and is the probability that a pair of strains have diverged by at time *t*. Given that is the average of independent realizations of we can approximate as a Gaussian distribution with average and variance Here and below angular brackets and the subscript *G* denote the average of a quantity over the entire genome.

Unlike the time- or realization-averaged distribution only the instantaneous distribution is accessible from genome sequences stored in databases. Notably, even for large populations these two distributions could be significantly different from each other. Indeed, in any given population is extremely noisy due to multiple peaks from clonal subpopulations and does not resemble its smooth long-time average profile (Higgs and Derrida 1992; Serva 2005). In Figure 6, we show for the four cases shown in Figure 4 (also marked by green diamonds in Figure 5). We fixed the population size to We changed by changing the recombination rate ρ. The solid lines represent a time snapshot obtained by numerically sampling in a Fisher–Wright population of size The dashed black line represents the time average

In the divergent regime of Figure 5, the instantaneous snapshot distribution has multiple peaks corresponding to divergence distances between several spontaneously formed clonal subpopulations present even in a homogeneous population. These subpopulations rarely exchange genetic material with each other because of a low recombination frequency ρ. In this regime, the time averaged distribution has a long exponential tail and, as expected, does not agree with the instantaneous distribution

Notably, in the metastable regime the exponential tail shrinks into a Gaussian-like peak. The width of this peak relates to fluctuations in around its mean value which in turn are dependent on the total number of genes *G*. Moreover, the difference between the instantaneous and the time averaged distributions diminishes as well. In this limit, all strains in the population exchange genetic material with each other. Consequently, the population becomes genetically cohesive and temporally stable.

## Comparison Between Three Models of Recombination

So far, we presented results from a simplified model of recombination (model 1, see Figure 1). Employing this model allowed us to develop a mathematical formalism to describe evolutionary dynamics of a pair of bacterial genomes in a coevolving population. It also allowed us to investigate how genomes diversify across a range of evolutionary parameters in a computationally efficient manner. However, in real bacteria, transfer events have variable lengths and partially overlap with each other (Milkman 1997; Falush *et al.* 2001; Vetsigian and Goldenfeld 2005; Dixit *et al.* 2015).

Here, we systematically study the similarities and differences between the three progressively more realistic models described in section *Computational Models* and (illustrated in Figure 1b). To directly compare results across different types of simulations, we ran each of the three simulations for the four parameter sets used in Figure 4. See Appendix for details of the simulations.

The metastability/divergent transition (see Figure 5 above) is based on the dynamics of the ensemble average We studied how depends on the nature of recombination with an explicit simulation of coevolving strains each with bp. Figure 7a shows the time evolution of the ensemble average estimated from the explicit simulations. The three colors represent three different models of recombination. Notably, is insensitive to whether recombination tracks are of variable length or overlapping with each other. Since metastability depends on the conclusions about metastability obtained using recombination model (1) can be generalized to more realistic models (2) and (3).

Can the effects of allowing overlapping recombination tracks be seen in population structure? To investigate this, we looked at the stochastic fluctuations in around Intuitively, overlapping recombination events are expected to homogenize highly divergent genetic fragments in the population. As a result, we expect smaller within-population variation, *i.e.*, smaller fluctuations in around We tested this by studying the expected distribution of pairwise genome-wide divergences within a population (note the above discussion of difference between average and the distribution within a sample population) for the three models of recombination.

We only consider the case where As seen in Figure 4 and Figure 7a, in the metastable state the divergence, virtually does not increase as a function of *t* at long times (the rate of increase is extremely slow). Thus, the variance in largely represents the variance in around its ensemble average In Figure 7b, we show for the three different models of recombination. Indeed, the variance in is much smaller when overlapping recombination events are allowed [models (2) and (3) compared to model (1)]. The effect of varying the length of recombined segments appears to be minimal.

## Application to Real-Life Bacterial Species

Where are real-life bacterial species located on the divergent–metastable diagram? Instead of population genetic studies of bacteria usually quantify the relative strength of recombination over mutations as the ratio of the number of SNPs brought in by recombination relative to those generated by point mutations in a pair of closely related strains (Guttman and Dykhuizen 1994; Vos and Didelot 2009; Dixit *et al.* 2015). In our framework, is defined as where is the rate of successful recombination events and is the average divergence in transferred regions. Both and depend on the evolutionary parameters (see Appendix for a detailed description of our calculations). is closely related (but not equal) to the inverse of used in our previous plots.

In Figure 8, we replotted the “phase diagram” shown in Figure 5 in terms of and , and approximately placed several real-life bacterial species on it. To this end, we estimated θ from the MLST data (Jolley and Maiden 2010) (see Appendix for details) and used values that were determined previously by Vos and Didelot (2009). As a first approximation, we assumed that the transfer efficiency is the same for all species considered and is given by used in Fraser *et al.* (2007). However, as mentioned above, the transfer efficiency depends in part on the RM and the MMR systems. Given that these systems vary a great deal across bacterial species including minimal barriers to recombination, for example as observed in *Helicobacter pylori* (Falush *et al.* 2001), or different combinations of multiple RM systems reported in Oliveira *et al.* (2016). We note that *H. pylori* appears divergent even with minimal barriers to recombination, probably because of its ecologically structured population that is dependent on human migration patterns (Thorell *et al.* 2017). One expects that transfer efficiency might also vary across bacteria. Further work is needed to collect the extent of this variation in a unified format and location. One possible bioinformatics strategy is to use the slope of the exponential tail in SNP distribution ( in our notation) to infer the transfer efficiency as described in Dixit *et al.* (2015).

Figure 8 confirms that both and are important evolutionary parameters and suggests that each of them alone cannot fully classify a species as either divergent or metastable. Notably, there is a sharp transition between the divergent and the metastable phases implying that a small change in or can change the evolutionary fate of the species. And finally, one can see that different bacterial species use diverse evolutionary strategies straddling the divide between these two regimes.

Can bacteria change their evolutionary fate? There are multiple biophysical and ecological processes by which bacterial species may move from the metastable to the divergent regime and vice versa in Figure 5. For example, if the effective population size remains constant, a change in mutation rate changes both as well as θ. A change in the level of expression of the MMR genes, changes in types or presence of MMR, SOS, or RM systems, or loss or gain of co-infecting phages, could all change or the rate of recombination (Vulić *et al.* 1997; Oliveira *et al.* 2016), thus changing the placement of the species on the phase diagram shown in Figure 8.

Adaptive and ecological events should be inferred from population genomics data only after rejecting the hypothesis of neutral evolution. However, the range of behaviors consistent with the neutral model of recombination-driven evolution of bacterial species was not entirely quantified up till now, leading to potentially unwarranted conclusions as illustrated in Krause and Whitaker (2015). Consider *Escherichia coli* as an example. Known strains of *E. coli* are usually grouped into five or six different evolutionary subclades (groups A, B1, B2, E1, E2, and D). It is thought that interclade sexual exchange is lower compared to intraclade exchange (Didelot *et al.* 2012; Dixit *et al.* 2015). Ecological niche separation and/or selective advantages are usually implicated as initiators of such putative speciation events (Polz *et al.* 2013). In our previous analysis of 32 fully sequenced *E. coli* strains, we estimated and (Dixit *et al.* 2015), implying that *E. coli* resides deeply in the divergent regime in Figure 8. Thus, based on the analysis presented above, one expects *E. coli* strains to spontaneously form transient sexually-isolated subpopulations even in the absence of selective pressures or ecological niche separation.

## Extending the Framework to Incorporate Selection and Other Factors Modulating Recombination

Throughout this study, we used two assumptions that allowed efficient mathematical analysis: (i) exponentially decreasing probability of successful integration of foreign DNA, and (ii) exponentially-distributed pairwise coalescent time distribution of a neutrally evolving well-mixed population. Here, we discuss how to relax these assumptions within our framework.

A wide variety of barriers to foreign DNA entry exist in bacteria (Thomas and Nielsen 2005). For example, bacteria may have multiple RM systems that either act in combination or are turned on and off randomly (Oliveira

*et al.*2016). Moreover, rare nonhomologous/illegitimate recombination events can transfer highly diverged segments between genomes (Thomas and Nielsen 2005) potentially leading to homogenization of the population. Such events can be captured by a weaker-than-exponential dependence of the probability of successful integration on local genetic divergence (see Appendix for a calculation with nonexponential dependence of the probability of successful integration on the local sequence divergence). One can incorporate these variations within our framework by appropriately modifying in the framework.Bacteria belong to ecological niches defined by environmental factors such as availability of specific nutrient sources, host–bacterial interactions, and geographical characteristics. Bacteria in different niches may rarely compete with each other for resources and consequently may not belong to the same effective population and may have lowered frequency of DNA exchange compared to bacteria sharing the same niche. How can one capture the effect of ecological niches on genome evolution? Geographically- and/or ecologically-structured populations exhibit a coalescent structure (and thus a pairwise coalescence time distribution) that depends on the nature of niche separation (Takahata 1991; Wakeley 2004). Within our framework, niche-related effects can be incorporated by accounting for pairwise coalescent times of niche-structured populations (Takahata 1991; Wakeley 2004) and niche-dependent recombination frequencies. For example, one can consider a model with two or more subpopulations with different probabilities for intra- and interpopulation DNA exchange describing geographical or phage-related barriers to recombination.

While most point mutations are thought to have insignificant fitness effect, the evolution of bacterial species may be driven by rare advantageous mutations (Majewski and Cohan 1999). Recombination is thought to be essential for bacterial evolution to minimize the fitness loss due to Muller’s ratchet (Takeuchi *et al.* 2014) and to minimize the impact of clonal interference (Cooper 2007). Thus, it is likely that both recombination frequency and transfer efficiency are under selection (Takeuchi *et al.* 2014; Lobkovsky *et al.* 2016; Iranzo *et al.* 2016). How could one include fitness effects in our theoretical framework? Above, we considered the dynamics of neutrally evolving bacterial populations. The effective population size is incorporated in our framework only via the coalescent time distribution and consequently the intraspecies diversity . Neher and Hallatschek (2013) recently showed that while pairwise coalescent times in adaptive populations are not exactly exponentially distributed, this distribution has a pronounced exponential tail with an effective population size weakly related to the actual census population size and largely determined by the variance of mutational fitness effects (Neher and Hallatschek 2013). To modify the recombination kernel , one needs to know the three-point coalescence distribution for strains *X*, *Y*, and the donor strain *D* [Dixit *et al.* (2015) for details]. Once such a three-point coalescence distribution is available in either analytical or even numerical form, our results could be straightforwardly generalized for adaptive populations (assuming most genes remain neutral). We expect the phase diagram of a thus modified adaptive model to be similar to its neutral predecessor considered here, given that the pairwise coalescent time distribution in an adaptive population has an exponential tail as well (Neher and Hallatschek 2013), and for our main results to remain qualitatively unchanged.

### Conclusions

While recombination is now recognized as an important contributor to patterns of genome diversity in many bacterial species (Guttman and Dykhuizen 1994; Milkman 1997; Falush *et al.* 2001; Thomas and Nielsen 2005; Touchon *et al.* 2009; Vos and Didelot 2009; Dixit *et al.* 2015), its effect on population structure and stability is still heavily debated (Fraser *et al.* 2007; Wiedenbeck and Cohan 2011; Doolittle 2012; Polz *et al.* 2013; Shapiro *et al.* 2016). In this work, we explored three models of gene transfers in bacteria to study how the competition between mutations and recombinations affect genome evolution. Analysis of each of the three models showed that recombination-driven bacterial genome evolution can be understood as a balance between two competing processes. We identified the two dimensionless parameters and that dictate this balance and result in two qualitatively different regimes in bacterial evolution, separated by a sharp transition.

The two competitions give rise to two regimes of genome evolution. In the divergent regime, recombination is insufficient to homogenize genomes leading to a temporally unstable and sexually fragmented species. Notably, understanding the time course of divergence between a single pair of genomes allows us to study the structure of the entire population. Species in the divergent regime are characterized by multi-peaked clonal population structure. On the other hand, in the metastable regime, individual genomes repeatedly recombine genetic fragments with each other leading to a sexually cohesive and temporally stable population. Notably, real bacterial species appear to belong to both of these regimes as well as in the crossover region separating them from each other.

## Acknowledgments

We thank Kim Sneppen, Erik van Nimwegen, Daniel Falush, Nigel Goldenfeld, Eugene Koonin, and Yuri Wolf for fruitful discussions and comments that lead to an improved manuscript. We also thank the three reviewers and the editor for their detailed reading and valuable comments.

## Appendix

### from Computer Simulations

To compare the three models of recombination, we performed three types of explicit simulations of a Fisher–Wright population of coevolving strains. The three simulations had different modes of gene transfers as indicated in Figure 1b. Each strain had bp. Each base pair was represented either by a 0 (wild-type) or 1 (mutated). The mutation rate was fixed at per base pair per generation. We varied the recombination rate and per base pair per generation. θ was fixed at and was fixed at These parameters are identical to the ones used in Figure 4 of the main text. We note that, given the low population diversity (), we can safely neglect back mutations. Note that in all three simulations, on an average, a total of 5 kbp of genome was transferred in a successful transfer event, thereby allowing us to directly compare the three simulations.

We started the simulations with identical genomes. We ran a Fisher–Wright simulation for generations to ensure that the population reached a steady state. In each generation, children chose their parents randomly. This ensured that the population size remained constant over time. Mutation and recombination events were attempted according to the corresponding rates. Note that it is nontrivial to keep track of the divergence between individual pairs over time since one or both of the strains in the pair may either be stochastically eliminated. To study the time evolution of the ensemble average of the divergence, at the end of the simulation, we collected the pairwise coalescent times *t* between all pairs of strains as well as the genomic divergences between them. Note that due to the stochastic nature of mutations and recombination events, is a random variable. We estimated the ensemble average by binning the pairwise coalescent times in intervals of generations (1/10th of the population size) and taking an average over all in each bin. The ensemble average thus estimated represents the average over multiple realizations of the coalescent process. Mathematically, the ensemble average is given by.(A1)Here, is the probability that the genomes of two strains whose most recent common ancestor was *t* generations ago have diverged by We note that the variance in is expected to be small since it is an average over a large number of genes. These results are plotted in Figure 7.

### Behavior of in the Long-Time Limit

In Figure A1 we show how increases with *t* over a longer range of times. We note that it is exponentially rarer to find a pair of strains in a population that have diverged beyond generations, where is the population size.

### Transition Between Metastable and Divergent Dynamics Does not Depend on the Choice of

In Figure 5 in the main text, we showed the transition between metastable and divergent evolutionary dynamics. There, we fixed and varied θ and ρ to scan the space of nondimensional parameters and However, our results do not depend on this particular value of To show this, we recalculated Figure 5 by randomly sampling θ (between 0.51 and 3%), (between 0.5 and 5%), and ρ (between and per base pair per generation) while keeping the mutation rate constant at per base pair per generation. In Figure A2 below, we plot the time required for the ensemble average genome-wide divergence to reach an atypical value of From Figure A2, it is clear that the time taken to reach is indeed determined by the two dimensionless constants and , and not by the particular choice of the value of

### Estimating from Model Parameters

As mentioned in the main text, is defined in a pair of strains as the ratio of SNPs brought in by recombination events and the SNPs brought in by point mutations. Clearly, will depend on a strain-to-strain comparison; however, usually it is reported as an average over all pairs of strains. How do we compute in our framework? We have.(A2)Thus, to compute we need two quantities. First, we need to compute the rate of successful recombinations We can calculate as(A3)where is the success probability that a gene that has diverged by *δ* will have a successful recombination event. The integration over exponentially-distributed pairwise coalescent times averages over the population can be computed from Equation 3 by integrating over all possible scenarios of successful recombinations. We have(A4)where and are normalized divergences and is the distribution of local divergences at time *t*. In practice, can only be estimated by analyzing statistics of distribution of SNPs on the genomes of closely related strain pairs where both clonally inherited and recombined parts of the genome can be identified (Didelot *et al.* 2012; Dixit *et al.* 2015). Here, we limit the time-integration in Equation A3 to times

Second, we need to compute the average divergence in transferred segments, We have(A5)where is the average divergence after a recombination event if the divergence before transfer was *δ*.

### Computing θ from MLST Data

Except for *E. coli*, where we used our previous analysis (Dixit *et al.* 2015) (we used and ), we downloaded MLST sequences of multiple organisms from the MLST database (Jolley and Maiden 2010). For each of the seven genes present in the MLST database, we performed a pairwise alignment between strains. For a given pair of strains, we evaluated the nucleotide difference in each gene and estimated the average *q* over these seven pairwise differences. The θ for the species was estimated as an average of *q* over all pairs of strains.

### Nonexponential Dependence of on Local Sequence Divergence

In the main text, we showed that when decays exponentially with the local divergence, the time evolution of local divergence shows metastability. When the recombination rate is low, a few recombination events take place that change to typical values in the population before the local region eventually escapes the integration barrier, leading to a linear increase in (see Figure 3). When the recombination rate is high, the number of recombination events before the eventual escape from the integration barrier increases drastically leading to metastable behavior.

Here, we suggest that weaker-than-exponential dependence of can lead to a time evolution of local divergence that never escapes the integration barrier, leading to a genetically homogeneous population independent of the recombination rate ρ.

While it is difficult to carry out analytical calculations for a finite θ and following Doroghazi and Buckley (2011), we consider the limit when μ and ρ are finite. The time evolution of in the limit when decays exponentially with divergence is given by (see Equation 3).(A6)In Equation A6, is the number of SNPs (as opposed to SNP density used in the main text). As was shown in the main text, the evolution of described by Equation A6 is a random walk that repeatedly resets to zero before eventually escaping to The number of resetting events depends on as defined in the main text (see low values in Figure 5).

A generalization to nonexponential dependence of the success probability is straightforward,(A7)where is the probability of successful integration. How weak should the integration barrier be so that the time evolution described by Equation A7 can never escape the pull of recombination? In other words, what are the conditions on that ensure that the time evolution of local divergence described by Equation A7 results in a random walk that resets to zero infinitely many times?

If the random walk resets infinitely many times, it has a well-defined stationary distribution as Note that the random walk described by an exponentially decaying does not have a well-defined stationary distribution since as regardless of the rate of recombination and the transfer efficiency. Let us assume that is such that there exists a well-defined stationary distribution. We define as the probability that in the stationary state. We can write balance equations in the stationary state(A8).(A9)Rearranging.(A10)Since from Equation A9 and Equation A10 we have for an arbitrary (denoting ).(A11)Thus, as long as the functional in Equation A11 is equal to 1 (or ), the walk remains localized. Equation A11 is a surprisingly simple result and is valid for any

Let us consider a specific case where A power-law dependence in is weaker than the exponential decay used in the main text, potentially allowing transfers between distant bacteria. Let us examine the self-consistency condition. We have.(A12)Taking logarithms and using the Abel–Plana formula(A13)*if * The integral (and thus the sum) tends to when Here, is the hypergeometric function. Thus, when a well-defined stationary distribution exists and as long as and regardless of ρ and the population remains genetically cohesive. When we expect behavior similar to the exponential case studied in the main text, viz. a divergent *vs.* metastable transition depending on the competition between forces of recombinations and mutations. We believe that these conclusions will also hold true when θ is finite.

## Footnotes

*Communicating editor: J. Wall*

- Received November 9, 2016.
- Accepted July 18, 2017.

- Copyright © 2017 by the Genetics Society of America