## Abstract

Inferring the rate of homologous recombination within a bacterial population remains a key challenge in quantifying the basic parameters of bacterial evolution. Due to the high sequence similarity within a clonal population, and unique aspects of bacterial DNA transfer processes, detecting recombination events based on phylogenetic reconstruction is often difficult, and estimating recombination rates using coalescent model-based methods is computationally expensive, and often infeasible for large sequencing data sets. Here, we present an efficient solution by introducing a set of mutational correlation functions computed using pairwise sequence comparison, which characterize various facets of bacterial recombination. We provide analytical expressions for these functions, which precisely recapitulate simulation results of neutral and adapting populations under different coalescent models. We used these to fit correlation functions measured at synonymous substitutions using whole-genome data on *Escherichia coli* and *Streptococcus pneumoniae* populations. We calculated and corrected for the effect of sample selection bias, *i.e.*, the uneven sampling of individuals from natural microbial populations that exists in most datasets. Our method is fast and efficient, and does not employ phylogenetic inference or other computationally intensive numerics. By simply fitting analytical forms to measurements from sequence data, we show that recombination rates can be inferred, and the relative ages of different samples can be estimated. Our approach, which is based on population genetic modeling, is broadly applicable to a wide variety of data, and its computational efficiency makes it particularly attractive for use in the analysis of large sequencing datasets.

- bacteria
- homologous recombination
- population diversity
- sample selection bias
- sample ages
- adapting populations
- Bolthausen–Sznitman coalescent

BACTERIA can receive DNA fragments from their environment by different mechanisms, and integrate them into their genome in a set of processes collectively known as horizontal gene transfer (HGT) (Thomas and Nielsen 2005). While the importance of HGT in bacterial evolution is increasingly appreciated (Koonin and Wolf 2008; Shapiro *et al.* 2012; Oren *et al.* 2014; Ravenhall *et al.* 2015; Rosen *et al.* 2015), quantifying its impact across bacterial genomes, and in diverse environmental samples, remains a key challenge (Maynard Smith 1991; Soucy *et al.* 2015). In particular, a prevalent form of HGT involves homologous recombination of fragments that bear a high degree of sequence similarity to the recipient genome (Andam and Gogarten 2011; Williams *et al.* 2012). Such transfer events are particularly difficult to detect, since they leave no obvious marks, and are indistinguishable based on nucleotide composition, yet they are likely to represent the majority of HGT events in bacteria (Fraser *et al.* 2007).

Three major mechanisms of HGT—transformation, conjugation, and transduction—mediate the passage of external DNA into bacterial cells. Once inside the cell, DNA fragments can be integrated into the genome either by homologous or nonhomologous recombination mechanisms (Thomas and Nielsen 2005). In homologous recombination, the fragment usually recombines into a homologous genomic locus replacing the existing DNA at the recipient locus. DNA transfer by homologous recombination is similar to gene conversion in eukaryotes—it is unidirectional, and changes occur only in the recipient locus. In nonhomologous recombination (also known as illegitimate recombination), a comparatively rare event, the fragment is inserted directly into the genome without replacing DNA. Homologous recombination breaks genetic linkage within a bacterial population, thereby alleviating clonal interference as well as Muller’s ratchet, and is therefore expected to increase the rate at which bacteria evolutionarily adapt to their environment. Theoretical analyses have shown that a principal determinant of the speed of evolutionary adaptation is the recombination rate (Cohen *et al.* 2005; Neher *et al.* 2010; Weissman and Barton 2012; Neher and Hallatschek 2013; Weissman and Hallatschek 2014). Measuring these rates, and other population genetic parameters, is therefore crucial for describing the evolutionary dynamics of different bacterial species.

To infer the parameters and rates of the HGT process, mathematical models based on coalescent theory (Kingman 1982; Hudson 1983) have been used to study the genealogy of a sample of sequences in the presence of gene conversion or homologous recombination (Wiuf 2000; Wiuf and Hein 2000; McVean *et al.* 2002; Didelot and Falush 2007; Touchon *et al.* 2009). Using a coalescent model with gene-conversion, computationally intensive methods have been developed that allow estimation of recombination rates on a whole-genome scale (Didelot *et al.* 2010; Ansari and Didelot 2014). Other methods detect recombination events based on regional differences of single nucleotide polymorphism (SNP) density (Croucher *et al.* 2015) or inferred phylogeny (Marttinen *et al.* 2012), and are often used to establish a clonal phylogeny for a sample based on the vertically inherited regions. Patterns of SNP density combined with Markov Chain Monte Carlo simulations have also been used to infer genome-wide recombination parameters (Dixit *et al.* 2015). One caveat is that phylogenetic methods are particularly sensitive to errors in the reference phylogeny. When recombination occurs as frequently as mutations accumulate, which appears to be the case in many species (Vos and Didelot 2009), it is not clear whether a reference phylogeny can be accurately constructed based on sequence data alone.

In natural populations, the genealogical trees reconstructed from the sampled sequences can exhibit substantial departure from neutrality. Kingman’s coalescent (KC) allows only pairwise mergers—at most two ancestral lines can merge at each coalescence event—and the branches of the coalescent tree are distributed evenly (Kingman 1982). In spatially expanding or panmictic adapting populations, multiple mergers of ancestral lines occurring at a single step are not rare in the genealogical trees, and (Brunet *et al.* 2007) showed that a special case of the multiple-merger coalescent, the Bolthausen–Sznitman coalescent (BSC) (Bolthausen and Sznitman 1998), describes their genealogies. More recently, (Desai *et al.* 2013) and (Neher and Hallatschek 2013) showed that, in rapidly adapting populations, exponential amplification of fit lineages leads to multiple mergers, and the resulting genealogies are well described by the BSC.

Here, we present an analytically tractable framework for inferring homologous recombination from sequence data that is applicable for any exchangeable coalescent model, which subsumes KC and BSC models. We introduce new population-genetic quantities based on correlations of substitutions within a population, and use these to accurately infer recombination parameters and provide consistency tests of the model. We study the impact of selection on these quantities, and demonstrate a smooth transition from KC to BSC predictions with the increase of selection strength. Importantly, our method does not rely on the construction of phylogenetic trees, and, because it is analytically tractable, it yields results as fast as sequence data can be read into memory. We demonstrate the power of this method using large datasets of whole-genome bacterial sequences.

## Materials and Methods

### Neutral population models

We consider a generalized Moran model that evolves a population of *N* individuals in continuous time with overlapping generations (Moran 1958), and allows individuals to have multiple offspring at each reproduction event. We choose an arbitrary unit of time to measure all rates in the model. Reproduction events occur with rate *G* at exponentially distributed times. At each event, exactly one individual reproduces, yielding new individuals that replace randomly chosen individuals out of the existing individuals in the population, not including the parent. The value is a random variable that can take values from with probability distribution To obtain the classic Moran model, one chooses and and it is well known that, in this case, the resulting coalescent genealogies are given by KC statistics (Kingman 1982). Alternatively, one can choose which we will call the *Schweinsberg model*, in which case the coalescent genealogies are given by BSC statistics (Schweinsberg 2012).

### Mutations, recombination, and fragment size distribution

Mutations and DNA transfer occur with uniform rates throughout the population, and across genomes. We model circular genomes of length *L* with an “alphabet” of size *a* letters, where the usual choice is corresponding to the four bases of DNA. Each site on a genome mutates with rate *μ* (per generation) to any of the remaining letters with equal probability. DNA transfer events, in which an external piece of DNA is imported and homologously recombined into the genome, occur at a rate *R* (per generation) in each genome. We define as the transfer rate per site, where *L* is the length of the genome. Each time a transfer occurs into a given genome (the “recipient”), one of the *N* individuals is chosen randomly to be the “donor,” and a randomly chosen fragment of size *f* is copied from the donor, replacing the existing piece at the identical location in the recipient.

The fragment size *f* is a random variable determined by a probability distribution function, We define the rate at which a site is affected by recombination, where is the average fragment size. We call the parameter *r*, the *recombination coverage rate*, since it measures the fraction of the genome that is “covered” by horizontally transferred pieces of DNA newly acquired in each generation. For two sites separated by distance *l* along the genome, the coverage rate *r* at each site can be partitioned into two contributions: the rate at which only the single site is covered by a transfer, (one-site transfer), and the rate at which both sites are covered, (two-site transfer), given by(1)(2)where When *l* is smaller than the minimal size of transferred fragments, these functions depend only on with and All simulations shown here used a constant fragment size setting where is the Kronecker delta function; all analytical results were derived in terms of and hence, they are generally applicable for any fragment size distribution.

### Genome sequences and population diversity

Each genome sequence is specified as for over an alphabet of size *a*, with For a pair of genomes, and we define the substitution sequence For notational convenience, we number the possible pairs of sequences and write the substitution sequence of the *k*-th pair as which we call the *substitution variable*, and which takes values . The substitution variable can be thought of as a set of observations indexed by the variables *i* and *k*, and we can average over the indices in various ways. For example, we define the average sequence distance between the *k*-th pair of genomes as where we let denote the average over sequence positions *i*. The population average pairwise distance, which we will also call the *population diversity*, is given by(3)where the bar denotes the average over all genome pairs *k* across the population. The variance of pairwise distances is written as(4)Equivalently, and often more conveniently, we consider to be a random variable that depends on *i* and *k*. Since we will be computing averages and correlations of *S* over the indices, we take *i* and *k* to be random variables with a uniform distribution over their respective values. In this notation, we have where is the conditional random variable, and denotes expectation. The population diversity is given by The variance of pairwise distances is written as

It is important to emphasize that *S*, as defined here, depends explicitly on the collection of genomes that make up the population at any given time. Throughout the paper, any quantity that is defined using *S* (*e.g.*, *d*, and the correlation functions below) is therefore a random variable whose distribution is determined by all possible realizations of the stochastic population dynamics process. In simulations, we average over many runs to calculate the expectations of these quantities, while our analytical results are constructed explicitly to compute their expectations over the stochastic dynamics.

### Population genetic parameters

We define several additional quantities, which will be useful when analyzing recombination and diversity at the population level. We consider a pair of individuals that coalesced at time *t* ago. Their *mutational divergence*, , corresponds to the fraction of the genome that accumulated mutations on the two lineages since coalescence. In the neutral models that we consider here, the mean coalescence time is generations (Appendix B), and the *mean mutational divergence* is given by the well-known population genetic parameter quantity Analogously, we define the pairwise *recombinational divergence*, which is the average number of recombination events per site that occurred on two lineages. The *mean recombinational divergence* is then given by the quantity Since each event covers on average base pairs, we define the mean *recombination coverage* as which measures the fraction of the genome affected by recombination since divergence. We also define and as the one-site and two-site recombinational coverage, where

### Correlation functions

We consider a pair of sites separated by distance *l* along the genome, denoted *i* and We define and to be the substitution variables at any two such sites, and treat *X* and *Y* as random variables, as above. This allows us to define three distinct correlation functions, each of which involves all pairs of sites separated by a distance of *l* along the genome (see Figure 1). The *mutation correlation function*,(5)assesses within each pair of genomes the correlation of mutations across all pairs of sites *i* and the *structure correlation function*,(6)measures how strongly pairs of sites are correlated across the population; and the *rate correlation function*,(7)quantifies the correlation of evolutionary rates at pairs of sites along the genome sequence. Note that, due to the circular genomes, the averaging along the genome sequence goes once around the genome, for using the convention that is taken modulo *L*.

### Adapting population model

To simulate an adapting population (Figure 3), we keep track of individuals’ fitness, and allow reproduction events to occur with rates proportional to individuals’ fitness, choosing the individual that is replaced at each event randomly and uniformly across the population. To model fitness effects, we consider the genome as a series of *L* “codons,” each of which contains one selective site and one neutral site (mimicking aspects of real codons, in which the first 2 bp are under stronger selective pressures than the third one). Mutations at neutral sites occur with rate *μ* per site, as before. Mutations at selective sites occur at rate per site, and change the genome’s fitness by where the parameter *s* is a constant during the simulation, and the randomly chosen sign of *s* determines whether it is a beneficial or deleterious mutation. Recombination events are modeled as above.

### Data availability

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

## Results

We initially study populations that evolve neutrally in the presence of recombination, both by simulations and theoretical analysis, and later determine how ongoing adaptation would influence the results. We then show that the theory can be used to explain mutational correlations in whole genome data, and can be applied in a straightforward manner to infer recombination rates, and other key population genetic parameters. Table 1 summarizes the key parameters and variables that are used throughout the text.

### Effect of homologous recombination on genetic diversity

We simulated neutral dynamics of a population of sequences that mutate and recombine DNA fragments, and measured their average pairwise distance—or population diversity, *d*—at different recombination rates. We used two different neutral models that yield either KC or BSC trees for a population of size *N*, using the mutation rate *μ* (per site per generation), recombination rate *γ* (per site per generation), and transferred fragment size We define the recombination coverage rate, which measures the fraction of the genome that is “covered” by transferred pieces at each generation. We also define the mutational divergence, which, under neutral evolution corresponds to the fraction of the genome that accumulated mutations in a pair of sequences since divergence, and the recombination coverage, the fraction of the genome covered by recombination since divergence (see *Materials and Methods*).

To study how DNA transfer impacts diversity for different population sizes, we performed comparisons at fixed *θ* by choosing Figure 2A shows a linear decrease in diversity with increasing recombination, as well as the expected collapse of the values of *d* obtained across different population sizes at constant mutational divergence *θ*. The decrease in diversity occurs because each recombination event replaces a piece of one genome with a homologous piece of another. DNA transfer within a bacterial population can therefore only decrease the number of alleles segregating at any single site on DNA. However, as seen here and shown analytically below, the slope of the decrease in *d* is very shallow, hence population diversity is insensitive to large fold changes in *r*.

While a greater recombination rate has only a weak effect on diversity, it has a stronger effect on the distribution of pairwise distances within the population. To see this, we plotted the variance of pairwise distances as a function of *r* (Figure 2A). When recombination coverage rates are low, the population exhibits clusters of similar sequences (Higgs and Derrida 1992), resulting in high values of Increasing *r* decreases due to transfers between clusters, which reduce the differences between clusters, and create an increasingly isotropic distribution of sequences. While *d* exhibits only moderate decrease as a function of *r*, decreases much more rapidly with *r*. For fixed *θ*, the variance exhibits strong dependence on *N*, which collapses onto a single curve as a function of the recombination coverage *ρ* (Figure 2A, inset). These results are consistent with homologous recombination acting as an efficient cohesive force within bacterial populations (Hanage *et al.* 2006; Fraser *et al.* 2007, 2009), since a small increase of *r* is able to destroy clusters while only marginally reducing population diversity.

### Genomic correlations and the mutational covariance identity

Inferring transfer rates from population measurements requires more informative quantities than *d* and The standard population genetic approach is to explore the pattern of linkage disequilibrium (LD) across genomic regions, and relate it to recombination rates (McVean *et al.* 2002; McVean *et al.* 2004; Ansari and Didelot 2014). However, when more than two alleles can segregate at a locus (*e.g.*, in large, diverse populations), minority and majority alleles cannot be unambiguously defined, and LD must be generalized to account for multiple alleles, which becomes mathematically cumbersome. Moreover, variations in substitution rates across the genome can confound LD measurements, and a framework that accounts for rate variations is particularly useful in bacteria.

To address these issues, we introduce three correlation functions that capture different effects of homologous recombination on the structure of genetic diversity in the population (see Figure 1 and *Materials and Methods*). To avoid arbitrarily assigning a reference sequence, or the related issues of majority and minority alleles mentioned above, we consider all possible pairwise comparisons of sequences, which we denote by the substitution variable where *k* indexes the pair of sequences, and *i* is the genomic position; if the two sequences are identical at position *i*, otherwise We consider any two loci that are *l* base pairs apart in a randomly chosen sequence pair, and denote the substitution variable at these loci by and We measure the correlation between *X* and *Y* across loci within each sequence pair, and then average over all pairs to obtain the mutational correlation function, We measure the correlation between *X* and *Y* across the population, and then average over loci to obtain the structure correlation function, Lastly, we compute the average substitution rates at the two loci across the population, and measure the correlation of substitution rates along the genome using the rate correlation function, Figure 1 indicates pictorially how these functions are obtained by correlating or averaging substitution variables conditionally along the sequence or across the population. The tabular structure suggests that a relation should exist among these functions. Indeed, it is easy to check by substitution that(8)a relation that we call the mutational covariance identity. In Appendix A, we show how this identity follows from the law of total covariance applied to the variables *X* and *Y*. Additionally, the non-negativity of yields the inequality,

The three correlation functions are shown for different recombination rates *γ* in Figure 2, B and C. An important aspect of these measurements is that, since we simulate neutrally evolving sequences, the correlations exist only at population level, *i.e.*, in the substitution variables that involve pairwise comparisons. Individual sequences in all cases are purely random, uncorrelated sequences of four letters. Since the process of reproduction builds correlations between pairs of sequences, while mutation breaks them, parameters related to reproduction, mutation, and recombination can all influence the shape of correlation functions.

In the absence of recombination, the mutational correlation, , is identically zero, as expected, since mutations occur randomly throughout each sequence. DNA transfer causes pairs of sequences to become identical over random blocks of size thereby inducing local correlations that decay as a function of *l* at a rate determined by *γ*; that is, the higher the recombination rate, the faster the decay of However, the dependence of the magnitude of on *γ* is nonmonotonic: correlation is low for both low and high values of *γ*, with a maximum at intermediate values (see also Figure 2D), which we discuss below. The qualitative behavior of the rate correlation, is similar to that of both as a function of *l* (Figure 2B, inset) and *γ* (Figure 2D), but with smaller overall magnitude. By definition, is constructed to detect correlation of substitution rates, and, indeed, despite the fact that mutation rates in simulations are constant across all loci, DNA transfer reduces diversity over blocks of size hence, it introduces a correlation length scale for substitution rates.

In contrast, the structure correlation, is positive and constant in the absence of recombination, where its value is determined by drift-mutation balance, *i.e.*, reproduction events build structure correlation while mutations destroy it (Figure 2C). Recombination causes to decay with *l* at a rate determined by *γ*. To see why increasing recombination rate reduces the magnitude of structure correlation, we consider two sites separated by distance *l*. The recombination coverage rate, *r*, at a single site can be partitioned into the rate at which transferred fragments overlap only the one site but not the other, and the rate at which fragments span both sites, where (see *Materials and Methods*). For a given pair of sequences, the rate of one-site transfer is since four sites are involved, while two-site transfers occur between the pair of sequences with rate These two-site transfers will annihilate differences between the pair of individuals at both sites and build correlation, while any one-site transfer will break associations and reduce correlation. Since is determined mainly by one-site transfers that destroy correlation, thus it decays monotonically with both *l* (Figure 2C) and *γ* (Figure 2D).

We plotted in Figure 2D the dependence of all three correlation functions as a function of the recombination rate *γ* using the value *i.e.*, for proximate sites along the genome. The results indicate that the pair and behave qualitatively similarly, as do the pair and These pair relations could be anticipated from the definitions in Equations 5–7, since both and are calculated by taking covariance across the genome sequence and expectation over the population, whereas both and involve the (co)variance over the population and the expectation across the genome. In each case, however, covariance and expectation are taken in different orders, leading to four distinct quantities. Taking expectation before covariance reduces the overall magnitude of the correlation since averaging destroys transient correlations, hence and

Indeed, the difference can be interpreted as measuring the portion of mutational correlations that result from equilibrium fluctuations within the population. By the mutational covariance identity, and, knowing that and are monotonically decreasing, and have different half-maximal positions, explains the pronounced peak of as a function of *γ*. Lastly, since is the variance in pairwise distances, it is determined by the recombination coverage rate, while as discussed above is determined by For this reason, the correlations measured by persist for higher values of *γ* by a factor of than the clusters measured by

We note that all of the dependencies that were measured for correlation functions, genetic diversity *d*, and variance shown in Figure 2 were qualitatively very similar in both KC and BSC models. This opens the question of how these functional forms depend quantitatively on the coalescent process and model parameters, which is the subject of the following section.

### Dependence of correlation functions and genetic diversity on the rate of recombination

Mathematical analysis of the models presented above was carried out in the context of a general, exchangeable coalescent model defined by a set of coalescence rates with which any subset of *k* out of *b* ancestral lines coalesce into a single common ancestor (see Appendix B). Exact solution for the correlation functions could be obtained numerically as the solution of a system of linear equations with coefficients that are linear combinations of *N*, *μ*, *γ*, and (see Appendix D). The exact solution is shown in solid lines in all panels of Figure 2, indicating excellent agreement between simulations and calculations.

An analytically tractable form, however, could be obtained only by approximating the full model dynamics. To this end, we introduced a mean-field approximation for the one-site transfer process, which affects the allelic state of exactly one of the substitution variables *X* and *Y* of two sites. Instead of accounting for one-site transfers explicitly, as we do in the full model, we approximate their effect as a mutational event that changes the substitution variable to its expected value *d* (see Appendix E). This approximation yields tractable expressions given below, which are shown as dashed lines in Figure 2. Generally, the agreement between approximate and full solutions is excellent for the KC model, while deviations are more pronounced under the BSC model, though mainly for mutational and rate correlation functions.

The mean-field expression for average pairwise distance is(9)where indicating that recombination has a very weak effect on the overall diversity (Figure 2A), since *r*, like *μ*, is orders of magnitude smaller than 1. Taking the mutational correlation is(10)where we define the one-site and two-site recombination coverage, respectively and with and where (KC) or (BSC). This expression confirms the basic intuition that two-site transfers build mutational correlation, while one-site transfers destroy correlation. Since it also predicts that initially increases with increasing *γ*, goes through a maximum, and eventually decays to zero. Taking above, we compute the value of *γ* at which is maximized (see Figure 2B), and obtain which has two regimes: if then while if then Calculation of the rate correlation yields(11)where which verifies that the overall shape of rate correlation is identical to mutational correlation, while its magnitude is smaller (see Appendix E).

The structure correlation takes the form(12)indicating that one-site transfer reduces correlation, and that two-site transfer plays a negligible role. The principal difference between and is their sensitivity to one- or two-site transfers: is strongly determined by while is insensitive to it. Lastly, the variance can be found by the large *l* limit of In this limit, two loci are sufficiently far that a two-site transfer cannot occur; hence, and Substituting in (12), we find(13)which confirms our prediction that the values of *γ* at which or are half-maximal (Figure 2B) differ by a factor of Moreover, we recapitulate the collapse of curves shown in Figure 2A (inset) as a function of *ρ*. Importantly, these results indicate that while the effect of recombination on *d* is negligible compared to mutation (since Equation 9), its impact on the fluctuations and correlations is substantial, and comparable to that of mutation (since Equation 13).

We generalized the model above to account for recombination barriers, which can limit the efficiency of transfer depending on the divergence of donor and recipient sequences (Fraser *et al.* 2007). Assuming that transfer rates decay exponentially as a function of sequence divergence, we simulated and analytically calculated the impact on transfer rates within the population, and derived the relevant correction to the mean-field approximation (see Appendix H). This analysis demonstrates how recombination barriers quantitatively affect the magnitude of correlations in DNA sequences, and is useful in refining intuition about the basic structure of the theory. When applying our method to bacterial genomic datasets, however, our inference procedure described below accounts for recombination barriers in its basic assumptions, and does not require further corrections.

### Mutational correlations in adapting populations

While our analysis can be applied for any exchangeable coalescent, genealogies at neutral sites are often linked to non-neutral sites that can undergo selection, which may violate the condition of exchangeability. In particular, in adapting populations in which fitness changes result from specific mutations that modulate individuals’ reproductive rates, exchangeability does not hold in general. To assess the magnitude of this effect on our predictions, we simulated an explicit sequence model with fitness effects in which both neutral and non-neutral sites exist within each genome (see *Materials and Methods*). Correlation functions were computed using the neutral sites, and, in Figure 3, we show the effects of selection on In the presence of DNA transfer, exhibits the characteristic decay that we observed for neutral models, across all values of the selection strength *s* (Figure 3A). Selection, however, diminishes the magnitude of correlations, leading to a more shallow decay.

We compared the simulation results of for adapting populations with our predictions based on the exact solution for either KC or BSC models. Since the analytical results require knowledge of which is model-dependent, we infer from the measured population diversity, *d*, according to Equation 26. We find that of adapting populations lies between the two curves defined by the KC and BSC solutions (Figure 3A). As the selection strength increases, the adapting population transitions smoothly from the KC to the BSC curve (Figure 3B). This is consistent with previous results that showed how BSC statistics are obtained in the strong selection limit (Neher *et al.* 2013). These limiting coalescence statistics are model independent (*e.g.*, KC describes both Moran and Wright-Fisher models); hence, the KC and BSC statistics are expected to provide the lower and upper bounds on correlation functions in many different models. We verified this to be the case using a substantially different adaptation model, in which fitness was controlled by a single locus (data not shown).

The effect of changing the recombination rate *γ* on in adapting populations is shown in Figure 3C, which shows a nonmonotonic behavior similar to that seen in neutral populations (Figure 2D). We also observed a smooth transition from BSC to the KC statistics with increasing *γ*. For low recombination rates, the population is subject to strong selection, and is correctly predicted using the BSC model. As recombination rates increase, recombination decouples the selective loci, breaking linkage between selective sites and their neutral neighbors, and thus reduces the effects of linked-selection on the statistics at neutral sites. Deviations from BSC statistics are more readily detected using (Figure 3, D and E), where, for either increasing selection strength or decreasing recombination rate, is noticeably lower than the analytical predictions.

### Parameter inference with sample selection bias

We have presented calculations of the correlation functions for randomly sampled individuals from a single large population. In principle, their functional forms could now be fit to measured correlations using bacterial sequencing data, provided that the sampled sequences constitute random individuals from the entire relevant population. In reality, there are strong biases associated with most sampling procedures. First, samples can often consist of very closely related strains, particularly when considering pathogen samples from outbreaks, or strain samples from specific geographic locations, which represent only a fraction of the total diversity within a species (Maynard Smith 1991). Second, sequencing of cultured samples involves selection bias for clones that are able to grow on specific media, which further reduces the sampled diversity. Third, different strains are often identified and grouped based on specific markers (*e.g.*, resistance), or phenotypes (*e.g.*, growth on selective media), which may or may not reflect their actual phylogenetic relationships, especially in the context of extensive recombination. Since all of these effects bias the sample composition in largely unknown ways, and thus most samples may not accurately reflect the genetic composition of the bulk population, accurate measurement of population genetic parameters must account for inherent sampling biases. Here, we present an approach to accurately estimate bulk population parameters from biased samples consisting of closely related sequences.

We have shown that recombination reduces only very slightly the diversity of the bulk population (Equation 9 and Figure 2A). However, in a sample of closely related sequences (where relatedness is measure by the coalescent time of the strictly vertical tree of cell divisions), recombination increases the diversity of the sample by importing DNA fragments from more distant sequences. We denote the average diversity of the imported fragments by *d*, which we take to be larger than the diversity within the vertically inherited portion of the sampled sequences. To calculate the average diversity within the sample, including both vertically inherited and horizontally acquired sequences, we consider a pair of sampled sequences that coalesce at time *t* ago. Vertically inherited portions have a mutational divergence while recombined portions have an expected divergence of *θ*, and corresponding diversity *d*. The recombination coverage for the given pair is and, as shown in Appendix F, for closely related sequences (*i.e.*, where ) that recombine fragments from a much larger population (*i.e.*, where ), the diversity of the sample is given by(14)The correlation functions and in the mean-field approximation can be expressed in terms of the function which is analytically derived in Appendix E, and takes the form(15)where is the bulk population’s recombinational divergence, *i.e.*, the average number of recombination events per site since coalescence. This function expresses the probability that two sites separated by distance *l* both have a substitution in a randomly chosen pair of individuals, and the above expression is the analytical result for the bulk population. When the same function is computed within a sample of closely related sequences, denoted by we obtain(16)which is accurate for Remarkably, we see that the quantity depends only on the bulk population parameters *θ* and *φ*, and on the mean fragment size and does not involve the sample-specific quantities and Fitting this functional form therefore allows us to infer the bulk population parameters, despite the biases inherent in natural population sampling.

As shown from simulation results (Figure 4A), the above expression exhibits a crossover between two regimes. For a substitution at both sites is most likely the result of an external two-site transfer; hence, we have which exhibits a hyperbolic decay (blue curve). As *l* increases, the sites are increasingly likely to have experienced a one-site transfer, which results in the linear decrease of , with a slope of (red line). We note that the expression computed for the biased sample (16) involves the fragment size while the same expression for the bulk population (15) does not. Intuitively, in a sample of closely related sequences, recombination coverage is small ( per site) because the coalescence time is short; hence, the boundaries of externally transferred fragments are relatively clear, and the mean fragment size can be deduced. In the bulk population, coalescence times are long (), and recombination coverage is large (), which results in overlapping fragments reducing the ability to infer fragment sizes. Sampling bias is therefore not exclusively a nuisance, since it presents the opportunity to infer the mean size of transferred fragments.

To test how well one can infer bulk population parameters based on strongly biased samples, for each simulated population we sampled a small number of closely related sequences. As shown in Figure 4B, in these biased samples, the diversity (filled circles) is a small fraction of the bulk population’s diversity Depending on the recombination rate, these samples represent from ∼ to of the total diversity, *i.e.*, We calculated for the samples, and by fitting Equation 16 to the data, we inferred the bulk population parameters *θ* and *φ*, and which are shown in Figure 4B. These results indicate that over a realistic range of sample diversity and population transfer rates one can accurately infer bulk population parameters using strongly biased samples.

### Using correlated mutations to infer recombination rates and global diversity of bacterial populations

We used whole genome sequences from a recently emerged multidrug-resistant *Escherichia coli* clone, sequence type 131 (ST131) containing 185 isolates (Price *et al.* 2013; Petty *et al.* 2014; Ben Zakour *et al.* 2016), and 1216 *Streptococcus pneumoniae* isolates in a longitudinal pneumococcal carriage study (Chewapreecha *et al.* 2014), two species that are well known for horizontal transfer and recombination (Lorenz and Wackernagel 1994). For each dataset, and several subtypes or clades, we calculated sample diversity and correlation functions, and inferred the bulk population parameters (Table 2) by fitting (see Appendix G for details of sequence analysis). As shown in Figure 5, all correlation functions of synonymous substitutions exhibit a decay pattern in both species, indicating the presence of recombination. The close agreement between measurements (circles) and predictions of the structure and rate correlation functions (solid green and blue lines) indicates that our model captures the essential features of the underlying transfer process across different clades of both species.

Parameter fitting yielded values of mean fragment size mutational divergence *θ*, and recombinational divergence *φ* (Table 2). For each sample, we directly measure the sample diversity, and using the fitted value of *θ* we calculate the global diversity, (Equation 9), and the sample’s recombination coverage (Equation 14). From the inferred parameters, we calculate and use the identity to obtain the value of The sample’s mutational divergence provides a measure of the age of the sample since coalescence, and can be converted to generations, if the mutation rate *μ* is known, *i.e.*, We also calculate the ratio which gives the relative rate of recombination to mutation in the population. Another useful quantity is the ratio of the number of SNPs that are brought into a sample by recombination to the number of SNPs that are due to *de novo* mutations. This ratio is given by which is the bulk population’s recombinational coverage, and thus it does not depend on the sample itself.

Table 2 lists the inferred values of the model parameters. As shown above, our inference method is applicable for closely related samples of sequences, in which and and, indeed, these conditions hold for the inferred values, indicating that the samples consist of closely related sequences that exchange DNA fragments with a much more diverse population. In *E. coli*, when considering all sequences as a single sample, we infer the bulk population’s mutational divergence as ∼ while sequences within the sample differ by Yet the sample’s mutational divergence, indicates that only ∼ of the genome has mutated since the coalescence, such that the vast majority of the sample’s diversity has been acquired from external transfer events. Similarly, using the sample of all *S. pneumoniae* sequences, the bulk population exhibits divergence, while sequences within the sample differ by and the sample’s mutational divergence, indicates again that diversity has been acquired largely from external sequences. The high population diversities also imply that the recombination barrier within species might not be particularly strong, since recombination evidently can still occur between sequences with divergence as large as 10%.

When analyzing separate clades within each species, we infer bulk population divergence levels, *θ*, that are relatively similar, with in *E.coli*, and in *S. pneumoniae*, despite large variations in sample diversity among clades. This result is consistent with the separate isolates having access to a single shared gene pool via recombination. Likewise, the average sizes of recombination fragments are consistent across clades in both species, ranging from 0.5 to 2 kb, which is comparable to the length of a typical gene. One exception is clade BC4-6B of *S. pneumoniae*, which has an inferred size of 23.5 kb. It should be noted that our fitting process did not assume any particular distribution for sizes of recombination fragments, provided that *l* is much smaller than transferred fragment sizes, a condition that is satisfied here since fitting was performed for *l* < 150 bp.

Notably, the bulk population’s recombination coverage, *ρ*, ranges from 73 to 240 for clades of *E. coli*, and from 48 to 2200 for clades of *S. pneumoniae*, indicating that, in each species, two randomly chosen sequences would have recombined so extensively since divergence that transfers would cover each recombining region many times over. As mentioned above, the value of *ρ* inferred for each sample is equal to the ratio of the number of polymorphisms due to recombination *vs.* the number of *de novo* mutations. In both species, indicating that the diversity introduced by recombination dominates sample diversity, which is consistent with previous results (Price *et al.* 2013; Chewapreecha *et al.* 2014; Petty *et al.* 2014). Yet, since *ρ* varies substantially from clade to clade, it suggests that real variability exists between subpopulations as far as the portions of the shared gene pool that each is able to access. Indeed, the inferred recombinational divergence, *φ*, shows substantial variation among clades in both species, with for clades of *E. coli* and for clades of *S. pneumoniae*. Since mutational divergence is relatively consistent across clades, while varies widely from 0.7 to 3.87 in *E. coli*, and from 0.65 to 6.55 in *S. pneumoniae*, it appears the variation in *φ* is not related to differences in population size, and likely arises from differences in access to the shared gene pool.

Lastly, we observe large variation across clades of the sample mutational divergence which provides a measure of each sample’s age since coalescence. Within *S. pneumoniae*, varies from to *i.e.*, nearly two orders of magnitude, while, in *E. coli*, varies from to or one order of magnitude. The entire collection of *S. pneomoniae* sequences is in fact substantially older than any given clade, with while in *E. coli* the sample as a whole has which is very similar to the age of its oldest clades. These ages can be converted to generations using known values of the mutation rates in each species. For example, in *E. coli* (Lee *et al.* 2012), and, using this value, the age of the entire sample is found to be 55,000 generations. While generation times cannot easily be assessed in bacteria, for *E. coli*, we conservatively assume one generation per day, using which we estimate that the sampled strains diverged ∼150 years ago. We note that, since laboratory measurements of *μ* may not reflect the mutation rates in natural environments—and variation in mutation rates could exist between strains, conditions, and genomic regions—mapping mutational divergence to generations remains an open problem in the field.

## Discussion

We presented a mutational correlation-based methodology for quantifying recombination rates in bacterial populations, which we showed can provide accurate estimates using whole-genome data. Our analytical calculation of the correlation functions enables accurate and rapid fitting to infer DNA recombination parameters, as well as a self-consistency check on the model provided by comparing the predicted correlation functions with measurements (see Figure 5). By decomposing the mutational correlation function into a sum of several distinct terms (Equation 8), we provided an intuitive description of how various types of correlations are related within a population, partitioning observed correlations into meaningful components. We showed by simulations and theory how the correlation functions depend on the population genetic parameters (Figure 2).

We compared our inferred parameters to those of previous work. In *E. coli*, we found that the ratio of recombinational to mutational events inferred by our models is less than that inferred in Touchon *et al.* (2009) (2.5), more than estimated in Didelot *et al.* (2012) (1.0), and much more than that determined by Dixit *et al.* (2015) (0.31). However, as we noted in our clade-by-clade analysis, these values can vary substantially even between clades of the same species. Since previous studies used different sets of strains, it is not surprising that there exists a range of estimates, and, what is remarkable, given the different methodologies and datasets, is that the various methods are all within an order of magnitude of each other. Our approach has the distinct advantage that it is extremely rapid and computationally efficient, and can therefore handle very large datasets with ease. Since our method is formulated using a population genetics model, it provides meaningful connections between a large body of theoretical work, and measurable quantities such as mutational correlation functions. Importantly, we identified sample selection bias, which is prevalent in most datasets, as a potential source of errors in parameter estimation, and by analyzing the correlation functions for samples of closely related sequences, we showed how to account for its effects.

In our analysis of *S. pneumoniae*, using the genomic sequences from Chewapreecha *et al.* (2014), we found the ratio ranges from 0.3 to 6.6 across clades, while measured across the set of all clades These values are larger than the previous estimates, which range from 0.1 to 0.35, which were performed by determining recombination events by contiguous clusters of SNPs (Chewapreecha *et al.* 2014). As noted in previous studies (Croucher *et al.* 2011, 2015; Chewapreecha *et al.* 2014), calling recombination events on a locus-by-locus basis is confounded by overlapping recombination events, which often cannot be detected, and is strongly affected by the age of the sample. For this reason, such methods are typically conservative, and their estimates are in effect lower bounds. Despite the order of magnitude difference in our parameter estimates, our method recapitulates the basic biological finding of the previous work, which showed that encapsulated pneumococci have consistently lower rates than nonencapsulated strains. This result is seen in the BC3-NT clade (Table 2), where serotype 14 isolates are encapsulated (), while the NT isolates are nonencapsulated (). In this regard, our analysis attributes a much bigger effect to encapsulation, which we infer reduces recombination rates by more than eightfold, whereas the previous analysis inferred a less than twofold effect.

The distinct correlation signature of recombination in bacterial genomes enables us to infer several additional quantities that are of broad interest for studies of microbial population structure and phylogeny. First, using closely related genome samples, we are able to infer the bulk population’s diversity, *d*, and divergence, *θ*. Remarkably, our inferred divergence for the bulk population of *E. coli* () closely matches the species’ diversity measured across a collection of global samples ( Dixit *et al.* 2015), which provides additional confirmation of the methodology. Second, our approach provides an estimate of the age of each sample without comparison to an outgroup. Existing methods attempt to partition SNPs into internal to a sample (*de novo* mutations) *vs.* external SNPs (arising from recombination), and thus encounter difficulties with older samples or overlapping recombination fragments. In contrast, our method compares the decay of mutational correlations with the standing diversity of the sample, and uses the predicted quantitative relations to infer the various rates. Without partitioning SNPs, and thus avoiding the known technical difficulties, we are able to infer the rate of *de novo* mutations in a sample, *i.e.*, its mutational divergence and therefore its relative age.

Within the populations assayed here, we find a wide range of sample ages, *e.g.*, the oldest *S. pneumoniae* clade is ∼80 times the age of the youngest clade. We do not find any meaningful correlation between a sample’s age, and the recombination parameters ( *ρ*) or diversity (*θ*, *d*) for the bulk population with which it exchanges DNA, indicating that our method correctly models the overall scaling of divergence with time and appropriately removes such effects. Thus, we believe our method provides a reliable tool for comparing sample ages, although such comparisons inherently assume a relatively constant molecular clock, *i.e.*, a value of *μ* that does not vary substantially between samples of the same species. At the same time, we expect that comparisons between species may be far less meaningful, since neutral evolutionary rates can differ substantially due to basic molecular differences and environmental factors that influence mutation rates.

Finally, given the prevalence of recombination in bacteria, our analysis can be used to quantify the utility (or futility) of inferring a phylogenetic tree for a given sample of DNA sequences. Construction of a tree based on sequence similarity requires some portion of the sequence to contain vertically inherited (*i.e.*, clonal) information. Each recombination event partially degrades the vertical signal in the data, and the recombination coverage determines the proportion of the total sequence that has been degraded. Once reaches a value of 1, we expect no remaining vertical signal in the data; hence, such a value indicates that phylogenetic reconstruction of the sample would be futile. The values of inferred from the samples we analyzed are all relatively low, ranging from ∼0.1 to 1.0% coverage (Table 2), indicating that there exists a vertical signal in the data. However, reliable tree construction additionally requires a sufficient number of *de novo* mutations that track the vertical signal, quantified by the value of times the genome size *L*. For certain samples, the number of SNPs may be too low to construct a meaningful tree. Additionally, even when a sufficient number of SNPs is expected to be present, one must reliably identify the SNPs that correspond to the clonal signal, which typically constitute a small fraction of the sample diversity,

For example, for the largest clade of the *E. coli* sequences (clade C, Table 2), indicating that, for each pair of sequences, > of the genome has been vertically inherited, which seems like good news for tree building. However, for this sample, meaning that between any pair of sequences <10 single nucleotide differences correspond to the vertical signal. These few crucial differences must be recognized from among 1000 differences that exist between any pair of individuals in the sample, the majority of which are due to recombination events. Finding these few informative needles in the haystack of SNPs is the major challenge for sequence-based phylogeny, and, while the problem may be quite difficult, at least in such cases one is relatively certain that a vertical signal exists in the data.

Much more severe, and possibly insurmountable, problems exist in establishing reliably the phylogeny of sequences from the bulk population of a bacterial species, where the relevant recombinational coverage is measured by *ρ*. Across the populations that we analyzed, indicating that, for any pair of sequences sampled from the bulk population, their genomes have been covered many times over by recombinational transfers since coalescence. We expect no vertical signal would remain in the data. Since the value of *ρ* represents a genome-wide average, it is of course possible that some genomic regions could exhibit extremely low recombination rates, and thus have effectively much lower values of *ρ*, which may enable phylogenetic inference in the bulk population, assuming for these regions. In the best case scenario, one would again be faced with identifying a tiny fraction of SNPs that capture the vertical signal, only now in a much smaller portion of the genome. Our results therefore indicate that phylogenetic inference in bacteria may, in most cases, be an extremely challenging problem, and our methodology provides the basic measures that quantify the (f)utility of tree building in any given case.

In conclusion, our new approach for inferring within-population recombination rates, based on correlation functions, provides a framework for measurements in a wide range of sequencing datasets. Further generalization of our method could incorporate variability in recombination rates across genomes, as well as modeling explicitly population subdivision and community structures. We expect that this framework, and its generalizations, will find fruitful application in inferring recombination rates within species, as well as providing a useful starting point for analyses of much more complex sets of data such as metagenomic samples.

## Acknowledgments

We wish to thank Sergei Maslov, Erik van Nimwegen, Jane Carlton, Alexander Grosberg, Charles Peskin, Matthew Rockman, Wei-Hsiang Lin, and Long Qian for valuable discussions, as well as three anonymous referees for their comments on the manuscript. This work was funded by a Human Frontier Science Program Young Investigators’ grant.

## Appendix

### A. Covariance Decomposition and the Law of Total Covariance

The law of total covariance states that(17)As in *Materials and Methods*, we define random variables for two sites separated by distance *l* as and where *i* is a site, and *k* is a sequence pair. We note that since the expectation is taken over the same set of sites in both cases. The expression can therefore be written equivalently as Together with Equation 5 and the law of total covariance (setting ), we have(18)Similarly, using Equations 6 and 7 (setting ), we obtain(19)The total covariance, is the covariance of the substitution variables at any pair of sites separated by distance *l* (*i.e.*, over all sites and all sequence pairs). It can be computed either by conditioning on the sequence pair or by conditioning on the site; either way the result is of course the same, which yields the covariance decomposition(20)While such a decomposition would be possible for any two-dimensional arrangement of random variables (*e.g.*, magnetic spins on a lattice), one interesting feature in our population genetics application is that a variance appears on the left-hand side rather than a covariance. This happens because the random variables and turn out to be identical, which is due to the circular genome. The same result also holds to excellent approximation for a very large genome, with corrections of order Therefore we obtain the mutational covariance inequality

### B. Coalescence Rates

The general exchangeable coalescent model is defined by a set of coalescence rates, , with which any subset of *k* out of *b* ancestral lines coalesce into a single common ancestor. These rates satisfy a self-consistency relation (see Pitman 1999; Sagitov 1999; Schweinsberg 2000; Brunet *et al.* 2008). For the Moran model, we have for all since is the total number of pairs that could coalesce; and, since only pairwise mergers are possible, for all (Moran 1958). For the Schweinsberg model, we obtain(22)and, since its genealogy follows the BSC, we have(23)For convenience, we take for the Moran model, and for the Schweinsberg model, so that both models have identical Setting *G* arbitrarily only changes the unit of all times, and does not change any results below. Explicit coalescence computations are not made for the adapting population model; instead, simulation results in the limits of weak and strong selection are compared with theoretical predictions of the KC and BSC models. This requires matching the overall coalescence rates of simulations with theory, which we accomplish by measuring directly in simulations.

### C. Calculation of Population Diversity

We consider substitutions at a single site on a randomly chosen pair of genomic sequences and as shown in Figure 6A, and analyze the dynamics of where *p* is the probability that these two sequences are identical at that site. We trace their history back one instant and compute the rates and outcomes of different events that affect the sequences, including coalescence due to reproduction, DNA transfer, and mutation.

First, a coalescence event in which one of or reproduces and replaces the other yields identity at the site, indicated by a star in Figure 6A corresponding to the state and occurs with rate The change in the state vector is then Second, an internal DNA transfer, which occurs between the pair of sequences in question (Figure 6A), yields the same result as coalescence, and thus the same Indeed, such an internal DNA transfer is analogous to a reproduction event in the sense that a piece of DNA “reproduces” from the donor sequence, and replaces its counterpart in the recipient sequence. The rate of such events is An external DNA transfer, on the other hand, in which the piece of DNA is copied from a third sequence or does not change the probability that the two sequences are identical at the site, and thus does not change the state vector, *i.e.*, Third, we assume mutations occur independently at each site, and uniformly along the sequence. A mutation causes a change that can be represented as a linear operator acting on The overall rate for mutations for two sites is and the linear operator is given by(24)Summing the possible changes due to each event multiplied by their rates, we find(25)Setting the derivative to zero, we obtain the steady-state solution for (26)where For the case this yields the well-known expression for heterozygosity in the Moran model.

### D. Calculation of Correlation Functions

We consider a pair of sites, *a* and *b*, separated by distance *l*, in a randomly chosen pair of genomic sequences, and The sequences can differ at 0, 1, or 2 sites, and we write the probability distribution of these three possible *states* as where the components are non-negative numbers summing to 1. More generally, a pair of sites can be configured across 2, 3, or 4 sequences, as shown in Figure 6B. Due to the effect of genetic linkage, the distribution of states depends on the configuration. We therefore denote by the distribution of states for 2, 3, or 4 randomly chosen sequences.

The correlation functions and as well as are determined by the values of four different terms (see Equations 4–7):(27)(28)(29)(30)where For large *N*, we obtain(31)Given that the maximal size of transferred segments, is far smaller than the genomic length (), is determined largely by the flat tail of for which we denote by (32)This corresponds to loci that are sufficiently far such that they are not affected by two-site transfers, which means and therefore Similar approximation yields(33)Based on Equations 4–7, the calculations of and are reduced to the analysis of (34)(35)(36)(37)To analyze the dynamics of we randomly choose *i* sequences, trace their history back one instant and compute the rates and outcomes of different events that could affect the sequences. These events include coalescence by reproduction, DNA transfers, and mutations. Mutations have the same effect on for as follows. Since two pairs of sites are involved, the total rate of mutations is and the linear operator of mutations acting on can be written as follows:(38)where

#### Dynamics equation for

Possible events that could affect beside mutations are shown in Figure 6C. Coalescence events occur at rate causing identity at both sites, and the state vector becomes hence, For DNA transfer events, as shown in Figure 6C, we need to consider (i) one-site internal transfers, (ii) one-site external transfers, (iii) two-site internal transfers, and (iv) two-site external transfer. (i) A one-site internal transfer, which occurs at rate results in a coalescence at one site, while leaving the other unchanged. The state vector becomes where *d* is the probability that two sequences differ at the unchanged site; hence, (ii) An external one-site transfer creates a configuration that involves three different sequences, and the state vector becomes thus This occurs with rate since each site can receive DNA from any of the sequences not in the given pair. (iii) A two-site internal transfer causes identity at both sites, *i.e.*, Lastly, (iv) the two-site external transfer does not change the state vector, since sequence labels are exchangeable. Multiplying the possible changes by their rates yields

#### Dynamics equation for

Possible events that could affect are illustrated in Figure 6D. Coalescence can affect the configuration in three different ways: (i) all three sequences could coalesce simultaneously, yielding identity at both sites; hence, (ii) the two-site sequence , and one of the two single-site sequences or , could coalesce, causing identity at one site, and leaving the other site unchanged; the state vector becomes and or (iii) the two single-site sequences and could coalesce onto an ancestral sequence that carries both sites; hence, the state vector becomes and The rates for these events are and respectively.

Since an internal transfer involves choosing two sequences (one as the donor and the other as the recipient), there exist two different types of transfers (Figure 6D): ones between the two-site sequence and one of the two single-site sequences or , and the others between the two single-site sequences and . The results of these transfers are exactly the same as those of coalescing two sequences discussed above (ii and iii, see above), such that they change the state vector to and and occur with rates and respectively. An external one-site transfer from an external sequence to one of the two single-site sequences changes only the sequence labels, but not the configuration, thus On the other hand, when an external one-site transfer occurs into the two-site sequence, the configuration becomes that of hence, with rate Multiplying rates by changes, we find

(41)#### Dynamics equation for

Possible events that could affect are illustrated in Figure 6E. Since four sequences are involved, each of which carries only one site, four different types of coalescent events are possible: (i) simultaneous coalescence of all four sequences, which yield identity at both sites and (ii) simultaneous coalescence of three sequences, producing identity at one of the two sites; hence, (iii) coalescence of two sequences with sites at the same locus, yielding identity at one of the two sites; hence, and (iv) coalescence of two sequences with two different sites, leading to an ancestral sequence that carries both sites; the state vector then becomes and The rates of these events are and respectively.

For an internal DNA transfer, the two chosen sequences may carry their single site either at the same locus, or at different loci, and, therefore, there are two types of internal DNA transfers (Figure 6E). The results of these two types of transfer on the state vector are exactly the same as for coalescence of two sequences discussed in types (iii) and (iv) above. The rates are and respectively. Since in the configuration of each of the four sequences contains only one site and sequence labels are exchangeable, external transfers do not change the state vector, Multiplying these rates by changes, we find

(42)#### Exact solution for steady-state values of

Combining the above, we express the equations as a system of nonhomogeneous coupled linear differential equations:(43)in which *I* is a identity matrix, and describe transitions from and to and respectively, with and and *S* is the operator for reproduction and transfer, which has the following form:(44)The tridiagonal form of *S* leads to a block tridiagonal matrix of the Kronecker product, which can be written in a general form(45)In the block matrix *A*, the off-diagonal blocks are scalar matrices, and the diagonal blocks are sums of scalar matrices and the mutation operator, *M*, which is diagonalizable. Importantly, *A* is a block diagonally dominant matrix in the norm induced by the -norm, which guarantees that *A* is nonsingular (Feingold and Varga 1962). The inverse of *A* can be computed efficiently using an algorithm given by (Ran and Huang 2006), which yields the steady-state solution of Equation 43. Using the steady-state solution for in Equations 27–33 and Equations 4–7 yields the exact solution for correlation functions and variance. The resulting solutions precisely predict the simulation results for both the Moran model and the Schweinsberg model, as shown in Figure 2. Since their exact expression is unwieldy, we seek an approximation that yields a simpler form for further analysis in the following section.

### E. Mean-Field Approximation and Solutions

We note that the one-site external transfers make the transitions between a cyclic graph (Figure 6F, dotted arrows), and thus complicate the solutions of the linear equations (Equations 40–42). One possible approximation is thus to remove the transitions and and to account for them implicitly as mutations. As shown in Figure 7, an external transfer involves three sequences with four possible genealogical structures. After the external one-site transfers, the genealogical relationships of the two sequences in question will change (Figure 7), but the coalescent time will only change in two of the four genealogical structures, which happens with probability If the coalescent time does not change [as in cases (i) and (ii), Figure 7], exchangeability implies that there will be no change in the probability distribution for the pair of sequences in question. If the coalescent time changes [as in cases (iii) and (iv), Figure 7], the probability distribution changes, and our mean-field approximation is to assume that the two sites will then differ with probability *d*, *i.e.*, the average pairwise distance in the population. Accordingly, the operator for these transfers, can be written as(46)We can combine these external one-site transfers and point mutations to form an effective mutation operator, which is for or for The explicit external one-site transfers appearing in Equation 43 are removed, and replaced by the appropriate mutation operator, yielding the simplified solutions given below.

#### Solution for the Moran model

We find the solutions of which we will use later for the solutions of correlation functions, in the mean-field approximation for the Moran model as follows:(47)(48)(49)Given that and we can simplify the solutions further:(50)(51)(52)Using the mean-field solution above for and Equation 34, we obtain the mean-field solution for (53)Similarly, using the mean-field solution for and Equation 36, we find the solution for (54)where

For using the solutions for and and Equation 35, we obtain(55)Lastly, we compute the variance

(56)#### Solution for the Schweinsberg model

We find the solutions for in the mean-field approximation for the Schweinsberg model as follows:(57)(58)(59)for

Given the solutions above, we find(60)(61)(62)where

#### Mean-field approximation and coalescent theory

The mean-field results for can equivalently be obtained using coalescent theory, which we illustrate here for the case of Given two pairs of sites on a pair of sequences, a coalescent event could involve both of the two pairs with rate where the last term is the rate of coalescence due to two-site transfers between the two sequences, yielding the state or it could involve only one of the two pairs with rate which is the rate of coalescence due to internal one-site transfers between the two sequences, yielding the state The coalescent time *t* thus follows an exponential distribution with rate Given the coalescence time, one can compute by propagating forward in time the process of mutations and external one-site transfers for a time *t* starting from the two ancestral states, and We note that external two-site transfers that occur during this time impact both sites, and are thus equivalent to exchanging one of the sequences for a different individual in the population. Since we study an exchangeable coalescent, these events do not change the distribution of coalescent times, and therefore have no impact on the calculation. We define a combined mutational operator that includes both mutations and external one-site transfers, and use it to propagate forward in time, while taking the expectation over the coalescent time:(63)One can check that this is the same equation as the steady-state equation for in the mean-field approximation (see Equation 40). Computing the matrix inverse and applying to and therefore yields the same expression for given in Equation 47.

### F. Analytical Forms for Parameter Inference in Biased Samples

We consider a pair of individuals with coalescence time We assume that *t* is sufficiently short such that each pair of sites in the genome separated by a distance has been affected by, at most, one mutational or recombination event. For portions of the genome that were not affected by recombination, the average per site divergence of the pair of sequences is given by The probability that a single site is affected by recombination is Since the donor is chosen randomly from the entire population of size *N*, we can neglect transfers between the given pair of individuals, which have probability After a transfer, the expected divergence in the recombined region is the bulk population diversity, *d*. Accounting for the rates and effects of mutation and recombination events, we calculate the expected diversity of the pair of sequences as(64)Recalling that and where we consider *N* to represent the size of the bulk population with which sequences can recombine (*e.g.*, that of an entire species), we can assume that since typically we have while is usually on the order of a kilobase or longer. These assumptions are later tested for self-consistency once fitting has been performed. Thus, we have so that which allows us to approximate by considering only the contribution of recombination:(65)We now consider the calculation of *i.e.*, the probability that substitutions have occurred at both sites, where we distinguish the value of the population, which corresponds to the expectation for two sequences chosen at random from the entire population of size *N*, and that of the sample, which we denote as and is calculated for the pairs of individuals within the given sample. Since *t* is sufficiently short, the probability that two or more events occur at the two sites in question is negligible. Thus, the only event that can introduce substitutions at both sites over the timescale *t* is a two-site transfer from the bulk population into the sample, which occurs with probability We obtain(66)and, substituting the expression for yields(67)We define(68)and, given that we obtain(69)If the distance between sites *l* is smaller than typical transferred fragment sizes, we have and using which, we find

### G. Application to Bacterial Sequence Analysis

We obtained a curated list of 185 *E. coli* sequence type 131 (ST131) isolates from two recent studies (Price *et al.* 2013; Petty *et al.* 2014; Ben Zakour *et al.* 2016), and a total list of 1216 *S. pneumoniae* isolates, which consisted of all isolates from the seven largest clusters in a longitudinal pneumococcal carriage study (Chewapreecha *et al.* 2014).

For each species, we applied a reference-based approach to generate whole-genome alignments from Illumina read data. We used *E. coli* strain EC958 (EMBL accession code HG941718), and *S. pneumoniae* strain ATCC700669 (EMBL accession code FM211187), as the reference genomes, and mapped read pairs against them using SMALT version 0.7.6 (https://sourceforge.net/projects/smalt/) with default settings. The resulting alignment was then processed with Samtools (Li *et al.* 2009) and FreeBayes (Garrison and Marth 2012) to generate a consensus sequence. To call bases at each position, we required at least two reads spanning in each direction (*i.e.*, for a total minimum read depth of four), and a ≥75% consensus on the major allele. The base quality at the site had to be ≥50, and the mapping quality had to be ≥30, on the Phred scale. When a called base was an SNP, in addition, we required it to be supported by both FreeBayes and Samtools with quality scores ≥30. All other sites that failed to pass these criteria, as well as insertions and deletions, were masked as gaps. Finally, for each clade in a species, the resulting consensus sequences were combined to generate a whole-genome alignment.

To avoid ambiguity in assigning distances along the genome due to the potential for genome rearrangement, the resulting whole-genome alignment was split into multiple gene alignments, and each gene alignment was further split into multiple alignment blocks with a fixed length of 300 bp. In each block, we removed sequences in which gaps constituted >2% of the total length. After filtering out such sequences, we additionally excluded alignment blocks consisting of less than five sequences. Using the filtered alignment blocks, we compared DNA sequences pairwise. For each pair of sequences *k*, we obtained a substitution profile at the third base of each codon. To reduce the impact of selection, we masked positions containing nonsynonymous substitutions, and did not use them in the analysis, but preserved their genome coordinate so that physical distances remained unchanged.

Using the substitution profiles, we calculated the sample diversity, and the correlation functions and To infer population parameters, we fit using the first 50 data points (codons) in the same manner as we tested using simulation results (Appendix F and Figure 4). To predict we used the relations in Equations 34–36 with the fitted form of and used where *q* was determined by the coalescent structure of the sample (*e.g.*, for BSC statistics). When fitting *q* (dashed blue and green lines in Figure 5), we performed linear regression using the relation with the fitted form and the measured values of

We note that, in principle, one could infer the full distribution of transferred fragment sizes by using the structure correlation function. One can invert Equation 12 to obtain in terms of and, by differentiating, obtain In practice, however, accurately measuring the curvature of the correlation decay would require a very large amount of data, which we found to be well beyond the sizes of available datasets (data not shown). Nevertheless, as we discussed above, the mean fragment size can be efficiently inferred using currently available data.

### H. Recombination Barriers and Population Structure

In the above analysis and simulations, the transfer process was unconstrained, such that each genome can recombine with any other sequence within the population with equal rate. In reality, however, barriers exist that prevent free genetic exchange among individuals within a bacterial population. For example, the mismatch-repair system inhibits interspecies recombination, and reduces recombination frequencies among distantly related individuals (Fraser *et al.* 2007). Moreover, the classical pathway for homologous recombination involves RecA-mediated homology recognition between donor and recipient sequences (Kuzminov 1999). In several bacterial species, laboratory studies have found a log-linear relationship between sequence divergence and the frequency of recombination (Fraser *et al.* 2007).

#### Recombination barriers in simulation

To assess the magnitude of recombination barriers on our estimation of transfer rates, we modified our simulations as follows. After randomly choosing a pair *k* of genomes for recombination, we calculate their pairwise distance, and we allow recombination to occur with probability where *b* is the strength of the recombination barrier that reduces transfer efficiency. During the simulations, we record the total number of successful transfer events across the population. This process limits recombination among distantly related individuals, and thereby reduces the overall rate of successful transfers within the population (Figure 8).

We compared the overall rates of successful transfers, measured directly from simulations with our estimations based on the mean-field approximation. Due to the recombination barrier, and its effect on the correlation functions, inference of transfer rates based on fitting of measurements from the bulk population will underestimate the rates of successful transfers by a factor of an increasing function of the product of population diversity and the recombination barrier (Figure 8; and see derivation below). Thus, a significant deviation of our estimate requires both high population diversity and a large recombination barrier, and one can correct the estimates if the value of *b* is known for a given species. However, our inference procedure for fitting genomic samples (Appendix F) implicitly accounts for transfer efficiency in the way it models the sample diversity, which avoids having to explicitly measure and correct for recombination barriers in different datasets.

#### Rate of successful transfers in the population

Here, we calculate the average rate of successful transfers, given the rate of attempted transfers, *γ*. For any sequence X, transfers occur from donor sequences, D, whose coalescence time with X is a random variable, *t*, which is exponentially distributed with rate The average divergence between sequence X and D is then given by and the probability that transfer is successful is therefore Integrating over *t* yields the mean rate of successful transfers,(71)where measures the effective strength of the recombination barrier.

#### The impact of transfer efficiency on the mean-field approximation

As detailed in Appendix E, we obtained the mean-field approximation by studying the distribution of possible phylogenetical trees involved in an external one-site transfer (Figure 7). A recombination barrier changes this distribution—transfers are more likely to occur among closely related sequences rather than among distant ones. It thus overweights the first two types of trees shown in Figure 7, and underweights the last two trees, which changes the value of *w*, and leads to an underestimate of the rate of successful transfers.

To correct this bias, we study the effect of transfer efficiency on the last two trees in Figure 7. We consider a pair of sequences, X and Y, and an external one-site transfer from a donor sequence D to X. When we trace the three lineages backward in time, there exist two sequential mergers among them, and we let and denote the coalescent times of the first and second mergers, respectively. We compute the probability of observing tree (iii) with specified values of and The first merger occurs at between branches D and Y, while all other possible mergers (D-X, X-Y, or D-X-Y) do not occur; the associated probability is thus After the first merger, there remains time until the second merger, during which the sample contains two lineages; hence, the associated probability is Multiplying the two probabilities, and using the identity the probability of observing the given tree is(72)Since the two trees (iii) and (iv) have the same topology, their probability is the same. In both cases, the divergence between D and X, which is given by determines the transfer efficiency. We can therefore calculate the value of *w*, which is the probability of observing one-site transfers as trees (iii) and (iv), by integrating over the coalescence times:(73)(74)With no recombination barrier (*i.e.*, ), we obtain the original probability In the mean-field expressions for the correlation functions (Appendix E), the population recombination rate occurs only in the combination of parameters Using the original mean-field solution to fit the population transfer rate (*i.e.*, using ) would thus yield a value which can be written as indicating that we would underestimate the rate of successful transfers, by a factor In the Moran model, by noting we find the factor to be while in the Schweinsberg model the factor is

## Footnotes

*Communicating editor: J. Lawrence*

- Received March 21, 2016.
- Accepted December 15, 2016.

- Copyright © 2017 by the Genetics Society of America