## Abstract

As multi-individual population-scale data become available, more complex modeling strategies are needed to quantify genome-wide patterns of nucleotide usage and associated mechanisms of evolution. Recently, the multivariate neutral Moran model was proposed. However, it was shown insufficient to explain the distribution of alleles in great apes. Here, we propose a new model that includes allelic selection. Our theoretical results constitute the basis of a new Bayesian framework to estimate mutation rates and selection coefficients from population data. We apply the new framework to a great ape dataset, where we found patterns of allelic selection that match those of genome-wide GC-biased gene conversion (gBGC). In particular, we show that great apes have patterns of allelic selection that vary in intensity—a feature that we correlated with great apes’ distinct demographies. We also demonstrate that the AT/GC toggling effect decreases the probability of a substitution, promoting more polymorphisms in the base composition of great ape genomes. We further assess the impact of GC-bias in molecular analysis, and find that mutation rates and genetic distances are estimated under bias when gBGC is not properly accounted for. Our results contribute to the discussion on the tempo and mode of gBGC evolution, while stressing the need for gBGC-aware models in population genetics and phylogenetics.

THE field of molecular population genetics is currently being revolutionized by progress in data acquisition. New challenges are emerging as new lines of inquiry are posed by increasingly large population-scale sequence data (Casillas and Barbadilla 2017). Mathematical theory describing population dynamics was developed before molecular sequences were available (*e.g.*, Fisher 1930; Wright 1931; Moran 1958; Kimura 1964); now that ample data are available to perform statistical inference, many models have been revisited. Recently, the multivariate Moran model with boundary mutations was developed and applied to exome-wide allele frequency data from great ape populations (Schrempf and Hobolth 2017). However, drift and mutation are not fully sufficient to explain the observed allele counts (Schrempf and Hobolth 2017). It was hypothesized that other forces, such as directional selection and GC-biased gene conversion (gBGC), may also play a role in shaping the distribution of alleles in great apes.

Directional selection and gBGC have different causes but similar signatures: under directional selection, the advantageous allele increases as a consequence of differences in survival and reproduction among different phenotypes; under gBGC, GC alleles are systematically preferred. gBGC is a recombination-associated segregation bias that favors GC-alleles (hereafter, strong alleles) over AT-alleles (hereafter, weak alleles) during the repair of mismatches that occur within heteroduplex DNA during meiotic recombination (Marais 2003). gBGC has been studied in several taxa including mammals (Duret and Galtier 2009; Romiguier *et al.* 2010; Lartillot 2013), birds (Webster *et al.* 2006; Weber *et al.* 2014; Smeds *et al.* 2016; Corcoran *et al.* 2017), reptiles (Figuet *et al.* 2015), plants (Muyle *et al.* 2011; Serres-Giardi *et al.* 2012; Clément *et al.* 2017; Liu *et al.* 2018), and fungi (Pessia *et al.* 2012; Lesecque *et al.* 2013; Liu *et al.* 2018). However, apart from some studies in human populations (Katzman *et al.* 2011; Glémin *et al.* 2015; Pouyet *et al.* 2018), a population-level perspective of the intensity and diversity of patterns of gBGC among closely related populations is still lacking.

Several questions remain open regarding the tempo and mode of gBGC evolution. For example, the effect of demography on gBGC is still controversial. While theoretical results and studies in mammals and birds advocate a positive relationship between the effective population size and the intensity of gBGC (Nagylaki 1983; Romiguier *et al.* 2010; Weber *et al.* 2014), Galtier *et al.* (2018) failed to detect such relationship between animal phyla. These results open the question as to which extent demography shapes the intensity of gBGC in closely *vs.* distantly related species/populations. Another aspect that is not completely understood is the impact of GC-bias on the base composition of genomes (Phillips *et al.* 2004; Romiguier *et al.* 2013). In particular, the individual and joint effect of gBGC and mutations shaping the substitution process remains elusive. Here, we address these two questions by revisiting great ape data (Prado-Martinez *et al.* 2013) with a Moran model that accounts for allelic selection.

The Moran model (Moran 1958) has a central position in describing the evolution of a population in that it models the dynamics of allele frequency changes in a finite haploid population. Recently, an approximate solution for the multivariate Moran model with boundary mutations (*i.e.*, low mutation rates) was derived (Schrempf and Hobolth 2017). In particular, the stationary distribution was shown useful to infer population parameters from allele frequency data (Schrempf *et al.* 2016; Schrempf and Hobolth 2017). Here, we present the Moran model with boundary mutations and allelic selection, derive the stationary distribution, and build a Bayesian framework to estimate population parameters. While De Maio *et al.* (2013) had previously proposed a Moran model with allelic selection, we introduce further assumptions on the mutation scheme that permit us to mechanistically describe the relative importance and impact of the population processes mediating the base composition of genomes and expected divergence.

Other approaches making use of allele frequency data to estimate mutation rates and selection coefficients have been proposed in the literature. Glémin *et al.* (2015) proposed a method to quantify gBGC from the derived allele frequency spectra that incorporates polarization errors, takes spatial heterogeneity into account, and jointly estimates mutation bias. The number of derived alleles is modeled by a Poisson distribution on the mutation rates among weak, strong, and neutral alleles (Muyle *et al.* 2011). Our approach differs from that of Glémin *et al.* (2015) as it does not require polarized data or need to account for polarization errors. In addition, our method makes use of the information given by the fixed sites—information that is usually discarded by other methods (Glémin *et al.* 2015 included).

Furthermore, our application to great apes shows that most great apes have patterns of allelic selection consistent with gBGC. Our results suggest further that demography plays a major role in determining the intensity of gBGC among great apes, as the intensity of the obtained selection coefficients correlates significantly with the effective population size of great apes. We also show that not accounting for GC-bias may considerably distort the reconstructed evolutionary process, as mutation and substitution rates are estimated under bias.

## Methods

### The multivariate Moran model with allelic selection

The modeling framework defined in this work builds on the model described by Schrempf *et al.* (2016), which, according to some proposed terminology (Vogl and Bergman 2015; Schrempf and Hobolth 2017), can be addressed as the multivariate Moran model with boundary mutations (hereafter, ). Here, we describe the and allelic selection (hereafter, ). The multivariate Moran model can be also referred to as a polymorphism-aware phylogenetic model (PoMo) if we consider the four-variate case (De Maio *et al.* 2013, 2015; Schrempf *et al.* 2016), *i.e.*, representing the four nucleotide bases (Figure 1).

Consider a haploid population of *N* individuals and a single locus with *K* alleles: and are two possible alleles. The evolution of this population over time is described by a continuous-time Markov chain with a discrete character-space defined by *N* and *K*, each of which represents a specific assortment of alleles. Two types of state can be defined: if all the individuals in a populations have the same allele, the population is monomorphic , *i.e.*, the *N* individuals have the allele ; differently, if two alleles are present in the population, the population is polymorphic , meaning that *n* individuals have the allele and have the allele . is therefore the frequency of allele in the population.

Allele trajectories are given by the rate matrix Q. Time is accelerated by a factor of *N*, and, therefore, instead of describing Moran dynamics in terms of Moran events (Moran 1958), we developed a continuous version in which time is measured as the coalescent in generation time (in units of *N*).

Drift is defined by the neutral Moran model: the transition rates of the allelic frequency shifts depend only on the allele frequency, and are therefore equal regardless of allele increases or decreases in the population (Durrett 2008).(1)We accommodated mutation and selection in the framework of the neutral Moran model by assuming them to be decoupled (Baake and Bialowons 2008; Etheridge *et al.* 2010).

Mutation is incorporated based on a boundary mutation model, in which mutations occur only in the boundary states. The boundary mutations assumption is met if the mutation rates are small (and *N* is not too large). More specifically, by comparing the expectations of the diffusion equation with the polymorphic diversity under the Moran model, Schrempf *et al.* (2016) established that should be lower than 0.1. In fact, most eukaryotes fulfill this condition [see Lynch *et al.* (2016) for a review of mutation rates]. Another assumption of our boundary mutation model is that the polymorphic states can only be biallelic. However, this assumption is not a significant constraint as tri-or-more allelic sites are rare in sequences with low mutation rates.

We employed the strategy used by Burden and Tang (2016), and separated our model into a time-reversible and a flux part. We wrote the mutation rates as the entries of a specific mutation model, the general time-reversible model (GTR) (Tavaré 1986): , where ρ represents the exchangeabilities between any two alleles ,and π the allele base composition (Equation 2). Here, we restricted ourselves to the GTR, as this model simplifies obtaining formal results (Burden and Tang 2016). Because π has free parameters, and ρ includes the exchangeabilities for all the possible pairwise combinations of *K* alleles, we ended up having free parameters in the GTR-based boundary mutation model.

Until now, we have essentially described the model proposed by Schrempf *et al.* (2016); this work extends this model by including allelic selection. We modeled allelic selection by defining relative selection coefficients σ: the selection coefficient of an arbitrary allele (A in our experiments) is fixed to 0. The selection coefficients defined this way guarantee that our multi-allelic model behaves neutrally only under the condition that all the selection coefficients are the same and equal to 0. Defining the fitness as the probability that an offspring of allele is replaced with probability (Durrett 2008), we can formulate the component of allelic selection alongside with drift, and thus among the polymorphic states (Equation 2).

Altogether, the instantaneous rate matrix Q of the multivariate Moran model with boundary mutations and allelic selection can be defined as(3)where *u* and *v* represent a frequency change in the allele counts (though *N* remains constant). The diagonal elements are defined by the mathematical requirement such that the respective row sum is 0.

As the parameters of the population size, mutation rate, and selection coefficients are confined, it is possible to scale them down to a small value of *N* while keeping the overall dynamics unchanged (Appendix A). The virtual population size *N* becomes a parameter describing the number of bins the allele frequencies can fall into. As a result, we can think of *N* either as a population size or a discretization scheme.

### The stationary distribution

The stationary distribution of a Markov process can be obtained by computing the vector ψ satisfying the condition (Appendix B). ψ is the normalized stationary vector, and has the solution(2)*k* is the normalization constant(4)where A is the alphabet of the *K* alleles , representing the monomorphic states, and all the possible pairwise combinations of A representing the types of polymorphic states , , …, . For example, for the four-multivariate case, we write A as the alphabet of the four nucleotide bases and as all the possible pairwise combinations of the four nucleotide bases . For a population of size *N*, we have possible states, four of which are monomorphic (Figure 1).

### Expected number of Moran events

From .Q. and ψ, we can compute the expected number of Moran events (mutations, drift, and selection) or the expected divergence per unit of time (in generations) under the model (Appendix C):(5)The quantity (5) can also be interpreted as the overall rate of the model. The expected number of Moran events for the neutral model can be easily calculated by letting . To compare the Moran distance with the standard models of evolution, we recalculated the Moran distance to account only for substitutions events : we corrected by the probability of a mutation and subsequent fixation under the Moran model (Appendix D)

(6)### Bayesian inference with stationary distribution

We can define a likelihood function based on the stationary distribution for a set of *S* independent sites in *N* individuals by taking the product of over counts of monomorphic and polymorphic sites , thus:(7)We employed a Bayesian approach: we define the prior distributions independently, a Dirichlet prior for π and an exponential prior for ρ and σ; a Dirichlet and multiplier proposals were set for the aforementioned parameters with tuning parameters guaranteeing a target acceptance rate of 0.234 (Roberts *et al.* 1997). We employed the Metropolis-Hastings algorithm (Hastings 1970) for each conditional posterior in a Markov chain Monte Carlo (MCMC) sequence to obtain random samples from the posterior. The algorithm was coded in the R statistical programing language (R Core Team 2015): the packages MCMCpack and expm were integrated in our code to obtain samples from the Dirichlet density and to compute the matrix exponential, respectively (Martin *et al.* 2011; Goulet *et al.* 2017).

### Application: great ape population data

The stationary distribution of the four-multivariate model was employed to infer the distribution of allele frequencies, selection coefficients, and mutation rates from fourfold degenerate sites of exome-wide population data from great apes (Prado-Martinez *et al.* 2013). We used 11 populations with up to 27 diploid individuals, totaling ∼2.8 million sites per population (Table 1). Data preparation follows the pipeline described in De Maio *et al.* (2015). Estimates of the Watterson’s *θ* genetic diversity is <0.003 for all the studied populations (Schrempf *et al.* 2016), validating the boundary mutations assumption of 0.1.

### Data availability

Data and R scripts necessary to confirm the findings of this article are available on GitHub (https://github.com/pomo-dev/pomo_selection). Supplemental material available at FigShare: https://doi.org/10.6084/m9.figshare.8180960.

## Results

### Simulations and algorithm validation

To validate the analytical solution for the stationary distribution of the multivariate Moran model, we compare it to the numerical solution obtained by calculating the probability matrix of for large enough *t*. We confirmed that the numerical solution converges to the analytical solution (Supplemental Material, Figure S1).

We validated the Bayesian algorithm for estimating population parameters from the stationary distribution by performing simulations (Figures S2–S5 and Table S2). Our algorithm efficiently recovers the true population parameters from simulated allele counts. We tested the algorithms for different numbers of sites (, and ) and state-spaces ( 5, 10, and 50). The number of sites does not increase the computational time substantially and is not a limiting factor for genome-wide analysis. In contrast, the size of the state-space influences the computational time. For larger state-spaces, *N*, more iterations are needed to obtain converged, independent, and mixed MCMC chains during the posterior estimation.

### Patterns of allelic selection in great apes

To test the role of allelic selection defining the distribution of alleles in the great apes, we compared the neutral model and the model with allelic selection (). Using the predictive stationary distribution and the observed allele counts, we computed the Bayes factors (BF) favoring the more complex model (*i.e.*, log BF favors the model with allelic selection) for all populations. It is clear that fits the data considerably better for most of the studied great apes (log BF , Table 1). The only exception is the Eastern gorilla population, for which a lower log BF was obtained (log BF , Table 1).

We have also corroborated our BF by inspecting the fit of the predictive distribution of and with the allele counts (Figure S6, A–K). The allele counts for the polymorphic states are not symmetrical; generally, one allele is preferred, and thus also the polymorphic states that have it in higher proportions. As expected, we observed that better reproduces the skewed distribution of allele counts among great apes.

We further investigated the patterns of allelic selection in great apes by analyzing the posterior distribution of the relative selection coefficients of C, G, and T ( was set to 0) under . A general pattern of allelic selection is observed in great apes: the selection coefficients of C and G are similar (meaning that their posterior distributions largely overlap), but different from the selection coefficient of T, which, in turn, overlaps 0 (approximately equal to the selection coefficient of A) (Figure 2). The only exception is Eastern gorillas, for which the selection coefficients are all only slightly higher than 0 and rather similar (Figure 2). This result corroborates the relatively low BF found for evidence of allelic selection in the Eastern gorilla population.

We further explored this result in order to check if the patterns of GC-bias found among great apes can be associated with gBGC. We correlate the GC-bias per chromosome (by averaging the scaled and ) with the chromosome size and recombination rate in the non-African human population (Figure S7), for which these data are particularly well characterized (Jensen-Seaman 2004). We found a significant positive correlation between the GC-bias and recombination rate (Spearman’s *ρ* = 0.57, *P* = 0.006), but a negative correlation with the chromosome length (Spearman’s *ρ* = −0.52, *P* = 0.014).

Although the patterns of selection among great apes are similar qualitatively, they differ quantitatively. For example, the Central chimpanzees have patterns of GC-bias ∼6.17/4.05 (, Table S3 and Figure 2; scaled according to Equation 11), while the closely related population of Western chimpanzees shows less strong patterns (∼2.21/2.30). Likewise, the GC-bias content in African and non-African human populations contrasts: 4.47/2.86 and 1.83/1.76, respectively. These results show that the patterns of allelic selection vary greatly among great apes, even among closely related populations.

It has been hypothesized that gBGC is a compensation mechanism for the mutational bias that exists in favor of the weak alleles, A and T (Duret and Galtier 2009; Philippe *et al.* 2011)—the AT/GC toggling effect. We observed that mutation rates from strong to weak alleles are more frequent (by a factor of 2.80; 3.26 if the stationary frequencies are accounted for), while no mutational bias was found between alleles of the same type (1.02; 0.98 if the stationary frequencies are accounted for; Table S3). As the estimated selection coefficients have a clear pattern of GC-bias in most great apes, we can conclude that our analyses are congruent with the expectations of the AT/GC toggling effect.

Furthermore, we compared our method with that of Glémin *et al.* (2015), by considering only two alleles [the strong (S) and weak (W) alleles] using human allele counts from the first human chromosome, divided into 51 regions of 1 million sites (data taken from Glémin *et al.* 2015). We compared estimates of the gBGC rate coefficient as predicted by our model and that of Glémin *et al.* (2015) ( and *B*, respectively), and observed that they are negatively correlated (Spearman’s , value = 0.012). Interestingly, *B* correlates significantly with our estimates of (the mutation rate of weak to strong alleles; , value = 0.001). We have further checked the influence of the fixed sites in our estimates of gBGC, and, as expected, we observed that correlates positively with the percentage of monomorphic sites (, value = 0.012); intriguingly, *B* is negatively correlated (, value = 0.001). Scatter plots of the mentioned correlation tests can all be found in Figure S8.

### and the total rate of mutation and selection in great apes

It is widely known that the intensity of mutation and selection reflect population demography. To check whether the estimated mutation and selection coefficients among great ape populations may be explained by demography, we tested the correlation between the total rate of mutation and selection and (obtained from Tenesa *et al.* (2007); Prado-Martinez *et al.* (2013)). Positive correlations between the total mutation and selection rates and the effective population size were obtained (Figure 3): Pearson’s correlation coefficient of 0.57 (*P* = 0.089) and 0.89 (*P* < 0.001), respectively. These correlations were obtained using independent contrasts (Felsenstein 1985) accounting for the great apes phylogeny as predicted in Prado-Martinez *et al.* (2013).

This result shows that plays an important role in determining the intensity of selection. In particular, it becomes clear that the different patterns of GC-bias found among great apes are due, in part, to different demographies. For example, Central chimpanzees have the highest GC-bias among the studied great apes, and they are indeed the population that was estimated with the largest (30,000; Prado-Martinez *et al.* 2013). Eastern gorillas showed the opposite pattern; this population had no evidence of GC-bias (with very homogeneous selection coefficients) and congruently Prado-Martinez *et al.* (2013) estimated its as only 2000—the lowest of the studied populations.

### Comparing the expected number of substitutions in great apes

We calculated the expected number of substitutions under and to evaluate the impact of allelic selection (in particular, GC-bias) in the evolutionary process. With Equation 6, we calculated and using the posterior estimates of the respective model parameters. We observe that, for most great ape populations, the expected number of substitutions is lower when allelic selection is accounted for (Table 2); Eastern gorillas are an exception, and the opposite pattern was observed. We also calculated the ratio between the expected number of substitutions in both models (*i.e.*, ), and we obtained minor (99.8% in Bornean orangutans) to major (82.1% in bonobos) deviations; the average difference is −7.3% (Table 2). These results suggest that not accounting for GC-bias may distort the reconstructed evolutionary process by overestimating the expected number of substitutions.

We complement this result by comparing the posterior distribution of the mutations rates in and . Because we wanted to identify the mutational types that may be estimated differently between these models, we calculated the relative difference between the mutation rate from allele to allele using the following ratio: . If for a certain mutation rate , then this mutation rate is being underestimated in when compared with (and vice versa if ); if the mutation rates are equally estimated in both models.

We observed a systematic bias among great apes. While weak-to-weak and strong-to-strong mutation rates are generally nondeferentially estimated in both models (most of their *r* overlap 1, Figure 4) the strong-to-weak and weak-to-strong mutation rates are generally biased in . In particular, we obtained that weak-to-strong mutation rates are augmented, while mutations rates from strong-to-weak alleles are deprecated (Figure 4), which suggests that not accounting for GC-bias may bias the estimation of mutation rates. Eastern gorillas behave differently by not showing significant differences between the estimated mutations rates (all overlap 1, Figure 4).

## Discussion

In this work, we built on the multivariate Moran model with boundary mutations and allelic selection to explain the population processes shaping the observed distribution of alleles. We obtained new formulae to characterize this model. In particular, we derived the stationary distribution and the rate of the process. In addition, we built a Bayesian framework to estimate population parameters (mutation rates and selection coefficients) from population data. This work accomplishes tasks set by Schrempf and Hobolth (2017), who observed derivations from neutrality without having a model in place to enlighten the causes.

### Variable patterns of gBGC among great apes

A genome-wide application to the great apes provides important insight into the strength and magnitude of GC-bias patterns and also the impact of gBGC in the evolutionary process. To our knowledge, this is the first work giving a population perspective of the patterns of GC-bias in nonhuman populations.

Here, we focus on GC-bias because it is a genome-wide effect. Mathematically speaking, it is difficult to disentangle gBGC from directional selection: they may have different biological explanations, but represent the exact same process modeling-wise (*i.e.*, one allele is preferred over the others). Therefore, existing signatures of directional selection are most likely canceling out, when several site-histories (∼2.8 million sites in our case) are summarized to perform inferences.

In agreement with previous studies in mammals and humans (Spencer *et al.* 2006; Capra *et al.* 2013; Lartillot 2013; Lachance and Tishkoff 2014; Glémin *et al.* 2015), we found that gBGC is weak on average. Indeed, among great apes, the effect of GC-bias is 2.75±1.27 (value obtained by averaging scaled and ), consistent with the nearly neutral scenario (Ohta and Gillespie 1996; Vogl and Bergman 2015). Other studies provided estimates of the scaled conversion coefficient in coding regions: Lynch (2010) estimated as 0.82 in humans and Lartillot (2013) adopted a phylogenetic approach that predicted scaled conversion coefficients <1 in all apes. Our estimates are comparatively higher; however, ours methods and those of Lynch (2010) and Lartillot (2013) have different underlying assumptions. In particular, our method employs the Moran model, which has a rate of genetic drift twice as fast as the Wright-Fisher model. Therefore, we expect to estimate selection coefficients that are twice as high as those in the studies cited.

We found no quantitative agreement between our estimates of the gBGC rate coefficient and those derived from the method of Glémin *et al.* (2015). In addition, we found that our model attributes to mutation what Glémin *et al.* (2015) attributes to gBGC. This might be a consequence the use of monomorphic sites in our method. Indeed, unlike those of Glémin *et al.* (2015), our estimates of gBGC correlate positively with the percentage of fixed sizes. In general, the gBGC rate coefficient should promote greater fixation by boosting the purging of polymorphic sites (at least for low mutation rates, as those observed in humans). On the other hand, Glémin *et al.* (2015) also considered a varying GC-content, which may explain why their estimates of gBGC do not correlate with the percentage of fixed sites. We have preliminary evidence showing that monomorphic sites can significantly impact estimates of population parameters. Nevertheless, a more comprehensive model accounting for both fixed states and variable GC-content would be necessary to disentangle their relative contribution to explaining allele counts.

The patterns of GC-bias we found in great apes are in concordance with the well-known process of gBGC. As expected, we observed that the larger the recombination rate or the lower the chromosome length, the higher the GC-effect. Evidently, recombination promotes gBGC; however, a negative association between gBGC and chromosome size is expected [in most organisms, small chromosomes undergo more recombination per unit of physical distance than large chromosomes (Kaback *et al.* 1989)]. We performed these analyses in non-African humans, for which these data are available; however, we are confident that the patterns of GC-bias found in great apes are due to gBGC.

It has been hypothesized that GC-bias is a compensation mechanism for the mutational bias that exists in favor of the weak alleles, A and T (Duret and Galtier 2009; Galtier *et al.* 2009; Philippe *et al.* 2011). Congruent with this expectation, we observed that mutation rates from strong to weak alleles are higher, but rather similar between alleles of the same type. Interestingly, as we have demonstrated, this symmetric manner by which mutations and selection act in great apes leads the number of substitutions to decrease on average. This suggests that AT/GC toggling may increase population variability by promoting more polymorphic sites; however, further studies would be necessary to clarify this prediction.

### Intensity of gBGC and demography in great apes

Glémin *et al.* (2015) hypothesized that differences in GC-bias intensity among human populations were due to the effects of demography. We also advance that demography regulates the intensity of gBGC in great apes. We obtained a positive correlation between the total rate of selection and in great apes. An important conclusion of our study is that patterns of gBGC can change rapidly due to demography, even among closely related populations. In fact, most of the studied populations are known to have diverged <0.5 MYA (Prado-Martinez *et al.* 2013).

Here, we showed that GC-bias determines the genome-wide base composition of genomes in a factor proportional to [or in the true dynamic]. Therefore, by either changing or *s*, we are able to change the AT/GC composition of genomes. Because we were able to correlate with the intensity of allelic selection (Pearson’s *ρ* = 0.89), we are convinced that demography plays a major role determining the base composition of great ape genomes. Studies using life history traits (*i.e.*, body size) in mammals (Romiguier *et al.* 2010) and ancestral reconstructions of the effective population size in birds (Weber *et al.* 2014) also advocated for correlations between and GC-content [although not as strong as that found here; *ρ* ∼ in Weber *et al.* (2014)].

In contrast, Galtier *et al.* (2018) did not find this correlation in a data set covering 31 species of distinct metazoa phyla (including vertebrates, insects, molluscs, crustaceans, echinoderms, tunicates, annelids, nematodes, nemertians, and cnidarians). This is most likely happening because aspects of the recombination landscape, such as genome-wide recombination rate, length of gene conversion tracts, and repair biases, may also affect the intensity of gBGC (Duret and Galtier 2009; Lesecque *et al.* 2014; Galtier *et al.* 2018). As the recombination landscape varies significantly across species, but not so much across closely related populations (*e.g.*, the karyotype is very conserved among great apes, with humans having 46 diploid chromosomes whereas other great apes having 48), we expected stronger correlations between the intensity of gBGC and demography.

Knowing to what extent variations in or *s* determine the base composition of genomes will require further study. In particular, determining *s* experimentally in different populations/species would help assess the real impact of gBGC. If we assume that *s* varies slightly among closely related populations/species, then we might attribute different intensities of GC-bias almost solely to demographic effects, which simplifies the task of accommodating gBGC in population models.

### gBGC calls for caution in molecular and phylogenetic analyses

The effects of gBGC in molecular analysis have been described extensively in the literature [reviewed in Romiguier and Roux (2017)]. We complement these results by showing how GC-bias affects the base composition of genomes, and how the mutation rates and genetic distances may be biased if GC-bias is not properly accounted for. In particular, we observed that mutation rates from weak-to-strong and strong-to-weak alleles are systematically over and underestimated, respectively.

The idea that gBGC may distort the reconstructed evolutionary process comes mainly from phylogenetic studies. For example, it is hypothesized that gBGC may promote substitution saturation (Romiguier and Roux 2017). We have shown that the number of substitutions may be significantly overestimated if we do not account for GC-bias, meaning that gBGC may indeed promote branch saturation. Based on this and other gBGC-related complications [*e.g.*, GC-bias promotes incomplete lineage sorting (Hobolth *et al.* 2011)], some authors advocate that only GC-poor markers should be used for phylogenetic analysis (McCormack *et al.* 2012; Romiguier *et al.* 2013). Contradicting this approach, our results show that we may gain more inferential power if GC-bias is accounted for when estimating evolutionary distances.

Here, we have not performed phylogenetic inference, but previous applications of the Moran model to phylogenetic problems (*i.e.*, PoMo) (De Maio *et al.* 2015; Schrempf *et al.* 2016) show that it can be done. Therefore, a necessary future work would be to test the effect of allelic selection (or, more specifically, GC-bias) in phylogeny reconstruction; in particular, it would be of major interest to determine how much of its signal can account for the increase in accuracy of tree estimation.

Recently, a nucleotide substitution process that accounts for gBGC was proposed by Lartillot (2013). In this latter model, a scaled conversion coefficient is used to correct substitution rates in a manner similar to that used here to calculate the expected number of substitutions for the Moran distance (*i.e.*, assessing the relative fixation probabilities under GC-bias, File S3). Therefore, these models should perform similarly, with the exception that PoMo should be able to disentangle the contribution of selection and mutation to the observed diversity, as it additionally accounts for polymorphic sites.

### Conclusions

Despite widespread evidence of gBGC in several taxa, several questions remain open regarding the role of gBGC in determining the base composition of genomes. In this work, we provide a mechanistic model and theoretical results that allow quantification of patterns of gBGC in nonhuman closely related populations, providing a new layer of understanding of the tempo and mode of gBGC evolution in vertebrate genomes.

In addition, our multivariate Moran model with allelic selection makes a significant contribution to the endeavor of estimating population parameters from multi-individual population-scale data. Importantly, our analysis showed that gBGC may significantly distort estimates of population parameters and genetic distances, highlighting that gBGC-aware models should be used when employing molecular phylogenetics and population genetics analyses. We stress that, although our application to great apes show evidence of GC-bias, our framework can be employed more generally to estimate patterns of nucleotide usage and associated mechanisms of evolution.

## Acknowledgments

We thank Dominik Schrempf, Claus Vogl, and Sylvain Glémin for helpful discussions, and the three anonymous reviewers for comments improving the manuscript. This work was funded by the Vienna Science and Technology Fund (WWTF) through project MA16-061. G.J.S. received funding from the European Research Council under the European Unions Horizon 2020 research and innovation program under grant agreement no. 714774.

## Appendix A

### Virtual Population Size

Consider two populations, *A* and with different population size, *N* and *M*, respectively. We want to mimic the dynamics of population *A*, relying on the population parameters of a population of different size (larger or smaller). Both populations have the same number of monomorphic states (equaling the number of alleles *K*) and so we assume them equally frequent in both populations. The number of polymorphic states differs: there are polymorphic states in population *A*, while has . Because we cannot make polymorphic states equivalent, we assume that the sum of polymorphic states for each pairwise comparison of the *K* alleles should be equal in both populations. These conditions can be written in the following system of equations(8)As we have derived an estimator of the site frequency spectrum, we can write this conditions for the multivariate Moran model with boundary mutations and selection as(9)This system has conditions and parameters and therefore cannot be solved. However, we know that the entries of *π* are constrained in and should sum up to 1 in both populations, therefore we make the additional assumption that . In addition, and by definition, the reference allele is considered to evolve neutrally in both systems, which permits the conclusion that the normalization constants *k* and are equal. Simplifying,(10)we obtain that the population parameters of population can be expressed in terms of the parameters of population *A*(11)This expression looks tedious, but the neutral case can be very intuitive. In this scenario, mutation rates of populations *A* and change by a factor that is simply the ratio of two harmonic numbers, each of which determined by the population size of the respective population. Intuitively, if then , meaning that, in order to compensate the smaller number of individuals, *M* (*i.e.* stronger effect of genetic drift), mutation rates are augmented in population . Figure 5 depicts the effect of the effective population size on the mutation rates and selection coefficients.

## Appendix B

### Proof of the Stationary Vector

Let ψ be a stationary vector of with and being the elements of the stationary vector corresponding to the states and , respectively. In the multivariate Moran model with low mutation rates and selection, mutation is occurring only in the boundary states, permitting the monomorphic states to communicate with the polymorphic states, while drift and selection are both acting among the polymorphic states. The detailed balance conditions can be defined and lead to equations for the monomorphic and the polymorphic states. In the boundary states, an allele is either fixed or absent (, *i.e.* is fixed), for which we may write(12)while, between the polymorphic states, the general condition is valid(13)Condition (13) can be rewritten in the recursive form(14)and then combined with Equation 12(15)The product can be further simplified by recognizing that, for , , while for , the rates inside the product are just the combined rate of drift and selection [according to expression (2)]. We can now rewrite Equation 14 in order to the element of the stationary vector of Q(16)Because and , we obtain a possible solution for the monomorphic states of the stationary distribution by making in Equation 16(17)The stationary solution for the polymorphic states can be obtained from Equation 16 by noting that and (18)The stationary distribution obtained here can be related with the stationary vector of the neutral boundary multivariate Moran model. We observe that, when , we obtain the solution computed by Schrempf *et al.* (2016) for the multivariate Moran model with drift only

## Appendix C

### Proof of the Expected Number of Moran Events per Unit of Time

To assess the impact of allelic selection in branch length estimation (or the total rate of the process), we computed the expected number of events per unit of time for the multivariate Moran model with selection(20)Where ψ is the stationary vector and the diagonal elements of Q. Equation 20 can be solved by observing that a monomorphic state can be escaped only by mutation, while a polymorphic state can be escaped only by selection and drift(21)The stationary vector is known from Equations 17 and 18(22)where *k* is the normalization constant defined in Equation 4. Expression (22) can be further simplified by observing that the sum in *n* results in doubling every element. Therefore, the expected number of events can be simplified to

## Appendix D

### Proof of the Moran Distance in Number Substitutions

The Moran distance accounts for several events (mutation, driftnewpage, and selection), and differs from the standard evolutionary distances because they are calculated in terms of the expected number of substitutions .(24)where *s* is the probability of a substitution. *s* can be calculated multiplying the probability *m* of an event being a mutation, by the probability *h* of that mutation getting fixed in the population(25)where represents all the possible pairwise permutations without repetition of *K* alleles.

#### Solving

The probability of an event being a mutation is simply the ratio between the rate of mutation and the total rate (*i.e.*, the rate of mutation plus the rate of drift and selection). In stationarity, we know that the total rate is simply the expected number of events of the Moran model and follows Equation 23. The rate of a mutation is the rate of escaping the monomorphic state , from which we can write(26)We can see that differs from only due to the selection coefficient in the numerator.

#### Solving

According to Kluth and Baake (2013), the fixation probability of an allele with fitness is for the Moran model(27)As we are using the multivariate Moran model, we have to extend the denominator of (27) to account for the different possible combinations of two selection coefficients. In particular, we may have(28)We further redefine the denominators in order to make them equal

(29)#### Solving *s*

*s*

The probability of a substitution under the multivariate Moran model with boundary mutations and selection can be computed as(30)We see that , which is an expected consequence of stationarity. We can now generalize for all the substitution types by using Equation 25(31)The relationship between the Moran distance in events and substitutions can be defined based on Equation 24,(32)This quantity can be evaluated for neutral regimes: *i.e.* . We obtain the probability of a substitutions under the neutral Moran model, and it matches the result computed by Schrempf *et al.* (2016):

## Footnotes

Supplemental material available at FigShare: https://doi.org/10.6084/m9.figshare.8180960.

*Communicating editor: M. Beaumont*

- Received September 13, 2018.
- Accepted May 20, 2019.

- Copyright © 2019 by the Genetics Society of America

Available freely online through the author-supported open access option.