## Abstract

The large amount and high quality of genomic data available today enable, in principle, accurate inference of evolutionary histories of observed populations. The Wright-Fisher model is one of the most widely used models for this purpose. It describes the stochastic behavior in time of allele frequencies and the influence of evolutionary pressures, such as mutation and selection. Despite its simple mathematical formulation, exact results for the distribution of allele frequency (DAF) as a function of time are not available in closed analytical form. Existing approximations build on the computationally intensive diffusion limit or rely on matching moments of the DAF. One of the moment-based approximations relies on the beta distribution, which can accurately describe the DAF when the allele frequency is not close to the boundaries (0 and 1). Nonetheless, under a Wright-Fisher model, the probability of being on the boundary can be positive, corresponding to the allele being either lost or fixed. Here we introduce the beta with spikes, an extension of the beta approximation that explicitly models the loss and fixation probabilities as two spikes at the boundaries. We show that the addition of spikes greatly improves the quality of the approximation. We additionally illustrate, using both simulated and real data, how the beta with spikes can be used for inference of divergence times between populations with comparable performance to an existing state-of-the-art method.

ADVANCES in sequencing technologies have revolutionized the collection of genomic data, increasing both the volume and quality of available sequenced individuals from a large variety of populations and species (Romiguier *et al.* 2014; Gudbjartsson *et al.* 2015). These data, which may involve up to millions of single-nucleotide polymorphisms (SNPs), contain information about the evolutionary history of the observed populations. There has been a great focus in the recent years on inferring such histories, and to this end, one of the most widely used models is the Wright-Fisher model (Gautier *et al.* 2010; Sirén *et al.* 2011; Malaspinas *et al.* 2012; Pickrell and Pritchard 2012; Gautier and Vitalis 2013; Steinrücken *et al.* 2014; Terhorst *et al.* 2015).

The Wright-Fisher model characterizes the evolution of a randomly mating population of finite size in discrete nonoverlapping generations. The model describes the stochastic behavior in time of the number of copies (frequency) of alleles at a locus. The frequency is influenced by a series of factors, such as random genetic drift, mutations, migrations, selection, and changes in population size. When inferring the evolutionary history of a population, the effects of the different factors have to be untangled. Mutation, migration, and selection affect the allele frequency in a deterministic manner (Ewens 2004). We collectively refer to these as *evolutionary pressures*. The frequency also varies from one generation to the next as a result of random sampling of a finite-sized population (*random genetic drift*). Mutations and migrations result in linear changes of the sampling probability, while selection is a nonlinear pressure (Kimura 1964; Crow and Kimura 1970) and therefore is more difficult to study analytically.

A crucial step for carrying out statistical inference in the Wright-Fisher model is determination of the *distribution of allele frequency* (DAF) as a function of time, conditional on an initial frequency. Even though the Wright-Fisher model has a very simple mathematical formulation, no tractable analytical form exists for the DAF (Ewens 2004). Therefore, various approximations have been developed, ranging from purely analytical to purely numerical. They generally either build on the diffusion limit of the Wright-Fisher model or rely on matching moments of the true DAF. Both types of approximations have been used successfully for inference of populations divergence times (Sirén *et al.* 2011; Gautier and Vitalis 2013), populations admixture (Pickrell and Pritchard 2012), SNPs under selection (Gautier *et al.* 2010), and selection coefficients from time-serial data (Malaspinas *et al.* 2012; Steinrücken *et al.* 2014; Terhorst *et al.* 2015).

Wright (1945) was the first to use the diffusion approximation to determine the stationary DAF. Kimura (1955) solved the diffusion limit and found the time-dependent distribution for pure genetic drift, and Crow and Kimura (1956) extended the solution to include linear evolutionary pressures. However, these approaches contain infinite sums, making their use cumbersome in practice. After decades dominated by inference based on the dual coalescent process (Rosenberg and Nordborg 2002; Hoban *et al.* 2012), diffusion has recently received increasing attention, and researchers have started to investigate other ways to solve analytically or approximate the diffusion equation (McKane and Waxman 2007; Waxman 2011; Malaspinas *et al.* 2012; Song and Steinrücken 2012; Zhao *et al.* 2013; Steinrücken *et al.* 2013, 2014).

Moment-based approximations are less ambitious in that they aim at fitting mathematically convenient distributions by equating the first moments of the true DAF. Such approximations typically use either the normal distribution (Nicholson *et al.* 2002; Coop *et al.* 2010; Gautier *et al.* 2010; Pickrell and Pritchard 2012; Terhorst *et al.* 2015) or the beta distribution (Balding and Nichols 1995, 1997; Sirén *et al.* 2011; Sirén 2012). The rationale behind the use of these distributions is twofold. First, they are motivated by the diffusion limit: the normal distribution is the resulting DAF when drift is small (Nicholson *et al.* 2002), while the beta distribution is the stationary DAF under linear evolutionary pressures (Wright 1945; Crow and Kimura 1956). Second, they are entirely determined by their mean and variance. One major difference between the two is their support. Because the normal distribution is defined over the whole real line, it needs to be truncated to [0,1] (Nicholson *et al.* 2002; Coop *et al.* 2010; Gautier *et al.* 2010). The truncated normal distribution has two atoms at 0 and 1 (corresponding to the allele being lost or fixed) containing the densities in the intervals and , respectively. However, the truncation procedure leads to a variance that no longer matches the variance of the true DAF (Gautier and Vitalis 2013). Alternatively, the full distribution can be applied for intermediary frequencies only, when the probabilities of lying outside the 0 and 1 boundaries are small and therefore can be ignored (Pickrell and Pritchard 2012; Terhorst *et al.* 2015). Unlike the normal distribution, the beta distribution has the interval as support, but because of its continuous nature, the probabilities at the boundaries will always be 0. Under a Wright-Fisher model, the loss and fixation events have a positive probability. The beta distribution provides a good fit for intermediary frequencies but fails at capturing the nonzero boundary probabilities, as illustrated for pure genetic drift in Figure 1, A–C. When time is small, most of the probability mass is found close to the initial value (Figure 1A). As time becomes larger, the allele frequency drifts away from , and more and more probability accumulates at the boundaries (Figure 1, B and C).

Here we propose an accurate extension of the beta distribution under linear evolutionary pressures called *beta with spikes* that explicitly models the probabilities at the boundaries. We show that the addition of spikes greatly improves the fit to the true DAF. We use simulation experiments and published chimpanzee exome data to demonstrate that beta with spikes can be used for inference of population divergence times under pure genetic drift with performance comparable to that of a state-of-the-art diffusion-based method (Gautier and Vitalis 2013) and less computational burden. We additionally discuss how beta with spikes can be used in future development to account for variable population size and selection.

## Beta with Spikes Approximation

Consider a diploid randomly mating population of size 2*N* and a biallelic locus with alleles *A*_{1} and *A*_{2}. Under a Wright-Fisher model, the count of one of the alleles, *A*_{1}, at the discrete generation *t* is a random variable . Let be the corresponding allele frequency. The evolution of is shaped by random genetic drift and evolutionary pressures that affect the sampling probability (Ewens 2004). We capture the joint effect of the evolutionary pressures in , a polynomial in the allele frequency . Conditional on , follows a binomial distribution (Ewens 2004) (1)Here we consider only linear evolutionary pressures, such as mutation and migration. Then takes the form (2)Because represents the sampling probability in equation (1), we must have , for all . From this we find that *a* and *b* satisfy . The case where , for which , for all , has no biological meaning, and we therefore assume that .

Under pure genetic drift, . If mutations happen with probabilities *u* (from to ) and *v* (from to ), then and . Migration can be modeled, for example, by assuming that individuals can migrate away from the population under study and that there is an influx of individuals from a large population with constant frequency . Then, with probabilities and , individuals migrate from and to the population under study, respectively. We have and . Mutation and migration can be modeled jointly, resulting in and . In the following, we treat the general linear case.

We are interested in the DAF conditional on as a function of the generation *t* (3)For simpler notation, we leave out the explicit condition on and implicit condition on population size and evolutionary pressures.

Under the beta approximation, the DAF is (Balding and Nichols 1995) (4)where is the beta function. The two shape parameters of the beta distribution are entirely determined by its mean and variance (5)Therefore, in order to fit to *f*, we need to calculate and . These can be obtained in closed analytical form (see Supporting Information, File S1 for a full derivation). The mean is entirely determined by the initial frequency and the parameters *a* and *b* of the linear evolutionary pressures, while the variance also depends on the population size. Under pure genetic drift (), we have (6)When , we get (7)In the limit of infinite population size, the preceding formulas are equivalent to the mean and variance obtained by Sirén (2012). We note that this equation corrects some typographic errors in Sirén (2012), as confirmed by correspondence with the author (see also File S1).

To account for loss and fixation probabilities, we surround the beta distribution with two spikes (8)where is the Dirac delta function, and . The incorporation of the loss and fixation probabilities into the DAF by means of Dirac delta functions has also been used to obtain a complete solution of the diffusion equation (McKane and Waxman 2007).

To fit to *f*, we need to determine the mean and variance of conditional on polymorphism () and the probabilities and of loss and fixation, respectively. Given , , , and , the conditional mean and variance can easily be calculated (see File S1). The shape parameters and are calculated as in equation (5), where the mean and variance are replaced by the conditional mean and variance, respectively. Therefore, we only require a means of calculating the loss and fixation probabilities to fully specify the beta with spikes approximation. We use a recursive approach in which we calculate the probabilities for by relying on the approximated . We additionally assume that *a* and *b* are small to obtain the following approximation for loss and fixation probabilities (see File S1 for a full derivation): (9)Figure 1, D–F depicts the beta with spikes approximation for the same cases as in Figure 1, A–C. When time is small (Figure 1, A and D), the beta and beta with spikes distributions are equivalent, but as the time becomes larger, the advantage of adding the spikes becomes evident.

To investigate the approximation quality of the beta with spikes, we calculated the Hellinger distance between the true DAF and the beta and beta with spikes for a range of initial frequencies , times *t* and parameters *a* and *b* (see File S1 for details). The Hellinger distance lies between 0 and 1, with 0 indicating a perfect match between the two distributions. Figure 1, G and H, shows that the addition of spikes drastically improves the fit of the beta approximation to the true DAF under a Wright-Fisher model. It is apparent from the figure that the beta distribution approximates well the true DAF when it is not close to the boundaries: either the initial frequency is close to 0.5 and the time is not too large or the parameters *a* or *b* are large enough to keep the allele frequency away from 0 and 1. The beta with spikes has a more consistent behavior because the corresponding Hellinger distance does not vary as much across the different parameters as it does for the beta distribution.

## Inference of Divergence Times

To further illustrate the advantage of incorporating the spikes, we inferred divergence times between populations using both simulated data and exome sequencing data from three chimpanzee subspecies (Bataillon *et al.* 2015).

Populations are represented as successive descendants of a single ancestral population. We assume that after each split, the new populations evolve in isolation (no migration) under pure genetic drift. A rooted tree (Figure 2) can be used to describe the joint history of several present populations, located at the leaves, while the common ancestral population is represented as the root. The data consist of *I* independent SNPs for *J* populations in the present: the (arbitrarily defined) reference (*A*_{1}) allele count in a sample of size () for each locus and population .

Conditional on the topology (*i.e.*, tree without branch lengths), we inferred the scaled branch lengths by numerically maximizing the likelihood of the data.

### Likelihood of the data

Assuming Hardy-Weinberg equilibrium, the probability of observing alleles in a sample of size given the population allele frequency is given by the binomial distribution (10)However, the allele frequencies are unobserved, and the likelihood of the data for SNP *i* is obtained by integrating over the unknown allele frequencies (11)where is the joint distribution of the values at the leaves. The likelihood is a function of the scaled branch lengths, denoted here as Θ, and *π*, the unknown DAF at the root. The joint distribution is, in turn, an integral over the allele frequencies in the ancestral populations, represented as internal nodes in the tree. We approximate the integrals with sums by discretizing the allele frequencies. The discretized joint distribution is then obtained using a peeling algorithm (Felsenstein 1981), where the transition probabilities on each branch are given by the DAF as a function of the branch length (see File S1 for details). Because the SNPs are assumed to be independent, the full likelihood is a product over the SNPs (12)Because SNP data contain only polymorphic sites, we further condition the preceding likelihood on polymorphic data as follows: (13)where (14)Here and are data corresponding to site *i* where the allele was lost or fixed, respectively, in the samples from all populations (15)We treat *π*, the root DAF, as a nuisance distribution, which we assume to be a beta distribution. For a given topology (*i.e.*, tree without branch lengths), the most likely branch lengths and shape parameters of *π* are recovered by numerically maximizing the likelihood conditional on polymorphism.

### Simulated data

Using the topology depicted in Figure 2, we simulated multiple data sets containing independent SNPs under a Wright-Fisher model, given an ancenstral frequency sampled from *π*, the root DAF, which we set to be a beta distribution. We used two different scenarios, labeled I and II, summarized in Table 1. Scenario I has a uniform *π* and large sample sizes, while scenario II is built to produce data that resemble the chimpanzee exome data analyzed later. For this, we used the chimpanzee sample sizes and scaled branch lengths and root DAF as inferred by the beta with spikes on the chimpanzee data (Table 2).

For each simulated data set, we estimated the branch lengths using both the beta and beta with spikes, as described previously. We additionally ran Kim Tree (Gautier and Vitalis 2013) using the default settings. Kim Tree is a method designed for inference of divergence times between populations evolving under pure genetic drift. It uses Kimura’s solution to the diffusion limit for the DAF (Kimura 1955) and relies on a Bayesian Markov chain Monte Carlo (MCMC) approach. Here we use the posterior means of the branch lengths reported by this method as point estimates.

All methods estimated the branches leading to populations 1 and 2 well (Figure 3). Beta with spikes estimates the branch lengths more accurately and with lower variance than the beta approximation (Figure S1). Despite the fact that the spikes’ probabilities do not perfectly match the true loss and fixation probabilities (Figure 1, E and F), this seems to have little effect on the accuracy of branch-length estimation for beta with spikes. For both scenarios, the branch leading to population 2 and the inner branch from the root to population 4 have similar lengths, but the beta approximation and Kim Tree provide a worse estimate for the inner branch. This could be due to the fact that there are no data available resulting directly from the evolution on that branch, making the estimation problem harder. A similar result was obtained by Gautier and Vitalis (2013), who used trees with the same topology. Interestingly, beta with spikes recovers the inner branch much more accurately than either beta or Kim Tree. When measuring the accuracy of the inferred lengths as an average over all four branches (Table S1), it is clear that beta with spikes outperforms Kim Tree for both scenarios.

### Chimpanzee data

The chimpanzee data analyzed here consisted of allele counts of autosomal synonymous SNPs obtained from exome sequencing of the Eastern, Central, and Western chimpanzee subspecies (Bataillon *et al.* 2015) for 11, 12, and 6 individuals, respectively. From the original data set containing 59,905 synonymous SNPs, we filtered the SNPs in which there were missing data, obtaining a total of 42,063 SNPs. We inferred the scaled branch lengths (Figure 4 and Table 2) using beta, beta with spikes, and Kim Tree on the full data set and on 50 smaller data sets containing only 10,000 randomly sampled SNPs. Beta with spikes and Kim Tree infer comparable branch lengths, with the exception of the branch leading to the Western chimpanzee subspecies. We additionally report in Table 2 the likelihood of the full data calculated using beta with spikes for the branch lengths inferred using the three methods and the ones reported in the original study (Bataillon *et al.* 2015). Bataillon *et al.* (2015) used an approximate Bayesian computation (ABC) approach to fit a demographic model to the synonymous SNPs. Their results are consistent with the results obtained here for the branches leading to the Eastern and Central chimpanzees. However, we obtained very different estimates for the remaining two branches. The likelihoods in Table 2 show that the branch lengths obtained by beta with spikes have the highest likelihood. This does not indicate which of the estimates is most correct because the methods use different modeling approaches. However, it does show that the differences between beta with spikes and beta/Kim Tree/ABC are a direct result of the modeling approach and not merely an artifact of the likelihood numerical optimization being trapped in a local optimum. The discrepancy between the results of beta with spikes and those of ABC is, perhaps, not surprising because the difference in inferred branch lengths seems to correlate with the goodness of fit of the ABC demographic model to the observed data. Bataillon *et al.* (2015) reported that their inferred demographic model shows a very good fit for the Central chimpanzees (absolute and relative difference in inferred branch length 0.017 and 0.386, respectively), a slightly less good fit for the Eastern chimpanzees (absolute and relative difference in inferred branch length 0.051 and 0.386, respectively), and a poorer fit for the Western chimpanzees (absolute and relative difference in inferred branch length 1.319 and 2.217, respectively).

### Data availability

The beta, beta with spikes approximations, inference of divergence times, and simulation under a Wright-Fisher model were implemented in Python 2.7. The code is freely available at https://github.com/paula-tataru/SpikeyTree.

## Discussion

We have developed a new approximation to the DAF as a function of time, conditional on an initial frequency, under a Wright-Fisher model with linear evolutionary pressures. Our work provides an accurate extension of the beta approximation (Balding and Nichols 1995, 1997; Sirén *et al.* 2011; Sirén 2012). As noted by Gautier and Vitalis (2013), the beta distribution ignores the possibility of loss or fixation of alleles. We addressed this issue by explicitly modeling the loss and fixation probabilities as two spikes at the boundaries. We showed that the addition of the spikes improves the quality of the approximation and results in more accurate inference of divergence times between populations that have been evolving under pure genetic drift. The DAF obtained as a solution of the diffusion equation is exactly the DAF expected under the Wright-Fisher model when the population size is large and the evolutionary pressures are not too strong, while the beta distribution is motivated only by the stationary DAF under linear evolutionary pressures. We therefore expected the beta with spikes to provide a less accurate approximation to the true DAF than the diffusion limit. Nevertheless, we showed that our method can infer divergence times more accurately than Kim Tree (Gautier and Vitalis 2013), a software built for inference of divergence times using Kimura’s solution to the diffusion limit (Kimura 1955). We would like to note here that the use of likelihood that is explicitly conditioning on polymorphic data [equation (13)] could potentially be the cause of beta with spikes outperforming Kim Tree.

### Computational complexity

The advantage of beta with spikes becomes clearer when one considers its computational complexity. Diffusion methods rely on heavy computations, such as calculations of Gegenbauer polynomials (Gautier and Vitalis 2013), spectral decomposition of large matrices (Steinrücken *et al.* 2013, 2014), or matrix inverse (Zhao *et al.* 2013). In contrast, beta with spikes requires operations that are performed in constant time per iteration. Perhaps the most expensive evaluation is the beta function used in the loss and fixation probabilities, but very efficient approximations exist for this (Abramowitz and Stegun 1964). The difference in computational complexity is noticeable when comparing the running times of beta with spikes, implemented in Python 2.7, and Kim Tree, implemented in Fortran. For the chimpanzee data set of 42,063 SNPs, beta with spikes ran in just under 5 min, while Kim Tree took almost an hour, even though Python 2.7 is a programming language that is less efficient than Fortran. We also note that the two inference methods are inherently different because here we used a numerical optimization procedure, while Kim Tree uses a Bayesian MCMC approach. Additionally, unlike Kim Tree, beta with spikes is a recursive method, and its accuracy and speed depend on the number of iterations performed (see File S1 for accuracy results for different numbers of iterations).

### Extensions

We end this section by discussing possible extensions of the beta with spikes approximation and how these can be used in inference problems. Throughout this paper, we assumed that the population size is constant. Given its recursive formulation, beta with spikes lends itself naturally to incorporating variable population size without any increase in computational complexity. This can then be used for inference of population size backward in time, similar to methods relying on the coalescent with recombination (Li and Durbin 2011; Sheehan *et al.* 2013; Schiffels and Durbin 2014). A recently published method (Liu and Fu 2015) illustrates that allele frequency data, summarized as site frequency spectra, can be used efficiently for inference of variable population size backward in time. Even though Liu and Fu (2015) assumed that sites are independent and did not use linkage information, their method can handle larger data sets than the method of Li and Durbin (2011), which leads to more accurate inference of population sizes for the recent past. The results obtained by Liu and Fu (2015) indicate that beta with spikes could be used successfully for such demographic inference.

Another extension of the presented approximation would be to incorporate selection, which is a nonlinear evolutionary pressure. In the recent years, there has been a great focus on inference of selection coefficients from time-series data under a Wright-Fisher model (Malaspinas *et al.* 2012; Bank *et al.* 2014; Steinrücken *et al.* 2014; Foll *et al.* 2015; Terhorst *et al.* 2015). A newly developed statistical method aims at modeling the evolution of multilocus alleles under a Wright-Fisher model with selection (Terhorst *et al.* 2015) by fitting a multivariate normal distribution from the first moments of the DAF. Using the approach of Terhorst *et al.* (2015) for moment calculation, beta with spikes can be extended to nonlinear evolutionary pressures. Terhorst *et al.* (2015) did not treat the loss and fixation probabilities. However, because selection is expected to drive allele frequencies toward the boundaries faster than pure genetic drift, including the explicit spikes becomes crucial.

## Acknowledgments

It is a pleasure to thank Thomas Mailund for helpful discussions. We also thank the two anonymous reviewers and the associate editor for their constructive suggestions and comments that helped to improve the manuscript. This work was supported, in part, by the European Research Council under the European Union’s Seventh Framework Program (FP7/20072013, ERC grant number 311341) and the Danish Research Council (grant number DFF-4002-00382).

## Footnotes

*Communicating editor: Y. S. Song*Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.179606/-/DC1

- Received June 19, 2015.
- Accepted August 22, 2015.

- Copyright © 2015 by the Genetics Society of America