## Abstract

The large state space of gene genealogies is a major hurdle for inference methods based on Kingman’s coalescent. Here, we present a new Bayesian approach for inferring past population sizes, which relies on a lower-resolution coalescent process that we refer to as “Tajima’s coalescent.” Tajima’s coalescent has a drastically smaller state space, and hence it is a computationally more efficient model, than the standard Kingman coalescent. We provide a new algorithm for efficient and exact likelihood calculations for data without recombination, which exploits a directed acyclic graph and a correspondingly tailored Markov Chain Monte Carlo method. We compare the performance of our Bayesian Estimation of population size changes by Sampling Tajima’s Trees (BESTT) with a popular implementation of coalescent-based inference in BEAST using simulated and human data. We empirically demonstrate that BESTT can accurately infer effective population sizes, and it further provides an efficient alternative to the Kingman’s coalescent. The algorithms described here are implemented in the R package phylodyn, which is available for download at https://github.com/JuliaPalacios/phylodyn.

MODELING gene genealogies from an alignment of sequences (timed and rooted bifurcating trees reflecting the ancestral relationships among sampled sequences), is a key step in coalescent-based inference of evolutionary parameters such as effective population sizes. In the neutral coalescent model without recombination, observed sequence variation is produced by a stochastic process of mutation acting along the branches of the gene genealogy (Watterson 1975; Kingman 1982a), which is modeled as a realization of the coalescent point process at a neutral nonrecombining locus. In the coalescent point process, the rate of coalescence (the merging of two lineages into a common ancestor at some time in the past) is a function that varies with time, and it is inversely proportional to the effective population size at time *t*, (Kingman 1982b; Slatkin and Hudson 1991; Donnelly and Tavaré 1995). Our goal is to infer , which we will refer to as the “effective population size trajectory.”

Multiple methods have been developed to infer using the standard coalescent model with or without recombination. Some of these methods infer from summary statistics such as the sample frequency spectrum (SFS) (Bhaskar *et al.* 2015; Terhorst *et al.* 2017); however, the SFS is not a sufficient statistic for inferring (Sainudiin *et al.* 2011). Other methods have been proposed that directly use molecular sequence alignments at a completely linked locus, *i.e.*, without recombination (Griffiths and Tavaré 1996; Kuhner and Smith 2007; Minin *et al.* 2008; Li and Durbin 2011; Drummond *et al.* 2012; Gill *et al.* 2013; Palacios and Minin 2013). Our approach is of this type. Still other methods account for recombination across larger genomic segments (Li and Durbin 2011; Sheehan *et al.* 2013; Schiffels and Durbin 2014; Palacios *et al.* 2015). In spite of their variety, all these methods must contend with two major challenges: (1) choosing a prior distribution or functional form for , and (2) integrating over the large hidden state space of genealogies. For example, several previous approaches have assumed exponential growth (Griffiths and Tavaré 1996; Kuhner *et al.* 1998; Kuhner and Smith 2007), in which case the estimation of is reduced to the estimation of one or two parameters. In general, the functional form of is unknown and needs to be inferred. A commonly used naive nonparametric prior on is a piecewise linear or constant function defined on time intervals of constant or varying sizes (Heled and Drummond 2008; Sheehan *et al.* 2013; Schiffels and Durbin 2014). The specification of change points in such time-discretized effective population size trajectories is inherently difficult because it can lead to runaway behavior or large uncertainty in . These difficulties can be avoided by the use of Gaussian process priors in a Bayesian nonparametric framework, allowing accurate and precise estimation (Gill *et al.* 2013; Palacios and Minin 2013; Lan *et al.* 2015; Palacios *et al.* 2015). More precisely, the autocorrelation modeled with the Gaussian process avoids runaway behavior and large uncertainty in .

The second challenge for coalescent-based inference of is the integration over the hidden state space of genealogies for large sample sizes. Given molecular sequence data Y at a single nonrecombining locus and a mutation model with parameter μ, current methods rely on calculating the marginal likelihood function by integrating over all possible coalescence and mutation events. Under the infinite sites mutation model without intralocus recombination (Watterson 1975), this integration requires a computationally expensive importance sampling technique or Markov Chain Monte Carlo (MCMC) techniques (Griffiths and Tavaré 1994a; Stephens and Donnelly 2000; Hobolth *et al.* 2008; Wu 2010; Gronau *et al.* 2011). Moreover, a maximum likelihood estimate of cannot be explicitly obtained; instead, it is obtained by exploring a grid of parameter values (Tavaré 2004). For finite sites mutation models, current methods approximate the marginal likelihood function by integrating over all possible genealogies via MCMC methods [Equation 1; Kuhner (2006); Drummond *et al.* (2012)]. In both cases, the marginal likelihood may be expressed as(1)in which Pr is used to denote both the probability of discrete variables and the density of continuous variables. The integral above involves an -dimensional integral over coalescent times and a sum over all possible tree topologies with *n* leaves. Therefore, these methods require a very large number of MCMC samples, and exploration of the posterior space of genealogies continues to be an active area of research (Kuhner *et al.* 1998; Rannala and Yang 2003; Drummond *et al.* 2012; Whidden and Matsen 2015; Aberer *et al.* 2016).

Current methods rely on the Kingman *n*-coalescent process to model the sample’s ancestry. However, the state space of genealogical trees grows superexponentially with the number of samples, making inference computationally challenging for large sample sizes. In this study, we develop a Bayesian nonparametric model that relies on Tajima’s coalescent, a lower-resolution coalescent process with a drastically smaller state space than that of Kingman’s coalescent. In particular, we approximate the posterior distribution , where corresponds to the Tajima’s genealogy of the sample (see Figure 1A and *Tajima’s genealogies*), has a Gaussian process prior with precision hyperparameter *τ* that influences the smoothness of the resulting trajectory, and mutations occur according to the infinite sites model of Watterson (1975). This results in a new efficient method for inferring called **B**ayesian **E**stimation by **S**ampling **T**ajima’s **T**rees (BESTT), with a drastic reduction in the state space of genealogies. Using simulated data, we show that BESTT can accurately infer effective population size trajectories and that it provides a more efficient alternative than Kingman’s coalescent models.

Next, we start with an overview of BESTT, detail our representation of molecular sequence data, and define the Tajima coalescent process. We then introduce a new augmented representation of sequence data as a directed acyclic graph (DAG). This representation allows us to both calculate the conditional likelihood under the Tajima coalescent model and to sample tree topologies compatible with the observed data. We then provide an algorithm for likelihood calculations and develop an MCMC approach to efficiently explore the state space of unknown parameters. Finally, we compare our method to other methods implemented in Bayesian Evolutionary Analysis Sampling Trees (BEAST) (Drummond *et al.* 2012) and estimate the effective population size trajectory from human mitochondrial DNA (mtDNA) data. We close with a discussion of possible extensions, and limitations of the proposed model and implementation.

## Materials and Methods

### Overview of BESTT

Our objective in the implementation of BESTT is to estimate the posterior distribution of model parameters by replacing Kingman’s genealogy with Tajima’s genealogy . A Tajima’s genealogy does not include labels at the tips (Figure 1): we do not order individuals in the sample but label only the lineages that are ancestral to at least two individuals (that is, we only label the internal nodes of the genealogy). Replacing Kingman’s genealogy by Tajima’s genealogy in our posterior distribution exponentially reduces the size of the state space of genealogies (Figure 1B). To compute , the conditional likelihood of the data conditioned on a Tajima’s genealogy, we assume the infinite sites model of mutations, and leverage a DAG representation of sequence data and genealogical information. Note that the overall likelihood, Equation 1, will differ only by a combinatorial factor from the corresponding likelihood under the Kingman coalescent. Our DAG represents the data with a gene tree (Griffiths and Tavaré 1994a), constructed via a modified version of the perfect phylogeny algorithm of Gusfield (1991). This provides an economical representation of the uncertainty, and conditional independences induced by the model and the observed data.

Under the infinite sites mutation model, there is a one-to-one correspondence between observed sequence data and the gene tree of the data (Gusfield 1991) (*Summarizing sequence data Y as haplotypes and mutation groups* and *Representing as a gene tree*). We further augment the gene tree representation with the allocation of the number of observed mutations along the Tajima’s genealogy to generate a DAG (*An augmented data representation using directed acyclic graphs*). The conditional likelihood is then calculated via a recursive algorithm that exploits the auxiliary variables defined in the DAG nodes, marginalizing over all possible mutation allocations (*Computing the conditional likelihood*). We approximate the joint posterior distribution via an MCMC algorithm using Hamiltonian Monte Carlo for sampling the continuous parameters of the model and a novel Metropolis–Hastings algorithm for sampling the discrete tree space.

### Summarizing sequence data Y as haplotypes and mutation groups

Let the data consist of *n* fully linked haploid sequences or alignments of nucleotides at *s* segregating sites sampled from *n* individuals at time (the present). Note that any labels we affix to the individuals are arbitrary in the sense that they will not enter into the calculation of the likelihood. We further assume the infinite sites mutation model of Watterson (1975) with mutation parameter *μ* and known ancestral states for each of the sites. Then, we can encode the data into a binary matrix Y of *n* rows and *s* columns with elements , where 0 indicates the ancestral allele.

To calculate the Tajima’s conditional likelihood , we first record each haplotype’s frequency and group repeated columns to form *mutation groups*; a mutation group corresponds to a shared set of mutations in a subset of the sampled individuals. We record the cardinality of each mutation group (*i.e.*, the number of columns that show each mutation group). In Figure 2A, there are two columns labeled “b,” corresponding to two segregating sites that have the exact same pattern of allelic states across the sample. Further, two individuals carry the derived allele of mutation group b, so in this case the frequency of haplotype 7 and the cardinality of mutation group b are both equal to 2. Likewise, haplotype 4 has frequency 1 and carries five mutations that are split into mutation groups “a,” “f,” and “g” (the latter is not shown in Figure 2A, but appears in Figure 2B) of respective cardinalities 1, 3, and 1. We denote the number of haplotypes in the sample as *h*, the number of mutation groups as *m*, and the representation of Y as haplotypes and mutation groups as .

### Representing as a gene tree

(Figure 2A) can alternatively be represented as a gene tree or perfect phylogeny (Gusfield 1991; Griffiths and Tavaré 1994b). This representation relies on our assumption of the infinite sites mutation model in which, if a site mutates once in a given lineage, all descendants of that lineage also have the mutation *and no other individuals carry that mutation*. The gene tree is a graphical representation of the haplotypes (as tips) arranged by their patterns of shared mutations. The haplotype data summarized in Figure 2A correspond to the gene tree given in Figure 2B. Details of the correspondence between haplotype data and gene tree are listed below, and an additional example is given in Figure E1 (Appendix E).A *gene tree* for a matrix of *h* haplotypes and *m* mutation groups is a rooted tree T with *h* leaves and at least *m* edges, such that (Figure 2B):

Each row of corresponds to exactly one leaf of T. The black numbers at leaf nodes in Figure 2B are the haplotype frequencies.

Each mutation group of is represented by exactly one edge of T, which is labeled accordingly (letters in Figure 2, A and B). The red numbers along edges in Figure 2B give the cardinality of each mutation group (

*i.e.*, the number of segregating sites in this group; see Figure 2A). Some external edges (edges subtending leaves) may not be labeled, indicating that they do not carry additional mutations to their parent edge. This happens when the other edges emanating from the parent node necessarily correspond to other mutation groups.Edges are placed in the gene tree in such a way that each “path” from the root to a leaf fully describes a haplotype. Edges corresponding to shared mutations between several haplotypes are closest to the root. For example, in Figure 2B, haplotype 4 corresponds to the leaf at which one arrives starting from the root and going along edges a, g, and f; in contrast, haplotype 7 corresponds to the leaf at which one arrives going from the root along edge b. Thus, the labels and the numbers associated with the edges along the unique path from the root to a leaf exactly specify a row of .

Dan Gusfield’s perfect phylogeny algorithm (Gusfield 1991) transforms the sequence data into a gene tree and this transformation is one-to-one. We note that the perfect phylogeny T or gene tree is not the same as the genealogy g. While a genealogy is a bifurcating tree of individuals of the sample, the gene tree is a multifurcating tree of haplotypes.

### Tajima’s genealogies

Our method of computing the probability of the recoded data, , uses ranked tree shapes rather than fully labeled histories. We refer to these ranked tree shapes as Tajima’s genealogies but note they have also been called *unlabeled rooted trees* (Griffiths and Tavaré 1995) and *evolutionary relationships* (Tajima 1983). In Tajima’s genealogies, only the internal nodes are labeled and they are labeled by their order in time. Tajima’s genealogies encode the minimum information needed to compute the probability of data , which consists of nested sets of mutations, without any approximations. In Figure 1A for example, it matters only that mutation group e occurs on a subgroup of the individuals who carry mutation group “a,” and that this is different from the subgroups carrying c, d, and f. No other labels matter because individuals are exchangeable in the population model we assume.

This represents a dramatic coarsening of tree space compared to the classical leaf-labeled binary trees of Kingman’s coalescent. The number of possible ranked tree shapes for a sample of size *n* corresponds to the *n*-th term of the sequence A000111 of Euler zig-zag numbers (Disanto and Wiehe 2013) whereas the number of labeled binary tree topologies is . As can be seen from Figure 1B, this provides a much more efficient way to integrate over the key hidden variable, the unknown gene genealogy of the sample, when computing likelihoods.

We model this hidden variable using the *vintaged and sized coalescent* (Sainudiin *et al.* 2015), which corresponds exactly to this coarsening of Kingman’s coalescent. As can be seen in Figure 1A, we assign vintages/labels 2 through *n* starting at the root of the tree and moving toward the present, so that the node created by the final splitting event, which is also the first coalescence event looking back in the ancestry of the sample, is labeled *n*. We write for the time of node *k*, measured from the present back into the past. We set to be the present time. Then during the interval the sample has exactly *k* extant ancestors, for .

The coarsening of the tree topology does not change the law of the times between two coalescence events. Thus, conditional on the effective population size trajectory and the time at which the number of ancestors to the sample decreases to *k*, the distribution of the time during which the sample has *k* ancestors is given by(2)(Slatkin and Hudson 1991), where . Writing the density at of the vector of coalescence times as a product of conditional densities, we obtain(3)We use a lower triangular matrix denoted F to represent Tajima’s genealogies; see Appendix A. The probability of a ranked tree shape was derived independently in Sainudiin *et al.* (2015) and Palacios *et al.* (2015). Specifically, for every ranked tree shape F with *n* leaves,(4)where *c* is the number of cherries in F (*i.e.*, nodes subtending two leaves; in Figure A1A). Note that this probability is independent of the effective population size trajectory since the choice of the pair of lineages that coalesce during an event is independent of (recall that in Kingman’s coalescent, the coalescing pair is chosen uniformly at random among all possible pairs). Since the distribution of Tajima’s genealogies conditional on can be factored as the product of the probability of the ranked tree shape F and the coalescent times density, we arrive at

### An augmented data representation using directed acyclic graphs

A key component of BESTT is the calculation of the conditional likelihood . We compute the conditional likelihood recursively over a DAG, D. Our DAG exploits the gene tree representation T of the data (Figure 2B), incorporates the branch length information of the Tajima’s genealogy (Figure 2C), and facilitates the recursive allocation of mutations to the branches of . Here, we detail the construction of the DAG.

We construct the DAG using three pieces of information: the observed gene tree T, a given Tajima’s genealogy and a latent allocation of mutations along the branches of the Tajima’s genealogy (Figure 3). An allocation refers to a possible mapping (compatible with the data) of the observed numbers of mutations (red numbers in Figure 2B) to branches in the Tajima’s genealogy. Figure 3A shows one possible mapping for the Tajima’s genealogy in Figure 2C; usually this mapping is not unique. Our construction of D enables an efficient recursive consideration of all possible allocations of mutations along when computing the conditional likelihood .

#### Constructing the DAG D:

The graph structure of our DAG (Figure 2D) with nodes Z and edges *E* is constructed from a gene tree T. The number of internal nodes in the DAG D is the same as the number of internal nodes in T. However, sister leaf nodes in T with the same number of descendants are grouped together in D and leaf nodes descending from edges with no mutations are treated as singletons grouped together in D. For example, the leaves in Figure 2B subtending from edges *i* and *j* are grouped into in Figure 2D, as they both have haplotype frequency 2. However, the leaves subtending from the e and f edges are not grouped (and correspond to and in the DAG Figure 2D) since they have respective haplotype frequencies 2 and 1. We label the root node of D as and increase the index *i* of each node from top to bottom, moving left to right. For , we assign a directed edge if the node in T corresponding to is connected to the node in T corresponding to . The index set of internal nodes in D is denoted by and the index set of leaf nodes is denoted by .

#### Information carried by the nodes in D:

Each node in D represents a vector, , which includes number of descendants, number of mutations, and latent allocation of mutations. Although the number of descendants and number of mutations are part of the observed data, the allocation of mutations can be seen as a random variable; for ease of exposition, we use capital letters to denote all three types of information. We define the vector as follows:where denotes the number of descendants of (*i.e.*, of sampled sequences subtended by) node , denotes the number of mutations separating from its parent node, and denotes the allocation of mutations along (described in detail below). The number of descendants is thus the number of individuals/sequences descending from node (this information is part of T). For internal nodes, records the cardinality of a mutation group, represented as a red number along the edge of T in Figure 2B, where *i* is the index of the parent node of . Leaf nodes in D may correspond to more than one leaf node in T, namely any sister nodes with the same number of descendants. In this case, is a vector with the cardinalities of the corresponding mutation groups (see for example node in Figure 3B). To keep the DAG construction simple, we only allow groupings of leaf nodes and not of internal nodes with identical descendants carrying identical numbers of mutations. We note that, in principle, it would be possible to compress the number of internal nodes of the DAG by exploiting all the symmetries observed in the data.

#### Allocation of mutation groups along :

The latent allocation variables determine a possible correspondence between subtrees in and nodes in D; in particular, indicates the branches in that subtend the subtrees corresponding to nodes if are child nodes of .

Allocations of mutations to branches are usually not unique and computation of the conditional likelihood requires summing over all possible allocations. In Figure 3A we show one such possible allocation of the mutation groups of the gene tree in Figure 2B along the Tajima’s genealogy in Figure 2C. For example, mutation group a in Figure 2B with cardinality 1 (number in red) is a mutation observed in seven individuals (sum of black numbers of leaves descending from edge marked a). This same mutation group, a, is shown as a red number 1 in Figure 3A allocated to branch . If is an internal node, the number of mutations is denoted as a vector of length 1. If is a leaf node, can be a vector of length >1. Details on notation for allocations can be found in Appendix B.

### Computing the conditional likelihood

Under the infinite sites mutation model, mutations are superimposed independently on the branches of as a Poisson process with rate *μ*. To compute , we marginalize over the latent allocation information in the DAG D; that is, we sum over all possible mappings of mutations in T to branches in as follows:where , , denotes the index of the parent of node *i* in D, and we set because it is assumed that there are no mutations above the root node and the length of the root branch . Writing ℒ for the tree length of (*i.e.*, the sum of the lengths of all branches of ) and factoring out a global factor (due to the Poisson distribution of mutations across the genealogy) from each of the above products over , we havewhere is the set of all permutations of divided into groups of different sizes. The number of different permutations of the *k* values of divided into groups of sizes is(6)For example, assume that and with branch lengths . In this case, because there will be three branches with two mutations, because there will be one branch with no mutations, and because there will be two branches with three mutations. The number of permutations of mutations groups divided into groups with cardinalities of sizes is .

The conditional likelihood is calculated via a backtracking algorithm (Appendix C). The algorithm marginalizes the allocations by traversing the DAG from the tips to the root. The pseudocode and an example can be found in Appendix C.

### The case of unknown ancestral states

Up to now, we have assumed that the ancestral state was known at every segregating site. The representation of the data Y that we use in this case records the cardinalities of each mutation group and the genealogical relations between these groups, but does not assign labels to the sequences. Hence, in the terminology of Griffiths and Tavaré (1995), our data corresponds to an *unlabeled rooted gene tree*.

When the ancestral types are not known, the data (now denoted ) may be represented as an unlabeled *unrooted* gene tree. By the remark following Equation 1 in Griffiths and Tavaré (1995), if *s* is the number of segregating sites, then there are at most unlabeled rooted gene trees that correspond to the unrooted gene tree of the observed data . By the law of total probability [see also Equation 10 in Griffiths and Tavaré (1995)], the conditional likelihood of can be written as the sum over all compatible unlabeled rooted gene trees of the probability of conditionally on . That is:(7)where each of the corresponds to a unique unlabeled rooted gene tree compatible with the unrooted gene tree and denotes the number of those unlabeled rooted gene trees. In the following sections, we shall assume that the ancestral type at each site is known.

### Bayesian inference of the effective population size trajectory

Our posterior distribution of interest is(8)where has a Gaussian process prior with mean **0** and covariance function (Rasmussen and Williams 2006). This specification ensures is nonnegative. In our implementation, we assume a regular geometric random walk prior, that is, at *B* regularly spaced time points in withThe parameter *τ* is a length-scale parameter that controls the degree of smoothness of the random walk. We place a γ prior with parameters and on *τ*, reflecting our lack of prior information in terms of high variance about the smoothness of the logarithm of the effective population size trajectory.

We approximate the posterior distribution of model parameters via an MCMC sampling scheme. Model parameters are sampled in blocks within a random scan Metropolis-within-Gibbs framework. Our algorithm initializes with the corresponding Tajima genealogy of the UPGMA estimated tree implemented in phangorn (Schliep 2011). Given an initial genealogy, our algorithm initializes and *τ* with the method of Palacios and Minin (2012) implemented in phylodyn (Karcher *et al.* 2017). We then proceed to generate: (1) a sample of the vector of effective population sizes and precision parameter as described in *Split Hamiltonian Monte Carlo updates of *, (2) a sample of the vector of coalescent times as described in *Hamiltonian Monte Carlo updates of coalescent times* and *Local updates of coalescent times* where we modify a single coalescent time, and (3) a sample of ranked tree shape as described in *Metropolis–Hastings updates for ranked tree shapes* in each iteration. To summarize the effective population size trajectory, we compute the posterior median and 95% credible intervals pointwise at each grid point in , were is the maximum time to the most recent common ancestor sampled.

#### Metropolis–Hastings updates for ranked tree shapes:

There is a large body of literature on local transition proposal distributions for Kingman’s topologies (Kuhner *et al.* 1998; Rannala and Yang 2003; Drummond *et al.* 2012; Whidden and Matsen 2015; Aberer *et al.* 2016). In this paper, we adapted the local transition proposal of Markovtsova *et al.* (2000) to Tajima’s topologies. We briefly describe the scheme below and provide a pseudocode algorithm in Appendix C (Algorithm 3).

Given the current state of the chain , we propose a new ranked tree shape in two steps. For step 1, we first sample a coalescent interval uniformly at random, where . Note that we will never select the interval at the top of the tree (see Figure A1A). Given *k*, we focus solely on the coalescent events at times and . For step 2, there are two possible scenarios. Case A: the lineage created at time , labeled *k*, coalesces at time (first row of Figure 4A). Case B: lineage *k* does not coalesce at time (Figure 4B). In Case A, we choose a new pair of lineages at random to coalesce at time from the three lineages subtending *k* and (excluding *k*), and we coalesce the remaining lineage with *k* at ( and in Figure 4). In Case B, we invert the order of the coalescent events; that is, the two lineages descending from *k* are set to coalesce at time and lineages descending from are set to coalesce at time ( a in Figure 4). Note that the numerical labels are included to clarify the picture: lineages subtending both Case A and Case B can be either labeled (if there is a vintage subtending that lineage) or not (if there is a singleton). The transition probability is given by the product of the probabilities of the two steps. The new ranked tree shape is accepted with probability given by the Metropolis–Hastings ratio defined below:(9)We note that our proposal can result in the same ranked tree shape. However, we tested alternative proposals that precluded this event and we did not find any notable difference in the overall performance of the MCMC algorithm.

#### Split Hamiltonian Monte Carlo updates of :

To make efficient joint proposals of γ and *τ*, we use the Split Hamiltonian Monte Carlo method proposed by Lan *et al.* (2015). Conditioned on , the target density becomes . This is the same target density implemented in Karcher *et al.* (2017) for fixed coalescent times t.

#### Hamiltonian Monte Carlo updates of coalescent times:

Given the current state , we propose a new vector of coalescent times with target density by numerically simulating a Hamilton system with Hamiltonian(10)where s is the momentum vector assumed to be normally distributed. The system evolves according to:(11)We use the *leapfrog* method (Neal 2011) with step size ϵ and a *p* Poisson with mean 10 distributed number of steps to simulate the dynamics from time to . Each leapfrog step of size *ϵ* follows the trajectory:(12)For our implementation, we set the mass matrix , the identity matrix. We simulate the Hamiltonian dynamics of the logarithm of times to avoid proposals with negative values. Solving the equations of the Hamilton system requires calculating the gradient of the logarithm of the target density with respect to the vector of log coalescent times. The gradient of the log conditional likelihood (score function) is calculated at every marginalization step in the algorithm for the likelihood calculation.

At the beginning of *Bayesian inference of the effective population size trajectory*, we described how we assume a regular geometric random walk prior on at *B* regularly spaced time points in . Ideally, the window size *T* must be at least , the time to the most recent common ancestor (TMRCA). However, is not known. Our initial values of coalescent times t are obtained from the UPGMA implementation in phangorn (Schliep 2011) with times properly rescaled by the mutation rate, and we set . We initially discretize the time interval into *B* intervals of length . As we generate new samples of t, we expand or contract our grid according to the current value of by keeping the grid interval length fixed to , effectively increasing or decreasing the dimension of γ.

#### Local updates of coalescent times:

In addition to Hamiltonian Monte Carlo (HMC) updates of coalescent times, we propose a move of a single coalescent time (excluding the TMRCA ) chosen uniformly at random and sampled uniformly in the intercoalescent interval; that is, we choose and . This is a symmetric proposal and the corresponding Metropolis–Hastings acceptance probability is(13)While these updates may seem unnecessary in light of the Hamiltonian updates of coalescence times (*Hamiltonian Monte Carlo updates of coalescent* *times*), we observed better performance of our MCMC sampler by including this additional proposal. One reason may be our choice of M in *Hamiltonian Monte Carlo updates of coalescent* *times* that does not account for the geometric structure of the posterior distribution of coalescent times. However, a better choice of M comes with higher computational burden than a simple local update of coalescent times.

#### Multiple independent loci:

Thus far, we have assumed our data consist of a single linked locus of *s* segregating sites. We can extend our methodology to *l* independent loci with segregating sites for . In this case, our data consist of *l* aligned sequences with elements , where 0 indicates the ancestral allele as before. We then jointly estimate the Tajima’s genealogies , precision parameter *τ*, and vector of log effective population sizes γ through their posterior distribution:(14)In Equation 14, we enforce that all loci follow the same effective population size trajectory but every locus can have its own mutation rate .

### Data availability

The methods presented in this article are implemented in the open source R package phylodyn (https://github.com/JuliaPalacios/phylodyn). The data files necessary to reproduce both the simulated and real data analyses are provided with the R package.

## Results

### The performance of BESTT in applications to simulated data

We tested our new method, BESTT, on simulated data under four different demographic scenarios. Note that in this section, is rescaled to the coalescent timescale, meaning that is the pairwise rate of coalescence at time *t* in the past relative to the rate at the present time zero. We simulated genealogies under four different population size trajectories:

A period of exponential growth followed by constant size:(15)

A trajector with instantaneous growth:(16)

Exponential growth: .

A constant trajectory: .

Given a genealogy of length , where is the intercoalescent length while there are *j* lineages, we drew the total number of mutations (segregating sites) *s* according to a Poisson distribution with parameter . We then placed the mutations uniformly at random along the branches of the genealogy. For each of the *s* mutations, we assigned the mutant type to individuals descending from the branch where the mutation occurred and the ancestral type otherwise.

We summarize our posterior inference by the posterior median and 95% Bayesian credible intervals (BCIs) after 200 thousand iterations, and thinned every 10 iterations with 100 iterations of burn in. Our initial number of change points for was set to 50 over the time interval , where is the initialized time to the most recent common ancestor; however, over the course of MCMC iterations, this number could increase or decrease according to the posterior distribution of .

We assess the accuracy and precision of our estimates using the sum of relative errors (SRE)(17)where is the estimated effective population size trajectory at time . Second, we computed the mean relative width (MRW) as(18)where corresponds to the 97.5% upper limit and corresponds to the 2.5% lower limit of the estimated posterior distribution of . In addition, we measured how well the 95% credible intervals cover the truth and compute the envelope measure, ENV:(19)We first simulated three data sets of individuals with an average number of 100 segregating sites under different types of population size trajectories: constant, exponential growth, and instantaneous growth. Results are depicted in the first column of Figure 8. Posterior medians and 95% credible intervals are shown as black curves and gray shaded areas, respectively. The trajectories used to simulate the data are depicted as dashed lines. Figure 8 shows that our BESTT method recovers the constant and exponential growth trajectories very well, but the instantaneous growth scenario is less accurate and with high uncertainty (wide credible intervals). In all three cases, our envelope measure is > 95%. Performance measures on all simulations are summarized in Table 1.

We analyzed the effect of increasing the number of segregating sites, the number of samples, and the number of independent genealogies on posterior inference with BESTT. In all three cases, we expect our method to better recover the truth. Figure 5 shows our results on simulated data under a population size trajectory with instantaneous growth (Equation 16) of individuals with 31, 63, and 120 segregating sites. As expected, our method recovers the truth with higher precision (MRW) and accuracy (SRE) when we increase the number of segregating sites. Increasing the number of segregating sites may result in more constraints in the gene tree. For , there are 7936 possible ranked tree shapes; however, for the data sets simulated with 31, 63, and 102 segregating sites, there are only , , and ranked tree shapes compatible with their corresponding gene trees. These numbers were estimated by importance sampling (Cappello and Palacios 2019).

As another performance assessment, we simulated data sets from a population size trajectory with instantaneous growth with varying numbers of samples. We simulated data sets with , 25, and 35 samples with 215 expected numbers of segregating sites. Our results depicted in Figure 6 show that our method performs better in terms of SRE and MRE when the number of samples increases. Similarly, precision (MRW) and accuracy (SRE) increases when inference is done from a larger number of independent data sets. Finally, Figure 7 shows our results from 1, 5, and 10 data sets simulated from 1, 5, and 10 independent genealogies of 10 individuals with a population size trajectory of growth followed by a constant period (Equation 15). As expected, our method’s performance substantially increases by increasing the number of independent data sets.

### Comparison to other methods

To our knowledge, there is no other method for inferring (variable) effective population size over time from haplotype data that assumes the infinite sites mutation and a nonparametric prior on ; therefore, we cannot directly compare our method to others. Moreover, our method is the only one that explicitly averages over Tajima genealogies instead of Kingman genealogies. BEAST (Drummond *et al.* 2012) is a program for analyzing molecular sequences that uses MCMC to average over the Kingman tree space and it is therefore a good reference for comparison to our method. We compared our results to the Extended Bayesian Skyline Plot method (EBSP) (Heled and Drummond 2008) and the Skygrid method (Gill *et al.* 2013) implemented in BEAST.

Since the infinite sites mutation model is not implemented in BEAST, we first converted our simulated sequences of 0s and 1s to sequences of nucleotides by sampling *s* ancestral nucleotides uniformly on and assigning one of the remaining three types uniformly at random to be the mutant type. This corresponds to a simulation of the Jukes–Cantor mutation model (Jukes and Cantor 1969) that is currently implemented in BEAST.

We compare the results of BESTT to those of BEAST EBSP and Skygrid (Drummond *et al.* 2005, 2012) in Figure 8. We note that results from BEAST are generated from 10 million iterations and thinned every 1000 iterations, while results from BESTT are generated from 200 thousand iterations.

We compared our point estimates from all methods to the ground truth for each simulation (Table 2). In two cases, BESTT has better envelope than BEAST. For the exponential growth simulation (Figure 8, second row) the BEAST EBSP result has better SRE and MRW, however, the credible intervals are uneven with very wide intervals at the ends. In all cases, the BEAST Skygrid results have wider credible intervals. For the instantaneous growth simulation (Figure 8, third row), BEAST EBSP did not generate many simulations beyond the time point 0.06; for this reason, we recomputed the performance statistics for the overlapping time interval . In this interval, BESTT outperforms both methods implemented in BEAST in terms of envelope and SRE. The last column of Figure 8 shows the posterior distribution of the TMRCA. For the case of constant population size, the true value of the TMRCA is contained in the 95% BCI estimated with BESTT but it is not contained in the 95% BCIs estimated with the two methods implemented in BEAST. In the exponential growth simulation, the true TMRCA is contained in the 95% BCIs estimated with the three methods and the instant growth method; the true TMRCA is not contained in the 95% BCIs of the three methods.

We note that BEAST Bayesian Skygrid (Gill *et al.* 2013) is a more comparable method to BESTT since it assumes Gaussian process priors on log like BESTT.

### Computational performance of BESTT

BESTT approximates the posterior distribution (a) , where is a Tajima’s genealogy instead of (b) , where g is a Kingman’s genealogy and is the labeled data, to estimate . These two posterior distributions are the same when every individual of the sample has its own private mutation group and no shared mutation groups. Otherwise, the number of Tajima’s trees compatible with observed data Y, *i.e.*, Tajima’s trees such that , is smaller than the number of Kingman’s trees compatible with observed labeled data (Cappello and Palacios 2019). That is, we are required to estimate the posterior of a smaller number of trees. For this reason, we argue that Tajima’s coalescent is a more efficient model than Kingman’s coalescent for estimating the posterior distribution of . However, a single conditional likelihood calculation requires the sum over all possible allocation of mutation groups to branches of . Our algorithm only accounts for allocations constrained by the DAG and the ranked tree shape of . For the data depicted in Figure 2, A and B and of Figure 2C, there are only eight different possible allocation paths of all mutation groups to branches. In Appendix C, we detail how our algorithm finds these paths. The number of paths depends on the number of subtrees with the same family size path in the DAG and in the ranked tree shape. In the best case, our algorithm will find a path in , where is the number of nodes in the gene tree. In general, the number of allocation paths will be much smaller than the number of labeled trees compatible with a ranked tree shape. In our implementation, we estimate posterior (a) with MCMC. The main difference between our MCMC algorithm and the MCMC algorithm implemented in BEAST is the tree topology sampler. While our MCMC algorithm explores the space of ranked tree shapes with local move proposals of ranked tree shapes, BEAST explores the space of labeled, ranked tree shapes with local move proposals of labeled trees. A formal assessment of the efficiency of our MCMC algorithm and its comparison to the MCMC implementation in BEAST is beyond the scope of this manuscript and subject of future research.

### Inferring human population demography from mtDNA

We selected samples of mtDNA at random from 107 Yoruban individuals available from the 1000 Genomes Project phase 3 (1000 Genomes Project Consortium *et al.* 2015). We retained the coding region, bp, according to the rCRS reference of human mtDNA (Anderson *et al.* 1981; Andrews *et al.* 1999) and removed 38 insertions/deletions. Of the 260 polymorphic sites, we retained 240 sites compatible with the infinite sites mutation model. The final file is available in https://github.com/JuliaPalacios/phylodyn. To encode our data as 0s and 1s, we use the inferred root sequence **RSRS** of Behar *et al.* (2012) to define the ancestral type at each site. To rescale our results in units of years, we assumed a mutation rate per site per year of (Rebolledo-Jaramillo *et al.* 2014). We compare our results with the Extended Bayesian Skyline method (Drummond *et al.* 2012) implemented in BEAST in Figure 9. When applying BEAST, we assumed the Jukes–Cantor mutation model. Both methods detect an inflection point ∼20 KYA followed by exponential growth. The mean TMRCA inferred for these YRI mtDNA samples with BESTT is ∼170 KYA with a 95% BCI of , while the mean TMRCA inferred with BEAST is ∼160 KYA with a 95% BCI of . In Appendix D, we include two more comparisons of BESTT and BEAST.

## Discussion

The size of emergent sequencing data sets prohibits the use of standard coalescent modeling for inferring evolutionary parameters. The main computational bottleneck of coalescent-based inference of evolutionary histories lies in the large cardinality of the hidden state space of genealogies. In the standard Kingman coalescent, a genealogy is a random labeled bifurcating tree that models the set of ancestral relationships of the samples. The genealogy accounts for the correlated structure induced by the shared past history of organisms and explicit modeling of genealogies is fundamental for learning about the past history of organisms. However, the genomic era is producing large data sets that require more efficient approaches that efficiently integrate over the hidden state space of genealogies.

In this manuscript, we show that a lower-resolution coalescent model on genealogies, Tajima’s coalescent, can be used as an alternative to the standard Kingman coalescent model. In particular, we show that the Tajima coalescent model provides a feasible alternative that integrates over a smaller state space than the standard Kingman model. The main advantage in Tajima’s coalescent is modeling of the ranked tree topology as opposed to the fully labeled tree topology, as in Kingman’s coalescent.

*A priori*, the cardinality of the state space of ranked tree shapes is much smaller than the cardinality of the state space of labeled trees. However, in this manuscript we show that when the Tajima coalescent model is coupled with the infinite sites mutation model, the space of ranked tree shapes is constrained by the data and the reduction on the cardinality of the hidden state space of Tajima’s trees is even more pronounced than expected.

To leverage the constraints imposed by the data and the infinite sites mutation model, we apply Dan Gusfield’s perfect phylogeny algorithm (Gusfield 1991) to represent sequence alignments as a gene tree. We exploit the gene tree representation for conditional likelihood calculations and for exploring the state space of ranked tree shapes.

For the calculation of the likelihood of the data conditioned on a given Tajima’s genealogy, we augment the gene tree representation of the data with the Tajima’s genealogy and map observed mutations to branches. We define a DAG with the augmented gene tree. This new representation as a DAG allows for calculating the likelihood as a backtracking algorithm that transverses the gene tree from the leaves to the root. Our implementation’s computational bottleneck lies in the likelihood calculation. Given a Tajima’s genealogy, our likelihood algorithm sums over all possible allocations of mutation groups to branches. Although this number is generally much smaller than the number of labeled genealogies, our algorithm can be further optimized. In future studies, we will explore a sum-product type of algorithm for the likelihood calculation. In the present implementation, we are able to infer effective population size trajectories from samples of size on a regular personal laptop computer within few hours.

Our statistical framework draws on Bayesian nonparametrics. We place a flexible geometric random walk process prior on the effective population size that allows us to recover population size trajectories with abrupt changes in simulations. The inference procedure proposed in this manuscript relies on MCMC methods with three large Gibbs block updates of: coalescent times, effective population size trajectory, and ranked tree shape topology. We use Hamiltonian Monte Carlo updates for continuous random variables—coalescent times and effective population size—and a Metropolis–Hastings sampler for exploring the space of ranked tree shapes. For exploring the genealogical space, Markovtsova *et al.* (2000) suggest a joint local proposal for both coalescent times and topology. Here, we restrict our attention to the topology alone. A future line of research includes the development of a joint local proposal of coalescent times and ranked tree shapes. We also envision that a joint sampler of coalescent times and effective population size trajectories should improve mixing and convergence.

Our method does not model recombination, population structure, or selection. It assumes completely linked and neutral segments from individuals from a single population, and the infinite sites mutation model. While this model is a good approximation for some human molecular data, it is not appropriate for modeling molecular data from other organisms such as pathogens and viral populations. Finally, haplotype data of many organisms are usually sparse with few unique haplotypes presented at high frequencies. Since our algorithm exploits molecular data at the haplotype level, our proposed method is ideally suited for this scenario where the space of ranked tree shapes is drastically smaller than the space of labeled topologies.

## Acknowledgments

We acknowledge the developers of R-ape, R-phangorn, and R-phylodyn, which facilitated our implementations, and thank the editor and two anonymous referees whose suggestions considerably improved the manuscript. This research was supported in part by National Institutes of Health grant R01 GM-131404 and funding from the Alfred P. Sloan Foundation to J.A.P, and by National Science Foundation CAREER award DBI-1452622 to S.R. A.V. was supported in part by the Chaire Modélisation Mathématique et Biodiversité program of VEOLIA Environnement, École Polytechnique, Museum National d’Histoire Naturelle, Fondation X. A.V. and J.A.P. were supported by the France-Stanford Center for Interdisciplinary Studies.

## Appendix

### Appendix A: Matrix Representation of a Ranked Tree Shape

Our algorithms exploit the following encoding of a ranked tree shape by a triangular matrix of size , which we denote by F (Figure A1). Recall that, by convention, and .

First, we declare that if . Next, the number of lineages through time is encoded on the diagonal of F: for *i* in . Finally, for , the entry denotes the number of lineages that do not coalesce in the time interval ; in particular, and for every *i* in , denotes the number of singletons (*i.e.*, external branches that have not coalesced) in the time interval (Figure A1). Other statistics of the ranked tree shape can be expressed in terms of the corresponding matrix F. Among them, the number *c* of cherries is equal to the number of times that the number of singletons decreases by two between lines *i* and , since such an event means that the coalescence separating these two epochs was that of two external branches. That is,

### Appendix B

#### Detailed allocation of mutation groups along

The latent allocation random variables are constrained by the information in the Tajima’s genealogy . In a given , every subtree is labeled by its ranking from past to present (Figure A1). Subtree *i* is subtended by branch with length , for . We will assume that , the length of the root branch, is 0. Let *c* be the number of cherries (nodes with two leaves) in ; the two branches of a given cherry share the same label . The actual label of external branches is arbitrary but, for ease of exposition in our figures, we first label the cherries’ branches from left to right by ; singleton branches are labeled from left to right by (Figure 2C). As mentioned before, the length of is the number of the corresponding sister nodes in T that were grouped together in forming node . In this case, denotes a collection of vectors of branch labels in subtending the child node subtrees of node . corresponds to the branch subtending from the leftmost child node of on D, corresponds to the branch subtending from the next child node of , etc., and corresponds to the branch subtending from the rightmost child node of on D. Observe that, since we group some of the leaf nodes in T into a single node in D, any may be a vector of branch labels; for example and in Figure 3B.

### Appendix C

#### Algorithms for conditional likelihood calculation

The following two algorithms detail the calculation of . Y is encoded in , the observed data as a tree structure. Each node in has number of descendants (or lineages) and mutation information attached to it. Tajima’s genealogy is encoded as , which contains the ranked tree shape , and , which contains the vector of coalescent times t multiplied by the mutation rate *μ*.

**Algorithm 1** Calculate Likelihood procedure

**Input:**,**Output:**Log Likelihood1: Initiate to be the set of leaf nodes of with at least one descendant. Initiate and to be zero. Initiate

*current_path*to be empty.2: Call CalcLL_recursive(

*LL, index, current_path, Fpath, times, Genetree*).3:

**return**

**Algorithm 2** CalcLL_recursive(, , *current_path*, , , ) procedure

1:

**if**{When a complete path node is found}**then**2:

**for**in**do**3: Calculate log likelihood based on and number of mutations of in

*current_path*.4: Accumulate to total log likelihood

5:

**end for**6:

**else**7:

**for**in**do**8: Check compatibility of the , according to the given .

9:

**if**is compatible with**then**10: Update by assigning it to the current step in

11: Update . If a has been mapped entirely, remove from pool, update its parent , and potentially add parent node to if parent node has not been entirely assigned.

12: Append this to

*current_path*13: Call CalcLL_recursive(

*LL, index*+ 1,*current_path, Fpath, Genetree*)14: Restore previous , , and

*current_path*15:

**end if**16:

**end for**17:

**end if**

To illustrate algorithms 1 and 2, we use our examples of Figure 2 and Figure 3. Algorithm 1 initiates the *pool* with nodes . Then, Algorithm 2 cycles through this list. Assume the first node is . This node has descendants and the *ancestry* is with sizes: . On the other hand, the first coalescent event (from present to past) labeled 16 in (Figure 3A) has ancestry with sizes: . Therefore, this node is *compatible*. The node is removed from the pool, its parent node added to the pool, and is assigned to the path. At this time *current_path* = *Z*_{8} and the *pool* becomes: . The algorithm then cycles through this list and picks . This node has size ancestry . On the other hand the second coalescent event labeled 15 has size ancestry: . Since the node size ancestry is contained in the second coalescent event’s size ancestry, this node is *compatible*. The current path becomes *current_path* = *Z*_{8} − *Z*_{10} and the pool becomes . We continue this procedure until we reach the path *current_path =* .

Once a path is found, the algorithm backtracks the path until there is one compatible node and the path continues to grow. A sequence of backtracking and growing is the following:

In the first sequence of steps , the path decreases. This happens because there are no alternative compatible paths at the point when the sequence starts to grow until step 10. At step 10, the algorithm does not find a compatible way to keep growing the path so the algorithm starts to backtrack again until step 12. From steps 12 to 18, the algorithm grows the path until a new complete path has been found. A complete path has the correspondence of coalescent events to nodes in the gene tree. The first element of the path corresponds to the coalescent event at time , the second element of the path corresponds to the second coalescent event at time . The last element of the path is when all sequences coalesce at time . In this example, the algorithm finds eight paths. Once the paths are found, the algorithm computes the likelihood and the result is the sum of the likelihoods of the eight paths.

#### Markovian proposal of ranked tree shapes

The following algorithm generates a new ranked tree shape from a Markovian proposal and outputs the corresponding transition probabilities. This proposal is used in *Metropolis–Hastings updates for ranked tree shapes*.

**Algorithm 3** Transition proposals for ranked tree shapes

**Input: Output:**, ,1. Set .

2. Sample with uniform discrete probability a coalescent event

*k*from the set Set .3.

**If**lineage*k*coalesces at time (Figure 4, Case A)

**If**the lineages coalescing at time are singletons (Figure 4, Case A, lineages 1 and 2 in )No sampling required to distinguish between two singletons. Set .

Update : merge one singleton with the lineage coalescing at (excluding lineage

*k*) in , then merge the second singleton at time with lineage*k*.Compute the probability of restoring the ordering of to .

**Else**Sample one of the lineages coalescing at time with uniform discrete probability. Set .

Update : merge the sampled lineage with the one coalescing at time in . At time , merge the lineage not sampled with the new lineage

*k*.Compute the probability of restoring the ordering of to .

**Else**(Figure 4, Case B)

Swap the coalescent events. Lineages descending from *k* are now set to coalesce at time and lineages previously descending from are now set to coalesce at time . Set and .

4. , .

### Appendix D

We replicated the BEAST EBSP analysis of the 35 Yoruban individuals from the 1000 Genomes Project phase 3 using the whole mtDNA coding region consisting of 15,409 sites. In both cases, we assumed the Jukes–Cantor mutation model (Jukes and Cantor, 1969). Figure D1 shows the comparison between EBSP inference from the 240 segregating sites retained in *Inferring human population demography from mtDNA* that are compatible with the infinite sites mutation model assumption. In both cases, we recover very similar trajectories.

In addition, we compared our results with BEAST Bayesian Skyline Plot (BSP) (Drummond and Rodrigo, 2000). For our reduced data set of 240 segregating sites, we could not generate valid inference of with Metropolis–Hastings acceptance probability > 0. Instead, we were able to generate results with BEAST BSP from the complete data set of 15,409 sites. The comparison of our method from 240 segregating sites to BEAST BSP from 15,409 sites is depicted in Figure D2.

### Appendix E

In Figure E1A, we show the data from Figure 2A with an additional haplotype (10) with frequency 1 and an additional column grouped with mutation group h (not shown in the table). In Figure E1B we show the corresponding perfect phylogeny. This new perfect phylogeny has a new tip with black label 1 (frequency) subtending from a branch with 0 mutations (red label). The path from the leaf to the root shows that this haplotype has a unique mutation corresponding to mutation group *a*. We note that mutation group labels carry no information. We incorporate the labels in the figure for ease of exposition. Since mutation group h has now multiplicity 2, the branch labeled h has now a red label 2.

## Footnotes

*Communicating editor: G. Coop*

- Received May 28, 2019.
- Accepted September 6, 2019.

- Copyright © 2019 by the Genetics Society of America

Available freely online through the author-supported open access option.