## Abstract

General genealogical processes such as Λ- and Ξ-coalescents, which respectively model multiple and simultaneous mergers, have important applications in studying marine species, strong positive selection, recurrent selective sweeps, strong bottlenecks, large sample sizes, and so on. Recently, there has been significant progress in developing useful inference tools for such general models. In particular, inference methods based on the site frequency spectrum (SFS) have received noticeable attention. Here, we derive a new formula for the expected SFS for general Λ- and Ξ-coalescents, which leads to an efficient algorithm. For time-homogeneous coalescents, the runtime of our algorithm for computing the expected SFS is where *n* is the sample size. This is a factor of faster than the state-of-the-art method. Furthermore, in contrast to existing methods, our method generalizes to time-inhomogeneous Λ- and Ξ-coalescents with measures that factorize as and respectively, where *ζ* denotes a strictly positive function of time. The runtime of our algorithm in this setting is We also obtain general theoretical results for the identifiability of the Λ measure when *ζ* is a constant function, as well as for the identifiability of the function *ζ* under a fixed Ξ measure.

WHEN summarizing sequence data from *n* individuals, a natural and often-used statistic is the site frequency spectrum (SFS), where is simply the number of sites at which *k* of *n* individuals carry the mutant (or the derived) allele. Despite being only numbers, the SFS still contains a surprising amount of information about the history and structure of the population from which the individuals were sampled. Indeed, for neutrally evolving populations that are well modeled by Kingman’s coalescent (Kingman 1982), the expected value of the SFS was first computed for populations of constant size (Fu 1995), extended to populations of variable size (Griffiths and Tavaré 1998; Polanski *et al.* 2003; Polanski and Kimmel 2003), and has since been used as a statistic for demographic inference in numerous studies (*e.g.*, Nielsen 2000; Gutenkunst *et al.* 2009; Coventry *et al.* 2010; Gravel *et al.* 2011; Excoffier *et al.* 2013; Bhaskar *et al.* 2015; Kamm *et al.* 2015; Gao and Keinan 2016).

Yet not all populations are well modeled by Kingman’s coalescent. In fact, Kingman’s coalescent can be viewed as a special case of a broader class of coalescent processes called Λ-coalescents (Pitman 1999; Sagitov 1999). While Kingman’s coalescent permits only pairwise mergers of lineages, Λ-coalescents allow two or more lineages to merge simultaneously in a single coalescence event. Such events arise when a single individual has many offspring (Möhle and Sagitov 2001; Eldon and Wakeley 2006), under models of recurrent selective sweeps (Durrett and Schweinsberg 2004, 2005), in populations undergoing continuous strong selection (Neher and Hallatschek 2013; Schweinsberg 2015), and in many other models. Λ-Coalescents can further be seen as special cases of a broader class of coalescents called Ξ-coalescents (Schweinsberg 2000). In Ξ-coalescents, more than one merger event can occur simultaneously, resulting in simultaneous multiple mergers. While Ξ-coalescents have received less attention than Λ-coalescents in the literature, they still arise in certain models of selection (Huillet 2014), models of selective sweeps (Durrett and Schweinsberg 2005), models with repeated strong bottlenecks (Birkner *et al.* 2009), and for certain diploid mating models (Möhle and Sagitov 2003). Also, since Ξ-coalecents generalize Λ-coalescents, any results presented about Ξ-coalescents immediately pertain to Λ-coalescents.

More formally, time-homogeneous Ξ-coalescents are governed by a measure on the set Furthermore, we consider time-inhomogeneous Ξ-coalescents with measures that decompose into a time-independent part and a strictly positive function of time, where represents (for historical reasons) the inverse intensity. That is, the coalescent is now governed by the measure For example, for Kingman’s coalescent, the point mass at zero, and corresponds to the scaled effective population size at time *t*. For other models, does not necessarily correspond to the population size, but has an interpretation specific to the model. For example, Neher and Hallatschek (2013) show empirically that the rate of coalescence events in a model of continuous strong selection is a nonlinear function of the population size and the first two moments of the distribution of mutational effects. For a review of the mechanics of Λ-coalescents, see Pitman (1999) and for a review of Ξ-coalescents, see Schweinsberg (2000). For an alternative perspective, see Donnelly and Kurtz (1999) and Birkner *et al.* (2009) for a lookdown construction of particle systems with general reproduction mechanisms.

As mentioned above, the expected SFS for Kingman’s coalescent is well understood and can, in fact, be computed for an arbitrary *ζ* in time (Polanski and Kimmel 2003). For Λ- and Ξ-coalescents, however, the expected SFS can be computed only for constant *ζ* and the method for Λ-coalescents takes time (Birkner *et al.* 2013a) and the method for Ξ-coalescents takes time exponential in *n* as a sum over partitions of the first *n* numbers must be performed (Blath *et al.* 2015). Here we present a method that can compute the expected SFS for time-inhomogeneous Λ- and Ξ-coalescents with arbitrary *ζ* in time. In the case where *ζ* is a constant function, our method can compute the expected SFS in time given the rate matrix **Q** of the ancestral process, which is defined more precisely below. We also prove some results about the sample size needed to make Λ identifiable for popular classes of Λ measures for constant *ζ*, as well as results about the sample size needed to make *ζ* identifiable for a fixed

There has also been some related work on determining the asymptotic behavior of the expected SFS as In this setting, Berestycki *et al.* (2007, 2014) derive some simple formulas for time-homogeneous Λ-coalescents that come down from infinity. For finite *n*, however, these asymptotic formulas can be rather inaccurate. Indeed, even for Birkner *et al.* (2013a) show that for some Λ-coalescents, there is a sizable discrepancy between the asymptotic formulas and the SFS obtained by simulation, highlighting the need for finite-sample calculations. Nevertheless, such asymptotic results highlight some interesting properties of Λ-coalescents and are reviewed in Berestycki (2009).

The remainder of this article is organized as follows. We first present our main results about the computation of the SFS for time-inhomogeneous coalescents and discuss the practical runtime of our implementation. We also investigate the variation in the empirical SFS and study the ability to infer the underlying model, using the empirical SFS. Then, we prove some identifiability results about general coalescents. We conclude with a discussion on the implications of our results.

## Main Theoretical Results on the Expected SFS

Here we present our theoretical results on the expected SFS for a general Ξ-coalescent with a measure of the form These results lead to an -time algorithm for computing the expected SFS and can be improved to if *ζ* is a constant function. Briefly, we use subsampling arguments to show that the expected SFS can be computed from where denotes the expected time to the most recent common ancestor for sample size Then, we show how to compute using a spectral decomposition of the rate matrix **Q** of the ancestral process (also known as the block-counting process) of the time-homogeneous coalescent corresponding to More specifically, **Q** is a lower triangular matrix, where is the instantaneous rate at which *i* unlabeled lineages merge to form *j* unlabeled lineages when For example, for Kingman’s coalescent,Using this notation, we are now ready to state our main result. The rest of this section provides lemmas that contain formulas for the matrices in *Theorem 1*, as well as a proof of those lemmas and *Theorem 1*.

**Theorem 1.** *Consider an arbitrary time-inhomogeneous* Ξ-*coalescent governed* *by a measure* *such that the expected time* *to the first coalescence* *for a sample of size* *k is finite for* *Let* *Then*, *there* *exists a universal matrix* *that does not depend on the measure and a* *matrix* *that depends on* Ξ *but not* *ζ*, *such that**where* *is the population-scaled mutation rate*. *Furthermore*, *this allows* * **to be* *computed in* * **time*.

Computing the matrix **L** in *Theorem 1* is costly. For time-homogeneous coalescents, it is possible to compute directly, resulting in the following corollary:

**Corollary 1.** *In the same setting as Theorem 1*, *if ζ is a constant function*, *then* *can be computed in* *time*.

In what follows, *Lemmas 1* and *2* provide formulas to compute the universal matrix **A**, while *Lemmas 3* and *4* provide formulas to compute **L**, which is related to the spectral decomposition of the rate matrix **Q**. The expected first coalescence times can be computed as (Polanski and Kimmel 2003; Bhaskar *et al.*, 2015)Note that since **A** and **L** do not depend on *ζ*, the SFS depends on time and the inhomogeneity of the coalescent process only through the first coalescence times

**Lemma 1.** *Let * *denote the antisingleton entries* (*i.e.*, *entries where exactly one individual has the ancestral allele and all other individuals* *have the derived allele*) *of the SFS for samples of sizes* *Then*,*where the entries of* *are given by*

*Proof*. We use induction to show that(1)Using exchangeability and a subsampling argument similar to that of Kamm *et al.* (2015, lemma 2), we obtain, for (2)which follows from removing an individual uniformly at random from a sample of size *k*. Now, define the *level* of as and note that (1) holds for level 1, *i.e.*, for on the left-hand side. Assume that (1) holds for level Then,where the first equality holds by the recursion (2) and the second equality holds by the inductive hypothesis, by noting that and are both one level below □

The following lemma relates to

**Lemma 2.** *Let γ, *

*and θ be defined as above*.

*Then*,

*where*

*is bidiagonal with*

*and*

*for*

*and*

*Proof*. As in the *Proof* of *Lemma 1*, we employ a subsampling argument. Consider a sample of size The only way that a subsample of size *k* can have a different time to the most recent common ancestor is if the removed individual is a singleton after all of the other lineages have coalesced. The probability that we remove that singleton to form our subsample is Then, the expected amount of time during which there is one singleton and all of the other individuals have coalesced scaled by the mutation rate is exactly the antisingleton entry. Thus,for When there are only two lineages, so the total branch length is the antisingleton entry. Thus, Rewriting this as a matrix equation for completes the *Proof*. □

By combining *Lemmas 1* and *2*, we obtain the universal matrix We now show how to compute the Ξ-dependent matrix **L**. First, we establish the following result on the decomposition of the rate matrix **Q**; this result was also obtained by Möhle and Pitters (2014, equation 2.3) for the Bolthausen–Sznitman coalescent.

**Lemma 3.** *Fix an arbitrary* Ξ-*coalescent* *with* * **for* *where* * **Let* *denote the rate matrix of the ancestral process corresponding to* (*that is*, *the process counting the number of extant* *lineages at time t*). *Then*,*where * *with* * **being the Kronecker delta that equals* 1 *if* *and* 0 *otherwise*, *and**Proof*. By the construction of **U**,which implies that Then, since **U** is triangular and has strictly positive diagonal entries, it is invertible. Therefore, □

The following result relates and

**Lemma 4.** *Let * *and* * **be defined as above*. *Fix an arbitrary* Ξ *measure* *and a strictly positive function ζ*. *Now consider a time-inhomogeneous coalescent governed* *by* *If* * **for* *then**where* *is the diagonal matrix* * **with* * **denoting the first* *column of* *and* * **denotes the submatrix of* *in rows and columns* 2 *through n*.

*Proof*. Note that Therefore,where the third equality follows from the fact that **Q** is lower triangular and hence so is its exponential. Now, since **U** is lower triangular, its inverse is as well. Therefore, we may ignore the value of Letting but with note that Then we haveNow, note that for all *i* by *Lemma 3* and induction. This implies or Using this identity, we can rewrite the above expression for aswhere Collecting these equations over in matrix form leads to the desired result. □

Using *Lemma 4*, we now see that the matrix **L** from *Theorem 1* is simply *Lemma 3* provides a recursion to compute **U**, and **D** may be computed by noting that and then since we have

*Proof of Theorem 1*. Combining *Lemmas 1*, *2*, and *4*, we obtain the equations in *Theorem 1*. For the runtime, note that each of the entries of **U** requires computations, and so computing **U** is The matrices composing **A** are known in closed form, however, and constructing **D** requires filling only entries, each requiring computations for a total of To then obtain the SFS from simply requires iterated matrix vector products taking time. The overall procedure thus requires

□

**Lemma 5.** *For coalescents of the form * *where* *ζ* *is a constant* *function*, * **can be computed recursively from* * **and* **Q** *as follows*:*Proof*. The formulas follow immediately from the homogeneity of the process, recursing on the number of individuals, and noting that the probability that the first coalescence event for a sample of size *k* results in *k* lineages merging down to *l* lineages is

□

*Proof of Corollary 1*. Use *Lemma 5* to compute in time. Then, by *Theorem 1*, which also takes time to compute.

□

**Remark 1.** *Other than computing* **U**, *the algorithm presented in Theorem 1 is* * **Thus*, *for the Bolthausen–Sznitman coalescent* (*Bolthausen and Sznitman* 1998) *or Kingman*’*s* *coalescent*, *where* **U** *is known in closed form* (*Möhle and Pitters* 2014, *theorem* *1.1* *and appendix*), *the SFS can be computed in* * **time even for* *nonconstant ζ*.

**Remark 2.** *The above results can easily be extended to a coalescent where both ζ and* *depend on t*, *as long as* *is piecewise constant*. *For example*, *in the recent past the population may evolve according to a β-coalescent*, *whereas for t* *greater than some* * **the population may evolve according to Kingman*’*s* *coalescent*. *By setting ζ appropriately in Theorem 1*, *one may obtain a* “*truncated SFS*” (*Kamm**et al.* 2015) *for each different* *Then*, *using the truncated SFS for each epoch and* *the same machinery as in* *Kamm**et al.* (2015), *one may compute the full SFS*. *The same techniques also allow one to consider multiple populations*, *with each* *population perhaps evolving according to its own* Ξ *measure*.

## Numerical Results

We implemented *Theorem 1* and *Corollary 1* in Mathematica, and the notebook is available upon request. We can compute the SFS for an arbitrary coalescent for a sample of size in ∼1 sec and a sample of size in a matter of minutes on a laptop computer, which is orders of magnitude faster than the >1 hr reported for a sample size of using the current state-of-the-art method (Blath *et al.* 2015). Furthermore, Blath *et al.* (2015) consider only specific Ξ measures where the number of simultaneous multiple mergers is restricted. Our method has the same runtime for all Ξ measures (after computing the rate matrix and the vector of first coalescence times). See Figure 1 for runtime *vs.* sample size. Furthermore, as noted above, if the spectral decomposition of the rate matrix **Q** is known, then the algorithm is We also present runtimes for the Bolthausen–Sznitman coalescent [which has a closed-form solution for the spectral decomposition (Möhle and Pitters 2014)] in Figure 1.

As long as the rate matrix **Q** of the ancestral process can be found exactly, our method is numerically stable. This is the case for popular Λ-coalescents such as point-mass coalescents and *β*-coalescents, as well as point mass Ξ-coalescents. If the rate matrix must be evaluated numerically, however, high-precision computation may be needed to avoid potential numerical problems due to catastrophic cancellation.

Using simulations, we now investigate the variation in the empirical SFS across independent realizations of the coalescent process and study the ability to infer the underlying model, using the empirical SFS. We consider three different *ζ*’s, illustrated in Figure 2. Due to the association with population sizes in the case of Kingman’s coalescent, we refer to *ζ* as the history or population size history. However, we caution that depending on the finite population size model, *ζ* may not represent the population size, but some other biologically relevant parameter. We consider a constant size history, a bottleneck history that undergoes a temporary 10-fold size reduction, and a growth history with repeated population doublings. For each *ζ*, we consider *β*-coalescents with Note that corresponds to the Bolthausen–Sznitman coalescent, while corresponds to the Kingman coalescent. For each of the nine distinct values of we simulated independent trees with leaves.

In Figure 3, we examine the observed variation in branch lengths across independent realizations of the coalescent process, from which we can deduce the variation in the observed SFS. Specifically, assume that each tree sampled from the coalescent process has the same mutation rate, and, without loss of generality, assume that time has been scaled such that the mutation rate is 1. Let be the sum of branch lengths with *k* leaves and recall that is the *k*th entry of the empirical SFS on the *n* observed individuals. Then, and In Figure 3, we plot for each simulated tree, as well as its expected value Defining for this case of we also plot an estimate of the standard deviation where is the empirical expectation. Now, if we sum the branch lengths and mutations over *m* independent trees [so then and ], then and describe the limiting behavior of both and as by the central limit theorem, and

A recent inconsistency result (Koskela *et al.* 2015, theorem 1) shows that a Λ measure cannot be inferred from a single tree (), even as Indeed, we see in Figure 3 that the branch lengths of a single tree can deviate substantially from For most *k* (say, ), typically or given a single tree. That is, for a single tree, branches subtending more than a few leaves are either not observed or much larger than the expected branch length. However, smaller *k* (especially the singletons, ) have smaller relative standard deviation and thus will tend to have lower relative error as *m* increases.

In the case of Kingman’s coalescent, *ζ* is inferred by minimizing the Kullback-Leibler (KL) divergence between a normalized version of the empirical SFS and a normalized version of the expected SFS (*e.g.*, Bhaskar *et al.* 2015, equation 10). Recall that the KL divergence, between two discrete probability distributions, is where is defined to be 0. We investigate how KL divergence behaves as a function of the number *m* of independent trees simulated in the case of Λ-coalescents. Let be the expected SFS under model and the corresponding branch lengths summed over the first *m* simulated trees. Define as the true probability distribution of derived alleles under scenario and as the conditional distribution of derived alleles, given the first *m* trees simulated under In Figure 4, we plot the KL divergence as a function of *m*, for every considered above (that is, *ζ* is constant, bottleneck, or growth, and *α* is 1, 1.5, or 2). In this case, we see that minimizing identifies the true scenario with access to only a moderate number of independent trees (between 10 and 100).

Figure 4 is encouraging, as not too many independent trees are needed to distinguish between the different scenarios Unfortunately, in some cases it may be impossible to even sample two independent trees (J. Koskela, personal communication). For example, in the model of Birkner *et al.* (2013b), a multiple-merger event happens over a single “generation,” which can cause the multiple merger to affect unlinked sites, resulting in correlated coalescence times. However, in other models, multiple-merger events may affect the genome only locally, and thus trees from unlinked sites are independent. For example, in the selective sweep model of Durrett and Schweinsberg (2005), multiple mergers are caused by selective sweeps taking place over “generations,” and a site experiences a multiple merger if where respectively parameterize the population size, selection strength, and recombination distance to the selected site. Thus, the independence of unlinked trees is not necessarily determined by the Λ or Ξ measure itself, but instead by the prelimiting model.

## Identifiability Results

Before attempting to infer *ζ* or Ξ in practice, it is important to know whether such inference is possible using the SFS. For instance, when inferring *ζ*, if two different functions and produce the same SFS, then it is impossible to distinguish between the two using only the SFS. In such a case, we say that *ζ* is not identifiable. For Kingman’s coalescent if one allows *ζ* to be an arbitrary positive function that produces a finite SFS, then *ζ* is not identifiable (Myers *et al.* 2008). *ζ* is identifiable in the case of Kingman’s coalescent, however, if one restricts *ζ* to be from a set of biologically realistic functions (technically, a set of functions with only a finite number of oscillations) (Bhaskar and Song 2014, theorem 11). We show that a similar result holds for all coalescents of the form where is fixed.

In general it is impossible to infer Ξ from the SFS if Ξ is not restricted. There has been some interest, however, in the case of distinguishing between a subset of Λ-coalescents (Eldon *et al.* 2015). We prove some results about the identifiability of the measure for various subsets of Λ measures when *ζ* is a constant function. We also consider the question posed by Eldon *et al.* (2015) of whether the SFS can distinguish between exponential growth under Kingman’s coalescent and a class of Λ-coalescents with constant *ζ*, and we show that indeed it is possible to distinguish between these cases with a surprisingly small number of samples. We note that our identifiability results require knowledge of the exact expected SFS, whereas Eldon *et al.* (2015) focus on the case where the expected SFS is approximated using an empirical SFS, which is what occurs in practice.

Throughout this section we assume that one has the exact expected SFS (*i.e.*, the object computed by *Theorem 1*).

### Identifiability of *ζ* for fixed Ξ measure

Before proceeding to the results and proofs, we first introduce some notation. Let denote the set of piecewise defined functions with at most *K* pieces made from some function family ℱ. Furthermore, let denote the sign-change complexity of ℱ. Informally, is the supremum of the number of times crosses 0 over functions which is related to the number of oscillations each is allowed to have [see Bhaskar and Song 2014, definition 4, for a formal definition of ]. We will also write for the number of 0 entries in in the spectral decomposition of **Q** for a coalescent on *n* individuals governed by Furthermore, denote by χ the space of Ξ measures such that for all *k*. That is, χ is the set of Ξ measures where for any sample size there is positive probability of a single pairwise merger. If we are considering only Λ-coalescents, then χ contains all Λ measures except for the star coalescent. We now present our main identifiability results and a conjectured bound on

Our main result on the identifiability of *ζ* is the following theorem.

**Theorem 2.** *For an arbitrary* Ξ-*coalescent governed by the measure* *where* *is fixed*, *suppose* * **and* * **Then for each expected SFS* * **there exists a unique* * **consistent with*

First, note that in the case of Kingman’s coalescent, for all *n*, and so in some sense, Kingman’s coalescent is optimal in terms of the number of samples needed to ensure that a certain model space is identifiable. For the Bolthausen–Sznitman coalescent, for all *n*, which follows from the spectral decomposition (Möhle and Pitters 2014, theorem 1.1). For the point mass Λ-coalescent with mass at for all odd entries of are 0 and so for thus implying that larger samples (relative to Kingman’s coalescent or the Bolthausen–Sznitman coalescent) are needed for this coalescent to ensure that a given model space is identifiable. We suspect that is the worst case among all Ξ-coalescents in χ for identifiability, resulting in the following conjecture:

**Conjecture 1.** *For all* *and* ,

If this conjecture is true, then the bound on the sample size needed to have identifiability in *Theorem 2* can be simplified to

### Identifiability of the Λ measure for a constant *ζ*

We also have the following results for Λ-coalescents about the identifiability of the Λ measure.

**Theorem 3.** *Consider the set of point-mass* Λ-*coalescents*: * **If* Λ *is* *restricted to be in this set and* *then the expected* *SFS* * **uniquely determines* Λ.

**Theorem 4.** *Consider the set of β*-*coalescents*: * **If* Λ *is restricted to be in this set and* *then the expected SFS* * **uniquely determines* Λ.

**Theorem 5.** *Consider the set of coalescents * *that is*, *Kingman*’*s* *coalescent with exponential growth*, *point-mass coalescents*, *or* *β*-*coalescents*. *If* Λ *is restricted to be in this set and* *then the expected SFS* * **uniquely determines* Λ.

*Theorem 5* gives a positive theoretical answer to the question of whether the SFS can distinguish between exponential growth and multiple-merger coalescents. Using the techniques presented below, it is straightforward to obtain similar results for other subsets of Λ-coalescents.

### Proofs of the identifiability results

The following lemma is used in proving the theorems in this section and may be of independent interest, as it shows that given the SFS for *n* individuals one can compute the expected time to most recent common ancestor for sample sizes or vice versa.

**Lemma 6.** *For all* Λ- *and* Ξ-*coalescents*, *there is a bijection between the* *expected SFS* * **and the expected times* * **to the most recent common ancestor*.

*Proof*. Combine *Lemmas 1* and *2* to see that with **B** and **C** being universal. Then, since **B** is upper triangular and all of its diagonal entries are nonzero, it is invertible. Furthermore, since **C** is bidiagonal and the diagonal entries are all nonzero, it is also invertible. Therefore, is invertible and since and are related through an invertible matrix, the transformation is bijective.

□

To prove *Theorem 2* we use the following lemma.

**Lemma 7.** *Let * *For all* *and all* **Λ** *other than* * (**i.e.*, *the star coalescent*), *the sequence* * **is strictly increasing*.

*Proof*. Consider a sample of size and a subsample of size *k*. Without loss of generality, assume individual is removed to produce the subsample. The time to the first event is the same for both samples unless the first event involves only individual and one lineage from . That is, the total rate when there are lineages is equal to the total rate when there are *k* lineages plus *k* times the rate at which exactly a particular pair of individuals coalesce. Formally,By assumption, and so the total rates must be strictly increasing.

□

We now prove *Theorem 2*. Our proof relies heavily on the proof of the corresponding result for Kingman’s coalescent (Bhaskar and Song 2014, theorem 11). We essentially show that this setting satisfies the same hypotheses as the Kingman’s coalescent case and then use that result to complete our proof.

*Proof of Theorem 2*. By *Lemma* 6, the SFS is uniquely determined by Then, furthermore, note that from *Lemma 3* the matrix **U** is invertible since it is triangular with all nonzero entries along the diagonal. Then, by the same argument as in Bhaskar and Song (2014, equation 12), we know that if the model space is not identifiable, then for each *k* not corresponding to a zero in **D** (contributing to ), must be the root of the Laplace transform of two different functions in the model space. By *Lemma 7*, these are all distinct, resulting in roots. Then, by taking sufficiently large, we obtain a contradiction via the generalized version of Descarte’s rule of signs (Bhaskar and Song 2014, theorem 4) and the theorem is proved.

□

We now prove *Theorems 3*, *4*, and *5*. The idea is to explicitly calculate the for the first few *k* for each allowed Λ measure and then use *Lemma 6* to show that if Λ is uniquely determined by the first few then it is uniquely determined by

*Proof of Theorem 3. * for all Λ in the set of possible Λ’s. Consider Using *Lemma 5* we see that(3)This is a monotonically decreasing function of and so Λ is uniquely determined by Then, appealing to *Lemma 6*, we see that Λ is uniquely determined by the SFS for

□

*Proof of Theorem 4*. A calculation similar to (3) gives for where This is a monotonically increasing function of and the claim follows from the same argument as in the proof of *Theorem 3*.

□

*Proof of Theorem 5*. Suppose that two distinct Λ-coalescents within the set of allowed models produce the same expected SFS for Then, by *Lemma 6*, they would have the same values of and By *Theorems 3* and *4*, we know that the Λ measures cannot both be point-mass coalescents or *β*-coalescents. From Bhaskar and Song (2014, corollary 8), we also know that the Λ measures cannot both be Kingman’s coalescent with different exponential growth parameters. There are thus three cases. They are all straightforward, albeit tedious.

*Case 1*. One Λ measure is a point-mass coalescent and the other is a *β*-coalescent. Letting (without loss of generality), we can explicitly compute for the point-mass coalescent and the *β*-coalescent, using the same recursive idea as in the *Proof* *of* *Theorem 3*. Let denote the probability that when there are *m* lineages, exactly *k* of them are involved in the next coalescence event. Then, by *Lemma 5* In particular, for the point-mass coalescent this implies Now, recalling the expression of in (3) and letting(4)implies Plugging this into we see that for the point-mass coalescent,(5)A similar calculation for the *β*-coalescent shows that(6)with under the *β*-coalescent. Equating (5) and (6) and solving for *t* results in the solution or But, if then we see that the star coalescent, which corresponds to for the *β*-coalescent, which is not in the set of allowed *β*-coalescents. If we see that which corresponds to Kingman’s coalescent, and for the *β*-coalescent, which again is not in the set of allowed *β*-coalescents. Therefore, a point-mass coalescent and a *β*-coalescent with cannot have the same and simultaneously.

*Case 2*. One Λ measure is a point-mass coalescent and the other is Kingman’s coalescent with exponential growth. Without loss of generality, assume that for the point-mass Λ-coalescent. The exponential-growth Kingman’s coalescent model considered here has where is the exponential integral (Bhaskar *et al.* 2015, supplemental material equation 5). Then, the constraint implies where Furthermore, assuming this constraint and applying *Theorem 1* to Kingman’s coalescent, we obtain(7)(8)Now, in addition to if the two coalescents have the same values of and then the right-hand sides of (4) and (7) must agree, while the right-hand sides of (5) and (8) must agree. This implies(9)whereHowever, by *Lemma 8* in the *Appendix*, there is no such that (9) holds.

*Case 3*. One Λ measure is a *β*-coalescent and the other is Kingman’s coalescent with exponential growth. If these two coalescents produce the same values of and then we must have in (6), and equating (6) and (8) implies(10)whereHowever, by *Lemma 9* in the *Appendix*, there is no such that (10) holds.

Since each of the three cases results in a contradiction, we see that no such Λ measures exist, proving the identifiability claim.

□

## Discussion

We have presented an efficient algorithm for computing the SFS for a very general class of coalescents. While Λ- and Ξ-coalescents seem to be primarily used in practice to model the genealogies of marine species (Árnason 2004; Hedgecock and Pudovkin 2011), these coalescents also model a wide range of other phenomena, including continuous strong positive selection (Neher and Hallatschek 2013), recurrent selective sweeps (Durrett and Schweinsberg 2004, 2005), strong bottlenecks (Birkner *et al.* 2009), and many others. Perhaps one of the reasons these coalescents are less widely used than Kingman’s coalescent is because efficient inference tools have not yet been developed to the same extent.

Multiple-merger coalescents have also attracted some interest recently in the context of extremely large sample sizes (Bhaskar *et al.* 2014). In such cases the sample size is too large for the assumption of only pairwise mergers of lineages imposed by Kingman’s coalescent to be biologically plausible, and indeed using Kingman’s coalescent to model such populations causes biases in inference (Bhaskar *et al.* 2014). It should be possible to extend the results presented in this article to discrete-time coalescents, such as the “exact coalescent” (Fu 2006) corresponding to the coalescent arising from the discrete-time Wright–Fisher process, or to any of the discrete-time random-mating models considered by Eldon and Wakeley (2006).

We also presented some encouraging identifiability results. While it is impossible in the general case to infer the inverse intensity function *ζ* or the measure of a Λ-coalescent from the SFS, for many biologically important cases identifiability does indeed hold. The method we presented for proving that the Λ measure is identifiable for constant *ζ* is powerful, but straightforward and should make it easy to prove whether the measure is identifiable for other sets of Λ- or Ξ-coalescents. While we considered the identifiability of Λ only for fixed, constant *ζ* and the identifiability of *ζ* for fixed Λ or Ξ, it would be interesting to see whether identifiability results can still be obtained for some model spaces while allowing both Λ and *ζ* to vary. It would also be interesting to extend our identifiability results for the Λ measure to some of the biologically relevant Ξ-coalescents.

Our identifiability results generally assumed access to the expected SFS. In practice, one observes a finite number of sites and so one has only a noisy estimate of the SFS. Our simulation study shows that, given a moderate number of independent trees, the empirical SFS is accurate enough to distinguish for some simple models. However, the effects of noisy data are still largely unknown, especially in cases where convergence to the expected SFS is not guaranteed. The accuracy of inferring *ζ* with the empirical SFS has been studied for Kingman’s coalescent (Terhorst and Song 2015), and it would be interesting to extend these results to general Λ-coalescents and to the inference of the Λ-measure itself; the results presented here should make such an analysis more tractable.

## Acknowledgments

We thank Jere Koskela for helpful discussion on convergence to the expected SFS. This research is supported in part by National Institutes of Health (NIH) grant R01-GM108805, NIH training grant T32-HG000047, and a Packard Fellowship for Science and Engineering.

## Appendix

Here we present two lemmas that are used in *Theorem 5*. Proofs are tedious but straightforward.

**Lemma 8.** *For* *where*

**Lemma 9.** *For* *where*

In what follows, let It is clear that for all Additionally,

(A1)which follows from the definition of and a change of variables.

*Proof of Lemma 8*. First, by noting that it is easy to see that the denominator is strictly negative for We now show that the numerator is strictly positive for First, by rearranging terms we see that(A2)Then, noteApplying this inequality to the negative term on the right-hand side of (12), we seewhich is >0 for any since and for □

*Proof of Lemma 9*. The denominator of (10) is equal to which is strictly positive for by definition of Furthermore, the numerator is strictly negative for by noting the following:Therefore, (10) holds. □

## Footnotes

*Communicating editor: N. H. Barton*

- Received October 26, 2015.
- Accepted February 10, 2016.

- Copyright © 2016 by the Genetics Society of America