# Statistical Properties of the Site-Frequency Spectrum Associated with Λ-Coalescents

^{*}Institut für Mathematik, Johannes-Gutenberg-Universität, 55099 Mainz, Germany^{†}Institut für Mathematik, Technische Universität Berlin, 10623 Berlin, Germany

- 1Corresponding author: Institut für Mathematik, Technische Universität Berlin, Strasse des 17 Juni 136, 10623 Berlin, Germany. E-mail: eldon{at}math.tu-berlin.de

## Abstract

Statistical properties of the site-frequency spectrum associated with Λ-coalescents are our objects of study. In particular, we derive recursions for the expected value, variance, and covariance of the spectrum, extending earlier results of Fu (1995) for the classical Kingman coalescent. Estimating coalescent parameters introduced by certain Λ-coalescents for data sets too large for full-likelihood methods is our focus. The recursions for the expected values we obtain can be used to find the parameter values that give the best fit to the observed frequency spectrum. The expected values are also used to approximate the probability a (derived) mutation arises on a branch subtending a given number of leaves (DNA sequences), allowing us to apply a pseudolikelihood inference to estimate coalescence parameters associated with certain subclasses of Λ-coalescents. The properties of the pseudolikelihood approach are investigated on simulated as well as real mtDNA data sets for the high-fecundity Atlantic cod (*Gadus morhua*). Our results for two subclasses of Λ-coalescents show that one can distinguish these subclasses from the Kingman coalescent, as well as between the Λ-subclasses, even for a moderate (maybe a few hundred) sample size.

- Λ-coalescents
- site-frequency spectrum
- covariance
- Atlantic cod
- pseudolikelihood

LARGE offspring number population models have recently been proposed as appropriate models with which to investigate high-fecundity natural populations. Some marine populations may belong to the class of high-fecundity populations, including Pacific oysters (*Crassostrea gigas*) (Beckenbach 1994; Li and Hedgecock 1998; Boudry *et al.* 2002), white sea bream (*Diplodus sargus*) (Planes and Lenfant 2002), and Atlantic cod (*Gadus morhua*) (Árnason 2004). Oysters feature in Williams (1975)’s elm and oyster model as an example of a high-fecundity population. Indeed, high-fecundity populations are discussed at length by Williams (1975) when comparing the benefits of sexual *vs.* asexual reproduction. Avise *et al.* (1988) compare genetic distances for mitochondrial (mt)DNA variation for three vertebrate species, american eels (*Anguilla rostrata*), hardhead catfish (*Arius felis*), and red-winged blackbirds (*Agelaius phoeniceus*), and conclude that historical effective population sizes may have been much lower than current census size. Low effective population size compared to census population size observed for certain marine populations in particular (*e.g.*, Hedgecock *et al.* 1992), and reviewed by Hedgecock and Pudovkin (2011), may be evidence of high variance in offspring distribution. Indeed, Hedrick (2005) observes that low effective population size results from high variance in reproductive success in a population with large census size. High fecundity may also be a way for certain marine organisms with broadcast spawning to compensate for high mortality rate among juveniles, thus exhibiting type III survivorship curves.

Multiple-merger coalescent processes, so-called Λ- and Ξ-coalescents, arise naturally from large offspring number models (Donnelly and Kurtz 1999; Sagitov 1999, 2003; Möhle and Sagitov 2001; Eldon and Wakeley 2006; Sargsyan and Wakeley 2008; Birkner and Blath 2009; Huillet and Möhle 2011a). A key assumption to obtain a multiple-merger coalescent process from a large offspring number model is that the number of offspring of an individual can be up to the order of the population size with nonnegligible probability. In contrast to the classical Kingman coalescent framework, which has served as the standard null model in population genetics since its introduction (Kingman 1982), Λ-coalescents allow multiple instead of only binary collisions of ancestral lineages, in which the group of lineages coalescing in a given instance may include anywhere from two to all of the current active lineages. Selective sweeps (Durrett and Schweinsberg 2004, 2005; Coop and Ralph 2012), or large-scale spatial extinction and recolonization events (Barton *et al.* 2010, 2013), can also give rise to multiple-merger coalescent processes. In some cases, one naturally encounters the even more general Ξ-coalescents (Schweinsberg 2000), allowing *simultaneous* multiple mergers at a time, *i.e.*, when distinct groups of ancestral lineages coalesce at the same time to distinct ancestors. Models of severe bottlenecks (Birkner *et al.* 2009) or diploidy (Möhle and Sagitov 2003) and recombination in conjunction with distinct loci can give rise to Ξ-coalescents. By way of example, a Ξ-coalescent admitting up to quadrifold simultaneous mergers is obtained from a diploid multilocus model with a skewed (high variance) offspring distribution (Birkner *et al.* 2013). Möhle and Sagitov (2001) provide a full mathematical classification of coalescent limits (as population size tends to infinity) of Cannings’ exchangeable population models.

The statistical methodology to test for the presence and particularities of the effects of multiple mergers, or to distinguish between different underlying Λ- and Ξ-coalescents within a family of models, is, however, still largely missing. Hedgecock and Pudovkin (2011) argue that large offspring number models are the right framework in which to study high-fecundity marine populations. Some large offspring number models (*e.g.*, Sagitov 2003; Eldon and Wakeley 2006) introduce new parameters (*coalescence* parameters) determining the size and intensity of large offspring number events, *i.e.*, when individuals have very many offspring. Estimating the coalescence parameters from genetic data may give us a way to distinguish between different population models. Indeed, Hedgecock and Pudovkin (2011) call for the development of such inference methods, in particular to infer parameters, to assess the quality of the estimators, and to construct hypothesis tests and confidence bounds.

On one hand are summary-statistics-based estimators for the parameter of certain classes of Λ-coalescents (Eldon 2011) in the infinite-sites model, such as the number of segregating sites, which may not represent a sufficient statistic, but are computationally efficient, and could be used to obtain prior information. On the other hand, there are full-likelihood-based point estimators, derived via relatively complex recursions—which are, in principle, amenable to Monte Carlo methods—taking all available information into account, but whose practical treatment becomes computationally prohibitive even for medium-sized data sets (Birkner and Blath 2008; Birkner *et al.* 2011). A small empirical simulation study for the latter can be found in Steinrücken (2009). For both types of estimators, an analysis of their statistical properties in the Λ case is still largely missing.

A summary statistic that contains much of the essential (in particular, topological) information of the underlying genealogical tree of the data is the *site-frequency spectrum* (SFS) (however, one may want to consult, *e.g.*, Myers *et al.* 2008 for theoretical limitations). The spectrum is a key quantity when applying coalescent theory in inference; see, *e.g.*, Wakeley (2007) for an overview and a discussion of the relation of the SFS with various other (simpler) summary statistics. The statistical properties of the SFS under the Kingman coalescent have been investigated in several studies (Fu 1995; Griffiths and Tavaré 1998; Živković and Wiehe 2008; Achaz 2009; Sainudiin *et al.* 2011). Extending in particular the results of Fu (1995) to Λ-coalescents is our aim. Recursions for the expected value, the variance, and the covariance of the number of derived mutations in the SFS are derived for a general Λ-coalescent.

Substantial theoretical research has been undertaken to identify the asymptotic behavior of the frequency spectrum associated with Λ-coalescents, culminating in the recent convergence results, as sample size tends to infinity, contained in Berestycki *et al.* (2013). However, so far it is not known how large samples need to be for the spectrum to be well approximated by the limiting asymptotic results of Berestycki *et al.* (2013) or, indeed, whether the approximation is equally good for all values of the relevant coalescent parameter. This question is addressed using simulations.

Finally, as an illustration we apply our methods to site-frequency spectra obtained for Atlantic cod (Árnason *et al.* 2000; Sigurgíslason and Árnason 2003; Árnason 2004) and fit Λ-coalescents to the data. The problem of distinguishing between coalescent classes by means of observing a realization of the SFS is also addressed using simulations.

## Setup and Previously Known Results

### Notation

A formal definition of the SFS is our starting point. We assume the *infinitely many-sites model* (Kimura 1969; Watterson 1975) of population genetics, where the genealogy of a sample of size *n* is described by an *n*-Λ-coalescent. Mutations occur at a rate *θ*/2 on the coalescent branches. Suppose further that we can distinguish between mutant and wild type, *e.g.*, with the help of an outgroup. Then, the SFS of an *n* sample is given bywhere the denote the number of sites at which variants appear *i* times in our sample. The are also referred to as the number of *derived* mutations in *i* copies, when the ancestral state is known. The vector **ξ**^{(}^{n}^{)} is called the *unfolded* site-frequency spectrum of the *n* sample.

In the case where the ancestral state (variant) is unknown, one often considers for *i* = 1, … , , which is the number of sites at which the less frequent variant (the minor allele) is present in *i* sequences, and := **1**_{(}_{n}_{even)}*n*/2 + **1**_{(}_{n}_{odd)}(*n*−1)/2, where **1**_{(}_{A}_{)} = 1 if event *A* is true and zero otherwise. The observed polymorphisms become “few and far between” with corresponding small expected values as *i* increases, at least for the Atlantic cod data we investigate. Thus, for for some suitable , we lump them together into one class labeled as . Considering only the first few classes and lumping the rest also speeds up the computations.

Then, is called the *folded* site-frequency spectrum of an *n* sample.

A Λ*-coalescent* is a partition-valued (one can think of enumerated DNA sequences and their ancestral relationships) exchangeable coalescent process determined by a finite measure Λ on [0, 1] (Donnelly and Kurtz 1999; Pitman 1999; Sagitov 1999). For a comprehensive overview see, *e.g.*, Berestycki (2009). If there are currently *n* blocks in the partition (*i.e.*, *n* active ancestral lineages), any particular set of *k* blocks merges into one at rate (1)Certain special cases of a Λ-coalescent include the following, some of which we use in our examples below:

One recovers the Kingman coalescent for (2)

*i.e.*when the measure Λ is concentrated at the point 0.The Beta(2 −

*α*,*α*)-coalescent (Schweinsberg 2003) is a Λ-coalescent characterized by the Λ-measure (3)with

*α*∈ (0, 2),*i.e.*, when the measure Λ is associated with the Beta-distribution with parameters 2 −*α*and*α*. The limiting case*α*= 2 (in the sense of weak convergence of measures) corresponds to the Kingman coalescent.Eldon and Wakeley (2006) considered purely atomic Λ-coalescents of the types (4)

and (5)

with

*ψ*∈ [0, 1], where*ψ*= 0 gives the Kingman coalescent.

In the application examples that follow, we are mainly concerned with coalescents (3) and (4). Thus, the parameter *α* (1 < *α* ≤ 2) refers exclusively to coalescent (3), and the parameter *ψ* (0 < *ψ* < 1) to coalescent (4).

For *α*, we focus our attention on the parameter range 1 < *α* < 2, since this corresponds to Beta-coalescents that “come down from infinity”, *i.e.*, admit a unique most-recent common ancestor even in the infinite-sample limit (Pitman 1999; Schweinsberg 2000). It can be argued whether this property is necessary from a biological point of view. However, our empirical results for real data sets yield so far only estimates of *α* between 1 and 2. In addition, the coalescent process derived from Schweinsberg’s (2003) population model converges to coalescent (3) only when 1 ≤ *α* < 2. For 0 < *α* < 1, the coalescent process is a discrete-time Ξ-coalescent (Schweinsberg 2003).

### Related asymptotic results

In inference, one often relies on asymptotic results. The asymptotic behavior of the site- and allele-frequency spectrum of Λ-coalescents for (very) large sample sizes (as *n* → ∞), as well as related asymptotic results concerning the distributions of tree lengths or the “speed of coming down from infinity”, *i.e.*, when the sample size is infinite (a mathematical peculiarity), how long it takes the ancestral lineages to become finite in number, is an active field of mathematical research. The concept of coming down from infinity may yet have real implications for inference. By way of example, the point-mass coalescent (4) does not come down from infinity, which implies that external branches become more dominating as sample size increases, regardless of the value of *ψ*. On the other hand, the Beta-coalescent (3) does come down from infinity, as does the Kingman coalescent.

Important asymptotic results regarding the SFS are obtained by Berestycki *et al*, (2013, Theorem 3), who provide convergence results (in the almost sure sense), as sample size tends to infinity, of the scaled SFS for an important class of Λ-coalescents, including the Beta-coalescent (3), as is now explained. We say that the measure Λ has (strong) *α*-regular variation at zero iffor some *α* ∈ (1, 2) and *A* > 0. This holds for the Beta(2 − *α*, *α*)-coalescent from (3). Then, the almost sure asymptotic behavior of each as *n* → ∞ is determined by the behavior of *f*(*x*) as *x* ↓ 0 for a large class of Λ-coalescents.

**Theorem 1** (Berestycki *et al.* 2013). *Assume that* Λ *has* (*strong*) *α-regular variation at zero for some* 1 < *α* < 2 *and A* > 0. *Then*, *for the site-frequency spectrum of an n sample*, *where the underlying genealogy is governed by a* Λ*-coalescent and mutation rate θ*/2,*in the almost sure sense*, *for i* = 1, … , *n*, *where*

We formulate **Theorem 1** with a mutation rate *θ*/2 instead of *θ* as in Berestycki *et al.* (2013) to be consistent with Fu (1995) and other literature. The asymptotic results should be of direct relevance for us in the case of large sample sizes. However, the rate of convergence of the asymptotic results in terms of sample size *n* and the parameter *α* seems to be unresolved. The issue of rate of convergence is investigated below, using simulations.

From a statistical point of view an ideal result would be an (asymptotic) result for the joint distribution of the normalized SFS for general Λ-coalescents, and this may well soon be within reach. For the Kingman case, this has already been achieved: There is asymptotic normality for the joint distribution of tree lengths. Kersting and Stanciu (2013) obtain an asymptotic normality result for the scaled branch lengths associated with the Kingman coalescent, from which Kersting and Stanciu deduce that the asymptotic joint distribution of the frequency spectrum is independent Poisson.

As a partial analog, Kersting *et al.* (2013) prove convergence of the total length of *external* branches of the Beta-coalescent (3) in distribution to a stable random variable with index *α*. Their result indicates that the joint distribution of the branch lengths associated with a Λ-coalescent will not be asymptotically normal, but might instead be a multivariate stable distribution. However, the rate of convergence is not known in the results of Kersting *et al.* (2013).

## The Expected Value, Variance, and Covariance of Derived Mutations in the SFS Associated with a Λ-Coalescent

In this section we compute the expected value , the variance , and the covariance of the SFS associated with Λ-coalescents. While exact solutions for a finite sample size *n* appear quite hard to obtain due to the multiple-merger property of Λ-coalescents, we derive recursions for these quantities. Fu (1995) (cf. Theorem S1 in Supporting Information) obtained closed-form expressions for these quantities when associated with the Kingman coalescent.

As above, let denote the random number of derived mutations in *i* copies in an *n* sample with mutation rate *θ*/2 and genealogy governed by a Λ-coalescent. Before stating our first result, we recall some simple properties of the block-counting process associated with a Λ-coalescent from Birkner and Blath (2008). The block-counting process (*Y _{t}*) simply counts the number of ancestral lineages present each time. Recall the corresponding rates

*λ*

_{n}_{,}

*from (1). When we refer to number of “leaves” (*

_{k}*n*), we mean the initial sample size. Thus, a “leaf” refers to one sampled (DNA) sequence. By using leaves rather than sequences we emphasize that we are thinking of the sequences as vertices in a graph describing the ancestral relations of the leaves. Let (

*Y*)

_{t}

_{t}_{≥0}be the block-counting process of our Λ-coalescent, which is a continuous-time Markov chain on ℕ with jump rates

The total jump rate away from state *i* is . By (6)we denote the transition probabilities of the embedded discrete skeleton chain. Let (7)be the expected amount of time that *Y _{t}*, starting from

*n*, spends in

*m*. Decomposing according to the first jump of

*Y*gives a recursion for

_{t}*g*(

*n*,

*m*),

### The expected value of derived mutations

For 1 < *k* ≤ *n*, as in Fu (1995), an important quantity is the probabilitythat *in a* Λ*-coalescent starting from n leaves*, *conditioned that there are at some point in time exactly k branches*, *a given one of these k branches* (*e.g.*, *the first*, *if we think of some ordering*) *subtends exactly b leaves* (see Figure 1).

A recursion for *p*^{(}^{n}^{)}[*k*, *b*] (Equation A2), given in the *Appendix* (Proposition A1), can be solved numerically, separately for each *k*. Equation (A2) allows us to derive a recursion for the expected frequency spectrum associated with Λ-coalescents, given in the following proposition.

**Proposition 1**. *Under the above assumptions*, *we have*, *for i* = 1, … , *n* − 1,

The proof proceeds by decomposing according to the “level” (the values taken by the lineage-counting process *Y _{t}*) of the coalescent tree where a mutation arises—see the

*Appendix*for details.

**Proposition 1**and its proof are a natural extension of the arguments of Griffiths and Tavaré (1998) to the multiple-merger case.

The expression in **Proposition 1** can be used to define a quantity (*ϕ _{n}*(

*i*)), the “expected normalized frequency spectrum”, which describes the shape of the genealogy and does not depend on

*θ*, as follows. Let denote the random total length of branches subtending

*i*leaves, for 1 ≤

*i*<

*n*. Then, clearly, , withIf we now let

*B*

^{(}

^{n}^{)}denote the random total length of the genealogy started from

*n*leaves, and by we denote the random total number of segregating sites in a sample of size

*n*gene copies (DNA sequences), we haveIf

*T*

_{ℓ}denotes the random time during which there are ℓ active lineages, for 2 ≤ ℓ ≤

*n*, we haveWe define

*ϕ*as (9)Loosely speaking, the probability that a mutation, under the infinitely many-sites assumption (Kimura 1969; Watterson 1975), with known ancestral types, appears

*i*times in a sample of size

*n*is

*ϕ*(

_{n}*i*). The quantity

*ϕ*(

_{n}*i*) is a quotient of expectations that can be interpreted as size biasing—it should not be confused with averaging over the quotient of these quantities,

*e.g.*, obtained from an empirical study, which has a different interpretation and will, at least for finite sample sizes, take different values. However, one can apply

*ϕ*(

_{n}*i*) as the main ingredient in a pseudolikelihood approach to infer coalescence parameters, as we do below.

### The covariance of derived mutations

To compute covariances one needs to compute the probability that two edges (ancestral lineages) drawn at random without replacement (ordered sampling) from the past, say from *Y*-states *k* and ℓ ≤ *n*, produce *i* (resp.) *j* subtended leaves. To this end, we need to distinguish three situations, namely whether our two edges are taken from the same “layer” (that is, when *Y* is in the same state *k* for both edges) or, if this is not the case, whether one of our two edges is a descendant of the other (“nested case”) or not (“unnested case”). In our calculations the leaves are *unlabeled*.

#### Case 1: Two edges in the same layer:

For 2 ≤ *k* ≤ *n* let be the probability that in an *n*-Λ-coalescent, conditioned on there being *k* ≥ 2 ancestral lineages at some time in the past, if we draw two of the *k* lines at random without replacement from these *k* (ordered sample), the first drawn line has *i* and the second *j* subtended leaves (see Figure 2). A recursion for (Proposition A2) and proof are given in the *Appendix*.

#### Case 2: Two edges in different layers, unnested case:

For 2 ≤ *k* < ℓ ≤ *n* let be the probability that in an *n*-Λ-coalescent, conditioned on there being *k* lines at some time in the past and ℓ lines at some (other) time in the past, if we draw one of the *k* lines at random and independently one of the ℓ lines at random (ordered sample), the first drawn line has *i* and the second *j* subtended leaves, and the second is not a descendant of the first (see Figure 3). A recursion for (Proposition A3) is obtained in a similar way to that for , and is given in the *Appendix* along with a proof.

#### Case 3: Two edges in different layers, nested case:

For 2 ≤ *k* < ℓ ≤ *n* let be the probability that in an *n*-Λ-coalescent, conditioned on there being *k* lines at some time in the past and ℓ > *k* lines at some more recent time in the past, if we draw (ordered sample) one of the *k* lines at random [(*k*, 1), say, see Figure 4] and independently one of the ℓ lines [(ℓ, 1), say, see Figure 4] at random, the first drawn line has *i* and the second *j* subtended leaves, and the second (ℓ, 1) is a descendant of the first [(*k*, 1), so necessarily *i* ≥ *j*, and otherwise the probability is zero; see Figure 4]. The recursion for is given in Proposition A4 in the *Appendix*, with the proof following similar arguments to proofs of Propositions A2 and A3.

Unfortunately, the computational complexity of the recursions for (Equation A4), (Equation A6), and (Equation A8), as they stand, is *O*(*n*^{5}), thus restricting numerical computations to relatively small values of *n*.

#### Second moments and covariance of the frequency spectrum:

Now we can put our recursions for , , and together to obtain the covariances . Our main result on the site-frequency spectrum associated with Λ-coalescents, an expression for in terms of , , , *p*^{(}^{n}^{)}[*k*, *i*], and *g*(*n*, *m*), is now stated—a proof is in the *Appendix*.

**Theorem 2**. *For* 1 ≤ *i*, *j* < *n*, 2 ≤ *i* + *j* ≤ *n*, *we have*

In Figure 5 and Figure 6 we graph the covariances (Equation A17) of the branch lengths and , where denotes the total length of branches subtending *i* of *n* leaves, and the coalescent process is the Beta(2−*α*, *α*)-coalescent. The variances and covariances become completely dominated by the variance of the length of external branches when multiple mergers are more prominent in the genealogy, *i.e.*, when *α* is close to 1 (Figure 5). This means that the variances and covariances of the SFS will be dominated by the variance in the number of singletons, when associated with Λ-coalescents, and multiple mergers dominate the genealogy, as would occur in a high-fecundity population with frequent occurrence of large offspring number events. In the Beta-coalescent (Equation 3), large offspring number events become more prominent as *α* approaches 1 and less prominent as *α* approaches 2. And even when *α* is more “modest” at 1.5, the covariances are still quite small compared to the variances. Figure 6 shows the covariances on their own scale, *i.e.*, with the variances and the diagonal covariances set at zero. These results suggest that one may model the site-frequency spectrum of an *n*-Λ-coalescent as composed of uncorrelated (but not independent) components for *n* large enough.

For fixed *n*, the covariances between and seem to be larger than the covariances off the diagonals. As for the Kingman case, this can be explained by considering the first branching event in the coalescent history forward in time after the root. This has substantial probability to be a binary split, say in subfamilies of size *i* and *n* − *i*, which yields a positive correlation between and .

Extending our results to the case of Ξ-coalescents and multiple loci is a natural follow-up task, but lies beyond the scope of the current discussion and will be part of future research.

## Simulation Studies and Data Analysis

### Reliability of the asymptotic results of Berestycki *et al.* (2013)

The asymptotic results obtained by Berestycki *et al.* (2013) (see **Theorem 1**) can be used to obtain estimates of *α* provided the asymptotic results are “close” to the true value *and* one has an estimate of *θ*. Ideally, such inference should take into account correlation between loci. Now we address the question of how large the sample must be for the SFS to be well approximated by the asymptotic results of Berestycki *et al.* (2013) and, indeed, whether one can trust the asymptotic results for all values of *α*.

Figure 7 compares the asymptotic results in **Theorem 1** to simulated values for sample size *n* = 500 (left column) and *n* = 1000 (right column) and with *α* varying as shown. Even though we have recursions to obtain the exact expected values for finite *n*, we were not able, with present computer power, to compute covariances for *n* ≥ 100. The estimated expected values and the ones from the recursions are in agreement (results not shown). Even for *n* = 10^{4} (Figure 1 and Figure 2), which would be a rather large sample, the asymptotic results (Figure 7, shaded bars) match the simulated ones (Figure 7, open bars) rather poorly when *α* < 1.5, but the correspondence improves as *α* increases, and at *α* = 1.5 the match is already quite good. For more common sample sizes *n* ≤ 10^{3} (Figure 3 and Figure 7) the match is not very good for *α* ≤ 1.5. Relying on asymptotic results to estimate *α* might thus give misleading results, even when sample size is quite large, as demonstrated in Figure 8, which compares the asymptotic results for the singletons to the scaled (with *n ^{α}*

^{−2}) exact expected values as sample size

*n*and

*α*vary as shown. The poor fit of the asymptotic results to exact values of the SFS, at least for lower values of

*α*, suggests that exact results for the SFS, if only in the form of recursions, may be a better choice for inference methods.

## Inference Methods and Data Analysis

Before we discuss our results on the analysis of the Atlantic cod data, we introduce our two inference methods, comparing observed and expected values using an ℓ^{2} distance and a pseudolikelihood approach.

### The ℓ^{2} distance

The simplest way to find a value of *α* or *ψ* that gives the best fit to the data would be to compare the expected values of the frequency spectrum to the observed one and report the parameter value that minimizes the sum of squares (Equation 10). To avoid having to jointly estimate the mutation rate, we can define the scaled frequency spectrum asThe observed are compared to the ratio of expected valuesAlthough clearly one should compare to its actual expected value, simulation results indicate that our approximation is quite good (results not shown). A natural distance measure to compare and is the ℓ^{2} distance (10)A similar distance measure is the *G _{ξ}* statistic proposed by Fu (1996), in which each term in the sum is weighted by the variance of . The computational cost of computing the variances for a Λ-coalescent prevents us from considering an identical statistic for Λ-coalescents. A drawback of Fu’s (1996)

*G*statistic is that it depends on the mutation rate

_{ξ}*θ*. Ideally, one would want to apply a statistic that did not depend on

*θ*. Instead, a statistic that reflects the underlying genealogy is desirable, since the processes that one usually tries to learn about in population genetics affect the topology and the branch lengths of the genealogy of a sample in one way or another. Determining statistical significance levels of the ℓ

^{2}distance is postponed for now; however, one may apply the approach of Fu (1996) to determine significance levels and statistical power.

### The pseudolikelihood approach

A more fundamental statistical inference procedure than simply applying the ℓ^{2} statistic is a likelihood inference. Ideally, in our case, to construct a likelihood function, one would have a way of writing down the probability of observing a mutation *i* times in a sample in terms of the coalescence parameters. However, these probabilities depend on the coalescence parameters in a way that is hard to make explicit, and therefore we approximate them with the quantity *ϕ _{n}*(

*i*) (Equation 9). We interpret the observed site-frequency spectrum as an observation from a multinomial distribution, in which the probability of each class is estimated using

*ϕ*(

_{n}*i*). For 1 ≤

*i*<

*n*,

*ϕ*(

_{n}*i*) is the expected number of mutations in

*i*copies scaled by the expected total number of segregating sites, which corresponds to the expected total number of branches subtending

*i*leaves scaled by the expected total size of the genealogy. Since we can compute

*ϕ*(

_{n}*i*) only numerically, Fisher information cannot be computed easily, and it would not be

*a priori*clear whether the variance and confidence bounds derived therefrom via classical statistical theory would be valid. Finally, we refer to our likelihood approach as a pseudolikelihood inference.

Now we state our pseudolikelihood function. Let *s* denote the total number of segregating sites, and let **s** = (*s*_{1}, … , *s _{k}*, ) be the observed folded frequency spectrum, where denotes the collapsed class. Write

*ϕ*:=

_{i}*ϕ*(

_{n}*i*), and define . The pseudolikelihood function

*L*

**(**

_{s}*π*) takes a multinomial form, in which

*π*denotes the appropriate coalescence parameter (

*ψ*or

*α*): (11)The form of

*L*can be seen from the way we view the data

**s**: as an observation from a multinomial distribution in which the probability of each class is estimated with

*ϕ*. It suffices to consider the pseudo-log-likelihood functionAs in the case of the statistics and , this pseudolikelihood approach does not require an estimate of

_{i}*θ*. In Table 1 we do show that the pseudolikelihood performs as it should, when applied to the expected site-frequency spectrum of different coalescent processes.

Table 1 shows maximum-likelihood estimates of the coalescence parameters *α* and *ψ*, when the data are the expected values of the site-frequency spectrum, obtained for given values of the appropriate coalescent parameter, denoted by *π* in Table 1 and computed using our recursion (**Proposition 1**). Thus, *π* = *α* when *π* ∈ (1, 2) and *π* = *ψ* when *π* ∈ (0, 1). For each combination of sample size *n* and *π*, the log-likelihood values are compared with the ones obtained for the Kingman coalescent log *L*(0). Highest log-likelihood values are always obtained for the correct parameter value even for sample size *n* as low as 10. And for modest sample sizes, one should be able to correctly identify the Kingman coalescent (*π* = 0 in Table 1).

In Figures 14–16 in supporting information, File S1, we estimate the distribution of the maximum-likelihood estimator for different values of *n*, *θ*, and the coalescence parameters *α* and *ψ*. As one expects, the performance of the pseudolikelihood is better for larger values of *n* and *θ*.

A natural question to address is, How well do the probabilities *ϕ _{n}*(

*i*) (Equation 9) actually fit with the correct values ? In Figures 5 and 6 in File S1 we address this issue via simulations. The agreement between

*p*and is best when the number

_{i}*n*of leaves (sample size) is large and

*α*is closer to 2 than to 1.

The C code written to carry out the computations is available upon request.

### Atlantic cod mtDNA data analysis

An application of the recursion (**Proposition 1**) for the exact expected values for the site-frequency spectrum is now illustrated using data on Atlantic cod obtained from various localities in the North Atlantic (Árnason *et al.* 2000; Sigurgíslason and Árnason 2003; Árnason 2004). The shallow gene genealogies observed among the haplotypes sampled, coupled with very high fecundity of Atlantic cod, led Árnason (2004) to argue that the Kingman coalescent may not be an appropriate null model for Atlantic cod. Later studies (Eldon 2011; Steinrücken *et al.* 2013) support that conclusion by finding a better match between the data and the Beta-coalescent than obtained for the Kingman coalescent. Eldon (2011) employ the asymptotic results of Berestycki *et al.* (2013) on the site-frequency spectrum obtained from Árnason (2004), while Steinrücken *et al.* (2013) employ full-likelihood methods on the sequence data of Atlantic cod data sets of smaller size than in Árnason (2004).

The Atlantic cod mtDNA data sets we analyze are summarized in Table 2. The sequence data from the Faroe Islands (Sigurgíslason and Árnason 2003) consist of a 250-bp segment from the mitochondrial *cytochrome* b gene in addition to 242- and 74-bp segments for a total of 566 bp. The data from Greenland and Iceland described in Árnason *et al.* (2000) consist of the 250-bp segment and so do the combined data described in Árnason (2004).

In Figure 9 and Figure 10 the ℓ^{2} distances comparing and obtained for the different Atlantic cod data sets are compared. The corresponding parameter estimates are reported in Table 3. For the larger data sets one observes much “sharper” curves than for the smaller ones as one would expect, since the larger data sets carry more information. The difference in ℓ^{2} distance between the coalescent processes is also biggest for the larger data sets, with the Beta-coalescent showing the best fit for all the data sets, although the difference between the two Λ-coalescents is not always large. However, based on this simple parameter estimation method, one would clearly prefer one of the two Λ-coalescents over the Kingman coalescent.

The maximum-likelihood estimates (Table 4) agree quite well with the ℓ^{2}-distance approach in Table 3. Steinrücken *et al.* (2013) apply full-likelihood methods on small Atlantic cod data sets, including the Faroe Islands data set (Sigurgíslason and Árnason 2003) and the Greenland subsample of Árnason *et al.* (2000), to obtain estimates of *α*. Their estimates of for the Greenland subsample and for the Faroe Islands sample agree with our estimates, in particular the Faroe Islands sample.

The Beta-coalescent yields the highest likelihood for all the data sets (Table 4). The fit to the data set reported in Árnason (2004) is, in particular, much better than for the Kingman coalescent, as measured by the ℓ^{2} distance. The superior fit of the Beta-coalescent is further illustrated in Figure 11, in which the expected site-frequency spectrum of the Beta-coalescent and the Kingman coalescent is compared to the observed one of Árnason (2004). A similar graph (Figure 4) for the Faroe Islands data of Sigurgíslason and Árnason (2003) is given in File S1. Comparing Akaike’s (1973) information criterion (Table 5) between models tells the same story (except for the Greenland data).

## Discussion

Inferring coalescence parameters of Λ-coalescents from large samples is our main focus. To this end we derive recursions for the expected values and covariances of the site-frequency spectrum associated with a Λ-coalescent. The multiple-merger property of Λ-coalescents makes obtaining closed-form solutions, as done by Fu (1995) for the Kingman coalescent, prohibitively hard. The value of the recursions in obtaining point estimates of coalescent parameters for two specific Λ-coalescents is then illustrated in two ways, utilizing mtDNA data on Atlantic cod.

Asymptotic results play a key role in inference. Knowing for what regions of the parameter space the asymptotic results approximate the true values “well enough”, in some sense, in other words knowing the rate of convergence, is also quite important for inference. We investigate the rate of convergence for recent asymptotic results, as sample size *n* → ∞, of Berestycki *et al.* (2013) regarding the scaled site-frequency spectrum associated with Λ-coalescents, since we do not have an expression for the error in the asymptotic results. Our results show that the rate of convergence, in terms of sample size, is quite slow for low values of the coalescence parameter *α*. Thus, exact inference methods appear preferable.

A simple distance measure is employed on the scaled frequency spectrum, removing the need to jointly estimate the mutation rate. The observed spectrum is compared to the expected values for different values of the coalescence parameters, and the value that minimizes the distance is reported. A straightforward pseudolikelihood approach is also applied, where the probabilities of mutations to be in different classes are estimated from the recursions for the expected values of the spectrum. These two different approaches give very similar estimates for the two Λ-coalescents we consider. In addition, they are shown to be capable of distinguishing Λ-coalescents from the Kingman coalescent, as well as distinguishing between the two Λ-coalescents when sample size is not too small. Both approaches also show promise in distinguishing between different Λ-coalescents, and they do not require estimates of the mutation rate. Our results open up new possibilities to construct test statistics to test for “neutrality” when the underlying population model admits large offspring numbers, *i.e.*, in highly fecund natural populations such as Atlantic cod (Árnason 2004). One could, by way of example, consider linear weighted combinations of the site-frequency spectrum to construct tests for neutrality in high-fecundity populations, in the spirit of Achaz (2009). The present discussion does not address the statistical power of different statistics one could construct using our recursions and their performance under various scenarios, such as population expansion. This will be the subject of future work. Our aim here is to illustrate the wide applicability of the recursions we obtain for the mean and variance of the site-frequency spectrum associated with Λ-coalescents.

Λ-Coalescents are essentially single-locus models, as they admit only one merger each time. Models for multiple loci would naturally involve Ξ-coalescents admitting simultaneous multiple mergers. Intuitively, one might think that multilocus inference methods would be better able to distinguish between different coalescent processes and, indeed, the underlying population models, than single-locus approaches. A natural question to address is the sample size, as well as the number of loci, one would recommend to identify the appropriate coalescent process. Such issues are truly relevant, not least in part by advances in DNA sequencing technology. Our partial response is that our results indicate that one may need only moderate sample sizes to distinguish between Λ-coalescents. The question of the number of loci required is still very much open, not least because the development of ancestral recombination graphs admitting simultaneous multiple mergers is only in its early stages (Eldon and Wakeley 2008; Birkner *et al.* 2013).

A key distinguishing feature of the site-frequency spectrum drawn from a large offspring number population is the excess of singletons compared to the spectrum one expects in an ordinary Wright–Fisher population. When data contain sequencing errors, some authors have proposed analyzing the data without the singletons, leading to corresponding test statistics that exclude the singletons (Achaz 2008). At present we do not address the issue of sequencing errors, but we point out that for many natural high-fecundity populations, and with advances in DNA sequencing techniques, this may not be an issue. However, the issue of sequencing errors remains an important subject for future analysis.

The role of mtDNA as a suitable genetic marker for studying population history in general has been subject to some criticism (Ballard and Whitlock 2004; Bazin *et al.* 2006; Balloux 2010). A single genetic marker may not be sufficient to infer population history, due to stochasticity in the effects of a population’s demography on the genome (Ballard and Whitlock 2004). Natural selection may be continuously acting on (at least) parts of the mitochondrial genome of Atlantic cod, thus biasing any inference made on supposedly “neutral” segments of the nonrecombining mitochondrial genome (Maynard Smith and Haigh 1974; Gillespie 2000; Durrett and Schweinsberg 2005). In addition, we have not compared the data to models of population growth, which may mimic the effects of high fecundity. However, overfishing of Atlantic cod may have significantly diminished cod stocks throughout its range in the North Atlantic, which suggests that population expansion may not be an appropriate model for Atlantic cod. In addition, as mtDNA is inherited maternally, and female cods may lay millions of eggs annually (May 1967; Árnason 2004), our fit with Λ-coalescents might represent sweepstakes reproduction. Whatever the main reasons for the observed patterns of genetic diversity observed (see Árnason 2004, in particular the discussion on pp. 1882 and 1883), we stress that our purpose is simply to illustrate our inference methods, rather than reach a firm conclusion on the main mechanism shaping mtDNA diversity in Atlantic cod.

The two specific Λ-coalescents we consider will certainly not (one hopes) be the only Λ-coalescents ever applied to highly fecund natural populations. Recent work (Huillet and Möhle 2011a,b; Möhle 2011) introduces specific Λ-coalescent processes not considered in the present discussion. It remains an important exercise to compare all those new processes to actual data. In conclusion, we emphasize that even though one finds that one particular coalescent process fits the data better than another one, this does not automatically mean that one has found the correct model. Biological knowledge and insight remain key factors in identifying appropriate population models for natural populations. By way of example, the haploid population models underlying the two Λ-coalescents we focus on—(4) and (3)—are very different. In the model underlying (3), the key assumption is about the distribution of the number of viable (potential) offspring each individual contributes to the population (Schweinsberg 2003). In the model underlying (4), a single parent contributes a fixed number of offspring in each reproduction event (Eldon and Wakeley 2006). Such assumptions about the population model must form an integral part of any inference study.

## Acknowledgments

We thank two referees for insightful comments that improved the presentation. M. Birkner was supported in part by Deutsche Forschungsgemeinschaft (DFG) grant BI 1058/2-1. J. Blath and B. Eldon were supported in part by DFG grant BL 1105/3-1.

## Appendix

### A Recursion for *p*^{(}^{n}^{)}[*k*, *b*]

^{n}

Assume we start the block-counting process in some state *Y*_{0} = *n* and denote its distribution by ℙ* _{n}*. Then, the probability that the first jump takes it to state

*n*′ <

*n*, conditionally on the event that it hits state

*k*≤

*n*′, is given by (A1)due to the Markov property of

*Y*. This well-known observation will be useful later.

Decomposing according to the first jump of *Y*, starting from *n*, yields the following recursion for *p*^{(}^{n}^{)}[*k*, *b*].

Proposition A1. *For* 1 < *k* ≤ *n*, *we have* (A2)*with the obvious boundary conditions p*^{(}^{n}^{)}[*n*, *b*] = *δ*_{1}_{b} and p^{(}^{n}^{)}[*k*, *b*] = 0 *if b* > *n* − (*k −* 1).

The boundary conditions allow us to replace the second indicator function on the right-hand side (rhs) by **1**_{(}_{b}_{≤}_{n}_{−(}_{k}_{−1))}. In the case of the Kingman coalescent [indicated by an additional superscript, “(*K*)”], *p*^{(}^{n}^{)}[*k*, *b*] becomesand Fu (1995) obtains a closed-form solution: A proof of Equation (A2) now follows.

*Proof*. Conditionally on *Y* hitting *k*, the first jump can take *Y* to any *n*′ ∈ {*k*, … , *n* − 1}. The probability of seeing a jump from *n* to *n*′, conditionally on hitting *k*, has probability , by (A1). Then, thinking “forward in time from *n*′ lineages”, either the initial (*n* − *n*′ + 1) split occurred to one of the [then necessarily *b* − (*n* − *n*′)] lineages subtended to the one we are interested in or it occurred to one of the (then necessarily *n*′ − *b*) others.

### Proof of Proposition 1

*Proof*. Let be the length of the time interval during which there are *k* lineages (which is possibly 0), and consider, in an *n*-Λ-coalescent,

One now obtains

(A3)### A Recursion for

**Proposition A2.** *For* 2 ≤ *k* < *n and i*, *j* ≥ 1; *i* + *j* ≤ *n* − *k* + 2, *we have* (A4)*and* (*as boundary condition*) (A5)*Proof*. Decompose according to the first jump of *Y*, conditionally on hitting *k*. As in the proof of Proposition A1, the probability of jumping to some state *m* ∈ {*k*, … , *n* − 1} is , by (12). Thinking “forward in time from *m* lineages”, the initial (*n* − *m* + 1) split occurred to one of the *i* − (*n* − *m*) lineages subtended from the first sampled edge, or to one of the *j* − (*n* − *m*) lineages subtended from the second sampled edge, or to one of the *m* − *i* − *b* others.

One can derive a similar recursion for “labeled” leaves, where the order of the sample plays a role. For example, if we distinguish between individuals in our sample, we obtain for , where the round brackets indicate that we refer to specific sets of *i* resp. *j* labeled leaves, the recursionfor *n* > 2 [noting that if *i* < 1 or *j* < 1 or *i* + *j* ≥ *n* − *k* + 1].

### A Recursion for

**Proposition A3**. *For* 2 ≤ *k* < ℓ ≤ *n we have* (A6)*with boundary conditions* (A7)*Proof*. As for the previous recursions, we decompose according to the first jump of *Y*, conditionally on hitting ℓ. Due to the Markov property we do not need to condition on hitting *k* as well. This second conditioning is “hidden” in the terms and in particular in the boundary terms (Equation A7). The rest of the argument is as usual.

### A Recursion for

**Proposition A4**. *For* 2 ≤ *k* < ℓ ≤ *n we have* (A8)*with boundary conditions*

The proof follows from arguments similar to the previous proofs.

The recursions for (Equation A4), (Equation A6), and (Equation A8) are (strictly) recursive in *n* and can be solved numerically in a straightforward way, starting from their respective boundary conditions (A5), (A7), and (A9). For numerical checking of implementations it may be useful to note that by definition,

### Proof of Theorem 2

Consider an *n*-Λ-coalescent, and let, for 2 ≤ *k* ≤ *n*, ℓ ∈ [*k*], (A12)Now we think of labeled lines, as opposed to the unlabeled lines in the previous recursions. By symmetry, does not depend on ℓ ∈ [*k*] := {1, … , *k*}. In fact (A13)The factor *g*(*n*, *k*)/*g*(*k*, *k*) is the probability that the block-counting process *Y*, starting from *n* lines, ever hits the state of *k* lines and thus removes the conditioning from the definition of *p*^{(}^{n}^{)}[*k*, ⋅]. Similarly, for 2 ≤ *k* ≤ *n*, ℓ, ℓ′ ∈ [*k*], ℓ ≠ ℓ′, (A14)and, for 2 ≤ *k* < *k*′ ≤ *n*, ℓ ∈ [*k*], ℓ′∈[*k*′], (A15)The random variables counting the number of (derived) mutations in *i* copies (for 1 ≤ *i* < *n*) can be expressed in terms of as follows. Write for the number of mutations that occur on the ℓth edge while there are *k* blocks in our *n*-Λ-coalescent, with if *k* blocks are never realized. Then (A16)*Proof*. As in (A12), let denote the random number of leaves—of *n*—subtended to block ℓ when there are *k* blocks, with if *k* blocks are never realized.

Distinguishing cases as before, we obtainThe three expected values on the right-hand side can now be expressed in terms of the total jump rates of our block-counting process *Y* away from states *k* resp. *k*′. Indeed, for the first expectation on the rhs note that the time while there are *k* blocks is the same for both edges 1 and 2, but mutations are collected *independently*. Thus, the expected number of mutations can be computed as the second moment of an exponential random variable:The second expectation is given by the second moment of a mixture of a Poisson random variable with an exponential distribution:The last expectation is (due to independence obtained from the memoryless property of coalescent jump times) just the product of the two individual expected values; *i.e.*,Altogether, we arrive atfrom which the result follows with the help of (A13), (A14), and (A15).

Let denote the total length of branches subtending *i* of *n* leaves. The covariance of and now follows immediately from previous results since

## Footnotes

*Communicating editor: J. Hermisson*

- Received May 26, 2013.
- Accepted August 28, 2013.

- Copyright © 2013 by the Genetics Society of America