## Abstract

The sample frequency spectrum (SFS), which describes the distribution of mutant alleles in a sample of DNA sequences, is a widely used summary statistic in population genetics. The expected SFS has a strong dependence on the historical population demography and this property is exploited by popular statistical methods to infer complex demographic histories from DNA sequence data. Most, if not all, of these inference methods exhibit pathological behavior, however. Specifically, they often display runaway behavior in optimization, where the inferred population sizes and epoch durations can degenerate to zero or diverge to infinity, and show undesirable sensitivity to perturbations in the data. The goal of this article is to provide theoretical insights into why such problems arise. To this end, we characterize the geometry of the expected SFS for piecewise-constant demographies and use our results to show that the aforementioned pathological behavior of popular inference methods is intrinsic to the geometry of the expected SFS. We provide explicit descriptions and visualizations for a toy model, and generalize our intuition to arbitrary sample sizes using tools from convex and algebraic geometry. We also develop a universal characterization result which shows that the expected SFS of a sample of size *n* under an arbitrary population history can be recapitulated by a piecewise-constant demography with only epochs, where is between and The set of expected SFS for piecewise-constant demographies with fewer than epochs is open and nonconvex, which causes the above phenomena for inference from data.

THE sample frequency spectrum (SFS), also known as the site or allele frequency spectrum, is a fundamental statistic in population genomics for summarizing the genetic variation in a sample of DNA sequences. Given a sample of *n* sequences from a panmictic (*i.e.*, randomly mating) population, the SFS is a vector of length of which the *k*th entry corresponds to the number of segregating sites, each with *k* mutant (or derived) alleles and ancestral alleles. The SFS provides a concise way to summarize *n* sequences of arbitrary length into just numbers, and is frequently used in empirical population genetic studies to test for deviations from equilibrium models of evolution. For instance, the SFS has been widely used to infer demographic history where the effective population size has changed over time (Nielsen 2000; Gutenkunst *et al.* 2009; Gravel *et al.* 2011; Keinan and Clark; 2012; Excoffier *et al.* 2013; Bhaskar *et al.* 2015) and to test for selective neutrality (Kaplan *et al.* 1989; Achaz 2009). In fact, many commonly used population genetic statistics for testing neutrality, such as Watterson’s (Watterson 1975), Tajima’s (Tajima 1983), and Fu and Li’s (Fu and Li 1993) can be expressed as linear functions of the SFS (Durrett 2008).

In the coalescent framework (Kingman 1982a,b,c), the *unnormalized expected* SFS for a random sample of *n* genomes drawn from a population is obtained by taking the expectation of the SFS over the distribution of sample genealogical histories under a specified population demography. In this work, we will be concerned with well-mixed, panmictic populations with time-varying historical population sizes, evolving according to the neutral coalescent process with the infinite-sites model of mutation. The coalescent arises as the continuum limit of a large class of discrete models of random mating—such as the Wright–Fisher, Moran, and Cannings exchangeable family of models (Möhle and Sagitov 2001)—by a suitable rescaling of time and taking the population size to infinity. The infinite-sites model postulates that every mutation in the genealogy of a sample occurs at a distinct site and is commonly employed in population genetic studies for organisms with low population-scaled mutation rates, such as humans. The SFS also appears in the context of statistical modeling as a vector of probabilities. In particular, the *normalized expected* SFS defined by normalizing the entries of so that they sum to one, gives the probability that a mutation chosen at random is present in *k* out of *n* sequences in the sample. Unless stated otherwise, we use the term expected SFS to refer to the unnormalized quantity

The expected SFS is strongly influenced by the demographic history of the population, and extensive theoretical and empirical work has been done to characterize this dependence (Fu 1995; Wakeley and Hey 1997; Polanski *et al.* 2003; Marth *et al.* 2004; Chen 2012; Jouganous *et al.* 2017; Kamm *et al.* 2017). Fu (1995) showed that under the infinite-sites model for a panmictic population with constant size and no selection, the expected SFS is given by where denotes the population-scaled mutation rate. When the population size is variable, however, the formula for the expected SFS depends on the entire population size history. In particular, Polanski and Kimmel (2003) (in equations 13–15) showed that the expected SFS under a time-varying population size is given by with being an -by- invertible matrix that only depends on *n* (formula presented in Appendix) and where denotes the expected time to the first coalescence event in a random sample of size *m* drawn from the population at present. For any time-varying population size function the quantity is given by the following expression:

### Pathologies of SFS-based inference algorithms

Let us consider a hypothetical scenario. Suppose we would like to learn about the population history of a group of finches on a remote island. Fossil evidence indicates that the island experienced many generations with ample resources leading to a large, roughly constant population size. Then, some catastrophe occurred, rendering the island’s resources scarce, leading to a small constant population size until the present. We are given four haplotypes from the population and we hope to infer the following parameters for a demographic model based on the history described above:

How big was the population during the epoch of plenty?

How big was the population during the epoch of scarcity?

When did the catastrophe occur, marking the break point?

First, we compute the SFS for the four haplotypes we collected. (Our choice of a sample size of four is for simplicity in this example, but the principles apply for larger samples.) We count singleton (appearing in only one of the haplotypes), doubleton, and tripleton mutations. We do not attempt to track nonsegregating sites. Now we have the SFS, a vector of three real numbers.

Next, we ask ourselves: would we expect to obtain this SFS for some particular set of parameters, based on our model? If the answer is yes, then that set of parameters is our best guess. In Figure 1, the green region describes the set of SFS we would expect for various parameters under this model. Blue dots indicate measured SFS. When the blue dots land in the green region, we simply infer the parameters corresponding to that point. The red crosses are the expected SFS computed for those parameters, so they coincide with the blue dots.

What if the answer is no? That is, what if the SFS we measured would not be expected for any choice of parameters in our population history model? We have two options to interpret this situation: (1) statistical noise is making the SFS appear inconsistent with the model, or (2) our model is mis-specified. Let us suppose that noise is the culprit. Our strategy is then to look for the *closest* SFS that would be expected in our model, and infer the parameters associated with that one.

This runs into two problems: First off, the parameters inferred in this way are often nonsensical. In Figure 1, the blue dots outside of the green region are connected by dotted lines to the closest SFS vectors in the green region. Naturally, these mainly lie on the boundary of the green region. The problem is that the boundary points (with one exception that we will discuss later) do not actually correspond to achievable expected SFS vectors! Those points correspond to population size histories where one of the epochs is infinity or zero.

The second problem: Even though there is, in general, a unique closest SFS to a given point outside of the green region, the process of finding the closest point is *highly sensitive* to noise. Specifically, if you change the quantities in the vector by a small amount, the resulting closest point may change by a large amount. The reason for this is that the set is *nonconvex*, meaning that not all of the straight lines between points in the green region lie inside the green region. As a consequence, some of the blue dots point to the left-hand green region, while others nearby point to the right-hand green region. Sensitivity to noise is a big problem for inference. Any demographic inference method would manifest these pathologies; indeed, the commonly used (Gutenkunst *et al.* 2009), fastsimcoal2 (Excoffier *et al.* 2013), and fastNeutrino (Bhaskar *et al.* 2015) all encounter these issues.

If we hypothesize that the model may be mis-specified, we need to support this assertion. The following question will arise: How *far away* is our measured SFS from the type of SFS that we would expect under the rejected population model? Furthermore, we may be asked to offer an alternative hypothesis, *i.e.*, is there another model that actually does allow for an SFS equal to or near the one that we measured? Both of these questions require an understanding of the set of all possible SFS.

### Minimal demographic complexity for SFS reconstruction

Let us slightly change our finch example. Suppose we have no *a priori* assumptions regarding the demographic history. Instead, we are only interested in determining whether the SFS is consistent with a null hypothesis of a single panmictic population under neutrality. If the measured SFS is equal to the expected SFS for some demography, we may be asked to produce the *simplest* demography with the expected SFS we want. Work by Myers *et al.* (2008) implies that there are infinitely many population size histories with a given expected SFS, as long as we allow the demographies to be arbitrarily complicated. Bhaskar and Song (2014) (two of the authors of this article) demonstrated that when we constrain ourselves to a simpler family of population size histories, we may have a unique function achieving the desired expected SFS.

Now suppose that the SFS does not equal the expected SFS for any demography. Again, we would need to quantify how far away it is from being achieved by some demography. This is an intimidating task. How can we be certain to find the SFS corresponding to every demography without leaving any SFS vectors out? After all, the space of possible population size histories is infinite-dimensional! Our hope is to understand the *shape* of the set of all possible SFS vectors so we know that we have covered everything when we reject the null hypothesis.

For the small example of a sample size of four, we have demonstrated a sequence of constraints placed on SFS vectors in Figure 2. The vectors of interest have three coordinates corresponding to singleton, doubleton, and tripleton mutations. Note that any vector of probabilities must be nonnegative and must sum to one. This means we are constrained to the triangle with vertices and We can ignore the third coordinate since it will always be one minus the others. This triangle is depicted in yellow in Figure 2. One might naively hope that every one of these probability vectors is achievable as the expected SFS of some demography.

A result proved by Sargsyan and Wakeley (2008) is that SFS vectors must be nonincreasing—this means that we are left with the triangle with vertices and This is depicted in blue in Figure 2. They further proved that the SFS is convex. This implies that the second coordinate is less than the average of the other two. This further cuts down our possibilities to the triangle with vertices and depicted in red in Figure 2. If we want SFS vectors for population size histories with two constant pieces, we are further constrained to the green region, which we will describe algebraically later.

We will be able to completely describe the shape of all SFS for a sample size of four using algebraic formulas for the boundary. In fact, we will show that to find all possible SFS for a sample size of four it is sufficient to consider piecewise-constant functions with at most three constant pieces. Furthermore, we will use tools from convex and algebraic geometry to extend our intuition from this small case study to the SFS for all sample sizes.

### Summary of main results

Studying the geometry of the set of expected SFS will address both of the areas discussed above:

Explaining the pathologies in SFS-based inference.

Describing the full set of SFS for a fixed sample size.

In this way, we can help researchers understand why fitting parameters to certain demographic models will lead to runaway behavior. We also enable researchers to reject a null hypothesis of a single panmictic population under neutrality.

Our main result is Theorem 8 which focuses on piecewise-constant demographies. It shows that for every sample size *n*, there is a crucial threshold in demographic complexity, which we denote . If we are fitting to a demographic model with fewer than constant pieces, then the set of all SFS will be nonconvex and we must expect pathological behavior as described above. However, once we allow for constant pieces , we get the full set of SFS for all demographies. Proving that this set is convex is left for later work.

## Piecewise-Constant Demographies

In this section, we will define two sets: one of them will be the set of expected SFS for piecewise-constant population size histories. As described in the Introduction, this is an important set for inference. The other set is the set of expected coalescence vectors; this is not as commonly used as the SFS, but it helps us build a strong understanding of the SFS. This is because it is related to the SFS by a simple transformation and yet it is much easier to formulate.

Let be the set of piecewise-constant population size functions with *k* pieces. Any population size function in is described by positive numbers, representing the *k* population sizes and the time points when the population size changes. Let which we call the SFS “manifold,” denote the set of all expected SFS vectors for a sample of size *n* that can be generated by population size functions in (Note that the sets and are not technically manifolds; they would be more accurately described as semialgebraic sets. However, for expository purposes, we use the widely known term manifold.) Similarly, let called the -coalescence manifold, denote the set of all vectors giving the expected first coalescence times of samples of size for population size functions in Let and respectively, be equal to the normalization of all points in and by their norms (*i.e.*, the sums of their coordinates). Note that both manifolds live in and their normalized versions live in the -dimensional simplex this is the set of nonnegative vectors in whose coordinates sum to 1.

Now that we have defined our basic objects of study, we can describe the remainder of the article: First, we provide a complete geometric picture of the SFS manifold describing the expected SFS for samples of size under piecewise-constant population size functions with an arbitrary number *k* of pieces. We make explicit the map between regions of the demographic model space and the corresponding probability vectors, and this will foreshadow some of the difficulties with population size inference in practice. Next, we develop a characterization of the space of expected SFS for arbitrary population size histories. In particular, we show that for any sample size *n*, there is a finite integer such that the expected SFS for a sample of *n* under any population size history can be generated by a piecewise-constant population size function with at most epochs. Stated another way, we show that the SFS manifold contains the expected SFS for all possible population size histories, no matter how complicated their functional forms. We establish bounds on that are linear in *n* and along the way prove some interesting results regarding the geometry of the general SFS manifold.

Before proceeding further, we state a proposition regarding the structure of the map from to which we will call the vector of transformed break points is denoted by and defined below, while the vector of population sizes in the *k* epochs is denoted by It turns out that we can formulate the expected coalescence times as polynomial functions of the *x* and *y* variables. Two different ways of writing those functions down will give us two perspectives on their shape. All proofs of the results presented in this article are deferred to the Appendix.

* Proposition 1. Fix a piecewise-constant population size function in with epochs where and which has constant population size value in the epoch for . Let for where (corresponding to time ), and define (corresponding to time ) for convenience. The vectors where and for all j, (uniquely) identify the population size functions in and they satisfy both of the following equations:*(2)(3)

*where is the expected first coalescence time for a sample of size m, as defined in (1).*

These two formulations provide two different ways of looking at the coalescence manifold

In (2), the left-hand matrix called has each column of the same form with two parameters; this indicates they all live in a two-dimensional surface. Imagine, for example, the surface of the earth. There are two degrees of freedom: north–south and east–west. Here, too, specifying the value of each column, regardless of the value of

*n*, is dependent on two numbers. Explicitly, each column is given by for some inputs*a*and*b*.Additionally, the vector has all positive entries. That means that, when we combine columns from our surface, they will not cancel in unexpected ways due to negative coefficients. The set of positive combinations of a set of points is called a cone, and it is very nicely behaved geometrically. This means that the vector is contained in the cone over the surface described by the columns of

In (3), the left-hand matrix called has each column of the same form with one parameter; this indicates they all live on a curve. Like a train on a track, this has one degree of freedom, only forward–backward. Explicitly, each column is given by for some input

*a*.The vector on the left-hand side has entries with possibly negative coordinates. So the vector is contained in the linear span of the curve described by the columns of Unfortunately, a linear span is not quite as nicely behaved as a cone. Still, this formulation gains the simplicity of having one degree of freedom instead of two.

Proposition 1 gives us the algebraic mappings that will serve as our objects of interest. Since the SFS manifold is simply a linear transformation of the coalescence manifold, we will use these maps as our entry into understanding the SFS manifold.

## The SFS Manifold

### A toy model

The first in-depth study will involve the set of all possible expected SFS for a sample size of four. We choose for a number of reasons: First, the cases of sample sizes of two and three are not interesting. When we only have two haplotypes, there is only one entry in the SFS vector, *i.e.*, singletons. The resulting set of possible expected SFS is just the set of all positive numbers. When we have three haplotypes, it is only slightly better. Because there must be fewer doubletons than singletons, the possible expected SFS is somewhere in the wedge between 0 and from the origin; this turns out to be the only constraint.

Second, when the SFS manifold lives in which can be nicely visualized, and the normalized SFS manifold lives in the triangle with vertices and Finally, as observed in Proposition 1, the most interesting phenomena in SFS manifolds of any dimension are fundamentally phenomena of curves and surfaces. These are already captured in the case.

For the sake of completeness, we begin by formally describing the coalescence manifolds for the trivial cases of and

**Proposition 2.** *We list some basic results on the coalescence manifolds with sample size n and k population epochs*, *for small values of *

The manifold for all

*n*.The manifold for all

The manifold for all

Note that from (2) and (3) for it follows that for In words, rescaling the population sizes in each epoch by a constant *a* also rescales the first coalescence times by *a*. This implies that every point in the coalescence manifold generates a full ray contained in the coalescence manifold. Another consequence is that the normalized coalescence manifold is precisely the intersection of the coalescence manifold with the simplex

With that justification, we begin to consider the normalized coalescence manifold living in the simplex. As stated in Proposition 2, is a ray, which implies that is a single point. We now characterize the set Again, this is the set of possible SFS for two-epoch, piecewise-constant population size histories considered as a subset of all vectors summing to one.

**Proposition 3.** *The manifold describing normalized expected times to first coalescence for sample size 4 and two population epochs*, *is a two-dimensional subset of the 2-simplex which can be described as the union of the point with the interiors of the convex hulls of two curves and * The curves are parametrized as follows:where denotes

This set has some highly unpleasant geometry. First of all, the set is nonconvex; topologically, it is also neither closed nor open because most of the boundary is excluded with the exception of the point The set is visualized in Figure 3A.

To precisely illustrate the geometry of we will consider how contours in the domain map to contours in the image. Specifically, we plot the images of lines with fixed values of respectively fixed values of to in the 2-simplex. The resulting contours are pictured in Figure 4.

Finally, we consider how the map *χ* acts on the boundaries of the domain. To aid visualization, we limit the inputs to and since all rescalings of and by the same positive constant while keeping fixed map to the same normalized coalescence vector. The resulting map is illustrated in Figure 5.

We note that the map fails to be one-to-one within the domain only when this is also in the preimage of the point The inverse function theorem implies that on the complement of the map is a homeomorphism (a map that preserves topological features like number of components). This is consistent with our observation that the two rectangles in Figure 5A correspond to the two envelopes in Figure 5C. Now, we consider demographies with more than two epochs. This proposition implies that any expected SFS for a sample size of four coming from a single panmictic population under neutrality, regardless of the true population size history, is equal to the expected SFS for some piecewise-constant history with only three pieces. It also shows that all of these SFS vectors live inside of the convex hull of one curve.

**Proposition 4.** *For all values the manifold and is the interior of the convex hull of the following curve*:As we can see from Proposition 4, is open and convex; however, we lose one useful property of the normalized map Specifically, let be given by noting that for Under this definition is generically one-to-one (*i.e.*, one-to-one away from a set of measure zero). Meanwhile, the analogous construction mapping the three-epoch demography with break points and population sizes to the corresponding normalized coalescence vector has two-dimensional preimages, generically. For this reason, contour images do not lend themselves to easy description.

However, as a heuristic, we can choose a distinguished member of this preimage with nice properties. In the orange region adjacent to depicted in Figure 6, every preimage contains a limit demography with first and third epochs set to zero, and second epoch set to one. This can be thought of as a demography with a population boom in the second epoch. In the blue region adjacent to the line segment from to every preimage contains a limit demography with second epoch set to zero. This corresponds to a demography with a population bottleneck in the second epoch. Because the set of demographies mapping to each point is two dimensional, this does not describe all demographies characterized by a chosen SFS, but it does give us intuition for the types of demographies to expect.

We can also describe the image of the map on the boundaries of our domain. The easiest way to visualize the map is first to understand how the time variables affect the value of the columns of and to view the *y* variables as specifying points in the convex hull of those three columns. The boundaries of the square map the columns (after rescaling to the simplex) as follows:The case of is the most interesting: when we fix and we obtain the boundary curve Note that corresponds to a second epoch of length zero. The intuition is that very short population booms at the second epoch lead to coalescence vectors close to The maps encoded by a general column of correspond to the interior of the orange region in Figure 7A. Adding in convex combinations of points gives the lined region, which is the remainder of this is discussed more rigorously in the Appendix. When the number of epochs *k* steps higher, all columns of still map to the same region of the simplex, so will still be contained in this convex hull. The region is depicted in Figure 7A.

As mentioned earlier, the SFS manifold is merely a linear transformation of however, since it is of interest in its own right, we include the formulas for analogous to those derived in this section.

**Proposition 5.** *The following hold for the normalized SFS manifold*: is the union of with the convex hulls of two curves:Here, also, denotes Finally, for all *k*, and is the convex hull of whereVisualizations of and may be found in Figure 3B and Figure 7B.

### General properties

In this section, we examine the constant defined earlier as the smallest index for which for all *k*. The tools for the proofs in this section come from algebraic geometry (for the derivation of the lower bound) and convex geometry (for the upper bound).

The gist of the algebraic geometry argument is that, under the formulation, the manifold can be seen to be part of another manifold built by a sequence of well-understood algebraic constructions. Details of this perspective are reserved for the proofs section in the Appendix.

Two concrete consequences follow from this observation:

The ability to compute all equations satisfied by using computer algebra.

A formula for the dimension of the coalescence and SFS manifolds.

While the former is harder to explain without more setup, the latter can be simply stated: the dimension of the normalized coalescence manifold is 0 when we have the constant demography If we allow *k* constant pieces, the manifold has dimension unless is greater than the dimension of the simplex In that case, it has dimension

**Proposition 6.** *The dimension of is given by*:In particular, is a proper subset of for

While Proposition 6 is useful for analyzing individual coalescence manifolds, it also leads to the observation that since the inclusions are proper until that index. It is worth remarking that a slightly weaker lower bound of follows immediately from the identifiability result of corollary 7 in Bhaskar and Song (2014), which states that for a piecewise-constant population size function with *k* pieces, the expected SFS of a sample of size suffices to uniquely identify the function.

We will illustrate how these algebraic ideas can be applied in the next case we have not seen, namely sample size

* Example 7. Note that by Proposition 2. We will use the new ideas above to describe for higher values of k*.

Since the normalized coalescence manifold has dimension we know that has dimension 2 inside of the 3-simplex; therefore, we anticipate that it will satisfy one equation, matching its codimension. The degree of the algebraic variety implies that this polynomial should have degree 8. Indeed, when we compute this equation using computer algebra software Macaulay2 (Grayson and Stillman 2002), we obtain a huge degree-8 polynomial with 105 terms, whose largest integer coefficient is Finally, is full-dimensional in the 3-simplex, so it will satisfy no algebraic equations relative to the simplex. It would be defined instead by the inequalities determining its boundary.

The convex geometry argument is more elementary. As we noted, the formulation is contained in the convex hull over the surface described by a general column of Because the columns are related, our selection of points in the surface is not unrestricted. For this reason, it is not obviously equal to the convex hull. However, once we fix some collection of values to be input in the formula for we can use convex geometry for the resulting polytope. In particular, we use Carathéodory’s theorem [Carathéodory (1907) or Barvinok (2002), theorem 2.3], which states that for *X* a subset of every can be represented as a positive combination of vectors for some

The argument, roughly, allows us to construct any point in that convex hull, with as few as points. This allows us to place the point in for Since no new SFS are generated by using more than epochs, we learn that is bounded above by

Combining the two bounds obtained in this section, we have the main theorem described in the Introduction.

**Theorem 8.** *For any integer there exists a positive integer such that for all* Furthermore, satisfiesAdditionally, is nonconvex for all values of

This allows us to express the SFS from any piecewise-constant demography as coming from a demography with relatively few epochs. Because the SFS is an integral over the demography, the SFS from a general measurable demography can be uniformly approximated by a piecewise-constant demography with sufficiently many epochs. Our results imply that it can be precisely obtained by a demography with at most epochs.

### Data availability

The authors affirm that all data necessary for confirming the conclusions of the article are present within the article, figures, and tables.

## Discussion

In this work, we characterized the manifold of expected SFS generated by piecewise-constant population histories with *k* epochs, while giving a complete geometric description of this manifold for the sample size and epochs. This special case is already rich enough to shed light on the issues that practitioners can face when inferring population demographies from SFS data using popular software programs. While we demonstrated these issues in Figure 1 using the fastNeutrino program (Bhaskar *et al.* 2015), the issues we point out are *inherent* to the geometry of the SFS manifold and not specific to any particular demographic inference software. Our simulations showed that the demographic inference problem from SFS data can be fraught with interpretability issues, due to the sensitivity of the inferred demographies to small changes in the observed SFS data. These results can also be viewed as complementary to recent pessimistic minimax bounds on the number of segregating sites required to reliably infer ancient population size histories (Terhorst and Song 2015; Baharian and Gravel 2018).

Our investigation of piecewise-constant population histories also lets us show a general result that the expected SFS for a sample of size *n* under *any population history* can also be generated by a piecewise-constant population history with at most epochs. This result could have potential applications for developing nonparametric statistical tests of neutrality. Most existing tests of neutrality using classical population genetic statistics such as Tajima’s *D* (Tajima 1989) implicitly test the null hypothesis of selective neutrality and a constant effective population size (Stajich and Hahn 2004). We have characterized the expected SFS of samples of size *n* under arbitrary population histories in terms of the expected SFS under piecewise-constant population histories with at most epochs. As a result, the Kullback–Leibler (KL) divergence of an observed SFS to the expected SFS under the best-fitting, piecewise-constant population history with at most epochs is also equal (up to a constant shift) to the negative log-likelihood of the observed SFS under the best-fitting population size history *without any constraints on its form*. (This assumes the commonly used Poisson random field model where sites being analyzed are unlinked.) One can then use the KL divergence inferred by existing parametric demographic inference programs to create rejection regions for the null hypothesis of selective neutrality without having to make any parametric assumption on the underlying demography. Such an approach would also obviate the need for interpreting the inferred demography itself, since the space of piecewise-constant population histories is only being used to compute the best possible log-likelihood under any single population demographic model. This approach could serve as an alternative to recent works which first estimate a parametric demography using genome-wide sites, and then perform a hypothesis test in each genomic region using simulated distributions of SFS statistics like Tajima’s *D* under the inferred demography (Rafajlović *et al.* 2014). We leave the exploration of such tests for future work.

## Acknowledgments

We thank Simon Gravel, Jeremy Berg, Laura Hayward, Yuval Simons, and the referees for their careful reading of our manuscript and for providing us with helpful comments. We also thank the Simons Institute for the Theory of Computing, where some of this work was carried out while the authors were participating in the “Evolutionary Biology and the Theory of Computing” program. This research is supported in part by a Math+X Research Grant, a National Science Foundation grant DMS-1149312 (CAREER), a National Institutes of Health grant R01 GM-109454, and a Packard Fellowship for Science and Engineering. Y.S.S. is a Chan Zuckerberg Biohub investigator.

## Appendix

### Formula for A_{n}

Recall that the SFS can be related to times to first coalescence by the formula The formula for is given recursively in Polanski and Kimmel (2003) (equations 13–15) by the following formulas (with variable names changed for clarity):

### Proof of Proposition 1

First, we reduce the integral expression for to a finite sum; then we make appropriate manipulations until we arrive at the desired expressions.

Coalescence in the Wright–Fisher model is an inhomogeneous Poisson process with parameter Therefore, the probability density of first coalescence at time *T* is:Let To compute the expected time to first coalescence, we have the integral:Substituting variables, note that Therefore, the integral becomes:where

The population size is a piecewise-constant function, whose value is if As specified in Proposition 1, and is the vector of population sizes. Observe that is also piecewise constant. In particular,Let for brevity. The resulting formula is:We turn the integral into a sum of integrals on the constant epochs:We now make the substitution Note that the old restriction becomes the new constraint Our formula for the is now:Noting the linear form of this expression, we factor as a matrix multiplication:Combining the first three matrices yields (2); combining the first two and last two separately yields (3). □

### Proof of Proposition 2

We justify each equation in turn:

As mentioned in the Introduction, this is a classical result in population genetics, and can be derived directly from (3).

The inclusion is immediate, so we need only show that any satisfies Using (2),

*a*is written as a sum of products of strictly positive numbers; soFirst, we show that is the interior of the open cone spanned by and Fix (for

*a*positive) and consider

When the second vector approaches when the first vector approaches The vectors are in the interior of that cone for all other permissible values of and To show that note that for larger values of *k*, the same cone of vectors are produced. In particular, yields

Clearly, the second coordinate of all vectors is bounded between zero and one.

### Proof of Proposition 3

First we observe that and are normalizations of the curves defined by parameterizations and where *t* is constrained to the open interval

Now we claim that the definition in terms of the map is equivalent to the definition in terms of these two curves. We can use the first formulation of *χ* to prove this:When the image is the point as stated. When we can use the left-hand expression to view the image as a point on the line segment between and the curve When the right-hand expression can be used to write the image as a point on the line segment between *X* and This means that the image of *χ* is contained in the regions and point specified.

To show that the reverse inclusion holds, we fix a point *P* in the interior of the convex hull of . By convexity, the line segment from *X* to *P* is contained in the region; continue in the direction until the line intersects the curve. This must occur because all points in the region are further from the bounding line than *X*. The point of intersection *q* is specified as for some By convexity, there exists some *ρ* such that Fixing and shows that *P* is in the image of *χ*. The same argument holds with slight variation for

### Proof of Proposition 4

The strategy to prove the equality of and the cone over comes in two steps:

Show that the columns of are always contained in the region

*R*whose boundary isDivide the convex hull of

*R*into two regions and show that each of these regions are included in

First we demonstrate that the regions map precisely into *R*. We have already shown in the main text that the boundaries of map to the boundaries of *R* under the mapping defined by where *S* is the sum of the coordinates. We compute the Jacobian of this map explicitly in Macaulay2 (Grayson and Stillman 2002). The result is:Plainly, this is nowhere zero in our domain. The inverse function theorem then implies that the interior is contained in the image of the boundaries. This accomplishes Step 1 of our proof.

For Step 2, we divide the image into two regions:

The triangle defined by vertices and including the two edges and

The remainder of the convex hull of

*R*—explicitly, the interior of the region bounded by and the line segment

To show that the triangle is included, let and let vary. The third column then sits arbitrarily close to and the first column traces out Set and toggle and to obtain the full span, including the interior of the triangle, and the line segment Set and the first column sits at while the third column traces out This catches the missing line segment.

For the remainder of the convex hull, fix a point *P* in this region. This point lies on a line segment between and some point *Q* in Suppose it is equal to Set We can choose *ɛ* and so that the second column is arbitrarily close to *P*. Furthermore, observe that the first column is approximately equal to the point on corresponding to and the third column is approximately the point on corresponding to Choosing and points us to

### Proof of Proposition 5

This is a direct application of the linear map computed as in Polanski and Kimmel (2003):

### Proof of Proposition 6

To prove the result about dimension, we show that is a relatively open subset of a certain algebraic variety. Because the relevant operations are native to projective geometry, we transport our objects of interest in the obvious way to projective space. The same scaling properties that allow us to focus on the simplex also lead to good behavior in projective space.

**Lemma 9.** *For the Zariski closure of is the affine cone over where*:

The symbol denotes the projective curve defined by mapping to

The symbol is the projective point

The operation J denotes the join of algebraic varieties.

The operation denotes the

*i*-th secant variety. Following Harris (2013), the*i*-th secant variety is the union of*i*-dimensional planes generated by points in the variety.

*Proof of Lemma 9*. The variety is the image of the following map:where and are not simultaneously zero, and *λ* is unrestricted.

Define the map sending toWe can recast the expression in (3) as the composition Based on this formulation, the set is clearly contained in To demonstrate the equality of the Zariski closures, we only need to show that the dimensions match and that the variety is irreducible. Both joins and secants have the property that irreducible inputs yield irreducible outputs, so the variety of interest is irreducible. The image of *ϕ* is open in and the map *ψ* has deficient rank on a set of positive codimension. Therefore, the composition of has full dimension. This proves the Lemma. □

The *i*-th secant variety of an irreducible nondegenerate curve in has projective dimension given by (Harris 2013, exercise 16.16). The curve is a toric transformation of a coordinate projection of the rational normal curve. The rational normal curve is nondegenerate and both of these operations preserve that property. This means our secant variety has projective dimension The join with a point adds 1 to the dimension of the variety, while the operation of passing to the affine cone adds 1 to the dimension of the variety and the ambient space. However, normalizing to the -simplex subtracts 1 from both variety and ambient space again. This means that assuming that

### Proof of Upper Bound in Theorem 8

Suppose a point c is in By definition, this implies that there is a point such that (2) yieldsSince the point c is in the cone over the *q* columns of the matrix, Carathéodory’s theorem implies that it is also in the cone over some of the columns. Therefore, we can replace the vector with so that all but (or fewer) are zero.

Passing to the expression in (3), this gives us:Since at most of the are nonzero, at most of the entries of the vector at right are nonzero. We delete the columns of the *X* matrix corresponding to zero entries except the first column. A new sequence may then be obtained from the ratio between the first entries in adjacent columns. The new sequence is obtained by taking the sequence of partial sums of the vector.

### Proof of Nonconvexity in Theorem 8

To prove this final result, we combine two properties already proven:

The manifold is a proper subset of for all (from Proposition 6).

The manifold is contained in the convex hull of (This follows from Equation 2.)

Since contains and is properly contained in the convex hull of it cannot be convex.

## Footnotes

*Communicating editor: S. Ramachandran*

- Received March 29, 2018.
- Accepted July 30, 2018.

- Copyright © 2018 by the Genetics Society of America