## Abstract

The advent of the genomic era has necessitated the development of methods capable of analyzing large volumes of genomic data efficiently. Being able to reliably identify bottlenecks—extreme population size changes of short duration—not only is interesting in the context of speciation and extinction but also matters (as a null model) when inferring selection. Bottlenecks can be detected in polymorphism data via their distorting effect on the shape of the underlying genealogy. Here, we use the generating function of genealogies to derive the probability of mutational configurations in short sequence blocks under a simple bottleneck model. Given a large number of nonrecombining blocks, we can compute maximum-likelihood estimates of the time and strength of the bottleneck. Our method relies on a simple summary of the joint distribution of polymorphic sites. We extend the site frequency spectrum by counting mutations in frequency classes in short sequence blocks. Using linkage information over short distances in this way gives greater power to detect bottlenecks than the site frequency spectrum and potentially opens up a wide range of demographic histories to blockwise inference. Finally, we apply our method to genomic data from a species of pig (*Sus cebifrons*) endemic to islands in the center and west of the Philippines to estimate whether a bottleneck occurred upon island colonization and compare our scheme to Li and Durbin’s pairwise sequentially Markovian coalescent (PSMC) both for the pig data and using simulations.

MUCH can be learned about the demographic history of a species or population from sequence variation. Bottlenecks—short periods where the population size is drastically reduced before recovering in size—are commonly detected demographic events. Because bottlenecks are often associated with range shifts caused by Pleistocene fluctuations in climate (*e.g.*, Moura *et al.* 2014), they are of particular interest to researchers studying contemporary biogeographic patterns. Bottlenecks lead to a sudden, strong increase in the rate of coalescence that may have different effects on the shape of genealogies. A strong bottleneck may trap all lineages, leading to a star-shaped genealogy. Alternatively, some lineages may “escape” weaker bottlenecks leading to longer basal branches, each connected to a clade of lineages that coalesced during the bottleneck (Figure 1). Thus, strong bottlenecks lead to an excess of rare variants, while moderate bottlenecks produce an excess of intermediate-frequency variants. These opposing effects make it difficult to reliably diagnose bottlenecks from simple summary statistics of nucleotide diversity such as Fu’s *D* (Fu and Li 1993) and Tajima’s *D* (Tajima 1989). The problem is exacerbated because signals of bottlenecks can be confounded by other population processes such as geographic structure or selective sweeps (*e.g.*, Nielsen and Beaumont 2009).

Given the stochasticity of the coalescent in general and the similarity in signal between selective sweeps and bottlenecks at a single locus (*e.g.*, Galtier *et al.* 2000), integrating information across multiple loci is imperative to robust demographic inference. For example, Li and Durbin’s (2011) pairwise sequentially Markovian coalescent (PSMC) method models transitions in the times to the most recent common ancestor along an individual genome and from this infers a trajectory of population size changes. The PSMC relies on an approximation to the coalescent and, because it makes use of long-range linkage information, requires fairly well-assembled reference genomes.

At the other extreme, a number of popular methods of demographic inference are based on the site frequency spectrum (SFS)—the number of sites in a sample of *n* sequences with *k* copies of the derived allele (*e.g.*, Gutenkunst *et al.* 2009). The SFS throws away all linkage information and therefore results in a drastic loss of information. Although Bhaskar and Song (2014) showed that for the piecewise constant models of population size change, the SFS is a sufficient statistic given enough data, Terhorst and Song (2015) show that the error of SFS-based estimates converges at rate where *s* is the number of segregating sites. Both articles emphasize that this could be remedied by incorporating linkage information.

An alternative set of methods bases inference on many short loci (blocks of sequence) without requiring the long-range linkage information necessary for the PSMC. Considering linked sites within sequence “blocks” exploits the demographic information contained in the distribution of genealogical branches while still avoiding the need to model the ancestral recombination graph. This class of methods assumes that intrablock recombination can be ignored (Yang 2002; Hey and Nielsen 2004). Using short-read sequencing technologies, data sets containing large numbers of short blocks can easily be obtained for any organism, for example in the form of low-coverage fragmented genome assemblies, restriction site associated DNA (RAD), or transcriptome data (*e.g.*, Davey *et al.* 2011; McCormack *et al.* 2013; Hearn *et al.* 2014).

Using outgroup information to polarize mutations and assuming an infinite-sites mutation model, polymorphic sites in a nonrecombining alignment can be summarized as counts of mutations on each possible genealogical branch without loss of information. Each possible combination of mutation counts is a unique mutational configuration. Lohse *et al.* (2011) showed how the generating function (GF) of genealogies can be used to derive the probability of mutational configurations for a large range of demographic models. The probability of a mutational configuration is obtained by taking successive derivatives of the GF with respect to all relevant “dummy” variables, each corresponding to a different branch of the genealogy. This gives an efficient, maximum-likelihood scheme for estimating model parameters and comparing models from arbitrarily many sequence blocks. Although such likelihood calculations have been used to fit models of divergence and admixture from triplet samples (Hearn *et al.* 2014; Lohse and Frantz 2014), they fail for large numbers of samples (*n* > 4) because both derivation of the GF and the sheer number of possible mutational configurations become unmanageable (Lohse *et al.* 2015). Our main motivation for the present study was to develop a simple and general summary of blockwise data that (i) removes the need for phase information and (ii) captures short-range linkage information, allowing more powerful inferences than the SFS.

We first briefly describe the GF of genealogies under a bottleneck model and outline how it can be automatically generated and solved (using *Mathematica*). We then consider a new summary of blockwise data that is an extension of the SFS and summarizes polymorphic sites within blocks as counts of singletons, doubletons, etc. (*i.e.*, we lump branches with the same number of descendants, in effect generating an SFS for each block). We compare the power of this new scheme to both the likelihood-based analysis for full mutational configurations (for small samples) and the genome-wide SFS and use simulations to investigate the sensitivity of the new method to intrablock recombination and population structure. Finally, we apply our method to an example data set of Visayan warty pig (*Sus cebifrons*) genomes from the Phillippines and compare our inferences with PSMC analyses on the same data.

## Materials and Methods

### Outline of the model

We consider a sample of *n* lineages from a diploid, panmictic population with a current effective size We follow Galtier *et al.* (2000) in assuming a simple model of an instantaneous bottleneck that can be characterized by two parameters: a bottleneck start time, and a strength parameter, (Figure 1). We assume that the bottleneck is instantaneous, so that no mutations occur during it and its only effect is a sudden burst in coalescence that is measured by an imaginary time Both and are scaled in generations.

Note that this two-parameter model is simpler than bottleneck models that consider step changes in in real time (which requires at least three parameters, *e.g.*, Marth *et al.* 2004). Our motivation for assuming an instantaneous bottleneck is threefold. First, bottleneck duration and intensity are often confounded in practice; *i.e.*, it is generally hard to distinguish weak and long from short and strong bottlenecks (Marth *et al.* 2004). Second, an instantaneous bottleneck captures drastic events, *e.g.*, the colonization of new areas by a small founder population. Finally, instantaneous bottleneck histories extend to more general models of coalescence in which multiple mergers occur as a continuous process rather than an instantaneous event (*e.g.*, Barton *et al.* 2010; Coop and Ralph 2012).

We can describe the history of a sample of *n* lineages as a Markov process: going backward in time, pairs of lineages coalesce until when there is a sudden burst of coalescence due to the bottleneck. Within the bottleneck, coalescence proceeds normally, but since we ban mutations, there is no growth in genealogical branches during (Figure 1). can be thought of as the time necessary for the same amount of coalescence had the population size not changed.

We apply the general recursion for the GF of genealogical branches developed by Lohse *et al.* (2011, equation 4) to the bottleneck model. We denote the vector of all possible branches and label branches by the individuals they are connected to. The GF of the distribution of branch lengths is defined as where are dummy variables corresponding to For the bottleneck model, we need only to track coalescence events and transitions between three phases: before (1), during (2), and after the bottleneck (3). Analogous to derivations of the GF for histories involving discrete splits or admixture between populations (Lohse *et al.* 2011), we first write down a GF for a model in which transitions between phases ( and ) are exponentially distributed with rates and (Equation 1). The GF recursion for each phase is identical apart from the Λ terms specifying the rates at which lineages enter the next phase and the fact that mutations (and hence *ω* terms) are banned during the bottleneck (). We denote the GF of interest (where event times are discrete) as and note that can be obtained by multiplying by and inverting once for each Λ parameter. Looking backward in time, Ω denotes the configuration of the sample before the first coalescence event *i* and the configuration after it, (1)where and is the sum of the that increase during that interval. For the first event, these correspond to the terminal branches; *i.e.*, We have automated Equation 1 in *Mathematica*.

### Calculating likelihoods from the GF

The general method for computing the probability of observing a particular mutational configuration, which is defined as counts of mutations on and can be interpreted as the likelihood, has been described in detail previously (Lohse *et al.* 2011, equation 1) and involves taking higher-order derivatives of the GF of genealogical branches. We assume a mutation rate per branch of and tabulate exact probabilities only up to a maximum number of mutations per branch. Probabilities for configurations involving mutations per branch are combined to avoid having to distinguish very unlikely configurations.

Although our automation allows us to generate the GF for the bottleneck model for rather large samples of individuals (*n*), inverting (with respect to Λ) and solving the GF are slow for We employ a set of combinatorial strategies (detailed in Lohse *et al.* 2015) to speed up these steps (and see Figure 2). In particular, we make use of the fact that lineages are exchangeable. This means that the GF (and the likelihood) can be written as a sum over unlabeled tree shapes (*sensu* Felsenstein 1978, 2003) that define equivalence classes of identically distributed genealogies. For each tree shape, we can condition on a single arbitrary labeling of individuals by setting terms in the GF that are incompatible with it to zero. The full GF can be written as a sum of the GFs for the set of equivalence class representatives, each weighted by the size of its class (the number of ways the sample labels can be permuted on the tree shape, Figure 2). Second, we use *Mathematica* to successively solve the GF for increasingly large sample configurations, starting with terms for pairs of lineages, inserting this solution into the GF for and so on.

### Blockwise SFS simplification

The number of possible mutational configurations quickly becomes unmanageable as the number of genealogical branches increases: distinguishing mutations on each of the branches and allowing for a maximum of mutations per branch, there are mutational configurations for each equivalence class ( stems from the configurations involving 0 and mutations per branch).

We introduce a simple summary of blockwise data—which we term the blockwise SFS (bSFS)—to address this issue. Instead of distinguishing mutations on all genealogical branches, we combine branches (and their corresponding *ω* variables) with the same number of descendants so that the vector of branch lengths, now distinguishes only singleton, doubleton branches, etc. (*i.e.*, ). In data terms, we reduce the mutational configuration of a block to corresponding to counts of singletons, doubletons, etc. in each block. Note also that the genome-wide expected SFS () is given by

The bSFS simplification has two advantages: first, it reduces the number of branches and hence mutational configurations substantially. Because we distinguish only types of mutations, the number of mutational configurations goes down from to For example, with and there are 390,625 configurations in the full scheme, but only 625 defined via the bSFS. Second, the bSFS simplification does not require phased data because we no longer distinguish in which sampled lineage a mutation occurred. However, despite these simplifications, the GF and hence the probabilities of can be computed analytically only for rather limited sample sizes ().

Many current methods for detecting population size changes are based on the SFS (*e.g.*, Polanski and Kimmel 2003; Marth *et al.* 2004; Gutenkunst *et al.* 2009; Chen 2012). Because the bSFS considers the joint distribution of linked polymorphisms, it is, all else being equal, more powerful than the SFS. However, given that analytic likelihood calculations under the bSFS are feasible only for small *n*, it is important to compare their power to that of the SFS for larger *n*. In the *Appendix* we obtain the SFS () under the bottleneck model for any *n*, a result that we anticipate will be useful for researchers with access exclusively to single-nucleotide polymorphism (SNP) data.

### Power analyses

We measured power by calculating the expected difference in support () between the true model and a null model of constant population size for three schemes: (i) the full information for or 3 ( = 3), (ii) the bSFS scheme for (unfolded, = 3, but 6 for the singleton category; and folded, = 6), and (iii) the SFS for The full scheme for could be considered equivalent to the PSMC in the case where a continuous genome is not available.

Throughout, our choice of parameters was motivated by the pig data example, including genome size ( Gb), (), and mutation rate ( per base and generation). We explore a broad parameter space, split into four sets: we alter either or from 0.2 to 1.2, while fixing the other parameter to 0.2 or 0.8. This covers a broad range of bottleneck strengths and ages. For example, for pigs, it would include bottlenecks between 20,000 and 120,000 years ago. We additionally investigate the power to detect more recent ( ) and very strong ( ) bottlenecks that have a high probability of “trapping” all lineages.

Given the assumption of a constant *μ* per site, the scaled mutation rate, can be thought of as the length of blocks. Since we generally have no independent knowledge of *θ* in practice, we conditioned on the expected number of segregating sites per block, and adjusted *θ* to correspond to per block for a pairwise sample. Because the expected total length of the genealogy decreases with increasing and decreasing younger and stronger bottlenecks correspond to larger *θ*, *i.e.*, longer blocks (Supporting Information, Table S1;). Note that our assumption of no intralocus recombination is unaffected by conditioning on because both *S* and the number of recombination events in a block depend on the total length of the genealogy.

We quantified the accuracy to estimate a particular parameter, using Fisher information (Edwards 1972; Lohse and Frantz 2014). This measures the sharpness of the curve near the maximum and shows a linear relationship with the number of loci. Assuming parameter estimates are away from the boundaries, the inverse of Fisher information gives a lower bound on the expected variance of parameter estimates (Rao 1945). We used this to calculate the expected standard deviation for and across parameter space (and checked these using simulations).

### Robustness to population structure and recombination

Previous studies have shown that simple summary statistics (Tajima’s *D*, Fu and Li’s *D*, etc.) cannot distinguish between demographic size changes and population structure (Nielsen and Beaumont 2009; Städler *et al.* 2009; Chikhi *et al.* 2010). In particular, intermediate migration rates and local sampling, *i.e.*, when all samples come from a single deme that is part of a much larger metapopulation, have very similar effects on these statistics as demographic size changes. This can be understood considering the separation of timescales in the structured coalescent (Wakeley 1998). Looking backward in time, during the initial “scattering” phase, lineages in the same deme either coalesce or migrate to unsampled demes. During the “collecting” phase, coalescence is much slower because lineages first need to migrate back into an occupied deme before they can coalesce. Thus, a genealogy of a sample from a single deme may look like a bottlenecked genealogy (rapid initial coalescence “during the bottleneck,” followed by slow coalescence of the remaining lineages that “survived the bottleneck”). We explored the sensitivity of our method to population structure, using simulated data.

We used *ms* (Hudson 2002) to simulate data under an island model with symmetric migration at rate We simulated samples of four individuals in a metapopulation of 10 equally sized demes (*d*). Sampling was either local (all samples from the same deme) or scattered (all from a different deme). We chose a single metapopulation (corresponding to a per deme *θ* of 0.1).

We also used simulations to assess whether ignoring intralocus recombination biases parameter values. We simulated data for assuming a bottleneck at time with strength and a range of recombination rates: 0, 0.025, 0.125, 0.25, 1.25, 2.5, 5, 12.5, or 25 crossovers per generation per base pair. Assuming this corresponds to = 0, 0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10. These ratios fall either side of the recombination estimates for mammals of 1 cM/Mb; *i.e.*, (Tortereau *et al.* 2012).

For each parameter combination, we simulated 1,000,000 loci to obtain the expected frequencies of mutational configurations for the bSFS scheme. We maximized the under both a null model of constant population size and a bottleneck history and calculated between the two scenarios.

### Application to island pigs

We analyzed genomic data from three Visayan warty pigs, *S. cebifrons*, a species endemic to the Philippines. Each individual genome was sequenced to 10× coverage and aligned to the *S. scrofa* reference genome (Ssc10.2) (Groenen *et al.* 2012; Frantz *et al.* 2013, 2014; Bosse *et al.* 2015). We divided the reference genome of *S. scrofa* into nonoverlapping 1000-bp blocks. To ensure enough coverage to call all heterozygous sites in each block and to remove possible copy number variants (Paudel *et al.* 2013), we filtered out blocks with an average coverage <7× or higher than twice the genome-wide average, using the pileup format in SAMtools v0.1.12 (Li *et al.* 2009). Clusters of two or more SNPs in a 10-bp window were filtered out as well as SNPs within 3 bp of an indel. We removed blocks for which <90% of sites were covered and excluded sites with a coverage <4× (Gronau *et al.* 2011). Finally, we selected only blocks that passed the above filtering criteria in all samples.

This procedure left 1,103,026 aligned 1000-bp blocks (∼50% of the genome, with an average of 2.75 mutations per block). Given the mean divergence (1.8%) of the only usable outgroup, the African common warthog (*Phacochoerus africanus*), we used the folded bSFS for five samples (combining singletons with quadrupletons and doubletons with tripletons) to avoid biases due to mispolarization. To make use of all six alleles (two per individual), we averaged the counts of mutational configurations across all sample combinations, *i.e.*, after omitting one allele from each individual in turn. We used all blocks to obtain point estimates and computed confidence intervals of parameters and between models, using a simple correction: linkage between blocks, measured by the pairwise correlation coefficient in the number of segregating sites between blocks, dropped below 0.05 at a distance of 404 blocks (Figure S1) so we rescaled by a factor of 1/404. Finally, we fitted the bottleneck model to the folded genome-wide SFS for all six samples (using the same correction to rescale ).

We conducted a PSMC analysis (Li and Durbin 2011) on the same three individuals, using the parameters and 64 time intervals, and performed 100 bootstrap replicates for each sample. To calibrate parameter estimates, we assumed a generation time of 5 years and a mutation rate of per base and generation (see Groenen *et al.* 2012; Frantz *et al.* 2013). Finally, we compared our method to the PSMC using simulations. We used Heng Li’s modification of msHOT (https://github.com/lh3/foreign/tree/master/msHOT-lite) and ms to PSMC parser (https://github.com/lh3/psmc/blob/master/utils/ms2psmcfa.pl) to simulate 100 diploid sequences of 10 Mb under a variety of bottleneck parameters with an ratio of 0.4, and

### Data availability

The pig data are available at the European Nucleotide Archive (http://www.ebi.ac.uk/ena/, accession no. PRJEB9326). A Mathematica notebook to fit the bottleneck model to data using the bSFS is available upon request.

## Results

### Power analyses

Our power analyses highlight several features of the blockwise inference method (Figure 3). First, inferences based on blockwise data have high power to correctly detect a bottleneck history (measured by ) in absolute terms. We plot for data sets consisting of 2000 unlinked blocks, but because scales linearly with the number of blocks, the *y*-axes in Figure 3 can be rescaled for any number of blocks. Unsurprisingly, power decreases with bottleneck age and increases with bottleneck strength (Figure 3). Given data from 2000 unlinked blocks, even very weak bottlenecks would be detectable. For example, for a relatively recent and weak bottleneck ( and ), for the bSFS scheme for and for the full scheme for which both indicate a significantly ( likelihood-ratio test with 2 d.f.) better fit than a null model of constant size.

Second, blockwise likelihood computations are substantially more powerful than the SFS and the relative improvement is particularly large for young bottlenecks. We have highlighted the area of parameter space where the bSFS scheme for is the only scheme that can still detect a bottleneck (Figure 3, bottom). As increases, the SFS scheme for overtakes the full scheme for in terms of power. However, the bSFS scheme for remains substantially more powerful throughout.

Interestingly, the power to detect bottlenecks with the SFS does not decrease monotonically with respect to the start time, While very recent bottlenecks () are easiest to detect, older bottlenecks () are easier to detect than intermediate bottlenecks at (Figure 3, top right). In contrast, for all blockwise schemes, power decreases monotonically with start time.

To compare the different inference schemes in terms of their accuracy of parameter estimates, we calculated the expected standard deviation () of each parameter. The analytical results for match the empirical SD across replicate simulated data sets and scale with the number of loci () as (Figure 4, Figure S3, and Figure S4). Across schemes, increases as bottlenecks get older and weaker. is always less for blockwise schemes, apart from weak, recent bottlenecks (Figure 4, left) where is large and similar for both the blockwise schemes and the SFS with 20 individuals (the SFS for 4 individuals is even worse).

To get a sense of the extra information captured by the bSFS, we plotted the probability of mutational configurations (for ) (Figure 5; *cf*. Figure S5 for the SFS). This highlights two features of the bSFS. First, even for short blocks () there are many more mutational configurations than site frequency classes. For example, for a bottleneck of strength at time there are 19 configurations with probability Second, the probabilities of mutational configurations depend on the bottleneck parameters in a nonlinear way.

The marginal distributions of the total length of each type of *n*-ton branch also illustrate the complex effects bottlenecks have on genealogies (Figure 6). These can be found from the GF by differentiating with respect to the corresponding and evaluating at All distributions have a discontinuity at (Figure 6, dotted line) and at multiples of corresponding to varying numbers of coalescences occurring within the bottleneck. None of this complexity is captured by the genome-wide SFS that depends only on the mean of each distribution.

### Robustness to population structure and recombination

With local sampling, population structure is difficult to distinguish from a bottleneck that always has greater support than a model of constant size (Table 1, top row). This is true for a wide range of migration rates, *i.e.*, –0.5. At the limit of low *M*, the chance of migration out of the focal deme is minuscule and so the individual deme functions as an isolated, constant size population. At the limit of high *M*, the whole metapopulation functions as a single panmictic population. Where there is support, we find that estimates for are very close to zero (see Table S2 for parameter estimates) as expected if the bottleneck mimics the scattering phase of the structured coalescent. Conversely, with scattered sampling, we never observe support for a bottleneck, irrespective of *M*; *i.e.*, is close to zero and *θ* estimates become very large as *M* decreases (not shown).

To investigate how well data generated under the island model (local sampling) actually fit a bottleneck history, we computed the expected frequency of mutational configurations under the (mis)inferred bottleneck parameters. Using as a measure of model fit, we compared the bottleneck model and a null model of constant population size. Although a bottleneck history fits data simulated under population structure (with intermediate *M*) better than a null model, the fit is poor in absolute terms. In particular, we find that more singleton mutations are seen in the data than predicted under a bottleneck history (Figure S6). Thus, it may in principle be possible to distinguish the two models.

We find that with realistic levels of recombination, parameter estimates under the bSFS are essentially unbiased. For is increasingly overestimated, while *θ*, and between the bottleneck and the null model are underestimated. SD of parameter estimates using 2000 blocks is low for < 1 (Figure S7). These results are encouraging and suggest that inferences based on the bSFS are robust to realistic levels of recombination.

### Application to island pigs and comparison with the PSMC

A bottleneck history fitted the *S. cebifrons* data significantly better than a model of constant using the folded bSFS for (Δ*lnL* = 110 after correcting for linkage). We inferred a strong bottleneck () ∼141 KYA and an ancestral of While both *T* parameter estimates have narrow confidence intervals even after correction for linkage between blocks, there is more information about the time () than the strength () of the bottleneck (Table 2). Parameter estimates from the genome-wide SFS (for ) agree, but have much wider confidence intervals (Table 2).

Our estimate of also broadly coincides with the period of the lowest population size in the PSMC analyses of the three *S. cebifrons* genomes (Figure 8A). Applying the PSMC to data simulated under the bottleneck history inferred for the pigs (Table 2) gives a shorter period of reduced population size than the empirical plot (Figure 8A *vs.* Figure 8B). The PSMC plot generated from the simulated data also does not include the apparent difference in population size seen in the empirical plot; *i.e.*, after the bottleneck (looking backward in time) is approximately twofold greater than before. Applying the PSMC to data simulated under an instantaneous bottleneck produces a trajectory of gradual decline and rise of (Figure 8B). This is true for a range of bottleneck times and strengths (Figure S8). Reassuringly, the period of smallest coincides with the time of the bottleneck. An unexpected feature of all PSMC plots from simulated data is that they show a marked increase of prior (looking backward in time) to the bottleneck start time (Figure 8B and Figure S8).

## Discussion

Researchers have long been interested in using genetic data to infer population bottlenecks associated with landmark events in the history of species and populations, for example, the colonization of new regions or range contractions (reviewed in Gattepaille *et al.* 2013). Indeed, the demographic history of model organisms such as humans and *Drosophila* has been characterized as a series of expansions and contractions that coincide with the spread of populations across the globe (*e.g.*, Haddrill *et al.* 2005; Voight *et al.* 2005; Sjödin *et al.* 2012). However, the genetic resources for model organisms far exceed those available for most species. Thus there is a need for inference methods that do not require high-quality reference genomes, phase information, and/or large samples of individuals. Maximum-likelihood estimation based on the blockwise SFS makes optimal use of the information contained in short blocks of sequence from just a handful of individuals—data that can readily be obtained for any organism with current short-read sequencing technology. Furthermore, and in contrast to the PSMC framework (Li and Durbin 2011) that qualitatively documents changes in population size, our method explicitly assesses the statistical support for a bottleneck model over a null model (see Table S3 for the pros and cons of alternative demographic inference methods).

### Power of the blockwise SFS

We find that our blockwise method has higher power to detect bottlenecks (Figure 3 and Figure 4) than the SFS. Perhaps surprisingly, the bSFS scheme for 5 lineages almost always contained more information about past bottlenecks than the genome-wide SFS for 20 lineages, a more typical sample size for SFS-based analyses (see Bhaskar and Song 2014, for specific bounds on sample sizes under various piecewise population size change models). Although our strategy of exploiting the symmetries of the coalescent by partitioning the GF of branch lengths into a sum over unlabeled tree shapes makes it possible to compute the probability of full mutational configurations for nontrivial sample sizes, in practice, the number of mutational configurations still explodes catastrophically for *n* > 5 (Lohse *et al.* 2015). By combining branches with the same number of descendants, the bSFS scheme introduces a further simplification that substantially decreases the dimensionality of mutational configurations and is a tractable compromise between information and complexity. It still exploits the information contained in the joint distribution of closely linked SNPs (Figure 5) and, since topologies are defined by the presence or absence of mutation categories, also retains topology information (Figure 2). On the other hand, the bSFS simplification avoids unnecessary computational complexity and removes the need for phase information. Because our likelihood calculations for the bSFS assume no intralocus recombination, they are restricted to relatively short blocks, which rarely contain sufficient mutational information to fully resolve all branches anyway.

Runs of homozygosity contain valuable information about demography (*e.g.*, Harris and Nielsen 2013) and some of the power of the blockwise schemes undoubtedly stems from the fact that some of this information is included in the frequency of monomorphic blocks that are almost always the most probable configuration (Figure 5). For example, for very recent, strong bottlenecks ( and ), >50% of blocks are monomorphic for whereas under the null model of constant size only 20% of blocks are monomorphic. This is despite the fact that we have fixed the block length to correspond to one pairwise difference per block on average in both cases.

Although we have focused on bottleneck histories because they are biologically interesting and analytically tractable, we emphasize that the bSFS simplification can be used to summarize data and compute likelihoods for any sampling scheme and demographic model. Just like the SFS, the bSFS also extends to multiple populations (Lohse *et al.* 2015). For larger samples and/or more complex models, for which analytic likelihood calculations become intractable, the bSFS could form the basis for alternative inference methods. In particular, it should be possible to approximate the bSFS using simulations and devise a composite-likelihood inference method analogous to that of Excoffier *et al.* (2013).

### When can we expect to be able to detect bottlenecks?

Our power analyses indicate that young and strong bottlenecks are easiest to identify. This makes intuitive sense given that these have the strongest effect on genealogies (greatest compression of tree length). Looking backward in time, the older the bottleneck, the more likely it is that lineages have already coalesced by the time the bottleneck “starts.” Even very large samples are expected to coalesce to two lineages within generations (Tajima 1983; Nordborg 1998), suggesting that there is little information to infer bottlenecks approaching this age or older (as we find, Figure 3) and that larger samples of individuals add only information about recent bottlenecks (see also Terhorst and Song 2015).

A potential issue of our power analyses is that we have adjusted the length of sequence blocks (*i.e.*, *θ*) to a fixed average number of segregating sites per block irrespective of the severity of the bottleneck. This ignores the fact that genomes have a finite length and so overestimates the power to infer very strong and/or recent bottlenecks. However, even after correcting for this by normalizing by a factor, the power to detect bottlenecks still remains highest for recent and strong bottlenecks (Figure S2). In the limit of infinitely young and strong bottlenecks and assuming a finite genome, it would of course be impossible to sample enough independent blocks for inference. However, we can still detect surprisingly recent bottlenecks: *e.g.*, assuming (2000 years ago for pigs), and a typical pig and *μ*, blocks would need to be ∼2.5 kb long to contain one pairwise difference on average, and only nine such blocks would be sufficient to reject a null model of no population size change ( = 0.34 for a single locus, nine loci would give a signficant likelihood-ratio test with 2 d.f., and *α* = 0.05).

Our finding that local sampling of a structured population mimics a bottlenecked population mirrors other recent analyses that conclude that population structure and demography can be hard to distinguish (*e.g.*, Nielsen and Beaumont 2009). However, we identified an excess of singletons in samples from structured compared to bottlenecked populations that could, in principle, be exploited to distinguish the two. Conversely, and in agreement with others (Wakeley 1999; Städler *et al.* 2009; Chikhi *et al.* 2010; Leblois *et al.* 2014; Mazet *et al.* 2015), we find that scattered sampling (one individual per deme) minimizes the problem of confounding size change and structure.

Finally, we show that for realistic recombination rates, blockwise inferences are robust to intralocus recombination (Figure 7). If the ratio of recombination to mutation rate is high, is overestimated, underestimated, and there is less power to detect a bottleneck overall. Because recombination within blocks generates a mosaic of correlated genealogies, this is expected, given the decrease in variance in branch lengths across loci (Wall 2003).

### Application to island pigs and comparison with the PSMC

The instantaneous bottleneck (but constant ) we consider is, in a sense, the inverse of the scenario assumed by the PSMC and other piecewise models that reconstructs demography as a trajectory of changing Given these opposing assumptions, it is reassuring how well the bottleneck time we infer for *S. cebifrons* ( KY) agrees with the PSMC analysis of the same data set (Figure 8A). In both cases, the inferred reduction in population size occurred long after the species divergence between *S. cebifrons* and *S. barbatus* ∼300 KYA (Frantz *et al.* 2013) and so was not associated with the initial colonization of the Philippines by the ancestor of *S. cebifrons*. This is not to say that we can rule out any colonization bottleneck, because most of the signal of older bottlenecks would have been erased by the more recent demographic event we infer. It is likely that the inferred bottleneck coincides with the colonization of the central/western islands of Panay, Cebu, and Negros by *S. cebifrons*. This would support the findings of Lucchini *et al.* (2005) that this species is genetically and morphologically distinct from its sister species *S. philippensis* that is found all over the eastern and northern Philippines. Unfortunately, genomic data do not exist for *S. philippensis* to validate this interpretation. Finally, our result is not consistent with a bottleneck due to anthropogenic activities (*i.e.*, it is too old). This does not mean that this critically endangered species has been unaffected by human activities but rather suggests that its genetic diversity was already reduced prior to any human contact and supports current efforts for careful management of zoo populations (Bosse *et al.* 2015).

We found that the PSMC is unable to reconstruct abrupt changes in (Figure S8): for example, instantaneous bottlenecks 2000 or 20,000 years ago are reconstructed as a gradual dip in over a period of 5000 or 50,000 years, respectively. The PSMC also artificially infers an increase in prior to the bottleneck. This increase in was also apparent when applying the PSMC to data simulated under a range of bottleneck scenarios including the history we inferred for the pigs (Figure 8B). We therefore recommend caution when interpreting PSMC plots.

### Conclusion

We have developed a method for inferring demographic size change from short sequence blocks for small numbers of samples. We show that the new method has higher power than inferences based on the genome-wide SFS for much larger samples. The central idea of the new framework is to summarize mutational configurations as blockwise site frequency spectra. This both results in little loss of information and enables the analysis of nontrivial samples of individuals and unphased data. We stress that the bSFS scheme could be used for inference under any model of population history. For example, one could use blockwise data to fit scenarios of population expansion, serial bottlenecks, or more general models of multiple merger coalescence (Coop and Ralph 2012). We also envisage extensions of this framework to explicitly compare the demographic histories of multiple, codistributed species.

## Acknowledgments

We thank Nick Barton, Mike Hickerson, two reviewers, and associate editor Noah Rosenberg for comments on earlier versions of this manuscript and Martien Groenen and Hendrik-Jan Megens for access to the genome sequences of *Sus cebifrons* ahead of publication. This work was supported by funding from the United Kingdom (UK) Natural Environment Research Council (NERC) to Graham Stone and K.L. (NE/J010499/1) and a UK NERC fellowship to K.L. (NE/I020288/1).

## Appendix: Calculating the Site Frequency Spectrum

We denote the time during which there are *p* lineages present, Griffiths and Tavaré (1998, equation 1.3) have shown that for any model in which lineages are statistically exchangeable the expected site frequency can be computed from by considering the probability that branches that exist during have *i* descendants:

We denote the vector of coalescence intervals as

We can write the GF of as a sum over all possible ways that a sample can enter and exit the bottleneck and distinguish only the branches (or rather the corresponding *ω* variables) associated with each interval : (A2)Here, and are the numbers of ancestral lineages that enter and exit the bottleneck at respectively; *n* is the number of sampled lineages; and and give the GF before, during, and after the bottleneck, as in Equation 1. Multiplying by and inverting with respect to and gives the GF under the bottleneck model, where event times are discrete. is obtained from this as (A3)The SFS can be obtained by inserting into Equation A1 above.

## Footnotes

*Communicating editor: N. A. Rosenberg*Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.179861/-/DC1.

- Received June 24, 2015.
- Accepted September 1, 2015.

- Copyright © 2015 Bunnefeld, Frantz, and Lohse

Available freely online through the author-supported open access option.