Abstract
The advent of the genomic era has necessitated the development of methods capable of analyzing large volumes of genomic data efficiently. Being able to reliably identify bottlenecks—extreme population size changes of short duration—not only is interesting in the context of speciation and extinction but also matters (as a null model) when inferring selection. Bottlenecks can be detected in polymorphism data via their distorting effect on the shape of the underlying genealogy. Here, we use the generating function of genealogies to derive the probability of mutational configurations in short sequence blocks under a simple bottleneck model. Given a large number of nonrecombining blocks, we can compute maximum-likelihood estimates of the time and strength of the bottleneck. Our method relies on a simple summary of the joint distribution of polymorphic sites. We extend the site frequency spectrum by counting mutations in frequency classes in short sequence blocks. Using linkage information over short distances in this way gives greater power to detect bottlenecks than the site frequency spectrum and potentially opens up a wide range of demographic histories to blockwise inference. Finally, we apply our method to genomic data from a species of pig (Sus cebifrons) endemic to islands in the center and west of the Philippines to estimate whether a bottleneck occurred upon island colonization and compare our scheme to Li and Durbin’s pairwise sequentially Markovian coalescent (PSMC) both for the pig data and using simulations.
MUCH can be learned about the demographic history of a species or population from sequence variation. Bottlenecks—short periods where the population size is drastically reduced before recovering in size—are commonly detected demographic events. Because bottlenecks are often associated with range shifts caused by Pleistocene fluctuations in climate (e.g., Moura et al. 2014), they are of particular interest to researchers studying contemporary biogeographic patterns. Bottlenecks lead to a sudden, strong increase in the rate of coalescence that may have different effects on the shape of genealogies. A strong bottleneck may trap all lineages, leading to a star-shaped genealogy. Alternatively, some lineages may “escape” weaker bottlenecks leading to longer basal branches, each connected to a clade of lineages that coalesced during the bottleneck (Figure 1). Thus, strong bottlenecks lead to an excess of rare variants, while moderate bottlenecks produce an excess of intermediate-frequency variants. These opposing effects make it difficult to reliably diagnose bottlenecks from simple summary statistics of nucleotide diversity such as Fu’s D (Fu and Li 1993) and Tajima’s D (Tajima 1989). The problem is exacerbated because signals of bottlenecks can be confounded by other population processes such as geographic structure or selective sweeps (e.g., Nielsen and Beaumont 2009).
The effect of a bottleneck on a genealogy. (Left) Genealogy for a population of constant size. (Center) Genealogy under a weak bottleneck at time with four surviving lineages. (Right) Genealogy under a strong bottleneck at
with only two surviving lineages. (A) We assume an instantaneous bottleneck that produces a sudden burst of coalescence. (B) Bottleneck strength is measured by
the time necessary for the same amount of coalescence in a population of constant size.
Given the stochasticity of the coalescent in general and the similarity in signal between selective sweeps and bottlenecks at a single locus (e.g., Galtier et al. 2000), integrating information across multiple loci is imperative to robust demographic inference. For example, Li and Durbin’s (2011) pairwise sequentially Markovian coalescent (PSMC) method models transitions in the times to the most recent common ancestor along an individual genome and from this infers a trajectory of population size changes. The PSMC relies on an approximation to the coalescent and, because it makes use of long-range linkage information, requires fairly well-assembled reference genomes.
At the other extreme, a number of popular methods of demographic inference are based on the site frequency spectrum (SFS)—the number of sites in a sample of n sequences with k copies of the derived allele (e.g., Gutenkunst et al. 2009). The SFS throws away all linkage information and therefore results in a drastic loss of information. Although Bhaskar and Song (2014) showed that for the piecewise constant models of population size change, the SFS is a sufficient statistic given enough data, Terhorst and Song (2015) show that the error of SFS-based estimates converges at rate where s is the number of segregating sites. Both articles emphasize that this could be remedied by incorporating linkage information.
An alternative set of methods bases inference on many short loci (blocks of sequence) without requiring the long-range linkage information necessary for the PSMC. Considering linked sites within sequence “blocks” exploits the demographic information contained in the distribution of genealogical branches while still avoiding the need to model the ancestral recombination graph. This class of methods assumes that intrablock recombination can be ignored (Yang 2002; Hey and Nielsen 2004). Using short-read sequencing technologies, data sets containing large numbers of short blocks can easily be obtained for any organism, for example in the form of low-coverage fragmented genome assemblies, restriction site associated DNA (RAD), or transcriptome data (e.g., Davey et al. 2011; McCormack et al. 2013; Hearn et al. 2014).
Using outgroup information to polarize mutations and assuming an infinite-sites mutation model, polymorphic sites in a nonrecombining alignment can be summarized as counts of mutations on each possible genealogical branch without loss of information. Each possible combination of mutation counts is a unique mutational configuration. Lohse et al. (2011) showed how the generating function (GF) of genealogies can be used to derive the probability of mutational configurations for a large range of demographic models. The probability of a mutational configuration is obtained by taking successive derivatives of the GF with respect to all relevant “dummy” variables, each corresponding to a different branch of the genealogy. This gives an efficient, maximum-likelihood scheme for estimating model parameters and comparing models from arbitrarily many sequence blocks. Although such likelihood calculations have been used to fit models of divergence and admixture from triplet samples (Hearn et al. 2014; Lohse and Frantz 2014), they fail for large numbers of samples (n > 4) because both derivation of the GF and the sheer number of possible mutational configurations become unmanageable (Lohse et al. 2015). Our main motivation for the present study was to develop a simple and general summary of blockwise data that (i) removes the need for phase information and (ii) captures short-range linkage information, allowing more powerful inferences than the SFS.
We first briefly describe the GF of genealogies under a bottleneck model and outline how it can be automatically generated and solved (using Mathematica). We then consider a new summary of blockwise data that is an extension of the SFS and summarizes polymorphic sites within blocks as counts of singletons, doubletons, etc. (i.e., we lump branches with the same number of descendants, in effect generating an SFS for each block). We compare the power of this new scheme to both the likelihood-based analysis for full mutational configurations (for small samples) and the genome-wide SFS and use simulations to investigate the sensitivity of the new method to intrablock recombination and population structure. Finally, we apply our method to an example data set of Visayan warty pig (Sus cebifrons) genomes from the Phillippines and compare our inferences with PSMC analyses on the same data.
Materials and Methods
Outline of the model
We consider a sample of n lineages from a diploid, panmictic population with a current effective size We follow Galtier et al. (2000) in assuming a simple model of an instantaneous bottleneck that can be characterized by two parameters: a bottleneck start time,
and a strength parameter,
(Figure 1). We assume that the bottleneck is instantaneous, so that no mutations occur during it and its only effect is a sudden burst in coalescence that is measured by an imaginary time
Both
and
are scaled in
generations.
Note that this two-parameter model is simpler than bottleneck models that consider step changes in in real time (which requires at least three parameters, e.g., Marth et al. 2004). Our motivation for assuming an instantaneous bottleneck is threefold. First, bottleneck duration and intensity are often confounded in practice; i.e., it is generally hard to distinguish weak and long from short and strong bottlenecks (Marth et al. 2004). Second, an instantaneous bottleneck captures drastic events, e.g., the colonization of new areas by a small founder population. Finally, instantaneous bottleneck histories extend to more general models of coalescence in which multiple mergers occur as a continuous process rather than an instantaneous event (e.g., Barton et al. 2010; Coop and Ralph 2012).
We can describe the history of a sample of n lineages as a Markov process: going backward in time, pairs of lineages coalesce until when there is a sudden burst of coalescence due to the bottleneck. Within the bottleneck, coalescence proceeds normally, but since we ban mutations, there is no growth in genealogical branches during
(Figure 1).
can be thought of as the time necessary for the same amount of coalescence had the population size not changed.
We apply the general recursion for the GF of genealogical branches developed by Lohse et al. (2011, equation 4) to the bottleneck model. We denote the vector of all possible branches and label branches by the individuals they are connected to. The GF of the distribution of branch lengths
is defined as
where
are dummy variables corresponding to
For the bottleneck model, we need only to track coalescence events and transitions between three phases: before (1), during (2), and after the bottleneck (3). Analogous to derivations of the GF for histories involving discrete splits or admixture between populations (Lohse et al. 2011), we first write down a GF for a model in which transitions between phases (
and
) are exponentially distributed with rates
and
(Equation 1). The GF recursion for each phase is identical apart from the Λ terms specifying the rates at which lineages enter the next phase and the fact that mutations (and hence ω terms) are banned during the bottleneck (
). We denote the GF of interest (where event times are discrete) as
and note that
can be obtained by multiplying
by
and inverting once for each Λ parameter. Looking backward in time, Ω denotes the configuration of the sample before the first coalescence event i and
the configuration after it,
(1)where
and
is the sum of the
that increase during that interval. For the first event, these correspond to the terminal branches; i.e.,
We have automated Equation 1 in Mathematica.
Calculating likelihoods from the GF
The general method for computing the probability of observing a particular mutational configuration, which is defined as counts of mutations on
and can be interpreted as the likelihood, has been described in detail previously (Lohse et al. 2011, equation 1) and involves taking higher-order derivatives of the GF of genealogical branches. We assume a mutation rate per branch of
and tabulate exact probabilities only up to a maximum number of mutations
per branch. Probabilities for configurations involving
mutations per branch are combined to avoid having to distinguish very unlikely configurations.
Although our automation allows us to generate the GF for the bottleneck model for rather large samples of individuals (n), inverting (with respect to Λ) and solving the GF are slow for We employ a set of combinatorial strategies (detailed in Lohse et al. 2015) to speed up these steps (and see Figure 2). In particular, we make use of the fact that lineages are exchangeable. This means that the GF (and the likelihood) can be written as a sum over unlabeled tree shapes (sensu Felsenstein 1978, 2003) that define equivalence classes of identically distributed genealogies. For each tree shape, we can condition on a single arbitrary labeling of individuals by setting terms in the GF that are incompatible with it to zero. The full GF can be written as a sum of the GFs for the set of equivalence class representatives, each weighted by the size of its class (the number of ways the sample labels can be permuted on the tree shape, Figure 2). Second, we use Mathematica to successively solve the GF for increasingly large sample configurations, starting with terms for pairs of lineages, inserting this solution into the GF for
and so on.
Combinatorial strategies to speed up likelihood calculations. For there are three possible unlabeled topologies or tree shapes (left, center, and right). Because samples from the same population are exchangeable, the generating function can be written as a sum over unlabeled tree shapes (each defining an equivalence class of identically distributed genealogies). The size of each class (given above each shape) depends on the number of ways the sample labels can be permuted. The blockwise SFS introduces a further simplification (bottom) by lumping all branches with the same number of descendants (shaded line, singleton; shaded dashed line, doubleton; solid dashed line, tripleton; solid line, quadrupleton).
Blockwise SFS simplification
The number of possible mutational configurations quickly becomes unmanageable as the number of genealogical branches increases: distinguishing mutations on each of the branches and allowing for a maximum of
mutations per branch, there are
mutational configurations for each equivalence class (
stems from the configurations involving 0 and
mutations per branch).
We introduce a simple summary of blockwise data—which we term the blockwise SFS (bSFS)—to address this issue. Instead of distinguishing mutations on all genealogical branches, we combine branches (and their corresponding ω variables) with the same number of descendants so that the vector of branch lengths, now distinguishes only singleton, doubleton branches, etc. (i.e.,
). In data terms, we reduce the mutational configuration of a block
to
corresponding to counts of singletons, doubletons, etc. in each block. Note also that the genome-wide expected SFS (
) is given by
The bSFS simplification has two advantages: first, it reduces the number of branches and hence mutational configurations substantially. Because we distinguish only types of mutations, the number of mutational configurations goes down from
to
For example, with
and
there are 390,625 configurations in the full scheme, but only 625 defined via the bSFS. Second, the bSFS simplification does not require phased data because we no longer distinguish in which sampled lineage a mutation occurred. However, despite these simplifications, the GF and hence the probabilities of
can be computed analytically only for rather limited sample sizes (
).
Many current methods for detecting population size changes are based on the SFS (e.g., Polanski and Kimmel 2003; Marth et al. 2004; Gutenkunst et al. 2009; Chen 2012). Because the bSFS considers the joint distribution of linked polymorphisms, it is, all else being equal, more powerful than the SFS. However, given that analytic likelihood calculations under the bSFS are feasible only for small n, it is important to compare their power to that of the SFS for larger n. In the Appendix we obtain the SFS () under the bottleneck model for any n, a result that we anticipate will be useful for researchers with access exclusively to single-nucleotide polymorphism (SNP) data.
Power analyses
We measured power by calculating the expected difference in support () between the true model and a null model of constant population size for three schemes: (i) the full information for
or 3 (
= 3), (ii) the bSFS scheme for
(unfolded,
= 3, but 6 for the singleton category; and folded,
= 6), and (iii) the SFS for
The full scheme for
could be considered equivalent to the PSMC in the case where a continuous genome is not available.
Throughout, our choice of parameters was motivated by the pig data example, including genome size ( Gb),
(
), and mutation rate (
per base and generation). We explore a broad parameter space, split into four sets: we alter either
or
from 0.2 to 1.2, while fixing the other parameter to 0.2 or 0.8. This covers a broad range of bottleneck strengths and ages. For example, for pigs, it would include bottlenecks between 20,000 and 120,000 years ago. We additionally investigate the power to detect more recent (
) and very strong (
) bottlenecks that have a high probability of “trapping” all lineages.
Given the assumption of a constant μ per site, the scaled mutation rate, can be thought of as the length of blocks. Since we generally have no independent knowledge of θ in practice, we conditioned on
the expected number of segregating sites per block, and adjusted θ to correspond to
per block for a pairwise sample. Because the expected total length of the genealogy decreases with increasing
and decreasing
younger and stronger bottlenecks correspond to larger θ, i.e., longer blocks (Supporting Information, Table S1;). Note that our assumption of no intralocus recombination is unaffected by conditioning on
because both S and the number of recombination events in a block depend on the total length of the genealogy.
We quantified the accuracy to estimate a particular parameter, using Fisher information (Edwards 1972; Lohse and Frantz 2014). This measures the sharpness of the curve near the maximum and shows a linear relationship with the number of loci. Assuming parameter estimates are away from the boundaries, the inverse of Fisher information gives a lower bound on the expected variance of parameter estimates (Rao 1945). We used this to calculate the expected standard deviation
for
and
across parameter space (and checked these using simulations).
Robustness to population structure and recombination
Previous studies have shown that simple summary statistics (Tajima’s D, Fu and Li’s D, etc.) cannot distinguish between demographic size changes and population structure (Nielsen and Beaumont 2009; Städler et al. 2009; Chikhi et al. 2010). In particular, intermediate migration rates and local sampling, i.e., when all samples come from a single deme that is part of a much larger metapopulation, have very similar effects on these statistics as demographic size changes. This can be understood considering the separation of timescales in the structured coalescent (Wakeley 1998). Looking backward in time, during the initial “scattering” phase, lineages in the same deme either coalesce or migrate to unsampled demes. During the “collecting” phase, coalescence is much slower because lineages first need to migrate back into an occupied deme before they can coalesce. Thus, a genealogy of a sample from a single deme may look like a bottlenecked genealogy (rapid initial coalescence “during the bottleneck,” followed by slow coalescence of the remaining lineages that “survived the bottleneck”). We explored the sensitivity of our method to population structure, using simulated data.
We used ms (Hudson 2002) to simulate data under an island model with symmetric migration at rate We simulated samples of four individuals in a metapopulation of 10 equally sized demes (d). Sampling was either local (all samples from the same deme) or scattered (all from a different deme). We chose a single metapopulation
(corresponding to a per deme θ of 0.1).
We also used simulations to assess whether ignoring intralocus recombination biases parameter values. We simulated data for assuming a bottleneck at time
with strength
and a range of recombination rates: 0, 0.025, 0.125, 0.25, 1.25, 2.5, 5, 12.5, or 25
crossovers per generation per base pair. Assuming
this corresponds to
= 0, 0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10. These ratios fall either side of the recombination estimates for mammals of 1 cM/Mb; i.e.,
(Tortereau et al. 2012).
For each parameter combination, we simulated 1,000,000 loci to obtain the expected frequencies of mutational configurations for the bSFS scheme. We maximized the under both a null model of constant population size and a bottleneck history and calculated
between the two scenarios.
Application to island pigs
We analyzed genomic data from three Visayan warty pigs, S. cebifrons, a species endemic to the Philippines. Each individual genome was sequenced to 10× coverage and aligned to the S. scrofa reference genome (Ssc10.2) (Groenen et al. 2012; Frantz et al. 2013, 2014; Bosse et al. 2015). We divided the reference genome of S. scrofa into nonoverlapping 1000-bp blocks. To ensure enough coverage to call all heterozygous sites in each block and to remove possible copy number variants (Paudel et al. 2013), we filtered out blocks with an average coverage <7× or higher than twice the genome-wide average, using the pileup format in SAMtools v0.1.12 (Li et al. 2009). Clusters of two or more SNPs in a 10-bp window were filtered out as well as SNPs within 3 bp of an indel. We removed blocks for which <90% of sites were covered and excluded sites with a coverage <4× (Gronau et al. 2011). Finally, we selected only blocks that passed the above filtering criteria in all samples.
This procedure left 1,103,026 aligned 1000-bp blocks (∼50% of the genome, with an average of 2.75 mutations per block). Given the mean divergence (1.8%) of the only usable outgroup, the African common warthog (Phacochoerus africanus), we used the folded bSFS for five samples (combining singletons with quadrupletons and doubletons with tripletons) to avoid biases due to mispolarization. To make use of all six alleles (two per individual), we averaged the counts of mutational configurations across all sample combinations, i.e., after omitting one allele from each individual in turn. We used all blocks to obtain point estimates and computed confidence intervals of parameters and between models, using a simple correction: linkage between blocks, measured by the pairwise correlation coefficient in the number of segregating sites between blocks, dropped below 0.05 at a distance of 404 blocks (Figure S1) so we rescaled
by a factor of 1/404. Finally, we fitted the bottleneck model to the folded genome-wide SFS for all six samples (using the same correction to rescale
).
We conducted a PSMC analysis (Li and Durbin 2011) on the same three individuals, using the parameters
and 64 time intervals, and performed 100 bootstrap replicates for each sample. To calibrate parameter estimates, we assumed a generation time of 5 years and a mutation rate of
per base and generation (see Groenen et al. 2012; Frantz et al. 2013). Finally, we compared our method to the PSMC using simulations. We used Heng Li’s modification of msHOT (https://github.com/lh3/foreign/tree/master/msHOT-lite) and ms to PSMC parser (https://github.com/lh3/psmc/blob/master/utils/ms2psmcfa.pl) to simulate 100 diploid sequences of 10 Mb under a variety of bottleneck parameters with an
ratio of 0.4,
and
Data availability
The pig data are available at the European Nucleotide Archive (http://www.ebi.ac.uk/ena/, accession no. PRJEB9326). A Mathematica notebook to fit the bottleneck model to data using the bSFS is available upon request.
Results
Power analyses
Our power analyses highlight several features of the blockwise inference method (Figure 3). First, inferences based on blockwise data have high power to correctly detect a bottleneck history (measured by ) in absolute terms. We plot
for data sets consisting of 2000 unlinked blocks, but because
scales linearly with the number of blocks, the y-axes in Figure 3 can be rescaled for any number of blocks. Unsurprisingly, power decreases with bottleneck age and increases with bottleneck strength (Figure 3). Given data from 2000 unlinked blocks, even very weak bottlenecks would be detectable. For example, for a relatively recent and weak bottleneck (
and
),
for the bSFS scheme for
and
for the full scheme for
which both indicate a significantly (
likelihood-ratio test with 2 d.f.) better fit than a null model of constant size.
The expected difference in support () between the bottleneck model (“Bottleneck”) and a null model of constant population size (“Constrained”) as a function of bottleneck strength (
left) and start time (
right). Plots are based on analytic results for the likelihood and assuming 2000 unlinked sequence blocks with one pairwise difference per block on average. The horizontal lines in the bottom panels indicate significance (at
) in a likelihood-ratio test with 2 d.f.
Second, blockwise likelihood computations are substantially more powerful than the SFS and the relative improvement is particularly large for young bottlenecks. We have highlighted the area of parameter space where the bSFS scheme for is the only scheme that can still detect a bottleneck (Figure 3, bottom). As
increases, the SFS scheme for
overtakes the full scheme for
in terms of power. However, the bSFS scheme for
remains substantially more powerful throughout.
Interestingly, the power to detect bottlenecks with the SFS does not decrease monotonically with respect to the start time, While very recent bottlenecks (
) are easiest to detect, older bottlenecks (
) are easier to detect than intermediate bottlenecks at
(Figure 3, top right). In contrast, for all blockwise schemes, power decreases monotonically with start time.
To compare the different inference schemes in terms of their accuracy of parameter estimates, we calculated the expected standard deviation () of each parameter. The analytical results for
match the empirical SD across replicate simulated data sets and scale with the number of loci (
) as
(Figure 4, Figure S3, and Figure S4). Across schemes,
increases as bottlenecks get older and weaker.
is always less for blockwise schemes, apart from weak, recent bottlenecks (Figure 4, left) where
is large and similar for both the blockwise schemes and the SFS with 20 individuals (the SFS for 4 individuals is even worse).
Expected standard deviation in (= 0.2, left) and
(= 0.8, right) as a function of
and
respectively. As in Figure 3, we assume 2000 unlinked blocks.
To get a sense of the extra information captured by the bSFS, we plotted the probability of mutational configurations (for ) (Figure 5; cf. Figure S5 for the SFS). This highlights two features of the bSFS. First, even for short blocks (
) there are many more mutational configurations than site frequency classes. For example, for a bottleneck of strength
at time
there are 19 configurations with probability
Second, the probabilities of mutational configurations depend on the bottleneck parameters in a nonlinear way.
The probability of mutational configurations defined by the bSFS for as a function (left) of
(
= 0.8) and (right) of
(
= 0.2). The key gives configurations as counts of singletons, doubletons, tripletons, and quadrupletons, respectively; e.g., 1010 denotes blocks containing one singleton and one tripleton. Only configurations with probabilities >0.01 (for at least one point in parameter space) are shown.
The marginal distributions of the total length of each type of n-ton branch also illustrate the complex effects bottlenecks have on genealogies (Figure 6). These can be found from the GF by differentiating with respect to the corresponding and evaluating at
All distributions have a discontinuity at
(Figure 6, dotted line) and at multiples of
corresponding to varying numbers of coalescences occurring within the bottleneck. None of this complexity is captured by the genome-wide SFS that depends only on the mean of each distribution.
The length distributions of each branch type for two recent bottlenecks of different strengths and a constant size model under the blockwise-SFS scheme for five lineages. The dotted line indicates the bottleneck start time ( = 0.2).
Robustness to population structure and recombination
With local sampling, population structure is difficult to distinguish from a bottleneck that always has greater support than a model of constant size (Table 1, top row). This is true for a wide range of migration rates, i.e., –0.5. At the limit of low M, the chance of migration out of the focal deme is minuscule and so the individual deme functions as an isolated, constant size population. At the limit of high M, the whole metapopulation functions as a single panmictic population. Where there is support, we find that estimates for
are very close to zero (see Table S2 for parameter estimates) as expected if the bottleneck mimics the scattering phase of the structured coalescent. Conversely, with scattered sampling, we never observe support for a bottleneck, irrespective of M; i.e.,
is close to zero and θ estimates become very large as M decreases (not shown).

To investigate how well data generated under the island model (local sampling) actually fit a bottleneck history, we computed the expected frequency of mutational configurations under the (mis)inferred bottleneck parameters. Using as a measure of model fit, we compared the bottleneck model and a null model of constant population size. Although a bottleneck history fits data simulated under population structure (with intermediate M) better than a null model, the fit is poor in absolute terms. In particular, we find that more singleton mutations are seen in the data than predicted under a bottleneck history (Figure S6). Thus, it may in principle be possible to distinguish the two models.
We find that with realistic levels of recombination, parameter estimates under the bSFS are essentially unbiased. For
is increasingly overestimated, while
θ, and
between the bottleneck and the null model are underestimated. SD of parameter estimates using 2000 blocks is low for
< 1 (Figure S7). These results are encouraging and suggest that inferences based on the bSFS are robust to realistic levels of recombination.
Application to island pigs and comparison with the PSMC
A bottleneck history fitted the S. cebifrons data significantly better than a model of constant using the folded bSFS for
(ΔlnL = 110 after correcting for linkage). We inferred a strong bottleneck (
) ∼141 KYA and an ancestral
of
While both T parameter estimates have narrow confidence intervals even after correction for linkage between blocks, there is more information about the time (
) than the strength (
) of the bottleneck (Table 2). Parameter estimates from the genome-wide SFS (for
) agree, but have much wider confidence intervals (Table 2).
Our estimate of also broadly coincides with the period of the lowest population size in the PSMC analyses of the three S. cebifrons genomes (Figure 8A). Applying the PSMC to data simulated under the bottleneck history inferred for the pigs (Table 2) gives a shorter period of reduced population size than the empirical plot (Figure 8A vs. Figure 8B). The PSMC plot generated from the simulated data also does not include the apparent difference in population size seen in the empirical plot; i.e.,
after the bottleneck (looking backward in time) is approximately twofold greater than before. Applying the PSMC to data simulated under an instantaneous bottleneck produces a trajectory of gradual decline and rise of
(Figure 8B). This is true for a range of bottleneck times and strengths (Figure S8). Reassuringly, the period of smallest
coincides with the time of the bottleneck. An unexpected feature of all PSMC plots from simulated data is that they show a marked increase of
prior (looking backward in time) to the bottleneck start time (Figure 8B and Figure S8).
Discussion
Researchers have long been interested in using genetic data to infer population bottlenecks associated with landmark events in the history of species and populations, for example, the colonization of new regions or range contractions (reviewed in Gattepaille et al. 2013). Indeed, the demographic history of model organisms such as humans and Drosophila has been characterized as a series of expansions and contractions that coincide with the spread of populations across the globe (e.g., Haddrill et al. 2005; Voight et al. 2005; Sjödin et al. 2012). However, the genetic resources for model organisms far exceed those available for most species. Thus there is a need for inference methods that do not require high-quality reference genomes, phase information, and/or large samples of individuals. Maximum-likelihood estimation based on the blockwise SFS makes optimal use of the information contained in short blocks of sequence from just a handful of individuals—data that can readily be obtained for any organism with current short-read sequencing technology. Furthermore, and in contrast to the PSMC framework (Li and Durbin 2011) that qualitatively documents changes in population size, our method explicitly assesses the statistical support for a bottleneck model over a null model (see Table S3 for the pros and cons of alternative demographic inference methods).
Power of the blockwise SFS
We find that our blockwise method has higher power to detect bottlenecks (Figure 3 and Figure 4) than the SFS. Perhaps surprisingly, the bSFS scheme for 5 lineages almost always contained more information about past bottlenecks than the genome-wide SFS for 20 lineages, a more typical sample size for SFS-based analyses (see Bhaskar and Song 2014, for specific bounds on sample sizes under various piecewise population size change models). Although our strategy of exploiting the symmetries of the coalescent by partitioning the GF of branch lengths into a sum over unlabeled tree shapes makes it possible to compute the probability of full mutational configurations for nontrivial sample sizes, in practice, the number of mutational configurations still explodes catastrophically for n > 5 (Lohse et al. 2015). By combining branches with the same number of descendants, the bSFS scheme introduces a further simplification that substantially decreases the dimensionality of mutational configurations and is a tractable compromise between information and complexity. It still exploits the information contained in the joint distribution of closely linked SNPs (Figure 5) and, since topologies are defined by the presence or absence of mutation categories, also retains topology information (Figure 2). On the other hand, the bSFS simplification avoids unnecessary computational complexity and removes the need for phase information. Because our likelihood calculations for the bSFS assume no intralocus recombination, they are restricted to relatively short blocks, which rarely contain sufficient mutational information to fully resolve all branches anyway.
Runs of homozygosity contain valuable information about demography (e.g., Harris and Nielsen 2013) and some of the power of the blockwise schemes undoubtedly stems from the fact that some of this information is included in the frequency of monomorphic blocks that are almost always the most probable configuration (Figure 5). For example, for very recent, strong bottlenecks ( and
), >50% of blocks are monomorphic for
whereas under the null model of constant size only 20% of blocks are monomorphic. This is despite the fact that we have fixed the block length to correspond to one pairwise difference per block on average in both cases.
Although we have focused on bottleneck histories because they are biologically interesting and analytically tractable, we emphasize that the bSFS simplification can be used to summarize data and compute likelihoods for any sampling scheme and demographic model. Just like the SFS, the bSFS also extends to multiple populations (Lohse et al. 2015). For larger samples and/or more complex models, for which analytic likelihood calculations become intractable, the bSFS could form the basis for alternative inference methods. In particular, it should be possible to approximate the bSFS using simulations and devise a composite-likelihood inference method analogous to that of Excoffier et al. (2013).
When can we expect to be able to detect bottlenecks?
Our power analyses indicate that young and strong bottlenecks are easiest to identify. This makes intuitive sense given that these have the strongest effect on genealogies (greatest compression of tree length). Looking backward in time, the older the bottleneck, the more likely it is that lineages have already coalesced by the time the bottleneck “starts.” Even very large samples are expected to coalesce to two lineages within generations (Tajima 1983; Nordborg 1998), suggesting that there is little information to infer bottlenecks approaching this age or older (as we find, Figure 3) and that larger samples of individuals add only information about recent bottlenecks (see also Terhorst and Song 2015).
A potential issue of our power analyses is that we have adjusted the length of sequence blocks (i.e., θ) to a fixed average number of segregating sites per block irrespective of the severity of the bottleneck. This ignores the fact that genomes have a finite length and so overestimates the power to infer very strong and/or recent bottlenecks. However, even after correcting for this by normalizing by a factor,
the power to detect bottlenecks still remains highest for recent and strong bottlenecks (Figure S2). In the limit of infinitely young and strong bottlenecks and assuming a finite genome, it would of course be impossible to sample enough independent blocks for inference. However, we can still detect surprisingly recent bottlenecks: e.g., assuming
(2000 years ago for pigs),
and a typical pig
and μ, blocks would need to be ∼2.5 kb long to contain one pairwise difference on average, and only nine such blocks would be sufficient to reject a null model of no population size change (
= 0.34 for a single locus, nine loci would give a signficant likelihood-ratio test with 2 d.f., and α = 0.05).
Our finding that local sampling of a structured population mimics a bottlenecked population mirrors other recent analyses that conclude that population structure and demography can be hard to distinguish (e.g., Nielsen and Beaumont 2009). However, we identified an excess of singletons in samples from structured compared to bottlenecked populations that could, in principle, be exploited to distinguish the two. Conversely, and in agreement with others (Wakeley 1999; Städler et al. 2009; Chikhi et al. 2010; Leblois et al. 2014; Mazet et al. 2015), we find that scattered sampling (one individual per deme) minimizes the problem of confounding size change and structure.
Finally, we show that for realistic recombination rates, blockwise inferences are robust to intralocus recombination (Figure 7). If the ratio of recombination to mutation rate is high, is overestimated,
underestimated, and there is less power to detect a bottleneck overall. Because recombination within blocks generates a mosaic of correlated genealogies, this is expected, given the decrease in variance in branch lengths across loci (Wall 2003).
The effect of recombination () on
and
(top) and on θ and
(assuming 2000 unlinked blocks) between a bottleneck model and a null model (bottom). The horizontal dotted lines indicate the true parameters used for simulations:
and
The vertical dotted line at 0.4 is a reasonable
for mammals (assuming r = 1 ×
and μ = 2.5 ×
). Figure S7 shows the standard deviation of estimates obtained from 100 replicate data sets, each consisting of 2000 blocks.
Application to island pigs and comparison with the PSMC
The instantaneous bottleneck (but constant ) we consider is, in a sense, the inverse of the scenario assumed by the PSMC and other piecewise models that reconstructs demography as a trajectory of changing
Given these opposing assumptions, it is reassuring how well the bottleneck time we infer for S. cebifrons (
KY) agrees with the PSMC analysis of the same data set (Figure 8A). In both cases, the inferred reduction in population size occurred long after the species divergence between S. cebifrons and S. barbatus ∼300 KYA (Frantz et al. 2013) and so was not associated with the initial colonization of the Philippines by the ancestor of S. cebifrons. This is not to say that we can rule out any colonization bottleneck, because most of the signal of older bottlenecks would have been erased by the more recent demographic event we infer. It is likely that the inferred bottleneck coincides with the colonization of the central/western islands of Panay, Cebu, and Negros by S. cebifrons. This would support the findings of Lucchini et al. (2005) that this species is genetically and morphologically distinct from its sister species S. philippensis that is found all over the eastern and northern Philippines. Unfortunately, genomic data do not exist for S. philippensis to validate this interpretation. Finally, our result is not consistent with a bottleneck due to anthropogenic activities (i.e., it is too old). This does not mean that this critically endangered species has been unaffected by human activities but rather suggests that its genetic diversity was already reduced prior to any human contact and supports current efforts for careful management of zoo populations (Bosse et al. 2015).
Comparison of our method with the PSMC. (Left) Shown are PSMC results for the three Sus cebifrons individuals (SCEB01 from Panay and SCEB02 from Negros). Estimates from the full data and 100 bootstrap replicates are depicted by darker and lighter lines, respectively. The gray bars indicate our estimates of and
(plus 95% C.I.) from the folded bSFS for
(Right) Using the inferred pig parameters (Table 2), we simulated 100 diploid sequences of 10 Mb with
We found that the PSMC is unable to reconstruct abrupt changes in (Figure S8): for example, instantaneous bottlenecks 2000 or 20,000 years ago are reconstructed as a gradual dip in
over a period of 5000 or 50,000 years, respectively. The PSMC also artificially infers an increase in
prior to the bottleneck. This increase in
was also apparent when applying the PSMC to data simulated under a range of bottleneck scenarios including the history we inferred for the pigs (Figure 8B). We therefore recommend caution when interpreting PSMC plots.
Conclusion
We have developed a method for inferring demographic size change from short sequence blocks for small numbers of samples. We show that the new method has higher power than inferences based on the genome-wide SFS for much larger samples. The central idea of the new framework is to summarize mutational configurations as blockwise site frequency spectra. This both results in little loss of information and enables the analysis of nontrivial samples of individuals and unphased data. We stress that the bSFS scheme could be used for inference under any model of population history. For example, one could use blockwise data to fit scenarios of population expansion, serial bottlenecks, or more general models of multiple merger coalescence (Coop and Ralph 2012). We also envisage extensions of this framework to explicitly compare the demographic histories of multiple, codistributed species.
Acknowledgments
We thank Nick Barton, Mike Hickerson, two reviewers, and associate editor Noah Rosenberg for comments on earlier versions of this manuscript and Martien Groenen and Hendrik-Jan Megens for access to the genome sequences of Sus cebifrons ahead of publication. This work was supported by funding from the United Kingdom (UK) Natural Environment Research Council (NERC) to Graham Stone and K.L. (NE/J010499/1) and a UK NERC fellowship to K.L. (NE/I020288/1).
Appendix: Calculating the Site Frequency Spectrum
We denote the time during which there are p lineages present, Griffiths and Tavaré (1998, equation 1.3) have shown that for any model in which lineages are statistically exchangeable the expected site frequency
can be computed from
by considering the probability that branches that exist during
have i descendants:

We denote the vector of coalescence intervals as
We can write the GF of as a sum over all possible ways that a sample can enter and exit the bottleneck and distinguish only the branches (or rather the corresponding ω variables) associated with each interval
:
(A2)Here,
and
are the numbers of ancestral lineages that enter and exit the bottleneck at
respectively; n is the number of sampled lineages; and
and
give the GF before, during, and after the bottleneck, as in Equation 1. Multiplying by
and inverting with respect to
and
gives the GF under the bottleneck model,
where event times are discrete.
is obtained from this as
(A3)The SFS can be obtained by inserting
into Equation A1 above.
Footnotes
Communicating editor: N. A. Rosenberg
Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.179861/-/DC1.
- Received June 24, 2015.
- Accepted September 1, 2015.
- Copyright © 2015 Bunnefeld, Frantz, and Lohse
Available freely online through the author-supported open access option.