Abstract
Advances in empirical population genetics have made apparent the need for models that simultaneously account for selection and demography. To address this need, we here study the Wright–Fisher diffusion under selection and variable effective population size. In the case of genic selection and piecewise-constant effective population sizes, we obtain the transition density by extending a recently developed method for computing an accurate spectral representation for a constant population size. Utilizing this extension, we show how to compute the sample frequency spectrum in the presence of genic selection and an arbitrary number of instantaneous changes in the effective population size. We also develop an alternate, efficient algorithm for computing the sample frequency spectrum using a moment-based approach. We apply these methods to answer the following questions: If neutrality is incorrectly assumed when there is selection, what effects does it have on demographic parameter estimation? Can the impact of negative selection be observed in populations that undergo strong exponential growth?
ADVANCES in empirical population genetics have pointed out the need for models that simultaneously account for selection and demography. Studies on samples from various species including humans (e.g., Williamson et al. 2005; Tennessen et al. 2012) and Drosophila melanogaster (Glinka et al. 2003; Duchen et al. 2013) have shown that demographic processes, such as population size changes, shape in large part the patterns of polymorphism among genomes and estimated the impact of selection on top of such underlying neutral conditions. Thus far, most theoretical articles considered selective and demographic forces independently of each other for the sake of simplicity (e.g., Stephan and Li 2007).
Theoretical studies of neutral models of time-varying population size have been accomplished within the diffusion and the coalescent frameworks. Kimura (1955a) derived the transition density of the Wright–Fisher (WF) diffusion with a constant population size that characterizes the neutral evolution of allele frequencies over time. Shortly thereafter, Kimura (1955b) noted how to rescale time to generalize this result to a deterministically changing population size. Nei et al. (1975) derived the average heterozygosity under this general condition by applying a differential equation method, before studies on time-varying population size started to utilize the coalescent. Watterson (1984) derived the probability distribution and the moments of the total number of alleles in a sample using models of one or two sudden changes in population size. Slatkin and Hudson (1991) considered the distribution of pairwise differences in exponentially growing populations, before Griffiths and Tavaré (1994) provided the coalescent for arbitrary deterministic changes in population size. The allele frequency spectrum, which is the distribution of the number of times a mutant allele is observed in a sample of DNA sequences, has been utilized in many theoretical and empirical studies. It can be further distinguished into the allelic spectrum and the sample frequency spectrum (SFS) according to whether absolute or relative frequencies are meant. Fu (1995) derived the first- and second-order moments of the allelic spectrum for a constant population size, which has been generalized to time-varying population size by Griffiths and Tavaré (1998) and Živković and Wiehe (2008). Although deterministic fluctuations in population size are commonly considered for the interpretation of biological data, studies have also examined stochastic changes in population size (e.g., Kaj and Krone 2003).
The mathematical modeling of natural selection is mostly carried out within the diffusion framework, whereas coalescent approaches have proved to be analytically challenging (e.g., Krone and Neuhauser 1997). Fisher (1930) derived the equilibrium solution for the allelic spectrum of a population, which became particularly useful when Sawyer and Hartl (1992) modeled the frequencies of mutant sites via a Poisson random field approach. Kimura (1955c) employed a perturbation approach to obtain a series representation of the transition density that is accurate for scaled selection coefficients smaller than one. However, as noted in Williamson et al. (2005), an appropriate use of this result with respect to the analysis of whole-genome data is even difficult for a constant population size. In a recent article, Song and Steinrücken (2012) devised an efficient method with which to accurately compute the transition density of the WF diffusion with recurrent mutations and general diploid selection. This nonperturbative approach that can be applied to scaled selection coefficients substantially greater than one finds the eigenvalues and the eigenfunctions of the diffusion generator and leads to an explicit spectral representation of the transition density. The results for this biallelic case have been extended to an arbitrary number of alleles by Steinrücken et al. (2013). The process dual to this multiallelic diffusion has been analyzed earlier by Barbour et al. (2000). While providing theoretical insight, their approach does not straightforwardly allow computation of the transition density.
In recent years, several researchers have started to investigate the combined effect of natural selection and demography. The majority of these studies have utilized finite difference schemes to enable tractable computation. Williamson et al. (2005) employed such a scheme to obtain a numerical solution of the SFS for a model with genic selection and one instantaneous population size change. The authors applied this result within a likelihood-based method to infer population growth and purifying selection at nonsynonymous sites across the human genome. Evans et al. (2007) investigated the forward diffusion equation with genic selection and deterministically varying population size and incorporated the effect of point mutations via a suitable boundary condition. They derived a system of ordinary differential equations (ODEs) for the moments of the allelic spectrum, but had to resort to a numerical scheme to make their results applicable. Gutenkunst et al. (2009) considered population substructure and selection to obtain the joint allele frequency spectrum of up to three populations by approximating the associated diffusion equation by a finite difference scheme as well. Lukić and Hey (2012) applied spectral methods that even account for a fourth population in the otherwise same setting as Gutenkunst et al. (2009). Recently, and again with respect to a single population, Zhao et al. (2013) provided a numerical method with which to solve the diffusion equation for random genetic drift that can incorporate the forces of mutation and selection. The authors illustrated the accuracy of their discretization approach by determining the probability of fixation in the presence of selection for both an instantaneous population size change and a linear increase in population size. In general, such methods require an appropriate discretization of grid points, which may depend strongly on the parameters. This makes it difficult, however, to predict if a particular discretization will produce accurate results.
In this study, we use the polynomial approach by Song and Steinrücken (2012) to obtain the transition density for genic selection and instantaneous changes in population size. First, we focus on a single time period during which the population has a different size relative to a fixed reference population size. We compute the eigenvalues and the eigenfunctions of the diffusion operator with respect to the modified drift term of the underlying diffusion equation. Similarly to a constant population size, the eigenfunctions are given as a series of orthogonal functions. The eigenvalues and eigenfunctions facilitate a spectral representation of the transition density describing the change in allele frequencies across this time period. Such transition densities for single time periods can then be folded over various instantaneous population size changes to obtain the overall transition density for such a multi-epoch model with genic selection. After illustrating the applicability of this approach, we derive the SFS by means of the transition density. While the transition density proves useful for the analysis of time-series data that are mostly gathered from species with short generation times as bacteria (e.g., Lenski 2011) but also from species with long generation times (Steinrücken et al. 2014), the SFS can also be applied to whole-genome data collected at a single time point. As an alternative approach to employing the transition density for the SFS, we modify the moment-based approach by Evans et al. (2007) to efficiently compute allele frequency spectra for genic selection, point mutations, and piecewise changes in population size.
We then employ a maximum-likelihood method with which to estimate the demographic and selective parameters of a given bottleneck model. After examining the accuracy of parameter estimation, we discuss how the estimates change when selection is ignored or a simpler demographic model is assumed. We investigate the demography of an African population of D. melanogaster (Duchen et al. 2013), allowing for selection coefficients that either are constant or vary according to a given distribution of fitness effects. Furthermore, we answer another, important question arising in human population genetics (Tennessen et al. 2012): Can the impact of negative selection be observed in populations that undergo strong exponential growth? We investigate how strong selection would have to be to leave a signature in the SFS.
The Transition Density for Genic Selection and Piecewise-Constant Population Sizes with K Epochs
Model and notation
We assume that the diploid effective population size changes deterministically, with denoting the size at time t. Here, time is measured in units of
generations, where
is a fixed reference population size. Unless stated otherwise, the initial population size will be used as the reference population size in the various numerical examples. In the diffusion limit, the relative population size
converges to a scaling function, which we denote by
.
We assume the infinitely-many-sites model (Kimura 1969) with and
denoting the ancestral and derived allelic types, respectively. The relative fitnesses of
and
genotypes over the
genotype are, respectively, given by
and
. The population-scaled selection coefficient is denoted by
. The frequency of the derived allele
at time t is denoted by
. Let f be a twice continuously differentiable, bounded function over
. The backward generator of a time-inhomogeneous one-dimensional WF diffusion process on
is denoted by
, which acts on f as
(1)where the diffusion and drift terms are given by
and
, respectively. While selection operates on a natural time scale as represented by the drift term, changes in population size require an appropriate rescaling of time within the diffusion term. Thus, the relative strength of natural selection and genetic drift is time inhomogeneous. This prohibits classical time-rescaling approaches and introduces considerable challenges in obtaining analytic results. To gain insights, we here focus on the case in which ρ is piecewise constant. In this case, the diffusion and drift terms differ by a constant factor within each piece, thus simplifying the analysis.
Throughout, we assume that ρ has K constant pieces (or epochs) in the time interval . The change points are denoted by
, and for convenience we define
and
. Then, for
, with
, we assume
, where
is some positive constant. For the epoch
, the diffusion term is thus given by
and the corresponding generator is denoted by
. The scale density
(Karlin and Taylor 1981, Chap. 15) for the epoch is given by
while the speed density
is given (up to a constant) by

Given real-valued functions f and g on that satisfy appropriate boundary conditions and are square integrable with respect to some real positive density h, we use
to denote

The transition density within each epoch 
For the epoch , let the transition density be denoted by
, where
,
, and
. Under the initial condition
, the spectral representation of
is given by
(3)where
and
are the eigenvalues and eigenfunctions of
, respectively. That is,
It can be shown that the eigenvalues are all real and nonpositive. Furthermore,
with
as
. The associated eigenfunctions
form an orthogonal basis of
, the space of real-valued functions on
that are square integrable with respect to the speed density
, defined in (2).
Song and Steinrücken (2012) recently developed a method for finding and
in the case of
. We give a brief description of their method and modify it accordingly to incorporate an arbitrary
. Let
denote the diffusion generator under neutrality (i.e.,
). The eigenfunctions of
are modified Gegenbauer polynomials
(cf. Appendix), and the corresponding eigenvalues are
, with
(4)Similar to Song and Steinrücken (2012), define
as
(5)Then,
form an orthogonal system with respect to the weight function
. By directly applying the full generator
to
, we observe that
are not eigenfunctions of
. Instead, we obtain
(6)where
. However, since both
and
are orthogonal with respect to the same weight function
and
form a basis of
, we can represent
as a linear combination of
:
(7)Furthermore, the fact that
is an eigenfunction of
with eigenvalue
implies that
and
satisfy the equation
(8)where
is as defined in (4) and
are known constants that depend on σ and m (cf. Song and Steinrücken 2012 for details).
The transition density expansion (3) can be obtained by numerically solving the eigensystem (8). Denote the infinite-dimensional matrix on the left-hand side of (8) by . The eigenvalues
of
correspond (up to a sign) to the eigenvalues of
, and the associated eigenvectors
of
determine the eigenfunctions of
via (7). Let
denote the
matrix obtained by taking the first D rows and D columns of
, and let
and
denote the eigenvalues and eigenvectors of
, respectively. The truncated eigensystem
(9)can then be used to approximate (8). This finite-dimensional linear system can be easily solved numerically. Since the truncated versions of the eigenvalues and eigenvectors converge rapidly as D increases, an accurate approximation of the transition density (3) can be efficiently obtained. The truncation level D required for convergence is higher when modeling a large population compared to the basic selection model and lower when the population size is small. The reason for this is that the necessary truncation level depends on the effective strength of selection, which is higher in large populations and lower in small populations. Therefore, for a fixed selection coefficient s, large populations are computationally more demanding than small populations. Furthermore, we observed that positive selection coefficients require higher values for D than negative ones.
The transition density for the entire period
with K epochs
Suppose and
. The transition density
for the entire period
is obtained by combining the transition densities for the K epochs as
(10)where
denotes the allele frequency at the change point
. Using (3), we can write (10) as
(11)where
is an infinite-dimensional column vector, while
and
are infinite-dimensional matrices defined as
and
In general,
is not a diagonal matrix since
and
are not orthogonal with respect to
if
. In Appendix, we show that the entry
of
is given by
(12)Note that the last line of (12) does not depend on n or m, so it needs to be computed only once. The overall computational time for evaluating
scales linearly with the number K of epochs.
To better understand the joint impact of selection and demography on the transition density, we consider two scenarios, where is simply denoted as
. Figure 1 illustrates the density in a scenario in which the selection coefficient is fixed and various K-epoch demographic models are considered. In comparison to the case of a constant population size (cf. Figure 1A), an instantaneous expansion (cf. Figure 1B) narrows the distribution around the mean, whereas an additional phase of a reduced population size (cf. Figure 1C) increases the variance relative to a population of a constant size. Figure 2 illustrates the same scenarios with a fixed transition time and varying selection coefficients. Note that all theoretical results and the corresponding applications in this article were implemented in Mathematica. The implementation is available from the authors upon request.
Transition densities for various transition times τ and a fixed selection coefficient . In all cases, we set
and
. (A) A single-epoch model (
), a constant population size with
. (B) A two-epoch model (
), with an instantaneous expansion (
,
). (C) A three-epoch model (
), with a population bottleneck followed by an expansion (
,
,
).
Transition densities for various selection coefficients σ and a fixed transition time . In all cases, we set
and
. (A) A single-epoch model (
), a constant population size with
. (B) A two-epoch model (
), with an instantaneous expansion (
,
). (C) A three-epoch model (
), with a population bottleneck followed by an expansion (
,
,
).
The Sample Frequency Spectrum
The transition density approach
The transition density derived in the previous section can be employed to obtain the SFS of a sample. Consider a sample of size n obtained at time . The probability that the
allele with frequency x at time
is observed b times in the sample is (Griffiths 2003)
(13)For piecewise-constant population size models with K epochs, a spectral representation of
can be found via (11) and evaluating (13) involves computing the integral
. For
, using (2), (5), and (7), we obtain
(14)where
is the confluent hypergeometric function of the first kind. The descending factorials
are defined in Appendix.
The SFS is the probability distribution on the number b of mutant alleles in a sample of size n taken at time τ, conditioned on segregation. For
,
is given by
(15)In (15), the SFS at a single site is obtained by averaging over sample paths. This is equivalent to the frequency spectrum distribution over a large number of independent mutant sites in the Poisson random field model of Sawyer and Hartl (1992). Using (11), (12), (13), and (14), we can approximate (15) numerically. If it is unknown which allele is derived, a folded version of (15) can be obtained as
, where
denotes the Kronecker delta.
A moment-based approach
As detailed above, the transition density can be employed to obtain the SFS. However, the specific solution for the transition density is not required to obtain the less complex and thus computationally less demanding SFS. Here, we utilize the work of Evans et al. (2007) to develop an efficient algorithm for computing the allele frequency spectrum in the case of genic selection and piecewise-constant population sizes.
Suppose mutations arise at rate (per sequence per
generations) and according to the infinitely-many-sites model (Kimura 1969). Evans et al. (2007) use the forward diffusion equation to describe population allele frequency changes and introduce mutations by an appropriate boundary condition. Slightly modifying their notation, we use
to denote the expected number of sites where the mutant allele has a frequency in
, with
, at time t. The forward equation is
(16)where the diffusion term
, the drift term
, and the scaled selection coefficient σ and the population size function
are defined as before. The influx of mutations is incorporated into this process via the boundary conditions
(17)The resulting polymorphic sites follow the dynamics of (16) thereafter. Note that this differs from the diffusion process studied in the previous section, as the influx of mutations is now explicitly modeled.
Again, it is analytically more practical to consider the corresponding backward equation, which is obtained by setting . This substitution transforms the forward equation for
into a backward equation for
, which is essentially given by (1) up to the sign of the drift term. Evans et al. (2007) derived a coupled system of ODEs for the moments
:
(18)
(19)where
. A similar system of ODEs was derived and solved by Kimura (1955a) for a neutral scenario with a constant population size and without mutations. For
, the above system is finite and can be solved explicitly (Živković and Stephan 2011). In the case of selection (
), on the other hand, the system is infinite and obtaining an explicit solution for an arbitrary ρ is a challenging problem, even if the system is truncated by setting
for
.
From now on, assume for
and rewrite the truncated system of ODEs in matrix form as
(20)where
,
,
are D-dimensional column vectors, and
and
are
matrices with entries
for
. The formal solution of (20) cannot be written in terms of a matrix exponential but only as a Peano–Baker series (Baake and Schlägel 2011) for arbitrary ρ, which can be numerically quite demanding. Therefore, we focus on the case of piecewise constant ρ and develop an efficient method to solve the truncated system of ODEs.
We first consider (i.e., a constant population size), for which the solution of (20) takes the form of a matrix exponential given by
(21)Let
, and
, respectively, denote the eigenvalues, row eigenvectors, and column eigenvectors of
. Then, (21) implies
(22)It is intractable to find closed-form expressions of
, and
, but, for a given truncation level D, they can be computed numerically. Depending on the details of the model under consideration, it might be more efficient to solve (21) numerically rather than applying the more analytic form given in (22).
We now investigate the equilibrium solution of (22), since it can be applied as an initial condition in a model in which the population size remains constant over a longer period of time before instantaneous population size changes occur. Assuming that all alleles are monomorphic at time zero, i.e., , and letting
, we obtain the moments at equilibrium as
For D sufficiently large, this result is numerically close to the exact solution
. The latter can also be obtained as follows. The equilibrium population frequency spectrum is given by (Fisher 1930)
(23)The sampled version can be easily found via binomial sampling as in (13):
(24)For
, the moments
of
are given by
where
is the incomplete gamma function.
Now, consider the piecewise-constant model with K epochs in the time interval defined earlier. For
,
(25)which can be solved as in (21). For
,
(26)where
, for
, is recursively given by
The initial condition
is chosen as either the equilibrium solution described above or the zero vector, which corresponds to the case of all loci being monomorphic at time
.
The accuracy of the above framework depends on how fast the truncated moments converge to zero as D increases. Similar to the transition density approach, the truncated moments converge faster for negative than for positive σ, and for instantaneous declines compared to instantaneous expansions. For a large positive σ, a higher truncation level D may be required to achieve the desired accuracy. Finally, the allelic spectrum
, for
, of a sample of size n taken at time τ can be obtained from the moments
by using the relationship
(27)The SFS
at time τ is then given by
(28)Substituting the truncated moments obtained from (26) into (27) provides numerical approximations of (27) and (28).
The joint impact of a population bottleneck and selection on the SFS is illustrated in Figure 3 for various points in time. As expected, negative and positive selection result in a skew of the SFS toward low- and high-frequency derived variants, respectively, when compared to a model without selection, across all sampling times. Moreover, this skew varies in intensity at different points in time. In the neutral demographic model (cf. Figure 3B), the relative frequency of singletons at time is higher than at time
, whereas under the same demographic model with negative selection (cf. Figure 3C), this relation is inverted. This is because the amount of singletons that is caused by demographic forces decreases after the expansion from
to
, while negative selection is still increasing the low-frequency derived classes in this time interval.
(A) The relative population size, , is initially 1 and changes instantaneously to
and 5 at times
and
, respectively. The SFS of a sample of size 20 are plotted for this demography (B) without selection, (C) negative selection of
, and (D) positive selection of
. The times of sampling are illustrated in A and the bars are accordingly displayed from the left to the right. Truncation levels D = 100 and D = 500 were respectively applied for (C) negative and (D) positive selection, while the SFS was explicitly calculated for (B) neutrality.
Applications
Here, we discuss biologically relevant questions that can be addressed using our theoretical framework. This section consists of the following parts:
We first consider models with negative selection and bottlenecks of medium strength at different time points. We examine the SFS under such models and try to estimate the demographic parameters while taking selection into account. We also carry out demographic inference ignoring selection. Whereas the former demonstrates how well the demographic and selective parameters can be estimated jointly, the latter mimics the common practice of assuming genome-wide polymorphic sites as putatively neutral (due to the difficulty of jointly estimating the impact of selection and demography using existing tools). We finally examine the consequences of assuming a too simple underlying demography on parameter estimation.
We then analyze an African sample of D. melanogaster to investigate its demographic history and possible selective effects.
Finally, we examine a model of strong exponential population growth (mimicking human evolution) and superimpose negative selection of various strengths to understand if and when selection can be inferred for such a model.
Throughout, the first population size change will occur after the allele frequencies have reached an equilibrium according to (24).
Joint inference of population bottleneck and purifying selection
A maximum likelihood approach:
Under the assumption that the considered sites are independent, the log-likelihood of a model ℳ given data is
, where
is the observed number of sites at which the derived allele occurs i times in the sample, and
is the probability that the derived allele occurs i times in the sample at a segregating site under model ℳ (e.g., Wooding and Rogers 2002). Recall that
can be obtained via either the transition density or the moment-based approach. The latter is preferable here, since the transition density is not explicitly required.
Consider the bottleneck model illustrated in Figure 4. Note that the present relative size is fixed to 1; i.e., here the present population size is used as the reference population size
. First, we consider the scenario where the ancestral population size
prior to the bottleneck is allowed to vary. In this case, the model has five free parameters:
, the initial population size;
, the population size during the bottleneck;
, the duration of the bottleneck;
, the time since recovery from the bottleneck; and σ, the scaled selection coefficient. We then also consider the scenario where the ancestral population size is the same as the present population size, i.e.,
, resulting in a model with four free parameters.
The population is constant in size before being instantaneously changed to relative size at time zero. Then, another jump to relative population size
follows at time
, before a sample is taken at time
.
We adopted a grid search in our estimation procedure, with and
. For the five-parameter model,
was chosen from the range
. In total, 110,000 grid points were chosen in the selected case and 10,000 in the neutral case. Note that the grid search also accounts for models of one or two successive instantaneous population expansions. For the four-parameter model, 11,000 grid points were chosen in the selected case and 1000 in the neutral case. The grid points are summarized in Table 1.
Estimation of bottleneck and selection parameters:
We first evaluated the SFS for a sample of size in the following 12 scenarios, all with
and
:
constant population size (i.e.,
).
bottleneck models with
,
, and
.
First, to test how well the demographic and selective parameters can be estimated jointly from sampled data, we focused on the bottleneck demography with and considered two scenarios: The neutral case (
) and the selected case with
. To mimic the limited availability of independent polymorphic sites across the genome, we sampled 10,000 sites according to the SFS for the two chosen scenarios and repeated this procedure 200 times. For each of these 200 data sets, we maximized the log-likelihood over the grid of parameter values described earlier, assuming (A1) neutrality when the true model has
, (A2) neutrality when the true model has
, (A3) presence of selection when the true model has
, and (A4) presence of selection when the true model has
.
The estimated parameters are shown in Table 2. For inference under correct model assumptions (A1 and A3), the median estimates are equal to the true parameters. When selection is ignored although present in the data set (A2), the ancestral population size () and the duration of the bottleneck (
) are underestimated, whereas the bottleneck size (
) and the time since the bottleneck (
) are accurately estimated. When the true model is neutral but the inference procedure allows for selection (A4), a neutral demographic model is accurately inferred. We calculated likelihood-ratio statistics for each of the 200 data sets to compare the two nested models of selection and neutrality. The null hypothesis of neutrality can be rejected at the
significance level with a power of
.
We further analyzed all 12 scenarios using the expected SFS directly, assuming that the amount of data are sufficiently large such that the observed SFS closely approximates the expected value. Our goal in this case is to study the effect of model misspecification on parameter estimation; specifically, assuming selection when the true model is neutral or assuming neutrality when there is selection. In the former case, the maximum likelihood estimates (MLEs) always coincided with the true parameters. Therefore, it is useful to allow for selection in an analysis even when putatively neutral regions are considered. In the latter case, our results are summarized in Table 3. For a constant population size, two rather old instantaneous expansions are estimated. For the bottleneck models, ignoring selection leads to the largest errors for the most recent bottleneck and and the least recent bottleneck and
, for which an instantaneous expansion is estimated. The time since the bottleneck was robustly estimated in many cases.
To assess the impact of assuming a slightly simplified model for parameter estimation, we carried out an analogous study in which the ancestral population size was incorrectly assumed to equal the current size
, while the true model had
and
. For the resampling analysis, we considered the same bottleneck scenarios as before with
or
, and maximized the log-likelihood values over a grid in the parameter space (as described earlier) for each of the 200 simulated data sets each containing 10,000 polymorphic sites. The parameter estimates are shown in Table 4. The time since the bottleneck (
) is accurately estimated irrespective of correct or wrong assumptions regarding selection. Incorrectly assuming
results in either an overestimation of the duration of the bottleneck (
) in most of the cases (A1–A3) or an inference of selection when
(A4). Selection was poorly estimated even under (A3).




Again, we also analyzed all 12 scenarios under the assumption that the observed SFS is a close approximation to the expected value, to study the effect of model misspecification on parameter estimation. The results are shown in Table 5. The biases caused by incorrectly assuming are largest for the scenario that captures the youngest bottleneck (
). Here, not only the selection coefficients are strongly misestimated but also the time since the bottleneck (
) is largely underestimated. In all the other scenarios, at least the time since the bottleneck (
) is accurately estimated. The estimation accuracy of the other demographic parameters and selection coefficients increases with bottleneck age and the concomitant decreasing impact of the ancestral population size on the SFS. In summary, we note that assuming a too simplistic demographic model can lead to large errors in parameter estimation.




Testing a data set of D. melanogaster:
Here, we apply our method to analyze a data set that has been recently used to estimate the joint demographic history of several populations of D. melanogaster (Duchen et al., 2013). The data set consists of 12 sequences from a Zimbabwe population comprising 197 noncoding loci, and within each locus there are between 1 and 41 segregating sites (3234 polymorphic sites in total). We focused on the effects of weak selection and used all segregating sites in our analysis, treating them as independent. We note that whereas the 197 loci are scattered over the genome, at least tens of thousands of bases apart, the sites within each locus are tightly linked and hence not independent. We have tried a bootstrap resampling procedure to study the effect of assuming independence, but the strong stochasticity among the small subsets of presumably independent sites, which were generated by sampling one site from each locus, prevented a reliable inference.
The empirical SFS of the data shows an uptick of high-frequency derived alleles (cf. Figure 5A). As explained in Discussion, this is likely to be caused by ancestral misidentification, not by positive selection. This effect is also unlikely to be caused by linkage, since the uptick is still observed in the previously mentioned subsamples of widely separated sites. To assess the effect of presumably misoriented sites on inference, we compare results for the unfolded SFS with those obtained from a partly folded version, where only singletons and doubletons are folded with their high-frequency counterparts, since these classes appear to be affected the most (cf. Baudry and Depaulis 2003).
(A) SFS for the observed data and the most likely selective and neutral parameter estimates from left to right. (B) The same as A except that the allelic classes 1 and 2 were respectively folded with 11 and 10.
We carried out our analysis based on the bottleneck model of the previous section allowing the current and the ancestral population size to differ. To account for varying selection pressures across the genome, sites are usually subdivided into various genomic categories (e.g., exons, introns, UTRs), often assuming a constant selection coefficient for each category. Alternatively, or even combined with such a categorization, selection coefficients are assumed to follow some distribution; a gamma distribution (Kimura 1979) is a popular choice due to its flexibility to fit empirical data. Since neutrality and purifying selection are considered to be prevalent in intronic and intergenic regions of African Drosophila, we focused on negative selection coefficients in our analysis. A noncoding data set can be classified as a single functional category. Therefore, we analyzed the data set first by either assuming constant selection or neutrality, followed by an analysis where the selection coefficients were allowed to vary according to a given distribution.
We initially computed an MLE for the unfolded and the partly folded SFS under the constant selection and the neutral bottleneck model on the coarse parameter grid given in Table 1. For each model, we investigated the accuracy of the parameter estimates via parametric bootstrap, using 200 bootstrap samples each consisting of 3234 polymorphic sites. We obtained rather narrow confidence intervals for the selection coefficient and the time since the bottleneck, whereas the other details of the bottleneck were less confidently estimated. To improve the parameter estimates, we further refined the grid as follows: Nine values for were chosen from the range
, 20 values for σ from
, 10 values for
from
, 25 values for
from
, and 25 values for
from
. This gives in total 1,125,000 parameter combinations for selection and 56,250 for neutrality. As before, the ratio of two consecutive values in each parameter range was kept roughly constant. Focusing on rescaled time
instead of
relies on the observation that
and
correlate strongly and has the advantage that unlikely combinations of
and
can be omitted. More values were chosen for time parameters, since these are more sensitive than the population size parameters.
The MLEs are given in Table 6 and both versions of the SFS are illustrated in Figure 5. The analysis based on the partly folded SFS shows a better fit than the unfolded version, since negative selection combined with any demographic model is incompatible with the uptick of high-frequency derived variants in the empirical SFS. Interestingly, a neutral model was inferred for the unfolded SFS, while the model with selection fits better for the partly folded version. Since an excess of high-frequency derived variants favors demographic models that capture a strong population decline, a much smaller estimate of the bottleneck population size () was obtained for the unfolded SFS. In accordance with the previous section, the time since the bottleneck (
) was robustly estimated in both cases, as illustrated by the 10 and 100 most likely parameter estimates. However, partially folding the SFS led to a smaller estimate
. A further refinement of the grid barely changed the estimates
and
. The estimates of bottleneck duration (
) and ancestral population size (
) appeared to be strongly correlated.
We now relax the assumption of a fixed σ for all sites and allow a distribution of fitness effects by introducing gamma-distributed selection coefficients. For , the probability density of the gamma distribution with shape and rate parameters α and β is given by
, where
denotes the gamma function. The allelic spectrum for gamma-distributed selection coefficients is then obtained by integrating the allelic spectrum for constant selection coefficients given by (27) against a gamma distribution, i.e.,
(29)The SFS for gamma-distributed selection coefficients is then given by

Even when the allelic spectrum is in equilibrium and the population size is constant, the integral in (29) cannot be solved explicitly, so we needed to employ numerical integration. Previous studies (e.g., Boyko et al. 2008; Racimo and Schraiber 2014) on the distribution of fitness effects in the presence of population size changes first inferred a demographic history using putatively neutral sites and then estimated the parameters α and β based on that fixed demography. Since we do not have a separately inferred demographic model here, we considered several σ values along a variety of demographic parameter combinations. We used a coarser grid for the demographic parameters due to the larger number of σ values needed for the numerical integration step, which adds additional computational burden. While the evaluation of the allelic spectrum takes less than half a second for a given σ value with high numerical precision, the numerical integration over the range of σ values according to (29) takes a few seconds. Thus, to further reduce computational cost, we restricted the analysis to exponentially distributed selection coefficients by setting and compared the MLEs for various values of β. See Table 7 for results. The MLE was found for
, so the average σ equals
. This finding and the associated demographic estimates are consistent with the result found for a fixed selection coefficient. However, this result may change if one allows for more general shape and rate parameters.
A model of human exponential population growth
We now demonstrate the utility of our method to investigate population-size histories containing epochs of exponential growth in combination with selection. To this end, we adopted the following demographic history of a sample of African human exomes that had been estimated by Tennessen et al. (2012) as a modification of a model by Gravel et al. (2011). The population had an ancestral size of 7310 individuals until 5920 generations ago (assuming a generation time of 25 years), when it increased instantaneously in size to 14,474 individuals. After this increase, the population remained constant in size until 205 generations ago, when it started to grow exponentially until reaching 424,000 individuals at present. The relative population size function for this model can be described by (30)where c is the ratio of population sizes after and before the instantaneous expansion, which can be dated arbitrarily, so we set the time of this expansion to zero. R is the scaled exponential growth rate,
is the time at which the expansion started, and τ is the time of sampling (the present). Times are given in units of
, where the reference population size
is the initial size before time zero (the ancestral size). Since the theoretical framework presented above assumes a history of piecewise constant population sizes, the phase of exponential growth in this model had to be adequately discretized to obtain a suitable piecewise approximation. The following piecewise function can be chosen to approximate the exponential growth phase via a geometric growth function,
(31)with times
,
. Here, the number of population size changes during the phase of exponential growth is given by
Varying the growth rate δ determines the number of discretization intervals used.
The SFS (28) of the discretized version is obtained straightforwardly from (26) and (27). For the demographic parameters given above, we computed the SFS for various sample sizes up to 200 and we used , which was chosen large enough to provide reasonably fast computation times but sufficiently small to provide a good approximation of the exponential growth model. In the neutral case, the goodness of the approximation can be verified via the explicit solution of the SFS (Živković and Stephan 2011), which can be applied to the continuous and the discretized model. As shown in Figure 6A, where a sample size of
is chosen, the spectra of both continuous and piecewise-constant models agree very well with each other; the percentage error is 0.57% based on the
-norm, while the Kullback–Leibler divergence is about
.
(A) Log–log plots for the SFS of the continuous and the discretized version of the estimated human African demography and neutral evolution. (B) Log–log plots for the SFS of the discretized version under various selection coefficients. The selection coefficients in the legend are ordered from top to bottom according to the function values of the high-frequency derived alleles. The sample size is given by in both subfigures and a truncation level D = 300 was applied in B.
Using our method, selection can then be incorporated into the piecewise-constant population-size model. The effect of various negative selection coefficients (scaled with respect to the ancestral population size) is illustrated again for sample size in Figure 6B, and the same trend can be observed for smaller sample sizes as well. It is probably not surprising that the resolution in distinguishing the selective and the neutral model rises with σ. More interestingly, differences between the neutral and the selective models are apparently more pronounced among derived alleles in intermediate to high frequency. Therefore, for large data sets where intermediate- to high-frequency derived alleles are present in sufficient numbers, one may focus more strongly on these allelic classes than on low-frequency derived ones for the statistical analysis of purifying selection.
Discussion
In this article, we extended the approach of Song and Steinrücken (2012) to develop a method for finding the transition density of a WF diffusion under genic selection and piecewise-constant effective population sizes. It can be used to obtain the SFS, but explicit knowledge of the transition density is actually not required for the computation of the SFS. To that end, we revisited and simplified the moment-based method by Evans et al. (2007) in the case of a constant population size and utilized the result to obtain an efficient method for computing the SFS for a model with piecewise-constant population sizes.
The transition density for a variable population size can be incorporated into a hidden Markov model framework to analyze time series genetic data, as done by Steinrücken et al. (2014) in the case of a constant population size. However, in this article we focused on biological questions that can be investigated using the SFS and sampling at a single time point. The SFS has been employed into a maximum likelihood framework that can be applied to simultaneously infer selection coefficients and the parameters of a multi-epoch demographic model. The importance of methods that enable the joint estimation of selective and demographic parameters becomes particularly apparent in large populations, for which the scaled selection coefficient can take considerable values across large regions of the genome, so that demography and selection cannot be estimated independently.
We tested our inference method on simulated data, generated by sampling a large number of sites from the SFS of a bottleneck model for a range of selection strengths. In our parameter estimation procedure, we assumed the same model as the one used in simulations, as well as a slightly less complex model. We demonstrated that our method can accurately estimate the parameters in the majority of the bottleneck scenarios, but less so when the simpler model is assumed. The time since the bottleneck was retrieved in most of the cases even when assuming the simpler model or when the data sets simulated with selection were analyzed under neutrality. This result is encouraging for the many published demographic estimates that have been obtained assuming neutrality, but further investigation is warranted to consider more realistic models, e.g., including phases of exponential growth. Our results encourage the application of not too simple demographic models anyway.
In the African Drosophila sample, no or barely any negative selection was inferred when the possible impact of misoriented sites was ignored. To account for ancestral misidentification while maintaining sufficient information for inference, we applied a partly folded spectrum, where only the first two classes were folded with the corresponding last two classes. Using this partly folded spectrum, a negative selection coefficient of about was estimated, irrespective of assuming constant or exponentially distributed selection coefficients.
Our analyses were performed based on the bottleneck model illustrated in Figure 4. The maximum number of piecewise changes that can be incorporated into a demographic model is a function of sample size (cf. Bhaskar and Song 2014 for the neutral case), so more elaborate demographic models would have been barely accessible for this data set, especially given the limited amount of segregating sites. It indeed turned out to be difficult to pinpoint the ancestral population size and the duration of the bottleneck, whereas the time since the bottleneck was again robustly estimated. Comparing both versions of the SFS obtained using our parameter estimates and the ones given in Duchen et al. (2013), we obtained an improved goodness-of-fit to the observed SFS from the data and date the bottleneck as about half as old (in rescaled, but also in calendar time) based on the partly folded SFS. This discrepancy is not surprising, since primarily summary statistics of the SFS were used in their study while accounting for linkage to some extent.
We also applied a grid search to test if weak positive selection could explain the uptick of high-frequency derived variants in the unfolded empirical SFS. However, we did not obtain estimates being plausible from a biological point of view. When, as in this example, an excess of low- and high-frequency derived variants is simultaneously observed in comparison to a standard neutral model, unrealistically large estimates for σ are needed to explain the data. Positive selection on its own (and of some appreciable strength) causes a decline of low-frequency derived variants and an excess of high-frequency derived alleles, whereas an expansion (as embedded in the bottleneck model) acts in the opposite way. Therefore, both forces have to severely counteract each other so that the requirements of both ends of the SFS can be met.
We analyzed an example of exponential human population growth (Tennessen et al. 2012) to see the effect of purifying selection in the context of this model. As illustrated in Figure 6B for a sample of size 200 and various selection coefficients, intermediate- and high-frequency derived variants are more affected by exponential growth and negative selection than the low-frequency derived ones. A plausible explanation is that both exponential growth and negative selection enforce an increase of low-frequency derived variants until these classes are saturated and their impact can be observed in the complimentary high-frequency allelic classes. In general, this example illustrates nicely that even more elaborate models that include various phases of exponential growth and population declines can be computationally efficiently treated via an appropriate discretization of phases of continuous population size change, using the methods presented in this article.
Acknowledgments
We thank valuable comments and suggestions from two reviewers. D.Z. thanks Anand Bhaskar, Steven N. Evans, and Andreas Wollstein for helpful discussions. We thank the generous support of the Simons Institute for the Theory of Computing, where much of this work was carried out while we were participating in the 2014 program on “Evolutionary Biology and the Theory of Computing.” Y.S.S. thanks the Miller Institute for providing a Research Professorship while this article was completed. This research is supported in part by Deutsche Forschungsgemeinschaft grant STE 325/14 from the Priority Program 1590 (D.Z., W.S.), the Volkswagen Foundation grant I/84232 (D.Z.), a National Institutes of Health grant R01-GM094402 (M.S., Y.S.S.), and a Packard Fellowship for Science and Engineering (Y.S.S.).
Appendix
Here, we derive the expression shown in (12). Using (2), (5), and (7), note that

Without loss of generality, assume . [If
, the integral in (A1) is trivial to evaluate using orthogonality.] Since
is a polynomial of order
, its jth derivative vanishes for
. Using integration by parts recursively
times, we obtain
Note that the summand for
in the previous equation is equal to zero and will be omitted in the remainder. Since
, we have
so that
(A2)The modified Gegenbauer polynomials are defined as
where
is the Gauss hypergeometric function,
, and
,
. Applying this definition, we obtain
Note that the sums are finite, since
for integers
. It is simple to show that
By applying this result we obtain, after some algebra,
(A3)Finally, combining (A3), (A2), and (A1) yields the desired result.
Footnotes
Communicating editor: J. Wakeley
- Received January 29, 2015.
- Accepted April 9, 2015.
- Copyright © 2015 by the Genetics Society of America