- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Bollback, J. P.
- Articles by Nielsen, R.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Bollback, J. P.
- Articles by Nielsen, R.
Genetics, Vol. 179, 497-502, May 2008, Copyright © 2008
doi:10.1534/genetics.107.085019
Estimation of 2Nes From Temporal Allele Frequency Data
Jonathan P. Bollback*,1,
Thomas L. York
and
Rasmus Nielsen*
* Department of Biology and Evolutionary Biology, University of Copenhagen, 2100 Copenhagen Ø, Denmark and
Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853
1 Corresponding author: School of Biological Sciences, Institute of Evolutionary Biology, University of Edinburgh, King's Bldgs., W. Mains Rd., Edinburgh EH9 3JT United Kingdom.
E-mail: j.p.bollback{at}ed.ac.uk
We develop a new method for estimating effective population sizes, Ne, and selection coefficients, s, from time-series data of allele frequencies sampled from a single diallelic locus. The method is based on calculating transition probabilities, using a numerical solution of the diffusion process, and assuming independent binomial sampling from this diffusion process at each time point. We apply the method in two example applications. First, we estimate selection coefficients acting on the CCR5-
32 mutation on the basis of published samples of contemporary and ancient human DNA. We show that the data are compatible with the assumption of s = 0, although moderate amounts of selection acting on this mutation cannot be excluded. In our second example, we estimate the selection coefficient acting on a mutation segregating in an experimental phage population. We show that the selection coefficient acting on this mutation is
0.43.
THE vast majority of analyses of selection are based on samples of molecular data obtained at a single point in time. However, in a few cases, time series of allele frequencies are available. Examples of such data are ancient DNA (aDNA) data in humans (HUMMEL et al. 2005), viral population data (SHANKARAPPA et al. 1999), and data on experimentally evolved populations such as Drosophila (BURI 1956), bacterial (WOODS et al. 2006), or viral/phage populations (WICHMAN et al. 1999, 2005; HOLDER and BULL 2001; BOLLBACK and HUELSENBECK 2007). Time-series data contain much more information regarding selection than samples obtained at a single point in time, because the expected changes in allele frequencies through time are closely related to the strength of selection. The objective of this article is to develop a statistical approach for estimating selection coefficients, and testing hypotheses regarding selection coefficients, that can take advantage of the information from a time series of allele frequencies.
The method for estimating selection coefficients from allele frequency data presented here is a natural extension of existing methods for estimating the effective population size, Ne, from this type of data. A number of methods for estimating Ne, in the absence of selection, have been developed. The first such methods were moments-based estimators (KRIMBAS and TSAKAS 1971; NEI and TAJIMA 1981; POLLAK 1983; WAPLES 1989). Unfortunately, these methods suffer from a number of biases, such as an upward bias in the estimate of Ne with low-frequency alleles.
WILLIAMSON and SLATKIN (1999) developed a maximum-likelihood approach for estimating Ne from changes in allele frequencies using a hidden Markov model (HMM). This approach assumes a Wright–Fisher model of neutral evolution. We can think of the model as an HMM with state space on the set of possible allele frequencies, transition probabilities among states given by the Wright–Fisher Markov chain, and with emission probabilities obtained as the sampling probabilities arising when taking a smaller sample of gene copies from the population (ANDERSON et al. 2000). Given such a model, the likelihood of Ne can be maximized with respect to the observed allele frequencies sampled at a number of different time points. This approach allows for samples that are irregularly spaced in time (i.e., unsampled generations), but in its original form by WILLIAMSON and SLATKIN (1999), it was, for computational reasons, restricted to diallelic markers.
ANDERSON et al. (2000) extended the method to the case of multiple alleles using a somewhat computationally intensive Monte Carlo approach that relies on importance sampling to evaluate the likelihood. WANG (2001) further developed this approach to increase the speed of the likelihood estimation of Ne and included a simulation study showing that the behavior of likelihood-based methods is superior to that of the moments-based estimators.
More recently, BERTHIER et al. (2002) developed a method for estimating Ne from two time-point samples that relies on an underlying coalescent model. This method was extended to multiple time points by BEAUMONT (2003). This method is an improvement over previous likelihood methods in that the computation of the likelihood can be faster when many generations separate the samples. The speed of these approaches was improved considerably by ANDERSON (2005), who, rather than using Markov chain Monte Carlo techniques, developed a Monte Carlo importance-sampling approach. This method has the nice property that the accuracy of the estimator can more easily be established, as was not the case with the previous methods.
In this article we expand on these methods to estimate both 2Ne and the selection coefficient, s, from temporal samples of diallele frequency data. In contrast to previous approaches, we use the diffusion process as the underlying Markov process describing changes in allele frequencies. This allows the method to be computationally efficient even for large population sizes. In the following we present the theory and demonstrate the method on two common types of data that are being collected today, aDNA and experimental evolution studies.
Theory:
The trajectory of an allele through time can be modeled as a Markov process (see, e.g., EWENS 2004). One set of models assumes discrete time and overlapping (e.g., Moran models) or nonoverlapping (e.g., Wright–Fisher models) generations. In the limit of large population sizes, all of these models can be described by a common diffusion process, X(t)
[0, 1], with transition probabilities described by the backward Kolmogorov equations
![]() | (1) |
![]() | (2) |
![]() | (3) |
![]() | (4) |
![]() | (5) |
![]() | (6) |
![]() | (7) |
![]() | (8) |
Using this approach, the likelihood function for the parameters s and Ne can then be calculated for a time series in linear time using standard dynamic programming algorithms, if the integral on the right-hand side of Equation 5 can be solved.
Numerical approximations:
We numerically evaluate
to calculate the sampling probability (Figure 1). The numerical approximation consists of two steps. First, the transition probabilities of the process are evaluated by numerically solving Equation 1 using the Crank–Nicolson method (CRANK and NICOLSON 1947). Briefly, derivatives are approximated by finite differencing, and implicit and explicit time steps are alternated, leading to an easily soluble sparse linear system of equations and giving both numerical stability and accuracy, which is second order in the size of the time step. We then evaluate the integral in Equation 6, using numerical integration based on quadrature using the midpoint rule. The same grid of values used for the Crank–Nicolson approximation is used for the numerical integration. To ensure a smooth likelihood surface we use a fixed grid for all parameter values.
|
Grid size and spacing:
The adequacy of the numerical approximations to Equation 1 will depend on the grid sizes used and the spacing between points—when the density is concentrated at the exit barriers, X(t) = 0 and X(t) = 1, the approximation may be poor. Therefore, we used exponentially spaced grid points to increase the number of grid points near the boundaries, while decreasing the number of central points, when the numerical solutions did not converge. The position of the n grid points can be computed in the following way. First, the points from i = 1, i = 2, ..., i = n/2, starting near the boundary at zero, can be calculated as
![]() | (9) |
is the spacing parameter. Second, the position of the remaining points, [0.5, 1), is simply the reflection of the points calculated in Equation 9. To avoid spurious differences in the likelihood calculations, due to the choice of grid points, we used a fixed grid for all time points and parameter values.
CCR5-
32:
In the first application we use a data set consisting of time-series allele frequency data for the CCR5-
32 locus (see Figure 2; HUMMEL et al. 2005). Briefly, HUMMEL et al. (2005) determined the frequency of this mutation from ancient human remains and an extant representative population of northern Europeans: samples were collected at five time points dating back to 900 B.C. (Figure 2). In these analyses we assume that a human generation is 20 years. We used 500 grid points in the numerical approximations with an exponential spacing (
= 0.005) to determine the midpoints.
|
Bacteriophage MS2:
In the second application we utilize frequency trajectory data from a recent experimental study of adaptation of the bacteriophage MS2 by BOLLBACK and HUELSENBECK (2007). Briefly, this study selected populations of MS2 for growth at elevated temperatures. The authors determined the frequency of a number of beneficial mutations throughout the time course of the experiment (every
10 passages). We apply our method to one of the nonsynonymous mutations in their experimental line 2: C206U (Figure 3). We make the assumption that each selective passage in their study consisted of 2.5 generations; C206U was tracked for
100 generations, or 40 serial passages. We used 500 grid points in the numerical approximation. Due to the fairly small number of generations and little time spent at the boundaries of the process, unequal spacing was not needed and equal spacing on the grid was used.
|
Numerical convergence and grid point spacing:
Determining the number of grid points and the spacing of those points affects the precision of the numerical approximations. To this end we have evaluated a simple case in which we were able to analytically calculate the log likelihood and compare this to the numerical approximation. We performed the check using a number of different grid points (n = 10, 100, 200, 500, 1000, 2000) and two methods of grid point spacing (equal and unequal). Our evaluations of convergence under two types of spacing (Figure 4) show that at low numbers of grid points (n = 10 and n = 100) the difference between the expected and the observed value is large regardless of the method of spacing (7.1–9.4% and 0.7–1.1%, respectively). As the number of grid points increases, the difference declines dramatically to <0.5% and reaches an error rate of 0.04% at the largest number of points. For the test data set little difference was observed for different values of
. As the number of grid points increases, the numerical routines for solving the set of differential equations become burdensome so a choice of grid points for a particular analysis will be a trade-off between computational burden and precision. For these reasons the applications of our method used values of 400 or 500 as the error rate was deemed to be sufficiently small for these values.
|
CCR5-
32:
The CCR5 protein is a chemokine receptor. This receptor is a coreceptor target for human immunodeficiency virus (HIV) and simian immunodeficiency virus, and possibly other related viruses (MUMMIDI et al. 2000; PATERLINI 2002). An allele with a 32-amino-acid deletion, named CCR5-
32, has been determined to be at low frequency in the human population and its origin has been estimated to be at least 700–3500 years ago (STEPHENS et al. 1998) with empirical observations of at least 2900 years ago (HUMMEL et al. 2005). Because of the age of the mutation it has been argued that it is extremely unlikely to be a neutral mutation. As a result, CCR5-
32 has been hypothesized to have been under selection from the bubonic plague or smallpox with the low frequency being explained by intermittent temporal selection or balancing selection (for reviews see DE SILVA and STUMPF 2004; STUMPF and WILKINSON-HERBOTS 2004; GALVANI and NOVEMBRE 2005). Recently, it has been demonstrated that homozygous individuals are completely resistant to HIV infection (heterozygous individuals exhibit lower infection rates and longer disease progressions) (MCNICHOLL et al. 1997). However, the recent origin of HIV is unlikely to explain its persistence over such a long period of time although the ongoing epidemic may affect the frequency in the future. NOVEMBRE et al. (2005), using the current allelic distribution of CCR5-
32 in Europe, modeled the historical spread of this mutation to determine its origin, rate of spread, and selective value. They found that, depending on whether selection is uniform or varying across Europe, the most likely origin of the allele was in the north or northwest, the rate of spread exceeded 100 km per generation, and the intensity of selection was >10% (NOVEMBRE et al. 2005).
We have applied our method to estimate s from an ancient DNA data set (HUMMEL et al. 2005) consisting of samples gathered from multiple time points in Europe dating from 2900 years ago to present. Our estimates of s for CCR5-
32 had a 95% confidence interval of –0.09–0.01 with a maximum value very close to zero (s
–0.0005), suggesting that the mutation is either neutral or at best slightly beneficial. The upper end of the confidence interval is close to the lower end of the confidence interval for previous estimates based on the analysis of frequency data and linkage disequilibrium (s
0.05–0.35; STEPHENS et al. 1998; SLATKIN 2001). However, NOVEMBRE et al. (2005) found support for s << 0.02 when the dispersal rate was <75 km, which is more consistent with studies of historical and modern dispersal in Europe. These values are also consistent with our estimates of s (Figure 5).
|
We should warn against a too strong interpretation of our results because the samples are clearly not obtained from an idealized panmictic population that has gene flow with other populations. In addition, there are the usual caveats regarding human aDNA data. A full discussion of problems regarding human aDNA is beyond the scope of this article (for review, see GILBERT et al. 2005).
Bacteriophage MS2:
Experimental evolution studies of microbial populations typically follow the change in beneficial mutations through time as a matter of course (e.g., WICHMAN et al. 1999; HOLDER and BULL 2001; BOLLBACK and HUELSENBECK 2007). The data from these types of studies are ideal for the method presented here for a number of reasons. First, they are performed in a controlled manner in which the population size is known, kept fairly constant, and generally large. Second, they are able to sample mutations throughout the bout of selection with relative ease. Third, the mutations are more often than not known to be under selection. Finally, the selective conditions are kept constant through time.
We apply our method to the experimental MS2 bacteriophage data of BOLLBACK and HUELSENBECK (2007). Using the trajectory for the mutation C206U (Figure 3) we evaluate 2Nes. We performed the numerical HMM integration over a reasonable set of population sizes (N = 1 x 107–2 x 108) that included the experimental value (5 x 107; BOLLBACK and HUELSENBECK 2007) and selection coefficients (2Nes
0–1 x 109; s
0–5). Figure 6 shows the likelihood surface for C206U, plotting the log of the population size, 2N, against the log of 2Ns. The maximum observed value (shown as a plus sign in Figure 6) indicates a population size of 3.89 x 107 and a selection coefficient of 0.427 (95% C.I.: 0.386–0.819). The best supported population size estimate is very close to the experimental values (N = 5 x 107; BOLLBACK and HUELSENBECK 2007). However, because of the extremely large population sizes and strong selection, the mutation's trajectory is strongly deterministic and little information exists to estimate the population effective size; the 95% confidence interval spans all of the values evaluated as expected. The estimate of s is reasonable considering the fitness gains (w – 1 = 3) and number of beneficial mutations (n = 4) observed in the experimental population (BOLLBACK and HUELSENBECK 2007): our estimate of s suggests that C206U accounts for 13–27% of the total fitness gain in the population.
|
Practical limitations and assumptions:
Two assumptions of our method merit discussion. First, in the applications presented, we ignore recurrent mutation. However, recurrent mutation is not likely to significantly affect allele frequencies in most cases, except those with weak selection, a high mutation rate, and large populations. The MS2 populations, for example, meet two of these conditions (BOLLBACK and HUELSENBECK 2007)—high mutation rate and large population size—but the strong selection experienced is expected to overwhelm any input from mutation. Second, our method assumes that the samples are taken from a panmictic population; highly structured populations will clearly have an effect on the estimation of 2Nes, particularly for recessive mutations. Unfortunately, sampling from structured populations cannot easily be accommodated in the current framework and merits future work.
Conclusions:
We show here that estimation of 2Nes is possible from time series of allele frequency data. The nature of the data naturally presents some limitations. The estimates of 2Ne and s will in many cases be very correlated. Additionally, as Ne becomes large, the trajectory of the allele frequency through time becomes approximately deterministic and there should be little or no power to estimate Ne (see Figure 6). Nonetheless, even in such cases the method will provide estimates of s that take into account the uncertainty associated with the estimation of population allele frequencies from sample allele frequencies. The greatest strength of the method, however, is in the cases where 2Nes is moderate and joint estimates of Ne and s can be done. In such cases, the method can also be used to test the hypothesis of s = 0. An example of such an application was given for the CCR5-
32 data.
ANDERSON, E. C., 2005 An efficient Monte Carlo method for estimating Ne from temporally spaced samples using a coalescent-based likelihood. Genetics 170: 955–967.
ANDERSON, E. C., E. G. WILLIAMSON and E. A. THOMPSON, 2000 Monte Carlo evaluation of the likelihood for Ne from temporally spaced samples. Genetics 156: 2109–2118.
BEAUMONT, M. A., 2003 Estimation of population growth or decline in genetically monitored populations. Genetics 164: 1139–1160.
BERTHIER, P., M. A. BEAUMONT, J.-M. CORNUET and G. LUIKART, 2002 Likelihood-based estimation of the effective population size using temporal changes in allele frequencies: a genealogical approach. Genetics 160: 741–751.
BOLLBACK, J. P., and J. P. HUELSENBECK, 2007 Clonal interference is alleviated by high mutation rates in large populations. Mol. Biol. Evol. 24: 1397–1406.
BURI, P., 1956 Gene frequency in small populations of mutant Drosophila. Evolution 10: 367–402.[CrossRef]
CRANK, J., and P. NICOLSON, 1947 A practical method for numerical evaluation of solutions of partial differential equations of the heat conduction type. Proc. Camb. Philos. Soc. 43: 50–67.
DE SILVA, E., and M. P. STUMPF, 2004 HIV and the CCR5-
32 resistance allele. FEMS Microbiol. Lett. 241: 1–12.[CrossRef][Medline]
DURBIN, R., S. EDDY, A. KROGH and G. MITCHISON, 1998 Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK.
EWENS, W. J., 2004 Mathematical Population Genetics, Ed. 2. Springer-Verlag, New York.
GALVANI, A. P., and J. NOVEMBRE, 2005 The evolutionary history of the CCR5-
32 HIV-resistance mutation. Microbes Infect. 7: 302–309.[CrossRef][Medline]
GILBERT, M. T., H. J. BANDELT, M. HOFREITER and I. BARNES, 2005 Assessing ancient DNA studies. Trends Ecol. Evol. 20: 541–544.[CrossRef][Medline]
HOLDER, K. K., and J. J. BULL, 2001 Profiles of adaptation in two similar viruses. Genetics 159: 1393–1404.
HUMMEL, S., D. SCHMIDT, B. KREMEYER, B. HERRMANN and M. OPPERMANN, 2005 Detection of the CCR5-
32 HIV resistance gene in Bronze Age skeletons. Genes Immun. 6: 371–374.[CrossRef][Medline]
KRIMBAS, C. B., and S. TSAKAS, 1971 The genetics of Dacus oleae. V. Changes of esterase polymorphism in a natural population following insecticide control—Selection or drift? Evolution 25: 454–460.[Medline]
MCNICHOLL, J. M., D. K. SMITH, S. H. QARI and T. HODGE, 1997 Host genes and HIV: the role of the chemokine receptor gene CCR5 and its allele (
32 CCR5). Emerg. Infect. Dis. 3: 261–271.[Medline]
MUMMIDI, S., M. BAMSHAD, S. S. AHUJA, E. GONZALEZ, P. M. FEUILLET et al., 2000 Evolution of human and non-human primate CC chemokine receptor 5 gene and mRNA. Potential roles for haplotype and mRNA diversity, differential haplotype-specific transcriptional activity, and altered transcription factor binding to polymorphic nucleotides in the pathogenesis of HIV-1 and simian immunodeficiency virus. J. Biol. Chem. 275: 18946–18961.
NEI, M., and F. TAJIMA, 1981 Genetic drift and estimation of effective population size. Genetics 98: 625–640.
NOVEMBRE, J., A. P. GALVANI and M. SLATKIN, 2005 The geographic spread of the CCR5
32 HIV-resistance allele. PLoS Biol. 3: e339.[CrossRef][Medline]
PATERLINI, M. G., 2002 Structure modeling of the chemokine receptor CCR5: implications for ligand binding and selectivity. Biophys. J. 83: 3012–3031.
POLLAK, E., 1983 A new method for estimating the effective population size from allele frequency changes. Genetics 104: 531–548.
SHANKARAPPA, R., J. B. MARGOLICK, S. J. GANGE, A. G. RODRIGO, D. UPCHURCH et al., 1999 Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J. Virol. 73: 10489–10502.
SLATKIN, M., 2001 Simulating genealogies of selected alleles in a population of variable size. Genet. Res. 78: 49–57.[CrossRef][Medline]
STEPHENS, J. C., D. E. REICH, D. B. GOLDSTEIN, H. D. SHIN, M. W. SMITH et al., 1998 Dating the origin of the CCR5-
32 AIDS-resistance allele by the coalescence of haplotypes. Am. J. Hum. Genet. 62: 1507–1515.[CrossRef][Medline]
STUMPF, M. P., and H. M. WILKINSON-HERBOTS, 2004 Allelic histories: positive selection on a HIV-resistance allele. Trends Ecol. Evol. 19: 166–168.[CrossRef][Medline]
WANG, J., 2001 A pseudo-likelihood method for estimating effective population size from temporally spaced samples. Genet. Res. 78: 243–257.[Medline]
WAPLES, R. S., 1989 A generalized approach for estimating effective population size from temporal changes in allele frequency. Genetics 121: 379–391.
WICHMAN, H., J. MILLSTEIN and J. BULL, 2005 Adaptive molecular evolution for 13,000 phage generations: a possible arms race. Genetics 170: 19–31.
WICHMAN, H. A., M. R. BADGETT, L. A. SCOTT, C. M. BOULIANNE and J. J. BULL, 1999 Different trajectories of parallel evolution during viral adaptation. Science 285: 422–424.
WILLIAMSON, E. G., and M. SLATKIN, 1999 Using maximum likelihood to estimate population size from temporal changes in allele frequencies. Genetics 152: 755–761.
WOODS, R., D. SCHNEIDER, C. L. WINKWORTH, M. A. RILEY and R. E. LENSKI, 2006 Tests of parallel molecular evolution in a long-term experiment with Escherichia coli. Proc. Natl. Acad. Sci. USA 103: 9107–9112.
Communicating editor: N. TAKAHATA
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Bollback, J. P.
- Articles by Nielsen, R.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Bollback, J. P.
- Articles by Nielsen, R.









. Transition probabilities, px,i = f(x; p, t) from x to 1 are numerically solved using the Crank–Nicholson method (




