Genetics, Vol. 166, 351-372, January 2004, Copyright © 2004

The Allele Frequency Spectrum in Genome-Wide Human Variation Data Reveals Signals of Differential Demographic History in Three Large World Populations

Gabor T. Martha, Eva Czabarkaa, Janos Murvaia, and Stephen T. Sherrya
a National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894

Corresponding author: Gabor T. Marth, Boston College, 140 Commonwealth Ave., Chestnut Hill, MA 02467., marth{at}bc.edu (E-mail)

Communicating editor: L. EXCOFFIER


*  ABSTRACT
*TOP
*ABSTRACT
*Modeling the distribution of...
*Demographic history
*METHODS
*RESULTS
*DISCUSSION
*APPENDIX
*LITERATURE CITED

We have studied a genome-wide set of single-nucleotide polymorphism (SNP) allele frequency measures for African-American, East Asian, and European-American samples. For this analysis we derived a simple, closed mathematical formulation for the spectrum of expected allele frequencies when the sampled populations have experienced nonstationary demographic histories. The direct calculation generates the spectrum orders of magnitude faster than coalescent simulations do and allows us to generate spectra for a large number of alternative histories on a multidimensional parameter grid. Model-fitting experiments using this grid reveal significant population-specific differences among the demographic histories that best describe the observed allele frequency spectra. European and Asian spectra show a bottleneck-shaped history: a reduction of effective population size in the past followed by a recent phase of size recovery. In contrast, the African-American spectrum shows a history of moderate but uninterrupted population expansion. These differences are expected to have profound consequences for the design of medical association studies. The analytical methods developed for this study, i.e., a closed mathematical formulation for the allele frequency spectrum, correcting the ascertainment bias introduced by shallow SNP sampling, and dealing with variable sample sizes provide a general framework for the analysis of public variation data.


THE analysis of statistical distributions of genetic variations has a rich history in classical population genetic studies (CROW and KIMURA 1970 Down), and recent genome-scale data collection projects have positioned the field to apply, challenge, and improve traditional theory by examining data from thousands of loci simultaneously. The two most frequently studied distributions of nucleotide sequence variation are the marker density (MD), or mismatch distribution (LI 1977 Down; ROGERS and HARPENDING 1992 Down; i.e., the distribution of the number of polymorphic sites observed when a collection of sequences of a given length are compared), and the allele frequency spectrum (AFS; EWENS 1972 Down; i.e., the distribution of diallelic polymorphic sites according to the number of chromosomes that carry a given allele within a sample). The latter distribution is immediately applicable to the genotype data produced by projects that are characterizing a large subset of currently available single-nucleotide polymorphisms (SNPs) with measures of individual allele counts (genotypes) for three ethnic populations (http://snp.cshl.org/allele_frequency_project/). In addition to data availability, the AFS has other, analytical advantages over MD data, most notably its independence from the effects of recombination or mutation rate heterogeneity as we show below.


*  Modeling the distribution of allele frequency
*TOP
*ABSTRACT
*Modeling the distribution of...
*Demographic history
*METHODS
*RESULTS
*DISCUSSION
*APPENDIX
*LITERATURE CITED

Prior study of the AFS has been restricted to properties of summary statistics such as Tajima's D (TAJIMA 1989 Down), or the proportion of rare- to medium-frequency alleles (FU and LI 1993 Down). There has been very little analysis of the general shape of observed spectral distributions. The analytical shape of the AFS, under a stationary history of constant effective population size, was derived by FU 1995 Down who showed that, within n samples, the expected number of mutations of size i is inversely proportional to i. Important properties of the coalescent process under deterministically changing population size have been derived in publications of GRIFFITHS and TAVARE 1994A Down, GRIFFITHS and TAVARE 1994B Down and TAVARE et al. 1997 Down. These results show that, for the purposes of genealogy, varying population size can be treated by appropriate scaling of the coalescent time. Applying these results to obtain a formula for the allele frequency spectrum is not trivial, however, because mutations occur in nonscaled time. More recently, WOODING and ROGERS (2002) derived a method called the matrix coalescent that overcomes these difficulties and calculates the AFS under arbitrarily changing population size histories. Their approach solves the problem for the general case, but leads to an involved computational procedure requiring numerical matrix inversion. In this study, we have taken a different approach. By extending Fu's result from a stationary population history to a more general shape, a profile of demographic history characterized by an arbitrary number of epochs such that the effective population size is constant within each epoch, we have arrived at a very simple, easily computable formula for the AFS. The price we pay is the lack of generality of arbitrary shapes. In many practical situations, however, these shapes can be approximated by a piecewise constant effective size profile. The advantage is a formulation that permits very rapid generation of AFS under a large number of competing histories for accurate data fitting and hypothesis testing. This result is applicable when the sites under consideration are selected randomly and the number of successfully genotyped samples is identical at each site. For the data set we are considering both of these assumptions are violated. First, the sites in question were selected for the population allele frequency characterization of a large subset of SNPs from a genome-wide map (SACHIDANANDAM et al. 2001 Down) of SNPs discovered by computational means, in large mining efforts in the public (ALTSHULER et al. 2000 Down; MULLIKIN et al. 2000 Down; LANDER et al. 2001 Down; MARTH et al. 2003 Down) and private (VENTER et al. 2001 Down) domains, numbering millions of sites. Common in these efforts is that SNP discovery was carried out in samples of a small number of chromosomes (two or three). The samples used in the discovery phase were different from the samples used in the consequent genotype characterization experiments, and they represented an unknown mixture of ethnicities. Second, because of genotyping failures, the number of successful genotypes varies from site to site, raising the question of how to compare allele counts across these sites. In this work, we propose methods to deal with these practical problems. The resulting suite of tools enables us to analyze the shape of the AFS observed in the data directly and to evaluate competing scenarios of demographic history on the basis of how well they fit the observations.


*  Demographic history
*TOP
*ABSTRACT
*Modeling the distribution of...
*Demographic history
*METHODS
*RESULTS
*DISCUSSION
*APPENDIX
*LITERATURE CITED

The reconstruction of human demographic history is of direct biological and anthropological interest. Additionally, the history of effective population size has a profound effect on important quantities such as the extent of linkage disequilibrium and is therefore important for medical association studies. There have been many attempts for demographic inference from contemporary molecular data representing different molecular mutation systems such as mitochondrial DNA polymorphisms (DI RIENZO and WILSON 1991 Down; ROGERS and HARPENDING 1992 Down; SHERRY et al. 1994 Down; INGMAN et al. 2000 Down), microsatellites (DI RIENZO et al. 1998 Down; KIMMEL et al. 1998 Down; REICH and GOLDSTEIN 1998 Down; RELETHFORD and JORDE 1999 Down; GONSER et al. 2000 Down; ZHIVOTOVSKY et al. 2000 Down), and, more recently, SNPs in nuclear DNA (HARDING et al. 1997 Down; CLARK et al. 1998 Down; CARGILL et al. 1999 Down; ZHAO et al. 2000 Down; REICH et al. 2001 Down; SACHIDANANDAM et al. 2001 Down; YU et al. 2001 Down). For both global samples of human diversity, or specific subpopulations, practically all possible simple shapes of population history have been proposed: constant effective size (stationary history), growth relative to an ancestral effective size (population expansion), size reduction (collapse), and bottleneck (a phase of size reduction followed by a phase of growth or recovery); see Fig 1. These claims as well as the underlying data have been reviewed by various authors (HARPENDING and ROGERS 2000 Down; WALL and PRZEWORSKI 2000 Down; JORDE et al. 2001 Down; ROGERS 2001 Down; PTAK and PRZEWORSKI 2002 Down; TISHKOFF and WILLIAMS 2002 Down). It is generally agreed that variation patterns in mitochondrial DNA show rapid expansion of effective size in all human populations. Results in microsatellite data are less unanimous about which populations experienced expansion or what the magnitude and starting time of such demographic events were. Recent studies of SNP data sets in nuclear DNA propose the possibility of a population collapse to explain reduced haplotype diversity (CLARK et al. 1998 Down; REICH et al. 2001 Down, REICH et al. 2002 Down; GABRIEL et al. 2002 Down), especially in samples of European ancestry, a hypothesis consistent with our observations in the current data set.



View larger version (16K):
In this window
In a new window
Download PPT slide
 
Figure 1. Example of a three-epoch, piecewise constant, bottleneck-shaped population history profile. The ancestral effective population size (N3) is followed by an instant reduction of effective size (N2). The duration of this epoch is T2 generations. This is followed by a stepwise increase of effective population size to N1, T1 generations before the present.


*  METHODS
*TOP
*ABSTRACT
*Modeling the distribution of...
*Demographic history
*METHODS
*RESULTS
*DISCUSSION
*APPENDIX
*LITERATURE CITED

Allele frequency spectrum under stepwise constant effective population size:
We show that, for a population evolving under the Wright-Fisher model, and under selective neutrality, the expectation for the number of mutations {Psi}i of size i, within a sample of n chromosomes under a demographic history of multi-epoch, piecewise constant effective population size is

(1)

where µ is the (constant) per-locus mutation rate, Nm is the effective population size in epoch m, Tm is the corresponding epoch duration, and the normalized epoch boundary time. A detailed derivation of this result is given in the Appendix. The normalized distribution of these expectations according to the frequency is the allele frequency spectrum:

(2)

It is sometimes useful to consider the "full" allele frequency spectrum, Pfulln(i), considering sizes 0 and n, i.e., when all samples carry the ancestral or the derived allele, respectively. We have verified the accuracy of the complete allele frequency spectrum derived from this formulation by coalescent simulations (supplemental Figure S1 at http://www.genetics.org/supplemental/). Three important properties of the allele frequency spectrum are clear from Equation 1. First, the expectation for a given frequency is linear under simultaneous scaling of all effective population sizes and epoch durations (i.e., as long as Tm and Nm are multiplied by the same constant for each m), hence the relative frequency spectrum remains unchanged. This fact can be exploited to reduce the number of parameters that characterizes a given demographic model under consideration. Second, the expected number of mutations of a given size for more than one nucleotide site is simply the sum of the individual expectations, without regard to any possible correlation among the site genealogy of proximal sites. Therefore, our results for the expected number of segregating sites as well as the allele frequency spectrum are also valid for polymorphisms at a single locus of arbitrary sequence length, without regard to possible recombination within the locus, or for polymorphisms collected from throughout the genome. This latter consideration allows us to apply the theoretical expectations derived here for the data set examined, without regard to the amount and structure of linkage between the sites represented within the set. Third, the allele frequency spectrum is independent of the actual value of the per-nucleotide, per-generation mutation rate, as long as this rate is uniform for every site considered.

Minor allele frequency spectrum (folded spectrum):
In situations where allele frequency is determined experimentally by counting the two alternative alleles within a sample of n chromosomes, it is uncertain which of the two alleles is the mutant allele. In such situations, instead of the true frequency, we work with the frequency of the less frequent (or minor) allele (FU 1995 Down). The distribution of minor allele frequency is described by the folded spectrum defined as

(3)

By this definition, if n is even, n(n/2) = 2Pn(n/2), i.e., twice the value we would expect to measure, leading to a "doubling effect." This fact needs to be taken into account during the interpretation of measured data. Because in many data sets available for analysis the ancestral allelic state is currently unknown, the folded spectrum is important in practice.

Numerical calculation of the allele frequency spectrum:
Frequency spectrum calculations were implemented in the C programming language. Some care must be taken when calculating the expected spectrum, because computing Equation 1 requires the evaluation of alternating sums, a source of numeric instability when the individual terms are close in value. Instability can be avoided by accurate calculation of each term. The higher the sample size, the more accurately each term has to be evaluated. We do not have a systematic way to predict the accuracy requirement as a function of sample size, hence we determined the accuracy requirement for a given sample size by trial and error. In our implementation, we have used high-accuracy numeric libraries with settable numeric precision. Our experience has been that, up to a sample size n = 100, a numeric precision of 100 decimal places was sufficient for our calculations. Evaluation of the allele frequency spectrum for a sample size of 1000 required a numerical precision of ~500 decimal places.

Correcting ascertainment bias:
To describe the situation where polymorphic sites discovered in a set of samples are genotyped in a second, independently drawn set of samples for frequency characterization we divide the two independent groups of samples into a "discovery" group consisting of k samples and a "genotyping" group consisting of n samples. The discovery process is modeled by considering only those sites within the n + k samples that are polymorphic (i.e., are of size between 1 and k - 1) within the discovery group of depth k and discarding those sites that are monomorphic in this group, as these sites would not be considered for subsequent genotyping. The conditional probability, Pn|k(i), that a site is of size i within the n genotyping samples given that it is polymorphic in the k discovery samples is:

(4)

It is possible that a site that appears polymorphic within the k discovery samples is monomorphic within the n genotyping samples. As a result, the conditional probabilities Pn|k(0) and Pn|k(n) are typically nonzero, and one has to renormalize after the transformation to get the AFS. It is easy to verify that Equation 4 is also valid for calculating the folded conditional spectrum n|k(i), as defined in Equation 3, provided that both folded spectra k(i) and n+k(i) are available. This property makes it possible to account for the ascertainment bias when only the folded allele frequency distributions are available. For the sake of completeness, we include the conditional spectrum for the important special case, k = 2, i.e., ascertainment within a pair of chromosomes:

(5)

It is easy to show that under a stationary history the spectrum is a linear function of i, and the folded spectrum is constant (Fig 2A).



View larger version (51K):
In this window
In a new window
Download PPT slide
 
Figure 2. Ascertainment bias. (a) Folded spectra under stationary history, at various values of "discovery sample" size k (METHODS). (b) Allele frequency spectra predicted under competing scenarios of population history (conditioned on pairwise ascertainment k = 2). Equilibrium history, N1 = 10,000; expansion, N1 = 20,000, T1 = 3000, N2 = 10,000; collapse, N1 = 2000, T1 = 500, N2 = 10,000; bottleneck history, N1 = 20,000, T1 = 3000, N2 = 2000, T2 = 500, N3 = 10,000. (a and b) Sample size n = 41.

We point out that our method of ascertainment bias correction improves on an earlier method based on using the measured discrete allele frequency as an estimator for the overall allele frequency within the population (SHERRY et al. 1997 Down; see supplemental Figure S2 at http://www.genetics.org/supplemental/).

Reduction of allele frequency counts to equivalent counts at a lower sample size:
Often allele frequency data are the result of genotyping a target number, nt, of individuals at a collection of polymorphic sites. Because of genotyping failures, however, the actual number of genotypes available at different locations is smaller and often varies from site to site. At sites where an identical number, n, of successfully determined chromosomal allelic states are available we denote the distribution of allele counts by Cn(i) and the corresponding probability distribution obtained by normalizing these counts by Pn(i). Sites with different numbers of successful genotypes are not directly comparable. To enable joint analysis of allele counts observed at all sites genotyped in the experiment, we have devised a procedure that, given an observed distribution of allele frequencies among samples, produces an equivalent distribution at a lower sample size, m. This is achieved by, first, considering all possible choices of m subsamples selected from the total n available samples, in such a way that each choice is equally likely and, second, requiring that the total number of observations remains the same. Under these assumptions, the "equivalent" allele counts, m(i), for m subsamples are

(6)


(7)

Note that this procedure does not allow one to generate a higher sample size distribution on the basis of a lower sample size distribution. Also note that, even if the higher sample size distribution was a relative allele frequency spectrum, the resulting lower sample size distribution will contain nonzero terms for size 0 and for size m. Clearly, the first case is the result of the possibility that the omission of n - m chromosomes left us with 0 mutant alleles, and the second is that only mutant alleles remained. This results in a slight reduction of the total number of relative counts as compared to the original observations. To obtain the AFS, one omits sizes 0 and m in Equation 7 and renormalizes. It is easy to verify that the equivalence reduction also works for the folded allele frequency distribution.

We point out that our reduction procedure is not equivalent to frequency binning, a procedure sometimes employed to compare allele counts available at different samples sizes. Aggregating discrete allele frequency data on the basis of a nominal allele frequency c/n, the ratio of allele counts and the sample size, results in data distortion stemming from two sources. First, for a given sample, the inherent base frequency is fn = n-1. In general, only window sizes that are integer multiples of fn will preserve the uniform appropriation of allele sizes into frequency bins. This may be impossible if multiple sample sizes are present in the data. Second, sites with identical nominal allele frequencies but different sample sizes are not equivalent; e.g., a site with a minor allele count of 1 in 3 samples is clearly not equivalent to a site with a minor allele count of 10 in 30 samples. Distortions from both sources are most pronounced at lower sample sizes. Our equivalence reduction procedure is a technique of data aggregation that is free of such distortions. This point is further illustrated in supplemental Figure S3 at http://www.genetics.org/supplemental/, where we compared the AFS resulting from simple binning of all available data for the European samples to the AFS we obtain by the equivalence data reduction procedure presented here.

Coalescent simulations and tabulation of linkage disequilibrium:
We used coalescent simulations to verify the accuracy of our allele frequency spectrum calculations (supplemental Figure S1), to tabulate measures of linkage disequilibrium, and to tabulate distributions of mutation age. To perform these simulations, we have implemented a widely used, direct coalescent algorithm (HUDSON 1991 Down). The simulation software was first implemented in Perl for rapid coding and error checking and then reimplemented in C++ for increased computational speed. To verify the direct formula, we have run coalescent simulations under a variety of population history scenarios, tabulated the allele frequency spectra, and compared them to the computed predictions. To verify the conditional spectrum calculations, we have simulated n + k chromosomes within a common genealogy, designated k samples as the discovery group, and n samples as the genotyping, or frequency measurement, group. Of all the sites that were polymorphic within the n + k samples, we discarded those sites that were monomorphic within the k discovery samples and kept the remaining sites. We then tabulated the allele frequency counts at these sites among the n genotyping samples.

Expectations for the extent of linkage disequilibrium were generated according to a previously published method (KRUGLYAK 1999 Down). For each population, we used the best-fitting three-epoch model for the coalescent simulations, with samples size n = 100. Marker allele frequencies were restricted to the range between 0.25n and 0.75n. For each value of recombination fraction, we tabulated r2, a commonly used measure of linkage disequilibrium defined as

(8)

where A and a denote the mutant and the ancestral alleles at the first marker location, and B and b are the alternative alleles at the second marker location. The quantities pA, pa, pB, and pb are the corresponding allele frequency measurements, and pAB is the measured frequency of the haplotype defined by the combination of allele A at the first marker position and B at the second marker position. Finally, marker age was tabulated by registering the time of occurrence for each of the mutations during the simulations.

Model fitting to observed allele frequency spectra:
The primary objective of the fitting experiments is to determine the distribution of the posterior probability of the model parameters given the observed data: P(model|data). With the help of our closed formula for the direct calculation of the AFS we were able to generate the expected AFS for a complete, high-resolution, multidimensional grid overlaid on the parameter space that we intended to explore. This direct approach yielded the likelihood distribution, P(data|model), computed at each grid point. Given that there is no sensible way to assign an "informed" prior distribution to the model parameters, the distribution of the likelihood function is equivalent to the posterior distribution and can be used in ranking competing parameters. We point out that an alternative method of achieving the same goal is to use a Markov-chain Monte Carlo (MCMC) technique to obtain the posterior distribution (GRIFFITHS and TAVARE 1994A Down; KUHNER et al. 1995 Down). We opted for the direct method because it was simple but computationally feasible, by its nature avoided the convergence issues usually associated with MCMC, and allowed us to evaluate the likelihood function at every grid point, for each of the three population-specific AFS analyzed.

Stepwise constant models of one, two, and three epochs were considered. For each model class defined by the number of epochs, a vector of parameters describing the model was considered, including the effective population size and the duration of the epoch (expressed in terms of generations). We have sampled each effective size parameter, Ni, between 1000 and 150,000 in steps of 1000 up to 30,000 and in steps of 5000 beyond 30,000, and each epoch duration parameter, Ti, between 100 and 50,000 in steps of 100 up to 10,000 and in steps of 500 beyond 10,000. Because of the scaling equivalence of the relative distribution discussed earlier, we fixed the ancestral size (the effective size of the epoch farthest in the past) parameter at 10,000, for each model class. We have generated the unbiased allele frequency spectra by direct calculation using Equation 1, for a sample size of m + 2, where m = 41 is the (common) sample size after data reduction, and k = 2 is the discovery size. We then computed the conditional spectrum using Equation 4. Finally, we folded the spectrum using the definition given in Equation 3. To quantify the degree of fit between a given model and the observations we have used the likelihood of the observed data conditioned on the model:

(9)

For generating the likelihood surface for the European bottleneck size vs. duration we used the {chi}2 metric defined as

(10)

In the above notations, ci is the observed number of sites of size i, c is the number of total sites, pi is the predicted (relative) probability of size i, and m is the common sample size to which all observations were reduced using the equivalence data reduction procedure outlined earlier.

Comparison between models with different epoch numbers:
Models within the same structure (same epoch number) could be directly compared on the basis of any of the three goodness-of-fit metrics discussed above. Models with different numbers of epochs were compared using methods of normal hypothesis testing for nested models (OTT 1991 Down), on the basis of the likelihood of the data given each of the two models compared. The quantity 2 ln({lambda}) = 2 ln(P(data|model1)/P(data|model2)) is asymptotically {chi}2 distributed, with degrees of freedom equal to the difference in the number of parameters characterizing the models (i.e., adding one extra epoch increases the number of parameters by two). The larger this quantity, the more significant the improvement that was achieved by the introduction of the extra epoch. If the quantity is small, the improvement in data fit does not warrant the introduction of the extra parameters.


*  RESULTS
*TOP
*ABSTRACT
*Modeling the distribution of...
*Demographic history
*METHODS
*RESULTS
*DISCUSSION
*APPENDIX
*LITERATURE CITED

Modeling allele frequency:
We considered a diploid population whose demographic history was described by a series of epochs such that the effective population size was stepwise constant within each epoch (e.g., Fig 1) and showed that the expected number of samples carrying a mutant allele can be described by a closed, easily computable mathematical formulation (see METHODS). We derived a method for incorporating the same frequency ascertainment bias into AFS models that was introduced into real data by the sampling strategies used during SNP discovery and for revealing the strategies's consequent effect on SNP population frequency (METHODS). We illustrate the effect of this bias under different values of ascertainment sample size (Fig 2A). As expected, the bias toward sample enrichment for common polymorphisms is strongest when SNPs are discovered in a pair of chromosomes, and it gradually disappears as discovery sample size increases. Under a stationary population history, the folded spectrum under ascertainment in two chromosomes is a constant function of frequency (METHODS), and deviations from a horizontal line signal a nonstationary history that is easy to detect and interpret. In Fig 2B, we contrast the ascertainment bias-corrected, minor allele frequency spectra for notable, competing scenarios of demographic history. When a population expands, an increasing number of chromosomes simultaneously incur new mutations, which results in an overabundance of rare alleles in the spectrum. Conversely, a population collapse is a rapid loss of chromosomes, and the alleles present at high frequency are more likely to be carried by surviving chromosomes than are their rare counterparts. For that reason a collapse generates an overrepresentation of common alleles. Finally, AFS under a bottleneck history (a reduction of effective size followed by a phase of recovery) carries the signature of both the phase of collapse (a valley at intermediate frequencies) and that of growth (elevated signal at low frequencies).

We report a procedure to transform allele counts at a given sample size to a lower, target sample size (METHODS). Using this equivalence sample size reduction procedure, allele count observations at all sites can be reduced to the equivalent counts at a lower, "common denominator" sample size, as illustrated in Fig 3. This procedure is useful for analyzing allele counts at sites where the number of available genotypes is variable either because a fraction of attempted genotyping experiments failed or when merging data sets in which the attempted sample sizes are different. In such cases one selects a target sample size and applies the reduction procedure to transform allele counts observed at higher sample sizes to the equivalent counts at this lower target sample size. It is then possible to fit the resulting single AFS containing the contribution of all available data instead of fitting multiple, often sparse spectra, one for each sample size present in the data.



View larger version (81K):
In this window
In a new window
Download PPT slide
 
Figure 3. Sample size reduction. Folded, normalized allele frequency distribution for each sample size (n = 42, ... , 84) present in the European allele count data (gray) is shown. The allele frequency spectra obtained using the equivalence sample size reduction technique (METHODS) are also shown for various equivalence sample sizes (m = 21, 31, and 41; green).

Minor allele frequency spectra observed in samples representing different world populations show differential demographic histories:
The SNP Consortium (http://snp.cshl.org), an organization formed primarily for the discovery of a large set of human SNPs, has made well over 1 million polymorphic sites available in the public domain (SACHIDANANDAM et al. 2001 Down). Most of these SNPs were discovered by comparing sequencing read fragments from multi-ethnic, anonymous, whole-genome shotgun subclone libraries to the public genome reference sequence (SACHIDANANDAM et al. 2001 Down); i.e., the vast majority of the SNPs were found in a discovery size of two chromosomes (k = 2). Quasi-random subsets of these candidate sites were then selected for frequency characterization in samples representing European-American, African-American, and East Asian populations (for sample identifiers see http://snp.cshl.org/allele_frequency_project/panels.shtml). In this study, we chose the largest data set of allele frequency counts resulting from genotypes provided by Orchid Biosciences, of 42 individuals (84 chromosomes) drawn from each of the three populations (http://snp.cshl.org/allele_frequency_project/). Experimental results were reported for 33,538 sites. For a significant fraction of the sites genotyping was unsuccessful for one or more of the populations attempted. In some other cases, although genotyping was successful, all samples carried the same allele and hence the site could not be confirmed as polymorphic. For the purpose of our study, we restricted our attention to those sites where (1) genotyping from each of the three sample groups was successful (genotyping for a given population was considered successful if genotype data were obtained for at least half the population samples, i.e., 21 individuals, even if only one of the alternative alleles was seen in that population) and (2) the site was polymorphic within at least one of the three population samples. Of the total 21,407 sites that were successfully genotyped in all three populations the European samples were polymorphic at 18,660 sites, the African samples at 20,587 sites, and the Asian samples at 17,369 sites. At a given site, the total number of alleles counted varied between 42 (the minimum number possible, in case only 21 diploid individuals were successfully genotyped within a population) and 84, the maximum possible if all 42 individuals within a population sample were successfully genotyped. To use all the data available, we have applied our equivalence sample size reduction procedure (METHODS) to convert the allele count data to a common denominator sample size. Because the identity of the ancestral and the mutant allele was not known, we used the allele counts of the less frequent (or minor) allele, giving rise to a folded spectrum (METHODS). To avoid the "doubling" effect associated with folding the allele frequency spectrum when the sample size is an even number, as described in METHODS and in particular by Equation 3, we chose the common denominator sample size as m = 41, i.e., the first odd number below the (even) sample size 42. The unfolded spectrum hence lies between 1 and 40 (sizes 0 and 41 indicate monomorphisms). Accordingly, the folded spectrum lies between minor allele sizes 1 and 20, for each of the three population-specific sample groups (Fig 4, first column). The allele frequency data used in our analysis are available through our web site: www.ncbi.nlm.nih.gov/IEB/Research/GVWG/AFS-2003/.



View larger version (65K):
In this window
In a new window
Download PPT slide
 
Figure 4. Model fitting to folded AFS observed in population-specific genotype data reduced to common sample size, m = 41. (a) European spectrum. (b) Asian spectrum. (c) African-American spectrum. First column, observed allele frequency spectrum (black), best-fitting three-epoch theoretical model prediction (green), and prediction under stationary effective size (red); second column, breakdown of mutations according to age within each frequency class of the best-fitting model spectra [color bands correspond to a range of 1000 generations (e.g., black band, 1–1000 generations; red band, 1001–2000 generations)]; third column, distribution of mutation times (generations in the past) at each frequency, based on 1 million simulation replicates. Notched box: 25%, median, 75%. Whiskers: min/max values. Open square: mean value. Open circle: 5%, 95% values.

To assess the signals of population history within these observed distributions, we generated allele frequency spectra as predicted under competing scenarios of population history of varying complexity: stationary history (one epoch), expansion or collapse (two epoch), and all possible shapes of three-epoch histories (METHODS). For a given set of model parameters, we generated the corresponding theoretically predicted, ascertainment bias-corrected minor allele frequency spectrum and evaluated the degree of fit between the prediction and the observations (METHODS). For each population-specific data set and for each model structure (number of epochs), we determined the best-fitting model parameters and the corresponding measures of goodness of fit. By definition of the likelihood function used for data fitting, the best-fitting model parameters are the maximum-likelihood parameter estimates for that model class (Table 1).


 
View this table:
In this window
In a new window

 
Table 1. Results of fitting multi-epoch models of allele frequency spectrum to population-specific observed allele frequency data

The normalized observed allele frequency distributions for each population group and the corresponding best-performing distributions within each model class are shown in Fig 4. In all three population-specific spectra, stationary history is a poor descriptor of the data, both by visual inspection and by examination of the fit values in Table 1. The best-fitting two-epoch model for all three spectra is that of expansion (Table 1). In the European (Fig 4A) and in the Asian (Fig 4B) samples the best-fitting three-epoch model is one of a bottleneck-shaped history. In the European data, the curve fit produced by the bottleneck profile is a very significant improvement over that produced by histories of expansion. In the Asian data, the improvement is still significant but to a lesser degree. The best-fitting three-epoch models in African-American data (Fig 4C) represent a two-step population increase of moderate size.

In addition to the best-fitting models, a range of parameter values produced comparably good fit to the observations. We have examined parameter sets that produced likelihood values that were at least 90% of the value obtained for the best-fitting three-epoch parameter set. Analysis of these "close to optimal" parameter values in the European data shows that both the size (N, effective number of individuals) and duration (T, generations) of the recovery phase was within a narrow range (N1 = 19,000–21,000, T1 = 2700–3000). Parameters of the bottleneck phase were in a wider range (N2 = 1000–4000 and T2 = 200–1300), with several alternative pairs available: longer but less severe bottlenecks or shorter, more severe bottlenecks. Given the potential interest in a possible bottleneck in the history of European populations, we further investigated the strength of the bottleneck signal by fixing the recovery size and duration parameters (N1 = 20,000, T1 = 3000) and varying the bottleneck size N2 and duration T2 in fine increments (20). For each parameter combination, we evaluated the goodness of fit to the European spectrum as measured by the {chi}2 statistics and reported the resulting probability surface in Fig 5. The best-fitting parameter combinations (ones not rejected by the {chi}2 test even at the 99.8% level) lie on a slightly curved line between the following pairs: effective size of 1040 during the bottleneck for 240 generations and effective size 2320 for 560 generations. The most likely model, at this resolution, is a bottleneck effective size of 1560 for 360 generations. These values and the ratio of effective population size and bottleneck duration being nearly constant in a large region are in good agreement with previous reports (REICH et al. 2001 Down). In the Asian data (Fig 4B), all parameters including those characterizing the bottleneck phase were within a tight range: N2 = 3000–5000, T2 = 600–1000, N1 = 24,000–26,000, and T1 = 3000–3200. Similarly narrow ranges were observed for the African-American data (Fig 4C): N2 = 16,000, T2 = 13,000–15,000, N1 = 26,000–30,000, and T1 = 2000–2600.



View larger version (28K):
In this window
In a new window
Download PPT slide
 
Figure 5. Bottleneck size and duration in the European samples. The probability surface of the effective size and the duration of a bottleneck are shown. Size of the ancestral epoch is fixed at N3 = 10,000, size of the present epoch is fixed at 20,000, and the duration of the present epoch is fixed at T1 = 3000. Parameter regions indicated by shading fall into the same bin of significance. Note that the P values indicated are the direct {chi}2 probabilities (i.e., 1 minus the tail probability).


*  DISCUSSION
*TOP
*ABSTRACT
*Modeling the distribution of...
*Demographic history
*METHODS
*RESULTS
*DISCUSSION
*APPENDIX
*LITERATURE CITED

Significance of the allele frequency analysis methods presented here:
Equation 1 (METHODS) provides a simple and rapid way to generate expected distributions of allele frequency under stepwise constant models of effective population size history. This procedure is orders of magnitude faster than tabulating simulation replicates, especially for large sample sizes, permitting fast generation of model spectra to explore large parameter spaces at high resolution. The method of ascertainment bias calculation we have presented permits the interpretation of allele frequency spectra measured at polymorphic sites selected from existing variation resources. Our procedure of equivalence sample size reduction enables the analysis of realistic data sets with genotyping failures. All three of the above procedures are firmly rooted within the coalescent framework. Model calculations directly correspond to experimentally observable quantities, without referencing directly unobservable quantities such as the overall population frequency of alleles. The data-fitting methodology is conceptually simple and allows direct comparison of the degree of fit between each of the three population samples examined, at each grid point (parameter combination).

Differential population histories in the three sample sets:
On the basis of the goodness of fit between models and observations (Table 1), a history of stationary population size can be confidently rejected for all three sets of samples. Introduction of even very simple dynamics into the history has dramatically improved data fit. There were large differences among the allele frequency spectra observed in the three populations (Fig 4 and Table 1). Clearly, the shapes of the European and the Asian spectra are closer to each other than either is to the shapes of the African spectra. On the basis of the three-epoch models, both the European and the Asian data are best explained by bottleneck-shaped histories, whereas the best-fitting third-order model for the African-American data is a continued expansion. The results of hierarchical model testing (METHODS) in Table 1 show that the inclusion of the third epoch did not significantly improve the fit to the African-American data. However, the bottleneck history is a dramatic improvement over the best-fitting two-epoch growth models in both the European and Asian data. Considering the range of models that produced close to optimal fit values, but using a fixed, 20-year generation time, the European bottleneck represented a 2.5- to 10-fold decline in population size, lasting 200–1300 generations [4–26 thousand years (KY)]. This was followed by a phase of 5- to 20-fold population expansion, starting 2700–4300 generations (54–86 KY) ago. The Asian bottleneck represented a 2- to 3-fold decline for 600–1000 generations (12–20 KY), followed by 5- to 8-fold growth starting 3000–4200 generations (60–84 KY) ago. The best-fitting models for the African-American data represent uninterrupted growth of effective population size, with the expansion clearly starting earlier than is evident in our European or the Asian data.

Earlier mitochondrial and microsatellite studies report data that are predominantly consistent with expansion-type histories of effective population size. The main evidence that points to expansion is negative values of Tajima's D and an excess of low-frequency alleles. The start of such expansion is estimated between 30 and 130 KYA (HARPENDING and ROGERS 2000 Down). Nuclear data, especially in samples of non-African origin, seem to show a different pattern, an excess of common variants (HEY 1997 Down; CLARK et al. 1998 Down; REICH et al. 2001 Down, REICH et al. 2002 Down). Simulation results have suggested that a bottleneck-shaped history of effective population size consisting of a phase of collapse followed by a recent phase of size recovery can reconcile this seeming contradiction between observations from different mutation systems (FAY and WU 1999 Down; HEY and HARRIS 1999 Down). These studies characterize bottleneck-shaped histories by a size expansion ratio (in our notation N1/N2) and a bottleneck severity index (in our notation T2/N2) and consider moderate bottlenecks where the expansion ratio is 20 and the severity index is in the range of 0.25 and 4.0. Our own estimates (expansion ratio 5–20 for Europeans, 5–8 for Asians, and severity index of ~0.2 for both populations) are in general agreement with these values and signify bottlenecks on the less severe end of the spectrum. Our estimates for the start of the recovery phase (54–86 KYA for Europeans, 60–84 KYA for Asians) are well within the range of the mitochondrial and microsatellite estimates. The fact that our best-fitting two-epoch models indicate expansion-type histories for all three populations we examined is also consistent with conclusions from mitochondrial and microsatellite data. A valuable reality check of an inferred demographic model is its implied pairwise nucleotide diversity value, {theta}. Although our data-fitting analysis of the relative spectrum does not provide absolute estimates for {theta}, these values can be obtained on the basis of the best-fitting models by fixing the ancestral size N3 and mutation rate µ. For each of the three populations, we use a common ancestral effective size of 10,000 and common mutation rate of 2 x 10-8 [a value that lies between recent, prominent estimates for average per-nucleotide, per-generation human mutation rate (NACHMAN and CROWELL 2000 Down; KONDRASHOV 2003 Down )]. This leads to an estimate of {theta} = 7.88 x 10-4 for the European model, in good agreement with previously reported values for other genome-wide data sets (SACHIDANANDAM et al. 2001 Down; VENTER et al. 2001 Down; MARTH et al. 2003 Down). The prediction from the Asian data is slightly higher, 8.24 x 10-4. The pairwise {theta} predicted by the best-fitting model for the African-American data is 10.29 x 10-4, significantly higher than that observed within the European and Asian samples, and in agreement with the general consensus that nucleotide diversity is higher in sub-Saharan samples than in non-African data (RELETHFORD and JORDE 1999 Down; PRZEWORSKI et al. 2000 Down; JORDE et al. 2001 Down; TISHKOFF and WILLIAMS 2002 Down). All three estimates are well within realistic values, lending further credence to the validity of our model parameters.

A bottleneck-shaped history was also our best-fitting three-epoch model structure for MD distributions observed in overlap fragments of public genome clone data (MARTH et al. 2003 Down). However, the parameter estimates are significantly different between these two studies. Our estimates from MD data indicated a less severe bottleneck of nearly identical duration and a shorter phase of recovery of more modest size as compared to the AFS in the European samples. Multiple factors may contribute to these differences. First, the DNA samples for the two studies came from different donors. Second, some fraction of the large-insert clones sequenced for the construction of the public genome reference sequence originate from libraries that are not of European origin [although there appears to be an overrepresentation of European sequences (WEBER et al. 2002 Down), presumably due to the origin of a single bacterial artificial chromosome library with the largest contribution]. If indeed an appreciable fraction of the data represents sub-Saharan DNA, the resultant MD in these mixed data could indicate a less severe bottleneck than would have been evident in a distribution containing only European data.

To understand the consequences of the differential histories that best describe the three population-specific data sets, we have partitioned the corresponding frequency spectra according to the age of the mutations (METHODS) that gave rise to the polymorphisms (Fig 4, second column). According to these tabulations, 35.9% of the European polymorphisms originated in <10,000 generations, as did a similar fraction, 34.9%, in the Asian model. In contrast, only 29.6% of the African mutation are younger than 10,000 generations. This indicates that the bottleneck events that explain the European and Asian data have eliminated a large fraction of the polymorphisms that predated these events, and a larger fraction of current polymorphisms are of a more recent origin as compared to the African data. This effect is most visible at the common end of the spectrum: only a negligible fraction of the common African SNPs are young, but an appreciable fraction of common European and Asian SNPs have originated <10,000 generations ago and have drifted to high population frequency. Finally, the third column of Fig 4 shows the average age of SNPs at given frequencies, confirming that SNPs at a higher frequency are expected to be older than SNPs at lower frequencies. Also, in each frequency class, the expected age of African SNPs is substantially higher than that of European or Asian SNPs, corroborating earlier observations noting the more ancient origins of African SNPs.

The differential demographic histories of the three populations examined also have important consequences for the extent of allelic association in the human genome, when the different populations are considered. To illustrate this point, we have carried out coalescent simulations, taking into account the individual best-fitting histories, and tabulated the average extent of linkage disequilibrium (LD) between markers separated by different values of recombination fraction (for a fixed value of per-nucleotide, per generation recombination rate, the recombination fraction translates into physical distance), as shown in Fig 6. Similar demographic histories distilled from the Asian and European samples result in similar values of LD at a given marker distance. LD is predicted to decay more rapidly (roughly twice as fast) for the best-fitting demographic history for the African-American samples, in agreement with previous reports (REICH et al. 2001 Down). Differences in the extent of allelic association within the genome are expected to have profound consequences for medical association studies.



View larger version (24K):
In this window
In a new window
Download PPT slide
 
Figure 6. The average extent of linkage disequilibrium, as predicted by the best-fitting, three-epoch demographic models for the three population samples. Values of r2 and the corresponding values of recombination fraction are shown for each of the three populations. On the right-hand side, we have indicated the equivalent physical distances assuming a genome average per-nucleotide, per-generation recombination rate, (METHODS).

Caveats and open problems:
Clearly, our multi-epoch, stepwise models of demographic history represent simplified versions of the "true" demographic past. Nevertheless, our three-epoch models go beyond the majority of previous studies that explore even simpler models of past population dynamics such as expansion vs. collapse or are restricted to the rejection of stationary effective size on the basis of summary statistics. Consideration of the third-order dynamics in this study allowed us to reveal a phase of bottleneck in the history characterizing the European and the Asian samples, permitting reconciliation of the signals of recent population growth apparent in mitochondrial and microsatellite data with realistic, observed values of nucleotide diversity.

Although the signal of differential history is undeniable in the data, the effect is confounded by the fact that the discovery and genotyping data sets were not drawn from a single population. SNP discovery was performed in shotgun sequences from ethnically diverse libraries (with ethnic association of individual reads unknown) aligned to the public genome reference sequence (SACHIDANANDAM et al. 2001 Down), presumably representing a mixture of ethnicities, with a bias toward clones from European donors (WEBER et al. 2002 Down). Polymorphic sites generated by this effort were then selected for genotyping in ethnically well-defined samples. It has been previously noted that collections of samples from multiple ethnicities contain a surplus of rare SNPs when measured in the same mixed collection (PTAK and PRZEWORSKI 2002 Down). However, it is unclear what the allele frequency of the same SNPs is when measured separately, within subpopulations. If the ethnicity of the discovery and the genotyping samples were known, one could estimate the effect of the ascertainment bias with models of population subdivision using coalescent simulation (PLUZHNIKOV et al. 2002 Down). The effect of ascertainment bias between ethnically mismatched or undefined samples is the subject of future investigation.

Additionally, internal population substructure can also distort the frequency spectrum (PRZEWORSKI 2002 Down; PTAK and PRZEWORSKI 2002 Down). Unfortunately, the little amount of information that was available concerning sample origin did not permit incorporation of this effect into our models in a meaningful fashion. Specifically, we did not take into account in our models the effects of recent admixture in the African-American samples. Although the AFS in these samples are best modeled by population growth, it carries a slight but noticeable dip at medium minor allele frequencies, a feature present in a more pronounced form in both the European (Fig 4A) and the Asian (Fig 4B) spectra. This potentially signifies the contribution of European ancestral lineages on the background of African lineages (RYBICKI et al. 2002 Down) in the AFS signal.

We must also acknowledge that the current shape of human variation structure is the result of a combination of neutral and nonneutral (selective) forces. The current state of the art in recognizing the effects of selection in variation data has been reviewed recently (BAMSHAD and WOODING 2003 Down). Positive selection resulting in genetic hitchhiking can mimic the effects of population expansion in that it gives rise to an excess of low-frequency alleles (KAPLAN et al. 1989 Down; BRAVERMAN et al. 1995 Down). Recent efforts have been aimed at detecting loci that exhibit signatures of positive selection (CARGILL et al. 1999 Down; SUNYAEV et al. 2000 Down; AKEY et al. 2002 Down; PAYSEUR et al. 2002 Down). However, the exact proportion of genes that have been targets of strong positive selection within our evolutionary past is unclear (BAMSHAD and WOODING 2003 Down). It is also unclear, in general, how far the effects of hitchhiking extend beyond the locus under selection (WIEHE 1998 Down). Given that only a few percent of the human genome represents coding DNA, and that not all genes are expected to be targets of positive selection, we speculate that the distortion due to selective forces on the AFS in our data set of >20,000 randomly selected genomic loci is small when compared to the global effects of drift modulated by long-term demography.

Conclusion:
The allele frequency spectrum is an excellent data source for modeling demographic history because of its independence of the effects of recombination and local, or sequence composition-specific variations of mutation rates and because the experimental determination of the allele frequency spectrum requires measurement of allelic states only at single-nucleotide positions, instead of sequencing of long stretches of contiguous DNA. The emergence of population-specific genotype sets on the genome scale provides sufficient data for the direct comparison of model-predicted and observed spectra with great resolution. This permits us to improve on previous conclusions drawn on the strength of summary statistics, on the basis of data from a handful of loci. Recent advances in allele frequency modeling should provide us with exciting, new tools to explore our demographic past and explain human haplotype structure. Accurate reconstruction of the history of world populations should also help us to detect and interpret differences that must be taken into account during the development of general resources for medical use such as the recently initiated human Haplotype Map Project (CARDON and ABECASIS 2003 Down; CLARK 2003 Down; WALL and PRITCHARD 2003 Down).


*  ACKNOWLEDGMENTS

The authors are indebted to Andrew Clark for useful comments on the manuscript. We also thank Ravi Sachidanandam for kindly providing earlier versions of the allele frequency data set analyzed in this study.

Manuscript received April 15, 2003; Accepted for publication September 4, 2003.


*  APPENDIX
*TOP
*ABSTRACT
*Modeling the distribution of...
*Demographic history
*METHODS
*RESULTS
*DISCUSSION
*APPENDIX
*LITERATURE CITED

THE EXPECTED NUMBER OF SEGREGATING SITES IN A SAMPLE DRAWN FROM A POPULATION CHARACTERIZED BY A PIECEWISE CONSTANT, MULTI-EPOCH HISTORY OF EFFECTIVE SIZE
Model:
We consider a population of a given organism evolving under the Wright-Fisher model and under selective neutrality. Let us select a specific site in the genome of the organism. Furthermore, let us randomly draw n DNA samples from this population. Without regard to recombination, the samples possess a unique tree-shaped genealogy at the selected site (the site genealogy). Such a genealogy can be described within the framework of the coalescent: starting with n samples in the present and, through a series of coalescent events (pairs of samples finding their common ancestors), this number reduces to 1, the most recent common ancestor (MRCA), or the root of the genealogy at that site (site root). At a given time, the process is said to be in state j, if at that time the current number of samples is j. This process is Markovian, in that the length of time until the next coalescent event depends only on the current state and is independent of the previous states. Due to molecular mutation processes, the nucleotide observed at the site under consideration might be different in different individuals. Let us assume that, at any given site, only two possible nucleotides are observed (diallelic variations). Accordingly, an individual carries either the allele that was present in the site root (also known as the ancestral allele) or a mutant or derived allele. Let us further assume that the mutant allele is the result of a single mutation event (infinite-sites assumption) within an ancestral sample of the site genealogy. Under this assumption, the number of samples that carry the derived allele is identical to the number of descendants of that ancestor within the site genealogy. Conversely, the derived allele is found in exactly i samples if and only if the ancestor in which the mutation occurred gave rise to i descendants. Under the further assumption of a constant-rate mutation process (HUDSON 1991 Down), the likelihood that a given mutation is of size i is related to the number of ancestral nodes with i descendants within the site genealogy and to the "life span" of these ancestors. As Fu shows in a seminal work (FU 1995 Down), this likelihood can be expressed with the length of time the site genealogy spends in state k, i.e., while the number of ancestor samples within the genealogy is exactly k. Under the further assumption of constant effective population size N, Fu then derives an explicit formula for the expected length of time in state k, leading to a simple result for the expected number of mutations of a given size within n samples (FU 1995 Down).

Our final goal is to extend this result from constant to merely piecewise constant population size. To this end, we use a standard continuous approximation according to which the probability density function of the length of time t spent in state k within the genealogy is exponential under a constant population size, and for a diploid population,

Using this approximation, we derive the expectation for the length of time spent in state k, under piecewise constant population history of an arbitrary number of epochs. Under the assumption of a constant-rate mutation process, this allows us to compute the expectation for the number of mutations of size i, denoted by {Psi}i, observed at a single site, at sites having identical site genealogies (DNA without recombination), or at a collection of sites with completely independent site genealogies. Because the distributions are identical for every site, the result is also valid for a collection of sites.

Conventions and useful identities:
We use the convention that the value of an empty product is 1 and the value of an empty sum is 0. The probability density function of a random variable X is denoted by fX and its cumulative density function by FX. The variable X conditioned on the event Y is denoted by X|Y. Next, we briefly state three lemmas to aid further derivations. In the following we assume that the ai are different.

LEMMA 1. For every value of x, for each 1 <= l <= n,

(A1)

Proof. Let

we need to show that f(x) {equiv} 0. For r: l <= r <= n we have that

Since f(x) is of degree at most n - l and it has at least n - l + 1 different zeros, necessarily f(x) {equiv} 0. Q.E.D.

LEMMA 2. For k, i: 1 <= k < i <= n we have

(A2)

Proof.

ßk,k+1 = 0, and for i > k + 1

where

LEMMA 3. For s < k < i <= n:

(A3)

Proof. From Lemma 2,

LEMMA 4.


Proof. Using Lemma 1,

Constant effective population size:
First, we cons