- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Marth, G. T.
- Articles by Sherry, S. T.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Marth, G. T.
- Articles by Sherry, S. T.
The Allele Frequency Spectrum in Genome-Wide Human Variation Data Reveals Signals of Differential Demographic History in Three Large World Populations
Gabor T. Martha, Eva Czabarkaa, Janos Murvaia, and Stephen T. Sherryaa National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
Corresponding author: Gabor T. Marth, Boston College, 140 Commonwealth Ave., Chestnut Hill, MA 02467., marth{at}bc.edu (E-mail)
Communicating editor: L. EXCOFFIER
| ABSTRACT |
|---|
We have studied a genome-wide set of single-nucleotide polymorphism (SNP) allele frequency measures for African-American, East Asian, and European-American samples. For this analysis we derived a simple, closed mathematical formulation for the spectrum of expected allele frequencies when the sampled populations have experienced nonstationary demographic histories. The direct calculation generates the spectrum orders of magnitude faster than coalescent simulations do and allows us to generate spectra for a large number of alternative histories on a multidimensional parameter grid. Model-fitting experiments using this grid reveal significant population-specific differences among the demographic histories that best describe the observed allele frequency spectra. European and Asian spectra show a bottleneck-shaped history: a reduction of effective population size in the past followed by a recent phase of size recovery. In contrast, the African-American spectrum shows a history of moderate but uninterrupted population expansion. These differences are expected to have profound consequences for the design of medical association studies. The analytical methods developed for this study, i.e., a closed mathematical formulation for the allele frequency spectrum, correcting the ascertainment bias introduced by shallow SNP sampling, and dealing with variable sample sizes provide a general framework for the analysis of public variation data.
THE analysis of statistical distributions of genetic variations has a rich history in classical population genetic studies (![]()
![]()
![]()
![]()
| Modeling the distribution of allele frequency |
|---|
Prior study of the AFS has been restricted to properties of summary statistics such as Tajima's D (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
| Demographic history |
|---|
The reconstruction of human demographic history is of direct biological and anthropological interest. Additionally, the history of effective population size has a profound effect on important quantities such as the extent of linkage disequilibrium and is therefore important for medical association studies. There have been many attempts for demographic inference from contemporary molecular data representing different molecular mutation systems such as mitochondrial DNA polymorphisms (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
|
| METHODS |
|---|
Allele frequency spectrum under stepwise constant effective population size:
We show that, for a population evolving under the Wright-Fisher model, and under selective neutrality, the expectation for the number of mutations
i of size i, within a sample of n chromosomes under a demographic history of multi-epoch, piecewise constant effective population size is
![]() |
(1) |
where µ is the (constant) per-locus mutation rate, Nm is the effective population size in epoch m, Tm is the corresponding epoch duration, and
the normalized epoch boundary time. A detailed derivation of this result is given in the Appendix. The normalized distribution of these expectations according to the frequency is the allele frequency spectrum:
![]() |
(2) |
It is sometimes useful to consider the "full" allele frequency spectrum, Pfulln(i), considering sizes 0 and n, i.e., when all samples carry the ancestral or the derived allele, respectively. We have verified the accuracy of the complete allele frequency spectrum derived from this formulation by coalescent simulations (supplemental Figure S1 at http://www.genetics.org/supplemental/). Three important properties of the allele frequency spectrum are clear from Equation 1. First, the expectation for a given frequency is linear under simultaneous scaling of all effective population sizes and epoch durations (i.e., as long as Tm and Nm are multiplied by the same constant for each m), hence the relative frequency spectrum remains unchanged. This fact can be exploited to reduce the number of parameters that characterizes a given demographic model under consideration. Second, the expected number of mutations of a given size for more than one nucleotide site is simply the sum of the individual expectations, without regard to any possible correlation among the site genealogy of proximal sites. Therefore, our results for the expected number of segregating sites as well as the allele frequency spectrum are also valid for polymorphisms at a single locus of arbitrary sequence length, without regard to possible recombination within the locus, or for polymorphisms collected from throughout the genome. This latter consideration allows us to apply the theoretical expectations derived here for the data set examined, without regard to the amount and structure of linkage between the sites represented within the set. Third, the allele frequency spectrum is independent of the actual value of the per-nucleotide, per-generation mutation rate, as long as this rate is uniform for every site considered.
Minor allele frequency spectrum (folded spectrum):
In situations where allele frequency is determined experimentally by counting the two alternative alleles within a sample of n chromosomes, it is uncertain which of the two alleles is the mutant allele. In such situations, instead of the true frequency, we work with the frequency of the less frequent (or minor) allele (![]()
![]() |
(3) |
By this definition, if n is even,
n(n/2) = 2Pn(n/2), i.e., twice the value we would expect to measure, leading to a "doubling effect." This fact needs to be taken into account during the interpretation of measured data. Because in many data sets available for analysis the ancestral allelic state is currently unknown, the folded spectrum is important in practice.
Numerical calculation of the allele frequency spectrum:
Frequency spectrum calculations were implemented in the C programming language. Some care must be taken when calculating the expected spectrum, because computing Equation 1 requires the evaluation of alternating sums, a source of numeric instability when the individual terms are close in value. Instability can be avoided by accurate calculation of each term. The higher the sample size, the more accurately each term has to be evaluated. We do not have a systematic way to predict the accuracy requirement as a function of sample size, hence we determined the accuracy requirement for a given sample size by trial and error. In our implementation, we have used high-accuracy numeric libraries with settable numeric precision. Our experience has been that, up to a sample size n = 100, a numeric precision of 100 decimal places was sufficient for our calculations. Evaluation of the allele frequency spectrum for a sample size of 1000 required a numerical precision of
500 decimal places.
Correcting ascertainment bias:
To describe the situation where polymorphic sites discovered in a set of samples are genotyped in a second, independently drawn set of samples for frequency characterization we divide the two independent groups of samples into a "discovery" group consisting of k samples and a "genotyping" group consisting of n samples. The discovery process is modeled by considering only those sites within the n + k samples that are polymorphic (i.e., are of size between 1 and k - 1) within the discovery group of depth k and discarding those sites that are monomorphic in this group, as these sites would not be considered for subsequent genotyping. The conditional probability, Pn|k(i), that a site is of size i within the n genotyping samples given that it is polymorphic in the k discovery samples is:
![]() |
(4) |
It is possible that a site that appears polymorphic within the k discovery samples is monomorphic within the n genotyping samples. As a result, the conditional probabilities Pn|k(0) and Pn|k(n) are typically nonzero, and one has to renormalize after the transformation to get the AFS. It is easy to verify that Equation 4 is also valid for calculating the folded conditional spectrum
n|k(i), as defined in Equation 3, provided that both folded spectra
k(i) and
n+k(i) are available. This property makes it possible to account for the ascertainment bias when only the folded allele frequency distributions are available. For the sake of completeness, we include the conditional spectrum for the important special case, k = 2, i.e., ascertainment within a pair of chromosomes:
![]() |
(5) |
It is easy to show that under a stationary history the spectrum is a linear function of i, and the folded spectrum is constant (Fig 2A).
|
We point out that our method of ascertainment bias correction improves on an earlier method based on using the measured discrete allele frequency as an estimator for the overall allele frequency within the population (![]()
Reduction of allele frequency counts to equivalent counts at a lower sample size:
Often allele frequency data are the result of genotyping a target number, nt, of individuals at a collection of polymorphic sites. Because of genotyping failures, however, the actual number of genotypes available at different locations is smaller and often varies from site to site. At sites where an identical number, n, of successfully determined chromosomal allelic states are available we denote the distribution of allele counts by Cn(i) and the corresponding probability distribution obtained by normalizing these counts by Pn(i). Sites with different numbers of successful genotypes are not directly comparable. To enable joint analysis of allele counts observed at all sites genotyped in the experiment, we have devised a procedure that, given an observed distribution of allele frequencies among samples, produces an equivalent distribution at a lower sample size, m. This is achieved by, first, considering all possible choices of m subsamples selected from the total n available samples, in such a way that each choice is equally likely and, second, requiring that the total number of observations remains the same. Under these assumptions, the "equivalent" allele counts,
m(i), for m subsamples are
![]() |
(6) |
![]() |
(7) |
Note that this procedure does not allow one to generate a higher sample size distribution on the basis of a lower sample size distribution. Also note that, even if the higher sample size distribution was a relative allele frequency spectrum, the resulting lower sample size distribution will contain nonzero terms for size 0 and for size m. Clearly, the first case is the result of the possibility that the omission of n - m chromosomes left us with 0 mutant alleles, and the second is that only mutant alleles remained. This results in a slight reduction of the total number of relative counts as compared to the original observations. To obtain the AFS, one omits sizes 0 and m in Equation 7 and renormalizes. It is easy to verify that the equivalence reduction also works for the folded allele frequency distribution.
We point out that our reduction procedure is not equivalent to frequency binning, a procedure sometimes employed to compare allele counts available at different samples sizes. Aggregating discrete allele frequency data on the basis of a nominal allele frequency c/n, the ratio of allele counts and the sample size, results in data distortion stemming from two sources. First, for a given sample, the inherent base frequency is fn = n-1. In general, only window sizes that are integer multiples of fn will preserve the uniform appropriation of allele sizes into frequency bins. This may be impossible if multiple sample sizes are present in the data. Second, sites with identical nominal allele frequencies but different sample sizes are not equivalent; e.g., a site with a minor allele count of 1 in 3 samples is clearly not equivalent to a site with a minor allele count of 10 in 30 samples. Distortions from both sources are most pronounced at lower sample sizes. Our equivalence reduction procedure is a technique of data aggregation that is free of such distortions. This point is further illustrated in supplemental Figure S3 at http://www.genetics.org/supplemental/, where we compared the AFS resulting from simple binning of all available data for the European samples to the AFS we obtain by the equivalence data reduction procedure presented here.
Coalescent simulations and tabulation of linkage disequilibrium:
We used coalescent simulations to verify the accuracy of our allele frequency spectrum calculations (supplemental Figure S1), to tabulate measures of linkage disequilibrium, and to tabulate distributions of mutation age. To perform these simulations, we have implemented a widely used, direct coalescent algorithm (![]()
Expectations for the extent of linkage disequilibrium were generated according to a previously published method (![]()
![]() |
(8) |
where A and a denote the mutant and the ancestral alleles at the first marker location, and B and b are the alternative alleles at the second marker location. The quantities pA, pa, pB, and pb are the corresponding allele frequency measurements, and pAB is the measured frequency of the haplotype defined by the combination of allele A at the first marker position and B at the second marker position. Finally, marker age was tabulated by registering the time of occurrence for each of the mutations during the simulations.
Model fitting to observed allele frequency spectra:
The primary objective of the fitting experiments is to determine the distribution of the posterior probability of the model parameters given the observed data: P(model|data). With the help of our closed formula for the direct calculation of the AFS we were able to generate the expected AFS for a complete, high-resolution, multidimensional grid overlaid on the parameter space that we intended to explore. This direct approach yielded the likelihood distribution, P(data|model), computed at each grid point. Given that there is no sensible way to assign an "informed" prior distribution to the model parameters, the distribution of the likelihood function is equivalent to the posterior distribution and can be used in ranking competing parameters. We point out that an alternative method of achieving the same goal is to use a Markov-chain Monte Carlo (MCMC) technique to obtain the posterior distribution (![]()
![]()
Stepwise constant models of one, two, and three epochs were considered. For each model class defined by the number of epochs, a vector of parameters describing the model was considered, including the effective population size and the duration of the epoch (expressed in terms of generations). We have sampled each effective size parameter, Ni, between 1000 and 150,000 in steps of 1000 up to 30,000 and in steps of 5000 beyond 30,000, and each epoch duration parameter, Ti, between 100 and 50,000 in steps of 100 up to 10,000 and in steps of 500 beyond 10,000. Because of the scaling equivalence of the relative distribution discussed earlier, we fixed the ancestral size (the effective size of the epoch farthest in the past) parameter at 10,000, for each model class. We have generated the unbiased allele frequency spectra by direct calculation using Equation 1, for a sample size of m + 2, where m = 41 is the (common) sample size after data reduction, and k = 2 is the discovery size. We then computed the conditional spectrum using Equation 4. Finally, we folded the spectrum using the definition given in Equation 3. To quantify the degree of fit between a given model and the observations we have used the likelihood of the observed data conditioned on the model:
![]() |
(9) |
For generating the likelihood surface for the European bottleneck size vs. duration we used the
2 metric defined as
![]() |
(10) |
In the above notations, ci is the observed number of sites of size i, c is the number of total sites, pi is the predicted (relative) probability of size i, and m is the common sample size to which all observations were reduced using the equivalence data reduction procedure outlined earlier.
Comparison between models with different epoch numbers:
Models within the same structure (same epoch number) could be directly compared on the basis of any of the three goodness-of-fit metrics discussed above. Models with different numbers of epochs were compared using methods of normal hypothesis testing for nested models (![]()
) = 2 ln(P(data|model1)/P(data|model2)) is asymptotically
2 distributed, with degrees of freedom equal to the difference in the number of parameters characterizing the models (i.e., adding one extra epoch increases the number of parameters by two). The larger this quantity, the more significant the improvement that was achieved by the introduction of the extra epoch. If the quantity is small, the improvement in data fit does not warrant the introduction of the extra parameters.
| RESULTS |
|---|
Modeling allele frequency:
We considered a diploid population whose demographic history was described by a series of epochs such that the effective population size was stepwise constant within each epoch (e.g., Fig 1) and showed that the expected number of samples carrying a mutant allele can be described by a closed, easily computable mathematical formulation (see METHODS). We derived a method for incorporating the same frequency ascertainment bias into AFS models that was introduced into real data by the sampling strategies used during SNP discovery and for revealing the strategies's consequent effect on SNP population frequency (METHODS). We illustrate the effect of this bias under different values of ascertainment sample size (Fig 2A). As expected, the bias toward sample enrichment for common polymorphisms is strongest when SNPs are discovered in a pair of chromosomes, and it gradually disappears as discovery sample size increases. Under a stationary population history, the folded spectrum under ascertainment in two chromosomes is a constant function of frequency (METHODS), and deviations from a horizontal line signal a nonstationary history that is easy to detect and interpret. In Fig 2B, we contrast the ascertainment bias-corrected, minor allele frequency spectra for notable, competing scenarios of demographic history. When a population expands, an increasing number of chromosomes simultaneously incur new mutations, which results in an overabundance of rare alleles in the spectrum. Conversely, a population collapse is a rapid loss of chromosomes, and the alleles present at high frequency are more likely to be carried by surviving chromosomes than are their rare counterparts. For that reason a collapse generates an overrepresentation of common alleles. Finally, AFS under a bottleneck history (a reduction of effective size followed by a phase of recovery) carries the signature of both the phase of collapse (a valley at intermediate frequencies) and that of growth (elevated signal at low frequencies).
We report a procedure to transform allele counts at a given sample size to a lower, target sample size (METHODS). Using this equivalence sample size reduction procedure, allele count observations at all sites can be reduced to the equivalent counts at a lower, "common denominator" sample size, as illustrated in Fig 3. This procedure is useful for analyzing allele counts at sites where the number of available genotypes is variable either because a fraction of attempted genotyping experiments failed or when merging data sets in which the attempted sample sizes are different. In such cases one selects a target sample size and applies the reduction procedure to transform allele counts observed at higher sample sizes to the equivalent counts at this lower target sample size. It is then possible to fit the resulting single AFS containing the contribution of all available data instead of fitting multiple, often sparse spectra, one for each sample size present in the data.
|
Minor allele frequency spectra observed in samples representing different world populations show differential demographic histories:
The SNP Consortium (http://snp.cshl.org), an organization formed primarily for the discovery of a large set of human SNPs, has made well over 1 million polymorphic sites available in the public domain (![]()
![]()
|
To assess the signals of population history within these observed distributions, we generated allele frequency spectra as predicted under competing scenarios of population history of varying complexity: stationary history (one epoch), expansion or collapse (two epoch), and all possible shapes of three-epoch histories (METHODS). For a given set of model parameters, we generated the corresponding theoretically predicted, ascertainment bias-corrected minor allele frequency spectrum and evaluated the degree of fit between the prediction and the observations (METHODS). For each population-specific data set and for each model structure (number of epochs), we determined the best-fitting model parameters and the corresponding measures of goodness of fit. By definition of the likelihood function used for data fitting, the best-fitting model parameters are the maximum-likelihood parameter estimates for that model class (Table 1).
|
The normalized observed allele frequency distributions for each population group and the corresponding best-performing distributions within each model class are shown in Fig 4. In all three population-specific spectra, stationary history is a poor descriptor of the data, both by visual inspection and by examination of the fit values in Table 1. The best-fitting two-epoch model for all three spectra is that of expansion (Table 1). In the European (Fig 4A) and in the Asian (Fig 4B) samples the best-fitting three-epoch model is one of a bottleneck-shaped history. In the European data, the curve fit produced by the bottleneck profile is a very significant improvement over that produced by histories of expansion. In the Asian data, the improvement is still significant but to a lesser degree. The best-fitting three-epoch models in African-American data (Fig 4C) represent a two-step population increase of moderate size.
In addition to the best-fitting models, a range of parameter values produced comparably good fit to the observations. We have examined parameter sets that produced likelihood values that were at least 90% of the value obtained for the best-fitting three-epoch parameter set. Analysis of these "close to optimal" parameter values in the European data shows that both the size (N, effective number of individuals) and duration (T, generations) of the recovery phase was within a narrow range (N1 = 19,00021,000, T1 = 27003000). Parameters of the bottleneck phase were in a wider range (N2 = 10004000 and T2 = 2001300), with several alternative pairs available: longer but less severe bottlenecks or shorter, more severe bottlenecks. Given the potential interest in a possible bottleneck in the history of European populations, we further investigated the strength of the bottleneck signal by fixing the recovery size and duration parameters (N1 = 20,000, T1 = 3000) and varying the bottleneck size N2 and duration T2 in fine increments (20). For each parameter combination, we evaluated the goodness of fit to the European spectrum as measured by the
2 statistics and reported the resulting probability surface in Fig 5. The best-fitting parameter combinations (ones not rejected by the
2 test even at the 99.8% level) lie on a slightly curved line between the following pairs: effective size of 1040 during the bottleneck for 240 generations and effective size 2320 for 560 generations. The most likely model, at this resolution, is a bottleneck effective size of 1560 for 360 generations. These values and the ratio of effective population size and bottleneck duration being nearly constant in a large region are in good agreement with previous reports (![]()
|
| DISCUSSION |
|---|
Significance of the allele frequency analysis methods presented here:
Equation 1 (METHODS) provides a simple and rapid way to generate expected distributions of allele frequency under stepwise constant models of effective population size history. This procedure is orders of magnitude faster than tabulating simulation replicates, especially for large sample sizes, permitting fast generation of model spectra to explore large parameter spaces at high resolution. The method of ascertainment bias calculation we have presented permits the interpretation of allele frequency spectra measured at polymorphic sites selected from existing variation resources. Our procedure of equivalence sample size reduction enables the analysis of realistic data sets with genotyping failures. All three of the above procedures are firmly rooted within the coalescent framework. Model calculations directly correspond to experimentally observable quantities, without referencing directly unobservable quantities such as the overall population frequency of alleles. The data-fitting methodology is conceptually simple and allows direct comparison of the degree of fit between each of the three population samples examined, at each grid point (parameter combination).
Differential population histories in the three sample sets:
On the basis of the goodness of fit between models and observations (Table 1), a history of stationary population size can be confidently rejected for all three sets of samples. Introduction of even very simple dynamics into the history has dramatically improved data fit. There were large differences among the allele frequency spectra observed in the three populations (Fig 4 and Table 1). Clearly, the shapes of the European and the Asian spectra are closer to each other than either is to the shapes of the African spectra. On the basis of the three-epoch models, both the European and the Asian data are best explained by bottleneck-shaped histories, whereas the best-fitting third-order model for the African-American data is a continued expansion. The results of hierarchical model testing (METHODS) in Table 1 show that the inclusion of the third epoch did not significantly improve the fit to the African-American data. However, the bottleneck history is a dramatic improvement over the best-fitting two-epoch growth models in both the European and Asian data. Considering the range of models that produced close to optimal fit values, but using a fixed, 20-year generation time, the European bottleneck represented a 2.5- to 10-fold decline in population size, lasting 2001300 generations [426 thousand years (KY)]. This was followed by a phase of 5- to 20-fold population expansion, starting 27004300 generations (5486 KY) ago. The Asian bottleneck represented a 2- to 3-fold decline for 6001000 generations (1220 KY), followed by 5- to 8-fold growth starting 30004200 generations (6084 KY) ago. The best-fitting models for the African-American data represent uninterrupted growth of effective population size, with the expansion clearly starting earlier than is evident in our European or the Asian data.
Earlier mitochondrial and microsatellite studies report data that are predominantly consistent with expansion-type histories of effective population size. The main evidence that points to expansion is negative values of Tajima's D and an excess of low-frequency alleles. The start of such expansion is estimated between 30 and 130 KYA (![]()
![]()
![]()
![]()
![]()
![]()
![]()
0.2 for both populations) are in general agreement with these values and signify bottlenecks on the less severe end of the spectrum. Our estimates for the start of the recovery phase (5486 KYA for Europeans, 6084 KYA for Asians) are well within the range of the mitochondrial and microsatellite estimates. The fact that our best-fitting two-epoch models indicate expansion-type histories for all three populations we examined is also consistent with conclusions from mitochondrial and microsatellite data. A valuable reality check of an inferred demographic model is its implied pairwise nucleotide diversity value,
. Although our data-fitting analysis of the relative spectrum does not provide absolute estimates for
, these values can be obtained on the basis of the best-fitting models by fixing the ancestral size N3 and mutation rate µ. For each of the three populations, we use a common ancestral effective size of 10,000 and common mutation rate of 2 x 10-8 [a value that lies between recent, prominent estimates for average per-nucleotide, per-generation human mutation rate (![]()
![]()
= 7.88 x 10-4 for the European model, in good agreement with previously reported values for other genome-wide data sets (![]()
![]()
![]()
predicted by the best-fitting model for the African-American data is 10.29 x 10-4, significantly higher than that observed within the European and Asian samples, and in agreement with the general consensus that nucleotide diversity is higher in sub-Saharan samples than in non-African data (![]()
![]()
![]()
![]()
A bottleneck-shaped history was also our best-fitting three-epoch model structure for MD distributions observed in overlap fragments of public genome clone data (![]()
![]()
To understand the consequences of the differential histories that best describe the three population-specific data sets, we have partitioned the corresponding frequency spectra according to the age of the mutations (METHODS) that gave rise to the polymorphisms (Fig 4, second column). According to these tabulations, 35.9% of the European polymorphisms originated in <10,000 generations, as did a similar fraction, 34.9%, in the Asian model. In contrast, only 29.6% of the African mutation are younger than 10,000 generations. This indicates that the bottleneck events that explain the European and Asian data have eliminated a large fraction of the polymorphisms that predated these events, and a larger fraction of current polymorphisms are of a more recent origin as compared to the African data. This effect is most visible at the common end of the spectrum: only a negligible fraction of the common African SNPs are young, but an appreciable fraction of common European and Asian SNPs have originated <10,000 generations ago and have drifted to high population frequency. Finally, the third column of Fig 4 shows the average age of SNPs at given frequencies, confirming that SNPs at a higher frequency are expected to be older than SNPs at lower frequencies. Also, in each frequency class, the expected age of African SNPs is substantially higher than that of European or Asian SNPs, corroborating earlier observations noting the more ancient origins of African SNPs.
The differential demographic histories of the three populations examined also have important consequences for the extent of allelic association in the human genome, when the different populations are considered. To illustrate this point, we have carried out coalescent simulations, taking into account the individual best-fitting histories, and tabulated the average extent of linkage disequilibrium (LD) between markers separated by different values of recombination fraction (for a fixed value of per-nucleotide, per generation recombination rate, the recombination fraction translates into physical distance), as shown in Fig 6. Similar demographic histories distilled from the Asian and European samples result in similar values of LD at a given marker distance. LD is predicted to decay more rapidly (roughly twice as fast) for the best-fitting demographic history for the African-American samples, in agreement with previous reports (![]()
|
Caveats and open problems:
Clearly, our multi-epoch, stepwise models of demographic history represent simplified versions of the "true" demographic past. Nevertheless, our three-epoch models go beyond the majority of previous studies that explore even simpler models of past population dynamics such as expansion vs. collapse or are restricted to the rejection of stationary effective size on the basis of summary statistics. Consideration of the third-order dynamics in this study allowed us to reveal a phase of bottleneck in the history characterizing the European and the Asian samples, permitting reconciliation of the signals of recent population growth apparent in mitochondrial and microsatellite data with realistic, observed values of nucleotide diversity.
Although the signal of differential history is undeniable in the data, the effect is confounded by the fact that the discovery and genotyping data sets were not drawn from a single population. SNP discovery was performed in shotgun sequences from ethnically diverse libraries (with ethnic association of individual reads unknown) aligned to the public genome reference sequence (![]()
![]()
![]()
![]()
Additionally, internal population substructure can also distort the frequency spectrum (![]()
![]()
![]()
We must also acknowledge that the current shape of human variation structure is the result of a combination of neutral and nonneutral (selective) forces. The current state of the art in recognizing the effects of selection in variation data has been reviewed recently (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Conclusion:
The allele frequency spectrum is an excellent data source for modeling demographic history because of its independence of the effects of recombination and local, or sequence composition-specific variations of mutation rates and because the experimental determination of the allele frequency spectrum requires measurement of allelic states only at single-nucleotide positions, instead of sequencing of long stretches of contiguous DNA. The emergence of population-specific genotype sets on the genome scale provides sufficient data for the direct comparison of model-predicted and observed spectra with great resolution. This permits us to improve on previous conclusions drawn on the strength of summary statistics, on the basis of data from a handful of loci. Recent advances in allele frequency modeling should provide us with exciting, new tools to explore our demographic past and explain human haplotype structure. Accurate reconstruction of the history of world populations should also help us to detect and interpret differences that must be taken into account during the development of general resources for medical use such as the recently initiated human Haplotype Map Project (![]()
![]()
![]()
| ACKNOWLEDGMENTS |
|---|
The authors are indebted to Andrew Clark for useful comments on the manuscript. We also thank Ravi Sachidanandam for kindly providing earlier versions of the allele frequency data set analyzed in this study.
Manuscript received April 15, 2003; Accepted for publication September 4, 2003.
| APPENDIX |
|---|
THE EXPECTED NUMBER OF SEGREGATING SITES IN A SAMPLE DRAWN FROM A POPULATION CHARACTERIZED BY A PIECEWISE CONSTANT, MULTI-EPOCH HISTORY OF EFFECTIVE SIZE
Model:
We consider a population of a given organism evolving under the Wright-Fisher model and under selective neutrality. Let us select a specific site in the genome of the organism. Furthermore, let us randomly draw n DNA samples from this population. Without regard to recombination, the samples possess a unique tree-shaped genealogy at the selected site (the site genealogy). Such a genealogy can be described within the framework of the coalescent: starting with n samples in the present and, through a series of coalescent events (pairs of samples finding their common ancestors), this number reduces to 1, the most recent common ancestor (MRCA), or the root of the genealogy at that site (site root). At a given time, the process is said to be in state j, if at that time the current number of samples is j. This process is Markovian, in that the length of time until the next coalescent event depends only on the current state and is independent of the previous states. Due to molecular mutation processes, the nucleotide observed at the site under consideration might be different in different individuals. Let us assume that, at any given site, only two possible nucleotides are observed (diallelic variations). Accordingly, an individual carries either the allele that was present in the site root (also known as the ancestral allele) or a mutant or derived allele. Let us further assume that the mutant allele is the result of a single mutation event (infinite-sites assumption) within an ancestral sample of the site genealogy. Under this assumption, the number of samples that carry the derived allele is identical to the number of descendants of that ancestor within the site genealogy. Conversely, the derived allele is found in exactly i samples if and only if the ancestor in which the mutation occurred gave rise to i descendants. Under the further assumption of a constant-rate mutation process (![]()
![]()
![]()
Our final goal is to extend this result from constant to merely piecewise constant population size. To this end, we use a standard continuous approximation according to which the probability density function of the length of time t spent in state k within the genealogy is exponential under a constant population size, and for a diploid population,

Using this approximation, we derive the expectation for the length of time spent in state k, under piecewise constant population history of an arbitrary number of epochs. Under the assumption of a constant-rate mutation process, this allows us to compute the expectation for the number of mutations of size i, denoted by
i, observed at a single site, at sites having identical site genealogies (DNA without recombination), or at a collection of sites with completely independent site genealogies. Because the distributions are identical for every site, the result is also valid for a collection of sites.
Conventions and useful identities:
We use the convention that the value of an empty product is 1 and the value of an empty sum is 0. The probability density function of a random variable X is denoted by fX and its cumulative density function by FX. The variable X conditioned on the event Y is denoted by X|Y. Next, we briefly state three lemmas to aid further derivations. In the following we assume that the ai are different.
LEMMA 1. For every value of x, for each 1
l
n,
![]() |
(A1) |
Proof. Let

we need to show that f(x)
0. For r: l
r
n we have that

Since f(x) is of degree at most n - l and it has at least n - l + 1 different zeros, necessarily f(x)
0. Q.E.D.
LEMMA 2. For k, i: 1
k < i
n we have
![]() |
(A2) |
Proof.

ßk,k+1 = 0, and for i > k + 1

where
LEMMA 3. For s < k < i
n:
![]() |
(A3) |
Proof. From Lemma 2,
LEMMA 4.

Proof. Using Lemma 1,
Constant effective population size:
First, we cons
















(METHODS).

