We propose a novel approximate-likelihood method to fit demographic models to human genomewide single-nucleotide polymorphism (SNP) data. We divide the genome into windows of constant genetic map width and then tabulate the number of distinct haplotypes and the frequency of the most common haplotype for each window. We summarize the data by the genomewide joint distribution of these two statistics—termed the HCN statistic. Coalescent simulations are used to generate the expected HCN statistic for different demographic parameters. The HCN statistic provides additional information for disentangling complex demography beyond statistics based on single-SNP frequencies. Application of our method to simulated data shows it can reliably infer parameters from growth and bottleneck models, even in the presence of recombination hotspots when properly modeled. We also examined how practical problems with genomewide data sets, such as errors in the genetic map, haplotype phase uncertainty, and SNP ascertainment bias, affect our method. Several modifications of our method served to make it robust to these problems. We have applied our method to data collected by Perlegen Sciences and find evidence for a severe population size reduction in northwestern Europe starting 32,500–47,500 years ago.
A major goal of evolutionary genetics is to infer the demographic history of a population. This is traditionally done by fitting a population genetic model to sequence data taken from a sample of individuals. The population genetic model often includes parameters allowing for changes in population size or population structure with or without migration. Such parameters are interesting in their own right, but are critical to define a proper “null model” that can be used to find “unusual” genes that may be targets of positive or negative selection (Jensen et al. 2005). Additionally, a proper demographic model is important for assessing genomewide patterns of positive and negative selection (Boyko et al. 2008; Lohmueller et al. 2008).
Methods have been developed that make full use of sequence data to infer demographic parameters (Griffiths and Tavaré 1994; Kuhner et al. 1995). These methods are computationally intensive and are impractical for all but the smallest data sets. Thus, researchers have turned to methods based on summary statistics (reviewed in Marjoram and Tavaré 2006). Summary statistics can be quickly calculated from the data and then be used to infer model parameters using either a likelihood or approximate Bayesian computation (ABC) framework (for example, Wall 2000a; Fagundes et al. 2007). The key for successful application of this approach is to find summaries of the data that contain enough information about the demographic parameters of interest. One of the most successfully used summary statistics for population genetic inference, the site frequency spectrum (SFS) (Nielsen 2000; Adams and Hudson 2004; Caicedo et al. 2007; Hernandez et al. 2007b), is a sufficient statistic for the full data if the single-nucleotide polymorphisms (SNPs) are unlinked. However, in reality, all SNPs are not unlinked. The amount of information in the data lost by ignoring the correlations among SNPs, or linkage disequilibrium (LD), in demographic inference is an open question, but recent theoretical work suggests that it may be nonnegligible (Myers et al. 2008).
An additional complication to using the SFS for demographic inference is that many genomewide genetic variation data sets in humans contain SNPs that were discovered through sequencing a small number of individuals. The discovered SNPs were then genotyped in a larger set of individuals, sometimes in a different population than was used for SNP discovery. Since this SNP discovery process will lead to preferential sampling of intermediate-frequency alleles, the SFS computed from SNP genotype data will differ substantially from the true SFS (Nielsen et al. 2004; Clark et al. 2005). Progress has been made to analytically correct the SFS for ascertainment bias when the SNP discovery process is known (Nielsen et al. 2004), but often this is not the case. More problematic is the situation where SNPs were discovered by resequencing individuals in one population, but then are genotyped in a second population. It remains an open question as to how well the SNPs discovered in the first population are representative of genetic diversity in the second population. Several authors have suggested that statistics based on combinations of multiple SNPs, or haplotypes, will be less susceptible to ascertainment bias than single-SNP frequencies or heterozygosities (Conrad et al. 2006). However, while this suggestion is encouraging, as yet there has not been extensive investigation into the precise ascertainment conditions under which this is true.
It is known that haplotype patterns and LD can be affected by both recombination and demographic history (Pritchard and Przeworski 2001), making these measures useful statistics for inference. Many recent studies have assumed a demographic model (often the standard neutral model) and then used either LD or haplotype patterns to estimate recombination rates (Wall 2000a; Hudson 2001; Li and Stephens 2003; McVean et al. 2004; Myers et al. 2005). Other studies have taken the opposite approach and assumed that the recombination rate is known and then used LD or haplotype patterns to estimate demographic parameters (Reich et al. 2001; Innan et al. 2005; Voight et al. 2005; Schaffner et al. 2005; Leblois and Slatkin 2007; Tenesa et al. 2007). The way in which haplotype information has been used for demographic inference is quite variable among studies. For example, Reich et al. (2001) examined how well several different demographic models predicted the observed decay of pairwise LD in humans, rather than estimating the model parameters. Francois et al. (2008) and Thornton and Andolfatto (2006) used ABC to estimate model parameters in Arabidopsis and Drosophila, respectively. However, summaries based on the distribution of the number of haplotypes were only one of several summary statistics considered, and it is unclear how much information came from the haplotype information vs. the other single-SNP diversity measures. While Anderson and Slatkin (2007) and Leblois and Slatkin (2007) developed methods that use haplotype information exclusively to fit a population split followed by growth model, their model is quite restrictive and allows inference only of one free parameter, the number of founding lineages. Thus, there has not been a systematic investigation as to the utility of haplotype information for inference in general, parameter-rich models, such as those involving population expansions and bottlenecks.
In this article we propose an approximate-likelihood method to estimate parameters in complex demographic models from genomewide SNP genotype (rather than full resequencing) data, using the joint distribution of the number of haplotypes and frequency of the most common haplotype in windows across the genome. We provide extensive simulations evaluating the performance of our method for growth and bottleneck models. These results indicate that a great deal of information regarding demographic history is captured by these two summary statistics. We also extensively test the robustness of our method to many practical problems with genetic data sets in humans. Specifically, we show that for many realistic SNP discovery protocols and levels of population divergence, our method is relatively robust to SNP ascertainment bias. We also found that our method is sensitive to recombination rate variation across the genome (as many haplotype-based summaries will be), and we incorporate a model of recombination rate variation into the inference scheme. Finally, since haplotype phase is often ambiguous, we provide a practical approach to circumvent this problem. We applied our method to genomewide SNP genotype data generated by Perlegen Sciences (Hinds et al. 2005). Using the CEU sample (consisting of individuals from Utah with northwestern European ancestry), we find evidence for a recent population bottleneck in northwestern Europe.
We summarize the genomewide data by the joint distribution of two haplotype statistics calculated from windows across the genome. Our method requires that we have a genetic map of the organism in question. Using this map, we divide the genome into windows of fixed genetic map distance, cwindow. The parameter cwindow is tunable to the diversity and recombination rates of the organism under study. We chose to divide the genome into nwindow nonoverlapping windows using genetic map distance so that each window will have the same expected amount of recombination within it and, consequently, the same expected number of haplotypes (Wall 2000a).
In many genomewide SNP data sets, some parts of the genome will have a small number of SNPs while other areas will contain many SNPs. In principle, while this could be due to mutation rate variation across the genome, variations in the time to the most recent common ancestor, or random chance, another likely explanation is ascertainment bias—some parts of the genome were more extensively screened for SNPs than others and consequently have more SNPs. Thus, we do not want our method to use any information about the number of SNPs within a given window. To ensure that all windows of the genome have the same number of SNPs, we select a subset of nsnp SNPs for each window of the genome. Again, nsnp is a function of the size of the windows as well as the SNP density. Another complicating factor is that SNPs may not have been discovered from the population under study, but instead from a second population. Since rare SNPs are more likely to be population specific, and consequently not equally ascertained in all populations, we include only those SNPs with minor allele frequency (MAF) ≥10%.
Having selected a subset of intermediate-frequency SNPs from each window, we can compute the number of distinct haplotypes as well as the count of the most common haplotype in a sample of n chromosomes. The HCN statistic is the genomewide joint distribution of these two statistics. Specifically, let , where Xij denotes the number of windows having i haplotypes where the most common haplotype has count j out of n. In principle, ; however, in practice, we bin intervals in the HCN statistic for the inference so that fewer simulation replicates will be needed to obtain an accurate estimate of the expected HCN (see below), and thus there are fewer than bins in the HCN statistic. Ideally, we wish to integrate over all possible sets of nsnp SNPs within each window when constructing the HCN statistic. However, this is not computationally feasible, so we generate 10 random matrices, each using a different randomly selected set of nsnp SNPs from each window. We then average these 10 matrices as our final X matrix to be used for inference. This is done to reduce Monte Carlo variance resulting from selecting a single set of SNPs. An example of the HCN statistic for several demographic models is shown in Figure 1.
We chose to use the number of haplotypes as a summary statistic because it is a sufficient statistic for the population mutation rate in the infinite-alleles model (Ewens 1972) and has been shown by simulation to be informative about population history (Depaulis and Veuille 1998; Innan et al. 2005). The count of the most common haplotype was also suggested as a test statistic in the infinite-alleles model (Ewens 1973) and has been found to be correlated with haplotype homozygosity (Zeng et al. 2007 and data not shown). The joint distribution of these two statistics performs better at distinguishing among demographic models than using either summary on its own (Figure 1). For example, the number of haplotypes is more informative about overall population size than is the count of the most common haplotype (compare Ncur = 10,000 to Ncur = 5000), as expected, since larger populations have a higher population recombination rate, than smaller populations, resulting in a larger number of haplotypes per window (Wall 2000a). Note that because we selected nsnp SNPs with MAF ≥10% per window, the fact that the larger population has a higher value of does not inflate the observed number of haplotypes per window. A recent bottleneck results in an intermediate number of haplotypes, but the stronger signature of the bottleneck is the excess proportion of windows where the most common haplotype is at unusually high frequency. These patterns are due to an elevated rate of coalescence during the bottleneck, which, for some simulated windows, results in there being fewer lineages available to recombine. A recent population expansion also results in an intermediate number of haplotypes, but without an increase in the number of windows where the frequency of the most common haplotype is unusually high.
The HCN statistic contains no information about how different haplotypes within a window are from each other. To add this information, we also considered another summary statistic Hpair, the distribution across the genome of the average number of pairwise differences between haplotypes. For all pairs of haplotypes within a given window, we simply counted the number of SNPs (which could range from 0 to nsnp) where the two haplotypes differed and counted the average. Hpair is the vector giving the number of windows having a given number of average pairwise differences. We show (supporting information, File S1 and Figure S1) that this statistic is not robust to SNP ascertainment bias and do not use it in further analyses.
We consider two different single-population demographic models. These models and their associated parameters are shown in Figure 2. Figure 2A shows a two-epoch model that is used for modeling population growth. Here there are three parameters to estimate: the current population size, Ncur; the ancestral population size, Nmid; and the time that growth has occurred, tcur. Figure 2B shows a three-epoch model that has five free parameters: the current population size, Ncur; the population size during the bottleneck, Nmid; the ancestral population size, Nanc; the time when the bottleneck started (going backward in time), tcur; and the duration of the bottleneck, tmid. All times are in units of generations. We note that although these models (and all models in population genetics) are arbitrary simplifications of the true demographic history, the hope is that they capture some essential features of population history.
Fitting models to data:
Since the observed HCN statistic follows a multinomial distribution, we fit demographic models to the data using an approximate-likelihood approach (see Weiss and Von Haeseler 1998; Wall 2000a; Fearnhead and Donnelly 2002; Plagnol and Wall 2006). We define , where pij is the probability that a window has i haplotypes where the most common haplotype is at count j. The approximate-likelihood function for the demographic parameters (Θ) can be written as(1)We use the coalescent with recombination (Hudson 1983, 2002) to find pij for the demographic parameter combination of interest. We simulate z replicates using the demographic parameters of interest (Θ) and ρ = 4Ncurcwindow. We estimate the matrix as the proportion of simulation replicates falling in a particular bin of the HCN statistic. Formally, define the indicator function to be equal to 1 if simulation replicate w has i haplotypes and the count of the most common haplotype is j, and equal to 0 otherwise. Then(2)
Since we select nsnp SNPs from each window, does not explicitly enter into these simulations. Therefore, instead of setting an arbitrary value of , and then randomly selecting nsnp SNPs, we use the “fixed S approach” (Hudson 1993) to add mutations onto the ancestral recombination graph (ARG). Specifically, nsnp mutations are randomly placed onto each simulated ARG such that these SNPs will have MAF ≥10%. To reduce Monte Carlo error, this process is repeated 10 times for each ARG. Each time, we evaluate and increment the appropriate bin of . Note that we record 10 different matrices and after the desired number of simulation replicates, we keep the average of the 10 matrices as our final matrix. This is an approximate-likelihood function since we are approximating using simulations, rather than calculating it exactly, and we also treat all windows of the genome as being mutually independent.
We optimize the approximate-likelihood function described above using a grid search since we are approximating the likelihoods by simulation and the simulation variance may be nonnegligible, misleading deterministic hill-climbing approaches. The number of grid points and number of simulation replicates used to maximize the approximate-likelihood function vary among analyses and are given below.
Since variation in recombination rates at a fine scale can affect the HCN statistic, we have added a model of recombination hotspots into our inference method. We describe the parameters used for specific instances below. Since each window of the genome corresponds to the same genetic map distance (cwindow), the number of base pairs per window will differ among windows. In our simulations to find , we select the size of the window in base pairs (denoted L) from the observed distribution of physical distances. We then set r, the per base pair recombination rate, to be constant across the window such that rL will give cwindow. We then simulate an ARG in the normal manner with recombination rate rL, but then, similar to the method used by Li and Stephens (2003), we model hotspots by changing the relationship between physical and genetic distance. Informally, the parts of the window where hotspots occur are assigned fewer base pairs and consequently have a lower probability of a SNP occurring in them than windows with lower recombination rates.
Simulations to evaluate performance:
We tested the performance of our method by simulating data under three different demographic models: (1) ancient population growth, (2) recent population growth, and (3) a recent population bottleneck. The parameter values for these models are shown in Figures 3–6⇓⇓⇓. These models were chosen because of their relevance to human demographic history. For each model we simulated data sets (500 for models assuming uniform recombination rates and 100 for models with recombination hotspots), each consisting of 2000 independent 250-kb windows in a sample of 40 chromosomes where μ = 1 × 10−8/bp/generation. Note that when generating test data sets, we placed a Poisson number of mutations onto the genealogies in the usual fashion (Hudson 1983), rather than using the fixed S approach as we did for the simulations used to estimate . We selected a subset of 20 SNPs (nsnp = 20) with MAF ≥10% from each window and constructed the observed HCN statistic for each data set. We repeated this process 10 times for each data set and used the average HCN statistic for inference. For computational efficiency, we performed the coalescent simulations to estimate over a grid of parameters (3780 and 20,580 parameter combinations for the growth and bottleneck models, respectively) once for each demographic and recombination model and stored the values to be used on subsequent data sets. The grid points used for each parameter are shown as the breaks in Figures 3–6⇓⇓⇓. For each grid point we used 104 coalescent simulations to approximate the likelihood. Using a representative data set, we then selected at least the top 103 grid points and ran an additional 105 replicates, and for points near the maximum-likelihood estimates (MLEs), we ran an additional 106 replicates. Due to computational constraints, these grids were not as dense as those used to estimate parameters in the Perlegen data.
For all data sets and demographic models, the total amount of recombination within each window simulated was 0.25 cM (i.e., cwindow = 0.25 cM). However, we also considered a model with five recombination hotspots present at random locations throughout each window. Each hotspot was 2 kb in size. The recombination rate (centimorgans per base pair) of each hotspot was drawn from a gamma distribution (shape = 0.5, scale = 2 × 10−6). We then rescaled the recombination rate of each hotspot such that 80% of the total amount of recombination in the window occurs within hotspots. We then assumed this model of recombination hotspots when inferring demographic parameters. The test data sets were generated using the program msHOT (Hellenthal and Stephens 2007).
Our method assumes that all windows in the genome are independent of each other. To assess the performance of our method when the windows are not independent, we simulated an additional 500 data sets using the same bottleneck model with cM. Here the 2000 windows within each data set were from 300 independent sets of 6 or 7 contiguous windows. All windows were treated as independent in the inference.
While for most of the simulations we assumed that there was no error in estimated recombination rates (i.e., ), we also determined what effect errors in the estimated genetic map had on our ability to accurately infer demographic parameters. Specifically, we simulated data sets under the same bottleneck model described above, but here instead of having cM, we drew cwindow for each window from a gamma distribution (shape = 10, scale = 0.025). From this distribution, ∼10% of windows have cwindow < 0.155 and ∼10% of windows have cwindow > 0.355. We then inferred demographic parameters when incorrectly fixing cM. We also correctly incorporated errors into the genetic map by drawing for each simulation replicate from the same gamma distribution used to generate the data.
Due to differences between the true HCN statistic and the HCN statistic constructed from phase-inferred haplotypes (Figure S2 and File S1), it is important to incorporate the phasing process into the inference. To do this, we suggest using the same phasing method that was used on the actual data to “phase” the simulated data used to estimate . Unfortunately, many phasing algorithms currently in use are computationally intensive and it would be nearly impossible to run these methods on the millions of coalescent simulation replicates used to find . For this reason, we examined the use of the computationally efficient parsimony phasing algorithm proposed by Clark (1990). If there are no individuals heterozygous at zero or one of the nsnp SNPs within a window or if there are genotypes that show no relation to known phased haplotypes, we arbitrarily assigned phase to a random individual and then used these two haplotypes to infer the rest. While this process may seem arbitrary, it can be done consistently both in the observed and in the simulated data sets. To make the method as computationally efficient as possible, we used only one ordering of the individuals. We assessed the performance of this approach by treating the simulated haplotypes in the test data sets as diploid genotypes and “phased” them using the parsimony method. For each simulation replicate to estimate we also phased the simulated data using the same parsimony method.
Since we found that the HCN statistic constructed using SNPs that were discovered in a SNP discovery sample ≥8 chromosomes was very similar to the HCN statistic with complete SNP ascertainment (Figure S3, Figure S4, Figure S5, Table S1, and File S1), we examined how ascertainment bias affected parameter estimates. Specifically, for the bottleneck model described above, we simulated a genotype sample of n = 40 and an additional SNP discovery sample of 6 chromosomes. Since the Perlegen SNPs were discovered using a multiethnic panel (Hinds et al. 2005), we included a SNP discovery sample of an additional 6 chromosomes from a second population (Ncur = 10,000) that 5000 generations ago split from the population that underwent the bottleneck. To construct the HCN statistic from these simulated data sets, we considered only SNPs with MAF in the genotype sample that were variable in the 12-chromosome ascertainment panel. To infer parameters, we assumed there was no ascertainment bias (i.e., we used the same lookup tables for that were described above that assumed complete SNP ascertainment).
Analysis of Perlegen data:
We applied our method to fit a bottleneck model to the CEU population genotyped by Perlegen Sciences (Hinds et al. 2005). We chose to use this population since there is previous evidence of a bottleneck in this population (e.g., Marth et al. 2004; Voight et al. 2005), and all SNPs that were discovered by the Perlegen resequencing arrays were later genotyped in the CEU sample, without regard to LD status. We note that HapMap phase II specifically did not genotype SNPs that were in high LD in the Perlegen study, and this ascertainment criterion complicates the analysis of those data (see discussion). We considered only autosomal (not X chromosome or mtDNA) SNPs with MAF in both the CEU and the African-American samples. Since our simulations of ascertainment bias suggest that SNPs needed to have been discovered from discovery sample sizes >2 chromosomes, we used only those SNPs that were discovered in Perlegen's resequencing arrays of the multiethnic diversity panel (type “A” SNPs). There were 615,415 SNPs that fit both of these criteria. We used Clark's parsimony phasing algorithm to phase haplotypes in both the real data and the simulation replicates to generate . For each population and in each window of the genome, we selected 10 random subsets of nsnp SNPs and constructed 10 different HCN statistics. We then used the average HCN statistic for inference.
We then set cwindow = 0.25 cM and nsnp = 20. We used the LDhat genetic map (International Hapmap Consortium 2007) to define windows since the deCODE map (based on pedigrees; Kong et al. 2002) does not have sufficient resolution for the scale of 0.25 cM (Myers et al. 2005). Since the quality of the genetic map used to divide the genome into windows can affect the inference, we drew from a gamma distribution (shape = 10, scale = 0.025) to model errors in the genetic map. This distribution has a mean of 0.25 and a variance of 0.00625. For the CEU data, nwindow = 8833.
We used a hotspot model similar to that of Schaffner et al. (2005; termed the “Schaffner hotspot model”). All hotspots had width of 2 kb. For each simulated window, hotspots occurred at random intervals drawn from a gamma distribution (shape = 0.3, scale = 8500/0.3), giving a mean spacing of 8.5 kb (variance of ∼2.41 × 108). Then the recombination rate (cM/2 kb) of each hotspot was drawn from another gamma distribution (shape = 0.3, scale ), where L is the physical size of the simulated window. In practice, L was drawn from the empirical distribution of physical distances for the nwindow windows. We then rescaled the recombination rate in the hotpots such that 88% of occurs within hotspots. The amount of recombination occurring outside of hotspots is then equal to . Figure S6 and Figure S7 show that this hotspot model matches the mean, the standard deviation, and the overall distribution (tabulated across all windows of the genome) of the observed inter-SNP genetic map distances quite well.
In addition to the Schaffner hotspot model described above, we also directly used the estimated fine-scale LDhat genetic map (International Hapmap Consortium 2007) as a guide to how recombination rates vary within windows (termed the “empirical hotspot” model). To do this, for each simulation replicate to estimate , we selected one of the 8833 windows at random and used the corresponding LDhat genetic map to delimit the relationship between genetic and physical distance for that replicate. We smoothed the map by allowing the recombination rate to change only at points >500 bp and >0.0001 cM apart. We note that this hotspot model does not match the observed inter-SNP genetic distances as well as the Schaffner hotspot model does (Figure S6 and Figure S7).
The grid to optimize the five-dimensional approximate-likelihood function consisted of 85,536 points for the Schaffner hotspot model and 101,088 points for the empirical hotspot model. We used 12,500 simulation replicates for all points, 105 replicates for at least the top 4000 points, and finally 106 replicates for at least the top 500 points. We found approximate 95% confidence intervals (C.I.'s) for single parameters using asymptotic theory (i.e., the C.I. included points <1.92 log-likelihood units from the maximum), with linear interpolation of the profile-likelihood curve to find points not directly simulated.
Performance on simulated data:
Figure 3 shows the distribution of the approximate MLEs of the three growth model parameters for simulated data sets under ancient growth (solid bars) and recent growth (open bars). In both cases, the method is relatively unbiased for all three parameters. For ancient growth, Ncur is estimated most accurately and tcur the least. For recent growth, all three parameters are equally accurate, although for any given parameter, the MLE is the true value ∼40% of the time. Note that the variance in the distribution of MLEs for tcur is much higher for ancient growth as compared to recent growth (making it the least precise as well as the least accurate). Table 1 shows that for both growth scenarios in 100% of the data sets, the true parameter values were within the asymptotic 95% C.I.'s (<3.9 log-likelihood units) around the MLEs. Additionally, in >95% of the data sets, the one-dimensional 95% C.I.'s for all three parameters from the profile-likelihood curves contain the true parameter values.
We also estimated the five parameters for a bottleneck model in simulated data sets. Figure 4 shows the distribution of the MLEs for the five bottleneck parameters. For the case of uniform recombination and cM, the mode of the distribution of the MLEs for each parameter is at the true value of the parameter. The distribution of MLEs is tightest for Nmid/Ncur and tcur and broadest for Nanc/Ncur. This suggests that the recent bottleneck greatly alters haplotype patterns such that its timing and severity can be accurately estimated, but so much so that less information about the ancestral, prebottleneck population size (Nanc/Ncur) remains. Furthermore, the method appears to be relatively unbiased since it over- and underestimates the true parameter value roughly equally. Table 1 shows that 99.8% of the time, the true parameter values are within the asymptotic 95% C.I.'s (within 5.5 log-likelihood units around the MLEs).
We next evaluated whether our method could accurately estimate demographic parameters in the presence of recombination hotspots (see methods for the recombination hotspot model used). Figure 4 shows that when properly modeling recombination hotspots, we are able to accurately estimate the five bottleneck model parameters. Note that the distributions of the MLEs for all parameters have larger variances than for the uniform recombination case. This pattern is due to the extra noise added by recombination hotspots. If a window of the genome has a low number of haplotypes and/or a high count of the most common haplotype, this could be due to demography (which is the only factor considered in the uniform recombination model) or due to SNPs falling in recombination coldspots. Consistent with this observation, Table 1 shows that a smaller proportion of the parameter space (99.65% vs. 99.87%) is >5.5 log-likelihood units from the MLEs when there are recombination hotspots, as compared to uniform recombination. Notably, however, the method still appears to be unbiased, and for all 100 data sets, the true parameter values are <5.5 log-likelihood units from the MLEs.
The simulations described above assumed that the windows in each data set were independent. In practice, the windows may be contiguous along the genome and thus are not independent. We examined the performance of our method on simulated data sets where some of the windows were linked. Figure 4 shows that the distributions of the MLEs for certain parameters have greater variance than when the windows are unlinked, which is not surprising since linked windows contain less information than independent windows. As shown in Table 1, in all cases except when errors in the genetic map are ignored or there is SNP ascertainment bias (see below), the true parameter values for >99% of the test data sets are within the asymptotic 95% C.I.'s (<3.9 and <5.5 log-likelihood units from the MLEs for growth and bottleneck models, respectively). This result suggests that the asymptotic C.I.'s may actually be conservative, since the true parameter values are contained within the interval >95% of the time. When examining data sets where some of the regions are linked, we find that for 99.4% of the time, the true values are <5.5 log-likelihood units from the MLEs (Table 1). Since in many cases, the 95% C.I.'s for individual parameters based on the profile log-likelihood curve also appeared conservative (Table 1), we assessed their coverage in the data sets with linkage. For each of the five parameters, the true parameter value is <1.92 log-likelihood units from the MLE in at least 96.8% of the data sets. These results suggest that for the level of nonindependence among windows, size of data sets, and parameter grid considered here, the asymptotic 95% C.I.'s remain conservative.
The above simulations assumed that . In practice, is estimated from a genetic map, based either on patterns of LD or on pedigrees. We next evaluated the performance of our method when is drawn from a gamma distribution to mimic errors in the estimated genetic map. We first assumed that cM when running the inference (i.e., we ignored the errors in the genetic map). Figure 5 shows that our method performs poorly compared to the case where the genetic map is known with certainty. In particular, it overestimates tcur and tmid. Due to the fact that some windows in the simulated data sets will have low recombination rates, these windows will have very few haplotypes and a high frequency of the most common haplotype because < 0.25 cM. Since we did not account for this in the inference, the method assumes that these low-diversity windows were due to a stronger (or longer) bottleneck. Table 1 shows that the true parameter values are nowhere near the MLEs in this case. If, however, during the inference, for each window is drawn from the same gamma distribution that generated the data, the method performs substantially better (Figure 5), although not quite as well as when cM. Likewise, 99.6% of the time, the true parameter values are <5.5 log-likelihood units from the MLEs. Note that, similar to what was seen for the case of recombination hotspots, on average, a smaller proportion of the parameter space (99.72% vs. 99.87%) is >5.5 log-likelihood units from the MLEs when and cwindow follow a gamma distribution instead of being fixed at 0.25 cM.
To properly correct for errors introduced from inferring haplotype phase, we decided to phase the simulations used to estimate using the same method as that used on the real data. We evaluated the performance of this strategy using Clark's phasing algorithm (Clark 1990) on simulated data sets. Figure 6 shows the distribution of the MLEs for the five bottleneck parameters. This strategy works reasonably well and for each parameter the mode of the distribution of the MLEs is at the true parameter value. Note that the distributions of the MLEs for the data sets phased using Clark's phasing algorithm are broader than those when haplotype phase in known with certainty. Additionally, a smaller proportion of the parameter space is excluded (>5.5 log-likelihood units from the MLEs) when using Clark's phasing algorithm as compared to known phase data (99.43% vs. 99.87%; Table 1). This finding illustrates that, compared to having phase-known haplotypes, some information is lost when computationally inferring haplotypes. However, the method appears reasonably unbiased, and in all simulated data sets the true parameter values are <5.5 log-likelihood units from the MLEs (Table 1).
To determine if we could accurately estimate bottleneck parameters in the presence of SNP ascertainment bias, we simulated data sets where the nsnp = 20 SNPs for each window were picked from those SNPs with MAF in the genotype sample and were variable in the 12-chromosome SNP ascertainment sample. Figure 6 shows that for Nmid/Ncur, tcur, and tmid, our method performs very well even in the presence of SNP ascertainment bias. The distributions of the MLEs for Ncur and Nanc/Ncur are more variable than when there is no ascertainment bias, and the modes of their distributions are not at the true parameter values, suggesting that MLEs of these parameters are less reliable in the presence of ascertainment bias. The 95% C.I.'s constructed from the profile-likelihood curves remain conservative for Nmid/Ncur, tcur, and tmid, but for Ncur the 95% C.I. is no longer conservative (i.e., the true value is within the 95% C.I. only 91.8% of the time). Additionally, the five-dimensional 95% C.I. is also slightly anticonservative (Table 1). For larger data sets (consisting of 10,000 independent regions) the C.I.'s for Ncur and Nanc/Ncur and the five-dimensional C.I. become even more anticonservative and the C.I. for tmid becomes slightly anticonservative (not shown), likely due the fact that the ascertainment model is misspecified, which has a stronger effect as the size of the data set increases.
Inference of bottleneck parameters for the Perlegen CEU population:
We fit the five-parameter bottleneck model to the Perlegen CEU population. Table 2 shows the MLEs and ∼95% C.I.'s for the five parameters when using both the Schaffner model of recombination hotspots and the empirical hotspot model based on the LDhat genetic map. Figure S8 shows that for both hotspot models, the HCNs generated using the MLE parameter estimates match the observed CEU HCN quite well. The current population size is estimated to be ∼10,000 when using both recombination models. There was a severe population size reduction (∼4.2–6.6% of the current size) lasting 260–552 generations (see Figure S9 for the two-dimensional profile-likelihood surface). The bottleneck ended (forward in time) ∼1017–1437 generations ago.
On the basis of the two-dimensional profile-likelihood surface (Figure S10), we estimate that the bottleneck began (tcur + tmid) 1500 generations ago—37,500 years, assuming 25 years/generation when using either hotspot model. Notably, the oldest start time within the asymptotic 95% C.I. (3 log-likelihood units, for 2 d.f.) is 1600 generations, or 40,000 years for the Schaffner hotspot model, and 1900 generations (47,500 years) for the empirical hotspot model.
One potential concern with using SNPs from the Perlegen SNP discovery project is that it contains a lot of missing data, resulting in some fraction of SNPs having been discovered in a smaller sample. To determine what effect this had on the inference of the bottleneck parameters, we examined the depth of the SNP discovery panel for the 615,415 SNPs used in constructing the HCN statistic used for demographic inference in the Perlegen CEU sample. We found that 11.8% of these SNPs were discovered by comparing fewer than eight chromosomes. We removed these SNPs and recomputed the HCN statistic from the Perlegen CEU data (now with 8174 windows) and reestimated the bottleneck parameters using the Schaffner recombination hotspot model. We found identical MLEs for the parameters as in our original analysis.
We have proposed a flexible approximate-likelihood method to estimate demographic parameters using haplotype summary statistics. We have shown that accurate estimates of demographic parameters can be made using genomewide SNP data sets of practical size. To the best of our knowledge, this is one of the first studies to estimate parameters in a demographic model in a likelihood framework using haplotype patterns from genomewide SNP genotype data. Furthermore, we have addressed many complications that arise in the analysis of genomewide data, such as recombination rate heterogeneity, errors in the estimated genetic map, haplotype phase uncertainty, and ascertainment bias. Provided that good genetic maps are available, our method could be applied to SNP data from other species, such as dogs and cattle, to estimate domestication bottleneck parameters.
One of the major disadvantages of our approach is its dependence on accurately modeling the distribution of recombination rates across the genome. Our simulations have shown that errors in the genetic map can cause poor performance. Therefore, we suggest applying our method only when there is an accurate genetic map for the species in question. We suggest incorporating a distribution on to allow for errors in the estimated genetic map. However, as the quality and resolution of genetic maps continue to improve, the utility and accuracy of our method will also continue to increase. For species where recombination rates vary at the fine scale, it is crucial to incorporate some model of recombination hotspots. Here for the Perlegen CEU data, we have implemented a parametric model as well as empirical estimates based on LD patterns. A similar influence of the assumed recombination rate on the demographic parameter estimates was noted in Thornton and Andolfatto (2006), who used the variance in the number of haplotypes across windows as one of their summary statistics. We recommend, as done in Thornton and Andolfatto (2006), using different recombination models and then comparing the final results to assess how dependent the estimates are on the assumed recombination model. While the dependence of our method on accurate estimates of the recombination rate is not ideal, we point out that many previous methods in molecular evolution and population genetics are dependent on accurate estimates of the mutation rate and will be biased if erroneous estimates are used.
We also assume that the genetic map remains constant over time and is the same across populations. Recombination hotspots do not appear in the same locations in chimps and humans despite a high level of sequence identity (Ptak et al. 2005; Winckler et al. 2005). It has therefore been speculated that recombination hotspots are not permanent features of the genome and evolve on a timescale of at least tens of thousands of years (Jeffreys et al. 2005). However, it appears that the timescale over which many hotspots evolve is older than 100,000 years, and because this is long enough to alter patterns of LD, temporal changes in hotspots on this timescale will not have such a severe impact on our method. Hotspots that evolve over shorter timescales, or are population specific, may have a larger effect on our method. This effect is hard to quantify since the prevalence of rapidly evolving or population-specific hotspots, other than the existence of a few examples (Jeffreys et al. 2005; Clark et al. 2007), remains largely unknown. Encouragingly, a recent article (Hellenthal et al. 2006) found that genotype-specific recombination events would not substantially affect LD patterns, boding well for our method.
We have found that the HCN statistic constructed from computationally phase-inferred data differs from the true HCN. Simply treating the phase-inferred haplotypes as the true haplotypes will likely give biased parameter estimates. Thus, when analyzing data from unrelated individuals, it is important to consider errors induced in the phasing process. We suggest doing this by inferring phase for the coalescent simulations used to estimate HCN. Our simulations suggest that using Clark's phasing algorithm works well for this purpose. However, some information is lost by this procedure (Figure 6; Table 1), and we therefore recommend, where available, using data from trios, where haplotype phase can be inferred with great accuracy, to maximize the information in the data.
It has been suggested (Conrad et al. 2006) that haplotype statistics may be less susceptible to SNP ascertainment bias than statistics based upon SNP frequencies. Here we have extensively investigated whether this holds true for the HCN and Hpair statistics under a variety of demographic and ascertainment conditions. Encouragingly, we found that the HCN statistic is reasonably robust to SNP ascertainment bias provided that the SNP discovery sample is sufficiently deep. The reason for this is that we focus on subsets of common SNPs, rather than on rare SNPs. However, the Hpair statistic was very susceptible to ascertainment bias (Figure S1 and File S1), suggesting that all haplotype statistics are not equally affected by ascertainment bias, and it will be necessary to explicitly evaluate, as we have done here, whether ascertainment bias affects a particular haplotype statistic.
The sizes of the SNP discovery and genotype samples play an important role in determining the effect of ascertainment bias on the HCN statistic. Interestingly, generating the HCN from SNPs ascertained uniformly using two chromosomes of known ethnicity (as done by Keinan et al. 2007) would result in a very different HCN statistic from that expected without ascertainment bias (Figure S3, Figure S4, and Figure S5), which would result in biased inference. Although not considered here, it is in principle possible to estimate the expected HCN statistic conditional on this particular ascertainment strategy, and application of such an estimate would reduce this bias.
We have found that SNP discovery sample sizes of at least 12 total chromosomes should be sufficient to result in the HCN statistic from ascertained SNPs to match the expected HCN when considering genotype samples of 40 or 120 chromosomes. Furthermore, we have shown that our method can reliably infer several bottleneck parameters when SNPs were ascertained in this manner. As the SNP discovery sample size increases, performance of our method will continue to improve and become closer to that for the case of no ascertainment bias, since a larger SNP discovery sample will capture more of the SNPs in the genotype sample. These results are especially encouraging since Perlegen's SNP discovery effort used 20–50 chromosomes, where ∼12 chromosomes were of African-American ancestry, ∼12 chromosomes were of European ancestry, and the remainder were of Mexican-American, Asian-American, and Native American ancestry (Collins et al. 1998; Hinds et al. 2005).
Furthermore, we have found that the size of the SNP discovery sample is more important than whether or not the SNP ascertainment had been done in a particular population (see Figure S5). This suggests that SNPs ascertained in the Perlegen resequencing survey could be used to estimate demographic parameters in other populations not represented in the SNP discovery panel. It is worth noting that the two populations in our simulation study shown in Figure S5 split 5000 generations ago (125,000 years, assuming 25 years/generation) with no subsequent migration and are thus more differentiated than many actual non-African populations that could be studied empirically.
It is important to note that the type of ascertainment bias studied here is due to the preferential genotyping of common SNPs. In all the analyses presented here, we assume that the genotyped SNPs were selected without regard to physical or genetic distance or LD patterns. Such an assumption is reasonable for the analyses of the Perlegen data presented here since Perlegen attempted to genotype all of the SNPs found in their SNP discovery process (Hinds et al. 2005). The assumption is not valid, however, for many of the “SNP chips” that preferentially selected SNPs on the basis of physical distance (in the case of Affymetrix 500K) or LD patterns (in the case of the Illumina platform; Eberle et al. 2007). Since the SNPs on these platforms are not a random subset of the total variation, using our method on such data will likely give misleading results. In principle, it should be possible to modify our method to model the SNP selection process in the inference, which would allow our method to be applied to the large-scale SNP genotype data sets that have been collected, such as the HGDP data set (Jakobsson et al. 2008; Li et al. 2008).
For the analysis of the CEU data, we used two different models of recombination rate variation. Overall, the results using both models are qualitatively similar, suggesting that our method is somewhat robust to minor misspecification of the recombination hotspot model. We also find that the one-dimensional 95% C.I.'s from the profile-likelihood curves overlap for all five parameters estimated (Table 2; Figure S11). Nevertheless, the five-dimensional 95% C.I.'s do differ between the two recombination models, mainly due to the fact that tcur is greater under the empirical hotspot model than under the Schaffner hotspot model.
The time we inferred that the CEU bottleneck began (∼37,500 years ago) is too recent to coincide with the accepted dates for the out-of-Africa bottleneck, which is believed to have occurred 40,000–80,000 years ago (Reed and Tishkoff 2006). Thus, our estimate may coincide with an additional bottleneck associated with the founding of Europe, which likely took place 30,000–40,000 years ago (Barbujani and Goldstein 2004). Alternatively, our estimated start time of the bottleneck may represent an average time over several bottlenecks, including the out-of-Africa bottleneck and a more recent bottleneck, perhaps associated with the Last Glacial Maximum that began ∼18,000 years ago (Barbujani and Goldstein 2004). Further work considering multiple European populations and multiple bottlenecks may help resolve this question.
How do our estimates of the bottleneck parameters for the CEU match with published estimates? Voight et al. (2005) may not be directly comparable to our study since their analysis considered a southern European sample and ours used individuals with northwestern European ancestry, and differences in haplotype diversity between these two regions have been noted (Lao et al. 2008; Auton et al. 2009). Nevertheless, our estimates of tmid and Nmid/Ncur fall within their confidence regions. Their bottleneck start times (40,000 years) and current and ancestral population size estimates (∼10,000) also agree with ours. Our estimate of the time the bottleneck began is also consistent with that found using the decay of pairwise LD. Reich et al. (2001) found evidence for a bottleneck 800–1600 generations ago (our MLE is 1500 generations). We find evidence for a more severe bottleneck than previously estimated (Adams and Hudson 2004; Marth et al. 2004; Keinan et al. 2007), which could reflect the importance of considering LD-based information in the inference since these studies are all based on the frequency spectrum, and the other study that considers a summary of LD (Voight et al. 2005) cannot reject such a severe bottleneck. Alternatively, there could be some other important factors in European population history not captured by these simple bottleneck models that may affect the frequency spectrum and LD patterns differently. Finally, we cannot exclude the possibility that we have overestimated the bottleneck intensity due to greater heterogeneity in the recombination rate than what was included in our hotspot models. In short, improved confidence in the fine-scale genetic map will allow definitive ability to discriminate between these alternative scenarios.
While we have shown that haplotype statistics can be used to estimate demographic parameters from SNP genotype data, and it has been shown that the site frequency spectrum (SFS) will give misleading results when applied to genotype data without a correction for ascertainment bias (Nielsen et al. 2004; Clark et al. 2005), one important question is whether haplotype summary statistics will provide additional information that is important for inference when full genomewide resequencing data are available and it is possible estimate the SFS accurately. We examined whether the number of haplotypes and the count of the most common haplotype can discriminate between two different demographic models that have similar SFSs. We focused on a demographic model that included ancestral population structure since previous studies found that ancestral structure can result in an excess of long-range LD (Wall 2000b; Plagnol and Wall 2006). Specifically, we found that for certain subsets of the parameter space (the ms command lines giving the parameters used to generate Figure 7 are given as File S1), population growth with ancestral structure can create a similar SFS to that expected under population growth without ancestral structure. Close inspection of Figure 7 reveals a very slight uptick in the proportion of high-frequency derived SNPs in the population growth with structure SFS as compared to the growth without structure SFS, which is the expected signal of ancestral population structure. However, this effect is very subtle and in practice may be attributed to misidentification of the derived allele, rather than ancestral population structure (Hernandez et al. 2007a). Note that while the magnitudes of growth in the structure and panmictic cases are different, the growth with structure case still has an excess of low-frequency SNPs (Figure 7), which would often be interpreted as evidence for population growth. The inset in Figure 7 shows the count of the most common haplotype vs. the number of haplotypes for 10,000 windows simulated under the two demographic models described above. Note the growth with structure model has an excess of windows where the most common haplotype is at higher frequency and an excess of windows with a fewer number of haplotypes compared to the pure growth model. Thus, this is a case where two demographic models that cannot be readily differentiated on the basis of the SFS can be distinguished easily using haplotype patterns. The reason for this is as follows: population growth results in an excess of low-frequency SNPs, and for the parameters used here, population structure results in an excess of both low-frequency and high-frequency derived alleles. The resulting SFSs in Figure 7 have been affected by both these forces, but the excess of low-frequency SNPs is the predominant feature. Since ancestral population structure results in some genealogies having longer internal branches, any mutations occurring on these branches will be in LD with each other, leading to fewer distinct haplotypes in the sample and the most common haplotype occurring at higher frequency. Additionally, since the SFS treats all SNPs as being independent, haplotype patterns capture more information regarding the local genealogy within a window than the SFS does. In other words, many of the high-frequency derived SNPs in the sample are clustered in certain windows, but this pattern is missed in the SFS since it treats all SNPs as being exchangeable. Notably, summaries of the SFS performed on a local scale may be better at distinguishing these two models (Figure S12).
It is important to point out that the HCN statistic has been designed to be used on ascertained SNP data, where the number of SNPs in a particular window of the genome is affected by the ascertainment process. Consequently, we deliberately did not use the number of SNPs in constructing the HCN statistic. To analyze full-resequencing data in a haplotype framework, a more powerful approach would also make use of information about the number of SNPs in each window (Innan et al. 2005). The HCN statistic can be modified to include this information, suggesting that haplotype patterns based on full-resequencing data will be even more informative than described here. Thus statistics, like the HCN statistic, that capture information about the local genealogy of a region of the genome will remain relevant for demographic inference even when ascertainment bias is no longer an issue.
The example above suggests that combining the SFS and the HCN statistic may present a powerful approach to distinguish between complex demographic scenarios. Further work combining the two statistics for demographic inference is ongoing. Another possible extension of our method would be to jointly model two populations in an isolation–migration framework (proposed by Nielsen and Wakeley 2001), where the data are summarized by the HCN statistic for shared and population-specific haplotypes. Finally, instead of using standard coalescent simulations to find the expected HCN statistic for a given demographic scenario, we could approximate the coalescent using the sequentially Markov coalescent (McVean and Cardin 2005; Marjoram and Wall 2006). Doing so would reduce the computational burden of the method and would also allow for greater values of cwindow to be used.
We thank J. Novembre, R. Hernandez, A. Boyko, A. Auton, J. Degenhardt, and M. Wright for helpful discussions and suggestions throughout the project and A. Auton and two reviewers for constructive comments on a previous version of the manuscript. K.E.L. was supported by a National Science Foundation Graduate Research Fellowship. This work was funded by National Institutes of Health grant RO1 HG003229 to A.G.C. and C.D.B.
Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.108.099275/DC1.
Communicating editor: N. Takahata
- Received December 5, 2008.
- Accepted February 19, 2009.
- Copyright © 2009 by the Genetics Society of America