In recent years approximate Bayesian computation (ABC) methods have become popular in population genetics as an alternative to full-likelihood methods to make inferences under complex demographic models. Most ABC methods rely on the choice of a set of summary statistics to extract information from the data. In this article we tested the use of the full allelic distribution directly in an ABC framework. Although the ABC techniques are becoming more widely used, there is still uncertainty over how they perform in comparison with full-likelihood methods. We thus conducted a simulation study and provide a detailed examination of ABC in comparison with full likelihood in the case of a model of admixture. This model assumes that two parental populations mixed at a certain time in the past, creating a hybrid population, and that the three populations then evolve under pure drift. Several aspects of ABC methodology were investigated, such as the effect of the distance metric chosen to measure the similarity between simulated and observed data sets. Results show that in general ABC provides good approximations to the posterior distributions obtained with the full-likelihood method. This suggests that it is possible to apply ABC using allele frequencies to make inferences in cases where it is difficult to select a set of suitable summary statistics and when the complexity of the model or the size of the data set makes it computationally prohibitive to use full-likelihood methods.
THE genetic patterns observed today in most species are the result of complex histories, which include demographic events such as population admixture, expansions, and/or collapses. The detection and quantification of such events relies on the fact that different scenarios leave a specific genetic signature in present-day populations, as well as on knowledge from other sources (e.g. ecology, biogeography, archeology) to define plausible models to explain such patterns. Recent population genetic modeling has seen the development of a number of statistical approaches that aim at extracting as much information as possible from the full allelic distributions (Griffiths and Tavaré 1994; Wilson and Balding 1998; Beaumont 1999; Beerli and Felsenstein 2001; Chikhi et al. 2001; Storz et al. 2002). These approaches aim at computing the likelihood L(θ), i.e., the probability PM(D | θ) of generating the observed data D under some demographic model M, defined by a set of parameters θ = (θ1, … , θk). In Bayesian statistics, the posterior density is used to make inferences as it reflects the probability of the parameters given the data, and it is obtained through the relationship P(θ | D) ∝ L(θ)P(θ), where P(θ) summarizes prior knowledge (or lack thereof) regarding θ before the data are observed (Beaumont and Rannala 2004). For most demographic models there are no explicit likelihood functions or the likelihood cannot be derived analytically. Therefore, full-likelihood approaches rely on methods that explore the parameter space efficiently, such as importance sampling (IS) (Stephens and Donnelly 2000) and Markov chain Monte Carlo (MCMC) (Beerli and Felsenstein 2001; Nielsen and Wakeley 2001; Beaumont 2003). However, these methods are highly computer intensive, their implementation into complex and realistic models is difficult, and, at the moment, their applicability to analyze large data sets is reduced (Hey and Machado 2003; Hey and Nielsen 2004). This has led to the development of methods that try to approximate the likelihood, such as approximate Bayesian computation (ABC) (Beaumont et al. 2002; Marjoram et al. 2003), composite likelihood (Hudson 2001; Nielsen et al. 2005), and product of approximate conditionals (PAC) (Li and Stephens 2003; Cornuet and Beaumont 2007; Roychoudhury and Stephens 2007).
The principle of ABC methods is to use simulations across a wide range of parameter values within a model to find the parameter values that generate data sets that match the observed data most closely (Beaumont et al. 2002). In most studies, the allele frequency data are summarized by means of summary statistics (Fu and Li 1997; Tavaré et al. 1997; Weiss and von Haeseler 1998; Pritchard et al. 1999; Tallmon et al. 2004; Thornton and Andolfatto 2006). ABC algorithms do not require an explicit likelihood function and are based on a rejection scheme to obtain an approximate sample from the joint posterior distribution. Briefly, this involves four steps: (i) simulation of data sets with different parameter values drawn from the prior distributions; (ii) computation of a set of summary statistics for each data set; (iii) comparison of the observed and simulated summary statistics using a distance metric, e.g., Euclidean distance; and (iv) rejection of the parameters that generated distant data sets. The posterior distribution reflects PM(θ | d(Ss, So) < δ), where d(Ss, So) stands for the distance between the observed and the simulated summary statistics, and δ is an arbitrary threshold. The choice of δ (and of the number of simulations) reflects to some extent a balance between computability and accuracy (Beaumont et al. 2002; Marjoram et al. 2003). In most ABC implementations the value of δ is set as a quantile (the tolerance level Pδ) from the empirical distance distribution found for a given observed data set, and typical values range from 10−5 to 10−2 (e.g., Estoup et al. 2004; Becquet and Przeworski 2007; Fagundes et al. 2007; Pascual et al. 2007; Bonhomme et al. 2008; Cox et al. 2008). The quality of the ABC inference is expected to depend on the summary statistics, the distance metric, and the tolerance level Pδ used. As noted by some authors, one potential problem is that it may be difficult or even impossible to define a suitable set of sufficient summary statistics (Marjoram et al. 2003).
Here, we show that it is possible to use the full allelic frequency distribution directly. The posterior distribution is thus approximated by PM(θ | d(Do, Ds) < δ), where Do and Ds stand for the observed and simulated allele frequency data, respectively. The advantage of this approach over the use of summary statistics is clear when δ decreases toward zero and the number of simulations increases to infinity, as the accepted points tend to the correct joint posterior distribution (Marjoram et al. 2003). As suggested by Marjoram et al. (2003), a rejection scheme might be inefficient when the data are high dimensional, and it has so far been used by Plagnol and Tavare (2004) to infer the times of lineage split based on fossil records. In this study we show that an ABC algorithm using the allele frequencies can approximate the results of a full-likelihood method in a reasonably complex model involving three populations and admixture. Note that the allele frequencies can be viewed as summary statistics of the individual genotypes. From this perspective they are sufficient since they contain all the information of the genotypes from each locus. We do not refer to this approach as based on summary statistics to avoid confusion with most ABC approaches, which usually use functions of the allele frequency distribution as summaries of the data.
We implemented this ABC approach for an admixture model identical to that of Chikhi et al. (2001) (Figure 1). These authors developed a MCMC approach based on the IS sampling scheme of Griffiths and Tavaré (1994), which is implemented in the LEA software (Langella et al. 2001). Currently, LEA is the only Bayesian full-likelihood method based on allele frequencies available to estimate admixture proportions, and for ease of comparison we used the same model in the ABC framework. Note that there is another full-likelihood method to estimate admixture (Wang 2003), but it is based on the maximization of the likelihood and hence comparisons are not straightforward (but see Excoffier et al. 2005). For comparison purposes, we also developed an ABC algorithm using typical summary statistics (ABC_SUMSTAT). Our main interest was to explore the performance of the ABC using the full allele frequency distribution to determine whether it can provide reasonable estimates compared to the full-likelihood (LEA) and summary statistics-based approaches. To summarize, in this study we (i) propose and validate with simulated data a new ABC inference method using allele frequency data, (ii) compare these results with those obtained with a full-likelihood method (LEA) and a traditional ABC method, and finally (iii) explore some general issues regarding the ABC approach, namely the choice of the distance metrics, the tolerance level, the number of simulations, and the use of a regression step.
MATERIALS AND METHODS
The admixture model:
The model is represented in Figure 1. It assumes that two independent parental populations, P1 and P2, of size N1 and N2, mixed some time T in the past with respective proportions p1 and p2 (= 1 − p1), creating a hybrid population H of size Nh. At the time of admixture, the gene frequency distributions of P1 and P2 are represented by the two vectors x1 and x2, respectively, and that of the hybrid population by p1x1 + p2x2. After admixture, P1, P2, and H evolve independently (with no migration) by pure drift (no mutations) until the present time. The time since admixture T (in generations) is scaled by the effective size of each population, and the corresponding drift times are called t1 = T/N1, t2 = T/N2, and th = T/Nh. This model is the same as in Thompson (1973). Flat priors were used for p1, t1, t2, and th. The priors for x1 and x2 were independent uniform Dirichlet distributions with k parameters D(1, … , 1), where k is the number of alleles per locus observed across all present-day populations. These priors reflect independent parental populations with allele frequencies generated according to a K-allele mutation model with K = k (Ewens 2004). Note that Wang (2003) and Choisy et al. (2004) criticized this admixture model because it ignores the correlation of the allele frequencies in the parental populations due to common ancestry. These authors propose alternative priors, but since they are not implemented in LEA, we kept the uniform Dirichlet in the ABC.
ABC without summary statistics (ABC_ALL_FREQ):
The ABC using the allele frequencies is referred to as ABC_ALL_FREQ. The rejection algorithm was divided into two parts as follows: (i) given a particular tolerance level Pδ and a particular data set Do, the corresponding tolerance δ* is obtained from the distance distribution of a first set of simulations (typically 104 or 105); and (ii) a large number of simulations (>106) are then performed, keeping all parameter values for which d(Do, Ds) < δ*. The division of the algorithm into two parts can reduce the computation time because there is no need to simulate the samples for all the loci and populations at each step of the second part. Whenever the distance after simulating the j*th population from the l*th locus is higher than the tolerance (), the parameter set is rejected. In the first part, the tolerance was computed as the 0.1 or 0.001 quantiles of the distance distributions, obtained after 104 or 105 simulations for the single-locus and multilocus case, respectively. In the second part, 106 simulations were performed for the single-locus and 108 for the multilocus case. The influence of the number of simulation steps was investigated by repeating the analysis with 106, 107, and 108 simulations, for the multilocus case with 10 loci. The choice of the tolerance level was also investigated, looking at the results for Pδ-values between 10−5 and 10−3.
Distance metrics (Euclidean and GST):
Most studies to date use a Euclidean distance. Here, two distance metrics were used to compute distances between simulated and observed data. For simplicity we focus on single-locus data from one population, which is represented by a vector Do = (o1, … , ok), where oi is the absolute frequency of allele i, i = (1, … , k), and k is the total number of alleles observed across the three populations (i.e., some oi can be zero in some populations, but not all). The two distance measures used here were (i) a standardized Euclidean distance, , where oi and si are the absolute allele frequencies of the observed and simulated data sets, respectively, and (ii) the “genetic distance” GST, , where HT is the expected heterozygosity (He) when the two vectors Do and Ds are pooled, and is the average He computed for Do and Ds (Nei 1986). The rationale behind the use of the GST distance comes from the fact that we were using allele frequencies, not summary statistics. We refer to the first as Euclidean and the second as GST. The distance is computed for each population and locus independently, and the total distance was defined as the sum over loci and over populations, , where j = 1, 2, h refers to the populations and l = 1, … , m to the loci.
As pointed out by Marjoram et al. (2003), using high-dimensional data may reduce the acceptance rate and compromise the efficiency of the rejection algorithm. However, since all the loci share the same demographic history and mutation process (K-allele model), i.e., the loci and alleles are exchangeable under this admixture model, the full allelic distribution can be viewed as a highly dimensional unordered data set [as in label-switching problems (Stephens 2000)]. In these cases, due to exchangeability, the likelihood does not depend on the order of the elements. Therefore, it is possible to increase the acceptance rate using permutations of the allele and loci labels to minimize the total distance between observed and simulated data sets. These approaches are described below.
Sort the allele frequencies:
Let us assume a single-population model in which we observe a single-locus data set with three alleles whose absolute frequencies are given by the vector Do = (15, 5, 30) (hence a sample size of 25 diploids or 50 genes). Let us further assume that the following simulated data set is obtained: Ds = (5, 30, 15). Given that the alleles are exchangeable, it is possible to permutate the allele labels and find an exact match between Ds and Do. In practice, the minimal distance was found by sorting the absolute allele frequencies. Since our model has three populations P1, P2, and H and since the labels of alleles in the three populations are not independent, the SORT algorithm sorts the alleles according to the allele frequencies of the three populations pooled together. We defined two algorithms referred to as SORT when alleles were sorted and as SIMPLE when the data sets are compared directly. We compared the SORT and SIMPLE algorithms in the single-locus case, using the Euclidean distance.
Reorder the loci:
Consider a vector where each element contains the data of one locus, say Do = (o1, … , ol) with l exchangeable loci, i.e., the order of the labels is irrelevant. When comparing this vector with a simulated one, say Ds = (s1, … , sl), there is a one-to-one correspondence but it is arbitrary to compare oi with si. Therefore, the labels of the loci were permuted to minimize the distance between the observed and the simulated data sets. The best solution requires the evaluation of all l! possible combinations, which may become impractical for the number of simulations performed here. Instead, we used a heuristic to approximate the minimal distance (but see Stephens 2000 for a discussion on efficient algorithms applied to similar label-switching problems in mixture models). In the first iteration, simulated locus s1 is compared with all observed loci oi, and the one with minimal distance is selected, say o3. In the second iteration, locus s2 is compared with all loci except o3. This procedure was repeated until all loci were reordered. Note that different loci may have different numbers of alleles and sample sizes. From a practical perspective though, it is convenient to compare observed and simulated data sets with the same number of alleles and sample sizes. This ensures a one-to-one relation between the observed and the simulated allele frequencies for each locus in the regression step (see below). This means that we constrain the permutations of locus labels to ensure that only those with the same number of alleles and sample sizes were compared. In real data sets the sample size can differ across loci due to missing data and/or use of data from different studies. This implies that for some data sets (e.g., when all loci have different samples sizes), there is only one permutation of the labels satisfying the constraints. In these cases it is possible to minimize the distance between simulated and observed data sets by grouping the loci according to their number of alleles and then resampling the allele frequencies to have the same sample size across all loci within each group. The reference sample size is set for each population as the smallest among loci within each group. For each locus, the alleles of each population are resampled without replacement from the original data set. Note that different resampled data sets may be obtained from the same original data set. The effect of this procedure in the estimates is dependent on the amount of missing data, and thus the analysis should be repeated with different resampled data sets to assess this effect. For the multilocus case we focused on unlinked biallelic markers. This choice was in part because of an increasing amount of data from genomic biallelic markers such as SNPs, e.g., HapMap (Frazer et al. 2007), RAPD, and RFLP (e.g., Parra et al. 1998), and because in this case all loci have the same number of alleles. The performance of reordering the loci was compared with a procedure where the loci were compared randomly without minimizing the distance, by analyzing the data sets of 10 biallelic loci with ti = 0.001 with the two approaches. Note that a similar labeling problem was met by Rosenblum et al. (2007), who used summary statistics and sorted the loci according to the values of one of the latter.
Beaumont et al. (2002) showed that performing a weighted local linear regression on the parameters obtained during the rejection step improves the estimation results. The regression assumes that, at least locally, there is a linear relation between the mean value of the accepted parameters and the accepted summary statistics. In this case, the predictor variables were the allele frequencies (or summary statistics for the ABC_SUMSTAT), and the response variables were the parameters of interest (p1, t1, t2, th). However, as can happen with some summary statistics, the relation between the allele frequencies and the parameters of interest is not necessarily linear. For p1 the linear assumption appears to be valid, as the allele frequencies among populations are linearly correlated with p1, at least immediately after the admixture event. However, for t1, t2, and th the mean value of the parameters does not change according to a linear relation with the allele frequencies. In a stable population there is a positive relation between the variance of the allelic frequency and drift. We thus performed two different regressions, (i) independent regression, applied for p1, and (ii) multiresponse quadratic regression, applied to t1, t2, th. In the first, due to the fact that within each locus in a population the allele frequencies are correlated (i.e., the sum is one), the regression was performed discarding the most frequent allele across populations from each locus. In the second, the allele frequencies were squared and the t1, t2, th were considered altogether in a single linear model, using dummy variables to code each parameter (Neter et al. 1985). The linear model becomes , where n = na × l (na alleles and l loci), and ε is the error. Y is a vector with the m accepted parameters pooled together, Y = (, … , , , … , , , … , ). Each Xj is a vector with the corresponding m accepted squared allele frequencies Xj = (, … , , , … , , , … , ), where is the allele frequency of the jth allele in the ith accepted simulation. The D dummy variables are coded with values 0 or 1 to identify each parameter. To estimate t1 the two dummy variables are equal to 0 and the model becomes . To estimate t2, D1 = 1 and D2 = 0, and the model becomes . Finally, to estimate t3 the two dummy variables are equal to 1, and the model becomes . The estimated βd1 and βd2 reflect the difference in the intercept of the three ti parameters. In both regressions, the accepted parameters were weighted according to the corresponding distances using the Epanechnikov kernel, as in Beaumont et al. (2002). The parameters were transformed to avoid posterior values outside the prior distribution limits, following the transformation of Hamilton et al. (2005).
ABC with summary statistics (ABC_SUMSTAT):
To determine whether our ABC approach was comparable to a summary statistics-based approach we also developed an ABC algorithm with 14 summary statistics, which is referred to as ABC_SUMSTAT. The summary statistics were chosen on the basis that they should contain information about the parameters of interest in the admixture model. Namely, we used (i) the expected heterozygosity (He) for each population and over all populations, (ii) the number of alleles na of each population, (iii) the number of private alleles np of each population, and (iv) the three pairwise FST and the overall FST. As in Beaumont et al. (2002), the distance metric considered was a Euclidean distance between the standardized observed and simulated summary statistics, , i = 1, … , s where s is the number of summary statistics. For the multilocus case, we used the mean of each summary statistic across loci. The summary statistics were standardized by subtracting the mean and dividing by the standard deviation of the simulations performed in the first part of the rejection algorithm, 104 for the single locus and 105 for the multilocus. The rejection step was done with the same number of simulation steps and the same tolerance values as for ABC_ALL_FREQ. A local weighted regression between the standardized summary statistics and the accepted parameters was performed as in Beaumont et al. (2002), transforming the parameters as in Hamilton et al. (2005).
The data sets used to test the performance of the different methods were simulated for each locus according to the demographic model depicted in Figure 1, using the coalescent. More specifically, the simulation algorithm was as follows:
Sample parameters , , , and from the prior distributions.
Sample the ancestral allele frequencies and from a uniform Dirichlet D(1, … , 1). The allele frequency of population H at the time of admixture is set to = × + (1 − ) × .
Sample the coalescent times, for each population independently, from an exponential distribution until the admixture event at time , where i = 1, 2, h. At each coalescent event the number of lineages decreases by one. The lineages that remain at time are designated as founder lineages.
Sample the allelic state of the founders from the ancestral allele frequencies , , .
Starting from the founder lineages to the present-day samples, lineages are randomly picked and duplicated for every coalescent event, until the present-day sample size is reached (Beaumont 2003).
This algorithm was used in the two ABC approaches and to generate all the data sets analyzed. Samples from 108 simulations with 10 independent biallelic loci were saved in a database, and this was used to perform the rejection step, in the multilocus case, for all the ABC approaches.
Comparison of approximate (ABC) and full-likelihood (LEA) methods:
The relative performance of the different approaches (see Table 1), including the full-likelihood method, was evaluated with samples generated with the following set of parameter values. Two different levels of drift, namely ti = 0.001 and ti = 0.01, (i = 1, 2, h), were used, assuming that the three populations evolved under the same conditions (i.e., t1 = t2 = t3). For each level of drift, we simulated 50 gene copies (25 diploid individuals) typed at 1, 5, and 10 unlinked loci, with an admixture proportion p1 = 0.7. For the single-locus data sets, we simulated loci with 2, 5, and 10 alleles. For multilocus loci, only biallelic markers were simulated. For each combination of parameters we simulated 50 independent data sets. These simulated data sets were then given as input to LEA, ABC_SUMSTAT, and ABC_ALL_FREQ.
For LEA we ran one MCMC chain for each generated sample with 105 steps and a thinning interval of 5 as suggested in Chikhi et al. (2001). At each step, the likelihood was estimated with 500 iterations of the importance sampler. The density estimation of the posterior distribution was performed discarding the first 10% of the chain (burn-in). For the 10-locus case, we conducted convergence analysis by comparing the results obtained with longer runs (106 steps) and found no difference. Thus, 105 steps were used in all the analyses, as LEA was clearly the slowest of the methods tested. The only difference between the full-likelihood and the approximate methods is in the priors for the ti's. LEA assumes uniform but improper priors (with no upper bound). In the ABC, the priors are also uniform but we defined an upper bound at 0.2, since the data sets were all generated with much smaller ti values. Of course, for the analysis of real data sets, for which the ti's are unknown, higher bounds can be allowed. Nevertheless, to avoid any bias in the comparison, we conditioned the sample from the posterior obtained with LEA such that ti ≤ 0.2, and the posterior of interest is thus PM(p1, t1, t2, th | D, ti ≤ 0.2). The different methods were compared by looking at properties of the full posterior distributions and at point estimates. We measured (i) the mean integrated square error (MISE) of each data set, which reflects the posterior density around the real parameter value (), where n is the number of accepted points used to obtain the posterior, and (ii) the relative root mean square error (RRMSE) of the median, which is the square root of the mean square error divided by the true value , where n is the total number of data sets analyzed. The confidence intervals for the RRMSE of each parameter were obtained with a nonparametric bootstrap, using 1000 iterations. Note that the RRMSE was computed as the mean of 50 independent data set analyses, and thus it should be considered indicative and not an absolute estimate of the error. In total, we simulated 500 different data sets that were analyzed by all the different methods, for a total of 2100 analyses (excluding the regression step results). The programs are available upon request. The rejection step for the three ABC programs was written in C. The regression analysis was performed in R using the lm function (R Development Core Team 2008). The locfit function implemented in R was used to estimate the density of the marginal posteriors (Loader 2007).
Analysis of a human data set:
We applied the ABC methods to a data set published by Parra et al. (1998) and previously analyzed with LEA by Chikhi et al. (2001). The original data set consists of nine nuclear loci (restriction site and Alu polymorphisms) typed in populations from Europe, Africa, the United States (African-Americans from different cities), and Jamaica. The aim was to estimate the admixture proportions in African-Americans and in Jamaica, using the European and African data as parentals. Most loci were biallelic with the exception of one locus that was triallelic. We focused on the Jamaican sample (average n = 185.8) as the hybrid (H) and considered that the samples from parental populations P1 and P2 correspond to all the samples pooled together from Europe (average n = 292.4) and Africa (average n = 387.6), respectively. The allele frequencies are the same as in Table 3 of Chikhi et al. (2001). Two approaches of the ABC_ALL_FREQ were used since the data set had loci with different numbers of alleles and different sample sizes. In the first, the original data set was used and no permutations were performed to minimize the distance between the observed and the simulated data sets. In the second, we created a resampled data set by sampling the allele frequencies until all loci had the same sample size at each population. This allowed us to use permutations to find the minimal distance between the simulated and the observed data set. The original data set and the resampled data set were analyzed with all the methods. For p1, a flat prior between zero and one was assumed. For the ti's, three different flat priors were tested, varying the upper limit as 0.2, 0.5, or 1. For the three ABC approaches we performed 107 simulations in the rejection step with a tolerance level of 0.1% (Pδ = 0.001), and the regression step was applied as in the simulation study. The effect of resampling was assessed repeating the analysis with 10 different resampled data sets using ABC_ALL_FREQ with GST distance. For LEA, three independent MCMC runs were performed with 106 steps each.
The posterior distributions obtained for the single-locus case with the ABC and full-likelihood methods are compared in Figure 2, for three representative runs with different numbers of alleles and a tolerance level Pδ = 0.001 (1000 simulation data sets accepted out of 106 simulations). Figure 2, together with the associated Table 2, shows that full-likelihood and ABC methods produce similar results. As expected, increasing the number of alleles leads to narrower posteriors around the true parameter values. For all methods, the p1 RRMSE decreases when drift decreased and when the number of alleles increased from two to five (Table 2). Thus, better p1 estimates were obtained when drift was limited and when the locus had a higher number of alleles, as described previously (Chikhi et al. 2001; Wang 2003; Excoffier et al. 2005). The RRMSE ratio of each ABC method over LEA ranged from 0.99 to 1.36, showing that some ABC methods have near-identical RRMSE values to LEA. Among the ABC methods, smaller errors were obtained with ABC_ALL_FREQ using the GST distance. The MISE results showed a slightly different pattern, with LEA exhibiting increasingly better results as the number of alleles increased (supplemental Table 1). The ti's RRMSE also decreased with increasing numbers of alleles. In most repetitions the posteriors of the ti's had a mode close to zero, as seen in the examples in Figure 2, but a median close to 0.1, which is also the median of the prior, confirming that the ti's are difficult to estimate (Chikhi et al. 2001; Wang 2003). In general, th exhibited the smallest RRMSE whereas t2 exhibited the largest error values. This is probably due to the fact that P2 contributed less to the hybrid population and hence provided less genetic information (Wang 2003). An apparently surprising result was that in most cases the RRMSE was slightly larger for LEA than for the ABC methods (ABC_ALL_FREQ GST and ABC_SUMSTAT). However, the RRMSE confidence intervals overlapped considerably, suggesting no significant differences among methods. Regarding the relative performance of sorting the alleles, i.e., SIMPLE vs. SORT, the latter exhibited lower RRMSE and MISE values and no bias, with both the rejection and the regression steps (Table 2). Thus, for the multilocus case we considered only the SORT approach.
The posterior distributions obtained with the approximate and full-likelihood methods for the multilocus data are represented in Figure 3, for the p1 parameter. As with single-locus data, the different methods produced similar distributions. Increasing the number of loci produced more accurate and precise distributions, reducing the RRMSE and MISE (Tables 3 and 4). For p1, the ABC point estimates were close to the ones obtained with LEA, producing nearly identical RRMSE values (Table 3). Note that in some cases the RRMSE was slightly smaller with the rejection step of ABC methods. For instance, the RRMSE ratio for p1 varied between 0.99 and 1.02 for the rejection step of ABC_ALL_FREQ with the GST distance and between 0.98 and 1.06 for ABC_SUMSTAT. However, the ABC posteriors tended to be wider than the full likelihood, as reflected by the higher MISE for the ABC methods (Table 4). LEA provided the posterior distributions with the smallest MISE, but ABC_SUMSTAT and ABC_ALL_FREQ with GST approximated reasonably well those values with the regression step. Note that the difference between the full likelihood and the ABC was typically higher with 10 loci, suggesting that LEA is better at using additional information brought by new loci. For the ti's, the smallest average MISE was obtained with LEA and ABC_SUMSTAT. Focusing on the ABC_ALL_FREQ, the GST distance metric tends to provide estimates with a smaller error than the Euclidean. Also, reordering the loci minimizing the distance of each simulation led to posteriors with higher density close to the true parameter values and closer to the ones obtained with the full likelihood.
Effect of tolerance, regression, and number of simulation steps:
The three ABC methods had the same behavior when the tolerance level varied, with lower RRMSE and MISE values when the tolerance level decreased (Figure 4). Although the ABC rejection step reached RRMSE values similar to LEA for p1, the MISE did not approach the values of the full-likelihood method. The performance of the ABC methods approached LEA's only when the regression step was performed, and in this case the error decreased significantly over the rejection step even for the highest tolerance levels considered here. For the drift parameters, the situation was slightly different for the ABC_ALL_FREQ methods, since the regression did not lead to major improvements over the rejection. Note that the effect of the regression on the RRMSE was not clear for p1, as the RRMSE increased above that of the rejection step for the lower tolerance level. This was also observed by Beaumont et al. (2002), who suggested that it was potentially caused by the limited number of points used to perform the regression (<500). However, this explanation may not apply here, since at least 1000 points were used and the MISE did not increase for the lower tolerance values. Increasing the total number of simulations from 106 to 108 does not lead to major differences, given the same tolerance level (Pδ). As long as 1000 points were accepted with Pδ = 10−3, the parameters were reasonably well estimated after the regression step (not shown), suggesting that 1 million simulations were enough to get approximate results.
Human data set (admixture in Jamaica):
As shown in Figure 5, the posteriors for p1 obtained with LEA had a high density around 0.07 (0.025–0.124), suggesting a limited contribution of Europeans to the Jamaican gene pool. The 0.05 and 0.95 quantiles of the posteriors are shown inside parentheses. For t1 the posteriors had higher density around 0.2 (0.07–0.61). However, the posteriors were similar to the priors, suggesting limited information about t1. For t2 and th the posteriors were clearly different from the priors and supported drift values close to zero (0.0016–0.0734 for t2 and 0.0004–0.0412 for th). As discussed by Chikhi et al. (2001), this is suggestive of a recent admixture event.
The ABC methods returned point estimates for p1 similar to LEA, although the posteriors were less precise (0.013–0.320 for GST, 0.015–0.235 for Euclidean, and 0.009–0.260 for ABC_SUMSTAT). ABC_ALL_FREQ produced the posterior closest to the full-likelihood results. For the ti's, the ABC posteriors were very wide and approached LEA's results only qualitatively; i.e., they pointed to higher drift in Europe, limited drift in Africa, and even less drift in Jamaica. For the ti's, ABC_SUMSTAT returned estimates closer to LEA than ABC_ALL_FREQ. The analysis of the resampled data set lead to identical results with LEA's, and almost no differences were found with the ABC methods after the regression step. As expected, for ABC_ALL_FREQ, the rejection step performed better with the resampled data set. The analysis of different resampled data sets returned similar posteriors, suggesting that the effect of resampling was limited in this case (supplemental Figure 1). On the contrary, reanalyzing the data sets varying the upper limit for the ti priors affected significantly the p1 posteriors. Better estimates were obtained with lower upper limits (Figure 6). The reason is that the true ti values are more likely close to zero, and hence reducing the upper limit of the ti prior led the ABC methods to explore more often the most likely parameter space.
Altogether our simulations and the real data set analysis show that the ABC using the full allelic distribution (ABC_ALL_FREQ) can be used to estimate parameters under a relatively complex demographic model. The results obtained here were similar to those obtained using summary statistics (ABC_SUMSTAT) and were comparable to those obtained with a full-likelihood method also based on allele frequency data. The ABC methods produced broader posterior distributions but did not appear to be biased (Tables 3 and 4). In principle, by increasing the number of simulations to infinity (or a very large number) the ABC based on allele frequency should produce results identical to LEA, while this would not necessarily be the case with the summary statistics due to the inevitable loss of information when summarizing the data (Marjoram et al. 2003). In practice, and given the number of simulations performed (between 106 and 108), LEA tended to produce better results than the ABC algorithms, although it was at least 10 times slower as the number of loci increased.
Focusing on the rejection step, the two ABC approaches (ABC_ALL_FREQ and ABC_SUMSTAT) generated posterior distributions with point estimates close to the true value and similar to LEA. However, with 10 loci, even when the number of simulations increased up to 108 and the tolerance level Pδ was lowered to 10−5, the posteriors were still wider than with LEA (Table 4). These results confirm the relatively poor efficiency of the rejection scheme when dealing with large data sets. This is potentially more problematic for the ABC_ALL_FREQ scheme, as the dimensionality increases quickly with the size of the data sets. Several approaches were tested here to minimize this problem by (i) sorting the allele frequencies, (ii) reordering the loci, and (iii) using different distance metrics, and all three improved the estimates.
A major improvement was observed for p1 when the local weighted regression was applied leading to posteriors close to LEA, even with 106 simulations (Pδ = 0.001). For the ti's, the regression step improved the posteriors of ABC_SUMSTAT but not of ABC_ALL_FREQ. This is most likely because the allele frequency data do not fit properly the assumptions of the linear regression. Namely, it is not clear that the relation between allele frequency and drift is linear. In fact, better estimates were obtained assuming a linear relation between the time since admixture and the square of the allele frequencies, probably because it is the variance of allele frequencies that increases with drift. Drift affected significantly the estimation of p1, and even with 10 biallelic loci the posteriors for p1 were rather wide with ti = 0.01 (10 generations of drift for Ne = 1000). This is in agreement with Choisy et al. (2004), Excoffier et al. (2005), and Wang (2006) and confirms that methods that do not account for drift to estimate demographic parameters will tend to provide misleadingly precise values.
Overall, the simulation study results show that ABC_SUMSTAT provides good approximations to the full likelihood and is probably easier to use than ABC_ALL_FREQ, despite the potential problem of choosing the summary statistics (but see Joyce and Marjoram 2008). However, in the analysis of the human data set, ABC_ALL_FREQ produced p1 posteriors closest to LEA. This suggests that there may be situations where using the allele frequencies may be suitable and provide better estimates. For the real data set LEA produced much more precise posterior distributions, which contrasts with the results obtained in the simulation study, where the ABC schemes approached reasonably well the full-likelihood method. Potential explanations for these differences are the influence of factors not taken into account in the simulation study, such as the sample size (larger in the real data set), the contribution of parental populations (set to be p1 = 0.7 in the simulation study), and the effective size of populations (set to be equal in the simulation study). Also, it can be related with the priors and the parameter space exploration. As seen in the simulation study, the drift since admixture affects the estimates of p1, and thus it is expected that the prior uncertainty on the ti's influences the posteriors. The ABC rejection scheme explores the parameter space randomly, whereas the full-likelihood MCMC method will tend to remain in the region of most likely parameter values after the burn-in period. In the human data set, the results point to limited drift in P2 and PH (t2 and th close to zero), and thus changing the ti prior upper limit could affect the ABC efficiency. This was indeed what was observed when the human data set analysis was repeated with different ti upper limits, and the precision of the p1 posterior distributions tended to increase, approximating LEA, as the uncertainty about the ti decreased. This points to the importance of the exploration of the parameter space during the rejection scheme and the importance of choosing informative priors for drift when trying to estimate the contribution of parental populations. It is noteworthy that the ABC framework may provide a simple way to assess if a data set fits the model. The idea is to compare the distance distributions of the real data set with the distance distributions of data sets generated under the admixture model, allowing us to assess if the real data produced on average larger distances than expected under the model. We found that the human data set distances were well within the ones obtained under the model (supplemental Figure 2). As a counterexample, we also simulated data sets under two alternative models, namely (i) one panmictic population and (ii) three independent populations. The data sets from the latter tended to return larger distances than expected under the admixture model, whereas the samples from the former returned only slightly larger distances. This suggests a simple way to determine if a model is acceptable for a particular data set. Note that similar principles are used for model choice using ABC (e.g., Estoup et al. 2004; Fagundes et al. 2007).
This study confirms that a simple rejection scheme can become inefficient when dealing with high-dimensional data, such as full allelic distributions, when there are many alleles and loci. However, we found that the ABC_ALL_FREQ was able to deal with a large number of biallelic loci such as SNPs, by using heuristic approaches to match the observed and simulated data. Our results suggest that the efficiency of the rejection step depends on the distance metric chosen (e.g., GST and Euclidean), on the minimization of the distance between the simulated and the observed data sets (e.g., SORT and SIMPLE), and on the exploration of the parameter space (e.g., effect of ti uncertainty on p1 estimates). Regarding the choice of distance metrics little has been done to assess objectively how to select them. In the simulation study the error was lower when using the GST distance, but in the real data set the Euclidean distance provided the posteriors closer to the full-likelihood method. Thus, despite the better performance of GST this seems to be data dependent. One way to predict which distance metric should be preferred might be to look at the correlation between the parameter values sampled from the priors during the rejection scheme and the corresponding distances. In our simulations, we found higher correlations for the p1 parameter with the GST distance (supplemental Figure 3). This suggests that GST may be more efficient at capturing small variations of p1 and that these correlations might be used to select the most suitable distance metric. While the ABC rejection step was much quicker than LEA, our results clearly show that to produce identical results the number of simulations required would be computationally prohibitive. Also, our simulations confirm that the regression step is crucial to obtain posteriors close to the full likelihood at a relatively low computational cost. Therefore, further improvements to the ABC approach using allele frequencies are possible either by increasing the efficiency of the rejection scheme or by investigating different regression models. Our results suggest that it is mainly at the level of the rejection step that further improvements can be achieved. For instance, recent approaches that explore the parameter space efficiently by spending most of the time in the most likely regions can be used, such as sequential approaches (Sisson et al. 2007; Beaumont et al. 2008) and MCMC without likelihoods (Marjoram et al. 2003). Another procedure that can be promising to reduce the dimensionality of the data sets is the principal component analysis (PCA) of the allele frequencies. This has proved useful at extracting information from the data (Novembre et al. 2008) and could be used in a rejection–regression scheme. Also, other generalized linear regression models and/or nonlinear approaches can be investigated, and as described by Blum and Francois (2008) they can improve substantially the efficiency of the ABC algorithms.
In summary, our results confirm that ABC methods are very flexible and easy to implement, provided that it is possible to simulate data sets under the desired demographic models. Although the full-likelihood methods provide more accurate and precise results and should thus be preferred over the ABC approaches, when dealing with large data sets or with complex models, ABC methods can provide reasonably good estimates in a reasonable computational time. For problems in which the choice of summary statistics is not obvious, it is suggested that the full allelic distribution could potentially be used to obtain approximate posterior density estimates.
We thank P. Fernandes for making available the bioinformatics resources at the Instituto Gulbenkian de Ciência and for his help in their use. We also thank B. Parreira for her suggestions and help concerning the regression step. We acknowledge two anonymous reviewers and the editor for very useful comments, particularly the suggestion to analyze the distance distributions to test if a data set fits the model and the suggestion to select the distance metrics on the basis of the correlation of distance and parameter values. This work was supported by grant SFRH/BD/22224/2005 to V.S. from “Fundação Ciência e Tecnologia” (FCT, Portuguese Science Foundation). L.C. is funded by the FCT Project PTDC_BIA-BDE_71299_2006 and by “Institut Français de la Biodiversité,” “Programme Biodiversité des îles de l'Océan Indien” grant no. CD-AOOI-07-003. Part of this work was carried out and written during visits between Toulouse and Lisbon that were funded by the “Actions Luso-Françaises”/“Acções Integradas Luso-Francesas” (F-42/08). M. M. Coelho and B. Crouau-Roy are also thanked for making these visits possible. We also thank the Egide Alliance Programme (project no. 12130ZG to L.C. and M.B.) for funding visits between Toulouse and Reading.
Communicating editor: M. Stephens
- Received October 31, 2008.
- Accepted January 21, 2009.
- Copyright © 2009 by the Genetics Society of America