- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Chikhi, L.
- Articles by Beaumont, M. A.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Chikhi, L.
- Articles by Beaumont, M. A.
Estimation of Admixture Proportions: A Likelihood-Based Approach Using Markov Chain Monte Carlo
Lounès Chikhia,b, Michael W. Bruforda,c, and Mark A. Beaumonta,da Institute of Zoology, Regent's Park, London NW1 4RY, United Kingdom,
b School of Biological Sciences, Queen Mary and Westfield College, University of London, London E1 4NS, United Kingdom,
c School of Biosciences, Cardiff University, Cardiff CF10 3TL, United Kingdom
d School of Animal and Microbial Sciences, University of Reading, Reading RG6 6AJ, United Kingdom
Corresponding author: Lounès Chikhi, Department of Biology, University College London, Wolfson House, 4 Stephenson Way, London NW1 2HE, United Kingdom., l.chikhi{at}ucl.ac.uk (E-mail)
Communicating editor: D. CHARLESWORTH
| ABSTRACT |
|---|
When populations are separated for long periods and then brought into contact for a brief episode in part of their range, this can result in genetic admixture. To analyze this type of event we considered a simple model under which two parental populations (P1 and P2) mix and create a hybrid population (H). After that event, the three populations evolve under pure drift without exchange during T generations. We developed a new method, which allows the simultaneous estimation of the time since the admixture event (scaled by the population size ti = T/Ni, where Ni is the effective population size of population i) and the contribution of one of two parental populations (which we call p1). This method takes into account drift since the admixture event, variation caused by sampling, and uncertainty in the estimation of the ancestral allele frequencies. The method is tested on simulated data sets and then applied to a human data set. We find that (i) for single-locus data, point estimates are poor indicators of the real admixture proportions even when there are many alleles; (ii) biallelic loci provide little information about the admixture proportion and the time since admixture, even for very small amounts of drift, but can be powerful when many loci are used; (iii) the precision of the parameters' estimates increases with sample size (n = 50 vs. n = 200) but this effect is larger for the ti's than for p1; and (iv) the increase in precision provided by multiple loci is quite large, even when there is substantial drift (we found, for instance, that it is preferable to use five loci than one locus, even when drift is 100 times larger for the five loci). Our analysis of a previously studied human data set illustrates that the joint estimation of drift and p1 can provide additional insights into the data.
DURING their history, populations can be separated for long periods and then brought into contact for a brief episode in part of their range, resulting in genetic admixture (![]()
![]()
![]()
![]()
![]()
![]()
The interest for admixture estimation and admixed populations thus ranges from evolutionary to more applied issues. The study of admixed populations can provide information on (i) the inheritance of complex genetic disease and, in particular, the mapping of the genes involved (![]()
![]()
Even though one could use admixture methods to estimate the relative contributions of subspecies meeting in hybrid zones, it is important to stress that the studies of hybrid zones and of admixed populations are often quite different. Whereas hybrid zone studies deal with spatial phenomena, admixture studies usually disregard this aspect and concentrate on the estimation of admixture proportions (see, for instance, ![]()
Recent theoretical advances substantially improved the ability to use genetic information from present-day populations to draw inferences about past demographic events (e.g., ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
One could naively use coalescent-based simulations to estimate how often a particular allelic configuration is observed. Practically, however, this is not possible because the number of possible genealogies becomes astronomical very quickly. As a consequence, even for moderate sample sizes (n > 10), the likelihood is impossible to evaluate by direct simulation. One could also consider using an analytical approach to derive an expression for the likelihood. Unfortunately, this expression is practically impossible to solve as soon as the number of alleles and the sample size become large (see, however, METHODS). ![]()
![]()
In this article, we apply a full-likelihood and coalescent-based approach to the admixture problem. We derive the likelihood function and compare results from this analytical approach with approximations obtained using the method of ![]()
50). In this study we chose to simulate data sets that are closer to those currently available (i.e., total sample sizes = 150 and 600, see below). We tested the performance of our method on a wide range of parameters for both sample sizes and with either biallelic loci (similar to many allozymes) or 10-allele loci (similar to microsatellite or mtDNA data). Finally, we applied the method to a published human data set.
| METHODS |
|---|
The model:
The admixture model shown in Fig 1 assumes that two independent parental populations, P1 and P2, of size N1 and N2, mixed some time T in the past (measured in generations) with respective proportions p1 and p2 (= 1 - p1), creating a hybrid population H of size Nh. At the time of hybridization, the gene frequency distributions of P1 and P2 are, respectively, the two vectors x1 and x2, and that of the hybrid population is p1x1 + p2x2. After admixture, P1, P2, and H evolve independently (with no migration) by pure drift (no mutations) until the present time. Even though T, the time since admixture (in generations) is the same for the three populations, the time scaled by the effective size of each population can be different for the three populations and is thus called t1 = T/N1, t2 = T/N2, and th = T/Nh. The parameters of the model are thus p1, t1, t2, th, x1, x2. Note that ![]()
|
A Bayesian approach:
We are interested in making inferences about a parameter (or a set of parameters)
of a statistical model by using the information provided by the observation of the data, D. This is given by a probability density function (pdf), which describes the probability distribution of
given the data p(
|D). We can use Bayes' theorem to write
![]() |
(1) |
The first term is the pdf of
before the data are obtained and is therefore called the prior as opposed to p(
|D), which is the posterior. Practically, p(
) summarizes our belief, knowledge, or lack of knowledge about
. The second term represents the probability of observing the data under the statistical model. Seen as a function of
, p(D|
) (= L(
)) is the likelihood function (![]()
|D) up to this multiplicative constant. When
is a set of parameters, we can obtain the distribution of any specific parameter by averaging across all others, and this is called the marginal pdf.
By taking a Bayesian (or full-likelihood) approach we consider that all relevant information about the parameter(s) is contained in the posterior pdf, and we are thus interested in the complete distribution rather than in point estimates. However, summary statistics such as point estimates can convey convenient information about the pdf for comparison and are provided as well. For instance, the standard deviation (SD) is given when useful because it is a commonly used measure of dispersion. However, dispersion is better described using the width between the 5 and 95% quantiles. This is often referred to as the 95% credible or equal-tail probability interval (CI or ETPI, respectively). To avoid confusion with the 95% confidence interval we use ETPI. For approximately symmetric distributions, using the mean, the median, or the mode provides very similar results. However, for distributions that are highly skewed toward small values, the mode can be very difficult to estimate. This proved particularly true for the ti's. We therefore decided to use the median that is the most widely used point estimate (see ![]()
The Bayesian procedure requires that we provide a prior on all parameters of the model. Although this step may be difficult in some problems, since it involves some subjectivity (![]()
The full likelihood:
However, the posterior p(p1, t1, t2, th, x1, x2 | D) [corresponding to p(
|D) in Equation 1] is not available in a closed form. The likelihood function p(D | p1, t1, t2, th, x1, x2) can be written as (see ![]()
![]() |
(2) |
where

a1, a2, and ah are the sample frequency counts in present-day samples of P1, P2, and H; f1, f2, and fh are the founder frequency counts in P1, P2, and H; c1, c2, and ch are the number of coalescences in the genealogical history; and n1, n2, and nh are the sample sizes of P1, P2, and H.
The first term (A) was first derived by ![]()
![]()
![]()
![]()
It is, however, computationally expensive to estimate this likelihood directly, because the number of allelic configurations among the founders that is compatible with the data can be very large. An alternative approach is therefore to use sampling methods to estimate the likelihood. Equation 2 can be rewritten in a more general form as
![]() |
(3) |
where G represents all possible genealogies and consists of a sequence of c coalescent events going back from time 0 to time T and where the allele frequency count among the lineages is recorded at each event. Following the notation of ![]()
) using standard methods of simulation from the coalescent (![]()
![]()
![]()
![]()
![]()
![]()
![]()
The method of Griffiths and Tavaré:
To circumvent the problem of analyzing all possible genealogies and allelic configurations, ![]()
![]()
![]()
![]()
![]() |
(4) |
Thus (2) as can be approximated by simulating K times from p*(G|c)p(c|
) and estimating (4) as
![]() |
(5) |
for all realized G and c. In fact, the scheme of Griffiths and Tavaré always guarantees that p(D|G) = 1, because the genealogical history is constructed backward from the data as described below.
More specifically, the G and c are sampled according to the following scheme. If we call Sk the state with k lineages, the state Sk-1 is chosen (going backward in time) according to the transition probabilities
![]() |
(6) |
(![]()
![]()
![]()
![]()
![]()

] by the probability of observing this founding state, which is a multinomial draw from the ancestral parental frequencies. If a chain has more coalescent events than n - k (i.e., giving rise to fewer than k founders), p(G |
) = 0 by construction.
This chain is run a reasonably large number of times and the likelihood is averaged across these runs. A comparison of simulated vs. analytical results on small data sets and comparing results obtained with different numbers of runs on larger data sets shows that 500 runs is large enough to estimate the likelihood when drift only is considered (see Appendix and ![]()
To summarize, the method of Griffiths and Tavaré allows us to calculate the likelihood p(D | p1, t1, t2, th, x1, x2) for specific values of p1, t1, t2, th, x1, and x2. Since we are interested in obtaining the posterior distribution p(p1, t1, t2, th, x1, x2 | D) [equivalent to p(
|D) in Equation 1] and, in particular, some of the marginals such as p(p1 | D), we need a method to sample from the posterior distribution. Markov chain Monte Carlo (MCMC) is a sampling-based method that enables us to do so.
Markov chain Monte Carlo methodology:
In Monte Carlo simulations, samples Xi (i = 1 ... n) of a random variable X are drawn from a distribution
(.) and then used to evaluate functions of X. When the distribution of interest is impossible to evaluate either because no closed form is known or because it is difficult to sample from, it is possible to construct a Markov chain having
(.) as its equilibrium distribution. One method to do so is by using the Metropolis-Hastings algorithm (![]()
![]()
![]() |
(7) |
Note that we need only to be able to estimate
(.) at some specific values and up to a multiplicative constant (i.e., provided by the IS scheme above). If the candidate state is not accepted the chain remains in its current state and a new candidate state is randomly chosen from the proposal distribution. Provided that some conditions are met (irreducibility of the chain; e.g., ![]()
(.) once equilibrium is reached. Practically the choice of q(.|Xt) is crucial if one wants the chain to reach equilibrium in a reasonable amount of time (see below). We applied the MCMC algorithm to the parameter space defined by our admixture model, i.e., p1, t1, t2, th, x1, and x2.
Different proposal (or updating) distributions were tested during the development of the method. We finally updated p1 by taking a normal random deviate around p1 with a standard deviation 0.05. We also found it efficient to update p1 10% of the time rather than at every step. The other parameters were updated the rest of the time. A lognormal distribution with mean ti and standard deviation s =
on a log scale was used for t1, t2, and th, where nloc is the number of loci. The ancestral parental allelic frequencies were updated by first selecting an allele at random, thus defining a partition of two sets of alleles: the allele itself and all the others. A ß-distribution with parameters v and w was then used to update the chosen allele frequency. v was chosen to be 1 while 1/(1 + w) was equated to the smallest frequency of the partition (see Appendix in CIOFI et al. 1999).
Testing for convergence and analysis of the output:
A key issue in MCMC simulation is to determine when equilibrium has been reached, i.e., when to stop the simulation to have a reasonable approximation of the posterior or likelihood curve. This is a serious problem, since even very long runs that appear to have converged may in fact be misleading (see ![]()
![]()
![]()
![]()
. ![]()
5% that observed within chains) are a good indication that equilibrium is reached (see ![]()
We ran 10 independent chains for independent loci for each of the three tested values of ti. This was done for loci with 10 alleles and a sample size of 200 genes per population (see next paragraph for the exact procedure). In all cases, we found that running the chain for 50,000 steps was enough to produce values of the statistic <1.1 (see Fig 2, which represents 10 runs for p1 and th for ti = 0.001 and ti = 0.1). We did not need to repeat the diagnostic analysis for the smaller sample size (n = 50, see below) or number of alleles since equilibrium is reached more quickly.
|
For each run 10,000 points were collected for all parameters of the model (i.e., 1 point every 5 steps). Following ![]()
Unless otherwise stated, all statistical analyses were performed using the R language (![]()
![]()
![]()
Simulating according to the model:
To test the method, we simulated data sets according to the model following a coalescent methodology. The two ancestral allele frequency distributions, x1 and x2, of the parental populations were simulated from two independent flat Dirichlet distributions. The allele frequency distributions of the hybrid population xh were then calculated as p1x1 + p2x2. We simulated the number of founders for the three populations under pure drift using a standard coalescent methodology over the intervals t1, t2, and th, respectively. The genetic types of the founders of the three populations were then sampled from x1, x2, and xh. For each of the populations, a lineage was chosen randomly and duplicated until the sample size was reached. The output of these simulations was fed into a program implementing our method. Because of the huge amount of calculations involved by MCMC methods we had to limit the parameter combinations that could be analyzed. All simulations were thus performed with p1 = 0.3 and by considering the same sample size for the three populations. However, the effect of sample size was investigated by using two different sample sizes (n = 50 and n = 200; i.e., 150 and 600 genes from the three populations in total, respectively). Three (scaled) times since admixture were used in the simulations. For simplicity, again, the same value was used for the three ti (i.e., the three populations were of the same size; see, however, discussion for a test of the effect of dissimilar sizes). We used ti = 0.001, 0.01, 0.1, which for an effective size of 1000 corresponds to 1, 10, and 100 generations of drift, respectively. For each parameter combination 20 independent loci were simulated (i.e., corresponding to 20 independent runs of the coalescent process). We also tested the importance of the number of alleles by using loci with either 2 or 10 alleles. For the 2-allele loci, 10 loci were simulated to reduce the time of analysis. Note that, for real data sets, there are no limitations whatsoever on the sample sizes. They can vary from locus to locus and population to population (see, for instance, the human data set analyzed). Loci with different numbers of alleles can also be used.
To summarize the principle of our approach, we used Bayes' theorem to rewrite the posterior pdf as a function of a prior and a likelihood. The likelihood was estimated at specific values of the parameter space using Griffiths and Tavaré's algorithm and a MCMC was run to obtain samples from the whole distribution. Finally, we used simulated data sets to test the accuracy of the method.
| RESULTS |
|---|
Estimation of admixture proportions from single-locus data:
Fig 3 represents the results obtained for the 20 loci of the 10-allele simulations. It shows the effect of the sample size and ti on the estimation of the admixture parameter p1. The results are also summarized in Table 1 while those of the 2-allele simulations are summarized in Table 2. The numbers given in both tables represent the averages of the medians of each of the pdf's of the independent loci and the width of the 95% ETPI across the 20 loci.
|
|
|
For all sample sizes and numbers of alleles, the pdf's widen as the time since admixture increases. This is because the genetic information about the admixture event is gradually eroded by subsequent genetic drift in the three populations. For ti = 0.001 (n = 50, 10 alleles) the average SD across the 20 loci of p1's posterior pdf's is 0.184 and increases to 0.198 for ti = 0.01 and to 0.222 for ti = 0.1. The 95% ETPI averaged across loci can be rather large and ranges from 0.71 to 0.81 as ti goes from 0.001 to 0.1 (for the n = 50, 10-allele case, Table 1). This indicates that the number of values that can be regarded as unlikely is in fact limited when only 1 locus is used (
2030% of the p1 values).
The effect of increasing sample size can be seen by comparing the left and right sides of Fig 3 and Table 1 and Table 2. For ti = 0.001 the average width of the 95% ETPI decreases from 0.71 to 0.56 and the average SD from 0.184 to 0.144 when the sample size goes from 50 to 200. For n = 200 the 95% ETPI reaches a value
0.70 and the average SD becomes 0.184 only for ti somewhere between 0.01 and 0.1 (Table 1). In other words, the precision is higher for n = 200 than for n = 50, even when drift is 10100 times as large. Note that the effect of sample size seems particularly strong for small ti values (95% ETPI of 0.56 vs. 0.70 for ti = 0.001, as opposed to 0.77 vs. 0.81 for ti = 0.1). It is thus worth increasing the sample size only if ti is <0.01. This means that, as drift increases, the amount of information that can be extracted about p1 is quite limited even with large samples. In such cases, the only solution is to increase the number of loci (see below).
The most dramatic factor affecting the estimation of p1 seems to be the number of alleles (Fig 4 and Table 2). Two-allele loci seem to provide little information on the admixture proportion even for very small values of ti. It is clearly preferable to have a single 10-allele locus after 100 times more generations of drift than one biallelic locus. With 2-allele loci it is practically impossible to exclude any value of p1 as can be seen from the 95% ETPIs (Table 2), which cover nearly 95% of the possible values of p1 even with n = 200 (i.e., as one would expect if there were no data).
|
As should be clear from Fig 3 and Fig 4, single point estimates such as those provided in Table 1 and Table 2 should be used with caution. Even though these values indicate that the method is reliable (the estimates of p1 are very close to the real value for both ti = 0.001 and 0.01 and differ only moderately for ti = 0.1), some single-locus pdf's can point to very different values (Fig 3). For instance, when drift is important (ti = 0.1), as many as 11 of the 20 pdf's had a median >0.5, 6 of which were >0.6 and 2 of which were >0.7 (for n = 50). For n = 200 there were, respectively, six, three, and two loci. Clearly, the whole distribution or the 95% ETPI should be used in place of the point estimates for p1.
When drift increases, it is possible for at least one of the populations to become fixed for one allele. In such cases the absence of polymorphism means that the corresponding population had either a very small size or a very large T. As a result large values of ti become equally likely and the MCMC cannot reach equilibrium for the corresponding ti. In practice this is easily overcome by introducing a prior on the distribution of the corresponding ti. We come back to this point in the analysis of the human data set. We observed this effect in the two-allele case for a few loci (1 locus for n = 200 and 7 loci for n = 50). As a consequence, the averages presented take into account only the parameters for which a posterior pdf was available (i.e., between 7 and 10 loci depending on the parameter combination).
Estimation of time since admixture from single-locus data:
Fig 5 shows the effect of drift on the estimation of th. As for p1, the pdf's 95% ETPIs increase as ti increases, reaching values
0.30.4 for ti = 0.1 (in the 10-allele case, Table 1) and even 3.5 for the 2-allele cases (Table 2). In the 10-allele cases, large samples (n = 200) provide more information than smaller ones (n = 50) even when drift is 10 times as large. However, we do not observe a greater effect of the sample size for small ti, which is similar to that observed for p1. Note that in the 2-allele cases, where the amount of information is very limited, increasing the sample size has virtually no effect (Table 2) and we therefore focus on the 10-allele cases.
|
Another difference from p1 pdf's is that the median is a rather poor point estimator of ti for small values of ti whereas it is reasonable for ti = 0.1 (Table 1). It is possible that because the pdf's of the ti are highly skewed toward zero, the maximum-likelihood estimate (MLE) should be preferred. However, regardless of the choice of a point estimator, the distributions are very wide: the 95% ETPIs are of the same order of magnitude for all ti's and are therefore more than two orders of magnitude larger than the real ti value for ti = 0.001. The simplest solution is probably to follow the full-likelihood approach and consider the whole distribution rather than point estimates. Indeed, the pdf's obtained for ti = 0.001 and ti = 0.1 are clearly different (Fig 5) even though summary statistics miss the differences.
Even if one uses the whole distribution one may wonder why these 95% intervals are so large and similar for ti = 0.001 and 0.01. A possible reason is that, for small ti values, the effect of sampling is no longer negligible as compared to drift. Indeed, in the 10-allele cases, when n = 200 the width of the 95% ETPI is much more reduced for small ti's than when n = 50 (Table 1). Thus, increasing the sample size does provide information on the ti's even if it does not have a great effect on p1. Note that pdf's for t1 are usually larger than those for t2, simply because p1 < 0.5 (i.e., there is more genetic information on parental population 2). Even though reasonable amounts of information can be extracted from single-locus data, it appears that this is true when the admixture event is recent and both the sample size and number of alleles are large.
Estimation of parameter values from multilocus data:
Multilocus estimation was performed for the 10-allele case using the data from 5 loci together. This was done for one sample size (n = 50) and three values of ti, namely 0.001, 0.01, and 0.1. We used the data from the first 5 independent loci analyzed for the parameter combination (i.e., 5 of the 20 loci represented in Fig 3, a and c, respectively) to compare the single- and multiple-locus results. The likelihood for multiple loci is estimated using Griffiths and Tavaré's algorithm by multiplying the likelihoods for individual loci at each step of the MCMC. In other words, loci are assumed to be independent.
Fig 6A shows the three pdf's obtained from the five-loci data together (represented by the solid lines). For comparison, the five single-locus pdf's obtained for ti = 0.001 are represented by dashed lines. The increase in information is such that it is clearly better to use five loci than one locus even when the drift is >100 times as large. Indeed, the 95% ETPI of the five single-locus pdf's varies between 0.58 and 0.74 for p1 whereas it is 0.23, 0.27, and 0.47 when ti = 0.001, 0.01, and 0.1, respectively. In other words, the uncertainty on the real value of p1 is divided by 2 to 3 depending on whether the amount of drift is equal or 10 times larger. Even when drift is 100 times larger it is still better to have five loci than one locus. To put this into perspective, in a population whose effective size is 1000, admixture will be better estimated with five loci after 100 generations of drift than it would have been with one locus such as mitochondrial DNA just one generation after the admixture event. The effect on ti is even larger with a reduction of the 95% ETPI by a factor 3 to 5. Fig 6A also shows another solid line, which was obtained using the information from five biallelic loci together for ti = 0.001. Clearly, the pdf is not distinguishable from the pdf's obtained for single loci having 10 alleles. This indicates that biallelic loci such as allozyme loci can provide reasonable amounts of information on admixture events when they are used jointly. If, as appears here, five biallelic loci are approximately equivalent to one 10-allele locus, then studies using 40 allozymes such as those produced in the last decades might be comparable to studies currently using 510 microsatellites. This comparison is certainly very rough, but shows that precise estimates of admixture proportions can be estimated with very easily obtained genetic markers.
|
Fig 6B shows the apparently flat distributions obtained with single-locus data for ti = 0.001. The distribution for five loci (solid line) shows an improvement but apart from pointing toward low values of ti the pdf is still flat. As was said earlier, information on the amount of drift is limited because of the inherent stochastic behavior of the coalescent and of other sources of variation. Note that the appearance of flatness is increased by the scale used to represent the six curves (see Fig 2B where the ti's are represented on another scale).
It might be thought that a convenient way of combining the information across loci would be to multiply the posterior pdf's across loci (instead of running the multiple loci data) and then renormalizing. However, these pdf's are marginals and the correct procedure would be to multiply the full pdf (i.e., across all parameters) and then take the marginal. Using the full pdf from the n independent loci is impractical because of the very high dimensional density estimation that would be involved. Therefore, it is necessary to run the MCMC simulations with all loci simultaneously. We found that with multiple loci, the MCMC runs take longer to converge than expected on the basis of the number of loci. For instance, we ran the five-loci data for 350,000 instead of 250,000 steps. Even though multiple-loci data take longer to analyze, the data produced justify this extra analysis time. Also, when only single-locus data are available the whole pdf should be used.
Effect of dissimilar sizes:
Even though the present method does not require the populations to have the same size, all previous data sets were simulated under this assumption (the three ti were always equal, see METHODS). It is necessary to test the method when the parental and hybrid populations have been subject to dissimilar amounts of drift and assess the effect on the estimation of the parameters. To do this we simulated data sets where the two parental populations were always of the same size and the hybrid was either 10100 times smaller (t1 = t2 = 0.001, th = 0.01 or 0.1) or larger (th = 0.001 and t1 = t2 = 0.01 or 0.1). For each of the four parameter combinations, 3 independent loci were simulated and analyzed. Fig 7 shows some of these results for p1 and th. Even though the number of loci analyzed is limited, a number of features are apparent. First, the precision on p1 (Fig 7, a and c) seems to be mostly dependent on the population that has drifted most (i.e., the largest value of ti) whether it is the hybrid (Fig 7A) or parental (Fig 7C) population. Indeed the 95% ETPI averaged across the 3 loci is 0.68 for th = 0.01 and 0.81 for th = 0.1. These values are within the values observed for the 20 loci when the three ti's were equal to 0.01 and 0.1, respectively (Table 1). Second, Fig 7B and Fig D shows that when there is a 100-fold ratio between th and both t1 and t2 the posterior pdf's become clearly different. When the ratio is 10 then the difference in the posterior pdf's is not as obvious and would certainly require multilocus or large sample data to be visible. This is because the pdf's on ti are wide even for small ti's as was shown before (Fig 5 and Table 1). Third, a comparison of Fig 7, a and c indicates that the pdf's for p1 are thinner when the hybrid is subject to little drift than when it is the parental populations that are subject to little drift. This is surprising because in our simulations two out of three populations experience large amounts of drift in the "large hybrid" cases instead of only one in the "small hybrid" cases.
|
| APPLICATION TO A HUMAN DATA SET |
|---|
We applied the method to a data set published by ![]()
![]()
![]()
![]()
![]()
We applied the method to the Jamaican sample because it is more likely to fit our model than the other samples. The allele frequencies in the three populations are given in Table 2 (average sample size: Europeans, n = 292; Africans, n = 388; Jamaicans, n = 186 chromosomes; Table 3). We ran the data for the nine loci independently first (for 50,000 steps) to check for loci that could have a very different behavior, perhaps indicating selection. Then we ran the data for the nine loci together for 300,000 steps. To check for convergence, we ran the chain six times, from different starting points. We also ran one "long" chain for 600,000 steps. The first 50,000 steps of each chain (100,000 for the long one) were discarded and the analysis was done with the rest of the points after thinning, resulting in 50,000 points per run (100,000 for the long run). Each multilocus run took
1 week on a Pentium 500 Mhz under Linux.
|
The single-locus data analysis was performed for all loci; but for two loci (FY-null and ICAM), at least one parental population was fixed for one of the alleles (Table 3). The absence of polymorphism despite the large sample sizes means the data are as likely to have been generated by any large value of ti and the MCMC will not reach equilibrium for the corresponding ti's, which move to larger and larger values. Practically, one can introduce a prior on the distribution of ti during the analysis (a possibility could be to use a flat prior between 0 and some reasonable value such as ti = 1 or 10) and then use the marginals of the other parameters of interest. A simpler solution is to use directly the marginals obtained from the run. Note that this situation disappears when all loci are analyzed together because large ti's become unlikely. The seven remaining loci showed similar results, with GC, the three-alleles locus, showing thinner and slightly shifted pdf's with regard to the others (not shown).
When all loci are used together the estimates of admixture proportions in the Jamaican sample are very similar to those obtained by ![]()
7% (Fig 8A; see also ![]()
|
By using a Bayesian approach, we obtain estimates that integrate across all possible gene frequency distributions in the parental populations. The genealogical approach allows us to take explicitly into account both drift in the three populations and the sampling process. As a consequence all factors that contribute to variance in the estimates are taken into account. Not all of these factors of variation are taken into account by the two other methods. This explains why our SD is larger than theirs. This also means that these methods underestimate the true variance and therefore provide the user with a misleading impression of precision. We are currently testing different methods of admixture estimation (including that of Long) on simulated data sets and find that our method usually performs best (lower mean square error and more accurate interval estimation; L. CHIKHI, R. A. NICHOLS, M. W. BRUFORD and M. A. BEAUMONT, unpublished results).
Therefore, our results, while supporting the point estimates given by ![]()
![]()
![]()
Fig 8 shows the pdf's obtained for p1 and th in the European, African, and Jamaican populations, respectively. The most striking result is the fact that th's pdf indicates much smaller values for the Jamaican (95% ETPI: 0.000320.05243) as compared to both the African (95% ETPI: 0.000680.08560) and particularly European (95% ETPI: 0.049170.67681) populations. Given the inverse relationship between ti and the size of the populations, this result is the opposite to what one would expect. This can be interpreted in different ways. First, one could observe that the distributions of the ti overlap and therefore may not necessarily indicate different effective population sizes. This interpretation is not satisfactory because the overlap is very limited at least between the Jamaican and the European populations. This can be tested by looking at the joint distribution of the three pairs of ti's. We find indeed that t1 > th (P = 0.0035) and t1 > t2 (P = 0.0134) whereas there is no significant difference between th and t2. Also, we showed with simulated data sets that clear differences in the ti's pdf's appear only when the differences are large (see Fig 7B and Fig D).
Another interpretation is that the data were not generated according to the model. A number of assumptions of the model are certainly not met. One could argue that gene flow from European to Jamaican has taken place during the last 200 years and this may explain why the Jamaican population seems to have the largest effective size. However, the fact that three other methods used give similar values for p1 indicates that this particular assumption should not be problematic. Indeed, the two methods used by ![]()
![]()
In admixture studies, the choice of the parental populations is often crucial. In most cases the exact parental populations cannot be identified with certainty (in fact it may not even be clear whether a "hybrid" population is really admixed). In the present case, the Jamaican population is admixed and the parental populations are known to be European and African. Any pair of samples from both continents would be as good as any other if the level of population differentiation within continents were low. That is unlikely to be the case, and ![]()
![]()
![]()
![]()
![]()
![]()
![]()
A posteriori the data point to a larger misrepresentation of European gene frequencies than African. A likely explanation is that, if only some slave owners contributed disproportionately to the Jamaican gene pool, the present-day European sample would appear to have undergone a greater degree of drift from the ancestral population. Indeed, taking a sample representing England, Ireland, and Germany may represent more variability than there really was in the more limited number of European ancestors of Jamaicans.
In conclusion, our method indicates that the European samples used, and perhaps the African samples as well, are unlikely to be representative of the parental populations of the Jamaican population. The effect on the final admixture estimate is difficult to predict. One way to test this is by using the information that is currently available on the geographic origin of the African slaves and European slave owners who settled in Jamaica. Samples should then be obtained from these areas in proportion to their known contributions. ![]()
| DISCUSSION AND CONCLUSION |
|---|
A number of methods have been developed to estimate admixture proportions since BERNSTEIN's (1931) seminal paper (see review by ![]()
![]()
![]()
![]()
![]()
![]()
The method presented here takes into account most sources of variation (sampling and drift in all populations and uncertainty over the parental allelic frequencies) that affect the estimation of the parameters. It also allows for the populations to have different sizes and therefore experience different amounts of drift (![]()
It is important to note, though, that two important sources of variation were not taken into account by our method: gene flow and mutations. Only Bertorelle and Excoffier's method considers the latter. The introduction of mutations to our model is theoretically possible (![]()
![]()
![]()
![]()
It is clear that gene flow will also affect the estimates of admixture. Clearly, our method should be used only when there are good reasons to believe that gene flow was limited in comparison to the original admixture event (see ![]()
![]()
Another source of error in the estimation of admixture proportions is the problem of nonrepresentative sampling. As for gene flow, effort should be made to use our method (and actually any method) only if there is good evidence that the parental populations are identified. We suggested in the analysis of the Jamaican data set that the robustness of admixture estimates should be checked by using different combinations of parental populations, using historical and/or biological information to identify them.
Having identified some of the weaknesses that may impair our method, it should be clear, though, that when the population conforms to expectation (i.e., according to our simplified version of the world, Fig 1) our method provides reliable results both for pl and the ti when large sample sizes and multiple-loci data are used. At the same time our study shows that, even when populations do evolve according to the model, single-locus data, such as those obtained with mitochondrial DNA, should be used only with extreme caution. In any case the whole distribution should be given rather than point estimates since we have seen how they can be weak representations of the data.
Likelihood-based methods such as those developed by Griffiths and Tavaré or Felsenstein's group are computer intensive and it is often difficult to test them accurately on a wide set of parameters as is done for classical coalescent-based methods, although it is easier to do now than when the methods were developed. This means that (i) comparison with other methods is difficult and (ii) most likelihood and MCMC methods published so far were not thoroughly tested (see ![]()
![]()















