- THIS ARTICLE
-
Abstract
- Full Text (PDF)
-
All Versions of this Article:
genetics.105.054130v1
173/3/1679 most recent - Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Wang, J.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Wang, J.
Originally published as Genetics Published Articles Ahead of Print on April 19, 2006.
Genetics, Vol. 173, 1679-1692, July 2006, Copyright © 2006
doi:10.1534/genetics.105.054130
A Coalescent-Based Estimator of Admixture From DNA Sequences
Jinliang Wang1
Institute of Zoology, Zoological Society of London, London NW1 4RY, United Kingdom
1 Address for correspondence: Institute of Zoology, Regent's Park, London NW1 4RY, United Kingdom.
E-mail: jinliang.wang{at}ioz.ac.uk
>ABSTRACT
METHODS
RESULTS
DISCUSSION
APPENDIX A: THE EXPECTED...
APPENDIX B: THE ETBLG...
ACKNOWLEDGEMENTS
LITERATURE CITED
A variety of estimators have been developed to use genetic marker information in inferring the admixture proportions (parental contributions) of a hybrid population. The majority of these estimators used allele frequency data, ignored molecular information that is available in markers such as microsatellites and DNA sequences, and assumed that mutations are absent since the admixture event. As a result, these estimators may fail to deliver an estimate or give rather poor estimates when admixture is ancient and thus mutations are not negligible. A previous molecular estimator based its inference of admixture proportions on the average coalescent times between pairs of genes taken from within and between populations. In this article I propose an estimator that considers the entire genealogy of all of the sampled genes and infers admixture proportions from the numbers of segregating sites in DNA sequence samples. By considering the genealogy of all sequences rather than pairs of sequences, this new estimator also allows the joint estimation of other interesting parameters in the admixture model, such as admixture time, divergence time, population size, and mutation rate. Comparative analyses of simulated data indicate that the new coalescent estimator generally yields better estimates of admixture proportions than the previous molecular estimator, especially when the parental populations are not highly differentiated. It also gives reasonably accurate estimates of other admixture parameters. A human mtDNA sequence data set was analyzed to demonstrate the method, and the analysis results are discussed and compared with those from previous studies.
OVER the past 70 years, many statistical methods have been developed and applied to estimating the genetic compositions of admixed/hybrid populations, using genetic marker data (for recent reviews see BEAUMONT 2003; CHOISY et al. 2004; EXCOFFIER et al. 2005). The primary interest is to infer, from the amount and pattern of genetic variation revealed by markers, the proportional contributions of two or more potential parental populations to the gene pool of an admixed population (CHAKRABORTY 1986). Estimating such admixture proportions helps in understanding the evolutionary history of populations (e.g., CHIKHI et al. 2002; WEN et al. 2004), in genetic epidemiological investigations (CHAKRABORTY and WEISS 1986, 1988), and in assessing the risk of diseases in human populations. In conservation biology, knowledge of admixture proportions helps in making informed management of endangered species in the wild.
Most methods available use allele frequency data to estimate admixture proportions, exploiting the genetic characteristic of an admixed population that its allele frequencies should be intermediate between those of the parental populations (CAVALLI-SFORZA and BODMER 1971; BERTORELLE and EXCOFFIER 1998). The main differences among these methods are whether or not to take genetic drift into account and how to select (e.g., CHAKRABORTY et al. 1992) and treat allele frequency data statistically. Traditional methods are usually moment estimators that ignore the genetic drift that occurred to the parental and hybrid populations since the admixture event (e.g., GLASS and LI 1953; ROBERTS and HIORNS 1965; ELSTON 1971; LONG 1991; CHAKRABORTY et al. 1992), while recent ones are usually likelihood or Baysian estimators (e.g., THOMPSON 1973; CHIKHI et al. 2001; WANG 2003), allowing the joint estimation of admixture proportions and genetic drift. A flexible method based on some summary statistics and approximate Bayesian computation (BEAUMONT et al. 2002; MARJORAM et al. 2003) has been proposed recently, which estimates admixture proportion, genetic drift, and mutation parameters simultaneously from linked or unlinked microsatellite markers (EXCOFFIER et al. 2005).
Molecular markers, such as DNA sequences and microsatellites that are now used widely, provide us not only allele frequency information, but also deep genealogical information revealed by the molecular diversity of sampled genes. Most of the above methods do not use such molecular information and assume that mutations are absent since the admixture event, causing two potential problems. One is that discarding molecular information may result in a loss of estimation precision, especially when the mutation rate is high for markers such as large DNA sequences. The other is that when admixture is ancient and thus mutations are not negligible, these methods may fail to deliver an estimate or give rather poor estimates. Realizing these problems, BERTORELLE and EXCOFFIER (1998) developed a novel estimator that uses both allele frequency and molecular information and explicitly takes mutations into account in estimating admixture proportions. The estimator was shown to be less biased and, in some situations, to yield more precise estimates of admixture proportions (BERTORELLE and EXCOFFIER 1998; WANG 2003). Later on, the estimator was extended by DUPANLOUP and BERTORELLE (2001) to allow three or more parental populations contributing to the admixture.
In this article I develop a new molecular estimator under a well-defined admixture model (BERTORELLE and EXCOFFIER 1998; WANG 2003; EXCOFFIER et al. 2005) and compare it with the previous molecular estimator by using both simulated and real data. While the previous molecular estimator bases its inference on the average coalescent times between two genes taken from within and between populations, the current one considers the entire genealogy of the sampled genes and infers admixture from the numbers of segregating sites in DNA sequence samples. By considering the genealogy of all sequences rather than pairs of sequences, this new estimator also allows the joint estimation of other interesting parameters such as admixture time, divergence time, population size, and mutation rate as well as admixture proportions.
ABSTRACT
>METHODS
RESULTS
DISCUSSION
APPENDIX A: THE EXPECTED...
APPENDIX B: THE ETBLG...
ACKNOWLEDGEMENTS
LITERATURE CITED
The admixture model:
Several recent studies adopt the admixture model proposed by BERTORELLE and EXCOFFIER (1998), with slight modifications (CHIKHI et al. 2001; WANG 2003; CHOISY et al. 2004; EXCOFFIER et al. 2005). In this study, I use the admixture model of EXCOFFIER et al. (2005), as illustrated in Figure 1. I assume an ancestral population, P0, splits into two parental populations, P1 and P2, which evolve separately for TD generations. At that point of time, a hybrid population, Ph, is instantaneously created by combining genes of proportions p1 and p2 = 1 p1 taken at random from parental populations P1 and P2, respectively. After the admixture event, the three populations evolve in isolation for a period of TA generations, when a sample of individuals is taken from each population for examining some markers. I also assume, as is implicit in all previous admixture models, that neither direct nor indirect selection is associated with the markers surveyed, and the markers are from diploid and autosomal loci. With 2N (N, the effective population size) replaced by N, however, the method applies to haploid markers (such as mtDNA) as well.
|
The above admixture model is characterized by seven parameters, which are the effective sizes of the ancestral (N0), parental (N1 and N2), and admixed (Nh) populations; the times (in the unit of generations) of divergence (TD) and admixture (TA); and the admixture proportion p1. The seven parameters are denoted by set
= {N0, N1, N2, Nh, TA, TD, p1}. Without an external standard, however, it is impossible to estimate the absolute values of population size and time. They are therefore rescaled by the mutation rate (µ) of markers as
k = 4Nkµ (k = 0, 1, 2, h),
k = Tkµ (k = A, D), and the estimable parameters are denoted by
= {
0,
1,
2,
h,
A,
D, p1}.
The mutation model:
To utilize molecular information and account for mutations explicitly in estimating admixture, a suitable mutation model must be specified to describe the mutational process of markers. Herein I use DNA sequences as markers, which are assumed to follow the infinite-site model of mutations (KIMURA 1969). Under this model, a locus is composed of so many sites that no more than one mutation occurs at any site in the genealogy of the sampled sequences. I also assume a constant-rate neutral mutation process, in which each offspring sequence differs from its parental one by an average of µ mutations. Under these assumptions, the number of mutations in a sample of DNA sequences is identical to the number of nucleotides that are polymorphic (segregating) in the sample. Therefore, the expected number of segregating sites in a sample is simply the product of µ and the expected total branch length of the genealogy of the sample (see, e.g., TAJIMA 1983; HUDSON 1990).
Expected numbers of segregating sites:
In this section, I derive the expected total branch length of the genealogy (ETBLG) (denoted by
) of a sample of DNA sequences given parameter set
. The expected number of segregating sites in a sample is simply the product of
and mutation rate µ under the infinite-site model. In the next section, the observed numbers of segregating sites are fitted to these expected values to obtain least-squares estimates of
. Suppose n1, n2, and nh sequences of a given locus are sampled at random from the current P1, P2, and Ph populations, respectively. The n = n1 + n2 + nh sequences can be arranged to constitute seven composite (artificial) samples. Samples 1, 2, and 3 contain sequences solely from populations P1, P2, and Ph, respectively. Samples 4, 5, and 6 are obtained by merging samples 1 and 2, 1 and 3, and 2 and 3, respectively. Sample 7 contains all of the n sequences. The sample sizes (number of sequences) for samples 1, 2, ... , 7 are therefore n1, n2, nh, n1 + n2, n1 + nh, n2 + nh, and n1 + n2 + nh, respectively.
Expected total branch length of a genealogy:
For convenience in deriving the ETBLG of a sample of sequences, time is measured backward hereafter. The current time when the sample was taken is designated as generation zero and the time T generations ago is referred to as generation T. Consider the ETBLG of sample 1,
1, conditional on parameters
. The genealogy can be partitioned into two segments, the first being formed by the coalescent process in population P1 during time interval [0, TA + TD], while the second is formed by the coalescent process in population P0 during time interval [TA + TD,
]. The expected total branch length of segment one can be derived, as shown in APPENDIX A, as
![]() | (1) |
and
are the rising and falling factorial functions, respectively.
For the second segment, the initial number of sequences, j (n1
j
2), at time TA + TD is a random variable. The probability of an initial i sequences at time zero coalescing into j sequences at time T in a population with effective size N is
![]() | (2) |
j
2) sequences are left extant at time TA + TD in population 1. Given j at time TA + TD and the effective size of the ancestral population N0, the expected total branch length of the second segment of genealogy (e.g., HUDSON 1990) is
. Summing over all possible values of j gives the expected total branch length of the second segment of genealogy
![]() | (3) |
![]() | (4) |
,
irrespective of the parameter values of
, and N0, as is expected because the most recent common ancestor (MRCA) must be found in the first segment of genealogy. In such a case, (4) reduces to
, the expected total branch length of the genealogy of n1 sequences from a population of a constant size N1 (e.g., HUDSON 1990). It can be shown that when N0 = N1, (4) also reduces to
irrespective of TA and TD, as expected.
Similarly, the ETBLG of sample 2 conditional on parameter set
,
2, is calculated by the right side of (4), replacing N1 by N2 and n1 by n2, respectively.
The derivation for the ETBLG of sample 3 is much more complicated. The genealogy is again partitioned into two segments. The first segment is formed by the coalescent process of an initial nh sequences in population Ph during interval [0, TA]. The expected total branch length of this segment is calculated by (1) with i = nh, N = Nh, and T = TA. The second segment is formed by the coalescent processes in populations P1 and/or P2 during time interval [TA, TA + TD] and then in ancestral population P0 during time interval [TA + TD,
]. The probability of i = nh sequences coalescing into j = m sequences at time T = TA in population Ph with effective size N = Nh is calculated by (2). According to the sources of the m extant sequences at time TA, three cases are distinguishable.
- Case 1: The m extant sequences at time TA come exclusively from population P1. When this case occurs, with probability
, the expected total branch length of the second segment of genealogy given m can be calculated by the sum of (1) and (3), replacing i, N, and T by m, N1, and TD, respectively.
- Case 2: The m extant sequences at time TA come exclusively from population P2. When this case occurs, with probability
where
, the expected total length of the second segment of genealogy given m can be calculated by the sum of (1) and (3), replacing i, N, and T by m, N2, and TD, respectively.
- Case 3: Among the m extant sequences at time TA, m1 (0 < m1 < m) sequences come from population P1 and m2 = m m1 from P2. When this case occurs, with the binomial probability of
, the second segment of genealogy can be further partitioned into three subsegments. Subsegment 1 is formed by the coalescent process of the initial m1 sequences at time TA in population P1 during interval [TA, TA + TD]. The expected total branch length of this subsegment can be derived, shown in APPENDIX A, as
- Case 2: The m extant sequences at time TA come exclusively from population P2. When this case occurs, with probability
![]() | (5) |
]. Suppose m3 and m4 sequences are extant at time TA + TD in populations P1 and P2, respectively, with probabilities
and
, respectively, calculated by (2). The expected total branch length of the segment of genealogy for the given initial m3 + m4 sequences in population P0 during the time interval [TA + TD,
] is (e.g., HUDSON 1990)
. Considering all possible values of m3 and m4 leads to the expected total branch length of the third subsegment of genealogy,
. Summing over the three subsegments gives the expected total branch length of the second segment of genealogy in case 3.
Summing over the three cases yields the expected total branch length of genealogy for segment 2. Adding the expected total branch lengths of segments 1 and 2 gives the ETBLG of sample 3,
![]() | (6) |
and
. It can be shown that, when Nk =
(k = 0, 1, 2, h) and nh = 2, (6) reduces to
, which is twice the expected coalescent time between a pair of sequences from the admixed haploid population derived by BERTORELLE and EXCOFFIER (1998). Using an approach similar to the derivation of (6), I also obtained, as shown in APPENDIX A, the equations for the ETBLGs of samples 47.
Expected number of segregating sites:
Under the infinite-site model assumed above, the expected number of segregating sites (ES) in a sample is simply the product of mutation rate µ and ETBLG of the sample. The expected number of segregating sites of the kth sample conditional on parameter set
, ESk, can then be calculated by the equation for
k by replacing Nj, TA, and TD with
j/4,
A, and
B, respectively, where k = 1, 2, ... , 7 and j = 0, 1, 2, h.
Estimation of parameters:
Suppose, for a single locus with (unknown) mutation rate µ, the number of segregating sites in sample k (k = 1, 2, ... , 7) is observed to be OSk. Estimates of the parameters
= {
0,
1,
2,
h,
A,
D, p1} can be obtained by fitting these observed to the expected numbers of segregating sites by a least-squares approach,
![]() | (7) |
) is a complicated function of the seven parameters and no closed form of solution is possible, some numerical methods have to be adopted for the estimation. The first derivatives of f(
) with respect to each of the seven parameters can be obtained and used in the multidimensional NewtonRaphson algorithm for the estimates of
from (7). However, the computation is intensive, especially when sample sizes are large, because both function (7) and its derivatives are not trivial to compute. Further, such an algorithm is sometimes fooled by a local rather than a global minimum of f(
). Having tried several methods, I finally choose to use Powell's quadratically convergent method (PRESS et al. 1996) with slight modifications. This algorithm does not require the computation of derivatives, and with the modification it updates only one of the parameters in most iterations so that only part of (7) needs to be recalculated. For example, updating
1 does not alter ES2 but changes only parts of the calculations of ESj for j = 3, 4, ... , 7. Therefore, the algorithm coupled with storing/reusing the computational results for different parts of ESj could reduce computational burden tremendously. To speed up computation, this algorithm occasionally updates multiple parameters simultaneously along an optimal direction determined by collecting and using information of previous iterations. Some comparative analyses of simulated data indicated that Powell's algorithm is less often stuck on a local minimum than the NewtonRaphson algorithm. In the RESULTS shown below, each simulated data set is analyzed in five independent replicates, each with a randomly chosen set of starting parameter values. The final estimates are those from the replicate with the minimum value of f(
). To analyze an empirical data set, more starting points can be used to obtain more reliable estimates.
The computational load of (7) increases rapidly with the numbers of sequences from the three populations. Furthermore, it is difficult to calculate (2) quickly and accurately because it is a series having terms of large values and alternating signs. To avoid large numerical errors in calculating (2) and thus (7) for large genealogies (a sample of
100 sequences), I conduct computations using high precision of hundreds of significant digits (depending on sample size). An alternative option is to adopt a Markov chain Monte Carlo method proposed by GRIFFITHS and TAVARÉ (1994) as in O'RYAN et al. (1998).
Multiple loci:
For multiple independent loci, it is inappropriate to use simply the average numbers of segregating sites over loci as data in the estimation. Distinctive loci may have different mutation rates because of the differences in mutation rate per base pair and/or in the sequence length. Different loci may also have different sample sizes, resulting in different ETBLGs and thus different expected numbers of segregating sites. The above methodology can be extended to use multilocus data jointly in estimating the parameters
= {
0,
1,
2,
h,
A,
D, p1}. In addition, the relative mutation rate of each locus can be estimated simultaneously.
Suppose a number of L unlinked loci have been surveyed, with locus l (l = 1, 2, ... , L) having a mutation rate µl and a sample size of nj,l sequences from population Pj (j = 1, 2, h). Without loss of generality, I scale parameters Nj (j = 0, 1, 2, h), TA, TD, and µl (l = 2, 3, ... , L) by µ1, the mutation rate of the first locus. The set of parameters to be estimated now becomes
= {
0,
1,
2,
h,
A,
D, p1,
2,
3, ... ,
L}, where
j = 4Njµ1 (j = 0, 1, 2, h),
A =TAµ1,
D =TDµ1, and
l = µl/µ1 (l = 2, 3, ... , L). The least-squares function for multiple loci becomes
![]() | (8) |
2 is calculated using parameters {
0,
1,
2,
h,
A,
D, p1} and then multiplied by
l.
The value of p1 and the relative values of
j (j = 0, 1, 2, h),
A, and
D obtained from (8) are independent of the locus chosen to scale the parameters. This is because mutation rate does not affect the genealogies and acts only as a multiplier with ETBLGs in determining the expected numbers of segregating sites. Using a locus with a smaller (larger) mutation rate to scale the parameters causes just a proportional decrease (increase) in the estimates of
j (j = 0, 1, 2, h),
A, and
D and has no affect on the estimates of p1. This can be checked easily by simulations.
Simulations:
Monte Carlo simulations were run to generate data sets with known parameters. These data were then analyzed by the newly developed estimator to check its quality of estimates, to investigate its statistical properties, and to compare it with a previous molecular estimator. Although quite a few admixture estimators are available, only the one of BERTORELLE and EXCOFFIER (1998) is a molecular estimator designed to use molecular information and take mutations into account. Therefore I confine my comparison to this molecular estimator in this study.Following the coalescence approach (HUDSON 1990) and the admixture model (Figure 1), the genealogies of the n1 + n2 + nh DNA sequences from the current three populations were reconstructed until the MRCA was found. Poisson-distributed mutations were then imposed on the reconstructed gene tree. Recombination was assumed to be absent and mutations were assumed to follow the infinite-site model. Data for different loci were generated independently, and monomorphic loci with no segregating sites were discarded. The sequence data were then processed to extract information for different estimators. For the current estimator, the number of segregating sites in each of the seven composite samples was obtained. For Bertorelle and Excoffier's molecular estimator, the mean coalescence time (scaled by mutation rate) was estimated by the mean number of site differences between pairs of sequences.
Several statistics are adopted to measure the quality of estimates from simulated data. First, the applicability (denoted as Appl%) of an estimator is calculated as the percentage of replicates in which admixture proportion estimates can be made successfully and the estimates are in the legitimate range of [0, 1] (CHOISY et al. 2004). Second, the mean and root mean square errors (the square root of mean squared errors, denoted by RMSE) of estimates across replicates are calculated. Third, "factor 2" is calculated as the proportion of replicates in which the estimated value is within the interval bounded by values equal to 50 and 200% that of the true value (EXCOFFIER et al. 2005). This measurement overlaps with RMSE in telling how close the estimates are to the true parameter value, but it is less affected by extreme outliers of the distribution of estimates. For most combinations of parameters, 1000 replicates were conducted.
Analysis of an empirical data set:
For demonstration, the estimator proposed herein was applied to the analysis of a published data set from MCLEAN et al. (2003). They sequenced the hypervariable segments I (HVSI, 364 bp in length) and II (HVSII, 343 bp in length) of the mtDNA from 47 Sierra Leoneans, 12 EuropeanAmericans, 12 rural Gullah-speaking AfricanAmericans, 12 urban AfricanAmericans living in Charleston, South Carolina, and 12 Jamaicans. Assuming that AfricanAmerican populations are admixtures by Europeans and Africans (e.g., PARRA et al. 1998), the mtDNA data can be analyzed by the coalescent estimator to infer the European genetic contributions to the gene pool of each of the three AfricanAmerican populations and the admixture time, divergence time, and genetic drift (population size) of each parental and admixed population involved. Sites in HVSI and HVSII sequences with missing or ambiguous information were eliminated, resulting in 295 and 275 unambiguous sites for HVSI and HVSII, respectively, utilized in the analysis. Due to the absence of recombination in human mtDNA, HVSI and HVSII are effectively a single locus. Sequences for the two loci are thus combined to form single-locus data before being analyzed by the two molecular estimators of admixture.ABSTRACT
METHODS
>RESULTS
DISCUSSION
APPENDIX A: THE EXPECTED...
APPENDIX B: THE ETBLG...
ACKNOWLEDGEMENTS
LITERATURE CITED
Simulations:
Many factors are important in determining the quality of admixture estimates, including the true parameters (defined by the genetic model in Figure 1) being estimated and the marker information content influenced by the number of loci, the number of individuals genotyped, and the polymorphism of each locus (e.g., WANG 2003; CHOISY et al. 2004; EXCOFFIER et al. 2005). The factor combinations are prohibitively too numerous to consider in a simulation study. Here I choose to present the estimation results in some hopefully typical scenarios.
The performances of the current and previous molecular estimators in estimating admixture proportions for the scenarios of a short or long divergence time (TD = 2500 or 10,000 generations); short, intermediate, or long admixture time (TA = 50, 500, or 5000 generations); and small or moderate admixture (p1 = 0.05 or 0.20) are summarized in Table 1. Sample size for each population is assumed to be either 20 or 40, and the number of loci is assumed to be 1, 5, or 10 with the same mutation rate of 0.001 per DNA sequence per generation. When the divergence time is short (TD = 2500) so that parental populations are not highly differentiated (TD/N = 0.5) when admixture occurs, the Appl% of the mY estimator is only
7090% for the case of a single locus, and 1030% of p1 estimates from this estimator are either smaller than zero or greater than one. Although the Appl% of the mY estimator improves with an increasing amount of marker information (mainly number of loci), it is still <90% for slight admixture even if 5 loci are used. Note that the increase in Appl% of the mY estimator with an increasing TA is an artifact, because a large TA results in the estimates of p1 biased toward 0.5 and thus in fewer negative estimates. In contrast, the new estimator gives the estimates of p1 that are always in the legitimate range of [0, 1]. Compared with the mY estimator, the new estimator is generally much less biased and has much smaller RMSE.
|
When the divergence time is long (TD = 10,000 generations) so that parental populations are highly differentiated (TD/N = 2) before they contribute to the admixture, the performances of the two estimators become similar. The mY estimator has an Appl% close to 100%, except when a single locus is used in estimating small admixture proportions. The new estimator is less biased than the mY estimator, especially when admixture is small and TA is large. The main merit of molecular estimators in comparison with traditional estimators is that mutations after the admixture events can be accounted for so that ancient admixture can be inferred accurately. The current estimator allows almost unbiased estimation of admixture proportions even if TA = 5000 (TA/N = 1) when the divergence time is long and multilocus data are available. In contrast, the mY estimator gives estimates of p1 biased toward 0.5 when admixture is ancient.
In addition to admixture proportions, the new estimator can also provide estimates of other interesting parameters. Table 2 summarizes the properties of the estimates of
0,
1,
2,
h,
A,
D, and relative mutation rates (
l). It can be seen that
1,
2, and µl are very well estimated with small biases and RMSEs, while
h is the most difficult parameter to estimate. This is understandable because information about
h comes from the coalescent events in the hybrid population during time interval [0, TA] only, and these events are too few when nh and TA/Nh are small to allow accurate estimates of
h. Indeed, the quality of
h-estimates increases with an increasing sample size and admixture time, as shown in Table 2. Similarly, the quality of
A estimates is dependent on the number of coalescent events in the three populations during interval [0, TA] and thus increases with an increasing admixture time (TA) and a decreasing population size (N1, N2, Nh). In contrast,
D is more accurately estimated with a decreasing admixture time (TA). Small TA means fewer coalescent events during interval [0, TA] and more coalescent events during time interval [TA, TA + TD] and thus more information about
D. For similar reasons,
0 is better estimated with a smaller value of TA + TD. The estimates of all parameters are improved substantially by increasing the number of loci and sample sizes.
|
I adopt the parametric bootstrapping technique to assess the uncertainties of admixture estimates from the new estimator. This is rendered possible because all the parameters fully defining the admixture model in Figure 1 can be estimated by the new estimator. Parametric bootstrapping is more appropriate than nonparametric bootstrapping (BERTORELLE and EXCOFFIER 1998) because the latter tends to yield too conservative estimates of uncertainties when the number of resampling units (loci, sequences) is small. Due to the heavy computational burden of the current estimator, however, it is difficult to evaluate the performance of the parametric bootstrapping procedure using extensive simulations. Table 3 lists the uncertainty estimates, which are average upper and lower limits of 95% confidence intervals (C.I.95%) and coverage (frequency of the true parameter value being covered by the estimated C.I.95%), for the cases of a single locus and five loci. In each case, 100 replicate data sets are simulated, and each data set is analyzed for point and C.I.95% estimates using 500 bootstrapping samples. The parameter values used in generating the simulated data sets are
k = 20 (k = 0, 1, 2, h),
A = 0.5,
D = 10, µ = 0.001, and 20 sequences for each locus from each population. As can be seen, the true parameter value is included in the estimated 95% confidence intervals in
95% of the replicates for both the single-locus and the five-loci cases. The confidence intervals for five loci are much narrower than those for a single locus, as is expected. In accordance with the results listed in Table 2,
h is the most difficult parameter to estimate, as indicated by the extremely large confidence intervals.
|
Admixture analysis of human populations:
The mtDNA sequence data from MCLEAN et al. (2003) are analyzed by BERTORELLE and EXCOFFIER's (1998) mY estimator and the new estimator. Parametric bootstrapping and nonparametric bootstrapping are adopted for the new and mY estimators, respectively, to ascertain the uncertainties of the estimates using 1000 samples of size identical to the original samples. The estimates of the European contributions to each of the three admixed AfricanAmerican and Jamaican populations are listed in Table 4.
|
The European contributions to the three admixed populations are estimated to be <7% from the new coalescent estimator and are in close agreement with previous estimates (PARRA et al. 2001; MCLEAN et al. 2003). MCLEAN et al. (2003) calculated admixture proportions from the frequencies of haplotypes composed of three HVS restriction site polymorphisms (RSPs). These RSPs are chosen because they are substantially differentiated between African and European populations and are thus highly informative for admixture analysis. Furthermore, a large number of 1396 individuals from the same five populations as the mtDNA sequence data analyzed herein are assayed for the RSPs. The estimated European contributions from their analyses are 0.030, 0.069, and 0.027 for the Gullahs, Charleston AfricanAmericans, and the Jamaicans, respectively. From large samples (
100 sequences per population) of HVSI data, the European contributions to the three admixed populations were estimated to be
0% using the highly informative haplogroup H frequencies (PARRA et al. 2001) and were estimated to be 0.065 and 0.129 for the Charleston and Jamaican populations, respectively, from both haplogroup H and L frequencies (PARRA et al. 1998). In general, these estimates are much lower than those inferred from many informative nuclear markers (PARRA et al. 1998), indicating that European females contributed little to the admixtures. The sex-biased admixtures, with European males contributing substantially greater than European females, were confirmed by analyzing the Y Alu polymorphic (YAP) informative markers (PARRA et al. 1998). It is encouraging that with a sample size as small as 12 sequences, the new coalescent estimator yields similar results. The mY estimator yields estimates of European contributions to the Gullah or Jamaican population that are low and roughly compatible with estimates from the other estimators and data, but estimates of European contribution to the Charleston population (0.5) that is much larger than other estimates and is even larger than the estimate from nuclear markers (0.12, PARRA et al. 1998). The estimated European contribution to the Jamaican population was 0.063, which is not surprising given the simulation result that the mY estimator often yields negative estimates when admixture proportion is low (Table 1).
For both molecular estimators and all of the three admixed populations, the 95% confidence intervals for the admixture estimates determined by parametric and nonparametric bootstrapping are quite broad. This is perhaps not surprising because the data set is small with effectively a single locus and only 12 sequences from a population. More loci and larger sample sizes are required to obtain more precise admixture estimates.
In addition to admixture proportions, the new estimator also gives the estimates of divergence times, admixture times, population sizes, and relative mutation rates. These estimates are listed in Table 4. Because human mtDNA does not recombine (PAKENDORF and STONEKING 2005), HVSI and -II sequences were obtained from the same individuals (MCLEAN et al. 2003), and the sample sizes are very small, the genealogical information that can be extracted from the data set is quite limited and the analysis results need to be interpreted with caution. Divergence time between the Africans and Europeans is estimated to be 1.9 on average. In the literature, the estimates of mutation rate for HVSs are quite diverse, the median value being
0.1/site/million years (MY) (PAKENDORF and STONEKING 2005; SANTOS et al. 2005). The total mutation rate for HVSI and -II is therefore
70/MY or 0.0021/generation if the generation interval is taken to be 30 years. The absolute divergence time is thus estimated to be 1.9/(70/MY) = 27,143 years, which is roughly in agreement with the estimate of <60,000 years from phylogenetic analysis of mtDNA control regions (e.g., WATSON et al. 1997; QUINTANA-MURCI et al. 1999). Subsequent migration after the split of Eurasian from African population could reduce the divergence and thus lead to an underestimation of divergence time (WANG 2003).
The point estimates of admixture time are quite variable from the three admixture analyses. Using the mutation rate estimate of 70/MY, the admixture time is estimated to be 6, 1628, and 2801 years ago for the Gullah, Charleston, and Jamaican populations, respectively. The first estimate is too small while the last two estimates are too high compared with the historical evidence that the AmericanAfrican populations were formed 150 years ago (PARRA et al. 1998). For all of the three admixed populations, however, the 95%C.I.'s are fairly consistent and well include the admixture time of 150 years.
The parental and ancestral population sizes (
1,
2,
0) are well estimated while the admixed population size (
h) is poorly estimated, as indicated by the corresponding widths of 95%C.I.'s. On average, the African population size is estimated to be 8 and 37 times larger than that of European and admixed populations, respectively, and the European population is 4 times larger than admixed populations. The results seem to be plausible and are at least qualitatively in agreement with previous studies.
ABSTRACT
METHODS
RESULTS
>DISCUSSION
APPENDIX A: THE EXPECTED...
APPENDIX B: THE ETBLG...
ACKNOWLEDGEMENTS
LITERATURE CITED
The differences between the current and previous molecular estimators of admixture are in close analogy with those between Watterson's and Tajima's estimators of
from DNA sequence data. WATTERSON (1975) showed that, under the infinite-site mutation model, the product of mutation rate and expected total branch length of the genealogy of a sample of sequences gives the expected number of segregating sites of the sample. From this relationship, he derived the estimator
, where OS is the observed number of segregating sites in a sample of n sequences. Tajima's estimator,
T, is given by the average number of nucleotide differences between two sequences (TAJIMA 1983). It is well known that both estimators are unbiased, but
W is generally more efficient than
T because it has a smaller variance, and the difference increases with sample size (LI 1997; WANG 2005). In the special case of a single WrightFisher population (say, parental population 1), the admixture model has just one parameter (
) and my admixture estimator (7) reduces to
W, as is expected. The previous molecular estimator of admixture (BERTORELLE and EXCOFFIER 1998) uses the mean number of nucleotide differences between pairs of sequences as information and is thus quite similar in this respect to
T.
Some assumptions made in deriving the current estimator can be relaxed without affecting much of the validity of the estimator. Although the current estimator assumes diploid nuclear markers, it applies to maternally (mtDNA) or paternally (Y chromosome) inherited markers as well. The only difference is in the explanation of the parameter
that corresponds to different effective sizes. It is assumed that there is no recombination at a locus. However, I use means rather than variances of the number of segregating sites as data in the estimator, and therefore it should apply to loci with recombination. Like Watterson's estimator, the current admixture estimator should actually give better estimates when there is recombination, although the estimated uncertainties might become too exaggerative. The current estimator also assumes two parental populations contributing to the admixture. It is straightforward to extend the method to the case of three or more parental populations. However, the computational burden increases very rapidly with the number of parental populations. Even with two parental populations as assumed in this study, the estimator's computational load increases so rapidly with sample sizes that it can cope only with samples of a few hundred sequences on an ordinary PC. Further refinements of the computational algorithms are necessary before the estimator is extended to more complicated situations such as three or more parental populations.
There is also room for methodological improvements of the current estimator. In (7), the expected numbers of segregating sites of the seven composite samples are obviously nonindependent, because each original sample of sequences appears in four of the seven composite samples. Ideally, their variance and covariance structure should be incorporated into a general least-squares framework to obtain estimates of the seven parameters. However, it is extremely difficult to derive this variance and covariance matrix analytically, and computation of the matrix numerically by simulations is too CPU demanding to be realistic. Although (7) ignored this variance and covariance structure, it should provide unbiased estimates as verified by simulations. The estimator's precision may be improved by the proper weighting based on the variance and covariance matrix.
A software package, Molecular Estimator of Admixture (MEAdmix), computing the admixture estimator and finding the confidence intervals by parametric bootstrapping, is available for free download from http://www.zoo.cam.ac.uk/ioz/software.htm.
ABSTRACT
METHODS
RESULTS
DISCUSSION
>APPENDIX A: THE EXPECTED...
APPENDIX B: THE ETBLG...
ACKNOWLEDGEMENTS
LITERATURE CITED
, these i genes will coalesce into their MRCA, and the ETBLG is
generations (e.g., HUDSON 1990). When T is a fixed definite number, however, the MRCA may or may not be found in the interval of [0, T] and j (1
j
i) lineages may be left extant at generation T. The probability of j distinct lineages extant at time T,
, is given by (2) (TAVARÉ 1984). Given i, j, N, and T, the expected time, ETm, during which there are m (j
m
i) distinct lineages in a genealogy is derived as follows:- ETm for m = i: Conditional on i, j < i, N, and T, the probability of the time interval (in generations), Ti, during which there are i distinct lineages is
ETi is thus obtained by the integration
which, after some algebra, simplifies to
where
(A1)
.
- ETm for j < m < i: Similarly, the expected time (in generations) during which there are m distinct lineages (j < m < i), given i, j < i 1, N, and T, is obtained by integration:
It can be simplified to
where
(A2)
when r = m and
otherwise, and
.
- ETm for m = j: The expected time during which there are m = j distinct lineages, given i, j
i, N, and T, is obtained by integration,
which leads to
where
(A3)
when k = j and
otherwise, and
. When no coalescent events occur in the interval of [0, T] (i.e., j = i), (A3) reduces to
, as is expected.
The ETBLG in the interval [0, T] can be calculated using (A1A3), in two separate cases. In the first case, the part of the genealogy after the MRCA is found, branch length T1, is irrelevant and excluded. In the second case, T1 is included in the ETBLG. Cases 2 and 1 apply when the MRCA lineage is and is not to be included in another genealogy involving other genes formed after T. The ETBLG conditional on i, j, N, T is
![]() | (A4) |
![]() | (A5) |
Considering all possible values of j given i, N, and T, the ETBLG in case 1 is
![]() |
, (1) further reduces to
(e.g., HUDSON 1990), as is expected. Similarly, the ETBLG in case 2 is
![]() |
ABSTRACT
METHODS
RESULTS
DISCUSSION
APPENDIX A: THE EXPECTED...
>APPENDIX B: THE ETBLG...
ACKNOWLEDGEMENTS
LITERATURE CITED
![]() | (B1) |
, which is twice the expected coalescent time between a sequence from parental population 1 and a sequence from parental population 2.
![]() | (B2) |
and
. As is expected, (B2) reduces to
, twice the average coalescent time between a sequence from parental population 1 and a sequence from the admixed population, when n1 = nh = 1 and Nk = N (k = 0, 1, 2, h). For sample 6,
6 is calculated by the right side of (B2) by exchanging
and
.
![]() | (B3) |
and
.
ABSTRACT
METHODS
RESULTS
DISCUSSION
APPENDIX A: THE EXPECTED...
APPENDIX B: THE ETBLG...
>ACKNOWLEDGEMENTS
LITERATURE CITED
ABSTRACT
METHODS
RESULTS
DISCUSSION
APPENDIX A: THE EXPECTED...
APPENDIX B: THE ETBLG...
ACKNOWLEDGEMENTS
>LITERATURE CITED
BEAUMONT, M. A., 2003 Conservation genetics, pp. 775780 in Handbook of Statistical Genetics, Ed. 2, edited by D. J. BALDING, M. BISHOP and C. CANNINGS. John Wiley & Sons, Chichester, England.
BEAUMONT, M. A., W. ZHANG and D. J. BALDING, 2002 Approximate Bayesian computation in population genetics. Genetics 162: 20252035.
BERTORELLE, G., and L. EXCOFFIER, 1998 Inferring admixture proportions from molecular data. Mol. Biol. Evol. 15: 12981311.[Abstract]
CAVALLI-SFORZA, L. L., and W. F. BODMER, 1971 The Genetics of Human Populations. W. H. Freeman, San Francisco.
CHAKRABORTY, R., 1986 Gene admixture in human populations: models and predictions. Yearb. Phys. Anthropol. 29: 143.[CrossRef]
CHAKRABORTY, R., and K. M. WEISS, 1986 Frequencies of complex diseases in hybrid populations. Am. J. Phys. Anthropol. 70: 489503.[CrossRef][Medline]
CHAKRABORTY, R., and K. M. WEISS, 1988 Admixture as a toll for finding linked genes and detecting that difference from allelic association between loci. Proc. Natl. Acad. Sci. USA 85: 91199123.
CHAKRABORTY, R., M. I. KAMBOH, M. NWANKWO and R. E. FERRELL, 1992 Caucasian genes in American blacks: new data. Am. J. Hum. Genet. 50: 145155.[Medline]
CHIKHI, L., M. W. BRUFORD and M. A. BEAUMONT, 2001 Estimation of admixture proportions: a likelihood-based approach using Markov chain Monte Carlo. Genetics 158: 13471362.
CHIKHI, L., R. A. NICHOLS, G. BARBUJANI and M. A. BEAUMONT, 2002 Y genetic data support the Neolithic demic diffusion model. Proc. Natl. Acad. Sci. USA 99: 1100811013.
CHOISY, M., P. FRANCK and J. M. CORNUET, 2004 Estimating admixture proportions with microsatellites: comparison of methods based on simulated data. Mol. Ecol. 13: 955968.[CrossRef][Medline]
DUPANLOUP, I., and G. BERTORELLE, 2001 Inferring admixture proportions from molecular data: extension to any number of parental populations. Mol. Biol. Evol. 18: 672675.
ELSTON, R. C., 1971 The estimation of admixture in racial hybrids. Ann. Hum. Genet. 35: 917.[Medline]
EXCOFFIER, L., A. ESTOUP and J. M. CORNUET, 2005 Bayesian analysis of an admixture model with mutations and arbitrarily linked markers. Genetics 169: 17271738.
GLASS, B., and C. C. LI, 1953 The dynamics of racial intermixturean analysis based on the American Negro. Am. J. Hum. Genet. 5: 119.[Medline]
GRIFFITHS, R. C., and S. TAVARÉ, 1994 Simulating probability distributions in the coalescent. Theor. Popul. Biol. 46: 131159.[CrossRef]
HUDSON, R. R., 1990 Gene genealogies and the coalescent process, pp. 144 in Oxford Surveys in Evolutionary Biology, edited by D. J. FUTUYMA and J. D. ANTONOVICS. Oxford University Press, New York.
KIMURA, M., 1969 The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics 61: 893903.
LI, W. H., 1997 Molecular Evolution. Sinauer Associates, Sunderland, MA.
LONG, J. C., 1991 The genetic structure of admixed populations. Genetics 127: 417428.[Abstract]
MARJORAM, P., J. MOLITOR, V. PLAGNOL and S. TAVARE, 2003 Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 100: 1532415328.
MCLEAN, D. C., I. SPRUILL, S. GEVAO, E. Y. S. MORRISON, O. S. BERNARD et al., 2003 Three novel mtDNA restriction site polymorphisms allow exploration of population affinities of African Americans. Hum. Biol. 75: 147161.[Medline]
O'RYAN, C., E. H. HARLEY, M. W. BRUFORD, M. A. BEAUMONT, R. K. WAYNE et al., 1998 Microsatellite analysis of genetic diversity in fragmented South African buffalo populations. Anim. Conserv. 1: 8594.[CrossRef]
PAKENDORF, B., and M. STONEKING, 2005 Mitochondrial DNA and human evolution. Annu. Rev. Genomics Hum. Genet. 6: 165183.[CrossRef][Medline]
PARRA, E. J., A. MARCINI, J. AKEY, J. MARTINSON, M. A. BATZER et al., 1998 Estimating African American admixture proportions by use of population-specific alleles. Am. J. Hum. Genet. 63: 18391851.[CrossRef][Medline]
PARRA, E. J., R. A. KITTLES, G. ARGYROPOULOS, C. L. PFAFF, K. HIESTER et al., 2001 Ancestral proportions and admixture dynamics in geographically defined African Americans living in South Carolina. Am. J. Phys. Anthropol. 114: 1829.[CrossRef][Medline]
PRESS, W. H., S. A. TEUKOLSKY, W. T. VETTERLING and B. P. FLANNERY, 1996 Numerical Recipes in Fortran 77, Ed. 2. Cambridge University Press, Cambridge, UK.
QUINTANA-MURCI, L., O. SEMINO, H. J. BANDELT, G. PASSARINO, K. MCELREAVEY et al., 1999 Genetic evidence of an early exit of Homo sapiens sapiens from Africa through eastern Africa. Nat. Genet. 23: 437441.[CrossRef][Medline]
ROBERTS, D. F., and R. W. HIORNS, 1965 Methods of analysis of the genetic composition of a hybrid population. Hum. Biol. 37: 3843.[Medline]
SANTOS, C., R. MONTIEL, B. SIERRA, C. BETTENCOURT, E. FERNANDEZ et al., 2005 Understanding differences between phylogenetic and pedigree-derived mtDNA mutation rate: a model using families from the Azores Islands (Portugal). Mol. Biol. Evol. 22: 14901505.
TAJIMA, F., 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437460.
TAVARÉ, S., 1984 Lines of descent and genealogical processes, and their applications in population genetics models. Theor. Popul. Biol. 26: 119164.[CrossRef][Medline]
THOMPSON, E. A., 1973 The Icelandic admixture problem. Ann. Hum. Genet. 37: 6980.[Medline]
WANG, J., 2003 Maximum-likelihood estimation of admixture proportions from genetic data. Genetics 164: 747765.
WANG, J., 2005 Estimation of effective population sizes from data on genetic markers. Philos. Trans. R. Soc. B 360: 13951409.[CrossRef][Medline]
WATSON, E., P. FORSTER, M. RICHARDS and H. J. BANDELT, 1997 Mitochondrial footprints of human expansions in Africa. Am. J. Hum. Genet. 61: 691704.[Medline]
WATTERSON, G. A., 1975 On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7: 256276.[CrossRef][Medline]
WEN, B., H. LI, D. R. LU, X. F. SONG, F. ZHANG et al., 2004 Genetic evidence supports demic diffusion of Han culture. Nature 431: 302305.[CrossRef][Medline]
Communicating editor: J. B. WALSH
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
-
All Versions of this Article:
genetics.105.054130v1
173/3/1679 most recent - Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Wang, J.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Wang, J.















