# A Coalescent-Based Estimator of Admixture From DNA Sequences

- Jinliang Wang
^{1}

- 1
*Address for correspondence:*Institute of Zoology, Regent's Park, London NW1 4RY, United Kingdom. E-mail: jinliang.wang{at}ioz.ac.uk

## Abstract

A variety of estimators have been developed to use genetic marker information in inferring the admixture proportions (parental contributions) of a hybrid population. The majority of these estimators used allele frequency data, ignored molecular information that is available in markers such as microsatellites and DNA sequences, and assumed that mutations are absent since the admixture event. As a result, these estimators may fail to deliver an estimate or give rather poor estimates when admixture is ancient and thus mutations are not negligible. A previous molecular estimator based its inference of admixture proportions on the average coalescent times between pairs of genes taken from within and between populations. In this article I propose an estimator that considers the entire genealogy of all of the sampled genes and infers admixture proportions from the numbers of segregating sites in DNA sequence samples. By considering the genealogy of all sequences rather than pairs of sequences, this new estimator also allows the joint estimation of other interesting parameters in the admixture model, such as admixture time, divergence time, population size, and mutation rate. Comparative analyses of simulated data indicate that the new coalescent estimator generally yields better estimates of admixture proportions than the previous molecular estimator, especially when the parental populations are not highly differentiated. It also gives reasonably accurate estimates of other admixture parameters. A human mtDNA sequence data set was analyzed to demonstrate the method, and the analysis results are discussed and compared with those from previous studies.

OVER the past 70 years, many statistical methods have been developed and applied to estimating the genetic compositions of admixed/hybrid populations, using genetic marker data (for recent reviews see Beaumont 2003; Choisy *et al*. 2004; Excoffier *et al*. 2005). The primary interest is to infer, from the amount and pattern of genetic variation revealed by markers, the proportional contributions of two or more potential parental populations to the gene pool of an admixed population (Chakraborty 1986). Estimating such admixture proportions helps in understanding the evolutionary history of populations (*e.g*., Chikhi *et al*. 2002; Wen *et al*. 2004), in genetic epidemiological investigations (Chakraborty and Weiss 1986, 1988), and in assessing the risk of diseases in human populations. In conservation biology, knowledge of admixture proportions helps in making informed management of endangered species in the wild.

Most methods available use allele frequency data to estimate admixture proportions, exploiting the genetic characteristic of an admixed population that its allele frequencies should be intermediate between those of the parental populations (Cavalli-Sforza and Bodmer 1971; Bertorelle and Excoffier 1998). The main differences among these methods are whether or not to take genetic drift into account and how to select (*e.g.*, Chakraborty *et al*. 1992) and treat allele frequency data statistically. Traditional methods are usually moment estimators that ignore the genetic drift that occurred to the parental and hybrid populations since the admixture event (*e.g.*, Glass and Li 1953; Roberts and Hiorns 1965; Elston 1971; Long 1991; Chakraborty *et al*. 1992), while recent ones are usually likelihood or Baysian estimators (*e.g.*, Thompson 1973; Chikhi *et al*. 2001; Wang 2003), allowing the joint estimation of admixture proportions and genetic drift. A flexible method based on some summary statistics and approximate Bayesian computation (Beaumont *et al*. 2002; Marjoram *et al*. 2003) has been proposed recently, which estimates admixture proportion, genetic drift, and mutation parameters simultaneously from linked or unlinked microsatellite markers (Excoffier *et al*. 2005).

Molecular markers, such as DNA sequences and microsatellites that are now used widely, provide us not only allele frequency information, but also deep genealogical information revealed by the molecular diversity of sampled genes. Most of the above methods do not use such molecular information and assume that mutations are absent since the admixture event, causing two potential problems. One is that discarding molecular information may result in a loss of estimation precision, especially when the mutation rate is high for markers such as large DNA sequences. The other is that when admixture is ancient and thus mutations are not negligible, these methods may fail to deliver an estimate or give rather poor estimates. Realizing these problems, Bertorelle and Excoffier (1998) developed a novel estimator that uses both allele frequency and molecular information and explicitly takes mutations into account in estimating admixture proportions. The estimator was shown to be less biased and, in some situations, to yield more precise estimates of admixture proportions (Bertorelle and Excoffier 1998; Wang 2003). Later on, the estimator was extended by Dupanloup and Bertorelle (2001) to allow three or more parental populations contributing to the admixture.

In this article I develop a new molecular estimator under a well-defined admixture model (Bertorelle and Excoffier 1998; Wang 2003; Excoffier *et al*. 2005) and compare it with the previous molecular estimator by using both simulated and real data. While the previous molecular estimator bases its inference on the average coalescent times between two genes taken from within and between populations, the current one considers the entire genealogy of the sampled genes and infers admixture from the numbers of segregating sites in DNA sequence samples. By considering the genealogy of all sequences rather than pairs of sequences, this new estimator also allows the joint estimation of other interesting parameters such as admixture time, divergence time, population size, and mutation rate as well as admixture proportions.

## METHODS

#### The admixture model:

Several recent studies adopt the admixture model proposed by Bertorelle and Excoffier (1998), with slight modifications (Chikhi *et al*. 2001; Wang 2003; Choisy *et al*. 2004; Excoffier *et al*. 2005). In this study, I use the admixture model of Excoffier *et al*. (2005), as illustrated in Figure 1. I assume an ancestral population, P_{0}, splits into two parental populations, P_{1} and P_{2}, which evolve separately for *T*_{D} generations. At that point of time, a hybrid population, P_{h}, is instantaneously created by combining genes of proportions *p*_{1} and *p*_{2} = 1 − *p*_{1} taken at random from parental populations P_{1} and P_{2}, respectively. After the admixture event, the three populations evolve in isolation for a period of *T*_{A} generations, when a sample of individuals is taken from each population for examining some markers. I also assume, as is implicit in all previous admixture models, that neither direct nor indirect selection is associated with the markers surveyed, and the markers are from diploid and autosomal loci. With 2*N* (*N*, the effective population size) replaced by *N*, however, the method applies to haploid markers (such as mtDNA) as well.

The above admixture model is characterized by seven parameters, which are the effective sizes of the ancestral (*N*_{0}), parental (*N*_{1} and *N*_{2}), and admixed (*N*_{h}) populations; the times (in the unit of generations) of divergence (*T*_{D}) and admixture (*T*_{A}); and the admixture proportion *p*_{1}. The seven parameters are denoted by set ω = {*N*_{0}, *N*_{1}, *N*_{2}, *N*_{h}, *T*_{A}, *T*_{D}, *p*_{1}}. Without an external standard, however, it is impossible to estimate the absolute values of population size and time. They are therefore rescaled by the mutation rate (μ) of markers as θ_{k} = 4*N _{k}*μ (

*k*= 0, 1, 2,

*h*), τ

_{k}=

*T*μ (

_{k}*k*=

*A*,

*D*), and the estimable parameters are denoted by Ω = {θ

_{0}, θ

_{1}, θ

_{2}, θ

_{h}, τ

_{A}, τ

_{D},

*p*

_{1}}.

#### The mutation model:

To utilize molecular information and account for mutations explicitly in estimating admixture, a suitable mutation model must be specified to describe the mutational process of markers. Herein I use DNA sequences as markers, which are assumed to follow the infinite-site model of mutations (Kimura 1969). Under this model, a locus is composed of so many sites that no more than one mutation occurs at any site in the genealogy of the sampled sequences. I also assume a constant-rate neutral mutation process, in which each offspring sequence differs from its parental one by an average of μ mutations. Under these assumptions, the number of mutations in a sample of DNA sequences is identical to the number of nucleotides that are polymorphic (segregating) in the sample. Therefore, the expected number of segregating sites in a sample is simply the product of μ and the expected total branch length of the genealogy of the sample (see, *e.g.*, Tajima 1983; Hudson 1990).

#### Expected numbers of segregating sites:

In this section, I derive the expected total branch length of the genealogy (ETBLG) (denoted by Δ) of a sample of DNA sequences given parameter set ω. The expected number of segregating sites in a sample is simply the product of Δ and mutation rate μ under the infinite-site model. In the next section, the observed numbers of segregating sites are fitted to these expected values to obtain least-squares estimates of Ω.

Suppose *n*_{1}, *n*_{2}, and *n*_{h} sequences of a given locus are sampled at random from the current P_{1}, P_{2}, and P_{h} populations, respectively. The *n* = *n*_{1} + *n*_{2} + *n*_{h} sequences can be arranged to constitute seven composite (artificial) samples. Samples 1, 2, and 3 contain sequences solely from populations P_{1}, P_{2}, and P_{h}, respectively. Samples 4, 5, and 6 are obtained by merging samples 1 and 2, 1 and 3, and 2 and 3, respectively. Sample 7 contains all of the *n* sequences. The sample sizes (number of sequences) for samples 1, 2, … , 7 are therefore *n*_{1}, *n*_{2}, *n*_{h}, *n*_{1} + *n*_{2}, *n*_{1} + *n*_{h}, *n*_{2} + *n*_{h}, and *n*_{1} + *n*_{2} + *n*_{h}, respectively.

##### Expected total branch length of a genealogy:

For convenience in deriving the ETBLG of a sample of sequences, time is measured backward hereafter. The current time when the sample was taken is designated as generation zero and the time *T* generations ago is referred to as generation *T*. Consider the ETBLG of sample 1, Δ_{1}, conditional on parameters ω. The genealogy can be partitioned into two segments, the first being formed by the coalescent process in population P_{1} during time interval [0, *T*_{A} + *T*_{D}], while the second is formed by the coalescent process in population P_{0} during time interval [*T*_{A} + *T*_{D}, ∞]. The expected total branch length of segment one can be derived, as shown in appendix a, as(1)where *i = n*_{1} is the initial number of sequences, *N = N*_{1} is the effective population size, *T = T*_{A} + *T*_{D} is the time interval, and and are the rising and falling factorial functions, respectively.

For the second segment, the initial number of sequences, *j* (*n*_{1} ≥ *j* ≥ 2), at time *T*_{A} + *T*_{D} is a random variable. The probability of an initial *i* sequences at time zero coalescing into *j* sequences at time *T* in a population with effective size *N* is(2)(Tavaré 1984). Inserting *i* = *n*_{1}, *T* = *T*_{A} + *T*_{D}, and *N* = *N*_{1} into (2) gives the probability that *j* (*n*_{1} ≥ *j* ≥ 2) sequences are left extant at time *T*_{A} + *T*_{D} in population 1. Given *j* at time *T*_{A} + *T*_{D} and the effective size of the ancestral population *N*_{0}, the expected total branch length of the second segment of genealogy (*e.g*., Hudson 1990) is . Summing over all possible values of *j* gives the expected total branch length of the second segment of genealogy(3)where *i = n*_{1}, *N = N*_{1}, *T = T*_{A} + *T*_{D}. The ETBLG is the sum of (1) and (3),(4)When , irrespective of the parameter values of , and *N*_{0}, as is expected because the most recent common ancestor (MRCA) must be found in the first segment of genealogy. In such a case, (4) reduces to , the expected total branch length of the genealogy of *n*_{1} sequences from a population of a constant size *N*_{1} (*e.g*., Hudson 1990). It can be shown that when *N*_{0} = *N*_{1}, (4) also reduces to irrespective of *T*_{A} and *T*_{D}, as expected.

Similarly, the ETBLG of sample 2 conditional on parameter set ω, Δ_{2}, is calculated by the right side of (4), replacing *N*_{1} by *N*_{2} and *n*_{1} by *n*_{2}, respectively.

The derivation for the ETBLG of sample 3 is much more complicated. The genealogy is again partitioned into two segments. The first segment is formed by the coalescent process of an initial *n*_{h} sequences in population P_{h} during interval [0, *T*_{A}]. The expected total branch length of this segment is calculated by (1) with *i* = *n*_{h}, *N* = *N*_{h}, and *T* = *T*_{A}. The second segment is formed by the coalescent processes in populations P_{1} and/or P_{2} during time interval [*T*_{A}, *T*_{A} + *T*_{D}] and then in ancestral population P_{0} during time interval [*T*_{A} + *T*_{D}, ∞]. The probability of *i* = *n*_{h} sequences coalescing into *j* = *m* sequences at time *T* = *T*_{A} in population P_{h} with effective size *N* = *N*_{h} is calculated by (2). According to the sources of the *m* extant sequences at time *T*_{A}, three cases are distinguishable.

Case 1: The

*m*extant sequences at time*T*_{A}come exclusively from population P_{1}. When this case occurs, with probability , the expected total branch length of the second segment of genealogy given*m*can be calculated by the sum of (1) and (3), replacing*i*,*N*, and*T*by*m*,*N*_{1}, and*T*_{D}, respectively.Case 2: The

*m*extant sequences at time*T*_{A}come exclusively from population P_{2}. When this case occurs, with probability where , the expected total length of the second segment of genealogy given*m*can be calculated by the sum of (1) and (3), replacing*i*,*N*, and*T*by*m*,*N*_{2}, and*T*_{D}, respectively.Case 3: Among the

*m*extant sequences at time*T*_{A},*m*_{1}(0 <*m*_{1}<*m*) sequences come from population P_{1}and*m*_{2}=*m − m*_{1}from P_{2}. When this case occurs, with the binomial probability of , the second segment of genealogy can be further partitioned into three subsegments. Subsegment 1 is formed by the coalescent process of the initial*m*_{1}sequences at time*T*_{A}in population P_{1}during interval [*T*_{A},*T*_{A}+*T*_{D}]. The expected total branch length of this subsegment can be derived, shown in appendix a, as

(5)where *i = m*_{1} is the initial number of sequences, *N = N*_{1} is the effective size, and *T* = *T*_{D} is the length of time. Subsegment 2 is formed by the coalescent process of the initial *m*_{2} sequences at time *T*_{A} in population P_{2} during interval [*T*_{A}, *T*_{A} + *T*_{D}]. The expected total branch length of this subsegment can be calculated similarly by (5) replacing *i*, *N*, and *T* by *m*_{2}, *N*_{2}, and *T*_{D}, respectively. The third subsegment is formed by the coalescent process in ancestral population P_{0} during the time interval [*T*_{A} + *T*_{D}, ∞]. Suppose *m*_{3} and *m*_{4} sequences are extant at time *T*_{A} + *T*_{D} in populations P_{1} and P_{2}, respectively, with probabilities and , respectively, calculated by (2). The expected total branch length of the segment of genealogy for the given initial *m*_{3} + *m*_{4} sequences in population P_{0} during the time interval [*T*_{A} + *T*_{D}, ∞] is (*e.g*., Hudson 1990) . Considering all possible values of *m*_{3} and *m*_{4} leads to the expected total branch length of the third subsegment of genealogy, . Summing over the three subsegments gives the expected total branch length of the second segment of genealogy in case 3.

Summing over the three cases yields the expected total branch length of genealogy for segment 2. Adding the expected total branch lengths of segments 1 and 2 gives the ETBLG of sample 3,(6)where and . It can be shown that, when *N _{k}* = (

*k*= 0, 1, 2,

*h*) and

*n*

_{h}= 2, (6) reduces to , which is twice the expected coalescent time between a pair of sequences from the admixed haploid population derived by Bertorelle and Excoffier (1998).

Using an approach similar to the derivation of (6), I also obtained, as shown in appendix a, the equations for the ETBLGs of samples 4–7.

##### Expected number of segregating sites:

Under the infinite-site model assumed above, the expected number of segregating sites (ES) in a sample is simply the product of mutation rate μ and ETBLG of the sample. The expected number of segregating sites of the *k*th sample conditional on parameter set Ω, ES_{k}, can then be calculated by the equation for Δ_{k} by replacing *N _{j}*,

*T*

_{A}, and

*T*

_{D}with θ

_{j}/4, τ

_{A}, and τ

_{B}, respectively, where

*k*= 1, 2, … , 7 and

*j*= 0, 1, 2,

*h*.

#### Estimation of parameters:

Suppose, for a single locus with (unknown) mutation rate μ, the number of segregating sites in sample *k* (*k* = 1, 2, … , 7) is observed to be OS_{k}. Estimates of the parameters Ω = {θ_{0}, θ_{1}, θ_{2}, θ_{h}, τ_{A}, τ_{D}, *p*_{1}} can be obtained by fitting these observed to the expected numbers of segregating sites by a least-squares approach,(7)where ES_{k} is calculated as shown above. Since *f*(Ω) is a complicated function of the seven parameters and no closed form of solution is possible, some numerical methods have to be adopted for the estimation. The first derivatives of *f*(Ω) with respect to each of the seven parameters can be obtained and used in the multidimensional Newton–Raphson algorithm for the estimates of Ω from (7). However, the computation is intensive, especially when sample sizes are large, because both function (7) and its derivatives are not trivial to compute. Further, such an algorithm is sometimes fooled by a local rather than a global minimum of *f*(Ω). Having tried several methods, I finally choose to use Powell's quadratically convergent method (Press *et al*. 1996) with slight modifications. This algorithm does not require the computation of derivatives, and with the modification it updates only one of the parameters in most iterations so that only part of (7) needs to be recalculated. For example, updating θ_{1} does not alter ES_{2} but changes only parts of the calculations of ES_{j} for *j* = 3, 4, … , 7. Therefore, the algorithm coupled with storing/reusing the computational results for different parts of ES_{j} could reduce computational burden tremendously. To speed up computation, this algorithm occasionally updates multiple parameters simultaneously along an optimal direction determined by collecting and using information of previous iterations. Some comparative analyses of simulated data indicated that Powell's algorithm is less often stuck on a local minimum than the Newton–Raphson algorithm. In the results shown below, each simulated data set is analyzed in five independent replicates, each with a randomly chosen set of starting parameter values. The final estimates are those from the replicate with the minimum value of *f*(Ω). To analyze an empirical data set, more starting points can be used to obtain more reliable estimates.

The computational load of (7) increases rapidly with the numbers of sequences from the three populations. Furthermore, it is difficult to calculate (2) quickly and accurately because it is a series having terms of large values and alternating signs. To avoid large numerical errors in calculating (2) and thus (7) for large genealogies (a sample of ≥100 sequences), I conduct computations using high precision of hundreds of significant digits (depending on sample size). An alternative option is to adopt a Markov chain Monte Carlo method proposed by Griffiths and Tavaré (1994) as in O'Ryan *et al*. (1998).

#### Multiple loci:

For multiple independent loci, it is inappropriate to use simply the average numbers of segregating sites over loci as data in the estimation. Distinctive loci may have different mutation rates because of the differences in mutation rate per base pair and/or in the sequence length. Different loci may also have different sample sizes, resulting in different ETBLGs and thus different expected numbers of segregating sites. The above methodology can be extended to use multilocus data jointly in estimating the parameters Ω = {θ_{0}, θ_{1}, θ_{2}, θ_{h}, τ_{A}, τ_{D}, *p*_{1}}. In addition, the relative mutation rate of each locus can be estimated simultaneously.

Suppose a number of *L* unlinked loci have been surveyed, with locus *l* (*l* = 1, 2, … , *L*) having a mutation rate μ_{l} and a sample size of *n _{j}*

_{,l}sequences from population P

_{j}(

*j*= 1, 2,

*h*). Without loss of generality, I scale parameters

*N*(

_{j}*j*= 0, 1, 2,

*h*),

*T*

_{A},

*T*

_{D}, and μ

_{l}(

*l*= 2, 3, … ,

*L*) by μ

_{1}, the mutation rate of the first locus. The set of parameters to be estimated now becomes Ω = {θ

_{0}, θ

_{1}, θ

_{2}, θ

_{h}, τ

_{A}, τ

_{D},

*p*

_{1}, λ

_{2}, λ

_{3}, … , λ

_{L}}, where θ

_{j}= 4

*N*μ

_{j}_{1}(

*j*= 0, 1, 2,

*h*), τ

_{A}=

*T*

_{A}μ

_{1}, τ

_{D}=

*T*

_{D}μ

_{1}, and λ

_{l}= μ

_{l}/μ

_{1}(

*l*= 2, 3, … ,

*L*). The least-squares function for multiple loci becomes(8)where OS

_{j,l}and ES

_{j,l}are observed and expected numbers of segregating sites for locus

*l*in composite sample

*j*(

*j*= 1, 2, … , 7). ES

_{j,1}is calculated as before, while ES

_{j,l}for locus

*l*with

*l*≥ 2 is calculated using parameters {θ

_{0}, θ

_{1}, θ

_{2}, θ

_{h}, τ

_{A}, τ

_{D},

*p*

_{1}} and then multiplied by λ

_{l}.

The value of *p*_{1} and the relative values of θ_{j} (*j* = 0, 1, 2, *h*), τ_{A}, and τ_{D} obtained from (8) are independent of the locus chosen to scale the parameters. This is because mutation rate does not affect the genealogies and acts only as a multiplier with ETBLGs in determining the expected numbers of segregating sites. Using a locus with a smaller (larger) mutation rate to scale the parameters causes just a proportional decrease (increase) in the estimates of θ_{j} (*j* = 0, 1, 2, *h*), τ_{A}, and τ_{D} and has no affect on the estimates of *p*_{1}. This can be checked easily by simulations.

#### Simulations:

Monte Carlo simulations were run to generate data sets with known parameters. These data were then analyzed by the newly developed estimator to check its quality of estimates, to investigate its statistical properties, and to compare it with a previous molecular estimator. Although quite a few admixture estimators are available, only the one of Bertorelle and Excoffier (1998) is a molecular estimator designed to use molecular information and take mutations into account. Therefore I confine my comparison to this molecular estimator in this study.

Following the coalescence approach (Hudson 1990) and the admixture model (Figure 1), the genealogies of the *n*_{1} + *n*_{2} + *n*_{h} DNA sequences from the current three populations were reconstructed until the MRCA was found. Poisson-distributed mutations were then imposed on the reconstructed gene tree. Recombination was assumed to be absent and mutations were assumed to follow the infinite-site model. Data for different loci were generated independently, and monomorphic loci with no segregating sites were discarded. The sequence data were then processed to extract information for different estimators. For the current estimator, the number of segregating sites in each of the seven composite samples was obtained. For Bertorelle and Excoffier's molecular estimator, the mean coalescence time (scaled by mutation rate) was estimated by the mean number of site differences between pairs of sequences.

Several statistics are adopted to measure the quality of estimates from simulated data. First, the applicability (denoted as Appl%) of an estimator is calculated as the percentage of replicates in which admixture proportion estimates can be made successfully and the estimates are in the legitimate range of [0, 1] (Choisy *et al*. 2004). Second, the mean and root mean square errors (the square root of mean squared errors, denoted by RMSE) of estimates across replicates are calculated. Third, “factor 2” is calculated as the proportion of replicates in which the estimated value is within the interval bounded by values equal to 50 and 200% that of the true value (Excoffier *et al*. 2005). This measurement overlaps with RMSE in telling how close the estimates are to the true parameter value, but it is less affected by extreme outliers of the distribution of estimates. For most combinations of parameters, 1000 replicates were conducted.

#### Analysis of an empirical data set:

For demonstration, the estimator proposed herein was applied to the analysis of a published data set from McLean *et al*. (2003). They sequenced the hypervariable segments I (HVSI, 364 bp in length) and II (HVSII, 343 bp in length) of the mtDNA from 47 Sierra Leoneans, 12 European–Americans, 12 rural Gullah-speaking African–Americans, 12 urban African–Americans living in Charleston, South Carolina, and 12 Jamaicans. Assuming that African–American populations are admixtures by Europeans and Africans (*e.g.*, Parra *et al*. 1998), the mtDNA data can be analyzed by the coalescent estimator to infer the European genetic contributions to the gene pool of each of the three African–American populations and the admixture time, divergence time, and genetic drift (population size) of each parental and admixed population involved. Sites in HVSI and HVSII sequences with missing or ambiguous information were eliminated, resulting in 295 and 275 unambiguous sites for HVSI and HVSII, respectively, utilized in the analysis. Due to the absence of recombination in human mtDNA, HVSI and HVSII are effectively a single locus. Sequences for the two loci are thus combined to form single-locus data before being analyzed by the two molecular estimators of admixture.

## RESULTS

#### Simulations:

Many factors are important in determining the quality of admixture estimates, including the true parameters (defined by the genetic model in Figure 1) being estimated and the marker information content influenced by the number of loci, the number of individuals genotyped, and the polymorphism of each locus (*e.g.*, Wang 2003; Choisy *et al*. 2004; Excoffier *et al*. 2005). The factor combinations are prohibitively too numerous to consider in a simulation study. Here I choose to present the estimation results in some hopefully typical scenarios.

The performances of the current and previous molecular estimators in estimating admixture proportions for the scenarios of a short or long divergence time (*T*_{D} = 2500 or 10,000 generations); short, intermediate, or long admixture time (*T*_{A} = 50, 500, or 5000 generations); and small or moderate admixture (*p*_{1} = 0.05 or 0.20) are summarized in Table 1. Sample size for each population is assumed to be either 20 or 40, and the number of loci is assumed to be 1, 5, or 10 with the same mutation rate of 0.001 per DNA sequence per generation. When the divergence time is short (*T*_{D} = 2500) so that parental populations are not highly differentiated (*T*_{D}/*N* = 0.5) when admixture occurs, the Appl% of the *m _{Y}* estimator is only ∼70–90% for the case of a single locus, and 10–30% of

*p*

_{1}estimates from this estimator are either smaller than zero or greater than one. Although the Appl% of the

*m*estimator improves with an increasing amount of marker information (mainly number of loci), it is still <90% for slight admixture even if 5 loci are used. Note that the increase in Appl% of the

_{Y}*m*estimator with an increasing

_{Y}*T*

_{A}is an artifact, because a large

*T*

_{A}results in the estimates of

*p*

_{1}biased toward 0.5 and thus in fewer negative estimates. In contrast, the new estimator gives the estimates of

*p*

_{1}that are always in the legitimate range of [0, 1]. Compared with the

*m*estimator, the new estimator is generally much less biased and has much smaller RMSE.

_{Y}When the divergence time is long (*T*_{D} = 10,000 generations) so that parental populations are highly differentiated (*T*_{D}/*N* = 2) before they contribute to the admixture, the performances of the two estimators become similar. The *m _{Y}* estimator has an Appl% close to 100%, except when a single locus is used in estimating small admixture proportions. The new estimator is less biased than the

*m*estimator, especially when admixture is small and

_{Y}*T*

_{A}is large. The main merit of molecular estimators in comparison with traditional estimators is that mutations after the admixture events can be accounted for so that ancient admixture can be inferred accurately. The current estimator allows almost unbiased estimation of admixture proportions even if

*T*

_{A}= 5000 (

*T*

_{A}/

*N*= 1) when the divergence time is long and multilocus data are available. In contrast, the

*m*estimator gives estimates of

_{Y}*p*

_{1}biased toward 0.5 when admixture is ancient.

In addition to admixture proportions, the new estimator can also provide estimates of other interesting parameters. Table 2 summarizes the properties of the estimates of θ_{0}, θ_{1}, θ_{2}, θ_{h}, τ_{A}, τ_{D}, and relative mutation rates (λ_{l}). It can be seen that θ_{1}, θ_{2}, and μ_{l} are very well estimated with small biases and RMSEs, while θ_{h} is the most difficult parameter to estimate. This is understandable because information about θ_{h} comes from the coalescent events in the hybrid population during time interval [0, *T*_{A}] only, and these events are too few when *n*_{h} and *T*_{A}/*N*_{h} are small to allow accurate estimates of θ_{h}. Indeed, the quality of θ_{h}-estimates increases with an increasing sample size and admixture time, as shown in Table 2. Similarly, the quality of τ_{A} estimates is dependent on the number of coalescent events in the three populations during interval [0, *T*_{A}] and thus increases with an increasing admixture time (*T*_{A}) and a decreasing population size (*N*_{1}, *N*_{2}, *N*_{h}). In contrast, τ_{D} is more accurately estimated with a decreasing admixture time (*T*_{A}). Small *T*_{A} means fewer coalescent events during interval [0, *T*_{A}] and more coalescent events during time interval [*T*_{A}, *T*_{A} + *T*_{D}] and thus more information about τ_{D}. For similar reasons, θ_{0} is better estimated with a smaller value of *T*_{A} + *T*_{D}. The estimates of all parameters are improved substantially by increasing the number of loci and sample sizes.

I adopt the parametric bootstrapping technique to assess the uncertainties of admixture estimates from the new estimator. This is rendered possible because all the parameters fully defining the admixture model in Figure 1 can be estimated by the new estimator. Parametric bootstrapping is more appropriate than nonparametric bootstrapping (Bertorelle and Excoffier 1998) because the latter tends to yield too conservative estimates of uncertainties when the number of resampling units (loci, sequences) is small. Due to the heavy computational burden of the current estimator, however, it is difficult to evaluate the performance of the parametric bootstrapping procedure using extensive simulations. Table 3 lists the uncertainty estimates, which are average upper and lower limits of 95% confidence intervals (C.I.95%) and coverage (frequency of the true parameter value being covered by the estimated C.I.95%), for the cases of a single locus and five loci. In each case, 100 replicate data sets are simulated, and each data set is analyzed for point and C.I.95% estimates using 500 bootstrapping samples. The parameter values used in generating the simulated data sets are θ_{k} = 20 (*k* = 0, 1, 2, *h*), τ_{A} = 0.5, τ_{D} = 10, μ = 0.001, and 20 sequences for each locus from each population. As can be seen, the true parameter value is included in the estimated 95% confidence intervals in ∼95% of the replicates for both the single-locus and the five-loci cases. The confidence intervals for five loci are much narrower than those for a single locus, as is expected. In accordance with the results listed in Table 2, θ_{h} is the most difficult parameter to estimate, as indicated by the extremely large confidence intervals.

#### Admixture analysis of human populations:

The mtDNA sequence data from McLean *et al*. (2003) are analyzed by Bertorelle and Excoffier's (1998) *m _{Y}* estimator and the new estimator. Parametric bootstrapping and nonparametric bootstrapping are adopted for the new and

*m*estimators, respectively, to ascertain the uncertainties of the estimates using 1000 samples of size identical to the original samples. The estimates of the European contributions to each of the three admixed African–American and Jamaican populations are listed in Table 4.

_{Y}The European contributions to the three admixed populations are estimated to be <7% from the new coalescent estimator and are in close agreement with previous estimates (Parra *et al*. 2001; McLean *et al*. 2003). McLean *et al*. (2003) calculated admixture proportions from the frequencies of haplotypes composed of three HVS restriction site polymorphisms (RSPs). These RSPs are chosen because they are substantially differentiated between African and European populations and are thus highly informative for admixture analysis. Furthermore, a large number of 1396 individuals from the same five populations as the mtDNA sequence data analyzed herein are assayed for the RSPs. The estimated European contributions from their analyses are 0.030, 0.069, and −0.027 for the Gullahs, Charleston African–Americans, and the Jamaicans, respectively. From large samples (∼100 sequences per population) of HVSI data, the European contributions to the three admixed populations were estimated to be ∼0% using the highly informative haplogroup H frequencies (Parra *et al*. 2001) and were estimated to be 0.065 and 0.129 for the Charleston and Jamaican populations, respectively, from both haplogroup H and L frequencies (Parra *et al*. 1998). In general, these estimates are much lower than those inferred from many informative nuclear markers (Parra *et al*. 1998), indicating that European females contributed little to the admixtures. The sex-biased admixtures, with European males contributing substantially greater than European females, were confirmed by analyzing the Y Alu polymorphic (YAP) informative markers (Parra *et al*. 1998). It is encouraging that with a sample size as small as 12 sequences, the new coalescent estimator yields similar results.

The *m _{Y}* estimator yields estimates of European contributions to the Gullah or Jamaican population that are low and roughly compatible with estimates from the other estimators and data, but estimates of European contribution to the Charleston population (0.5) that is much larger than other estimates and is even larger than the estimate from nuclear markers (0.12, Parra

*et al*. 1998). The estimated European contribution to the Jamaican population was −0.063, which is not surprising given the simulation result that the

*m*estimator often yields negative estimates when admixture proportion is low (Table 1).

_{Y}For both molecular estimators and all of the three admixed populations, the 95% confidence intervals for the admixture estimates determined by parametric and nonparametric bootstrapping are quite broad. This is perhaps not surprising because the data set is small with effectively a single locus and only 12 sequences from a population. More loci and larger sample sizes are required to obtain more precise admixture estimates.

In addition to admixture proportions, the new estimator also gives the estimates of divergence times, admixture times, population sizes, and relative mutation rates. These estimates are listed in Table 4. Because human mtDNA does not recombine (Pakendorf and Stoneking 2005), HVSI and -II sequences were obtained from the same individuals (McLean *et al*. 2003), and the sample sizes are very small, the genealogical information that can be extracted from the data set is quite limited and the analysis results need to be interpreted with caution. Divergence time between the Africans and Europeans is estimated to be 1.9 on average. In the literature, the estimates of mutation rate for HVSs are quite diverse, the median value being ∼0.1/site/million years (MY) (Pakendorf and Stoneking 2005; Santos *et al*. 2005). The total mutation rate for HVSI and -II is therefore ∼70/MY or 0.0021/generation if the generation interval is taken to be 30 years. The absolute divergence time is thus estimated to be 1.9/(70/MY) = 27,143 years, which is roughly in agreement with the estimate of <60,000 years from phylogenetic analysis of mtDNA control regions (*e.g.*, Watson *et al*. 1997; Quintana-Murci *et al*. 1999). Subsequent migration after the split of Eurasian from African population could reduce the divergence and thus lead to an underestimation of divergence time (Wang 2003).

The point estimates of admixture time are quite variable from the three admixture analyses. Using the mutation rate estimate of 70/MY, the admixture time is estimated to be 6, 1628, and 2801 years ago for the Gullah, Charleston, and Jamaican populations, respectively. The first estimate is too small while the last two estimates are too high compared with the historical evidence that the American–African populations were formed 150 years ago (Parra *et al*. 1998). For all of the three admixed populations, however, the 95%C.I.'s are fairly consistent and well include the admixture time of 150 years.

The parental and ancestral population sizes (θ_{1}, θ_{2}, θ_{0}) are well estimated while the admixed population size (θ_{h}) is poorly estimated, as indicated by the corresponding widths of 95%C.I.'s. On average, the African population size is estimated to be 8 and 37 times larger than that of European and admixed populations, respectively, and the European population is 4 times larger than admixed populations. The results seem to be plausible and are at least qualitatively in agreement with previous studies.

## DISCUSSION

In this article, I show that DNA sequence data can be utilized more efficiently in admixture inferences by considering the entire genealogy of all sampled sequences rather than the genealogy of pairs of sequences. In comparison with a previous molecular estimator (Bertorelle and Excoffier 1998), the new estimator provides better estimates of admixture proportions, which are always in the legitimate range of [0, 1] and have usually higher accuracy and precision, especially when divergence time is short and/or admixture time is long (Table 1). In addition, it allows reasonably good estimation of other important parameters of the admixture model, such as the divergence time, admixture time, and population sizes (Table 2). These parameters are scaled by the mutation rate of the markers, but their relative values are still meaningful in understanding the admixture events. When marker mutation rates are known, the absolute values of divergence and admixture times and population sizes can be easily calculated from the estimates. Other advantages of the new estimator are that it can use information from multiple loci with different mutation rates in estimating admixture and relative mutation rates jointly and that it automatically accounts for variable sample sizes both among and within loci because its inferences are based on the genealogy of an entire sample rather than pairs of sequences. The simulation results shown in Tables 1 and 2 assumed equal sample sizes among populations and loci and equal mutation rates among loci. When either or both of these two quantities vary, the new estimator is expected to perform even better than the previous molecular estimator (Bertorelle and Excoffier 1998). Furthermore, the previous molecular estimator assumes an equal size of all populations involved in the admixture. The assumption is now redundant in the new estimator and all population sizes can be estimated jointly with admixture proportions.

The differences between the current and previous molecular estimators of admixture are in close analogy with those between Watterson's and Tajima's estimators of from DNA sequence data. Watterson (1975) showed that, under the infinite-site mutation model, the product of mutation rate and expected total branch length of the genealogy of a sample of sequences gives the expected number of segregating sites of the sample. From this relationship, he derived the estimator , where OS is the observed number of segregating sites in a sample of *n* sequences. Tajima's estimator, θ_{T}, is given by the average number of nucleotide differences between two sequences (Tajima 1983). It is well known that both estimators are unbiased, but θ_{W} is generally more efficient than θ_{T} because it has a smaller variance, and the difference increases with sample size (Li 1997; Wang 2005). In the special case of a single Wright–Fisher population (say, parental population 1), the admixture model has just one parameter (θ) and my admixture estimator (7) reduces to θ_{W}, as is expected. The previous molecular estimator of admixture (Bertorelle and Excoffier 1998) uses the mean number of nucleotide differences between pairs of sequences as information and is thus quite similar in this respect to θ_{T}.

Some assumptions made in deriving the current estimator can be relaxed without affecting much of the validity of the estimator. Although the current estimator assumes diploid nuclear markers, it applies to maternally (mtDNA) or paternally (Y chromosome) inherited markers as well. The only difference is in the explanation of the parameter θ that corresponds to different effective sizes. It is assumed that there is no recombination at a locus. However, I use means rather than variances of the number of segregating sites as data in the estimator, and therefore it should apply to loci with recombination. Like Watterson's estimator, the current admixture estimator should actually give better estimates when there is recombination, although the estimated uncertainties might become too exaggerative. The current estimator also assumes two parental populations contributing to the admixture. It is straightforward to extend the method to the case of three or more parental populations. However, the computational burden increases very rapidly with the number of parental populations. Even with two parental populations as assumed in this study, the estimator's computational load increases so rapidly with sample sizes that it can cope only with samples of a few hundred sequences on an ordinary PC. Further refinements of the computational algorithms are necessary before the estimator is extended to more complicated situations such as three or more parental populations.

There is also room for methodological improvements of the current estimator. In (7), the expected numbers of segregating sites of the seven composite samples are obviously nonindependent, because each original sample of sequences appears in four of the seven composite samples. Ideally, their variance and covariance structure should be incorporated into a general least-squares framework to obtain estimates of the seven parameters. However, it is extremely difficult to derive this variance and covariance matrix analytically, and computation of the matrix numerically by simulations is too CPU demanding to be realistic. Although (7) ignored this variance and covariance structure, it should provide unbiased estimates as verified by simulations. The estimator's precision may be improved by the proper weighting based on the variance and covariance matrix.

A software package, Molecular Estimator of Admixture (MEAdmix), computing the admixture estimator and finding the confidence intervals by parametric bootstrapping, is available for free download from http://www.zoo.cam.ac.uk/ioz/software.htm.

## APPENDIX A: THE EXPECTED TOTAL BRANCH LENGTH OF A GENEALOGY

Suppose *i* genes at a locus are sampled randomly from a Wright–Fisher population of *N* diploid individuals. Looking backward in time, the current time when the sample was taken is designated as generation zero and the time *T* generations ago is referred to as generation *T*. As *T* → ∞, these *i* genes will coalesce into their MRCA, and the ETBLG is generations (*e.g.*, Hudson 1990). When *T* is a fixed definite number, however, the MRCA may or may not be found in the interval of [0, *T*] and *j* (1 ≤ *j* ≤ *i*) lineages may be left extant at generation *T*. The probability of *j* distinct lineages extant at time *T*, , is given by (2) (Tavaré 1984). Given *i*, *j*, *N*, and *T*, the expected time, ET_{m}, during which there are *m* (*j* ≤ *m* ≤ *i*) distinct lineages in a genealogy is derived as follows:

ET

_{m}for*m = i*: Conditional on*i*,*j < i*,*N*, and*T*, the probability of the time interval (in generations),*T*, during which there are_{i}*i*distinct lineages isET_{i}is thus obtained by the integrationwhich, after some algebra, simplifies to(A1)where .ET

_{m}for*j < m < i*: Similarly, the expected time (in generations) during which there are*m*distinct lineages (*j < m < i*), given*i*,*j < i −*1,*N*, and*T*, is obtained by integration:It can be simplified to(A2)where when*r = m*and otherwise, and .ET

_{m}for*m*=*j*: The expected time during which there are*m*=*j*distinct lineages, given*i*,*j*≤*i*,*N*, and*T*, is obtained by integration,which leads to(A3)where when*k = j*and otherwise, and . When no coalescent events occur in the interval of [0,*T*] (*i.e.*,*j*=*i*), (A3) reduces to , as is expected.

The ETBLG in the interval [0, *T*] can be calculated using (A1–A3), in two separate cases. In the first case, the part of the genealogy after the MRCA is found, branch length *T*_{1}, is irrelevant and excluded. In the second case, *T*_{1} is included in the ETBLG. Cases 2 and 1 apply when the MRCA lineage is and is not to be included in another genealogy involving other genes formed after *T*. The ETBLG conditional on *i*, *j*, *N*, *T* is(A4)if *T*_{1} is excluded and is(A5)if *T*_{1} is included.

Considering all possible values of *j* given *i*, *N*, and *T*, the ETBLG in case 1 iswhich, after some algebra, reduces to (1) in the text. As *T* → ∞, (1) further reduces to (*e.g.*, Hudson 1990), as is expected. Similarly, the ETBLG in case 2 iswhich reduces to (5) in the text.

## APPENDIX B: THE ETBLG OF SAMPLES 4–7

The expected total branch length of the genealogy (ETBLG) for sample 4 can be derived, using the approach adopted in deriving (6) in the text, as(B1)When *n*_{1} = *n*_{2} = 1 and *N _{k}* =

*N*(

*k*= 0, 1, 2), (B1) reduces to , which is twice the expected coalescent time between a sequence from parental population 1 and a sequence from parental population 2.

The ETBLG for sample 5 is(B2)where and . As is expected, (B2) reduces to , twice the average coalescent time between a sequence from parental population 1 and a sequence from the admixed population, when *n*_{1} = *n*_{h} = 1 and *N _{k}* =

*N*(

*k*= 0, 1, 2,

*h*). For sample 6, Δ

_{6}is calculated by the right side of (B2) by exchanging and .

The ETBLG of sample 7 is(B3)where and .

## Acknowledgments

I thank David C. McLean for sending me the human mtDNA sequence data that were analyzed by my new admixture estimator and Laurent Excoffier, Brigitte Pakendorf, Bruce Walsh, and two anonymous referees for critical reading and constructive comments on earlier versions of this manuscript.

## Footnotes

Communicating editor: J. B. Walsh

- Received December 1, 2005.
- Accepted April 9, 2006.

- Copyright © 2006 by the Genetics Society of America