Genetics, Vol. 149, 1539-1546, July 1998, Copyright © 1998

Inference of Population History Using a Likelihood Approach

Gunter Weissa and Arndt von Haeselera
a Institute of Zoology, University of Munich, D-80333 Munich, Germany

Corresponding author: Arndt von Haeseler, Institute of Zoology, University of Munich, Luisenstrasse 14, D-80333 Munich, Germany, arndt{at}zi.biologie.uni-muenchen.de (E-mail).

Communicating editor: S. TAVARÉ


*  ABSTRACT
*TOP
*ABSTRACT
*THEORETICAL BACKGROUND
*APPLICATION TO SEQUENCE DATA
*DISCUSSION
*LITERATURE CITED

We introduce an approach to revealing the likelihood of different population histories that utilizes an explicit model of sequence evolution for the DNA segment under study. Based on a phylogenetic tree reconstruction method we show that a Tamura-Nei model with heterogeneous mutation rates is a fair description of the evolutionary process of the hypervariable region I of the mitochondrial DNA from humans. Assuming this complex model still allows the estimation of population history parameters, we suggest a likelihood approach to conducting statistical inference within a class of expansion models. More precisely, the likelihood of the data is based on the mean pairwise differences between DNA sequences and the number of variable sites in a sample. The use of likelihood ratios enables comparison of different hypotheses about population history, such as constant population size during the past or an increase or decrease of population size starting at some point back in time. This method was applied to show that the population of the Basques has expanded, whereas that of the Biaka pygmies is most likely decreasing. The Nuu-Chah-Nulth data are consistent with a model of constant population.


IN the past decade much effort has been put into collecting and sequencing DNA from human populations from all over the world (CAVALLI-SFORZA et al. 1994 Down; KOGELNIK et al. 1997 Down; HANDT et al. 1998 Down). While much of the interest in these problems comes from medical genetics, population geneticists also address questions about the origin of our species, its spread over the world, and relationships between contemporary populations. The noncoding hypervariable region I (HVRI) of mitochondrial DNA (mtDNA) is especially suited for this intraspecies analysis. This is due to the maternal inheritance of mtDNA, the absence of recombination, and the high mutation rate of this part of the genome. By now more than 4000 individual HVRI sequences have been published (HANDT et al. 1998 Down).

There are different approaches to analyzing this sort of data: Networks are a suggestive way to visualize the relationship of sequences in a sample (BANDELT et al. 1995 Down). Tree reconstruction methods can give insights into the mode of evolution of the particular genomic region studied (TAMURA and NEI 1993 Down). If one intends to extract information about historical population events, coalescent theory provides a framework with which to incorporate parametric models of population history explicitly (DONNELLY and TAVARE 1995 Down). Coalescent models describe the evolution of a sample of DNA sequences in terms of stochastic processes. Thus, they allow application of statistical techniques for parameter estimation and model testing.

Here we study a class of expansion models that include three parameters: the initial population size, the point back in time when exponential growth (positive or negative) began, and the ratio of current to initial population size. Other authors (ROGERS and HARPENDING 1992 Down; GRIFFITHS and TAVARE 1994B Down; WAKELEY and HEY 1997 Down) estimated the parameters of a so-called sudden expansion model assuming an infinitely many sites model (WATTERSON 1975 Down).

We argue that a Tamura-Nei model with {Gamma} -distributed rates is more appropriate for describing the evolution of HVRI sequences (TAMURA and NEI 1993 Down). Under this complex mutation model it is difficult to derive analytical formulae and infer population history parameters by a method of moments approach. Therefore, we make use of a likelihood ratio statistic that is based on two summary statistics, the mean pairwise sequence difference and the number of variable positions in a sample of DNA sequences. More precisely, for a whole range of population histories we compute the corresponding likelihoods given the data. A history is considered implausible if its likelihood is much smaller than that of the most probable population history. Since we are not aware of any analytical expression for the likelihood function under this complex model, we will resort to Monte Carlo simulations. We demonstrate the applicability of our approach by analyzing the demographic history of three human populations, namely the Basques, the Nuu-Chah-Nulth, and the Biaka pygmies.


*  THEORETICAL BACKGROUND
*TOP
*ABSTRACT
*THEORETICAL BACKGROUND
*APPLICATION TO SEQUENCE DATA
*DISCUSSION
*LITERATURE CITED

The coalescent (KINGMAN 1982A Down, KINGMAN 1982B Down, KINGMAN 1982C Down) is an efficient and elegant way to describe the ancestral relationship of a sample of homologous DNA sequences in terms of probabilities. While some coalescent models are analytically tractable, a major feature of the coalescent is that it gives an exact framework to base the simulation on (HUDSON 1991 Down; DONNELLY and TAVARE 1995 Down). To generate a set of simulated DNA sequences we have to model two processes: First, the reproduction process (constant size or expansion model, panmictic or substructured population, etc.) determines shape and branch lengths of a sample genealogy; second, the model of the mutation process (infinitely or finitely many sites model, uniform or site-dependent mutation rate) specifies the way mutations are distributed along the sequences in the genealogy. The impact of the various assumptions of different reproduction models on genetic diversity measures, such as pairwise difference distributions, mean pairwise difference, or number of variable positions in a sample, was investigated by several authors (TAJIMA 1989B Down; SLATKIN and HUDSON 1991 Down; ROGERS and HARPENDING 1992 Down; MARJORAM and DONNELLY 1993 Down). Most of this theoretical work and therefore most of the data analysis has been done under the infinitely many sites model assumption (WATTERSON 1975 Down). However, most HVRI sequence samples from different populations are not consistent with this model, which does not allow for any recurrent or parallel mutation. It is argued that this mutation process model serves as a useful approximation (ROGERS 1992 Down). Others claim that their method is "insensitive to the mutational process" (ROGERS et al. 1996 Down). On the other hand, reports that describe the substantial effects of different mutation processes on various genetic diversity measures abound (LUNDSTROM et al. 1992A Down, LUNDSTROM et al. 1992B Down; BERTORELLE and SLATKIN 1995 Down; ARIS-BROSOU and EXCOFFIER 1996 Down; TAJIMA 1996 Down). In agreement with the latter results we put forward that the specification of the model of the mutation process is crucial for the inference of the parameters of a population history. Consequently, we study the mutation process of a segment of the HVRI of human mtDNA.

The mutation process of HVRI sequences:
Since DNA sequence evolution generally does not adhere to the infinitely many sites model, it is essential to investigate the process of sequence evolution in greater detail. We suggest decoupling the estimation of mutation parameters from the analysis of a specific population sample. This approach requires a reasonably large number of sequences from a widespread distribution of different populations. To model the mutation process we assume that evolution at a given site follows a time-continuous, time-homogeneous Markov chain. The most general model we allow for is the so-called Tamura-Nei model (1993). This model is most conveniently summarized in the rate matrix R (Scheme 1), where the omitted diagonal elements are set such that the elements in each row add up to zero. The entry ri,j (i,j = A,C,G,T, i != j)

describes the substitution rate of nucleotide i to nucleotide j. The model comprises parameters for the base frequencies {pi}A, {pi}C, {pi}G, {pi}T, a transition/transversion parameter {kappa}, and a pyrimidine/purine transition parameter {xi}. Note that other models are special cases of the Tamura-Nei model. If {xi} = 1, the Hasegawa Kishino Yano (HKY) model (HASEGAWA et al. 1985 Down) is defined. If we, moreover, assume homogeneous base frequencies, the HKY model reduces to KIMURA's two-parameter model (KIMURA 1980 Down). Finally, letting {kappa} = 0.5, the Jukes-Cantor model is obtained (JUKES and CANTOR 1969 Down). To take rate heterogeneity into account, we assume that each site evolves according to the same rate matrix R, but that the actual rate of a given site follows a {gamma}-distribution

with mean 1 and variance 1/{alpha} (UZZEL and CORBIN 1971 Down; WAKELEY 1993 Down).

For a specific model the corresponding mutation parameters are estimated by employing a phylogenetic approach. That is, a random sample of lineages is drawn from the collection of sequences. Subsequently, a maximum likelihood tree using the PUZZLE program is reconstructed and the mutation parameters are estimated from the tree (STRIMMER and VON HAESELER 1996 Down). To ensure that the estimates are not affected by the sample and by the population dynamics, the entire procedure is repeated several times for different random samples of lineages. This phylogenetic approach produces parameter estimates of the mutation process that are virtually independent of the demographic history.

However, as outlined above, we still have the choice between a variety of complex models. To find out the most appropriate model of evolution for HVRI sequences, we employed GOLDMAN'S suggestion (1993) of a likelihood ratio test (e.g., COX 1961 Down, COX 1962 Down). That is, for a sample of lineages a maximum likelihood PUZZLE tree is estimated assuming a HKY model of sequence evolution, say. Let L0 be the corresponding maximum likelihood value. Then, assuming a Tamura-Nei model the maximum likelihood tree with likelihood L1 is estimated. Unfortunately, the usual {chi}2 approximation for the distribution of {Delta} = -2 ln() under H0 cannot be used in the phylogenetic context (GOLDMAN 1993 Down). Therefore, {Delta} is compared to the empirical distribution, which is obtained as follows: Sequences were generated under hypothesis H0 (e.g., the estimated ML tree under HKY), using the program Seq-Gen (RAMBAUT and GRASSLY 1997 Down). For each simulated data set L0/, L1 and {Delta} are computed as described for the true data. H0 are computed as described for the true data. H0 is rejected if the observed {Delta}-value of the data falls in the upper 5% tail of the empirical distribution. Usually, 100 sets of sequences are generated. If H0 is rejected, it is assumed that the parameters of the evolutionary model from H1 are more appropriate for describing sequence evolution.

Starting with the most simple model (e.g., JUKES-CANTOR 1969) one can gradually increase the complexity of the models until H0 is no longer rejected or until we cannot increase the complexity of the model (in our case the Tamura-Nei model with {gamma}-distributed rates is the most complex model).

The entire procedure was applied to estimate the best evolutionary model for the 360-bp region [corresponding to positions 16024 through 16383 (ANDERSON et al. 1981 Down)] of HVRI from human mtDNA, from which 2298 different lineages are collected (HANDT et al. 1998 Down). In the HVRI region a pronounced transition/transversion ratio as well as rate heterogeneity is observed (KOCHER and WILSON 1991 Down; HASEGAWA et al. 1993 Down; WAKELEY 1993 Down). Thus, the infinitely many sites model is violated.

To pick the most appropriate model of sequence evolution, 10 x 25 random lineages were samples from the collection (HANDT et al. 1998 Down), and for each sample the test procedure was performed. For all 10 repeats the Tamura-Nei model with {gamma}-distributed rates was the best model as indicated by the likelihood of the tree. All simpler models were rejected on the 5% level, except for the HKY model with {gamma}-distributed rates, which was rejected only in 1 out of the 10 cases. Since the Tamura-Nei model with {gamma}-distributed rates always provided a higher likelihood, this model is assumed in the following. Finally, the model parameters (cf. Scheme 1) were estimated from 50 sets of 50 lineages that were randomly chosen from the HVRI sequences collection (HANDT et al. 1998 Down). For each set, parameters were estimated from the maximum likelihood PUZZLE tree. Table 1 shows the results. Our estimates are in good agreement with those from similar studies, even though direct comparisons are complicated by the different regions studied. For example, we obtained one {alpha} of 0.26 and other estimates ranging from 0.11 to 0.47 (TAMURA and NEI 1993 Down; WAKELEY 1993 Down). Similarly, we inferred a value of 15.5 for the transition/transversion ratio {kappa}, which corresponds with other observations (HORAI and HAYASAKA 1990 Down; TAMURA and NEI 1993 Down). For the following analysis of population genetics data we used the Tamura-Nei model with {gamma}-distributed rates and the parameter values from column three of Table 1.


 
View this table:
In this window
In a new window

 
Table 1. Estimated parameters of the Tamura-Nei model with {Gamma} -distributed rates for HVRI sequences of human mtDNA

A class of population history models:
The basic model of population history is the Wright-Fisher model, assuming a panmictic population of constant size with nonoverlapping generations (FISHER 1930 Down; WRIGHT 1931 Down). Several other models, including population structure (with or without migration) or models with variable population sizes, have been proposed (HUDSON 1991 Down; SLATKIN and HUDSON 1991 Down; ROGERS and HARPENDING 1992 Down).

We will investigate a class of expansion model where a Wright-Fisher population at equilibrium started to grow or decrease exponentially at a certain time {tau} in the past to the current population size. This model is defined by three parameters: {theta} = 2N0µ is the population parameter at equilibrium in the past. Here, N0 denotes the initial effective population size and µ is the mutation rate per sequence and generation. {tau} is the time when the population size started to change, where {tau} is measured in units of 1/µ. Finally, {rho} defines the ratio of current to initial population size. We get the basic model of constant size as a special case of the expansion model by setting {rho} equal to one. Then the time parameter {tau} remains unspecified. Figure 1 sketches the three scenarios.



View larger version (5K):
In this window
In a new window
Download PPT slide
 
Figure 1. Three types of population histories included in the class of models considered. The left part shows an exponential increase in population size that started at time {tau} in the past; the right part shows an exponential decrease. In the middle a model of constant population size is indicated. The x-axis represents time from present backward and the y-axis scaled population size.

Inference of population history parameters:
In the following, we assume that the mutation process is known. Thus, the evolution of a sample of sequences is fully characterized by the three parameters {theta}, {tau}, and {rho}. To estimate these three parameters a likelihood method is reasonable. Instead of conditioning the likelihood of the parameter set on the full sequence data as the elegant approach of GRIFFITHS and TAVARE 1994A Down, GRIFFITHS and TAVARE 1994B Down does, we summarize the data by the mean pairwise difference K and the number of variable positions S of the sample. While we do not know yet to what extent this restriction affects the estimates of ({theta},{tau},{rho}), there are several reasons to proceed in that manner: First, we are more interested in detecting major demographic signals in the data than in giving precise joint estimates (together with the inherent large variances of these estimates). Second, by conditioning on both K and S, we make implicit use of the relationship between the mean pairwise difference and the number of variable positions, which proved useful in TAJIMA's D-statistic (1989a). Assuming the infinitely many sites model TAJIMA 1989B Down showed that changes in population size have different effects on K and S. Third, the reduction of the data simplifies and speeds up our method of evaluating the likelihood via simulations.

The set of parameters that maximizes the likelihood lik({theta},{tau},{rho}|k,s) defines the most probable population history within the class of models described in the previous paragraph. To rate the plausibility of a parameter set ({theta}0,{tau}0,{rho}0) we use the likelihood ratio lik({theta}0,{tau}0,{rho}0|k,s)/LA, where LA is the maximum likelihood value within the considered class of population histories.

The remaining question is how to compute the likelihoods for a given data set. To the best of our knowledge analytical formulae for the likelihood functions based on a complex model of sequence evolution are not available, even in the case of a population of constant size. Therefore, we will use computer simulations to determine the likelihood value of the parameter sets.

Simulations:
Coalescent theory provides an efficient way to simulate the evolution of a sample of sequences under various population histories. The coalescent is a stochastic process that counts the number of distinct ancestors in a genealogy of a sample of size n as one moves back in time. If the current population size N is large and time is measured in N generations, the coalescent provides a good approximation to the ancestral process of the sample (KINGMAN 1982B Down, KINGMAN 1982C Down; TAVARE 1984 Down; DONNELLY and TAVARE 1995 Down). The coalescent starts with initial state n. Each coalescent event of two ancestors in the sample reduces the number of distinct ancestors by exactly one; i.e., the state value decreases by one. After n - 1 coalescent events the process ends in state one and the most recent common ancestor of the sample is found. The continuous time approximation of the coalescent ensures that the jump size of the process is exactly one; i.e., only one coalescent even can occur at one time. The times Sj, j = n, ... , 2, when the process changes state (coalescent times) are random quantities that depend on the population history. If the sampled sequences stem from a Wright-Fisher population of constant size, the times Tj = Sj - Sj+1 (Sn+1 = 0) follow an exponential distribution with mean 2/j(j - 1), j = n, ... , 2.

If the population size varies deterministically, the coalescent approximation still holds, but the coalescent times must be rescaled appropriately (GRIFFITHS and TAVARE 1994B Down; DONNELLY and TAVARE 1995 Down): Let Nv(x) denote the population size at time x in the past. The relative size function v(x) for the class of models under consideration (Figure 1) is given by

(1)

Furthermore, we define the population size intensity function {Lambda}:

(2)

When population size varies, the sequence Sn, Sn-1, ... , S2 is still Markovian. Although the times Tj are not independent anymore, it is straightforward to simulate the coalescent times (GRIFFITHS and TAVARE 1994B Down):

For j = n, ... , 2 the times Sj are determined by recursively solving the equation

(3)
where Sn+1 is set equal to zero and Ej is an exponential random variable with mean 2/j(j - 1). When the n - 1 coalescent times are determined, the genealogy of the sample is produced by randomly merging two ancestral lineages at each coalescent time.

Sequences according to the genealogy are then produced as follows: We generate an ancestral sequence at the root of the genealogy according to the stationary base composition in Table 1. This sequence then evolves along the genealogy under the Tamura-Nei model with rate heterogeneity. Thus, one simulated data set is produced.

The approximate likelihood value lik({theta},{tau},{rho}|k,s) for a real data set with mean pairwise difference k and s variable positions is based on j = 1, ... , B simulations according to the specified parameters ({theta}, {tau}, {rho}). For a simulated data set j we computed the mean pairwise difference kj and the number of variable positions sj. Because the mean pairwise difference is virtually a continuous variable, we introduce the indicator variable

(4)
where {delta} is a small positive number that defines an interval with k as the center. Thus, the indicator variable equals one for simulations that are reasonably close to our summary statistics observed in the data. The choice of {delta} is a compromise between the efficiency of the simulations and the precision of the approximation. In our investigation {delta} = 0.2 proved to be useful when B = 25,000. The likelihood value of a parameter set is then approximated by

(5)

We determine the likelihood values on a grid of parameter combinations to approximate the maximum likelihood value under the most general hypothesis and the specific subset of parameters defined by the null hypothesis.


*  APPLICATION TO SEQUENCE DATA
*TOP
*ABSTRACT
*THEORETICAL BACKGROUND
*APPLICATION TO SEQUENCE DATA
*DISCUSSION
*LITERATURE CITED

We demonstrate the features of our method by analyzing published data sets from three populations: The Basques are linguistic isolates living on both sides of the border between France and Spain (BERTRANPETIT et al. 1995 Down); the Nuu-Chah-Nulth (NCN) are a native-American tribe peopling the western coast of Vancouver Island (Canada) and the Olympic Peninsula of Washington state (USA) (WARD et al. 1991 Down); and the Biaka pygmies live as hunter-gatherers in the Central African Republic (VIGILANT et al. 1991 Down). Sample sizes, n, mean pairwise sequence differences, k, and numbers of variables positions, s, of these three data sets are given in Table 2.


 
View this table:
In this window
In a new window

 
Table 2. Sample sizes, mean pairwise sequence differences, and numbers of variable positions of the analyzed data sets

Assuming the model of evolution as specified in Table 1, we determined for each population the likelihood values for various parameter combinations according to Equation 5. The number of simulations B was equal to 25,000 and {delta} = 0.2. Thereby, {rho} was set equal to 10z, z being an integer. In this notation z = 0 reflects the case of constant population size over time. If z > 0 the population size is increasing, and if z < 0, it is decreasing. The likelihood values determined for {tau} were a multiple of one-fourth and {theta} a multiple of one-half.

The parameter set that yields the highest likelihood (LpopA ) is regarded as the most probable population history. Since this point estimate does not reflect the uncertainty inherent in the stochastic models, we rate each parameter set ({theta}0,{tau}0,{rho}0) using the likelihood ratio . Figure 2 Figure 3 Figure 4 show graphical representation of these results. Each panel refers to the specified value of {rho}. The abscissa and ordinate represent the parameters {tau} and {theta}, respectively. (Note that if {rho} = 1, {tau} is not defined and the panel is only one dimensional.) The different gray levels reflect the likelihood ratio of the corresponding parameter sets. Dark colors represent high values of the likelihood ratio (see the legend in Figure 2). If the usual {chi}2 theory applies, then plain white color would indicate parameter combinations that are rejected on the 5% level. Conversely, colored boxes belong to a 95% confidence set.



View larger version (58K):
In this window
In a new window
Download PPT slide
 
Figure 2. Result of the likelihood analysis of the Basque data (BERTRAN-PETIT et al. 1995). Dark color represent high values of the likelihood ratio (see legend). The parameter combination {rho} = 100, {tau} = 2.25, and {theta} = 1 yielded the highest likelihood ratio. If the usual {chi}2 theory applies, colored boxes belong to a 95% confidence set.



View larger version (59K):
In this window
In a new window
Download PPT slide
 
Figure 3. Result of the likelihood analysis of the Nuu-Chah-Nulth data (WARD et al. 1991 Down). The parameter combination {rho} = 1, {theta} = 7.5 yielded the highest likelihood ratio. For further description see legend of Figure 2.



View larger version (76K):
In this window
In a new window
Download PPT slide
 
Figure 4. Result of the likelihood analysis of the Biaka pygmie data (VIGILANT et al. 1991 Down). The parameter combination {rho} = 0.1, {tau} = 1, and {theta} = 18 yielded the highest likelihood ratio. For further description see the legend of Figure 2.

Figure 2 represents the analysis of the Basques. Parameters sets that favor a recent expansion (small {tau}) from a small population (small {theta}) receive high support from the data. Models of population decrease or constant population size get virtually no support. The corresponding panels are plain white and therefore omitted from Figure 2. Thus, we conclude that this data set is consistent with a model exponential growth. The analysis further suggests that the Basques stem from a rather small founding population (as measured by mitochondrial variability) that started to increase in size one to four mutational time units ago.

The picture of the Nuu-Chah-Nulth data set (Figure 3) shows a different result. Neither models of population decrease nor increase were rejected. The most probable parameter set is that of constant size and {theta} = 7.5. Figure 3 indicates also high support for a very recent expansion ({tau} <= 0.5). However, this result suggests that further analysis of this data set is reasonable under the constant size assumption.

The analysis of the Biaka pygmies reveals an interesting result (Figure 4). Models that assume a moderate decrease in population size ({rho} = 0.1) obtain a high support. Also, the long-term constant size model is conceivable. The decrease could be the result of an expansion of other populations into the habitat of the Biaka pygmies, pushing them back to more restricted areas.


*  DISCUSSION
*TOP
*ABSTRACT
*THEORETICAL BACKGROUND
*APPLICATION TO SEQUENCE DATA
*DISCUSSION
*LITERATURE CITED

Effects of variable population size on genetic diversity measures are confounded with the effects of rate variation in the sequences (TAJIMA 1989B Down; SLATKIN and HUDSON 1991 Down; LUNDSTROM et al. 1992A Down; ROGERS and HARPENDING 1992 Down; MARJORAM and DONNELLY 1993 Down; BERTORELLE and SLATKIN 1995 Down; ARIS-BROSOU and EXCOFFIER 1996 Down; TAJIMA 1996 Down). Here, we try to resolve the interaction by studying first the appropriate model of sequence evolution and subsequently employing this model in a coalescent framework.

The large amount of available HVRI sequence data (HANDT et al. 1998 Down) enabled us to infer this model of sequence evolution for this particular region of the human genome. This was done independently from population dynamics by using a likelihood-based tree reconstruction method (STRIMMER and VON HAESELER 1996 Down). Fixing the model of sequence evolution and thereby separating the dynamics of the mutation process from the dynamics of population history makes inference of population history parameters possible. Note that we do not include the uncertainty inherent in estimating the mutation process parameters. This may lead to underestimation of the variability in the estimates of population history parameters.

We studied a class of population histories that allows for positive or negative growth of a population starting from a population at equilibrium in the past. Even though these models are more general than the frequently used constant-size assumption, it is obvious that they are simplistic and do not describe the evolution of a population in its full complexity. Nevertheless, the application section shows that this approach detects signals of major changes in population size. Moreover, we rated demographic scenarios by their likelihood ratio. This yields a set of plausible parameter combinations consistent with the data. On the basis of these parameters, further questions, such as the estimation of the time to the most recent common ancestor, can be addressed (TAJIMA 1983 Down; GRIFFITHS and TAVARE 1994C Down; FU 1996B Down; TAVARE et al. 1997 Down).

The proposed likelihood approach makes implicit use of the relationship of the mean pairwise difference and the number of variable positions. Therefore, it is far more applicable than "mismatch analysis" (ROGERS 1995 Down; ROGERS et al. 1996 Down), which is restricted to data sets that show smooth and unimodal pairwise difference distributions. While the restriction of the data to two summary statistics has its advantage in speeding up the evaluation of likelihood values via simulations and concurrently retaining information about population history, the drawback of the data summary is a loss in the precision of parameter estimates. We note that inclusion of additional information, like patterns of segregating sites (FU 1996A Down; WAKELEY and HEY 1997 Down), should result in sharper estimates. Conditioning the likelihood on the entire information in the data as proposed by GRIFFITHS and TAVARE 1994A Down, GRIFFITHS and TAVARE 1994B Down would reduce the variability of the estimates even further. However, the Markov Chain, Monte Carlo-based calculation of the likelihood is computationally intensive when applied to our complex model of sequence evolution.

In principle, the likelihood ratio approach is applicable to other classes of population history. The extension to demographic scenarios that are different from the exponential growth model is straightforward. This could be done by appropriately altering the rescaling of the coalescent times (Equation 1 and Equation 2). Also, simple subpopulation models with migration (WAKELEY and HEY 1997 Down) can be explored by conditioning on suitable summary statistics.

With the prospective large amount of available DNA sequence data, the refinement of our understanding of the evolution of different parts of our genome, and the development of applicative methods of analysis, we will be able to improve our knowledge of the history of our species in the near future.


*  ACKNOWLEDGMENTS

The authors are grateful to KORBINIAN STRIMMER for helpful comments on the use of PUZZLE and SONJA MEYER for providing the HVRI data collection and discussing the estimation of mutation parameters. We thank ELLEN BAAKE and SVANTE PÄÄBO for fruitful discussions and improving the manuscript. Critical comments from two anonymous referees and the editor are also gratefully acknowledged. This work was supported by a grant from the Deutsche Forschungsgemeinschaft (DGF) to A.v.H. The simulation program described here is available upon request (arndt@zi.biologie.uni-muenchen.de, gweiss@zi.biologie.uni-muenchen.de).

Manuscript received July 28, 1997; Accepted for publication March 23, 1998.


*  LITERATURE CITED
*TOP
*ABSTRACT
*THEORETICAL BACKGROUND
*APPLICATION TO SEQUENCE DATA
*DISCUSSION
*LITERATURE CITED

ANDERSON, S., A. T. BANKIER, B. G. BARELL, M. H. L. DE BRUIJN, and A. R. COULSON et al., 1981  Sequence and organization of the human mitochondrial genome. Nature 290:457-465[Medline].

ARIS-BROSOU, S. and L. EXCOFFIER, 1996  The impact of population expansion and mutation rate heterogeneity on DNA sequence polymorphism. Mol. Biol. Evol. 13:494-504[Abstract].

BANDELT, H. J., P. FORSTER, B. C. SYKES, and M. B. RICHARDS, 1995  Mitochondrial portraits of human populations using median networks. Genetics 141:743-753[Abstract].

BERTORELLE, G. and M. SLATKIN, 1995  The number of segregating sites in expanding human populations, with implications for estimates of demographic parameters. Mol. Biol. Evol. 12:887-892[Abstract].

BERTRANPETIT, J., J. SALA, F. CALAFELL, P. A. UNDERHILL, and P. MORAL et al., 1995  Human mitochondrial DNA variation and the origin of Basques. Am. J. Hum. Genet. 59:63-81.

CAVALLI-SFORZA, L. L., P. MENOZZI and A. PIAZZA, 1994 The History and Geography of Human Genes. Princeton University Press, Princeton.

COX, D. R., 1961  Tests of separate families of hypotheses. Proc. 4th(Berkeley Symp. (Univ. California Press) 1):105-123.

COX, D. R., 1962  Further results on tests of separate families of hypotheses. J. R. Stat. Soc. B 24:406-424.

DONNELLY, P. and S. TAVARÉ, 1995  Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29:401-421[Medline].

FISHER, R. A., 1930 The Genetical Theory of Natural Selection. Clarendon, Oxford.

FU, Y.-X., 1996a  New statistical tests of neutrality for DNA samples from a population. Genetics 143:557-570[Abstract].

FU, Y.-X., 1996b  Estimating the age of the common ancestor of a DNA sample using the number of segregating sites. Genetics 144:829-838[Abstract].

GOLDMAN, N., 1993  Statistical tests of models of DNA substitution. J. Mol. Evol. 36:182-198[Medline].

GRIFFITHS, R. C. and S. TAVARÉ, 1994a  Simulating probability distributions in the coalescent. Theor. Popul. Biol. 46:131-159.

GRIFFITHS, R. C. and S. TAVARÉ, 1994b  Sampling theory for neutral alleles in a varying environment. Phil. Trans. R. Soc. Lond. B 344:403-410[Medline].

GRIFFITHS, R. C. and S. TAVARÉ, 1994c  Ancestral inference in population genetics. Stat. Sci. 9:307-319.

HANDT, O., S. MEYER, and A. VON HAESELER, 1998  Compilation of human mtDNA control region sequences. Nucleic Acids Res. 26:126-130[Abstract/Free Full Text].

HASEGAWA, M., H. KISHINO, and K. YANO, 1985  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160-174[Medline].

HASEGAWA, M., A. DI RIENZO, T. D. KOCHER, and A. C. WILSON, 1993  Toward a more accurate time scale for the human mitochondrial DNA tree. J. Mol. Evol. 37:347-354[Medline].

HORAI, S. and K. HAYASAKA, 1990  Intraspecific nucleotide sequence differences in the major noncoding region of human mitochondrial DNA. Am. J. Hum. Genet. 46:828-842[Medline].

HUDSON, R. R., 1991 Gene genealogies and the coalescent process, pp. 1–44 in Oxford Surveys in Evolutionary Biology, edited by D. FUTUYAMA and J. ANTONOVICS. Oxford University Press, Oxford.

JUKES, T. H., and C. R. CANTOR, 1969 Evolution of protein molecules, pp. 21–132 in Mammalian Protein Metabolism, edited by H. N. MUNRO. Academic Press, New York.

KIMURA, M., 1980  A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120[Medline].

KINGMAN, J. F. C., 1982a  The coalescent. Stoch. Proc. Applns. 13:235-248.

KINGMAN, J. F. C., 1982b  On the genealogy of large populations. J. Appl. Probab. 19A:27-43.

KINGMAN, J. F. C., 1982c Exchangeability and the evolution of large populations, pp. 97–112 in Exchangeability in Probability and Statistics, edited by G. KOCH and F. SPIZZICHINO. North-Holland Publishing Company, Amsterdam.

KOCHER, T. D., and A. C. WILSON, 1991 Sequence evolution of mitochondrial DNA in humans and chimpanzees: control region and protein coding regions, pp. 391–413 in Evolution of Life: Fossils, Molecules and Culture, edited by S. OSAWA and T. HONIO. Springer Verlag, Tokyo.

KOGELNIK, A. M., M. T. LOTT, M. D. BROWN, S. B. NAVATHE, and D. C. WALLACE, 1997  MITOMAP: an update on the status of the human mitochondrial genome database. Nucleic Acids Res. 25:196-199[Abstract/Free Full Text].

LUNDSTROM, R., S. TAVARÉ, and R. H. WARD, 1992a  Modelling the evolution of the human mitochondrial genome. Math. Biosci. 112:319-335[Medline].

LUNDSTROM, R., S. TAVARÉ, and R. H. WARD, 1992b  Estimating substitution rates from molecular data using the coalescent. Proc. Natl. Acad. Sci. 89:5961-5965[Abstract/Free Full Text].

MARJORAM, P. and P. DONNELLY, 1993  Pairwise comparisons of mitochondrial DNA sequences in subdivided populations and implications for early human evolution. Genetics 136:673-683[Abstract].

RAMBAUT, A. and N. C. GRASSLY, 1997  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235-238[Abstract/Free Full Text].

ROGERS, A. R., 1992  Error introduced by the infinite sites model. Mol. Biol. Evol. 9:1181-1184[Medline].

ROGERS, A. R., 1995  Genetic evidence for a pleistocene population explosion. Evolution 49:608-615.

ROGERS, A. R., A. E. FRALEY, M. J. BAMSHAD, W. S. WATKINS, and L. B. JORDE, 1996  Mitochondrial mismatch analysis is insensitive to the mutational process. Mol. Biol. Evol. 13:895-902[Abstract].

ROGERS, A. R. and H. HARPENDING, 1992  Population growth makes waves in the distribution of pairwise genetic differences. Mol. Biol. Evol. 9:552-569[Abstract].

SLATKIN, M. and R. R. HUDSON, 1991  Pairwise comparison of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics 129:555-562[Abstract].

STRIMMER, K. and A. VON HAESELER, 1996  Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964-969.

TAJIMA, F., 1983  Evolutionary relationships of DNA sequences in finite populations. Genetics 105:437-460[Abstract/Free Full Text].

TAJIMA, F., 1989a  Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585-595[Abstract/Free Full Text].

TAJIMA, F., 1989b  The effect of change in population size on DNA polymorphism. Genetics 123:597-601[Abstract/Free Full Text].

TAJIMA, F., 1996  The amount of DNA polymorphism maintained in a finite population when the neutral mutation rate varies among sites. Genetics 143:1457-1465[Abstract].

TAMURA, K. and M. NEI, 1993  Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512-526[Abstract].

TAVARÉ, S., 1984  Line-of-descent and genealogical processes, and their applications in population genetics models. Theor. Pop. Biol. 26:119-164[Medline].

TAVARÉ, S., D. J. BALDING, R. C. GRIFFITHS, and P. DONNELLY, 1997  Inferring coalescence times from DNA sequence data. Genetics 145:505-518[Abstract].

UZZEL, T. and K. W. CORBIN, 1971  Fitting discrete probability distributions to evolutionary events. Science 172:1089-1096[Abstract/Free Full Text].

VIGILANT, L., M. STONEKING, H. HARPENDING, K. HAWKES, and A. C. WILSON, 1991  African populations and the evolution of mitochondrial DNA. Science 253:1503-1507[Abstract/Free Full Text].

WAKELEY, J., 1993  Substitution rate variation among sites in hypervariable region 1 of human mitochondrial DNA. J. Mol. Evol. 37:613-623[Medline].

WAKELEY, J. and J. HEY, 1997  Estimating ancestral population parameters. Genetics 145:847-855[Abstract].

WARD, R. H., B. L. FRAZIER, K. DEW-JAGER, and S. PÄÄBO, 1991  Extensive mitochondrial diversity within a single Amerindian tribe. Nat. Acad. Sci. 88:8720-8724.

WATTERSON, G. A., 1975  On the number of segregating sites in genetical models without recombination. Theor. Pop. Biol. 7:256-276[Medline].

WRIGHT, S., 1931  Evolution in Mendelian populations. Genetics 16:97-159[Free Full Text].