Abstract
This study provides a comprehensive survey of the complex pattern of nucleotide substitution in the control region of human mtDNA, which is of central importance to the studies of human evolution. A total of 1229 different hypervariable region I (HVRI) and 385 different hypervariable region II (HVRII) sequences were analyzed using a complex substitution model. Moreover, we suggest a new method to assign relative rates to each site in the sequence. Estimates are based on maximumlikelihood methods applied to randomly selected subsets of sequences. Our results indicate that the rate of substitution in HVRI is approximately twice as high as in HVRII and that this difference is mainly due to a higher frequency of pyrimidine transitions in HVRI. However, rate heterogeneity is more pronounced in HVRII.
SEQUENCES from the noncoding control region of mitochondrial DNA are widely used to address questions concerning genetic variation within species (Cannet al. 1987; Garner and Ryder 1996; Watsonet al. 1996; Goldberg and Ruvolo 1997). Their high evolutionary rate and their maternal inheritance make them exceptionally suitable for analyzing population history. Hypervariable regions I and II (HVRI, HVRII) in the control region of human mtDNA have been studied extensively to infer, for example, aspects of historical biogeography and the time since the most recent common ancestor of human mtDNAs (Cannet al. 1987; Hasegawa and Horai 1991; Vigilantet al. 1991; Ward et al. 1991, 1993; Pesoleet al. 1992; Stonekinget al. 1992; Watsonet al. 1996; Kringset al. 1997). Although >4000 HVRI and 900 HVRII sequences of humans from all over the world have been determined, knowledge of the substitution pattern is still far from complete. However, it is now clear that human control region sequences evolve according to a complex pattern that makes analysis difficult. For example, base composition is not uniform, transitions occur in greater frequencies than transversions, the number of pyrimidine transitions in the Lstrand exceeds the number of purine transitions, and substitution rates vary among sites (Aquadro and Greenberg 1983; Kocher and Wilson 1991; Tamura and Nei 1993; Wakeley 1993).
Probably the most enigmatic feature of HVR sequence evolution is the variation of rates among sites. Until recently, the importance of accounting for rate heterogeneity to obtain unbiased estimates of the transitiontransversion ratio, unbiased dating of speciation events, and correct reconstruction of phylogenies has not been recognized (Wakeley 1993, 1994, 1996; Yang 1996). Especially in a population genetics context, a model that describes the substitution process is explicitly necessary. Choosing an inappropriate model may lead to misinterpretation of the data. For example, mutational hot spots (i.e., positions at which substitutions accumulate predominantly) could mimic population expansion (Lundstrom et al. 1992a; Bertorelle and Slatkin 1995; ArisBrosou and Excoffier 1996). Ignoring the existence of heterogeneous mutation rates may yield biased estimates of measures of genetic diversity and/or parameters of population history (Lundstrom et al. 1992a; Bertorelle and Slatkin 1995; ArisBrosou and Excoffier 1996; Deng and Fu 1996; Tajima 1996; Misawa and Tajima 1997). For example, the estimation of the time to the most recent common ancestor strongly depends on both the population genetics model and the estimated parameters therein. Therefore, population genetics inference benefits from a better understanding of the substitution process of a genomic region.
On the other hand, the findings of pedigree analyses have generated much confusion about the frequency of mutations that affect the HVR (Howellet al. 1996; Parsonset al. 1997). From these studies, mutation rate estimates have been obtained that are ∼20fold higher than those derived from phylogenetic studies. The question of whether mutational hot spots can account for this enormous discrepancy in mutation rates has become a hotly debated issue (Pääbo 1996; Jazinet al. 1998; Parsons and Holland 1998). To gauge the contribution of hot spots to the high rate estimate, we need to know the rate for each sequence position. Beyond being of use for the interpretation of the apparently conflicting mutation rate estimates, knowledge of sitespecific rates is of great benefit to sequence analyses in general, as it allows the refinement of phylogenetic models and a more precise interpretation of population sequence data. This study provides a comprehensive survey of the nucleotide substitution pattern of HVRI and HVRII with special attention to variation among sites.
DATA
We used a publicly available collection of aligned human mitochondrial controlregion sequences that comprised 4079 HVRI and 969 HVRII sequences of individuals from all over the world (Handtet al. 1998). It can be retrieved via the worldwide web at URL http://www.eva.mpg.de/hvrbase/. From this data collection we extracted all sequences that were sequenced without ambiguities in the range from 16024 to 16382 in HVRI and from 57 to 371 in HVRII according to the numbering of Anderson et al. (1981). These sequence parts were chosen because they were the largest continuously determined subregions in the majority of sequences in the collection. The alignments of the two regions contained only a few gaps. The HVRI part enclosed seven gaps of varying length at positions 16104.1, 16169.1, 16174.1, 16183.1–16183.4, 16227.1, 16259.1, and 16366.1, whereas five gaps at positions 65.1, 190.1, 294.1, 302.1–302.4, and 310.1–310.2 were observed in the alignment of the HVRII sequences. Positions in the alignment that show a gap were excluded from the analysis. Because we pursued a phylogenetic approach, we reduced the set of sequences such that each sequence type was represented only once in our set. This led to a final data set comprising 1229 different HVRI and 385 different HVRII sequences.
MODEL AND METHODS
Model of sequence evolution: To quantify the substitution process and rate heterogeneity among sites, we used the TamuraNei (Tamura and Nei 1993) model assuming Γdistributed rates. It has been suggested (Weiss and von Haeseler 1998) that this model is the most appropriate to describe the evolutionary process of human HVR sequences. The TamuraNei (1993) model with Γdistributed rates includes the parameters π_{A}, π_{C}, π_{G}, π_{T}, κ, τ, and α that have to be estimated from the data. π= (π_{A}, π_{C}, π_{G}, π_{T}) is the equilibrium distribution of base frequencies, the parameter κ adjusts for the transitiontransversion ratio, and τ describes the ratio of pyrimidine transitions to purine transitions. The shape parameter of the Γdistribution α is inversely related to the extent of rate heterogeneity among sites.
Parameter estimation: The equilibrium distribution of base frequencies, π, was estimated from the data by averaging the base composition of all sequences. This estimate should be very similar to the maximumlikelihood estimate (Goldmann 1993). Estimation of the transitiontransversion parameter κ, the pyrimidinepurine transition parameter τ, and the rateheterogeneity parameter α was done using a phylogenetic approach and a subsampling procedure. More precisely, we drew a random sample of different sequences from the data set containing either HVRI or HVRII sequences. From this random sample a tree was constructed and the parameters κ, α, and τ were estimated from the tree using approximate maximum likelihood and discrete Γdistribution with eight categories as implemented in the PUZZLE program (Strimmer and von Haeseler 1996). From biological considerations, one should expect a continuum of rates among sites (Uzzel and Corbin 1971; Kocher and Wilson 1991). However, maximumlikelihood calculations with the continuous Γdistribution involve intensive computations and are feasible only for data sets up to six sequences (Yang 1996). We repeated the entire estimation procedure 150 times for different random samples of a given size to obtain parameter estimates that are not affected by the sample but are representative for all HVRI or HVRII sequences, respectively. Subsequently, we averaged the values of the parameter estimates from the subsamples. To investigate the effect of sample size on the estimates
Sitespecific rates: In the following we assumed that the parameters κ, τ, and α are known or estimates are given, for example, from the approach described in the previous section. To obtain estimates of the sitespecific rates, we used a discretized Γdistribution (Yang 1993). Here, the range of possible rates was divided into eight categories such that each category was equiprobable under the Γdistribution. Within each category we computed the median rate. Thus, we assumed that each site evolved according to one of these eight rates. The following procedure was applied to estimate sitespecific rates: For a random sample of 50 different sequences a maximumlikelihood tree was computed. On the basis of this tree, the likelihood of a specific site was computed for each of the eight rates using PUZZLE (Strimmer and von Haeseler 1996). The rate that yielded the highest likelihood value was assigned to this site. From a Bayesian point of view this corresponds to choosing the rate with the maximal posterior probability, because the way of discretizing the Γdistribution resulted in a uniform prior distribution. Thus, for each site in the alignment of 50 sequences, a rate was computed. The entire procedure was repeated 50 times for different random samples. Therefore we obtained for each site 50 estimates of its specific rate. The relative rate for each site was simply the average value.
RESULTS
Figure 1 displays the averages of the transitiontransversion parameter κ, the pyrimidinepurine transition parameter τ, and the rateheterogeneity parameter α for HVRI and HVRII as a function of the number of different sequences sampled (sample size). With increasing sample size all estimates decrease and show few changes for samples of size 60 or more; this is also reflected in the decrease of the variance of the sample mean. The observation that 60 or more sequences are needed to reduce the bias in parameter estimates is due to the fact that the sequences are very similar, and so there is little information regarding the parameters. For more divergent sequences we do not expect such a strong relationship between subsample size and bias.
In HVRI, the estimate of the transitiontransversion parameter κ decreases from 20.0 to 15.7 with increasing sample size, whereas κ for HVRII varies from 7.1 to 7.6, more or less independently of the sample size. A similar picture emerges for the estimation of the pyrimidinepurine transition parameter τ. In HVRI, the estimated value of τ decreases from 2.5 to 1.75, and the estimated value of τ in HVRII varies between 1.07 and 1.18.
Because larger subsamples reflect the transitiontransversion ratio and the pyrimidinepurine transition ratio with smaller standard derivation than smaller subsamples, we suppose that the estimated values of κ and τ for the larger subsamples are closer to the true values. Small samples may contain none or too few transversions, and thus κ is overestimated. If κ were not inferred correctly, then τ could not be inferred correctly either.
To estimate the rateheterogeneity parameter, α samples of size 10 were too small, whereas, for samples of size 20 and larger, estimates of α are close to 0.26 and 0.13 in HVRI and HVRII, respectively. Table 1 summarizes the averaged values for samples of size 80 and 150 repeats. The estimated transitiontransversion parameter κ in HVRI is approximately twice as high as the corresponding HVRII value. Accordingly, the estimate of the pyrimidinepurine transition parameter τ from HVRI is higher than the estimate of τ from HVRII. The smaller value of the estimated rateheterogeneity parameter α in HVRII indicates that the mutation rate of this region is more heterogeneous than in HVRI. Calculation of the expected number of substitutions from the TamuraNei (1993) rate matrix reveals that the expected number of transversions is approximately the same for both HVRs. The two regions differ mainly in the number of pyrimidine transitions, leading to the higher pyrimidinepurine transition ratio and higher transitiontransversion ratio in HVRI. The total number of substitutions is twice as high in HVRI as in HVRII (8.14:4.08). The parameter estimates in Table 1 provide a comprehensive model of HVRI and HVRII evolution. On the basis of this model, we estimated sitespecific rates, summarized in Figure 2. As expected from the estimate of rate heterogeneity in HVRII, sites evolve either with a small relative rate or with a high rate. The fastest positions in HVRII evolve more than six times faster than the average substitution rate of this region. In HVRI the fastest positions evolve with relative rates of 4.8. The proportion of sites in HVRII with relative rates <0.001 (virtually not variable) is 0.54. This is almost twice as high as the value of 0.28 that we found for HVRI.
DISCUSSION
Applying maximumlikelihood methods to randomly selected subsets of different sequences, we estimated from the subsets the parameters of the TamuraNei (1993) model with rate heterogeneity. The importance of simultaneous parameter estimation, especially if rate heterogeneity exists among sites, has become increasingly clear (Wakeley 1994, 1996; Swoffordet al. 1995). Our estimates agree by and large with results from other studies, even though direct comparison is complicated by the different parts of the HVR studied. Moreover, assessing the significance of the differences to published estimates is demanding, because this requires determination of the variances of our estimation procedure. In principle, this could be done by one of the usual resampling techniques (Efron and Tibshirani 1993). However, the amount of iterations necessary is computationally very expensive. Hasegawa and Horai (1991) analyzed a smaller sample composed of three data sets that cover different positions of the control region. They estimated transitiontransversion ratios between 14.5 and 27.0, depending on the data set. Other estimates of the transitiontransversion ratio range from 12 to 37 (Horai and Hayasaka 1990; Hasegawa and Horai 1991; Kocher and Wilson 1991; Vigilantet al. 1991; Pesoleet al. 1992; Tamura and Nei 1993). Tamura and Nei (1993) obtained by parsimony analysis transitiontransversion ratios of 18.8, 12.2, and 15.7 for HVRI, HVRII, and both (HVRI + HVRII), respectively. Our estimated ratios (Table 1) are slightly lower than Tamura and Nei's (1993), even though it is known that estimates of substitution parameter using parsimony are likely to be underestimates. It is possible that in Tamura and Nei's (1993) data set transversions are underrepresented and thus the transitiontransversion ratio is overestimated. For example, Ward et al. (1991), who examined the transitiontransversion ratio from a sample of 28 sequences of HVRI, found no transversions at all. This fits well with our observation that the transitiontransversion parameter for HVRI decreases with increasing sample size. Estimates of the rate heterogeneity parameter α have been reported to be 0.11 for the entire control region (Kocher and Wilson 1991; Tamura and Nei 1993) and 0.47 for HVRI (Wakeley 1993). No separate estimates of α are published for HVRII, but it is known that HVRII has a higher heterogeneity of rates than HVRI (ArisBrosou and Excoffier 1996). Our estimate of 0.26 for α in HVRI is substantially smaller than Wakeley's estimate of 0.47. This is due to the different data set and the bias inherent in the parsimonybased inference of rate heterogeneity (Wakeley 1993).
Up to now, estimates for sitespecific rates have been derived only for HVRI by counting the numbers of substitutions in a most parsimonious tree (Hasegawaet al. 1993; Wakeley 1993). Hasegawa et al. (1993) estimated the rates from only 14 HVRI sequences, studying the whole HVRI region, whereas Wakeley (1993) estimated rates from positions 16130 to 16379 in 322 sequences. Positions that experienced more than five substitutions were classified as fast. Table 2 lists positions that are, according to our approach, at least twice as fast as the average rate in HVRI. The higher our rate estimate, the higher the concordance with positions that have been classified as fast by either Wakeley (1993) and/or Hasegawa et al. (1993). Eight of the nine positions with rates >4 are also fast evolving according to the other studies. Both studies classified five positions as fast, for which we estimated moderate rates between 1 and 2 (16209, 16290, 16291, 16298, and 16304). Position 16234 has been classified as fast by Wakeley (1993) and Hasegawa et al. (1993), for which we found a rate close to 1. However, no position has been classified as fast by either Wakeley (1993) or Hasegawa et al. (1993), for which we found a relative rate of <1.
Our approach, however, detected quite a few positions that are evolving moderately rapidly. The deviations are due to the smaller data sets analyzed by Wakeley (1993) and Hasegawa et al. (1993). Moreover, they used different estimation methods. The parsimony method is known to be biased, because a parsimony reconstruction of the internal nodes of the tree gives the minimum number of changes required at a site, and the amount of rate heterogeneity therefore is underestimated (Wakeley 1993). Our results suggest that maximumlikelihood methods in combination with the subsampling strategy are an approach to detect rapidly evolving sites. Moreover, this approach has the advantage of rates estimated relative to the mean substitution rate. These rates are easier to interpret than actual numbers of substitutions and furthermore simplify comparison of rates among sites. According to our results, the proportion of almost invariable sites, 54% in HVRI and 28% in HVRII, is considerably smaller than the Hasegawa and Horai (1991) estimate of 70%. From neutral theory (Kimura 1983), the existence of almost invariable sites suggests that these sites are subject to functional constraints. Although the control region is noncoding, it is known to contain the main regulatory elements for transcription and replication. It is the binding site for numerous molecules such as DNA and RNA polymerases and other transcription and regulatory factors and thus may well be subjected to various evolutionary pressures (Sacconeet al. 1991). HVRII is probably the more important functional part of the control region because it contains the origin of heavy strand replication (positions 110–440; Chang and Clayton 1985).
The major regulatory features of HVRII that lie within the region studied here are three conserved sequence blocks (CSBs) that have been suggested to serve as control sequences involved in the transition from primer RNA synthesis to DNA synthesis (Walberg and Clayton 1981), the RNase MRP cleaving site (RMC; Clayton 1991), and two mitochondrial transcriptionfactor binding sites (TFB; Clayton 1991). Looking at the rates of these regulatory elements, we see that the majority of the positions in the CSBs (216–235, 299–315, and 346–363) and the RMC (317–321) evolve much more slowly than an average site of HVRII, with the exception of position 357 in CSBIII, which has a rate of 2.29. The TFBs (233–260 and 276–303) show four sites (236, 247, 295, and 297) that evolve more than twice as fast as the average rate, while the rest of the positions show moderate variability (range, 0.0001–1.28). The finding that the mitochondrial transcription factor has flexible sequence specificity (Fisheret al. 1989) might explain the slightly higher average variability found in the TFBs compared to the CSBs and the RMC. Positions 16104–16106 map the trinucleotide stoppoint for the 3′ ends of the 7S DNA strands (SP; Dodaet al. 1981). For these positions, as well as for a possible control element (CE; Ohnoet al. 1991; 16194–16208), most rates are close to 0. The terminationassociated sequence TAS (Dodaet al. 1981; 16157–16172), which is a putative template stop signal for the elongation of the Dloop strands, shows more variability than the functional regions mentioned above. This analysis provides the interesting insight that functionally important regions do not necessarily have a small substitution rate.
Another question of interest is how well our sitespecific estimates correlate with positions that were variable in pedigree analyses (Howellet al. 1996; Parsonset al. 1997). The rates for these 10 positions are given in Table 3. Surprisingly, position 94 is invariant in our collection of 969 HVRII sequences (Handtet al. 1998). For positions 16092 and 234, the rates are <1. The 7 remaining positions show rates of >2. Among these are 3 positions with the highest rate of 6.22. Altogether our results suggest that the substitutions observed in family studies occur preferentially but not exclusively at sites with elevated rates. Even though diminishing the discrepancy (Pääbo 1996; Jazinet al. 1998; Parsons and Holland 1998) of rates estimated from family and phylogenetic studies, our observations do not fully explain the ∼20fold higher rate of the latter. But one should be aware that this time we exploited a method where rates are collected in eight categories. Thus the category representing the highest rate lumps together all sites that are extremely fast. That is to say, an upper limit for the relative rate at a position is introduced. Therefore, it is possible that the numerical values for some of the fast positions are underestimated. The discrete Γapproach (Yang 1993) yielded eight nonequidistant rates. Especially small evolutionary rate factors are very close to each other. Hence, for some site patterns the likelihood values do not differ substantially between neighboring categories. Because the final rate at a position is an average over many subsamples, possible inaccuracies in the rate assignment for a single subsample should have no substantial effect. Certainly these problems could be circumvented if a continuous Γdistribution were applied. Unfortunately, this is at present computationally unfeasible.
In this article we used maximumlikelihood methods to coestimate the parameters of the TamuraNei (1993) model including rate heterogeneity. By using a purely phylogenetic approach, we regarded the sequences as an interspecies data set. Therefore, the analyses were based on a restricted set of sequences, where each sequence type of the data collection was represented only once. This restriction may result in a loss of information about the mutational process, because clearly the human mtDNA variation is shaped by demographic processes as well. Obviously, it would be desirable to estimate substitution and demographic parameters simultaneously by using a population genetics model. To this end, coalescence theory (Kingman 1982a,b,c) provides an excellent tool to describe the ancestral relationship of a sample of sequences under various demographic scenarios (Donnelly and Tavaré 1995). The mutational process has been studied under demographic models of limited complexity (Lundstrom et al. 1992a,b). However, accurately modeling the dynamics of the worldwide population may result in an overly parameterrich model. On the other hand, it seems worthwhile to put further effort into creating a more exact picture of the relative rates, as knowledge of the sitespecific rates combined with the strength of the pedigree approach (Howellet al. 1996; Parsonset al. 1997) provides a powerful tool to unravel the absolute mutation rate in HVR. Our study is a step in this direction, as relative rates for the HVR have been estimated on the basis of a survey of a large amount of data. Our parameter estimates and rate estimates can be used to refine models of molecular structure and function as well as methods of phylogenetic inference. Thus, a better understanding of the forces and mechanisms that affect sequence evolution can be obtained and more sound conclusions about history of sequences and populations can be drawn.
Acknowledgments
We express our special thanks to Korbinian Strimmer, Roland Fleißner, and Svante Pääbo for stimulating discussions. We also thank Simon Tavaré and two anonymous referees for helpful comments on the manuscript. Financial support from the Deutsche Forschungsgemeinschaft is greatly appreciated.
Footnotes

Communicating editor: S. Tavaré
 Received August 6, 1998.
 Accepted April 2, 1999.
 Copyright © 1999 by the Genetics Society of America