High mutation rate in mammalian mitochondrial DNA generates a highly divergent pool of alleles even within species that have dispersed and expanded in size recently. Phylogenetic analysis of 277 human mitochondrial genomes revealed a significant (P < 0.01) excess of rRNA and nonsynonymous base substitutions among hotspots of recurrent mutation. Most hotspots involved transitions from guanine to adenine that, with thymine-to-cytosine transitions, illustrate the asymmetric bias in codon usage at synonymous sites on the heavy-strand DNA. The mitochondrion-encoded tRNAThr varied significantly more than any other tRNA gene. Threonine and valine codons were involved in 259 of the 414 amino acid replacements observed. The ratio of nonsynonymous changes from and to threonine and valine differed significantly (P = 0.003) between populations with neutral (22/58) and populations with significantly negative Tajima's D values (70/76), independent of their geographic location. In contrast to a recent suggestion that the excess of nonsilent mutations is characteristic of Arctic populations, implying their role in cold adaptation, we demonstrate that the surplus of nonsynonymous mutations is a general feature of the young branches of the phylogenetic tree, affecting also those that are found only in Africa. We introduce a new calibration method of the mutation rate of synonymous transitions to estimate the coalescent times of mtDNA haplogroups.
MITOCHONDRIAL DNA (mtDNA) encodes for 13 proteins, two ribosomal genes, and 22 tRNAs that are essential in the energy production of the human cell. Variation in the sequence of mtDNA has provided significant insights into the maternal history of anatomically modern humans (Giles et al. 1980; Denaro et al. 1981), complementing the paternal legacy of the Y chromosome (Underhill et al. 2000). Studies based on restriction fragment length polymorphism (RFLP) of the coding and direct sequencing of the noncoding control region have formed the basis of a hierarchical classification of distinct geographic and ethnic affinities (Torroni et al. 1993, 1996; Chen et al. 1995; Watson et al. 1997; Macaulay et al. 1999; Forster et al. 2001). Studies addressing sequence variation in the mtDNA coding region have suggested that natural selection has significantly shaped the course of human mtDNA evolution (Cann et al. 1984; Nachman et al. 1996; Ingman and Gyllensten 2001; Mishmar et al. 2003; Moilanen et al. 2003; Moilanen and Majamaa 2003; Elson et al. 2004; Ruiz-Pesini et al. 2004). These studies have disagreed, however, upon whether the distribution of specific human mtDNA clades or haplogroups is due to an adaptation to different climates or if their distribution is a function of random genetic drift assisted by purifying selection that eliminates nonsynonymous changes. In an attempt to clarify this disagreement and to study the mode of natural selection in mtDNA variation in human populations, we provide here a phylogenetic analysis of a global sample of mtDNAs and investigate the position, chemical nature, and geographic distribution of recurrent and frequent mutations in the coding region.
MATERIALS AND METHODS
The ascertainment set comprised 277 individuals from the five continents, including genomic DNA samples from 129 Africans (10 Biaka Pygmy, 15 Mbuti Pygmy, two Lisongo, six San, two Mandenka, four Ethiopian Jews, nine Sudanese, one Eritrean, one Ghanan, three Herero, one Ovambo, one Pedi, one Sotho, two Tswana, two Zulu, 10 Fulbe, 10 Mossi, 10 Rimaibe, one Berta, one Tuareg, 37 Dominicans), 43 Asians (one Arab, one Kazak, one Druze, four Bedouin, one Sepharadim, one Yemenite Jew, two Pathan, five Sindhi, two Burushaski, one Baluchi, one Brahui, two Makran, two Hazara, one Tamil, two Cambodians, one Hmong, one Atayal, one Ami, four Han Chinese, five Japanese, four Koreans), 76 Europeans (five Northern Europeans, 12 Italians, one Greek, two Finns, two Ashkenazi, one Georgian, 17 Hungarians, three Icelanders, three Czechs, one Sardinian, five Basque, one Iberian, 23 Dutch), 13 Oceanians (four New Guineans, three Melanesians, six Australian Aborigines), and 16 Native Americans (one Auca, one Guarani, five Brazilian Indians, three Colombian Indians, two Mayan, one Piman, one Muskogee, one Navaho, one Quechua) (supplemental Table 1 at http://www.genetics.org/supplemental/). A subset of 103 sequences from these populations has been reported elsewhere (Shen et al. 2005). DNA was extracted using the QIAamp DNA blood kit (QIAGEN, Valencia, CA). Immortalized cell lines have been established for all individuals with the exception of the 17 Hungarians, the 23 Dutch, the 37 Dominicans, the 10 Mossi, the 10 Rimaibe, and the 10 Fulbe.
PCR and DNA sequencing:
The 41 primer pairs used for bidirectional sequencing of mtDNA nucleotides 435–16,023, the PCR conditions, and the determined complete coding region sequence information for 277 individual samples are available at http://insertion.stanford.edu/primers_mitogenome.html. Amplicons were purified with QIAGEN QIAquick spin columns and sequenced with the Applied Biosystems (Foster City, CA) Dye Terminator Cycle sequencing kit and a model 3700 DNA sequencer.
Phylogenetic and statistical analyses:
An unrooted tree from a median-joining network (Bandelt et al. 1999) was drawn and labeled following existing mtDNA haplogroup nomenclature (Torroni et al. 1996, 2001; Macaulay et al. 1999; Kivisild et al. 2002, 2004; Salas et al. 2002; Yao et al. 2002; Kong et al. 2003, 2004; Shen et al. 2005). The tree was rooted using nuclear inserts of mtDNA retrieved from human genomic sequence and the consensus sequence of the three chimpanzee mitochondrial genomes. The accession numbers, mtDNA positional range, and identity (ID) measures of the genomic contigs containing the inserts that were used for rooting are as follows: NT_006713.14 (bp 341–2697; ID 94%); NT_009237.17 (bp 521–2976; ID 94%); NT_006316.15 (bp 2899–3050; ID 94%); NT_077913.3 (bp 3914–9756; ID 98%); and NT_034772.5 (bp 10,269–15,487; ID 94%). The assembled sequence of the inserts is available at http://insertion.stanford.edu/mtDNA.html. The GenBank accession numbers of the two Pan troglodytes and one Pan paniscus sequences that were used are D38113, X93335, and D38116, respectively. Haplogroup divergence estimates ρ and their error ranges were calculated as averages of the distances from the tips to the most recent common ancestor of the haplogroup (Forster et al. 1996; Saillard et al. 2000). Two separate measures of nonsynonymous (N) to synonymous (S) substitution ratios were used: first, the MN/MS ratio estimates the number of mutational changes inferred from the phylogenetic tree (Figure 1), and second, the dN/(dS + constant) refers as in Mishmar et al. (2003) to the ratio of the average pairwise distances of N and S changes in the given sample. Statistical significance was determined from binomial or χ2 probabilities. Disease-implicated substitutions were excluded from these analyses. For interspecies comparisons, mammalian mtDNA sequences were retrieved from the Mitochondriome website (http://bighost.area.ba.cnr.it/mitochondriome/Mt_chordata.htm).
Tests for positive selection:
Seven primate taxa, namely Homo sapiens, P. troglodytes, Gorilla gorilla, Papio hamadryas, Hylobates lar, Pongo pygmaeus, and Macaca sylvanus were chosen from GenBank (gi|17981852, gi|5835121, gi|5835149, gi|5835638, gi|5835820, gi|5835163, gi|14010693) and aligned using clustalW (Thompson et al. 1994) to test for the historic occurrence of positive, directional selection on the 13 coding regions of the primate mitochondrion using the program codeml of the PAML package (Yang 2002). In these tests, maximum-likelihood ratios of nonsynonymous-to-synonymous mutations (ω) exceeding 1 are consistent with the hypothesis of positive selection, while values close to 1 indicate selective neutrality, and values converging on 0 suggest strong purifying selection. We conducted both lineage and site-specific tests. For the lineage-specific tests, we used a model in which all lineages have the same ω (hereafter referred to as M0) and compared that with a model in which ω is estimated for each lineage (hereafter referred to as M1). To test for the action of selection among amino acid sites within a specific lineage, we compared a model that allows for heterogeneity in ω among sites, but not among lineages, with a model that allows for variation in ω along a predefined lineage (as in Yang and Nielsen 2002). We assumed the following unrooted phylogeny (troglodytes, ((((macaca, papio), hylobates), pongo), gorilla), troglodytes), human). However, results of our analyses were robust to minor fluctuations in the tree.
The deepest splits of the phylogeny constructed from 277 mtDNA complete coding region sequences (Figure 1) were sustained by African mtDNAs, which belonged to previously defined haplogroups L0–L5 (Torroni et al. 2001; Salas et al. 2002; Mishmar et al. 2003; Kivisild et al. 2004; Shen et al. 2005). A number of new subclades were identified among these (see Figure 1. Haplogroup sharing between distinct geographic regions was generally low. All European sequences could be assigned to clades N1, X, W, HV, TJ, and U (Torroni et al. 1996; Macaulay et al. 1999; Finnilä et al. 2001; Herrnstadt et al. 2002). Asian, Amerindian, Oceanian, and Australian Aborigine sequences belonged to region-specific haplogroups nested within macro-clades M and N (Kivisild et al. 2002; Yao et al. 2002; Kong et al. 2003, 2004; Friedlaender et al. 2005). All Australian Aborigine M sequences (two from this study and one from Ingman et al. (2000) share six mutations that define the new haplogroup M42. The majority of Australian N and R sequences (Ingman and Gyllensten 2003) belong to clades S and P defined by transitions at nucleotide positions (np) 8404 and 15607, respectively (Forster et al. 2001; Friedlaender et al. 2005).
The most parsimonious root of the mtDNA tree using nuclear inserts of mtDNA and the chimpanzee consensus sequence as outgroups appeared between haplogroup L0 and the rest of the phylogeny (Figure 1). Extensive interspecies homoplasy and mutational saturation was highlighted by the fact that for more than one-third (417/1292) of the variable sites, regardless of their phylogenetic position on the tree, the derived allele among humans corresponded to the chimpanzee allele. In agreement with noncoding region information (Aquadro and Greenberg 1983), a high ratio (21.5 on average, 34.8 in synonymous positions) of transitions to transversions was observed in the coding region (577–16023).
Interspecies calibration of the molecular clock over the complete mtDNA sequence (Ingman et al. 2000; Mishmar et al. 2003) is problematic because of saturation of transitions at silent positions and the effect of selection on the fixation rate of amino acid replacement mutations (Ho et al. 2005). Assuming 6 million years for the human–chimp species split (Goodman et al. 1998) and 6.5 million years for the most recent common ancestor of their mtDNA lineages (Mishmar et al. 2003), we estimated the average transversion rate at synonymous and rRNA positions as 2.1 × 10−9 and 4.1 × 10−10/year/position, respectively. Using the observed relative rates of different substitution types in humans (Table 1), the average transition rate at 4212 synonymous positions is 3.5 × 10−8 (SD 0.1 × 10−8)/year/position. Over all genes in mtDNA this would be equivalent to accumulation of one synonymous transition/6764 (SD 140) years on average. The coalescent date of the human mitochondrial DNA tree using this rate is 160,000 (SD 22,000) years. This coalescent date is broadly consistent with the dates of the Homo sapiens fossils recognized so far from Ethiopia (Clark et al. 2003; White et al. 2003; McDougall et al. 2005). The most recent common ancestor of all the Eurasian, American, Australian, Papua New Guinean, and African lineages in clade L3 dates to 65,000 ± 8000 years while the average coalescent time of the three basic non-African founding haplogroups M, N, and R is 45,000 years. These estimates, bracketing the time period for the recent out-of-Africa migration (Stringer and Andrews 1988), are younger than those based on calibrations involving all coding region sites (Ingman et al. 2000; Mishmar et al. 2003) but are still in agreement with the earliest archaeological signs of anatomically modern humans outside Africa (Mellars 2004). The differences between the date estimates of previous studies are most likely due to the overrepresentation of possibly slightly deleterious nonsynonymous mutations in the younger branches of the tree (Elson et al. 2004) that introduces a bias to the coalescent approach if all the sites of the coding region are used.
Of the 1788 mutations depicted in the tree, 1758 occurred at 1292 variable sites in the coding region between np 577 and 16023. Consistent with previous reports (Mishmar et al. 2003; Moilanen and Majamaa 2003; Elson et al. 2004), there was a significant excess of synonymous mutations in all genes coded by mtDNA, especially among those positions that defined the deeper branches of the tree (Tables 2 and 3). In contrast to Ruiz-Pesini et al. (2004), we did not observe any significant regional (climatic) differences in the rate of nonsynonymous changes for mtDNA haplogroups. This discrepancy likely results from the fact that Ruiz-Pesini et al. (2004) compared region-specific haplogroups of different diversity levels: e.g., the “old” paragroup L in Africans vs. “young” Arctic haplogroups (Table 4). Populations of Asian, European, and West African origin showed significantly negative Tajima's D values (Table 3), consistent with selection, population growth, and/or population subdivision (Ray et al. 2003). That population substructure accounts at least for part of the deviation from neutrality is obvious from the observation that it decreases upon partitioning of West Africans into a sample from Burkina Faso and one from the Dominican Republican.
Significant (P < 0.05) mutational bias toward specific (NNG to NNA and NNU to NNC) codon usage was observed in 27 of 32 pairs of codons that differed by a transition in the third codon position (Table 5). This relative preference of G-to-A and T-to-C mutations (per existing nucleotide pool in the light strand) extends over the nonsilent positions and is characteristic of the noncoding D-loop region (Aquadro and Greenberg 1983; Malyarchuk and Rogozin 2004). However, the general strand bias, known to be reversed in some Metazoan genera, can be related to asymmetric mutational constraints involving deaminations of A and C nucleotides during the replication and/or transcription processes (Hassanin et al. 2005). Importantly, the ND6 gene, encoded by the heavy strand showed the opposite mutational bias, suggesting that the differences of codon usage in human mtDNA might be primarily a function of strand asymmetry rather than of differences in the tRNA pools, as generally expected (Tanaka and Ozawa 1994). Notably, 16/18 nucleotide positions (Table 6) that had undergone five or more recurrent changes involved the transition of guanine to adenine in the light strand.
Of the 1292 variable sites, 288 (22.2%) had mutated recurrently. Unexpectedly, the hotspots that had mutated five or more times were predominantly within mitochondrial rRNA (P < 5 × 10−15) and showed a significantly higher ratio of nonsynonymous-to-silent mutations (90:32 hits, respectively) than polymorphic sites with lower recurrence (608:1004) (Table 6). Finally, these hotspots of mutational activity included positions where the human-derived allele predominates in the mammalian consensus sequences (e.g., np 709, 3010, 10398, and 13928), implying the effect of site-specific positive selection. Among the six nonsynonymous substitutions that have recurred four or more times, five involved threonine (P < 6.1 × 10−7). Overall, threonine and valine codons were involved in 259 of the 414 amino acid replacements observed on the tree.
Lineage-specific tests failed to detect significant positive selection along any unique lineage in the seven-taxon phylogeny of primates. A model fixing a single ratio of ω to all lineages (M0) could not be rejected in favor of a model of different ω's on specified lineages (M1). The ω estimated across all lineages in the phylogeny was 0.35. A test of the previous model against a model enforcing neutral selection, where ω is expected to be equal to 1, showed that these data do not deviate significantly from neutrality (M0 was rejected in favor of model where ω = 1; P ≈ 0, d.f. = 1). Further tests for lineage-specific variation in ω, including a model that assigned a different ω to the human lineage from the remaining primates, did not fit the data as well as M1 did. However, site-specific model testing revealed significant positive selection across regions of the primate mitochondrion. A model enforcing a single ω ratio on all codon sites was rejected in favor of a model allowing for three ratios across sites with three site classes (P ≈ 0, d.f. = 5). The three-ratio model identified 16 codon sites to be under significant (posterior probabilities > 0.95; dN/dS = 2.02) positive selection (Table 7). Among these, four codon sites appeared to be among the nonsynonymous sites with recurrent mutation (particularly no. 114 in the ND3 gene, np 10398 with seven recurrences) in human–human comparisons (Table 6).
A majority of the mitochondrial disease-related mutations have been detected in the tRNA genes, and they mainly affect the secondary structure of the molecule (McFarland et al. 2004 and references therein). Our global survey of natural variation in the tRNA genes showed a sevenfold excess of tRNAThr mutations (N = 28) over other tRNA genes (P < 10−19). This finding would suggest, at first glance, that this gene might have become nonfunctional in the mitochondrion and that its encoded tRNA needs to be imported from the nucleus. Evidence suggesting nuclear tRNA import into mitochondria in marsupials has been obtained previously (Dörner et al. 2001). However, plotting the observed mutations in the tRNAThr gene against the mammalian consensus sequences (Helm et al. 2000) showed that none of the mutations that we have observed in 277 humans fell within the 100% conserved regions of the tRNA (Figure 2). Most pathological mutations affecting tRNAs cluster in highly conserved regions (McFarland et al. 2004), as illustrated in the case of tRNALeu in Figure 2. Four private mutations changed the nucleotide that is >90% conserved in mammalian tRNAThr, while most of the frequent and recurrent mutations in the data set affected the minor fraction of the sites that are not highly conserved in mammalian species. This argues against the proposition that human mitochondrial tRNAThr has lost function. A large fraction (12/28) of the mutations affecting tRNAThr occurred at three positions, two of which have a different allele in consensus humans as compared to the 31 mammalian species analyzed by Helm et al. (2000). Similarly, a mutational hotspot at position 5821 in tRNACys showed a majority allele in humans different from that found in consensus mammalians (Figure 2). Surprisingly, in the latter tRNA we observed two parallel mutations at position 5814 that have been previously reported pathogenic. Yet, because this position is not highly conserved in other mammalian species, its pathological role has to be questioned. No other tRNA site that has been confirmed as associated with mitochondrial disease was found to be variable in our data set. The only mutational hotspot (12,172) affecting the human allele that matches the allele conserved in >90% of mammalian tRNA-s was found in tRNAHis.
A comparison of the amino acid substitutions in the mtDNA-encoded proteins in humans, primates, carnivores, and artiodactyls revealed that substitutions between threonine and alanine are significantly overrepresented in humans while changes between methionine and leucine are most common in other mammalian species (Table 8). The direction of threonine and valine substitution with other amino acids was significantly different among populations with neutral and significantly negative Tajima's D values, respectively (Table 3), and among haplogroups: in H1 sequences sampled broadly from Europe and the Near East, 7 of 11 nonsynonymous mutations resulted in the replacement of threonine and valine with alanine and isoleucine, while only three mutations resulted in a change toward threonine or valine (Figure 1). In contrast to this pattern, in haplogroup V sequences from Finland (Finnilä et al. 2001), where populations continued to rely largely on hunting and fishing for subsistence even after the first contacts with farmers, six of seven replacement polymorphisms resulted in a change to threonine and valine, and none resulted in the replacement of the latter two amino acids (P < 0.01). Similarly, L3 sequences of West African origin showed a significantly lower (P < 0.001) ratio of gains to losses of threonine and valine residues (13/14) than haplogroup L0 sequences from East and South Africa (22/2; Figure 1). East Asian sequences showed an increase of valine codons (9 mutations to and 2 from valine codons), but also a significant (P < 0.01) decrease of threonine, with 11 mutations from and 5 to threonine codons, while Native Americans had 1 mutation from and 7 to threonine, respectively. Over all haplogroups and genes, the direction of amino acid change was significantly (P < 0.02) biased toward replacement of isoleucine and methionine to valine, even when considering the transitional preferences observed in the mitochondrial D-loop (Tamura and Nei 1993). The strand-specific mutational biases are unlikely to explain this pattern because of the observed excess of mutations involving valine codons (8/16 in our data set and 13/16 in the list of ND6 polymorphic sites in MITOMAP) in the ND6 gene that is encoded by the opposite strand.
In phylogenetic analysis of human mitochondrial DNA-coding region sequences, two different spheres of character evolution can be distinguished (Ho et al. 2005; Penny 2005). First, within our species, at the population level, a relatively low level of parallel mutations—as compared to the mtDNA control-region-based phylogenies—enables the reconstruction of the unrooted tree from individual sequences without significant ambiguity. This tree is determined by a substantial fraction of amino acid replacement mutations whose proportion to synonymous substitutions increases from the average of 0.37 in “older” clades to 0.62 in “younger” ones. In the second sphere, a high level of homoplasy with chimpanzees, affecting at least one-third of the variable sites in humans, complicates detailed phylogenetic analyses at the interspecies level. Approximately 930 synonymous mutations that can be observed between human and chimpanzee mtDNA represent only the visible component of variation between the species while the effective ratio of nonsynonymous-to-silent mutations is expectedly significantly less than the observed value of 0.2 due to the hidden load of synonymous mutations. These differences between the two spheres imply that, even among the substitutions that define the deepest branches of the human mtDNA tree, a significant excess of nonsynonymous mutations have not yet been eliminated by purifying selection—assuming, of course, that they are generally deleterious, after all.
More than half of the amino acid replacements observed in the human mtDNA tree involved threonine and valine codons. Adaptive correlation with the elevated mutability in the mitochondrion-encoded tRNAThr, in principle, could be considered as one explanation for the excess of mutations involving threonine codons. However, none of the highly conserved sites in the tRNAThr gene was found to be different in humans from that of the consensus mammalians and, instead, the excessive variability in this gene could be ascribed largely to the presence of three hotspot positions. Furthermore, no such general molecular phenomenon or the characteristic G-to-A and T-to-C mutational bias on the light strand of mtDNA would explain the pattern of differences of amino acid replacement directions that were observed among human populations.
One factor that could explain, theoretically at least, the different amino acid replacement patterns observed among populations and between humans and other mammals is diet. Threonine and valine, essential amino acids that must be taken in the diet, are abundant in meats, fish, peanuts, lentils, and cottage cheese, but deficient in most grains. Alternatively, or in combination with dietary restriction, other constraints of selection on slightly deleterious positions during the phases of population expansion and contraction may be involved. Because of the specific compositional bias in mtDNA induced by characteristic mutational preferences different from those observed in the nuclear genome, additional inter- and intraspecies comparisons of mtDNA-encoded amino acid replacement patterns should be examined to gain deeper insights into the nonsynonymous character evolution in metazoan mitochondria, particularly in taxa with shifted strand symmetry (Hassanin et al. 2005).
Tests of neutrality based on the comparisons of the ratio of nonsynonymous and synonymous mutations across all sites can detect only major effects of purifying (KN/KS approaches 0) or directional selection (KN/KS is significantly >1), which affect simultaneously a large number of codon positions. Consistent with previous studies (Cann et al. 1984; Nachman et al. 1996; Ingman and Gyllensten 2001; Mishmar et al. 2003; Moilanen et al. 2003; Moilanen and Majamaa 2003; Elson et al. 2004; Ruiz-Pesini et al. 2004) human mtDNA-encoded proteins did not provide evidence of directional selection. However, several hotspots of mutational activity included nonsilent substitutions susceptible to site-specific positive selection. Comparing the mtDNA protein-encoding genes from several primates (Macaca, Papio, Hylobates, Pongo, Gorilla, and Pan) with human ones, we discovered significant positive selection in several regions, generally nonmatching, however, with the codons displaying a high KN/KS ratio in human–human comparisons. This difference might be explained by the dynamic polarity of the amino acid replacements at the intra- and interspecies levels whereby the constraint of selection is determined in each lineage by the ancestral state of each codon position.
In conclusion, we have provided new evidence for nonrandom processes affecting the evolution of the human mtDNA-encoded proteins. The potential role of selection in affecting fixation probabilities at different nonsilent positions undermines the appropriateness of using the average mitochondrial clock over all sites in dating events in human population history. Despite the evidence of departures from neutrality and high levels of homoplasy at the interspecies level, the phylogenetic approach for analyzing mtDNA sequence data at the intraspecies level remains viable because the reconstruction of the basic branches is robust and the excess of nonsynonymous substitutions affects mainly the terminal branches of the tree.
We thank Richard Villems for useful comments. This work was supported by National Institutes of Health grants GM28428, GM63883, and GM55273, European Commission research grant QLG2-CT-2002-90455, Progetto Consiglio Nazionale delle Ricerche-Ministero dell Istruzione, dell Universita e della Ricerca Genomica Funzionale-Legge 449/97, and Telethon-Italy E.0890. We thank Rita Horvath for providing DNA samples.
↵2 These authors contributed equally to this work.
↵3 Present address: Department of Systems Biology, Harvard Medical School, Boston, MA 02115.
↵4 Present address: Institute of Functional Genomics, University of Regensburg, Josef-Engert-Strasse 9, 93053 Regensburg, Germany.
Communicating editor: L. Excoffier
- Received March 30, 2005.
- Accepted September 5, 2005.
- Copyright © 2006 by the Genetics Society of America