- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Pesole, G.
- Articles by Saccone, C.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Pesole, G.
- Articles by Saccone, C.
A Novel Method for Estimating Substitution Rate Variation Among Sites in a Large Dataset of Homologous DNA Sequences
Graziano Pesolea and Cecilia Sacconeba Dipartimento di Fisiologia e Biochimica Generali, Università di Milano, 20133 Milano, Italy
b Dipartimento di Biochimica e Biologia Molecolare, Università di Bari, 70126 Bari, Italy
Corresponding author: Graziano Pesole, Dipartimento di Fisiologia e Biochimica Generali, Università di Milano, via Celoria 26, 20133 Milano, Italy., graziano.pesole{at}unimi.it (E-mail)
Communicating editor: F. TAJIMA
| ABSTRACT |
|---|
We present here a novel method to estimate the site-specific relative variability in large sets of homologous sequences. It is based on the simple idea that the more closely related are the compared sequences, the higher the probability of observing nucleotide changes at rapidly evolving sites. A simulation study has been carried out to support the reliability of the method, which has been applied also to analyzing the site variability of all available human sequences corresponding to the two hypervariable regions of the mitochondrial D-loop.
THE specific variability of different nucleotide sites during evolution is a general property of DNA sequences. In coding sequences, it is well known that the average evolutionary dynamics of the three codon positions are rather different, with the third codon positions evolving much faster than the first and the second codon positions, the latter evolving slower than the former. This rate pattern is expected as most of the changes in the third codon positions are synonymous, i.e., do not change the coded amino acid, whereas all changes in the second codon positions are nonsynonymous and generally introduce drastic changes in the physical-chemical properties of the coded amino acid. Furthermore, some rate heterogeneity can be observed within each of the three codon positions depending on the specific functional constraints acting on each site. Thus, first and second positions in codons for amino acids critical for the biochemical activity of the protein are usually totally invariant or strongly conserved. Because of the remarkable rate heterogeneity between synonymous (third codon positions) and nonsynonymous (first and second codon positions) sites, it is advisable to analyze the two groups of sites separately.
In noncoding sequences, usually endowed with regulatory activity, a higher level of rate heterogeneity is observed with respect to coding sequences, with some sites extremely conserved and others highly variable and prone to accept insertion/deletion events. Reliable estimates of genetic distances between homologous sequences and their phylogenetic relationships can be obtained under the condition that the mathematical model used in the analysis adequately fits the sequence data (![]()
![]()
![]()
Indeed, the estimate of the nucleotide substitution rate and the phylogenetic reconstructions may be strongly affected if rate heterogeneity is not taken into account in the mathematical model used in the analysis. Evidence has been provided that shows that variation in substitution rate among sites can be suitably described by discrete or continuous gamma distribution models (![]()
![]()
![]()
Different approaches have been used so far to estimate the substitution rate at each site. A first approach counts the substitution events along a phylogenetic tree estimated by using maximum parsimony (![]()
![]()
![]()
![]()
![]()
![]()
![]()
A major drawback of all of the above approaches is their dependence on a phylogenetic tree. In particular, in the case of a maximum-parsimony (MP) estimated tree the longer the branches in the tree, the more underestimated the substitution rate. ML-based approaches add to the predetermination of the tree the a priori determination of the substitution rate parameters. These approaches are computationally very intensive and thus can be conveniently applied only to relatively small datasets; therefore, several randomly selected subsets of the whole dataset are often used in the analysis (![]()
We propose here a new method that provides the relative rate for each individual site from all pairwise genetic distances calculated between the sequences under examination without any assumption of either the structure of the tree or the rate categories. It is based on the simple idea that the closer the two compared sequences are the higher is the probability of observing nucleotide changes of rapidly evolving sites. A similar argument is also the basis of the tree-independent method proposed by ![]()
![]()
More than 100 pathogenic mtDNA base substitution mutations have been identified in a variety of degenerative diseases (![]()
| MATERIALS AND METHODS |
|---|
The substitution rate at a given sequence position can be calculated ideally by observing that particular position for a suitable amount of time and then measuring the number of changes per unit of time. Of course, this is not possible when dealing with extant sequences that are the product of unknown past evolutionary processes.
However, if we assume that a given position has a specific rate that is independent from the particular lineage considered, one could ideally equally calculate its rate by carrying out a suitable number of pairwise comparisons of sequences whose divergence is just one unit of time. Of course the unit of time we define here, dt, has to be sufficiently small to minimize the possibility of multiple substitutions occurring within its time span. In other words, if the position i evolves twice as fast as the position j, in independent pairwise comparisons of homologous sequences whose divergence time is dt we should expect to observe twice as many nucleotide changes in position i than in position j. When extending this principle to real sequences we can expect that the more variable a given position, the more changes are observed in closely related sequences.
Let us consider a nucleotide multialignment of N sequences and L sites. Following the above reasoning the relative variability of the ith site can be simply given by
![]() |
(1) |
where
ij = 1 or
ij = 0 depending on the observation or not of a nucleotide substitution in the jth pairwise comparison, Kj being the overall genetic distance. To calculate genetic distances as correctly as possible, we used the stationary Markov model of ![]()
![]()
According to Equation 1 the relative contribution of an observed pairwise change in a given position to its measure of variability will be inversely proportional to the corresponding pairwise genetic distance. The absolute value of
i being dependent on the particular dataset considered, its relative value can be defined as

where
i ranges from 0 for invariant sites to 1 when
i =
max. The number of compared sequences is critical as it can be reasonably expected that the higher the number of sequences considered, the more reliable the estimate of the relative variability of the individual positions in a multialignment.
The SiteVar software that implements the above-described method, written in C language and running under the Unix operating system, is available from the authors upon request.
We tested the reliability of the above-described method through a simulation procedure. To carry out the simulation we used the program Seq-gen (![]()
![]()
![]()
![]()
In the Seq-gen sequence simulations we assumed the GTR model for sequence evolution and used as input trees T50 and T500 to generate datasets of 50 and 500 sequences 1000 nucleotides (nt) long, denoted S50 and S500, respectively. A specific model implemented in the Seq-Gen software, assuming three classes of relative variability, was used to mimic site-specific rate heterogeneity. In this way the simulated sequences contained three classes of sites, each accounting for one-third of the total sequence length, which we denoted as p1, p2, and p3, and evolving at three different relative rates. Two different rate ratios were used in the simulation, namely p1:p2:p3 = 1:2:3 or 1:5:10. Four different average rates of substitution were used in the simulations along T50 and T500 trees, namely 5, 10, 25, and 50%/myr (denoted as R1, R2, R5, and R10), which included the average rate of nucleotide substitution of the D-loop Hvr1 corresponding to
10%/myr (![]()
The BASEML program, implemented in the PAML package (![]()
To test the method on real data we determined site-by-site variability of the two hypervariable regions of the D-loop (Hvr1 and Hvr2). For the Hvr1 region, spanning from position 16,024 to 16,382, we used a dataset of 1308 sequences, whereas for the Hvr2 region, spanning from position 57 to 371, we used a smaller dataset of 458 sequences. In both cases we used all the available sequences in the HVRbase collection (![]()
| RESULTS |
|---|
Fig 1 shows the results obtained by our model applied to the S50 and S500 sequence datasets, whose evolution was simulated to make sites fit in three different classes of variability and to which four different average rates of nucleotide substitution were applied, denoted as R1, R2, R5, and R10 (see MATERIALS AND METHODS for further details on the simulation). It is remarkable that the model determines quite accurately site relative variability as set up in the simulation both in the case of a small rate difference (relative variability of sites 1:2:3) and in the case of a large rate difference (relative variability of sites 1:5:10 for both S50 and S500 datasets). As expected, a higher level of accuracy was obtained with the S500 dataset for all rates, with a slight underestimate at fastest average rates (i.e., R5 and R10) mainly in the S50 dataset. All in all, the simulation demonstrates the remarkable accuracy of the proposed methodology in the determination of the relative nucleotide substitution rate variation among sites.
|
On the contrary, the analysis carried out with BASEML software on only the S50 dataset, due to the higher computational needs of this approach, has shown quite surprisingly that site-specific rates were appropriately estimated qualitatively only with p1 < p2 < p3 since observed rate ratios were remarkably lower than expected (Fig 2).
|
We then determined the relative variability of sites of the Hvr1 and Hvr2 regions of the human mitochondrial D-loop by analyzing all available sequences from the HVRbase collection (![]()
![]()
![]()
![]()
|
The Hvr1 region, contained in the ETAS domain, corresponds to the 3' end of the D-loop where the newly synthesized H-strand stops. Of the 359 Hvr1 sites considered, 125 (35%) proved to be invariant, the relative variability of the remaining sites being between 0.001 and 1 (position 16,223). The most variable tract, corresponding to a 136-nt region (positions 16,16516,300), was only present in some primates but not in other mammalian D-loops and has been defined as "insertion sequence" (IS; ![]()
The Hvr2 region, in the CSB domain, corresponding to the 5' end of the D-loop, contains the main regulatory elements of the mitochondrial genome: the two promoters (HSP and LSP) and the origin of replication of the H-strand. Of the 315 Hvr2 sites considered, 210 (67%) proved to be invariant, the relative variability of the remaining sites being between 0.003 and 1 (position 146) as against a global average variability of 0.026.
Table 1 shows the occurrence of the different substitution patterns observed in the aligned Hvr1 and Hvr2 sequences and their average relative variability. In both regions the CT and AG transition sites greatly outnumber transversion sites with CT more frequent than AG patterns and the relative variability of the former higher than that of the latter, particularly in Hvr1.
|
Table 2 lists all hypervariable sites (relative variability >0.2) found in both Hvr1 and Hvr2. Most of the sites that were hypervariable in Hvr1 were also hypervariable in other studies carried out on this region (![]()
![]()
![]()
![]()
|
| DISCUSSION |
|---|
The knowledge of the evolutionary dynamics of a specific nucleotide sequence may greatly contribute to the elucidation of its structure-function relationships. In particular, the level of constraints acting on each individual site is generally correlated to its functional activity. The same information may be also particularly useful in evolutionary studies both for the measurement of nucleotide substitution rate and for the reconstruction of the phylogeny (![]()
![]()
![]()
Indeed, evolutionary models generally used for evolutionary analyses may provide misleading results if the specific assumptions made by the model (e.g., equal probability of transitions and transversions, rate homogeneity among analyzed sites, etc.) are not fulfilled by real sequence data. The better we know the specific evolutionary dynamics of the set of homologous sequences under examination the higher is the chance of obtaining reliable inference for the use of more suitable evolutionary models.
In recent years the study of human genetic variability, in particular of the main noncoding regulatory region of mtDNA (D-loop) for its remarkable level of intraspecific diversity, has produced important breakthroughs in the elucidation of the origin of modern man. The study of mtDNA genetic variability in humans could also contribute to assessing the functional significance of mtDNA mutations, possibly associated with the wide variety of mitochondrial degenerative diseases, aging, and cancer (![]()
Recently, ![]()
The strong rate heterogeneity of the D-loop hypervariable regions explains the 20-fold difference in the estimate of the mutation rate for the mtDNA D-loop obtained by phylogenetic or family studies (![]()
![]()
Despite the fact that ML models using gamma-distributed rates are considered the most reliable in assessing rate variation among sites, if the assumed gamma model does not comply with the actual evolutionary process, inaccurate results will be obtained. This is particularly true when discrete rate-class models are used as in our simulation study where the observed discrepancy between the expected site-specific rates and those inferred by BASEML software are due to the mismatch between the simulation model and the assumed gamma distribution. Indeed, we do not know a priori the distribution of rate variation among sites for the real datasets under analysis and the continuous-rate models are computationally unfeasible for large datasets. The simple method we propose, which does not require the knowledge or assumption of a phylogenetic tree or of a substitution model and has proved to be rather reliable in a simulation study, has the great advantage of allowing the analysis of very large sequence datasets. Indeed, this feature is particularly important for the expected rapid growth of sequence data from intraspecific studies aimed at both population genetic analyses and at the study and diagnosis of genetic diseases. For both topics, particularly the latter, a site-by-site elucidation of the evolutionary dynamics can provide very useful information.
| ACKNOWLEDGMENTS |
|---|
This work has been supported by Ministero dell'Universitá e della Ricerca Scientifica e Tecnologica (MURST), Italy (PRIN project "Bioinformatics and Genomics"), and by programma Biotecnologie legge 95/95 (MURST 5%).
Manuscript received July 18, 2000; Accepted for publication October 10, 2000.
| LITERATURE CITED |
|---|
ANDERSON, S., A. T. BANKIER, B. G. BARREL, M. H. L. D. BRUIJIN, and A. R. COULSON et al., 1981 Sequence and organization of the human mitochondrial genome. Nature 290:457-465[Medline].
ATTIMONELLI, M., N. ALTAMURA, R. BENNE, A. BRENNICKE, and J. M. COOPER et al., 2000 MitBASE: a comprehensive and integrated mitochondrial DNA database. The present status. Nucleic Acids Res. 28:148-152
EXCOFFIER, L. and Z. YANG, 1999 Substitution rate variation among sites in mitochondrial hypervariable region I of humans and chimpanzees. Mol. Biol. Evol. 16:1357-1368[Abstract].
FELSENSTEIN, J. and G. A. CHURCHILL, 1996 A Hidden Markov Model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93-104[Abstract].
GU, X. and J. ZHANG, 1997 A simple method for estimating the parameter of substitution rate variation among sites. Mol. Biol. Evol. 14:1106-1113[Abstract].
HANDT, O., S. MEYER, and A. VON HAESELER, 1998 Compilation of human mtDNA control region sequences. Nucleic Acids Res. 26:126-129
HASEGAWA, M., A. DI RIENZO, T. D. KOCHER, and A. C. WILSON, 1993 Toward a more accurate time scale for the human mitochondrial DNA tree. J. Mol. Evol. 37:347-354[Medline].
JAZIN, E., H. SOODYALL, P. JALONEN, E. LINDHOLM, and M. STONEKING et al., 1998 Mitochondrial mutation rate revisited: hot spots and polymorphism. Nat. Genet. 18:109-110[Medline].
KOGELNIK, A. M., M. T. LOTT, M. D. BROWN, S. B. NAVATHE, and D. C. WALLACE, 1998 MITOMAP: a human mitochondrial genome database1998 update. Nucleic Acids Res. 26:112-115
LANAVE, C., G. PREPARATA, C. SACCONE, and G. SERIO, 1984 A new method for calculating evolutionary substitution rates. J. Mol. Evol. 20:86-93[Medline].
LAUDER, I. J., H. J. LIN, J. Y. LAU, T. S. SIU, and C. L. LAI, 1993 The variability of the hepatitis B virus genome: statistical analysis and biological implications. Mol. Biol. Evol. 10:457-470[Abstract].
MEYER, S., G. WEISS, and A. VON HAESELER, 1999 Pattern of nucleotide substitution and rate heterogeneity in the hypervariable regions I and II of human mtDNA. Genetics 152:1103-1110
MICHIKAWA, Y., F. MAZZUCCHELLI, N. BRESOLIN, G. SCARLATO, and G. ATTARDI, 1999 Aging-dependent large accumulation of point mutations in the human mtDNA control region for replication. Science 286:774-779
PAGEL, M., 1999 Inferring the historical patterns of biological evolution. Nature 401:877-884[Medline].
PARSONS, T. J., D. S. MUNIEC, K. SULLIVAN, N. WOODYATT, and R. ALLISTON-GREINER et al., 1997 A high observed substitution rate in the human mitochondrial DNA control region. Nat. Genet. 15:363-368[Medline].
PESOLE, G., G. DELLISANTI, G. PREPARATA, and C. SACCONE, 1995 The importance of base composition in the correct assessment of genetic distances. J. Mol. Evol. 41:1124-1127.
RAMBAUT, A. and N. C. GRASSLY, 1997 Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235-238
SACCONE, C., C. LANAVE, G. PESOLE, and G. PREPARATA, 1990 Influence of base composition on quantitative estimates of gene evolution. Methods Enzymol. 183:570-583[Medline].
SACCONE, C., G. PESOLE, and E. SBISÀ, 1991 The main regulatory region of mammalian mitochondrial DNA: structure-function model and evolutionary pattern. J. Mol. Evol. 33:83-91[Medline].
SACCONE, C., M. ATTIMONELLI, C. LANAVE, G. PESOLE and E. SBISÀ, 2000 Mitochondrial DNA and human diversity: a detailed analysis of the D-loop region, pp. 6978 in The Origin of Humankind, edited by M. ALOISI et al. IOS Press, Venice.
SBISÀ, E., F. TANZARIELLO, A. REYES, G. PESOLE, and C. SACCONE, 1997 Mammalian mitochondrial D-loop region structural analysis: identification of new conserved sequences and their functional and evolutionary implications. Gene 205:125-140[Medline].
VAN DE PEER, Y. and R. DE WACHTER, 1993 TREECON: a software package for the construction and drawing of evolutionary trees. Comput. Appl. Biosci. 9:177-182
WAKELEY, J., 1993 Substitution rate variation among sites in hypervariable region I of human mitochondrial DNA. J. Mol. Evol. 37:613-623[Medline].
WALLACE, D. C., 1999 Mitochondrial diseases in man and mouse. Science 283:1482-1488
YANG, Z., 1994a Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39:105-111[Medline].
YANG, Z., 1994b Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306-314[Medline].
YANG, Z., 1996 Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol. Evol. 9:367-372.
YANG, Z., 1997 PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13:555-556
YANG, Z. and S. KUMAR, 1996 Approximate methods for estimating the pattern of nucleotide substitution and the variation of substitution rates among sites. Mol. Biol. Evol. 13:650-659[Abstract].
YANG, Z. and T. WANG, 1995 Mixed model analysis of DNA sequence evolution. Biometrics 51:552-561[Medline].
ZHARKIKH, A., 1994 Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol. 39:315-329[Medline].
This article has been cited by other articles:
![]() |
N. Galtier, D. Enard, Y. Radondy, E. Bazin, and K. Belkhir Mutation hot spots in mammalian mitochondrial DNA Genome Res., February 1, 2006; 16(2): 215 - 222. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. L. Kosakovsky Pond and S. D. W. Frost A Simple Hierarchical Approach to Modeling Distributions of Substitution Rates Mol. Biol. Evol., February 1, 2005; 22(2): 223 - 234. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Reyes, E. Nevo, and C. Saccone DNA Sequence Variation in the Mitochondrial Control Region of Subterranean Mole Rats, Spalax ehrenbergi Superspecies, in Israel Mol. Biol. Evol., April 1, 2003; 20(4): 622 - 632. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Meyer and A. von Haeseler Identifying Site-Specific Substitution Rates Mol. Biol. Evol., February 1, 2003; 20(2): 182 - 189. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Pesole, G.
- Articles by Saccone, C.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Pesole, G.
- Articles by Saccone, C.






