- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Tillier, E. R. M.
- Articles by Collins, R. A.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Tillier, E. R. M.
- Articles by Collins, R. A.
High Apparent Rate of Simultaneous Compensatory Base-Pair Substitutions in Ribosomal RNA
Elisabeth R. M. Tilliera and Richard A. Collinsaa Department of Molecular and Medical Genetics, University of Toronto, Toronto, Ontario, Canada M5S 1A8
Corresponding author: Richard A. Collins, Department of Molecular and Medical Genetics, Faculty of Medicine, University of Toronto, 1 Kings College Circle, Toronto, Ontario, Canada M5S 1A8, rick.collins{at}utoronto.ca (E-mail).
Communicating editor: G. BRIAN GOLDING
| ABSTRACT |
|---|
We present a model for the evolution of paired bases in RNA sequences. The new model allows for the instantaneous rate of substitution of both members of a base pair in a compensatory substitution (e.g., A-U
G -C) and expands our previous work by allowing for unpaired bases or noncanonical pairs. We implemented the model with distance and maximum likelihood methods to estimate the rates of simultaneous substitution of both bases,
d, vs. rates of substitution of individual bases,
s in rRNA. In the rapidly evolving D2 expansion segments of Drosophila large subunit rRNA, we estimate a low ratio of
d/
s, indicating that most compensatory substitutions involve a G -U intermediate. In contrast, we find a surprisingly high ratio of
d/
s in the core small subunit rRNA, indicating that the evolution of the slowly evolving rRNA sequences is modeled much more accurately if simultaneous substitution of both members of a base pair is allowed to occur approximately as often as substitution of individual bases. Using simulations, we have ruled out several potential sources of error in the estimation of
d/
s. We conclude that in the core rRNA sequences compensatory substitutions can be fixed so rapidly as to appear to be instantaneous.
ALTHOUGH ribosomal RNA genes have been extensively used for inferring the phylogeny of distantly related sequences (for reviews see ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
To address the complexities imposed by RNA secondary structure, several authors have recently proposed different probability-based models for the evolution of paired bases in RNA sequences (![]()
![]()
![]()
![]()
![]()
U-A) are still assumed to be the result of two independent events; their probabilities in a small amount of time are essentially nil and are therefore disregarded (![]()
![]()
![]()
In contrast, we have developed a model for the evolution of double-stranded RNA sequences that allows for the complete evolutionary dependence by permitting the simultaneous substitution of both members of a base pair (![]()
![]()
Here, we use our expanded model to estimate the rates of substitution of base pairs and of individual bases, using a large data set of published sequences of small subunit (SSU) rRNA. We cannot estimate absolute values of the rate parameters irrespective of time because, in most cases, the true time of divergence is unknown, but we can estimate their relative rates. Of particular interest is the ratio of the rate of the simultaneous substitution of both members of the base pair to the rate of substitution to and from the G -U transitional intermediate. This quantity addresses the question of whether G -U is necessarily an evolutionarily stable intermediate in the substitution between G -C
A-U pairs. We find that this ratio is substantially greater than zero; in fact, the evolution of rRNA sequences is modeled far more accurately if simultaneous substitution of both members of a base pair is allowed to occur approximately as often as substitution of individual bases.
| THE MODEL |
|---|
The description of the probability of substitution of any base combination to any other is mathematically complicated, involving a 16 x 16 matrix and potentially as many rate parameters as there are types of substitutions. Different authors have used different simplifying assumptions to limit the number of independent rate parameters to three or fewer (![]()
![]()
![]()
![]()
d, from A-U to G -C, even when single base transitions at a rate
s through a G -U intermediate were possible. Because all single transversion mutations would disturb base pairing, all transversions were considered double compensatory substitutions, occurring at an instantaneous rate ß.
To more accurately describe the evolution of real rRNA sequences, the model previously presented (![]()
![]()
7. Substitutions to or from O -T base combinations are allowed at an equal rate
. We call this model the Other RNA Model, or OTRNA. This model reduces to our previous model (![]()
![]()
7 = 0. A diagram of the OTRNA Model is given in Figure 1, and the transition probability matrix, derived as in ![]()
|
|
| ESTIMATING THE RATIO OF DOUBLE TO SINGLE SUBSTITUTION RATES |
|---|
Distance estimate:
The transition probability matrix allows us to estimate the parameters in the model multiplied by the time of divergence from the observed number of differences between any two sequences. The use of a probability model corrects for multiple substitutions at a given site over time ( ![]()
![]() |
(1) |
s as defined in the legend to Figure 1. From Equation 1, we can obtain the new distance measure with this model because
![]() |
(2) |
The approximate variances of these rate estimates due to sampling, as well as the approximate variance for the ratio of any two of these estimates, were also obtained. For example, the variance for
d/
s is given by
![]() |
(3) |
To estimate
d/
s, we considered a large data set consisting of 473 bacterial and archaebacterial SSU rRNA [from the Ribosomal Database Project (RDP); ![]()
![]()
d/
s. In order to limit the number of comparisons, and because there are wide variations in the base-pair composition of the sequences, suggesting that the base composition of the data set is not at equilibrium, we only considered the pair-wise comparisons of sequences that had similar base-pair composition (i.e., where the base-pair frequencies of A-U, G -U and G -C did not differ by more than 5% between the two sequences in the comparison). The results from the 1044 comparisons are shown in Table 1 and Figure 3.
|
|
The estimated
d/
s ratio (1.45 ± 0.77 standard deviation, SD) is substantially greater than zero, and even larger than one. The sample SD is given but it is an underestimate of the true SD and cannot be used to test the significance of that mean. This is because the 1044 estimates are not independent, because the same sequence can be used more than once for different comparisons and because the phylogeny existing between these sequences is disregarded. The variance in the estimate is extremely high at low distances (<0.05) due to the lack of a sufficient number of substitutions to provide an accurate estimate. If the assumption of the OTRNA Model is met by these data, our finding that the rate of apparent simultaneous substitutions of base pairs is greater than the rate of single transition substitutions suggests that a large proportion of double transitions do not go through a G -U intermediate or that the G -U intermediate is lost before reaching fixation.
The other interesting rate that can now be estimated with the new model is the instantaneous rate of substitution from unpaired to paired bases (and vice versa) relative to the rate of double transversions. This ratio, ß/
, is estimated to be about 1.15 (SD 0.78). The variance in this estimate is high, and the quantity is difficult to estimate due to the rarity of transversions. Another difficulty with this quantity is that
is a mixed rate to and from the various O -T base pairs, with most due to transversions and one due to transitions (by the C -A pair).
Possible sources of error in the distance estimate of
d/
s:
Inadequate data set:
The data set used was large (473 sequences) and quite varied, consisting of sequences from the Eubacteria and Archaebacteria and covering the whole possible range of genetic distances. Because the variance in the estimate of
d/
s increases substantially with lower distances, and to determine whether there was any difference in the estimate of
d/
s from Eukaryotic sequences, we examined two more data sets. The first consisted of 61 Eubacterial sequences, all enteric bacteria aligned to the E.coli sequence. The second set consisted of 65 Eukaryotic sequences, all fungal ascomycetes, aligned to the S. cerevisiae sequence and reference secondary structure (![]()
d/
s for the bacterial data set was 1.53 (SD 0.8) and 1.24 (SD 0.64) for the ascomycetes (Table 1). Again, the variance in the estimate is high at the lower distances (not shown). These estimates do not vary substantially from the one previously found for the larger data set of 473 sequences (1.45, SD 0.77), increasing our confidence that
d/
s is greater than zero across all kingdoms.
Lack of consideration of the phylogeny in the distance estimate of
d/
s:
Because the estimation procedure used here does not take into consideration the phylogeny, it was necessary to determine whether the approach was valid. This question was addressed by computer simulations. 200 simulations were performed using the RNA Model as in the method described previously (![]()
d/
s of 0 (the value expected if
d = 0) or 1, which is closer to the value that was approximated from the data. The value of
d/ß was set to two for these simulations because that is approximately the estimate for this quantity from the real data. The value of
d/
s was then estimated from the sequences by considering all pair-wise comparisons as was done with the real data. The results in Table 2 show that the estimate of
d/
s with this method is indeed a very accurate estimate of the value used to generate the data, even with a small number of sites. Even in the worst-case scenarios of very large distances and few sites, the simulations gave good estimates in situations which would give a high apparent number of double transitions that could possibly lead to an overestimation of
d/
s, suggesting that the pair-wise comparison approach, although crude, is nevertheless a valid one.
|
|
Unknown mode of evolution:
The real mode of evolution for rRNA sequences is of course unknown, and the OTRNA Model of evolution may be not be an accurate model of real evolution. Others (![]()
![]()
![]()
d/
s? We used the ![]()
![]()
![]()
, such that the rate of substitution from Watson-Crick pairs to all others is proportional to 1/
and the reverse substitution rate proportional to
. The rate of simultaneous substitution of both members of a dinucleotide pair is zero. We used the simplest version of the model where G -U is considered unpaired, and all base frequencies are equal. This model will yield a larger proportion of Watson-Crick pairs with increasing values of
. We set
= 3 for our simulations, which leads to expected equilibrium frequencies for A-U and G -C base pairs (
8 and
10) equal to 0.375, for G -U (
9) equal to 0.0417 and all others (
7) equal to 0.208. Any higher value for
leads to a frequency of G -U too low to allow an accurate estimate of
s (not shown). The sequences generated were analyzed with the OTRNA Model to estimate
d/
s. The result from 200 simulations is shown in Table 2, where the estimate of
d/
s is found to be essentially zero (-0.044 SD 0.017). The estimate is actually slightly negative because sequences evolved under the MUSE model show fewer double changes that would be expected to be observed had sequences evolved under our model (even with
d = 0). The simulations therefore show that our method is robust, as they do not overestimate the value of
d/
s, even when the sequences have evolved following a very different model that does not allow for instantaneous double substitutions.
Variable selection against G -U base pairs:
Although the method seems to accurately take into account multiple substitutions at the sites, an uneven amount of selection against G -U base pairs at different sites in the sequence could create a systematic bias in the estimate of
d/
s. If some sites allow G -Us and others do not at all, then there are two types of sites in the molecule; sites at which G -Us are allowed and where transitions proceed mostly through G -U intermediates (type I sites), and sites where all substitutions are double (type II sites). We chose to investigate the degree of error in the estimate resulting from variable selection against G -U pairs using only the RNA Model because the effect would be more severe when O -T base pairs (particularly A- C, the other possible transitional intermediate) are not allowed. At type II sites, this model is equivalent to HASEGAWA's two-parameter model (![]()
d, the rate of transversion is ß, and where we can have A-U, G -C, U-A or C- G rather than A, G, U or C at each site. Treating type I and type II sites separately allows for more accurate estimates of the parameters in the models if the number in each type of site is large enough.
It was possible to analytically determine the expected error in the estimate of
d/
s. This was performed by finding the expression for
d/
s when type I and type II sites are considered separately, and comparing it to the expression for the
d/
s obtained when the sites are all considered together. The relative error (the difference between the overestimate and the true value, divided by the latter) is plotted in Figure 5 against an increasing number of Type II sites and an increasing expected distance Kd between the sequences. The error and its standard deviation were also obtained by simulation. The graph shows that the relative error in the estimate of
d/
s is always less than 10%.
|
Maximum likelihood estimate:
An independent way of estimating
d/
s is with Maximum Likelihood. The likelihood ratio test, in which we compare a general model (either the RNA Model or the OTRNA Model) to a restricted model constructed by setting the
d/
s ratio to a constant. Interestingly, the restricted model, where
d/
s is fixed, does not have the mathematical properties of the more general model as described in ![]()
Ten eubacterial sequences (Escherichia coli, Buchera aphidicola, Citrobacter freundii, Erwinia herbicola, Serratia marcescens, Hafnia alvei, Rahnella aquatilis, Yersinia enterocolitica, Proteus vulgaris, and Plesiomonas shigelloides strain M51; all enteric bacteria), chosen to be similar to the E. coli reference sequence to minimize changes in structure, were analyzed with the Maximum Likelihood approach. Another set of sequences, all fungal ascomycetes (Saccharomyces cerevisiae, Blastomyces dermatitidis, Coccidioides immitis, Aspergillus fumigatus, Aureubasidium pullulans, Podospora anserina, Neurospora crassa, Colletotrichum gloesporioides, Torulaspora delbrueckii, and Schizosaccharomyces pombe ; aligned with the reference secondary structure of S. cereisiae), was also analyzed to determine whether the
d/
s estimate would also be large in the eukaryotic kingdom. The phylogenies and secondary structures for both sets were assumed to be those in the RDP database. The results of the likelihood ratio tests between the restricted models and the general model are plotted in Figure 6 for increasing values of the (fixed)
d/
s. Also plotted in Figure 6 are the results of a similar analysis performed using the OTRNA Model on two data sets obtained by simulation as in ![]()
d/
s equal to either 0 or 2. The 95% confidence level is also shown on the graph, indicating the level below which the restricted model is not statistically different from the general model in explaining the data (i.e., the values when the fixed
d/
s yields a maximum likelihood that is not significantly different when
d/
s is allowed to vary).
|
The curves for the analyses of both the bacterial and eukaryotic sequence data clearly show that
d/
s > 0 because fixing it at zero gives a significantly worse likelihood in all cases. This was observed no matter whether O -T base pairs are considered or not. The fixed
d/
s that yielded the minimum value for the likelihood ratio with the eubacterial data set was at 2.3 using the RNA Model and 2.5 using the OTRNA Model. This is very close to the estimate of 2.3 obtained with the pair-wise distance method described above on this same data set modeled using the OTRNA Model. For the eukaryotic sequences, the minimum value of the likelihood ratio is found to occur at an
d/
s value of 0.9, also close to the value of 1.22 obtained using the pair-wise distance method. It is remarkable that the minima found for these curves are at or below the 95% confidence level, which implies that the
d/
s estimates are quite uniform from branch to branch, such that a uniformly imposed value is statistically justifiable.
The simulation curves in Figure 6 show that the method is valid, as the minima for these curves are very close to the value of
d/
s used in generating the data. Particularly telling is the extremely rapid rise in the likelihood ratio with increasing value of the estimated
d/
s when this ratio was actually set at zero in the simulation. This shows the likelihood method's ability to accurately estimate
d/
s when this quantity is zero. The minima in the curves for the simulated data sets fall well below the 95% confidence level as would be expected with simulated data being not as variable as real sequences, because in this case the only source of variation is due to sampling.
Interestingly, the value of
d/
s is estimated to be slightly higher when the OTRNA Model is used instead of the RNA Model, a phenomenon that is also observed when the distance method is used (data not shown). We would have expected the reverse result, as the OTRNA Model allows for the other transitional intermediate, C-A, whereas the RNA Model does not. The phenomenon may simply be because, when O -T base pairs are included, the relative frequency of G -U pairs is lowered and thus serves to increase the estimate of
d.
Degree of independence and the effective number of sites:
From the value of the rate parameters in the model we can obtain an estimate of the effective number of sites Ke, i.e., the length of an equivalent sequence to the one considered if all sites were evolving independently.
![]() |
(4) |
For the OTRNA Model, we have
![]() |
(5) |
| DISCUSSION |
|---|
The models we have proposed allow for instantaneous rates of simultaneous substitution of both members of a base pair. It is often assumed that such rates, describing the probability of two concurrent unlikely events, are small enough to be negligible (![]()
![]()
![]()
We investigated several potential sources of error in estimating these rates, including the possibility that single base substitutions, over long periods of time, along with a low frequency of G -U base pairs, could lead to an overestimate of the rate of double transitions, particularly when the phylogenetic relationship between the sequences is disregarded. From simulation results (Figure 6 and Table 2), using sequences evolved under our model and that of ![]()
d/
s with several sets of actual sequence data gave very similar results:
d/
s is much greater than zero.
The OTRNA Model, although more general than the previously proposed RNA Model (![]()
![]()
Another simplification of our model is the assumption that substitutions from paired sites to unpaired sites (or to G -U) and vice versa are considered to have the same rate (either
s or
) in either direction. In contrast, the models of ![]()
![]()
The present study and those of ![]()
![]()
![]()
![]()
![]()
![]()
The models proposed here for the double-stranded regions could be used in conjunction with other models for the single-stranded regions of the RNA in distance and maximum likelihood analyses for phylogenetic purposes. ![]()
![]()
![]()
d/
s is indeed low (0.17 SD 1.8), which agrees with the interpretation that the majority of A-U to G -C substitutions in these sequences involves a G -U intermediate. On the other hand, our analysis on the core of SSU rRNA sequences revealed a high
d/
s (approximately
1), suggesting that the majority of compensatory substitutions between Watson-Crick base-pairs does not involve a stable G -U intermediate. Our results provide direct support for the suggestion by ROUSSET et al. that conclusions based on rapidly evolving expansion segments should not be extrapolated to the slowly evolving core segments of rRNA. Although the secondary structure of the expansion segments appears to be conserved in Drosophila species, noncompensatory substitutions may be more tolerated by natural selection and therefore more evolutionarily stable than in the core rRNA. The degree of selective constraint on secondary structure will determine the evolutionary stability of noncompensatory, single substitutions. Indeed, we find that for the Drosophila D2 expansion segments, ki = 0.86, indicating that the members of the base-paired sites in these sequences are on average evolving more independently than in core SSU rRNA sequences (where ki = 0.65). ROUSSET et al. conclude from their study that a model of compensatory substitution that allows G -U intermediates is correct, but we find that such a model is too restrictive if it does not also allow for double substitutions that do not require any stable intermediate.
The rates estimated in this paper are substitution rates and therefore do not yield information on which, if any, of the many evolutionary forces (mutation, selection or drift) is dominant in bringing about the rapid rates of compensatory base-pair substitutions that are observed. High compensatory substitution rates could possibly be due to an increased mutation rate due to the palindromic nature of the stems in the RNA and, thus, in the coding DNA sequence, and to the phenomenon of templated mutations (see ![]()
![]()
![]()
| ACKNOWLEDGMENTS |
|---|
We thank two anonymous reviewers for their constructive comments. Parts of this work were supported by a Natural Sciences and Engineering Research Council of Canada (NSERC) postgraduate scholarship to E.T. Support was also provided by an NSERC research grant to R.C. R.C. is a fellow of the Canadian Institute for Advanced Reserach Program in Evolutionary Biology.
Manuscript received September 29, 1997; Accepted for publication December 17, 1997.
| LITERATURE CITED |
|---|
DIXON, M. T. and D. M. HILLIS, 1993 Ribosomal RNA secondary structure: compensatory mutations and implications for phylogenetic analysis. Mol. Biol. Evol. 10:256-267[Abstract].
GOLDING, G. B., 1987 Nonrandom patterns of mutation are reflected in evolutionary divergence and may cause some of the unusual patterns observed in sequences, pp. 151172 in Genetic Constraints on Adaptive Evolution, edited by V. LOESCHCKE. Springer-Verlag, Berlin.
GUTELL, R. R., 1994 Collection of small subunit (16S- and 16S-like) ribosomal RNA structures. Nucleic Acids Res. 22:3502-3507
HASEGAWA, M., H. KISHINO, and T. YANO, 1985 Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160-174[Medline].
HILLIS, D. M. and M. T. DIXON, 1991 Ribosomal DNA: molecular evolution and phylogenetic inference. Q. Rev. Biol. 66:411-453[Medline].
JUKES, T. H., and C. R. CANTOR, 1969 Evolution of protein molecules, pp. 21132, in Mammalian Protein Metabolism, edited by H. N. MUNRO. Academic Press, New York.
KIMURA, M., 1985 The role of compensatory neutral mutations in molecular evolution. J. Genet. 64:7-9.
MAIDAK, B. L., G. J. OLSEN, N. LARSEN, R. OVERBEEK, and M. J. MCCAUGHEY et al., 1996 The Ribosomal Database Project. Nucleic Acids Res. 24:82-85
MUSE, S. V., 1995 Evolutionary analyses of DNA sequences subject to constraints on secondary structure. Genetics 139:1429-1439[Abstract].
NOLLER, H. F., 1984 Structure of ribosomal RNA. Annu. Rev. Biochem. 53:119-162[Medline].
OLSEN, G. J. and C. R. WOESE, 1993 Ribosomal RNA: a key to phylogeny. FASEB J. 7:113-123[Abstract].
ROUSSET, F., M. PÉLANDAKIS, and M. SOLIGNAC, 1991 Evolution of compensatory substitutions through a G -U intermediate state in Drosophila rRNA. Proc. Natl. Acad. Sci. USA 88:10032-10036
RZHETSKY, A., 1995 Estimating substitution rates in ribosomal RNA genes. Genetics 141:771-783[Abstract].
SCHONIGER, M. and A. VON HAESELER, 1994 A stochastic model for the evolution of autocorrelated DNA sequences. Mol. Phyl. Evol. 3:240-247[Medline].
STEPHAN, W., 1996 The rate of compensatory evolution. Genetics 144:419-426[Abstract].
TILLIER, E. R. M., 1994 Maximum Likelihood with multi-parameter models of substitution. J. Mol. Evol. 39:409-417.
TILLIER, E. R. M. and R. A. COLLINS, 1995 Neighbor Joining and Maximum Likelihood with RNA sequences: addressing the interdependence of sites. Mol. Biol. Evol. 12:7-15.
VAWTER, L. and W. M. BROWN, 1993 Rates and patterns of base change in the small subunit ribosomal RNA gene. Genetics 134:597-608[Abstract].
WHEELER, W. C. and R. L. HONEYCUTT, 1988 Paired sequence difference in ribosomal RNAs: evolutionary and phylogenetic implications. Mol. Biol. Evol. 5:90-96[Abstract].
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Tillier, E. R. M.
- Articles by Collins, R. A.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Tillier, E. R. M.
- Articles by Collins, R. A.










