- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Savill, N. J.
- Articles by Higgs, P. G.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Savill, N. J.
- Articles by Higgs, P. G.
RNA Sequence Evolution With Secondary Structure Constraints: Comparison of Substitution Rate Models Using Maximum-Likelihood Methods
Nicholas J. Savill1,a, David C. Hoylea, and Paul G. Higgsaa School of Biological Sciences, University of Manchester, Manchester M13 9PT, United Kingdom
Corresponding author: Paul G. Higgs, School of Biological Sciences, University of Manchester, Oxford Rd., Manchester M13 9PT, United Kingdom., paul.higgs{at}man.ac.uk (E-mail)
Communicating editor: S. YOKOYAMA
| ABSTRACT |
|---|
We test models for the evolution of helical regions of RNA sequences, where the base pairing constraint leads to correlated compensatory substitutions occurring on either side of the pair. These models are of three types: 6-state models include only the four Watson-Crick pairs plus GU and UG; 7-state models include a single mismatch state that combines all of the 10 possible mismatches; 16-state models treat all mismatch states separately. We analyzed a set of eubacterial ribosomal RNA sequences with a well-established phylogenetic tree structure. For each model, the maximum-likelihood values of the parameters were obtained. The models were compared using the Akaike information criterion, the likelihood-ratio test, and Cox's test. With a high significance level, models that permit a nonzero rate of double substitutions performed better than those that assume zero double substitution rate. Some models assume symmetry between GC and CG, between AU and UA, and between GU and UG. Models that relaxed this symmetry assumption performed slightly better, but the tests did not all agree on the significance level. The most general time-reversible model significantly outperformed any of the simplifications. We consider the relative merits of all these models for molecular phylogenetics.
THERE are several classes of RNA molecules where sequences are available over a wide range of species and where multiple sequence alignments are well established, e.g., transfer RNA, 5S ribosomal RNA, small and large subunit ribosomal RNA, and ribonuclease P RNA. Secondary structure is strongly conserved over long time periods, indicating that selection is acting to maintain a structure that is essential for the function of these molecules. The helical regions of the molecule are often quite variable in sequence. This shows that the precise sequence of bases within the helical regions is of relatively little importance as long as the positioning of these regions within the secondary structure is correct.
The mode of evolution within the helical regions is via pairs of compensatory neutral mutations; i.e., a mutation on one side of a pair disrupts the structure and is slightly deleterious, but a second mutation of the other side of the pair restores the pairing ability. Compensatory pair changes form the basis of the comparative method for deducing RNA secondary structures (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The mathematical theory of compensatory mutations was first discussed by ![]()
![]()
![]()
![]()
RNA sequences are often used in constructing molecular phylogenies (e.g., ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
There is now a large variety of slightly different models. The principle aim of this article is to compare these alternatives to see which is best able to describe real sequence data. Some of these models involve a relatively small number of parameters and make assumptions about the symmetry of the rate matrix. This allows analytical solution of the rate equations in several cases. More complex models are straightforward to construct. Increasing the complexity of a model will, in general, improve the quality of the fit to the data. However, very large numbers of parameters are sometimes not justified because the extra parameters simply fit noise in the data rather than any underlying trends. We therefore require statistical techniques for this model selection process. Comparison of models of differing complexity has been carried out for models of single nucleotide substitution by ![]()
The most important issues to be considered when comparing models are introduced at this point. First, we might expect that the frequency of GC pairs should be equal to that of CG, that the frequency of GU should be equal to that of UG, and that the frequency of AU should be equal to that of UA. We refer to this as "base pair reversal symmetry." We wish to determine whether models that allow arbitrary base pair frequencies fit the data significantly better than models that assume base pair reversal symmetry. Note that the first-mentioned base in the pair is the one closer to the 5' end of the molecule. Thus, for example, it is possible to unambiguously distinguish a GC from a CG pair by this rule, and the equivalence of GC and CG pairs does not follow as a trivial point.
Second, we wish to determine how to treat mismatches. The 6- and 7-state models clearly throw away information by ignoring mismatches or by treating them in a simplified way. However, since the mismatch states are rare, it may be that they give very little phylogenetic information in any case, and it may be difficult to estimate the parameters determining the rates of change to and from mismatch states with any accuracy. Hence it is not clear a priori whether 16-state models give any advantage.
A third important issue is whether double substitutions should be permitted in the rate matrix. It is frequently observed that pairs of closely related species differ by a pair of compensatory substitutions; e.g., a GC in one species is replaced by an AU in the other. Since mutation rates are very low in real organisms it is unlikely that these two changes occurred in a single organism in a single generation. There were presumably some individuals with single mutant genotypes (in this case probably GU) at some point in time. From the population genetics viewpoint the compensatory change can happen in two ways. The first is by fixation of the slightly deleterious mutation (i.e., GU sequences rise to a high frequency in the population) followed by fixation of the second mutation, which is now slightly advantageous (i.e., AU sequences arise and replace the GU sequences). The second method is by the compensatory substitution mechanism discussed by ![]()
![]()
![]()
In phylogenetic studies, there is usually only one sequence available for each species, and there is no information available on minor sequence variants that might exist in the population. The substitution rates therefore represent changes in the consensus sequence of the population and do not represent rates of mutation in individual copies of a gene. Thus it is perfectly reasonable to allow double substitutions in the rate matrix, even though double mutations in single genes probably almost never occur. ![]()
![]()
| MATERIALS AND METHODS |
|---|
Definition of models:
A model is defined by the matrix r, where each element, rij, gives the rate of substitution to state j given that the base pair is currently in state i. The theoretical treatment of rate matrices has been developed in the context of single site models (see, for example, ![]()
![]()
![]()
![]() |
(1) |
The diagonal elements of the matrix must satisfy
![]() |
(2) |
to conserve probability, and this constraint is included in the definition of the models. At large times Pij(t) tends to
j, the equilibrium frequency of state j, irrespective of the initial state i. In some models the equilibrium frequencies are parameters of the model; hence when fitting data we need to apply the constraint
![]() |
(3) |
In other models, the frequencies are defined as functions of other parameters, in which case constraint (3) applies automatically. When comparing models it is useful to have a common time scale. We choose the time scale so that an average of one substitution event per base pair happens in 1 time unit, hence the constraint
![]() |
(4) |
This constraint can be imposed by multiplication of all elements of the matrix by a constant factor. In addition, all the models considered here are time reversible; i.e., they satisfy
![]() |
(5) |
Table 1 shows the models tested and summarizes the parameters involved. The number of free parameters in a model is the number of frequency parameters plus the number of rate parameters minus the number of constraints. The constraints are Equation 4, for all models, and Equation 3, where appropriate. The models are assigned identification codes A, B, C, etc. in order of decreasing numbers of free parameters. In all the models, states 16 refer to the principal paired states in the following order: AU, GU, GC, UA, UG, CG. In 7-state models, state 7 is MM. In 16-state models, states 716 refer to the 10 possible mismatch states in alphabetical order.
|
A general reversible model is the most general matrix of a given number of states that satisfies Equation 5 (![]()
![]()
![]()
4 =
1,
5 =
2, and
6 =
3. This gives model 7B. Another possible simplification is to set all the
parameters corresponding to double substitutions to zero, giving model 7C. Changes to and from the MM state are treated as single substitutions and are not set to zero in 7C. ![]()
ij parameters in 7A are simplified to just 4:
s controls the single substitution rate,
d controls the double substitution rate, ß controls the double transversion rate, and
controls substitutions to and from the mismatch state. Following the usual convention, we define a transition as a substitution from one purine to another, or one pyrimidine to another, and a transversion as a substitution from a purine to a pyrimidine or vice versa.
|
The three models 7B, 7C, and 7D are nested in model 7A; i.e., the simpler models are special cases of the more general model obtained by setting some parameters equal or some parameters to zero. Further simplifications of these models are possible. If the double substitutions are set to zero in 7D, we obtain 7E. If base pair reversal symmetry is imposed on 7D, we obtain 7F. The relationship between all these models is shown in Fig 2, where an arrow indicates that the model at the head of the arrow is nested in the model at the tail.
|
The 6-state models are similar to the 7-state models, except that they lack the MM state. Model 6A is the general reversible 6-state model. Models 6B and 6C are obtained by eliminating the MM state from models 7D and 7F, respectively. Model 6D is obtained by setting double transitions to zero in 6C. These 6-state models form a simple nested series, as shown in Fig 2. Models 6C and 6D were originally proposed by ![]()
In principle we could define a general reversible 16-state model with 134 free parameters; however, we do not believe such a complex model would be practical, and we have not attempted this. To facilitate comparison between the 6- and 7-state models and the 16-state models we have introduced models 16A and 16C, which are similar in spirit to model 7D. The full matrix for 16A is shown in Fig 3. There are 16 frequency parameters for the 16 states. The rate parameters for the 6 principal states are the same as those in 7D. Rates of single substitutions to and from mismatch states are controlled by a parameter
, and rates of single substitutions between mismatch states are controlled by a parameter
. Model 16C further simplifies the treatment of mismatches by setting the frequencies of all 10 mismatches to a single parameter
m. Models 16A and 16C are the only 16-state models that allow a nonzero rate of double substitutions.
|
In the model proposed by ![]()
j if states i and j differ by a single substitution and zero otherwise. To apply Equation 4 we introduce an extra factor µ, so that rij = µ
j, and then scale µ to satisfy (4). This model, termed 16B, is identical to that of ![]()
![]()
![]()
that enables the GU and UG pairs to have intermediate frequencies. The full rate matrix for 16D is given in Fig 4. The equilibrium frequency
XY of a base pair XY is related to the frequencies of the two bases
X and
Y by
XY = 
X
Y
2 if X and Y form a Watson-Crick pair;
XY = 
X
Y
2 if X and Y are GU or UG;
XY = 
X
Y if X and Y form a mismatch. The constant
is determined by
![]() |
(6) |
|
This model reduces to 16E if
= 1 and to 16F if
=
. In addition, model 16E reduces to model 16H if all the four bases have equal frequency and there is no difference between transitions and transversions.
![]()
The maximum-likelihood calculation:
A set of eubacterial small subunit ribosomal RNA sequences was obtained from the rRNA database (![]()
![]()
![]()
![]()
|
The sequence alignments and the positions of the conserved secondary structures given by ![]()
![]()
![]()
![]()
![]()
When testing the models we assumed that the topology of the tree was fixed, but not the branch lengths. For each model we obtained the parameter set that maximized the likelihood of observation of the given sequences on the given tree. Methods of calculating the likelihood are discussed by ![]()
![]()
![]()
When using the six-state models, a decision had to be made as to what to do with the mismatch states that do actually occur in the real sequences. Positions where there are many mismatches do not count as paired sites in the consensus secondary structure in the database, and therefore these sites are not considered. However, there remain sporadic mismatches occurring rather randomly throughout the sequence alignment in positions where almost all the other sequences are properly paired. We did not wish to discard the complete column of data from the sequence alignment, simply because a single sequence in the set had a mismatch at that position. Therefore, any mismatch states that occurred were treated as being the same as the Watson-Crick pair that has the same 5' base; i.e., AA, AC, and AG are treated as AU, while GA and GG are treated as GC, etc. In this way the mismatch states are distributed roughly equally between the four main states.
The maximum-likelihood values of the parameters for each model were calculated as follows. An initial estimate of the state frequencies was obtained by measuring the average frequencies in the data. The initial rate matrix was estimated by calculating the frequency of changes of states between pairs of sequences. The individual rates were estimated by calculating the number of differences of each type between sequence pairs and normalizing using Equation 4. The rates were then averaged over all pairs of sequences. The initial times were estimated by finding the maximum-likelihood divergence time for pairs of sequences given the initial estimation of the rate matrix.
Once initial values of the parameters had been estimated an iterative algorithm was used to calculate the maximum likelihood of the data. The algorithm was iterated until the maximum likelihood converged (typically 10002000 iterations but this depends on the model being studied). At each iteration a small random change was made to a single rate parameter and a single frequency parameter. Pij(t) was solved by numerical integration of Equation 1 and the maximum-likelihood values of the times were calculated by a hill climbing method in time-space. If the new value of the maximum likelihood was greater than the best value found so far, the new parameter values were retained. Otherwise, the changes made to the rates and frequencies were discarded. Checks were made that the algorithm converged to the same optimal parameter values from different starting points. For some of the models we also checked the results from our optimization algorithm against results obtained using a simple simulated annealing algorithm (![]()
![]()
Statistical tests:
Having estimated the maximum-likelihood parameter set for each of the models, we have a likelihood value L for each model, which is the likelihood of the given sequences on the given tree assuming the optimized values of the parameters. In general a larger L indicates a better fit of the model to the data; however, in choosing between models it is not sufficient to select the one with the highest L. There is a tendency for models with more parameters to give higher L values. In fact, in cases where models are nested, the one with the larger number of parameters will always give the higher L. However, the use of additional parameters is sometimes not justified statistically.
A criterion often used to compare models is the Akaike information criterion (AIC; ![]()
The likelihood-ratio test (LRT) makes a direct comparison between two models H0 and H1, where H0 is the simpler model nested within the more general model H1. If L0 is the likelihood of the data according to H0, and L1 is the likelihood of the data according to H1, then the LRT proceeds by calculating the logarithm of the likelihood ratio:
= ln(
). Even if H0 is perfectly valid
will still be positive, since H1 has a larger number of parameters with which to fit the data. Theory shows (![]()
will be distributed according to a
2 distribution with the number of degrees of freedom equal to the difference in the number of parameters between the two models. As a significance test we can calculate the probability P that 2
from the
2 distribution will be greater than the observed value of 2
. A small P indicates that the observed result is unlikely to occur by chance if H0 is an adequate model, and hence that H1 is a significantly better fit to the data. A large P indicates that introduction of the extra parameters in H1 does not significantly improve the fit given by H0.
The proofs of the AIC and LRT rely on the asymptotic assumption, i.e., that there is a very large amount of data. For the case of phylogenetic methods this means that the sequences should be extremely long. ![]()
![]()
for two models as in the LRT. Although the distribution of
cannot be calculated analytically if the asymptotic results do not apply, it can still be simulated numerically. After fitting the real data and calculating the maximum-likelihood parameters according to both models, a large number of sets (typically 100) of simulated sequences are generated using the model H0. Each of the simulated sets is then refitted using both models, and a histogram of
values is obtained for the simulated sets. The significance probability P is the fraction of the simulated sets that have
higher than the observed value for the real data. This test is to be preferred over the other two since it involves no assumptions on the distribution of
; however, it requires very much greater computer time. In cases where the tree topology is not known, it is usual to estimate the maximum-likelihood tree topology for both the real data and each set of simulated data when doing the Cox test. We did not do this. When fitting the simulated data the tree topology was kept fixed while the branch lengths and rate parameters were varied, in the same way as was done for fitting the real data. As long as the same procedure is used for fitting the real and the simulated data, the statistical test is valid.
The adequacy of the most general models cannot be tested by comparison to any of the other models. However, they can be compared to a nonparametric (NP) model (![]()
![]() |
(7) |
where NC is the number of occurrences of the pattern C in the aligned sequences. We carried out the Cox test for model 7A vs. the seven-state NP model.
| RESULTS |
|---|
The values of the log-likelihood and the AIC statistical test as well as the optimal phylogenetic tree times for each model are shown in Table 2. Since the lowest AIC is to be preferred, we see that the general reversible 6-state model 6A is the best of the 6-state models, the general reversible 7-state model is the best of the 7-state models, and model 16A is the best of the 16-state models. Note that AIC values cannot be compared between the three groups of models because the states are different, and the sequence data are treated in a different way. The more possible states there are, the smaller the likelihood of change between any two individual states. Hence models with more states have lower likelihoods and higher AICs, but this does not tell us anything about the relative merits of the three groups of models.
|
Table 3 shows the outcomes of the statistical tests between the model pairs. Tests are carried out for pairs of models linked by arrows in Fig 2. The number of degrees of freedom (d.f.) for each test is also given. The significance values for the LRT are P values obtained from a
2 table. For all but two of the Cox tests, 100 replicates were performed. In many cases there were no simulated
values greater than the real value, and we quote this as a significance of P < 1/100. For cases where there were n simulated values higher than the real value out of m replicates we have quoted P < (n + 1)/m. Where the level of significance P was small and the outcome of the Cox test critical in affecting our final choice of model, we have performed a larger number of replicates to reduce the expected error in the estimated level of significance obtained from the Cox test. This was the case for the comparison of 16A and 7A, where we performed 270 replicates, and for the comparison of 7B and 7A, where we performed 200 replicates.
|
We discuss the 6- and 7-state models together, since the conclusions are very similar. The question of whether base pair reversal symmetry is valid is addressed by the comparison 7B-7A in Table 3. The more general model gives a significantly better fit with P < 0.025 according to the LRT and a marginally significant better fit with P < 0.055 according to the Cox test. The distributions are shown in Fig 6. Once again the more general model has the lower AIC and consequently overall we consider 7A to be a better model than 7B. The same question is also addressed by the comparisons 7F-7D and 6C-6B. In both these cases the more general model has the lower AIC, suggesting that we should not make the assumption of base pair symmetry. However, the two tests for 7F-7D and 6C-6B give P = 4 or 5%, which is only marginally significant. ![]()
|
The comparison 7C-7A tests whether double substitution rates may be set to zero. The answer is clearly no: 7A is a better fit than 7C with very significant P values according to both pairwise tests. Also, the AIC is lower for 7A than 7C; hence all three tests are in agreement. The histogram of
values from Cox's test is shown in Fig 6, in comparison to the 1/2
2 distribution (expected according to the LRT) and to the real value in the data (denoted by an arrow). It can be seen that the real value is completely outside the range of the distribution, hence P is very much <1% and it would require many more than 100 replicates to estimate a true P value. The question of zero vs. nonzero rates of double substitutions is also addressed by the comparison of 7E-7D, in which the parameters
d and ß are set to zero, and 6D-6C in which
d is set to zero (note that ß cannot be zero in the 6-state model, otherwise the states are divided into two inaccessible subsets of three). In both these two comparisons the model with the nonzero rates gives a much better fit (very low P values) and also gives a lower AIC.
Model 7D is of interest because it is the most complex of the models for which an analytical solution is available (![]()
Even though the general model 7A outperforms the alternatives, comparison with the nonparametric model in Table 3 indicates that it is still not an adequate description of the data. A likely reason for this is that we have assumed equal rates of substitution at each site. Relaxing this assumption may give better models but also may increase the number of parameters to fit. Comparison with the NP model is a very stringent test, and it seems unlikely that any reasonably tractable model would ever pass the test when applied to real sequence data. This test is rather unhelpful since it tends to reject models without proposing any better alternative.
One point that can be seen for all the results with the 6- and 7-state models is that the Cox test with numerical simulation of the
distribution always gives very similar results to the much simpler LRT. The conclusion reached on significance is the same in every case, and the simulated distributions differ rather little (if at all) from the 1/2
2 distributions assumed by the LRT. Therefore it would seem that the LRT is an appropriate test for analysis of these sequences despite the original doubts that there may not be sufficient data to be in the asymptotic regime. It was also found that the Cox test on the 16-state models was extremely slow (note that every comparison requires 200 runs of the maximum-likelihood program). For these two reasons we did not perform a Cox test for every pair of 16-state models and decided to rely on the results of the LRT. The Cox test was only performed for the 16D-16A and 16A-16D comparisons, for which the LRT is not valid because the models are not nested.
The two best 16-state models according to the AIC are 16A and 16C. These are the two models that have nonzero rates of double substitution. The model of ![]()
d and ß to zero and
s =
=
). Model 16A is significantly better than 16B by the LRT and the AIC. Model 16C also performs less well than 16A, although it is still better than any of the other 16-state models. In model 16C the frequencies of all the mismatch states are set equal to one another. This result shows that there is a significant improvement in the likelihood if all the mismatch frequencies are allowed to vary independently.
Of the two models 16E and 16F proposed by ![]()
= 13.41 and
= 2.93. The fact that
is closer to 1 than to
indicates that the GU and UG pairs are closer in behavior to the mismatch pairs than to Watson-Crick pairs according to this model.
The AIC suggests that all three models 16D, 16E, and 16F perform less well than 16A and 16C. This cannot be checked by the LRT because the models are not nested. We therefore performed a Cox test between 16D and 16A [in Table 3, with the result that 16A is significantly better than 16D (P < 0.01)]. The simulated distribution of
values is shown in Fig 6, from which it can be seen that the majority of simulated values are negative; i.e., if 16D were the true model, then 16A would usually not fit the data better than 16D. In contrast,
is positive for the real data, indicating that it is better explained by 16A. This test can also be performed in the reverse direction by using 16A to simulate the data. In this case
is negative for the real data, and none of the simulated data give a
as low as this (i.e., P < 100/100). Thus if 16A were the true model then it would be highly unlikely that the difference in likelihood between the two models would be as high as it is. From this point of view we can conclude that 16A is not an entirely adequate description of the real data. Nevertheless it is still better than any of the alternatives considered.
Through use of the Cox test we can compare the performance of models with differing numbers of states. We have performed Cox tests between the best models in each class, i.e., between 16A and 7A, and between 16A and 6A. We have also compared 16A with 7D and 6B, since these two models have exactly the same form of the matrix for the principal states as 16A. The tests give P values between 7 and 59%; i.e., there is no evidence for rejecting 16A. The 7% value for the 16A-7A comparison is the smallest of these four values, which gives some degree of support to choosing 7A as the single best model. It is unfortunate that these tests cannot be performed in the reverse direction. If 7-state models are used to generate the data, then the data will contain MM states that cannot be treated properly in 16-state models or in 6-state models. If 6-state models are used to generate the data, there will be no mismatch states of any kind, hence there will be no point in fitting the simulated data to 7-state and 16-state models. Thus we cannot reject 16A in favor of any of the other models, but, equally, we cannot reject the others in favor of 16A. These tests are therefore not very helpful in deciding how many states to use.
When choosing a rate model for molecular phylogenetics, it is not just the likelihood value that needs to be taken into account, but also the speed of calculation. Sixteen-state models require considerably more time for likelihood calculations; therefore, since we have been unable to demonstrate a clear advantage for any of the 16-state models, we propose not to use them in our future phylogenetic studies. Our preference is for model 7A, from the results of the above tests, and model 7D, since this is the best of the models that is analytically solvable. The analytical solution again allows savings in computer time in ML methods and easy calculations of pairwise distances for use in distance matrix methods.
The ML values of the seven branch lengths of the tree are given in Table 4. Since the rates are normalized so that there is one substitution event per unit time on average, these branch lengths are the mean number of substitution events per base pair on each branch. Here an "event" is either a single or double substitution. The sets of branch lengths are almost identical for the first three 6-state models, but for 6D they are all much larger. This is because 6D disallows double substitutions; hence wherever there have been compensatory changes this counts as two changes in state. In the other models a double substitution usually occurs as a single step. The same effect shows up in the 7-state models, where the branch lengths are apparently much larger in models 7C and 7E where double substitution rates are zero. The times are also larger for 7-state models and corresponding 6-state models (e.g., 7A and 6A) because the 7-state models count changes to and from mismatch states that are not counted in 6-state models. In terms of relative branch lengths, however, there is not much difference between the 6- and 7-state models. It is the relative values of the lengths that are most important, because the absolute values only have a meaning if a molecular clock calibration is used to assign times to the different branch points on the tree (which we have not attempted with these data). For the 16-state models, there is considerable difference in the branch lengths according to the different models. This is another reason why we prefer the 6- and 7-state models to the 16-state models.
|
| DISCUSSION |
|---|
The work in this article was begun as a statistical support for the study of RNA helix evolution by ![]()
|
It can be seen that GC and CG pairs have low mutabilities and high frequencies, that AU and UA pairs have moderate frequencies and mutabilities, and that GU, UG, and MM pairs have low frequencies and high mutabilities. The order of the base frequencies is the same as that of the thermodynamic stability of stacking interactions. This shows that sequences are selected to increase the thermodynamic stability of the secondary structure. It can also be seen that rates of double transitions are large; e.g., AU to GC and UA to CG are actually higher than the rates of the single transitions AU to GU and UA to UG. This shows that the single-step compensatory mutation mechanism is occurring frequently in these sequences. The five sequences used here are a subset of the rRNA-1 set used by ![]()
![]()
The ranking order of the models given by the tests used here applies only to the particular set of sequences used and should therefore be treated with some caution. However, we believe that very similar selective effects are occurring in helical regions of many types of RNA, as was discussed fully by ![]()
Although it is clear from the statistical analysis that general models such as 7A give significantly higher likelihoods than analytically tractable models like 7D, the analysis does not say why this is. However, it is not difficult to see why there should be a lack of symmetry in the rate matrix. First, real molecules may be subject to mutational bias, such that new bases do not arise with equal frequency. Second, there are many different selective effects. As discussed above, we believe that double substitutions are occurring via a single-step compensatory mechanism. The rate at which this occurs is strongly influenced by the fitness of the intermediate state. Double transitions between Watson-Crick pairs occur via GU and UG intermediates, whereas double transversions have to pass via true mismatches like GG or CC. True mismatches have a much greater destabilizing effect on the helix and therefore presumably a much lower fitness. This accounts for the slow rate of double transversions with respect to double transitions. The thermodynamics of stacking interactions is far from symmetric (see ![]()
We became aware of two other models for RNA helix evolution after the analysis of the models in this article was almost complete. M. SCHÖNIGER and A. VON HAESELER (personal communication) have considered another 16-state model that reduces to the HKY model for single sites (![]()
![]()
![]()
![]()
In summary, the results presented here compare a large number of different models to describe RNA sequence evolution. The fact that general reversible models perform best shows that the real sequences are not well described by the simple symmetric assumptions used in some of the previously proposed models. The statistical tests show that we are justified in using large numbers of rate parameters in the model. The helical regions of RNAs are potentially very informative in phylogenetic work. Having a matrix that accurately models the substitution process in these regions should allow reliable phylogenetic inferences to be drawn. There are a large number of potential applications of these rate matrices in constructing phylogenies from RNA sequences.
| FOOTNOTES |
|---|
1 Present address: Department of Mathematics, Heriot-Watt University, Edinburgh EH14 4AS, United Kingdom. ![]()
| ACKNOWLEDGMENTS |
|---|
This work was supported by the UK Biotechnology and Biological Sciences Research Council.
Manuscript received July 17, 2000; Accepted for publication September 11, 2000.
| LITERATURE CITED |
|---|
COX, D. R., 1962 Further results on tests of families of alternate hypotheses. J. R. Stat. Soc. B 24:406-424.
FELSENSTEIN, J., 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376[Medline].
FELSENSTEIN, J., 1995 Phylip (Phylogeny Inference Package), version 3.5c.
GATESY, J., C. HAYASHI, R. DESALLE, and E. VRBA, 1994 Rate limits for pairing and compensatory change: the mitochondrial ribosomal DNA of antelopes. Evolution 48:188-196.
GAUTHERET, D., D. KONINGS, and R. R. GUTELL, 1995 GU base pairing motifs in ribosomal RNA. RNA 1:807-814[Abstract].
GOLDMAN, N., 1993 Statistical tests of models of DNA substitution. J. Mol. Evol. 36:182-198[Medline].
GUTELL, R. R., 1996 Comparative sequence analysis and the structure of 16S and 23S RNA, pp. 1527 in Ribosomal RNA: Structure, Evolution, Processing, and Function in Protein Biosynthesis, edited by R. A. ZIMMERMANN and A. E. DAHLBERG. CRC Press, Boca Raton, FL.
HASEGAWA, M., H. KISHINO, and T. YANO, 1985 Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160-174[Medline].
HIGGS, P. G., 1998 Compensatory neutral mutations and the evolution of RNA. Genetica 102(103):91-101.
HIGGS, P. G., 2000 RNA secondary structure: physical and computational aspects. Q. Rev. Biophys. 30(3).
IIZUKA, M. and M. TAKEFU, 1996 Average time until fixation of mutants with compensatory fitness interaction. Genes Genet. Syst. 71:167-173.
KIMURA, M., 1985 The role of compensatory neutral mutations in molecular evolution. J. Genet. 64:7-19.
KIRBY, D. A., S. V. MUSE, and W. STEPHAN, 1995 Maintenance of pre-mRNA secondary structure by epistatic selection. Proc. Natl. Acad. Sci. USA 92:9047-9051
KIRKPATRICK, S., 1984 Optimization by simulated annealing: quantitative studies. J. Stat. Phys. 34:975-986.
KIRKPATRICK, S., C. D. GELATT, and M. P. VECCHI, 1983 Optimization by simulated annealing. Science 220:671-680
KNUDSEN, B. and J. HEIN, 1999 RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15:446-454
LEIPE, D., and V. SOUSSOV, 1995 NCBI taxonomy browser. http://www.ncbi.nlm.nih.gov/Taxonomy.
LI, W. H., 1997 Molecular Evolution. Sinauer Associates, Sunderland, MA.
LI, W. H. and X. GU, 1996 Estimating evolutionary distances between DNA sequences. Methods Enzymol. 266:449-459[Medline].
LINHART, H., and W. ZUCCHINI, 1986 Model Selection. John Wiley & Sons, New York.
MAIDAK, B. L., J. R. COLE, C. T. PARKER, JR., G. M. GARRITY, and N. LARSEN et al., 1999 A new version of the RDP (ribosomal database project). Nucleic Acids Res. 27:171-173. http://www.cme.msu.edu/RDP/(
MUSE, S., 1995 Evolutionary analyses of DNA sequences subject to constraints on secondary structure. Genetics 139:1429-1439[Abstract].
OLSEN, G. J. and C. R. WOESE, 1993 Ribosomal RNA: a key to phylogeny. FASEB J. 7:113-123[Abstract].
OTSUKA, J., T. NAKANO, and G. TERAI, 1997 A theoretical study of nucleotide changes under a definite functional constraint of forming stable base pairs in the stem regions of ribosomal RNAs. J. Theor. Biol. 184:171-186[Medline].
OTSUKA, J., G. TERAI, and T. NAKANO, 1999 Phylogeny of organisms investigated by the base pair changes in the stem regions of small and large ribosomal subunit RNAs. J. Mol. Evol. 48:218-235[Medline].
ROUSSET, F., M. PELANDAKIS, and M. SOLIGNAC, 1991 Evolution of compensatory substitutions through GU intermediate state in Drosophila rRNA. Proc. Natl. Acad. Sci. USA 88:10032-10036
RZHETSKY, A., 1995 Estimating substitution rates in ribosomal RNA genes. Genetics 141:771-783[Abstract].
SCHÖNIGER, M. and A. VON HAESELER, 1994 A stochastic model for the evolution of autocorrelated DNA sequences. Mol. Phylogenet. Evol. 3:240-247[Medline].
STEPHAN, W., 1996 The rate of compensatory evolution. Genetics 144:419-426[Abstract].
SWOFFORD, D. L., G. J. OLSEN, P. J. WADDELL and D. M. HILLIS, 1996 Phylogenetic inference, pp. 407514 in Molecular Systematics, Ed. 2, edited by D. M. HILLIS. Sinauer Associates, Sunderland, MA.
TILLIER, E. R. M., 1994 Maximum likelihood with multi-parameter models of substitution. J. Mol. Evol. 39:409-417.
TILLIER, E. R. M. and R. A. COLLINS, 1995 Neighbour joining and maximum likelihood with RNA sequences: addressing the interdependence of sites. Mol. Biol. Evol. 12:7-15.
TILLIER, E. R. M. and R. A. COLLINS, 1998 High apparent rate of simultaneous compensatory base-pair substitutions in ribosomal RNA. Genetics 148:1993-2002
VAN DE PEER, Y., A. CAERS, P. DE RIJK, and R. DE WACHTER, 1998 Database on the structure of small ribosomal subunit RNA. Nucleic Acids Res. 26:179-182. http://rrna.uia.ac.be/ssu(
VAWTER, L. and W. M. BROWN, 1993 Rates and patterns of base change in the small subunit ribosomal RNA gene. Genetics 134:597-608[Abstract].
WADDELL, P. J. and M. A. STEEL, 1997 General time-reversible distances with unequal rates across sites. Mol. Phylogenet. Evol. 8:398-414[Medline].
WHEELER, W. C. and R. J. HONEYCUTT, 1988 Paired sequence difference in ribosomal RNAs: evolutionary and phylogenetic implications. Mol. Biol. Evol. 5:90-96[Abstract].
WOESE, C. R., and N. C. PACE, 1993 Probing RNA structure, function and history by comparative analysis, pp. 91117 in The RNA World, edited by R. F. GESTELAND and J. F. ATKINS. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.
YANG, Z. H., N. GOLDMAN, and A. FRIDAY, 1994 Comparison of models for nucleotide substitution used in maximum likelihood phylogenetic estimation. Mol. Biol. Evol. 11:316-324[Abstract].
This article has been cited by other articles:
![]() |
R. R. Stocsits, H. Letsch, J. Hertel, B. Misof, and P. F. Stadler Accurate and efficient reconstruction of deep phylogenies from structured RNAs Nucleic Acids Res., September 1, 2009; (2009) gkp600v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Varriale, G. Torelli, and G. Bernardi Compositional properties and thermal adaptation of 18S rRNA in vertebrates RNA, August 1, 2008; 14(8): 1492 - 1500. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Dohrmann, D. Janussen, J. Reitner, A. G. Collins, and G. Worheide Phylogeny and Evolution of Glass Sponges (Porifera, Hexactinellida) Syst Biol, June 1, 2008; 57(3): 388 - 405. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. L. Thorne, S. C. Choi, J. Yu, P. G. Higgs, and H. Kishino Population Genetics Without Intraspecific Data Mol. Biol. Evol., August 1, 2007; 24(8): 1667 - 1677. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Kosiol, I. Holmes, and N. Goldman An Empirical Codon Model for Protein Sequence Evolution Mol. Biol. Evol., July 1, 2007; 24(7): 1464 - 1479. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. M. F. Merlo, M. Lunzer, and A. M. Dean An empirical test of the concomitantly variable codon hypothesis PNAS, June 26, 2007; 104(26): 10938 - 10943. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Gowri-Shankar and M. Rattray A Reversible Jump Method for Bayesian Phylogenetic Inference with a Nonhomogeneous Substitution Model Mol. Biol. Evol., June 1, 2007; 24(6): 1286 - 1299. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. M. Kjer, J. J. Gillespie, and K. A. Ober Opinions on Multiple Sequence Alignment, and an Empirical Comparison of Repeatability and Accuracy between POY and Structural Alignment Syst Biol, February 1, 2007; 56(1): 133 - 146. [Full Text] [PDF] |
||||
![]() |
S. L. Kosakovsky Pond, F. V. Mannino, M. B. Gravenor, S. V. Muse, and S. D. W. Frost Evolutionary Model Selection with a Genetic Algorithm: A Case Study Using Stem RNA Mol. Biol. Evol., January 1, 2007; 24(1): 159 - 170. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. L. Kosakovsky Pond, D. Posada, M. B. Gravenor, C. H. Woelk, and S. D. W. Frost Automated Phylogenetic Detection of Recombination Using a Genetic Algorithm Mol. Biol. Evol., October 1, 2006; 23(10): 1891 - 1901. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Baele, J. Raes, Y. Van de Peer, and S. Vansteelandt An Improved Statistical Method for Detecting Heterotachy in Nucleotide Sequences Mol. Biol. Evol., July 1, 2006; 23(7): 1397 - 1405. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Gowri-Shankar and M. Rattray On the Correlation Between Composition and Site-Specific Evolutionary Rate: Implications for Phylogenetic Inference Mol. Biol. Evol., February 1, 2006; 23(2): 352 - 364. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Dutheil, T. Pupko, A. Jean-Marie, and N. Galtier A Model-Based Approach for Detecting Coevolving Positions in a Molecule Mol. Biol. Evol., September 1, 2005; 22(9): 1919 - 1928. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Sengupta and P. G. Higgs A Unified Model of Codon Reassignment in Alternative Genetic Codes Genetics, June 1, 2005; 170(2): 831 - 840. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J Telford, M. J Wise, and V. Gowri-Shankar Consideration of RNA Secondary Structure Significantly Improves Likelihood-Based Estimates of Phylogeny: Examples from the Bilateria Mol. Biol. Evol., April 1, 2005; 22(4): 1129 - 1136. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Pedersen, R. Forsberg, I. M. Meyer, and J. Hein An Evolutionary Model for Protein-Coding Regions with Conserved RNA Structure Mol. Biol. Evol., October 1, 2004; 21(10): 1913 - 1922. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Pagel and A. Meade A Phylogenetic Mixture Model for Detecting Pattern-Heterogeneity in Gene Sequence or Character-State Data Syst Biol, August 1, 2004; 53(4): 571 - 581. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Whelan and N. Goldman Estimating the Frequency of Events That Cause Multiple-Nucleotide Changes Genetics, August 1, 2004; 167(4): 2027 - 2043. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. A. Castoe, T. M. Doan, and C. L. Parkinson Data Partitions and Complex Models in Bayesian Analysis: The Phylogeny of Gymnophthalmid Lizards Syst Biol, June 1, 2004; 53(3): 448 - 469. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Iwasa, F. Michor, and M. A. Nowak Stochastic Tunnels in Evolutionary Dynamics Genetics, March 1, 2004; 166(3): 1571 - 1579. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. D. Smith, T. W. H. Lui, and E. R. M. Tillier Empirical Models for Substitution in Ribosomal RNA Mol. Biol. Evol., March 1, 2004; 21(3): 419 - 427. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Jameson, A. P. Gibson, C. Hudelot, and P. G. Higgs OGRe: a relational database for comparative analysis of mitochondrial genomes Nucleic Acids Res., January 1, 2003; 31(1): 202 - 206. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. C. Hoyle and P. G. Higgs Factors Affecting the Errors in the Estimation of Evolutionary Distances Between Sequences Mol. Biol. Evol., January 1, 2003; 20(1): 1 - 9. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Jow, C. Hudelot, M. Rattray, and P. G. Higgs Bayesian Phylogenetics Using an RNA Substitution Model Applied to Early Mammalian Evolution Mol. Biol. Evol., September 1, 2002; 19(9): 1591 - 1601. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Caetano-Anolles Tracing the evolution of RNA structure in ribosomes Nucleic Acids Res., June 1, 2002; 30(11): 2575 - 2587. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. B. Carlini, Y. Chen, and W. Stephan The Relationship Between Third-Codon Position Nucleotide Content, Codon Bias, mRNA Secondary Structure and Gene Expression in the Drosophilid Alcohol Dehydrogenase Genes Adh and Adhr Genetics, October 1, 2001; 159(2): 623 - 633. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Innan and W. Stephan Selection Intensity Against Deleterious Mutations in RNA Secondary Structures and Rate of Compensatory Nucleotide Substitutions Genetics, September 1, 2001; 159(1): 389 - 399. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Savill, N. J.
- Articles by Higgs, P. G.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Savill, N. J.
- Articles by Higgs, P. G.


















