- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Goldman, N.
- Articles by Jones, D. T.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Goldman, N.
- Articles by Jones, D. T.
Assessing the Impact of Secondary Structure and Solvent Accessibility on Protein Evolution
Nick Goldmana, Jeffrey L. Thorneb, and David T. Jonesca Department of Genetics, University of Cambridge, Cambridge CB2 3EH, United Kingdom,
b Program in Statistical Genetics, Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695-8203
c Department of Biological Sciences, University of Warwick, Coventry CV4 7AL, United Kingdom
Corresponding author: Nick Goldman, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, n.goldman{at}gen.cam.ac.uk (E-mail).
Communicating editor: G. B. GOLDING
| ABSTRACT |
|---|
Empirically derived models of amino acid replacement are employed to study the association between various physical features of proteins and evolution. The strengths of these associations are statistically evaluated by applying the models of protein evolution to 11 diverse sets of protein sequences. Parametric bootstrap tests indicate that the solvent accessibility status of a site has a particularly strong association with the process of amino acid replacement that it experiences. Significant association between secondary structure environment and the amino acid replacement process is also observed. Careful description of the length distribution of secondary structure elements and of the organization of secondary structure and solvent accessibility along a protein did not always significantly improve the fit of the evolutionary models to the data sets that were analyzed. As indicated by the strength of the association of both solvent accessibility and secondary structure with amino acid replacement, the process of protein evolutionboth above and below the species levelwill not be well understood until the physical constraints that affect protein evolution are identified and characterized.
WIDELY used models of sequence evolution provide notoriously poor fits to actual sequence data (e.g., ![]()
![]()
![]()
![]()
![]()
![]()
Although their assumptions may be violated and there may be statistical grounds for rejecting them, simple models of sequence evolution may still be adequate for investigating the questions to which they are often applied. In population genetics, the infinite sites model may be good enough for testing the neutral hypothesis of ![]()
Nevertheless, the limitations of widely used models of sequence evolution often prevent more refined and informative questions from being addressed. A key to modelling and understanding the evolutionary process is identification and characterization of the constraints that evolution "perceives" as proteins diverge. Selective constraints on protein structure would be expected to give rise to associations between patterns of amino acid replacement and structure. In this article, we extend our earlier approach (![]()
There is a tradition of studies (e.g., ![]()
![]()
![]()
![]()
![]()
![]()
![]()
We assume that each site in a protein belongs to one of several categories. The categories need to be predetermined, but the category to which each site belongs does not need to be prespecified. In this study, we have four types of secondary structure:
-helix, ß-sheet, turn, and coil. For each secondary structure type, we further divide sites into those that are relatively exposed to solvent and those that are buried.
Having two accessibility classes, buried (b) and exposed (e), for each of the four secondary structure types (H, E, T, C) gives rise to a model with protein sites that belong to one of eight categories (Hb, He, Eb, Ee, Tb, Te, Cb, Ce). We use a hidden Markov model (HMM) approach (see ![]()
![]()
![]()
![]()
![]()
| MATERIALS AND METHODS |
|---|
BRKALN database:
Fixed parameters of the model were estimated as described below from a database of structure-related amino acid sequence alignments. The BRKALN database maintained by D.T. Jones (unpublished results) contains amino acid sequences classified into families of closely related sequences for which the tertiary structure of at least one member has been experimentally determined. The database is built by extracting nonhomologous sequences from the Brookhaven Protein Databank (PDB; ![]()
![]()
![]()
![]()
Secondary structure assignments and solvent accessibility scores are calculated for the protein of known structure with the DSSP program (![]()
The DSSP assignments that occur in the BRKALN database are classified as follows: H, helix; E, sheet; S and T, turn; all others (i.e., B, ".", G, and I), coil. The decision to classify S and T as a separate turn category instead of as coil (as in ![]()
10%). Relative accessibility was estimated by using a fully extended Gly-X-Gly tripeptide as the reference state. This resulted in approximately half of the sites in known structures of the BRKALN database being classified as buried and half as exposed. There is scope for more refined solvent accessibility schemes to be explored in the future.
Replacement processes:
Each combination of secondary structure and accessibility status is associated with a particular amino acid replacement process. As is the case for most widely used models of nucleotide substitution and amino acid replacement, our models of amino acid replacement are Markovian with respect to time. To specify the process of evolution for a specific replacement category k that is one of the eight in our model, there are parameters
kij , the relative rate of change from type i to j. Slight modifications (![]()
![]()
![]()
![]()
![]()
kij on observed amino acid replacement counts. These counts are made by comparing pairs of sequences in the BRKALN database that are 85% or more identical and by recording the number of times for replacement category k that amino acid type i is observed in one sequence of the pair and type j is observed at the corresponding site in the other.
Hidden Markov model:
Neither the secondary structure nor the solvent accessibility at one site in a protein is independent of that at nearby sites. In fact, secondary structures at adjacent sites are strongly positively correlated. Accessibility status at adjacent sites is also positively correlated, although less strongly so. We adopt a hidden Markov model (HMM) to describe the organization of secondary structure and accessibility status along amino acid sequences. The states of the model correspond to the underlying but unobserved (hence hidden) secondary structure and accessibility. Transitions among the states are modelled with a Markov process.
A very simple HMM would ignore accessibility status and only describe the organization of the four secondary structure types along a protein sequence. A consequence would be that the number of sites in each uninterrupted stretch of a secondary structure would be geometrically distributed. Figure 1AD, compares the length distributions predicted under this simple model for the four secondary structure types with the empirically observed frequencies in the 207 known structures of the BRKALN database. For the secondary structure type coil (Figure 1D), the observed and predicted length distributions are reasonably close. For the other three secondary structure types, the observed distribution and that predicted under this simple HMM are quite different. For example, the geometric distribution necessarily has length one as its mode, but it is impossible for an
-helix to consist of just one residue.
|
To provide a better fit between observed and predicted distributions, our HMM has states for the first, second, etc., positions in a secondary structure. We illustrate this with sheets. To represent the ith position of a sheet for i
{1,2,...,5}, we use Ebi or Eei, respectively, when the accessibility status is buried or exposed. The states Eb6 and Ee6 denote buried and exposed sites in position six or greater of a sheet.
We constrain the HMM to enter a sheet only through either the Eb1 or Ee1 states. Once in state Ebi or Eei (i
{1,2,...,5}), the HMM must continue by progressing to either Ebi+1 or Eei+1 or by leaving the sheet states (and entering a helix, turn, or coil state). In other words, the HMM must progress through the sheet states in an ordered fashion (but may switch among the buried and exposed states) until such time as it leaves the sheet. Once in the states Eb6 or Ee6, the HMM can remain there (with the possibility of switching between these two states) or can leave the sheet. This arrangement of 12 sheet states is illustrated in Figure 2A. The manner of controlling the distributions of secondary structure lengths is described below.
|
The choice of six position-specific sheet states each for buried and exposed was made with a likelihood ratio testing procedure that took into account the number of parameters being estimated and the improvement in goodness-of-fit as the number of position-specific states increased. We use nE to represent the number of position-specific states for each of buried and exposed sheet. Similar considerations suggested that the helix states Hb and He each be expanded into nH = 10 states (Hb1, Hb2,..., Hb10 and He1, He2,..., He10) and that the turn states Tb and Te be expanded to Tb1, Tb2, Te1, and Te2 (nT = 2). Only two coil states (denoted Cb1 and Ce1) are in the HMM (nC = 1) because coil represents a collection of diverse minor secondary structure elements (including but not limited to loops) that together have approximately a geometric length distribution. In summary, this led to 38 hidden states comprising 10 buried and 10 exposed helix states, six buried and six exposed sheet states, two buried and two exposed turn states, and one buried and one exposed coil state.
The model requires that all sites with a particular secondary structure and accessibility status experience the same amino acid replacement process, regardless of relative position within their secondary structure element. For example, a buried site that begins a helix (Hb1) and a buried site in the third position of a helix (Hb3) share the same replacement process. Thus, each of the 38 HMM states corresponds to a particular one of the eight amino acid replacement categories.
The most natural way to estimate transition probabilities (
ij) among the 38 HMM states is to examine amino acid sequences of known structure, count how many times a site in state i is followed by a site in state j, and divide this count by the number of times sites in state i are followed by sites in any category. We felt there were insufficient data in the BRKALN database to make reliable estimates of all transition parameters in this manner. To reduce the number of parameters without a great sacrifice in the utility of the model, we make assumptions that we believe are reasonable albeit not strictly correct. Because these assumptions define the form of transition probabilities in our model, they are described here in detail and are illustrated in Figure 2.
If the current state is Xyi (where X is one of H, E, T, C; y is b or e; and i = 1,2,...,nX), the probability that the next state does not have secondary structure X is assumed to be independent of the current accessibility status (y). This probability is denoted gXi for i = 1,2,...,nX - 1 (and X
C) and 1 - rX for i = nX.
The probabilities gXi can be fixed so that the expected proportions of secondary structure X elements with given lengths between 1 and nX - 1 can take any specified values. If the desired proportion of elements of length i
{1,2,...,nX - 1} is fXi (and defining fX0 = 0 for all X), then setting gXi =
satisfies these length distribution requirements. We have estimated the fXi as the observed proportions of secondary structure elements of type X that have length exactly i in the BRKALN database and applied the above formula for the gXi.
The probabilities rX determine the expected length distributions of secondary structures X for lengths i
nX. Effectively, we model the tail of the length distribution with a geometric distribution. After selecting the value of nX, we estimate the parameters rX by the values that equate the observed mean lengths of secondary structure X elements in the BRKALN database and the expected mean lengths. We note that our estimates of gXi and rX are maximum likelihood estimates for the case where the probabilites of lengths less than nX are each represented by an individual parameter, and the probabilites of lengths nX or greater are determined by attaching a geometric tail to the length distribution. A comparison between the observed and predicted distributions of secondary structure lengths is given in Figure 1DG.
Given that the secondary structure W of the next state differs from the secondary structure X of the current state Xyi, we assume that W and the next accessibility status z = b or e are independent of the value of i. The conditional probabilities of transitions from Xyi to Wz1 are denoted aXy,Wz and were estimated directly from the known structures in the BRKALN database by their observed relative frequencies.
Given that the secondary structure of the next state is identical to that of the current state Xyi, we assume the accessibility status of the next state is independent of i. The probability that the next site is buried, given that the current site is buried and both sites have secondary structure X, is represented by pXb . The complementary probability qXb = 1 - pXb is then the probability that the next site is exposed, given that the current site is buried and both sites have secondary structure X. Similarly, pXe is the probability that the next site is exposed, given that the current site is exposed and both are secondary structure X. We use qXe to represent the complementary probability 1 - pXe . These probabilities are again estimated directly from the observed frequencies in the experimentally determined structures of the BRKALN database.
Relative rates of amino acid replacement:
One way to compare the eight estimated amino acid replacement processes is to determine the relative rate at which sites in the different replacement categories evolve. We normalize rates so that the average site evolves at rate 1. First, note that the rate at which amino acid i is replaced in category k is
j
i
kij , the sum of the replacement rates
kij over all amino acids that are not i. Second, the overall replacement rate for category k is
i
ki
j
i
kij , which accounts for amino acid frequencies by weighting the amino acid-specific rates by
ki , the frequency of amino acid i in category k. Table 1 displays the estimated values of the
ki . Finally, different replacement categories k have different stationary probabilities
k, determined for our HMM by the equilibrium distribution of the matrix
ij. Accounting for the variation in stationary probabilities among replacement categories, the relative rate of replacement for category k is
![]() |
(1) |
|
Relative replacement rates associated with a particular secondary structure averaged over accessibility classifications can also be estimated. For example, the relative rate for helices is
![]() |
(2) |
Similarly, relative rates for each accessibility classification averaged over secondary structures (e.g., R·b) can be estimated. These estimates are shown in Table 2.
|
A concern is that one or a few proteins in the BRKALN database could potentially have an especially large impact on the relative rate estimates. The potential for a small number of protein families to have a great impact exists because the rate estimates are based upon amino acid replacement counts from pairwise sequence comparisons. Protein families with many sites or many sequences will tend to generate higher counts that families consisting of a few short sequences. If these influential protein families tend to evolve via comparatively atypical evolutionary processes, the general applicability of our evolutionary model would be restricted. To investigate this, we employed a nonparametric bootstrap resampling procedure (![]()
A decision that we had to make when analyzing the pairwise replacement count data was how to treat data from sites that had unknown secondary structure or accessibility because they were insertions relative to the experimentally determined structure. One possibility was to ignore these data. Another was to add these replacement count data to the exposed coil (Ce) replacement category because of the observed tendency for insertion and deletion events to affect exposed coils. As the relative rate estimates in Table 2 exhibit, the treatments yield similar outcomes. All other results presented here are based on estimating rates by ignoring sites that are insertions relative to known structures.
Phylogenetic tree:
Typically the form (topology and branch lengths) of the phylogenetic tree relating a set of sequences is unknown but of interest. In this case, it may be treated as a parameter of the problem and estimated using statistical methods. This is the approach that we adopt.
In protein structure prediction problems, it is now realized that evolutionarily related amino acid sequences should not be treated as though they are statistically independent of one another (![]()
![]()
![]()
![]()
![]()
Likelihood calculations:
After estimating the relative rates of replacement
kij and the HMM transition probabilities
ij from the BRKALN database, we can calculate the likelihood for a candidate phylogenetic tree T (representing both topology and branch lengths). We do not re-estimate the
kij or
ij during the likelihood calculations but instead fix them at their estimated values. We denote the aligned data set by S, its length (number of amino acids) by N, the first i columns of the data set by Si, and the ith column itself by si. In the equations below, many of the probabilities are actually conditional upon
kij and
ij but, for the sake of clarity, we omit
kij and
ij when this is feasible. The likelihood of the tree T is given by Pr(S|T), and this is calculated via the terms Pr(Si,ci|T) for each possible secondary structure category ci at site i using the iteration:
![]() |
(3) |
![]() |
(4) |
c1,Ce = 1 if c1 = Ce, and 0 otherwise. This has proved slightly superior to using the stationary distribution (
i) to describe the secondary structure probabilities at the first site of sequences (results not shown), in which case Pr(S1, c1|T) is set equal to Pr(s1|c1,T)
c1 . When completed, the iteration gives the required Pr(S|T) because
![]() |
(5) |
To obtain the maximum likelihood estimate T^, we use numerical optimization algorithms (e.g., ![]()
-helix.
Prediction of protein secondary structure can subsequently be performed with a plug-in approach that fixes the tree at T^ and uses appropriate methods to calculate the posterior distribution Pr(ci|S,T^). This has been described in more detail for a simpler HMM of sequence evolution by ![]()
Model comparisons:
When a statistical comparison indicates that one evolutionary model is superior to a simpler model, this implies that features possessed by the complicated model and absent from the simple model may be evolutionarily important. This approach has been taken in the past to compare models of DNA sequence evolution (![]()
![]()
![]()
In this study, we investigate models that range from the simplicity of sites evolving identically to the relative complexity of the full version of our new HMM in order to test the significance of the different components of the new model. We introduce the following notation to label the model variants. If the number of different secondary structure types recognized is ss, the number of solvent accessibility classes distinguished is acc, and the total number of states in the HMM is hmm, then we can label models as ss/acc/hmm+ (+ indicating the use of the HMM to model dependencies between adjacent sites) or ss/acc/hmm- (- indicating the disabling of the HMM; see below). Using this notation, our new HMM is denoted 4/2/38+. The following models were also considered:
1/1/1-:
This model represents the simplest case, in which no information on protein structure or nonindependence of sites is incorporated (hence, ss = 1 average secondary structure category, acc = 1 average accessibility state, and no meaningful HMM is possible). This corresponds to currently available methods implemented in software such as PAML (![]()
![]()
![]()
![]()
4/2/8+: This model adds to the 1/1/1- model an HMM recognizing four secondary structure elements, each with two accessibility categories. No special allowance is made for the distributions of lengths of secondary structures (which therefore follow a purely geometric distribution under this model).
4/1/19+: This model adds to the 1/1/1- model an HMM recognizing four secondary structure elements and making allowance for the distributions of lengths of these structures but does not recognize different solvent accessibilities.
For all the above models, the methods for estimating model parameters varied slightly from those described above for the 4/2/38+ model because not all of the parameters of that model are meaningful for simpler versions. In all cases, methods analogous to those described above for the 4/2/38+ model were used to estimate appropriated parameters from the BRKALN database (but see above regarding the 1/1/1/D- and 1/1/1J- models); full details are available from the authors.
4/2/38-:
This model is very similar to the 4/2/38+ model, but sites are treated as independent of one another. This independence is achieved by replacing each row of the matrix (
ij) with that matrix's equilibrium distribution (
i).
As with the 4/2/38+ model (above), the 4/2/8+ and 4/2/38- models are implemented with the first site in a protein forced to be an exposed coil. For the 4/1/19+ model, accessibility status is not considered, and we simply treat the first site as a coil.
| ANALYSIS OF EXAMPLE DATA SETS |
|---|
Data sets:
We analyzed 11 data sets that encompass a broad range of genes, organisms, and evolutionary divergence. When possible, we utilized previously published amino acid sequence alignments to reduce the chances that results could be biased by our own alignment procedures or prejudices. In addition, we selected sequences that were sufficiently similar to be likely to share the same secondary structure.
Gapped positions in the alignments were treated as missing information, as they are in the maximum likelihood programs of the PHYLIP package (![]()
ADP-glucose pyrophosphorylase (abbreviated to ADPGP):
Four plant sequences were used, as analyzed by ![]()
HIV-1 gp120 envelope glycoprotein (GP120): One sequence was selected from each of the eight major subtypes (AH) of HIV-1.
HIV-1 p17 matrix protein: Sequences were selected from a number of strains of human immunodeficiency virus type 1 (HIV-1). Two data sets were studied, the first (P17ALL) comprising one sequence from each of the seven subtypes (AD, FH) of HIV-1 for which significantly different p17 sequences have been recognized and the second (P17B) comprising eight sequences from HIV-1 subtype B.
Sucrose synthase (SUSY):
Four dicotyledonous plant sequences were used, as analyzed in a preliminary study of the effects on protein structure on sequence evolution by ![]()
Xylanase (XYLA):
Seven prokaryotic sequences were used, as analyzed in a preliminary study of phylogeny- and likelihood-based methods of protein secondary structure prediction by ![]()
The following five data sets were derived from multiple sequence alignments deposited at the EMBL-European Bioinformatics Institute (EBI) and available electronically from ftp://ftp.ebi.ac.uk/pub/databases/embl/align. In each case, minor alterations were made by hand.
Alcohol dehydrogenase:
Two data sets were formed from the alignment available from the EBI ftp server, file ds14642.dat (see also ![]()
Glutamate dehydrogenase (GDH):
Eleven sequences, with both eubacterial and eukaryotic representatives, were selected from the alignment file ds20281.dat (see also ![]()
G protein
subunit (GPA):
![]()
subunit sequences (file ds15369.dat). We selected 18 Gi
and Go
sequences from mammals, Drosophila melanogaster, and Caenorhabditis elegans.
Phosphoenolpyruvate carboxykinase (PEPCK):
We have used the alignment of 18 Lepidopteran sequences studies and submitted (file ds24063.dat) by ![]()
Model comparisons:
Table 3 contains maximum log-likelihoods obtained when analyzing each of the above 11 data sets with models ranging from the most simple (1/1/1-, 1/1/1D-, 1/1/1J-) to the most complex (4/2/38+). To better understand which specific features of our models are most responsible for improved fits, we performed a series of parametric bootstrap analyses on each data set. Each analysis was a comparison of a relatively simple model (the null hypothesis H0) with a model that considers more aspects of protein structure (the alternative hypothesis HA).
|
We use a likelihood-ratio test with test statistic
l, the maximum log-likelihood for the alternative hypothesis minus the maximum log-likelihood for the null hypothesis. For the data sets analyzed here, this statistic can be computed from the appropriate entries in Table 3. To approximate the distribution of
l under the null hypothesis, 100 simulated data sets are produced. The null model of sequence evolution is used, along with the maximum likelihood topology and branch lengths estimated under the null hypothesis for the original data set, to generate each simulated data set. The simulated data sets have the same number of taxa and are the same length as the original data set. If a residue at a particular site in the data set has unknown type in the original data set because of alignment uncertainty or gaps, the residue at this position is also considered unknown in the simulated data set. For each simulated data set,
l can be calculated via likelihood maximization under the null and alternative hypotheses. If the observed value of
l for the original data set is sufficiently extreme relative to the distribution of simulated values, then the null hypothesis can be rejected. One measure of extremity is given by the proportion of simulated test statistic values that equal or exceed the actual value. This proportion is an estimate of the probability under H0 of realizing a value of
l at least as extreme as that observed for the original data. Sufficiently low values imply rejection of H0. Another measure of extremity is a z-score, calculated by subtracting the mean simulated test statistic value from the actual value and then dividing by the sample standard deviation of the simulated values.
In our parametric bootstrap comparisons, we do not account for uncertainty in parameters governing the relative rates of amino acid replacement or organization of secondary structure and solvent accessibility. Estimates for these parameters are fixed at the values obtained from the BRKALN database. Only the topology and branch lengths are estimated from the simulated data sets. Ideally, the parametric bootstrap procedure would account for the uncertainty in all parameters when comparing models, but this would be computationally expensive.
Parametric bootstrap comparisons between some models are more informative than those between others. By combining evolution with protein structure, four potentially important features that we have incorporated in the 4/2/38+ model are (1) association of secondary structure and amino acid replacement dynamics, (2) association of solvent accessibility and replacement dynamics, (3) regional organization of secondary structure and solvent accessibility along a sequence, and (4) a relatively realistic length distribution of secondary structure elements. Earlier work (![]()
|
Computation times:
Because of the 38 HMM states and the eight replacement categories, our 4/2/38+ model is more computationally demanding than the 1/1/1- model. Our experience is that the computation time required for analyzing data sets is more sensitive to the number of replacement categories than to the number of HMM states. For this reason, one might expect the 4/2/38+ model to require approximately eight times more computation than the 1/1/1- model. In our experience, the actual ratio of CPU times required by these two models varies somewhat between analyses. As an example, in a case where topology was fixed and branch lengths were estimated, analysis of the GPA data set on a Digital Alphastation 500/400 required 199 seconds of CPU time with the 1/1/1- model and 943 seconds with the 4/2/38+ model.
| DISCUSSION |
|---|
An earlier study (![]()
-helix, ß-sheet, and loop) demonstrated that consideration of secondary structure can significantly improve the fit of models to data. The belief that secondary structure is important is reinforced by the results shown in Table 4. All but one of the comparisons in which the null hypothesis makes no distinction between secondary structures and the alternative employs different Markov models of amino acid replacement for different secondary structures strongly reject the null hypothesis (Table 4, rows 14). Further evidence for an association between secondary structure and amino acid replacement is summarized in Table 2. We note the surprising results that the estimated rates for buried helix and sheet residues are greater than those for buried turns and coils and that the estimated rate for exposed helix residues is greater than those for other exposed residues. Because the probability a site has a particular accessibility status is not independent of its secondary structure, these patterns are less clear for the weighted average over buried and exposed sites. Nevertheless, it is still contrary to intuition that
-helix and ß-sheet positions experience more amino acid replacements on average than coil positions.
Differences in amino acid replacement dynamics associated with solvent accessibility status have been explored by ![]()
The comparison of the 4/2/38- and 4/2/38+ models (Table 4, row 6) tests for evidence of correlation of structure categories along sequences. For some data sets, the null hypothesis of no correlation could not be rejected. This could reflect failure of the 4/2/38+ model to incorporate information from residues that are not close in the linear sequence. Methods for detecting long-range interactions in an evolutionary context are currently being developed (D. D. POLLOCK, unpublished results), and we hope that the information they yield can be incorporated into models used for phylogenetic inference. An alternative explanation for failing to reject the null hypothesis of no correlation is that some data sets may have too few sequences or too little evolutionary divergence to provide relative certainty about underlying structural environments. It is difficult to detect regional organization of structure along a sequence if there is a high degree of uncertainty regarding the underlying structural environment at each site.
We have preliminary indications that secondary structure predictions generated from our 4/2/38+ model are superior to those from the 4/2/38- model. For example, predictions of the 4/2/38- model exhibit an abundance of unrealistically short
-helices and ß-sheets (results not shown). It may be the case that incorporating correlation of structure categories will be important for accurate secondary structure prediction, even for data sets where it has little impact on the goodness of fit of models. This possibility will be a focus of future research.
For six of the 11 data sets, we find that the comparison between the 4/2/8+ and 4/2/38+ models does not reject the former. For these data sets, there is no significant evidence in favor of the more complex HMM that attempts to use information regarding the distribution of lengths of secondary structure elements. Mixed results across different data sets (Table 4, row 7) again make it unclear whether this is a failing of our model or is due to insufficient data. Detection of both structural correlation along sequences and information pertaining to the length distributions of secondary structure elements might be assisted by adding more sequences to a data set. This should in effect increase the information pertaining to each alignment position and make the HMM states "less hidden."
Improved models of sequence evolution can lead to improved estimates of both evolutionary relationships (topology) and distances (branch lengths) in phylogenies (e.g., ![]()
![]()
One result of ![]()
![]()
Fortunately, the two studies can be reconciled by realizing that our work is based on experimentally determined structures that are exclusively globular proteins whereas ![]()
![]()
In this article we have described the analysis of amino acid sequences under the assumption that the protein's true structure is unknown. In the case that the tertiary structure has been determined for a protein, the phylogenetic estimation method can be modified to allow the known secondary structure and accessibility information to be used. This would remove some of the sources of uncertainty and should improve the estimation of phylogenies. In this way, experimentally determined structures could directly assist phylogenetics instead of being ingored, or being used indirectly through their influence on average properties of databases of known structures as in this article.
Our amino acid replacement models for the eight structural environment categories are based only on analysis of database sequences. They are not varied to suit specific proteins. A promising approach for tailoring amino acid replacement processes to specific proteins has been developed by ![]()
All models are potentially misled by violations of their assumptions. Our model assumes that all residues of an alignment are related via the same phylogeny, for example, that recombination is absent, but the HIV-1 data sets we have analyzed have potentially been subject to intra- and interspecific recombination (![]()
![]()
![]()
![]()
One of the most important advances in the reconstruction of evolutionary trees has been the consideration of heterogeneity of evolutionary rates among sequence sites (e.g., ![]()
![]()
We believe that characterizing general associations of patterns and rates of sequence evolution with phenotypic features such as protein structure is essential to understanding the process of sequence evolution both within and between populations. It is the relationship between genotype and phenotype that drives much of evolution. Protein structure is fundamental to phenotype and yet little previous effort has been devoted to characterizing its impact on evolution.
| ACKNOWLEDGMENTS |
|---|
EDWARD HOLMES kindly assisted with the HIV-1 data sets. We also thank DAVID POLLOCK, PIETRO LIÒ, and two anonymous referees for their helpful suggestions. N.G. is supported by a Wellcome Trust Fellowship in Biodiversity Research. J.L.T. was supported by National Institutes of Health grant P01-GM45344 and the Alfred P. Sloan Foundation. D.T.J. is supported by a Royal Society University Research Fellowship. Software, parameter estimates, and data sets used for the analysis described above are available from the authors via http://ng-dec1.gen.cam.ac.uk/hmm/contents.html.
Manuscript received October 5, 1997; Accepted for publication February 17, 1998.
| LITERATURE CITED |
|---|
ADACHI, J., and M. HASEGAWA, 1995 MOLPHY: Programs for Molecular Phylogenetics, Ver. 2.3. Institute of Statistical Mathematics, Tokyo.
ASAI, K., S. HAYAMIZU, and K. HANDA, 1993 Prediction of protein secondary structure by the hidden Markov model. CABIOS 9:141-146
BENNER, S. A., I. BADCOE, M. A. COHEN, and D. L. GERLOFF, 1994 Bona fide prediction of aspects of protein conformation: assigning interior and surface residues from patterns of variation and conservation in homologous sequences. J. Mol. Biol. 235:926-958[Medline].
BERNSTEIN, F. C., T. F. KOETZLE, G. J. B. WILLIAMS, E. F. MEYER, and M. D. BRICE et al., 1977 The protein data bank: a computer-based archival file for macromolecular structures. Eur. J. Biochem. 80:319-324[Medline].
BLEASBY, A. J. and J. C. WOOTTON, 1990 Construction of validated, non-redundant composite protein sequence databases. Prot. Eng. 3:153-159
BROWN, M., R. HUGHEY, A. KROGH, I. S. MIAN. K. SJOLANDER et al., 1993 Using Dirichlet mixture priors to derive hidden Markov models for protein families, pp. 4755 in Proceedings of the First International Conference on Intelligent Systems for Molecular Biology, edited by L. HUNTER, D. SEARLS and J. SHAVLIK. AIII Press, Menlo Park, CA.
BRUNO, W. J., 1996 Modelling residue usage in aligned protein sequences via maximum likelihood. Mol. Biol. Evol. 13:1368-1374[Abstract].
CAO, Y., J. ADACHI, A. JANKE, S. PÄÄBO, and M. HASEGAWA, 1994 Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: instability of a tree based on a single gene. J. Mol. Evol. 39:519-527[Medline].
CHOTHIA, C. and A. M. LESK, 1986 The relation between the divergence of sequence and structure in proteins. EMBO J. 5:823-826[Medline].
CHURCHILL, G. A., 1989 Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51:79-94[Medline].
DAYHOFF, M. O., R. V. ECK and C. M. PARK, 1972 A model of evolutionary change in proteins, pp. 8999 in Atlas of Protein Sequence and Structure, edited by M. O. DAYHOFF. National Biomedical Research Foundation, Washington, DC.
DAYHOFF, M. O., R. M. SCHWARTZ and B. C. ORCUTT, 1978 A model of evolutionary change in proteins, pp. 345352 in Atlas of Protein Sequence and Structure, edited by M. O. DAYHOFF. National Biomedical Research Foundation, Washington, DC.
EFRON, B., and R. J. TIBSHIRANI, 1993 An Introduction to the Bootstrap. Chapman and Hall, New York.
FELSENSTEIN, J., 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376[Medline].
FELSENSTEIN, J., 1985 Phylogenies and the comparative method. Am. Nat. 125:1-15.
FELSENSTEIN, J., 1995 PHYLIP (Phylogenetic Inference Package), Ver. 3.57. Department of Genetics, University of Washington, Seattle.
FELSENSTEIN, J. and G. A. CHURCHILL, 1996 A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93-104[Abstract].
FRIEDLANDER, T. P., J. C. REGIER, C. MITTER, and D. L. WAGNER, 1996 A nuclear gene for higher-level phylogeneticsphosphoenolpyruvate carboxykinase tracks Mesozoic-age divergences within Lepidoptera (Insecta). Mol. Biol. Evol. 13:594-604[Abstract].
GOLDMAN, N., 1993 Statistical tests of models of DNA substitution. J. Mol. Evol. 36:182-198[Medline].
GOLDMAN, N. and Z. YANG, 1994 A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725-736[Abstract].
GOLDMAN, N., J. L. THORNE, and D. T. JONES, 1996 Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. J. Mol. Biol. 263:196-208[Medline].
GOTOH, O., 1982 An improved algorithm for matching biological sequences. J. Mol. Biol. 162:705-708[Medline].
HANSEN, J. E., O. LUND, J. O. NIELSEN, S. BRUNAK, and J.-E. S. HANSEN, 1996 Prediction of the secondary structure of HIV-1 gp120. Proteins 25:1-11[Medline].
HARVEY, P. H., and M. D. PAGEL, 1991 The Comparative Method in Evolutionary Biology. Oxford University Press, Oxford, UK.
HASEGAWA, M., H. KISHINO, and T. A. YANO, 1985 Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160-174[Medline].
JONES, D. T., W. R. TAYLOR, and J. M. THORNTON, 1992 The rapid generation of mutation data matrices from protein sequences. CABIOS 8:275-282
JONES, D. T., W. R. TAYLOR, and J. M. THORNTON, 1994 A mutation data matrix for transmembrane proteins. FEBS Letts. 339:269-275[Medline].
JUKES, T. H., and C. R. CANTOR, 1969 Evolution of protein molecules, pp. 21132 in Mammalian Protein Metabolism, edited by H. N. MUNRO. Academic Press, New York.
KABSCH, W. and C. SANDER, 1983 Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers 22:2577-2637[Medline].
KIMURA, M., 1983 The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge, UK.
KOSHI, J. M. and R. A. GOLDSTEIN, 1995 Context-dependent optimal substitution matrices. Prot. Eng. 8:641-645.
LÜTHY, R., A. D. MCLACHLAN, and D. EISENBERG, 1991 Secondary structure-based profiles: use of structure-conserving scoring tables in searching protein sequence databases for structural similarities. Proteins 10:229-239[Medline].
NAYLOR, G. J. P. and W. M. BROWN, 1997 Structural biology and phylogeny estimation. Nature 388:527-528[Medline].
OVERINGTON, J., M. S. JOHNSON, A.
ALI, and T. L. BLUNDELL, 1990 Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction. Proc. R. Soc. Lond. Ser B 241:132-145[Medline].
ROBERTSON, D. L., B. H. HAHN, and P. M. SHARP, 1995 Recombination in AIDS viruses. J. Mol. Evol. 40:249-259[Medline].
RUSSELL, R. B., M. A. S. SAQI, R. A. SAYLE, P. A. BATES, and M. J. E. STERNBERG, 1997 Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation. J. Mol. Biol. 269:423-439[Medline].
SWOFFORD, D. L., G. J. OLSEN, P. J. WADDELL and D. M. HILLIS, 1996 Phylogenetic inference, pp. 407514 in Molecular Systematics, Ed. 2, edited by D. M. HILLIS, C. MORITZ and B. K. MABLE. Sinauer Associates, Sunderland, MA.
TAYLOR, W. R., 1988 A flexible method to align large numbers of biological sequences. J. Mol. Evol. 28:161-169[Medline].
TELLER, J. K., P. J. BAKER, K. L. BRITTON, P. C. ENGEL, and D. W. RICE et al., 1995 Correlation of intron-exon organisation with the three-dimensional structure in glutamate dehydrogenase. Biochim. Biophys. Acta 1247:231-238[Medline].
THORNE, J. L., N. GOLDMAN, and D. T. JONES, 1996 Combining protein evolution and secondary structure. Mol. Biol. Evol. 13:666-673[Abstract].
TOPHAM, C. M., A. MCLEOD, F. EISENMENGER, J. P. OVERINGTON, and M. S. JOHNSON et al., 1993 Fragment ranking in modelling of protein structure: conformationally constrained substitution tables. J. Mol. Biol. 229:194-220[Medline].
WAKO, H. and T. L. BLUNDELL, 1994 Use of amino acid environment-dependent substitution tables and conformational propensities in structure prediction from aligned sequences of homologous proteins. I. Solvent accessibility classes. J. Mol. Biol. 238:682-692[Medline].
WHITE, J. V., C. M. STULTZ, and T. F. SMITH, 1994 Protein classification by stochastic modeling and optimal filtering of amino-acid sequences. Math. Biosci. 119:35-75[Medline].
YANG, Z., 1994 Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306-314[Medline].
YANG, Z., 1995 A space-time process model for the evolution of DNA sequences. Genetics 139:993-1005[Abstract].
YANG, Z., 1996 Among-site rate variation and its impact on phylogenetic analysis. TREE 11:367-372.
YANG, Z., 1997 PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS 13:555-556
YANG, Z., N. GOLDMAN, and A. FRIDAY, 1994 Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol. Biol. Evol. 11:316-324[Abstract].
YANG, Z., I. J. LAUDER, and H. J. LIN, 1995 Molecular evolution of the hepatitis B virus genome. J. Mol. Evol. 41:587-596[Medline].
YOKOYAMA, S. and W. T. STARMER, 1992 Phylogeny and evolutionary rates of G protein
subunit genes. J. Mol. Evol. 35:230-238[Medline].
YOKOYAMA, S. and D. E. HARRY, 1993 Molecular phylogeny and evolutionary rates of alcohol dehydrogenases in vertebrates and plants. Mol. Biol. Evol. 10:1215-1226[Abstract].
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited









