Genetics, Vol. 149, 445-458, May 1998, Copyright © 1998

Assessing the Impact of Secondary Structure and Solvent Accessibility on Protein Evolution

Nick Goldmana, Jeffrey L. Thorneb, and David T. Jonesc
a Department of Genetics, University of Cambridge, Cambridge CB2 3EH, United Kingdom,
b Program in Statistical Genetics, Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695-8203
c Department of Biological Sciences, University of Warwick, Coventry CV4 7AL, United Kingdom

Corresponding author: Nick Goldman, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK, n.goldman{at}gen.cam.ac.uk (E-mail).

Communicating editor: G. B. GOLDING


*  ABSTRACT
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*ANALYSIS OF EXAMPLE DATA...
*DISCUSSION
*LITERATURE CITED

Empirically derived models of amino acid replacement are employed to study the association between various physical features of proteins and evolution. The strengths of these associations are statistically evaluated by applying the models of protein evolution to 11 diverse sets of protein sequences. Parametric bootstrap tests indicate that the solvent accessibility status of a site has a particularly strong association with the process of amino acid replacement that it experiences. Significant association between secondary structure environment and the amino acid replacement process is also observed. Careful description of the length distribution of secondary structure elements and of the organization of secondary structure and solvent accessibility along a protein did not always significantly improve the fit of the evolutionary models to the data sets that were analyzed. As indicated by the strength of the association of both solvent accessibility and secondary structure with amino acid replacement, the process of protein evolution—both above and below the species level—will not be well understood until the physical constraints that affect protein evolution are identified and characterized.


WIDELY used models of sequence evolution provide notoriously poor fits to actual sequence data (e.g., GOLDMAN 1993 Down; GOLDMAN and YANG 1994 Down). For example, the infinite sites model of population genetics assumes that distinct mutational events never affect the same site in a DNA sequence. This assumption cannot be reconciled with the data: this model predicts that at most two nucleotide types will be represented in a column of a sequence alignment, yet it is often the case that an actual aligned data set contains one or more alignment columns where three or sometimes all four nucleotide types are represented. For these data sets, the infinite sites model can be rejected by inspection. More sophisticated models (e.g., JUKES and CANTOR 1969 Down; HASEGAWA et al. 1985 Down; YANG 1994 Down) are typically employed when sequences from different species are being compared. These may be superior to the infinite sites model in that they cannot be rejected by inspection, but these also tend to be rejected by goodness-of-fit tests such as the one proposed by GOLDMAN 1993 Down.

Although their assumptions may be violated and there may be statistical grounds for rejecting them, simple models of sequence evolution may still be adequate for investigating the questions to which they are often applied. In population genetics, the infinite sites model may be good enough for testing the neutral hypothesis of KIMURA 1983 Down. In macroevolutionary applications, the JUKES-CANTOR (1969) model may often be sufficient for accurate reconstruction of phylogenies.

Nevertheless, the limitations of widely used models of sequence evolution often prevent more refined and informative questions from being addressed. A key to modelling and understanding the evolutionary process is identification and characterization of the constraints that evolution "perceives" as proteins diverge. Selective constraints on protein structure would be expected to give rise to associations between patterns of amino acid replacement and structure. In this article, we extend our earlier approach (THORNE et al. 1996 Down) for characterizing these associations by increasing the number of secondary structures distinguished by the evolutionary model and by considering solvent accessibility (i.e., whether or not a residue is near the surface of a protein and relatively exposed to solvent). We also add a more realistic description of secondary structure organization along a protein sequence.

There is a tradition of studies (e.g., OVERINGTON et al. 1990 Down; LUTHY et al. 1991 Down; TOPHAM et al. 1993 Down; WAKO and BLUNDELL 1994 Down) that attempts to associate amino acid replacement patterns and protein secondary structure or solvent accessibility, but these studies were performed without direct consideration of evolution and their findings are therefore difficult to apply to evolutionary questions. The work of KOSHI and GOLDSTEIN 1995 Down is more interpretable from an evolutionary perspective but was not aimed at the study of evolution and instead addressed prediction of secondary structure and solvent accessibility. Other recently proposed models incorporate variation of preferred amino acid residues among sites (e.g., BROWN et al. 1993 Down; BRUNO 1996 Down) but are not designed to facilitate study of the relationship between protein structure and evolution. With the approaches described here, associations between protein secondary structure, solvent accessibility, and the pattern of protein evolution can be investigated from within a likelihood inference framework.

We assume that each site in a protein belongs to one of several categories. The categories need to be predetermined, but the category to which each site belongs does not need to be prespecified. In this study, we have four types of secondary structure: {alpha}-helix, ß-sheet, turn, and coil. For each secondary structure type, we further divide sites into those that are relatively exposed to solvent and those that are buried.

Having two accessibility classes, buried (b) and exposed (e), for each of the four secondary structure types (H, E, T, C) gives rise to a model with protein sites that belong to one of eight categories (Hb, He, Eb, Ee, Tb, Te, Cb, Ce). We use a hidden Markov model (HMM) approach (see CHURCHILL 1989 Down; ASAI et al. 1993 Down; WHITE et al. 1994 Down; YANG 1995 Down; FELSENSTEIN and CHURCHILL 1996 Down) to describe organization of these eight categories along a protein sequence, and we describe the length distribution of secondary structures more accurately than did our earlier model. The model that results from these changes is both relatively complicated and comparatively realistic when contrasted with previous explicit models of protein sequence evolution. In the sections that follow, we first describe the model. We then investigate several data sets to understand how much various features of the model improve its fit.


*  MATERIALS AND METHODS
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*ANALYSIS OF EXAMPLE DATA...
*DISCUSSION
*LITERATURE CITED

BRKALN database:
Fixed parameters of the model were estimated as described below from a database of structure-related amino acid sequence alignments. The BRKALN database maintained by D.T. Jones (unpublished results) contains amino acid sequences classified into families of closely related sequences for which the tertiary structure of at least one member has been experimentally determined. The database is built by extracting nonhomologous sequences from the Brookhaven Protein Databank (PDB; BERNSTEIN et al. 1977 Down). Low resolution (>2.6Å) and NMR structures are excluded. Each sequence is compared to the OWL nonredundant protein sequence database (BLEASBY and WOOTTON 1990 Down) with a dynamic programming-based similarity search (GOTOH 1982 Down) to find sequences of >30% identity with the sequence of known structure. Each family found in this manner is aligned via the multiple sequence alignment method of TAYLOR 1988 Down.

Secondary structure assignments and solvent accessibility scores are calculated for the protein of known structure with the DSSP program (KABSCH and SANDER 1983 Down) and are extrapolated across each aligned sequence family by assuming that all residues in an alignment column share the secondary structure and solvent accessibility classification of the homologous residue in the protein with experimentally determined structure. Residues inserted into the middle of a secondary structure element are assigned that secondary structure, and insertions elsewhere are defined as having unknown secondary structure. Using the January 1995 release of PDB and release 25.0 of OWL resulted in the BRKALN database containing 207 families of sequences.

The DSSP assignments that occur in the BRKALN database are classified as follows: H, helix; E, sheet; S and T, turn; all others (i.e., B, ".", G, and I), coil. The decision to classify S and T as a separate turn category instead of as coil (as in THORNE et al. 1996 Down) was made because the DSSP assignments S (bend) and T (turn) yield amino acid replacement patterns and rates that are qualitatively similar to one another. DSSP solvent accessibility values were converted to two states, buried (accessibility <10%) and exposed (accessibility >=10%). Relative accessibility was estimated by using a fully extended Gly-X-Gly tripeptide as the reference state. This resulted in approximately half of the sites in known structures of the BRKALN database being classified as buried and half as exposed. There is scope for more refined solvent accessibility schemes to be explored in the future.

Replacement processes:
Each combination of secondary structure and accessibility status is associated with a particular amino acid replacement process. As is the case for most widely used models of nucleotide substitution and amino acid replacement, our models of amino acid replacement are Markovian with respect to time. To specify the process of evolution for a specific replacement category k that is one of the eight in our model, there are parameters {alpha}kij , the relative rate of change from type i to j. Slight modifications (JONES et al. 1992 Down; GOLDMAN et al. 1996 Down; THORNE et al. 1996 Down) of the approach by DAYHOFF et al. 1972 Down, DAYHOFF et al. 1978 Down are used to base estimates of {alpha}kij on observed amino acid replacement counts. These counts are made by comparing pairs of sequences in the BRKALN database that are 85% or more identical and by recording the number of times for replacement category k that amino acid type i is observed in one sequence of the pair and type j is observed at the corresponding site in the other.

Hidden Markov model:
Neither the secondary structure nor the solvent accessibility at one site in a protein is independent of that at nearby sites. In fact, secondary structures at adjacent sites are strongly positively correlated. Accessibility status at adjacent sites is also positively correlated, although less strongly so. We adopt a hidden Markov model (HMM) to describe the organization of secondary structure and accessibility status along amino acid sequences. The states of the model correspond to the underlying but unobserved (hence hidden) secondary structure and accessibility. Transitions among the states are modelled with a Markov process.

A very simple HMM would ignore accessibility status and only describe the organization of the four secondary structure types along a protein sequence. A consequence would be that the number of sites in each uninterrupted stretch of a secondary structure would be geometrically distributed. Figure 1A–D, compares the length distributions predicted under this simple model for the four secondary structure types with the empirically observed frequencies in the 207 known structures of the BRKALN database. For the secondary structure type coil (Figure 1D), the observed and predicted length distributions are reasonably close. For the other three secondary structure types, the observed distribution and that predicted under this simple HMM are quite different. For example, the geometric distribution necessarily has length one as its mode, but it is impossible for an {alpha}-helix to consist of just one residue.




View larger version (64K):
In this window
In a new window
Download PPT slide
 
Figure 1. —Observed and predicted secondary structure length distributions. Open bars are empirically observed frequencies of secondary structure lengths. Solid bars are predicted frequencies. The predicted frequencies for Figure 1 (A–D) assume a geometric length distribution. The predicted frequencies for Figure 1 (E–G) are those used by the HMM. For coils, the HMM uses the geometric distribution depicted in (D). (A) Observed and geometric distributions of helix lengths (n = 1339). (B) Observed and geometric distributions of sheet lengths (n = 1862). (C) Observed and geometric distributions of turn lengths (n = 4482). (D) Observed and geometric distributions of coil lengths (n = 5113). (E) Observed and HMM distributions of helix lengths (n = 1339). (F) Observed and HMM distributions of sheet lengths (n = 1862). (G) Observed and HMM distributions of turn lengths (n = 4482).

To provide a better fit between observed and predicted distributions, our HMM has states for the first, second, etc., positions in a secondary structure. We illustrate this with sheets. To represent the ith position of a sheet for i {isin} {1,2,...,5}, we use Ebi or Eei, respectively, when the accessibility status is buried or exposed. The states Eb6 and Ee6 denote buried and exposed sites in position six or greater of a sheet.

We constrain the HMM to enter a sheet only through either the Eb1 or Ee1 states. Once in state Ebi or Eei (i {isin} {1,2,...,5}), the HMM must continue by progressing to either Ebi+1 or Eei+1 or by leaving the sheet states (and entering a helix, turn, or coil state). In other words, the HMM must progress through the sheet states in an ordered fashion (but may switch among the buried and exposed states) until such time as it leaves the sheet. Once in the states Eb6 or Ee6, the HMM can remain there (with the possibility of switching between these two states) or can leave the sheet. This arrangement of 12 sheet states is illustrated in Figure 2A. The manner of controlling the distributions of secondary structure lengths is described below.





View larger version (59K):
In this window
In a new window
Download PPT slide
 
Figure 2. —Examples of permitted transitions among the 38 HMM states. Arrows indicate the permitted transitions among the states illustrated and are labelled with the parameter combinations that define the transition probabilities {rho}ij. Thicknesses of arrows are approximately proportional to the values of the corresponding {rho}ij. (A) Twelve sheet states. (B) Four turn states. (C) Two coil states. (D) All permitted transitions from the Eb3 state. The HMM may progress to either of the next sheet states (Eb4 or Ee4) or may leave the sheet and enter a helix, turn, or coil via their first states (Hb1 and He1, Tb1 and Te1, or Cb1 and Ce1).

The choice of six position-specific sheet states each for buried and exposed was made with a likelihood ratio testing procedure that took into account the number of parameters being estimated and the improvement in goodness-of-fit as the number of position-specific states increased. We use nE to represent the number of position-specific states for each of buried and exposed sheet. Similar considerations suggested that the helix states Hb and He each be expanded into nH = 10 states (Hb1, Hb2,..., Hb10 and He1, He2,..., He10) and that the turn states Tb and Te be expanded to Tb1, Tb2, Te1, and Te2 (nT = 2). Only two coil states (denoted Cb1 and Ce1) are in the HMM (nC = 1) because coil represents a collection of diverse minor secondary structure elements (including but not limited to loops) that together have approximately a geometric length distribution. In summary, this led to 38 hidden states comprising 10 buried and 10 exposed helix states, six buried and six exposed sheet states, two buried and two exposed turn states, and one buried and one exposed coil state.

The model requires that all sites with a particular secondary structure and accessibility status experience the same amino acid replacement process, regardless of relative position within their secondary structure element. For example, a buried site that begins a helix (Hb1) and a buried site in the third position of a helix (Hb3) share the same replacement process. Thus, each of the 38 HMM states corresponds to a particular one of the eight amino acid replacement categories.

The most natural way to estimate transition probabilities ({rho}ij) among the 38 HMM states is to examine amino acid sequences of known structure, count how many times a site in state i is followed by a site in state j, and divide this count by the number of times sites in state i are followed by sites in any category. We felt there were insufficient data in the BRKALN database to make reliable estimates of all transition parameters in this manner. To reduce the number of parameters without a great sacrifice in the utility of the model, we make assumptions that we believe are reasonable albeit not strictly correct. Because these assumptions define the form of transition probabilities in our model, they are described here in detail and are illustrated in Figure 2.

If the current state is Xyi (where X is one of H, E, T, C; y is b or e; and i = 1,2,...,nX), the probability that the next state does not have secondary structure X is assumed to be independent of the current accessibility status (y). This probability is denoted gXi for i = 1,2,...,nX - 1 (and X != C) and 1 - rX for i = nX.

The probabilities gXi can be fixed so that the expected proportions of secondary structure X elements with given lengths between 1 and nX - 1 can take any specified values. If the desired proportion of elements of length i {isin} {1,2,...,nX - 1} is fXi (and defining fX0 = 0 for all X), then setting gXi = satisfies these length distribution requirements. We have estimated the fXi as the observed proportions of secondary structure elements of type X that have length exactly i in the BRKALN database and applied the above formula for the gXi.

The probabilities rX determine the expected length distributions of secondary structures X for lengths i >= nX. Effectively, we model the tail of the length distribution with a geometric distribution. After selecting the value of nX, we estimate the parameters rX by the values that equate the observed mean lengths of secondary structure X elements in the BRKALN database and the expected mean lengths. We note that our estimates of gXi and rX are maximum likelihood estimates for the case where the probabilites of lengths less than nX are each represented by an individual parameter, and the probabilites of lengths nX or greater are determined by attaching a geometric tail to the length distribution. A comparison between the observed and predicted distributions of secondary structure lengths is given in Figure 1D–G.

Given that the secondary structure W of the next state differs from the secondary structure X of the current state Xyi, we assume that W and the next accessibility status z = b or e are independent of the value of i. The conditional probabilities of transitions from Xyi to Wz1 are denoted aXy,Wz and were estimated directly from the known structures in the BRKALN database by their observed relative frequencies.

Given that the secondary structure of the next state is identical to that of the current state Xyi, we assume the accessibility status of the next state is independent of i. The probability that the next site is buried, given that the current site is buried and both sites have secondary structure X, is represented by pXb . The complementary probability qXb = 1 - pXb is then the probability that the next site is exposed, given that the current site is buried and both sites have secondary structure X. Similarly, pXe is the probability that the next site is exposed, given that the current site is exposed and both are secondary structure X. We use qXe to represent the complementary probability 1 - pXe . These probabilities are again estimated directly from the observed frequencies in the experimentally determined structures of the BRKALN database.

Relative rates of amino acid replacement:
One way to compare the eight estimated amino acid replacement processes is to determine the relative rate at which sites in the different replacement categories evolve. We normalize rates so that the average site evolves at rate 1. First, note that the rate at which amino acid i is replaced in category k is {Sigma}j!=i{alpha}kij , the sum of the replacement rates {alpha}kij over all amino acids that are not i. Second, the overall replacement rate for category k is {Sigma}i{pi}ki {Sigma}j!=i{alpha}kij , which accounts for amino acid frequencies by weighting the amino acid-specific rates by {pi}ki , the frequency of amino acid i in category k. Table 1 displays the estimated values of the {pi}ki . Finally, different replacement categories k have different stationary probabilities {Psi}k, determined for our HMM by the equilibrium distribution of the matrix {rho}ij. Accounting for the variation in stationary probabilities among replacement categories, the relative rate of replacement for category k is

(1)


 
View this table:
In this window
In a new window

 
Table 1. Estimated amino acid frequencies for the eight replacement categories

Relative replacement rates associated with a particular secondary structure averaged over accessibility classifications can also be estimated. For example, the relative rate for helices is

(2)

Similarly, relative rates for each accessibility classification averaged over secondary structures (e.g., R·b) can be estimated. These estimates are shown in Table 2.


 
View this table:
In this window
In a new window

 
Table 2. Relative rates of amino acid replacement

A concern is that one or a few proteins in the BRKALN database could potentially have an especially large impact on the relative rate estimates. The potential for a small number of protein families to have a great impact exists because the rate estimates are based upon amino acid replacement counts from pairwise sequence comparisons. Protein families with many sites or many sequences will tend to generate higher counts that families consisting of a few short sequences. If these influential protein families tend to evolve via comparatively atypical evolutionary processes, the general applicability of our evolutionary model would be restricted. To investigate this, we employed a nonparametric bootstrap resampling procedure (EFRON and TIBSHIRANI 1993 Down). One hundred resampled data sets were constructed by sampling 207 randomly selected protein families with replacement from the 207 families of the BRKALN database. From each resampled data set, stationary probabilities of the eight replacement categories for our model were estimated. Likewise, replacement counts were generated from each resampled data set. Each resampled data set thereby yielded estimates of relative rates of replacement for each category and the sample standard deviations of these estimates were determined (see Table 2).

A decision that we had to make when analyzing the pairwise replacement count data was how to treat data from sites that had unknown secondary structure or accessibility because they were insertions relative to the experimentally determined structure. One possibility was to ignore these data. Another was to add these replacement count data to the exposed coil (Ce) replacement category because of the observed tendency for insertion and deletion events to affect exposed coils. As the relative rate estimates in Table 2 exhibit, the treatments yield similar outcomes. All other results presented here are based on estimating rates by ignoring sites that are insertions relative to known structures.

Phylogenetic tree:
Typically the form (topology and branch lengths) of the phylogenetic tree relating a set of sequences is unknown but of interest. In this case, it may be treated as a parameter of the problem and estimated using statistical methods. This is the approach that we adopt.

In protein structure prediction problems, it is now realized that evolutionarily related amino acid sequences should not be treated as though they are statistically independent of one another (BENNER et al. 1994 Down; GOLDMAN et al. 1996 Down). In fact, common ancestry induces a correlation structure among sequences that is specified by the phylogenetic tree relating them. It has been shown that ideas developed in comparative evolutionary biology (FELSENSTEIN 1985 Down; HARVEY and PAGEL 1991 Down) for using phylogenetic trees to describe this correlation structure can improve the results of secondary structure prediction algorithms (GOLDMAN et al. 1996 Down). The model presented here provides a statistical basis for a protein secondary structure prediction technique that shares these benefits.

Likelihood calculations:
After estimating the relative rates of replacement {alpha}kij and the HMM transition probabilities {rho}ij from the BRKALN database, we can calculate the likelihood for a candidate phylogenetic tree T (representing both topology and branch lengths). We do not re-estimate the {alpha}kij or {rho}ij during the likelihood calculations but instead fix them at their estimated values. We denote the aligned data set by S, its length (number of amino acids) by N, the first i columns of the data set by Si, and the ith column itself by si. In the equations below, many of the probabilities are actually conditional upon {alpha}kij and {rho}ij but, for the sake of clarity, we omit {alpha}kij and {rho}ij when this is feasible. The likelihood of the tree T is given by Pr(S|T), and this is calculated via the terms Pr(Si,ci|T) for each possible secondary structure category ci at site i using the iteration:

(3)
for i > 1. The terms Pr (si|ci, T) are evaluated using the Markov process replacement models appropriate for each secondary structure ci and the pruning algorithm of FELSENSTEIN 1981 Down. Because the site at the N terminus of a protein tends to be an exposed coil (Ce), we assume this is the case and start the iteration according to:

(4)
where {delta}c1,Ce = 1 if c1 = Ce, and 0 otherwise. This has proved slightly superior to using the stationary distribution ({Psi}i) to describe the secondary structure probabilities at the first site of sequences (results not shown), in which case Pr(S1, c1|T) is set equal to Pr(s1|c1,T){Psi}c1 . When completed, the iteration gives the required Pr(S|T) because

(5)

To obtain the maximum likelihood estimate T^, we use numerical optimization algorithms (e.g., SWOFFORD et al. 1996 Down) to determine the T that maximizes Pr(S|T). A slight improvement in model fit compared to our treatment might be generated via special consideration of secondary structure elements at the extreme carboxyl-terminus of proteins. For example, column N of the alignment should not be the first position of an {alpha}-helix.

Prediction of protein secondary structure can subsequently be performed with a plug-in approach that fixes the tree at T^ and uses appropriate methods to calculate the posterior distribution Pr(ci|S,T^). This has been described in more detail for a simpler HMM of sequence evolution by GOLDMAN et al. 1996 Down and will be explored in the future for the HMM introduced here.

Model comparisons:
When a statistical comparison indicates that one evolutionary model is superior to a simpler model, this implies that features possessed by the complicated model and absent from the simple model may be evolutionarily important. This approach has been taken in the past to compare models of DNA sequence evolution (GOLDMAN 1993 Down; YANG et al. 1994 Down, YANG et al. 1995 Down), indicating the importance of factors such as base composition bias, transition/transversion rate ratio, and rate heterogeneity across DNA sites.

In this study, we investigate models that range from the simplicity of sites evolving identically to the relative complexity of the full version of our new HMM in order to test the significance of the different components of the new model. We introduce the following notation to label the model variants. If the number of different secondary structure types recognized is ss, the number of solvent accessibility classes distinguished is acc, and the total number of states in the HMM is hmm, then we can label models as ss/acc/hmm+ (+ indicating the use of the HMM to model dependencies between adjacent sites) or ss/acc/hmm- (- indicating the disabling of the HMM; see below). Using this notation, our new HMM is denoted 4/2/38+. The following models were also considered:

1/1/1-: This model represents the simplest case, in which no information on protein structure or nonindependence of sites is incorporated (hence, ss = 1 average secondary structure category, acc = 1 average accessibility state, and no meaningful HMM is possible). This corresponds to currently available methods implemented in software such as PAML (YANG 1997 Down) and MOLPHY (ADACHI and HASEGAWA 1995 Down). We have performed some analyses with each of three variants of this model: one derived from the work of DAYHOFF et al. 1978 Down and denoted 1/1/1D-, one derived from JONES et al. 1992 Down and denoted 1/1/1J-, and the third derived from the BRKALN database of this study and denoted simply 1/1/1-.

4/2/8+: This model adds to the 1/1/1- model an HMM recognizing four secondary structure elements, each with two accessibility categories. No special allowance is made for the distributions of lengths of secondary structures (which therefore follow a purely geometric distribution under this model).

4/1/19+: This model adds to the 1/1/1- model an HMM recognizing four secondary structure elements and making allowance for the distributions of lengths of these structures but does not recognize different solvent accessibilities.

For all the above models, the methods for estimating model parameters varied slightly from those described above for the 4/2/38+ model because not all of the parameters of that model are meaningful for simpler versions. In all cases, methods analogous to those described above for the 4/2/38+ model were used to estimate appropriated parameters from the BRKALN database (but see above regarding the 1/1/1/D- and 1/1/1J- models); full details are available from the authors.

4/2/38-: This model is very similar to the 4/2/38+ model, but sites are treated as independent of one another. This independence is achieved by replacing each row of the matrix ({rho}ij) with that matrix's equilibrium distribution ({Psi}i).

As with the 4/2/38+ model (above), the 4/2/8+ and 4/2/38- models are implemented with the first site in a protein forced to be an exposed coil. For the 4/1/19+ model, accessibility status is not considered, and we simply treat the first site as a coil.


*  ANALYSIS OF EXAMPLE DATA SETS
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*ANALYSIS OF EXAMPLE DATA...
*DISCUSSION
*LITERATURE CITED

Data sets:
We analyzed 11 data sets that encompass a broad range of genes, organisms, and evolutionary divergence. When possible, we utilized previously published amino acid sequence alignments to reduce the chances that results could be biased by our own alignment procedures or prejudices. In addition, we selected sequences that were sufficiently similar to be likely to share the same secondary structure.

Gapped positions in the alignments were treated as missing information, as they are in the maximum likelihood programs of the PHYLIP package (FELSENSTEIN 1995 Down). In regions where the alignment was deemed relatively unreliable, we treated the residues responsible for the alignment difficulty as missing information. This was done to reduce the impact of alignment errors on the evaluation of our models.

ADP-glucose pyrophosphorylase (abbreviated to ADPGP): Four plant sequences were used, as analyzed by GOLDMAN and YANG 1994 Down in a study of improved models of DNA nucleotide evolution.

HIV-1 gp120 envelope glycoprotein (GP120): One sequence was selected from each of the eight major subtypes (A–H) of HIV-1.

HIV-1 p17 matrix protein: Sequences were selected from a number of strains of human immunodeficiency virus type 1 (HIV-1). Two data sets were studied, the first (P17ALL) comprising one sequence from each of the seven subtypes (A–D, F–H) of HIV-1 for which significantly different p17 sequences have been recognized and the second (P17B) comprising eight sequences from HIV-1 subtype B.

Sucrose synthase (SUSY): Four dicotyledonous plant sequences were used, as analyzed in a preliminary study of the effects on protein structure on sequence evolution by THORNE et al. 1996 Down.

Xylanase (XYLA): Seven prokaryotic sequences were used, as analyzed in a preliminary study of phylogeny- and likelihood-based methods of protein secondary structure prediction by GOLDMAN et al. 1996 Down.

The following five data sets were derived from multiple sequence alignments deposited at the EMBL-European Bioinformatics Institute (EBI) and available electronically from ftp://ftp.ebi.ac.uk/pub/databases/embl/align. In each case, minor alterations were made by hand.

Alcohol dehydrogenase: Two data sets were formed from the alignment available from the EBI ftp server, file ds14642.dat (see also YOKOYAMA and HARRY 1993 Down). The first (ADHAN) was formed from 16 mammalian, avian, and amphibian sequences, with the homologous sequence from the cod Gadus callarias added by the authors. The second (ADHPL) comprises 13 plant sequences.

Glutamate dehydrogenase (GDH): Eleven sequences, with both eubacterial and eukaryotic representatives, were selected from the alignment file ds20281.dat (see also TELLER et al. 1995 Down).

G protein {alpha} subunit (GPA): YOKOYAMA and STARMER 1992 Down aligned a large number of G protein {alpha} subunit sequences (file ds15369.dat). We selected 18 Gi{alpha} and Go{alpha} sequences from mammals, Drosophila melanogaster, and Caenorhabditis elegans.

Phosphoenolpyruvate carboxykinase (PEPCK): We have used the alignment of 18 Lepidopteran sequences studies and submitted (file ds24063.dat) by FRIEDLANDER et al. 1996 Down.

Model comparisons:
Table 3 contains maximum log-likelihoods obtained when analyzing each of the above 11 data sets with models ranging from the most simple (1/1/1-, 1/1/1D-, 1/1/1J-) to the most complex (4/2/38+). To better understand which specific features of our models are most responsible for improved fits, we performed a series of parametric bootstrap analyses on each data set. Each analysis was a comparison of a relatively simple model (the null hypothesis H0) with a model that considers more aspects of protein structure (the alternative hypothesis HA).


 
View this table:
In this window
In a new window

 
Table 3. Maximum log-likelihoods for the analysis of 11 data sets under seven model variants

We use a likelihood-ratio test with test statistic {Delta}l, the maximum log-likelihood for the alternative hypothesis minus the maximum log-likelihood for the null hypothesis. For the data sets analyzed here, this statistic can be computed from the appropriate entries in Table 3. To approximate the distribution of {Delta}l under the null hypothesis, 100 simulated data sets are produced. The null model of sequence evolution is used, along with the maximum likelihood topology and branch lengths estimated under the null hypothesis for the original data set, to generate each simulated data set. The simulated data sets have the same number of taxa and are the same length as the original data set. If a residue at a particular site in the data set has unknown type in the original data set because of alignment uncertainty or gaps, the residue at this position is also considered unknown in the simulated data set. For each simulated data set, {Delta}l can be calculated via likelihood maximization under the null and alternative hypotheses. If the observed value of {Delta}l for the original data set is sufficiently extreme relative to the distribution of simulated values, then the null hypothesis can be rejected. One measure of extremity is given by the proportion of simulated test statistic values that equal or exceed the actual value. This proportion is an estimate of the probability under H0 of realizing a value of {Delta}l at least as extreme as that observed for the original data. Sufficiently low values imply rejection of H0. Another measure of extremity is a z-score, calculated by subtracting the mean simulated test statistic value from the actual value and then dividing by the sample standard deviation of the simulated values.

In our parametric bootstrap comparisons, we do not account for uncertainty in parameters governing the relative rates of amino acid replacement or organization of secondary structure and solvent accessibility. Estimates for these parameters are fixed at the values obtained from the BRKALN database. Only the topology and branch lengths are estimated from the simulated data sets. Ideally, the parametric bootstrap procedure would account for the uncertainty in all parameters when comparing models, but this would be computationally expensive.

Parametric bootstrap comparisons between some models are more informative than those between others. By combining evolution with protein structure, four potentially important features that we have incorporated in the 4/2/38+ model are (1) association of secondary structure and amino acid replacement dynamics, (2) association of solvent accessibility and replacement dynamics, (3) regional organization of secondary structure and solvent accessibility along a sequence, and (4) a relatively realistic length distribution of secondary structure elements. Earlier work (THORNE et al. 1996 Down) indicated that the effect of secondary structure on amino acid replacement is important. Therefore, we have concentrated on parametric bootstrap analyses that assess the remaining three features. The comparison between a null hypothesis of the 4/1/19+ model and the 4/2/38+ model investigates the effect of solvent accessibility on amino acid replacement. The comparison between a null hypothesis of the 4/2/38- model and the 4/2/38+ model addresses regional organization of secondary structure and solvent accessibility along a sequence. The comparison between a null hypothesis of the 4/2/8+ model and the 4/2/38+ model focuses on modelling realistic length distributions for secondary structure. We found it also of interest to use the 4/1/19+, 4/2/38-, 4/2/8+, and 4/2/38+ models as alternative hypotheses when the 1/1/1- model was the null hypothesis. These comparisons explore how much various combinations of the aforementioned four features improve upon a model that neglects structure. The results of these parametric bootstrap comparisons are shown in Table 4.


 
View this table:
In this window
In a new window

 
Table 4. Results of parametric bootstrap analyses

Computation times:
Because of the 38 HMM states and the eight replacement categories, our 4/2/38+ model is more computationally demanding than the 1/1/1- model. Our experience is that the computation time required for analyzing data sets is more sensitive to the number of replacement categories than to the number of HMM states. For this reason, one might expect the 4/2/38+ model to require approximately eight times more computation than the 1/1/1- model. In our experience, the actual ratio of CPU times required by these two models varies somewhat between analyses. As an example, in a case where topology was fixed and branch lengths were estimated, analysis of the GPA data set on a Digital Alphastation 500/400 required 199 seconds of CPU time with the 1/1/1- model and 943 seconds with the 4/2/38+ model.


*  DISCUSSION
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*ANALYSIS OF EXAMPLE DATA...
*DISCUSSION
*LITERATURE CITED

An earlier study (THORNE et al. 1996 Down) that considered just three structure categories ({alpha}-helix, ß-sheet, and loop) demonstrated that consideration of secondary structure can significantly improve the fit of models to data. The belief that secondary structure is important is reinforced by the results shown in Table 4. All but one of the comparisons in which the null hypothesis makes no distinction between secondary structures and the alternative employs different Markov models of amino acid replacement for different secondary structures strongly reject the null hypothesis (Table 4, rows 1–4). Further evidence for an association between secondary structure and amino acid replacement is summarized in Table 2. We note the surprising results that the estimated rates for buried helix and sheet residues are greater than those for buried turns and coils and that the estimated rate for exposed helix residues is greater than those for other exposed residues. Because the probability a site has a particular accessibility status is not independent of its secondary structure, these patterns are less clear for the weighted average over buried and exposed sites. Nevertheless, it is still contrary to intuition that {alpha}-helix and ß-sheet positions experience more amino acid replacements on average than coil positions.

Differences in amino acid replacement dynamics associated with solvent accessibility status have been explored by KOSHI and GOLDSTEIN 1995 Down, but their significance to protein evolution has not been statistically tested. In our study, the tests represented in Table 4, rows 1, 3, 4, and 5, have a bearing on this question. Rows 1, 3, and 4 represent tests in which the effect of accessibility is studied in combination with other components of the models. The comparison between the 4/1/19+ and 4/2/38+ models (Table 4, row 5) directly investigates the effect of incorporating accessibility into our evolutionary model. For every data set and for all of these tests, the model incorporating solvent accessibility information is strongly preferred.

The comparison of the 4/2/38- and 4/2/38+ models (Table 4, row 6) tests for evidence of correlation of structure categories along sequences. For some data sets, the null hypothesis of no correlation could not be rejected. This could reflect failure of the 4/2/38+ model to incorporate information from residues that are not close in the linear sequence. Methods for detecting long-range interactions in an evolutionary context are currently being developed (D. D. POLLOCK, unpublished results), and we hope that the information they yield can be incorporated into models used for phylogenetic inference. An alternative explanation for failing to reject the null hypothesis of no correlation is that some data sets may have too few sequences or too little evolutionary divergence to provide relative certainty about underlying structural environments. It is difficult to detect regional organization of structure along a sequence if there is a high degree of uncertainty regarding the underlying structural environment at each site.

We have preliminary indications that secondary structure predictions generated from our 4/2/38+ model are superior to those from the 4/2/38- model. For example, predictions of the 4/2/38- model exhibit an abundance of unrealistically short {alpha}-helices and ß-sheets (results not shown). It may be the case that incorporating correlation of structure categories will be important for accurate secondary structure prediction, even for data sets where it has little impact on the goodness of fit of models. This possibility will be a focus of future research.

For six of the 11 data sets, we find that the comparison between the 4/2/8+ and 4/2/38+ models does not reject the former. For these data sets, there is no significant evidence in favor of the more complex HMM that attempts to use information regarding the distribution of lengths of secondary structure elements. Mixed results across different data sets (Table 4, row 7) again make it unclear whether this is a failing of our model or is due to insufficient data. Detection of both structural correlation along sequences and information pertaining to the length distributions of secondary structure elements might be assisted by adding more sequences to a data set. This should in effect increase the information pertaining to each alignment position and make the HMM states "less hidden."

Improved models of sequence evolution can lead to improved estimates of both evolutionary relationships (topology) and distances (branch lengths) in phylogenies (e.g., YANG et al. 1994 Down). NAYLOR and BROWN 1997 Down have recently illustrated adverse effects on phylogenetic tree estimation from mammalian mitochondrial protein sequences when physicochemical properties of amino acids are ignored. We hope that our new (4/2/38+) model may give more reliable phylogenetic estimates than simpler models. It directly incorporates four components representing structural features that have not generally been considered in models of sequence evolution, instead of using physicochemical properties of amino acids as surrogates. Two of these components, encompassing effects of secondary structure and accessibility, give particularly large improvements in the fit of models across the broad range of data sets that were studied.

One result of NAYLOR and BROWN 1997 Down seems to contradict our finding that buried sites evolve more slowly than exposed sites. NAYLOR and BROWN classified sites as hydrophilic or hydrophobic on the basis of the amino acids found in the alignment column at that site. In the data sets we analyzed, hydrophilic sites tend to be exposed to solvent and hydrophobic sites tend to be buried. With a parsimony analysis, NAYLOR and BROWN 1997 Down concluded that hydrophilic sites fit an accepted tree topology better than hydrophobic sites. We attribute this difference in fit (as measured by parsimony techniques) to heterogeneity of rates among sites: slowly evolving sites generate less homoplasy than quickly evolving sites. Therefore, their results could be explained if, in contrast to our results, hydrophobic (buried) sites were evolving more quickly on average than hydrophilic (exposed) sites.

Fortunately, the two studies can be reconciled by realizing that our work is based on experimentally determined structures that are exclusively globular proteins whereas NAYLOR and BROWN 1997 Down studied only integral membrane proteins. Relatively unconstrained sites on the surface of a globular protein are apt to be exposed to solvent and hydrophilic, but the relatively unconstrained sites on the surface of a membrane protein are likely to be lipid accessible and hydrophobic. The evolutionary processes of globular and membrane proteins are apt to differ, and JONES et al. 1994 Down have demonstrated that patterns of amino acid replacement in integral membrane proteins bear little resemblance to those in globular proteins.

In this article we have described the analysis of amino acid sequences under the assumption that the protein's true structure is unknown. In the case that the tertiary structure has been determined for a protein, the phylogenetic estimation method can be modified to allow the known secondary structure and accessibility information to be used. This would remove some of the sources of uncertainty and should improve the estimation of phylogenies. In this way, experimentally determined structures could directly assist phylogenetics instead of being ingored, or being used indirectly through their influence on average properties of databases of known structures as in this article.

Our amino acid replacement models for the eight structural environment categories are based only on analysis of database sequences. They are not varied to suit specific proteins. A promising approach for tailoring amino acid replacement processes to specific proteins has been developed by CAO et al. 1994 Down for an evolutionary model that ignores protein structure. This technique combines parameters estimated from database sequences with amino acid frequencies calculated from the specific sequences being analyzed and potentially could be extended to models that consider protein structure.

All models are potentially misled by violations of their assumptions. Our model assumes that all residues of an alignment are related via the same phylogeny, for example, that recombination is absent, but the HIV-1 data sets we have analyzed have potentially been subject to intra- and interspecific recombination (ROBERTSON et al. 1995 Down). Additionally, although structure is more conserved than sequence (e.g., CHOTHIA and LESK 1986 Down; RUSSELL et al. 1997 Down), it clearly does evolve. For example, there is some evidence that the secondary structure of one of the proteins we have examined (GP120) can vary in different strains of HIV-1 (HANSEN et al. 1996 Down). Our model currently assumes that there has been no change in protein secondary structure or accessibility since sequence divergence. Advanced models that explicitly address the evolution of structure would be of great interest for phylogenetic estimation, structure prediction, and the study of evolutionary processes. Although this would add complexity to our model, modern computational statistical methods may make such developments practical.

One of the most important advances in the reconstruction of evolutionary trees has been the consideration of heterogeneity of evolutionary rates among sequence sites (e.g., YANG 1994 Down, YANG 1996 Down). Although this variability has been statistically modelled, typically with a gamma-distribution, its biological basis has not been well characterized. Our estimates (Table 2) indicate that rate heterogeneity is strongly associated with structural environment. Exposed sites tend to experience ~2x the rate of amino acid replacements experienced by buried sites. A higher rate of replacement for exposed sites is seen for each secondary structure type. The association between accessibility status and replacement rates is a noteworthy feature of protein evolution but has received scant attention in the field of molecular evolution and has not been previously exploited in phylogenetic studies.

We believe that characterizing general associations of patterns and rates of sequence evolution with phenotypic features such as protein structure is essential to understanding the process of sequence evolution both within and between populations. It is the relationship between genotype and phenotype that drives much of evolution. Protein structure is fundamental to phenotype and yet little previous effort has been devoted to characterizing its impact on evolution.


*  ACKNOWLEDGMENTS

EDWARD HOLMES kindly assisted with the HIV-1 data sets. We also thank DAVID POLLOCK, PIETRO LIÒ, and two anonymous referees for their helpful suggestions. N.G. is supported by a Wellcome Trust Fellowship in Biodiversity Research. J.L.T. was supported by National Institutes of Health grant P01-GM45344 and the Alfred P. Sloan Foundation. D.T.J. is supported by a Royal Society University Research Fellowship. Software, parameter estimates, and data sets used for the analysis described above are available from the authors via http://ng-dec1.gen.cam.ac.uk/hmm/contents.html.

Manuscript received October 5, 1997; Accepted for publication February 17, 1998.


*  LITERATURE CITED
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*ANALYSIS OF EXAMPLE DATA...
*DISCUSSION
*LITERATURE CITED

ADACHI, J., and M. HASEGAWA, 1995 MOLPHY: Programs for Molecular Phylogenetics, Ver. 2.3. Institute of Statistical Mathematics, Tokyo.

ASAI, K., S. HAYAMIZU, and K. HANDA, 1993  Prediction of protein secondary structure by the hidden Markov model. CABIOS 9:141-146[Abstract/Free Full Text].

BENNER, S. A., I. BADCOE, M. A. COHEN, and D. L. GERLOFF, 1994  Bona fide prediction of aspects of protein conformation: assigning interior and surface residues from patterns of variation and conservation in homologous sequences. J. Mol. Biol. 235:926-958[Medline].

BERNSTEIN, F. C., T. F. KOETZLE, G. J. B. WILLIAMS, E. F. MEYER, and M. D. BRICE et al., 1977  The protein data bank: a computer-based archival file for macromolecular structures. Eur. J. Biochem. 80:319-324[Medline].

BLEASBY, A. J. and J. C. WOOTTON, 1990  Construction of validated, non-redundant composite protein sequence databases. Prot. Eng. 3:153-159[Abstract/Free Full Text].

BROWN, M., R. HUGHEY, A. KROGH, I. S. MIAN. K. SJOLANDER et al., 1993 Using Dirichlet mixture priors to derive hidden Markov models for protein families, pp. 47–55 in Proceedings of the First International Conference on Intelligent Systems for Molecular Biology, edited by L. HUNTER, D. SEARLS and J. SHAVLIK. AIII Press, Menlo Park, CA.

BRUNO, W. J., 1996  Modelling residue usage in aligned protein sequences via maximum likelihood. Mol. Biol. Evol. 13:1368-1374[Abstract].

CAO, Y., J. ADACHI, A. JANKE, S. PÄÄBO, and M. HASEGAWA, 1994  Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: instability of a tree based on a single gene. J. Mol. Evol. 39:519-527[Medline].

CHOTHIA, C. and A. M. LESK, 1986  The relation between the divergence of sequence and structure in proteins. EMBO J. 5:823-826[Medline].

CHURCHILL, G. A., 1989  Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51:79-94[Medline].

DAYHOFF, M. O., R. V. ECK and C. M. PARK, 1972 A model of evolutionary change in proteins, pp. 89–99 in Atlas of Protein Sequence and Structure, edited by M. O. DAYHOFF. National Biomedical Research Foundation, Washington, DC.

DAYHOFF, M. O., R. M. SCHWARTZ and B. C. ORCUTT, 1978 A model of evolutionary change in proteins, pp. 345–352 in Atlas of Protein Sequence and Structure, edited by M. O. DAYHOFF. National Biomedical Research Foundation, Washington, DC.

EFRON, B., and R. J. TIBSHIRANI, 1993 An Introduction to the Bootstrap. Chapman and Hall, New York.

FELSENSTEIN, J., 1981  Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376[Medline].

FELSENSTEIN, J., 1985  Phylogenies and the comparative method. Am. Nat. 125:1-15.

FELSENSTEIN, J., 1995 PHYLIP (Phylogenetic Inference Package), Ver. 3.57. Department of Genetics, University of Washington, Seattle.

FELSENSTEIN, J. and G. A. CHURCHILL, 1996  A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93-104[Abstract].

FRIEDLANDER, T. P., J. C. REGIER, C. MITTER, and D. L. WAGNER, 1996  A nuclear gene for higher-level phylogenetics—phosphoenolpyruvate carboxykinase tracks Mesozoic-age divergences within Lepidoptera (Insecta). Mol. Biol. Evol. 13:594-604[Abstract].

GOLDMAN, N., 1993  Statistical tests of models of DNA substitution. J. Mol. Evol. 36:182-198[Medline].

GOLDMAN, N. and Z. YANG, 1994  A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:725-736[Abstract].

GOLDMAN, N., J. L. THORNE, and D. T. JONES, 1996  Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. J. Mol. Biol. 263:196-208[Medline].

GOTOH, O., 1982  An improved algorithm for matching biological sequences. J. Mol. Biol. 162:705-708[Medline].

HANSEN, J. E., O. LUND, J. O. NIELSEN, S. BRUNAK, and J.-E. S. HANSEN, 1996  Prediction of the secondary structure of HIV-1 gp120. Proteins 25:1-11[Medline].

HARVEY, P. H., and M. D. PAGEL, 1991 The Comparative Method in Evolutionary Biology. Oxford University Press, Oxford, UK.

HASEGAWA, M., H. KISHINO, and T. A. YANO, 1985  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160-174[Medline].

JONES, D. T., W. R. TAYLOR, and J. M. THORNTON, 1992  The rapid generation of mutation data matrices from protein sequences. CABIOS 8:275-282[Abstract/Free Full Text].

JONES, D. T., W. R. TAYLOR, and J. M. THORNTON, 1994  A mutation data matrix for transmembrane proteins. FEBS Letts. 339:269-275[Medline].

JUKES, T. H., and C. R. CANTOR, 1969 Evolution of protein molecules, pp. 21–132 in Mammalian Protein Metabolism, edited by H. N. MUNRO. Academic Press, New York.

KABSCH, W. and C. SANDER, 1983  Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers 22:2577-2637[Medline].

KIMURA, M., 1983 The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge, UK.

KOSHI, J. M. and R. A. GOLDSTEIN, 1995  Context-dependent optimal substitution matrices. Prot. Eng. 8:641-645.

THY, R., A. D. MCLACHLAN, and D. EISENBERG, 1991  Secondary structure-based profiles: use of structure-conserving scoring tables in searching protein sequence databases for structural similarities. Proteins 10:229-239[Medline].

NAYLOR, G. J. P. and W. M. BROWN, 1997  Structural biology and phylogeny estimation. Nature 388:527-528[Medline].

OVERINGTON, J., M. S. JOHNSON, A. ALI, and T. L. BLUNDELL, 1990  Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction. Proc. R. Soc. Lond. Ser B 241:132-145[Medline].

ROBERTSON, D. L., B. H. HAHN, and P. M. SHARP, 1995  Recombination in AIDS viruses. J. Mol. Evol. 40:249-259[Medline].

RUSSELL, R. B., M. A. S. SAQI, R. A. SAYLE, P. A. BATES, and M. J. E. STERNBERG, 1997  Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation. J. Mol. Biol. 269:423-439[Medline].

SWOFFORD, D. L., G. J. OLSEN, P. J. WADDELL and D. M. HILLIS, 1996 Phylogenetic inference, pp. 407–514 in Molecular Systematics, Ed. 2, edited by D. M. HILLIS, C. MORITZ and B. K. MABLE. Sinauer Associates, Sunderland, MA.

TAYLOR, W. R., 1988  A flexible method to align large numbers of biological sequences. J. Mol. Evol. 28:161-169[Medline].

TELLER, J. K., P. J. BAKER, K. L. BRITTON, P. C. ENGEL, and D. W. RICE et al., 1995  Correlation of intron-exon organisation with the three-dimensional structure in glutamate dehydrogenase. Biochim. Biophys. Acta 1247:231-238[Medline].

THORNE, J. L., N. GOLDMAN, and D. T. JONES, 1996  Combining protein evolution and secondary structure. Mol. Biol. Evol. 13:666-673[Abstract].

TOPHAM, C. M., A. MCLEOD, F. EISENMENGER, J. P. OVERINGTON, and M. S. JOHNSON et al., 1993  Fragment ranking in modelling of protein structure: conformationally constrained substitution tables. J. Mol. Biol. 229:194-220[Medline].

WAKO, H. and T. L. BLUNDELL, 1994  Use of amino acid environment-dependent substitution tables and conformational propensities in structure prediction from aligned sequences of homologous proteins. I. Solvent accessibility classes. J. Mol. Biol. 238:682-692[Medline].

WHITE, J. V., C. M. STULTZ, and T. F. SMITH, 1994  Protein classification by stochastic modeling and optimal filtering of amino-acid sequences. Math. Biosci. 119:35-75[Medline].

YANG, Z., 1994  Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306-314[Medline].

YANG, Z., 1995  A space-time process model for the evolution of DNA sequences. Genetics 139:993-1005[Abstract].

YANG, Z., 1996  Among-site rate variation and its impact on phylogenetic analysis. TREE 11:367-372.

YANG, Z., 1997  PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS 13:555-556[Free Full Text].

YANG, Z., N. GOLDMAN, and A. FRIDAY, 1994  Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. Mol. Biol. Evol. 11:316-324[Abstract].

YANG, Z., I. J. LAUDER, and H. J. LIN, 1995  Molecular evolution of the hepatitis B virus genome. J. Mol. Evol. 41:587-596[Medline].

YOKOYAMA, S. and W. T. STARMER, 1992  Phylogeny and evolutionary rates of G protein {alpha} subunit genes. J. Mol. Evol. 35:230-238[Medline].

YOKOYAMA, S. and D. E. HARRY, 1993  Molecular phylogeny and evolutionary rates of alcohol dehydrogenases in vertebrates and plants. Mol. Biol. Evol. 10:1215-1226[Abstract].