Protein evolution depends on intramolecular coevolutionary networks whose complexity is proportional to the underlying functional and structural interactions among sites. Here we present a novel approach that vastly improves the sensitivity of previous methods for detecting coevolution through a weighted comparison of divergence between amino acid sites. The analysis of the HIV-1 Gag protein detected convergent adaptive coevolutionary events responsible for the selective variability emerging between subtypes. Coevolution analysis and functional data for heat-shock proteins, Hsp90 and GroEL, highlight that almost all detected coevolving sites are functionally or structurally important. The results support previous suggestions pinpointing the complex interdomain functional interactions within these proteins and we propose new amino acid sites as important for interdomain functional communication. Three-dimensional information sheds light on the functional and structural constraints governing the coevolution between sites. Our covariation analyses propose two types of coevolving sites in agreement with previous reports: pairs of sites spatially proximal, where compensatory mutations could maintain the local structure stability, and clusters of distant sites located in functional domains, suggesting a functional dependency between them. All sites detected under adaptive evolution in these proteins belong to coevolution groups, further underlining the importance of testing for coevolution in selective constraints analyses.
UNVEILING the mechanisms of natural selection whereby proteins evolve is one of the fundamental aims of evolutionary genetics studies. The identification of genes showing particular amino acid residues that have undergone adaptive evolution is key in determining functionally or structurally important protein regions. In light of the neutral theory of molecular evolution, mutations are fixed neutrally in proteins (Kimura 1983). Due to their stochastic distribution, only few mutations are beneficial for the biological fitness of the organism and are hence fixed by positive selection (adaptive evolution). However, it is becoming increasingly evident that a significant percentage of genes have undergone adaptive evolution at some stage during their evolutionary past. This body of observed positively selected genes has been growing rapidly during the past decade due to the increase in the number and sensitivity of statistical methods for detecting adaptive evolution.
Methods designed to detect adaptive evolution can be based on Bayesian approaches (Yang et al. 2000) or maximum parsimony (Suzuki and Gojobori 1999; Fares et al. 2002a). None of these methods takes into account the evolutionary interdependence between protein residues. A protein's function is, however, the result of the functional and structural communication between sites. Sites constraints are hence dependent on the interactions with other residues of the molecule. Mutations at either nearby sites or functionally related distant sites in the structure will change the selective constraints. The more complex the coevolution network is for a particular site, the greater the selection coefficient may be against a mutation at that site due to the dramatic effect that this mutation would have on other coevolving protein regions. Testing coevolution between sites is hence an essential step to complement molecular selection analyses, providing more biologically realistic results.
Grouping sites for molecular evolution analyses has been previously attempted (Hughes and Nei 1988; Clark and Kao 1991). Significant progress has been achieved in building more realistic models (Fares et al. 2002a; Suzuki 2004; Berglund et al. 2005), albeit several problems regarding the molecular evolution of proteins are still unresolved. For instance, linear sliding-window methods are one-dimensional based and assume independence between different window regions irrespective of their three-dimensional proximity. Conversely, classification of amino acids in the same group of evolution based on their three-dimensional proximity (three-dimensional sliding window) will ignore the coevolution between functional regions that are spatially distant. Various reports state that residues can form a physically connected network that links distant functional sites in the tertiary protein structure (Süel et al. 2003). In fact, mapping energetic interactions in the PDZ domain family predicts a set of energetically coupled positions for a binding site residue that includes unexpected long-range interactions (Lockless and Ranganathan 1999). Coevolution between clusters of sites, which are not in contact, has also been shown (Pritchard and Dufton 2000). Coevolution between distant sites has been observed in sites proximal to regions with critical functions, where coevolution occurs to maintain the structural characteristics around these regions and consequently to maintain the protein conformational and functional stability (Gloor et al. 2005).
Coevolution of any type has its origin in the covarion hypothesis proposed first by Fitch and Markowitz (1970). This hypothesis states that, at any given time, some sites are invariable due to their functional or structural constraints but, as mutations are fixed elsewhere in the sequence, these constraints may change. Various methods for identifying covariant amino acid pairs at the molecular level have been previously developed (e.g., Korber et al. 1993; Göbel et al. 1994; Shindyalov et al. 1994; Taylor and Hatrick 1994; Tillier and Collins 1995; Chelvanayagam et al. 1997; Pollock and Taylor 1997; Lockhart et al. 1998; Tuffley and Steel 1998; Pollock et al. 1999; Pritchard et al. 2001; Tillier and Lui 2003; Ané et al. 2004; Galtier 2004; Dutheil et al. 2005). The main limitation of many of these methods has been their inability to separate phylogenetic linkage from functional and structural coevolution. In other methods there was not an assessment of the random noise caused by a limited number of sequences in the alignment or by high pairwise distance. Tillier and Lui (2003) attempted to remove the phylogenetic coevolution; however, their analysis was biased toward those positions covarying with at most a few others. Gloor et al. (2005) partially corrected these effects although their method requires alignments of at least 125 sequences to remove stochastic covariation. Further, these methods do not simultaneously take into account the replacement propensity of a site, the background sequence divergence, and the three-dimensional information.
In this work we first develop a novel method for detecting coevolution between sites that allows for more realistic analyses of selective constraints. We then test previously hypothesized and experimentally supported interdomain coevolution in the 90-kDa heat-shock protein (Hsp90), the multimeric Hsp60 (GroEL), and the Gag protein from the human immunodeficiency virus type 1 (HIV-1). We finally apply the new method to uncover coevolution between sites previously detected as under adaptive evolution in GroEL and Gag. We show that coevolution analyses not only provide more realistic results but also highlight the molecular and structural mechanisms shaping the evolution of proteins.
MATERIALS AND METHODS
Coevolution analysis using protein sequences (CAPS) compares the correlated variance of the evolutionary rates at two sites corrected by the time since the divergence of the protein sequences they belong to (Figure 1). Substitutions or conservation at two independent sites cannot be directly compared due to their amino acid composition difference. The method instead compares the transition probability scores between two sequences at these particular sites, using the blocks substitution matrix (BLOSUM) (Henikoff and Henikoff 1992). For each protein alignment the correspondent BLOSUM matrix is applied, depending on the average sequence identity.
Despite the fact that BLOSUM matrices correct for the substitution values due to the estimated divergence between sequence pairs, a given alignment can include sequences whose pairwise distance is significantly divergent from the mean pairwise distance. For instance, an alignment including two highly divergent sequence groups (for example, gene duplication predating speciation) could show an unrealistic pairwise average identity level. In this respect, sequences that diverged a long time ago are more likely to fix correlated mutations at two sites by chance (under a Poisson model) compared to recently diverged sequences. BLOSUM values should be hence normalized by the time of divergence between sequences. BLOSUM values (Bek) are thus weighted for the transition between amino acids e and k using the time (t) since the divergence between sequences i and j:(1)
The assumption made in Equation 1 is that the different types of amino acid transitions (slight to radical amino acid changes) in a particular site follow a Poisson distribution along time. The greater the time is since the divergence between sequences i and j the greater the probability is of having a radical change. A linear relationship is thus assumed between the BLOSUM values and time. Synonymous substitutions per site () are silent mutations, as they do not affect the amino acid composition of the protein. These mutations are therefore neutrally fixed in the gene. Assuming that synonymous sites are not saturated or under constraints, dS is proportional to the time since the two sequences compared diverged. Time (t) therefore is measured as dS. Note that convergent radical amino acid changes at two sites in sequences that have diverged recently have larger weights compared to convergent changes in distantly related sequences.
The next step is the estimation of the mean θ-parameter for each site of the alignment, so that(2)Here S refers to each pairwise comparison, while T stands for the total number of pairwise sequence comparisons, and thus(3)where N is the total number of sequences in the alignment.
The variability of each pairwise amino acid transition compared to that of the site column is estimated as(4)
The mean variability for the corrected BLOSUM transition values is(5)
The coevolution between amino acid sites (A and B) is estimated thereafter by measuring the correlation in the pairwise amino acid variability, relative to the mean pairwise variability per site, between them. We thus use the relative variability rather than the absolute variability to measure the correlation between two sites. This ensures making the covariation independent from the differences in the rates of evolution of the sites compared. This covariation is measured as the correlation between their -values, such as(6)
Here e and k are any two characters at sites A and B. To determine if the correlation coefficient (ρAB) is significant, either a resampling or a simulation analysis can be performed. In the first approach we randomly sample K numbers of pairs of sites and compute Equations 1–6 for each pair. The mean correlation coefficient and its variance are then estimated as(7)Correlation coefficients are then tested for significance under a normal distribution:(8)
The second approach consists of the Monte Carlo simulation of K sequence alignments. Here the coevolution test is conducted for a number of randomly selected pairs of sites in each simulated alignment or for the complete set of pairs of sites in the random data set computing Equations 1–6. An average value of the correlation for the simulated alignments and its variance are obtained utilizing Equation 7. Finally, the real correlation coefficients are tested using Equation 8.
The statistical power of the test is optimized by analyzing sites showing(9)Here, Θ is the parametric value of from Equation 5 and σ is the standard deviation of Θ. Θ is calculated as(10)where L is the length of the alignment. Pairwise comparisons including gaps in any or both sites at any sequence are excluded from the analysis.
Removing the phylogenetic coevolution:
Coevolution between amino acid sites can be the result of their structural, functional, or physical interaction; their phylogenetic convergence; and their stochastic covariation. The analysis of simulated data to test for significance removes stochastic effects. To disentangle functional, structural, and interaction coevolution from phylogenetic coevolution, the method is applied to the complete alignment and to subalignments, where specific phylogenetic clades are removed from the tree. Coevolving amino acid sites that are no longer detected following removal of one of the clades will be classified as phylogenetic coevolving sites as they occur in specific branches of the tree. Conversely, coevolving amino acid sites detected irrespective of the tree clades removed will be considered as functional/structural/interaction coevolving sites since they present correlated changes throughout the phylogenetic tree. Note that the latter condition means that when one amino acid changes, the covarying amino acid has necessarily to change. In the former condition, a change in one site does not always (in all branches) involve a change in the covarying site. In other words, our method detects phylogenetic-independent coevolution. Clades for coevolution analyses are defined in terms of their biological coherence and/or statistical support (defined as bootstrap values). Consequently, phylogenetic clades are specified prior to conducting the coevolutionary analysis and they include sequences that are forming either a well-defined biological cluster or alternatively a cluster supported by a high bootstrap value.
Using the atomic distances as additional information in coevolution analyses:
Spatial proximity between coevolving sites can be used to define their structural or functional interaction. In this method coevolution is not always synonymous with physical interaction but also involves structural and functional coevolution, as has been previously described (Lockless and Ranganathan 1999; Pritchard and Dufton 2000; Süel et al. 2003; Gloor et al. 2005).
The three-dimensional closeness of two sites is estimated as the vectorial distance between their atomic centers (δ). This distance is obtained by comparing the three-dimensional coordinates (X, Y, and Z) of atoms A and B for amino acids i and j:(11)Since each amino acid consists of several atoms, the mean atomic distance () between sites i and j is taken:(12)Here, μ refers to the total number of atoms in the amino acid. The significance of the distance is tested by comparing it to a distribution of K random amino acid pairs sampled from the three-dimensional structure. The reason for conducting a statistical analysis to detect proximal sites is that sites may be considered significantly proximal or distant depending on the shape of the protein. On the other hand, sites that are not in physical contact but are surrounding functionally important sites, and are hence proximal, can present coevolution due to their proximity to important sites (Gloor et al. 2005).
We tested the sensitivity of CAPS using simulated data that allow for the control of the extent of coevolution and the evolutionary history of each site in the alignment. We also compared CAPS with other nonparametric methods that use the information theory or a Bayesian approximation, including the method of Korber [herein called the mutual information criterion (MICK) implemented in our program PIMIC and available on request; Korber et al. 1993] and the method of Tillier and Lui (2003) implemented in the program Dependency, as well as with the parametric method of Pollock et al. (1999) implemented in the program lnLCorr. While parametric methods can be more powerful than nonparametric ones, incorrect assumptions in the model can yield a high number of false positives (Dimmic and Hubisz 2005). We thus compared CAPS to more conservative nonparametric methods and to more powerful parametric methods. We simulated sequence alignments using a model similar to that devised by Pollock et al. (1999). Initially an ancestral sequence of 200 amino acids was generated using the amino acid composition corresponding to the equilibrium residue frequencies in naturally occurring proteins (Jones et al. 1992). This sequence was evolved using a Markov chain Monte Carlo simulation along a phylogenetic tree and, simultaneously, 10 pairs of sites were randomly selected to coevolve. Coevolution was established by forcing correlated variation such as the transition between amino acids at both sites had similar θek as specified in Equation 1. We have also introduced coevolution without forcing this condition but results were unaltered. The phylogenetic trees used for simulations were bifurcated and symmetric (all branches had the same length). The robustness of the coevolution analysis to the sequence phylogenetic divergence was assessed using different levels of background noise and alignment sizes. We fixed the background noise (branch lengths) to evolve the ancestral sequence at 0.1, 0.2, 0.5, and 1 Poisson-corrected substitutions per site. The numbers of sequences tested were 10, 20, and 30 sequences. We simulated and tested multiple simulations for each condition (number of sequences and number of substitutions per site), performing 12 different types of simulation analyses. We measured the sensitivity (SN) of the method as(13)Here TP and FP are the numbers of true and false positives, respectively. We also measured the specificity of the methods as:(14)Here, TN and FN are the numbers of true and false negatives, respectively.
Analysis of real sequences:
We used CAPS to analyze the Gag protein from HIV-1, the 90-kDa heat-shock protein (Hsp90), and the 60-kDa heat-shock protein (GroEL). These proteins are multimeric and organized in functionally connected domains and are hence perfect for testing interdomain coevolution. To show an example of coevolution between sites that have undergone positive selection we used previously published data demonstrating adaptive evolution in GroEL from endosymbiotic bacteria of insects (Fares et al. 2002b, 2004) and the HIV-1 gag gene (Yang et al. 2003).
Gag protein from HIV-1:
The HIV-1 genome is translated to yield both structural and nonstructural proteins (Wain-Hobson et al. 1985). Among these proteins, Gag is a 55-kDa polyprotein that is initially associated with the cell membrane to ease the budding of virus particles from the host cell. Gag is further processed to produce four proteins called matrix (p17), capsid (p24), nucleocapsid (p9), and p6 (Gottlinger et al. 1989). We tested whether coevolution exists between specific Gag proteins or amino acid sites in specific HIV-1 lineages. The HIV-1 group M subtypes are thought to have originated from a single ancestor and are currently described by highly supported clades.
Hsp90 is an ATPase molecular chaperone that assists the conformational maturation of molecules involved in cell-cycle regulation and signal transduction (Pratt 1998; Buchner 1999; Caplan 1999; Mayer and Bukau 1999). Hsp90 is translated as a monomeric protein but its function depends on its dimerization. Several functional domains can be identified in the linear sequence of Hsp90 (supplemental Table 1 at http://www.genetics.org/supplemental/). The importance of the complex intramolecular interactions for the Hsp90 function is poorly understood (Prodromou et al. 1999; Johnson et al. 2000; Chadli et al. 2000). Here we apply CAPS to test and identify interdomain coevolution within Hsp90.
Analysis of the heat-shock protein 60-kDa GroEL:
The ATPase molecular chaperone GroEL is found specifically in bacteria and the organelles of eukaryotic cells (Landry et al. 1993). The multimeric protein GroEL folds 10–15% of slow-folding proteins, which are mostly aggregation prone (Deuerling et al. 1999; Thulasiraman et al. 1999). Each GroEL subunit is organized in three domains; apical, equatorial, and intermediate (Braig et al. 1994, 1995). Several functionally important intradomain regions in GroEL have been previously identified (supplemental Table 2 at http://www.genetics.org/supplemental/). Here we test if coevolution among sites is crucial for the functional and structural stability of GroEL.
Sequence alignments and phylogenetic inferences:
GenBank accession numbers for the gag, hsp90, and groel sequences are provided in supplemental Tables 3, 4, and 5, respectively, at http://www.genetics.org/supplemental/. Protein sequences were aligned (available from the corresponding author upon request) using CLUSTAL X (Jeanmougin et al. 1998). Nucleotide sequences were then aligned, concatenating triplets according to the amino acid sequence alignment.
The phylogeny of the HIV-1 group M subtypes is very well defined (Robertson et al. 2000). Representative sequences were selected in the manner described previously (Travers et al. 2005). For each subtype, all available full-genome sequences were retrieved from the Los Alamos HIV database (http://hiv-web.lanl.gov) and a neighbor-joining tree of each resulting data set was reconstructed using PAUP* 4.0b10 (Swofford 1998). Representative sequences were selected for each subtype on the basis of their spread throughout the subtype tree, resulting in the selection of a diverse range of sequences for each subtype (supplemental Table 3 at http://www.genetics.org/supplemental/). The resulting data set contained 36 taxa.
For Hsp90 we used sequences from unicellular and multicellular eukaryotes comprising a total of 43 taxa. Aligned sequences were subject to phylogenetic analyses using PAUP* 4.0b10. Maximum-likelihood and maximum-parsimony analyses yield the same phylogenetic tree. We used the GroEL phylogenetic tree obtained in a previous work (Fares et al. 2002b).
Analysis of coevolution:
Coevolution analyses were implemented in the program CAPSv1.0 (available from the corresponding author on request). Synonymous substitutions dS were considered to be proportional to the time since the divergence between sequences since no indication of saturation of synonymous sites was observed using SWAPSC v1.0 (Fares 2004). The significance of the correlation coefficients was tested using 10,000 pseudorandom pairs of amino acid sites and a confidence value of (α = 0.001), to minimize type I error. Clades defined in each protein for coevolution analyses are indicated in Figures 2, 3A, and 4A.
Testing the accuracy of CAPS:
The analysis of simulated data sets demonstrates that CAPS is highly sensitive and robust at a wide range of amino acid distances and alignment sizes (Figure 2, A–C). CAPS sensitivity ranged between 65 and 87% in alignments of 10 sequences (Figure 2, A–C). Increasing the alignment length to 20 and 30 sequences yielded sensitivity values between 80 and 90% and between 83 and 98%, respectively.
Although all methods find a high percentage of true coevolutionary amino acid pairs (Figure 2, D–F), the alternate methods to which CAPS was compared also identified large numbers of false positives, showing sensitivity values ranging between 8 and 17% in MICK, between 7 and 15% in Dependency, and between 7 and 8% in lnLCorr (Figure 2A). When the alignment size increased to 20 sequences, the sensitivity of MICK improved at all the distances, ranging between 20 and 15% (Figure 2B). Dependency and lnLCorr presented lower sensitivity values compared to MICK, ranging between 8 and 10% (Figure 2B). Using alignments containing 30 sequences, the sensitivity of MICK, Dependency, and lnLCorr decreased as the average sequence pairwise distance increased (Figure 2C). MICK, Dependency, and lnLCorr seem to require very dense phylogenetic trees as previously suggested (Pollock et al. 1999; Tillier and Lui 2003). Using alignments of 200 sequences does not seem to change the sensitivity of CAPS (data not shown). The sensitivity of CAPS increases with the number of sequences in the alignment (using a multivariate test, F = 9.968, P = 0.016), unlike the other methods, which present no evidence for such an increase (Figure 3A). Conversely, the level of pairwise amino acid divergence negatively affects all the methods except MICK (F = 1.446, P = 0.309) (Figure 3B).
We analyzed the effect of two factors, alignment size and amino acid substitutions per site, on the sensitivity of all the methods in general and of each method individually. A multivariate test demonstrates that neither of these factors alone influences the sensitivity of the methods for detecting coevolution (F = 1.951 and F = 2.301 and P = 0.267 and P = 0.090, for the effects of the alignment size and amino acid distance, respectively). The interaction of both factors, however, has a significant effect on the sensitivity of the four methods taken together (F = 131.938; P ≪ 0.001) or individually (data not shown). In summary, a low number of sequences combined with a high pairwise sequence divergence negatively affect the sensitivity of all the methods for detecting coevolution.
Analysis of the specificity of the methods demonstrates that CAPS is more specific than the alternate methods, although specificity was detected to be significant in all four methods used (data not shown).
Phylogenetic and functional coevolution in Gag from HIV-1:
Following coevolution analysis of the complete gag alignment, we identified 21 groups of coevolving sites, containing 73 unique residues (Figure 4A). Of these 73 residues, 42 were observed to have undergone phylogenetic coevolution, in that removal of a particular clade in the analysis resulted in loss of detection of that site in the subsequent analysis using CAPS. In all lineages coevolving sites were spread throughout the gene with the majority of sites present in the functional p17 and p24 regions (Figure 4B). Interestingly, while subtypes A and G, which evolved from a common ancestor, have the highest number of phylogenetically coevolving sites only three residues (E107, I147, and T186) are shared between them, indicating the presence of both lineage-specific and ancestral coevolution within the gag gene (Figure 4B). Also, numerous examples of convergent coevolution were observed between subtypes (Figure 4B).
Two groups of amino acid sites (R15, K28, R91, A120 and E55, V128) maintained their grouping no matter what clade (defined as subtype) was removed, indicating functional/structural/physical interaction (herein called functional groups, FGs) coevolution between these sites. We observed 8 of the 14 positively selected amino acids identified by Yang et al. (2003) as coevolving within the HIV-1 group M phylogeny (R15, K28, G62, Q69, R91, I138, N252, and T280) with 5 of these (K28, G62, Q69, R91, and T280) having been identified by Yang et al. as undergoing adaptive evolution in separate analyses of subtypes A, B, and C. Of the 8 residues overlapping between our study and that of Yang et al., 3 (R15, K28, and R91) are present in one of the two FGs.
Important functional regions in Hsp90 do coevolve:
The clades defined for the coevolution analyses were those defined as biologically distinguishable organisms, including Hsp90 from endoplasmic reticulum, chloroplast, yeast, plants, protozoan parasites, insects, and high eukaryotes (Figure 5A). The application of the coevolution analysis to Hsp90 identified 34 groups of coevolution (G1–G34; Figure 5B). A significant proportion of the coevolving sites have an important reported biological function (Table 1 and Figure 5B).
We applied a neural network-based analysis implemented in the program CONSEQ (Berezin et al. 2004) to predict the functional or structural importance of residues never tested experimentally. To avoid missing information due to possible shifts in the selective constraints acting on Hsp90 residues in the branch separating unicellular from multicellular eukaryotes, we applied CONSEQ to both data sets (each one comprising a total of 20 sequences). Caution is required when using this approach due to problems in the specificity of the method (Berezin et al. 2004). We found that sites reported to be important in the literature were identified by the method. In addition, in most of the cases, sites predicted to be buried or exposed were correctly identified when compared to the three-dimensional structure. After using this approach, and taking into account the literature published, the average number of functionally or structurally important sites in each group of coevolution was 86.26% (Figure 5B). A careful inspection of Figure 5B reveals that coevolution has occurred within and between functional domains. In fact, coevolution was found between domains involved in protein interaction (PI); in ATP modulatory, ATP amino-, and carboxy-terminal binding regions (ATPM, ATP-Nt, and ATP-Ct); and in dimerization domains (DD) (Table 1).
The mean variance for the amino acid transition in each group of coevolution () ranged between 0.531 and 2.084, whereas the mean correlation () varied between 0.625 and 0.948 (Table 1). The three-dimensional structure is available only for the amino-terminal and middle segments. The analysis of the atomic distances (AD) identified a certain percentage of coevolving residues within each group as spatially close (Table 1). Examples are the pairs of amino acids [(S113, A112) and (E4, G3, and V163)] in the amino-terminal domain (Figure 5C) and the groups of amino acids [(E351, N382), (Q409, F410), and (K445, S446)] in the middle segment (Figure 5D). A number of coevolving groups exhibit coevolution between sites significantly distant but functionally related (Table 1).
Revealing adaptive coevolution in GroEL:
Application of CAPS to GroEL, taking into account the clusters belonging to different bacterial groups and defined in Figure 6A, identified 17 groups of coevolving amino acids (G1–G17; Figure 6B). Seventy-five percent of coevolving sites were previously reported in functional data (supplemental Table 2 at http://www.genetics.org/supplemental/), or by neural network analysis, as functionally or structurally important.
The mean variance for the amino acid transition in each group of coevolution () ranged between 0.739 and 2.477, while the mean correlation () varied between 0.510 and 0.944 (Table 2). All of the groups detected included sites belonging to the apical and equatorial domains with very few sites belonging to the intermediate domain. Most of the sites not identified by functional data or undetected by predicting neural networks presented significantly short distances to functionally or structurally important sites. Examples of spatially close sites were sites [(K132, L134, K425), (A127, K132), and (S424, K425)] in the equatorial domain and sites (T210, G211, V213) in the apical domain (Figure 6C).
Interestingly, a significant proportion of the sites detected as coevolving in this GroEL data set have been previously proposed to be under adaptive evolution (Figure 6B) (Fares et al. 2002b, 2004). Most of these positively selected sites included in the same coevolution group belong to different domains (apical and equatorial domains). Examples of this were found in the coevolution groups G1, G2, G7, G8, and G9 (Figure 6B). Taking into account all these data, the overall percentage of coevolving sites previously identified as key for GroEL function or evolution was 82%. Other positively selected sites were proximal to important sites (for example, sites T210, G211, and V213 in group G1 are in physical contact with sites S201, Y203, and F204 that are involved in GroES and substrate binding; supplemental Table 2 at http://www.genetics.org/supplemental/). An important observation to mention is that not all positively selected sites in GroEL were detected to be under coevolution, which makes the dependence between positive selection and coevolution less likely.
A highly sensitive method:
To assess the validity of the method for detecting coevolution between sites, we asked two questions: How sensitive is the method to distinguish between true and false positives? And, how much does the method improve upon the sensitivity of similar nonparametric and parametric methods previously published? Our simulations indicate that CAPS does indeed identify a high percentage of true correlated pairs even at large pairwise sequence distances. When considering sensitivity, CAPS performs significantly better than the other three methods to which it was compared over all distances and sequence numbers. We also followed the recommendations of Tillier and Lui (2003), increasing the number of sequences in our simulated alignments to a number equivalent to the number of amino acids sites, but the sensitivity of CAPS remained unaltered (data not shown). The importance of the number of sequences in the alignment to obtain accurate coevolution results, especially in mutual information-based methods, has been previously investigated by Gloor et al. (2005). They reported that these methods require alignment sizes 10 times greater than those used in this study. We have shown that CAPS exhibits high sensitivity, using either large or small data sets. Our method therefore has an enormous advantage in the analysis of proteins that have not been sequenced in many organisms, proteins that arose recently in evolution, or proteins in which fast evolution permits obtaining accurate results only at very narrow phylogenetic ranges.
Coevolution within the gag gene phylogeny:
We have identified coevolution between the different functional regions of the gag gene in HIV-1 (Figure 3A). The two groups of amino acid sites exhibiting coevolution regardless of clade removal (R15, K28, R91, A120 and E55, V128) are detected as having a functional or structural dependency and included sites previously detected as having undergone adaptive evolution. Also, 42 of 73 sites were detected as coevolving phylogenetically with most of these coevolving sites being unique to specific lineages, thereby providing evidence for possible selective shifts between subtypes within the HIV-1 group M phylogeny. Interestingly, some of those sites were identified as phylogenetic ancestral coevolution (coevolving residues in the branch leading to the ancestor of two subtypes) or convergent coevolution (between lineages that do not share a recent common ancestor; Figure 4B). These results provide further evidence of the need for a more comprehensive evolutionary analysis of the distinct HIV-1 group M subtypes to obtain a full understanding of past, present, and future HIV-1 dynamical change.
Coevolution between functional domains in Hsp90:
Several studies demonstrate that Hsp90 functions through marked conformational changes that are governed by complex interdomain interactions (e.g., Prodromou et al. 1999; Chadli et al. 2000). Because of this complexity, the number of experimental analyses required to fully understand intramolecular Hsp90 interactions is prohibitively high. Moreover, experimental identification of residues involved in the stability of interdomain interactions due to their spatial proximity to functionally important domains is anything but straightforward. Computational methods are hence instrumental in the detection and a priori identification of regions or domains in which functional interaction should be tested.
CAPS identified 100% of the Hsp90 interdomain interactions proposed in previous studies and uncovered potential interactions that should be tested. We detected coevolution between sites belonging to PI2 and PI3 domains that are involved in binding different protein clients (Table 1). Simultaneous binding of proteins by Hsp90 has been proposed by other authors who suggest a synergistic effect of these proteins in the function of Hsp90 (Chen et al. 1998). An example of simultaneous substrate binding is that provided by the eNOS activation pathway, where the upstream activator of PKB/Akt, PDK1, binds Hsp90 simultaneously to eNOS (Fujita et al. 2002). Coevolution between these domains has interesting implications for the simultaneous and regulated binding of proteins with Hsp90. The method also detects coevolved amino acid sites involved in domain dimerization (DD2 and DD3) and cochaperone binding. Sites belonging to DD2 have been associated with the binding of accessory proteins such as Hop and immunophilines that are essential for the interaction and maturation of the complex Hsp90-Hsc70-client proteins (Chen et al. 1998). Deletion of the region 661–677 from the Hsp90 of the chicken Gallus gallus results in loss of Hsp90 dimerization and diminished interactions with all cofactors (Chen et al. 1998). Within this region, the method detected amino acid S662 (groups 3 and 30), N673 (group 31), and I674 (group 32). Not surprisingly, we also detected coevolution between these sites and others involved in interaction with eNOS, AKT, and glucocorticoid receptors (GRs).
The detected coevolution between sites from the N-terminal domain (A112–S115), involved in nucleotide binding, and sites from the middle segment and C-terminal domain (K445 and S446), involved in binding unfolded polypeptides, correlates with their functional relationships (Johnson et al. 2000). The coevolution between the N-terminal (S115) and the C-terminal (K636) ATP-binding domains supports previous functional data (Söti et al. 2002) and also correlates with the coevolution between the ATP-binding pocket in the N-terminal domain (A112 and D113 in group 17; S115 in group 23) and the ATP modulator domain (E351 in groups 17 and 23). The detected coevolution of sites belonging to distinct binding pockets of accessory elements (e.g., in group 24) supports works pinpointing the temporal regulation and coordinated interaction of cochaperones and Hsp90 (Chen et al. 1998). Other sites have been identified as coevolving with DD2 including, remarkably, the site W296 in which mutation in humans interferes in the interaction between Hsp90 and Akt (Meyer et al. 2003).
Finally, other sites with no reported functional importance to date have been detected in this study to coevolve with sites involved in protein interaction, ATP binding, or dimerization. All of these sites are exposed in the protein as predicted by the neural network analyses and by the three-dimensional model of Hsp90. These sites might establish additional interactions between Hsp90 domains and protein clients or may be responsible for the stabilization of the Hsp90 dimer through intermonomer interactions.
GroEL is under strong selective constraints of coevolution:
GroEL analysis points to three conclusions: Some of the coevolving sites are spatially close, supporting a functional and maybe structural complicity; the main interdomain coevolution took place between the apical and equatorial domains; and amino acid sites under adaptive evolution have also been identified as coevolving between each other.
Previous works have shown that GroEL has been accumulating advantageous mutations to improve its ability to compensate the effect of slightly deleterious and conformational-destabilizing mutations fixed in the proteome of endosymbionts (Fares et al. 2002b, 2004). Studies have even suggested that GroEL has been fixing slightly deleterious mutations as a result of the genetic drift operating in the genome of these bacteria (Herbeck et al. 2003). The fact that the method has detected amino acids coevolving with spatially close functional sites (Figure 5C) and that these sites underwent adaptive evolution suggests the existence of compensatory changes to maintain the overall structural stability of GroEL. This result can be fully explained under the neutral theory of evolution (Kimura 1983). Under this model, changes altering the amino acid side-chain spatial distribution will disrupt the already optimized performance of the protein's contribution to the organism's fitness. These disadvantageous mutations can be compensated only by changes in spatially proximal sites, where mutations will be fixed by adaptive evolution. The joint contribution of both mutations to the fitness of the organism will be neutral (Fukami-Kobayashi et al. 2002).
Advantages and limitations of the method:
Many methods developed to detect coevolution (e.g., Göbel et al. 1994; Neher 1994) have shown an inability to screen out background correlation, particularly in the presence of phylogenetic relatedness between the sequences, and to distinguish between positive and negative correlation (Pollock and Taylor 1997). As we have shown, CAPS is effective in distinguishing background correlation from true correlations. CAPS analysis can be performed with no knowledge of the phylogenetic relationships among sequences. We have, however, shown that the removal of well-defined clades can serve in the identification of structural/functional coevolution.
Coevolution methods have also been used in an attempt to resolve the docking problem (Göbel et al. 1994; Pazos et al. 1997). Despite their elegance, most of these methods failed to distinguish unambiguously between coevolution and phylogenetic noise. Correction of BLOSUM values by the sequence divergence and the consideration of functional and structural coevolution in CAPS allows isolation of the true coevolutionary events. It is important to mention that, unlike other studies, atomic distances are not used here as evidence of coevolution but rather as additional supporting information in the identification of functional and structural coevolution.
As expected, the method does not lack limitations. For example, saturation of synonymous sites can lead to underestimates of the divergence times, although data sets used in this study did not show such effects. The number of sequences in the alignment also poses a problem when sequences are too divergent, although the sensitivity is improved compared to that of previous methods. Further, constant amino acid sites that are very likely to be functionally important cannot be tested for coevolution using CAPS, although this limitation affects all the methods so far. Moreover, our method assumes that the coevolutionary relationship between a pair of sites remains constant through time. This assumption can be simplistic when analyzing alignments including highly divergent sequences. Nonetheless, the dynamic removal of prespecified phylogenetic clades ameliorates this problem.
Finally, even though coevolutionary analyses can be used to identify protein–protein interaction interfaces, CAPS is not designed for such a purpose. The reason is that, while interaction would necessarily involve coevolution, coevolution does not imply physical interaction. Detecting amino acids involved in protein–protein interactions is a more complex problem, requiring the consideration of other parameters such as solvent accessibility, physiochemical amino acid properties, etc. We have shown here that coevolutionary analyses in biologically key molecules add another dimension to selective constraints analyses and provide more interpretable results.
We are grateful to two anonymous reviewers for important comments and insightful suggestions on the manuscript. This work was supported by Science Foundation Ireland.
Communicating editor: S. Yokoyama
- Received November 8, 2005.
- Accepted March 3, 2006.
- Copyright © 2006 by the Genetics Society of America