Abstract
It is well known that the rate of amino acid substitution varies among different proteins and among different sites of a protein. It is, however, unclear whether the extent of rate variation among sites of a protein and the mean substitution rate of the protein are correlated. We used two approaches to analyze orthologous protein sequences of 51 nuclear genes of vertebrates and 13 mitochondrial genes of mammals. In the first approach, no assumptions of the distribution of the rate variation among sites were made, and in the second approach, the gamma distribution was assumed. Through both approaches, we found a negative correlation between the extent of amongsite rate variation and the average substitution rate of a protein. That is, slowly evolving proteins tend to have a high level of rate variation among sites, and vice versa. We found this observation consistent with a simple model of the neutral theory where most sites are either invariable or neutral. We conclude that the correlation is a general feature of protein evolution and discuss its implications in statistical tests of positive Darwinian selection and molecular time estimation of deep divergences.
THE rate of amino acid substitution varies considerably among different proteins as well as different sites of a protein (e.g., Kimura 1983; Nei 1987; Li 1997). It is interesting to study the relationship between these two types of rate variations because they result from the variation of functional constraints at different levels (proteins or amino acid residues) according to the neutral theory of molecular evolution (Kimura and Ohta 1974; Kimura 1983). In a recent study of vertebrate mitochondrial genes, Kumar (1996) observed that the extent of rate variation among nucleotide sites of a gene and the average substitution rate of the gene are negatively correlated. That is, fast evolving genes have low degrees of amongsite rate variation, whereas slowly evolving genes have high degrees of amongsite rate variation. However, the cause of the correlation was unclear at the time. Furthermore, the result was obtained based on the strong assumption that the amongsite rate variation follows a gamma distribution that may or may not hold in all the genes. To examine whether such correlation is a general feature of molecular evolution, a largescale analysis of nuclear and mitochondrial genes with an approach that is free from the assumption of the gamma distribution seems to be necessary. In this article, we analyzed protein sequences for 51 nuclear genes of vertebrates and 13 mitochondrial genes of mammals and found that the correlation holds for both nuclear and mitochondrial genes. We also explain the underlying cause of the correlation and discuss its implications.
DATA AND METHODS
Characterization of amongsite rate variation: Let us assume that the substitution rate (r) at an amino acid site of a protein sequence is an identically and independently distributed random number following an unknown distribution g(r).Let k be the number of substitutions at a given site during a period of evolutionary time T. Assuming that amino acid substitutions follow the Poisson process when the rate is given, k follows the distribution:
Nevertheless, the gamma distribution has been predominantly used in recent years to characterize the amongsite rate variation of proteins (e.g., Uzzell and Corbin 1971; Golding 1983; Tamura and Nei 1993; Yang 1994; Guet al. 1995; Gu and Zhang 1997). That is, the substitution rate (r) at a site is assumed to follow the gamma distribution with the probability density function
Genes used: We randomly chose 51 nuclear genes (see Table 1) for which the orthologous sequences from the human (Homosapiens), mouse (Mus musculus), chicken (Gallus gallus), and African clawed frog (Xenopus laevis) are available in the GenBank. All the protein sequences were downloaded from the HOVERGEN database (Duretet al. 1994). Hedges et al. (1996) examined the orthology of most of the genes for their purpose of estimating the divergence time of primates and rodents. For the remaining genes, we examined the orthology by phylogenetic analysis as described in Hedges et al. (1996). All the sequences used were longer than 100 amino acids. We chose the four species because (1) estimation of the gamma shape parameter α requires at least three sequences, (2) there are many gene sequences available for these species, (3) the phylogeny of the species is known, and (4) the divergence among these species is moderate for most genes so that pairwise distances and substitution rates can be accurately estimated. As shown in Equation 4, to study the mean rate μ, we have to know the total amount of evolutionary time T for a tree. But if we use the same species for all the genes, T will be the same so that we can infer properties of μ from those of μT without knowing the actual value of T.
We also studied 13 mitochondrial genes of 9 mammalian species, including the mouse (GenBank accession no. J01420), rat (Rattus norvegicus; X14848), human (D38112), gibbon (Hylobates lar; X97707), cow (Bos taurus; V00654), whale (Balaenoptera physalus; X61145), opossum (Didelphis virginiana; Z29573), wallaroo (Macropus robustus; Y10524), and platypus (Ornithorhynchus anatinus; X83427). In this case only mammalian species were used because mammalian mitochondrial genes evolve quickly (Brownet al. 1979) and use of distantly related species may increase estimation errors. The unrooted tree of these 9 species has been well established (e.g., Jankeet al. 1997).
Sequence analysis: For the orthologous sequences of each gene, we used the known species tree and estimated the number of substitutions (k) along the tree at each amino acid site by using Equation 7 of Gu and Zhang's (1997) method. We then estimated the mean and variance of k. Estimates of μT and CV(r) were then obtained by using Equations 4 and 6, and they were denoted as D and CV, respectively.
We also estimated the gamma shape parameter (α) for the protein sequences of each gene by using Gu and Zhang's (1997) method, which has been shown to be as accurate as, but much faster than, the likelihood method. Actually, the likelihood method (Yang 1994) and the Gu and Zhang's (1997) method gave essentially the same result for the data sets used in this article (data not shown). In Gu and Zhang's (1997) method, the k value for each site is first estimated. Because k follows the negative binomial distribution when r follows the gamma distribution, the gamma shape parameter α is estimated by maximizing the following likelihood function derived from the negative binomial distribution,
For each gene, by a likelihood ratio test, we examined whether the null hypothesis that all sites have the same substitution rate can be rejected and whether the alternative hypothesis of the gamma distribution is favored. We computed
RESULTS
Nuclear genes: Table 1 shows the estimates for the mean substitution rate and the extent of amongsite rate variation for the 51 nuclear genes examined. The coefficient of variation (CV) of the substitution rate among sites varies from 0.43 to 2.02 with the mean of 1.09 and the median of 1.00 (Figure 2). Figure 3 shows the relationships between CV and D (average number of substitutions per site along the whole tree). We fitted the data by a linear regression of CV and lnD. The coefficient of correlation (R) is −0.95, which is significantly different from 0 (P < 0.0001). If we fit the data by a linear regression of CV and D, R becomes −0.91 (P < 0.0001). Because D = μT and T (total amount of evolutionary time of the tree) is identical for all the genes, the correlation coefficient between CV and D (or ln D) is the same as that between CV and μ (or lnμ). These results therefore show that the mean substitution rate and the extent of amongsite rate variation of a protein are negatively correlated. This observation is independent of any assumption about the underlying distribution of the substitution rate g(r).
We then examined the correlation under the assumption of a gamma distribution for the rate variation among sites. The gamma shape parameter α of the 51 nuclear genes varies from 0.17 to 3.45 with the median of 0.71 (Table 1 and Figure 4). About twothirds of the genes have α values <1, and, therefore, their rate distributions are Lshaped (see Figure 1A). Only 12% of the genes have α values >2. Likelihood ratio tests showed that the null hypothesis of no rate variation among sites can be rejected at the 1% significance level for 48 of the 51 genes and can be rejected at the 5% level for the remaining three genes. Therefore, the alternative hypothesis of the gamma distribution is favored. A common feature of the three genes mentioned above is that they are all short (around 150 amino acids), suggesting that the test might not be powerful enough for these genes. The gamma distance d (number of amino acid substitutions per site) between the mouse and chicken varies from 0.057 to 0.703 for the 51 genes (Table 1), which corresponds to the rate of about 0.1 to 1.2 amino acid substitutions per site per billion (10^{9}) years if we assume that mammals and birds diverged about 300 mya, which is now generally accepted (Benton 1993; Hedgeset al. 1996).
Figure 5 shows the relationship of α and d (between the mouse and chicken) for the 51 nuclear genes. The R value for α and d is 0.68, which is significantly different from 0 (P < 0.0001). Linear regression of α and lnd does not change the R value very much. Because only orthologous sequences were used, d is proportional to the average substitution rate μ (d = 2μt, where t is the time since the divergence of birds and mammals). Therefore, the correlation coefficient between α and μ is also 0.68. Although only the distances between the mouse and chicken are presented in Figure 5, use of pairwise distances among other species gave similar results. Large estimates of α (e.g., α > 2) are known to have substantial sampling errors (Gu and Zhang 1997), and they should have lower weights than other data points in the regression. A simple practice is to give zero weights to them. The correlation coefficient for the remaining 45 genes with α < 2 is now 0.86 (P < 0.0001).
Influence of estimation biases: The above analyses showed that the estimated mean substitution rate and the extent of rate variation among sites are highly correlated. However, to prove that our observations reflect the true properties of protein evolution, we have to show that the correlation is not an artifact due to some biases of the estimation methods we have used. In the estimation of CV and D, no assumptions of the distribution g(r) were made. It is, therefore, very difficult to examine the estimation biases quantitatively. In the following paragraphs, we studied possible biases in the estimation of α and d, where g(r) was assumed to be gamma.
A computer simulation study (J. Zhang, unpublished results) demonstrated that when the number of sequences used is small (e.g., four), estimation of α is subject to asymptotic bias: α is likely to be underestimated for closely related sequences, but overestimated for highly diverged sequences. This is because when the divergence level is low, there is not enough time for substitutions to occur at potentially variable sites, so they are regarded as invariant sites and α is underestimated. When the divergence level is very high, multiple hits cannot be fully corrected. As a result, the numbers of substitutions are underestimated at very fast evolving sites and α is subsequently overestimated. The estimation bias decreases when the number of sequences used increases (Gu and Zhang 1997; J. Zhang, unpublished results). Because the estimation bias of α is likely to create a positive correlation between α and d (or μ), it is worthwhile to examine the influence of the estimation bias on the present study quantitatively by using computer simulation.
The tree of the human, mouse, chicken, and frog (Figure 6) was used as a model tree in the simulation. The branch lengths of the tree were proportional to the divergence times of the species, that is, 100 million yr between the human and mouse (Hedgeset al. 1996), 300 million yr between mammals and birds, and 350 million yr between amniotes and frogs (Ahlberg and Milner 1994). In the computer simulation, we simulated the evolution of 45 independent genes. For each gene, we randomly chose an α value from a uniform distribution in the range of 0.2 to 2 as the gamma shape parameter of the gene, and a p value from a uniform distribution in the range of 0.05 to 0.45 as the proportional difference between the mouse and chicken proteins. These ranges represented the observed variations in the 45 genes, excluding the 6 genes with estimated α values >2. The sequence length used was 450 amino acids, because it is approximately the average length of the proteins that we have examined. The simulation procedure was similar to that described in Zhang and Nei (1997), except that the substitution model used was gamma + Poisson. That is, for each site, the relative substitution rate (r_{R}) was randomly generated from a gamma distribution with the expectation of 1 and the shape parameter α, and the amino acid substitution followed the Poisson process. The tree branch lengths (expected numbers of substitutions per site) were computed for each site separately. For example, the expected distance between the mouse and chicken is d_{M−C}, computed from Equation 9 by using the α and p values that were randomly chosen for this gene, and the average substitution rate for the protein is then μ = d_{M−C}/(2 × 300) substitutions per site per mya, where 300 stands for 300 million yr. The substitution rate at a given site is therefore equal to r = r_{R}μ, where r_{R} is the relative rate at the site. The length of the branch from human to the common ancestor of human and mouse is then 100r substitutions for the site, where 100 stands for 100 million yr. Other branch lengths can be computed similarly. In the simulation, after the four protein sequences at the exterior nodes of the tree were generated, the gamma shape parameter α was estimated by the method of Gu and Zhang (1997), and the gamma distance (d) between the simulated mouse and chicken sequences was computed by using Equation 9 with the estimated α and p. The simulation results with the estimated α >2 were abandoned. When 45 independent genes were simulated and 45 couples of α (<2) and d were obtained, the correlation coefficient R was computed. We conducted 2500 replications of such computer simulations, and the distribution of the 2500 R values is presented in Figure 7.
It is seen from Figure 7 that the estimated α and d are positively correlated, though the true α and d have no correlation at all. The correlation coefficient of the estimated α and d is significantly different from 0 because all 2500 simulated R values are positive. The average correlation coefficient is 0.65. However, only 0.32% of the 2500 simulated R values are greater than 0.86, which is the observed R value for the 45 nuclear genes examined. In other words, the observed R value of 0.86 is significantly greater than 0 at the 0.32% significance level, even when the estimation bias of α is considered. Therefore, the observed positive correlation between α and d of the nuclear proteins seems to have underlying causes other than the estimation bias.
In the above simulation, α and p were randomly chosen from a uniform distribution, respectively. But in reality, neither of them follow a uniform distribution (see Figure 4). It is expected, and is confirmed in a smallscale simulation, that the R values obtained from the simulation will be smaller when the true distributions of the α and p are used. Therefore, the probability of observing an R value of 0.86 or larger, given that there is no correlation between α and d, may have been overestimated in the simulation. In other words, the effect of the estimation bias may be exaggerated in the simulation and the confidence level of 0.32% is a conservative estimate. The assumptions of the Poisson model of amino acid substitution for protein evolution does not seem to have affected our simulation result very much. Because only orthologous proteins are used in the analysis, the evolutionary rate of a protein is expected to be more or less constant among lineages unless there is variation of the mutation rate due to the generation time effect (Ohta 1995; Liet al. 1996), which, however, acts on all the genes to the same extent and does not affect the correlation between CV (or α) and D (or d).
Mitochondrial genes: The estimated CV values of the 13 mitochondrial genes are between 0.66 and 2.05, with the mean of 1.14 and the median of 1.10 (Table 2 and Figure 2). Again, we found that CV and D are highly correlated. The R value is −0.98 (P < 0.0001) for the linear regression between CV and ln D (Figure 8). If we fit the data by a linear regression of CV and D, R becomes −0.92 (P < 0.0001). These results show that the mean substitution rate and the extent of amongsite rate variation are negatively correlated for the mitochondrial genes as well.
The estimated α values for the mitochondrial genes vary from 0.19 to 1.68 with the median of 0.56 (Table 2; Figure 4). Similar to the nuclear genes, about twothirds of the mitochondrial genes have α values <1. The distance d between the human and mouse varies from 0.20 to 1.15 (Table 2), corresponding to the rate of about 1.0 to 5.8 amino acid substitutions per site per billion years under the assumption that the human and mouse diverged 100 mya. It is seen that mammalian mitochondrial genes evolve about 5 to 10 times faster than nuclear genes, as generally believed (Brownet al. 1979).
Figure 9 shows the relationship of α and d (of the human and mouse) for the mitochondrial genes. The correlation coefficient between α and d (or μ) is 0.72 (P < 0.01). Because we have used nine species in the estimation of α, and the estimation bias of α is expected to be quite small when the number of sequences used is no less than eight (Gu and Zhang 1997; J. Zhang, unpublished results), the observed correlation between α and d is unlikely to have been affected by the estimation bias of α. This result was also confirmed by a smallscale computer simulation similar to that performed for the nuclear genes (data not shown). As mentioned before, Kumar (1996) found that the correlation coefficient between α and d for mitochondrial genes is 0.82 by using 1st and 2nd codon positions of the nucleotide sequences. This value is quite close to what we have obtained (0.72), although we used different species and different methods of α estimation. This suggests that the correlation is robust. As in the case of the nuclear genes, the hypothesis of no rate variation among sites was rejected for all 13 mitochondrial genes (P < 0.01).
DISCUSSION
Causes of the negative correlation between the mean rate and the extent of amongsite rate variation: Through the analysis of 64 genes in total, we have shown that the mean substitution rate and the extent of amongsite rate variation of a protein are negatively correlated. We have also excluded the possibility that the correlation is caused by estimation biases or is dependent on the assumption of the gamma distribution of the rate variation among sites. Because the 51 nuclear genes we examined are scattered all over the genome and they encode proteins with a variety of functions, it is unlikely that the correlation is limited to particular chromosomal regions or is restricted to particular groups of proteins. Rather, the correlation is likely to be a general feature of molecular evolution because both nuclear and mitochondrial genes show similar patterns. Nevertheless, all the data we used are from vertebrates, and our conclusion needs further support from genes of other organisms.
The observation of negative correlation can be explained under the neutral theory of molecular evolution (Kimura 1983; see also Ohta 1992), which asserts that protein sequence evolution is under purifying selection and random genetic drift, and that the rate variation is due to the difference in functional constraints among amino acid sites of a protein and among different proteins. In the following paragraphs, we explain the cause of the correlation by using some simple models of neutral evolution.
Let ν be the mutation rate at an amino acid site, which is assumed to be constant for all sites of all proteins. Then the substitution rate is 0 for an invariant site, between 0 and ν for a variable but functionally constrained site, and equal to ν for a neutral site. Let us first consider a simple model in which the functional constraints at all sites of a protein are equal. Then the mean rate μ and CV have no correlation because CV is always 0, and μ can be any value from 0 to ν. Therefore, this model cannot explain our observation of the correlation between CV and μ. Now let us consider another model, in which there are two types of sites, either invariant or neutral. Then the average substitution rate of the protein becomes
In reality, the pattern of the variation of functional constraint among sites is somewhere between the above two extreme models, and the observation of the strong correlation between μ and CV or 1/(CV^{2} + 1) suggests that it is probably much closer to the second model than to the first. This is biologically understandable because, for an amino acid sequence to be functional, there must be some essential sites that are invariant and some other sites that are from mildly variable to almost free of any change. This pattern of functional constraints seems to be the requirement for a functional protein. It is unrealistic that the constraints at all amino acid sites of a protein are more or less equal unless the protein is free of any functional constraint (no function). In Equation 14, we can see that v also influences the relationship of μ and CV. There is evidence that the mutation rate varies among nuclear genes (Wolfeet al. 1989). Therefore, the correlation between μ and CV for the nuclear genes may have been depressed due to the variation of mutation rate among genes. This effect is expected to be small for the 13 mitochondrial genes because they have similar mutation rates (Nedbal and Flynn 1998).
The biological meaning of α: The gamma distribution is predominantly used to fit the rate variation among sites of proteins. Our results from the likelihood ratio test indicated that the gamma distribution model fit the rate variation significantly better than the uniform rate model. But the reason why rate variation follows the gamma distribution is unknown and, therefore, the biological meaning of the shape parameter α is not clear. According to our observation of the strong positive correlation between α and μ, α can be interpreted as a measure of functional conservation of a protein. Small values of α indicate strong constraint, and large values indicate weak constraint. It is therefore expected that fourfold degenerate sites and intron regions of functional genes and pseudogenes have very large α values. For example, we estimated that the α of the glycerol3phosphate dehydrogenase (GPDH, EC 1.1.1.8) of 13 Drosophila and related species (Kwiatowskiet al. 1997) is about 0.10, but the α of the intron 3 and 5 of the Gpdh gene is ~3.0. Actually, the hypothesis of rate constancy over sites cannot be rejected for these intron sequences (P > 0.05, goodness of fit test). However, the above interpretation of α is valid only when the majority of amino acid sites are undergoing neutral evolution (i.e., no positive selection). When strong positive selection is acting on a large number of sites, μ tends to be high and α tends to be small because functionally most important sites remain invariant, whereas positively selected sites have very high substitution rates. So, in this case, α does not measure the functional conservation of a protein. For example, we found that the α value of the sperm lysin of 20 abalone species (Leeet al. 1995) is 0.16, whereas the μ is about 2 to 3 times the mutation rate ν. Here, ν was estimated under the assumption that the mutation rate per amino acid is 2.25 times the mutation rate per nucleotide (Nei 1975, p. 225), which was further assumed to be equal to the rate of synonymous nucleotide substitution. So, if this gene is presented in Figure 5, it will sit at the top left corner, far from the regression line.
It is intriguing that there are no dots sitting in the upper left corner in Figure 5 or the upper right corner in Figure 3. These are the locations where both the mean rate and the amongsite rate variation are high, and where we expect to find proteins consisting of a number of conserved sites and a number of directionally selected sites. The simplest explanation would be that such genes are not included in our sample of 64 genes. Because our sample is random and is not small, the result suggests that those genes that are under strong positive selection are rare in the genome. Of course, we cannot distinguish neutral genes from those in which positive selection operates in a very small number of sites, because in the latter, the overall relationship between α and μ is mainly determined by the sites that are neutral or under purifying selection. In fact, there are no statistical methods that can effectively detect positive selection that only acts on one or several sites except in the case of convergent and parallel evolution (Zhang and Kumar 1997). It has to be noted that all we have studied here is orthologous gene evolution. Because orthologous genes usually have the same function, it is understandable that we did not find signals of strong positive selection (Sharp 1997). Strong selection is more likely to operate on newly duplicated genes during the evolution of novel gene functions (Zhanget al. 1998).
It is also interesting that no dots were found in the lower left corner of Figure 3, where both the mean rate and the rate variation are small. This suggests that even in the most conserved proteins it is unlikely that all sites are equally conserved. More probably, there are still a few nearly neutral sites so that CV becomes high (see Equation 13).
Other implications: Accelerated evolution in some evolutionary lineages has been observed in several proteins (e.g., Stewartet al. 1987; Tucker and Lundrigan 1993; Wallis 1993, 1996; Whitfieldet al. 1993). If the mutation rate is constant among lineages, there are two possible causes of the enhanced substitution rate. One is relaxation of functional constraint, and the other is positive selection. When the rate of nonsynonymous nucleotide substitution is not significantly greater (or even not greater) than the rate of synonymous substitution, it is often difficult to distinguish between the two possible causes of the accelerated evolution (e.g., Li and Gojobori 1983). The relationship between μ and α may provide us some information, because when μ is high, high values of α are expected if functional constraint is reduced, but small values of α are likely if strong positive selection acts on a large number of sites (as in the case of the abalone sperm lysin).
Using the relationship between μ and α, let us examine an interesting case of fast evolution. Whitfield et al. (1993) and Tucker and Lundrigan (1993) reported that the sexdetermining protein SRY evolves very fast in primates and rodents, and the rates of nonsynonymous and synonymous nucleotide substitution are similar. But because the nonsynonymous rate is not significantly higher than the synonymous rate, these authors were not able to determine whether the accelerated evolution is due to positive selection or relaxation of functional constraints. SRY protein is a transcription factor that binds to specific DNA sites to regulate the expression of its downstream genes in the sex determination pathway. SRY consists of Nterminal, HMG, and Cterminal domains (with 58, 78, and 68 amino acids, respectively, in the human sequence). The HMG domain is responsible for DNA binding and is very conservative, whereas the other two domains are highly variable, particularly the C domain. The length of the HMG domain remains the same among placental mammals, but the lengths of the other two domains vary within or among mammalian orders. The estimated α values of the SRY proteins of eight primate species (Whitfieldet al. 1993) are 1.58, 0.24, and 3.73 for the N, HMG, and C domains, respectively. The large values of α for the N and C domains, particularly the latter, suggest that the accelerated evolution in SRY is probably due to low functional significance out of the HMG domain rather than adaptation under strong positive selection. Actually, all clinical mutations in human SRY resulting in phenotypic sex reversal are found in the HMG domain (Hawkinset al. 1992; Goodfellow and LovellBadge 1993; Werneret al. 1995), except for one case in which a nonsense mutation happened in the C domain (Tajimaet al. 1994). Pontiggia et al. (1995) found that SRY proteins from different primate species are very similarin both DNA binding and bending and predicted that they should be able to substitute for each other.
Ignoring rate variation among sites causes underestimation of distances between homologous proteins. For estimating dates of early divergences by the molecular clock, only slowly evolving proteins can be used. Because slowly evolving proteins tend to have high degrees of rate variation among sites, consideration of the rate variation and use of gamma distances seems to be necessary for such estimations. This has been highlighted in recent debates on the age of the most recent common ancestor of presentday organisms (Doolittleet al. 1996; Hasegawaet al. 1996; Miyamoto and Fitch 1996; Fenget al. 1997; Gu 1997). In this case, neglect of the rate variation among sites was found to bias the time estimation at least 0.5–1 billion years. However, it is worth mentioning that use of the gamma distance enlarges the sampling variance so that many genes are needed for obtaining reliable estimates of divergence times.
Acknowledgments
We are grateful to Sudhir Kumar and Patrick Parker for providing us aligned protein sequences of the nuclear genes. We thank Sudhir Kumar, Masatoshi Nei, Tanya Sitnikova, ChungI Wu, and two anonymous reviewers for their comments on an earlier version of this article. This work was supported by the research grants from the National Institutes of Health and the National Science Foundation to M. Nei.
Footnotes

Communicating editor: C.I Wu
 Received December 10, 1997.
 Accepted April 7, 1998.
 Copyright © 1998 by the Genetics Society of America