Genetics, Vol. 149, 1615-1625, July 1998, Copyright © 1998

Correlation Between the Substitution Rate and Rate Variation Among Sites in Protein Evolution

Jianzhi Zhanga and Xun Gua
a Institute of Molecular Evolutionary Genetics and Department of Biology, Pennsylvania State University, University Park, Pennsylvania 16802

Corresponding author: Jianzhi Zhang, 322 Mueller Laboratory, Institute of Molecular Evolutionary Genetics, Pennsylvania State University, University Park, PA 16802, jxz128{at}psu.edu (E-mail).

Communicating editor: C.-I WU


*  ABSTRACT
*TOP
*ABSTRACT
*DATA AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

It is well known that the rate of amino acid substitution varies among different proteins and among different sites of a protein. It is, however, unclear whether the extent of rate variation among sites of a protein and the mean substitution rate of the protein are correlated. We used two approaches to analyze orthologous protein sequences of 51 nuclear genes of vertebrates and 13 mitochondrial genes of mammals. In the first approach, no assumptions of the distribution of the rate variation among sites were made, and in the second approach, the gamma distribution was assumed. Through both approaches, we found a negative correlation between the extent of among-site rate variation and the average substitution rate of a protein. That is, slowly evolving proteins tend to have a high level of rate variation among sites, and vice versa. We found this observation consistent with a simple model of the neutral theory where most sites are either invariable or neutral. We conclude that the correlation is a general feature of protein evolution and discuss its implications in statistical tests of positive Darwinian selection and molecular time estimation of deep divergences.


THE rate of amino acid substitution varies considerably among different proteins as well as different sites of a protein (e.g., KIMURA 1983 Down; NEI 1987 Down; LI 1997 Down). It is interesting to study the relationship between these two types of rate variations because they result from the variation of functional constraints at different levels (proteins or amino acid residues) according to the neutral theory of molecular evolution (KIMURA and OHTA 1974 Down; KIMURA 1983 Down). In a recent study of vertebrate mitochondrial genes, KUMAR 1996 Down observed that the extent of rate variation among nucleotide sites of a gene and the average substitution rate of the gene are negatively correlated. That is, fast evolving genes have low degrees of among-site rate variation, whereas slowly evolving genes have high degrees of among-site rate variation. However, the cause of the correlation was unclear at the time. Furthermore, the result was obtained based on the strong assumption that the among-site rate variation follows a gamma distribution that may or may not hold in all the genes. To examine whether such correlation is a general feature of molecular evolution, a large-scale analysis of nuclear and mitochondrial genes with an approach that is free from the assumption of the gamma distribution seems to be necessary. In this article, we analyzed protein sequences for 51 nuclear genes of vertebrates and 13 mitochondrial genes of mammals and found that the correlation holds for both nuclear and mitochondrial genes. We also explain the underlying cause of the correlation and discuss its implications.


*  DATA AND METHODS
*TOP
*ABSTRACT
*DATA AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Characterization of among-site rate variation:
Let us assume that the substitution rate (r) at an amino acid site of a protein sequence is an identically and independently distributed random number following an unknown distribution g(r). Let k be the number of substitutions at a given site during a period of evolutionary time T. Assuming that amino acid substitutions follow the Poisson process when the rate is given, k follows the distribution:

(1)

The expectation of k, E(k), can be obtained by

(2)
where µ = E(r) is the expectation of r, or mean substitution rate. The variance of k, Var(k), can be obtained by

(3)
where Var(r) is the variance of r. Therefore, the mean and variance of r can be obtained from

(4)
and

(5)

The degree of among-site rate variation is better measured by the coefficient of variation, which is obtained from

(6)

It is interesting to see from Equation 4 and Equation 6 that if we can estimate the mean and variance of k, we will be able to study the mean substitution rate and the extent of among-site rate variation without knowing the underlying rate distribution g(r) (GU and LI 1998 Down).

Nevertheless, the gamma distribution has been predominantly used in recent years to characterize the among-site rate variation of proteins (e.g., UZZELL and CORBIN 1971 Down; GOLDING 1983 Down; TAMURA and NEI 1993 Down; YANG 1994 Down; GU et al. 1995 Down; GU and ZHANG 1997 Down). That is, the substitution rate (r) at a site is assumed to follow the gamma distribution with the probability density function

(7)
where {Gamma} () is the gamma function. There are two parameters in Equation 7: the mean rate µ = E(r) and the shape parameter {alpha}. The extent of among-site rate variation can be measured by {alpha}. Small values of {alpha} indicate a high degree of rate variation among sites, whereas large values indicate a low degree of variation. In fact, under the gamma model, CV(r) = . Some examples of the gamma distribution with different values of µ and {alpha} are given in Figure 1.



View larger version (19K):
In this window
In a new window
Download PPT slide
 
Figure 1. The probability density functions of gamma distributions. (A) The mean (µ) is fixed at 1 and the shape parameter ({alpha}) is presented next to each curve. (B) {alpha} is fixed at 2 and µ of each curve is presented.

Genes used:
We randomly chose 51 nuclear genes (see Table 1) for which the orthologous sequences from the human (Homo sapiens), mouse (Mus musculus), chicken (Gallus gallus), and African clawed frog (Xenopus laevis) are available in the GenBank. All the protein sequences were downloaded from the HOVERGEN database (DURET et al. 1994 Down). HEDGES et al. 1996 Down examined the orthology of most of the genes for their purpose of estimating the divergence time of primates and rodents. For the remaining genes, we examined the orthology by phylogenetic analysis as described in HEDGES et al. 1996 Down. All the sequences used were longer than 100 amino acids. We chose the four species because (1) estimation of the gamma shape parameter {alpha} requires at least three sequences, (2) there are many gene sequences available for these species, (3) the phylogeny of the species is known, and (4) the divergence among these species is moderate for most genes so that pairwise distances and substitution rates can be accurately estimated. As shown in Equation 4, to study the mean rate µ, we have to know the total amount of evolutionary time T for a tree. But if we use the same species for all the genes, T will be the same so that we can infer properties of µ from those of µT without knowing the actual value of T.


 
View this table:
In this window
In a new window

 
Table 1. Substitution rates and among-site rate variations of 51 nuclear genes of vertebrates

We also studied 13 mitochondrial genes of 9 mammalian species, including the mouse (GenBank accession no. J01420), rat (Rattus norvegicus; X14848), human (D38112), gibbon (Hylobates lar; X97707), cow (Bos taurus; V00654), whale (Balaenoptera physalus; X61145), opossum (Didelphis virginiana; Z29573), wallaroo (Macropus robustus; Y10524), and platypus (Ornithorhynchus anatinus; X83427). In this case only mammalian species were used because mammalian mitochondrial genes evolve quickly (BROWN et al. 1979 Down) and use of distantly related species may increase estimation errors. The unrooted tree of these 9 species has been well established (e.g., JANKE et al. 1997 Down).

Sequence analysis:
For the orthologous sequences of each gene, we used the known species tree and estimated the number of substitutions (k) along the tree at each amino acid site by using Equation 7 of GU and ZHANG's (1997) method. We then estimated the mean and variance of k. Estimates of µT and CV(r) were then obtained by using Equation 4 and Equation 6, and they were denoted as D and CV, respectively.

We also estimated the gamma shape parameter ({alpha}) for the protein sequences of each gene by using GU and ZHANG's (1997) method, which has been shown to be as accurate as, but much faster than, the likelihood method. Actually, the likelihood method (YANG 1994 Down) and the GU and ZHANG's (1997) method gave essentially the same result for the data sets used in this article (data not shown). In GU and ZHANG's (1997) method, the k value for each site is first estimated. Because k follows the negative binomial distribution when r follows the gamma distribution, the gamma shape parameter {alpha} is estimated by maximizing the following likelihood function derived from the negative binomial distribution,

(8)
where ki is the number of substitutions at site i, n is the number of amino acids of the sequence, and D = µT is the mean of ki over all sites, as mentioned earlier. Under the gamma model, the unbiased pairwise distances among the orthologous sequences were estimated by using the formula (NEI et al. 1976 Down; OTA and NEI 1994 Down),

(9)
where d is the gamma distance between two protein sequences, p is the proportional difference, and {alpha} is the estimated gamma shape parameter.

For each gene, by a likelihood ratio test, we examined whether the null hypothesis that all sites have the same substitution rate can be rejected and whether the alternative hypothesis of the gamma distribution is favored. We computed

(10)
where {alpha} is the estimated gamma shape parameter and L({infty}) is the likelihood of the null hypothesis. Because the null hypothesis is a special case of the alternative hypothesis with {alpha} being infinity, l is approximately {chi}2 distributed with the number of degrees of freedom being 1.


*  RESULTS
*TOP
*ABSTRACT
*DATA AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Nuclear genes:
Table 1 shows the estimates for the mean substitution rate and the extent of among-site rate variation for the 51 nuclear genes examined. The coefficient of variation (CV) of the substitution rate among sites varies from 0.43 to 2.02 with the mean of 1.09 and the median of 1.00 (Figure 2). Figure 3 shows the relationships between CV and D (average number of substitutions per site along the whole tree). We fitted the data by a linear regression of CV and lnD. The coefficient of correlation (R) is -0.95, which is significantly different from 0 (P < 0.0001). If we fit the data by a linear regression of CV and D, R becomes -0.91 (P < 0.0001). Because D = µT and T (total amount of evolutionary time of the tree) is identical for all the genes, the correlation coefficient between CV and D (or ln D) is the same as that between CV and µ (or lnµ). These results therefore show that the mean substitution rate and the extent of among-site rate variation of a protein are negatively correlated. This observation is independent of any assumption about the underlying distribution of the substitution rate g(r).



View larger version (21K):
In this window
In a new window
Download PPT slide
 
Figure 2. Distributions of the CV values of 51 nuclear (solid histograms) and 13 mitochondrial (open histograms) genes.



View larger version (15K):
In this window
In a new window
Download PPT slide
 
Figure 3. Correlation between the extent of among-site rate variation CV and the mean substitution rate D (=µT) in 51 nuclear genes. R is the coefficient of correlation.

We then examined the correlation under the assumption of a gamma distribution for the rate variation among sites. The gamma shape parameter {alpha} of the 51 nuclear genes varies from 0.17 to 3.45 with the median of 0.71 (Table 1 and Figure 4). About two-thirds of the genes have {alpha} values <1, and, therefore, their rate distributions are L-shaped (see Figure 1A). Only 12% of the genes have {alpha} values >2. Likelihood ratio tests showed that the null hypothesis of no rate variation among sites can be rejected at the 1% significance level for 48 of the 51 genes and can be rejected at the 5% level for the remaining three genes. Therefore, the alternative hypothesis of the gamma distribution is favored. A common feature of the three genes mentioned above is that they are all short (around 150 amino acids), suggesting that the test might not be powerful enough for these genes. The gamma distance d (number of amino acid substitutions per site) between the mouse and chicken varies from 0.057 to 0.703 for the 51 genes (Table 1), which corresponds to the rate of about 0.1 to 1.2 amino acid substitutions per site per billion (109) years if we assume that mammals and birds diverged about 300 mya, which is now generally accepted (BENTON 1993 Down; HEDGES et al. 1996 Down).



View larger version (20K):
In this window
In a new window
Download PPT slide
 
Figure 4. Distributions of the {alpha} values of 51 nuclear (solid histograms) and 13 mitochondrial (open histograms) genes.

Figure 5 shows the relationship of {alpha} and d (between the mouse and chicken) for the 51 nuclear genes. The R value for {alpha} and d is 0.68, which is significantly differ-ent from 0 (P < 0.0001). Linear regression of {alpha} and lnd does not change the R value very much. Because only orthologous sequences were used, d is proportional to the average substitution rate µ (d = 2µt, where t is the time since the divergence of birds and mammals). Therefore, the correlation coefficient between {alpha} and µ is also 0.68. Although only the distances between the mouse and chicken are presented in Figure 5, use of pairwise distances among other species gave similar results. Large estimates of {alpha} (e.g., {alpha} > 2) are known to have substantial sampling errors (GU and ZHANG 1997 Down), and they should have lower weights than other data points in the regression. A simple practice is to give zero weights to them. The correlation coefficient for the remaining 45 genes with {alpha} < 2 is now 0.86 (P < 0.0001).



View larger version (17K):
In this window
In a new window
Download PPT slide
 
Figure 5. Correlation between the gamma shape parameter {alpha} and the distance d between mouse and chicken proteins. The dashed line is the regression for 51 genes (both solid and open circles), whereas the solid line is the regression for the 45 genes (solid circles) excluding the 6 genes (open circles) whose {alpha} values are estimated to be greater than 2. The regression equations presented at the bottom of the figure are for the 51 genes, and the equations at the top are for the 45 genes.

Influence of estimation biases:
The above analyses showed that the estimated mean substitution rate and the extent of rate variation among sites are highly correlated. However, to prove that our observations reflect the true properties of protein evolution, we have to show that the correlation is not an artifact due to some biases of the estimation methods we have used. In the estimation of CV and D, no assumptions of the distribution g(r) were made. It is, therefore, very difficult to examine the estimation biases quantitatively. In the following paragraphs, we studied possible biases in the estimation of {alpha} and d, where g(r) was assumed to be gamma.

A computer simulation study (J. ZHANG, unpublished results) demonstrated that when the number of sequences used is small (e.g., four), estimation of {alpha} is subject to asymptotic bias: {alpha} is likely to be underestimated for closely related sequences, but overestimated for highly diverged sequences. This is because when the divergence level is low, there is not enough time for substitutions to occur at potentially variable sites, so they are regarded as invariant sites and {alpha} is underestimated. When the divergence level is very high, multiple hits cannot be fully corrected. As a result, the numbers of substitutions are underestimated at very fast evolving sites and {alpha} is subsequently overestimated. The estimation bias decreases when the number of sequences used increases (GU and ZHANG 1997 Down; J. ZHANG, unpublished results). Because the estimation bias of {alpha} is likely to create a positive correlation between {alpha} and d (or µ), it is worthwhile to examine the influence of the estimation bias on the present study quantitatively by using computer simulation.

The tree of the human, mouse, chicken, and frog (Figure 6) was used as a model tree in the simulation. The branch lengths of the tree were proportional to the divergence times of the species, that is, 100 million yr between the human and mouse (HEDGES et al. 1996 Down), 300 million yr between mammals and birds, and 350 million yr between amniotes and frogs (AHLBERG and MILNER 1994 Down). In the computer simulation, we simulated the evolution of 45 independent genes. For each gene, we randomly chose an {alpha} value from a uniform distribution in the range of 0.2 to 2 as the gamma shape parameter of the gene, and a p value from a uniform distribution in the range of 0.05 to 0.45 as the proportional difference between the mouse and chicken proteins. These ranges represented the observed variations in the 45 genes, excluding the 6 genes with estimated {alpha} values >2. The sequence length used was 450 amino acids, because it is approximately the average length of the proteins that we have examined. The simulation procedure was similar to that described in ZHANG and NEI 1997 Down, except that the substitution model used was gamma + Poisson. That is, for each site, the relative substitution rate (rR) was randomly generated from a gamma distribution with the expectation of 1 and the shape parameter {alpha}, and the amino acid substitution followed the Poisson process. The tree branch lengths (expected numbers of substitutions per site) were computed for each site separately. For example, the expected distance between the mouse and chicken is dM-C, computed from Equation 9 by using the {alpha} and p values that were randomly chosen for this gene, and the average substitution rate for the protein is then µ = dM-C/(2 x 300) substitutions per site per mya, where 300 stands for 300 million yr. The substitution rate at a given site is therefore equal to r = rRµ, where rR is the relative rate at the site. The length of the branch from human to the common ancestor of human and mouse is then 100r substitutions for the site, where 100 stands for 100 million yr. Other branch lengths can be computed similarly. In the simulation, after the four protein sequences at the exterior nodes of the tree were generated, the gamma shape parameter {alpha} was estimated by the method of GU and ZHANG 1997 Down, and the gamma distance (d) between the simulated mouse and chicken sequences was computed by using Equation 9 with the estimated {alpha} and p. The simulation results with the estimated {alpha} >2 were abandoned. When 45 independent genes were simulated and 45 couples of {alpha} (<2) and d were obtained, the correlation coefficient R was computed. We conducted 2500 replications of such computer simulations, and the distribution of the 2500 R values is presented in Figure 7.



View larger version (12K):
In this window
In a new window
Download PPT slide
 
Figure 6. The model tree used in the simulation for nuclear genes.



View larger version (53K):
In this window
In a new window
Download PPT slide
 
Figure 7. The distribution of 2500 R values of the estimates of {alpha} and µ in the simulation. There is no correlation between the true values of {alpha} and µ. The observed R from the 45 nuclear genes is marked.

It is seen from Figure 7 that the estimated {alpha} and d are positively correlated, though the true {alpha} and d have no correlation at all. The correlation coefficient of the estimated {alpha} and d is significantly different from 0 because all 2500 simulated R values are positive. The average correlation coefficient is 0.65. However, only 0.32% of the 2500 simulated R values are greater than 0.86, which is the observed R value for the 45 nuclear genes examined. In other words, the observed R value of 0.86 is significantly greater than 0 at the 0.32% significance level, even when the estimation bias of {alpha} is considered. Therefore, the observed positive correlation between {alpha} and d of the nuclear proteins seems to have underlying causes other than the estimation bias.

In the above simulation, {alpha} and p were randomly chosen from a uniform distribution, respectively. But in reality, neither of them follow a uniform distribution (see Figure 4). It is expected, and is confirmed in a small-scale simulation, that the R values obtained from the simulation will be smaller when the true distributions of the {alpha} and p are used. Therefore, the probability of observing an R value of 0.86 or larger, given that there is no correlation between {alpha} and d, may have been overestimated in the simulation. In other words, the effect of the estimation bias may be exaggerated in the simulation and the confidence level of 0.32% is a conservative estimate. The assumptions of the Poisson model of amino acid substitution for protein evolution does not seem to have affected our simulation result very much. Because only orthologous proteins are used in the analysis, the evolutionary rate of a protein is expected to be more or less constant among lineages unless there is variation of the mutation rate due to the generation time effect (OHTA 1995 Down; LI et al. 1996 Down), which, however, acts on all the genes to the same extent and does not affect the correlation between CV (or {alpha}) and D (or d).

Mitochondrial genes:
The estimated CV values of the 13 mitochondrial genes are between 0.66 and 2.05, with the mean of 1.14 and the median of 1.10 (Table 2 and Figure 2). Again, we found that CV and D are highly correlated. The R value is -0.98 (P < 0.0001) for the linear regression between CV and ln D (Figure 8). If we fit the data by a linear regression of CV and D, R becomes -0.92 (P < 0.0001). These results show that the mean substitution rate and the extent of among-site rate variation are negatively correlated for the mitochondrial genes as well.



View larger version (18K):
In this window
In a new window
Download PPT slide
 
Figure 8. Correlation between CV and D in 13 mammalian mitochondrial genes.


 
View this table:
In this window
In a new window

 
Table 2. Substitution rates and among-site variations of 13 mitochrondrial genes of mammals

The estimated {alpha} values for the mitochondrial genes vary from 0.19 to 1.68 with the median of 0.56 (Table 2; Figure 4). Similar to the nuclear genes, about two-thirds of the mitochondrial genes have {alpha} values <1. The distance d between the human and mouse varies from 0.20 to 1.15 (Table 2), corresponding to the rate of about 1.0 to 5.8 amino acid substitutions per site per billion years under the assumption that the human and mouse diverged 100 mya. It is seen that mammalian mitochondrial genes evolve about 5 to 10 times faster than nuclear genes, as generally believed (BROWN et al. 1979 Down).

Figure 9 shows the relationship of {alpha} and d (of the human and mouse) for the mitochondrial genes. The correlation coefficient between {alpha} and d (or µ) is 0.72 (P < 0.01). Because we have used nine species in the estimation of {alpha}, and the estimation bias of {alpha} is expected to be quite small when the number of sequences used is no less than eight (GU and ZHANG 1997 Down; J. ZHANG, unpublished results), the observed correlation between {alpha} and d is unlikely to have been affected by the estimation bias of {alpha}. This result was also confirmed by a small-scale computer simulation similar to that performed for the nuclear genes (data not shown). As mentioned before, KUMAR 1996 Down found that the correlation coefficient between {alpha} and d for mitochondrial genes is 0.82 by using 1st and 2nd codon positions of the nucleotide sequences. This value is quite close to what we have obtained (0.72), although we used different species and different methods of {alpha} estimation. This suggests that the correlation is robust. As in the case of the nuclear genes, the hypothesis of no rate variation among sites was rejected for all 13 mitochondrial genes (P < 0.01).



View larger version (17K):
In this window
In a new window
Download PPT slide
 
Figure 9. Correlation between {alpha} and the distance d of human and mouse in 13 mammalian mitochondrial genes.


*  DISCUSSION
*TOP
*ABSTRACT
*DATA AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Causes of the negative correlation between the mean rate and the extent of among-site rate variation:
Through the analysis of 64 genes in total, we have shown that the mean substitution rate and the extent of among-site rate variation of a protein are negatively correlated. We have also excluded the possibility that the correlation is caused by estimation biases or is dependent on the assumption of the gamma distribution of the rate variation among sites. Because the 51 nuclear genes we examined are scattered all over the genome and they encode proteins with a variety of functions, it is unlikely that the correlation is limited to particular chromosomal regions or is restricted to particular groups of proteins. Rather, the correlation is likely to be a general feature of molecular evolution because both nuclear and mitochondrial genes show similar patterns. Nevertheless, all the data we used are from vertebrates, and our conclusion needs further support from genes of other organisms.

The observation of negative correlation can be explained under the neutral theory of molecular evolution (KIMURA 1983 Down; see also OHTA 1992 Down), which asserts that protein sequence evolution is under purifying selection and random genetic drift, and that the rate variation is due to the difference in functional constraints among amino acid sites of a protein and among different proteins. In the following paragraphs, we explain the cause of the correlation by using some simple models of neutral evolution.

Let {nu} be the mutation rate at an amino acid site, which is assumed to be constant for all sites of all proteins. Then the substitution rate is 0 for an invariant site, between 0 and {nu} for a variable but functionally constrained site, and equal to {nu} for a neutral site. Let us first consider a simple model in which the functional constraints at all sites of a protein are equal. Then the mean rate µ and CV have no correlation because CV is always 0, and µ can be any value from 0 to {nu}. Therefore, this model cannot explain our observation of the correlation between CV and µ. Now let us consider another model, in which there are two types of sites, either invariant or neutral. Then the average substitution rate of the protein becomes

(11)
where f is the proportion of neutral sites in the protein (KIMURA 1983 Down). The variance (Var) and the coefficient of variation (CV) of the rate among sites can be computed as

(12)
and

(13)

By using Equation 11 and Equation 13, we obtained

(14)

Equation 14 shows that, under this model, µ is negatively correlated with CV and is dependent only on CV, when {nu} is given. Use of more complex models (e.g., KIMURA 1979 Down) gives qualitatively similar results. With Equation 14 in mind, we computed the correlation coefficient R for D and 1/(CV2 + 1). It was 0.95 (P < 0.0001) and 0.97 (P < 0.0001) for the nuclear and mitochondrial genes, respectively.

In reality, the pattern of the variation of functional constraint among sites is somewhere between the above two extreme models, and the observation of the strong correlation between µ and CV or 1/(CV2 + 1) suggests that it is probably much closer to the second model than to the first. This is biologically understandable because, for an amino acid sequence to be functional, there must be some essential sites that are invariant and some other sites that are from mildly variable to almost free of any change. This pattern of functional constraints seems to be the requirement for a functional protein. It is unrealistic that the constraints at all amino acid sites of a protein are more or less equal unless the protein is free of any functional constraint (no function). In Equation 14, we can see that v also influences the relationship of µ and CV. There is evidence that the mutation rate varies among nuclear genes (WOLFE et al. 1989 Down). Therefore, the correlation between µ and CV for the nuclear genes may have been depressed due to the variation of mutation rate among genes. This effect is expected to be small for the 13 mitochondrial genes because they have similar mutation rates (NEDBAL and FLYNN 1998 Down).

The biological meaning of {alpha}:
The gamma distribution is predominantly used to fit the rate variation among sites of proteins. Our results from the likelihood ratio test indicated that the gamma distribution model fit the rate variation significantly better than the uniform rate model. But the reason why rate variation follows the gamma distribution is unknown and, therefore, the biological meaning of the shape parameter {alpha} is not clear. According to our observation of the strong positive correlation between {alpha} and µ, {alpha} can be interpreted as a measure of functional conservation of a protein. Small values of {alpha} indicate strong constraint, and large values indicate weak constraint. It is therefore expected that fourfold degenerate sites and intron regions of functional genes and pseudogenes have very large {alpha} values. For example, we estimated that the {alpha} of the glycerol-3-phosphate dehydrogenase (GPDH, EC 1.1.1.8) of 13 Drosophila and related species (KWIATOWSKI et al. 1997 Down) is about 0.10, but the {alpha} of the intron 3 and 5 of the Gpdh gene is ~3.0. Actually, the hypothesis of rate constancy over sites cannot be rejected for these intron sequences (P > 0.05, goodness of fit test). However, the above interpretation of {alpha} is valid only when the majority of amino acid sites are undergoing neutral evolution (i.e., no positive selection). When strong positive selection is acting on a large number of sites, µ tends to be high and {alpha} tends to be small because functionally most important sites remain invariant, whereas positively selected sites have very high substitution rates. So, in this case, {alpha} does not measure the functional conservation of a protein. For example, we found that the {alpha} value of the sperm lysin of 20 abalone species (LEE et al. 1995 Down) is 0.16, whereas the µ is about 2 to 3 times the mutation rate {nu}. Here, {nu} was estimated under the assumption that the mutation rate per amino acid is 2.25 times the mutation rate per nucleotide (NEI 1975 Down, p. 225), which was further assumed to be equal to the rate of synonymous nucleotide substitution. So, if this gene is presented in Figure 5, it will sit at the top left corner, far from the regression line.

It is intriguing that there are no dots sitting in the upper left corner in Figure 5 or the upper right corner in Figure 3. These are the locations where both the mean rate and the among-site rate variation are high, and where we expect to find proteins consisting of a number of conserved sites and a number of directionally selected sites. The simplest explanation would be that such genes are not included in our sample of 64 genes. Because our sample is random and is not small, the result suggests that those genes that are under strong positive selection are rare in the genome. Of course, we cannot distinguish neutral genes from those in which positive selection operates in a very small number of sites, because in the latter, the overall relationship between {alpha} and µ is mainly determined by the sites that are neutral or under purifying selection. In fact, there are no statistical methods that can effectively detect positive selection that only acts on one or several sites except in the case of convergent and parallel evolution (ZHANG and KUMAR 1997 Down). It has to be noted that all we have studied here is orthologous gene evolution. Because orthologous genes usually have the same function, it is understandable that we did not find signals of strong positive selection (SHARP 1997 Down). Strong selection is more likely to operate on newly duplicated genes during the evolution of novel gene functions (ZHANG et al. 1998 Down).

It is also interesting that no dots were found in the lower left corner of Figure 3, where both the mean rate and the rate variation are small. This suggests that even in the most conserved proteins it is unlikely that all sites are equally conserved. More probably, there are still a few nearly neutral sites so that CV becomes high (see Equation 13).

Other implications:
Accelerated evolution in some evolutionary lineages has been observed in several proteins (e.g., STEWART et al. 1987 Down; TUCKER and LUNDRIGAN 1993 Down; WALLIS 1993 Down, WALLIS 1996 Down; WHITFIELD et al. 1993 Down). If the mutation rate is constant among lineages, there are two possible causes of the enhanced substitution rate. One is relaxation of functional constraint, and the other is positive selection. When the rate of nonsynonymous nucleotide substitution is not significantly greater (or even not greater) than the rate of synonymous substitution, it is often difficult to distinguish between the two possible causes of the accelerated evolution (e.g., LI and GOJOBORI 1983 Down). The relationship between µ and {alpha} may provide us some information, because when µ is high, high values of {alpha} are expected if functional constraint is reduced, but small values of {alpha} are likely if strong positive selection acts on a large number of sites (as in the case of the abalone sperm lysin).

Using the relationship between µ and {alpha}, let us examine an interesting case of fast evolution. WHITFIELD et al. 1993 Down and TUCKER and LUNDRIGAN 1993 Down reported that the sex-determining protein SRY evolves very fast in primates and rodents, and the rates of nonsynonymous and synonymous nucleotide substitution are similar. But because the nonsynonymous rate is not significantly higher than the synonymous rate, these authors were not able to determine whether the accelerated evolution is due to positive selection or relaxation of functional constraints. SRY protein is a transcription factor that binds to specific DNA sites to regulate the expression of its downstream genes in the sex determination pathway. SRY consists of N-terminal, HMG, and C-terminal domains (with 58, 78, and 68 amino acids, respectively, in the human sequence). The HMG domain is responsible for DNA binding and is very conservative, whereas the other two domains are highly variable, particularly the C domain. The length of the HMG domain remains the same among placental mammals, but the lengths of the other two domains vary within or among mammalian orders. The estimated {alpha} values of the SRY proteins of eight primate species (WHITFIELD et al. 1993 Down) are 1.58, 0.24, and 3.73 for the N, HMG, and C domains, respectively. The large values of {alpha} for the N and C domains, particularly the latter, suggest that the accelerated evolution in SRY is probably due to low functional significance out of the HMG domain rather than adaptation under strong positive selection. Actually, all clinical mutations in human SRY resulting in phenotypic sex reversal are found in the HMG domain (HAWKINS et al. 1992 Down; GOODFELLOW and LOVELL-BADGE 1993 Down; WERNER et al. 1995 Down), except for one case in which a nonsense mutation happened in the C domain (TAJIMA et al. 1994 Down). PONTIGGIA et al. 1995 Down found that SRY proteins from different primate species are very similar in both DNA binding and bending and predicted that they should be able to substitute for each other.

Ignoring rate variation among sites causes underestimation of distances between homologous proteins. For estimating dates of early divergences by the molecular clock, only slowly evolving proteins can be used. Because slowly evolving proteins tend to have high degrees of rate variation among sites, consideration of the rate variation and use of gamma distances seems to be necessary for such estimations. This has been highlighted in recent debates on the age of the most recent common ancestor of present-day organisms (DOOLITTLE et al. 1996 Down; HASEGAWA et al. 1996 Down; MIYAMOTO and FITCH 1996 Down; FENG et al. 1997 Down; GU 1997 Down). In this case, neglect of the rate variation among sites was found to bias the time estimation at least 0.5–1 billion years. However, it is worth mentioning that use of the gamma distance enlarges the sampling variance so that many genes are needed for obtaining reliable estimates of divergence times.


*  ACKNOWLEDGMENTS

We are grateful to SUDHIR KUMAR and PATRICK PARKER for providing us aligned protein sequences of the nuclear genes. We thank SUDHIR KUMAR, MASATOSHI NEI, TANYA SITNIKOVA, CHUNG-I WU, and two anonymous reviewers for their comments on an earlier version of this article. This work was supported by the research grants from the National Institutes of Health and the National Science Foundation to M. NEI.

Manuscript received December 10, 1997; Accepted for publication April 7, 1998.


*  LITERATURE CITED
*TOP
*ABSTRACT
*DATA AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

AHLBERG, P. E. and A. R. MILNER, 1994  The origin and early diversification of tetrapods. Nature 368:507-514.

BENTON, M. J., 1993 The Fossil Record, Vol. 2, Chapman and Hall, New York.

BROWN, W. M., M. GEORGE, JR., and A. C. WILSON, 1979  Rapid evolution of animal mitochondrial DNA. Proc. Natl. Acad. Sci. USA 76:1967-1971[Abstract/Free Full Text].

DOOLITTLE, R. F., D.-F. FENG, S. TSANG, G. CHO, and E. LITTLE, 1996  Determining divergence times of the major kingdoms of living organisms with a protein clock. Science 271:470-477[Abstract].

DURET, L., D. MOUCHIROUD, and M. GOUY, 1994  HOVERGEN: a database of homologous vertebrate genes. Nucleic Acids Res. 22:2360-2365[Abstract/Free Full Text].

FENG, D.-F., G. CHO, and R. F. DOOLITTLE, 1997  Determining divergence times with a protein clock: update and reevaluation. Proc. Natl. Acad. Sci. USA 94:13028-13033[Abstract/Free Full Text].

GOLDING, G. B., 1983  Estimation of DNA and protein sequence divergence: an examination of some assumptions. Mol. Biol. Evol. 1:125-142[Abstract].

GOODFELLOW, P. N. and R. LOVELL-BADGE, 1993  SRY and sex determination in mammals. Annu. Rev. Genet. 27:71-92[Medline].

GU, X., 1997  The age of the common ancestor of eukaryotes and prokaryotes: statistical inferences. Mol. Biol. Evol. 14:861-866[Abstract].

GU, X. and J. ZHANG, 1997  A simple method for estimating the parameter of substitution rate variation among sites. Mol. Biol. Evol. 14:1106-1113[Abstract].

GU, X. and W.-H. LI, 1998  Estimation of evolutionary distances under stationary and nonstationary models of nucleotide substitution. Proc. Natl. Acad. Sci. USA 95:5899-5905[Abstract/Free Full Text].

GU, X., Y.-X. FU, and W.-H. LI, 1995  Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. Mol. Biol. Evol. 12:546-557[Abstract].

HASEGAWA, M., W. M. FITCH, J. P. GOGARTEN, L. OLENDZENSKI, and E. HILARIO et al., 1996  Dating the cenancestor of organisms. Science 274:1750-1753[Medline].

HAWKINS, J. R., A. TAYLOR, P. BERTA, J. LEVILLIERS, and B. VAN DER AUWERA et al., 1992  Mutational analysis of SRY: nonsense and missense mutations in XY sex reversal. Hum. Genet. 88:471-474[Medline].

HEDGES, S. B., P. H. PARKER, C. G. SIBLEY, and S. KUMAR, 1996  Continental breakup and the ordinal diversification of birds and mammals. Nature 381:226-229.

JANKE, A., X. XU, and U. ARNASON, 1997  The mitochondrial genome of the wallaroo (Macropus robustus) and the phylogenetic relationship among Monotremate, Marsupialia, and Eutheria. Proc. Natl. Acad. Sci. USA 94:1276-1281[Abstract/Free Full Text].

KIMURA, M., 1979  Model of effectively neutral mutations in which selective constraint is incorporated. Proc. Natl. Acad. Sci. USA 76:3440-3444[Abstract/Free Full Text].

KIMURA, M., 1983 The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge.

KIMURA, M. and T. OHTA, 1974  On some principles governing molecular evolution. Proc. Natl. Acad. Sci. USA 71:2848-2852[Abstract/Free Full Text].

KUMAR, S., 1996  Patterns of nucleotide substitution in mitochondrial protein coding genes of vertebrates. Genetics 143:537-548[Abstract].

KWIATOWSKI, J., M. KRAWCZYK, M. JAWORSKI, D. SKARECKY, and F. J. AYALA, 1997  Erratic evolution of glycerol-3-phosphate dehydrogenase in Drosophila, Chymomyza, and Ceratitis.. J. Mol. Evol. 44:9-22[Medline].

LEE, Y.-H., T. OTA, and V. D. VACQUIER, 1995  Positive selection is a general phenomenon in the evolution of abalone sperm lysin. Mol. Biol. Evol. 12:231-238[Abstract].

LI, W.-H., 1997 Molecular Evolution. Sinauer Associates, Sunderland, MA.

LI, W.-H. and T. GOJOBORI, 1983  Rapid evolution of goat and sheep globin genes following gene duplication. Mol. Biol. Evol. 1:94-108[Abstract].

LI, W.-H., D. L. ELLSWORTH, J. KRUSHKAL, B. H.-J. CHANG, and D. HEWETT-EMMETT, 1996  Rates of nucleotide substitution in primates and rodents and the generation-time effect hypothesis. Mol. Phylogenet. Evol. 5:182-187[Medline].

MIYAMOTO, M. M. and W. M. FITCH, 1996  Constraints on protein evolution and the age of the eubacteria/eukaryote split. Syst. Biol. 45:568-575[Medline].

NEDBAL, M. A. and J. J. FLYNN, 1998  Do the combined effects of the asymmetric process of replication and DNA damage from oxygen radicals produce a mutation-rate signature in the mitochondrial genome? Mol. Biol. Evol. 15:219-223[Medline].

NEI, M., 1975 Molecular Population Genetics and Evolution. North-Holland Publishing Company, Amsterdam.

NEI, M., 1987 Molecular Evolutionary Genetics. Columbia University Press, New York.

NEI, M., R. CHAKRABORTY, and P. A. FUERST, 1976  Infinite allele model with varying mutation rate. Proc. Natl. Acad. Sci. USA 73:4164-4168[Abstract/Free Full Text].

OHTA, T., 1992  The nearly neutral theory of molecular evolution. Annu. Rev. Ecol. Syst. 23:263-286.

OHTA, T., 1995  Synonymous and nonsynonymous substitutions in mammalian genes and the nearly neutral theory. J. Mol. Evol. 40:56-63[Medline].

OTA, T. and M. NEI, 1994  Estimation of the number of amino acid substitutions per site when the substitution rate varies among sites. J. Mol. Evol. 36:642-643.

PONTIGGIA, A., S. WHITFIELD, P. N. GOODFELLOW, R. LOVELL-BADGE, and M. E. BIANCHI, 1995  Evolutionary conservation in the DNA-binding and -bending properties of HMG-boxes from SRY proteins of primates. Gene 154:277-280[Medline].

SHARP, P. M., 1997  In search of molecular Darwinism. Nature 385:111-112[Medline].

STEWART, C.-B., J. W. SCHILLING, and A. C. WILSON, 1987  Adaptive evolution in the stomach lysozymes of foregut fermenters. Nature 330:401-404[Medline].

TAJIMA, T., J. NAKAE, N. SHINOHARA, and K. FUJIEDA, 1994  A novel mutation localized in the 3' non-HMG box region of the SRY gene in 46X, Y gonadal dysgenesis. Hum. Mol. Genet. 3:1187-1189[Free Full Text].

TAMURA, K. and M. NEI, 1993  Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in human and chimpanzees. Mol. Biol. Evol. 10:512-526[Abstract].

TUCKER, P. K. and B. L. LUNDRIGAN, 1993  Rapid evolution of the sex determining locus in Old World mice and rats. Nature 364:715-717[Medline].

UZZELL, T. and K. W. CORBIN, 1971  Fitting discrete probability distributions to evolutionary events. Science 172:1089-1096[Abstract/Free Full Text].

WALLIS, M., 1993  Remarkable high rate of molecular evolution of ruminant placental lactogens. J. Mol. Evol. 37:86-88[Medline].

WALLIS, M., 1996  The molecular evolution of vertebrate growth hormones: a pattern of near-stasis interrupted by sustained bursts of rapid change. J. Mol. Evol. 43:93-100[Medline].

WERNER, M. H., J. R. HUTH, A. M. GRONENBORN, and G. M. CLORE, 1995  Molecular basis of human 46X, Y sex reversal revealed from the three-dimensional solution structure of the human SRY-DNA complex. Cell 81:705-714[Medline].

WHITFIELD, L. S., R. LOVELL-BADGE, and P. N. GOODFELLOW, 1993  Rapid sequence evolution of the mammalian sex-determining gene SRY. Nature 364:713-715[Medline].

WOLFE, K. H., P. M. SHARP, and W.-H. LI, 1989  Mutation rates differ among regions of the mammalian genome. Nature 337:283-285[Medline].

YANG, Z., 1994  Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J. Mol. Evol. 39:306-314[Medline].

ZHANG, J. and S. KUMAR, 1997  Detection of convergent and parallel evolution at the amino acid sequence level. Mol. Biol. Evol. 14:527-536[Abstract].

ZHANG, J. and M. NEI, 1997  Accuracies of ancestral amino acid sequences inferred by the parsimony, likelihood, and distance methods. J. Mol. Evol. 40(Suppl. 1):S139-S146.

ZHANG, J., H. F. ROSENBERG, and M. NEI, 1998  Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc. Natl. Acad. Sci. USA 95:3708-3713[Abstract/Free Full Text].