Abstract
A common assumption in comparative sequence analysis is that the sequences have evolved with the same pattern of nucleotide substitution (homogeneity of the evolutionary process). Violation of this assumption is known to adversely impact the accuracy of phylogenetic inference and tests of evolutionary hypotheses. Here we propose a disparity index, I_{D}, which measures the observed difference in evolutionary patterns for a pair of sequences. On the basis of this index, we have developed a Monte Carlo procedure to test the homogeneity of the observed patterns. This test does not require a priori knowledge of the pattern of substitutions, extent of rate heterogeneity among sites, or the evolutionary relationship among sequences. Computer simulations show that the I_{D}test is more powerful than the commonly used χ^{2}test under a variety of biologically realistic models of sequence evolution. An application of this test in an analysis of 3789 pairs of orthologous human and mouse proteincoding genes reveals that the observed evolutionary patterns in neutral sites are not homogeneous in 41% of the genes, apparently due to shifts in G + C content. Thus, the proposed test can be used as a diagnostic tool to identify genes and lineages that have evolved with substantially different evolutionary processes as reflected in the observed patterns of change. Identification of such genes and lineages is an important early step in comparative genomics and molecular phylogenetic studies to discover evolutionary processes that have shaped organismal genomes.
MOLECULAR sequences are routinely used to reconstruct phylogenetic histories of species and multigene families and to detect nonneutral evolution at the molecular level. Most of these methods assume that the sequences analyzed have evolved with the same process of nucleotide substitution in their evolutionary history (homogeneity assumption). If this assumption is not satisfied, the inferred phylogenetic trees may have erroneous branching patterns and tests of neutral evolutionary hypotheses may become unreliable (Hasegawaet al. 1993; Steelet al. 1993; Funket al. 1995; Galtier and Gouy 1998; Naylor and Brown 1998; RodriguezTrelleset al. 2000; Tarrioet al. 2000). Therefore, it is important to test this assumption for a given set of sequences prior to molecular evolutionary analysis. Knowledge of the violation of the homogeneity assumption would allow the investigators to choose advanced methods of phylogenetic reconstruction (e.g., Lockhartet al. 1994; Galtier and Gouy 1998) or to conduct phylogenetic analyses with the offending sequences removed, if possible. Identification of genes and species with atypical patterns of change is also useful for elucidating the evolutionary mechanisms responsible for the observed differences.
In general, sequences that have evolved with the same substitution process (that is, where the relative probability of change from one state to another is the same in the lineages being compared) are expected to have similar nucleotide (and amino acid) compositions. Therefore, differences in the substitution process among lineages can be detected by comparing the observed patterns of nucleotide frequencies in the extant sequences. In the following, we propose a simple measure, disparity index (I_{D}), to quantify the difference in observed patterns and use it to develop a statistical test. We examine the performance of this test under biologically realistic conditions and compare it to other tests by means of computer simulation as well as empirical data analysis.
DISPARITY INDEX TO MEASURE SUBSTITUTION PATTERN DIVERGENCE
Let X and Y be two DNA sequences of length L each. Let x_{i} be the count of the ith type of nucleotide (i = A, T, C, or G) in sequence X and let y_{i} be the corresponding count in sequence Y. The composition distance between these two sequences can then be defined as
Using Equation 2, we can write (1) as
Equation 10 shows that the expected number of differences between two sequences is simply half the sum over all states of the squared differences of the corresponding base (amino acid) frequencies in the sequences compared. CornishBowden (1977) first presented the statistic given in Equation 1. However, the proof of Equation 10 presented in that work implicitly assumed homogeneity of the evolutionary process and also that the counts x_{i}'s and y_{i}'s are independent. The second assumption is clearly invalid because these counts are correlated due to the common ancestry of sequences X and Y. Our proof (Equations 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) does not require this assumption and holds irrespective of the complexity of the nucleotide (or amino acid) substitution model to be applied to the observed pattern and the extent of amongsite rate heterogeneity among sites. Computer simulations reaffirm this fact over a variety of conditions (Figure 1).
When the two sequences compared do not exhibit the same substitution pattern (heterogeneity scenario), the composition distance obtained using Equation 1 is expected to be larger than that obtained under the homogeneity case. This is because the observed difference in frequency of the same state in two sequences (x_{i} – y_{i}) will then be larger. To show this, we conducted computer simulations for amino acid sequences. In this simulation, the probability of change from one amino acid residue to another was made the same for all residues in the sequence evolution in both lineages (homogeneity case, open circles in Figure 2). In the heterogeneity scenario, the transition probability to a given residue is made increasingly larger in a preselected lineage to effect a larger deviation in the substitution pattern [pattern deviation factor (pdf)], with all other transition probabilities kept equal. pdf is a factor by which the probability of change to a prespecified amino acid (or nucleotide) differs from that expected under the homogeneity case. For s states (s = 4 for nucleotides and s = 20 for amino acids), the probability of substitution to a given state is 1/(s – 1) when all changes are equally likely. A pdf equal to f means that the probability of substitution from any state to this prespecified state is f/(s – 1); f = 1 corresponds to the homogeneous process. The probability of change to any other state is equal and is given by (1 – f/[s – 1])/(s – 2). A higher value of pdf indicates greater heterogeneity in the patterns of substitution.
Figure 2 shows the results of computer simulations for the homogeneity and heterogeneity cases. It is clear that D_{C} is higher when the evolutionary process is heterogeneous. This disparity increases with increasing heterogeneity; we call this difference the disparity index (I_{D}). I_{D} increases when the number of substitutions increases with pdf kept constant (Figure 3A) and when the pdf increases with the number of substitutions kept constant (Figure 3B). The relationship in both cases is explained approximately by a second order power curve, as the frequency difference is squared in the D_{C} formula.
In empirical data analysis, we obtain I_{D} for a given pair of sequences using the equation
MONTE CARLO METHOD FOR TESTING THE HOMOGENEITY ASSUMPTION
We test the homogeneity assumption by calculating the probability of observing a composition distance (D_{CO}) greater than that expected under the null hypothesis of homogeneity, i.e., I_{D} > 0. Because the actual distribution of D_{C} under homogeneity for the given base frequencies and number of differences is not known a priori, we derive it using a Monte Carlo approach. In each replicate of the Monte Carlo method, we start with a random sequence of length L; the expected frequencies are made equal to the average base frequencies computed using the given pair of sequences. Two descendent sequences are then generated by introducing substitutions randomly until the number of differences between the descendent sequences becomes equal to N_{d} for the original pair of sequences. This is done to obtain D_{C} under the homogeneity assumption from the observed data, given the average base frequencies for the original pair of sequences. For effecting a substitution, we randomly select one of the two descendent sequences and then choose a site in this sequence at random. We replace the nucleotide at this site (irrespective of its current base) with another chosen randomly on the basis of the average observed frequencies obtained above. Therefore, the resulting sequences are expected to have the same base frequencies, as the substitutions occur with the same evolutionary process in both lineages. (This scheme is chosen because there is no a priori information on the null pattern of substitution and evolutionary rate heterogeneity among sites or between lineages.) Using the two sequences generated in the current replicate (say b), we compute D_{C,b}. This process is repeated a desired number of times and the proportion of replicates in which D_{CO} is higher than the D_{C,b} (I_{D} > 0) is computed. If this proportion is >95%, we can reject the null hypothesis at the 5% level. As an example, we show the distribution of D_{C} for the amino acid sequence of human and mouse myeloid differentiation primary response proteins in Figure 4. For this pair, D_{CO} = 93, N_{d} = 56, and therefore I_{D} = 37. This I_{D} is >0 at the 5% level as D_{CO} is located on the right of the 95% cutoff point (D_{C} = 92) in the D_{C} distribution.
POWER OF THE I_{D}TEST
To assess the power of the Monte Carlo test in detecting differences in the evolutionary patterns, we conducted computer simulations under biologically diverse conditions. Figure 5A shows the type I error of the I_{D}test at the 5% significance level when the pattern of substitution is homogeneous for three sets of conditions: (1) the JukesCantor (JC; Jukes and Cantor 1969) model with the same evolutionary rate among sites; (2) the HasegawaKishinoYano (HKY + G; Hasegawaet al. 1985) model with biased base composition, transition/transversion rate bias, and extreme rate heterogeneity among sites; and (3) the general timereversible model with rate heterogeneity among sites (GTR + G; reviewed in Nei and Kumar 2000). Results in Figure 5A clearly show that the type I error at the 5% significance level is ∼5%, and thus the test is not conservative. Similar results were obtained in simulations involving unequal rates of evolution between lineages and for protein sequences (results not shown). Given that the type I error could be >5% in some cases (Figure 5A), we recommend that a 1% significance level may be more appropriate.
Figure 5, B and C, shows the power of the I_{D}test in rejecting a false null hypothesis when the sequences compared have actually evolved with different evolutionary processes. The statistical power of the I_{D}test in rejecting the null hypothesis increases with the number of substitutions and sequence length (Figure 5B). For a given sequence length and number of substitutions, its power increases quickly with even small deviations in the evolutionary pattern between sequences (pdf = 2; Figure 5C). Similar results are found when the sequence evolution followed HKY, HKY + G, GTR, and GTR + G models.
Relative power of the I_{D}test: The χ^{2}test is often employed to examine if the base frequencies are similar between sequences. In this case,
The reason for the conservative nature of the classical χ^{2}test is the underlying assumption that the counts are independent. This is not so because the frequencies obtained from homologous sequences are not independent due to the shared evolutionary history. This nonindependence inflates the denominator in the χ^{2}test formula as it incorporates information from all sites, even including those that have not undergone any substitutions. Inclusion of these invariant positions in the denominator makes the χ^{2}value too low, whereas their contribution in the numerator automatically cancels out. This effect is more severe for closely related sequences than for distantly related sequences, as a larger fraction of sites are identical by descent and thus invariant in the former. We conducted computer simulation studies to examine the type I error of the χ^{2}test on the basis of only those sites that had undergone change in one or both lineages (we refer to this as the V^{2}test). In this case, the test became liberal with type I error almost two times the significance level when the null hypothesis is true. One might consider constructing a null distribution for the V^{2}test using the Monte Carlo approach, but it is unclear what the expected V ^{2} is under homogeneity. In any case, D_{C} and V ^{2} are quite similar in form and the expected distribution of D_{C} under homogeneity can be easily constructed. Furthermore, the V ^{2} statistic has no clearcut biological interpretation, unlike the D_{C} statistic.
The problem of observing the same base at a site due to factors such as identity by descent was also considered by Rzhetsky and Nei (1995). They developed a rigorous statistical test of equality of nucleotide (amino acid) frequencies among multiple sequences. Our computer simulations (not shown) in conditions equivalent to those in Figure 5, however, show that the RzhetskyNei test is also a liberal test, which may be due to the violation of some of the assumptions made in their test.
TESTING THE HOMOGENEITY OF MOLECULAR EVOLUTIONARY PATTERNS IN HUMAN AND MOUSE GENES
Human and mouse genome sequencing projects provide DNA sequences of a large number of genes, which gives us an opportunity to examine the homogeneity of patterns of substitution for different genes in human and mouse lineages on a genomewide scale. We assembled a data set consisting of cDNA sequences of 3789 human genes and their mouse orthologs using the July 1999 release of the HOVERGEN database (Duretet al. 1994). Sequence orthology in each case was determined using multigene family trees constructed using the neighborjoining method (Saitou and Nei 1987) from protein sequence alignments and the homogeneity assumption examined for neutral substitutions in individual genes in human and mouse lineages. For this purpose, we use the fourfold degenerate sites, which are known to best reflect the neutral evolutionary patterns, as no nucleotide change in fourfold degenerate sites will alter the amino acid encoded by the codon. We took a stringent approach in identifying fourfold degenerate sites by choosing sites that have potentially remained fourfold degenerate throughout the evolutionary history of human and mouse. This was accomplished by considering a site to be fourfold degenerate only if it was so in both human and mouse codons. With this definition only ∼15% of all sites in a gene were fourfold degenerate, with the average number being ∼220.
We tested the null hypothesis of similarity of the evolutionary process in human and mouse lineages (homogeneity assumption) for each gene by the I_{D}test. Results show that the null hypothesis can be rejected in 41% of the genes at the 5% significance level (Table 1). This indicates that the neutral evolutionary sites are potentially evolving with significantly different substitution patterns between human and mouse lineages. Homogeneityrejected genes are not necessarily evolving faster than other genes because the average proportion of sites different in the two cases was similar (0.36 and 0.32, respectively). As expected, the χ^{2}test was conservative as it rejected the null hypothesis in only 23% of genes at the 5% level and only 14.4% at the 1% level (Table 1). Therefore the χ^{2}test is only onehalf as powerful as the I_{D}test for these data.
Mammalian genomes are mosaics of regions of homogeneous base compositions (see review in Bernardi 2000). These isochores are characterized by their G + C content, which is also reflected in the third codon positions (and fourfold degenerate sites). If homogeneity is rejected in many genes due to shifts in G + C content either in the human sequence or the mouse sequence for a gene, then the average G + C content difference between human and mouse sequences at fourfold degenerate sites (ΔGC4) is expected to be higher for homogeneityrejected genes as compared to the other genes. This was indeed the case, as the average ΔGC4 over homogeneityrejected genes was 12.9%, which is almost three times that observed in all other genes (4.6%). In fact, the I_{D}test for G + C content difference almost always rejects the same genes (40.8% at the 5% level).
Significant differences in G + C content between genes could arise if the G + C content of one of the two genomes has experienced an overall change. This does not appear to be the case as the percentages of G + C content averaged over all genes in fourfold degenerate sites are 59.7 and 58.3%, respectively, for human and mouse genomes. Another possibility is chromosomal rearrangement. Mammalian genomes are also known to rearrange at a high rate (Kumaret al. 2001), which can relocate genes from one isochore to another with substantially different G + C content. Such an event may lead to a directional change in the G + C content of the fourfold degenerate sites in the transposed gene such that it becomes similar to that of its surroundings. As the genetic maps of human and mouse genomes become available from completed gene sequencing projects, we plan to examine the contribution of gene relocation in the observed shifts in patterns of neutral evolution.
We also conducted tests of the homogeneity assumption for zerofold degenerate sites, which are under strong purifying selection because all changes at these nucleotide sites produce a change in the amino acid encoded. The χ^{2}test rejected the null hypothesis in only 0.3% of the cases, which is much lower than that expected by chance alone at the 5% significance level. The I_{D}test rejects the null hypothesis in 12.4% of the genes (Table 1). A similar result was seen in the analysis of protein sequences, in which the null hypothesis was rejected in 12.9% of the cases. These results indicate that protein sequences have evolved with a more homogeneous process than evolutionarily neutral sites because 41% of the genes were rejected in the latter case.
Thus, we have shown the usefulness of the I_{D} statistic as a diagnostic tool to identify pairs of sequences that are evolving with significantly different substitution patterns. In molecular phylogenetics, the ability to identify such sequence pairs prior to evolutionary tree reconstruction using the I_{D}test is potentially useful for deciding on whether or not to use phylogenetic reconstruction methods that relax the homogeneity assumption (e.g., Lockhartet al. 1994; Galtier and Gouy 1998). However, it is worth noting that the parameter richness of such sophisticated methods can be a hindrance in obtaining results with high statistical confidence (reviewed in Nei and Kumar 2000). Alternatively, investigators may choose to remove sequences that do not satisfy the homogeneity assumption using the I_{D}test and use simpler models to make more robust phylogenetic estimations. In general, the I_{D}test will be useful as a diagnostic tool to screen for lineages and genes evolving with atypical patterns of change.
Acknowledgments
We thank S. Blair Hedges, Michael Douglas, Marlis Douglas, Tom Dowling, Mark Miller, and Philip Hedrick for comments on an earlier draft of this article; Sankar Subramanian for help with cDNA sequence alignments; and Michael Rosenberg for invaluable help with the simulation study. We also thank two anonymous reviewers and Dr. Marcy Uyenoyama for many insightful comments and making the derivation of Equation 10 more concise. This work was supported by research grants to S.K. from the National Institutes of Health (HG02096), National Science Foundation (DBI9983133), and BurroughsWellcome Fund (BWF1001311). Methods described in this work are available in the computer software MEGA2 (http://www.megasoftware.net).
Footnotes

Communicating editor: M. K. Uyenoyama
 Received February 8, 2001.
 Accepted April 6, 2001.
 Copyright © 2001 by the Genetics Society of America