- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Correction
- A corrigendum has been published
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Kumar, S.
- Articles by Gadagkar, S. R.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Kumar, S.
- Articles by Gadagkar, S. R.
Disparity Index: A Simple Statistic to Measure and Test the Homogeneity of Substitution Patterns Between Molecular Sequences
Sudhir Kumara and Sudhindra R. Gadagkaraa Department of Biology, Arizona State University, Tempe, Arizona 85287-1501
Corresponding author: Sudhir Kumar, Life Sciences A 371, Department of Biology, Arizona State University, Tempe, AZ 85287-1501., s.kumar{at}asu.edu (E-mail)
Communicating editor: M. K. UYENOYAMA
| ABSTRACT |
|---|
A common assumption in comparative sequence analysis is that the sequences have evolved with the same pattern of nucleotide substitution (homogeneity of the evolutionary process). Violation of this assumption is known to adversely impact the accuracy of phylogenetic inference and tests of evolutionary hypotheses. Here we propose a disparity index, ID, which measures the observed difference in evolutionary patterns for a pair of sequences. On the basis of this index, we have developed a Monte Carlo procedure to test the homogeneity of the observed patterns. This test does not require a priori knowledge of the pattern of substitutions, extent of rate heterogeneity among sites, or the evolutionary relationship among sequences. Computer simulations show that the ID-test is more powerful than the commonly used
2-test under a variety of biologically realistic models of sequence evolution. An application of this test in an analysis of 3789 pairs of orthologous human and mouse protein-coding genes reveals that the observed evolutionary patterns in neutral sites are not homogeneous in 41% of the genes, apparently due to shifts in G + C content. Thus, the proposed test can be used as a diagnostic tool to identify genes and lineages that have evolved with substantially different evolutionary processes as reflected in the observed patterns of change. Identification of such genes and lineages is an important early step in comparative genomics and molecular phylogenetic studies to discover evolutionary processes that have shaped organismal genomes.
MOLECULAR sequences are routinely used to reconstruct phylogenetic histories of species and multigene families and to detect nonneutral evolution at the molecular level. Most of these methods assume that the sequences analyzed have evolved with the same process of nucleotide substitution in their evolutionary history (homogeneity assumption). If this assumption is not satisfied, the inferred phylogenetic trees may have erroneous branching patterns and tests of neutral evolutionary hypotheses may become unreliable (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
In general, sequences that have evolved with the same substitution process (that is, where the relative probability of change from one state to another is the same in the lineages being compared) are expected to have similar nucleotide (and amino acid) compositions. Therefore, differences in the substitution process among lineages can be detected by comparing the observed patterns of nucleotide frequencies in the extant sequences. In the following, we propose a simple measure, disparity index (ID), to quantify the difference in observed patterns and use it to develop a statistical test. We examine the performance of this test under biologically realistic conditions and compare it to other tests by means of computer simulation as well as empirical data analysis.
| DISPARITY INDEX TO MEASURE SUBSTITUTION PATTERN DIVERGENCE |
|---|
Let X and Y be two DNA sequences of length L each. Let xi be the count of the ith type of nucleotide (i = A, T, C, or G) in sequence X and let yi be the corresponding count in sequence Y. The composition distance between these two sequences can then be defined as
![]() |
(1) |
The expected value of DC can be obtained in the following way. Let us represent sequences X and Y as

For a given nucleotide type i at a given site k, we define
![]() |
(2) |
Using Equation 2, we can write (1) as
![]() |
(3) |
The expected value is given by
![]() |
(4) |
Assuming independence among sites, we get
![]() |
(5) |
The first term on the right-hand side is simply the expected number of nucleotides different between the two sequences (Nd), which is determined by the extent of sequence divergence, pattern of evolutionary change, and the extent of evolutionary rate heterogeneity among sites. That is,
![]() |
(6) |
The second term on the right-hand side in (5) can be written as follows, because the summations are over independent sites:
![]() |
(7) |
When the underlying substitution process is homogeneous, then for a given nucleotide pair (i, j), E(nij) = E(nji), where ni· =
j
inij, n·i =
j
inji, and nij is the number of sites showing nucleotide i in sequence X and j in sequence Y. Thus,
![]() |
(8) |
Therefore,
![]() |
(9) |
Substituting (6) and (9) into (5), we get
![]() |
(10) |
where Nd is the number of sites with different nucleotides in sequences X and Y. This proof works for any number of states.
Equation 10 shows that the expected number of differences between two sequences is simply half the sum over all states of the squared differences of the corresponding base (amino acid) frequencies in the sequences compared. ![]()
|
When the two sequences compared do not exhibit the same substitution pattern (heterogeneity scenario), the composition distance obtained using Equation 1 is expected to be larger than that obtained under the homogeneity case. This is because the observed difference in frequency of the same state in two sequences (xi - yi) will then be larger. To show this, we conducted computer simulations for amino acid sequences. In this simulation, the probability of change from one amino acid residue to another was made the same for all residues in the sequence evolution in both lineages (homogeneity case, open circles in Fig 2). In the heterogeneity scenario, the transition probability to a given residue is made increasingly larger in a preselected lineage to effect a larger deviation in the substitution pattern [pattern deviation factor (pdf)], with all other transition probabilities kept equal. pdf is a factor by which the probability of change to a prespecified amino acid (or nucleotide) differs from that expected under the homogeneity case. For s states (s = 4 for nucleotides and s = 20 for amino acids), the probability of substitution to a given state is 1/(s - 1) when all changes are equally likely. A pdf equal to f means that the probability of substitution from any state to this prespecified state is f/(s - 1); f = 1 corresponds to the homogeneous process. The probability of change to any other state is equal and is given by (1 - f/[s - 1])/(s - 2). A higher value of pdf indicates greater heterogeneity in the patterns of substitution.
|
Fig 2 shows the results of computer simulations for the homogeneity and heterogeneity cases. It is clear that DC is higher when the evolutionary process is heterogeneous. This disparity increases with increasing heterogeneity; we call this difference the disparity index (ID). ID increases when the number of substitutions increases with pdf kept constant (Fig 3A) and when the pdf increases with the number of substitutions kept constant (Fig 3B). The relationship in both cases is explained approximately by a second order power curve, as the frequency difference is squared in the DC formula.
|
In empirical data analysis, we obtain ID for a given pair of sequences using the equation
![]() |
(11) |
where xi and yi are the counts of ith type of nucleotide (or amino acid) in sequences X and Y, respectively, and the observed Nd is used as an estimator of the DC expected under homogeneity. When the homogeneity assumption is satisfied, E(ID) = 0, because the expected value for the first term is the same as that for the second term (Equation 10).
| MONTE CARLO METHOD FOR TESTING THE HOMOGENEITY ASSUMPTION |
|---|
We test the homogeneity assumption by calculating the probability of observing a composition distance (DCO) greater than that expected under the null hypothesis of homogeneity, i.e., ID > 0. Because the actual distribution of DC under homogeneity for the given base frequencies and number of differences is not known a priori, we derive it using a Monte Carlo approach. In each replicate of the Monte Carlo method, we start with a random sequence of length L; the expected frequencies are made equal to the average base frequencies computed using the given pair of sequences. Two descendent sequences are then generated by introducing substitutions randomly until the number of differences between the descendent sequences becomes equal to Nd for the original pair of sequences. This is done to obtain DC under the homogeneity assumption from the observed data, given the average base frequencies for the original pair of sequences. For effecting a substitution, we randomly select one of the two descendent sequences and then choose a site in this sequence at random. We replace the nucleotide at this site (irrespective of its current base) with another chosen randomly on the basis of the average observed frequencies obtained above. Therefore, the resulting sequences are expected to have the same base frequencies, as the substitutions occur with the same evolutionary process in both lineages. (This scheme is chosen because there is no a priori information on the null pattern of substitution and evolutionary rate heterogeneity among sites or between lineages.) Using the two sequences generated in the current replicate (say b), we compute DC,b. This process is repeated a desired number of times and the proportion of replicates in which DCO is higher than the DC,b (ID > 0) is computed. If this proportion is >95%, we can reject the null hypothesis at the 5% level. As an example, we show the distribution of DC for the amino acid sequence of human and mouse myeloid differentiation primary response proteins in Fig 4. For this pair, DCO = 93, Nd = 56, and therefore ID = 37. This ID is >0 at the 5% level as DCO is located on the right of the 95% cutoff point (DC = 92) in the DC distribution.
|
| POWER OF THE ID-TEST |
|---|
To assess the power of the Monte Carlo test in detecting differences in the evolutionary patterns, we conducted computer simulations under biologically diverse conditions. Fig 5A shows the type I error of the ID-test at the 5% significance level when the pattern of substitution is homogeneous for three sets of conditions: (1) the Jukes-Cantor (JC; ![]()
![]()
![]()
5%, and thus the test is not conservative. Similar results were obtained in simulations involving unequal rates of evolution between lineages and for protein sequences (results not shown). Given that the type I error could be >5% in some cases (Fig 5A), we recommend that a 1% significance level may be more appropriate.
|
Fig 5B and Fig C, shows the power of the ID-test in rejecting a false null hypothesis when the sequences compared have actually evolved with different evolutionary processes. The statistical power of the ID-test in rejecting the null hypothesis increases with the number of substitutions and sequence length (Fig 5B). For a given sequence length and number of substitutions, its power increases quickly with even small deviations in the evolutionary pattern between sequences (pdf = 2; Fig 5C). Similar results are found when the sequence evolution followed HKY, HKY + G, GTR, and GTR + G models.
Relative power of the ID-test:
The
2-test is often employed to examine if the base frequencies are similar between sequences. In this case,
![]() |
(12) |
is used, where f1i and f2i are the respective counts of the ith state in sequences 1 and 2. Type I errors of the
2-test at the 5% significance level obtained in computer simulations under homogeneity assumption are given in Fig 5D. The
2-test is clearly a conservative test. This conservative nature is also manifested in the power curves for the
2-test when the null hypothesis is false (Fig 5E and Fig F). In all our simulations, the ID-test was more powerful than the
2-test (Fig 5; other results not shown).
The reason for the conservative nature of the classical
2-test is the underlying assumption that the counts are independent. This is not so because the frequencies obtained from homologous sequences are not independent due to the shared evolutionary history. This nonindependence inflates the denominator in the
2-test formula as it incorporates information from all sites, even including those that have not undergone any substitutions. Inclusion of these invariant positions in the denominator makes the
2-value too low, whereas their contribution in the numerator automatically cancels out. This effect is more severe for closely related sequences than for distantly related sequences, as a larger fraction of sites are identical by descent and thus invariant in the former. We conducted computer simulation studies to examine the type I error of the
2-test on the basis of only those sites that had undergone change in one or both lineages (we refer to this as the V2-test). In this case, the test became liberal with type I error almost two times the significance level when the null hypothesis is true. One might consider constructing a null distribution for the V2-test using the Monte Carlo approach, but it is unclear what the expected V2 is under homogeneity. In any case, DC and V2 are quite similar in form and the expected distribution of DC under homogeneity can be easily constructed. Furthermore, the V2 statistic has no clear-cut biological interpretation, unlike the DC statistic.
The problem of observing the same base at a site due to factors such as identity by descent was also considered by ![]()
| TESTING THE HOMOGENEITY OF MOLECULAR EVOLUTIONARY PATTERNS IN HUMAN AND MOUSE GENES |
|---|
Human and mouse genome sequencing projects provide DNA sequences of a large number of genes, which gives us an opportunity to examine the homogeneity of patterns of substitution for different genes in human and mouse lineages on a genome-wide scale. We assembled a data set consisting of cDNA sequences of 3789 human genes and their mouse orthologs using the July 1999 release of the HOVERGEN database (![]()
![]()
15% of all sites in a gene were fourfold degenerate, with the average number being
220.
We tested the null hypothesis of similarity of the evolutionary process in human and mouse lineages (homogeneity assumption) for each gene by the ID-test. Results show that the null hypothesis can be rejected in 41% of the genes at the 5% significance level (Table 1). This indicates that the neutral evolutionary sites are potentially evolving with significantly different substitution patterns between human and mouse lineages. Homogeneity-rejected genes are not necessarily evolving faster than other genes because the average proportion of sites different in the two cases was similar (0.36 and 0.32, respectively). As expected, the
2-test was conservative as it rejected the null hypothesis in only 23% of genes at the 5% level and only 14.4% at the 1% level (Table 1). Therefore the
2-test is only one-half as powerful as the ID-test for these data.
|
Mammalian genomes are mosaics of regions of homogeneous base compositions (see review in ![]()
GC4|) is expected to be higher for homogeneity-rejected genes as compared to the other genes. This was indeed the case, as the average |
GC4| over homogeneity-rejected genes was 12.9%, which is almost three times that observed in all other genes (4.6%). In fact, the ID-test for G + C content difference almost always rejects the same genes (40.8% at the 5% level).
Significant differences in G + C content between genes could arise if the G + C content of one of the two genomes has experienced an overall change. This does not appear to be the case as the percentages of G + C content averaged over all genes in fourfold degenerate sites are 59.7 and 58.3%, respectively, for human and mouse genomes. Another possibility is chromosomal rearrangement. Mammalian genomes are also known to rearrange at a high rate (![]()
We also conducted tests of the homogeneity assumption for zerofold degenerate sites, which are under strong purifying selection because all changes at these nucleotide sites produce a change in the amino acid encoded. The
2-test rejected the null hypothesis in only 0.3% of the cases, which is much lower than that expected by chance alone at the 5% significance level. The ID-test rejects the null hypothesis in 12.4% of the genes (Table 1). A similar result was seen in the analysis of protein sequences, in which the null hypothesis was rejected in 12.9% of the cases. These results indicate that protein sequences have evolved with a more homogeneous process than evolutionarily neutral sites because 41% of the genes were rejected in the latter case.
Thus, we have shown the usefulness of the ID statistic as a diagnostic tool to identify pairs of sequences that are evolving with significantly different substitution patterns. In molecular phylogenetics, the ability to identify such sequence pairs prior to evolutionary tree reconstruction using the ID-test is potentially useful for deciding on whether or not to use phylogenetic reconstruction methods that relax the homogeneity assumption (e.g., ![]()
![]()
![]()
| ACKNOWLEDGMENTS |
|---|
We thank S. Blair Hedges, Michael Douglas, Marlis Douglas, Tom Dowling, Mark Miller, and Philip Hedrick for comments on an earlier draft of this article; Sankar Subramanian for help with cDNA sequence alignments; and Michael Rosenberg for invaluable help with the simulation study. We also thank two anonymous reviewers and Dr. Marcy Uyenoyama for many insightful comments and making the derivation of Equation 10 more concise. This work was supported by research grants to S.K. from the National Institutes of Health (HG02096), National Science Foundation (DBI-9983133), and Burroughs-Wellcome Fund (BWF-1001311). Methods described in this work are available in the computer software MEGA2 (http://www.megasoftware.net).
Manuscript received February 8, 2001; Accepted for publication April 6, 2001.
| LITERATURE CITED |
|---|
BERNARDI, G., 2000 Isochores and the evolutionary genomics of vertebrates. Gene 241:3-17[Medline].
CORNISH-BOWDEN, A., 1977 Assessment of protein sequence identity from amino acid composition data. J. Theor. Biol. 65:735-742[Medline].
DURET, L., D. MOUCHIROUD, and M. GOUY, 1994 HOVERGEN: a database of homologous vertebrate genes. Nucleic Acids Res. 22:2360-2365
FUNK, D. J., D. J. FUTUYMA, G. ORTI, and A. MEYER, 1995 Mitochondrial DNA sequences and multiple data sets: a phylogenetic study of phytophagous beetles (Chrysomelidae: Ophraella). Mol. Biol. Evol. 12:627-640[Abstract].
GALTIER, N. and M. GOUY, 1998 Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871-879[Abstract].
HASEGAWA, M., H. KISHINO, and T. YANO, 1985 Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160-174[Medline].
HASEGAWA, M., T. HASHIMOTO, J. ADACHI, N. IWABE, and T. MIYATA, 1993 Early branchings in the evolution of eukaryotes: ancient divergences of entamoeba that lacks mitochondria revealed by protein sequence data. J. Mol. Evol. 36:380-388[Medline].
JUKES, T. H., and C. R. CANTOR, 1969 Evolution of protein molecules, pp. 21132 in Mammalian Protein Metabolism, edited by H. N. MUNRO. Academic Press, New York.
KUMAR, S., S. R. GADAGKAR, A. FILIPSKI, and X. GU, 2001 Determination of the number of conserved chromosomal segments between species. Genetics 157:1387-1395
LOCKHART, P. J., M. A. STEEL, M. D. HENDY, and D. PENNY, 1994 Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 11:605-612.
NAYLOR, G. J. P. and W. M. BROWN, 1998 Amphioxus mitochondrial DNA, chordate phylogeny, and the limits of inference based on comparisons of sequences. Syst. Biol. 47:61-76[Medline].
NEI, M., and S. KUMAR, 2000 Molecular Evolution and Phylogenetics. Oxford University Press, New York.
RODRIGUEZ-TRELLES, F., R. TARRIO, and F. J. AYALA, 2000 Evidence for a high ancestral GC content in Drosophila. Mol. Biol. Evol. 17:1710-1717
RZHETSKY, A. and M. NEI, 1995 Tests of applicability of several substitution models for DNA sequence data. Mol. Biol. Evol. 12:131-151[Abstract].
SAITOU, N. and M. NEI, 1987 The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406-425[Abstract].
STEEL, M. A., P. J. LOCKHART, and D. PENNY, 1993 Confidence in evolutionary trees from biological sequence data. Nature 364:440-442[Medline].
TARRIO, R., F. RODRIGUEZ-TRELLES, and F. J. AYALA, 2000 Tree rooting with outgroups when they differ in their nucleotide composition from the ingroup: the Drosophila saltans and willistoni groups, a case study. Mol. Phylogenet. Evol. 16:344-349[Medline].
This article has been cited by other articles:
![]() |
S. Kumar and A. Filipski Multiple sequence alignment: In pursuit of homologous DNA positions Genome Res., February 1, 2007; 17(2): 127 - 135. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. F. Gruber, R. S. Voss, and S. A. Jansa Base-Compositional Heterogeneity in the RAG1 Locus among Didelphid Marsupials: Implications for Phylogenetic Inference and the Evolution of GC Content Syst Biol, February 1, 2007; 56(1): 83 - 96. [Abstract] [Full Text] [PDF] |
||||
![]() |
C.-H. Huang and J. Peng Evolutionary conservation and diversification of Rh family genes and proteins PNAS, October 25, 2005; 102(43): 15512 - 15517. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Spaethe and A. D. Briscoe Molecular characterization and expression of the UV opsin in bumblebees: three ommatidial subtypes in the retina and a new photoreceptor organ in the lamina J. Exp. Biol., June 15, 2005; 208(12): 2347 - 2361. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. D. Briscoe and G. D. Bernard Eyeshine and spectral tuning of long wavelength-sensitive rhodopsins: no evidence for red-sensitive photoreceptors among five Nymphalini butterfly species J. Exp. Biol., February 15, 2005; 208(4): 687 - 696. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. S. Jermiin, S. Y.W. Ho, F. Ababneh, J. Robinson, and A. W.D. Larkum The Biasing Effect of Compositional Heterogeneity on Phylogenetic Estimates May be Underestimated Syst Biol, August 1, 2004; 53(4): 638 - 643. [Full Text] [PDF] |
||||
![]() |
J. Spaethe and A. D. Briscoe Early Duplication and Functional Diversification of the Opsin Gene Family in Insects Mol. Biol. Evol., August 1, 2004; 21(8): 1583 - 1594. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Lercher, J.-V. Chamary, and L. D. Hurst Genomic Regionality in Rates of Evolution Is Not Explained by Clustering of Genes of Comparable Expression Profile Genome Res., June 1, 2004; 14(6): 1002 - 1013. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Tamura, S. Subramanian, and S. Kumar Temporal Patterns of Fruit Fly (Drosophila) Evolution Revealed by Mutation Clocks Mol. Biol. Evol., January 1, 2004; 21(1): 36 - 44. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. S. Rosenberg and S. Kumar Heterogeneity of Nucleotide Frequencies Among Evolutionary Lineages and Phylogenetic Inference Mol. Biol. Evol., April 1, 2003; 20(4): 610 - 621. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Yi, D. L. Ellsworth, and W.-H. Li Slow Molecular Clocks in Old World Monkeys, Apes, and Humans Mol. Biol. Evol., December 1, 2002; 19(12): 2191 - 2198. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Tamura and S. Kumar Evolutionary Distance Estimation Under Heterogeneous Substitution Pattern Among Lineages Mol. Biol. Evol., October 1, 2002; 19(10): 1727 - 1736. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Zurovcova and F. J. Ayala Polymorphism Patterns in Two Tightly Linked Developmental Genes, Idgf1 and Idgf3, of Drosophila melanogaster Genetics, September 1, 2002; 162(1): 177 - 188. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. D. Briscoe Functional Diversification of Lepidopteran Opsins Following Gene Duplication Mol. Biol. Evol., December 1, 2001; 18(12): 2270 - 2279. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Kumar and S. Subramanian Mutation rates in mammalian genomes PNAS, January 22, 2002; 99(2): 803 - 808. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Correction
- A corrigendum has been published
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Kumar, S.
- Articles by Gadagkar, S. R.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Kumar, S.
- Articles by Gadagkar, S. R.











/ß) were as follows, in order from left to right: (1) 0.25, 0.25, 0.25, 0.25,
, 1; (2) 0.05, 0.45, 0.05, 0.45, 0.1, 5; (3) 0.20, 0.30, 0.20, 0.30, 0.2, 2; (4) 0.15, 0.35, 0.15, 0.35, 0.3, 3; (5) 0.10, 0.40, 0.10, 0.40, 0.4, 4; (6) 0.05, 0.45, 0.05, 0.45, 0.5, 5; (7) 0.20, 0.30, 0.20, 0.30, 0.6, 2; (8) 0.15, 0.35, 0.15, 0.35, 0.7, 3; (9) 0.10, 0.40, 0.10, 0.40, 0.8, 4; (10) 0.05, 0.45, 0.05, 0.45, 0.9, 5; (11) 0.05, 0.45, 0.05, 0.45, 1.0, 5. gA, gT, gC, and gG refer to the respective equilibrium frequencies of the four nucleotides, a is the value of the gamma parameter quantifying the extent of rate heterogeneity among sites, and 




C:A 





