- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Morozov, P.
- Articles by Rzhetsky, A.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Morozov, P.
- Articles by Rzhetsky, A.
A New Method for Characterizing Replacement Rate Variation in Molecular Sequences: Application of the Fourier and Wavelet Models to Drosophila and Mammalian Proteins
Pavel Morozov1,a, Tatyana Sitnikova1,b, Gary Churchillc, Francisco José Ayalad, and Andrey Rzhetskyea Columbia Genome Center, Columbia University, New York, New York 10032,
b Eisai Research Institute, GEFA Biology Group, Boston, Massachusetts 02138,
c The Jackson Laboratory, Bar Harbor, Maine 04609,
d New York, New York 10013
e Department of Medical Informatics, Columbia University, New York, New York 10032
Corresponding author: Andrey Rzhetsky, Columbia Genome Center, Columbia University, 1150 St. Nicholas Ave., Unit 109, New York, NY 10032., andrey{at}genome2.cpmc.columbia.edu (E-mail)
| ABSTRACT |
|---|
We propose models for describing replacement rate variation in genes and proteins, in which the profile of relative replacement rates along the length of a given sequence is defined as a function of the site number. We consider here two types of functions, one derived from the cosine Fourier series, and the other from discrete wavelet transforms. The number of parameters used for characterizing the substitution rates along the sequences can be flexibly changed and in their most parameter-rich versions, both Fourier and wavelet models become equivalent to the unrestricted-rates model, in which each site of a sequence alignment evolves at a unique rate. When applied to a few real data sets, the new models appeared to fit data better than the discrete gamma model when compared with the Akaike information criterion and the likelihood-ratio test, although the parametric bootstrap version of the Cox test performed for one of the data sets indicated that the difference in likelihoods between the two models is not significant. The new models are applicable to testing biological hypotheses such as the statistical identity of rate variation profiles among homologous protein families. These models are also useful for determining regions in genes and proteins that evolve significantly faster or slower than the sequence average. We illustrate the application of the new method by analyzing human immunoglobulin and Drosophilid alcohol dehydrogenase sequences.
WHILE variation in the rate of amino acid replacement across sites within protein sequences has been frequently observed, mathematical modeling of this phenomenon remains a challenging problem. The difficulty lies mainly in choosing the right trade-off between the parameter richness of the model, quality of statistical inference (more parameter-rich models tend to provide parameter estimates with larger variance), and the computational cost of applying the model to real data, the optimum balance being usually different for different data sets and available computational facilities. Earlier models of rate variation had few parameters, while the number of model parameters steadily increased in the more recent models (e.g., see ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Here we introduce a model that describes rate variation along sites as a function of each individual site. The number of parameters is not predetermined, but rather selected separately for each data set, ranging from one to the total number of sites in the sequence minus one. We introduce two model variations: wavelet and Fourier. We illustrate their application to human immunoglobulin and Drosophilid alcohol dehydrogenase sequences.
| Maximum-likelihood calculations assuming unequal rate of substitution across sites |
|---|
The long-term evolution of an individual site within a protein sequence can be conveniently modeled as a Markov chain in continuous time (e.g., ![]()
0). The Markov process X(t) thus has integer outcomes between 1 and 20, and there are well-developed techniques for computing the probability of replacing of amino acid i with amino acid j after time t, Pij(t), which is defined, assuming a time-homogeneous process, as
![]() |
(1) |
where {Pij(t)} = P(t) is a matrix of transition probabilities.
Given a matrix of instantaneous transition probabilities, Q, P(t) is computed as exp{Qt}. Assuming that our Markov process has an equilibrium state, we introduce a row vector
whose entries represent the equilibrium amino acid frequencies associated with Q. By definition, the vector
can be obtained by solving the system of linear equations
Q = 0 and
20i = 1
i = 1.
Q and t will always appear in a likelihood function as a product and thus cannot be estimated separately. We therefore normalize Q (e.g., see ![]()
![]() |
(2) |
Once Q is normalized, t will be measured in terms of the expected number of amino acid replacements per site. For example, for the simplest model of amino acid substitution, the Poisson model (![]()
are equal to 1/20, while all off-diagonal elements in the matrix are different in the more advanced ![]()
![]()
![]()
![]()
![]()
![]()
To illustrate the likelihood computation for a given tree, consider a hypothetical set S of four homologous contemporary protein sequences of length l aligned without gaps. We assume that the correct phylogenetic topology T for these sequences is known (see Fig 1). By definition, the likelihood L of observing the contemporary sequences under our model is Prob{S | T,
} multiplied by an arbitrary positive constant (usually set to 1), where
is a set of specified values of model parameters. The probability of observing data at the ith site of the sequence alignment, given a set of parameter values
i = {Q, t1i, t2i, ... , t6i} and topology T as shown in Fig 1, is computed as (![]()
![]() |
(3) |
where
j is the frequency of the jth amino acid at the root of the tree, and the rest of the notations are as explained in the legend of Fig 1. The total likelihood L is a product of the probabilities of observing data at individual sites, and the maximum-likelihood estimates of the parameters are obtained by finding a set of parameter values that maximize the value of L.
|
Let us consider the topology shown in Fig 1 and assume that the set of branch lengths for the ith site, {t1i, t2i, ... , t6i}, can be expressed as {t1 ci, t2 ci, ... , t6 ci}, where {t1, t2, ... , t6} is the set of expected branch lengths averaged over all l sites, and ci is a nonnegative constant defining the relative replacement rate of the ith site. One can think of introducing l - 1 independent relative rate parameters c1, c2, ... , cl-1 (one parameter for each of l - 1 sequence sites. Below we refer to this model as the unrestricted rates model; the relative rate of the lth site is not independent on other rates and is equal to l - c1 - ... - cl-1), although this is usually considered impractical because the number of free parameters appears large relative to the amount of data, thus making accurate estimation difficult. (There are other similarly formulated models. For example, ![]()
There are several recognized ways to decrease the number of rate variation parameters (e.g., see ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Here we suggest a different approach: we define ci as a function of site number and a set of real-valued parameters {a1, a2, ... , ak}, ci = f{i; a1, a2, ... , ak}, where
![]() |
(4) |
The denominator of the expression is required to ensure that the average of f(i; a1, a2, ... , ak} over i = 1, 2, ... , l is always equal to 1. {
(i, j)}j=1,k are distinct basis functions, linear combination of which can precisely fit any relative substitution rate profile when k = l - 1; one can also get an approximation of the substitution rate profile when some of the least contributing basis functions are dropped. The function f(i; a1, a2, ... , ak} cannot be observed directly, but rather has to be estimated using the maximum-likelihood approach.
Assuming that the potential readers of this article may not have had a prior exposure to the ideas of decomposition of functions, we first introduce relevant concepts in an intuitive way.
| Basis vectors and functions, function decomposition |
|---|
It might be easier to start by considering decomposition of a vector (![]()
Decomposition of a vector into basis vectors is mathematically indistinguishable from decomposition of a discrete function into basis functions. We can simply view our three-dimensional vector (x, y, z), as a discrete function, f(1) = x, f(2) = y, and f(3) = z, and decompose this function into three basis functions,
1 = (1, 0, 0),
2 = (0, 1, 0), and
3 = (0, 0, 1): f = x
1 + y
2 + z
3. Therefore, discrete basis functions are nothing but compact representations of basis vectors, and there are an infinite number of distinct sets of basis functions. The orthogonality of basis functions is defined by analogy with the orthogonality of basis vectors.
The Fourier version of our model is obtained from (4) by defining
(i, j) as cos(i
j
/l) (see Fig 2), where {
j}j=1,k is a subset of k different integers from {1, 2, ... , l - 1}.
|
![]() |
(5) |
The choice of cosines as basis functions is motivated by the well-known fact that cosines with period ranges from 1 to l form a valid basis of functions for exactly representing any discrete function (not necessarily periodic) on the interval between 1 and l (e.g., see ![]()
The wavelet version of our model is obtained by defining
(i, j) as one of the wavelet functions commonly used in discrete wavelet transforms (see Fig 2). Unlike basis functions in the Fourier series, wavelets are local functions, which is a convenient property for summarizing rate variation at different scales. The choice of wavelets as basis functions also allows for a considerable reduction in computation time required for maximizing the likelihood function compared to the corresponding computations under the Fourier model. In this work we use
(i, j) based on Haar, Daubechies 4, Daubechies 12, and Daubechies 20 wavelets (see Fig 3; ![]()
![]()
|
A wavelet is a univariate real-valued function selected such that it vanishes outside a limited interval and has equal-sized areas below and above zero. The ultimate goal of a discrete wavelet transform is to represent and/or approximate a function given as a set of l values, where l is usually a power of 2. In our case l is the number of sites and the function that we are trying to approximate is the relative substitution rate at each site.
The wavelets are used for constructing an orthogonal basis of functions as described below for Haar wavelets. The basis function in the case of the simplest Haar transform has the shape of a step function (see Fig 2 and Fig 3A) such that for a wavelet defined on interval [0, 1], the value is -1 on interval [0, 1/2], and +1 on interval [1/2, 1]. The basis functions for the Haar wavelet decomposition are generated by scaling and shifting the same step wavelet function such that exactly one function covers the complete interval [1, l], two basis functions cover exactly half of the complete interval (intervals [1, l/2], and [l/2 + 1, l], respectively), four basis functions cover one-fourth of the complete interval (intervals [1, l/4], [l/4 + 1, l/2], [l/2 +1, 3l/4], and [3l/4 + 1, l], respectively), and so on. The early wavelets can contribute to a large range (e.g., left half-right half of the sequence) of variations, while successively more focused wavelets pick out smaller regions and eventually the individual sites. The basis functions with the smallest domain are most abundant (there are l/2 of them) and cover just two data points; such functions together cover the complete interval [1, l]. In addition to step functions, the complete set of Haar basis functions contains one scale function that has value 1 on interval [1, l]. The details of fast computation of a discrete wavelet decomposition can be found in ![]()
One may define the optimality of a set of basis functions
(i, j) as the minimal number of functions from this set required to fit the data with a predefined minimum of the maximum-likelihood value. We suspect that there is no best set of basis functions that would enable one to obtain the most parsimonious description for any arbitrary data set. It is more likely that the Fourier version of the model would be more suitable for data sets with a pronounced periodicity of rates, while the wavelet version will work better for aperiodic and/or sparse rate variation profiles. In this article we illustrate the application of both.
| Choosing the optimum rate variation model |
|---|
When choosing among several alternative models for describing the same data, one intuitively searches for optimum trade-off between the goodness-of-fit of a model and economy of representationthe number of free parameters defined by the model. The general approach to such a problem is to search through the space of all possible models and make certain comparisons between competing models. We discuss these two steps, searching and comparison of models, separately.
Searching through the space of all possible models:
Both the Fourier and wavelet models allow for the generation of a large number of rate variation profiles, ranging from a zero rate parameter profile to the most parameter rich (l - 1 relative rate variation parameters for a set of sequences of length l). Therefore, for each data set it would ideally be of interest to (i) order all possible rate variation profiles by their maximum-likelihood values, and (ii) use an objective scheme to decide what subset of rate variation parameters is needed for an optimum description of the data set.
The searching through the space of different profile models can be done in the following ways.
- Backward elimination search. First, beginning with the "general" wavelet or Fourier profile using all l - 1 rate parameters, successive parameters may be removed one by one by iteratively deleting the parameter with the smallest absolute value and repeating the maximum-likelihood optimization under each reduced model. The process is repeated until the change in likelihood is "small," for example, when compared to a 5% chi-square critical value corresponding to the doubled difference between likelihoods of the "more complex" and "less complex" models (the likelihood-ratio test).
- Forward selection search. One starts with the equal-rate model and increases the number of parameters by one, trying all possible combinations, until the increase in the likelihood is large, say, according to the likelihood-ratio test.
- Markov chain Monte Carlo search. It is possible to substitute/supplement the maximum-likelihood analysis with a Bayesian analysis with a flat initial distribution using the Markov chain Monte Carlo (MCMC) technique for simultaneously searching for the best phylogenetic tree and the best rate variation model [e.g., see "reversible jump" method by
GREEN 1995 and
MAU et al. 1996 ; the MCMC method is described in more detail in the next section]. We did not implement this strategy in the current study.
- Heuristic search. There are numerous heuristics that can be used for the purpose. To decide the order in which cosine functions or wavelets should be added in stepwise computation of Fourier or wavelet profiles, we started by estimating all l - 1 parameters by the least-squares fitting of the cosine Fourier function to the observed profile calculated under the uniform rate model. Given the set of normalized numbers of substitutions per site, xi's, estimated under the equal-rate model, one can compute the set of wavelet or Fourier parameter values that minimizes

- where the number of rate variation parameters, k, is set to l - 1 at the beginning. One then proceeds by iteratively eliminating the parameters with the smallest absolutes, which give the smallest contribution to the resulting rate profile.
Comparison of alternative models:
This can also be addressed in several different ways. For example, rival models, nested or nonnested, may be compared by using one of the approaches built on the likelihood analysis but penalizing parameter-rich models, such as the Akaike information criterion (AIC; ![]()
![]() |
(6) |
where Ni is the number of parameters used in the ith model and Li is the maximum-likelihood value obtained under that model. The idea behind this formula is to penalize an increase in the number of parameters if the addition of each new parameter increases the likelihood value by less than one unit of log-likelihood. The better the fit of the model to the data, the larger the AIC value will be. The AIC tends to favor parameter-rich models as the data sample size increases. A second popular criterion, the Bayesian information criterion (BIC; ![]()
![]() |
(7) |
where n is the sample size. BIC usually tends to choose less parameter-rich models than AIC because in real data analyses log n is usually >2. In our case BIC does not seem to be very useful because under our model different sites within the same sequence are not identically distributed and the sample size is not well defined. If we would believe that in this case the sample size is 1 (as suggested by Z. YANG, personal communication), BIC becomes equivalent to the maximum-likelihood value itself and automatically chooses the most parameter-rich model.
Next, one can use chi-square approximation for distribution of doubled difference between likelihoods of "complex" and "simple" models (the Cox test). This approach gives results very similar to those of AIC and can be misleading for small data sets because chi square distribution may not be an appropriate approximation for distribution of the test statistic. To correct this potential problem with distribution of the test statistic, one can use a parametric bootstrap to estimate distribution of the likelihood-ratio statistic in the Cox test (![]()
![]()
In this study we illustrated application of the majority of the above approaches of model comparison while leaving out only GREEN's (1995) test.
| Tests of biological hypotheses with the Fourier/wavelet model |
|---|
For many biological investigations it is of interest to identify regions within a protein alignment that have evolved significantly slower or faster than the average of the protein. Our model allows for such a test by generating confidence intervals around the relative replacement rates in each site of a protein alignment. It also allows for testing the homogeneity of replacement rates among sites within a protein and testing the identity of replacement rate profiles between two or more sets of homologous proteins. The described methods represent a straightforward application of classical testing theory (e.g., ![]()
First, we outline the test for homogeneity of replacement rates. Denote a set of the maximum-likelihood estimates of k rate variation parameters by a row vector, â. To test the null hypothesis that the replacement rate is the same across sites of a given protein (that is, a = 0 where 0 is a zero vector), we need to compute the Hessian matrix H at the point
= {
, â}. (Vector
contains the maximum-likelihood estimates of the tree branch lengths and the rest of the model parameters not included in â.) Element Hij of the matrix H is defined as the second partial derivative of the logarithm of the likelihood function taken with respect to
i and
j evaluated for a specified set of parameter values. The negated Hessian matrix evaluated at the maximum of the likelihood surface is known to asymptotically tend toward the inverse of the variance-covariance matrix for the vector of the maximum-likelihood estimates of model parameters (e.g., see ![]()
![]()
![]() |
(8) |
asymptotically follows a
2-distribution with k d.f., where Va is a submatrix of the complete variance-covariance matrix corresponding to the vector of Fourier parameters and V-1a is its inverse; the superscript t indicates the transpose of a vector (this test is a special case of the Wald test). Unfortunately, because of the restrictions on the values of rate variation parameters under the wavelet and Fourier models (the sum of all relative rate values across sites should always be equal to l for a data set with l sites, and negative values of relative rates are not allowed), depending on the expected values of the relative rates in the data set under analysis, the actual distribution of the maximum-likelihood estimates may significantly deviate from normal and the utility of the described test is limited.
Alternatively, one can use the likelihood-ratio test for the same purpose. The random variable
![]() |
(9) |
asymptotically follows a
2-distribution with k d.f. under the null model, provided that the more general model has exactly k extra rate variation parameters (a likelihood-ratio test). The numerator and denominator in (9) represent the maximum-likelihood values under the equal-rate and the wavelet/Fourier rate variation models, respectively. As with the Wald test, the actual distribution of the likelihood-ratio test statistic may not be close to the asymptotic
2-distribution for small data sets.
We next describe the comparison of replacement rate profiles from two sets of homologous protein sequences, S1 and S2. The independent wavelet or Fourier parameter estimates under the same rate variation model (i.e., under models with the identical number and type of rate variation parameters), â1 and â2, respectively, for these two data sets are evaluated as
![]() |
(10) |
which approximately follows a
2-distribution with k d.f. under the null hypothesis E(â1) = E(â2), where Va1 and Va2 are the variance-covariance matrices for the vectors â1 and â2, computed from data sets S1 and S2, respectively, in the same manner as described for Va above. Once again, the actual distribution of rate parameter estimates appears to deviate significantly from normal for data sets with significantly nonuniform rate profiles, so one should exercise caution in applying this test to real data. This test statistic has a likelihood-ratio test statistic counterpart, which can be constructed by analogy with Equation 9.
| Computing "confidence intervals" of relative substitution rates with the MCMC technique |
|---|
In this study we applied MCMC technology to compute an analog of confidence intervals for relative rate profiles for a predefined tree topology. We used in this computation a fully parameterized rate variation model, but the procedure is obviously applicable to models with a more moderate number of rate variation parameters. The idea in this case is to use the property of MCMC analysis that the frequencies of model parameter values encountered in a well-designed random walk through parameter space, given a fixed tree topology, give an estimate of the posterior probabilities of these parameter values. In this way, observing the variation of the intermediate profiles in the MCMC random walk, one can compute intervals for rate parameters including the middle 95% of a posterior probability distribution with a flat prior, which we loosely refer to below as "95% confidence intervals."
In MCMC technique the likelihood function is computed exactly in the same way as it is done in the maximum-likelihood analysis, but instead of numerical maximization of the likelihood function, one simulates a random walk through the space of parameter values and, optionally, tree topologies (in our case we limit our analysis to a predefined tree topology, so the tree topology was not changed). The core iteration step of the MCMC calculation samples a new point in the parameter space,
*, given the current value of
and likelihood values at both points. Depending on the ratio of likelihood values at points
* and
, the system accepts the new point or rejects it, staying in the old one. The random walk can be designed in an infinite number of ways: the only restriction to the simulation design provided by MCMC theory (see ![]()
![]()
* and
be time reversible. Time-reversibility is defined as the requirement that the following equation hold true:
* P(
* |
) =
P(
|
*), where P(
* |
) and P(
|
*) are the probabilities of transitions between states
* and
in the Markov chain corresponding to the random walk, and
* and
are the probabilities of states
* and
in the posterior distribution that we are trying to estimate.
Following suggestions formulated by ![]()
* being at state
:
![]() |
(11) |
where
(
* |
) is the conditional probability of sampling
* given
. We designed the simulation such that the probabilities
(
* |
) and
(
|
*) were equal. The second multiplier in expression (11), min(1, L(
*, T)/L(
, T)), is the conditional probability of accepting point
* given that it is already sampled.
By definition, the value of the posterior distribution,
, at point
is a product of the likelihood L(
, T), and the prior probability of observing the current
and T, divided by the prior probability of observing the current data set. Assuming lack of a prior knowledge about the parameter/tree distribution (a uniform prior), we have
*/
= L(
*, T)/L(
, T), and therefore the time-reversibility condition,
* P(
* |
) =
P(
|
*), is indeed satisfied.
We organized our MCMC computation in the following way. The random walk always started at the point of the maximum of likelihood. Each iteration of the simulation included an update cycle through all parameters, updating one parameter at a time and each time deciding whether to accept or reject the updated value; the tree topology was not changed. For each parameter,
i, we defined the maximum absolute change of this parameter,
i, and allowed for its single stochastic update (usually
i was set to be equal to half of the absolute value of the maximum-likelihood estimate of
i). More precisely, first, an equiprobable random decision to decrease or increase the parameter value was made. Second, the absolute amount of change was determined by drawing a random number from a uniform distribution defined at the interval between 0 and
i. If the resulting value of
i was inadmissible (for example, the value of the relative rate became negative),
i retained the old value, and next the parameter was updated. [The latter step was essential to ensure the condition
(
* |
) =
(
|
*)otherwise the probability of moving in the direction of the boundary of the region of allowed values is not generally equal to the probability of moving in the opposite direction.] For the sake of computational efficiency, under the unrestricted-rates model, the change in the value of one of the relative rate parameters was compensated by an equivalent change in the opposite direction in one of the other randomly chosen relative rate parameters. In this way we avoided the problem of renormalizing all relative rates after each change and had to recompute the likelihood values for only the two sites affected. If the new value of
i was admissible, the new point,
*, was accepted with probability min(1, L(
*, T)/L(
, T)). Once all parameters were updated in this way, the resulting state of the system,
, was saved. The confidence intervals for relative substitution rates at each site were computed from 10,000 saved intermediate states of the random walk; the results of the first few iterations that preceded reaching the stationary state of the random walk were discarded.
| Example 1. Variable regions of human immunoglobulins |
|---|
To illustrate the application of the model, we analyzed two sets of human immunoglobulin light chain variable region protein sequences. The first set contained seven light
chain variable region (V
) genes, representing three predominant V
subgroups (indicated in parentheses): O18 (I), L23 (I), L11 (I), A23 (II), O11 (II), L2 (III), and A27 (III) (![]()
chain variable region (V
) genes, representing three frequently expressed V
families (indicated by the first digit): 1c, 1e, 2b2, 2d, 3a, 3e, and 3j (![]()
![]()
![]()
For each data set the wavelet and Fourier parameters were estimated as follows: First, a neighbor-joining tree (![]()
![]()
We used the heuristic method for ordering parameters in both models: starting with the equal-rate model, we computed the substitution rate profile and fit this profile with wavelet and Fourier functions, ordering the resulting parameter values for each function by decreasing absolute value. Once the parameters were ordered, we performed a series of maximum-likelihood analyses starting with the equal-rate model and serially adding wavelet or Fourier parameters as ordered in the previous step, beginning with the parameters of largest absolute value. Because addition of a new parameter to a Fourier function changes the values of the Fourier function in each site of the profile, while the analogous addition to a wavelet function changes values only in a subset of sites, the computation under the wavelet model is generally much faster. We encountered no difficulty in performing estimation while adding wavelet or Fourier parameters up until their maximum number: 91 for immunoglobulin sequence alignments of length 92. The resulting relative substitution rate profiles for the most parameter-rich variants were identical under the Fourier model and wavelet models with the four different mother wavelet basis functions (see Fig 3) and corresponded to the profile obtained under the unrestricted-rates model. The 95% confidence intervals for the relative rates were computed under the unrestricted-rates model for both data sets.
The replacement-rates profiles are shown in Fig 4A and the corresponding confidence intervals are shown in Fig 4B. The regions of high replacement rate coincide with complementarity-determining regions (CDRs), which are the sites of antigen-antibody interaction. The regions of low replacement rate correspond to framework regions (FRs). For each data set the null hypothesis of rate constancy is rejected (P < 0.05) because the confidence intervals exclude rate 1 for at least one of the alignment sites. In contrast, we were unable to reject the null hypothesis that the patterns of rate variation are identical for these two IgV data sets, despite some differences between them. For example, the region between CDR2 and CDR3 (FR3) is slightly more variable in V
sequences than in the V
sequences.
|
The starting (equal-rate model) log-likelihood values were -885.06 and -896.67 for V
and V
data sets, respectively, while the final log-likelihood values of the most parameter-rich profiles were -794.14 and -804.33, respectively. Therefore, for both data sets the maximum-likelihood values under the unrestricted-rates model are >90 units of log-likelihood larger than under the equal-rate model. Using the PAML package (![]()
and V
data sets, respectively, which corresponded to log-likelihood values of -873.73 and -885.13, respectively. Therefore, the maximum-likelihood values under the discrete gamma model are ~80 units of log-likelihood smaller than the corresponding values under the unrestricted-rates model. The profile obtained under the discrete gamma model with eight discrete categories for V
sequences is compared to the wavelet and Fourier model profiles for the same data set in Fig 5.
|
We performed a series of comparisons of different models of rate variation in terms of the optimum (maximum) AIC value. The data showed similar patterns among the different basis functions, so here we present only data for the Haar basis functions in the wavelet decomposition of the V
and V
data sets. For the V
data set the optimum value of AIC was -1705.51 and corresponded to 37-parameter Haar wavelet function (see Fig 6), while the AIC value for the V
data set under the discrete gamma model was -1771.47. For the V
data set the optimum value of AIC was -1724.85 and corresponded to 37-parameter Haar wavelet function, while the AIC value under the discrete gamma model was -1794.26. Therefore, in both of these analyses, AIC favored the wavelet/Fourier model over the discrete gamma model. We obtained essentially the same results as with AIC with the Cox test, assuming that the asymptotic distribution of the test statistic is valid (data not shown). This similarity of results between AIC and the Cox test is not completely unexpected, because both tests use virtually the same test statistic and the same assumption of the asymptotic distribution of the test statistic.
|
To verify the assumptions concerning the distribution of the Cox statistic (logarithm of the maximum-likelihood value under the discrete gamma model subtracted from the logarithm of the maximum-likelihood value under the wavelet model), we implemented the parametric bootstrap version of the Cox test as described by ![]()
is high under the null model and therefore the difference between the maximum-likelihood values for the wavelet/Fourier model and for the discrete gamma model is not significant for the V
data set. We conjecture that the situation is likely to change when a larger number of sequences is considered.
|
To see if the estimates of relative rates are sensitive to the assumed model of amino acid substitutions, we also compared estimates of relative rates obtained under the Poisson model with those computed under the "more realistic" JTT model for the V
data set (see Fig 8). The resulting profiles appeared to be very similar, at least for the data sets that we analyzed (see Fig 8).
|
| Example 2. Drosophila alcohol dehydrogenase genes |
|---|
Drosophila alcohol dehydrogenase (ADH) has been the focus of much interest among evolutionary biologists (![]()
![]()
While ADH presumably performs the same biochemical functions in each of the species groups, the genetic and population genetic contexts are known to differ among them. For example, the Hawaiian species exist in limited geographic ranges and have effective population sizes (Ne) much smaller than the Ne's of the globally distributed species of the melanogaster group (![]()
![]()
![]()
![]()
![]()
It is therefore of interest to compare the patterns of ADH replacement rate variation among the various Drosophilid species groups. ![]()
|
| DISCUSSION |
|---|
Statistical consistency of parameter estimates under the unconstrained-rates and wavelet/Fourier models. The "big bang" model:
Imagine a hypothetical data set with an infinite number of homologous DNA sequences diverged from a common ancestor according to a star-like tree under the Jukes-Cantor model (![]()
We conjecture that under the unconstrained-rates model combined with an arbitrary bifurcating tree, the relative rate parameter estimation has essentially the same properties as under the big bang model, except that the addition of new sequences leads to an increase in the number of the branch length parameters and is followed by a change in tree shape. Namely, an increase in the number of sequences to infinity should be associated with a reduction of the variance of relative rate parameters to an arbitrarily small value, while an increase in the total number of sites in the alignment should be followed by a reduction of the variance of the branch length estimates. According to this conjecture, one can decrease arbitrarily the variances of estimates of all model parameters by simultaneously increasing the number of sites and the number of sequences in the data set.
Strange shape of the relative rate confidence intervals under the unrestricted-rates model and the geometry of the space of the admissible parameter values:
The confidence intervals for relative substitution rates computed w




) Interior nodes corresponding to amino acids (j1, j2, and j3) in the ith site of unknown ancestral sequences. The tji's specify the expected length of the jth branch at the ith site of the protein alignment. Pij(t) in 














