Microarray technology has produced massive expression data that are invaluable for investigating the genomewide evolutionary pattern of gene expression. To this end, phylogenetic expression analysis is highly desirable. On the basis of the Brownian process, we developed a statistical framework (called the E0 model), assuming the independent expression of evolution between lineages. Several evolutionary mechanisms are integrated to characterize the pattern of expression diversity after gene duplications, including gradual drift and dramatic shift (punctuated equilibrium). When the phylogeny of a gene family is given, we show that the likelihood function follows a multivariate normal distribution; the variance-covariance matrix is determined by the phylogenetic topology and evolutionary parameters. Maximum-likelihood methods for multiple microarray experiments are developed, and likelihood-ratio tests are designed for testing the evolutionary pattern of gene expression. To reconstruct the evolutionary trace of expression diversity after gene (or genome) duplications, we developed a Bayesian-based method and use the posterior mean as predictors. Potential applications in evolutionary genomics are discussed.
DNA microarray technology can simultaneously monitor the expression levels of thousands of genes across many experimental conditions or treatments (Brown and Botstein 1999), providing us with unique opportunities to investigate the evolutionary pattern of gene regulation (e.g., Wagner 2000; Guet al. 2002,Guet al. 2002; Enardet al. 2002; Gu and Gu 2003; Cacereset al. 2003; Rifkinet al. 2003). To this end, an appropriate statistical framework is highly desirable because current bioinformatic tools for microarray data analysis such as clustering (e.g., Eisenet al. 1998) are not sufficient for studying the evolution of gene expression. Besides, the high-level background noise could make many data-driven approaches statistically unreliable (Kerr and Churchill 2001).
A conventional approach to tracing the evolutionary change is to classify each (continuous) expression measure into a binary state (expressed or not expressed; Ferkowiczet al. 1998; Forceet al. 1999; Vialeet al. 2000; Prince and Pickett 2002; Wang and Gu 2002; Gu and Gu 2003). In spite of its simplicity, the appropriate cut-off for classification depends on data type and experimental design, which may not be obvious in many cases. For microarray data, in particular, assessing statistical significance of gene expression or expression change is still challenging, due to the huge number of genes and very few replicates (Kerr and Churchill 2001; Quackenbush 2001).
Here we focus on the evolution of gene expression after duplications. When the phylogenetic tree of a gene family can be inferred by the sequence data, the pattern of expression profiles among member genes can be modeled as a stochastic process driven by underlying evolutionary mechanisms. This approach has several advantages: (i) Statistical methods such as the likelihoodratio test can be applied for exploring the evolutionary pattern of gene expression; (ii) evolutionary tracing of expression changes can be predicted by the Bayesian method; (iii) the statistical model can be utilized to study the expression-motif association; and (iv) it may provide a novel approach for phylogenetic reconstruction beyond sequence data.
The goal of this article is to develop a statistical framework for studying the evolution of gene expression after duplications. Since gene expression data can be viewed as continuous characters, the concept of comparative methods for phenotypic evolution is helpful to establish a bridge between massive microarray data and phylogenetic analysis. Starting from a simple random-walk model (Edwards and Cavalli-Sforza 1964; Lynch and Hill 1986), several evolutionary mechanisms are then introduced to characterize the pattern of expression diversity after gene duplication. Since modeling these mechanisms may result in overparameterization, we solve this problem by using the hierarchical modeling approach to derive the likelihood function of gene expressions when the phylogeny is known. The effect of experimental correlation between microarrays is also considered. Moreover, we propose an “empirical” Bayesian framework to infer the ancestral expression levels at internal nodes, providing a useful tool to reconstruct the evolutionary trace of gene expressions. These newly developed methods are applied for the yeast microarray data.
MODELS AND METHODS
Basic Brownian model (B model): In microarray data, the expression level X of a gene is usually measured by the log-transformed signal intensity, after normalization and bias correction. For the (two-way) cDNA microarray, X measures the relative mRNA abundance to a prespecified condition (control), while for the Affymatrix array, X is a good predictor for the absolute mRNA abundance (Quackenbush 2001).
Given the initial expression level x0, the Brownian model demonstrates that the expression level X = x after t evolutionary time units follows a normal distribution with mean x0 and variance σ2t; the density is given by (1) The Brownian model has been used to study the character evolution. Under the mutation-drift model, σ2 can be interpreted as the mutational variance (Lynch and Hill 1986; Felsenstein 1988).
Microarray data usually include multiple measurements under various development/tissue stages or experimental treatments. These measurements can be considered as samples from the same stochastic process, i.e., evolutionary replicates (see below), but the initial expression level differs. We assume that the expression level x0 at the root O (the common ancestor of the gene family) varies among microarray experiments according to a normal distribution N(x0; μ, ρ2), with the following density: (2)
Gene family with two-member genes: We start from a simple two-member gene family (Figure 1a). Let x1 and x2 be the expression levels of two member genes, respectively, and P(x1, x2) be the joint density. Given the initial value (x0) of gene expression at the root O, the change of x1 follows a Brownian process B(x1|x0; σ2t), and the change of x2 follows B(x2|x0; σ2t). If the evolution of gene expression is independent between lineages, referred to as the E0 assumption, we have P(x1, x2|x0) = B(x1|x0; σ1t) B(x2|x0; σ2t). From Equations 1 and 2 one can show the joint density of x1 and x2 is given by (3) that is, x1 and x2 follow a bivariate normal distribution, with the mean vector μ = (μ, μ) and the variance-covariance matrix (4) Under this model, the covariance between x1 and x2 equals the ancestral variance (ρ2) of the root O, indicating that the expression similarity between gene 1 and gene 2 reflects their evolutionary relatedness.
Gene family with three members: For a given three-member gene family (Figure 1b), the joint density of expressions x1, x2, and x3 is derived as follows. Denote the expression level at the ancestral node A by x4. Let T and t be the evolutionary times of nodes O (the root) and A, respectively. Thus, given the initial value (x0) at O, the change of x4 follows B(x|x0; σ2 (T – t)) and the change of x3 follows B(x3|x0; σ2T). Similarly, given the ancestral level x4, the changes of x1 and x2 follow B(x1|x4;σ2t) and B(x2|x4; σ2t), respectively. According to the Markov property, we obtain the joint density P(x1, x2, x3, x4|x0) = B(x3|x0, σ2T) B(x1|x4; σ2t) B(x2|x4; σ2t) B(x4|x0; σ2 (T – t)). Since the ancestral expression x4 is unobservable, we have . After some math, one can show that the joint density (5) is a three-variate normal density N(x1, x2, x3; μ, VB), with the mean vector μ = (μ, μ, μ)′, and the variance-covariance matrix (6) The proof can be outlined as follows. Under the Brownian model, P(x1, x2, x3|x0) is obviously normally distributed. Hence the normality of P(x1, x2, x3) is guaranteed when π(x0) is a normal density. Using the technique of conditional expectations, one can derive μ and VB easily.
Lineage-specific expression diversity (L model): The B model assumes that the expression divergence of a gene family is driven mainly by small and additive genetic drifts (random effects), with a constant rate measured by σ2, or by mutational variance under the driftmutation model (Lynch and Hill 1986). In this sense, the B model can be considered as the “neutral-evolution” model of gene expression or the “expression clock.”
A more general model is the L model, under which the evolutionary rate σ2 of expression diversity is lineage specific. Let be the variance component in the kth branch (k = 1,..., 4) in the three-gene phylogeny (Figure 1c) and tk be the corresponding evolutionary time. Similar to the B model, we have shown that the joint density of x1, x2, and x3, (7) is a three-variate normal density N(x1, x2, x3; μ, VL), with the mean vector μ = (μ, μ, μ)′, and the variance-covariance matrix (8) Obviously, when for all k = 1,..., 4, the L model is reduced to the B model.
Directional trend of expression diversity (D model): The lineage-specific (L) Brownian model can be further extended to the Brownian model with directional trend (the D model). That is, given the initial value x0 at t = 0, the stochastic process of gene expression change after t time units is described as B(x|x0; λt, σ2t), i.e., a normal distribution with mean x0 +λt and variance σ2t; λ is called the trend coefficient or the coefficient of directional selection (Felsenstein 1988). Apparently, the L model is the special case of λ= 0. Since the trend coefficient (λ) of a gene varies among microarray samples, we treat λ as a random variable that follows a normal distribution . The “marginal” Brownian motion is therefore given by (9) i.e., it is a normal distribution with mean and variance σ2t +ω2t2. Note that the variance under the D model is a quadratic function of time t.
The joint density of gene expressions under the D model can be derived by adding lineage-specific directional trends to the L model. For a three-member gene family, Equation 7 can be modified as (10) where and are the mean and variance of directional trends in the kth branch, respectively (Figure 1c). Similar to the derivation of the L model, one can show that P(x1, x2, x3) is a three-variate normal density N(x1, x2, x3; μ, V), with the mean vector μ = (μ1, μ2, μ3)′, where , , and . The variance-covariance matrix VD = VL + WD, where VL is given by Equation 8, and the matrix of directional trend WD is given by (11)
Dramatic shift of expression diversity (S model): The D model we developed above assumes that the change of gene expression is continuous with time t (the gradual evolution). However, a dramatic shift (positively or negatively) in gene expression may happen shortly after gene duplication and then remain little changed, i.e., the process of “punctuated equilibrium” (Hansen and Martins 1996). The S model assumes that after gene duplication the expression level has shifted z and z′ units in two lineages, respectively (Figure 2). It is further assumed that two shift variables (z and z′), which are independent of each other, vary among microarray experiments according to normal distributions N(z; δ, s2) and N(z′; δ′, s′2), respectively.
For a three-gene family, the joint density of x1, x2, and x3 under the S model is derived as follows. After the first gene duplication, the initial expression level (x0) at the root O is immediately shifted to x0 + zO in one lineage and to in another lineage (Figure 2). Hence, the change of x4 follows a Brownian process , and the change of x3 follows . After the second gene duplication, the expression level (x4) at node A is immediately shifted to x4 + zA in one lineage and to in another lineage (Figure 2). In the same manner, we have for the change of x1 and for the change of x2. For simplicity, we assume no gradual drift in each branch. Then, the joint density conditional of x0, zO, , zA, can be written as follows: (12) Apparently, the L model (Equation 7) is a special case when and . The next step is to integrate out all shift variables zA, , zO, and . Letting φ(z) be the normal density for any shift variable z, we have (13) And finally, the initial expression at the root (x0) is integrated out according to Equation 2; i.e., (14) We have shown that P(x1, x2, x3) follows a multivariate normal distribution. The mean vector μ = (μ1, μ2, μ3)′ is given by μ1 = μ + δA + δO, , and ; and the variance-covariance matrix VS = VL + WS, where VL is given by Equation 8 and the shift matrix WS is given by (15) It has been shown that if the gradual drift (D) model is considered, one can show that the variance-covariance matrix is given by VS = VL + WD + WS, where WD is given by Equation 11.
The general likelihood function under the E0 model: The joint density for a three-member gene family can be extended to any n-member gene family when a rooted phylogeny is given. We have shown that the joint density of x = (x1,..., xn) follows a multivariate normal distribution, N(x1,..., xn; μ, V); with the mean vector μ and the variance-covariance matrix V, the density is given by (16) This result can be proved by using the principle of mathematical induction: Given that Equation 16 holds for n = 3, we have shown that if Equation 16 holds for n = k, it must be true for n = k + 1. A complete proof under any phylogenetic tree is given by Z. Zhang and X. Gu (unpublished results). We call it the E0 model because it assumes independent evolution between lineages. The main results are summarized as follows.
The B model: The mean vector μ = (μ,..., μ)′, and the variance-covariance matrix VB is given by VB =ρ2J +2TU, where any ijth element of matrix J is 1, and T is the age of the common ancestor (the root O). The treetopology-related matrix U is defined as Uii = 1 and Uij = 1 – aij (i ≠ j), where aij = tij/T; tij is the age of the common ancestor of genes i and j. More explicitly, the ijth element of VB is given by (17)
The L model: Since the L model allows specific for each branch k, one cannot separate the topology matrix U from other components. Instead, the ijth element of VL is (18) where tk is the evolutionary time of branch k, the subscript notation k ∈ xi runs over all branches in the lineage from the root O to gene xi, and k ∈ (xi, xj) runs over all branches shared by xi and xj since the root O (Figure 3).
The D model: It has been shown that μi, the mean of each xi, is equal to μ plus the sum of mean drifts in branches from the root O to gene xi; i.e., (19) The variance-covariance matrix can be written as VD = VL + WD, where VL is given by Equation 18. The ijth element of matrix WD is given by and (Figure 3). Therefore, VD can be expressed as (20) which is a quadratic function of evolutionary time of each branch.
The S model: Under the general S model, the mean of each xi is the sum of the mean gradual drifts (over branches) and the mean shifts (over the ancestral nodes) from root O to xi; that is, (21) where m ∈ xi runs over all ancestral nodes between the root O and xi (root O included but xi not included). The variance-covariance matrix VS can be expressed as (22) where for each ancestral node m, Sm = sm or for two follow-up branches, respectively. Let Aij be the most recent common ancestral node of xi and xj. Thus, the notation m ∈ (xi, xj) runs over all ancestral nodes between the root O and the node Aij (root O included but Aij not included).
Ancestral gene expression inference: Bayesian approach: Ancestral state reconstruction along a phylogenetic tree is at the center of comparative methods in evolutionary biology, for both morphological and molecular characters (Harvey and Pagel 1991; Yanget al. 1995; Schluteret al. 1997; Golding and Dean 1998). The massive microarray data make it possible to reconstruct an ancestral expression pattern that is useful to trace the evolutionary changes of gene regulation.
Single-node ancestral inference: This method provides a fast Bayesian procedure to infer ancestral expression profile because each time it deals with one ancestral node and then runs over the tree. Let x = (x1,..., xn) be the observed expression pattern and y be the expression level at the ancestral node of interest (Figure 2). According to Bayes' rule, the posterior density P(y|x1,..., xn) is computed as follows: (23) From Equation 16, we know P(x1,..., xn)isan n-variate normal density. To derive the numerator of Equation 23, we use a three-gene family for illustration, where y = x4. Note that P(x1, x2, x3, x4|x0) = B(x3|x0)B(x1|x4) B(x2|x4)B(x4), according to the Markov property. Since , similar to the derivation of P(x1, x2, x3), we show that P(x1, x2, x3, x4) is a 4-variate normal density.
In the general case, let M = n + 1 and regard the ancestral level y as an additional variable xn+1. It has been shown that P(x1,..., xn, y) is an (n + 1)-variate normal density, denoted by N(x1,..., xn, y; μ, VM). The extended variance-covariance matrix VM has the following structure: If 1 ≤ i, j ≤ n, the ijth element of VM is equal to that of V, e.g., Equation 22 under the general S model. For any i, n + 1th element, i = 1,..., n + 1, it is given by (24) and Vn+1,i = Vi,n+1, where the subscript notation k ∈ y runs over all branches in the lineage from the root O to the ancestral node y, and k ∈ (xi, y) runs over all branches shared by xi and y since the root O (Figure 3). For simplicity, we assume μ = (μ,..., μ)′.
Hence, it becomes obvious that the posterior density P(y|x1,..., xn) is a normal density. Let ; cij is the ijth element of C. After some algebra we obtain (25) where is the (posterior) variance of y. That is, the posterior mean of y conditional of x = (x1,..., xn)′ is given by (26) where βi = –ci,n+1/cn+1,n+1 and . Apparently, the posterior mean prediction for the ancestral gene expression is a linear function of current gene expressions.
Joint ancestral inference: To explore the joint evolutionary pattern of expression changes after gene duplications, the single-node method may not be sufficient. Therefore we develop an approach for joint ancestral expression inference. For a gene family with n member genes, there are m ancestral nodes when the phylogenetic tree is given. Let x = (x1,..., xn)′ and y = (y1,..., ym)′ be the vectors of current and ancestral expression levels, respectively, and M = n + m. The (extended) M × M variance-covariance matrix for (y′, x′) is denoted by VM. We have shown that P(y, x) is an M-dimensional multinormal density. It follows that the joint posterior density of ancestral nodes y, (27) is also m × m multinormal, that is, P(y|x) = N(y; μy|x, Σy|x), where μy|x = (μy1|x,..., μym|x)′ is the posterior mean vector of the ancestral nodes, and Σy|x is the m × m posterior variance-covariance matrix of y1,..., ym.
To obtain useful analytical results for numerical calculation, we partition the matrix VM as (28) where H and A are m × n and m × m matrices, respectively. The matrix H is the ancestral-current expression covariances and A is the variance-covariance matrix among ancestral nodes. Thus, the inverse of the matrix VM can be written as (29) where Λxx, Λxy, and Λyy are n × n, m × n, and m × m matrices, respectively. It has been shown that (30)
Implementation and data analysis: Multiple microarray experiments: The microarray data collection for evolutionary analysis can be outlined as follows: (1) Multiple (N) microarray data sets are downloaded from the Stanford microarray database and (2) a relational database is established to extract the expression profiles of any given gene family. A typical data set for a three-gene family is where the kth column represents the expression profile across the gene family in the kth microarray experiment, and the ith row represents the expression profile of gene i across the N microarray experiments. For our interest, we view multiple microarrays as evolutionary replicates, on the basis of the following considerations.
For a given gene family, the whole set of cis- and trans-regulatory elements is diversified following gene duplications. Ideally, these r regulatory elements can be represented as a binary string denoted by h = h1,..., hr. Each duplicate gene has a unique representation of h. These “aligned” regulatory elements, similar to aligned nucleotide sites, can be viewed as the evolutionary replicates from a stochastic process. Microarray samples of cells/tissues at various developmental stages or under experimental treatments show characteristic expression profiles that reflect the differences in h among member genes. Thus, the expression profile of gene i at the kth microarray data can be conceptually written as xi,k =ψk(h), although ψ is yet little known. In this regard, multiple microarray experiments are also evolutionary replicates but the sampling property needs to be addressed carefully.
Likelihood under the i.i.d. assumption: The likelihood function can be easily built under the i.i.d. assumption that microarray experiments are independently, identically distributed, which has been used in many previous evolutionary studies (e.g., Wagner 2000; Guet al. 2002,Guet al. 2002). For N microarray experiments, let xk = (x1,k,..., xn,k) be the expression pattern of an n-member gene family at the kth experiment. When the phylogeny is given, the likelihood for gene expressions can be written as (31) The maximum-likelihood estimates of the parameters can be obtained by the Newton-Raphson iteration method, and the sampling variance of each estimate is approximated by the inverse of the information matrix.
Expression branch lengths: The number of unknown parameters under the E0 model could be larger than the degrees of freedom, which makes the model statistically infeasible. One useful solution is to define the expression branch length Ek for any branch k along the phylogeny, that is, (32) where the subscript k under S2k is for the initial node of the kth branch. Then, the variance-covariance matrix under the general E0 model (Equations 16 and 22) can be simplified as (33) Hence, the statistical framework with parameters (ρ2, Ek, k = 1, 2,...) becomes a general model for constructing useful likelihood-ratio tests. We have implemented our methods using the statistical software package S-plus.
Likelihood under experimental correlations: Hundreds of microarray experiments, say, for the yeast, include timecourse experiments, tissue/developmental stages, cell cycles, stress-response experiments, different environmental conditions, mutants (gene deletions), etc. (Eisenet al. 1998). The overall expression profiles with similar types of conditions or treatments are more similar to each other; e.g., two adjacent sampling points in a time-course assay are usually highly correlated. Because of these experimental correlations, the i.i.d. assumption seems to be unrealistic. That is, the sample of expression profiles of a gene family is not only phylogenetically but also experimentally dependent.
As the first-order approximation, we model the experimental correlation of the microarray data as the overall correlations among microarray experiments. Let D be the N × N matrix of experimental correlations; that is, the diagonal element is 1, while the off-diagonal element is the coefficient of correlation between any two microarray experiments. For the microarray chip that includes in total C genes (e.g., C = 5500 for the yeast genome), D can be estimated by the standard approach over all C genes. Then one can show that the expression profiles of the gene family X = (x1,..., xN)′ follow a multivariate normal distribution, with the large [n × N] × [n × N] variance-covariance matrix V ⊗ D. The likelihood function is therefore given by (34) Apparently, Equation 34 is reduced to Equation 31 when D = I. When the dimension n × N is not large, a numerical algorithm based on the matrix decomposition is developed. Here we show a simple example.
Wolfe and Shields (1997) have identified 452 duplicate genes that were from the yeast genome duplication. Using Chu et al.'s (1998) cDNA microarray data, sevenpoint time-course experiments during the sporulation, we calculated the experimental correlation matrix D (Chuet al. 1998). Obviously, two adjacent time points usually show a high correlation, while it is low between distant points. The variance and covariance elements of V for a duplicate pair are given by V11 =ρ2 + E1, V22 =ρ2 + E2, and V12 =ρ2. These unknown parameters (ρ2, E1, and E2) are then estimated using Equations 31 and 34, respectively. We use the estimated coefficient of correlation to compare their performance. All 452 pairs are plotted in Figure 4, which shows that for the dense-sampled time course, the effect of experimental correlation is nontrivial.
Yeast glutamyl- and glutaminyl-tRNA synthetases gene family: An example: The glutamyl- and glutaminyl-tRNA synthetases (GlnS) family has three member genes (YGL245w, YOR168w, and YOL033w). Phylogenetic analysis has shown that YGL245w and YOR168w are more closely related (Figure 5). The cell-cycle yeast microarray data are used (Eisenet al. 1998).
Under the general E0 model, the likelihood under the i.i.d. assumption results in the maximum-likelihood (ML) estimates of , , Ê2 = 0.062 ± 0.020, Ê3 = 0.099 ± 0.031, and Ê4 = 0.079 ± 0.020. The maximum log-likelihood value is –146.19. Next we consider the likelihood function considering the experimental correlations. We first compute the matrix D of microarray experiments. Using the ML estimates under i.i.d. as initial values, we obtain , , Ê1 = 0.112, Ê2 = 0.068, Ê3 = 0.104, and Ê4 = 0.061. It seems that the likelihood under the i.i.d. assumption is useful for fast and large-scale analysis.
Using the molecular clock approach, we have approximately dated the relative time of first gene duplication (between YGL245w/YOR168w and YOL033w) as 2.2 (to the Escherichia coli/yeast split time), and the second one as 1.27. Thus, under the B model, we have obtained the ML estimates (under i.i.d.) of , , and . The maximum log-likelihood value under the B model is –154.13. Apparently, the likelihood-ratio test shows that the B model, or the “expression clock,” is rejected at the significance level of 0.001; .
As shown in Figure 6, the expression profile of the ancestral ancestor of YGL245w-YOR168w has been inferred by the Bayesian method. Therefore, one can infer lineage-specific changes after gene duplication (the derived characters) from the ancestral expression pattern. For instance, changes of expression level in each lineage after gene duplication will be very informative when other genome data (e.g., putative regulatory motifs) are available.
Yeast gene families: We have conducted a large data analysis for yeast gene families to understand the evolution of expression after gene duplications. Amino acid sequences of yeast gene families, as well as homologous genes from nearly 30 complete genomes, were downloaded from the COG database (http://www4.ncbi.nlm.nih.gov/COG/). The phylogenetic tree of each gene family is inferred by the neighbor-joining method (Saitou and Nei 1987). Furthermore, the duplication time relative to yeast/E. coli split can be (approximately) estimated by using the molecular clock approach under the inferred tree, similar to Gu et al. (2002,2002). In total 276 yeast microarray data are collected for each gene family; the likelihood of Equation 31 is used for the analysis.
For a three-member gene family when the root can be reliably determined, the likelihood-ratio test (LRT) is used to test the expression clock hypothesis (Figure 7). The null hypothesis is E1 = E2. The log of likelihood ratio approximately follows a χ2 with d.f. = 1 so that one can determine the significance level. We conducted the LRT for 60 three-member yeast gene families; for them the relative duplication times estimated from sequence data are treated as known. Overall, 42 (70%) gene families showed that the null hypothesis is rejected at the 0.05 significance level and 39 (65%) gene families at the 0.01 significance level. Therefore, one may conclude that, after gene duplication, an unequal rate for expression divergence in each duplicate gene is a common pattern. Examples in Figure 8 show more dramatic expression changes than sequence substitutions after gene duplications.
In this article we have developed a statistical framework to explore the expression divergence during the gene family evolution. Several data analyses have shown the potential for studying many interesting problems in evolutionary genomics, e.g., the evolution of gene expression specificity, the evolutionary fate of duplicate genes, ancestral expression inference, and the coevolution between expression and regulatory motif or coding sequence. Our method is also useful for understanding the population-quantitative genetic basis of expression evolution after gene duplication.
Comparative methods: Statistical methods for the analysis of continuous morphological data have been studied for decades to deal with phylogenetic-dependent sampling (Harvey and Pagel 1991; Maddison 1994). These methods can be roughly classified into phylogenetically independent contrasts (PIC; Felsenstein 1988), the phylogenetic autocorrelation (PA) method (Cheverudet al. 1985), and phylogenetic generalized least-squares (PGLS; Grafen 1989), as well as many variants; see Rohlf (2001) for a recent detailed review and critiques. Here we briefly discuss the connection between the newly developed method and these existing methods.
We have noted that all these methods share a similar theoretical ground. In fact, PIC, PA, and PGLS methods require the known “expected variance-covariance matrix” V for incorporating the phylogenetic structure. Except for some early studies, the popular assumption is V =σ2C, where the matrix C is determined by the phylogeny and timescales. Essentially, it is equivalent to the B model. Schluter et al. (1997) introduced a simple prior for σ2 to relax the unrealistic assumption. There has been long-term controversy about which one should be used and exactly how it can be applied (Pagel 1993; Rohlf 2001). Rohlf (2001) did not consider any complex form of V, because it is more arbitrary in implementation. The central theme of our study is to model the expected variance-covariance matrix V for exploring the evolutionary pattern of expression divergence, which can be estimated under appropriate statistical procedures. Since the likelihood-ratio test clearly shows a strong lineage-specific mode of expression evolution, the conventional PIC, PA, and PGLS methods that assume a constant Brownian motion may be oversimplified for microarray data analysis.
Inference of the ancestral expression pattern provides an efficient approach to reconstruct the evolutionary trace of expression diversity after gene (or genome) duplications. The so-called maximum-likelihood method for inferring the ancestral state of continuous characters is to minimize the (weighted) sum of differences over all branches (e.g., Schluteret al. 1997). For the three-gene tree (Figure 1b), it is given by (35) Obviously, the estimation requires known evolutionary times t and T. Since the assumption of the constant Brownian model does not hold in general for the microarray data, the ancestral inference in Equation 35 could be highly biased. Our empirical Bayesian approach provides in general more accurate inference for the ancestral expression pattern because the lineage-specific effect is well taken into consideration. Finally, we point out that the Bayesian method of Schluter et al. (1997) is to assume uniform priors for the mean vector and the log of σ2, respectively, which did not specifically address the lineage-specific effect.
Models extended for expression data: In the future, we will study how to implement the methodology of PIC, PA, or PGLS for the microarray data. For instance, the PGLS regression model for gene family expression may improve the efficiency of the expression-motif search (e.g., Bussemakeret al. 2001; Blanchette and Tompa 2002). This is because the regulatory processing of a gene family can be traced back to a single ancestral regulatory modular so that the heterogeneity problem can be partially avoided. For a given microarray experiment, the expression profile of a gene family x = (x1,..., xn)′ can be written as x = Hb + ϵ, where the matrix H represents explanatory variables, e.g., the cis-regulatory motif structure (predicted or known, number of copies, locations, etc.), and some trans-elements that may influence the expression level. The variance-covariance matrix of the error term ϵ, denoted by R, is given by (36) where V is given by Equation 32; ρ2 should be removed since the root state is treated as a fixed value when a single data set is considered.
Similarly, the PA method can be written as x =αSx + ϵ′, where α is the autocorrelation coefficient, S is an n × n (tree-dependent) connection matrix (each row of which sums up to one), and ϵ′ is error terms (Cheverudet al. 1985). Instead of the arbitrary construction of S, we suggest estimating S from R in Equation 36 with (row) normalization. Moreover, for a gene family, it would be very interesting to detect any functional correlation between two microarray experiments. We may modify the PIC method (Felsenstein 1988) to correct not only phylogenetic dependence but also experimental correlation as well as the lineage-specific effect.
Experimental errors and others: We have recognized that the current statistical framework involves several assumptions that need to be examined carefully. The first one is the normal assumption. By default, we assume that fold change of gene expression follows a Brownian-based process during evolution. Further investigation is needed to test the robustness of normal assumption when other data normalization procedures are adopted (Quackenbush 2001). Second, we should consider the sources of experimental errors. Let be the variance of experimental errors. Then the variance-covariance matrix V of Equation 33 should be modified as (37) We study the performance of our method when the “background noise” is strong (Kerr and Churchill 2001). For instance, the LRT tests may be liberal so a more stringent significance level is suggested. Since the variance of experimental errors does not depend on the tree topology, it can be estimated when replicates for a single microarray experiment become the standard protocol.
The author thanks Zhongqi Zhang for assisting in numerical analysis. This work was supported by a National Institutes of Health grant.
Communicating editor: Y.-X. Fu
- Received July 14, 2003.
- Accepted January 22, 2004.
- Copyright © 2004 by the Genetics Society of America