Abstract
Microarray technology has produced massive expression data that are invaluable for investigating the genomewide evolutionary pattern of gene expression. To this end, phylogenetic expression analysis is highly desirable. On the basis of the Brownian process, we developed a statistical framework (called the E_{0} model), assuming the independent expression of evolution between lineages. Several evolutionary mechanisms are integrated to characterize the pattern of expression diversity after gene duplications, including gradual drift and dramatic shift (punctuated equilibrium). When the phylogeny of a gene family is given, we show that the likelihood function follows a multivariate normal distribution; the variancecovariance matrix is determined by the phylogenetic topology and evolutionary parameters. Maximumlikelihood methods for multiple microarray experiments are developed, and likelihoodratio tests are designed for testing the evolutionary pattern of gene expression. To reconstruct the evolutionary trace of expression diversity after gene (or genome) duplications, we developed a Bayesianbased method and use the posterior mean as predictors. Potential applications in evolutionary genomics are discussed.
DNA microarray technology can simultaneously monitor the expression levels of thousands of genes across many experimental conditions or treatments (Brown and Botstein 1999), providing us with unique opportunities to investigate the evolutionary pattern of gene regulation (e.g., Wagner 2000; Guet al. 2002,Guet al. 2002; Enardet al. 2002; Gu and Gu 2003; Cacereset al. 2003; Rifkinet al. 2003). To this end, an appropriate statistical framework is highly desirable because current bioinformatic tools for microarray data analysis such as clustering (e.g., Eisenet al. 1998) are not sufficient for studying the evolution of gene expression. Besides, the highlevel background noise could make many datadriven approaches statistically unreliable (Kerr and Churchill 2001).
A conventional approach to tracing the evolutionary change is to classify each (continuous) expression measure into a binary state (expressed or not expressed; Ferkowiczet al. 1998; Forceet al. 1999; Vialeet al. 2000; Prince and Pickett 2002; Wang and Gu 2002; Gu and Gu 2003). In spite of its simplicity, the appropriate cutoff for classification depends on data type and experimental design, which may not be obvious in many cases. For microarray data, in particular, assessing statistical significance of gene expression or expression change is still challenging, due to the huge number of genes and very few replicates (Kerr and Churchill 2001; Quackenbush 2001).
Here we focus on the evolution of gene expression after duplications. When the phylogenetic tree of a gene family can be inferred by the sequence data, the pattern of expression profiles among member genes can be modeled as a stochastic process driven by underlying evolutionary mechanisms. This approach has several advantages: (i) Statistical methods such as the likelihoodratio test can be applied for exploring the evolutionary pattern of gene expression; (ii) evolutionary tracing of expression changes can be predicted by the Bayesian method; (iii) the statistical model can be utilized to study the expressionmotif association; and (iv) it may provide a novel approach for phylogenetic reconstruction beyond sequence data.
The goal of this article is to develop a statistical framework for studying the evolution of gene expression after duplications. Since gene expression data can be viewed as continuous characters, the concept of comparative methods for phenotypic evolution is helpful to establish a bridge between massive microarray data and phylogenetic analysis. Starting from a simple randomwalk model (Edwards and CavalliSforza 1964; Lynch and Hill 1986), several evolutionary mechanisms are then introduced to characterize the pattern of expression diversity after gene duplication. Since modeling these mechanisms may result in overparameterization, we solve this problem by using the hierarchical modeling approach to derive the likelihood function of gene expressions when the phylogeny is known. The effect of experimental correlation between microarrays is also considered. Moreover, we propose an “empirical” Bayesian framework to infer the ancestral expression levels at internal nodes, providing a useful tool to reconstruct the evolutionary trace of gene expressions. These newly developed methods are applied for the yeast microarray data.
MODELS AND METHODS
Basic Brownian model (B model): In microarray data, the expression level X of a gene is usually measured by the logtransformed signal intensity, after normalization and bias correction. For the (twoway) cDNA microarray, X measures the relative mRNA abundance to a prespecified condition (control), while for the Affymatrix array, X is a good predictor for the absolute mRNA abundance (Quackenbush 2001).
Given the initial expression level x_{0}, the Brownian model demonstrates that the expression level X = x after t evolutionary time units follows a normal distribution with mean x_{0} and variance σ^{2}t; the density is given by
Microarray data usually include multiple measurements under various development/tissue stages or experimental treatments. These measurements can be considered as samples from the same stochastic process, i.e., evolutionary replicates (see below), but the initial expression level differs. We assume that the expression level x_{0} at the root O (the common ancestor of the gene family) varies among microarray experiments according to a normal distribution N(x_{0}; μ, ρ^{2}), with the following density:
Gene family with twomember genes: We start from a simple twomember gene family (Figure 1a). Let x_{1} and x_{2} be the expression levels of two member genes, respectively, and P(x_{1}, x_{2}) be the joint density. Given the initial value (x_{0}) of gene expression at the root O, the change of x_{1} follows a Brownian process B(x_{1}x_{0}; σ^{2}t), and the change of x_{2} follows B(x_{2}x_{0}; σ^{2}t). If the evolution of gene expression is independent between lineages, referred to as the E_{0} assumption, we have P(x_{1}, x_{2}x_{0}) = B(x_{1}x_{0}; σ^{1}t) B(x_{2}x_{0}; σ^{2}t). From Equations 1 and 2 one can show the joint density of x_{1} and x_{2} is given by
Gene family with three members: For a given threemember gene family (Figure 1b), the joint density of expressions x_{1}, x_{2}, and x_{3} is derived as follows. Denote the expression level at the ancestral node A by x_{4}. Let T and t be the evolutionary times of nodes O (the root) and A, respectively. Thus, given the initial value (x_{0}) at O, the change of x_{4} follows B(xx_{0}; σ^{2} (T – t)) and the change of x_{3} follows B(x_{3}x_{0}; σ^{2}T). Similarly, given the ancestral level x_{4}, the changes of x_{1} and x_{2} follow B(x_{1}x_{4};σ^{2}t) and B(x_{2}x_{4}; σ^{2}t), respectively. According to the Markov property, we obtain the joint density P(x_{1}, x_{2}, x_{3}, x_{4}x_{0}) = B(x_{3}x_{0}, σ^{2}T) B(x_{1}x_{4}; σ^{2}t) B(x_{2}x_{4}; σ^{2}t) B(x_{4}x_{0}; σ^{2} (T – t)). Since the ancestral expression x_{4} is unobservable, we have
Lineagespecific expression diversity (L model): The B model assumes that the expression divergence of a gene family is driven mainly by small and additive genetic drifts (random effects), with a constant rate measured by σ^{2}, or by mutational variance under the driftmutation model (Lynch and Hill 1986). In this sense, the B model can be considered as the “neutralevolution” model of gene expression or the “expression clock.”
A more general model is the L model, under which the evolutionary rate σ^{2} of expression diversity is lineage specific. Let
Directional trend of expression diversity (D model): The lineagespecific (L) Brownian model can be further extended to the Brownian model with directional trend (the D model). That is, given the initial value x_{0} at t = 0, the stochastic process of gene expression change after t time units is described as B(xx_{0}; λt, σ^{2}t), i.e., a normal distribution with mean x_{0} +λt and variance σ^{2}t; λ is called the trend coefficient or the coefficient of directional selection (Felsenstein 1988). Apparently, the L model is the special case of λ= 0. Since the trend coefficient (λ) of a gene varies among microarray samples, we treat λ as a random variable that follows a normal distribution
The joint density of gene expressions under the D model can be derived by adding lineagespecific directional trends to the L model. For a threemember gene family, Equation 7 can be modified as
Dramatic shift of expression diversity (S model): The D model we developed above assumes that the change of gene expression is continuous with time t (the gradual evolution). However, a dramatic shift (positively or negatively) in gene expression may happen shortly after gene duplication and then remain little changed, i.e., the process of “punctuated equilibrium” (Hansen and Martins 1996). The S model assumes that after gene duplication the expression level has shifted z and z′ units in two lineages, respectively (Figure 2). It is further assumed that two shift variables (z and z′), which are independent of each other, vary among microarray experiments according to normal distributions N(z; δ, s^{2}) and N(z′; δ′, s′^{2}), respectively.
For a threegene family, the joint density of x_{1}, x_{2}, and x_{3} under the S model is derived as follows. After the first gene duplication, the initial expression level (x_{0}) at the root O is immediately shifted to x_{0} + z_{O} in one lineage and to
The general likelihood function under the E_{0} model: The joint density for a threemember gene family can be extended to any nmember gene family when a rooted phylogeny is given. We have shown that the joint density of x = (x_{1},..., x_{n}) follows a multivariate normal distribution, N(x_{1},..., x_{n}; μ, V); with the mean vector μ and the variancecovariance matrix V, the density is given by
The B model: The mean vector μ = (μ,..., μ)′, and the variancecovariance matrix V_{B} is given by V_{B} =ρ^{2}J +^{2}TU, where any ijth element of matrix J is 1, and T is the age of the common ancestor (the root O). The treetopologyrelated matrix U is defined as U_{ii} = 1 and U_{ij} = 1 – a_{ij} (i ≠ j), where a_{ij} = t_{ij}/T; t_{ij} is the age of the common ancestor of genes i and j. More explicitly, the ijth element of V_{B} is given by
The L model: Since the L model allows specific
The D model: It has been shown that μ_{i}, the mean of each x_{i}, is equal to μ plus the sum of mean drifts in branches from the root O to gene x_{i}; i.e.,
The S model: Under the general S model, the mean of each x_{i} is the sum of the mean gradual drifts (over branches) and the mean shifts (over the ancestral nodes) from root O to x_{i}; that is,
Ancestral gene expression inference: Bayesian approach: Ancestral state reconstruction along a phylogenetic tree is at the center of comparative methods in evolutionary biology, for both morphological and molecular characters (Harvey and Pagel 1991; Yanget al. 1995; Schluteret al. 1997; Golding and Dean 1998). The massive microarray data make it possible to reconstruct an ancestral expression pattern that is useful to trace the evolutionary changes of gene regulation.
Singlenode ancestral inference: This method provides a fast Bayesian procedure to infer ancestral expression profile because each time it deals with one ancestral node and then runs over the tree. Let x = (x_{1},..., x_{n}) be the observed expression pattern and y be the expression level at the ancestral node of interest (Figure 2). According to Bayes' rule, the posterior density P(yx_{1},..., x_{n}) is computed as follows:
In the general case, let M = n + 1 and regard the ancestral level y as an additional variable x_{n}_{+1}. It has been shown that P(x_{1},..., x_{n}, y) is an (n + 1)variate normal density, denoted by N(x_{1},..., x_{n}, y; μ, V_{M}). The extended variancecovariance matrix V_{M} has the following structure: If 1 ≤ i, j ≤ n, the ijth element of V_{M} is equal to that of V, e.g., Equation 22 under the general S model. For any i, n + 1th element, i = 1,..., n + 1, it is given by
Hence, it becomes obvious that the posterior density P(yx_{1},..., x_{n}) is a normal density. Let
Joint ancestral inference: To explore the joint evolutionary pattern of expression changes after gene duplications, the singlenode method may not be sufficient. Therefore we develop an approach for joint ancestral expression inference. For a gene family with n member genes, there are m ancestral nodes when the phylogenetic tree is given. Let x = (x_{1},..., x_{n})′ and y = (y_{1},..., y_{m})′ be the vectors of current and ancestral expression levels, respectively, and M = n + m. The (extended) M × M variancecovariance matrix for (y′, x′) is denoted by V_{M}. We have shown that P(y, x) is an Mdimensional multinormal density. It follows that the joint posterior density of ancestral nodes y,
To obtain useful analytical results for numerical calculation, we partition the matrix V_{M} as
Implementation and data analysis: Multiple microarray experiments: The microarray data collection for evolutionary analysis can be outlined as follows: (1) Multiple (N) microarray data sets are downloaded from the Stanford microarray database and (2) a relational database is established to extract the expression profiles of any given gene family. A typical data set for a threegene family is
For a given gene family, the whole set of cis and transregulatory elements is diversified following gene duplications. Ideally, these r regulatory elements can be represented as a binary string denoted by h = h_{1},..., h_{r}. Each duplicate gene has a unique representation of h. These “aligned” regulatory elements, similar to aligned nucleotide sites, can be viewed as the evolutionary replicates from a stochastic process. Microarray samples of cells/tissues at various developmental stages or under experimental treatments show characteristic expression profiles that reflect the differences in h among member genes. Thus, the expression profile of gene i at the kth microarray data can be conceptually written as x_{i}_{,}_{k} =ψ_{k}(h), although ψ is yet little known. In this regard, multiple microarray experiments are also evolutionary replicates but the sampling property needs to be addressed carefully.
Likelihood under the i.i.d. assumption: The likelihood function can be easily built under the i.i.d. assumption that microarray experiments are independently, identically distributed, which has been used in many previous evolutionary studies (e.g., Wagner 2000; Guet al. 2002,Guet al. 2002). For N microarray experiments, let x_{k} = (x_{1,}_{k},..., x_{n}_{,}_{k}) be the expression pattern of an nmember gene family at the kth experiment. When the phylogeny is given, the likelihood for gene expressions can be written as
Expression branch lengths: The number of unknown parameters under the E_{0} model could be larger than the degrees of freedom, which makes the model statistically infeasible. One useful solution is to define the expression branch length E_{k} for any branch k along the phylogeny, that is,
Likelihood under experimental correlations: Hundreds of microarray experiments, say, for the yeast, include timecourse experiments, tissue/developmental stages, cell cycles, stressresponse experiments, different environmental conditions, mutants (gene deletions), etc. (Eisenet al. 1998). The overall expression profiles with similar types of conditions or treatments are more similar to each other; e.g., two adjacent sampling points in a timecourse assay are usually highly correlated. Because of these experimental correlations, the i.i.d. assumption seems to be unrealistic. That is, the sample of expression profiles of a gene family is not only phylogenetically but also experimentally dependent.
As the firstorder approximation, we model the experimental correlation of the microarray data as the overall correlations among microarray experiments. Let D be the N × N matrix of experimental correlations; that is, the diagonal element is 1, while the offdiagonal element is the coefficient of correlation between any two microarray experiments. For the microarray chip that includes in total C genes (e.g., C = 5500 for the yeast genome), D can be estimated by the standard approach over all C genes. Then one can show that the expression profiles of the gene family X = (x_{1},..., x_{N})′ follow a multivariate normal distribution, with the large [n × N] × [n × N] variancecovariance matrix V ⊗ D. The likelihood function is therefore given by
Wolfe and Shields (1997) have identified 452 duplicate genes that were from the yeast genome duplication. Using Chu et al.'s (1998) cDNA microarray data, sevenpoint timecourse experiments during the sporulation, we calculated the experimental correlation matrix D (Chuet al. 1998). Obviously, two adjacent time points usually show a high correlation, while it is low between distant points. The variance and covariance elements of V for a duplicate pair are given by V_{11} =ρ^{2} + E_{1}, V_{22} =ρ^{2} + E_{2}, and V_{12} =ρ^{2}. These unknown parameters (ρ^{2}, E_{1}, and E_{2}) are then estimated using Equations 31 and 34, respectively. We use the estimated coefficient of correlation
Yeast glutamyl and glutaminyltRNA synthetases gene family: An example: The glutamyl and glutaminyltRNA synthetases (GlnS) family has three member genes (YGL245w, YOR168w, and YOL033w). Phylogenetic analysis has shown that YGL245w and YOR168w are more closely related (Figure 5). The cellcycle yeast microarray data are used (Eisenet al. 1998).
Under the general E_{0} model, the likelihood under the i.i.d. assumption results in the maximumlikelihood (ML) estimates of
Using the molecular clock approach, we have approximately dated the relative time of first gene duplication (between YGL245w/YOR168w and YOL033w) as 2.2 (to the Escherichia coli/yeast split time), and the second one as 1.27. Thus, under the B model, we have obtained the ML estimates (under i.i.d.) of
As shown in Figure 6, the expression profile of the ancestral ancestor of YGL245wYOR168w has been inferred by the Bayesian method. Therefore, one can infer lineagespecific changes after gene duplication (the derived characters) from the ancestral expression pattern. For instance, changes of expression level in each lineage after gene duplication will be very informative when other genome data (e.g., putative regulatory motifs) are available.
Yeast gene families: We have conducted a large data analysis for yeast gene families to understand the evolution of expression after gene duplications. Amino acid sequences of yeast gene families, as well as homologous genes from nearly 30 complete genomes, were downloaded from the COG database (http://www4.ncbi.nlm.nih.gov/COG/). The phylogenetic tree of each gene family is inferred by the neighborjoining method (Saitou and Nei 1987). Furthermore, the duplication time relative to yeast/E. coli split can be (approximately) estimated by using the molecular clock approach under the inferred tree, similar to Gu et al. (2002,2002). In total 276 yeast microarray data are collected for each gene family; the likelihood of Equation 31 is used for the analysis.
For a threemember gene family when the root can be reliably determined, the likelihoodratio test (LRT) is used to test the expression clock hypothesis (Figure 7). The null hypothesis is E_{1} = E_{2}. The log of likelihood ratio approximately follows a χ^{2} with d.f. = 1 so that one can determine the significance level. We conducted the LRT for 60 threemember yeast gene families; for them the relative duplication times estimated from sequence data are treated as known. Overall, 42 (70%) gene families showed that the null hypothesis is rejected at the 0.05 significance level and 39 (65%) gene families at the 0.01 significance level. Therefore, one may conclude that, after gene duplication, an unequal rate for expression divergence in each duplicate gene is a common pattern. Examples in Figure 8 show more dramatic expression changes than sequence substitutions after gene duplications.
DISCUSSION
In this article we have developed a statistical framework to explore the expression divergence during the gene family evolution. Several data analyses have shown the potential for studying many interesting problems in evolutionary genomics, e.g., the evolution of gene expression specificity, the evolutionary fate of duplicate genes, ancestral expression inference, and the coevolution between expression and regulatory motif or coding sequence. Our method is also useful for understanding the populationquantitative genetic basis of expression evolution after gene duplication.
Comparative methods: Statistical methods for the analysis of continuous morphological data have been studied for decades to deal with phylogeneticdependent sampling (Harvey and Pagel 1991; Maddison 1994). These methods can be roughly classified into phylogenetically independent contrasts (PIC; Felsenstein 1988), the phylogenetic autocorrelation (PA) method (Cheverudet al. 1985), and phylogenetic generalized leastsquares (PGLS; Grafen 1989), as well as many variants; see Rohlf (2001) for a recent detailed review and critiques. Here we briefly discuss the connection between the newly developed method and these existing methods.
We have noted that all these methods share a similar theoretical ground. In fact, PIC, PA, and PGLS methods require the known “expected variancecovariance matrix” V for incorporating the phylogenetic structure. Except for some early studies, the popular assumption is V =σ^{2}C, where the matrix C is determined by the phylogeny and timescales. Essentially, it is equivalent to the B model. Schluter et al. (1997) introduced a simple prior for σ^{2} to relax the unrealistic assumption. There has been longterm controversy about which one should be used and exactly how it can be applied (Pagel 1993; Rohlf 2001). Rohlf (2001) did not consider any complex form of V, because it is more arbitrary in implementation. The central theme of our study is to model the expected variancecovariance matrix V for exploring the evolutionary pattern of expression divergence, which can be estimated under appropriate statistical procedures. Since the likelihoodratio test clearly shows a strong lineagespecific mode of expression evolution, the conventional PIC, PA, and PGLS methods that assume a constant Brownian motion may be oversimplified for microarray data analysis.
Inference of the ancestral expression pattern provides an efficient approach to reconstruct the evolutionary trace of expression diversity after gene (or genome) duplications. The socalled maximumlikelihood method for inferring the ancestral state of continuous characters is to minimize the (weighted) sum of differences over all branches (e.g., Schluteret al. 1997). For the threegene tree (Figure 1b), it is given by
Models extended for expression data: In the future, we will study how to implement the methodology of PIC, PA, or PGLS for the microarray data. For instance, the PGLS regression model for gene family expression may improve the efficiency of the expressionmotif search (e.g., Bussemakeret al. 2001; Blanchette and Tompa 2002). This is because the regulatory processing of a gene family can be traced back to a single ancestral regulatory modular so that the heterogeneity problem can be partially avoided. For a given microarray experiment, the expression profile of a gene family x = (x_{1},..., x_{n})′ can be written as x = Hb + ϵ, where the matrix H represents explanatory variables, e.g., the cisregulatory motif structure (predicted or known, number of copies, locations, etc.), and some transelements that may influence the expression level. The variancecovariance matrix of the error term ϵ, denoted by R, is given by
Similarly, the PA method can be written as x =αSx + ϵ′, where α is the autocorrelation coefficient, S is an n × n (treedependent) connection matrix (each row of which sums up to one), and ϵ′ is error terms (Cheverudet al. 1985). Instead of the arbitrary construction of S, we suggest estimating S from R in Equation 36 with (row) normalization. Moreover, for a gene family, it would be very interesting to detect any functional correlation between two microarray experiments. We may modify the PIC method (Felsenstein 1988) to correct not only phylogenetic dependence but also experimental correlation as well as the lineagespecific effect.
Experimental errors and others: We have recognized that the current statistical framework involves several assumptions that need to be examined carefully. The first one is the normal assumption. By default, we assume that fold change of gene expression follows a Brownianbased process during evolution. Further investigation is needed to test the robustness of normal assumption when other data normalization procedures are adopted (Quackenbush 2001). Second, we should consider the sources of experimental errors. Let
Acknowledgments
The author thanks Zhongqi Zhang for assisting in numerical analysis. This work was supported by a National Institutes of Health grant.
Footnotes

Communicating editor: Y.X. Fu
 Received July 14, 2003.
 Accepted January 22, 2004.
 Copyright © 2004 by the Genetics Society of America