- THIS ARTICLE
-
Abstract
- Full Text (PDF)
-
All Versions of this Article:
genetics.104.031500v1
170/4/2003 most recent - Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Munneke, B.
- Articles by Doerge, R. W.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Munneke, B.
- Articles by Doerge, R. W.
Originally published as Genetics Published Articles Ahead of Print on June 8, 2005.
Genetics, Vol. 170, 2003-2011, August 2005, Copyright © 2005
doi:10.1534/genetics.104.031500
Adding Confidence to Gene Expression Clustering
B. Munneke*,1,
K. A. Schlauch
,1,
K. L. Simonsen*,
W. D. Beavis
and
R. W. Doerge*,
,2
* Department of Statistics, Purdue University, West Lafayette, Indiana 47907
Center for Biomedical Genomics and Informatics, George Mason University, Manassas, Virginia 20110
National Center for Genome Resources, Santa Fe, New Mexico 87505
Department of Agronomy, Purdue University, West Lafayette, Indiana 47907
2 Corresponding author: Department of Statistics, 1399 Mathematical Sciences Bldg., 150 N. University St., Purdue University, West Lafayette, IN 47907-1399.
E-mail: doerge{at}purdue.edu
It has been well established that gene expression data contain large amounts of random variation that affects both the analysis and the results of microarray experiments. Typically, microarray data are either tested for differential expression between conditions or grouped on the basis of profiles that are assessed temporally or across genetic or environmental conditions. While testing differential expression relies on levels of certainty to evaluate the relative worth of various analyses, cluster analysis is exploratory in nature and has not had the benefit of any judgment of statistical inference. By using a novel dissimilarity function to ascertain gene expression clusters and conditional randomization of the data space to illuminate distinctions between statistically significant clusters of gene expression patterns, we aim to provide a level of confidence to inferred clusters of gene expression data. We apply both permutation and convex hull approaches for randomization of the data space and show that both methods can provide an effective assessment of gene expression profiles whose coregulation is statistically different from that expected by random chance alone.
MICROARRAY technology has been applied experimentally across many biological disciplines; some of the earliest examples in agriculture used expressed sequence tags (ESTs) in studies of plant gene expression (EWING et al. 1999). Similarly, in human studies, cDNA microarray data have been used as the vehicle for the investigation of biologic variation in mammary epithelial cells among breast tumor samples (PEROU et al. 1999). The evolutionary biology of yeast was studied using microarray techniques to compare genome-wide expression patterns in evolved vs. parental strains after 250 generations of growth (FEREA et al. 1999). Since its initial impact the use of microarray technology now includes much broader investigations of regulatory science (DOERGE 2002; CHEUNG et al. 2003; SCHADT et al. 2003; BREM and KRUGLYAK 2005; KIM et al. 2005; RONALD et al. 2005) and reconstruction of genetic networks (KNUDSEN 2002).
Microarray technologies allow researchers to simultaneously monitor cellular activity of many gene transcripts. These experiments produce mRNA expression data in great abundance and provide useful information to pursue the conjectures of functional genomics. Two different approaches for analyzing these data, both of which rely heavily on statistics, include testing each gene for differential expression and/or summarizing gene expression profiles by assessing similarities in their pattern of behavior across treatments or conditions. The statistical issues (e.g., statistical models, hypotheses, multiple comparisons, etc.) involved in testing differential expression have been investigated more thoroughly (see review by CRAIG et al. (2003)) than those statistical issues surrounding the statistical assessment of gene expression profile patterns and their clustering (ZHANG and ZHAO 2000; KERR and CHURCHILL 2001; MUNNEKE 2001; TIBSHIRANI et al. 2001; MCSHANE et al. 2002).
In efforts to understand the underlying data structure resulting from expression studies, data are typically summarized by grouping the expression intensities via similarity of response to various experimental conditions. Statistical methodologies that supply these groups, or clusterings, of gene expression profiles are known as clustering techniques and originate from a well-established area of statistics referred to as multivariate statistics (JOHNSON and WICHERN 1998). The clustering methods involved in the previously mentioned studies tend to be either partition based, such as K-means clustering (TAVAZIOE et al. 1999) and self-organizing maps (TAMAYO et al. 1999), or nested, as with hierarchical clustering (EISEN et al. 1998). Regardless of the clustering method used, an assignment of statistical significance for the resulting partitions is necessary to interpret cluster reliability and biological meaning (ZHANG and ZHAO 2000; KERR and CHURCHILL 2001; MCSHANE et al. 2002). Several such approaches have been developed. KERR and CHURCHILL (2001) present a bootstrapping technique to assess the stability of profile clusters. Genes are assigned to a set of fixed profiles via their correlation; a correlation coefficient >0.90 is enough to assign two profiles to the same cluster. Bootstrapping is then employed to assess stability of a gene by counting how many times it is assigned to the same fixed profile. ZHANG and ZHAO (2000) also use bootstrapping to generate perturbed data sets. Their reliability measure depends on the number of times genes occur in the set of perturbed clusters. MCSHANE et al. (2002) present a principal components analysis to assess the overall clustering of expression patterns and then test whether gene expression profiles arise from a single multivariate normal distribution. New data sets with artificial experimental error are generated by adding Gaussian white noise to the original expression levels. Two reproducibility indexes are associated to each cluster by computing the number of pairs existing in both the original and the perturbed cluster. A perturbed cluster is matched to the original cluster if it contains a majority of elements in common with the original cluster.
All clustering methods when applied to gene expression data depend on a dissimilarity measure, dij, in such a way that for two genes gi and gj the measure obeys three properties (13), while a true distance measure also satisfies a fourth (see, e.g., JOHNSON and WICHERN 1998):
![]() |
![]() | (1) |
As an example, consider the three gene expression profiles in Figure 1, taken from the EISEN et al. (1998) yeast sporulation data set. Letting cor denote Pearson's correlation coefficient, we have cor(YBR148W,YDR260C) = 0.991, cor(YDR260C,YKL166C) = 0.998, and cor(YBR148W,YKL166C) = 0.993, rendering all three pairs almost identical. If we let euc denote the Euclidean distance, then euc(YBR148W,YDR260C) = 28.14, euc(YDR260C,YKL166C) = 10.48, and euc(YBR148W,YKL166C) = 38.55, implying that gene YBR148W is more similar to YDR260C than to YKL166C, a result that does not agree with the correlation coefficient result. Employing the Munneke metric (1) on these same data yields d(YBR148W,YDR260C) = 1.77, d(YDR260C,YKL166C) = 1.58, and d(YBR148W,YKL166C) = 1.97. While the correlation coefficient between the two pairs is almost identical, the Munneke metric (1) is able to differentiate the two pairs with respect to relative magnitude, which we believe to be a more intuitive measure of dissimilarity. We realize that many penalty terms can be used (GORDON 1999), but we restrict our attention to the described penalized dissimilarity measure (1) as it distinguishes between induced and repressed gene expression of the same magnitude.
|
The joining method complements the dissimilarity measure and is chosen from three of many possible methods; nearest neighbor (single linkage), farthest neighbor (complete linkage), and average linkage (GORDON 1999). Using any measure, the dissimilarity between two groups is assessed by considering all pairs of genes formed by taking one member of each group. Throughout this work we concentrate on agglomerative hierarchical clustering and employ the Munneke dissimilarity measure (1) with these joining methods.
The amount of random variation found in gene expression studies is known to have numerous sources (KNUDSEN 2002; CRAIG et al. 2003). Given the lack of ability to discriminate how this variation affects clustering results, users of clustering algorithms are growing increasingly careful in interpreting their results. Furthermore, because the technology is expensive, there is also a desire to compare the results across laboratories and experiments. Ideally, this comparison should be possible at the level of cluster analysis, independent of the algorithm, dissimilarity function, or joining method. Our purpose in this work is to rely on the penalized dissimilarity measure (1) and to place a level of confidence on the groupings suggested by hierarchical clustering (or any other clustering algorithm). Our focus is not on the clustering mechanism itself, but rather on deriving a statistical assessment of significance as one traverses down the branches of a hierarchical clustering tree. In essence, we propose a comparison of the original data with a randomization of the data space (i.e., no association between the expression profiles) for the purpose of assessing how probable any given cluster is to have occurred by chance, thus providing researchers with a baseline for comparing results independent of the clustering algorithm used.
Two main features are involved in our proposed method of placing a level of certainty on branches of a clustering tree. The first is the creation of a random representation of the actual expression profiles. The second is a test statistic that summarizes the existing cluster tree structure, whether from the actual data or from the randomized data. Randomization of gene expression information in this context is accomplished using two different methods: permutation tests and convex hulls. Permutation tests (FISHER 1935) have been successfully employed in the quantitative trait loci (QTL) literature (CHURCHILL and DOERGE 1994; DOERGE and CHURCHILL 1996; NETTLETON and DOERGE 2000) to establish experimental thresholds for declaration of statistically significant QTL. The application of permutation methods to gene expression data is analogous to the QTL application in that individual expression measures that are related (associated) through coregulation will lie in the same cluster. However, in certain circumstances (discussed later) permutation methods are less effective in randomizing the data space, and a convex hull (BARBER et al. 1997) approach is employed as an alternative. Finally, a test statistic is required for the purpose of measuring the amount of structure present in a dendrogram via a single number. This statistic distinguishes between the data sets without structure (i.e., reflecting a randomized data space) and the original data space. The underlying goal is to calculate a statistic that complements the dissimilarity measure used to create the dendrogram while clearly distinguishing between random gene expression profiles and clusters and gene expression data that demonstrate true associations.
Randomization:
Permutation method:
Consider the gene expression data to be real-valued t-dimensional vectors, one dimension for each treatment in the experiment. The data vectors represent specific gene expression levels recorded under each of the t treatment conditions. The random gene expression profiles (or the permuted data) are constructed by focusing on one treatment at a time and sampling without replacement from this treatment for every expression observation (gene). Via the permutation, the value of each expression profile is equally likely to be any one of the set of all values observed for that variable over the course of the original experiment. Thus, under the null hypothesis, a random permutation represents an outcome that is as likely to have been observed as the original data, without parametric assumptions of data structure. Furthermore, the permutation method also allows for any inherent association between individual profiles to be broken. While one permutation of the data creates a single randomized data set, repeating this process n times creates a collection of n random data sets, each potentially representing an experiment having no inherent structure among its data points.Permutation techniques for null model generation are easy to implement and aim to remove much of the association between gene expression profiles (i.e., coregulation). Not all data sets can be associated with permutations without cluster structure: even after permuting the original data up to 10,000 times it is possible, and has been seen, that the randomized data retain a looser cluster structure. We have found that some gene expression data are aligned in the data space in a manner that prohibits permutation from randomizing the data space efficiently, e.g., a group of expression vectors forming a long thin cloud (i.e., cigar shaped) in two-dimensional space. We are motivated to provide a null data set that has little or no cluster structure, but is still contained within the boundaries of the original data. For this, we introduce an alternate approach of randomly mixing gene expression vectors within their convex hull.
Convex hull:
The convex hull of a set G of m gene expression vectors in Rt is the smallest convex set in Rt containing all m vectors and is defined as
![]() |
![]() |
Permutation vs. convex hull:
As mentioned previously, it is sometimes the case that the permutation approach alone (without the convex hull) generates random data sets with subcluster structure. Consider the gene expression profiles of 58 yeast genes across seven temporal states as shown in Figure 2. These data are part of a large public gene expression experiment of 6118 yeast genes undergoing sporulation (http://cmgm.stanford.edu/pbrown/sporulation/additional/spospread.txt). A single random data set generated by permutations alone is shown in Figure 3, and one generated using the convex hull approach is shown in Figure 4. We believe that the convex hull approach is more reliable than the permutation approach in generating data sets without subcluster structure for original data in which strong cluster structure exists. In such cases, artificial/permuted data generated by permutations will tend to retain some, if not all, of the original subcluster structure. Consider the example of Figure 2, in which each expression vector belongs to one of two clusters. For each experiment, except the first at time 0, the set of possible values is divided into two distinct sets. Thus, the permuted value for each experiment will belong to one of these two distinct sets, generating expression vectors that lie within one of 64 = 26 possible distinct clusters, as is demonstrated in Figure 3. Alternatively, the convex hull approach first generates two random numbers in [0, 1] and then constructs new profiles via linear combinations of original profiles using the two random numbers, making it unlikely that the new profile will lie exactly within one of the original two clusters.
|
|
|
Test statistic:
As a means of assessing the subgroup architecture within any given gene expression experiment or randomized data set, we construct a test statistic based upon the sum of the branch lengths of a cluster dendrogram. The distribution of the test statistic under the null hypothesis is estimated and used to establish statistically significant results in the original gene expression data when compared to the null distribution. The clusters being tested at each stage are provided by the (hierarchical) clustering of the original data, based upon our penalized dissimilarity measure (Munneke metric) (1) and one of the joining methods mentioned previously. Consider Figure 5, noting that for each branch in the dendrogram a left child subgroup and a right child subgroup exist. The sum of branch lengths is calculated as a function of the sum of the differences between the last join, D0, and each of the last joins for the child clusters, D1 and D2 (Figure 5). The distribution of the sum of the branch lengths below (SLB) the parent node under the null hypothesis is used to assess the statistical significance of the original data dendrogram. SLB focuses on the dissimilarities provided in the clustering of the data for the purposes of assessing the subgroup structure and avoiding criteria such as sum-of-squares.
|
We define the SLB statistic using both a dissimilarity measure d and an agglomerative method D as required by hierarchical clustering. Three standard agglomerative methods are defined as
![]() |
The cluster of genes G is partitioned into the child subgroups G0 and G1, i.e., G = G0
G1. To calculate the test statistic we denote the partitions of G0 and G1 by extending our notation,
![]() |
![]() |
Implementation:
Using the SLB test statistic, we approach a clustering tree one branch point at a time. Starting from the top, SLB is calculated for the first branch point, which divides the total set of gene expression profiles G into two subgroups G0 and G1. Recall that the original data set has a representation in the form of a dendrogram as does each of the randomized data sets. Therefore, the statistic (SLB) is calculated for the original data, as well as for each of the random data sets. If the statistic calculated for the original data is large (i.e., exceeds the 1
percentile) relative to the distribution of SLB under the null hypothesis, this suggests that the original data space has a stronger subgroup structure than would be expected from a random association of gene expression profiles. The statistically significant subgroup structure is then accepted for this branch point, a probability (P-value) is assigned, and the algorithm continues by operating on each of the two statistically significant subgroups independently. Note that each additional partitioning of the data is conditional on acceptance of the partition for the preceding branch point. The algorithm self-terminates when all branch points below valid partitions in the original data set are found not to be statistically significant. If further investigation is desired below a specific subgroup, then the algorithm can be reinitiated beginning at the parent node of this subgroup. Simulation:
As a means of assessing the power of both randomization methodologies, we first relied on data simulation. Hierarchical clustering routines HC and HCASS2 (Fionn Murtagh at Université Louis Pasteur, Strasbourg, http://newb6.u-strasbg.fr/fmurtagh//mda-sw/) were employed along with a standard uniform random number generator for permutation. Normal random variates were produced by the Box-Mueller transformation of uniform variates, and two distinct multivariate groups with 50 observations (genes) in each group, at three treatments (or dimensions), were simulated. The distance between the two groups was measured by a signal- (expression intensity) to-noise (variation) ratio. For example, in one dimension the signal-to-noise comparison gives an indication of the level of separation for the means of the two distributions defining the gene clusters. As such, in one dimension, the signal-to-noise ratio can be calculated as |µ1 µ2|/
for two normal distributions with means µ1 and µ2, respectively, and a common standard deviation
. The distance between the two multivariate groups was measured using the average signal-to-noise ratio for each of the t (treatment) dimensions, calculated as
![]() |
= 0.05. The number of significant clusters and the misclassification rate were based on 1000 repetitions of this process (Table 1).
|
Simulation results:
Each simulation resulted in a number of significant clusters (ranging from 1 to 100) for each data set. As the data were simulated under the assumption of two known clusters, any number of significant clusters different from two is incorrect. The degree of incorrectness of a cluster analysis, in this context, is the percentage of genes misclassified (i.e., a gene profile is assigned to a cluster different from the cluster from which it was simulated). The percentage of misclassification is the maximum overlap between two of the discovered clusters and the known true generating clusters. For example, if the two true clusters are G1 = {1, ... , 50} and G2 = {51, ... , 100}, and the discovered cluster is Gdisc.1 = {1, ... , 100} then 50% of the genes are misclassified. If the discovered clusters are Gdisc.1 = {1, ... , 48}, Gdisc.2 = {49, ... , 94}, and Gdisc.3 = {95, ... , 100}, then 8% of the genes are misclassified. A perfect scenario (for this simulation), as the signal-to-noise ratio increases, maintains 100% of the simulations as having two distinct clusters with 0% misclassification. In fact, we find that the average misclassification tends to 0.0, indicating no misclassification, as the true separation between the two generating clusters tends to infinity. Because our simulation study was based upon two three-dimensional gene expression clusters, we allowed the signal-to-noise ratio to increase in a stepwise manner from 3.0 to 6.0, while the variance remained at 1.0. Only the results for the permutation randomization are shown in Table 1. For both randomization methods, as the signal-to-noise ratio increases the ability to identify statistically significant gene clusters increases. The average percentage of the observations misclassified reflects an increase in the power of the test as the dissimilarity between the clusters becomes larger.
Experimental data:
We consider the 517-gene subset of the 8613 genes from human fibroblast cells treated with serum (IYER et al. 1999). These data were used to create the cluster dendrogram in IYER et al.'s (1999) Figure 2. This figure is available at http://genome-www.stanford.edu/clustering/Figure1.jpg and the data are available at http://genome-www.stanford.edu/serum/data.html. The data consist of measurements of mRNA present at 13 time points following the treatment with serum. The clustering in IYER et al.'s (1999) Figure 2 is the result of a hierarchical method, utilizing the cosine of the angle between the gene vectors as the dissimilarity measure, and the agglomerative method. First, we use a permutation-based hierarchical clustering routine with a cosine dissimilarity and average agglomeration method to compare our results directly with those of Iyer. Second, we repeat the test using the Munneke metric (1) and the complete agglomeration method. Third, we repeat the analysis using a significance threshold, 1
. Finally, the analysis is repeated using the convex hull method with the Munneke metric and the average agglomerative method.
The results of the permutation-based randomization (based on 1000 permutations) with hierarchical clustering (using the cosine metric) reveal two subgroups. Using a significance level of
= 0.05, the permutation subgroups share common clusters with IYER et al.'s (1999) Figure 2. The first branch point is the most obvious and gives the only statistically significant subgroup structure. The interpretation of the groupings (Figure 6) indicates one group of induced genes (group 1) and one group of repressed genes (group 2). Even when the significance level
is increased to
= 0.40, this grouping remains intact.
|
We modified the dissimilarity measure by adding a penalty term (Munneke metric) to the cosine dissimilarity measure (1) and repeated the permutation-based analysis. Any change in the number of significant subgroups between the two permutation-based analyses should be a result of the penalization term in the dissimilarity measure. The permutation-based results present essentially the same results for subgroup membership as the initial analysis (using the cosine metric), indicating that the previously described scenario of equally induced and repressed gene expressions is not of concern for these data.
The final analysis was performed using the Munneke metric (1) and the complete linkage agglomeration method. We chose the complete linkage method because it gives the highest resolution of 517 serum genes into clusters when the significance is increased to
= 0.15. Nine cluster subgroups result (Figure 7), highlighting the use of the permutation methodology as an exploratory data analysis tool. All three analyses thus far are based on the empirical distribution of the SLB test statistic, so the results can be compared directly. The penalized dissimilarity measure coupled with the complete joining method discerns the most distinct expression clusterings for the serum data of IYER et al. (1999). Results based on the convex hull method differ from those based on permutations: using the convex hull approach, the first branch point is not detected with
< 0.025, at which point the subsequent second branch point is also accepted. This may indicate that the convex hull approach provides a stricter criterion for detecting valid cluster structure. All the branches that follow have P-values >0.50, indicating no statistical significance. The relative difference in the results between the permutation randomization and the convex hull randomization is most likely due to the inability of the permutation randomization to remove all valid subcluster structure in the randomized null model data sets. The convex hull approach avoids these issues by guaranteeing, on average, that the randomized data sets have no valid subcluster structure. As a focus of future research, we are considering different criteria to guide the determination of which randomization method to use.
|
The broad utility of randomization techniques as applied to cluster analysis is not limited to genomics or gene expression applications. The randomization methods presented here in the context of assessing gene expression profiles can be implemented in combination with any clustering technique, thus allowing researchers to identify statistically significant subgroup structures in a group of genes selected for study in a given organism. Furthermore, these gene clusters can then be compared across clustering methods and criteria via their attached level of confidence (or P-value). The resulting subgroups have great potential to suggest genes that may be coregulated under the conditions studied in the experiment. In work by LAN et al. (2003) hierarchical clustering with oblique principal components was used to reduce the dimension of the data space when mapping mRNA abundance as quantitative traits. While no levels of confidence were attached to the clusters, it is easily understood that randomization methods not only fit the application, but also in fact benefit the results.
Several resampling techniques that associate reliability metrics to clustering results have been established. Those mentioned in the Introduction are based on bootstrapping approaches and rely upon the examination and counting of individual elements in the original and perturbed clusters. Through this work we introduced the use of permutation or randomization methods that do not rely upon cluster size and individual elements, but assign one value to each cluster via a test statistic, to assess its "tightness" relative to its subclusters. Additionally, our proposed methods are independent of clustering technique and offer a dissimilarity metric capable of assessing both magnitude and pattern of expression. We believe that our approach is complementary to those mentioned previously, and that it offers a reasonable alternative to assessing reliability of clusters generated by any clustering procedure.
Clustering techniques have been used extensively to explore the results of gene expression experiments. This is due in part to the intuitive visual appeal that clusters provide for discerning patterns in very large and complex data sets. Perhaps more importantly, biologists use the results from clustering to filter and select sets of (candidate) genes for further hypothesis-driven research. If used for such decision making, it is important to determine whether subsets of gene expression patterns are distinguishable or merely random artifacts. In the absence of independent biological replications we have demonstrated how randomization can aid the exploration process that often accompanies microarray experiments.
BARBER, C., D. DOBKIN and H. HUHDANPAA, 1997 The quickhull algorithm for convex hulls. ACM Trans. Math. Sci. 22: 469483.
BREM, R. B., and L. KRUGLYAK, 2005 The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc. Natl. Acad. Sci. USA 102(5): 15721577.
BRøNDSTED, A., 1983 An Introduction to Convex Polytopes (Graduate Texts in Mathematics, No. 90). Springer-Verlag, New York.
CHEUNG, K. J., V. BADARINARAYANA, D. W. SELINGER, D. JANSE and G. M. CHURCH, 2003 A microarray-based antibiotic screen identifies a regulatory role for supercoiling in the osmotic stress response of Escherichia coli. Genome Res. 13(2): 206215.
CHURCHILL, G. A., and R. W. DOERGE, 1994 Empirical threshold values for quantitative trait mapping. Genetics 138: 963971.[Abstract]
CRAIG, B. A., M. A. BLACK and R. W. DOERGE, 2003 Microarrays: the technology and analysis. J. Agric. Biol. Environ. Stat. 8(1): 128.
DOERGE, R. W., 2002 Mapping and analysis of quantitative trait loci in experimental populations. Nat. Rev. Genet. 3: 4352.[CrossRef][Medline]
DOERGE, R. W., and G. A. CHURCHILL, 1996 Permutation tests for multiple loci affecting quantitative character. Genetics 142: 285294.[Abstract]
EISEN, M. B., P. T. SPELLMAN, P. O. BROWN and D. BOTSTEIN, 1998 Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95: 1486314868.
EWING, R. M., A. B. KAHLA, O. POIROT, F. LOPEZ, S. AUDIC et al., 1999 Large-scale statistical analyses of rice ESTs reveal correlated patterns of gene expression. Genome Res. 9: 950959.
FEREA, T. L., D. BOTSTEIN, P. O. BROWN and R. F. ROSENZWEIG, 1999 Systematic changes in gene expression patterns following adaptive evolution in yeast. Proc. Natl. Acad. Sci. USA 96: 97219726.
FISHER, R. A., 1935 The Design of Experiments, Ed. 3. Oliver & Boyd, London.
GOOD, I. P., 2000 Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypothesis. Springer, New York.
GORDON, A. D., 1999 Classification. Chapman & Hall, London.
HARTIGAN, J. A., 1975 Clustering Algorithms. Wiley, New York.
IYER, V. R., M. B. EISEN, D. T. ROSS, G. SCHULER, T. MOORE et al., 1999 The transcriptional program in the response of human fibroblasts to serum. Science 283: 8387.
JOHNSON, R. A., and D. W. WICHERN, 1998 Applied Multivariate Statistical Analysis, Ed. 4. Prentice-Hall, Englewood Cliffs, NJ.
KERR, M. K., and G. A. CHURCHILL, 2001 Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc. Natl. Acad. Sci. USA 98(16): 89618965.
KIM, K., M. A. L. WEST, R. W. MICHELMORE, D. A. ST.CLAIR and R. W. DOERGE, 2005 Old methods for new ideas: dissection of the determinants of gene expression levels. Proceedings of the Stadler Genetics Symposium, Columbia, MO.
KNUDSEN, S., 2002 A Biologist's Guide to Analysis of DNA Microarray Data. Wiley, New York.
LAN, H., J. P. STOEHR, S. T. NADLER, K. L. SCHUELER, B. S. YANDELL et al., 2003 Dimension reduction for mapping mRNA abundance as quantitative traits. Genetics 164: 16071614.
MANGALAM, H., J. E. STEWART, J. ZHOU, M. WAUGH, K. SCHLAUCH et al., 2001 GeneX: an open source gene expression database and integrated toolset. IBM Syst. J. 40(2): 552569.
MCSHANE, L., R. RADMACHER, B. FREIDLIN, R. YU, M.-C. LI et al., 2002 Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 18(11): 14621469.
MUNNEKE, B., 2001 Null model methods for cluster analysis of gene expression data. Ph.D. Thesis, Department of Statistics, Purdue University, West Lafayette, IN.
NETTLETON, D., and R. W. DOERGE, 2000 Accounting for variability in the use of permutation testing to detect quantitative trait loci. Biometrics 56: 285291.
PEROU, C. M., S. S. JEFFREY, M. VAN DE RIJN, C. A. REES, M. B. EISEN et al., 1999 Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc. Natl. Acad. Sci. USA 96: 92129217.
RONALD, J., J. M. AKEY, J. WHITTLE, E. N. SMITH, G. YVERT et al., 2005 Simultaneous genotyping, gene-expression measurement, and detection of allele-specific expression with oligonucleotide arrays. Genome Res. 15(2): 284291.
SCHADT, E. E., S. A. MONKS, T. A. DRAKE, A. J. LUSIS, N. CHE et al., 2003 Genetics of gene expression surveyed in maize, mouse, and man. Nature 422: 297302.[CrossRef][Medline]
TAMAYO, P., D. SLONIM, J. MESIROV, Q. ZHU, S. KITAREEWAN et al., 1999 Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 96: 29072912.
TAVAZIOE, S., J. D. HUGHES, M. J. CAMPBELL, R. J. CHO and G. M. CHURCH, 1999 Systematic determination of genetic network architecture. Nat. Genet. 22: 281285.[CrossRef][Medline]
TIBSHIRANI, R., G. WALTHER and T. HASTIE, 2001 Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B Stat. Methodol. 63: 411423.[CrossRef]
ZHANG, K., and H. ZHAO, 2000 Assessing reliability of gene clusters from gene expression data. Funct. Integr. Genomics 1: 156173.[Medline]
Communicating editor: J. B. WALSH
This article has been cited by other articles:
![]() |
J. C. Cushman, R. L. Tillett, J. A. Wood, J. M. Branco, and K. A. Schlauch Large-scale mRNA expression profiling in the common ice plant, Mesembryanthemum crystallinum, performing C3 photosynthesis and Crassulacean acid metabolism (CAM) J. Exp. Bot., May 1, 2008; 59(7): 1875 - 1894. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. R. Swindell The Association Among Gene Expression Responses to Nine Abiotic Stress Treatments in Arabidopsis thaliana Genetics, December 1, 2006; 174(4): 1811 - 1824. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
-
All Versions of this Article:
genetics.104.031500v1
170/4/2003 most recent - Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Email this article to a friend
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Munneke, B.
- Articles by Doerge, R. W.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Munneke, B.
- Articles by Doerge, R. W.
















