EXCAVATOR: a computer program for efficiently mining gene expression data

Dong Xu; Victor Olman; Li Wang; Ying Xu

doi:10.1093/nar/gkg783

. 2003 Oct 1;31(19):5582–5589. doi: 10.1093/nar/gkg783

EXCAVATOR: a computer program for efficiently mining gene expression data

Dong Xu ^1,^*, Victor Olman ^1,^a, Li Wang ¹, Ying Xu ^1,2

PMCID: PMC206478 PMID: 14500821

Abstract

Massive amounts of gene expression data are generated using microarrays for functional studies of genes and gene expression data clustering is a useful tool for studying the functional relationship among genes in a biological process. We have developed a computer package EXCAVATOR for clustering gene expression profiles based on our new framework for representing gene expression data as a minimum spanning tree. EXCAVATOR uses a number of rigorous and efficient clustering algorithms. This program has a number of unique features, including capabilities for: (i) data- constrained clustering; (ii) identification of genes with similar expression profiles to pre-specified seed genes; (iii) cluster identification from a noisy background; (iv) computational comparison between different clustering results of the same data set. EXCAVATOR can be run from a Unix/Linux/DOS shell, from a Java interface or from a Web server. The clustering results can be visualized as colored figures and 2-dimensional plots. Moreover, EXCAVATOR provides a wide range of options for data formats, distance measures, objective functions, clustering algorithms, methods to choose number of clusters, etc. The effectiveness of EXCAVATOR has been demonstrated on several experimental data sets. Its performance compares favorably against the popular K-means clustering method in terms of clustering quality and computing time.

INTRODUCTION

As we are moving into the post genome sequencing era, various high throughput experimental techniques have been developed to characterize biological systems at the genome scale. Unlike traditional approaches, where genes/proteins are generally studied one at a time, high throughput approaches can provide a global view of all the genes in a genome in a relatively fast and cost-effective manner. Among various high throughput methods, microarray technology (1,2) provides a unique approach to simultaneously observe expression changes of thousands of genes under a set of experimental conditions or over a time course. A challenging issue is to effectively ‘mine’ the enormous amount of gene expression data being generated by various research laboratories world wide and to extract biological information hidden in the noisy and unstructured expression data. Computational analysis is often carried out using gene expression profiles, i.e. the expression level versus experimental condition or over a time course. Since genes with the same cellular function or in the same biological pathway often show similar patterns of expression profiles, one can often infer functions of unknown genes based on the known functions of genes with similar expression patterns. One can also assign new players in a biological pathway through identifying genes exhibiting similar expression patterns to the known genes in the pathway. That is why expression profiles clustering is often the first step in biological knowledge discovery from gene expression data. Gene expression data clustering can also be used for disease sub-typing (3) (i.e. to categorize a disease). In this case, instead of clustering expression profiles, patients with a certain disease (e.g. leukemia) can be clustered into several groups according to their expression patterns in a set of related genes. Then each group of patients can be given customized medicine to maximize the effectiveness of the treatment while minimizing potential side effects. In this case, one can apply similar methods of clustering gene expression profiles for disease sub-typing.

A number of computer packages have been developed for clustering gene expression patterns, including GeneCluster (4), using weighted voting and K-nearest neighbor algorithms, Cleaver (5), using K-means clustering, and Treeview (6), using hierarchical clustering. Some computer tools, such as the R Packages for Gene Expression Analysis (http://www.stat.uni-muenchen.de/~strimmer/rexpress.html), J-Express (7), Genesis (8) and Stanford’s Xcluster (http://genome-www.stanford.edu/~sherlock/cluster.html), implement a suite of classical algorithms for data clustering such as hierarchical clustering (6), self-organizing maps (9), K-means clustering (10) and principal component analysis (11). While these packages have demonstrated their usefulness in applications, some basic problems remain in terms of the algorithms applied (12). (i) None of these methods can, in general, guarantee a globally optimal clustering for any non-trivial objective function and (ii) most methods, such as K-means and self-organizing maps, depend on the ‘regularity’ of the geometric shape of cluster boundaries. These methods generally do not work well when the clusters cannot be contained in some non-overlapping convex sets. (iii) Given the above two problems, the clustering results by these methods are often sensitive to noise. In addition, none of the methods are designed to detect clusters from a noisy background; rather they are aimed at partitioning a data set into ‘clusters’ regardless of whether the set is purely made of ‘clusters’. The noise-related problem is particularly important in analyzing gene expression data, which are typically very noisy.

To overcome such problems in the existing packages for gene expression data analysis, we have developed a computer system EXCAVATOR (EXpression data Clustering Analysis and VisualizATiOn Resource). Unlike any other gene expression analysis tool, EXCAVATOR represents a set of gene expression data as a minimum spanning tree (MST) (13), a concept from graph theory. The basic idea of a MST-based clustering includes the construction of a MST and clustering by cutting edges on the MST, as illustrated in Figure 1. To construct a MST, we first build a complete graph, where each node of the graph represents a gene and every pair of nodes is connected by an edge. The length of an edge is calculated by a certain measure, e.g. the Euclidean distance between two gene expression profiles. A MST is a tree structure that connects all the nodes together with the minimum total distance. Conceptually, a MST provides a skeleton of the graph. Through this representation, the problem of clustering a multi-dimensional data set is rigorously reduced to a tree partitioning problem without losing any essential information for the purpose of data clustering, as we have mathematically proved (12). This representation has made our data clustering problem much easier to tackle.

Basic idea of MST-based clustering. (a) Data representation using MST. Each node represents a gene in a multi-dimensional space, where the value of each dimension shows the expression level of a given experimental condition or at a particular time point. (b) Clustering of genes by cutting edges on the MST. Cutting the dotted edge and the dashed edge of the MST creates three clusters, i.e. clusters of red, green and blue nodes.

Based on the MST representation of a set of gene expression data, we have developed a number of rigorous and efficient clustering algorithms. The unique and novel algorithmic techniques of EXCAVATOR include: (i) efficient implementations of clustering algorithms with guaranteed mathematical properties, including global optimality clustering measured by general objective functions; (ii) extracting clusters with guaranteed mathematical properties from a noisy background; (iii) a strong capability in dealing with clusters with complex cluster boundaries. We have implemented a stand-alone package that can run from a user-friendly Java interface or DOS/Unix/Linux command line. In addition, we also developed a server that allows users to run EXCAVATOR remotely through a Web browser. Even researchers with little computer skill can easily analyze gene expression data through the Java interface or a Web browser. EXCAVATOR is freely available to academic users at http://compbio.ornl.gov/structure/excavator/.

The algorithmic aspect of EXCAVATOR has been addressed in our previous publications (12,14,15). In this paper, we will focus on EXCAVATOR from the software perspective, including its design, functionality and comparison against other methods.

MATERIALS AND METHODS

Overview of EXCAVATOR

Figure 2 summarizes the overall design of EXCAVATOR, which consists of a kernel written in C and a graphic user interface (GUI) written in Java. The Java GUI consists of a set of pull-down menus and pop-up panels related to selecting parameters and displaying the output graphics. The connection between the two parts is done through a Java application. Such decomposition allows fast computing with the efficient kernel, while having flexible graphics features at the interface. All the intensive computations, including constructing the MST, MST partition, cluster identification in gene expression profiles and cluster assessment, are done by the kernel. The whole package is bundled using JDK1.3 (http://java.sun.com/j2se/1.3/download.html) and has been tested thoroughly on Sun, DEC and Pentium PC workstations with Windows or Linux. Running on a computer with 1 Gb memory, EXCAVATOR can handle up to 10 000 genes for clustering. There is no restriction on the number of experimental conditions, which has little effect on the memory requirement and computing time. Although a whole-genome microarray may contain tens of thousands of genes, a specific analysis often focuses on only several hundred differentially expressed genes or fewer, which generally takes less than 100 Mb memory and 10 min of computing time or less on a desktop PC. The kernel can be used as a stand-alone tool from the DOS/Unix/Linux command line. Through the interface, a user can input the data and customize the kernel parameters. Java system calls are then made to preprocess the data and feed data into the kernel. After the computing is done, the user can select the graphics for visualization. In addition, all the results are kept in individual output files. The Web interface at the client side is similar to the GUI of the stand-alone version, while all the computation is done on the server side at Oak Ridge National Laboratory. EXCAVATOR provides an easy to use way to analyze the data, while providing numerous options for users to choose.

In the following, we illustrate the design and features of EXCAVATOR using a small set of gene expression data (68 genes in total) from the budding yeast Saccharomyces cerevisiae (6), with each gene having 79 conditions (represented by a vector in 79-dimension space). The data set is made up of four clusters which had been previously determined experimentally, i.e. protein degradation, glycolysis, protein synthesis and chromatin.

Data input

The main input to EXCAVATOR is a file containing gene expression data, which can be obtained from public databases or through experimentation. EXCAVATOR recognizes several widely used data formats for gene expression profiles. The most common format (as the default EXCAVATOR input format) represents each gene using one line and all the entries for a gene are separated by tabs, including gene name, gene annotation and a set of expression levels.

Sometimes the gene expression data may not be complete for all data points for every gene. A user can choose how a missing data point should be handled in clustering analysis. The default in EXCAVATOR is to set the missing data to 1 for the ratio between the expression level under the given experimental condition and that of the reference state. The missing data can also be replaced by the average over other genes in the same column of the data series, the average over all the other known data points of the same gene or the average over two neighboring known data points of the same gene.

A user can also select only differentially expressed genes and remove other genes from a data analysis. Such a selection can be done using two cut-off values v₁ and v₂. Specifically, if the minimum value among the expression levels of a gene under all conditions is larger than v₁ and/or the maximum value among its expression levels is less than v₂, this gene will be removed.

Similarity measure

The similarity measure is used to calculate the distance between gene expression profiles. EXCAVATOR has several options for the similarity measure: 1.0 – correlation coefficient (default); 1.0 – square of correlation coefficient; 1.0 – absolute value of correlation coefficient; Euclidean distance; square of Euclidean distance; sine square of the angle between two vectors of expression profiles; Manhattan distance (16); Mahalanobis distance (17).

Clustering methods

EXCAVATOR offers the following methods for MST-based clustering algorithms based on the selected similarity measure and objective function, which were described in detail in our previous publication (12).

MST hierarchical clustering (default) to minimize hierarchically the sum of the distances between a gene and the center of its cluster.

MST iterative clustering (non-hierarchical) to minimize the sum of the distances between a gene and the center of its cluster, through an iterative procedure. The clustering result, while reaching a local minimum, may not reach the global optimal solution for the objective function.

MST optimal clustering (non-hierarchical) to minimize globally the sum of the distances between a gene and the best representative gene from the cluster. The global optimal solution is guaranteed, but it takes much longer than the other methods.

Single-linkage clustering by simply cutting the longest edges in the MST. It is the fastest method and it works well for obvious clusters. However, the result may not be the desired one for complicated clusters.

Number of clusters

EXCAVATOR provides three options to determine the number of clusters.

A user specifies the number of clusters.

A user determines the number of clusters based on the transition profile generated by EXCAVATOR (12). The transition profile T(K), where K is the number of clusters, is calculated based on the minimum value, Q(K), of the objective function for K, i.e.

where we define Q(0) = 0. Typically the highest T(K) value indicates the most ‘natural’ number of clusters. However, the user can specify the number of clusters based on a local maximum of T(K), which may provide an alternative meaningful number of clusters.

EXCAVATOR automatically determines the best number of clusters based on the maximum value of T(K).

Constraints

EXCAVATOR allows the user to apply constraints during the clustering process so that certain specified genes will stay in the same cluster. The constraints can be specified using a file in which genes in the same line (separated by spaces or tabs) are required to belong to the same cluster. Another related option is to use a set of ‘seed’ genes (specified in the constraint file) for EXCAVATOR to find additional genes having a similar pattern to that of the seed genes, as demonstrated in Results.

Cluster identification using ordered representation plot

As a unique feature of gene expression data analysis packages, EXCAVATOR is able to not only partition all the genes in a data set but also identify and extract data clusters from a noisy background. Such a feature is achieved through establishing an ordered representation plot (15,18,19) based on the relationship between data and the result of Prim’s algorithm (20) for constructing a MST, as shown in Figure 3. The ordered representation plot provides a 1-dimensional profile for edge distances in the MST in the order determined by the selection order of Prim’s algorithm. The plot gives a set of clusters, each corresponding to a ‘valley’ in the ordered representation plot. The statistical significance of a cluster can be assessed by a reliability value using the distances of the edges associated with the valley. All the related information is shown in the Java GUI, as in the example given in Figure 4.

Ordered representation plot. (a) Construction of an ordered representation plot. The construction starts from a complete graph representing all the genes in a data set and establishes the order of edges of the MST through a series of steps (partial solutions) in Prim’s algorithm. An initial partial solution is a singleton set containing an arbitrary node as the root. In each step, the current partial solution is repeatedly expanded by adding the node (not in the current solution, i.e. nodes connected by the dotted lines in the figure) that has the shortest edge to any node in the current solution, until all the nodes are in the current solution. The shortest edge added by each step forms a 1-dimensional profile as the ordered representation plot. (b) The relationship between a cluster and its ordered representation plot. A cluster is generally indicated by the nodes connected through a series of short edges bounded by two long edges (hence forming a valley) on an ordered representation plot. The statistical significance of the cluster can be assessed (18) by the distribution of the edge distances of such a valley on an ordered representation plot based on the null hypothesis presented as a Dirichlet distribution.

The ordered representation plot for gene expression profiles. The middle red/green graphics show the expression profiles sorted by Prim’s algorithm, where red indicates up-regulation, green indicates down-regulation and black indicates no significant change in the expression level. To the left of the red/green graphics is the ordered representation plot. The left-most figure shows identified clusters. Each bar indicates a cluster (note that a cluster may have different boundaries with different levels of statistical significance), where darker colors (especially blue) indicate more significant clusters, corresponding to deeper valleys in the ordered representation plot. All the genes in the cluster marked in blue, and only these genes among the data set, are related to chromatin.

Comparison between different clustering results

EXCAVATOR provides many options to satisfy different needs of users. Most data analysis methods work best using the default options. However, in some cases other parameter settings may be more suitable; it may not be obvious which set of parameters is the best for a certain problem. Therefore, it is important to compare clustering results from the same input data set but with different parameters and clustering methods. For this purpose, EXCAVATOR provides a capability to facilitate such comparison and it outputs a value (12) between 0 (most different clustering results) and 1 (identical clustering results) to measure the similarity between two clustering results. If these clustering results are very similar, it means that the clustering is stable and likely to be reliable; otherwise the clustering results may require more manual evaluations. The comparison can also help address the issue of overlapping clusters, where some genes can appear in different clusters at the same time. If different clustering results only differ by a few genes when using different parameters, these genes are likely to be the overlapping genes in different clusters.

RESULTS

We have tested EXCAVATOR on several applications (12,14,15). In addition, we applied EXCAVATOR to analyze expression data for genes involved in chitin elicitation in Arabidopsis thaliana (21), which led to the discovery of novel genes related to the process. In this paper, we will also compare EXCAVATOR with the popular K-means clustering method in terms of clustering quality and computing time. We will also show two examples of applications of EXCAVATOR with unconventional use of clustering.

Computing time

EXCAVATOR is very efficient in terms of computing time. The computational complexity of the algorithms used in EXCAVATOR was analyzed in our previous paper (12). Here we provide a benchmark on a single CPU Linux workstation. As shown in Figure 5a for different number of genes in 4 clusters, all the EXCAVATOR algorithms other than the MST optimal clustering are much faster than the K-means method. The CPU time for the MST optimal clustering is not long (<70 s for >500 genes). In this case, as shown in Figure 5b, all the methods finished the clustering in 10 s, i.e. different methods do not make much difference. However, with comparable computing time to the K-means method, EXCAVATOR can deliver better results with well-defined mathematical properties. The computing time plots in Figure 5 have the expected computational complexity for MST-based methods, as discussed in our previous paper (12).

Quality comparison between EXCAVATOR and K-means clustering

Other than offering more features than other gene expression analysis tools, we found that EXCAVATOR generally outperforms the most popular clustering method, K-means (5), in partitioning expression profiles. As shown in our previous paper (12), other than MST iterative clustering, all the other clustering algorithms used in EXCAVATOR mathematically guarantee global optimal solutions of some objective functions. In contrast, the K-means method does not guarantee global optimality of the problem. Practical comparisons of clustering quality between different methods can be tricky and currently there is no gold standard for this purpose. The two common goals or criteria for clustering are: (i) to put elements that are close to each other, with respect to an appropriate distance measure, in the same cluster; (ii) to separate elements that are far apart into different clusters. A clustering method often performs well on one criterion but does poorly on another. Here we try to compare the methods on both criteria using the same data set [the rat CNS gene expression data (23)]. We applied an open source implementation of K-means (http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/index.html). For the first criterion, we applied a method developed by Yeung et al. (24) for validating clustering of gene expression data. This method used a jack-knife approach (25) that applies a clustering algorithm to the data from all but one experimental condition. The remaining condition is used to assess the predictive power of the resulting clusters. The quality of clustering is measured by the ‘2-norm figure of merit’ (24) as a function of the number of clusters. The 2-norm figure of merit FOM(m,n) for n clusters with the mth condition as the left-out is defined as follows:

where N is the total number of genes, C_i is the set of genes in cluster i, R(x,m) is the expression level of gene x at the mth condition and R_i(m) is the average gene expression level at the mth condition for the genes in cluster i. The total FOM for all the conditions is FOM(n) = ∑^M_{m = 1} FOM(m,n), where M is the total number of columns. For a given n, a lower FOM(n) value indicates better clustering quality in terms of the first criterion described above. As shown in Figure 6, for FOM(n), all three algorithms used in EXCAVATOR outperform the K-means method. For the second criterion, we measured the number of genes whose closest neighbors were in different clusters versus the number of clusters. The higher the number of closest neighbors in different clusters, the poorer the clustering quality. Again, all three algorithms used in EXCAVATOR outperform the K-means method using this measure. In short, our results show that EXCAVATOR performs better than K-means as measured by both criteria at the same time.

A comparison between the EXCAVATOR clustering algorithms and the K-means clustering algorithm using the rat CNS data with 111 genes and nine conditions. (a) The predictive power of the clustering algorithm measured by the 2-normal FOM versus the number of clusters based on a jack-knife approach. (b) The separability quality of clusters measured by the number of genes whose closest neighbors are in different clusters versus number of clusters. Any of the three EXCAVATOR algorithms, i.e. MST hierarchical (blue long dashed lines), MST iterative (brown dot-dashed lines) or MST global optimal (red dotted lines), outperforms the K-means algorithm (black solid lines) in both measurements. In both figures, the lower the y-value, the better the clustering quality.

Identifying cell cycle-regulated genes

Here we show an example of using EXCAVATOR to identify yeast genes whose transcription levels are cell cycle regulated. There are 6178 genes in the yeast S.cerevisiae, 104 of which are known to be cell cycle regulated (26). It was estimated that about 250 cell cycle-regulated genes might exist (27). The challenge is to identify the remaining 150 or so unknown cell cycle-regulated genes, based on the 104 known ones and the expression profiles of the 6178 genes. This type of problem occurs often in gene expression data analysis, i.e. identifying the rest of the genes associated with a particular biological process, knowing a subset of the genes associated with this process. Our working assumption is that genes that are cell cycle regulated have correlated expression patterns. We have used the gene expression data from http://cellcycle-www.stanford.edu/. In this data set, expression levels were collected at 82 different time points for each gene. The distance between two gene expression profiles, A and B, is defined as 1 – cc(A,B), where cc(A,B) is the correlation coefficient of vectors A and B. We have applied EXCAVATOR to identify the remaining 150 or so cell cycle-regulated genes from the 6178 gene genome. The basic idea is as follows. We first label all the 104 genes as being in the same cluster. We then select a threshold H to remove all edges in the MST with distance >H, in such a way that the cluster containing the 104 genes has approximately 250 genes. This procedure produced a cluster with 263 genes, including the 104 genes. We predict these to be cell cycle-regulated genes. Since there is no experimental data to verify our prediction, we conducted the following computational exercise, trying to estimate how reliable our above prediction is. We performed a data constrained clustering using 52 of the 104 genes as the ‘seeds’, which were randomly selected from the 104 genes. Figure 7a shows how the threshold H affects the composition of our target cluster. When the threshold value H reaches 0.2, our target cluster consists of a total of 279 data points, of which 79 are from the 104 gene set. When this value goes beyond 0.2, a large number of junk data points are included in our target cluster. The threshold 0.2 clearly represents a ‘transition point’. The shape of the dashed line and the additional 27 identified genes from the 104 gene set suggest that we should have some confidence in the 263 genes being cell cycle-regulated genes. For the general number of seeds to be included, Figure 7b shows the prediction accuracy when choosing ∼250 genes from all the yeast genes using different numbers of ‘seeds’ from the 104 gene set.

Identification of cell cycle-regulated genes in yeast. (a) Gene identification as a function of threshold H (x-axis) using 52 out of 104 genes as the seeds. The solid line shows the number of genes (y-axis on the left), from the 104 genes, included in our target cluster and the dashed line shows the total number of genes included in this cluster (y-axis on the right). (b) The number of genes from the 104 gene set that are included in the target cluster as a function of number of seeds used, when we chose ∼250 genes in total from all the yeast genes (solid line).

To verify the statistical significance of our results, we calculated the P value for finding 27 (i.e. 79 genes minus 52 seeds) correct genes or more out of 52 non-seed (known) genes by chance. For this purpose, we calculated P(i,N,M,Q), i.e. the probability of obtaining i known genes out of M genes in a random sample of volume Q from N total genes is

where N = 6178, M = 52, Q = 211 (i.e. 263 genes minus 52 seeds) and i = 27, 28, 29, …, 52. By summing P(i,N,M,Q) for all i, we obtained P = 1.55 × 10^–15, which means that our result is highly significant statistically.

Cluster identification in gene expression profiles

One of the main problems with existing clustering techniques is that they are generally inadequate in identifying ‘dense’ data clusters from a noisy background as they are designed to partition a data set into ‘clusters’ regardless of whether any ‘clusters’ exist or how many exist. Our cluster identification method using ordered representation plots, as described in Materials and Methods, can solve this problem. We have applied the method to a number of gene expression data sets. The example we show here is a set of 145 differentially expressed genes from yeast under experimental conditions that allow us to identify genes possibly involved in the amino acid transport pathway. Figure 8 shows a part of the ordered representation for the data set. We can clearly see that there is a ‘dense’ cluster in the middle of the figure. Five genes are in this dense cluster: PHO5, BAP2, BAP3, AGP1 and TAT1. Based on our previous knowledge and experimental results (28), we know that these five genes are part of the amino acid transport pathway in yeast. This information, which cannot be obtained by other methods, is very important in understanding the pathway.

The ordered representation plot for gene expression data related to yeast amino acid transport, where the valley in the middle of the plot indicates a ‘dense’ cluster.

SUMMARY

In this paper, we have presented a new software package EXCAVATOR for analyzing gene expression data. Although many software packages are available to analyze gene expression data, EXCAVATOR is unique in many ways. EXCAVATOR is based on a new data representation framework, which provides a solid mathematical foundation for clustering and classification. It has efficient implementations of clustering algorithms with guaranteed mathematical properties (12), including global optimality. It also has strong capabilities in handling data clusters with complex cluster boundaries and overcoming problems caused by background noise. EXCAVATOR also offers a wide range of options and features, including various definitions of distance between expression profiles, different clustering algorithms, data constrained clustering, automatic selection of the most plausible number of clusters, removal of background noise, identification of genes with similar expression profiles to a set of specified seed genes, comparison of clustering results using different parameters and algorithms, etc. The Java GUI interface of the stand-alone version and the Web interface allow a user with little computing experience to use the program easily and produce many graphics to aid the user in evaluating and understanding the results. The Web server can obviate the need to install the software and provide the computing power on the server side. To our knowledge this is the only gene expression data analysis package that provides a Unix/Linux/DOS command line option, a stand-alone version with a GUI and a Web interface together. With all the unique features, we believe EXCAVATOR will become a powerful tool for the gene expression data analysis community.

Acknowledgments

ACKNOWLEDGEMENTS

We thank Manesh Shah, Yu Chen, Shauna Somerville, Keith M. Goldstein, Jeffrey M. Becker, Robin Zimmer and Morey Parang for helpful discussions and evaluation of the software. This work was supported in part by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, under contract DE-AC05-00OR22725, managed by UT-Battelle LLC. It was also funded in part by the US Department of Energy’s Genomes to Life program (www.doegenomestolife.org) under project ‘Carbon Sequestration in Synechococcus sp.: From Molecular Machines to Hierarchical Modeling’ (www.genomes-to-life.org).

REFERENCES

1.Chu S., DeRisi,J., Eisen,M., Mulholland,J., Botstein,D., Brown,P.O. and Herskowitz,I. (1998) The transcriptional program of sporulation in budding yeast. Science, 282, 699–705. [DOI] [PubMed] [Google Scholar]
2.DeRisi J.L., Iyer,V.R. and Brown,P.O. (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale, Science, 278, 680–686. [DOI] [PubMed] [Google Scholar]
3.Golub T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasenbeek,M., Mesirov,J.P., Coller,H., Loh,M.L., Downing,J.R., Caligiuri,M.A. et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. [DOI] [PubMed] [Google Scholar]
4.Tamayo P., Slonim,D., Mesirov,J., Zhu,Q., Dmitrovsky,E., Lander,E.S. and Golub,T.R. (1999) Interpreting gene expression with self-organizing maps: methods and application to hematopoeitic differentiation. Proc. Natl Acad. Sci. USA, 96, 2907–2912. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Raychaudhuri S., Sutphin,P.D., Chang,J.T. and Altman,R.B. (2001) Basic microarray analysis: grouping and feature reduction. Trends Biotechnol., 19, 189–193. [DOI] [PubMed] [Google Scholar]
6.Eisen M.B., Spellman,P.T., Brown,P.O. and Botstein,D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863–14868. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Dysvik B. and Jonassen,I. (2001) J-Express: exploring gene expression data using Java. Bioinformatics, 17, 369–370. [DOI] [PubMed] [Google Scholar]
8.Sturn A., Quackenbush,J. and Trajanoski,Z. (2002) Genesis: cluster analysis of microarray data. Bioinformatics, 18, 207–208. [DOI] [PubMed] [Google Scholar]
9.Tamayo P., Slonim,D., Mesirov,J., Zhu,Q., Kitareewan,S., Dmitrovsky,E., Lander,E.S. and Golub,T.R. (1999) Interpreting patterns of gene expression with self organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA, 96, 2907–2912. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Herwig R., Poustka,A.J., Mller,C., Bull,C., Lehrach,H. and O’Brien,J. (1999) Large-scale clustering of cDNA-fingerprinting data. Genome Res., 9, 1093–1105. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Yeung K.Y. and Ruzzo,W.L. (2001) Principal component analysis for clustering gene expression data. Bioinformatics, 17, 763–774. [DOI] [PubMed] [Google Scholar]
12.Xu Y., Olman,V. and Xu,D. (2002) Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics, 18, 526–535. [DOI] [PubMed] [Google Scholar]
13.Graham R.L. and Hell,P. (1985) On the history of the minimum spanning tree problem. Ann. Hist. Comput., 7, 43–57. [Google Scholar]
14.Xu Y., Olman,V. and Xu,D. (2001) Minimum spanning trees for gene expression data clustering. In Miyano,S., Shamir,R. and Takagi,T. (eds), Proceedings of the 12th International Conference on Genome Informatics (GIW). Universal Academy Press, Tokyo, Japan, pp. 24–33. [PubMed] [Google Scholar]
15.Olman V., Xu,D. and Xu,Y. (2003) Solving data clustering problem as a string search problem. In Bozdogan,H. (ed.), Proceedings of the Conference on Statistical Data Mining and Knowledge Discovery. Chapman & Hall/CRC, Boca Raton, FL, pp. 417–434. [Google Scholar]
16.Skiena S. (1990) Implementing Discrete Mathematics: Combinatorics and Graph Theory with Mathematica. Addison-Wesley, Reading, MA. [Google Scholar]
17.Bar-Shalom T. and Fortmann,T.E. (1988) Tracking and Data Association. Academic Press, New York, NY. [Google Scholar]
18.Olman V., Xu,D. and Xu,Y. (2003) Identifications of regulatory binding sites using minimum spanning trees. In Proceedings of the 2003 Pacific Symposium on Biocomputing (PSB). World Scientific Pub. Co., Lihue, Hawaii, pp. 327–338. [PubMed] [Google Scholar]
19.Olman V., Xu,D. and Xu,Y. (2003) CUBIC: identifications of regulatory binding sites through data clustering. J. Bioinform. Comput. Biol., 1, 21–40. [DOI] [PubMed] [Google Scholar]
20.Prim R.C. (1957) Shortest connection networks and some generalizations. Bell Syst. Tech. J., 36, 1389–1401. [Google Scholar]
21.Ramonell K.M., Zhang,B., Ewing,R.M., Chen,Y., Xu,D., Stacey,G. and Somerville,S. (2002) Microarray analysis of chitin elicitation in Arabidopsis thaliana. Mol. Plant Pathol., 3, 301–311. [DOI] [PubMed] [Google Scholar]
22.Iyer V.R., Eisen,M.B., Ross,D.T., Schuler,G., Moore,T., Lee,J.C.F., Trent,J.M., Staudt,L.M., Hudson,J.,Jr, Boguski,M.S. et al. (1999) The transcriptional program in the response of human fibroblasts to serum. Science, 283, 83–87. [DOI] [PubMed] [Google Scholar]
23.Wen X., Fuhrman,S., Michaels,G.S., Carr,D.B., Smith,S., Barker,J.L. and Somogyi,R. (1998) Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl Acad. Sci. USA, 95, 334–339. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Yeung K.Y., Haynor,D.R. and Ruzzo,W.L. (2001) Validating clustering for gene expression data. Bioinformatics, 17, 309–318. [DOI] [PubMed] [Google Scholar]
25.Efron B. (1982) The Jackknife, the Bootstrap and Other Resampling Plans. Society for Industrial and Applied Mathematics, Philadelphia, PA. [Google Scholar]
26.Spellman P.T., Sherlock,G., Zhang,M.Q., Iyer,V.R., Anders,K., Eisen,M.B., Brown,P.O., Botstein,D. and Futcher,B. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 3273–3297. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Price C., Nasmyth,K. and Schuster,T. (1991) A general approach to the isolation of cell cycle-regulated genes in the budding yeast, Saccharomyces cerevisiae. J. Mol. Biol., 218, 543–556. [DOI] [PubMed] [Google Scholar]
28.Chen Y., Liu,Y., Goldstein,K.M., Becker,J.M., Xu,Y. and Xu,D. (2003) A computational study on the signal transduction pathway for amino acid and peptide transport in yeast: bridging the gap between high-throughput data and traditional biology. Appl. Genomics Proteomics, 2, 43–50. [Google Scholar]

[gkg783c1] 1.Chu S., DeRisi,J., Eisen,M., Mulholland,J., Botstein,D., Brown,P.O. and Herskowitz,I. (1998) The transcriptional program of sporulation in budding yeast. Science, 282, 699–705. [DOI] [PubMed] [Google Scholar]

[gkg783c2] 2.DeRisi J.L., Iyer,V.R. and Brown,P.O. (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale, Science, 278, 680–686. [DOI] [PubMed] [Google Scholar]

[gkg783c3] 3.Golub T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasenbeek,M., Mesirov,J.P., Coller,H., Loh,M.L., Downing,J.R., Caligiuri,M.A. et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. [DOI] [PubMed] [Google Scholar]

[gkg783c4] 4.Tamayo P., Slonim,D., Mesirov,J., Zhu,Q., Dmitrovsky,E., Lander,E.S. and Golub,T.R. (1999) Interpreting gene expression with self-organizing maps: methods and application to hematopoeitic differentiation. Proc. Natl Acad. Sci. USA, 96, 2907–2912. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg783c5] 5.Raychaudhuri S., Sutphin,P.D., Chang,J.T. and Altman,R.B. (2001) Basic microarray analysis: grouping and feature reduction. Trends Biotechnol., 19, 189–193. [DOI] [PubMed] [Google Scholar]

[gkg783c6] 6.Eisen M.B., Spellman,P.T., Brown,P.O. and Botstein,D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863–14868. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg783c7] 7.Dysvik B. and Jonassen,I. (2001) J-Express: exploring gene expression data using Java. Bioinformatics, 17, 369–370. [DOI] [PubMed] [Google Scholar]

[gkg783c8] 8.Sturn A., Quackenbush,J. and Trajanoski,Z. (2002) Genesis: cluster analysis of microarray data. Bioinformatics, 18, 207–208. [DOI] [PubMed] [Google Scholar]

[gkg783c9] 9.Tamayo P., Slonim,D., Mesirov,J., Zhu,Q., Kitareewan,S., Dmitrovsky,E., Lander,E.S. and Golub,T.R. (1999) Interpreting patterns of gene expression with self organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA, 96, 2907–2912. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg783c10] 10.Herwig R., Poustka,A.J., Mller,C., Bull,C., Lehrach,H. and O’Brien,J. (1999) Large-scale clustering of cDNA-fingerprinting data. Genome Res., 9, 1093–1105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg783c11] 11.Yeung K.Y. and Ruzzo,W.L. (2001) Principal component analysis for clustering gene expression data. Bioinformatics, 17, 763–774. [DOI] [PubMed] [Google Scholar]

[gkg783c12] 12.Xu Y., Olman,V. and Xu,D. (2002) Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics, 18, 526–535. [DOI] [PubMed] [Google Scholar]

[gkg783c13] 13.Graham R.L. and Hell,P. (1985) On the history of the minimum spanning tree problem. Ann. Hist. Comput., 7, 43–57. [Google Scholar]

[gkg783c14] 14.Xu Y., Olman,V. and Xu,D. (2001) Minimum spanning trees for gene expression data clustering. In Miyano,S., Shamir,R. and Takagi,T. (eds), Proceedings of the 12th International Conference on Genome Informatics (GIW). Universal Academy Press, Tokyo, Japan, pp. 24–33. [PubMed] [Google Scholar]

[gkg783c15] 15.Olman V., Xu,D. and Xu,Y. (2003) Solving data clustering problem as a string search problem. In Bozdogan,H. (ed.), Proceedings of the Conference on Statistical Data Mining and Knowledge Discovery. Chapman & Hall/CRC, Boca Raton, FL, pp. 417–434. [Google Scholar]

[gkg783c16] 16.Skiena S. (1990) Implementing Discrete Mathematics: Combinatorics and Graph Theory with Mathematica. Addison-Wesley, Reading, MA. [Google Scholar]

[gkg783c17] 17.Bar-Shalom T. and Fortmann,T.E. (1988) Tracking and Data Association. Academic Press, New York, NY. [Google Scholar]

[gkg783c18] 18.Olman V., Xu,D. and Xu,Y. (2003) Identifications of regulatory binding sites using minimum spanning trees. In Proceedings of the 2003 Pacific Symposium on Biocomputing (PSB). World Scientific Pub. Co., Lihue, Hawaii, pp. 327–338. [PubMed] [Google Scholar]

[gkg783c19] 19.Olman V., Xu,D. and Xu,Y. (2003) CUBIC: identifications of regulatory binding sites through data clustering. J. Bioinform. Comput. Biol., 1, 21–40. [DOI] [PubMed] [Google Scholar]

[gkg783c20] 20.Prim R.C. (1957) Shortest connection networks and some generalizations. Bell Syst. Tech. J., 36, 1389–1401. [Google Scholar]

[gkg783c21] 21.Ramonell K.M., Zhang,B., Ewing,R.M., Chen,Y., Xu,D., Stacey,G. and Somerville,S. (2002) Microarray analysis of chitin elicitation in Arabidopsis thaliana. Mol. Plant Pathol., 3, 301–311. [DOI] [PubMed] [Google Scholar]

[gkg783c22] 22.Iyer V.R., Eisen,M.B., Ross,D.T., Schuler,G., Moore,T., Lee,J.C.F., Trent,J.M., Staudt,L.M., Hudson,J.,Jr, Boguski,M.S. et al. (1999) The transcriptional program in the response of human fibroblasts to serum. Science, 283, 83–87. [DOI] [PubMed] [Google Scholar]

[gkg783c23] 23.Wen X., Fuhrman,S., Michaels,G.S., Carr,D.B., Smith,S., Barker,J.L. and Somogyi,R. (1998) Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl Acad. Sci. USA, 95, 334–339. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg783c24] 24.Yeung K.Y., Haynor,D.R. and Ruzzo,W.L. (2001) Validating clustering for gene expression data. Bioinformatics, 17, 309–318. [DOI] [PubMed] [Google Scholar]

[gkg783c25] 25.Efron B. (1982) The Jackknife, the Bootstrap and Other Resampling Plans. Society for Industrial and Applied Mathematics, Philadelphia, PA. [Google Scholar]

[gkg783c26] 26.Spellman P.T., Sherlock,G., Zhang,M.Q., Iyer,V.R., Anders,K., Eisen,M.B., Brown,P.O., Botstein,D. and Futcher,B. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 3273–3297. [DOI] [PMC free article] [PubMed] [Google Scholar]

[gkg783c27] 27.Price C., Nasmyth,K. and Schuster,T. (1991) A general approach to the isolation of cell cycle-regulated genes in the budding yeast, Saccharomyces cerevisiae. J. Mol. Biol., 218, 543–556. [DOI] [PubMed] [Google Scholar]

[gkg783c28] 28.Chen Y., Liu,Y., Goldstein,K.M., Becker,J.M., Xu,Y. and Xu,D. (2003) A computational study on the signal transduction pathway for amino acid and peptide transport in yeast: bridging the gap between high-throughput data and traditional biology. Appl. Genomics Proteomics, 2, 43–50. [Google Scholar]

PERMALINK

EXCAVATOR: a computer program for efficiently mining gene expression data

Dong Xu

Victor Olman

Li Wang

Ying Xu

Abstract

INTRODUCTION

Figure 1.