MIClique: An Algorithm to Identify Differentially Coexpressed Disease Gene Subset from Microarray Data

Huanping Zhang; Xiaofeng Song; Huinan Wang; Xiaobai Zhang

doi:10.1155/2009/642524

. 2010 Jan 20;2009:642524. doi: 10.1155/2009/642524

MIClique: An Algorithm to Identify Differentially Coexpressed Disease Gene Subset from Microarray Data

Huanping Zhang ¹, Xiaofeng Song ^1,^*, Huinan Wang ¹, Xiaobai Zhang ¹

PMCID: PMC2822236 PMID: 20169000

Abstract

Computational analysis of microarray data has provided an effective way to identify disease-related genes. Traditional disease gene selection methods from microarray data such as statistical test always focus on differentially expressed genes in different samples by individual gene prioritization. These traditional methods might miss differentially coexpressed (DCE) gene subsets because they ignore the interaction between genes. In this paper, MIClique algorithm is proposed to identify DEC gene subsets based on mutual information and clique analysis. Mutual information is used to measure the coexpression relationship between each pair of genes in two different kinds of samples. Clique analysis is a commonly used method in biological network, which generally represents biological module of similar function. By applying the MIClique algorithm to real gene expression data, some DEC gene subsets which correlated under one experimental condition but uncorrelated under another condition are detected from the graph of colon dataset and leukemia dataset.

1. Introduction

Microarray data may provide much useful information for disease gene identification and medical diagnosis because microarray has the ability to measure the expression levels of thousands of genes simultaneously [1]. Among the huge number of genes, only a small fraction of them show strong correlation with a certain phenotype. Many statistical and supervised methods such as t-test, neural network are utilized to mine genes that are differentially expressed under different conditions [2, 3]. However, these gene selection techniques are often based on individual gene prioritization by measuring the correlation of each gene with particular disease types. The individual gene prioritization list does not indicate interaction relationships among genes. So these traditional techniques might ignore the differentially coexpressed (DCE) gene subsets which are defined to be highly correlated under one experimental condition but uncorrelated under another condition [4]. Disease-related differentially coexpressed genes are those which exhibit similar expression patterns in normal samples but share no similarity in disease samples. Figure 1 depicts the simulated differentially coexpressed disease genes between normal samples (samples1–20) and disease samples (samples 21–40). The coexpression pattern in normal samples disappears in disease samples.

Illustration of differentially coexpressed (DEC) disease gene subset between normal samples and disease samples. The left 20 samples are normal samples and the right 20 samples are disease samples.

Identification of disease specific DEC gene subsets is very helpful for disease diagnosis and clinical treatment. The DEC genes should be analyzed by gene subsets instead of individual genes. Clustering algorithms are often used to find gene groups which display similar expression profiles [5, 6]. However, the DEC genes only show highly correlated expression patterns in one biological state, not across the entire dataset. Biclustering is a method to identify gene subsets exhibiting consistent patterns over a subset of experimental conditions, but this method is still not proper for identification of DEC gene groups because the experimental conditions may not be in the same biological state [7, 8].

Kostka and Spang proposed the first method to investigate DEC gene subsets by using an additive model and a stochastic search algorithm [9]. AlteredExpression was an improved algorithm based on additive model to detect optimal DEC gene subsets with best RRV (ratio of residual variance between two different samples) and minimal F-score [10]. Varadan and Anastassiou proposed an approach called Entropy Minimization and Boolean Parsimony (EMBP) to identify gene subsets whose joint expression state predicts the presence or absence of a particular disease with minimum uncertainty [4]. The coXpress was developed to identify groups of gene that are differentially coexpressed in different biological states by using a resampling method to calculate t-value for each clustered group [11]. These methods took into account all possible gene subsets by searching the whole dataset; it was a huge computational burden as the number of genes increases.

In this paper, the MIClique algorithm is proposed to explore DEC gene subsets in an intuitive way based on mutual information (MI) and clique analysis. Mutual information is used to measure the coexpression relationship between each pair of genes in two different kinds of samples, and then the symmetric mutual information matrices are binarized by selecting two threshold values. The adjacency matrix of graph is obtained by logical operation with vertices corresponding to genes and edges corresponding to relationships between genes. Gene cliques detected by MIClique represent DEC gene subsets, which are highly correlated under one experimental condition but uncorrelated under another condition.

2. Materials and Methods

2.1. Mutual Information (MI)

The interaction relationships of genes are very complex, including linear and nonlinear. Compared with linear similarity measures such as Euclidean distance and Pearson correlation [12, 13], the mutual information is a general measure of statistical dependence between variables and capable of detecting any type of functional relationship, which is widely used in gene expression analysis [14]. For the application of MI on gene expression data, the continuous experimental data need to be partitioned into discrete intervals or bins [15]. Entropy and MI are two central concepts of Shannon's theory of information [16]. Table 1 describes the related concepts of MI.

Table 1.

Concepts of entropy and MI defined by Shannon's theory of information.

Concepts of Shannon's theory of information	Descriptions
$H (X) = - \sum_{x}^{} p (x) \log_{2} p (x)$	The uncertainty of a random variable X is measured by its entropy H(X); p(x) is the probability density of X
$H (X ∣ Y) = - \sum_{x}^{} p (x ∣ y) \log_{2} p (x ∣ y)$	The uncertainty of a random variably X given knowledge of another random variable Y is measured by the conditional entropy H(X ∣ Y)
$H (X, Y) = - \sum_{x}^{} \sum_{y}^{} p (x, y) lo g_{2} p (x, y)$	The uncertainty of a pair of random variables X, Y is measured by the entropy H(X ∣ Y)
H(X, Y) = H(X) + H(Y ∣ X) = H(Y) + H(X ∣ Y)
$MI (X; Y) = \sum_{x}^{} \sum_{y}^{} p (x, y) lo g_{2} \frac{p (x, y)}{p (x) p (y)}$	Given two random variables X and Y, the amount of information that each one of them provides about the other is the mutual information MI(X; Y)
MI(X; Y) = H(X) + H(Y) − H(X, Y)

Open in a new tab

The physical meaning of MI(X; Y) is the reduction of the uncertainty of X due to knowledge of Y (or vice versa). Note that H(X) = I(X; X), and so entropy is the self-information. The nonnegative MI(X; Y) equals zero if and only if X and Y are statistically independent, meaning that the variables X and Ydo not follow any kind of dependence.

2.2. Clique Enumeration of Graph Theory

Graph theoretical concepts are useful for the description and analysis of relationships in biological systems. Clique analysis is a core component of graph in many biological applications such as gene expression networks analysis, cis regulatory motif finding, and matching three-dimensional molecular structures [17]. Generally clique represents biological module of similar function and biological annotations.

For a simple undirected graph G with the set of vertices and edges, two vertices are called adjacent if they are joined by an edge. The degree of a vertex is the number of connected edges; thus the degree of an isolated vertex is zero. Weight of each edge is a value between the pair connection, which might represent costs, lengths, or correlation, and so forth. A complete graph is a graph with every pair of nodes joined by an edge. Clique is complete subgraph and all pairs of vertices in the clique are connected. A maximal clique is a clique not contained in any other complete subgraph. The adjacency matrix of an undirected graph is a symmetric matrix B = (b_ij) in which the entry b_ij = 1 if the node i and node j are connected by an edge and 0 otherwise. If the graph is a clique, then B is a matrix with 1 off the diagonal and 0 on the diagonal. If the graph contains a clique, the adjacency matrix of that clique is a submatrix of B. Identification of all maximal cliques in a graph is a problem of clique enumeration [18]. Bioconductor, the open project for the analysis and comprehension of genomic data, provides a large collection of software for working with graphs and cliques [19]. Some social network analysis tools are also efficient in clique analysis [20].

But for imperfect systems or experimental data, the requirement of complete connectivity for maximal cliques is stringent; so more general notions of cohesive subgroups should be considered including n-cliques, k-plexes, and k-core [21]. For undirected and unweighted graph, a commonly used measure of network cohesion is density, which simply refers to the ratio of the number of edges that is actually present in the graph to maximum possible number of edges. A large density indicates high interconnectedness and cohesion in the network. The density of clique is 1.

2.3. The Main Process of MIClique

For each set of microarray data E = (e_ij)_NXS involving N genes from S samples, e_ij is the expression value of the ith gene in jth sample. The sample set is divided into two subsets: S₁ (normal samples) and S₂ (disease samples); so E_NXS is also divided into (E₁)_NXS₁ and (E₂)_NXS₂. Differentially coexpressed disease genes are those of high mutual information values in normal samples but of low MI values in disease samples.

The detailed process of MIClique is as follows.

Step 1 —

Calculating the mutual information of each pair of genes in E₁ and E₂, then two square symmetric mutual information matrices (MI₁)_NXN and (MI₂)_NXN are obtained. A big value of mutual information MI₁(i, j) means that the gene i and gene j are strongly coexpressed in normal samples, while a low value represents weak coexpression.

Step 2 —

Binarizing the mutual information matrices by selecting two threshold values T₁ and T₂ (T₁ > T₂), respectively, for MI₁ and MI₂, one has the following.

If MI₁(i, j) ≥ T₁, then M₁(i, j) = 1, else M₁(i, j) = 0.

If MI₂(i, j) ≤ T₂, then M₂(i, j) = 1, else M₂(i, j) = 0.

M(i, j) = M₁(i, j) & M₂(i, j).

If i = j then M(i, j) = 0.

The matrices M₁and M₂ are binarized mutual information matrices for MI₁ and MI₂. M is a logical symmetric matrix obtained by “AND” operation on M₁ and M₂. If M(i, j) is 1, it means that gene i and gene j are coexpressed in normal samples while suffer an alteration in disease groups.

Step 3 —

The M matrix can be transformed to the adjacency matrix of a graph G with vertices corresponding to genes and edges corresponding to biological interactions. There is an edge between vertices i and j in G if M(i, j) = 1. The DEC disease genes, which present a similar expression pattern in normal samples but suffer a distinct alteration in disease samples, are represented as a completely connected subgraph. So the problem of identifying DEC disease gene subsets is converted into clique detection based on adjacency matrix.

2.4. Threshold Selection

How to select the threshold values of T₁ and T₂ is very important for biological experimental interpretation. Different threshold values lead to different results. If the T₁ is high and T₂ is low, the graph has few edges and many isolated vertices. As T₁ decreases and T₂ increases, more edges are added to the graph, until it is completely connected. A graph with a large number of isolated vertices generally will fail to fall into a clique, but too many edges will cause a lot of overlapped cliques, which also are not very informative for data analysis. Proper thresholds will lead to a proper percentage of isolated vertices and reasonable experimental results. The threshold values are related with data sources and data types, and so forth, and they can be selected by graph density and percentage of isolated vertex. Figure 2 gives the gene networks for normalized simulated gene data by MIClique algorithm. The percentage of isolated vertices decreases and the number of edges increases as T₁ decreases and T₂ increases.

Gene networks for simulated gene data with different thresholds. (a) T₁ = 2.2; T₂ = 0.8; (b) T₁ = 2.0; T₂ = 1.0; (c) T₁ = 1.8; T₂ = 1.2.

3. Results and Discussion

Real gene expression data including colon dataset and Leukemia dataset are selected to illustrate the application of the proposed MIClique algorithm [22, 23]. The colon dataset contains expression levels of 2000 genes with the highest minimal intensity selected from 6500 genes across 62 samples, 40 tumor samples, and 22 normal samples. The dataset was normalized before further data analysis. The leukemia dataset contains gene expression profiles of acute leukemias measured using Affymetrix high-density oligonucleotide arrays: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). The dataset contains 7129 human genes, 47 cases of ALL (38 B-cell ALL and 9 T-cell ALL), and 25 cases of AML. Only 3374 genes remained after data preprocessing.

3.1. Results of Colon Dataset

Different threshold values are selected for colon dataset. Figure 3 gives the percentage of isolated vertices and the density of the graph (number of edges present in graph divided by maximum possible number of edges, which is C₂₀₀₀²). The final thresholds for colon data are selected as T₁ = 2.2 and T₂ = 1.0. Then the data are transformed into gene network by MIClique algorithm.

Different threshold values lead to different results for colon dataset. (a) Percentage of isolated vertices. (b) Density of the graph (number of edges divided by maximum possible number of edges, which is C₂₀₀₀²).

The maximal cliques are detected from this gene network, with the minimum size of clique as four. An overlapped clique group with six cliques and eight genes is found. Table 2 lists the gene accession numbers in each clique and Figure 4 displays the overlapped clique group graphically. These tightly overlapped cliques form a cohesive subgroup. There are eight vertices and 19 edges in the cohesive subgroups with the density of 0.68 (the maximum possible number of edges is C₈²).

Table 2.

Genes accession numbers in each clique identified by MIClique from colon dataset.

Clique number	Genes in each clique
1	M63391 H64489 R87126 X74295
2	H64489 R87126 T92451 X74295
3	H64489 R87126 X74295 J02854
4	R87126 X74295 X86693 U19969
5	R87126 X74295 J02854 U19969
6	M63391 R87126 X74295 U19969

Open in a new tab

The cohesive subgroup identified from colon dataset; the overlapped clique group with six cliques and eight genes.

Figure 5 shows MI values of the eight genes, where each plot is a representation of the MI matrix in either the normal samples or disease samples. Each MI value in the matrix is represented as a square, with the color of the square representing the amount of value. The color scale used is black to white, with black representing the smallest value of MI and white representing largest value of MI. The MI values range from 2.072 to 2.477 in normal samples and from 0.508 to 1.095 in disease samples. This view shows all the MI values in an intuitive way. These eight genes form a differentially coexpressed gene subset, which is disease-related gene module identified by MIClique algorithm. Table 3 lists the Genbank accession number, the gene symbols, accession number in UniProtKB (UniProt Knowledgebase), and gene descriptions given by colon data. The UniProtKB is the central hub for the collection of information on proteins such as amino acid sequence, protein name or description, taxonomic data, and biological ontology [24]. Figure 6 depicts gene expression profiles of the eight genes in normal and disease samples. As shown in Figure 6, the profiles of these genes are highly coexpressed in normal samples (samples 1–22) while the coexpression pattern disappears in disease samples (samples 23–62).

Images of the MI matrices for the eight genes in colon dataset. (a) Normal samples. (b) Disease samples.

Table 3.

Eight differentially coexpressed genes in cohesive subgroup identified from colon dataset.

Accession number	Gene symbol	UniProtKB ID	Gene descriptions
M63391	DESMIN (DES)	P17661	Human desmin gene, complete cds
H64489	CD37	P11049	Leukoyte antigen CD37 (Homo sapiens)
R87126	MYH9_HUMAN	P14105	Myosin heavy chain, nonmuscle (Gallus gallus)
T92451	TPM2	P07951	Tropomyosin, Fibroblast and epithelial muscle-type (Human)
X74295	ITGA7	Q13683	H.sapiens mRNA for alpha 7B integrin
J02854	MYL2	P10916	Myosin regulatory light chain 2, smooth muscle isoform (Human)
X86693	SPARCL1(Hevin)	Q14515	H.sapiens mRNA for hevin like protein
U19969	ZEB1(ZEB)	Q13088	Human two-handed zinc finger protein ZEB mRNA

Open in a new tab

Differentially coexpressed profiles of the eight genes in two kinds of samples; samples 1–22 represent normal samples and samples 23–62 are disease samples.

Table 4 lists gene annotations of the eight genes from Gene Ontology (GO) obtained by AmiGO searching tool. GO is a database to support biologically meaningful annotation for the description of the molecular function, biological process, and cellular component of gene products [25]. As observed in Table 4, some of the genes are of the common biological functions and involved in the same biological processes such as muscle development, calcium ion binding, and regulation of striated muscle contraction. The results of Aigner et al. showed that ZEB1 is associated with human colorectal cancer, and ZEB1 is a key player in pathologic epithelial to mesenchymal transition (EMT) associated with tumour progression [26]. Claeskens et al. have proved that Hevin is downregulated in many cancers and Hevin may be a potential target for cancer diagnosis and therapy [27]. Meanwhile the results of colon dataset by MIClique coincide with those of other researchers. For example, all these eight genes are included in the differentially expressed genes for colon dataset selected by unified framework [28]; some of these genes are consistent with the results of other researchers [29–31].

Table 4.

GO annotations of eight DEC genes identified from colon dataset by MIClique.

Gene Symbol	Ontology	GO Terms
DESMIN	Biological process	Cytoskeleton organization; muscle contraction; regulation of heart contraction
	Cellular component	Z disc
	Molecular function	Protein binding`;` structural constituent of cytoskeleton

CD37	Biological process	Protein amino acid N-linked glycosylation
CD37	Cellular component	Plasma membrane`;` integral to plasma membrane

MYH9	Biological process	Actin cytoskeleton reorganization; actin filament-based movement; angiogenesis; blood vessel endothelial cell migration; cytokinesis; membrane protein ectodomain proteolysis; monocyte differentiation; platelet formation; protein transport; regulation of cell shape
	Cellular component	Cleavage furrow; contractile ring; cytoplasm; cytosol; integrin complex; nucleus; plasma membrane; ruffle; stress fiber
	Molecular function	Actin filament binding; ATPase activity; microfilament motor activity; protein anchor; protein homodimerization activity

TPM2	Biological process	Regulation of ATPase activity
	Cellular component	Muscle thin filament tropomyosin
	Molecular function	Actin binding; structural constituent of muscle

ITGA7	Biological process	Cell-matrix adhesion; muscle organ development; integrin-mediated signaling pathway
ITGA7	Molecular function	Calcium ion binding; protein binding; receptor activity

MYL2	Biological process	Cardiac myofibril assembly; heart contraction; negative regulation of cell growth; regulation of striated muscle contraction; ventricular cardiac muscle morphogenesis
	Cellular component	Sarcomere
	Molecular function	Actin monomer binding; calcium ion binding; myosin heavy chain binding; protein binding; structural constituent of muscle

SPARCL1	Biological process	Signal transduction
SPARCL1	Molecular function	Calcium ion binding

ZEB1	Biological process	Cell proliferation; immune response; negative regulation of transcription from RNA polymerase II promoter; regulation of transcription, DNA-dependent
ZEB1	Molecular function	Transcription coactivator activity; transcription corepressor activity; transcription factor activity; zinc ion binding

Open in a new tab

3.2. Comparisons with Other Similarity Measures

The definition of the similarity measures is very important for identification of the relationships among genes. Euclidean distance and correlation coefficient are traditional similarity measures commonly used in gene expression analysis. But both of them are unsuitable for nonlinear relationships that might exist between the patterns. Euclidean distance fails to detect the simultaneous upregulated or downregulated expression levels with large amplitude absolute changes. Compared with Euclidean distance and Pearson correlation coefficient, the usage of the MI measure yields a more significant performance [32].

Figures 7 and 8 show Euclidean distance values matrices and Pearson correlation coefficient values matrices of the eight genes identified by MIClique from colon dataset respectively. The Euclidean distance values range from 2.025 to 7.073 in normal samples and range from 1.676 to 5.497 in disease samples. The Pearson correlation coefficient values range from 0.151 to 0.946 in normal samples and range from 0.242 to 0.891 in disease samples. Both of the figures display no indication of differentially coexpression patterns among the eight genes.

Images of the Euclidean distance matrices for the eight genes from colon dataset. (a) Normal samples. (b) Disease samples.

Images of the Pearson correlation coefficient matrices for the eight genes from colon dataset. (a) Normal samples. (b) Disease samples.

3.3. Leukemia Data

The samples of leukemia dataset are divided into two subclasses of disease samples: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). The MIClique algorithm is applied to the preprocessed and normalized leukemia dataset with T₁ = 2.2 and T₂ = 0.9. A group of DEC genes is identified, which are coexpressed in ALL samples but not in AML samples. The MI values of these eight genes in DEC group range from 1.944 to 3.348 in ALL samples and range from 0.764 to 1.225 in AML samples with the average MI values 2.550 in ALL samples and 0.934 in AML samples, respectively. Table 5 lists the Genbank accession numbers, gene symbols, and gene descriptions given by leukemia dataset. Besides the MIClique can identify DEC genes correlated in AML but not in ALL. All these DEC genes are helpful for understanding disease pathogenesis of leukemia and biological function of gene modules.

Table 5.

Differentially coexpressed genes correlated in ALL but not in AML in Leukemia dataset.

Accession numbers	Gene symbols	UniProt	Gene descriptions
HG4074-HT4344	FEN1(RAD2)	P39748	Rad2
L41870	RB1	P06400	Retinoblastoma 1 (including osteosarcoma)
U18062	TAF7(TAFII55)	Q15545	Human TFIID subunit TAFII55 mRNA
M92287	CCND3	P30281	Cyclin D3
U28833	RCAN1(DSCR1)	Q9UF15	Down syndrome critical region protein (DSCR1) mRNA
X56468	YWHAQ	P27348	14-3-3 protein tau
X84373	NRIP1(RIP140)	P48552	Nuclear factor RIP140
Z23064	RBMX	P38159	Heterogeneous nuclear ribonucleoprotein G

Open in a new tab

4. Conclusions

The difference between the MIClique and supervised gene selection methods is that MIClique algorithm evaluates the contributions of genes to phenotype by gene subets, rather than individual genes. Although the aim of MIClique is not to select discriminative genes between normal and disease tissues, or between different types of disease samples, the identified genes are still very informative for samples classification. For example, most of the genes identified by MIClique from colon dataset are also differentially expressed genes, which are consistent with the results of other researches.

It is clear that the MIClique algorithm is very efficient in identifying DEC genes. The DEC genes focus on the interaction among gene pairs and disease-related gene network, which is very important for understanding disease pathogenesis and biological function of gene modules. The MIClique algorithm has provided a new and intuitive way to biological and clinical cancer research.

References

1.Garber K. Genomic medicine: gene expression tests foretell breast cancer’s future. Science. 2004;303(5665):1754–1755. doi: 10.1126/science.303.5665.1754. [DOI] [PubMed] [Google Scholar]
2.Troyanskaya OG, Garber ME, Brown PO, Botstein D, Altman RB. Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics. 2002;18(11):1454–1461. doi: 10.1093/bioinformatics/18.11.1454. [DOI] [PubMed] [Google Scholar]
3.Chu F, Xie W, Wang L. Gene selection and cancer classification using a fuzzy neural network. In: Proceedings of the Annual Conference of the North American Fuzzy Information Processing Society (NAFIPS ’04), vol. 2; 2004; pp. 555–559. [Google Scholar]
4.Varadan V, Anastassiou D. Inference of disease-related molecular logic from systems-based microarray analysis. PLoS Computational Biology. 2006;2(6, article e68) doi: 10.1371/journal.pcbi.0020068. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America. 1998;95(25):14863–14868. doi: 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL. Model-based clustering and data transformations for gene expression data. Bioinformatics. 2001;17(10):977–987. doi: 10.1093/bioinformatics/17.10.977. [DOI] [PubMed] [Google Scholar]
7.Tanay A, Sharan R, Shamir R. Discovering statistically significant biclusters in gene expression data. Bioinformatics. 2002;18(supplement 1):S136–S144. doi: 10.1093/bioinformatics/18.suppl_1.s136. [DOI] [PubMed] [Google Scholar]
8.Cheng Y, Church GM. Biclustering of expression data. In: Proceedings of the International Conference on Intelligent Systems for Molecular Biology (ISMB ’00), vol. 8; 2000; pp. 93–103. [PubMed] [Google Scholar]
9.Kostka D, Spang R. Finding disease specific alterations in the co-expression of genes. Bioinformatics. 2004;20(supplement 1):i194–i199. doi: 10.1093/bioinformatics/bth909. [DOI] [PubMed] [Google Scholar]
10.Prieto C, Rivas MJ, Sánchez JM, López-Fidalgo J, De Las Rivas J. Algorithm to find gene expression profiles of deregulation and identify families of disease-altered genes. Bioinformatics. 2006;22(9):1103–1110. doi: 10.1093/bioinformatics/btl053. [DOI] [PubMed] [Google Scholar]
11.Watson M. CoXpress: differential co-expression in gene expression data. BMC Bioinformatics. 2006;7, article 509 doi: 10.1186/1471-2105-7-509. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Michaels GS, Carr DB, Askenazi M, Fuhrman S, Wen X, Somogyi R. Cluster analysis and data visualization of large-scale gene expression data. Pacific Symposium on Biocomputing. 1998;3:42–53. [PubMed] [Google Scholar]
13.Daub CO, Steuer R, Selbig J, Kloska S. Estimating mutual information using B-spline functions—an improved similarity measure for analysing gene expression data. BMC Bioinformatics. 2004;5, article 118 doi: 10.1186/1471-2105-5-118. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.D’Haeseleer P, Liang S, Somogyi R. Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics. 2000;16(8):707–726. doi: 10.1093/bioinformatics/16.8.707. [DOI] [PubMed] [Google Scholar]
15.Steuer R, Kurths J, Daub CO, Weise J, Selbig J. The mutual information: detecting and evaluating dependencies between variables. Bioinformatics. 2002;18(supplement 2):S231–S240. doi: 10.1093/bioinformatics/18.suppl_2.s231. [DOI] [PubMed] [Google Scholar]
16.Cover T, Thomas J. Elements of Information Theory. New York, NY, USA: Wiley Interscience; 2006. [Google Scholar]
17.Voy BH, Scharff JA, Perkins AD, et al. Extracting gene networks for low-dose radiation using graph theoretical algorithms. PLoS Computational Biology. 2006;2(7, article e89) doi: 10.1371/journal.pcbi.0020089. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Kose F, Weckwerth W, Linke T, Fiehn O. Visualizing plant metabolomic correlation networks using clique-metabolite matrices. Bioinformatics. 2001;17(12):1198–1208. doi: 10.1093/bioinformatics/17.12.1198. [DOI] [PubMed] [Google Scholar]
19.Gentleman RC, Carey VJ, Bates DM, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome biology. 2004;5(10, article R80) doi: 10.1186/gb-2004-5-10-r80. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Borgatti SP, Everett MG, Freeman LC. Ucinet for Windows: Software for Social Network Analysis. Harvard, Mass, USA: Analytic Technologies; 2002. [Google Scholar]
21.Huber W, Carey VJ, Long L, Falcon S, Gentleman R. Graphs in molecular biology. BMC Bioinformatics. 2007;8(supplement 6, article S8) doi: 10.1186/1471-2105-8-S6-S8. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Alon U, Barkai N, Notterman DA, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America. 1999;96(12):6745–6750. doi: 10.1073/pnas.96.12.6745. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Golub TR, Slonim DK, Tamayo P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]
24.The UniProt Consortium. The Universal Protein resource (UniProt) Nucleic Acids Research. 2008;36(1, database issue):D190–D195. doi: 10.1093/nar/gkm895. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.The Gene Ontology Consortium. The Gene Ontology project in 2008. Nucleic Acids Research. 2008;36(1, database issue):D440–D444. doi: 10.1093/nar/gkm883. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Aigner K, Dampier B, Descovich L, et al. The transcription factor ZEB1 (δEF1) promotes tumour cell dedifferentiation by repressing master regulators of epithelial polarity. Oncogene. 2007;26(49):6979–6988. doi: 10.1038/sj.onc.1210508. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Claeskens A, Ongenae N, Neefs JM, et al. Hevin is down-regulated in many cancers and is a negative regulator of cell growth and proliferation. British Journal of Cancer. 2000;82(6):1123–1130. doi: 10.1054/bjoc.1999.1051. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Shaik JS, Yeasin M. A unified framework for finding differentially expressed genes from microarray experiments. BMC Bioinformatics. 2007;8, article 347 doi: 10.1186/1471-2105-8-347. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Li X, Rao S, Wang Y, Gong B. Gene mining: a novel and powerful ensemble decision approach to hunting for disease genes using microarray expression profiling. Nucleic Acids Research. 2004;32(9):2685–2694. doi: 10.1093/nar/gkh563. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Xiong M, Fang X, Zhao J. Biomarker identification by feature wrappers. Genome Research. 2001;11(11):1878–1887. doi: 10.1101/gr.190001. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Zhang XW, Yap YL, Wei D, Chen F, Danchin A. Molecular diagnosis of human cancer type by gene expression profiles and independent component analysis. European Journal of Human Genetics. 2005;13(12):1303–1311. doi: 10.1038/sj.ejhg.5201495. [DOI] [PubMed] [Google Scholar]
32.Priness I, Maimon O, Ben-Gal I. Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinformatics. 2007;8, article 111 doi: 10.1186/1471-2105-8-111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B1] 1.Garber K. Genomic medicine: gene expression tests foretell breast cancer’s future. Science. 2004;303(5665):1754–1755. doi: 10.1126/science.303.5665.1754. [DOI] [PubMed] [Google Scholar]

[B2] 2.Troyanskaya OG, Garber ME, Brown PO, Botstein D, Altman RB. Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics. 2002;18(11):1454–1461. doi: 10.1093/bioinformatics/18.11.1454. [DOI] [PubMed] [Google Scholar]

[B3] 3.Chu F, Xie W, Wang L. Gene selection and cancer classification using a fuzzy neural network. In: Proceedings of the Annual Conference of the North American Fuzzy Information Processing Society (NAFIPS ’04), vol. 2; 2004; pp. 555–559. [Google Scholar]

[B4] 4.Varadan V, Anastassiou D. Inference of disease-related molecular logic from systems-based microarray analysis. PLoS Computational Biology. 2006;2(6, article e68) doi: 10.1371/journal.pcbi.0020068. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5.Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America. 1998;95(25):14863–14868. doi: 10.1073/pnas.95.25.14863. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6.Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL. Model-based clustering and data transformations for gene expression data. Bioinformatics. 2001;17(10):977–987. doi: 10.1093/bioinformatics/17.10.977. [DOI] [PubMed] [Google Scholar]

[B7] 7.Tanay A, Sharan R, Shamir R. Discovering statistically significant biclusters in gene expression data. Bioinformatics. 2002;18(supplement 1):S136–S144. doi: 10.1093/bioinformatics/18.suppl_1.s136. [DOI] [PubMed] [Google Scholar]

[B8] 8.Cheng Y, Church GM. Biclustering of expression data. In: Proceedings of the International Conference on Intelligent Systems for Molecular Biology (ISMB ’00), vol. 8; 2000; pp. 93–103. [PubMed] [Google Scholar]

[B9] 9.Kostka D, Spang R. Finding disease specific alterations in the co-expression of genes. Bioinformatics. 2004;20(supplement 1):i194–i199. doi: 10.1093/bioinformatics/bth909. [DOI] [PubMed] [Google Scholar]

[B10] 10.Prieto C, Rivas MJ, Sánchez JM, López-Fidalgo J, De Las Rivas J. Algorithm to find gene expression profiles of deregulation and identify families of disease-altered genes. Bioinformatics. 2006;22(9):1103–1110. doi: 10.1093/bioinformatics/btl053. [DOI] [PubMed] [Google Scholar]

[B11] 11.Watson M. CoXpress: differential co-expression in gene expression data. BMC Bioinformatics. 2006;7, article 509 doi: 10.1186/1471-2105-7-509. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12.Michaels GS, Carr DB, Askenazi M, Fuhrman S, Wen X, Somogyi R. Cluster analysis and data visualization of large-scale gene expression data. Pacific Symposium on Biocomputing. 1998;3:42–53. [PubMed] [Google Scholar]

[B13] 13.Daub CO, Steuer R, Selbig J, Kloska S. Estimating mutual information using B-spline functions—an improved similarity measure for analysing gene expression data. BMC Bioinformatics. 2004;5, article 118 doi: 10.1186/1471-2105-5-118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14.D’Haeseleer P, Liang S, Somogyi R. Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics. 2000;16(8):707–726. doi: 10.1093/bioinformatics/16.8.707. [DOI] [PubMed] [Google Scholar]

[B15] 15.Steuer R, Kurths J, Daub CO, Weise J, Selbig J. The mutual information: detecting and evaluating dependencies between variables. Bioinformatics. 2002;18(supplement 2):S231–S240. doi: 10.1093/bioinformatics/18.suppl_2.s231. [DOI] [PubMed] [Google Scholar]

[B16] 16.Cover T, Thomas J. Elements of Information Theory. New York, NY, USA: Wiley Interscience; 2006. [Google Scholar]

[B17] 17.Voy BH, Scharff JA, Perkins AD, et al. Extracting gene networks for low-dose radiation using graph theoretical algorithms. PLoS Computational Biology. 2006;2(7, article e89) doi: 10.1371/journal.pcbi.0020089. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] 18.Kose F, Weckwerth W, Linke T, Fiehn O. Visualizing plant metabolomic correlation networks using clique-metabolite matrices. Bioinformatics. 2001;17(12):1198–1208. doi: 10.1093/bioinformatics/17.12.1198. [DOI] [PubMed] [Google Scholar]

[B19] 19.Gentleman RC, Carey VJ, Bates DM, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome biology. 2004;5(10, article R80) doi: 10.1186/gb-2004-5-10-r80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20.Borgatti SP, Everett MG, Freeman LC. Ucinet for Windows: Software for Social Network Analysis. Harvard, Mass, USA: Analytic Technologies; 2002. [Google Scholar]

[B21] 21.Huber W, Carey VJ, Long L, Falcon S, Gentleman R. Graphs in molecular biology. BMC Bioinformatics. 2007;8(supplement 6, article S8) doi: 10.1186/1471-2105-8-S6-S8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22.Alon U, Barkai N, Notterman DA, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America. 1999;96(12):6745–6750. doi: 10.1073/pnas.96.12.6745. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23.Golub TR, Slonim DK, Tamayo P, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]

[B24] 24.The UniProt Consortium. The Universal Protein resource (UniProt) Nucleic Acids Research. 2008;36(1, database issue):D190–D195. doi: 10.1093/nar/gkm895. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] 25.The Gene Ontology Consortium. The Gene Ontology project in 2008. Nucleic Acids Research. 2008;36(1, database issue):D440–D444. doi: 10.1093/nar/gkm883. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26.Aigner K, Dampier B, Descovich L, et al. The transcription factor ZEB1 (δEF1) promotes tumour cell dedifferentiation by repressing master regulators of epithelial polarity. Oncogene. 2007;26(49):6979–6988. doi: 10.1038/sj.onc.1210508. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] 27.Claeskens A, Ongenae N, Neefs JM, et al. Hevin is down-regulated in many cancers and is a negative regulator of cell growth and proliferation. British Journal of Cancer. 2000;82(6):1123–1130. doi: 10.1054/bjoc.1999.1051. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] 28.Shaik JS, Yeasin M. A unified framework for finding differentially expressed genes from microarray experiments. BMC Bioinformatics. 2007;8, article 347 doi: 10.1186/1471-2105-8-347. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] 29.Li X, Rao S, Wang Y, Gong B. Gene mining: a novel and powerful ensemble decision approach to hunting for disease genes using microarray expression profiling. Nucleic Acids Research. 2004;32(9):2685–2694. doi: 10.1093/nar/gkh563. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] 30.Xiong M, Fang X, Zhao J. Biomarker identification by feature wrappers. Genome Research. 2001;11(11):1878–1887. doi: 10.1101/gr.190001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B31] 31.Zhang XW, Yap YL, Wei D, Chen F, Danchin A. Molecular diagnosis of human cancer type by gene expression profiles and independent component analysis. European Journal of Human Genetics. 2005;13(12):1303–1311. doi: 10.1038/sj.ejhg.5201495. [DOI] [PubMed] [Google Scholar]

[B32] 32.Priness I, Maimon O, Ben-Gal I. Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinformatics. 2007;8, article 111 doi: 10.1186/1471-2105-8-111. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

MIClique: An Algorithm to Identify Differentially Coexpressed Disease Gene Subset from Microarray Data

Huanping Zhang

Xiaofeng Song

Huinan Wang

Xiaobai Zhang

Abstract

1. Introduction

Figure 1.