Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 May 27.
Published in final edited form as: IASTED Int Conf Comput Syst Biol (2006). 2006 Nov;2006:68–72.

A NEW CLUSTERING METHOD AND ITS APPLICATION TO PROTEOMIC PROFILING FOR COLON CANCER

Yongbin Ou 1, Lan Guo 2, Cun-Quan Zhang 3
PMCID: PMC4445888  NIHMSID: NIHMS686085  PMID: 26029744

Abstract

In this paper, we introduce a new clustering method: quasi-clique merger, and its associated data pretreatment programs. This program constructs non-binary hierarchical trees with much smaller number of clusters in the outputs. And overlapping clusters are also allowed in the outputs. We applied this new method to cluster 60 human cancer cell lines (the NCI-60) using the previously identified proteomic determinants for chemosensitivity of 5-Fluorouracil (5-FU). All colon cancer cell lines were aggregated into a single cluster, indicating that the eight proteomic markers are potential diagnostic markers of colon cancer. The results based on the new clustering method have surpassed those based on previous methods on the same datasets.

Keywords: Biological Data Mining, unsupervised hierarchical clustering, overlapping clusters, Microarray Data Analysis, NCI-60, chemosensitivity determinants

1. Introduction

Clustering is one of the most important methods for bioinformatics research, and there are a variety of different clustering algorithms now available. Although all these approaches have clearly demonstrated their usefulness in applications, a number of important questions remain to be addressed, such as, "problems related to robustness, uniqueness, and optimality of linear ordering which complicates the interpretation of the resulting hierarchical relationships", problems of "how to determine the optimal number of clusters" (Lukashin & Fuchs, Bioinformatics 2001 (1)), problems that "none of these algorithms can, in general, rigorously guarantee to produce a globally optimal clustering for non-trivial objective function" (Xu, Olman & Xu, Bioinformatics 2002 (2)) and "there are no completely satisfactory methods for determining the number of population clusters for any type of cluster analysis… …" (SAS/STAT User's Guide (3)).

In this paper, we introduce a new clustering method called "quasi-clique merger" and its associated data pretreatment programs. One of the most significant differences between the new method and other existing methods is that "quasi-clique merger" method constructs a much smaller hierarchical tree, which highlight the meaningful clusters (while the hierarchical trees produced by most existing hierarchical clustering methods are binary). Another special feature of the new method is the property of multi-membership (or, called overlapping clustering), which is a new concept recently introduced by Palla et al. (4), Pereira-Leal, et al. (5), Futschik et al. (6).

We applied this new method to cluster 60 human cancer cell lines (the NCI-60) using the previously identified proteomic determinants for chemosensitivity of 5-Fluorouracil (5-FU). All colon cancer cell lines were aggregated into a single cluster, indicating that the eight proteomic markers are potential diagnostic markers of colon cancer. The results based on the new clustering method have surpassed those based on previous methods on the same datasets.

2. The quasi-clique merger algorithm

Graph/network is one of the most commonly used model for presenting the real-valued relationship of a set of input items. Let G=(V,E) be a graph with the weight w: E(G)R where w(e) represents the similarity/closeness of the items u and v where e=uv.

Clustering is a processing that detects all denser subgraphs in G, and list their inclusion relation in a hierarchical structure.

The following is the algorithm.

2-1. Subprograms

For a subgraph C, we define the density of C by

d(C)=2eE(C)w(e)|V(C)|(|V(C)|1).

Obviously, (for those who are familiar with graph theory or in computer science), if w(e)=1 for every edge e in C, the subgraph C induces a clique. For a weighted graph, a subgraph C is called a Δ-quasi-clique if d(C) ≥ Δ for some positive real number Δ. A heuristic processing is applied here for finding all quasi-cliques with density in various levels.

The core of the algorithm is deciding whether or not to add a vertex to a community. For a vertex v not in C, we define the contribution of v to C by

c(v,C)=uV(C)w(uv)|V(C)|

A vertex v is added into C if c(v,C)>α d(C) where α is a user specified parameter.

Algorithm Grow(C,G)

(grow a community C in G)

  • while V(G)\V(C)≠Ø;
    • begin
    • pick vV(G)\V(C) such that c(v,C) is a
  • maximum
    • if c(v,C) > α d(C) then add v to C
    • else return
    • end

Algorithm Decompose(G,w0)

(decompose a graph G into communities using edges with weights at least w0)

  • compute E0 = {eE(G): w(e)≥w0}

  • for each e=uvE0 in decreasing order of w(e)
    • begin
    • if either u or v is not in any community
  • then
    •   begin
    •   create a new empty community
    • C and add u, v into C
    •   Grow(C,G)
    •   end
    • end
  • repeatedly
    • if for any two communities C1 and C2, |C1C2| >β min(|C1|, |C2|) then merge C1 and C2 into a new community C=C1C2 (where β is a user specified parameter).

Algorithm Contraction(G)

Each community becomes a vertex. The weight of an edge is defined by

W(C1,C2)=eEcw(e)|Ec|

where Ec is the set of crossing edges which is defined by Ec = {v1v2: v1C1,v2C2, v1≠v2}

2-2. The main algorithm

Algorithm Main-Algorithm

(produce hierarchic clustering tree for a graph G)

  • while E(G)≠Ø
    • begin
    • Choose w0 according to some criterion
    • Decompose(G,w0)
    • Contraction(G)
    • store the resulted graph to G
    • end
  • trace the movement of each vertex and produce the hierarchic tree

Note: the choice of w0 depends on the weights of the edges in E(G). Usually e0max w(e).

3. Data pretreatment

The quasi-clique merger algorithm produces much less clusters. This is one of the important features of the new method. However, if the weights of most edges are distributed in a very small interval, then the above algorithm may not be able to recognize the small difference and therefore, produces only a very small number of clusters. In order to output an appropriate number of clusters, we introduce the following data pretreatment in our processing.

3-1. Input

One of the most common formats of inputs is an m×n-matrix A = [aij]. Microarray data are usually in this format. Clustering processing is to separate the set of rows into several clusters.

3-2. "Minority rules"

Find the average bj of the j-th column, for every j.

aijaijbj.

This processing sets zeros for all those average values and therefore, highlights the difference for those above average and bellow average.

3-3. Similarity – angles between vectors

Each row is considered as a vector vi and the angle between two vectors is used to measure their difference. Hence, the similarity between two vectors vi and vj is Cos θ where θ is the angle between the vectors (Cos θ is determined by inner product).

3-4. Difference amplifier

aijf(aij)

The function f that we used in our application for proteomic profiling for cancer cell lines is a composite function of a rational function (of order (−3)) and an anti-trigonometry function.

4. Application -- proteomic profiling for cancer cell lines

Assessment of an individual’s predisposition to drugs is essential to achieve the goal of personalized medicine. This approach is needed to allow clinicians to choose a treatment option that includes the most effective therapeutic agents for a given patient while avoiding ineffective agents and unnecessary side effects. In a previous study, we explored proteomic contributions to drug sensitivity and predicted the drug responses of 60 human cancer cell lines (NCI-60 panel) to 118 anti-cancer agents by proteomic profiling (7). The protein expression levels were measured in untreated cells. As the focus was on predicting the response to therapy and not analyzing the molecular consequences of therapy, this study provides a basis for predicting drug responses based on protein markers in the tumors of untreated patients.

It is especially challenging to predict chemosensitivity in a clinical context because drug responses reflect the properties intrinsic to both the target cells and host metabolism (8). Our analysis was limited to the intrinsic properties of cells in culture by modeling the response of the NCI-60 panel of 60 human cancer cell lines, which includes lines derived from leukemias, melanomas, and carcinomas of ovarian, renal, breast, prostate, colon, lung, and central nervous system (CNS) origin. These cell lines have been screened previously for the activity of 118 anti-cancer drugs whose mechanisms of action are putatively known (9). Some of these drugs are currently in routine clinical use for cancer treatment; others are in clinical trials or the late stages of drug development.

We investigated the feasibility of predicting drug responses using protein expression levels. Both the proteomic profiles (10) and drug activity database (9) were generated by the National Cancer Institute (NCI) and are available at the NCI website (http://discover.nci.nih.gov/datasets.jsp). The protein expression database was generated by proteomic assays with 52-antibody, reverse-phase, protein lysate microarrays in each individual cell line (10). We sought to identify important protein markers for predicting responses to the 118 anti-cancer agents in each cell line. Classifiers of the complete range of drug responses (sensitive, intermediate, and resistant) were developed, one for each drug evaluated. The chemosensitivity classifiers were designed to be independent of the cells’ tissue origin.

This study identified the protein markers for predicting chemosensitivity of the 118 agents in 60 cancer cell lines. The markers can, in principle, provide a basis to devise the optimal combination of therapies directed specifically to eliminate the cancer cells, while minimizing toxicity to the normal cells (11). In addition, the markers can theoretically portrait a unique molecular signature for detection and diagnosis of a cancer (10)–(12). Among the studied drugs, 5-Fluorouracil (5-FU) (NSC 19893) has been included in the treatment combinations for patients with stage III colon cancer (13). Using Random forests in software package R (http://www.r-project.org/), eight protein markers were identified for the prediction of drug response to 5-FU, including CDH1, CDH2, KRT8, ERBB2, MSN, MVP, MAP2K1, and MGMT. All of these proteins, except for KRT8, are involved in the pathogenesis of colon cancer. In order to investigate the feasibility of using these markers to diagnose colon cancer, we performed unsupervised hierarchical clustering on the 60 cancer cell lines by using the expression levels of these eight proteins. There were a total of seven colon cancer cell lines in the NCI-60 panel, including KM-12, HCT-15, HT29, COLO-205, HCC-2998, HCT-116 and SW-620. All of them were aggregated together using the new clustering method, indicating that the identified protein markers provided a basis not only for detection and diagnosis of colon cancer, but also for devising the optimal therapeutic combination targeted specifically to eliminate the cancer cells.

Fig. 1 is the output of the program, in which all colon cancer cell line are clustered in one cluster. It is noteworthy that our previous research (7) was not able to produce such cluster by using the CIMminer (Fig. 2) (http://discover.nci.nih.gov/cimminer/) developed by the National Cancer Institute (14).

Fig. 1.

Fig. 1

Fig. 2.

Fig. 2

5. Conclusion

The new clustering method (quasi-clique merger and its associated pre-treatment) introduced in this paper is based on a graph theoretical model. This method is designed for searching and merging cliques or clique-like subgraphs in an input dataset. This method is capable to produce more meaningful clusters than many other methods. The application to 60 human cancer cell lines has clearly indicates that the results produced by this method have surpassed those based on previous methods on the same datasets.

Acknowledgements

Y. Ou was supported in part by the West Virginia University Research Corporation;

L. Guo was supported in part by NIH under Grant NIH/NCRR P20 RR16440-03;

C.-Q. Zhang was supported in part by the National Security Agency under Grant MDA904-01-1-0022 and by WV EPSCoR under Grant EPS2006-37.

Contributor Information

Yongbin Ou, Department of Mathematics, West Virginia University, Morgantown, WV 26506-6310.

Lan Guo, MBR Cancer Center/Department of Community Medicine, West Virginia University, Morgantown, WV 26506-9300.

Cun-Quan Zhang, Department of Mathematics, West Virginia University, Morgantown, WV 26506-6310.

References

  • 1.Lukashin AV, Fuchs R. Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics. 2001;17(5):405–414. doi: 10.1093/bioinformatics/17.5.405. [DOI] [PubMed] [Google Scholar]
  • 2.Xu Y, Olman YV, Xu D. Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics. 2002;18:536–545. doi: 10.1093/bioinformatics/18.4.536. [DOI] [PubMed] [Google Scholar]
  • 3.SAS OnlineDoc. SAS/STAT User's Guide. Duluth: University of Minnesota; 1999. ( http://www.d.umn.edu/math/docs/saspdf/stat/pd fidx.htm) [Google Scholar]
  • 4.Palla G, Derenyi I, Farkas I, Vicsek T. Uncovering the overlapping community structure of complex networks in nature and society. Nature. 2005;435(Issue 7043):814–818. doi: 10.1038/nature03607. [DOI] [PubMed] [Google Scholar]
  • 5.Pereira-Leal JB, Enright AJ, Ouzounis CA. Detection of unctional Modules From Protein Interaction Networks. PROTEINS: Structure, Function, and Bioinformatics. 2004;54:49–57. doi: 10.1002/prot.10505. [DOI] [PubMed] [Google Scholar]
  • 6.Futschik ME, Carlisle B. Noise-Robust soft clustering of gene expression timecourse. Journal of Bioinformatics and Computational Biology. 2005;3(4):965–988. doi: 10.1142/s0219720005001375. [DOI] [PubMed] [Google Scholar]
  • 7.Ma Y, Ding Z, Qian Y, Shi X, Castranova V, Harner EJ, Guo L. Predicting Cancer Drug Response by Proteomic Profiling. Clinical Cancer Research. 2006 doi: 10.1158/1078-0432.CCR-06-0290. (accepted) [DOI] [PubMed] [Google Scholar]
  • 8.Staunto JE, Slonim DK, Coller HA, Tamayo P, Angelo MJ, Park J, Scherf U, Lee JK, Reinhold WO, Weinstein JN, Mesirov JP, Lander ES, Golub TR. Chemosensitivity prediction by transcriptional profiling. Proc. Natl. Acad. Sci. U.S.A. 2001:10787–10792. doi: 10.1073/pnas.191368598. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Scherf U, Ross DT, Waltham M, Smith LH, Lee JK, Tanabe L, Kohn KW, Reinhold WC, Myers TG, Andrews DT, Scudiero DA, Eisen MB, Sausville EA, Pommier Y, Botstein D, Brown PO, Weinstein JN. A gene expression database for the molecular pharmacology of cancer. Nat. Genet. 2000:236–244. doi: 10.1038/73439. [DOI] [PubMed] [Google Scholar]
  • 10.Nishizuka S, Charboneau L, Young L, Major S, Reinhold WC, Waltham M, Kouros-Mehr H, Bussey KJ, Lee JK, Espina V, Munson PJ, Petricoin E, III, Liotta LA, Weinstein JN. Proteomic profiling of the NCI-60 cancer cell lines using new high-density reverse-phase lysate microarrays. Proc. Natl. Acad. Sci. U.S.A. 2003:14229–14234. doi: 10.1073/pnas.2331323100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Munagala K, Tibshirani R, Brown PO. Cancer characterization and feature set extraction by discriminative margin clustering. BMC. Bioinformatics. 2004:21. doi: 10.1186/1471-2105-5-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Nishizuka S, Chen ST, Gwadry FG, Alexander J, Major SM, Scherf U, Reinhold WC, Waltham M, Charboneau L, Young L, Bussey KJ, Kim S, Lababidi S, Lee JK, Pittaluga S, Scudiero DA, Sausville EA, Munson PJ, Petricoin EF, III, Liotta LA, Hewitt SM, Raffeld M, Weinstein JN. Diagnostic markers that distinguish colon and ovarian adenocarcinomas: identification by genomic, proteomic, and tissue array profiling. Cancer Res. 2003:5243–5250. [PubMed] [Google Scholar]
  • 13.Benson AB., III Adjuvant Chemotherapy of Stage III Colon Cancer. Semin. Oncol. 2005:74–77. doi: 10.1053/j.seminoncol.2005.04.016. [DOI] [PubMed] [Google Scholar]
  • 14.Weinstein JN, Myers TG, O'Connor PM, Friend SH, Fornace AJ, Jr, Kohn KW, Fojo T, Bates SE, Rubinstein LV, Anderson NL, Buolamwini JK, van Osdol WW, Monks AP, Scudiero DA, Sausville EA, Zaharevitz DW, Bunow B, Viswanadhan VN, Johnson GS, Wittes RE, Paull KD. An information-intensive approach to the molecular pharmacology of cancer. Science. 1997:343–349. doi: 10.1126/science.275.5298.343. [DOI] [PubMed] [Google Scholar]

RESOURCES