Skip to main content
Bioinformation logoLink to Bioinformation
. 2013 Jan 18;9(2):84–88. doi: 10.6026/97320630009084

A novel harmony search-K means hybrid algorithm for clustering gene expression data

KA Abdul Nazeer 1,*, MP Sebastian 2, SD Madhu Kumar 1
PMCID: PMC3563403  PMID: 23390351

Abstract

Recent progress in bioinformatics research has led to the accumulation of huge quantities of biological data at various data sources. The DNA microarray technology makes it possible to simultaneously analyze large number of genes across different samples. Clustering of microarray data can reveal the hidden gene expression patterns from large quantities of expression data that in turn offers tremendous possibilities in functional genomics, comparative genomics, disease diagnosis and drug development. The k- ¬means clustering algorithm is widely used for many practical applications. But the original k-¬means algorithm has several drawbacks. It is computationally expensive and generates locally optimal solutions based on the random choice of the initial centroids. Several methods have been proposed in the literature for improving the performance of the k-¬means algorithm. A meta-heuristic optimization algorithm named harmony search helps find out near-global optimal solutions by searching the entire solution space. Low clustering accuracy of the existing algorithms limits their use in many crucial applications of life sciences. In this paper we propose a novel Harmony Search-K means Hybrid (HSKH) algorithm for clustering the gene expression data. Experimental results show that the proposed algorithm produces clusters with better accuracy in comparison with the existing algorithms.

Background

The advancements in scientific data collection methods have led to the piling up of large quantities of data at various biology and life sciences databases. The advent of DNA microarrays facilitated the simultaneous monitoring of the expression levels of thousands of genes across a multitude of samples. Unraveling the hidden patterns from the gene expression data offers a distinct opportunity for understanding the important biological processes. The information thus obtained can be effectively utilized for important applications like early diagnosis and classification of various diseases. However, the complexity of the enormous amount of the genomic data poses a big challenge to the task of unearthing the hidden patterns from the massive data banks. Cluster analysis [1] is a first step towards addressing this challenge.

Cluster analysis is one of the major data mining methods which help identify the natural grouping in a set of data items. Clustering is the process of partitioning a given set of objects into disjoint clusters. This is done in such a way that the objects in the same cluster are similar while objects belonging to different clusters differ considerably, with respect to their attributes. Clustering of microarray Gene Expression (GE) data helps understand the gene functions, gene regulation and cellular processes [2]. The genes in the same cluster exhibit similar expression patterns and are likely to be co-regulated. Clustering of tissue samples collected from various people including healthy persons and those infected with diseases like cancer, based on the similarity in expression patterns, can help in effective classification of unknown samples which in turn can lead to early diagnosis of diseases [3].

The k-means algorithm [4] is effective in producing clusters for many practical applications. But the computational complexity of the original k-means algorithm is very high, especially for large data sets. Moreover, this algorithm results in locally optimum solutions and produces different types of clusters depending upon the random choice of the initial centroids. Several methods were proposed in the literature for improving the performance of the k-means clustering algorithm [58]. However, none of them could find wide acceptance among the users. Clustering can be thought of as an optimization problem that maximizes the intra-cluster similarity and minimizes the inter-cluster similarity among the data points. Attempts were made to develop hybrid algorithms by combining the k-means algorithm with meta-heuristic techniques such as Harmony Search optimization [911], in an attempt to obtain globally optimum solutions. This technique makes use of a harmony memory which is initialized with randomly generated feasible solutions to provide a global solution space. The harmony memory is then improvised by generating a new solution from the existing candidate solutions and replacing the current worst solution by the new solution, if it has a better fitness value. This process is repeated for several times and the best solution in the harmony memory is selected as the final solution.

The inadequate clustering accuracy of the existing methods is a bottleneck in crucial applications in the areas of Biology and Life Sciences. This paper presents a novel and effective Harmony Search-K means Hybrid (HSKH) algorithm for clustering the microarray gene expression data [1214].

Methodology

The proposed HSKH clustering algorithm consists of two phases. In the first phase, the initial centroids of the clusters are determined using an improved Harmony Search optimization technique. The centroids thus determined are used in the second phase to form the final clusters by repetitively assigning the data points to the clusters with the nearest centroids.

The Improved Harmony Search Algorithm:

In this method the harmony memory is initialized with n number of candidate solutions in a systematic way, where n is the number of columns in the GE matrix. Each row in the harmony memory corresponds to a specific clustering solution represented by k columns. Each column represents a cluster and the value corresponding to the column indicates the index of the data-point that represents the centroid of the corresponding cluster.

First of all, the first column of the GE matrix (Sample-IDs or Gene-IDs, as the case may be) is sorted based on the value of the data items present in the second column (expression levels of the gene in the samples or in the sample of the genes). The sorted data items are then divided into k equal sets. The index of the data point which comes at the middle of each set is selected as the centroid of that cluster. A set of k such centroids will constitute a candidate solution. The above procedure is repeated for all the n columns of the Gene Expression matrix and the harmony memory is initialized with n candidate solutions thus obtained. Then the fitness values of all the solutions are evaluated as the Average Distance of Data points to Cluster centroids (ADDC) (Please see supplementary material for equation and explanation).

Assigning Data Points to Clusters:

After determining the initial centroids using the Improved Harmony Search algorithm, the data items are assigned to the clusters with the nearest centroids using a variant of the method followed in [6] (Please see supplementary material).

Result and Discussion

For evaluating the clustering accuracy of the proposed HSKH algorithm, two standard Gene Expression data sets, namely, the Human Fibroblast Serum data [15] and the Rat CNS data [16] were given as inputs to the original K-means algorithm and the proposed HSKH algorithm. The accuracy of clustering is determined in terms of the Silhouette Index [17] metric. Its value ranges from -1 to 1, with higher value indicating better solution.

The results of the experiment are compared with the clustering accuracy of four other methods available in the literature [18]. The methods considered are Self Organizing Maps (SOM) [19], Iterative Fuzzy C Means (IFCM) [20], Variable string length Genetic Algorithm (VGA) [21] and Chinese Restaurant Clustering (CRC) [22]. The clustering accuracies of the algorithms for the two data sets are compared in (Figure 1).

Figure 1.

Figure 1

Performance Comparison of the Algorithms

It can be seen from the above experiments that the proposed HSKH algorithm performs better than the existing methods. The significant improvement in the clustering accuracy can be attributed to the improved harmony search method used to determine the initial centroids of the clusters. As a result of the refined initial centroids and the efficient assignment of the data points to the different clusters, our algorithm yields better clusters in less number of computational steps.

Conclusion

Clustering is a first step towards retrieving useful knowledge from microarray gene expression data. Accurate clustering is very much needed in Biology and Life Science applications as the resulting clusters are used for making crucial inferences on disease diagnosis and drug development. The conventional clustering techniques do not generally give sufficiently good results. The HSKH algorithm proposed in this paper combines the merits of the Harmony Search Optimization and the Enhanced K means algorithms, producing a globally optimal solution with significant improvement in the accuracy of clustering compared to the original k-means and other known clustering techniques.

The proposed HSKH algorithm may not be effective in its present form for certain clustering applications where the number of clusters k is not known in advance. But it is not a bottleneck in the case of clustering gene expression data where the concern is to classify the data into a specified number of clusters. Development of a clustering technique with automatic computation of k is suggested as a topic for further research.

Supplementary material

Data 1
97320630009084S1.pdf (162.4KB, pdf)

Footnotes

Citation:Nazeer et al, Bioinformation 9(2): 084-088 (2013)

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data 1
97320630009084S1.pdf (162.4KB, pdf)

Articles from Bioinformation are provided here courtesy of Biomedical Informatics Publishing Group

RESOURCES