Abstract
Summary
Biclustering is a generalization of clustering used to identify simultaneous grouping patterns in observations (rows) and features (columns) of a data matrix. Recently, the biclustering task has been formulated as a convex optimization problem. While this convex recasting of the problem has attractive properties, existing algorithms do not scale well. To address this problem and make convex biclustering a practical tool for analyzing larger data, we propose an implementation of fast convex biclustering called COBRAC to reduce the computing time by iteratively compressing problem size along with the solution path. We apply COBRAC to several gene expression datasets to demonstrate its effectiveness and efficiency. Besides the standalone version for COBRAC, we also developed a related online web server for online calculation and visualization of the downloadable interactive results.
Availability and implementation
The source code and test data are available at https://github.com/haidyi/cvxbiclustr or https://zenodo.org/record/4620218. The web server is available at https://cvxbiclustr.ericchi.com.
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
Hierarchical clustering is a fundamental task in many bioinformatics problems ranging from cytogenetics to population genetics. Many clustering algorithms, including the popular k-means method, formulate the clustering task as a nonconvex optimization problem. A serious issue with solving non-convex optimization problems is the presence of suboptimal local minima which often leads to suboptimal clustering assignments. This issue vanishes, however, when using the convex clustering algorithm introduced by Pelckmans et al. (2005). Solutions to the convex clustering problem trace out a tree organization of the data from the data points located at the leaves to a root as a nonnegative tuning parameter γ increases from zero to a finite positive maximal tuning parameter value . The convex clustering tree has several attractive properties, in particular, the recovered hierarchical clustering is guaranteed to be stable to noise in the sense that small perturbations in the data are guaranteed to not lead to disproportionately large fluctuations in the output tree (Chi et al., 2017). Convex clustering has proven useful for revealing hierarchical organizations in data from a wide range of applications from genetics (Chen et al., 2015) to text mining (Weylandt et al., 2020) and has been extended to biclustering case by Chi et al. (2017).
From a computational perspective, several optimization algorithms have been proposed for solving the convex clustering problem ranging from a variety of first-order methods (Hocking et al., 2011; Chi and Lange, 2015; Panahi et al., 2017), to second-order methods (Yuan et al., 2018), as well as a novel algorithm regularization path approach which approximates the solution path up to an arbitrarily small error with extremely inexact alternating direction method of multiplier updates (Weylandt et al., 2020). Although these methods are successful for solving convex clustering problem, with the exception of the work by Weylandt et al. (2020), none of them is designed to solve the optimization problem with the goal of efficiently generating a final dendrogram.
In this article, we introduce the COBRAC algorithm which iteratively performs two steps: (i) solves a weighted convex biclustering problem; (ii) compresses the problem size. Compared with Weylandt et al. (2020), COBRAC stores the solution over a much smaller set of γ, leading to significantly smaller memory usage to generate the clustering dendrograms. In addition, the compression procedure in COBRAC can be easily incorporated to other convex clustering algorithms to further accelerate computations. Our contributions to convex biclustering are as follows: (i) we present COBRAC, a fast implementation of convex biclustering that can identify possible biclusters and generate the corresponding row and column clustering dendrograms; (ii) while a similar strategy has been proposed for convex clustering before (Hocking et al., 2011), to our best of knowledge, COBRAC is the first implementation of convex biclustering that utilizes this compression strategy to generate the full path solution; (iii) we also develop a webserver for users to run COBRAC online and visualize the dynamic clustering process along the rows and columns of a matrix.
2 Algorithm
Figure 1A summarizes COBRAC at a high level. COBRAC solves a sequence of weighted convex biclustering problem over a series of parameters , and uses the solutions to generate row and column clustering dendrograms. The core idea of COBRAC is to reduce the size of matrix X iteratively because when some rows and columns are clustered at a given γ, they will remain clustered for all larger γs (Chi and Steinerberger, 2019). After compression, the objective function is still a weighted convex biclustering problem, and the only difference is the size of X and weight coefficients. For example, suppose for a given parameter γ, the rows and columns of X have been clustered into r and c clusters, respectively. Then, X will be compressed into an r × c matrix for next larger γ in the series, enabling the efficient computation of the entire solution path. For detailed formulation and derivation, see Supplementary Section S1.
Fig. 1.
(A) A flowchart of the algorithm. (B) The heatmap of the US president dataset with 44 observations (rows) and 75 features (columns) and clustering dendrogram generated by COBRAC using
3 Implementation
The COBRAC program is written in C/C++ and a Python wrapper is also provided for easy usage. In our implementation, we optimize the dual problem of weighted convex biclustering using an accelerated proximal gradient algorithm FASTA (Goldstein et al., 2014). The input of COBRAC is a data matrix in csv format and a list of γ to run the convex biclustering algorithm. The output of COBRAC is a json file that contains both the solution matrices for different γs and the corresponding clustering dendrograms. In addition to the standalone program, we also develop a web server at http://cvxbiclustr.ericchi.com for people to run COBRAC and visualize the results online.
4 Evaluation
As an example, we run COBRAC on the President dataset that contains log-transformed word counts of the 75 most variable words taken from the aggregated major speeches of the 44 U.S. presidents through mid-2018 (Weylandt et al., 2020). Figure 1B shows that COBRAC identifies a strong biclustering structure among word usage and presidents. To investigate the speed-ups attainable by introducing compression, we also run COBRAC on several different genomic expression data including Lung100, Lung500 (Lee et al., 2010) and TCGA Breast (Koboldt et al., 2012) datasets. COBRAC achieves a 2.5 ∼5× speed-up to solve the entire solution path over these data (Supplementary Table S1). The detailed descriptions of the data and experimental set-up are provided in Supplementary Data. The heatmaps with dendrogram of Lung100 and Lung500 are provided in our webserver and Supplementary Figures S4 and S5. The dendrograms given by COBRAC on Lung100 and Lung500 can achieve good clustering performance (Supplementary Table S2) and are also robust to the selection of γs (Supplementary Section S5). Furthermore, we also test the performance of COBRAC using different number of CPUs. The results show that parallel computing can further reduce the running time by nearly 40% (Supplementary Fig. S1). All the example data are available on our webserver for users to reproduce the results.
Funding
This work was supported, in part, by the National Science Foundation [DMS-1752692] and National Institutes of Health [R01EB026936 and R01GM135928].
Conflict of interest statement. None declared.
Supplementary Material
Contributor Information
Haidong Yi, Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
Le Huang, Department of Genetics, Curriculum in Bioinformatics & Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
Gal Mishne, Halıcıoğlu Data Science Institute, University of California, San Diego, La Jolla, CA 92093, USA.
Eric C Chi, Department of Statistics, North Carolina State University, Raleigh, NC 27607, USA.
References
- Chen G.K. et al. (2015) Convex clustering: an attractive alternative to hierarchical clustering. PLoS Comput. Biol., 11, e1004228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chi E.C., Lange K. (2015) Splitting methods for convex clustering. J. Comput. Graph. Stat., 24, 994–1013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chi E.C., Steinerberger S. (2019) Recovering trees with convex clustering. SIAM J. Math. Data Sci., 1, 383–407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chi E.C. et al. (2017) Convex biclustering. Biometrics, 73, 10–19. [DOI] [PubMed] [Google Scholar]
- Goldstein T. et al. (2014) A field guide to forward-backward splitting with a fasta implementation. arXiv Preprint arXiv, 1411.3406, available online at: https://github.com/tomgoldstein/fasta-matlab. [Google Scholar]
- Hocking T. et al. (2011) Clusterpath: an algorithm for clustering using convex fusion penalties. In: International Conference on Machine Learning, New York, NY, USA, pp. 745–752.
- Koboldt D.C. et al. (2012) Comprehensive molecular portraits of human breast tumours. Nature, 490, 61–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee M. et al. (2010) Biclustering via sparse singular value decomposition. Biometrics, 66, 1087–1095. [DOI] [PubMed] [Google Scholar]
- Panahi A. et al. (2017) Clustering by sum of norms: Stochastic incremental algorithm, convergence and cluster recovery. In: International Conference on Machine Learning, Sydney, Australia, pp. 2769–2777.
- Pelckmans K. et al. (2005) Convex clustering shrinkage. In: PASCAL Workshop on Statistics and Optimization of Clustering Workshop, London, UK. [Google Scholar]
- Weylandt M. et al. (2020) Dynamic visualization and fast computation for convex clustering via algorithmic regularization. J. Comput. Graph. Stat., 29, 87–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan Y. et al. (2018) An efficient semismooth Newton based algorithm for convex clustering. In: International Conference on Machine Learning, Stockholm, Sweden, pp. 5718–5726.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.