Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2022 Oct 3;18(10):e1010577. doi: 10.1371/journal.pcbi.1010577

Fast and interpretable consensus clustering via minipatch learning

Luqin Gan 1,*, Genevera I Allen 2,3
Editor: Isidore Rigoutsos4
PMCID: PMC9560608  PMID: 36191044

Abstract

Consensus clustering has been widely used in bioinformatics and other applications to improve the accuracy, stability and reliability of clustering results. This approach ensembles cluster co-occurrences from multiple clustering runs on subsampled observations. For application to large-scale bioinformatics data, such as to discover cell types from single-cell sequencing data, for example, consensus clustering has two significant drawbacks: (i) computational inefficiency due to repeatedly applying clustering algorithms, and (ii) lack of interpretability into the important features for differentiating clusters. In this paper, we address these two challenges by developing IMPACC: Interpretable MiniPatch Adaptive Consensus Clustering. Our approach adopts three major innovations. We ensemble cluster co-occurrences from tiny subsets of both observations and features, termed minipatches, thus dramatically reducing computation time. Additionally, we develop adaptive sampling schemes for observations, which result in both improved reliability and computational savings, as well as adaptive sampling schemes of features, which lead to interpretable solutions by quickly learning the most relevant features that differentiate clusters. We study our approach on synthetic data and a variety of real large-scale bioinformatics data sets; results show that our approach not only yields more accurate and interpretable cluster solutions, but it also substantially improves computational efficiency compared to standard consensus clustering approaches.

Author summary

Clustering seeks to discover groups in big data with wide applications across scientific domains, especially in bioinformatics. However, for huge and sparse data sets common with genomic sequencing technologies, clustering methods can suffer from unreliable results, lack of interpretability in terms of feature importance, and heavy computational costs. To solve these challenges, we propose an extension of consensus clustering that leverages minipatch learning, an ensemble learning framework with learners trained on tiny subsets of observations and features. With adaptive sampling frameworks on both features and observations, our method is able to achieve higher clustering accuracy and reliability, as well as simultaneously identify scientifically important features that distinguish the clusters. In addition, we offer major computational improvements, with dramatically faster speed than our competitors. Our method is general and widely applicable to data sets from any field, and especially can offer superior performance when dealing with complex sparse and high dimensional data found in bioinformatics.


This is a PLOS Computational Biology Methods paper.

Introduction

Consensus clustering is a widely used unsupervised ensemble method in the domains of bioinformatics, pattern recognition, image processing, and network analysis, among others. This method often outperforms conventional clustering algorithms by ensembling cluster co-occurrences from multiple clustering runs on subsampled observations [1]. However, consensus clustering has many drawbacks when dealing with large data sets typical in bioinformatics. These include computational inefficiency due to repeated clustering of very large data on multiple subsamples, degraded clustering accuracy due to high sensitivity to irrelevant features, as well as lack of interpretability. Consider, for example, the task of discovering cell types from single-cell RNA sequencing data. This data often contains tens-of-thousands of cells and genes, making consensus clustering computationally prohibitive. Additionally, only a small number of genes are typically responsible for differentiating cell types; consensus clustering considers all features and provides no interpretation of which features or genes may be important. Inspired by these challenges for large-scale bioinformatics data, we propose a novel approach of consensus clustering that utilizes tiny subsamples or minipatches as well as adaptive sampling schemes to speed computation and learn important features.

Related work

Several types of consensus functions in ensemble clustering have been proposed, including co-association based function [25], hyper-graph partitioning [68], relabeling and voting approach [911], mixture model [1214], and mutual information [1517]. Co-association based function, such as consensus clustering, is faster in convergence and is more applicable to large-scale bioinformatics data sets. Our approach is based on consensus clustering, whose concept is straightforward. In order to achieve evidence accumulation, a consensus matrix is constructed from pairwise cluster co-occurrence, ranging in [0,1]. It is later regarded as a similarity matrix of the observations to obtain the final clustering results [18]. Closely related to our work, numerous variants of consensus clustering with adaptive subsampling strategies on observations have been proposed. For instance, Duarte et al. [19] update the sampling weights of objects with their degrees of confidence, which are subtracted by clustering the consensus matrix; Parvin et al. [20] compute sampling weights by the uncertainty of object assignments based on consensus indices’ distances to 0.5; and Topchy et al. [21] adaptively subsample objects according to the consistency of clustering assignments in previous iterations. Besides adaptive sampling, Ren et al. [22] overweight the observations with high confusion, and assign the one-shot weights to obtain final clustering results. However, the existing sampling schemes focus on observations only and do not take feature relevance into consideration. So these methods show inferior performance in the application to sparse data sets, where only a small set of features can significantly influence cluster assignments. Many clustering methods and pipelines have been proposed that specifically focus on single-cell RNA-seq data [2327]. A popular approach, SC3 [23], employs consensus clustering by applying dimension reduction to the subsampled data and then applying K-means. Satija et al. [25] integrate dropout imputation and dimension reduction with a graph-based clustering algorithm. Another widely used and simple approach is to conduct tSNE dimension reduction followed by K-Means clustering [28]. Many have discussed the computational challenge of clustering large-scale single-cell sequencing data [28] and have sought to address this via dimension reduction. But clustering based on dimension reduced data is no longer directly interpretable; that is, one cannot determine which genes are directly responsible for differentiating cell type clusters. The motivation of our approach is not only to propose a computationally fast approach, but also to develop a method that has built-in feature interpretability to discover differentially expressed genes. A series of clustering algorithms have been proposed to add insights on feature importance. Some clustering algorithms conduct sparse feature selection through regularization within clustering algorithms. For example, sparse K-Means (sparseKM), sparse hierarchical clustering (sparseHC) [29] and sparse convex clustering [30,31] facilitate feature selection by solving a lasso type optimization problem. However, this type of sparse clustering algorithm is often slow and highly sensitive to hyper-parameter choices; thus, they face maybe computational challenges for large data. Another class of methods ranks features by their influence on results. The resulting sensitivity to the changes of one feature can be measured by the difference in silhouette widths of clustering results [32], the difference in the entropy of consensus matrices [33], or consistency of graph spectrum [34]. However, feature ranking methods have to measure the importance of each feature separately, which leads to extremely high computational costs. Additionally, [35] propose a post-hoc feature selection method that solves an optimization problem to determine important features within the standard consensus clustering algorithm; however, this approach suffers from major computational hurdles for large data. Therefore, we are motivated to propose an extension of consensus clustering to greatly improve clustering accuracy, provide model interpretability, and simultaneously ease the computational burden, by incorporating innovative adaptive sampling schemes on both features and observations with minipatch learning.

Contributions

In this paper, we propose a novel methodology as an extension of consensus clustering, which demonstrates major advantages in large-scale bioinformatics data sets. Specifically, we seek to improve computational efficiency, provide interpretability in terms of feature importance, and at the same time improve clustering accuracy. We achieve these goals by leveraging the idea of minipatch learning [3638], which is an ensemble of learners trained on tiny subsamples of both observations and features. Compared to only subsampling observations in existing consensus clustering ensembles, our approach offers significant computational savings by learning from many tiny data sets. In addition, we develop novel adaptive sampling schemes for both observations and features to concentrate learning on observations with uncertain cluster assignments and on features that are most important for separating clusters. This provides inherent interpretations for consensus clustering and also further improves the computational efficiency of the learning process. We test our novel methods and compare them to existing approaches through extensive simulations and four large real-data case studies from bioinformatics and imaging. Our results show major computational gains with our run time on the same order as that of hierarchical clustering, as well as improved clustering accuracy, feature selection performance, and interpretability.

Methods

Let X ∈ RN×M be the data matrix of interest, with M features measured over N observations. xi ∈ RM is the M-dimensional feature vector observed for sample i. Our goal is to partition the observations into disjoint homogeneous clusters, which can reflect the underlying data structures and patterns. We propose to extend popular consensus clustering techniques [39] to be able to detect clusters more accurately and computationally efficiently, in high-dimensional noisy data common in bioinformatics [40,41]. We also seek ways to ensure our clusters are interpretable through feature selection. To this end, we propose a number of innovations and improvements to consensus clustering outlined in our Minipatch Consensus Clustering framework in Algorithm 1. Similar to consensus clustering, our approach repeatedly subsamples the data, applies clustering, and records the N × N co-clustering membership matrix, V. It then ensembles all the co-clustering membership information together into the N × N consensus matrix S. This consensus matrix takes values in [0,1] indicating the proportion of times two observations are clustered together; it can be regarded as a similarity matrix for the observations. A perfect consensus matrix includes only entries of 0 or 1, where observations are always assigned to the same clusters; values in between indicate the (un)reliability of cluster assignments for each observation. To obtain final cluster assignments, one can cluster the estimated consensus matrix, which typically yields more accurate clusters than applying the standard, non-ensembled clustering algorithms [1].

While the core of our approach is identical to that of consensus clustering, we offer three major methodological innovations in Steps 1 and 2 of Algorithm 1 that yield 112 remarkably faster, more accurate, and interpretable results. Our first innovation is building cluster ensembles based on (n = 25%N, m = 10%M) tiny subsets with default of both observations and features termed minipatches [3739]. Note that existing consensus clustering approaches form ensembles by subsampling typically 80% of observations and all the features for each ensemble member [42]. For large-scale bioinformatics data where the number of observations and features could be in the tens-of-thousands, repeated clustering of this large data is a major computational burden. Instead, our approach, termed Minipatch Consensus Clustering (MPCC), subsamples a tiny fraction of both observations and features and hence has obvious computational advantages. The computational complexity of MPCC in Algorithm 1 is O(mn2T + N2), where T is the total number of minipatches. Since m and n are very small, the dominating term is the N2 computations required to update the consensus matrix. This compares very favorably to existing consensus clustering approaches. If the default of 80% of observations are subsampled in each run, then the time complexity is O(MN2T), which can be very slow for both large N and large M datasets. On the other hand, our method is comparable in complexity to hierarchical clustering, which is also O(N2) [43], but is perhaps slower than K-Means, which is O(N) [44]. The proof of the time complexity is in S1 Text.

While MPCC offers dramatic computational improvements over standard consensus clustering, one may ask whether the results will be as accurate. We investigate and address this question from the perspective of how tiny subsamples of observations and separately features affect clustering results. First, note that if a tiny fraction of observations is subsampled, then by chance, some of the clusters may not be represented; this is especially the case for large K or for uneven cluster sizes. Existing consensus clustering approaches typically apply a clustering algorithm with fixed K to each subsample, but this practice would prove detrimental to our approach. Instead, we propose to choose the number of clusters in each minipatch adaptively. While there are many techniques in the literature to do so that could be employed with our method [18,45], we are motivated to choose the number of clusters very quickly with nearly no additional computation. Hence, we propose to exclusively use hierarchical clustering on each minipatch and to cut the tree at the h quantile of the dendrogram height to determine the number of clusters and cluster membership. This approach is not only fast but also adaptive to the number of clusters present in the minipatch, and the results change smoothly with cuts at different heights. Our empirical results reveal that this approach performs well on minipatches, and we specifically investigate its utility, sensitivity, and tuning of h in S1 Text; importantly, we find that setting h = .95 to nearly universally yields the best results, and hence we suggest fixing this value. Additionally, we provide details on hyper-parameters, tuning, and stopping criteria in S1 Text. Besides, we also explored other alternatives to determine the number of clusters in a minipatch, including selecting the cluster number with the highest silhouette score and using the oracle number of clusters K. However, the alternatives yield worse performance either in terms of clustering accuracy or computational time. Further details are in S1 Text.

Next, one may ask how subsampling the features in minipatches affects clustering accuracy. Obviously, for high-dimensional data in which only a small number of features are relevant for differentiating clusters, subsampling minipatches containing the correct features would improve results. We address such possibilities in the next section. But if this is not the case, would clustering accuracy suffer? Since we apply hierarchical clustering, which takes distances as input, we seek to understand how far off our distance input can be when we employ sub-samples of features. We consider this theoretically in S1 Text and empirically in the subsequent section. Our analysis and results reveal that while smaller minipatches yield faster computations, there may be a 164 slight trade-off in terms of clustering accuracy. Our empirical results in S1 Text 165 suggest that such a trade-off is generally slight or negligible, so we can typically utilize 166 smaller minipatches.

Algorithm 1: Minipatch Consensus Clustering

Input: X, n, m, V(0), D(0), h; while stopping criteria not meet do

1. Obtain minipatch XIt,Ft Rn×m by subsampling n observations It

{1,…,N} and m featrues Ft ⊂ {1,…,M}, without replacement;

MPCC subsamples uniformly at random;

MPACC uses the adaptive observation sampling scheme only;

IMPACC uses both adaptive feature and observation sampling schemes simultaneously;

2. Obtain estimated clustering result C(t) by fitting hierarchical clustering to XIt,Ft and cut tree at h height quantile;

3. Update co-clustering membership matrix V and co-sampling matrix D:

V(t)(i,i)=V(t1)(i,i)+I(Ci(t)=Ci(t));D(t)(i,i)=D(t1)(i,i)+I(iIt,iIt)

end

Calculate consensus matrix S(i,i)=V(T)(i,i)max(1,D(T)(i,i));

Obtain final clustering result Π^ by using S as a similarity matrix;

Output: S, Π^.

We have introduced minipatch consensus clustering (MPCC) using random subsamples of both features and observations. The advantage of this approach is its computational speed, which is on the order of standard clustering approaches such as hierarchical and spectral clustering, as suggested by our empirical results in Results section. But, one may ask whether clustering results can be improved by perhaps optimally sampling observations and/or features instead of random sampling. Some

have suggested such possibilities in the context of consensus clustering [1922]; we explore it and develop new approaches for this in the following sections.

Minipatch Adaptive Consensus Clustering (MPACC)

One may ask whether it is possible to improve upon minipatch consensus clustering in terms of both speed and clustering accuracy by adaptively sampling observations. For example, we may want to sample observations that are not well clustered more frequently to learn their cluster assignments faster. In the method MiniPatch Adaptive Consensus Clustering (MPACC), we propose to dynamically update sampling weights, with a focus on observations that are difficult to be clustered and that are less frequently sampled. In addition, we leverage the adaptive weights by designing a novel observation sampling scheme. Specifically, we propose to update observation weights by adjusted confusion values dynamically, with a default learning rate αI = 0.5. To measure the level of clustering uncertainty, confusion values are derived from consensus matrix, given by conufusioni=1Ni=1NS(i,i)(1S(i,i)) for observation i. A larger confusion value near 0.25 indicates poorer clustering with unstable assignments, and the minimum confusion value 0 suggests perfect clustering. Note that confusions tend to grow with iterations because more consensus values are updated from the initial value 0. Therefore, a large confusion value due to oversampling cannot truly reflect the level of uncertainty. To eliminate bias caused by oversampling and to upweight less frequently sampled observations, we further adjust confusion values by sampling frequencies of observations in previous iterations, as presented in Algorithm 2. The next question is, how do we leverage the weights to dynamically construct minipatches as the number of iterations grows? A straightforward solution is to probabilistically subsample with probability (Prob) proportional to the weights. But the problem with this approach is that the clustering performance will be compromised if we only tend to sample uncertain and difficult observations. To resolve such drawback, we develop an exploitation and exploration plus probabilistic (EE + Prob) sampling scheme (Algorithm 3). The scheme consists of two sampling stages: a burn-in stage and an adaptive stage. The burn-in stage aims to explore the entire observation space and ensure every observation is sampled several times. During the next adaptive stage, observations with levels of uncertainty greater than a threshold are classified into of observations using probabilistic sampling. Here, {γ(t)} ∈ [0.5,1],t = 1,2,.. is a monotonically increasing sequence that controls sampling size in the exploitation and exploration step. Meanwhile, the algorithm explores the rest of the observations with uniform weights to avoid exclusively focusing on difficult observations. The reason why we randomly sample the observations that we are confident about is that, we need to include a fair amount of easy-to-cluster observations to construct well-defined clusters in

Algorithm 2: Weight updating in adaptive observation sampling scheme the high uncertainty set, and the algorithm exploits this set by sampling γ(t) proportion each minipatch so as to better cluster the uncertain ones. We also propose to use the EE + Prob scheme as our adaptive feature sampling scheme, which is discussed in Interpretable Minipatch Adaptive Consensus Clustering (IMPACC) section.

Input: S(t1),wI(t1),{Il}l=1t1,αI;S(0)=0,wI(0)=1N;

1. Calculate sample uncertainty ui=1Ni=1NS(i,i)(1S(i,i))×t1l=1t1I(iIl);

2. Update observation weight vector wI(t)=αIwI(t1)+(1αI)ui=1Nui; Output: wI(t).

In Algorithm 3, t denotes the current count of iterations, E denotes number of burn-in epochs with default value 3, and wI(t−1) is generated from Algorithm 2. And {τ} is the data-driven threshold of uncertain observations (important features), which is set to be the 90% quantile of observation weights (mean plus one standard deviation of feature weights).

Relation to existing literature

Several have suggested similar weight updating approaches in the consensus clustering literature. Ren et al. [22] also obtain observation weights by confusion values as in our method. The difference is that, their method only uses the weighting scheme at the final clustering step rather than adaptive sampling. On the other hand, similar to our adaptive weight updating scheme, Duarte et al. [19], Topchy et al. [21] and Parvin et al. [20] iteratively update weights depending on clustering history. However, these existing methods utilize probabilistic sampling, so they would largely suffer from biased sampling and inaccurate results by only focusing on hard observations. However, instead of probabilistic sampling, we design the EE + Prob sampling scheme to leverage the weights, which is inspired by the exploration and exploitation (EE) scheme from multi-arm bandits [46,47] and also employed for feature selection with minipatches [36]. Compared to the latter, the innovation in our approach is to combine the advantages of probabilistic sampling and exploitation-exploration sampling, which proves to have particular advantages for clustering. Comparisons with other possible sampling schemes proposed in the literature are in S1 Text.

Algorithm 3: Adaptive Observation (Features) Sampling Scheme—EE +Prob

Input:t, n, N, E, {γ(t)}, wI(t−1), {τ}; wI(0) = 1/N;

Initialization: Q=[Nn],I={1,,N}; if tE · Q then

// Burn-in stage if modQ(t) = 1 then

// New epoch

Randomly reshuffle feature index set I and partition into disjoint sets {IQ}q=0Q1;

else Set It = ImodQ(t); end

// Adaptive stage

1. Update observation weights wI(t) by Algorithm 2;

2. Create high uncertainty set HI={i{1,,N}:wIi(t)>τwI(t)};

3. Exploitation: sample min(n,γ(t)|HI|) observations It,1 ⊆ HI with probability wIHtI;

4. Exploration: sample (n − min(n,γ(t)|H|I)) observations It,2 ⊆ {1,…,N}\HI uniformly at random;

5. Set It = It,1 ∪ It,2;

end Output: It.

Interpretable Minipatch Adaptive Consensus Clustering (IMPACC)

One major drawback of consensus clustering is that it lacks interpretability into important features. This is especially important for high-dimensional data like that in bioinformatics, where we expect only a small subset of features to be relevant for determining clusters. To address this, we develop a novel adaptive feature sampling approach termed Interpretable Minipatch Adaptive Consensus Clustering (IMPACC) that learns important features for clustering and improves clustering accuracy for high-dimensional data. In clustering, two types of approaches to determine important features have been proposed. One is to obtain a sparse solution by solving an optimization problem [2931], and another one is to rank features by their influence to results [3234]. However, in data sets with a large number of observations and features, both kinds of methods suffer from significant computational inefficiency. So the question we are interested in is, can we achieve fast, accurate, and reliable feature selection within the consensus clustering process with minipatches? We address this question by proposing a novel adaptive feature weighting method that measures the feature importance in each minipatch and then ensembles the results to increase the weights of the important features. Given these adaptive feature weights, we can then utilize our adaptive sampling scheme proposed in Algorithm 3 to sample important features more frequently. Outlined in Algorithm 4, we propose an adaptive feature weighting scheme by testing whether each feature is associated with the estimated cluster labels on that minipatch. To do so, we use a simple ANOVA test in part, because it is computationally fast and only requires one matrix multiplication. Based on the p-values from these tests, we establish an important feature set, A, and obtain the importance scores as the frequencies of features being classified into this feature set over iterations. Then the feature sampling weights are dynamically updated with learning rate αF, with a default value 0.5. Therefore, by ensembling feature importance obtained from each iteration, we are able to simultaneously improve clustering accuracy and build model interpretability from resulting feature weights, with minimal sacrifices of computation time. In Algorithm 4, C(t−1) denotes the clustering labels on the (t − 1)-th minipatch, denotes sets of subsampled features in each minipatch up to iteration t − 1, denotes the feature support constructed up to iteration t − 2, and the p-value cutoff η has default value 0.05. We also explored alternative measures of the association between features and cluster labels in a minipatch. These include using a non-parametric ANOVA (a Kruskal-Wallis test), which relaxes normality assumptions, and using a multinomial regression of features to predict cluster assignments, which can account for feature correlations. Both of these approaches, however, have a higher computational burden than using a simple ANOVA test. We explore these empirically to additionally show that they also yield lower clustering accuracy in S1 Text.

We propose to utilize the same type of EE + Prob sampling scheme (Algorithm 3) given our feature weights to learn the important features for clustering. Such a scheme exploits the important features and samples these more frequently as the algorithm progress. But it also balances exploring other features to ensure that potentially important features are not missed. Our final IMPACC algorithm then utilizes both adaptive observation sampling and adaptive feature sampling to improve computation efficiency and clustering accuracy while also providing feature interpretability. Utilizing minipatches in consensus clustering allows us to develop these innovative adaptive sampling schemes and be the first to propose feature learning in this context. Even though IMPACC has several hyper-parameters, in practice, our methods are quite robust and reliable to parameter selections and generally give a strong performance under default parameter settings. Therefore, we are freed from the computationally expensive hyper-parameter tuning process and its computational burdens. We include a study on learning accuracy with different hyper-parameters and default values and suggest a data-driven tuning process in S1 Text. Overall, the proposed MPACC with only adaptive sampling on observation is more suitable for data of no or little sparsity; and IMPACC, which adaptively subsamples both observations and features in minipatch learning, can be more useful when dealing with high dimensional and sparse data sets in bioinformatics. It enhances model accuracy, scalability, and interpretability by focusing on uncertain observations and important features in an efficient manner. Our empirical study in Results section demonstrates the major advantages of the IMPACC method in terms of clustering quality, feature selection accuracy, and computational savings.

Algorithm 4: Weight updating in adaptive feature sampling scheme

Input:, wF(t−1), αF; wF(0) = 1/M;

1. For each feature jFt−1, conduct ANOVA test between features j and C(t−1), record p-value pj(t1);

2. Create a feature support A(t1)Ft1:A(t1)={j{1,,m}:pj(t1)>η}; ;

3. Update feature weight vector wF t ∈ RM by ensembling feature supports

{Al}l=1(t1):

wFj(t)=αFwFj(t1)+(1αI)l=1t1I(jFj,jAl)max(1,l=1t1I(jFl)

Output: wF(t).

Results

In this section, we assess the performance of IMPACC and MPCC with application to a high dimensional and high noise synthetic simulation study in Synthetic Data section and four large-scale real data sets in Case Studies on Real Data section, in comparison with several conventional clustering strategies.

Synthetic data

We evaluate the performance of MPCC and IMPACC in terms of clustering accuracy and computation time with widely used competitors, and compare IMPACC’s feature selection accuracy with the existing sparse feature selection techniques. We propose two kinds of generative models for the synthetic data, with different structures of feature correlation. Here we only show the results of sparse simulation with autoregressive covariance structure. In addition, we also generate synthetic data based on a real single-cell RNA-seq data set using the splatter single-cell simulation method [48]. The results of splatter simulation and simulation with block-diagonal covariance structure conducted in sparse, weak sparse and no sparse scenarios are in S1 Text. In the sparse autoregressive simulation study, each data set is created from a mixture of Gaussian with AR(1) covariance structure, where the covariance between feature j and j can be written as σj,j′ = ρ|jj′|. The parameter ρ is set to be 0.5. We set the number of observations, features and clusters to be N = 500, M = 5,000, K = 4, respectively. In order to better reflect the structure of real bioinformatics data, we design unbalanced cluster sizes and the numbers of observations in each cluster are 20, 80, 120, 280. The means of features in synthetic data is μ = [μk,μ0], where μk ∈ R25 and μ0 = 04975 are the means of 25 signal features and 4,975 noise features, respectively. The signal-to-noise (SNR) ratio is defined as the L2-norm of feature means: SNR = ∥μ2. In order to assess feature selection capability, synthetic data is generated with SNR ranging from 1 to 8. Specifically, the signal features are generated with μ1=SNR5125,μ2=(SNR5113T,SNR5112T)T,μ3=(SNR5113T,SNR5112T)T,μ4=SNR5125. Data with higher SNR ratio has more informative signal features so is easier to be clustered. For all clustering algorithms, we assume oracle number of clusters K. Hierarchical clustering is applied as the final algorithm in IMPACC and MPCC, with the number of iterations determined by an early stopping criteria, as described in S1 Text. And we have exactly the same setting as those of MPCC in regular consensus clustering, including the number of iterations. Ward’s minimum variance method with Manhattan distance is used in all hierarchical clustering related methods. Details on the implementation of competing methods are in S1 Text. In terms of feature selection, IMPACC provides feature importance scores ranging in [0,1], and sparseKM and sparseHC generate sparse feature weights with zero values for unimportant features. We propose two methods to evaluate feature selection accuracy. The oracle method selects the top 25 features (the oracle number of signal features) with the highest importance score in IMPACC or the highest non-zero weights in sparseKM and sparseHC. And the data-driven feature selection is to select features with importance scores higher than the mean plus one standard deviation of all scores in IMPACC, and select all the features with non-zero weights in sparseKM and sparseHC [30].

We use adjusted rand index (ARI) to evaluate the clustering performance and the F1 score to measure feature selection accuracy, which both range in [0,1], with a higher value indicating higher accuracy. The averaged results over 10 repetitions are shown in Fig 1. Overall, IMPACC yields the best clustering performance over all competing methods with the highest ARI in most of the SNR settings. Comparing feature selection performance, IMPACC has perfect recovery on informative features, with an F1 score equaling to 1 when SNR is large, and is significantly better than sparseKM and sparseHC. Note that the oracle and data-driven F1 scores are the same for sparseKM and sparseHC because these two methods under-select important features. Additionally, IMPACC achieves significantly major computational advantages comparing to sparse feature selection clustering strategies. All of the computation time is recorded on a laptop with 16GB of RAM (2133 MHz) and a dual-core processor (3.1 GHz). Note that we only show results of the sparse simulation with autoregressive covariance structure in Fig 1, and we include the rest scenarios in S1 Text. Our methods are still dominant in sparse simulations with block-diagonal structure and splatter simulations, but IMAPCC shows little improvement in the no-sparsity scenario when all the features are relevant.

Fig 1. Clustering performance (ARI), feature selection accuracy (F1 score), and computation time on sparse synthetic data sets.

Fig 1

(A) ARI (higher is better) of estimated grouping; (B) computation time in log seconds; (C) F1 score for signal feature estimates with oracle and data driven selection. IMPACC has superior performance over competing methods in clustering and feature selection accuracy with significant computational savings.

Case studies on real data

We apply our methods to one bulk-cell RNA-seq data set, which measures the expression of different tumor cells, three gold-standard single-cell RNA-seq data sets and one image data set with known cluster labels, whose information is reported in Table 1. The PANCAN bulk RNA-seq data [49] is a benchmark data obtained from UCI Machine Learning Repository [50], which contains gene expressions of patients with different types of tumor: BRCA, KIRC, COAD, LUAD and PRAD. The cluster information of the three single-cell RNA-seq data is known because these data sets are generated from cells of various development stages. The Biase [51] data is generated from 49 single cells composed of 1-cell (zygote), mid-stage 2-cell, and 4-cell mouse embryos. The Goolam [52] data investigates gene expression patterns in the pre-implantation development of mouse embryos, including cells isolated from the 2-cell stage to the 32-cell stage. And the Yan [53] data measures gene expression of cells from human pre-implantation embryos and human embryonic stem cells at different passages. In the RNA-seq data, gene expressions are transformed by x → log2(1 + x) before conducting clustering algorithms; the image data set [54] is adjusted to be within the range [0,1]. Note that we do not conduct any prior feature selection before applying clustering algorithms. With the same settings in Synthetic Data section, we evaluate the learning performance of MPCC and IMPACC with existing methods, with the number of clusters being oracle. Details on the implementation of competing methods are in S1 Text.

Table 1. Data sets used in empirical study.

PANCAN Biase Goolam Yan COIL20
Data type RNA-seq scRNA-seq scRNA-seq scRNA-seq Image
Tissue tumor cells mouse embryos mouse embryos human embryos
# clusters 5 3 5 7 20
# observations 761 49 124 90 1,440
# features 13,244 25,737 41,480 20,286 1,024
% zeros 14.2% 50.43% 68.56% 38.08% 34.38%
citation [49] [51] [52] [53] [54]
Source Synapse:syn4301332 GSE57249 E-MTAB-3321 GSE36552

Table 2 summarizes the mean of 10 realizations of clustering results on real data sets. IMPACC is either the best or among the top-performing methods in each data set at discovering known clusters with the high ARI scores. Also, it demonstrates major computational advantages, sometimes even beating hierarchical clustering. Clustering followed by dimension reduction via tSNE can have faster and better clustering accuracy for some of the data sets, but it fails to provide direct interpretability of feature importance. Many conduct inference for differentially expressed genes post clustering, but this suffers from selection bias and inflated false positives [5557]; thus, a direct way to assess important genes as with our method is preferred. Even though single cell RNA-seq specific method SC3 has comparable accuracy in the Biase [51] and Yan [53] data set, these methods select genes with high variance before performing clustering algorithm and do not provide inherent interpretations of important genes. Note that R failed to apply sparseHC to large genomics data due to excessive demand on computing memory, and we only perform SC3 and Seurat on single-cell RNA-seq data sets. Further, even though MPCC has a slightly lower ARI than IMPACC, it still yields better or comparable performance in learning accuracy over consensus and standard methods, and it is relatively fast. Additionally, we visualize the consensus matrices of IMPACC and compare them to that of regular consensus clustering in Fig 2. We can conclude that IMPACC is able to produce more accurate consensus matrices, with clearer diagonal blocks of clusters and less noise on off-diagonal entries.

Table 2. Clustering performance (ARI) and computation time in seconds on real data sets with known cluster labels.

ARI Time (s)
PANCAN Biase Goolam Yan COIL20 PANCAN Biase Goolam Yan COIL20
IMPACC (HC) 0.991 0.953 0.815 0.742 0.74 33.379 1.849 7.602 4.17 70.297
IMPACC (Spec) 0.99 0.953 0.829 0.742 0.663 33.379 1.849 7.602 4.17 70.297
MPCC (HC) 0.982 0.948 0.66 0.742 0.717 21.897 0.093 4.73 2.189 52.446
MPCC (Spec) 0.991 0.948 0.682 0.772 0.672 21.897 0.093 4.73 2.189 52.446
Consensus (HC) 0.754 0.953 0.452 0.834 0.673 75.377 0.121 1.797 0.967 557.623
Consensus (Spec) 0.774 0.953 0.684 0.763 0.67 75.377 0.121 1.797 0.967 557.623
sparseKM 0.981 1 0.459 0.736 0.441 1044.875 46.636 141.162 68.011 95.572
sparsHC 0.342 0.514 0.777 14.883 86.904 22.97
Seurat 0.66 0.447 0.548 0.922 1.412 0.791
SC3 0.948 0.687 0.731 75.290 73.234 68.598
tSNE+KMeans 0.983 0.509 0.317 0.736 0.619 7.853 0.288 1.598 0.514 133.09
tSNE+HC 0.991 0.948 0.3 0.671 0.685 7.864 0.287 1.598 0.514 3.543
tSNE+spectral 0.803 0.948 0.307 0.666 0.787 12.598 0.411 1.685 0.594 3.578
tSNE+KMedoid 0.98 0.948 0.354 0.641 0.727 8.008 0.287 1.6 0.516 20.255
KMeans 0.795 0.948 0.493 0.544 0.771 2.67 0.045 0.326 0.107 7.04
HClust 0.756 0.948 0.433 0.763 0.54 57.236 0.057 1.116 0.201 0.201
Spectral 0.734 0.948 0.381 0.473 0.65 4.817 0.118 0.393 0.181 2.097
KMedoid 0.761 1 0.676 0.743 0.447 58.955 0.066 1.212 0.214 12.491

Clustering performance (ARI) and computation time in seconds on real data sets with known cluster labels. The IMPACC method is among the best in terms of clustering performance, with significant improvements on the computational cost compared to sparseKM, sparseHC, and consensus clustering. The MPCC method also yields comparable clustering performance and computational speed.

Fig 2. Heatmaps of final consensus matrix derived from IMPACC and consensus clustering respectively, using oracle number of clusters.

Fig 2

Darker color indicates higher consensus value.

Interpretability analysis on Yan data set

IMPACC further provides interpretability in terms of feature importance. Since the feature support set in IMPACC is constructed by including features that demonstrate different expressions across clusters, we can identify differentially expressed genes from the final feature importance scores. We propose a data-driven way to set the cutoff as the mean plus one standard deviation of all scores to conduct feature selection. Here we conduct our interpretability analysis focusing on a realization of IMPACC clustering on the Yan [53] data set. IMPACC selects 466 differentially expressed genes using the data-driven cutoff, and the full list of genes is reported in S1 Table. We plot the gene expression matrix of the top 50 differentially expressed genes determined by IMAPCC in Fig 3, with the subgroups defined by the final consensus matrix of IMPACC, using the oracle number of clusters K = 7 (separated by white vertical lines). The important genes selected by IMPACC have significantly different expressions among different clusters of cells, especially in the Morulae cluster.

Fig 3. Gene expression matrix of the top 50 differentially expressed genes identified by IMPACC in Yan data set, with subgroups defined by the final consensus matrix of IMPACC.

Fig 3

The Yan [53] data set measures gene expression in human oocytes, early embryos at seven developmental stages and hESC cells. And the original paper Yan et al. [53] discovered that the EPI cells have lower gene expression in gamete generation, germ cell development and reproduction process, indicated from the GO terms identified by differential genes between EPI cells and other cell lineages in blastocysts. To further evaluate the model interpretability of IMPACC, we perform Gene Ontology (GO) pathway enrichment analysis on the 466 differentially expressed genes determined by our data-driven approach. With a p-value cutoff of 0.05, we can successfully identify 26 GO terms, and these pathways are highly related to germ cell development (oogenesis, oocyte development, oocyte differentiation), gamete generation (female gamete generation, DNA methylation involved in gamete generation), and reproduction (regulation of reproductive process, negative regulation of reproductive process, positive regulation of reproductive process, cellular process involved in reproduction in multicellular organism). The top 10 pathways with fold enrichment, p-values, and counts are illustrated in Fig 4. And the enriched GO terms reported in Yan et al. [53], the complete list of GO terms based on IMPACC, and pathway analysis using differentially expressed genes identified by SC3 and sparseKM are in S1 Text and S2 Table. Overall, these results reveal that IMPACC is able to provide accurate and reliable interpretations of scientifically important genes as well as biologically meaningful GO enrichment analysis; these results match the original paper’s scientific conclusions in which the cell types are known.

Fig 4. Top 10 pathway of GO enrichment analysis using the differentially expressed genes identified by IMPACC in Yan data set, with information on adjusted p-values, fold enrichment and count.

Fig 4

Discussion

We have proposed novel and powerful methodologies for consensus clustering using minipatch learning with random or adaptive sampling schemes. We have demonstrated that both MPCC and IMPACC are stable, robust, and offer superior performance than competing methods in terms of accuracy. Further, our approaches offer significant computational savings with runtime comparable to hierarchical or spectral clustering. Finally, IMPACC offers interpretable results by discovering features that differentiate clusters. This method is particularly applicable to sparse, high-dimensional data sets common in bioinformatics. Our empirical results suggest that our method might prove particularly important for discovering cell types from single-cell RNA sequencing data. Note that while our methods offer computational advantages over consensus clustering for all settings, our method does not seem to offer any dramatic improvement in clustering accuracy for non-sparse and non-high-dimensional data sets. In future work, one can further optimize computations through memory-efficient management of the large consensus matrix and through hashing or other approximate schemes. Overall, we expect IMPACC to become a critical instrument for clustering analyses of complicated and massive data sets in bioinformatics as well as a variety of other fields.

Supporting information

S1 Text. Fast and Interpretable Consensus Clustering via Minipatch Learning: Supplementary Materials.

(DOCX)

S1 Table. Differentially expressed genes in Yan selected by IMPACC.

(CSV)

S2 Table. SC3’s GO Enrichment Pathway Analysis of Yan data set.

(CSV)

Acknowledgments

The authors would like to thank Zhandong Liu and Ying-Wooi Wan for helpful discussions on single-cell sequencing as well as Tianyi Yao for helpful discussions on minipatch learning.

Data Availability

IMPACC, including source code and a tutorial, is freely available at https://github.com/DataSlingers/IMPACC. All data sets used in empirical study are uploaded to Kaggle: https://www.kaggle.com/ganluqin/impacc-data.

Funding Statement

This study received funding from the National Science Foundation(DMS-1554821, https://www.nsf.gov/) and National Institutes of Health / National Institute of General Medicine (1R01GM140468, https://www.nigms.nih.gov/) received by G.I.A. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Ghaemi R, Sulaiman MN, Ibrahim H, Mustapha N, et al. A survey: clustering ensembles techniques. World Academy of Science, Engineering and Technology. 2009;50:636–645. [Google Scholar]
  • 2.Fred A. Finding consistent clusters in data partitions. In: International Workshop on Multiple Classifier Systems. Springer; 2001. p. 309–318.
  • 3.Fred AL, Jain AK. In: Object recognition supported by user interaction for service robots. vol. 4. IEEE; 2002. p. 276–280. [Google Scholar]
  • 4.Kellam P, Liu X, Martin N, Orengo C, Swift S, Tucker A. Comparing, contrasting and combining clusters in viral gene expression data. In: Proceedings of 6th workshop on intelligent data analysis in medicine and pharmocology; 2001. p. 56–62.
  • 5.Azimi J, Mohammadi M, Analoui M, et al. Clustering ensembles using genetic algorithm. In: 2006 International Workshop on Computer Architecture for Machine Perception and Sensing. IEEE; 2006. p. 119–123.
  • 6.Strehl A, Ghosh J. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of machine learning research. 2002;3(Dec):583–617. [Google Scholar]
  • 7.Ng A, Jordan M, Weiss Y. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems. 2001;14:849–856. [Google Scholar]
  • 8.Karypis G, Kumar V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing. 1998;20(1):359–392. [Google Scholar]
  • 9.Dudoit S, Fridlyand J. Bagging to improve the accuracy of a clustering procedure. Bioinformatics. 2003;19(9):1090–1099. doi: 10.1093/bioinformatics/btg038 [DOI] [PubMed] [Google Scholar]
  • 10.Fischer B, Buhmann JM. Path-based clustering for grouping of smooth curves and texture segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2003;25(4):513–518. [Google Scholar]
  • 11.Fischer B, Buhmann JM. Bagging for path-based clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2003;25(11):1411–1415. doi: 10.1109/TPAMI.2003.1240115 [DOI] [Google Scholar]
  • 12.Topchy A, Jain AK, Punch W. A mixture model for clustering ensembles. In: Proceedings of the 2004 SIAM international conference on data mining. SIAM; 2004. p. 379–390. [Google Scholar]
  • 13.Topchy A, Jain AK, Punch W. Clustering ensembles: models of consensus and weak partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005;27(12):1866–1881. doi: 10.1109/TPAMI.2005.237 [DOI] [PubMed] [Google Scholar]
  • 14.Analoui M, Sadighian N. Solving cluster ensemble problems by correlation’s matrix & GA. In: Intelligent Information Processing III: IFIP TC12 International Conference on Intelligent Information Processing (IIP 2006), September 20–23, Adelaide, Australia 3. Springer; 2007. p. 227–231.
  • 15.Luo H, Jing F, Xie X. Combining multiple clusterings using information theory based genetic algorithm. In: 2006 International Conference on Computational Intelligence and Security. vol. 1. IEEE; 2006. p. 84–89.
  • 16.Topchy A, Jain AK, Punch W. Combining multiple weak clusterings. In: Third IEEE international conference on data mining. IEEE; 2003. p. 331–338.
  • 17.Azimi J, Abdoos M, Analoui M. A new efficient approach in clustering ensembles. In: International Conference on Intelligent Data Engineering and Automated Learning. Springer; 2007. p. 395–405.
  • 18.Fred AL, Jain AK. Combining multiple clusterings using evidence accumulation. IEEE transactions on pattern analysis and machine intelligence. 2005;27(6):835–850. doi: 10.1109/TPAMI.2005.113 [DOI] [PubMed] [Google Scholar]
  • 19.Duarte JM, Fred AL, Duarte FJF. Adaptive Evidence Accumulation Clustering Using the Confidence of the Objects’ Assignments. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer; 2012. p. 70–87. [Google Scholar]
  • 20.Parvin H, Minaei-Bidgoli B, Alinejad-Rokny H, Punch WF. Data weighing mechanisms for clustering ensembles. Computers & Electrical Engineering. 2013;39(5):1433–1450. [Google Scholar]
  • 21.Topchy A, Minaei-Bidgoli B, Jain AK, Punch WF. Adaptive clustering ensembles. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004.. vol. 1. IEEE; 2004. p. 272–275.
  • 22.Ren Y, Domeniconi C, Zhang G, Yu G. Weighted-object ensemble clustering: methods and analysis. Knowledge and Information Systems. 2017;51(2):661–689. [Google Scholar]
  • 23.Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, et al. SC3: consensus clustering of single-cell RNA-seq data. Nature methods. 2017;14(5):483–486. doi: 10.1038/nmeth.4236 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Yang Y, Huh R, Culpepper HW, Lin Y, Love MI, Li Y. SAFE-clustering: single-cell aggregated (from ensemble) clustering for single-cell RNA-seq data. Bioinformatics. 2019;35(8):1269–1277. doi: 10.1093/bioinformatics/bty793 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nature biotechnology. 2015;33(5):495–502. doi: 10.1038/nbt.3192 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome biology. 2018;19(1):1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nature biotechnology. 2014;32(4):381–386. doi: 10.1038/nbt.2859 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nature Reviews Genetics. 2019;20(5):273–282. doi: 10.1038/s41576-018-0088-9 [DOI] [PubMed] [Google Scholar]
  • 29.Witten DM, Tibshirani R. A framework for feature selection in clustering. Journal of the American Statistical Association. 2010;105(490):713–726. doi: 10.1198/jasa.2010.tm09415 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Wang B, Zhang Y, Sun WW, Fang Y. Sparse convex clustering. Journal of Computational and Graphical Statistics. 2018;27(2):393–403. [Google Scholar]
  • 31.Wang M, Allen GI. Integrative generalized convex clustering optimization and feature selection for mixed multi-view data. Journal of Machine Learning Research. 2021;22(55):1–73. [PMC free article] [PubMed] [Google Scholar]
  • 32.Yu J, Zhong H, Kim SB. An Ensemble Feature Ranking Algorithm for Clustering Analysis. Journal of Classification. 2019; p. 1–28. [Google Scholar]
  • 33.Dash M, Liu H. Feature selection for clustering. In: Pacific-Asia Conference on knowledge discovery and data mining. Springer; 2000. p. 110–121. [Google Scholar]
  • 34.Zhao Z, Liu H. Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on Machine learning; 2007. p. 1151–1157.
  • 35.Liu H, Shao M, Fu Y. Feature selection with unsupervised consensus guidance. IEEE Transactions on Knowledge and Data Engineering. 2018;31(12):2319–2331. [Google Scholar]
  • 36.Yao T, Allen GI. Feature Selection for Huge Data via Minipatch Learning. arXiv preprint arXiv:201008529. 2020;.
  • 37.Yao T, LeJeune D, Javadi H, Baraniuk RG, Allen GI. Minipatch Learning as Implicit Ridge-Like Regularization. In: 2021 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE; 2021. p. 65–68. [DOI] [PMC free article] [PubMed]
  • 38.Toghani MT, Allen GI. MP-Boost: Minipatch Boosting via Adaptive Feature and Observation Sampling. In: 2021 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE; 2021. p. 75–78. [DOI] [PMC free article] [PubMed]
  • 39.Monti S, Tamayo P, Mesirov J, Golub T. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning. 2003;52(1–2):91–118. [Google Scholar]
  • 40.Hayes DN, Monti S, Parmigiani G, Gilks CB, Naoki K, Bhattacharjee A, et al. Gene expression profiling reveals reproducible human lung adenocarcinoma subtypes in multiple independent patient cohorts. Journal of Clinical Oncology. 2006;24(31):5079–5090. doi: 10.1200/JCO.2005.05.1748 [DOI] [PubMed] [Google Scholar]
  • 41.Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer cell. 2010;17(1):98–110 doi: 10.1016/j.ccr.2009.12.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Wilkerson MD, Hayes DN. ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics. 2010;26(12):1572–1573. doi: 10.1093/bioinformatics/btq170 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Murtagh F. A survey of recent advances in hierarchical clustering algorithms. The computer journal. 1983;26(4):354–359. [Google Scholar]
  • 44.Pakhira MK. A linear time-complexity k-means algorithm using cluster shifting. In: 2014 International Conference on Computational Intelligence and Communication Networks. IEEE; 2014. p. 1047–1051.
  • 45.Fred A, Jain AK. Evidence accumulation clustering based on the k-means algorithm. In: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer; 2002. p. 442–451.
  • 46.Bouneffouf D, Rish I. A survey on practical applications of multi-armed and contextual bandits. arXiv preprint arXiv:190410040. 2019;.
  • 47.Slivkins A. Introduction to multi-armed bandits. arXiv preprint arXiv:190407272. 2019;.
  • 48.Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome biology. 2017. Dec;18(1):1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, et al. The cancer genome atlas pan-cancer analysis project. Nature genetics. 2013;45(10):1113–1120. doi: 10.1038/ng.2764 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Dua Dheeru and Graff Casey. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml. 2019;. [Google Scholar]
  • 51.Biase FH, Cao X, Zhong S. Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing. Genome research. 2014. Nov 1;24(11):1787–96. doi: 10.1101/gr.177725.114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Goolam M, Scialdone A, Graham SJ, Macaulay IC, Jedrusik A, Hupalowska A, Voet T, Marioni JC, Zernicka-Goetz M. Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4-cell mouse embryos. Cell. 2016. Mar 24;165(1):61–74. doi: 10.1016/j.cell.2016.01.047 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Yan L, Yang M, Guo H, Yang L, Wu J, Li R, Liu P, Lian Y, Zheng X, Yan J, Huang J, Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells. Nature structural molecular biology 20.9. 2013; 1131–1139. doi: 10.1038/nsmb.2660 [DOI] [PubMed] [Google Scholar]
  • 54.Nene SA, Nayar SK, Murase H. Columbia Object Image Library (COIL-20); 1996. [Google Scholar]
  • 55.Berk R, Brown L, Buja A, Zhang K, Zhao L, Valid post-selection inference. The Annals of Statistics. 2013; 802–837. [Google Scholar]
  • 56.Fithian W, Sun D, Taylor J, Optimal inference after model selection. arXiv preprint arXiv. 2014; 1410.2597.
  • 57.Zhang JM, Kamath GM, David NT. Valid post-clustering differential analysis for single-cell RNA-Seq. Cell systems. 2019. Oct 23;9(4):383–92. doi: 10.1016/j.cels.2019.07.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010577.r001

Decision Letter 0

Isidore Rigoutsos, Mark Alber

15 Mar 2022

Dear Ms. Gan,

Thank you very much for submitting your manuscript "Fast and Interpretable Consensus Clustering via Minipatch Learning" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Isidore Rigoutsos, Ph.D.

Associate Editor

PLOS Computational Biology

Mark Alber

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The work by Gan and Allen describes the development of an interesting algorithm that can solve the limitations of consensus clustering when applied to large-scale biological data. The manuscript is well written, organized and largely seems technically sound. However, I have some questions for the authors.

1. Throughout the manuscript, authors claim that in high-dimensional biological data, we expect only a small subset of features to be relevant for determining clusters. Although this can be true in some cases, I can think of many scenarios where this is not true (even in some of the datasets they analyze). For example, in a biopsy of a tumor within a normal tissue, the tumor cells have a vastly different gene expression profile. Even within a normal tissue, let's say bone marrow, we expect to find a wide variety of cell types, each one with many distinct genes expressed and common ones expressed at different levels. This is not to say that their method is not applicable to such datasets but I would expect that the authors consider and discuss the implications in such cases.

2. In at least two instances (lines 141 and 248), authors choose a relatively simple methodology so that the overall algorithm is fast enough. To test that this is the best choice (not in terms of speed but in terms of accuracy), I would expect to see a side-by-side comparison with additional tests/methods and show that speed is not compromising accuracy.

3. Lines 131-151 deal with the problem of accuracy. I am not sure that I understand how this approach specifically addresses the problem of accuracy. By the end of this paragraph, I can understand the methodology but not how it achieves better accuracy (only how it proposes to do so).

4. The biological interpretations are extremely weak (lines 342-355). Authors perform KEGG pathway analysis (not described with which tool, statistical thresholds?) and argue that some pathways are important in one setting vs. the other. This is a subjective analysis and I can argue for the opposite in some cases: Why is insulin secretion important in IMAPCC of Table 3 when we are focusing on brain cells? Why is this not a false positive? Authors need to look at the genes within each pathway in more detail. They also need to justify their findings biologically and compare them to the ones reported in the original papers that generated the data they use. In addition, they need to perform some basic correlation analysis of importance weights, pathway fold enrichments (not just p values) among the three tested approaches.

Reviewer #2: Review uploaded as attachment

Reviewer #3: In this manuscript, the authors described a study that combined Minipatch learning and adaptive sampling to improve the current consensus clustering method. The authors showed that the new consensus clustering methods could achieve higher computational efficiency, improved accuracy and better interpretability, on both synthetic and real-world data. In general, the manuscript is very well written and easy to follow. The statistical modeling and validation approaches are sound to me. As consensus clustering is widely used in biological data analysis, a more robust and interpretable clustering method could serve as a very useful tool for the research community. Therefore, I recommend the publication of this work after resolving some minor issues.

1. The authors provided the codes for the implementation of the described methods. However, the documentation for the usage of those codes is lacking, which may make it difficult for other people to use this tool. The authors are encouraged to organize those individual R scripts into an R package (ideally a Bioconductor package) with proper documentation (function help page, tutorial, example and so on.)

2. The authors used way too many “dramatic” in this manuscript. In my opinion, it’s better to refrain from using such words, especially when the improvement of computational efficiency and accuracy in some scenarios is rather comparable to some other clustering methods and the performance also depends on whether the dataset itself is sparse or not, as mentioned by the authors in the discussion part.

3. The authors are encouraged to strengthen the hyper-parameter tuning part and better justify the generalizability of the recommended hyper-parameters. The authors showed that hyper-parameters had a limited impact on model performance and therefore suggested that users could choose the default parameters. However, the authors only tuned and compared the hyper-parameters on two real-world biological datasets (Brain cells and PANCAN). As the users will perhaps have much more diverse biological datasets, the authors may want to test the hyper-parameters in more cases (other types of biological data or synthetic data) to ensure the generalizability of the recommended hyper-parameters

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Attachment

Submitted filename: Review comments.pdf

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010577.r003

Decision Letter 1

Isidore Rigoutsos, Mark Alber

25 Jun 2022

Dear Ms. Gan,

Thank you very much for submitting your manuscript "Fast and Interpretable Consensus Clustering via Minipatch Learning" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Isidore Rigoutsos, Ph.D.

Associate Editor

PLOS Computational Biology

Mark Alber

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Authors have adequately addressed most of my comments. However, I still have a minor comment on the interpretation of the biological results. Figure 3 of the revised manuscript shows the differentially expressed genes that they identify through their methods and figure 4 shows the relevant pathways. These mainly include genes/pathways down-regulated as development progresses. However, in the main manuscript they argue that they "successfully identify 26 GO terms, and these pathways are highly related to the regulation of the reproductive process and cell development" (page 15; line 418). In addition they mention that in the original publication Yan et al. "discovered that the differential genes between EPI cells and the remaining cells are enriched for GO terms related to transcriptional regulation and germ cell development". Why is germ cell development (e.g. oogenesis and oocyte differentiation) relevant when the biological context is pre-implantation development? I think the authors need to better clarify the parts of biology that their methodology captures. One way of doing so would be to better argue on the importance of these pathways in this context (it makes sense biologically for the negative regulators of sperm binding to the zona pellucida to be downregulated after proper fertilization) but I would expect a much more thorough discussion. Another way is to show some Venn diagrams of the identified pathways with pathways relevant in mouse pre-implantation development. This is a well-studied period of development with many datasets and results publicly available that the authors can utilize and justify the robustness of their results or to better showcase their own differential expression analysis.

Reviewer #2: The authors provide a much improved revision, with appropriately moderated claims, which largely addresses the original review criticisms. Residual comments below are minor and should be easy to address (some are just suggestions).

COMMENTS

Several notation inconsistencies still remain in the pseudocode. The authors should go over everything CAREFULLY and fix. E.g., in algorithm 1, Cit should rear Ci(t).

For all algorithms, initial values of variables should be specified. E.g., in Algorithm 2, specify S(t=0) and wI(t=0).

“Here we only show the results of sparse simulation with autoregressive covariance structure, as it is the best representative of high dimensional bioinformatics data.” Justification for this statement (“best representative of high dimensional bioinformatics data”) should be provided, e.g., via citing work where this is demonstrated. And a rationale for the specific choice of covariance (σj,j′ = ρ|j−j′|) should be given.

In my opinion, the results of clustering the Splatter-simulated data are more reflective of the algorithm’s performance on real scRNA-seq data and, in fairness, should at the very least be presented in the main text along with the results from the autoregressive model, rather than in the appendix.

One recommendation for Table 2: in addition to displaying the actual measurements, consider using color gradients for the table cells (similar to how you present data in Appendix Figure 21), to aid visual delineation of “good” and “bad” values.

“Clustering followed by dimension reduction via tSNE can have faster and better clustering accuracy for some of the data sets, but they fail to provide interpretability in terms of feature importance”. I disagree with the authors that this is a significant limitation. E.g., one can always run a post-clustering ANOVA (or similar) to prioritize features, if desired.

For the PANCAN dataset, table 1 claims that it contains five clusters. Where is this number coming from?

Reviewer #3: The authors have adequately addressed my previous comments.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010577.r005

Decision Letter 2

Isidore Rigoutsos, Mark Alber

15 Sep 2022

Dear Ms. Gan,

We are pleased to inform you that your manuscript 'Fast and Interpretable Consensus Clustering via Minipatch Learning' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Isidore Rigoutsos

Academic Editor

PLOS Computational Biology

Mark Alber

Section Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Authors have addressed my concerns

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010577.r006

Acceptance letter

Isidore Rigoutsos, Mark Alber

23 Sep 2022

PCOMPBIOL-D-21-02086R2

Fast and Interpretable Consensus Clustering via Minipatch Learning

Dear Dr Gan,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Olena Szabo

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Fast and Interpretable Consensus Clustering via Minipatch Learning: Supplementary Materials.

    (DOCX)

    S1 Table. Differentially expressed genes in Yan selected by IMPACC.

    (CSV)

    S2 Table. SC3’s GO Enrichment Pathway Analysis of Yan data set.

    (CSV)

    Attachment

    Submitted filename: Review comments.pdf

    Attachment

    Submitted filename: response.pdf

    Attachment

    Submitted filename: reply2.pdf

    Data Availability Statement

    IMPACC, including source code and a tutorial, is freely available at https://github.com/DataSlingers/IMPACC. All data sets used in empirical study are uploaded to Kaggle: https://www.kaggle.com/ganluqin/impacc-data.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES