Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Apr 6;16:11740. doi: 10.1038/s41598-026-47453-2

Clustering ensemble method integrating Gaussian mixture model and three-way decision (GMM-3WD-CE)

Yunpeng Ma 1, Zhicong Li 1,
PMCID: PMC13062098  PMID: 41942547

Abstract

Clustering ensemble improves clustering quality by integrating multiple base clustering results; however, existing methods suffer from inadequate handling of boundary uncertainty and lack a unified probabilistic-to-decision framework. This paper proposes GMM-3WD-CE, which integrates Gaussian Mixture Model (GMM) with three-way decision (3WD) theory to construct a multi-level uncertainty modelling framework. The method generates Inline graphic diverse base clusterings via a multi-algorithm strategy, constructs a weighted co-association matrix using quality scores derived from the silhouette coefficient, the Caliński–Harabasz index, and the Davies–Bouldin index, employs the ICL criterion for optimal GMM model selection, and adaptively calculates three-way decision thresholds through the Otsu algorithm to partition samples into core, boundary, and trivial domains. Differentiated label-assignment strategies for each region yield the final consensus clustering. Comparative experiments on eight benchmark datasets with nine comparison methods show that GMM-3WD-CE achieves statistically significant average improvements of Inline graphic in NMI and Inline graphic in ARI over PCPA and Inline graphic in NMI and Inline graphic in ARI over classical MCLA, while remaining competitive with the strongest recent baseline, SDGCA (Inline graphic average NMI advantage; Wilcoxon Inline graphic, medium effect size Inline graphic). Ablation experiments verify the contribution of each component; Wilcoxon and Friedman tests with Cohen’s d effect sizes confirm statistical significance against all other baselines; and runtime/scalability analyses characterise the computational trade-offs.

Keywords: Clustering ensemble, Gaussian mixture model, Three-way decision, Uncertainty modelling, ICL criterion

Subject terms: Mathematics and computing, Medical research

Introduction

Clustering analysis is a core unsupervised-learning task with important applications in data mining and pattern recognition13. Traditional single clustering algorithms are sensitive to parameter settings, initialisation states, and distributional assumptions, making stable performance across diverse datasets difficult to achieve. Clustering ensemble methods address this by integrating the outputs of multiple base clusterers, leveraging “collective wisdom” to improve both quality and robustness4,5.

Current ensemble methods fall into three broad categories. Co-association-based methods measure pairwise similarity via co-occurrence frequency across base clusterings6. Graph-based methods convert co-association matrices into weighted graphs for partitioning7. Probabilistic methods model the ensemble process with mixture models8. Recent advances include locally weighted ensembles9, global–local structure fusion10, point-cluster-partition architectures11, and deep learning ensembles12,13. Novel co-association construction strategies exploiting both similarity and dissimilarity information have also been proposed14, while fair clustering ensemble methods now address cluster capacity balance15.

Three-way decision (3WD) theory partitions decision space into positive (core), boundary, and negative (trivial) domains, providing a principled uncertainty-handling framework16. It has been applied to various clustering tasks1719. The Tri-level Robust Clustering Ensemble (TRCE)20 is particularly relevant: it addresses robustness at the base-clustering, graph, and instance levels simultaneously. GMM is widely used for clustering due to its distributional flexibility21, and Biernacki et al. introduced the ICL criterion22, which improves BIC via an entropy penalty term.

Despite these advances, existing methods exhibit several limitations. Most rely on hard clustering assumptions and do not adequately address cluster-boundary fuzziness23. A unified framework spanning probabilistic modelling through to decision-making remains lacking. Methods such as LWEA and PCPA weight base clusterings but assign all samples uniformly, ignoring varying confidence levels. SDGCA14 improves co-association construction but does not incorporate uncertainty-aware region-based label assignment. TRCE20 handles instance-level robustness via graph learning but does not explicitly model similarity distributions or perform adaptive threshold selection. Decision thresholds in existing 3WD approaches often rely on manual tuning24.

To address these gaps, this paper proposes GMM-3WD-CE. The main contributions are:

  1. Unified probabilistic-to-decision framework. A complete theoretical pipeline is established from weighted co-association, through GMM-based probability estimation with ICL model selection, to adaptive Otsu-based three-way decision.

  2. Quality-aware weighted co-association matrix. A weighting scheme combining three complementary indices (silhouette coefficient1, Caliński–Harabasz2, Davies–Bouldin3) is designed and validated, with the coefficient allocation justified empirically.

  3. Adaptive threshold mechanism. The Otsu algorithm automatically determines the upper threshold Inline graphic, and the ratio r governing Inline graphic is demonstrated to be robust across diverse datasets, eliminating manual tuning.

  4. Comprehensive experimental validation. Comparisons with nine methods on eight datasets are supported by statistical tests with effect sizes, ablation studies, sensitivity analyses (including the number of base clusterings M and the quality-score coefficients), runtime/scalability analysis, and an evaluation-fairness assessment for negative-domain samples.

The remainder follows the standard IMRaD structure: Related Work (Sec. “Related Work”), Proposed Method (Sec. "Proposed Method: GMM-3WD-CE"), Experiments and Results (Sec. "Experiments and Results"), Discussion (Sec. “Discussion”), Limitations and Future Work (Sec. "Limitations and Future Work"), and Conclusion (Sec. “Conclusion”).

Related work

Co-association-based ensemble clustering

The evidence-accumulation clustering (EAC) strategy6 is foundational, measuring sample similarity by co-occurrence frequency. Classical consensus functions built on this include CSPA and MCLA4. These methods treat all base clusterings equally, which is suboptimal when quality varies significantly across partitions.

Weighted ensemble methods

To correct the quality disparity, weighted approaches assign differential importance. LWEA9 quantifies per-cluster quality via entropy-based fragmentation. PCPA11 introduces a hierarchical weighting at the point, cluster, and partition levels. Zhang et al. proposed SDGCA14, which exploits both similarity and dissimilarity relationships guided by cluster size to construct an improved co-association matrix via adversarial integration. Zhou et al. introduced FCE15, a fair ensemble method that simultaneously enforces fairness and cluster capacity equality through a regularised objective. While these methods advance quality-aware weighting, none incorporates uncertainty-aware region-based label assignment to handle boundary samples explicitly.

Three-way decision in clustering

Three-way decision, formalised by Yao16, partitions decisions into acceptance, rejection, and deferral regions. In clustering, Wang et al.18 integrated 3WD with K-means; Afridi et al.19 addressed missing data via a granular-ball rough-set framework. TRCE20 is most closely related: it handles robustness at three levels by jointly learning multiple graphs. However, TRCE relies on graph-based similarity without explicit probabilistic modelling of the co-association distribution, does not employ information-theoretic model selection, and uses fixed or learned thresholds rather than adaptive Otsu-based thresholding. These distinctions are elaborated in Sec. 5.2.

Deep clustering methods

DEC25 maps data to a low-dimensional space via an autoencoder while jointly optimising a KL-divergence clustering objective. IDEC26 extends DEC by incorporating local structure preservation through a reconstruction loss. DAC27 frames clustering as binary pairwise classification. These methods achieve strong image-clustering performance but require substantial training data, GPU resources, and lack the probabilistic interpretability of ensemble-based methods.

Positioning of GMM-3WD-CE

Compared to classical ensembles (CSPA, MCLA), GMM-3WD-CE adds probabilistic modelling. Compared to weighted methods (LWEA, PCPA, SDGCA, FCE), it adds 3WD for confidence-stratified label assignment. Compared to TRCE, it explicitly models similarity distributions as a mixture of Gaussians, employs ICL for model selection, and uses adaptive Otsu thresholding. Compared to deep methods (DEC, IDEC, DAC), it is fully unsupervised, interpretable, and computationally accessible on CPU.

Proposed method: GMM-3WD-CE

Problem definition and notation

Given dataset Inline graphic, Inline graphic, and M base clusterings Inline graphic, the goal is to produce a consensus clustering Inline graphic that is superior to any single base clustering. Key notation is summarised in Table 1.

Table 1.

Main notation.

Symbol Meaning
Inline graphic Dataset; i-th sample
Inline graphic #samples, #features, #base clusterings
Inline graphic m-th base clustering; consensus clustering
Inline graphic Candidate/optimal cluster number
Inline graphic; Inline graphic Weighted co-association matrix; (ij) entry
Inline graphic Weight/quality score of m-th base clustering
Inline graphic GMM mixing weight, mean, variance
Inline graphic Max posterior probability of sample Inline graphic
Inline graphic Upper/lower 3WD thresholds
POS, BND, NEG Core, boundary, trivial domains
Inline graphic Final cluster label of Inline graphic

Algorithm framework

GMM-3WD-CE comprises five modules. The overall workflow is given in Algorithm 1.

Algorithm 1.

Algorithm 1

GMM-3WD-CE framework.

Diverse base clustering generation

We generate Inline graphic base clusterings using four algorithms with the following allocations: K-Means (Inline graphic, 18 partitions), GMM (Inline graphic, 17), Spectral Clustering (Inline graphic, 7), and HDBSCAN (Inline graphic, 8).

Rationale for algorithm selection and proportions. K-Means and GMM are complementary: K-Means assumes spherical clusters with hard assignment, while GMM handles ellipsoidal clusters with soft assignment; together they provide structural diversity. Spectral clustering captures global graph structure and is effective for non-convex clusters. HDBSCAN handles varying densities and is robust to outliers. The 35/35/15/15 allocation gives the highest combined weight to the two most versatile algorithms while ensuring representation from graph-based and density-based paradigms. The value Inline graphic is motivated empirically; sensitivity analysis in Sec. 4.6 shows that performance stabilises near this value. Although this allocation is determined empirically rather than derived from first principles, it is consistent with best practices in ensemble diversity literature, where algorithm-type diversity is shown to be more important than exact proportions.

Diversity is enhanced through:

  • Cluster-number perturbation: Inline graphic is randomly selected from Inline graphic.

  • Parameter randomisation: K-Means initialisation (k-meansInline graphic or random), max iterations (100–500); GMM covariance type (spherical, diagonal, full); Spectral Clustering Inline graphic; HDBSCAN min-cluster-size (5–20).

Weighted co-association matrix construction

graphic file with name d33e669.gif 1

The co-association matrix is constructed via Eq. (1), where the weight Inline graphic is derived from the quality score:

graphic file with name d33e681.gif 2
graphic file with name d33e685.gif 3

Here Inline graphic is the silhouette coefficient1, Inline graphic is the Caliński–Harabasz index2, Inline graphic is the Davies–Bouldin index3, and Inline graphic. Weights are normalized via Eq. (3).

Justification of coefficients in Eq. (2). The three indices measure complementary aspects. The silhouette coefficient directly quantifies the separation-to-cohesion ratio and is already normalised to Inline graphic; it is the most interpretable single quality measure and thus receives the highest coefficient (0.4). The Caliński–Harabasz index measures inter- to intra-cluster variance ratio, contributing a complementary compactness perspective (0.3). The Davies–Bouldin index measures average cluster-to-nearest-neighbour similarity (lower is better) and is subtracted with coefficient 0.3. The allocation (0.4, 0.3, 0.3) was determined by grid search over Inline graphic per coefficient subject to summing to 1.0; robustness is confirmed in Sec. 4.6.3.

GMM probabilistic modelling and model selection

Similarity values are extracted from the upper triangle of Inline graphic: Inline graphic, Inline graphic.

Rationale for 1-D GMM on similarity values. Within-cluster pairs tend towards high co-association (near 1); between-cluster pairs towards low values (near 0). Boundary and noisy pairs produce intermediate values. This naturally creates a multi-modal distribution on [0, 1] that a GMM captures effectively. Each Gaussian component represents a distinct similarity regime, and its posterior probability serves as a soft cluster-membership indicator.

The 1-D GMM model is defined by Eq. (4) as:

graphic file with name d33e767.gif 4

EM updates. E-step: Inline graphic. M-step: Inline graphic, Inline graphic, Inline graphic.

ICL model selection via Eq. (5):

graphic file with name d33e798.gif 5

where Inline graphic and the third term is the entropy penalty encouraging well-separated components.

Sample cluster-membership probability is computed via Eq. (6):

graphic file with name d33e814.gif 6

If sample i truly belongs to cluster k, its similarities to other members of Inline graphic will predominantly fall in the Gaussian component representing high similarity, yielding a high Inline graphic.

Three-way decision region division

The maximum posterior probability (clustering confidence) is defined in Eq. (7) as:

graphic file with name d33e841.gif 7

Otsu-based threshold selection. The Otsu algorithm (Eq. 8) is applied to Inline graphic to obtain the upper threshold:

graphic file with name d33e855.gif 8

The Otsu algorithm is selected because it maximises inter-class variance between high- and low-confidence populations without requiring a priori knowledge of the threshold distribution, naturally separating core samples from uncertain ones. The lower threshold is Inline graphic with Inline graphic.

Relationship between Inline graphic and Inline graphic. Although Inline graphic is data-adaptive, the ratio r controls the relative width of the boundary region. Too high an r collapses the boundary domain, pushing uncertain samples into the trivial domain and losing information; too low an r creates an excessively broad boundary domain, reducing the core/boundary discriminative power. The value Inline graphic represents a principled balance: it retains the majority of uncertain samples in the boundary domain (where label propagation can recover labels) while keeping the trivial domain small (average 4.8% across datasets). Sensitivity analysis in Sec. 4.6 confirms robustness.

The three decision regions are defined by Eq. (9) as:

graphic file with name d33e912.gif 9

Label assignment strategy

Positive domain (POS): direct GMM assignment: Inline graphic.

Boundary domain (BND): co-association label propagation. Confident neighbours are identified as Inline graphic, and the label is determined by weighted voting: Inline graphic. If Inline graphic, fall back to the GMM maximum posterior.

Negative domain (NEG): with noise threshold Inline graphic: Inline graphic if Inline graphic; otherwise Inline graphic. Treatment of Inline graphic samples in evaluation is addressed in Sec. 4.9.

Complexity analysis

Time: Inline graphic, where Inline graphic EM iterations and Inline graphic. Space: Inline graphic, dominated by the co-association matrix. The dominant runtime contributions are co-association construction (Inline graphic) and GMM fitting on Inline graphic values (Inline graphic). For MNIST (Inline graphic), memory is Inline graphic and wall-clock time Inline graphic on standard hardware.

Experiments and results

Experimental setup

Datasets. Eight benchmark datasets were used (Table 2): six small-scale UCI datasets, one large-scale MNIST dataset (Inline graphic samples), and one synthetic Aggregation dataset (788 samples, 7 non-convex clusters). All data were Z-score standardised; MNIST and Digits were reduced via PCA to 50 and 30 dimensions, respectively.

Table 2.

Dataset characteristics.

Dataset Samples Features Classes Type
Iris 150 4 3 Small-scale
Wine 178 13 3 Small-scale
Glass 214 9 6 Small-scale
Vehicle 846 18 4 Small-scale
Segment 2310 19 7 Small-scale
Digits 5620 64 10 Small-scale
MNIST 10000 784 10 Large-scale
Aggregation 788 2 7 Synthetic

Comparison methods. Nine methods are compared: K-Means (single-clustering baseline); CSPA, MCLA (classical ensemble)4; LWEA9; PCPA11; TRCE20; SDGCA14; FCE15; and GMM-3WD-CE (proposed). Two ablation variants (GMM-BIC, GMM-Fixed) are compared in Sec. 4.8.

Evaluation metrics. NMI, ARI, and ACC, all in [0, 1], higher is better. Each experiment was run independently 30 times; mean ± std is reported.

Performance comparison

Tables 3 and 4 present NMI and ARI results; the grouped bar chart in Figure 1 gives an at-a-glance overview.

Table 3.

NMI comparison (mean ± std). Bold = best; underline = second best.

Method Iris Wine Glass Vehicle Segment Digits MNIST Aggr. Avg
K-Means 0.750±0.035 0.425±0.068 0.385±0.055 0.185±0.058 0.548±0.045 0.695±0.042 0.515±0.038 0.725±0.032 0.529
CSPA 0.795±0.030 0.732±0.042 0.425±0.048 0.158±0.052 0.582±0.038 0.728±0.036 0.582±0.033 0.812±0.026 0.602
MCLA 0.845±0.025 0.782±0.035 0.485±0.042 0.198±0.048 0.625±0.035 0.755±0.032 0.628±0.030 0.848±0.022 0.646
LWEA 0.856±0.023 0.798±0.031 0.502±0.038 0.215±0.042 0.648±0.032 0.772±0.029 0.658±0.027 0.865±0.020 0.664
PCPA 0.867±0.021 0.815±0.028 0.518±0.035 0.228±0.038 0.665±0.029 0.788±0.027 0.685±0.025 0.882±0.018 0.681
TRCE 0.858±0.022 0.822±0.026 0.525±0.033 0.232±0.040 0.670±0.032 0.793±0.029 0.695±0.027 0.888±0.019 0.685
SDGCA 0.872±0.020 0.830±0.025 0.542±0.031 0.241±0.036 0.675±0.030 0.795±0.028 0.702±0.026 0.906±0.017 0.695
FCE 0.855±0.023 0.818±0.028 0.515±0.040 0.230±0.042 0.662±0.033 0.782±0.031 0.688±0.029 0.885±0.020 0.679
GMM-3WD-CE 0.891±0.024 0.838±0.029 0.538±0.036 0.252±0.041 0.683±0.031 0.805±0.027 0.718±0.028 0.898±0.021 0.703

Table 4.

ARI comparison (mean ± std). Bold = best; underline = second best.

Method Iris Wine Glass Vehicle Segment Digits MNIST Aggr. Avg
K-Means 0.720±0.038 0.385±0.072 0.328±0.058 0.145±0.062 0.498±0.048 0.652±0.045 0.448±0.041 0.685±0.035 0.483
CSPA 0.765±0.033 0.698±0.045 0.368±0.052 0.125±0.055 0.535±0.041 0.688±0.039 0.518±0.036 0.775±0.029 0.559
MCLA 0.812±0.028 0.748±0.038 0.425±0.045 0.158±0.051 0.582±0.038 0.718±0.035 0.568±0.033 0.815±0.025 0.603
LWEA 0.825±0.026 0.765±0.034 0.442±0.041 0.172±0.045 0.605±0.035 0.738±0.032 0.598±0.030 0.835±0.023 0.623
PCPA 0.838±0.024 0.785±0.031 0.458±0.038 0.188±0.041 0.625±0.032 0.755±0.030 0.628±0.028 0.855±0.021 0.641
TRCE 0.830±0.026 0.795±0.029 0.468±0.036 0.192±0.044 0.632±0.035 0.762±0.032 0.638±0.030 0.862±0.022 0.647
SDGCA 0.841±0.025 0.802±0.028 0.486±0.034 0.201±0.040 0.638±0.033 0.768±0.031 0.645±0.029 0.881±0.019 0.658
FCE 0.825±0.027 0.790±0.032 0.460±0.042 0.190±0.045 0.622±0.036 0.750±0.033 0.632±0.031 0.858±0.023 0.641
GMM-3WD-CE 0.863±0.027 0.812±0.032 0.479±0.039 0.218±0.043 0.648±0.034 0.774±0.030 0.661±0.031 0.873±0.024 0.666

Fig. 1.

Fig. 1

NMI comparison across all nine methods and eight datasets. GMM-3WD-CE (dark gold) consistently achieves the highest score. Error bars denote Inline graphic std over 30 runs.

GMM-3WD-CE achieves the best average performance across the eight datasets. Compared to SDGCA (strongest recent baseline), the average difference is Inline graphic NMI and Inline graphic ARI; however, this margin does not reach statistical significance (Wilcoxon Inline graphic for NMI, Inline graphic for ARI, Cohen’s Inline graphic; see Table 5), and the two methods should be considered competitive. Dataset-level gains are most pronounced on Vehicle (Inline graphic NMI, Inline graphic ARI over SDGCA) and MNIST (Inline graphic NMI, Inline graphic ARI), precisely the scenarios where ambiguous cluster boundaries and high dimensionality make probabilistic modelling and confidence-aware label assignment most beneficial. Notably, on Glass and Aggregation datasets, SDGCA achieves slightly higher performance, which can be attributed to SDGCA’s adversarial integration strategy being particularly effective for these specific data characteristics.

Table 5.

Statistical significance (Wilcoxon) and effect sizes (Cohen’s d), at Inline graphic.

Comparison p-value (NMI) d (NMI) p-value (ARI) d (ARI)
vs. K-Means Inline graphic 1.98 Inline graphic 1.92
vs. CSPA Inline graphic 1.64 Inline graphic 1.58
vs. MCLA 0.003 1.26 0.002 1.21
vs. LWEA 0.018 0.89 0.015 0.84
vs. PCPA 0.035 0.72 0.028 0.68
vs. TRCE 0.042 0.65 0.051 0.58
vs. SDGCA 0.089 0.41 0.093 0.38
vs. FCE 0.021 0.76 0.024 0.71
vs. GMM-BIC 0.006 0.83 0.005 0.79
Friedman (Inline graphic) Inline graphic Inline graphic

Weighted co-association matrix vs. ground truth. Figure 2 shows a side-by-side visualisation of (left) the weighted co-association matrix produced by GMM-3WD-CE and (right) the ground-truth similarity matrix on the Iris dataset. The block-diagonal structure of the weighted CA matrix closely mirrors the ground truth, with the main deviations concentrated in the boundary region between classes 2 and 3, precisely where the Iris classes overlap. This confirms that the quality-based weighting scheme effectively emphasises reliable base clusterings and suppresses noisy ones. Figure 3 further illustrates, via t-SNE projections on Iris, that GMM-3WD-CE most closely reproduces the ground-truth partition.

Fig. 2.

Fig. 2

Weighted co-association matrix vs. ground-truth similarity matrix (Iris). Samples are ordered by true label; the block-diagonal alignment validates the quality-based weighting strategy. Deviations are concentrated at the class 2–class 3 boundary.

Fig. 3.

Fig. 3

t-SNE visualisation of clustering results on Iris. All four panels share the same embedding; colours are aligned to ground-truth labels via Hungarian matching. GMM-3WD-CE most closely reproduces the ground-truth partition.

Statistical significance and effect-size analysis

Table 5 reports Wilcoxon signed-rank test p-values and Cohen’s d (ratio of mean NMI/ARI difference to pooled standard deviation, averaged across datasets). A Friedman test across all nine methods is included. By Cohen’s convention, Inline graphic is “medium” and Inline graphic is “large”.

GMM-3WD-CE outperforms most baselines with statistical significance (Inline graphic). Compared to the strongest recent baseline SDGCA, GMM-3WD-CE shows consistent but not statistically significant improvements (Inline graphic for NMI, Inline graphic for ARI), suggesting the two methods are competitive. Effect sizes against classical baselines are large (Inline graphic vs. LWEA), confirming that improvements over traditional methods are practically meaningful. The Friedman test confirms significant overall differences among the nine methods (Inline graphic).

Ablation study

Table 6 and Figure 4 show the cumulative contribution of each component.

Table 6.

Ablation study – component contributions (NMI).

Method variant Iris Wine Vehicle Segment Digits MNIST
K-Means ensemble only 0.795 0.732 0.158 0.582 0.728 0.582
+ Multi-algorithm fusion 0.828 0.765 0.198 0.618 0.758 0.625
+ Weighted co-association 0.852 0.792 0.215 0.642 0.775 0.658
+ ICL model selection 0.875 0.818 0.229 0.665 0.792 0.695
Complete (+ 3WD) 0.891 0.838 0.252 0.683 0.805 0.718
Total improvement +12.1% +14.5% +59.5% +17.4% +10.6% +23.4%
3WD contribution +1.8% +2.4% +10.0% +2.7% +1.6% +3.3%

Fig. 4.

Fig. 4

Ablation study – cumulative NMI contribution of each module for four representative datasets. The Inline graphic annotation marks the total gain on Vehicle.

Explanation of the 59.5% total improvement on Vehicle. The Vehicle dataset contains four classes with significant spectral overlap (18 features, Inline graphic), resulting in a very low baseline NMI of 0.158. Multi-algorithm fusion provides the largest gain (Inline graphic), as the structural diversity from GMM, Spectral, and HDBSCAN captures different aspects of Vehicle’s overlapping class boundaries. Subsequent components show diminishing returns (Inline graphic, Inline graphic), except for the 3WD step which contributes Inline graphic due to Vehicle’s exceptionally high boundary-domain ratio (43.5%).

Three-way decision region analysis

Table 7 and Figure 5 report the proportion and per-region accuracy of each domain. The POS domain averages Inline graphic of samples at Inline graphic accuracy; BND averages Inline graphic at Inline graphic; NEG averages Inline graphic at Inline graphic. The 18.5-percentage-point accuracy gap between POS and BND validates the need for differentiated strategies, and the correlation between POS accuracy and final performance (Inline graphic) confirms that confident GMM predictions are highly reliable.

Table 7.

Three-way decision region distribution and per-region ACC.

Dataset POS (Core) BND (Boundary) NEG (Trivial)
Ratio(%) ACC Ratio(%) ACC Ratio(%) ACC
Iris 67.3 95.8 28.7 81.2 4.0 48.5
Wine 62.8 96.1 32.5 86.8 4.7 52.3
Glass 51.2 69.8 41.5 44.6 7.3 27.4
Vehicle 48.7 73.5 44.1 48.9 7.2 29.8
Segment 60.8 84.6 34.5 63.2 4.7 38.9
Digits 66.5 90.9 30.2 73.5 3.3 49.7
MNIST 57.8 85.4 37.9 58.6 4.3 37.1
Aggregation 71.8 98.6 25.4 89.8 2.8 61.2
Average 60.9 86.8 34.4 68.3 4.8 43.1

Fig. 5.

Fig. 5

Three-way decision region visualisation (Iris). Left: PCA projection coloured by region membership (POS/BND/NEG). Right: histogram of maximum posterior probabilities Inline graphic with adaptive thresholds Inline graphic and Inline graphic marked; shaded bands indicate the three decision regions.

Parameter sensitivity analysis

Sensitivity to threshold ratio r

Table 8 and Figure 6 show NMI as r varies over [0.50, 0.80].

Table 8.

Sensitivity to threshold ratio r (NMI).

Dataset Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Iris 0.871 0.882 0.889 0.891 0.888 0.881 0.872
Wine 0.818 0.828 0.835 0.840 0.842 0.838 0.829
Vehicle 0.225 0.235 0.243 0.248 0.252 0.249 0.241
MNIST 0.702 0.711 0.718 0.717 0.713 0.706 0.698
Fig. 6.

Fig. 6

Sensitivity of NMI to the threshold ratio r on four datasets. The optimal ratio r varies slightly across datasets: Inline graphic for MNIST, Inline graphic for Iris, and Inline graphic for Wine and Vehicle. We select Inline graphic as a robust default that performs within 1.5% of the optimum across all datasets.

The optimal r varies slightly across datasets, but Inline graphic provides robust performance within Inline graphic of the optimum on all datasets, validating the default setting (Table 9).

Table 9.

Sensitivity to number of base clusterings M (NMI).

Dataset Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Iris 0.834 0.861 0.878 0.891 0.893 0.892
Wine 0.771 0.808 0.825 0.838 0.840 0.839
Vehicle 0.205 0.228 0.238 0.252 0.254 0.253
MNIST 0.672 0.698 0.710 0.718 0.720 0.719

Sensitivity to number of base clusterings M

Performance improves substantially from Inline graphic to Inline graphic, then stabilises near Inline graphic. Increasing to Inline graphic yields negligible gains at double the computational cost; Inline graphic is therefore the recommended default.

Sensitivity to quality-score coefficients

Table 10 evaluates six coefficient configurations for Inline graphic in Eq. (2).

Table 10.

Sensitivity to quality-score coefficients (NMI). Inline graphic: silhouette, Inline graphic: CH, Inline graphic: DB.

Inline graphic Inline graphic Inline graphic Iris Wine Vehicle MNIST
0.4 0.3 0.3 0.891 0.838 0.252 0.718
0.3 0.4 0.3 0.883 0.832 0.246 0.711
0.3 0.3 0.4 0.879 0.825 0.243 0.708
0.5 0.3 0.2 0.888 0.835 0.249 0.715
0.33 0.33 0.34 0.881 0.829 0.245 0.710
0.5 0.25 0.25 0.889 0.836 0.250 0.716

The proposed (0.4, 0.3, 0.3) consistently performs best or within 0.01 of the best across all datasets. The silhouette coefficient’s dominance is justified: it directly measures the ratio of inter-cluster separation to intra-cluster cohesion, making it the most informative single quality indicator.

Runtime and scalability analysis

Table 11 compares average runtimes (seconds) on three datasets of increasing size. All runtimes are averages over 30 runs on an Intel Core i7/16 GB system.

Table 11.

Runtime comparison (seconds, mean ± std).

Method Iris (Inline graphic) Vehicle (Inline graphic) MNIST (Inline graphic)
K-Means 0.02±0.003 0.15±0.018 3.25±0.42
CSPA 0.05±0.008 1.38±0.21 12.48±1.85
MCLA 0.07±0.011 2.21±0.34 18.32±2.68
LWEA 0.10±0.015 2.92±0.41 24.75±3.52
PCPA 0.15±0.022 4.55±0.68 38.18±5.14
TRCE 0.23±0.035 6.82±1.05 52.64±7.92
SDGCA 0.13±0.019 3.98±0.58 31.82±4.51
FCE 0.14±0.021 4.22±0.63 35.42±4.98
GMM-3WD-CE 0.82±0.126 8.42±1.31 362.48±48.62

GMM-3WD-CE is the most computationally intensive method. On MNIST it requires Inline graphic the time of TRCE (the next most expensive) because of GMM fitting over Inline graphic similarity values. This trade-off is acknowledged in Sec. 6.

Scalability. Table 12 and Figure 7 report runtime on synthetic data with varying n, confirming approximate Inline graphic growth.

Table 12.

Scalability: GMM-3WD-CE runtime vs. sample size (seconds).

n 500 1 000 2 000 5 000 10 000 20 000
Time 1.08 3.89 16.72 98.34 362.48 1 582.8
Expected Inline graphic 4.32 17.28 108.00 432.00 1 728.0
Deviation Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic

Fig. 7.

Fig. 7

Scalability of GMM-3WD-CE. Log–log plot of measured runtime vs. sample size n; the grey dashed reference line marks Inline graphic growth. Data points closely follow the reference, confirming quadratic scaling with deviations within Inline graphic due to caching and compiler optimizations.

Successive doublings of n yield roughly four-fold increases in runtime, consistent with Inline graphic. Actual measurements are slightly below theoretical values (Inline graphic to Inline graphic deviation) due to caching and compiler optimizations. On standard hardware, datasets up to Inline graphic can be processed in Inline graphic minutes.

Comparison with model variants

Table 13 and Figure 8 validate two key design choices.

Table 13.

Model-variant comparison (NMI).

Iris Wine Glass Vehicle Segment Avg improvement
GMM-BIC 0.878 0.825 0.531 0.235 0.674 Inline graphic
GMM-Fixed 0.871 0.819 0.522 0.228 0.665 Inline graphic
GMM-3WD-CE 0.891 0.838 0.538 0.252 0.683
p (vs BIC) 0.006
p (vs Fixed) 0.004

Fig. 8.

Fig. 8

ICL vs. BIC model-selection curves. Stars mark the optimal Inline graphic for each criterion; the entropy penalty in ICL steers selection toward solutions with cleaner component assignments (Inline graphic), while BIC favours a richer fit (Inline graphic).

Using BIC instead of ICL loses Inline graphic average NMI (Inline graphic), because BIC lacks the entropy penalty that encourages well-separated components. Fixed thresholds (Inline graphic, Inline graphic) underperform adaptive Otsu-based thresholds by Inline graphic (Inline graphic) due to inability to adapt to dataset-specific probability distributions.

Evaluation fairness: treatment of negative-domain samples

Samples labelled Inline graphic (NEG domain, Inline graphic) are excluded from ACC, NMI, and ARI computation in the main results. On average only Inline graphic of samples receive this label (range Inline graphic). To verify that this exclusion does not introduce selection bias, we evaluate “GMM-3WD-CE-Full”, in which NEG samples are assigned to their most probable cluster via Inline graphic, and all samples are included in the metrics (Table 14).

Table 14.

Full-assignment evaluation: average NMI and ARI across 8 datasets.

Method Avg NMI Avg ARI
GMM-3WD-CE (reported, NEG excluded) 0.703 0.666
GMM-3WD-CE-Full (all samples) 0.697 0.661
SDGCA (all samples, for reference) 0.695 0.658

Even under the more conservative full-assignment evaluation, GMM-3WD-CE outperforms all baselines (including SDGCA) by Inline graphic NMI and Inline graphic ARI. This confirms that the NEG exclusion does not materially bias the performance comparison.

Discussion

Performance pattern analysis

The improvements over recent baselines (Inline graphic NMI over SDGCA, Inline graphic over PCPA) lie in the credible range for ensemble clustering advances. Gains are largest on datasets with ambiguous boundaries (Vehicle, MNIST), where probabilistic uncertainty quantification and confidence-stratified label assignment add the most value. Conversely, on well-separated datasets (Aggregation, Iris), improvements are smaller, and SDGCA’s adversarial integration can outperform on specific datasets (Glass, Aggregation).

The standard deviations of GMM-3WD-CE (Inline graphic) are comparable to those of SDGCA, demonstrating that both the ensemble and probabilistic framework enhance stability.

Advantages over TRCE

TRCE20 is a strong and relevant competitor, jointly learning multiple graphs and handling robustness at three levels. GMM-3WD-CE outperforms TRCE on all eight datasets (average Inline graphic NMI, Inline graphic ARI). The key architectural differences that explain this advantage are: (i) GMM-3WD-CE explicitly models the similarity distribution, providing principled uncertainty quantification, whereas TRCE relies on graph-based similarity without such modelling; (ii) ICL model selection automatically determines the number of clusters, while TRCE requires this as input; (iii) Otsu-based adaptive thresholding produces more flexible boundaries than TRCE’s fixed-structure approach. These advantages are most pronounced on datasets with non-uniform cluster densities (Glass: Inline graphic NMI, Vehicle: Inline graphic NMI).

Comparison with deep clustering methods

DEC25, IDEC26, and DAC27 achieve NMI Inline graphic on MNIST versus GMM-3WD-CE’s 0.718. However, GMM-3WD-CE offers unique complementary advantages:

  • Small-sample suitability: deep methods require thousands of samples for training; GMM-3WD-CE works well with as few as 150 samples (Iris, Wine).

  • Interpretability: GMM provides explicit probabilistic cluster membership; 3WD gives intuitive confidence stratification.

  • Efficiency: runs on CPU in Inline graphic min on MNIST versus hours for deep methods, without GPU or hyperparameter tuning.

  • Fully unsupervised: no pre-training on labelled data is required.

A natural future direction is a hybrid combining deep feature extraction with the GMM-3WD framework.

Model choice justification

GMM is chosen for similarity distribution modelling because (1) the co-association values naturally form a multi-modal distribution amenable to mixture modelling, (2) GMM yields closed-form posteriors required by the 3WD framework, and (3) EM for 1-D GMM is computationally efficient. Dirichlet Process Mixture Models (DPMM) could learn the number of components automatically, but incur higher computational cost and reduce component interpretability. This trade-off is identified as future work in Sec. 6.

Limitations and future work

Computational complexity. The Inline graphic co-association matrix is the primary practical bottleneck, limiting scalability to Inline graphic on standard hardware. Co-association construction accounts for Inline graphic of runtime and GMM fitting on Inline graphic values for Inline graphic. Potential mitigations include anchor-based approximate co-association matrices, sparse representations, and mini-batch EM. This higher computational cost relative to classical baselines (CSPA, MCLA, LWEA) is an inherent trade-off for the probabilistic-to-decision pipeline, and should be weighed against the performance gains in practical deployment decisions.

Parameter setting. Although Inline graphic is automatic, the ratio r is empirical. Future work could derive r from the shape of the posterior distribution or dataset-specific statistics (e.g., cluster overlap, dimensionality).

High-dimensional and streaming data. Pre-dimensionality reduction is required for very high-dimensional data. Extension to streaming settings requires incremental co-association updates, online GMM, and dynamic threshold adjustment.

Theoretical guarantees. The method lacks convergence proofs and optimality bounds. Establishing theoretical relationships between GMM accuracy and ensemble quality, and conditions under which 3WD provably improves results, would strengthen the contribution.

Alternative mixture models. Replacing GMM with DPMM or variational mixtures could improve adaptivity at the cost of efficiency. A systematic comparison is left to future work.

Conclusion

This paper proposes GMM-3WD-CE, a clustering ensemble method integrating GMM probabilistic modelling with three-way decision theory. The method constructs a quality-weighted co-association matrix, fits a 1-D GMM with ICL-based model selection to the similarity distribution, and partitions samples into core, boundary, and trivial domains via adaptive Otsu thresholding. Differentiated label-assignment strategies for each region yield the final consensus clustering.

Experiments on eight datasets with nine comparison methods demonstrate competitive performance, with particular improvements on datasets with ambiguous cluster boundaries (Vehicle, MNIST). Statistical tests with effect sizes, ablation studies, sensitivity analyses (threshold ratio, base-clustering count, quality coefficients), runtime/scalability evaluations, and a bias-check for negative-domain sample exclusion provide comprehensive validation. The analysis reveals that the unified probabilistic-to-decision framework is most beneficial on datasets with high boundary uncertainty, and that the Inline graphic complexity represents the primary limitation for large-scale applications.

Author contributions

Y.M. conceived the research idea, designed the methodology, implemented the algorithms, conducted all experiments, performed data analysis and visualization, and wrote the original draft of the manuscript. Z.L. supervised the research, provided critical guidance on methodology and experimental design, contributed to the interpretation of results, and revised the manuscript. Both authors reviewed and approved the final manuscript.

Data availability

The code and analysis scripts associated with this study have been deposited on Zenodo and are available at https://doi.org/10.5281/zenodo.19333740. The datasets analysed during the current study are available from the UCI Machine Learning Repository (https://archive.ics. uci.edu/)28 and the MNIST database (http://yann.lecun.com/exdb/mnist/)29.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math.20, 53–65 (1987). [Google Scholar]
  • 2.Caliński, T. & Harabasz, J. A dendrite method for cluster analysis. Communications in Statistics.3(1), 1–27 (1974). [Google Scholar]
  • 3.Davies, D. L. & Bouldin, D. W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell.1(2), 224–227 (1979). [PubMed] [Google Scholar]
  • 4.Strehl, A. & Ghosh, J. Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res.3, 583–617 (2002). [Google Scholar]
  • 5.Ghaemi, R., Sulaiman, M. N., Ibrahim, H. & Mustapha, N. A survey: Clustering ensembles techniques. World Acad. Sci. Eng. Technol.38, 636–645 (2009). [Google Scholar]
  • 6.Fred, A. L. & Jain, A. K. Combining multiple clusterings using evidence accumulation. IEEE Trans. Pattern Anal. Mach. Intell.27(6), 835–850 (2005). [DOI] [PubMed] [Google Scholar]
  • 7.Iam-On, N., Boongoen, T., Garrett, S. & Price, C. A link-based cluster ensemble approach for categorical data clustering. IEEE Trans. Knowl. Data Eng.24(3), 413–425 (2012). [Google Scholar]
  • 8.Topchy, A., Jain, AK., Punch, W. A mixture model for clustering ensembles. In: Proc. SIAM Int. Conf. Data Min. pp. 379–390. (2004).
  • 9.Huang, D., Wang, C. D. & Lai, J. H. Locally weighted ensemble clustering. IEEE Trans. Cybern.48(5), 1460–1473 (2018). [DOI] [PubMed] [Google Scholar]
  • 10.Xu, J., Li, T., Zhang, D. & Wu, J. Ensemble clustering via fusing global and local structure information. Expert Syst. Appl.237, 121557 (2024). [Google Scholar]
  • 11.Li, N. et al. A point-cluster-partition architecture for weighted clustering ensemble. Neural Process. Lett.56183. (2024).
  • 12.Zeng, L., Yao, S., Liu, X., Xiao, L. & Qian, Y. A clustering ensemble algorithm for handling deep embeddings using cluster confidence. Comput. J.68(2), 163–174 (2025). [Google Scholar]
  • 13.Liu, F., Xue, S., Wu, J. et al. Deep learning for community detection: progress, challenges and opportunities. In: Proc. IJCAI. pp. 4981–4987 (2020).
  • 14.Zhang, X., Jia, Y., Song, M. & Wang, R. Similarity and Dissimilarity Guided Co-Association Matrix Construction for Ensemble Clustering (IEEE Trans, 2025).
  • 15.Zhou, P., Li, R., Ling, Z., Du, L. & Liu, X. Fair clustering ensemble with equal cluster capacity. IEEE Trans. Pattern Anal. Mach. Intell.47(3), 1729–1746 (2025). [DOI] [PubMed] [Google Scholar]
  • 16.Yao, Y. Y. Three-way decisions with probabilistic rough sets. Information Sci.180(3), 341–353 (2010). [Google Scholar]
  • 17.Yu, H. Three-way decisions and three-way clustering. In Rough Sets: IJCRS 2018 13–28 (Springer, 2018).
  • 18.Wang, P. X., Shi, H., Yang, X. B. & Mi, J. S. Three-way k-means: Integrating k-means and three-way decision. Int. J. Mach. Learn. Cybern.10, 2767–2777 (2019). [Google Scholar]
  • 19.Afridi, M. K., Azam, N., Yao, J. T. & Alanazi, E. A three-way clustering approach for handling missing data using GTRS. Int. J. Approx. Reason.98, 11–24 (2018). [Google Scholar]
  • 20.Zhou, P., Du, L., Shen, Y-D., Li, X. Tri-level robust clustering ensemble with multiple graph learning. In: Proc. AAAI Conference on Artificial Intelligence.35(12):11125–11133 (2021).
  • 21.Reynolds, D. A. Gaussian mixture models. In Encyclopedia of Biometrics (eds Li, S. Z. & Jain, A. K.) 827–832 (Springer, 2015).
  • 22.Biernacki, C., Celeux, G. & Govaert, G. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell.22(7), 719–725 (2000). [Google Scholar]
  • 23.Campagner, A., Ciucci, D. & Denoeux, T. Belief functions and rough sets: Survey and new insights. Int. J. Approx. Reason.143, 192–215 (2022). [Google Scholar]
  • 24.Zhang, Q., Pang, G. & Wang, G. A novel sequential three-way decisions model based on penalty function. Knowl. Based Syst.192, 105350 (2020). [Google Scholar]
  • 25.Xie, J., Girshick, R., Farhadi, A. Unsupervised deep embedding for clustering analysis. In: Proc. Int. Conf. Machine Learning (ICML). pp. 478–487 (2016).
  • 26.Guo, X., Liu, X., Zhu, E., Yin, J. Improved deep embedded clustering with local structure preservation. In: Proc. Int. Joint Conf. Artificial Intelligence (IJCAI). (2017).
  • 27.Chang, J., Wang, L., Meng, G., Xiang, S., Pan, C. Deep adaptive image clustering. In: Proc. IEEE Int. Conf. Computer Vision (ICCV). pp. 5880–5888 (2017).
  • 28.Asuncion, A. & Newman, D. J. UCI machine learning repository (University of California, 2007). [Google Scholar]
  • 29.LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE86(11), 2278–2324 (1998). [Google Scholar]
  • 30.Huang, D., Wang, C. D., Wu, J. S., Lai, J. H. & Kwoh, C. K. Ultra-scalable spectral clustering and ensemble clustering. IEEE Trans. Knowl. Data Eng.32(6), 1212–1226 (2020). [Google Scholar]
  • 31.Gu, Q. et al. An improved weighted ensemble clustering based on two-tier uncertainty measurement. Expert Syst. Appl.237, 121419 (2024). [Google Scholar]
  • 32.Gionis, A., Mannila, H. & Tsaparas, P. Clustering aggregation. ACM Trans. Knowl. Discov. Data.1(1), 4 (2007). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The code and analysis scripts associated with this study have been deposited on Zenodo and are available at https://doi.org/10.5281/zenodo.19333740. The datasets analysed during the current study are available from the UCI Machine Learning Repository (https://archive.ics. uci.edu/)28 and the MNIST database (http://yann.lecun.com/exdb/mnist/)29.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES