Abstract
Clustering ensemble improves clustering quality by integrating multiple base clustering results; however, existing methods suffer from inadequate handling of boundary uncertainty and lack a unified probabilistic-to-decision framework. This paper proposes GMM-3WD-CE, which integrates Gaussian Mixture Model (GMM) with three-way decision (3WD) theory to construct a multi-level uncertainty modelling framework. The method generates
diverse base clusterings via a multi-algorithm strategy, constructs a weighted co-association matrix using quality scores derived from the silhouette coefficient, the Caliński–Harabasz index, and the Davies–Bouldin index, employs the ICL criterion for optimal GMM model selection, and adaptively calculates three-way decision thresholds through the Otsu algorithm to partition samples into core, boundary, and trivial domains. Differentiated label-assignment strategies for each region yield the final consensus clustering. Comparative experiments on eight benchmark datasets with nine comparison methods show that GMM-3WD-CE achieves statistically significant average improvements of
in NMI and
in ARI over PCPA and
in NMI and
in ARI over classical MCLA, while remaining competitive with the strongest recent baseline, SDGCA (
average NMI advantage; Wilcoxon
, medium effect size
). Ablation experiments verify the contribution of each component; Wilcoxon and Friedman tests with Cohen’s d effect sizes confirm statistical significance against all other baselines; and runtime/scalability analyses characterise the computational trade-offs.
Keywords: Clustering ensemble, Gaussian mixture model, Three-way decision, Uncertainty modelling, ICL criterion
Subject terms: Mathematics and computing, Medical research
Introduction
Clustering analysis is a core unsupervised-learning task with important applications in data mining and pattern recognition1–3. Traditional single clustering algorithms are sensitive to parameter settings, initialisation states, and distributional assumptions, making stable performance across diverse datasets difficult to achieve. Clustering ensemble methods address this by integrating the outputs of multiple base clusterers, leveraging “collective wisdom” to improve both quality and robustness4,5.
Current ensemble methods fall into three broad categories. Co-association-based methods measure pairwise similarity via co-occurrence frequency across base clusterings6. Graph-based methods convert co-association matrices into weighted graphs for partitioning7. Probabilistic methods model the ensemble process with mixture models8. Recent advances include locally weighted ensembles9, global–local structure fusion10, point-cluster-partition architectures11, and deep learning ensembles12,13. Novel co-association construction strategies exploiting both similarity and dissimilarity information have also been proposed14, while fair clustering ensemble methods now address cluster capacity balance15.
Three-way decision (3WD) theory partitions decision space into positive (core), boundary, and negative (trivial) domains, providing a principled uncertainty-handling framework16. It has been applied to various clustering tasks17–19. The Tri-level Robust Clustering Ensemble (TRCE)20 is particularly relevant: it addresses robustness at the base-clustering, graph, and instance levels simultaneously. GMM is widely used for clustering due to its distributional flexibility21, and Biernacki et al. introduced the ICL criterion22, which improves BIC via an entropy penalty term.
Despite these advances, existing methods exhibit several limitations. Most rely on hard clustering assumptions and do not adequately address cluster-boundary fuzziness23. A unified framework spanning probabilistic modelling through to decision-making remains lacking. Methods such as LWEA and PCPA weight base clusterings but assign all samples uniformly, ignoring varying confidence levels. SDGCA14 improves co-association construction but does not incorporate uncertainty-aware region-based label assignment. TRCE20 handles instance-level robustness via graph learning but does not explicitly model similarity distributions or perform adaptive threshold selection. Decision thresholds in existing 3WD approaches often rely on manual tuning24.
To address these gaps, this paper proposes GMM-3WD-CE. The main contributions are:
Unified probabilistic-to-decision framework. A complete theoretical pipeline is established from weighted co-association, through GMM-based probability estimation with ICL model selection, to adaptive Otsu-based three-way decision.
Quality-aware weighted co-association matrix. A weighting scheme combining three complementary indices (silhouette coefficient1, Caliński–Harabasz2, Davies–Bouldin3) is designed and validated, with the coefficient allocation justified empirically.
Adaptive threshold mechanism. The Otsu algorithm automatically determines the upper threshold
, and the ratio r governing
is demonstrated to be robust across diverse datasets, eliminating manual tuning.Comprehensive experimental validation. Comparisons with nine methods on eight datasets are supported by statistical tests with effect sizes, ablation studies, sensitivity analyses (including the number of base clusterings M and the quality-score coefficients), runtime/scalability analysis, and an evaluation-fairness assessment for negative-domain samples.
The remainder follows the standard IMRaD structure: Related Work (Sec. “Related Work”), Proposed Method (Sec. "Proposed Method: GMM-3WD-CE"), Experiments and Results (Sec. "Experiments and Results"), Discussion (Sec. “Discussion”), Limitations and Future Work (Sec. "Limitations and Future Work"), and Conclusion (Sec. “Conclusion”).
Related work
Co-association-based ensemble clustering
The evidence-accumulation clustering (EAC) strategy6 is foundational, measuring sample similarity by co-occurrence frequency. Classical consensus functions built on this include CSPA and MCLA4. These methods treat all base clusterings equally, which is suboptimal when quality varies significantly across partitions.
Weighted ensemble methods
To correct the quality disparity, weighted approaches assign differential importance. LWEA9 quantifies per-cluster quality via entropy-based fragmentation. PCPA11 introduces a hierarchical weighting at the point, cluster, and partition levels. Zhang et al. proposed SDGCA14, which exploits both similarity and dissimilarity relationships guided by cluster size to construct an improved co-association matrix via adversarial integration. Zhou et al. introduced FCE15, a fair ensemble method that simultaneously enforces fairness and cluster capacity equality through a regularised objective. While these methods advance quality-aware weighting, none incorporates uncertainty-aware region-based label assignment to handle boundary samples explicitly.
Three-way decision in clustering
Three-way decision, formalised by Yao16, partitions decisions into acceptance, rejection, and deferral regions. In clustering, Wang et al.18 integrated 3WD with K-means; Afridi et al.19 addressed missing data via a granular-ball rough-set framework. TRCE20 is most closely related: it handles robustness at three levels by jointly learning multiple graphs. However, TRCE relies on graph-based similarity without explicit probabilistic modelling of the co-association distribution, does not employ information-theoretic model selection, and uses fixed or learned thresholds rather than adaptive Otsu-based thresholding. These distinctions are elaborated in Sec. 5.2.
Deep clustering methods
DEC25 maps data to a low-dimensional space via an autoencoder while jointly optimising a KL-divergence clustering objective. IDEC26 extends DEC by incorporating local structure preservation through a reconstruction loss. DAC27 frames clustering as binary pairwise classification. These methods achieve strong image-clustering performance but require substantial training data, GPU resources, and lack the probabilistic interpretability of ensemble-based methods.
Positioning of GMM-3WD-CE
Compared to classical ensembles (CSPA, MCLA), GMM-3WD-CE adds probabilistic modelling. Compared to weighted methods (LWEA, PCPA, SDGCA, FCE), it adds 3WD for confidence-stratified label assignment. Compared to TRCE, it explicitly models similarity distributions as a mixture of Gaussians, employs ICL for model selection, and uses adaptive Otsu thresholding. Compared to deep methods (DEC, IDEC, DAC), it is fully unsupervised, interpretable, and computationally accessible on CPU.
Proposed method: GMM-3WD-CE
Problem definition and notation
Given dataset
,
, and M base clusterings
, the goal is to produce a consensus clustering
that is superior to any single base clustering. Key notation is summarised in Table 1.
Table 1.
Main notation.
| Symbol | Meaning |
|---|---|
![]() |
Dataset; i-th sample |
![]() |
#samples, #features, #base clusterings |
![]() |
m-th base clustering; consensus clustering |
![]() |
Candidate/optimal cluster number |
;
|
Weighted co-association matrix; (i, j) entry |
![]() |
Weight/quality score of m-th base clustering |
![]() |
GMM mixing weight, mean, variance |
![]() |
Max posterior probability of sample
|
![]() |
Upper/lower 3WD thresholds |
| POS, BND, NEG | Core, boundary, trivial domains |
![]() |
Final cluster label of
|
Algorithm framework
GMM-3WD-CE comprises five modules. The overall workflow is given in Algorithm 1.
Algorithm 1.
GMM-3WD-CE framework.
Diverse base clustering generation
We generate
base clusterings using four algorithms with the following allocations: K-Means (
, 18 partitions), GMM (
, 17), Spectral Clustering (
, 7), and HDBSCAN (
, 8).
Rationale for algorithm selection and proportions. K-Means and GMM are complementary: K-Means assumes spherical clusters with hard assignment, while GMM handles ellipsoidal clusters with soft assignment; together they provide structural diversity. Spectral clustering captures global graph structure and is effective for non-convex clusters. HDBSCAN handles varying densities and is robust to outliers. The 35/35/15/15 allocation gives the highest combined weight to the two most versatile algorithms while ensuring representation from graph-based and density-based paradigms. The value
is motivated empirically; sensitivity analysis in Sec. 4.6 shows that performance stabilises near this value. Although this allocation is determined empirically rather than derived from first principles, it is consistent with best practices in ensemble diversity literature, where algorithm-type diversity is shown to be more important than exact proportions.
Diversity is enhanced through:
Cluster-number perturbation:
is randomly selected from
.Parameter randomisation: K-Means initialisation (k-means
or random), max iterations (100–500); GMM covariance type (spherical, diagonal, full); Spectral Clustering
; HDBSCAN min-cluster-size (5–20).
Weighted co-association matrix construction
![]() |
1 |
The co-association matrix is constructed via Eq. (1), where the weight
is derived from the quality score:
![]() |
2 |
![]() |
3 |
Here
is the silhouette coefficient1,
is the Caliński–Harabasz index2,
is the Davies–Bouldin index3, and
. Weights are normalized via Eq. (3).
Justification of coefficients in Eq. (2). The three indices measure complementary aspects. The silhouette coefficient directly quantifies the separation-to-cohesion ratio and is already normalised to
; it is the most interpretable single quality measure and thus receives the highest coefficient (0.4). The Caliński–Harabasz index measures inter- to intra-cluster variance ratio, contributing a complementary compactness perspective (0.3). The Davies–Bouldin index measures average cluster-to-nearest-neighbour similarity (lower is better) and is subtracted with coefficient 0.3. The allocation (0.4, 0.3, 0.3) was determined by grid search over
per coefficient subject to summing to 1.0; robustness is confirmed in Sec. 4.6.3.
GMM probabilistic modelling and model selection
Similarity values are extracted from the upper triangle of
:
,
.
Rationale for 1-D GMM on similarity values. Within-cluster pairs tend towards high co-association (near 1); between-cluster pairs towards low values (near 0). Boundary and noisy pairs produce intermediate values. This naturally creates a multi-modal distribution on [0, 1] that a GMM captures effectively. Each Gaussian component represents a distinct similarity regime, and its posterior probability serves as a soft cluster-membership indicator.
The 1-D GMM model is defined by Eq. (4) as:
![]() |
4 |
EM updates. E-step:
. M-step:
,
,
.
ICL model selection via Eq. (5):
![]() |
5 |
where
and the third term is the entropy penalty encouraging well-separated components.
Sample cluster-membership probability is computed via Eq. (6):
![]() |
6 |
If sample i truly belongs to cluster k, its similarities to other members of
will predominantly fall in the Gaussian component representing high similarity, yielding a high
.
Three-way decision region division
The maximum posterior probability (clustering confidence) is defined in Eq. (7) as:
![]() |
7 |
Otsu-based threshold selection. The Otsu algorithm (Eq. 8) is applied to
to obtain the upper threshold:
![]() |
8 |
The Otsu algorithm is selected because it maximises inter-class variance between high- and low-confidence populations without requiring a priori knowledge of the threshold distribution, naturally separating core samples from uncertain ones. The lower threshold is
with
.
Relationship between
and
. Although
is data-adaptive, the ratio r controls the relative width of the boundary region. Too high an r collapses the boundary domain, pushing uncertain samples into the trivial domain and losing information; too low an r creates an excessively broad boundary domain, reducing the core/boundary discriminative power. The value
represents a principled balance: it retains the majority of uncertain samples in the boundary domain (where label propagation can recover labels) while keeping the trivial domain small (average 4.8% across datasets). Sensitivity analysis in Sec. 4.6 confirms robustness.
The three decision regions are defined by Eq. (9) as:
![]() |
9 |
Label assignment strategy
Positive domain (POS): direct GMM assignment:
.
Boundary domain (BND): co-association label propagation. Confident neighbours are identified as
, and the label is determined by weighted voting:
. If
, fall back to the GMM maximum posterior.
Negative domain (NEG): with noise threshold
:
if
; otherwise
. Treatment of
samples in evaluation is addressed in Sec. 4.9.
Complexity analysis
Time:
, where
EM iterations and
. Space:
, dominated by the co-association matrix. The dominant runtime contributions are co-association construction (
) and GMM fitting on
values (
). For MNIST (
), memory is
and wall-clock time
on standard hardware.
Experiments and results
Experimental setup
Datasets. Eight benchmark datasets were used (Table 2): six small-scale UCI datasets, one large-scale MNIST dataset (
samples), and one synthetic Aggregation dataset (788 samples, 7 non-convex clusters). All data were Z-score standardised; MNIST and Digits were reduced via PCA to 50 and 30 dimensions, respectively.
Table 2.
Dataset characteristics.
| Dataset | Samples | Features | Classes | Type |
|---|---|---|---|---|
| Iris | 150 | 4 | 3 | Small-scale |
| Wine | 178 | 13 | 3 | Small-scale |
| Glass | 214 | 9 | 6 | Small-scale |
| Vehicle | 846 | 18 | 4 | Small-scale |
| Segment | 2310 | 19 | 7 | Small-scale |
| Digits | 5620 | 64 | 10 | Small-scale |
| MNIST | 10000 | 784 | 10 | Large-scale |
| Aggregation | 788 | 2 | 7 | Synthetic |
Comparison methods. Nine methods are compared: K-Means (single-clustering baseline); CSPA, MCLA (classical ensemble)4; LWEA9; PCPA11; TRCE20; SDGCA14; FCE15; and GMM-3WD-CE (proposed). Two ablation variants (GMM-BIC, GMM-Fixed) are compared in Sec. 4.8.
Evaluation metrics. NMI, ARI, and ACC, all in [0, 1], higher is better. Each experiment was run independently 30 times; mean ± std is reported.
Performance comparison
Tables 3 and 4 present NMI and ARI results; the grouped bar chart in Figure 1 gives an at-a-glance overview.
Table 3.
NMI comparison (mean ± std). Bold = best; underline = second best.
| Method | Iris | Wine | Glass | Vehicle | Segment | Digits | MNIST | Aggr. | Avg |
|---|---|---|---|---|---|---|---|---|---|
| K-Means | 0.750±0.035 | 0.425±0.068 | 0.385±0.055 | 0.185±0.058 | 0.548±0.045 | 0.695±0.042 | 0.515±0.038 | 0.725±0.032 | 0.529 |
| CSPA | 0.795±0.030 | 0.732±0.042 | 0.425±0.048 | 0.158±0.052 | 0.582±0.038 | 0.728±0.036 | 0.582±0.033 | 0.812±0.026 | 0.602 |
| MCLA | 0.845±0.025 | 0.782±0.035 | 0.485±0.042 | 0.198±0.048 | 0.625±0.035 | 0.755±0.032 | 0.628±0.030 | 0.848±0.022 | 0.646 |
| LWEA | 0.856±0.023 | 0.798±0.031 | 0.502±0.038 | 0.215±0.042 | 0.648±0.032 | 0.772±0.029 | 0.658±0.027 | 0.865±0.020 | 0.664 |
| PCPA | 0.867±0.021 | 0.815±0.028 | 0.518±0.035 | 0.228±0.038 | 0.665±0.029 | 0.788±0.027 | 0.685±0.025 | 0.882±0.018 | 0.681 |
| TRCE | 0.858±0.022 | 0.822±0.026 | 0.525±0.033 | 0.232±0.040 | 0.670±0.032 | 0.793±0.029 | 0.695±0.027 | 0.888±0.019 | 0.685 |
| SDGCA | 0.872±0.020 | 0.830±0.025 | 0.542±0.031 | 0.241±0.036 | 0.675±0.030 | 0.795±0.028 | 0.702±0.026 | 0.906±0.017 | 0.695 |
| FCE | 0.855±0.023 | 0.818±0.028 | 0.515±0.040 | 0.230±0.042 | 0.662±0.033 | 0.782±0.031 | 0.688±0.029 | 0.885±0.020 | 0.679 |
| GMM-3WD-CE | 0.891±0.024 | 0.838±0.029 | 0.538±0.036 | 0.252±0.041 | 0.683±0.031 | 0.805±0.027 | 0.718±0.028 | 0.898±0.021 | 0.703 |
Table 4.
ARI comparison (mean ± std). Bold = best; underline = second best.
| Method | Iris | Wine | Glass | Vehicle | Segment | Digits | MNIST | Aggr. | Avg |
|---|---|---|---|---|---|---|---|---|---|
| K-Means | 0.720±0.038 | 0.385±0.072 | 0.328±0.058 | 0.145±0.062 | 0.498±0.048 | 0.652±0.045 | 0.448±0.041 | 0.685±0.035 | 0.483 |
| CSPA | 0.765±0.033 | 0.698±0.045 | 0.368±0.052 | 0.125±0.055 | 0.535±0.041 | 0.688±0.039 | 0.518±0.036 | 0.775±0.029 | 0.559 |
| MCLA | 0.812±0.028 | 0.748±0.038 | 0.425±0.045 | 0.158±0.051 | 0.582±0.038 | 0.718±0.035 | 0.568±0.033 | 0.815±0.025 | 0.603 |
| LWEA | 0.825±0.026 | 0.765±0.034 | 0.442±0.041 | 0.172±0.045 | 0.605±0.035 | 0.738±0.032 | 0.598±0.030 | 0.835±0.023 | 0.623 |
| PCPA | 0.838±0.024 | 0.785±0.031 | 0.458±0.038 | 0.188±0.041 | 0.625±0.032 | 0.755±0.030 | 0.628±0.028 | 0.855±0.021 | 0.641 |
| TRCE | 0.830±0.026 | 0.795±0.029 | 0.468±0.036 | 0.192±0.044 | 0.632±0.035 | 0.762±0.032 | 0.638±0.030 | 0.862±0.022 | 0.647 |
| SDGCA | 0.841±0.025 | 0.802±0.028 | 0.486±0.034 | 0.201±0.040 | 0.638±0.033 | 0.768±0.031 | 0.645±0.029 | 0.881±0.019 | 0.658 |
| FCE | 0.825±0.027 | 0.790±0.032 | 0.460±0.042 | 0.190±0.045 | 0.622±0.036 | 0.750±0.033 | 0.632±0.031 | 0.858±0.023 | 0.641 |
| GMM-3WD-CE | 0.863±0.027 | 0.812±0.032 | 0.479±0.039 | 0.218±0.043 | 0.648±0.034 | 0.774±0.030 | 0.661±0.031 | 0.873±0.024 | 0.666 |
Fig. 1.
NMI comparison across all nine methods and eight datasets. GMM-3WD-CE (dark gold) consistently achieves the highest score. Error bars denote
std over 30 runs.
GMM-3WD-CE achieves the best average performance across the eight datasets. Compared to SDGCA (strongest recent baseline), the average difference is
NMI and
ARI; however, this margin does not reach statistical significance (Wilcoxon
for NMI,
for ARI, Cohen’s
; see Table 5), and the two methods should be considered competitive. Dataset-level gains are most pronounced on Vehicle (
NMI,
ARI over SDGCA) and MNIST (
NMI,
ARI), precisely the scenarios where ambiguous cluster boundaries and high dimensionality make probabilistic modelling and confidence-aware label assignment most beneficial. Notably, on Glass and Aggregation datasets, SDGCA achieves slightly higher performance, which can be attributed to SDGCA’s adversarial integration strategy being particularly effective for these specific data characteristics.
Table 5.
Statistical significance (Wilcoxon) and effect sizes (Cohen’s d), at
.
| Comparison | p-value (NMI) | d (NMI) | p-value (ARI) | d (ARI) |
|---|---|---|---|---|
| vs. K-Means | ![]() |
1.98 | ![]() |
1.92 |
| vs. CSPA | ![]() |
1.64 | ![]() |
1.58 |
| vs. MCLA | 0.003 | 1.26 | 0.002 | 1.21 |
| vs. LWEA | 0.018 | 0.89 | 0.015 | 0.84 |
| vs. PCPA | 0.035 | 0.72 | 0.028 | 0.68 |
| vs. TRCE | 0.042 | 0.65 | 0.051 | 0.58 |
| vs. SDGCA | 0.089 | 0.41 | 0.093 | 0.38 |
| vs. FCE | 0.021 | 0.76 | 0.024 | 0.71 |
| vs. GMM-BIC | 0.006 | 0.83 | 0.005 | 0.79 |
Friedman ( ) |
![]() |
– | ![]() |
– |
Weighted co-association matrix vs. ground truth. Figure 2 shows a side-by-side visualisation of (left) the weighted co-association matrix produced by GMM-3WD-CE and (right) the ground-truth similarity matrix on the Iris dataset. The block-diagonal structure of the weighted CA matrix closely mirrors the ground truth, with the main deviations concentrated in the boundary region between classes 2 and 3, precisely where the Iris classes overlap. This confirms that the quality-based weighting scheme effectively emphasises reliable base clusterings and suppresses noisy ones. Figure 3 further illustrates, via t-SNE projections on Iris, that GMM-3WD-CE most closely reproduces the ground-truth partition.
Fig. 2.
Weighted co-association matrix vs. ground-truth similarity matrix (Iris). Samples are ordered by true label; the block-diagonal alignment validates the quality-based weighting strategy. Deviations are concentrated at the class 2–class 3 boundary.
Fig. 3.
t-SNE visualisation of clustering results on Iris. All four panels share the same embedding; colours are aligned to ground-truth labels via Hungarian matching. GMM-3WD-CE most closely reproduces the ground-truth partition.
Statistical significance and effect-size analysis
Table 5 reports Wilcoxon signed-rank test p-values and Cohen’s d (ratio of mean NMI/ARI difference to pooled standard deviation, averaged across datasets). A Friedman test across all nine methods is included. By Cohen’s convention,
is “medium” and
is “large”.
GMM-3WD-CE outperforms most baselines with statistical significance (
). Compared to the strongest recent baseline SDGCA, GMM-3WD-CE shows consistent but not statistically significant improvements (
for NMI,
for ARI), suggesting the two methods are competitive. Effect sizes against classical baselines are large (
vs. LWEA), confirming that improvements over traditional methods are practically meaningful. The Friedman test confirms significant overall differences among the nine methods (
).
Ablation study
Table 6 and Figure 4 show the cumulative contribution of each component.
Table 6.
Ablation study – component contributions (NMI).
| Method variant | Iris | Wine | Vehicle | Segment | Digits | MNIST |
|---|---|---|---|---|---|---|
| K-Means ensemble only | 0.795 | 0.732 | 0.158 | 0.582 | 0.728 | 0.582 |
| + Multi-algorithm fusion | 0.828 | 0.765 | 0.198 | 0.618 | 0.758 | 0.625 |
| + Weighted co-association | 0.852 | 0.792 | 0.215 | 0.642 | 0.775 | 0.658 |
| + ICL model selection | 0.875 | 0.818 | 0.229 | 0.665 | 0.792 | 0.695 |
| Complete (+ 3WD) | 0.891 | 0.838 | 0.252 | 0.683 | 0.805 | 0.718 |
| Total improvement | +12.1% | +14.5% | +59.5% | +17.4% | +10.6% | +23.4% |
| 3WD contribution | +1.8% | +2.4% | +10.0% | +2.7% | +1.6% | +3.3% |
Fig. 4.
Ablation study – cumulative NMI contribution of each module for four representative datasets. The
annotation marks the total gain on Vehicle.
Explanation of the 59.5% total improvement on Vehicle. The Vehicle dataset contains four classes with significant spectral overlap (18 features,
), resulting in a very low baseline NMI of 0.158. Multi-algorithm fusion provides the largest gain (
), as the structural diversity from GMM, Spectral, and HDBSCAN captures different aspects of Vehicle’s overlapping class boundaries. Subsequent components show diminishing returns (
,
), except for the 3WD step which contributes
due to Vehicle’s exceptionally high boundary-domain ratio (43.5%).
Three-way decision region analysis
Table 7 and Figure 5 report the proportion and per-region accuracy of each domain. The POS domain averages
of samples at
accuracy; BND averages
at
; NEG averages
at
. The 18.5-percentage-point accuracy gap between POS and BND validates the need for differentiated strategies, and the correlation between POS accuracy and final performance (
) confirms that confident GMM predictions are highly reliable.
Table 7.
Three-way decision region distribution and per-region ACC.
| Dataset | POS (Core) | BND (Boundary) | NEG (Trivial) | |||
|---|---|---|---|---|---|---|
| Ratio(%) | ACC | Ratio(%) | ACC | Ratio(%) | ACC | |
| Iris | 67.3 | 95.8 | 28.7 | 81.2 | 4.0 | 48.5 |
| Wine | 62.8 | 96.1 | 32.5 | 86.8 | 4.7 | 52.3 |
| Glass | 51.2 | 69.8 | 41.5 | 44.6 | 7.3 | 27.4 |
| Vehicle | 48.7 | 73.5 | 44.1 | 48.9 | 7.2 | 29.8 |
| Segment | 60.8 | 84.6 | 34.5 | 63.2 | 4.7 | 38.9 |
| Digits | 66.5 | 90.9 | 30.2 | 73.5 | 3.3 | 49.7 |
| MNIST | 57.8 | 85.4 | 37.9 | 58.6 | 4.3 | 37.1 |
| Aggregation | 71.8 | 98.6 | 25.4 | 89.8 | 2.8 | 61.2 |
| Average | 60.9 | 86.8 | 34.4 | 68.3 | 4.8 | 43.1 |
Fig. 5.
Three-way decision region visualisation (Iris). Left: PCA projection coloured by region membership (POS/BND/NEG). Right: histogram of maximum posterior probabilities
with adaptive thresholds
and
marked; shaded bands indicate the three decision regions.
Parameter sensitivity analysis
Sensitivity to threshold ratio r
Table 8 and Figure 6 show NMI as r varies over [0.50, 0.80].
Table 8.
Sensitivity to threshold ratio r (NMI).
| Dataset | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|---|---|---|---|---|---|---|---|
| Iris | 0.871 | 0.882 | 0.889 | 0.891 | 0.888 | 0.881 | 0.872 |
| Wine | 0.818 | 0.828 | 0.835 | 0.840 | 0.842 | 0.838 | 0.829 |
| Vehicle | 0.225 | 0.235 | 0.243 | 0.248 | 0.252 | 0.249 | 0.241 |
| MNIST | 0.702 | 0.711 | 0.718 | 0.717 | 0.713 | 0.706 | 0.698 |
Fig. 6.
Sensitivity of NMI to the threshold ratio r on four datasets. The optimal ratio r varies slightly across datasets:
for MNIST,
for Iris, and
for Wine and Vehicle. We select
as a robust default that performs within 1.5% of the optimum across all datasets.
The optimal r varies slightly across datasets, but
provides robust performance within
of the optimum on all datasets, validating the default setting (Table 9).
Table 9.
Sensitivity to number of base clusterings M (NMI).
| Dataset |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| Iris | 0.834 | 0.861 | 0.878 | 0.891 | 0.893 | 0.892 |
| Wine | 0.771 | 0.808 | 0.825 | 0.838 | 0.840 | 0.839 |
| Vehicle | 0.205 | 0.228 | 0.238 | 0.252 | 0.254 | 0.253 |
| MNIST | 0.672 | 0.698 | 0.710 | 0.718 | 0.720 | 0.719 |
Sensitivity to number of base clusterings M
Performance improves substantially from
to
, then stabilises near
. Increasing to
yields negligible gains at double the computational cost;
is therefore the recommended default.
Sensitivity to quality-score coefficients
Table 10 evaluates six coefficient configurations for
in Eq. (2).
Table 10.
Sensitivity to quality-score coefficients (NMI).
: silhouette,
: CH,
: DB.
![]() |
![]() |
![]() |
Iris | Wine | Vehicle | MNIST |
|---|---|---|---|---|---|---|
| 0.4 | 0.3 | 0.3 | 0.891 | 0.838 | 0.252 | 0.718 |
| 0.3 | 0.4 | 0.3 | 0.883 | 0.832 | 0.246 | 0.711 |
| 0.3 | 0.3 | 0.4 | 0.879 | 0.825 | 0.243 | 0.708 |
| 0.5 | 0.3 | 0.2 | 0.888 | 0.835 | 0.249 | 0.715 |
| 0.33 | 0.33 | 0.34 | 0.881 | 0.829 | 0.245 | 0.710 |
| 0.5 | 0.25 | 0.25 | 0.889 | 0.836 | 0.250 | 0.716 |
The proposed (0.4, 0.3, 0.3) consistently performs best or within 0.01 of the best across all datasets. The silhouette coefficient’s dominance is justified: it directly measures the ratio of inter-cluster separation to intra-cluster cohesion, making it the most informative single quality indicator.
Runtime and scalability analysis
Table 11 compares average runtimes (seconds) on three datasets of increasing size. All runtimes are averages over 30 runs on an Intel Core i7/16 GB system.
Table 11.
Runtime comparison (seconds, mean ± std).
| Method | Iris ( ) |
Vehicle ( ) |
MNIST ( ) |
|---|---|---|---|
| K-Means | 0.02±0.003 | 0.15±0.018 | 3.25±0.42 |
| CSPA | 0.05±0.008 | 1.38±0.21 | 12.48±1.85 |
| MCLA | 0.07±0.011 | 2.21±0.34 | 18.32±2.68 |
| LWEA | 0.10±0.015 | 2.92±0.41 | 24.75±3.52 |
| PCPA | 0.15±0.022 | 4.55±0.68 | 38.18±5.14 |
| TRCE | 0.23±0.035 | 6.82±1.05 | 52.64±7.92 |
| SDGCA | 0.13±0.019 | 3.98±0.58 | 31.82±4.51 |
| FCE | 0.14±0.021 | 4.22±0.63 | 35.42±4.98 |
| GMM-3WD-CE | 0.82±0.126 | 8.42±1.31 | 362.48±48.62 |
GMM-3WD-CE is the most computationally intensive method. On MNIST it requires
the time of TRCE (the next most expensive) because of GMM fitting over
similarity values. This trade-off is acknowledged in Sec. 6.
Scalability. Table 12 and Figure 7 report runtime on synthetic data with varying n, confirming approximate
growth.
Table 12.
Scalability: GMM-3WD-CE runtime vs. sample size (seconds).
| n | 500 | 1 000 | 2 000 | 5 000 | 10 000 | 20 000 |
|---|---|---|---|---|---|---|
| Time | 1.08 | 3.89 | 16.72 | 98.34 | 362.48 | 1 582.8 |
Expected
|
– | 4.32 | 17.28 | 108.00 | 432.00 | 1 728.0 |
| Deviation | – | ![]() |
![]() |
![]() |
![]() |
![]() |
Fig. 7.
Scalability of GMM-3WD-CE. Log–log plot of measured runtime vs. sample size n; the grey dashed reference line marks
growth. Data points closely follow the reference, confirming quadratic scaling with deviations within
due to caching and compiler optimizations.
Successive doublings of n yield roughly four-fold increases in runtime, consistent with
. Actual measurements are slightly below theoretical values (
to
deviation) due to caching and compiler optimizations. On standard hardware, datasets up to
can be processed in
minutes.
Comparison with model variants
Table 13 and Figure 8 validate two key design choices.
Table 13.
Model-variant comparison (NMI).
| Iris | Wine | Glass | Vehicle | Segment | Avg improvement | |
|---|---|---|---|---|---|---|
| GMM-BIC | 0.878 | 0.825 | 0.531 | 0.235 | 0.674 | ![]() |
| GMM-Fixed | 0.871 | 0.819 | 0.522 | 0.228 | 0.665 | ![]() |
| GMM-3WD-CE | 0.891 | 0.838 | 0.538 | 0.252 | 0.683 | – |
| p (vs BIC) | 0.006 | |||||
| p (vs Fixed) | 0.004 |
Fig. 8.
ICL vs. BIC model-selection curves. Stars mark the optimal
for each criterion; the entropy penalty in ICL steers selection toward solutions with cleaner component assignments (
), while BIC favours a richer fit (
).
Using BIC instead of ICL loses
average NMI (
), because BIC lacks the entropy penalty that encourages well-separated components. Fixed thresholds (
,
) underperform adaptive Otsu-based thresholds by
(
) due to inability to adapt to dataset-specific probability distributions.
Evaluation fairness: treatment of negative-domain samples
Samples labelled
(NEG domain,
) are excluded from ACC, NMI, and ARI computation in the main results. On average only
of samples receive this label (range
). To verify that this exclusion does not introduce selection bias, we evaluate “GMM-3WD-CE-Full”, in which NEG samples are assigned to their most probable cluster via
, and all samples are included in the metrics (Table 14).
Table 14.
Full-assignment evaluation: average NMI and ARI across 8 datasets.
| Method | Avg NMI | Avg ARI |
|---|---|---|
| GMM-3WD-CE (reported, NEG excluded) | 0.703 | 0.666 |
| GMM-3WD-CE-Full (all samples) | 0.697 | 0.661 |
| SDGCA (all samples, for reference) | 0.695 | 0.658 |
Even under the more conservative full-assignment evaluation, GMM-3WD-CE outperforms all baselines (including SDGCA) by
NMI and
ARI. This confirms that the NEG exclusion does not materially bias the performance comparison.
Discussion
Performance pattern analysis
The improvements over recent baselines (
NMI over SDGCA,
over PCPA) lie in the credible range for ensemble clustering advances. Gains are largest on datasets with ambiguous boundaries (Vehicle, MNIST), where probabilistic uncertainty quantification and confidence-stratified label assignment add the most value. Conversely, on well-separated datasets (Aggregation, Iris), improvements are smaller, and SDGCA’s adversarial integration can outperform on specific datasets (Glass, Aggregation).
The standard deviations of GMM-3WD-CE (
) are comparable to those of SDGCA, demonstrating that both the ensemble and probabilistic framework enhance stability.
Advantages over TRCE
TRCE20 is a strong and relevant competitor, jointly learning multiple graphs and handling robustness at three levels. GMM-3WD-CE outperforms TRCE on all eight datasets (average
NMI,
ARI). The key architectural differences that explain this advantage are: (i) GMM-3WD-CE explicitly models the similarity distribution, providing principled uncertainty quantification, whereas TRCE relies on graph-based similarity without such modelling; (ii) ICL model selection automatically determines the number of clusters, while TRCE requires this as input; (iii) Otsu-based adaptive thresholding produces more flexible boundaries than TRCE’s fixed-structure approach. These advantages are most pronounced on datasets with non-uniform cluster densities (Glass:
NMI, Vehicle:
NMI).
Comparison with deep clustering methods
DEC25, IDEC26, and DAC27 achieve NMI
on MNIST versus GMM-3WD-CE’s 0.718. However, GMM-3WD-CE offers unique complementary advantages:
Small-sample suitability: deep methods require thousands of samples for training; GMM-3WD-CE works well with as few as 150 samples (Iris, Wine).
Interpretability: GMM provides explicit probabilistic cluster membership; 3WD gives intuitive confidence stratification.
Efficiency: runs on CPU in
min on MNIST versus hours for deep methods, without GPU or hyperparameter tuning.Fully unsupervised: no pre-training on labelled data is required.
A natural future direction is a hybrid combining deep feature extraction with the GMM-3WD framework.
Model choice justification
GMM is chosen for similarity distribution modelling because (1) the co-association values naturally form a multi-modal distribution amenable to mixture modelling, (2) GMM yields closed-form posteriors required by the 3WD framework, and (3) EM for 1-D GMM is computationally efficient. Dirichlet Process Mixture Models (DPMM) could learn the number of components automatically, but incur higher computational cost and reduce component interpretability. This trade-off is identified as future work in Sec. 6.
Limitations and future work
Computational complexity. The
co-association matrix is the primary practical bottleneck, limiting scalability to
on standard hardware. Co-association construction accounts for
of runtime and GMM fitting on
values for
. Potential mitigations include anchor-based approximate co-association matrices, sparse representations, and mini-batch EM. This higher computational cost relative to classical baselines (CSPA, MCLA, LWEA) is an inherent trade-off for the probabilistic-to-decision pipeline, and should be weighed against the performance gains in practical deployment decisions.
Parameter setting. Although
is automatic, the ratio r is empirical. Future work could derive r from the shape of the posterior distribution or dataset-specific statistics (e.g., cluster overlap, dimensionality).
High-dimensional and streaming data. Pre-dimensionality reduction is required for very high-dimensional data. Extension to streaming settings requires incremental co-association updates, online GMM, and dynamic threshold adjustment.
Theoretical guarantees. The method lacks convergence proofs and optimality bounds. Establishing theoretical relationships between GMM accuracy and ensemble quality, and conditions under which 3WD provably improves results, would strengthen the contribution.
Alternative mixture models. Replacing GMM with DPMM or variational mixtures could improve adaptivity at the cost of efficiency. A systematic comparison is left to future work.
Conclusion
This paper proposes GMM-3WD-CE, a clustering ensemble method integrating GMM probabilistic modelling with three-way decision theory. The method constructs a quality-weighted co-association matrix, fits a 1-D GMM with ICL-based model selection to the similarity distribution, and partitions samples into core, boundary, and trivial domains via adaptive Otsu thresholding. Differentiated label-assignment strategies for each region yield the final consensus clustering.
Experiments on eight datasets with nine comparison methods demonstrate competitive performance, with particular improvements on datasets with ambiguous cluster boundaries (Vehicle, MNIST). Statistical tests with effect sizes, ablation studies, sensitivity analyses (threshold ratio, base-clustering count, quality coefficients), runtime/scalability evaluations, and a bias-check for negative-domain sample exclusion provide comprehensive validation. The analysis reveals that the unified probabilistic-to-decision framework is most beneficial on datasets with high boundary uncertainty, and that the
complexity represents the primary limitation for large-scale applications.
Author contributions
Y.M. conceived the research idea, designed the methodology, implemented the algorithms, conducted all experiments, performed data analysis and visualization, and wrote the original draft of the manuscript. Z.L. supervised the research, provided critical guidance on methodology and experimental design, contributed to the interpretation of results, and revised the manuscript. Both authors reviewed and approved the final manuscript.
Data availability
The code and analysis scripts associated with this study have been deposited on Zenodo and are available at https://doi.org/10.5281/zenodo.19333740. The datasets analysed during the current study are available from the UCI Machine Learning Repository (https://archive.ics. uci.edu/)28 and the MNIST database (http://yann.lecun.com/exdb/mnist/)29.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math.20, 53–65 (1987). [Google Scholar]
- 2.Caliński, T. & Harabasz, J. A dendrite method for cluster analysis. Communications in Statistics.3(1), 1–27 (1974). [Google Scholar]
- 3.Davies, D. L. & Bouldin, D. W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell.1(2), 224–227 (1979). [PubMed] [Google Scholar]
- 4.Strehl, A. & Ghosh, J. Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res.3, 583–617 (2002). [Google Scholar]
- 5.Ghaemi, R., Sulaiman, M. N., Ibrahim, H. & Mustapha, N. A survey: Clustering ensembles techniques. World Acad. Sci. Eng. Technol.38, 636–645 (2009). [Google Scholar]
- 6.Fred, A. L. & Jain, A. K. Combining multiple clusterings using evidence accumulation. IEEE Trans. Pattern Anal. Mach. Intell.27(6), 835–850 (2005). [DOI] [PubMed] [Google Scholar]
- 7.Iam-On, N., Boongoen, T., Garrett, S. & Price, C. A link-based cluster ensemble approach for categorical data clustering. IEEE Trans. Knowl. Data Eng.24(3), 413–425 (2012). [Google Scholar]
- 8.Topchy, A., Jain, AK., Punch, W. A mixture model for clustering ensembles. In: Proc. SIAM Int. Conf. Data Min. pp. 379–390. (2004).
- 9.Huang, D., Wang, C. D. & Lai, J. H. Locally weighted ensemble clustering. IEEE Trans. Cybern.48(5), 1460–1473 (2018). [DOI] [PubMed] [Google Scholar]
- 10.Xu, J., Li, T., Zhang, D. & Wu, J. Ensemble clustering via fusing global and local structure information. Expert Syst. Appl.237, 121557 (2024). [Google Scholar]
- 11.Li, N. et al. A point-cluster-partition architecture for weighted clustering ensemble. Neural Process. Lett.56183. (2024).
- 12.Zeng, L., Yao, S., Liu, X., Xiao, L. & Qian, Y. A clustering ensemble algorithm for handling deep embeddings using cluster confidence. Comput. J.68(2), 163–174 (2025). [Google Scholar]
- 13.Liu, F., Xue, S., Wu, J. et al. Deep learning for community detection: progress, challenges and opportunities. In: Proc. IJCAI. pp. 4981–4987 (2020).
- 14.Zhang, X., Jia, Y., Song, M. & Wang, R. Similarity and Dissimilarity Guided Co-Association Matrix Construction for Ensemble Clustering (IEEE Trans, 2025).
- 15.Zhou, P., Li, R., Ling, Z., Du, L. & Liu, X. Fair clustering ensemble with equal cluster capacity. IEEE Trans. Pattern Anal. Mach. Intell.47(3), 1729–1746 (2025). [DOI] [PubMed] [Google Scholar]
- 16.Yao, Y. Y. Three-way decisions with probabilistic rough sets. Information Sci.180(3), 341–353 (2010). [Google Scholar]
- 17.Yu, H. Three-way decisions and three-way clustering. In Rough Sets: IJCRS 2018 13–28 (Springer, 2018).
- 18.Wang, P. X., Shi, H., Yang, X. B. & Mi, J. S. Three-way k-means: Integrating k-means and three-way decision. Int. J. Mach. Learn. Cybern.10, 2767–2777 (2019). [Google Scholar]
- 19.Afridi, M. K., Azam, N., Yao, J. T. & Alanazi, E. A three-way clustering approach for handling missing data using GTRS. Int. J. Approx. Reason.98, 11–24 (2018). [Google Scholar]
- 20.Zhou, P., Du, L., Shen, Y-D., Li, X. Tri-level robust clustering ensemble with multiple graph learning. In: Proc. AAAI Conference on Artificial Intelligence.35(12):11125–11133 (2021).
- 21.Reynolds, D. A. Gaussian mixture models. In Encyclopedia of Biometrics (eds Li, S. Z. & Jain, A. K.) 827–832 (Springer, 2015).
- 22.Biernacki, C., Celeux, G. & Govaert, G. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell.22(7), 719–725 (2000). [Google Scholar]
- 23.Campagner, A., Ciucci, D. & Denoeux, T. Belief functions and rough sets: Survey and new insights. Int. J. Approx. Reason.143, 192–215 (2022). [Google Scholar]
- 24.Zhang, Q., Pang, G. & Wang, G. A novel sequential three-way decisions model based on penalty function. Knowl. Based Syst.192, 105350 (2020). [Google Scholar]
- 25.Xie, J., Girshick, R., Farhadi, A. Unsupervised deep embedding for clustering analysis. In: Proc. Int. Conf. Machine Learning (ICML). pp. 478–487 (2016).
- 26.Guo, X., Liu, X., Zhu, E., Yin, J. Improved deep embedded clustering with local structure preservation. In: Proc. Int. Joint Conf. Artificial Intelligence (IJCAI). (2017).
- 27.Chang, J., Wang, L., Meng, G., Xiang, S., Pan, C. Deep adaptive image clustering. In: Proc. IEEE Int. Conf. Computer Vision (ICCV). pp. 5880–5888 (2017).
- 28.Asuncion, A. & Newman, D. J. UCI machine learning repository (University of California, 2007). [Google Scholar]
- 29.LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE86(11), 2278–2324 (1998). [Google Scholar]
- 30.Huang, D., Wang, C. D., Wu, J. S., Lai, J. H. & Kwoh, C. K. Ultra-scalable spectral clustering and ensemble clustering. IEEE Trans. Knowl. Data Eng.32(6), 1212–1226 (2020). [Google Scholar]
- 31.Gu, Q. et al. An improved weighted ensemble clustering based on two-tier uncertainty measurement. Expert Syst. Appl.237, 121419 (2024). [Google Scholar]
- 32.Gionis, A., Mannila, H. & Tsaparas, P. Clustering aggregation. ACM Trans. Knowl. Discov. Data.1(1), 4 (2007). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The code and analysis scripts associated with this study have been deposited on Zenodo and are available at https://doi.org/10.5281/zenodo.19333740. The datasets analysed during the current study are available from the UCI Machine Learning Repository (https://archive.ics. uci.edu/)28 and the MNIST database (http://yann.lecun.com/exdb/mnist/)29.

































































