Skip to main content
Heliyon logoLink to Heliyon
. 2025 Jan 15;11(2):e41953. doi: 10.1016/j.heliyon.2025.e41953

Cluster validity indices for automatic clustering: A comprehensive review

Abiodun M Ikotun a, Faustin Habyarimana a, Absalom E Ezugwu b,
PMCID: PMC11787482  PMID: 39897868

Abstract

The Cluster Validity Index is an integral part of clustering algorithms. It evaluates inter-cluster separation and intra-cluster cohesion of candidate clusters to determine the quality of potential solutions. Several cluster validity indices have been suggested for both classical clustering algorithms and automatic metaheuristic-based clustering algorithms. Different cluster validity indices exhibit different characteristics based on the mathematical models they employ in determining the values for the various cluster attributes. Metaheuristic-based automatic clustering algorithms use cluster validity index as a fitness function in its optimization procedure to evaluate the candidate cluster solution's quality. A systematic review of the cluster validity indices used as fitness functions in metaheuristic-based automatic clustering algorithms is presented in this study. Identifying, reporting, and analysing various cluster validity indices is important in classifying the best CVIs for optimum performance of a metaheuristic-based automatic clustering algorithm. This review also includes an experimental study on the performance of some common cluster validity indices on some synthetic datasets with varied characteristics as well as real-life datasets using the SOSK-means automatic clustering algorithm. This review aims to assist researchers in identifying and selecting the most suitable cluster validity indices (CVIs) for their specific application areas.

Keywords: Clustering, Cluster validity index, Automatic clustering, Metaheuristic algorithms, Optimization algorithms

1. Introduction

Clustering, an unsupervised machine learning technique, is applied to large unlabelled datasets to uncover hidden patterns inherent in the datasets [1]. Clustering algorithms use intra-cluster cohesion and the inter-cluster separation of data objects to partition unlabelled datasets into distinct groups. Cluster validity indices (CVIs) are used to evaluate the quality of the formed clusters.

Cluster validity indices examine the relationship among the attributes of the cluster such as connectedness, cohesion, symmetry, and separation [2]. Different CVIs use different metrics in determining the value of these various cluster attributes. They exhibit different characteristics based on the mathematical model used in determining the value of the various cluster attributes. They are designed to differentiate between inferior and superior clustering [3]. In literature, many cluster validity indices have been reported for evaluating potential clustering solutions of both classical clustering algorithms and metaheuristic-based clustering algorithms.

In data clustering, there are three major criteria for evaluating the potential clustering solutions' quality: external, internal, and relative criteria [4,5]. The internal criteria evaluate the cluster's quality using the dataset's vector qualities such as the data objects' proximity matrix while the external criteria use a pre-specified structure based on the user's intuition which is imposed on the dataset. The basic idea adopted in the relative criteria is based on comparing the resultant cluster structure with other clustering structures obtained using different input parameters within the same algorithm.

According to Ref. [5], the cluster validation approach based on the internal validation criteria is the most used among the three approaches. In validating cluster results using the internal validation criteria, several methods focus on the level of compactness of the object within a cluster and the level of its separateness from other clusters while some other methods called the stability-based validation rely on the clustering algorithm's stability relative to the performance of the different input dataset samples [5].

The dimensionality and density of real-world datasets are known to be high. Therefore, pre-identifying the number of clusters in a dataset is difficult. The automatic clustering approach to data clustering seeks to determine the appropriate cluster number in a dataset with no prior knowledge of the structure of the dataset. It also discovers the corresponding inherent partitioning structure of such a dataset [3]. The automatic clustering problems are expressed as the problem of optimization using optimization techniques to find its solution. Cluster validity indices are usually adopted as fitness functions for evaluating the quality of the potential clustering solutions [3]. Based on some objective function given in a defined domain, optimization finds the best available values that are good enough and best fit for the objective [6].

Metaheuristic optimization is categorized as a higher-level optimization technique that employs simple but efficient methods in finding solutions to optimization problems [7]. Algorithms based on metaheuristic optimization approach have become the latest in finding solutions to optimization problems [6]. The majority of modern optimization techniques involve metaheuristic techniques serving as a powerful tool in providing solutions to hard optimization problems. Their application in major areas of science, engineering, and industrial applications has been well reported in the literature [6].

Cluster validity function is used in metaheuristic-based clustering algorithms as fitness functions The aim is to identify the optimal solutions to the clustering problem based on the data object's intra-cluster cohesion and inter-cluster separation. Different cluster validity indices exhibit varied characteristics that are dependent on different criteria such as the proximity measure, cluster prototype type, and processes involved in measuring the intra-cluster cohesion and inter-cluster separation [3]. Cluster validity indices such as the Xie-Beni, Silhouette, Davies-Bouldin Index, Dunn index, and Calinski-Harabasz index have been used in metaheuristic-based clustering algorithms. In most cases, the choice of CVIs selected for the metaheuristic-based clustering is not based on experimental judgment to support their selection with the behavioural characteristics.

This systematic study is a focused study on the existing cluster validity indices that have been used in metaheuristic-based automatic clustering algorithms as fitness functions. It presents a systematic review of identified CVIs that have been used in metaheuristic-based automatic clustering algorithms reported in the literature. It discusses the strengths and weaknesses of each of the CVIs in their functionality as fitness functions in metaheuristic-based automatic clustering algorithms. The following research questions were addressed in this review.

  • 1.

    Which of the existing cluster validity indices has been adopted as a fitness function in metaheuristic-based clustering algorithms?

  • 2.

    Which of the cluster validity indices discovered in RQ1 were mostly used?

  • 3.

    Are there basic criteria for selecting a cluster validity index for any given metaheuristic-based automatic clustering algorithm?

  • 4.

    What factors contribute to CVIs evaluation performance?

Does the real-life application area of automatic clustering affect the choice of CVIs? This paper is organized as follows: In section 1, the introduction to the study is presented while Section 2 reports the methodology employed for the systematic study. Section 3 presents the existing related reviews on CVIs in comparison with this current work. The discussion on Automatic clustering and the various cluster validity indices used in metaheuristic-based automatic clustering algorithms is given in Section 4. Section 5 presents the findings from the systematic review as well as the identified application areas. In section 6, the experimental studies and discussions of the findings are presented. The conclusion of the study is presented in Section 7.

2. Research methodology

This study aims to conduct a systematic review of the various internal cluster validity indices that have been used as fitness functions in metaheuristic-based automatic clustering algorithms. In this section, the report on the review methodology adopted in the study is presented. For the systematic literature review, the procedure presented by Ref. [8] was adopted. The details of the selection processes concerning the database search, the search keywords, search techniques, and data sources as well as the inclusion and exclusion criteria for the identification of relevant research papers are presented to buttress the transparency of the selection process.

2.1. Search keywords

To retrieve the most relevant research papers that assist in providing answers to our research questions, keywords that are common to the research purpose were used in the search process. The list of keywords used includes cluster validity indices, automatic data clustering, metaheuristic optimization algorithm, cluster separability measure, cluster evaluation criteria, clustering performance analysis, and cluster validity concepts. The names of the various identified cluster validity indices were also used to find relevant literature that reports on their use in any metaheuristic algorithms for automatic clustering. These keywords were used to search the relevant academic databases for the articles included in the review.

2.2. Article search

The search for the relevant articles was carried out between February 2024 and May 2024. A total number of 57143 articles were identified during the initial automated search from the various databases. 57081 articles were filtered out using the electronic database advanced search combining the various keywords with the ‘OR’ and ‘AND’ options to further streamline the retrieved articles leaving 90 articles for the review. The citations and references of the retrieved articles were further scanned for more related articles with 28 articles added. The PRISMA [9] diagram reflecting the search and selection process is presented in Fig. 1.

Fig. 1.

Fig. 1

Literature search and selection process prisma diagram.

2.3. Academic databases

In searching for the relevant articles, the search focused on credible sources including conference proceedings, peer-reviewed journals, and edited books that were indexed in various academic databases. The academic databases used for the extraction of the relevant articles include Springer, IEEE Xplore, Google Scholar, Elsevier, and ACM Digital Library. These repositories keep high-quality, SCI-indexed journal publications and top international conferences.

2.4. Article inclusion/exclusion criteria

Each article was evaluated based on the title, abstract, full content, and conclusion to verify if it aligns directly with the review objectives and goals. The details of the inclusion and exclusion criteria presented in Table 1 were used to ensure that the most relevant articles were included in the selection.

Table 1.

Systematic review selection criteria.

Inclusion Exclusion
The article focused on Metaheuristic-based Automatic clustering to ensure that only articles that aligned with the research objectives and goal were selected. Articles on classical clustering and other clustering approaches were not considered.
Articles that used internal cluster validity indices for automatic clustering Articles that use other mode of validity indices i.e. external and relative were excluded.
Conference proceedings, peer-reviewed journals, and edited books published in reputable journals were included to ensure the use of academic-level sources and the quality of relevant literature. Non-peer-reviewed articles, reports, and other sources were excluded.
Articles published in the English language only were included to keep with the official language of research articles and to ensure a proper understanding of the article content. Articles published using any other language apart from English are excluded.

3. Comparison with existing survey on cluster validity indices

The main differences between existing surveys on cluster validity indices and this systematic review work are presented in this section. Several literatures have been published on different cluster validity indices with many introducing new cluster validity indexes or improving the existing ones. Comparative analysis of some of these cluster validity indices has been reported with a view of evaluating their performances to specific clustering algorithm categories (classical clustering algorithms or nature-inspired metaheuristics) and their performances based on the characteristic nature of the datasets. For instance Ref. [10], published a survey of Fuzzy clustering validity evaluation methods. The authors in Refs. [3,[11], [12], [13], [14], [15]] present a comparative analysis of CVIs based on the classical clustering algorithms. Publications reporting comparisons of cluster validity indices include [[16], [17], [18], [19], [20], [21]].

For automatic clustering algorithms, there are reviews and survey studies reported in the literature that discussed some of the clustering similarity measures used in metaheuristics-based automatic clustering [22,23]. The authors in Ref. [24] mentioned 17 validity indices that have been used as fitness functions in metaheuristic-based automatic clustering. The work of [25] mentioned 25 different internal cluster validation measures and eight external cluster validation measures. The performances of 68 cluster validity indices were reviewed by Ref. [26] on 21 real-life and simulated datasets. Their evaluation was based on multivariate chemometric methods for disclosing the mutual relationship among the indices and reporting their effectiveness in terms of accuracy and reliability. Their discussion was based purely on the general performance of the CVIs and not particularly about automatic clustering. They intended to present a survey of most of the CVIs used for crisp clustering comparing their performances from a multivariate chemometric perspective. Table 2 presents a summary of the existing survey and comparative analysis of cluster validity indices comparing them with this systematic study.

Table 2.

Summary of existing cluster validity indices surveys.

Author and year Publication year Study focus Impact as of 2024
[27] 1985 The study used an agglomerative process of hierarchical clustering for comparative analysis of CVIs 5290
[28] 1987 The study focused on a comparative study of two internal indices in estimating the true number of clusters in multivariate data to show their effectiveness. 371
[29] 1997 Comparative studies of CVIs for the choice of the correct number of components in a mixture of normal distributions. 126
[14] 2002 Comparative study similar to Milligan & Cooper's work with a focus on choosing the correct number of components using cluster validity indices based on high dimensional empirical binary data 416
[12] 2007 Examined CVIs' correlation with the error rates 310
[11] 2011 Comparison of CVIs using a different methodology that avoids false assumptions based on the correctness of the clustering algorithms 93
[5] 2013 Extensive comparative study of the performance of 30 CVIs 1414
[30] 2021 Compared external and internal cluster validity indices with a similar bounded index range. 3
[15] 2012 Comparison of CVIs using Swarm intelligence-based clustering. 153
[24] 2021 Survey of CVIs for automatic data clustering using ACDE. 14
[31] 2021 Study popular CVIs to determine their suitability or unsuitability for judging the quality of different partitions of the same cardinality. 26
[26] 2024 Compared 68 cluster validity indices using the K-means clustering algorithm using multivariate chemometric methods 1
This work 2024 The study focused on internal validity indices used as fitness functions in metaheuristic-based automatic clustering algorithms using SOSK-means

4. Metaheuristic-based automatic clustering algorithm and clustering validity indices

The problem of automatic clustering heralded a new era in cluster analysis in the late 1990s because of the proliferation of big data which are mostly unlabelled. The automatic clustering algorithms find the optimal number of clusters in a dataset automatically while at the same time grouping the data objects into appropriate clusters [2]. Metaheuristic search algorithms were identified as the techniques mostly used for automatic clustering algorithm implementation [24]. In metaheuristic-based automatic clustering, the clustering problems are treated as optimization problems to minimize the intra-cluster distance and maximize the inter-cluster distances [32].

Several successful implementations of metaheuristic algorithms for automatic clustering problems have been widely reported in the literature [[33], [34], [35], [36], [37], [38], [39], [40], [41]]. A survey on the use of nature-inspired metaheuristic algorithms in finding solutions to automatic clustering problems was conducted by the authors in Ref. [33]. The authors in Ref. [42] classified metaheuristics-based clustering algorithms as search-based, hard partitional clustering algorithms and were subdivided into evolutionary-based e.g. Genetic algorithms, swarm intelligence e.g. Particle swarm Optimization, and others.

There are two major problems associated with solving automatic clustering problems: finding the optimal cluster numbers and all data groups' correct identification. The clustering task is known to be computationally expensive even for moderately sized problems [32,33,43]. The problem of finding an optimal clustering solution when K>3 is an NP-hard problem. Given N objects with K clusters, N objects partitioned into K clusters will require the following number of combinations as represented in Equation (1).

C(N,K)=1K!i=0K(1)Ki(Ki)iN (1)

and to find the optimal number of clusters, the search space is given Equation (2).

O(N)=K=1NS(N,K) (2)

Automatic clustering seeks to find an optimal number of clusters within a defined range [Kmin,Kmax]. The automatic clustering problem based on metaheuristic optimization technique is formulated as an optimization problem given in Equation (3):

Given

Ω={1,2,B(n)} (3)

as the set of all clustering that is possible where the clustering solutions represent each element of the set for a given dataset N with f given as the single fitness function (the cluster validity index serves as the fitness criterion). For a single objective clustering problem, (Ω,f) is required to find the clustering solution as defined in Equation (4), where:

f(ˆ)=min{f()|Ω} (4)

Such that f(C) is minimized without loss of generality. For a multi-objective clustering problem, (Ω,f1,f2,,fm) is required to find the clustering solution that satisfies Equation (5).

f(ˆ)=min{ft()|Ω},t=1,2,,m (5)

where ft,t=1,2,,m represents the set of m (single) criterion functions. The multi-objective problems usually return multiple optimal solutions for which the principle of Pareto dominance is used in identifying the solutions. The principle of Pareto states that given 1,2Ω,1 is regarded as a dominating 2 if and only if Equations (6), (7) holds.

ft(1)ft(2),t1,2,m (6)

and

ft(1)<ft(2),t1,2,m (7)

All Pareto nondominated solutions form the Pareto-optimal set and the objective function values corresponding to this set are called the Pareto-optimal front.

4.1. Cluster validity methods for automatic clustering algorithms

The quality of potential clustering solutions is evaluated using cluster validity index and it also determines the optimal cluster numbers in automatic clustering problems. Specifically, the quality of clusters is typically determined using internal cluster validity measures, external validation methods, and domain-specific evaluation techniques, each of which is explained in detail subsequently. It is noteworthy that while CVIs like the Davies Bouldin Index, Compact-Separated index, Silhouette index, or Dunn index help assess the validity of the number of clusters, they also provide insight into cluster quality by evaluating metrics such as compactness and separation, both of which has been extensively employed in the literature to determine the quality of clustering task. While the compactness metrics, measure how closely related or tight the points are within a cluster (e.g., based on intra-cluster distances), the separation metrics measure how distinct clusters are from each other (e.g., based on inter-cluster distances). Further explanations are provided subsequently in the next paragraphs.

Several cluster validity indices have been proposed in literature with new ones being introduced as better alternatives to existing ones. According to Ref. [10], CVI research has become a hot topic. The clustering validity indices are broadly grouped into three categories: the external validity methods, the internal validity methods, and the relative validity methods [10,23,25]. The external validity uses a pre-specified structure based on the user's intuition which is imposed on the dataset. The clustering results are compared with previously known structures obtained using similar parameters based on some external information such as the class labels.

The internal criteria evaluate clusters' quality using the dataset's vector qualities such as the proximity matrix of the data objects. The underlying structure of real-life datasets is usually not known and as such it is difficult to know the correct number of clusters that will be optimal for the dataset. The internal cluster validity methods are mostly used in metaheuristic-based automatic clustering algorithms to estimate the correct number of clusters in each dataset. They do not depend on any prior clustering structure of the dataset. They evaluate clustering results using some defined formulas which are based on various factors such as dataset density, skewed distribution, noise, sub-clusters, and monotonicity of index.

The internal cluster validity methods measure the intra-cluster compactness and the inter-cluster separation. The intra-cluster compactness determines the homogeneity of a single cluster, and the similarity level of data objects within the same cluster, while the inter-cluster separation measures the heterogeneity of the different clusters, measuring how different the data objects in different clusters are to each other. Cluster compactness is commonly measured using intra-cluster distance, within-group dispersion, or variance which are usually required to be minimized [26]. The inter-cluster separation measures how far apart clusters are, and the metrics used for this include the use of the nearest neighbour distance, the farthest neighbour distance, and the distance between the clusters’ centroids. According to Ref. [44], inter-cluster separation plays a more important role in cluster validation than intra-cluster cohesion.

The internal validity indices that have been used in metaheuristic-based automatic clustering algorithms are discussed below. The summary of the identified cluster validity indices is presented in Table 3.

Table 3.

Summary of Cluster Validity Indices that have been applied to Metaheuristic-based automatic clustering algorithms.

SN Cluster Validity Indices Optimum index value rule Strength Weakness
1 Baker-Hubert Gamma index [45] Maximum difference The Baker-Hubert Gamma index is sensitive to the true underlying clustering structure. It effectively distinguishes between random and meaningful clustering by offering a robust measure of how well the clustering algorithm has captured the inherent pattern in the datasets. computationally prohibitive and impractical for most real applications of cluster analysis
2 Ball-Hall index [46] Maximum difference No absolute threshold is used in the measure of similarity criterion of this technique. The technique is independent of the sequence in which patterns are presented [100].
Capable of finding correct clustering structure for arbitrarily shaped clusters with high density [101,102]
The use of metrics weighted with respect to cluster as well as component can make clustering interpretation difficult when used for data analysis
3 Bandeld-Raftery index Maximum difference The index incorporates a penalty for the number of parameters in the model, helping the model to prevent overfitting. Also, by using the likelihood function, the index evaluates how the model fits the data. Lastly, the model can be used with large datasets and complex models. Calculating the index can be extremely compute-intensive, especially for large datasets. Moreover, the effectiveness of the index relies on the correctness of the underlying models.
4 Bayesian Information Criterion Index [48] Minimization BIC supplies computationally inexpensive proxies to otherwise difficult-to-calculate posterior model probabilities [103]. This technique has a strong distribution assumption of parametric likelihood [104]
5 C-Criterion Index [48] Minimum C-Criterion primarily measures the model prediction accuracy with a statistical significance of optimal unbiased estimator of linear combinations of parameters [105] The calculation of the C criterion does not yield a specific value but instead ranks designs by comparing their C criterion vectors [105].
6 Calinski-Harabasz Index [52] First maximum It uses the arrangement of clusters to assess the quality of the clustering solution regardless of the choice of distance measure. The Calinski–Harabasz index is shown to be affected by the data size and level of data overlap. It is regarded as data dependent such that its behaviour may change if different data structures are used for the same datasets [27]. Only applicable to spherical clusters [106]
7 Category Utility Metric [57] There is a reduction of uncertainty due to the communication of category information through some cues [107]. There is the assumption that probability distributions on separate attributes are statistically independent of one another which is, however, not always true because the correlation between attributes often exists [108].
8 Compact-Separated index [62] Minimization technique Efficient in handling clusters with different dimensions, densities, or sizes
Produces more good quality solutions
Computationally intensive and expensive
9 Condorcet's Criterion [65] Maximization Technique It uses a natural cluster structure without the need to use sampling methods of data that can lead to inaccurate results [109]. It involves handling large matrices of o(n2) complexity. There is a need to fix some initial parameters such as the number of iterations and the similarity threshold [109].
10 COP index [5] Minimum The COP is not affected by the number of clusters and is hardly affected by cluster overlap [5] Only applicable to spherical clusters [106]
11 Davies-Bouldin Index [50] Minimization technique Hardly affected by cluster overlap [5].
Demonstrates a good clustering partition.
Make strong assumptions that are not valid in many real situations [110]. Too simple to handle data with specific structures such as arbitrarily shaped with dispersed density.
Only applicable to spherical clusters [106]
12 S_Dbw validity index [21] First Minimum Work well for compact and well-separated clusters.
Robust to noise [5].
Can not work with non-convex clusters or clusters with extraordinary, curved geometries. High computational Cost [93]
13 Det Ratio index [76] Minimum difference One of the best validity criteria for arbitrarily shaped closed contour clusters [102]. Capable of finding correct clustering structure for arbitrarily shaped clusters with high density [101] Det Ratio index can be highly sensitive to the size and shape of the clusters. More so, it does not explicitly account for the overlap between clusters.
14 Dunn index [72] Maximum Capable of finding correct clustering structure for arbitrarily shaped clusters with high density [101,102] Make strong assumptions that are not valid in many real situations [110]. Difficulty with handling arbitrarily shaped clusters and clusters with dispersed density due to their general simplicity
Computationally expensive and sensitive to noise.
Only applicable to spherical clusters [106]
15 Gamma index [45] Maximum Suitable for datasets with compactness properties and datasets with multiple densities [111] Data-dependent varied behaviour per data structure [27]
Computationally expensive. Inefficient with overlapping clusters. Difficulties with arbitrarily shaped clusters [97]
16 Generalized Dunn index [44] Maximum Good for validating hyper-spherical/cloud and shell-type clusters [44]. Computationally intensive and expensive [68,112]
17 G-plus index [75] Minimum Capable of finding correct clustering structure arbitrarily shaped clusters with high density [101] Computationally expensive. Inefficient with overlapping clusters. Difficulties with arbitrarily shaped clusters [97,112]
18 I-index Maximum I is found to be more consistent and reliable in indicating the correct number of clusters compared with DB, CH, and DI [113] Requires parameter tunning [114]
19 Ksq_DetW index [76] Maximum difference Capable of finding correct clustering structure for arbitrarily shaped clusters with high density [101] Does not allow for direct comparison between clustering algorithms [115]
20 Log_Det_Ratio index [76] Minimum difference Capable of finding correct clustering structure arbitrarily shaped clusters with high density [101,102] The Log_Det_Ratio index assumes that clusters are roughly spherical and of similar size. It also focuses more on the compactness of clusters, potentially neglecting other essential aspects of clustering quality such as separation between clusters.
21 Log_SS_Ratio index [78] Minimum difference Capable of finding correct clustering structure arbitrarily shaped clusters with high density [101,102] Outliers can significantly affect the within-cluster sum of squares, distorting the measure of cluster compactness.
22 McClain-Rao index [77] Maximum difference Perform relatively well in low dimensions [116]. Performance degrades as the dimension increases [116]. Worst performing CVI [11]
23 Negentropy Increment [80] First Minimum Calculation Simplicity. Satisfactory performance on clusters with heterogeneous orientation, densities, and scales. Assess the correct number of clusters with more reliability than DB, Dunn, and PBM [117] Poor performance with datasets with low number of data points [118].
24 Niva index [82] Minimum Takes advantage of cluster density, size, and shape [82]. The index can often place too much emphasis on certain metrics, such as within-cluster variance, potentially neglecting other important aspects of clustering quality, such as the overall structure or topology of the data.
25 OS-index [83] Minimum Efficient for clusters of different shapes, sizes, and density Poor performance with overlapping clusters
26 PBM index [17] Maximum It favours more compact and fewer clusters. Only capable of identifying compact clusters
27 Point-Biserial Index [85] Maximum Capable of finding correct clustering structure for arbitrarily shaped clusters with high density [101] Sensitivity to varying numbers of clusters or dimensions in datasets [68]
28 Ratkowsky-Lance index [86] Maximum Superior performance in validating clusters in binary datasets [14] Weakness in correct absolute cluster profile identification [14]
29 Ray-Turi index [87] Minimum Demonstrate Superior performance in cluster validation for dynamic connectivity data [119] Exhibit Sensitivity problem [120]
30 Root-mean square standard deviation [121] Minimum Valid for rectangular data [122] Only valid if the method used is average, centroid, and ward [122]
Can only validate well separated hyper sphere-shaped clusters [123]
31 Scatter Criterial [89] The Scatter Criterion is relatively simple to understand and compute. It can also be applied to a variety of clustering algorithms, making it versatile in its use. The criterion primarily focuses on within-cluster compactness and does not explicitly consider the separation between clusters.
32 Score function [90] Maximum Good for validating hyper-spheroidal clusters as well as multidimensional and noisy datasets. It can handle single cluster case and sub-cluster hierarchies [114] Restricted to datasets containing hyper-spheroidal clusters
33 Scott-Symons index [76] Minimum Suitable for clusters of different shapes, sizes, and orientations [26], Where clusters are not well represented, it cannot be properly calculated [26].
Not robust to noise [47]
34 SD validity index [4] Minimization Find Optimal Partition independent of the clustering algorithm [70] Sensitive to the geometry of the cluster centres and number of clusters [26]
35 Silhouette Index [94] Maximization Depends only on the actual partition of objects and not on the clustering algorithm. Useful for improving cluster analysis results. For comparison of clustering solution of different clustering algorithms. Suitable for datasets with compactness properties and datasets with multiple densities [111] It is related to specific distance measures and so cannot be used for comparing with clustering results that use different distance measures. Only applicable to spherical clusters [106]
36 Sum of Squared Error [96] Maximum rate of change It provides a clear numerical value that indicates the compactness of clusters. It can be used with various clustering algorithms, such as k-means, hierarchical clustering, and others. It is a versatile measure that can be applied across different methods. The index is highly sensitive to outliers, as they can significantly increase the total error. Calculating the index for very large datasets or high-dimensional data can be computationally expensive. This can limit its practicality for large-scale clustering tasks.
37 SV-Index [97] Maximization technique Independent of the number of objects in a cluster, data density, is less dependent on cluster centroid and average values.
Efficient handling of clusters of different sizes and densities [97]
Calculating the SV-Index can be computationally intensive, as it often requires multiple runs of the clustering algorithm and comparisons between results. This can be time-consuming, especially for large datasets or complex algorithms.
38 Sym-index [18] Maximization technique Efficient at detecting symmetrically shaped clusters [18] Dependent on the underlying clustering algorithm [18]. Only applicable to internally symmetric datasets [106]
39 Tau index [92] Maximization technique Capable of finding correct clustering structure arbitrarily shaped clusters with high density [101] High computational cost [112]
40 Trace_W index [92] Maximization technique Capable of finding correct clustering structure arbitrarily shaped clusters with high density [101] The index itself may not always provide intuitive insights into the clustering quality, making it challenging to understand the underlying reasons behind its score.
41 Trace_WiB index [98] Maximization technique The Trace_WiB index is normalized, which helps in comparing clustering results across different datasets or clustering methods, providing a more standardized measure of cluster validity. The index might be influenced by the initial conditions, or the clustering algorithm used, leading to variability in results if different algorithms or initializations are applied.
42 Wemmert-Gancarski index Maximization technique Performance stability in all distance measures for syntactic and real datasets Performance sensitivity to noise
43 Xie-Beni index [99] Minimization technique Effective detection of hyper-spherical shaped clusters [18] Decreases monotonically when the number of clusters is very large [97]

Baker-Hubert Gamma index: Baker-Huberts Gamma index [45] evaluates the correlation between two vectors X and Y whose dataset size is the same. The Γ index is adapted in the Baker-Hubert Gamma index and the definition is given as in Equation (8):

C=Γ=S±SS++S (8)

where S=(r,s)ϵIY(u,v)ϵIY1{duv>drs} and S+=(r,s)ϵIY(u,v)ϵIX1{duv<drs}

The Baker-Huber Gamma index has a computational complexity of O(n2logn). The pairwise distance calculation between the two vectors makes it computationally intensive and unsuitable for large datasets.

Ball-Hall index: The Ball-Hall index [46] measures the mean of the mean dispersion of all the clusters. It is given as shown in Equation (9):

BallHallindex(C)=1KK=1K1nkiϵIkMi{k}G{k}2 (9)

It is the average counterpart of the Trace_W index [26]. The computational complexity of the Ball-Hall index is O(n.d) which accounts for the centroid calculation and the variance calculation. The linear complexity makes it relatively efficient for large datasets.

Banfield-Raftery index: The Banfield-Raftery index [47] uses the variance-covariance of each cluster to measure the performance of the clustering result. In the Banfield-Raftery index, the logarithms' weighted sum of the variance trace of each cluster's covariance matrix is measured and it is defined in Equation (10):

C=k=1Knklog(Tr(WG){k}nk) (10)

Banfield-Raftery index is proposed as an alternative index to the Trace_W index using the square of the average distance from the centroids of the clusters instead of the sum of squares criterion used in the Trace_W index. It produces a better performance by finding varied sizes of hyper-spherical clusters. The cluster size is measured using the volume occupied and not the number of objects within the cluster. It has a computational complexity of O(n.d2+k.d3) which makes it computationally expensive for high dimensional datasets and datasets with large numbers of clusters.

Bayesian information criterion (BIC) index: The Bayesian information criterion (BIC) [48] index is a minimization problem that tries to solve partitions’ overfitting problems of the clustering algorithm. The definition of BIC is given as shown in Equation (11):

BIC=ln(L)+vln(n) (11)

where L represents the likelihood of data generation by the parameters in the model, n represents the number of entities and v represents the number of free parameters in the Gaussian model. The computational complexity of the Bayesian information criterion is O(n.d+k.d2). It is efficient for datasets with a moderate number of dimensions and the number of clusters.

C-criterion: The C-criterion [49] is an extension of Condorcet's validity index. It compares the maximum and minimum possible intra-cluster distances with the total intra-cluster distances for a given dataset. The definition is given as represented in Equation (12):

CiϵCxj,xkϵCixjxk(s(xj,xk)γ)+CiϵCxjϵCi;xkCi(γs(xj,xk)) (12)

The computational complexity is O(n2logn) [50,51].

Calinski-Harabasz index: In the Calinski-Harabasz index [52], the cluster's closeness or compactness is measured based on the distance between the cluster's centroid and the data points within the cluster while the cluster's separation from other clusters is measured using the distance from the cluster's centroid to the global centroid [2]. The definition of the Calinski-Harabasz validity index is given as Equation (13):

CH=trace(SB)trace(Sw)np1npk (13)

where (Sw) is the intra-cluster scatter matrix, (SB) is the inter-cluster scatter matrix, k is the number of clusters and np is the number of data objects in a cluster. It is known to be data-dependent such that its behaviour may change if different data structures are used for the same datasets [27]. The CH index has a linear computational complexity O(n.d) which makes it very efficient for large and high dimensional datasets [51,52]. The variants of the CH index include the LSSR index [53], the Ratkowsky-Lance (RL)index [54], the RS index [55], and the WCH index [56]. The LSSR is a logarithmic scale-based variant that measures the logarithmic ratio of the sum of the inter-cluster squared distance to the sum of the squared intra-cluster distance. The RL index variant considered the mean value of the ratios obtained for each dataset object. The RS index variant finds the extent to which the differences between clusters differ from each other. In the WCH index variant, consideration is given to large overlaps among clusters using a correction factor that accounts for these overlaps among the clusters. The CH index uses cluster arrangement to assess the quality of the clustering solution regardless of the choice of distance measure.

Category utility metric [57]: The measure of the goodness of a category is evaluated by the Category utility metric. Given a set of entities, the binary category C={c,c} is defined in Equation (14).

CU(C,F)=[p(c)i=1np(fi|c)logp(fi|c)+p(c)i=1np(fi|c)logp(fi|c)]i=1np(fi)logp(fi)) (14)

where n-sized binary feature set is given in Equation (15):

F={fi},i=1,2,, (15)

and p(c) represents an entity prior probability of belonging to the positive category c;

p(fi|c) represents the conditional probability that the feature fi belong to the positive category c;

p(fi|c) represents the conditional probability that the feature fi belong to the positive category c;

p(fi) represents the entity's previous probability (Corter and Gluck, 1992; Ezugwu et al., 2020a).

The Category utility metric has a linear computational complexity and it is given as O(n.d+k.d.m) with the m representing the average number of possible values per attribute [51,58].

C-index: The definition of the C-index cluster validation method [59] is given in Equations (16), (17), (18), (19).

CI(C)=S(C)Smin(C)Smax(C)Smin(C) (16)

where:

S(C)=CkCxi,xjCkde(xixj) (17)

and

Smin(C)=min(nw)xi,xjX{de(xixj)} (18)

and

Smax(C)=max(nw)xi,xjX{de(xixj)} (19)

The overall computational complexity of the C-index is O(n2logn) [51,60,61].

Compact-Separated (CS) index: The Compact-Separated (CS) index [62] gives the ratio of the sum of within-cluster scatter to between-cluster separation. Suppose the distance measure V is given as V(Xi,Xj) and the intra-cluster scatter is given as Xi with the inter-cluster separation is given as Xj, the CS index for clustering Q is calculated as described in Equations (20), (21).

CS(Q,V)=1Pi=1P[1DnXiQimaxXjQi{V(Xi,Xj)}]1Pi=1P[minjP,ji{V(xi,xj)}] (20)
=i=1P[1QiXiQimaxXjQ{V(Xi,Xj)}]1Pi=1P[minjP,ji{V(xi,xj)}] (21)

where:

V(Xi,Xj) represents the distance between the within-cluster scatter Xi and the between-cluster separation Xj;P represents the number of clusters in Q and the number of data points in cluster P is given as |Dn| with the distance of data points from their centroids given as d. The computational complexity for the CS index is given as O(n.d+k2.d) [51,61,63]. According to Ref. [64], the CS index is reported as being more efficient in handling clusters with different densities or sizes and dimensions. It produces good quality solutions when compared with the DB index. In terms of execution time, however, it is more computationally intensive. The CS index has the same computational complexity as the K-means when the number of clusters is far smaller compared with the total number of data objects in the dataset.

Condorcet's criterion: Condorcet's criterion [49,65] is defined as given in Equation (22).

CiCxj,xkCixjxks(xj,xk)+CiCxjCi;xkCid(xj,xk) (22)

Condorcet's criterion has a computational complexity O(n.m2) where m is the number of candidates [66,67].

COP Index: The Clustering Outcome Prediction (COP) index is a measure of the distance between the cluster points and the centroid and the largest distance between neighbours gives the separation measure [5]. The definition is given in Equation (23).

COP(C)=1Nxick|ck|1|ck|xickde(xick)minxickmaxxickde(xixj) (23)

It has an overall computational complexity of O(n.d+k2.d). In datasets with kn, the complexity is approximately linear with respect to the number of data points [51,68].

Davies-Bouldin Index (DB): Davies-Bouldin Index [50] finds the mean inter-cluster similarity between any two clusters and their nearest. DB is minimized for a better result. The DB index is defined in Equation (24).

BD=1ci=1cmaxij{d(xi)+d(xj)d(ci,cj)} (24)

where i and j represent cluster labels, d(xi) and d(xj) represents entities in respective clusters, c represents the number of clusters, and d(ci,cj) represents the distance between cluster centroids. From the study reported by Ref. [69], the DB index is said to be more reliable when the variance on the dataset is equal to 0.16. This indicates that it works better on compact clusters. The DB index has a computational complexity of O(n.d+k2.d) similar to the COP index and this makes the complexity roughly linear for most practical application where kn [50,51]. Variants of DBI include the DB2 which measures the mean of the sum of all the clusters of the largest sum ratio of the two clusters radii to the smallest distance between their centroids.

S_Dbw validity index: The underlying characteristics of the clusters are used by the S_Dbw validity index [21] to validate the clustering algorithm result. The cluster's compactness is measured using the intra-cluster variance while the clusters' separation is determined based on the inter-cluster density. The definition is given as given in Equations (25), (26).

SDbw(nc)=Scat(nc)+Densbw(nc) (25)

where

Densbw(nc)=1nc.(nc1)i=1nn(j=1ijncdensity(uij)max{density(vi),density(vj)}) (26)

where ci,cj are clusters with centroids vi,vj respectively and the middle point of the line segment is represented by uij. The computational complexity of the S_Dbw index is given as O(n.d+k2.d) similar to the COP and the DB index. It exhibits a linear computational complexity with respect to the number of data objects in the datasets and demonstrates a quadratic computational complexity with respect to the number of clusters [16,51,70].

Det Ratio index: Det Ratio index (Scott and Symons, 1971) is given as represented in Equation (27).

DetRatio=det(T)det(WG) (27)

Where WG is the individual matrices and T represents the total scatter matrix. The Det Ratio index has a computational complexity of O(n.d2+d3) [27,71].

Dunn index: Dunn index [72] measures the smallest between-cluster distance and the largest within-cluster distance ratio in a partition. It is a maximization problem and the time complexity is high with respect to the number of data points in the datasets. The computational complexity is given as O(n2), [16,51,73]. It is also affected by noise. Dunn index is given as shown in Equation (28).

Dunn=min1ic{min{d(ci,cj)max1kcd(Xk)}} (28)

where c represents the number of clusters in the dataset; d(ci,cj) represents the distance between cluster Xi and Xj while d(Xk) measures the distance between cluster Xk members. The Dunn index is overly sensitive to noisy clusters [44].

Gamma Index: The Gamma index [45] is given as shown in Equation (29).

G(C)=ckCxi,xjckdl(xi,xj)nw((N2)nw) (29)

where dl(xi,xj) represents the number of all pairs of objects in X. Gamma Index complexity is given as O(n2.d+n2logn). This complexity is high making it unsuitable for large datasets [4,45].

Generalized Dunn Index (GDI): This measures the inter-cluster and the intra-cluster distances in dataset partition [44]. The definition is given as shown in Equation (30).

C=minkkδ(Ck,Ck)maxkΔ(Ck) (30)

where δ and Δ are the measures of inter-cluster distance and intra-cluster distance respectively and 1kKand1kK. The Generalized Dunn Index has a complexity that is the same as the Dunn index, that is, O(n2). The quadratic complexity of the GDI makes it computationally intensive for large datasets [44,51,73].

Ksq_DetW index: This is also written as K2|W| [74]. The K2|W| analysis the determinant of the within cluster scatter matrix W to evaluates clusters’ compactness. The definition is given as in represented in Equation (31).

C=K2det(WG) (31)

where WG is the matrices of the individual cluster. The computational complexity for the Ksq_DetW index is O(n.d2+d3). This is the same with the Det Ratio index. For high-dimensional data, the d3 dominates the complexity making it computationally inefficient [27,71].

G-plus index: The G-plus index examines the rank-order relationship of inter- and intra-cluster distances to evaluate the quality of a clustering. It uses the concept of concordant and discordant pairs. If the intra-cluster distances of a pair of clusters are smaller than the inter-cluster distances, the pair is said to be concordant, The definition of the G-plus index [75] is given as shown in Equation (32).

G+=2SNT(NT1) (32)

The computational complexity is given as O(n2.n+n2logn). The quadratic complexity due to the computation and ranking of the pairwise distances makes the G-plus index computationally intensive for large datasets.

Log_Det_Ratio index: Log_Det_Ratio index [76] is the Det_Ratio logarithmic version. Log_Det_Ratio index determines the quality of clusters using the log determinants of the ratio of the between-cluster scatter matrix and the within-cluster scatter matrix. It is defined in Equation (33).

C=Nlog(det(T)det(WG)) (33)

The computational complexity is given as O(n.d2+d3) [4,71].

McClain-Rao index: McClain-Rao index [77] finds the average of the ratio of within-cluster and between-cluster distances. It has a quadratic computational complexity with respect to the number of data objects and it is given as O(n2.d) The minimum value gives the best partition. The definition of the McClain-Rao index is given as presented in Equation (34).

C=NBSwNwSB (34)

Log_SS_Ratio index: This Log_SS_Ratio index [78] measures the ratio of the traces of matrices BG and WG. It compares the within-cluster sum of square to between-cluster sum of square to evaluates how compact and well-separated the clusters are by taking the logarithm of the ratio between these two measures. It is defined in Equation (35).

C=log(BGSSWGSS) (35)

It has a computational complexity of O(n.d) [27,79].

Negentropy Increment: Negentropy Increment [80] measures the normality of clusters instead of the intra-cluster distances and inter-cluster distances. It evaluates the quality of clusters by calculating the negentropy (the distance of a distribution from Gaussian) of the dataset before and after clustering. It is defined as shown in Equation (36).

NI(C)=12ckCp(ck)log|ck|1/2log|X|ckCp(ck)logp(ck) (36)

The computational complexity is given as O(n.d), [71,81].

NIVA index: The NIVA (Normalized-Intra-cluster and Variance distance) index measures the balance between average intra-cluster distance and the variation in inter-cluster distances to assess the quality of clusters. The definition of the NIVA index [82] is given in Equation (37).

NIVA(C)=Compac(C)SepxG(C) (37)

SepxG(C) and Compac(C) represent the average separability and average compactness of the cluster C respectively. The NIVA index has a computational complexity of O(n2.d) [16,71].

OS-index: The Optimal Stability Index (OS-index) [83] evaluates clustering quality by assessing the stability of the clusters based on the compactness within clusters and the separation between clusters. It is given in Equation (38).

OS(C)=ckCxickOV(xi,ck)ckC10/|ck|maxxick(0.1|ck|){de(xick)} (38)

It has a quadratic computational complexity of O(n2) making it computationally expensive for large datasets [84].

The Pakhira–Bandyopadhyay–Maulik (PBM) index: The PBM index [17] is also called the I index. It finds the distance between the points and their barycentre as well as the distances between the barycentre. The acronym PBM is derived from the initials of the author's names. The index is defined as illustrated in Equations (39), (40), (41), (42).

C=(1k×ETEW×DB)2 (39)

where

DB=maxk<kd(G{k},G{k}) (40)

and

Ew=k=1KiIkd(Mi,G{k}) (41)

and

ET=i=1Nd(Mi,G) (42)

Three basic factors are considered in the PBM index: comparison between the total within-cluster dispersion and the total scatter of the dataset as a single cluster after partitioning; the maximum distance between cluster centroids and the inverse of the number of clusters. The computational complexity of PBM is given as O(n.k.d+k2).

Point-Biserial index: The Point-Biserial index [85] is a correlation-based clustering validity measure that finds the pairwise distance between data points within and between clusters. It has a computational complexity of O(n2) which makes it computationally inefficient for large datasets. The definition of Point-Biserial index is given as shown in Equations (43), (44).

C=sn×rpb(A,B)=(SWNWSBNB)NWNBNT (43)

where:

rpb(A,B)=MA1MA0SnnA0nA1n2 (44)

MA0 represents the mean inter-cluster distance, MA1 represents the mean intra-cluster distance, the standard deviation of A is given as sn while nA0,nA1 represents each group's number of elements. The distance between pairs of cluster points is represented by Set. If a pair of points are in different clusters, the value of B is 0 and if otherwise, the value is 1.

Ratkowsky-Lance index: The Ratkowsky-Lance index [86] is a centroid-based cluster validity index that calculates the sum-of-squares distances between data points and cluster centroids. It has an approximately O(n.k) computational complexity which makes it computationally feasible for small and medium-sized datasets. However, it becomes computationally expensive as the number of data points and number of clusters The definition for the Ratkowsky-Lance index is given in Equations (45), (46).

C=RK=ck (45)

where:

c2=R=1pj=1pBGSSjTSSj (46)

The BGSSj represents the matrix BG diagonal term.

Ray-Turi index: The definition for the Ray-Turi index [87] is given in Equation (47).

C=1NWGSSmink<kΔkk2 (47)

The numerator represents the mean squared distance of all points from the barycentre of their respective clusters while the denominator represents the clusters’ minimum squared distance from each other. It also has an approximate computational complexity of O(n.k) which scales up with larger datasets making it unsuitable for big data clustering applications.

Root-mean-square standard deviation(RMSSTD): The Root-mean-square standard deviation [50] measures the square root of all the attributes’ variance used in the clustering. By this, the RMSSTD measures the homogeneity of clusters in datasets. It also has a computational complexity of O(n.k). The definition is given in Equation (48).

RMSSTD=[i=1ncj=1vk=1nij(xkxk)2i=1ncj=1v(nij1)] (48)

R-squared index(RS): The definition of the R-squared index [88] is given in Equation (49).

RS=SSbSSt=SStSSwSSt (49)

It measures the degree of dissimilarity between clusters by calculating the total variance across all data points and within-cluster variance which typically yields a computational complexity of O(n.k). It is computationally expensive for large datasets, especially for high-dimensional ones.

Scatter Criteria: The Scatter Criteria index measures the quality of a clustering solution by evaluating the dispersion of data points within a cluster and dispersion between clusters using scatter matrices. The total of the two scatter matrices captures the overall variance in the datasets. The computational complexity of the Scatter Criteria index is given as O(n.d2). It is computationally expensive for large-scale or high-dimensional datasets. The definition for Scatter Criteria [89] is given in Equation (50).

Sk=xCk(xμk)(xμk)T (50)

Score function: The Score function [90] estimates cluster centroids ‘distances from the global centroids to evaluate the dispersion of clusters from each other. It also evaluates the clusters’ degree of closeness by measuring the distance between the data objects and their respective cluster centroids. It has a computational complexity of O(n2) and it typically scales quadratically with the number of data points. The definition for the score function index is given shown in Equations (51), (52), (53).

SF(C)=11eebdc(C)+wcd(C) (51)

where:

bdc(C)=ckC|ck|de(ck,X)N×K (52)

and

wcd(C)=ckC1|ck|xickdexi,(ck) (53)

Scott-Symons index: In Scott-Symons index [76], the weighted sum of the variance-covariance matrix's determinant for each cluster is evaluated. It also has a computational complexity of O(n2) making it computational inefficient for large-scale and high dimensional datasets. The definition is given in Equation (54).

C=k=1Knklogdet(WG{k}nk) (54)

where:

WG{k} represents the matrices and the matrices’ determinants are positive.

SD validity index: The SD validity index [4] evaluates the mean of intra-cluster and inter-cluster scattering. The SD validity index is defined in Equations (55), (56), (57).

SD(nc)=a.Scat(nc)+Dis(nc) (55)

where:

Scat(nc)=1nci=1ncσ(vi)σ(X) (56)

and

Dis(nc)=DmaxDmink=1nc(z=1ncvkvz)1 (57)

The SD validity index is a summation-type index. It combines the cluster compactness and separation measures in an additive way. The Scat(nc) is the mean of the normalized variances within the clusters while the Dis(nc) represents the total separation between the clusters. The SD index has a computational complexity of O(n.k.d+k2.d). It is computationally expensive for large datasets and high-dimensional data. S_Dbw [5] is a variant of the SD validity index that uses the density of objects in between two clusters replacing the total separation of the SD validity index and also removing the weighting factor a. Other variants of the SD validity index includes Vsv1, Vsv2 [[91], [92], [93]].

Silhouette index: The Silhouette index [94], requires that information about separation and compactness of at least two clusters must be known. In evaluating cluster validity using the Silhouette index, the index assigns the silhouette width, s(i)=(i=1,,m) to the ith entity of a given cluster Xj(j=1,c) . This is an estimate of the degree of probability that the ith sample belongs to the cluster Xj. The definition for the index is given in Equation (58).

s(i)=(b(i)a(i))Max{a(i),b(i)} (58)

where the mean distance between other entities in the cluster Xj and the ith entity in the same cluster is represented by a(i) while the minimum mean distance between the ith and all the entities clustered in Xk(k=1,,c;kj) is represented by b(i) The width of the silhouette is obtained using the normalized difference between an object's distance to the nearest object in another cluster in its neighbourhood and the mean distance to the other objects of the same cluster. A value of 1 is an indicator that an object is well positioned within its cluster, a value closer to 0 indicates that the object is at the borderline of two clusters, and a value closer to −1 indicates it should be assigned to the cluster in the neighbourhood. Silhouette index does not depend on the clustering algorithm that generates the data partition but only on the actual partition of the objects in its evaluation of cluster quality [94]. It is useful in improving cluster analysis results and useful in comparing clustering solutions of different clustering algorithms on the same datasets. The Silhouette's main strength is in the interpretation and validation of cluster analysis results. It is related to specific distance measures and so cannot be used for comparing with clustering results that use different distance measures. The Silhouette index has an approximate computational complexity of O(n2) which is considerably expensive for large datasets.

Sum of squared error (SSE): SSE is known to be among the most popular cluster validity evaluation criteria. It evaluates a given cluster's quality by considering only the clusters' cohesion. The definition of Sum of squared error is given in Equation (59).

SSE=k=1KxiCkxiμk2 (59)

where the set of all entities in the cluster k while the vector means of k is given as μk. The partition with the lowest SSE value is considered the best [95,96]. It has a computational complexity of O(n.k.d).

SV-Index: In the Symmetry-based Validity Index (SV-Index) [97], cluster separation is evaluated as a measure of distance between the nearest neighbours while cluster compactness is measured using the boundary points to the clusters’ centroids. The definition is given in Equation (60).

SV(C)=ckCminciC\ck{de(ck,cl)}ckC10/|ck|maxxick(0.1|ck|){de(xick)} (60)

The SV index aims at efficient validation of clusters whose sizes and densities differ widely. It is similar to Dunn's index GDI11. It measures the compactness using the mean distance of ten percent of objects that are farthest from the centroids of the cluster and measures the cluster separation using the sum of the smallest pairwise distance between the centroids of the clusters. It is usually used for identifying clusters with symmetric distribution. The SV index is adaptable for different data distributions and types because it can be computed using different distance metrics types. It has a computational complexity of O(n2).

Sym-index: The Symmetry index (Sym-index) [18] measures the symmetric distribution of clusters in a dataset. The Sym-index is based on the point-symmetry distance replacing the Euclidean metric synonymous with most classical cluster validity indices with point-symmetry distance in measuring objects' proximity to the cluster's centroid. It is mostly used in datasets with symmetric or ellipsoidal shape clusters. The Sym-index is given in Equation (61).

Sym(C)=maxCk,Cl{de(ck,cl)}Kckxickdps(xi,ck) (61)

The computational complexity of SV-index is given as O(n.k).

Tau index: This is also called the Tau coefficient. It is used in assessing the agreement or similarity between two clustering solutions. It measures the extent to which data element pairs are grouped or separated. The definition for the Tau index [92] is given in Equation (62).

C=s+sNBNW(NT(NT1)2) (62)

The numerator is not affected by the equality of the intra-cluster and inter-cluster distances because s+ands do not count ties. The Tau index has a quadratic computational complexity of O(n2).

Trace_W index: The Trace_W index gives the total dispersion of the cluster which is measured by the within-cluster sum of squares. The definition for Trace_W index [92] is given in Equation (63).

C=Tr(WG)=WGSS (63)

where WG represents the sum of all clusters while WGSS represents the within-cluster sum of squares. It is counted among the most commonly used cluster validity indices in clustering applications [92]. It performs well mostly when all the clusters have the same dispersion but performs poorly when clusters are hyper-spherical with different sizes. This is because the size of clusters is measured based on the number of objects it contains and not on the volume of space it occupies. It has an overall computational complexity of O(n.k.d). The computational intensity increases linearly as the number of data points and number of clusters increases as well as the number of dimensions.

Trace_WiB index: This is also called Hotelling's Trace Criterion. It measures the quality of the clustering solution based on the within-cluster matrix which it seeks to minimize while maximizing the between-cluster distance. The definition for Trace_WiB index [98] is given in Equation (64).

C=Tr(WG1.BG) (64)

The computational complexity of the Trace_WiB index is given as O(n.k.d).

Wemmert-Gancarski index: The Wemmert-Gancarski index evaluates the weighted average of all clusters’ quantities (Jk) . The definition of the Wemmert-Gancarski index is given in Equations (65), (66), (67).

C1Nk=1Kmax{0,nkiIkR(Mi)} (65)

where M is an element in the cluster Ck,

Jk=max{0,nkiIkR(Mi)} (66)

and

R(M)=MG{k}minkkMG{k} (67)

The Wemmert-Gancarski index measures the number of objects closer to the centroid of their cluster than to other clusters' centroids [26]. proposed a variant to the Wemmert-Gancarski index using the idea of the Silhouette index defining each object's cluster membership score based on a comparison of each object's distance from its cluster's centroid and its minimum distance from the centroid of other clusters. The Wemmert-Gancarski index has a computational complexity of O(n.k).

Xie-Beni index: The Xie-Beni index [99] finds the ratio of the mean quadratic error and the minimum of the squared distances between the points in the cluster. The definition is given in Equations (68), (69).

C=1NWGSSmink<kδ1(Ck,Ck)2 (68)

where:

δ1(Ck,Ck)2=mini=Ikj=Ikd(Mi,Mj) (69)

In the Xie-Beni index, the cluster cohesion is measured using the global mean squared distance of objects from the centroid of their cluster while the inter-cluster separation is measured using the minimum squared distance between pairs of clusters [26]. Xie-Beni index is reported as demonstrating a monotonic decreasing tendency as the cluster number gets larger and near the number of objects. Variants of Xie-Beni include Ray-Turi, Kw index, Tang index, and XB2. The XB2 variant uses the maximum cluster variance in place of the global mean of cluster compactness to avoid the general tendency common with averaging which tends to hide the unnecessary merging of clusters effect. It has a computational complexity of O(n.k).

Algorithm listing 1 presents a high-level pseudo-code of the algorithm for clustering validity indices as would be incorporated into any metaheuristic methods with or without any further modification.

Algorithm 1: Pseudocode for Generic Cluster Validity Indices
Input:


Output:
1.
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15
16:
17:
18:
19:
20:
21:
Array {x1,x2,x3,.xn}//Dataset to be clusters
k//Number of required clusters
CC=(cc1,cc2,cc3,cck})//Cluster centroids
Cluster Validity Index Value
//Initialize Parameters
X=(x1,x2,x3,xn})
CC=(cc1,cc2,cc3,cck})
MinInterClust = minItc
MaxInterClust = MaxItc
//Compute Intra-cluster distance
 for i=1tok do
 Compute intra-cluster distance
 Update intra-cluster distance
 end i
//Compute Inter-cluster distance
 for i = 1 to k
 for j = i+1 to k
 Compute inter-cluster distance
 Update inter-cluster distance
 end j
 end i
//Compute Cluster Validity Index Value
Use the CVI function to compute the cluster validity index value
Output Cluster Validity Index Value
End

5. Review findings

In this section, a review of the selected articles with reference to the cluster validity index used as a fitness function in automatic clustering is presented with an emphasis on the performances of the CVIs. In Ref. [124], an automatic metaheuristic-based clustering algorithm using Particle swarm optimization is reported. The authors used Dunn's, Turi and the S_Dbw validity indices. Their report shows that the Turi validity index performed better than the other two validity indices. The authors in Ref. [125] used the Sum of Square Error, Variance Ratio Criterion, and Davies Bouldin index in evaluating their automatic clustering algorithm based on the combinatorial Particle swarm optimization metaheuristic algorithm.

In the automatic clustering algorithm reported by Ref. [126], the Calinski-Harabasz index and Rand Index were used as the cluster validity index. The Turi index was employed by Ref. [127]in the improved Particle swarm optimization automatic clustering algorithm. They observed that obtaining a better Turi index value does not ensure higher accuracy. Their suggestion is to use another validity index if the main concern is accuracy. Moreover, they also observed that the similarity measurement significantly influenced the results obtained.

In [128], variance [129] and connectivity [130] were used in their multi-objective immunized PSO automatic clustering algorithm. A kernel-induced similarity measure was adopted in the CS measure by Ref. [131] for automatic clustering based on the Multi-Elitist Particle swarm optimization algorithm instead of the usual sum-of-squares distance as a kernelized distance metric in the CS cluster validity index. The CS measure was noted as more efficient in handling clusters of different sizes and/or densities compared with other popular validity indices due to high computational loads as the number of clusters and datasets increases.

In the automatic clustering using Multi-objective Particle Swarm and Simulated Annealing algorithms reported in Ref. [132], three cluster validity indices were used: the DB index, Sym-index, and Conn-index using the Euclidean distance for cluster connectedness, symmetry for total cluster symmetry and cluster connectedness in each of the CVIs respectively. The adoption of the three CVIs in the multi-objective function helped in the detection of clusters in datasets with various shapes as well as overlapping and non-convex datasets. In the work of [37], two CVIs were used, the DB index and the CS measure. They observed that the two CVIs were not efficient with datasets that had overlapping clusters.

In [133], the I cluster validity index was used in their differential evolution automatic clustering algorithm using the cluster number oscillation method. In the Differential Evolution Fuzzy clustering for Automatic cluster evolution proposed by Ref. [134], the Xie-Beni index was used as the CVIs for the automatic cluster evolution algorithm. In Ref. [135], the I index was added to the Xie-Beni index for the cluster validity evaluation of their modified Differential Evolution-based automatic clustering algorithm. In Ref. [136], the Xie-Beni index and Silhouette index were used in the multi-objective Differential Evolution automatic clustering algorithms. The Xie-Beni index was also adopted by Ref. [137] in their automatic clustering using the synergy of GA and multi-objective DE.

The authors [138] opined that the effectiveness of automatic fuzzy clustering methods is dependent on the selection of the validity indexes. Moreover, using a single-objective function may not yield satisfactory results in real-world applications like remote sensing images due to the complexity involved in such applications. They used the Xie-Beni index and Jm in their proposed adaptive multi-objective DE for automatic clustering in remote sensing imagery. According to them, optimizing several validity measures simultaneously is necessary to adequately cluster datasets with varying characteristics.

In automatic clustering using genetic algorithms, Liu, Wu, and Shen [36] employed the DB index to evaluate the automatic clustering result. Their observation was that it is difficult to use one CVI to deal with different datasets. They proposed to use another validity index such as the PBM index for their future research. In Ref. [139], the authors adopted the CH index in their two-stage genetic algorithm for automatic clustering. Their report further noted the challenges in some of the existing cluster validity – the computational heaviness and difficulty with noisy data observed with the Dunn index [90] but only useful for identifying clean clusters in datasets whose sizes are not more than hundreds of points; the DB index's inability to accommodate datasets with overlapping clusters though it gives good results in datasets with distinct clusters; inapplicability of Silhouette index to handle datasets with sub-cluster because it is only able to identify the first choice and the PBM index's dependency on user-specified parameters.

The Xie-Beni index, Sum of Square Error, and COSEC fitness function were used by Ref. [140] in their hybrid clustering technique based on genetic algorithms with K-means. The authors in Ref. [141] used the VI index on account of its satisfactory performance as reported in Refs. [142,143]. In Ref. [144], the CSkernel measure was used in evaluating the performance of their proposed automatic clustering algorithm. A kernel function replaced the conventional Euclidean distance for efficiently handling datasets with different scales and densities. The use of the kernel function is good with complicated and linearly inseparable datasets. The authors [145] adopted the VI value as the cluster validity index for their automatic clustering algorithm based on artificial bee colony for customer segmentation.

The PBM index was used by Ref. [146] for the fitness function for their proposed dynamic parameter harmony search optimization algorithm (AC-DPHS)automatic clustering algorithm. PBM was compared with the DB index and XB index and reported to exhibit better performance in terms of the optimal number of clusters and the lower computational time. However, the effect of the clustering quality for higher dimensional datasets is not too obvious and a suggestion for better clustering validity indexes was suggested for higher dimensional datasets.

5.1. Analysis of the CVIs usage in automatic clustering algorithms

The analysis of the reviewed articles regarding the indices used for cluster validation is presented in Fig. 2. The highest number of articles used the Davies-Boulding index for cluster validation followed by CS, SI, and Xie-Beni indices respectively. The strengths and weaknesses of the CVIs have been summarized in Table 3 as obtained from the reviewed articles.

Fig. 2.

Fig. 2

Analysis of reviewed literature for cluster validation.

5.2. Factors affecting the performances of cluster validity indices

Cluster validity indices are measured based on the relationship between cluster characteristics such as cluster cohesion, cluster separation, cluster symmetry, and connectedness [1]. These basic cluster characteristics are determined using some proximity metrics such as the Euclidean distance, the Cosine distance, the maximum edge distance, etc. The proximity measure adopted in any cluster validity index determines the shape of the clusters that can be identified. For instance, the use of Euclidean distance identifies spherically shaped clusters while the maximum edge distance is good at discovering irregular-shaped clusters. The Cosine distance is employed mostly when priority is given to discovering the orientation between patterns rather than their magnitude.

In determining the closeness or similarity of objects in a dataset, the distance measure used has a considerable effect on how the data objects are clustered [100]. Cluster validity indices that use traditional variability criteria(variance, separation, density, and continuity) for cluster validation are not efficient when handling arbitrarily shaped clusters [4]. Validity indices that do not use average values in their evaluation metric perform better in validating clusters of different densities and sizes. According to Ref. [31], the standard measure of interest, that is, distance, is the least reliable measure for cluster validation for clusters of volumetric cloud forms. In cluster validity indices that use Euclidean distance, scaling of various dimensions also affects the clustering patterns.

Moreover, from the study conducted by Ref. [147], the cluster validity indices' reliability/performance varies to the clustering method, the data structure as well as the clustering objective. According to Ref. [5], cluster overlap, experimental factors, and the presence of noise have an impact on cluster validation indices. According to Ref. [5], most CVIs demonstrate better results with fewer clusters. Jiang et al. reported that distance among features becomes meaningless in high dimensional datasets more so in data gene expression where the overall shape of the gene expression patterns is more important [148]. Hence the Pearson's correlation coefficient is used in measuring the similarities in the shapes of gene expression patterns.

It is generally concluded that no cluster validity index provides consistent results for different clustering algorithms thus emphasizing the fact that none perform better than others. It is recommended that many validation indexes should be employed to determine the best-performing one for various datasets.

5.3. Application areas common cluster validity indices

5.3.1. Web Usage

Web intelligence is the general term for describing the research and application of information technology and machine learning focussing on the Web platform. Web intelligence applications include Web document clustering, classification of online text, Web usage profiling, e-commerce web recommender, and other tasks involving knowledge discovery [149]. Web Usage data are often unstructured and characterised by complex attributes. They are usually generated from Web activities dynamically and asynchronously. Clustering plays an important role in mining and extracting knowledge from Web intelligence data and related applications.

Clustering web documents organizes knowledge, enhances search engine results, and enhances web crawling [150]. Metaheuristic optimization approaches have been applied in web document clustering due to the high dimensionality and orthogonality characteristics of web documents. The authors [150] suggest Entropy-based measures and cluster cohesiveness measures as fitness functions for web document clustering. In text classification, unstructured sets of documents are partitioned into their respective category based on their content [151]. Due to the exponential increase in the growth of information over the internet, there is a need for automatic classification of text documents. The area of application of text classification includes topic tracking, spam filtering, sentiment analysis, web page classification, and email routing.

The authors [149] investigated and reported the efficacy of the application of nature-inspired optimization algorithms such as the Fireflies, Cuckoos, Wolves, and Bats for Web Intelligence data clustering. The performance of the clustering algorithms was measured using the inter-cluster distance and intra-cluster distance. A Fuzzy-based Recommender System for Web Users was proposed by Ref. [152] which uses an algorithm to provide acceptable data clusters without prior knowledge of the initial clusters. Similarity and distance measures were used in calculating the match score for the recommender system.

5.3.2. Speech processing

Speech provides a natural way of communication among humans. The study and processing methods of speech signals are referred to as speech processing. It includes speech-coding algorithms, speech recognition, speech synthesis, and other aspects of speech processing. The speech-coding algorithms provide effective and efficient voice communication and storage. The ability of computers to understand human language and follow human voice commands is made possible through speech recognition. The synthesis creates a platform for interactive systems that correspond to humans with natural voices [153].

According to Ref. [154], one of the most successful yet fundamental techniques in speech recognition, speech coding, image coding speaker recognition, and speech synthesis is Vector Quantization. The techniques of Vector Quantization are regarded as data clustering methods [155]. It involves compressing voice data for transmission or storage while retaining the data fidelity. A set of k-dimensional data vectors is encoded by the VQ encoder with a much smaller subset called a codebook.

The authors [155] used the Linde-Buzo-Gray (LBG) algorithm to automatically generate initial centroids using a splitting procedure. The LBG algorithm is a local optimization procedure and uses various approaches for its optimization task. The author used the directed search binary-splitting approach for the vector quantization. In Ref. [156], automatic clustering was applied to find an appropriate number of clusters in the application of the clustering method for capturing phonetic classification to establish the reliability of automatic clustering in phonetic classification. The Davies Boulding and I validity indices were employed to validate the quality of the generated clusters [157]. proposed a new spectral clustering algorithm that is based on minimizing a cost function built on measures of error between a solution of the spectral relaxation of a minimum cut problem and a given partition. This spectral clustering was used as a learning algorithm in speech separation problems.

In [158], a method for automatic clustering of similar units for unit selection in speech synthesis is presented. The distance between two units is measured using acoustic measure which gives the mean weighted distance between units with the shorter unit linear interpolated to the longer unit. The problem of automatic classification of speech data was addressed by authors in Ref. [159] without clearly defining the categories to characterize different speaking styles. They proposed an x-means clustering that clusters the data based on a pre-defined distance measurement that is formulated using a human perception-based weighted distance.

5.3.3. Onset and progression of disease in medical science

The health sector is regarded as one of the primary sectors that has a general impact on the members of the public. Therefore, the improvement of the healthcare sector alongside contemporary society's development is very important. Diseases pose a serious threat to public health across the globe. Analysis of healthcare data has assisted patients, health officials, and healthcare communities in the early detection of many diseases [160]. Access to complete medical data obtained from patterns extracted from healthcare data has assisted in improved medical diagnosis and treatment. A huge number of medical images are generated daily. Analysis of these medical images using image segmentation to identify regions of interest has assisted in extracting important features that aid in the diagnosis of diseases. Clustering has been used as an important tool in addressing the challenge of analysing big image data.

In the work of [161], the desired cluster numbers were identified using learning vector quantization in their proposed automated system for retinal image analysis for the diagnosis of eye diseases. The authors [162] adopted an automatic clustering method for COVID-19 CT image segmentation to assist in diagnosing the disease. They used the generalized extreme value (GEV) to improve the density peak clustering (DPC) in finding the optimal number of clustering centres in their proposed model. The structural similarity index, peak signal-to-noise ratio, and entropy were used to measure the performance of the proposed algorithm.

In [163], the Mean Shift clustering method was used to automatically identify a cluster using kernel density estimation of a predetermined feature space for functional Magnetic Resonance Imaging (fMRI) used in the identification of activations regions in the brain. The authors [164] solved the problem of intensity inhomogeneity and the associated challenges of initialization and configuration of controlling parameters in medical image segmentation. They proposed a method that integrates a variation of fuzzy clustering with a local region-based level set method to automatically determine the region of interest in the image segmentation. The fuzzy local similarity measure was applied to ensure robustness against noise and for image detail preservation.

In [165], a semi-supervised clustering technique based on multi-objective optimization based on simulated annealing was proposed and applied for the automatic segmentation of MR brain images. Three cluster validity indices (Sym-index, I-index, and Minkowski index) were used as the objective functions for the system. The Sym-index used the symmetry distance metric while the I-index used the Euclidean distance metric. A hybrid automatic clustering algorithm proposed by Ref. [166] was used in the cluster analysis of prostate cancer data. Their proposed automatic clustering algorithm combined automatic kernel clustering with bee colony optimization, and it used the CSkernel index as the objective function in optimization with the aim of efficient handling of datasets with different scales and densities.

Automatic clustering has been used in deciphering hidden patterns in gene expression data. In the review work of [167] regarding the application of clustering algorithms to gene expression data, several automatic clustering algorithms were reported. In Ref. [39], a multi-objective clustering technique was proposed which automatically partitioned gene expression data into an appropriate cluster number. Three objective functions were used simultaneously for the detection of appropriate cluster numbers and optimum clustering of the gene expression data. Other automatic clustering algorithms used for gene expression data include [39,130].

The authors [168] propose an automatic clustering algorithm for medical big data clustering based on a modified Immune Evolutionary Algorithm. The objective function f based on the FCM objective function J was adopted for the optimization process.

5.3.4. Image processing and image segmentation

Image processing involves the application of an extensive range of possible computational operations to an image for knowledge discovery. Image segmentation is an aspect of image processing that involves exhaustive homogeneous partitioning of an image based on some image property. Automatic clustering methods have been applied in solving problems relating to image processing and segmentation. Articles reporting applications of automatic clustering methods for image processing and segmentation include [38,162,[169], [170], [171], [172], [173], [174]].

5.3.5. Retrieval of information

Information Retrieval (IR) focuses on discovering effective computational approaches for automating document storage and retrieval [175]. It involves the process of digging out queries for multimedia information, images, or specific text from web content. The information retrieval techniques find applications in a wide range of fields such as research publications, e-commerce, academics, clinical decision support, etc [176]. The adoption of massive online digital content in this era of digitization has made information retrieval cumbersome and more complex. Evolutionary-based approaches and swarm intelligence approaches transform IR problems into optimization problems using the collection of documents as a space of solutions [177].

The authors [176] proposed a swarm-optimized cluster-based framework of information retrieval using the K-Flock clustering algorithm. To evaluate the performance of the clusters, the modified silhouette coefficient [178] index measure was adopted [179]. Augmented user's original query for information retrieval through a query expansion process based on a Fire-fly algorithm-based approach. The Firefly algorithm was used to find the best-expanded query among a set of expanded query candidates for effective query expansion retrieval while maintaining low computational complexity. The inverted indexes of the terms in the expanded query were used to compute the scores for each document with the best score considered as the fitness value for the expanded query. Other works reported by these authors on automatic clustering for DIR problems can be found in Refs. [[180], [181], [182]].

In [183], automatic query expansion using cuckoo search and accelerated particle swarm optimization techniques for IR problems was proposed. The authors used the same fitness function as the one used by Ref. [179]. Other work relating to automatic query expansion is reported in Refs. [184,185]. The authors [186] proposed the use of Cellular Automata to improve the quality of clustering for information retrieval. In Ref. [187] a relevance and interface-driven clustering for visual information retrieval is proposed. Their proposed cluster algorithm automatically generates highly relevant clusters while optimizing for interface-driven desiderata for spatial, temporal, and keyword coherence and excluding the need for specification of complex distance metrics. For Automatic clustering-based IR [188], reported that the Cosine similarity measure is particularly good for text documents as a distance measure in cluster validity indexes.

The authors [189] implemented a modified firefly algorithm adapted to Intelligent Ontology and Latent Dirichlet Allocation Information Retrieval model for the enhancement of query searching time information retrieval systems. The cluster validity is based on the Semantic relevancy which is determined using the document topical strength measure. Other research reports on automatic clustering and Information retrieval include [190].

5.3.6. Automotive and aviation systems

Trajectory clustering in aviation is a technique that identifies prevailing aircraft patterns. In trajectory clustering, similar trajectories or trajectory segments are identified and classified into clusters that have the potential to reveal the movements and behaviours of the corresponding objects or nodes [191]. Improving efficiency in aviation systems requires that the actual flight trajectory of aircraft is close to their ideal profile. The authors [192] proposed Trajectory clustering that uses both temporal and spatial features in approach trajectory and aircraft descent optimization based on a multi-objective perspective to minimize aircraft emission, fuel consumption, and the impact of noise.

Automatic clustering has been applied in Automatic Identification System trajectory clustering for maritime safety. It provides a theoretical basis for route planning design and management. It also strengthens the monitoring of ships dynamically and improves maritime supervision efficiently. Authors [193] proposed an automatic multi-step trajectory clustering method for robust shipboard Automatic Identification System trajectory clustering. It was used to find the customary vessel routes and detect abnormal trajectories. The authors [194] proposed a solution for anomaly detection for components of different products in the automotive industry using an automatic clustering algorithm. Six different cluster validity indexes including the Silhouette index, CH index, WB index, Sum of Square Within Clusters (SSW), Sum of Square Errors (SSE), and Sum of Square Between Clusters (SSB) were used for cluster validation.

In the work of [195], a cluster-based adaptive network fuzzy inference system tuned by Particle Swarm Optimization for the forecasting of annual automotive sales was developed. The authors in Ref. [196] Proposed an auto-tuning controller using multi-layer Particle Swarm Optimization with K-means clustering and adaptive learning strategy for Permanent magnet synchronous motor drives was proposed The proposed system uses the Square Error criterion as its fitness function. The authors [197] proposed an automotive product analysis based on automatic MP-DP-Kmeans clustering using MP similarity in place of the Euclidean distance to analyse and make the horizontal comparison of competitor products in automotive product development.

5.3.7. Bioinformatics

Bioinformatics is an interdisciplinary field that mainly involves genetics and molecular biology, statistics, computer science, and mathematics. It has to do with addressing data-intensive large-scale biological problems from a computational point of view. Application of automatic clustering in Bioinformatics can either be in the form of analysing gene expression data which were generated from DNA microarray technologies or by direct clustering process on protein sequences or linear deoxyribonucleic acid (DNA) data [34,198]. Clustering of gene expression data helps in identifying patterns within datasets that relate to this domain and provides insights on natural structures inherent in biological data, understanding of gene functions, subtypes of cells, cellular processes, and gene regulations [199].

The authors in Refs. [200,201] proposed an automatic multiple kernel density clustering algorithm for handling high-dimension bioinformatic data as well as incomplete datasets in bioinformatics respectively [202]. proposed an automatic clustering algorithm for grouping brain tumour gene expression datasets based on Cuckoo search clustering and levy flight cuckoo search [203]. Hybridized Genetic Algorithm with Cuckoo search algorithm for automatic clustering of Breast Cancer dataset. The Silhouette coefficient index was utilized as an objective function for the clustering algorithm.

In [204], microarray gene expression data clustering based on a two-stage meta-heuristic algorithm that uses the concept of alpha-planes in general type-2 fuzzy sets was considered. The alpha-plane for general type-2 fuzzy c-means was used as the objective function for the clustering process. The automatic metaheuristic-based clustering was based on a Simulated Annealing optimization algorithm. Authors [205]introduced a soft computing metaheuristic framework for the automatic clustering of DNA sequences with intelligent techniques based on the Bat algorithm hybridized with the Genetic algorithm. They adopted the pulse-coupled neural network for calculating the DNA sequence similarity or dissimilarity. Their algorithm was used for clustering the expanded human oral microbiome database.

A hybrid gene selection algorithm for cancer classification was proposed by Ref. [206] based on the Bat algorithm. A minimum redundancy maximum relevancy filtering method with a Bat algorithm wrapper method was used for gene selection in the microarray dataset. An article on the soft computing methods that have been used in Bioinformatics was published by Ref. [207] stating clustering as one of the soft computing methods. He summarized some applications of sequence alignment and the soft computing methods indicating metaheuristic and swarm intelligence algorithms as the most used soft computing algorithms for sequence alignment. There is also a literature survey on population-based metaheuristic algorithms used for Gene clustering by Ref. [208] with emphasis on the application of Genetic Algorithm and Particle Swarm Optimization algorithm, their variants and hybridization.

6. Experimental study

This section presents the report on the experimental study carried out using eight cluster validity indices on the SOSKmeans clustering algorithm [209]. The SOSKmeans clustering algorithm is a hybrid algorithm that combines a symbiotic organism search metaheuristic algorithm with the classical K-means algorithm. It harnessed the benefits of the two algorithms for handling automatic clustering problems. The parameter setting of the algorithm is summarized in Table 4. The algorithm was executed for 200 iterations over 20 replications for each cluster validity index. The algorithm was executed using MATLAB 2018 on an Intel Dual Corei7-7600U CPU with 2.80 GHz and 15.8 GB RAM. The performance of each of the CVI was evaluated using the average best fitness value obtained for each dataset and the average computational time for convergence.

Table 4.

SOSK-means algorithm parameter setting.

Parameter Description Value
Max-It Number of iterations 200
NP No of population 20

6.1. Datasets

Twelve datasets consisting of synthetic and real-life datasets with different characteristics were considered in this study. The summary of the datasets is presented in Table 5. Breast, Glass, Iris, Thyroid, Wine, and Yeast are real-life datasets that are taken to represent different domains in Engineering and Science. The remaining datasets are synthetic representing non-linearly separable datasets with arbitrary shape clusters. The Jain dataset represents complex shapes with overlapping characteristics. The compound and flame datasets are a representation of non-linearly separable clusters with different shapes and densities. Path-based, Spiral, and Two-moons datasets are a representation of arbitrarily shaped clusters with intertwined clusters with Path-based exhibiting more complex paths than the other two. The datasets are commonly used in literature for evaluating the performance of clustering algorithms on non-linearly separable data. The clustering illustration for the cluster structure of each of the datasets can be found in Ref. [209].

Table 5.

Characteristics of the datasets.

Datasets Dataset Types Number of Objects Dataset Features Number of Clusters References
Breast UCI 699 9 2 [210,211]
Glass UCI 214 9 7 [210,211]
Iris UCI 150 4 3 [210,211]
Thyroid UCI 215 5 2 [210,211]
Wine UCI 178 13 3 [210,211]
Yeast UCI 1484 8 10 [210,211]
Compound Shape 399 2 6 [210,212]
Flame Shape 240 2 2 [210,213]
Jain Shape 373 2 2 [210,214]
Path-based Shape 300 2 3 [210,215]
Spiral Shape 312 2 2 [210,215]
Two-moons Shape 10,000 2 2

6.2. Evaluated CVIs

Eight different internal cluster validity indices were considered in this study. The CVIs include the General Dunn Index, PBM Index, CH Index, SI Index, DB Index, CS Index, Xie-Beni Index, and the Dunn-Symmetric Index. Details of these CVIs have been presented in section 4.2. The CVIs were used as an internal validity index in the metaheuristic-based automatic clustering algorithm -SOSKmeans algorithm [209] for this study. For each of the CVIs, the algorithm was executed using twenty independent runs of 200 iterations on each dataset and the result of their performances is presented in Table 6. The computation time of the various CVIs for each dataset is presented in Table 7.

Table 6.

CVIs Average Clustering Performance on each dataset.


Average Clustering Performance
Dataset GDI S_Dbw XB CH Dunn-Sym SI DB CS
Breast 0.1281 0.045155 0.15874 12.9768 0.011814 0.700576 1.2416 1.1019
Compound 0.022873 0.009323 0.093588 22.1904 0.002889 0.59727 0.519386 0.77324
Flame 0.020888 0.020641 0.14963 17.3804 0.01999 0.318642 1.173924 1.55968
Glass 0.071372 0.001358 0.070395 4.52524 0.004836 0.102634 0.82578 0.02
Iris 0.027458 4.11E-05 0.128888 3.042554 2.59E-08 85352.2 0.84139 1.13144
Jain 0.010184 0.006578 0.12338 29.1736 0.001478 0.607982 0.6532 1.02611
Path-based 0.77061 0.0096 0.14029 27.5826 0.000734 0.684518 0.784962 0.968948
Spiral 0.017882 0.011035 0.196416 25.3902 0.000912 0.689684 0.80329 1.17602
Thyroid 0.030312 2.78E-08 0.054334 1.102296 6.41E-09 114850 0.63196 1.57881
Two-moons 0.00483 0.025071 0.111604 142.5633 0.000878 0.202264 0.73901 0.922632
Wine 0.222994 0.011253 0.404006 0.35299 7.38E-06 1197.974 1.0061 1.40034
Yeast 0.10069 0.00038 0.108704 3.39498 0.043199 0.018009 0.762546 0
Average 0.119016 0.011703 0.144998 24.13962 0.007228 16783.67 0.831929 0.971593

Table 7.

CVIs Average Computation Time expended on each dataset.


Average Computational Time

Dataset GDI S_Dbw XB CH Dunn-Sym SI DB CS
Breast 6845.7 11981.4 5348.62 1935.74 14330.4 2306.78 960.21 13514
Compound 6115.78 5572 5458.94 1203.78 10569.4 1216.45 881.4 3938.8
Flame 5694.1 5675.28 4734.42 2168.2 9245.78 1175.54 1092.9 2451.6
Glass 4807.8 2843.78 2049.84 1202.45 4548.68 1357.866 649.57 2007.5
Iris 5610.06 3042 2485 1430.836 5760.98 887.272 604.2 1450.2
Jain 6951.2 5811.54 5338.26 1237.858 6284.62 1219.56 1026.4 3473.9
Path-based 1413.44 4300.84 5128.22 1413.44 10018.72 1390.88 1396.7 2492.4
Spiral 6587.48 5770.68 4004.32 2485.84 9916.82 1664.634 1208.7 2849.3
Thyroid 3692.74 1999.32 1320.84 1320.026 3337.66 1159.196 730.35 2282.2
Two-moons 7076 19002.6 9970.48 4134.7 29142.66667 1236.64 2824.1 13712
Wine 6066.26 4546.3 2625.5 1896.94 6564.62 2064.188 654.03 1863.2
Yeast 10953 16129.8 11775.78 7396.12 26362.66667 4775.05 1295.6 46894
Average 5984.46 7223 5020.02 2318.83 11340.25 1704.51 1110.35 8077.43

It is important to note that while internal cluster validity indices address the challenge of determining the validity of the number of clusters, metrics such as compactness and separation, associated with these indices, are employed to evaluate the quality of the clustering task. More so, cluster quality can be assessed by examining the stability of the clustering algorithm under variations in data or algorithm parameters. These steps form the basis of the experimental approach described in this study.

6.3. Experimental results

From Tables 6 and it can be observed that the Dunn-Sym index demonstrated superior performance in ten of the twelve datasets compared with the other CVIs followed by S_Dbw with better performance in the remaining two. The GDI followed the Dunn-Sym and S_Dbw considering its average performance compared with the rest of the CVIs. Though the Dunn-Symm demonstrated superior performance, it recorded a greater computation time compared with the other CVIs. From the experimental results, it is obvious that GDI, S_Dbw, XBI, and SI performed better on these datasets compared with the traditional DBI and CSI. The performance of the CH index could not be compared with others using the fitness value because it is a maximization technique that produces higher values.

From Table 7 showing the average computation time of the CVIs, it can be observed that the average computational time for the clustering process is lower for the traditional DBI, CS, and CH compared with the better-performing CVIs. The performance of each CVI for each of the datasets is shown in Fig. 3, Fig. 4, Fig. 5, Fig. 6, Fig. 7, Fig. 8, Fig. 9, Fig. 10, Fig. 11, Fig. 12, Fig. 13, Fig. 14 while the performance of each CVI on all the datasets is illustrated in Fig. 15, Fig. 16, Fig. 17, Fig. 18, Fig. 19, Fig. 20, Fig. 21, Fig. 22.

Fig. 3.

Fig. 3

CVIs performance on Breast Dataset.

Fig. 4.

Fig. 4

CVIs performance on compound Dataset.

Fig. 5.

Fig. 5

CVIs performance on flame Dataset.

Fig. 6.

Fig. 6

CVIs performance on Glass Dataset.

Fig. 7.

Fig. 7

CVIs performance on Iris Dataset.

Fig. 8.

Fig. 8

CVIs performance on Jain Dataset.

Fig. 9.

Fig. 9

CVIs performance on path-based Dataset.

Fig. 10.

Fig. 10

CVIs performance on Spiral Dataset.

Fig. 11.

Fig. 11

CVIs performance on Thyroid Dataset.

Fig. 12.

Fig. 12

CVIs performance on Twomoons Dataset.

Fig. 13.

Fig. 13

CVIs performance on Wine Dataset.

Fig. 14.

Fig. 14

CVIs performance on Yeast Dataset.

Fig. 15.

Fig. 15

Gd index performance on 12 datasets.

Fig. 16.

Fig. 16

S_Dbw index performance on 12 datasets.

Fig. 17.

Fig. 17

Xie-Beni index performance on 12 datasets.

Fig. 18.

Fig. 18

CH index performance on 12 datasets.

Fig. 19.

Fig. 19

DunnSym index performance on 12 datasets.

Fig. 20.

Fig. 20

Symm index performance on 12 datasets.

Fig. 21.

Fig. 21

Db index performance on 12 datasets.

Fig. 22.

Fig. 22

CS index performance on 12 datasets.

6.4. Discussion on CVIs performances on real-life datasets

The performance of the different CVIs on real-life datasets is discussed in this section. Each of the datasets has varied characteristics. For instance, the Breast, Glass, and Wine have high dimensions varying between 9 and 13 with the Glass having the highest number of clusters. However, they have a data density of less than a thousand. The Yeast dataset is characterised by high dimensionality and high density with the highest number of clusters. The performance of each of these CVIs based on these dataset characteristics will be noted in this discussion. Table 8 presents this performance based on the clustering results.

Table 8.

Average clustering performance on real-life datasets.


Average Clustering Performance of CVIs on Real-Life Datasets
Dataset GDI S_Dbw XB CH Dunn-Sym SI DB CS
Breast 0.1281 0.045155 0.15874 12.9768 0.011814 0.700576 1.2416 1.1019
Glass 0.071372 0.001358 0.070395 4.52524 0.004836 0.102634 0.82578 0.02
Iris 0.027458 4.11E-05 0.128888 3.042554 2.59E-08 85352.2 0.84139 1.13144
Thyroid 0.030312 2.78E-08 0.054334 1.102296 6.41E-09 114850 0.63196 1.57881
Wine 0.222994 0.011253 0.404006 0.35299 7.38E-06 1197.974 1.0061 1.40034
Yeast 0.10069 0.00038 0.108704 3.39498 0.043199 0.018009 0.762546 0

For the average performance on high dimensional datasets, the Dunn-Sym Index and S_Dbw exhibited the best performances while SI recorded the worst performances. The GDI followed by the Xie-Beni index performed averagely well compared with DB and CS. The SI performed the worst on the Wine dataset which has the highest number of dimensions.

For the average performance on the number of clusters, the Glass and Yeast datasets have the highest number of clusters 7 and 10 respectively. The S-Dbw recorded the best performance for the two datasets followed by the Dunn-Sym. However, the SI performed better than Dunn-Sym on the Yeast datasets. The other CVIs performed averagely well.

For the average performance on the dataset density, the Yeast dataset has the highest number of objects followed by Breast with 1484 and 699 objects respectively. The S_Dbw recorded its best performance on the Yeast dataset performing better than Dunn-Sym. The SI also recorded its best performance on the Yeast dataset with a better performance compared with the Dunn-Sym. The worst performance recorded for the Dunn-Sym is on the Yeast dataset though with a better performance compared with others except S_Dbw and SI. The DB and CS recorded their worst performances on the Breast dataset. The DB recorded a poor performance on the Yeast while CS could not return any result at all.

6.5. Discussion on CVIs performances on synthetic datasets

The performance of the different CVIs on the synthetic datasets is discussed next in this section. Each of the synthetic datasets is generated to demonstrate different characteristics with varying degrees of complexity and overlapping. As earlier mentioned in section 6.2, the Path-based, Spiral, and Two-moons datasets represent arbitrarily shaped clusters having intertwined clusters. The Path-based dataset exhibits more complex paths compared with Spiral and Two-moons datasets. The Jain dataset is a representation of complex shapes with overlapping characteristics while the compound and flame represent datasets with non-linearly separable clusters having different shapes and densities. The performance of each of the CVIs based on these datasets with varying degrees of complexity is the point of discussion in this section. Table 9 presents the performances of the CVIs on the various synthetic datasets based on their clustering results.

Table 9.

Average clustering performance on synthetic datasets.


Average Clustering Performance of CVIs on Synthetic Datasets
Dataset GDI S_Dbw XB CH Dunn-Sym SI DB CS
Compound 0.022873 0.009323 0.093588 22.1904 0.002889 0.59727 0.519386 0.77324
Flame 0.020888 0.020641 0.14963 17.3804 0.01999 0.318642 1.173924 1.55968
Jain 0.010184 0.006578 0.12338 29.1736 0.001478 0.607982 0.6532 1.02611
Path-based 0.77061 0.0096 0.14029 27.5826 0.000734 0.684518 0.784962 0.968948
Spiral 0.017882 0.011035 0.196416 25.3902 0.000912 0.689684 0.80329 1.17602
Two-moons 0.00483 0.025071 0.111604 142.5633 0.000878 0.202264 0.73901 0.922632

All the synthetic datasets are low dimensional specifically two dimensions. In terms of the number of clusters, all the synthetic datasets have just two clusters except Compound and Path-based datasets which have six and three clusters each. The report on the CVIs' performance will majorly focus on how well they can handle non-linearly separable clusters of different shapes and densities.

From the general point of view, the Dunn-Syn index recorded the best performance for all the synthetic datasets with its best performance on the Two-moons dataset and its worst performance on the Flame dataset. For the dataset with complex shapes and overlapping characteristics represented by the Jain dataset, the S_Dbw recorded its best performance. The GDI and Xie-Beni performed averagely well compared with SI, DB, and CS on this dataset.

For the datasets characterized by non-linearly separable clusters with different shapes and densities represented by the Compound and Flame datasets, the Dunn-Sym recorded the best results followed by the S_Dbw. The GDI, Xie-Beni, and SI performed averagely well in that order. The DB and CS recorded their best performance on the Compound datasets (though worse than the earlier mentioned CVIs) and recorded their worst performance on the Flame dataset.

For the datasets characterised by arbitrarily shaped clusters with overlapping clusters represented by the Path-based, Spiral, and Two-moons, the Dunn-Sym recorded the best performances with its best performance recorded for Two-moons which coincidently has the highest number of data objects. The S_Dbw performances on Path-based were better compared with Spiral and Two-moons. The performances of GDI, Xie-Beni, and SI on two moons were average better when compared with their performances on the Spiral and Path-based. The performances recorded by DB and CS are poor compared with other CVIs.

From the observed performances of the CVIs, it can be noted that the Dunn-Sym performed better than the other CVIs on the synthetic datasets and mostly so on datasets with arbitrarily shaped and overlapping clusters. The S_Dbw also recorded averagely better performances compared with GDI, Xie-Beni, and SI. The DB and CS recorded worse performances compared with the other CVIs.

To statistically validate the experimental results, a series of statistical analysis were carried out on the data. The Friedman Rank Test was carried out to detect differences among the various cluster validity indices across the multiple datasets. The Friedman Rank test [216] is a non-parametric statistical test that is mostly used when there are repeated measures such that there are the same subjects under different conditions. In this case, several cluster validity indices are tested on datasets to investigate their performances in relation to each of the datasets.

The Friedman Rank Test ranks each of the CVIS per dataset, evaluating the sum of ranks of each of the CVIs, and analyses the sums to determine if there is a statistically significant difference among them. The Friedman test statistics follows a chi-square distribution with a null hypothesis that there are no differences between the CVIs. The statistical analysis of the data obtained from the experiments produced a Friedman test statistic of 69.80 with a p-value of 1.63e-12.

The Friedman test result shows that there is a statistically significant difference among the CVIs across the datasets that were evaluated. This is indicated by the extremely small p-value which is much lower than the 0.05 null hypothesis acceptance value. Therefore, it can be concluded that at least one CVI has a significant performance difference compared with other CVIs.

To determine which specific CVIs differ, a post-hoc test – the Nemenyi test - was carried out to identify the pairs of CVIs that exhibit statistically significant differences. This is to show which indices outperform or underperform relative to each other. The Nemenyi test [217] is used for pairwise comparisons to determine the significant differences between the CVIs. The Nemenyi test produced a heatmap that shows the p-values for each pairwise comparison of the CVIs. The heatmap is shown in Fig. 23.

Fig. 23.

Fig. 23

Nemenyi test results for pairwise comparisons of the CVIs.

The cells with p-values <0.05 imply that there are significant differences between the CVI pairs that form the cell. This indicates that the performance of one CVI is significantly different from the other across the datasets. The red cells have a p-value ≈1 which indicates high p-values to show that there is no statistical difference between the two CVIs in that cell. This implies that performances are very similar.

The cell in the shades of blue indicates lower p-values with darker blue colours showing a p-value that is less than 0.05. This implies that there is a statistically significant difference between the two CVIs that form the cell. This indicates that the CVIs perform differently across the datasets. Because of this, the following can be observed: there is a significant difference between GDI and CS based on the approximate p-value of 8.2×104. There approximate p-value of 0.00082 is reported for GDI and DunnSym indicating a significant difference between the two CVIs.

The CH and DunnSym have an approximate p-value of 7.1×109 which shows that there is a very strong difference in the performance of the two CVIs. In the same vein, the S_Dbw performs significantly differently from the DunnSym which is indicated by the approximate p-value of 5.8×107. Moreover, the SI, DB, and CS in comparison with DunnSym show very low p-values in the range of 105to109 which indicates statistically significant differences between them. However, the heatmap indicates that there are no statistically significant differences between GDI and S_Dbw as well as between SI and DB. This implies that these pairs demonstrate similar performances across the datasets. It is also shown that S-Dbw and XB, SI, and CS have high p-values which indicates that there are no significant performance differences between them.

Moreover, the confidence intervals [218] for each of the CVIs across the datasets were also estimated using bootstrap confidence intervals based on the 2.5th and 97.5th percentile of the distribution as the bounds of a 95 % confidence interval. This gives a sense of variability for each of the CVIs with respect to the different datasets. The mean and 95 % confidence interval for each of the CVIs is presented in Table 10.

Table 10.

Mean and 95 % Confidence Interval for each CVI.

Mean and 95 % Confidence Interval for Each CVI
Mean 95 % CI Lower 95 % CI Upper
GDI 0.12054 0.042117 0.238925
S_Dbw 0.011754 0.005823 0.019251
XB 0.14454 0.106795 0.195972
CH 24.213927 9.903149 45.567862
DunnSym 0.007093 0.001823 0.014243
SI 16211.52378 184.725214 37392.51573
DB 0.833124 0.731929 0.943132
CS 0.967853 0.695004 1.220978

In analysing the performance of the CVIs across various datasets, the mean gives the CVI's average value across the datasets while the 95 % confidence interval indicates the range where the true mean of the CVI is expected to fall with 95 % confidence based on the variability in the data. A higher or lower mean reflects the typical measurement provided by the CVI across the datasets. A narrow confidence interval is an indicator of less variability for the CVI while a wide CI indicates greater variability.

The GDI has a relatively low mean with a relatively narrow confidence interval. It is an indicator that GDI demonstrates moderate performance consistency across the dataset. S_Dbw has a lower mean with a very narrow confidence interval compared with GDI. It indicates that S_Dbw demonstrates a high level of consistency and low variance across the datasets. The XB index has a higher mean however with a relatively narrow confidence interval. This suggests that the performance of the XB index is stable across the datasets.

The CH index has a high mean of larger values compared with other indices and a relatively wide confidence interval. This shows significant variability across the datasets. For the DunnSym index, the mean is very low with a narrow confidence interval. This suggests high consistency with low variability across the datasets. SI index has an extremely high mean with a very with confidence interval which implies a lot of variability across the datasets. This indicates that SI's performance is not stable, and it varies significantly across the datasets.

DB has a moderate mean with relatively narrow CI which indicates that its performance is consistent across the datasets. For the CS index, the mean is relatively high with a relatively wide confidence interval. This shows that CS demonstrates some variability across the datasets.

7. Conclusion

The Cluster Validity Index is an important aspect of clustering processes. It is employed in evaluating the quality of potential clustering solutions. Several CVIs have been proposed in the literature for clustering processes in general. CVIs are categorized into three: external, internal, and relative criteria. The internal cluster validities are employed in automatic meta-heuristic-based clustering algorithms as fitness functions for the optimization process of the clustering algorithm. Cluster validity indices are measured based on the relationship between cluster characteristics such as cluster cohesion, cluster separation, cluster symmetry, and connectedness. This study presents a comprehensive survey of internal cluster validity indexes that have been used as fitness functions in automatic meta-heuristic-based clustering algorithms. It presents the strengths and weaknesses of the various internal cluster validity indexes and the peculiar application areas. This review paper will be beneficial for both researchers and practitioners.

The findings in this review show that the Davies Bouldin index is the most used CVI for automatic meta-heuristic-based clustering algorithms followed by the CS index, Xie-Beni index, Symmetric Index, and WGS index. DB index performance however degrades when handling datasets with arbitrarily shaped clusters with varied densities. The proximity measure adopted in any cluster validity index determines the shape of the clusters that can be identified. The use of Euclidean distance identifies spherically shaped clusters while the maximum edge distance is good at discovering irregular-shaped clusters. The Cosine distance is employed mostly when priority is given to discovering the orientation between patterns rather than their magnitude.

Cluster validity indices' reliability/performance varies for the clustering method, the data structure as well as the clustering objective. Cluster overlap, experimental factors, and the presence of noise have an impact on cluster validation indices performance. Most CVIs demonstrate better results with fewer clusters. Distance among features becomes meaningless in high-dimensional datasets. In data gene expression and similar domains where the overall shape of the gene expression patterns is more important, Pearson's correlation coefficient is used in measuring the similarities in the shapes of gene expression patterns.

From the experimental results, it has been statistically validated that DunnSym has significant differences with many other indices like the GDI, S_Dbw, CH, SI, DB, and CS. Also from the statistical test, it can be concluded that GDI and S_Dbw or SI and DB exhibit similar performances. These will assist in making an informed choice of CVIs for future clustering evaluation. Based on the confidence interval for the CVIs across the dataset, it can be observed that SI and CH performances are less consistent while S_Dbw, DunnSym, and DB are more stable and reliable across the datasets. The stability and reliability demonstrated by the DunnSym and S_Dbw make them more suitable for comparative studies of clustering algorithms.

Future experimental studies can discuss the performance of these and other CVIs not mentioned here in terms of their performance with reference to different distance metrics, dimensionality, and density variation to show which CVIs perform better under specific conditions.

CRediT authorship contribution statement

Abiodun M. Ikotun: Writing – review & editing, Writing – original draft, Visualization, Validation, Investigation, Data curation, Conceptualization. Faustin Habyarimana: Writing – review & editing, Writing – original draft, Visualization, Validation, Supervision, Software, Resources. Absalom E. Ezugwu: Writing – review & editing, Writing – original draft, Visualization, Validation, Supervision, Software, Resources, Methodology, Investigation, Data curation, Conceptualization.

Ethical approval

NA.

Availability of data and materials

All data generated or analyzed during this study are included in this article.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

The authors wish to acknowledge the funding support by the National Research Foundation of South Africa (Reference Number: PSTD230503101493).

References

  • 1.José-García A., Gómez-Flores W. CVIK: a Matlab-based cluster validity index toolbox for automatic data clustering. SoftwareX. May 2023;22 doi: 10.1016/j.softx.2023.101359. [DOI] [Google Scholar]
  • 2.Ikotun A.M., Ezugwu A.E. Improved SOSK-means automatic clustering algorithm with a three-Part Mutualism phase and random weighted reflection coefficient for high-dimensional datasets. Appl. Sci. Dec. 2022;12(24) doi: 10.3390/app122413019. [DOI] [Google Scholar]
  • 3.José-García A., Gómez-Flores W. Proceedings of the Genetic and Evolutionary Computation Conference. Jun. 2021. A survey of cluster validity indices for automatic data clustering using differential evolution; pp. 314–322. [DOI] [Google Scholar]
  • 4.Halkidi M., Batistakis Y., Vazirgiannis M. Cluster Validity M e t h o d s : Part I. 2002;31(2):40–45. [Google Scholar]
  • 5.Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J.M., Perona I. An extensive comparative study of cluster validity indices. Pattern Recognit. Jan. 2013;46(1):243–256. doi: 10.1016/j.patcog.2012.07.021. [DOI] [Google Scholar]
  • 6.Singh P., Kumar Choudhary Sushil. Studies in Computational Intelligence. 2021. Metaheuristic and evolutionary computation: algrotihms and applications. [Google Scholar]
  • 7.Khanduja N., Bhusha B. Recent advances and application of metaheuristic algorithms: a survey (2014–2020) Springer Metaheuristic Evol. Comput. Algorithms Appl. Stud. Comput. Intell. 2021;916 [Google Scholar]
  • 8.Weidt F., Silva R. Relatórios Técnicos Do DCC/UFJF; 2016. Systematic Literature Review in Computer Science—A Practical Guide. [Google Scholar]
  • 9.Moher D., Liberati A., Tetzlaff J., Altman D.G. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. Int. J. Surg. 2010;8(5):336–341. doi: 10.1016/j.ijsu.2010.02.007. [DOI] [PubMed] [Google Scholar]
  • 10.Wang H.Y., Wang J.S., Wang G. A survey of fuzzy clustering validity evaluation methods. Inf. Sci. 2022;618:270–297. doi: 10.1016/j.ins.2022.11.010. [DOI] [Google Scholar]
  • 11.Gurrutxaga I., Muguerza J., Arbelaitz O., Pérez J.M., Martín J.I. Towards a standard methodology to evaluate internal cluster validity indices. Pattern Recognit. Lett. 2011;32(3):505–515. doi: 10.1016/j.patrec.2010.11.006. [DOI] [Google Scholar]
  • 12.Brun M., et al. Model-based evaluation of clustering validation measures. Pattern Recognit. 2007;40(3):807–824. doi: 10.1016/j.patcog.2006.06.026. [DOI] [Google Scholar]
  • 13.Crawford J., Gower J., Lingoes J., Rhee W., Rohlf F.J., Sarle W. AN E X A M I N A T I O N O F P R O C E D U R E S F O R D E T E R M I N I N G. 1985;50(2):159–179. [Google Scholar]
  • 14.Dimitriadou E., Dolnicar S., Weingessel A. An examination of indexes for determining. Psychometrika. 2002;67(3) [Google Scholar]
  • 15.Xu R., Xu J., Wunsch D.C. A comparison study of validity indices on swarm-intelligence-based clustering. CYBERNETICS. 2012;42(4):1243. doi: 10.1109/TSMCB.2012.2188509. [DOI] [PubMed] [Google Scholar]
  • 16.Halkidi M. Springer; 2001. On Clustering Validation Techniques; pp. 107–145.http://link.springer.com/article/10.1023/A:1012801612483 [Online]. Available: [Google Scholar]
  • 17.Pakhira M.K., Bandyopadhyay S., Maulik U. Validity index for crisp and fuzzy clusters. Pattern Recognit. Mar. 2004;37(3):487–501. doi: 10.1016/J.PATCOG.2003.06.005. [DOI] [Google Scholar]
  • 18.Bandyopadhyay S., Saha S. A point symmetry-based clustering technique for automatic evolution of clusters. IEEE Trans. Knowl. Data Eng. 2008;20(11):1441–1457. doi: 10.1109/TKDE.2008.79. [DOI] [Google Scholar]
  • 19.Zhang H., Zhou X. 2018. A Novel Clustering Algorithm Combining Niche Genetic Algorithm with Canopy and K-Means; A Novel Clustering Algorithm Combining Niche Genetic Algorithm with Canopy and K-Means. [DOI] [Google Scholar]
  • 20.Saha S., Sanghamitra B. Performance evaluation of some symmetry-based cluster validity indexes. IEEE Trans. Syst. , Man Cybern. 2009;39(4):420–425. [Google Scholar]
  • 21.Halkidi M., Vazirgiannis M. Clustering validity assessment: finding the optimal partitioning of a data set. Proc. - IEEE Int. Conf. Data Mining, ICDM. 2001:187–194. doi: 10.1109/icdm.2001.989517. [DOI] [Google Scholar]
  • 22.Ezugwu A.E.S., Agbaje M.B., Aljojo N., Els R., Chiroma H., Elaziz M.A. A comparative performance study of hybrid firefly algorithms for automatic data clustering. IEEE Access. 2020;8:121089–121118. doi: 10.1109/ACCESS.2020.3006173. [DOI] [Google Scholar]
  • 23.José-garcía A., Gómez-flores W. vol. 41. 2016. pp. 192–213. (Automatic Clustering Using Nature-Inspired Metaheuristics : A Survey). [Google Scholar]
  • 24.José-García A., Gómez-Flores W. A survey of cluster validity indices for automatic data clustering using differential evolution. GECCO 2021 - Proc. 2021 Genet. Evol. Comput. Conf. 2021:314–322. doi: 10.1145/3449639.3459341. [DOI] [Google Scholar]
  • 25.Ezugwu A.E. Nature-inspired metaheuristic techniques for automatic clustering: a survey and performance study. SN Appl. Sci. Feb. 2020;2(2):273. doi: 10.1007/s42452-020-2073-0. [DOI] [Google Scholar]
  • 26.Todeschini R., Ballabio D., Termopoli V., Consonni V. Extended multivariate comparison of 68 cluster validity indices. A review. Chemometr. Intell. Lab. Syst. 2024;251:1–20. doi: 10.1016/j.chemolab.2024.105117. [DOI] [Google Scholar]
  • 27.Milligan G.W., Cooper M.C. An examination of procedures for determining the number of clusters in a data set. Psychometrika. Jun. 1985;50(2):159–179. doi: 10.1007/BF02294245. [DOI] [Google Scholar]
  • 28.Dubes R.C. How many clusters are best? - an experiment. Pattern Recognit. 1987;20(6):645–663. doi: 10.1016/0031-3203(87)90034-3. [DOI] [Google Scholar]
  • 29.Bezdek J.C., Li W.Q., Attikiouzel Y., Windham M. A geometric approach to cluster validity for normal mixtures. Soft Comput. - A Fusion Found. Methodol. Appl. Dec. 1997;1(4):166–179. doi: 10.1007/s005000050019. [DOI] [Google Scholar]
  • 30.Ismail K.N., Seman A., Abu Samah K.A.F. 2021 IEEE 11th Int. Conf. Syst. Eng. Technol. ICSET 2021 - Proc. 2021. A comparison between external and internal cluster validity indices; pp. 229–233. November. [DOI] [Google Scholar]
  • 31.Gagolewski M., Bartoszuk M., Cena A. Are cluster validity measures (in) valid? Inf. Sci. 2021;581:620–636. doi: 10.1016/j.ins.2021.10.004. [DOI] [Google Scholar]
  • 32.Ikotun A.M., Ezugwu A.E. Enhanced firefly-K-means clustering with adaptive mutation and central limit theorem for automatic clustering of high-dimensional datasets. Appl. Sci. Nov. 2022;12(23) doi: 10.3390/app122312275. [DOI] [Google Scholar]
  • 33.José-García A., Gómez-Flores W. Automatic clustering using nature-inspired metaheuristics: a survey. Appl. Soft Comput. Apr. 2016;41:192–213. doi: 10.1016/J.ASOC.2015.12.001. [DOI] [Google Scholar]
  • 34.Ezugwu A.E., et al. Metaheuristics: a comprehensive overview and classification along with bibliometric analysis. Artif. Intell. Rev. Aug. 2021;54(6):4237–4316. doi: 10.1007/s10462-020-09952-0. [DOI] [Google Scholar]
  • 35.He H., Tan Y. A two-stage genetic algorithm for automatic clustering. Neurocomputing. Apr. 2012;81:49–59. doi: 10.1016/J.NEUCOM.2011.11.001. [DOI] [Google Scholar]
  • 36.Liu Y., Wu X., Shen Y. Automatic clustering using genetic algorithms. Appl. Math. Comput. Oct. 2011;218(4):1267–1279. doi: 10.1016/J.AMC.2011.06.007. [DOI] [Google Scholar]
  • 37.Das S., Abraham A., Konar A. Automatic clustering using an improved differential evolution algorithm. Syst. HUMANS. 2008;38(1) doi: 10.1109/TSMCA.2007.909595. [DOI] [Google Scholar]
  • 38.Kapoor S., Zeya I., Singhal C., Nanda S.J. A grey wolf optimizer based automatic clustering algorithm for satellite image segmentation. Procedia Comput. Sci. 2017;115:415–422. doi: 10.1016/j.procs.2017.09.100. [DOI] [Google Scholar]
  • 39.Saha S., Bandyopadhyay S. A generalized automatic clustering algorithm in a multiobjective framework. Appl. Soft Comput. J. 2013;13(1):89–108. doi: 10.1016/j.asoc.2012.08.005. [DOI] [Google Scholar]
  • 40.Tseng L.Y., Yang S.B. Genetic approach to the automatic clustering problem. Pattern Recognit. 2001;34(2):415–424. doi: 10.1016/S0031-3203(00)00005-4. [DOI] [Google Scholar]
  • 41.Garai G., Chaudhuri B.B. A novel genetic algorithm for automatic clustering. Pattern Recognit. Lett. 2004;25(2):173–187. doi: 10.1016/j.patrec.2003.09.012. [DOI] [Google Scholar]
  • 42.Ezugwu A.E., et al. A comprehensive survey of clustering algorithms: state-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Eng. Appl. Artif. Intell. 2022;110(December 2021) doi: 10.1016/j.engappai.2022.104743. [DOI] [Google Scholar]
  • 43.Cowgill M.C., Harvey R.J., Watson L.T. Genetic algorithm approach to cluster analysis. Comput. Math. Appl. 1999;37(7):99–108. doi: 10.1016/S0898-1221(99)00090-5. [DOI] [Google Scholar]
  • 44.Bezdek J.C., Pal N.R. Some new indexes of cluster validity. IEEE Trans. Syst. Man Cybern. Part B. Jun. 1998;28(3):301–315. doi: 10.1109/3477.678624. [DOI] [PubMed] [Google Scholar]
  • 45.Baker F.B., Hubert L.J. Measuring the power of hierarchical cluster analysis. J. Am. Stat. Assoc. 1975;70(349):31–38. [Google Scholar]
  • 46.Ball G.H., Hall D.J. ISODATA, a novel method of data analysis and pattern classification. Tech. Rep. NTIS. 1965;699616 [Google Scholar]
  • 47.Banfield Jeffrey D., Raftery Adrian E. Model-based Gaussian and non-Gaussian clustering. Int. Biometric Soc. 2019;49(3):803–821. [Google Scholar]
  • 48.Schwarz G. “Estimating the dimension of a model author (s): gideon schwarz source. Ann. Stat. 1978;6(2):461–464. http://www.jstor.org/stable/2958889 Vol . 6 , No . 2 (Mar ., 1978), pp . 461-464 Published by : Institute of Mathematical Statistics Stable URL : Ann. Stat. [Google Scholar]
  • 49.Fortier J.J., Solomon Ho, Procedures Clustering. Multivar. Anal. 1996;62 [Google Scholar]
  • 50.Davies D.L., Bouldin D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. Apr. 1979;PAMI-1(2):224–227. doi: 10.1109/TPAMI.1979.4766909. [DOI] [PubMed] [Google Scholar]
  • 51.Aggarwal C.C., Reddy C.K. 2014. Data Clustering : Algorithms and Application. [Google Scholar]
  • 52.Caliński T., Harabasz J. A dendrite method for cluster analysis. Commun. Stat. Theory Methods. 1974;3(1):1–27. [Google Scholar]
  • 53.Walesiak M., Dudek A. 2010. The clusterSim Package. [Google Scholar]
  • 54.Charrad M., Ghazzali N., Boiteau V., Niknafs A. Nbclust: an R package for determining the relevant number of clusters in a data set. J. Stat. Softw. 2014;61(6):1–36. doi: 10.18637/jss.v061.i06. [DOI] [Google Scholar]
  • 55.Nieweglowski . 2013. Clv: Cluster Validation Techniques. [Google Scholar]
  • 56.Dimitriadou E. 2023. Convex Clustering Methods and Clustering Indexes - Package Cclust. [Google Scholar]
  • 57.Corter J.E., Gluck M.A. Explaining basic categories: feature predictability and information. Psychol. Bull. Mar. 1992;111(2):291–303. doi: 10.1037/0033-2909.111.2.291. [DOI] [Google Scholar]
  • 58.Fisher D.H. Knowledge acquisition via incremental conceptual clustering. Mach. Learn. Sep. 1987;2(2):139–172. doi: 10.1007/BF00114265. [DOI] [Google Scholar]
  • 59.c Dalrymple-Alford E. Measurement of Clustering in free recall. Psychol. Bull. 1970;74(1) [Google Scholar]
  • 60.Hubert L.J., Levin J.R. A general statistical framework for assessing categorical clustering in free recall. Psychol. Bull. Nov. 1976;83(6):1072–1080. doi: 10.1037/0033-2909.83.6.1072. [DOI] [Google Scholar]
  • 61.Hennig C. Data Analysis and Applications 1. Wiley; 2019. Cluster validation by measurement of clustering characteristics relevant to the user; pp. 1–24. [DOI] [Google Scholar]
  • 62.Kosters W.A., Laros J.F.J. Research and Development in Intelligent Systems XXIV. Springer; London: 2008. Metrics for mining multisets; pp. 293–303. London. [DOI] [Google Scholar]
  • 63.Žalik K.R. An efficient k′-means clustering algorithm. Pattern Recognit. Lett. Jul. 2008;29(9):1385–1391. doi: 10.1016/j.patrec.2008.02.014. [DOI] [Google Scholar]
  • 64.Ezugwu A.E., Agbaje M.B., Aljojo N., Els R., Chiroma H., Elaziz M.A. A comparative performance study of hybrid firefly algorithms for automatic data clustering. IEEE Access. 2020;8:121089–121118. doi: 10.1109/ACCESS.2020.3006173. [DOI] [Google Scholar]
  • 65.De De-Condorcet N. Cambridge Univ. Press; 2014. Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix. [Google Scholar]
  • 66.Taylor A.D. Cambridge University Press.; 2005. Social Choice and the Mathematics of Manipulation. [Google Scholar]
  • 67.Nurmi H. vol. 13. Springer Science & Business Media; 2012. (Comparing Voting Systems). [Google Scholar]
  • 68.Vendramin L., Campello R.J.G.B., Hruschka E.R. Relative clustering validity criteria: a comparative overview. Stat. Anal. Data Min. 2010;3(4):209–235. doi: 10.1002/sam.10080. [DOI] [Google Scholar]
  • 69.Salazar E.J., Velez A.C., Parra C. A cluster validity index for comparing non-hierarchical clustering methods. Eiti. September 2002:1–5. https://www.researchgate.net/publication/2534590 2014, [Online]. Available: [Google Scholar]
  • 70.Halkidi M., Vazirgiannis M., Batistakis Y. Quality Scheme Assessment in the Clustering Process. 2000:265–276. doi: 10.1007/3-540-45372-5_26. [DOI] [Google Scholar]
  • 71.Xu R., Wunsch D. John Wiley & Sons.; 2008. Clustering. [Google Scholar]
  • 72.Dunn J.C. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cybern. Jan. 1973;3(3):32–57. doi: 10.1080/01969727308546046. [DOI] [Google Scholar]
  • 73.Well-separated clusters and optimal fuzzy partitions. J. Cybern. Jan. 1974;4(1):95–104. doi: 10.1080/01969727408546059. J. C. Dunn†. [DOI] [Google Scholar]
  • 74.Marriott F.H.C. Practical problems in a method of cluster analysis. 1971. https://about.jstor.org/terms [Online]. Available: [PubMed]
  • 75.Rohlf F.J. Methods of comparing classifications. Annu. Rev. Ecol. Syst. 1974;5(1):101–113. [Google Scholar]
  • 76.Scott A.J., Symons M.J. Clustering methods based on likelihood ratio criteria. 1971. https://about.jstor.org/terms [Online]. Available:
  • 77.Mcclain J.O., Rao V.R. 1975. CLUSTISZ: A Program to Test for the Quality of Clustering of a Set of Objects. [Google Scholar]
  • 78.Hartigan J. John Wiley Sons Inc; 1975. Clustering Algorithms. [Google Scholar]
  • 79.Everitt Brian, Landau Sabine, Leese Morven, Stahl Daniel. fifth ed. Wiley; Chichester: 2011. Cluster Analysis. [Google Scholar]
  • 80.Lago-Fernández L.F., Corbacho F. Normality-based validation for crisp clustering. Pattern Recognit. Mar. 2010;43(3):782–795. doi: 10.1016/j.patcog.2009.09.018. [DOI] [Google Scholar]
  • 81.Hyvärinen A., Oja E. Independent component analysis: algorithms and applications. Neural Network. Jun. 2000;13(4–5):411–430. doi: 10.1016/S0893-6080(00)00026-5. [DOI] [PubMed] [Google Scholar]
  • 82.Rendon E., et al. 2008. NIVA: A Robust Cluster Validity, [Google Scholar]
  • 83.Drewes B. Knowledge Mining. Springer-Verlag; Berlin/Heidelberg: 2005. Some industrial applications of text mining; pp. 233–238. [DOI] [Google Scholar]
  • 84.Ackerman M., Ben-David S. 2008. Measures of Clustering Quality: A Working Set of Axioms for Clustering. [Google Scholar]
  • 85.Milligan G.W. A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika. Jun. 1981;46(2):187–199. doi: 10.1007/BF02293899. [DOI] [Google Scholar]
  • 86.Ratkowsky D.A., Lance G.N. A criterion for determining the number of groups in a classification. Aust. Comput. J. 1978;10(3):115–117. [Google Scholar]
  • 87.Ray S., Turi R.H. 1999. Determination of Number of Clusters in K-Means Clustering and Application in Colour Image Segmentation. [Google Scholar]
  • 88.Sharma S. John Wiley Sons, Inc..; 1995. Applied Multivariate Techniques. [Google Scholar]
  • 89.Rokach L., Maimon O. Clustering methods. Data Min. Knowl. Discov. Handb. 2005:321–352. [Google Scholar]
  • 90.Saitta S., Raphael B., Smith I.F.C. A bounded index for cluster validity. Lect. Notes Comput. Sci. 2007;4571 LNAI:174–187. doi: 10.1007/978-3-540-73499-4_14. [DOI] [Google Scholar]
  • 91.Wiroonsri N. Clustering performance analysis using a new correlation-based cluster validity index. Pattern Recognit. 2024;145(August 2023) doi: 10.1016/j.patcog.2023.109910. [DOI] [Google Scholar]
  • 92.Edwards A.W., Cavalli-Sforza L. A method for cluster analysis. Int. Biometric Soc. 1965;21(2):362–375. [PubMed] [Google Scholar]
  • 93.Kim M., Ramakrishna R.S. New indices for cluster validity assessment. Pattern Recognit. Lett. 2005;26(15):2353–2363. doi: 10.1016/j.patrec.2005.04.007. [DOI] [Google Scholar]
  • 94.Rousseeuw P.J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987;20(C):53–65. doi: 10.1016/0377-0427(87)90125-7. [DOI] [Google Scholar]
  • 95.Tsay R. John Wiley Sons, Inc..; 2005. Analysis of Financial Time Series. [Google Scholar]
  • 96.Hamilton J.D. Princet. Univ. Press; 1994. Time series analysis. [Google Scholar]
  • 97.Žalik K.R., Žalik B. Validity index for clusters of different sizes and densities. Pattern Recognit. Lett. 2011;32(2):221–234. doi: 10.1016/j.patrec.2010.08.007. [DOI] [Google Scholar]
  • 98.Friedman H.P., Rubin J. On some invariant criteria for grouping data. J. Am. Stat. Assoc. 1967;62(320):1159–1178. [Google Scholar]
  • 99.Xie X.L., Beni G. A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 1991;13(8):841–847. doi: 10.1109/34.85677. [DOI] [Google Scholar]
  • 100.Ball G.H. AFIPS Conf. Proc. - 1965 Fall Jt. Comput. Conf. AFIPS 1965; 1965. Data analysis in the social sciences: what about the details? pp. 533–559. [DOI] [Google Scholar]
  • 101.Agrawal K.P., Garg S., Patel P. Performance measures for densed and arbitrary shaped clusters. Comput. Sience Electron. Journals. 2015;6(2):338–350. [Google Scholar]
  • 102.Nagargoje A., Kankar P.K., Jain P.K., Tandon P. Performance evaluation of the data clustering techniques and cluster validity indices for efficient toolpath development for incremental sheet forming. J. Comput. Inf. Sci. Eng. Jun. 2021;21(3) doi: 10.1115/1.4048914. [DOI] [Google Scholar]
  • 103.Drton M., Plummer M. A Bayesian information criterion for singular models. J. R. Stat. Soc. Ser. B Stat. Methodol. 2017;79(2):323–380. doi: 10.1111/rssb.12187. [DOI] [Google Scholar]
  • 104.Wang L. Wilcoxon-type generalized Bayesian information criterion. Biometrika. Jan. 2009;96(1):163–173. doi: 10.1093/biomet/asn060. [DOI] [Google Scholar]
  • 105.Zichen W., Zhengqiang P., Zhijun C., Yanlin W. Research on evaluation indices and calculation method of experimental design. Qual. Reliab. Eng. Int. 2023;39(5):1909–1934. doi: 10.1002/qre.3337. [DOI] [Google Scholar]
  • 106.Duan X., Ma Y., Zhou Y., Huang H., Wang B. A novel cluster validity index based on augmented non-shared nearest neighbors. Expert Syst. Appl. Aug. 2023;223 doi: 10.1016/j.eswa.2023.119784. [DOI] [Google Scholar]
  • 107.Castillo A., Castellanos A., VanderMeer D. vol. 2469. CEUR Workshop Proc.; 2019. pp. 84–97. (Inferring Structure for Design: an Inductive Approach to Ontology Generation). [Google Scholar]
  • 108.Sharma N., Bajpai A., Litoriya R. Comparison the various clustering algorithms of weka tools. Int. J. Emerg. Technol. Adv. Eng. 2012;2(5):73–80. [Google Scholar]
  • 109.Slaoui S.C., Lamari Y. Clustering of large data based on the relational analysis. 2015 Intell. Syst. Comput. Vision, ISCV. 2015;2015 doi: 10.1109/ISACV.2015.7105550. [DOI] [Google Scholar]
  • 110.Ros F., Riad R., Guillaume S. vol. 528. 2023. (PDBI: A Partitioning Davies-Bouldin Index for Clustering Evaluation). [DOI] [Google Scholar]
  • 111.Tomasini C., Borges E.N., Machado K., Emmendorfer L. Proceedings of the 19th International Conference on Enterprise Information Systems. 2017. A study on the relationship between internal and external validity indices applied to partitioning and density-based clustering algorithms; pp. 89–98. [DOI] [Google Scholar]
  • 112.Aschenbruck R., Szepannek G. Cluster validation for mixed-type data. Arch. Data Sci. Ser. A. 2020;6(1):2. doi: 10.5445/KSP/1000098011/02. [DOI] [Google Scholar]
  • 113.Maulik U., Bandyopadhyay S. Performance evaluation of some clustering algorithms and validity indices. IEEE Trans. Pattern Anal. Mach. Intell. Dec. 2002;24(12):1650–1654. doi: 10.1109/TPAMI.2002.1114856. [DOI] [Google Scholar]
  • 114.Saitta S., Raphael B., Smith I.F.C. A comprehensive validity index for clustering. Intell. Data Anal. 2008;12(6):529–548. doi: 10.3233/IDA-2008-12602. [DOI] [Google Scholar]
  • 115.Saltos R., Weber R. Generalized black hole clustering algorithm. Pattern Recognit. Lett. 2023;176(January):196–201. doi: 10.1016/j.patrec.2023.11.006. [DOI] [Google Scholar]
  • 116.Powell B.A. How I learned to stop worrying and love the curse of dimensionality: an appraisal of cluster validation in high-dimensional spaces. 2022:1–20. http://arxiv.org/abs/2201.05214 [Online]. Available: [Google Scholar]
  • 117.Lago-Fernández L.F., Corbacho F. Using the Negentropy Increment to Determine the Number of Clusters. 2009:448–455. doi: 10.1007/978-3-642-02478-8_56. [DOI] [Google Scholar]
  • 118.Lago-Fernández L.F., Sánchez-Montañés M., Corbacho F. The effect of low number of points in clustering validation via the negentropy increment. Neurocomputing. 2011;74(16):2657–2664. doi: 10.1016/j.neucom.2011.03.023. [DOI] [Google Scholar]
  • 119.Vergara V.M., Salman M., Abrol A., Espinoza F.A., Calhoun V.D. Determining the number of states in dynamic functional connectivity using cluster validity indexes. J. Neurosci. Methods. 2020;337 doi: 10.1016/j.jneumeth.2020.108651. August 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 120.Niu J., Li Z., Salvendy G. Multi-resolution shape description and clustering of three-dimensional head data. Ergonomics. 2009;52(2):251–269. doi: 10.1080/00140130802334561. [DOI] [PubMed] [Google Scholar]
  • 121.Fu S., Lu S.Y., Davies D.L., Bouldin D.W. The string-to-string correction problem, 1979 doi: 10.1109/TPAMI.1979.4766909. [DOI] [Google Scholar]
  • 122.Angel Latha Mary S., Sivagami A.N., Usha Rani M. Cluster validity measures dynamic clustering algorithms. ARPN J. Eng. Appl. Sci. 2015;10(9):4009–4012. [Google Scholar]
  • 123.F. Kovács, C. Legány, and A. Babos, “Cluster Validity Measurement Techniques”.
  • 124.Omran M.G.H., Salman A., Engelbrecht A.P. Dynamic clustering using particle swarm optimization with application in image segmentation. Pattern Anal. Appl. 2006;8(4):332–344. doi: 10.1007/s10044-005-0015-5. [DOI] [Google Scholar]
  • 125.Masoud H., Jalili S., Hasheminejad S.M.H. Dynamic clustering using combinatorial particle swarm optimization. Appl. Intell. 2013;38(3):289–314. doi: 10.1007/s10489-012-0373-9. [DOI] [Google Scholar]
  • 126.Ling H.L., Wu J.S., Zhou Y., Zheng W.S. How many clusters? A robust PSO-based local density model. Neurocomputing. 2016;207:264–275. doi: 10.1016/j.neucom.2016.03.071. [DOI] [Google Scholar]
  • 127.Kuo R.J., Zulvia F.E. Automatic clustering using an improved particle swarm optimization. 2013;1(1):46–51. doi: 10.12720/jiii.1.1.46-51. [DOI] [Google Scholar]
  • 128.Nanda S.J., Panda G. Automatic clustering algorithm based on multi-objective Immunized PSO to classify actions of 3D human models. Eng. Appl. Artif. Intell. May 2013;26(5–6):1429–1441. doi: 10.1016/j.engappai.2012.11.008. [DOI] [Google Scholar]
  • 129.Handl J., Knowles J. Evolutionary Multiobjective Clustering. 2004:1081–1091. doi: 10.1007/978-3-540-30217-9_109. [DOI] [Google Scholar]
  • 130.Handl J., Knowles J. An evolutionary approach to multiobjective clustering. IEEE Trans. Evol. Comput. Feb. 2007;11(1):56–76. doi: 10.1109/TEVC.2006.877146. [DOI] [Google Scholar]
  • 131.Das S., Abraham A., Konar A. Automatic kernel clustering with a multi-elitist particle swarm optimization algorithm. Pattern Recognit. Lett. 2008;29(5):688–699. doi: 10.1016/j.patrec.2007.12.002. [DOI] [Google Scholar]
  • 132.Abubaker A., Baharum A., Alrefaei M. Automatic clustering using multi-objective particle swarm and simulated annealing. PLoS One. Jul. 2015;10(7) doi: 10.1371/journal.pone.0130995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 133.Lee W.P., Chen S.W. Automatic clustering with differential evolution using cluster number oscillation method. Proc. - 2010 2nd Int. Work. Intell. Syst. Appl. ISA 2010. 2010;(1):1–4. doi: 10.1109/IWISA.2010.5473289. [DOI] [Google Scholar]
  • 134.Saha I., Maulik U., Bandyopadhyay S. 2009 IEEE International Advance Computing Conference. Mar. 2009. A new differential evolution based fuzzy clustering for automatic cluster evolution; pp. 706–711. [DOI] [Google Scholar]
  • 135.Maulik U., Saha I. Differential evolution for image classification. Ieee Trans. Geosci. Remote Sens. 2010;48(9):3503–3510. [Google Scholar]
  • 136.Suresh K., Kundu D., Ghosh S., Das S., Abraham A. 2009. Data Clustering Using Multi-Objective Differential Evolution Algorithms; pp. 1001–1024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 137.Kundu D., Suresh K., Ghosh S., Das S., Abraham A., Badr Y. In: In Hybrid Artificial Intelligence Systems. HAIS 2009. Corchado E., Wu X., Oja E., Herrero Á., Baruque B., editors. vol. 5572. Springer; Berlin, Heidelberg: 2009. Automatic clustering using a synergy of genetic algorithm and multi-objective differential evolution; pp. 177–186. (Lecture Notes in Computer Science). [DOI] [Google Scholar]
  • 138.Zhong Y., Zhang S., Zhang L. Automatic fuzzy clustering based on adaptive multi-objective differential evolution for remote sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013;6(5):2290–2301. doi: 10.1109/JSTARS.2013.2240655. [DOI] [Google Scholar]
  • 139.He H., Tan Y. A two-stage genetic algorithm for automatic clustering. Neurocomputing. 2012;81:49–59. doi: 10.1016/j.neucom.2011.11.001. [DOI] [Google Scholar]
  • 140.Rahman M.A., Islam M.Z. A hybrid clustering technique combining a novel genetic algorithm with K-Means. Knowledge-Based Syst. 2014;71:345–365. doi: 10.1016/j.knosys.2014.08.011. [DOI] [Google Scholar]
  • 141.Ozturk C., Hancer E., Karaboga D. Dynamic clustering with improved binary artificial bee colony algorithm. Appl. Soft Comput. J. 2015;28:69–80. doi: 10.1016/j.asoc.2014.11.040. [DOI] [Google Scholar]
  • 142.Turi R.H. Monash University; 2001. Clustering-based Colour Image Segmentation. [Google Scholar]
  • 143.Hamerly G., Elkan C. Proceedings of the Eleventh International Conference on Information and Knowledge Management. Nov. 2002. Alternatives to the k-means algorithm that find better clusterings; pp. 600–607. [DOI] [Google Scholar]
  • 144.Kuo R.J., Syu Y.J., Chen Z.Y., Tien F.C. Integration of particle swarm optimization and genetic algorithm for dynamic clustering. Inf. Sci. 2012;195:124–140. doi: 10.1016/j.ins.2012.01.021. [DOI] [Google Scholar]
  • 145.Kuo R.J., Zulvia F.E. Automatic clustering using an improved artificial bee colony optimization for customer segmentation. Knowl. Inf. Syst. Nov. 2018;57(2):331–357. doi: 10.1007/s10115-018-1162-5. [DOI] [Google Scholar]
  • 146.Zhu Q., Tang X., Elahi A. Automatic clustering based on dynamic parameters harmony search optimization algorithm. Pattern Anal. Appl. 2022;25(4):693–709. doi: 10.1007/s10044-022-01065-4. [DOI] [Google Scholar]
  • 147.Shim Y., Jiwon C., In-Chan C. International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce. 2005. A comparison study of cluster validity indices using a non-hierarchical clustering algorithm; pp. 199–204. [Google Scholar]
  • 148.Aouf M., Lyanage L., Hansen S. 5th Int. Conf. Serv. Syst. Serv. Manag. - Explor. Serv. Dyn. With Sci. Innov. Technol. ICSSSM’08. 2008. Review of data mining clustering techniques to analyze data with high dimensionality as applied in gene expression data (June 2008) [DOI] [Google Scholar]
  • 149.Rui T., Fong S., Yang X.-S., Deb S. 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology. Dec. 2012. Nature-inspired clustering algorithms for web intelligence data; pp. 147–153. [DOI] [Google Scholar]
  • 150.Mahdavi M., Chehreghani M.H., Abolhassani H., Forsati R. Novel meta-heuristic algorithms for clustering web documents. Appl. Math. Comput. Jul. 2008;201(1–2):441–451. doi: 10.1016/j.amc.2007.12.058. [DOI] [Google Scholar]
  • 151.Janani R., Vijayarani S. Automatic text classification using machine learning and optimization algorithms. Soft Comput. Jan. 2021;25(2):1129–1145. doi: 10.1007/s00500-020-05209-8. [DOI] [Google Scholar]
  • 152.Nadi S., Saraee M.H., Jazi M.D., Bagheri A. FARS: fuzzy ant based recommender system for web users. IJCSI Int. J. Comput. Sci. Issues. 2011;8(1) www.IJCSI.org [Online]. Available: [Google Scholar]
  • 153.Juang B.H., Chen Tsuhan. The past, present, and future of speech processing. IEEE Signal Process. Mag. May 1998;15(3):24–48. doi: 10.1109/79.671130. [DOI] [Google Scholar]
  • 154.Furui S. Research of individuality features in speech waves and automatic speaker recognition techniques. Speech Commun. Jun. 1986;5(2):183–197. doi: 10.1016/0167-6393(86)90007-5. [DOI] [Google Scholar]
  • 155.Sonkamble B.A., Doye D.D. Speech Recognition Using Vector Quantization through Modified K-meansLBG Algorithm. 2012;3(7) www.iiste.org [Online]. Available: [Google Scholar]
  • 156.Neel J., Carlson R. 2005. Cluster Analysis Methods for Speech Recognition. [Google Scholar]
  • 157.Bach F.R., Org F.B., Edu J.B. Jordan; 2006. Learning Spectral Clustering, with Application to Speech Separation Michael I. [Google Scholar]
  • 158.Black A.W., Taylor P. Automatically clustering similar units for unit selection in speech synthesis. 1997. http://www.cstr.ed.ac.uk [Online]. Available:
  • 159.Wu X., Wu Z., Jia J., Meng H., Cai L., Li W. The 9th International Symposium on Chinese Spoken Language Processing. Sep. 2014. Automatic speech data clustering with human perception based weighted distance; pp. 216–220. [DOI] [Google Scholar]
  • 160.Aldhyani T.H.H., Alshebami A.S., Alzahrani M.Y. Soft clustering for enhancing the diagnosis of chronic diseases over machine learning algorithms. J. Healthc. Eng. 2020;2020 doi: 10.1155/2020/4984967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 161.Waheed A., Akram M.U., Khalid S., Waheed Z., Khan M.A., Shaukat A. Hybrid features and mediods classification based robust segmentation of blood vessels. J. Med. Syst. Oct. 2015;39(10):128. doi: 10.1007/s10916-015-0316-1. [DOI] [PubMed] [Google Scholar]
  • 162.Abd Elaziz M., Al-qaness M.A.A., Abo Zaid E.O., Lu S., Ali Ibrahim R., Ewees A.A. Automatic clustering method to segment COVID-19 CT images. PLoS One. Jan. 2021;16(1) doi: 10.1371/journal.pone.0244416. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 163.Ai L., Gao X., Xiong J. Application of mean-shift clustering to Blood oxygen level dependent functional MRI activation detection. BMC Med. Imaging. Dec. 2014;14(1):6. doi: 10.1186/1471-2342-14-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 164.Rastgarpour M., Shanbehzadeh J. A new kernel-based fuzzy level set method for automated segmentation of medical images in the presence of intensity inhomogeneity. Comput. Math. Methods Med. 2014;2014:1–14. doi: 10.1155/2014/978373. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 165.Saha S., Alok A.K., Ekbal A. Brain image segmentation using semi-supervised clustering. Expert Syst. Appl. Jun. 2016;52:50–63. doi: 10.1016/j.eswa.2016.01.005. [DOI] [Google Scholar]
  • 166.Kuo R.J., Huang Y.D., Lin C.C., Wu Y.H., Zulvia F.E. Automatic kernel clustering with bee colony optimization algorithm. Inf. Sci. Nov. 2014;283:107–122. doi: 10.1016/J.INS.2014.06.019. [DOI] [Google Scholar]
  • 167.Oyelade J., et al. Clustering Algorithms : Their Application to Gene Expression Data. 2016:237–253. doi: 10.4137/BBI.S38316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 168.Yu J., Li H., Liu D. Modified Immune evolutionary algorithm for medical data clustering and feature extraction under cloud computing environment. J. Healthc. Eng. Jan. 2020;2020:1–11. doi: 10.1155/2020/1051394. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 169.Mousavirad S.J., Ebrahimpour-Komleh H., Schaefer G. Automatic clustering using a local search-based human mental search algorithm for image segmentation. Appl. Soft Comput. Nov. 2020;96 doi: 10.1016/j.asoc.2020.106604. [DOI] [Google Scholar]
  • 170.Mousavirad S.J., Schaefer G., Moghadam M.H., Saadatmand M., Pedram M. Proceedings of the Genetic and Evolutionary Computation Conference Companion. Jul. 2021. A population-based automatic clustering algorithm for image segmentation; pp. 1931–1936. [DOI] [Google Scholar]
  • 171.Lei T., Liu P., Jia X., Zhang X., Meng H., Nandi A.K. Automatic fuzzy clustering framework for image segmentation. IEEE Trans. Fuzzy Syst. Sep. 2020;28(9):2078–2092. doi: 10.1109/TFUZZ.2019.2930030. [DOI] [Google Scholar]
  • 172.Kumar V., Chhabra J.K., Kumar D. Automatic cluster evolution using gravitational search algorithm and its application on image segmentation. Eng. Appl. Artif. Intell. 2014;29:93–103. doi: 10.1016/j.engappai.2013.11.008. [DOI] [Google Scholar]
  • 173.Kumar V., Chhabra J.K., Kumar D. Automatic data clustering using parameter adaptive harmony search algorithm and its application to image segmentation. J. Intell. Syst. Oct. 2016;25(4):595–610. doi: 10.1515/jisys-2015-0004. [DOI] [Google Scholar]
  • 174.Das S., Konar A. Automatic image pixel clustering with an improved differential evolution. Appl. Soft Comput. Jan. 2009;9(1):226–236. doi: 10.1016/J.ASOC.2007.12.008. [DOI] [Google Scholar]
  • 175.Ezugwu A.E., et al. A comprehensive survey of clustering algorithms: state-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Eng. Appl. Artif. Intell. Apr. 2022;110 doi: 10.1016/j.engappai.2022.104743. [DOI] [Google Scholar]
  • 176.Bhopale A.P., Tiwari A. Swarm optimized cluster based framework for information retrieval. Expert Syst. Appl. 2020;154(Sep) doi: 10.1016/j.eswa.2020.113441. [DOI] [Google Scholar]
  • 177.Djenouri Y., Belhadi A., Belkebir R. Bees swarm optimization guided by data mining techniques for document information retrieval. Expert Syst. Appl. Mar. 2018;94:126–136. doi: 10.1016/j.eswa.2017.10.042. [DOI] [Google Scholar]
  • 178.Rawashdeh M., Ralescu A. 2012 Annual Meeting of the North American Fuzzy Information Processing Society (NAFIPS) Aug. 2012. Crisp and fuzzy cluster validity: generalized intra-inter silhouette index; pp. 1–6. [DOI] [Google Scholar]
  • 179.Khennak I., Drias H. Bat Algorithm for Efficient Query Expansion: Application to MEDLINE. 2016:113–122. doi: 10.1007/978-3-319-31232-3_11. [DOI] [Google Scholar]
  • 180.Khennak I., Drias H. An accelerated PSO for query expansion in web information retrieval: application to medical dataset. Appl. Intell. Oct. 2017;47(3):793–808. doi: 10.1007/s10489-017-0924-1. [DOI] [Google Scholar]
  • 181.Khennak I., Drias H. Bat-inspired algorithm based query expansion for medical web information retrieval. J. Med. Syst. Feb. 2017;41(2):34. doi: 10.1007/s10916-016-0668-1. [DOI] [PubMed] [Google Scholar]
  • 182.Khennak I., Drias H. Proceedings of the International Conference on Learning and Optimization Algorithms: Theory and Applications. May 2018. Data mining techniques and nature-inspired algorithms for query expansion; pp. 1–6. [DOI] [Google Scholar]
  • 183.Sharma M., Chhabra J.K. Sustainable automatic data clustering using hybrid PSO algorithm with mutation. Sustain. Comput. Informatics Syst. 2019;23:144–157. doi: 10.1016/j.suscom.2019.07.009. [DOI] [Google Scholar]
  • 184.Gupta Y., Saini A. A novel Fuzzy-PSO term weighting automatic query expansion approach using combined semantic filtering. Knowledge-Based Syst. Nov. 2017;136:97–120. doi: 10.1016/j.knosys.2017.09.004. [DOI] [Google Scholar]
  • 185.Khalifi H., Cherif W., El Qadi A., Ghanou Y. Query expansion based on clustering and personalized information retrieval. Prog. Artif. Intell. Jun. 2019;8(2):241–251. doi: 10.1007/s13748-019-00178-y. [DOI] [Google Scholar]
  • 186.Kiran Sree P., Raju G.V.S., Ramesh Babu I., Viswanadha Raju S. Improving quality of clustering using cellular Automata for information retrieval. J. Comput. Sci. 2008;4(2):167–171. [Google Scholar]
  • 187.Bouadjenek M.R., Sanner S., Du Y. Relevance- and interface-driven clustering for visual information retrieval. Inf. Syst. 2020;94 doi: 10.1016/j.is.2020.101592. [DOI] [Google Scholar]
  • 188.Subhashini R., Kumar V.J.S. Proceedings - 1st International Conference on Integrated Intelligent Computing. ICIIC 2010; 2010. Evaluating the performance of similarity measures used in document clustering and information retrieval; pp. 27–31. [DOI] [Google Scholar]
  • 189.Subramaniam M., Kathirvel A., Sabitha E., Basha H.A. Modified firefly algorithm and fuzzy c-mean clustering based semantic information retrieval. J. Web Eng. Feb. 2021;20(1):33–52. doi: 10.13052/jwe1540-9589.2012. [DOI] [Google Scholar]
  • 190.Mohammed A.J., Yusof Y., Husni H. Nature Inspired Data Mining Algorithm for Document Clustering in Information Retrieval. 2014:382–393. doi: 10.1007/978-3-319-12844-3_33. [DOI] [Google Scholar]
  • 191.Tang J., Liu L., Wu J., Zhou J., Xiang Y. Trajectory clustering method based on spatial-temporal properties for mobile social networks. J. Intell. Inf. Syst. Feb. 2021;56(1):73–95. doi: 10.1007/s10844-020-00607-8. [DOI] [Google Scholar]
  • 192.Yang Z., Tang R., Chen Y., Wang B. Spatial–temporal clustering and optimization of aircraft descent and approach trajectories. Int. J. Aeronaut. Sp. Sci. Dec. 2021;22(6):1512–1523. doi: 10.1007/s42405-021-00401-y. [DOI] [Google Scholar]
  • 193.Li H., Liu J., Liu R.W., Xiong N., Wu K., Kim T.H. A dimensionality reduction-based multi-step clustering method for robust vessel trajectory analysis. Sensors. Aug. 2017;17(8) doi: 10.3390/s17081792. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 194.Guerreiro M.T., et al. Anomaly detection in automotive industry using clustering methods—a case study. Appl. Sci. Nov. 2021;11(21) doi: 10.3390/app11219868. [DOI] [Google Scholar]
  • 195.Hasheminejad S.A., Shabaab M., Javadinarab N. Developing cluster-based adaptive network fuzzy inference system tuned by particle swarm optimization to forecast annual automotive sales: a case study in Iran market. Int. J. Fuzzy Syst. 2022 doi: 10.1007/s40815-022-01263-6. [DOI] [Google Scholar]
  • 196.Tran H.N., Nguyen T.T., Cao H.Q., Nguyen T.H., Nguyen H.X., Jeon J.W. Auto-tuning controller using MLPSO with K-means clustering and adaptive learning strategy for PMSM drives. IEEE Access. 2022;10:18820–18831. doi: 10.1109/ACCESS.2022.3150777. [DOI] [Google Scholar]
  • 197.Feng A. 2023 6th International Conference on Artificial Intelligence and Big Data. vol. 2023. ICAIBD; 2023. Automotive product analysis based on MP-DP-kmeans clustering; pp. 305–311. [DOI] [Google Scholar]
  • 198.Saxena A., et al. A review of clustering techniques and developments. Neurocomputing. Dec. 2017;267:664–681. doi: 10.1016/j.neucom.2017.06.053. [DOI] [Google Scholar]
  • 199.Karim M.R., et al. Deep learning-based clustering approaches for bioinformatics. Brief. Bioinform. Jan. 2021;22(1):393–415. doi: 10.1093/bib/bbz170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 200.Liao L., Li K., Li K., Yang C., Tian Q. A multiple kernel density clustering algorithm for incomplete datasets in bioinformatics. BMC Syst. Biol. Nov. 2018;12 doi: 10.1186/s12918-018-0630-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 201.Liao L., Li K., Li K., Tian Q., Yang C. 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) Nov. 2017. Automatic density clustering with multiple kernels for high-dimension bioinformatics data; pp. 2105–2112. [DOI] [Google Scholar]
  • 202.Banu P.K.N., Andrews S. Gene clustering using metaheuristic optimization algorithms. Int. J. Appl. Metaheuristic Comput. (IJAMC) Oct. 2015;6(4):14–38. doi: 10.4018/IJAMC.2015100102. [DOI] [Google Scholar]
  • 203.Badr Y.A., Abou El-Naga A.H. 5th International Conference on Computing and Informatics, ICCI 2022. 2022. A hybrid metaheuristic approach for automatic clustering of Breast cancer; pp. 392–399. [DOI] [Google Scholar]
  • 204.Doostparast Torshizi A., Fazel Zarandi M.H. Alpha-plane based automatic general type-2 fuzzy clustering based on simulated annealing meta-heuristic algorithm for analyzing gene expression data. Comput. Biol. Med. Sep. 2015;64:347–359. doi: 10.1016/j.compbiomed.2014.06.017. [DOI] [PubMed] [Google Scholar]
  • 205.Badr Y.A., Wassif K.T., Othman M. Automatic clustering of DNA sequences with intelligent techniques. IEEE Access. 2021;9:140686–140699. doi: 10.1109/ACCESS.2021.3119560. [DOI] [Google Scholar]
  • 206.Alomari O.A., Khader A.T., Azmi Al-Betar M., Abualigah L.M. Mrmr ba: a hybrid gene selection algorithm for cancer classification. J. Theor. Appl. Inf. Technol. 2017;30(12) www.jatit.org [Online]. Available: [Google Scholar]
  • 207.Karlik B. 2013. SOFT COMPUTING METHODS IN BIOINFORMATICS: A COMPREHENSIVE REVIEW. [Google Scholar]
  • 208.Jain Arpit, Agrawal Shikha, Agrawal Jitendra, Sharma Sanjeev. Analysis-of-Population-Based-Metaheuristic-Used-for-Gene-Clustering. Int. J. Comput. Commun. Eng. 2013;2(2) [Google Scholar]
  • 209.Ikotun A.M., Ezugwu A.E. Boosting k-means clustering with symbiotic organisms search for automatic clustering problems. PLoS One. Aug. 2022;17(8) doi: 10.1371/journal.pone.0272861. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 210.Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J.M., Perona I. An extensive comparative study of cluster validity indices. Pattern Recognit. 2013;46(1):243–256. doi: 10.1016/j.patcog.2012.07.021. [DOI] [Google Scholar]
  • 211.Kelly A., Longjohn R., Nottingham K. Univ. California, Sch. Inf. Comput. Sci.; Irvine, CA, USA: 2023. UCI Machine Learning Repository.https://archive.ics.uci.edu [Online]. Available: [Google Scholar]
  • 212.Jain A.K., Law M.H.C. 2005. Data Clustering : A User ’ S Dilemma; pp. 1–10. [Google Scholar]
  • 213.Fu L., Medico E. vol. 15. 2007. pp. 1–15. (FLAME , a Novel Fuzzy Clustering Method for the Analysis of DNA Microarray Data). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 214.Chang H., Yeung D. Robust path-based spectral clustering. 2008;41:191–203. doi: 10.1016/j.patcog.2007.04.010. [DOI] [Google Scholar]
  • 215.Abraham A., Das S., Roy S. Swarm intelligence algorithms for data clustering. Soft Comput. Knowl. Discov. Data Min. 2008:279–313. doi: 10.1007/978-0-387-69935-6_12. [DOI] [Google Scholar]
  • 216.Sheldon M.R., Fillyaw M.J., Thompson W.D. The use and interpretation of the Friedman test in the analysis of ordinal‐scale data in repeated measures designs. Physiother. Res. Int. Nov. 1996;1(4):221–228. doi: 10.1002/pri.66. [DOI] [PubMed] [Google Scholar]
  • 217.Nemenyi P.B. Princeton University; 1963. Distribution-free Multiple Comparisons. [Google Scholar]
  • 218.Hazra A. Using the confidence interval confidently. J. Thorac. Dis. Oct. 2017;9(10):4124–4129. doi: 10.21037/jtd.2017.09.14. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data generated or analyzed during this study are included in this article.


Articles from Heliyon are provided here courtesy of Elsevier

RESOURCES