VIASCKDE Index: A Novel Internal Cluster Validity Index for Arbitrary-Shaped Clusters Based on the Kernel Density Estimation

Ali Şenol

doi:10.1155/2022/4059302

. 2022 Jun 8;2022:4059302. doi: 10.1155/2022/4059302

VIASCKDE Index: A Novel Internal Cluster Validity Index for Arbitrary-Shaped Clusters Based on the Kernel Density Estimation

Ali Şenol ^1,^✉

PMCID: PMC9200537 PMID: 35720897

Abstract

The cluster evaluation process is of great importance in areas of machine learning and data mining. Evaluating the clustering quality of clusters shows how much any proposed approach or algorithm is competent. Nevertheless, evaluating the quality of any cluster is still an issue. Although many cluster validity indices have been proposed, there is a need for new approaches that can measure the clustering quality more accurately because most of the existing approaches measure the cluster quality correctly when the shape of the cluster is spherical. However, very few clusters in the real world are spherical. Therefore, a new Validity Index for Arbitrary-Shaped Clusters based on the kernel density estimation (the VIASCKDE Index) to overcome the mentioned issue was proposed in the study. In the VIASCKDE Index, we used separation and compactness of each data to support arbitrary-shaped clusters and utilized the kernel density estimation (KDE) to give more weight to the denser areas in the clusters to support cluster compactness. To evaluate the performance of our approach, we compared it to the state-of-the-art cluster validity indices. Experimental results have demonstrated that the VIASCKDE Index outperforms the compared indices.

1. Introduction

Clustering approaches are unsupervised learning techniques that separate data into groups called clusters according to the similarities and dissimilarities among the data [1, 2]. The DBSCAN [3], kmeans [4], BIRCH [5], Spectral Clustering [6], Agglomerative Clustering [7], HDBSCAN [8], Affinity Propagation [9], and OPTICS [10] are some examples of them, and they are used in many fields such as pattern recognition [11–13], machine learning [14–16], data mining [17, 18], web mining [1, 19], bioinformatics [20, 21], and streaming data mining [22, 23]. On the other hand, measuring the performance of any proposed clustering approach is also an important issue because each algorithm has its special point of view, and the results of each clustering technique vary. Therefore, to overcome this problem, cluster validation analysis or cluster validation indices have emerged. These approaches are generally used for two purposes, which are measuring the performance of clustering algorithms and contributing to clustering algorithms as a guide by finding the optimum number of clusters.

Cluster validation indices are divided into two main categories as internal and external indices. In external indices, true class labels are compared with the labels that are assigned by the proposed algorithm to measure the performance. Therefore, to use these indices, there is a need for true class labels. The Purity [24], Rand Index [25], Adjusted Rand Index [26], Accuracy, Precision and Recall [27], F-Measure [28], and NMI [29] can be given as examples of these types of indices. On the other hand, in the internal indices, we do not need actual class labels to measure the quality of clusters. In these indices, the evaluation of clustering performance is based on how similar the data in the same cluster are to each other, known as compactness, and how dissimilar the data in different clusters are from each other, known as separation. The Silhouette Index (SI) [30], Dunn Index [31], Davies–Bouldin (DB) [32], Calinski-Harabasz (CH) [33], Xie-Beni (XB) [34], S_Dbw [35], and RMSSTD [36] can be mentioned as primary cluster validity indices. Besides, there are many new cluster validity indices such as the CVNN [37], CVDD [38], DSI [39], SCV [40], and AWCD [41].

The main problem of the majority of state-of-the-art cluster validity indices is that they measure the cluster quality correctly when the shapes of the clusters are spherical. As an example, Silhouette Index (SI) uses the means of distances of each data in the cluster to evaluate their quality. Similarly, Davies–Bouldin (DB) uses cluster diameters and cluster centroids, and the Calinski-Harabasz (CH) uses the square of intracluster and intercluster distances. These all calculations are ideal if the shape of the cluster is spherical. However, the shapes of the minority of clusters are spherical in the real world. Additionally, if the shape is arbitrary, these indices cannot measure the cluster quality correctly because the center of gravity of any cluster is in the middle only if the shape is spherical.

Similar to our approach, there is another kernel density estimation-based cluster validation index, named the M_clus [42]. In the M_clus, the authors used a function of estimation of the mode to assess cluster quality. This mode function allows the index to assess the cluster quality by adopting interpoint distance measures that can be defined to have a probability density function. To evaluate clustering with the number of clusters greater than 1 (K > 1), they applied the mode estimation procedure for interpoint distances that are assumed to have a probability density function between the data members. On the other hand, in this study, we proposed a novel Internal Validity Index for Arbitrary-Shaped Clusters based on the kernel density estimation (the VIASCKDE Index). We aimed to calculate the cluster quality accurately by using compactness and separation of each data to support arbitrary-shaped clusters and the kernel density estimation (KDE) to weight denser regions in the clusters to the compactness of the clusters. Therefore, the advantages of our new approach can be listed as follows:

The VIASCKDE Index can evaluate arbitrary-shaped clusters correctly
It weights denser regions to support the compactness of clusters
It is suitable for all types of clustering techniques, especially for density-based algorithms
It can be used for micro-cluster-based approaches
It has greater performance when compared with state-of-the-art techniques

The rest of this paper was organized as follows: in Section 2, the related studies were reviewed. In the 3^rd section, the problem with existing works and the need for the proposed approach was explained. While details about the VIASCKDE Index were given in the 4^th section, the comparison of experimental results with the state-of-the-art approaches on real and synthetic datasets was given in the 5^th section. After that, the discussion on the results was provided in Section 6. Finally, the conclusion of the study was presented in Section 7.

2. Background and Related Works

As cluster validation techniques, in internal methods, we do not need the actual class labels. The cluster validation operation is done by calculating the similarities in the intraclusters and the differences in the interclusters produced by the model to reveal how consistent the produced clusters are [43]. As mentioned above, in the internal methods, cluster quality is evaluated in the aspects of two concepts [44]:

Compactness: it states how much the data, which is in the same cluster, are close to each other. Closer data mean better clustering.
Separation: it evaluates how much the clusters are far from each other. In the clustering evaluation, it is expected to be far from each other as much as possible.

The illustration of these two concepts is presented in Figure 1, while the equation is demonstrated in Eq. (1). Here, α and β are the weights.

\begin{matrix} Index = \frac{α • Compactness}{β • Separation} . \end{matrix}

(1)

The example of the relationship between the compactness and separation concepts of two clusters in a two-dimensional data space.

There are many internal methods proposed in the literature. In this section, we focused on the validation indices that are relevant to our approach. To make definitions shorter and more understandable, the general definitions are as follows:

Let X = {x₁, x₂,…,x_n} ∈ R^d be a dataset containing n points in a d-dimensional space, and x_i ∈ R^d. X is a set of disjoint k clusters (where C_i is a cluster and i = 1,2,3,…,k), and n_i data are in the C_i cluster. While the cluster center that is the gravity center of cluster C_i is the mean of the data that belongs to C_i and calculated by μ=1/n_i∑_{x_i∈C_i}x_i, the mean of all datasets is calculated by $\bar{μ} = 1 / n \sum_{x \in X} x$ . In the present study, the mentioned distance is the Euclidean distance; one of each x and y is data of the dataset, and the Euclidean distance between these two data is expressed as d_e( x , y ). In light of this information, we can briefly list the main internal cluster validity indices as follows:

Silhouette Index (SI) [30]: as given in Figure 2, the compactness value of one of the data in any cluster is calculated by measuring the distance from the data to each data in the same cluster. Then, the compactness of the cluster, which is notated as a(x), is calculated by measuring the mean of compactness of all the data that the cluster has. The average of the distances from the elements of the nearest cluster, to which the mentioned data do not belong, gives the separation value of that data. After that, the separation value of the cluster is found by calculating the mean of the separation values of all the data of the cluster and it is notated as b(x). From now on, we can calculate the SI value, which is the cluster validity index of the model. The equations to calculate SI, a(x), and b(x) are given in equations (2)–(4), respectively. The SI value is [−1, +1]. While -1 means the worst clustering, +1 means the best clustering.

\begin{matrix} a (x) = \frac{1}{n_{i} - 1} \sum_{x, y \in C_{i}, y \neq x} d_{e} (x, y), \end{matrix}

(2)

\begin{matrix} b (x) = {min}_{j = 1,2, \dots, k; j \neq i} \{\frac{1}{n_{j}} \sum_{x \in C_{j}, y \in C_{j}} d_{e} (x, y)\}, \end{matrix}

(3)

\begin{matrix} S I = \frac{1}{n} \sum_{i = 1}^{k} \sum_{x \in C_{i}} \frac{b (x) - a (x)}{max (a (x), b (x))} . \end{matrix}

(4)

Dunn Index (DI) [31]: the DI calculates the success of the model based on compactness and the separation between the clusters. To do this, the DI value of a cluster is calculated by the distance to the closest cluster and its own diameter. Let d_min(C_i, C_j) be the closest distance between clusters C_i and C_j, and let diam(C_l) be the diameter of the cluster C_l, and the values of these two variables are calculated by d_min(C_i, C_j)=min_{x_i∈C_i, x_j∈C_j}d_e(x_i, y_i) and diam(C_l)=max_{x_i∈C_i, x_j∈C_j}d_e(x_i, y_i). Therefore, by knowing the value of d_min(C_i, C_j) and diam(C_l), the DI of the model is calculated by equation (5). The larger the result value, the more successful the clustering is.

\begin{matrix} D I = {min}_{1 \leq i \leq k} \{{min}_{j \neq i, 1 \leq j \leq k} \{\frac{dmin (C_{i}, C_{j})}{{max}_{1 \leq l \leq k} \{diam (C_{l})\}}\}\} . \end{matrix}

(5)

Calinski-Harabasz (CH) [33]: the CH calculates compactness and separation values via the mean of the squares of the interclass and intraclass distances. The CH index value is calculated by (6). In the CH index, the goal is to make the result as large as possible.

\begin{matrix} C H = \frac{\sum_{i = 1}^{k} n_{i} d_{e}^{2} (μ_{i}, μ) / (k - 1)}{\sum_{i = 1}^{k} \sum_{x \in C_{i}} d_{e}^{2} (x, μ_{i}) / (n - k)} . \end{matrix}

(6)

Davies–Bouldin (DB) [32]: the compactness value is calculated over the mean of the variance of the data in each cluster. On the other hand, the separation value is calculated over the distance from the center of the cluster to the center of the closest one. Let avg(C_i), which is calculated by (7), be the average of the distances of each data in the cluster i to the cluster center, and the avg(C_i) is calculated by (8).

\begin{matrix} a v g (C_{i}) = \frac{1}{n_{i} (n_{i} - 1)} \sum_{x_{i}, x_{j} \in C_{i}} d_{e} (x_{i}, x_{j}), \end{matrix}

(7)

\begin{matrix} D B = \frac{1}{k} \sum_{i = 1, i \neq j, 1 \leq j \leq k}^{k} max \{\frac{avg (C_{i}) + avg (C_{j})}{d_{e} (μ_{i}, μ_{j})}\} . \end{matrix}

(8)

S_Dbw Index [35]: The S_Dbw calculates the compactness value of the clusters over the standard deviations (σ) of the data that the cluster has. On the other hand, it calculates the separation value by the distance between the centers of the clusters. The S_Dbw index is a type of index that considers the density of clusters. Let den be the density of the cluster, and the S_Dbw index value is calculated with the following equations:

\begin{matrix} S_D b w = \frac{1}{k} \sum_{C_{i} \in C} \frac{σ (C_{i})}{σ (X)} + \frac{1}{k (k - 1)} \sum_{C_{i} \in C} \sum_{C_{j} \in C, C_{i} \neq C_{j}} \frac{den (C_{i}, C_{j})}{max \{den (C_{i}), den (C_{j})\}}, \\ den (C_{i}) = \sum_{x_{p} \in C_{i}} f (x_{p}, μ_{i}), \\ den (C_{i}, C_{j}) = \sum_{x_{p} \in C_{i} \cup^{} C_{j}} f (x_{p}, \frac{μ_{i} + μ_{j}}{2}), \\ f (x_{p}, μ_{i}) = \{\begin{matrix} 0, & d_{e} (x_{p}, μ_{i}) > σ (C), \\ 1, & otherwise. \end{matrix} \end{matrix}

(9)

Distance-based Separability Index (DSI) [39]: the DSI is another approach that measures the cluster quality by the means of the distances based on intercluster and intracluster. Let C_i and C_j be two clusters and have N_i and N_j data points, respectively. The intracluster distance set of cluster C_i will be a set as given equation (13). Moreover, the intercluster distance set is measured based on the distances of data pairs of clusters C_i and C_j. To compute the DSI, the Kolmogorov–Smirnov (KS) test was utilized.

\begin{matrix} \{\begin{matrix} \{d_{C_{i}}\} = \{d_{e} (x, y) | x, y \in C_{i}; x \neq y\}, \\ I f |C_{i}| = N_{i,}, then |\{d_{C_{i}}\}| = \frac{1}{2} N_{i,} (N_{i,} - 1), \end{matrix} \\ \{\begin{matrix} \{d_{C_{i, j}}\} = \{d_{e} (x, y) | x \in C_{i}; y \in C_{j}\}, \\ I f |C_{i}| = N_{i,}, |C_{j}| = N_{j,}, then |\{d_{C_{i, j}}\}| = N_{i} . N_{j} . \end{matrix} \end{matrix}

(10)

Let S_{C_i} be Kolmogorov–Smirnov test of cluster C_i, which is calculated as S_{C_i}=KS({d_{C_i}}, {d_{C_i,j}}) and S_{C_j} be of C_j, and the DSI of these two clusters is the result of the following equation:

\begin{matrix} D S I (\{C_{i}, C_{y}\}) = \frac{S_{C_{i}} + S_{C_{j}}}{2} . \end{matrix}

(11)

RMSSTD [35]: the root-mean-square standard deviation (RMSSTD) aims to calculate the clustering quality by measuring the homogeneity of clusters. It is commonly used for hierarchical clustering. Let the dataset consists of k clusters, p be the number of independent variables, ${\bar{x}}_{i j}$ be the mean of data in variable j and cluster i, and n_ij is the number of data in variable p and cluster k. RMSSTD is measured by equation (12). The lower RMSSTD means better clustering.

\begin{matrix} RMSSTD = \sqrt{\frac{\sum_{i = 1,2, \dots, p}^{j = 1,2, \dots, k} \sum_{a = 1}^{n_{i j}} {(x_{a} - {\bar{x}}_{i j})}^{2}}{\sum_{i = 1,2, \dots, n}^{j = 1,2, \dots, k} (n_{i j} - 1)}} . \end{matrix}

(12)

3. Statement of the Problem

Although many approaches have been proposed, analysis of the cluster quality is still an issue. Because there are many clustering approaches in the literature, they differ from each other in many aspects. Therefore, no cluster validation technique can evaluate the quality of all produced clusters precisely. However, some approaches have been used in this task including the Silhouette Index, Dunn Index, Davies–Bouldin, Calinski-Harabasz, and S_Dbw. Although these indices have been used commonly, each of them has a specific problem with cluster validation as given in Table 1. For example, a significant part of the proposed cluster validity indices assumes the shapes of clusters are spherical. In fact, the minority of clusters are spherical in the real world as some examples are given in Figure 3. The SI can be given as an example of these kinds of indices. It cannot achieve a good score if the shape of the cluster is not spherical. On the other hand, the DB and the CH identify clusters that are compact and well separated. However, in the real world, very few clusters are in that shape. Similarly, despite being better than the DB and the CH in case of the clusters are not well separated, the DI encounters some issues with computational cost when the number of clusters or dimensionality is high. Besides, it is affected by the noisy data due to increasing diameter. As for the S_Dbw, although it is proposed as a density-supported validity index and gets a good score with the compact and well-separated clusters, it is affected by the distribution of the data. In addition, thanks to being a density-based clustering validity index, the DSI is good at dealing with arbitrary-shaped clusters. It can successfully evaluate any cluster quality. However, the DSI is also another cluster validity index that is affected when clusters are too close. Likewise, the RMSSTD is another validity index that encounters some problems when the clusters are close to each other. The examples of the problems on the shapes of clusters that existing indices come across can be increased.

Table 1.

Comparison of clustering validity indices that were used for experimentation in the present study.

Cluster validity Index	Notation	Runtime complexity	Optimal value	Considering denser region?	Handling arbitrary-shaped clusters?	Advantages	Disadvantages
Silhouette Index [30]	SI	O(n²)	Max.	✗	✗	The score is higher when the clusters are dense and well separated	Good at handling the spherical clusters, high computational complexity
Dunn Index [31]	DI	O(n²)	Max.	✗	✓	Competent at cluster validity task	High computational cost with high-dimensional data and the number of clusters
Calinski-Harabasz Index [33]	CH	O(n)	Max.	✗	✗	Good at well separated and compact clusters, its computational complexity is very low	It is not competent enough at the cluster validation task.
Davies–Bouldin Index [32]	DB	O(n)	Min.	✗	✗	Good at well separated and compact clusters, its computational complexity is very low	It is not competent enough at the cluster validation task.
S_Dbw validity Index [35]	S_Dbw	O(n)	Min.	✗	✓	Its computational complexity is very low	Affected negatively by the distribution of data
Distance-based Separability Index [39]	DSI	O(n³)	Min	✗	✓	Useful to discover the shape of clusters	Affected negatively when clusters are too close and its computational complexity is high
Root-mean-square std dev [35]	RMSSTD	O(n)	Min.	✗	✗	Good for hierarchical clustering	Has issues when the clusters are close to each other
VIASCKDE Index (proposed)	VIASCKDE	O(n²)	Max.	✓	✓	It can handle the arbitrary-shaped clusters, take into account the denser regions, can be used for density-based and micro-cluster-based approaches	Has issues when the clusters are close to each other

Dataset	Type	# of Features	# of data	# of classes	Reference
Half-kernel	Synthetic	2	1000	2	[45]
Two spirals	Synthetic	2	312	3	[45]
Outlier	Synthetic	2	700	4	[45]
Corners	Synthetic	2	2000	4	[45]
Cluster in cluster	Synthetic	2	1012	2	[45]
Crescent full moon	Synthetic	2	1000	2	[45]
Moon	Synthetic	2	514	4	[45]
Face	Synthetic	2	322	4	[46]
Wave	Synthetic	2	287	2	[46]
Aggregation	Synthetic	2	788	7	[47]
Zelnik1	Synthetic	2	622	4	[48]
Zelnik5	Synthetic	2	512	4	[48]
Xclara	Synthetic	2	3000	3	[48]
Banana	Synthetic	2	4811	2	[48]
D2c2sc13	Synthetic	2	588	13	[48]
2sp2glob	Synthetic	2	999	3	[48]
Cure-t1-200n	Synthetic	2	2000	5	[48]
Thyroid	Real	4	215	2	[49]
Fisher iris	Real	4	150	3	[49]
Breast cancer	Real	8	699	2	[49]

Datasets	Adjusted Rand Index (ARI)
	Methods
	Gaussian Weight	KDE Weight
Half-kernel	1.0000	1.0000
Two spirals	1.0000	1.0000
Outlier	1.0000	1.0000
Corners	1.0000	1.0000
Cluster in cluster	1.0000	1.0000
Crescent full moon	1.0000	1.0000
Moon	0.7424	0.7424
Face	0.9949	1.0000
Wave	1.0000	1.0000
Fisher iris	0.7493	0.7493
Breast cancer	0.7540	0.7540
Aggregation	0.7338	0.9118
Thyroid	-0.0619	0.6783
Zelnik1	1.0000	0.9488
Zelnik5	1.0000	1.0000
Xclara	0.0001	0.0001
Banana	1.0000	1.0000
Ds2c2sc13	0.3187	0.5904
2sp2glob	1.0000	0.9880
Cure-t1-2000n	0.8850	0.8850

Kernels	Datasets
	Obtained VIASCKDE Values with each kernel						Obtained ARI Values with each kernel
	Face	Aggregation	Outliers	Thyroid	Crescent full moon	Cure-t1-200n	Face	Aggregation	Outliers	Thyroid	Crescent full moon	Cure-t1-200n
Gaussian	0.7063	0.6368	0.6797	0.4947	0.6623	0.6555	0.6085	0.8246	1.0000	0.5083	1.0000	0.8850
Cosine	0.5967	0.6564	0.6499	0.1699	0.6340	0.6343	0.6085	0.8089	1.0000	0.5083	1.0000	0.8850
Exponential	0.7005	0.6371	0.6714	0.5541	0.6426	0.6653	0.0386	0.8089	1.0000	0.5034	1.0000	0.8850
Linear	0.5736	0.6427	0.6306	0.1594	0.6169	0.6371	0.6085	0.8089	1.0000	0.5083	1.0000	0.8850
Epanechnikov	0.6021	0.6562	0.6581	0.1758	0.6388	0.6295	0.6085	0.8089	1.0000	0.5083	1.0000	0.8850
Tophat	0.6457	0.6165	0.6433	0.2306	0.6664	0.6299	0.6085	0.0333	1.0000	0.5083	1.0000	0.8850

Bandwidth	Datasets
	Obtained VIASCKDE values with each bandwidth						Obtained ARI values with each bandwidth
	Face	Aggregation	Outliers	Thyroid	Crescent full moon	Cure-t1-200n	Face	Aggregation	Outliers	Thyroid	Crescent full moon	Cure-t1-200n
0.01	0.3377	0.3444	0.4650	0.0556	0.4780	0.5264	−0.0386	0.8089	1.0000	0.5277	1.0000	0.8850
0.03	0.6627	0.6565	0.6508	0.3493	0.6608	0.6421	0.6085	0.8089	1.0000	0.5034	1.0000	0.8850
0.05	0.7063	0.6388	0.6797	0.4947	0.6623	0.6555	0.6085	0.9898	1.0000	0.5034	1.0000	0.8850
0.1	0.7365	0.6225	0.6851	0.6306	0.6486	0.6565	−0.0386	0.8089	1.0000	0.5034	1.0000	0.8850
0.3	0.7857	0.5947	0.6773	0.7402	0.6143	0.6189	−0.0386	0.7338	1.0000	0.2099	1.0000	0.8850
0.5	0.7586	0.5689	0.5481	0.7591	0.5945	0.6039	−0.0386	0.7338	1.0000	0.2099	1.0000	0.8850
1.0	0.7412	0.5636	0.5257	0.7618	0.5927	0.6018	−0.0386	0.7338	1.0000	0.2099	1.0000	0.8850
1.5	0.7362	0.5629	0.5236	0.7618	0.5923	0.6016	−0.0386	0.7338	1.0000	0.2099	1.0000	0.8850
2	0.7339	0.5626	0.5229	0.7618	0.5921	0.6015	−0.0386	0.7338	1.0000	0.2099	1.0000	0.8850
2.5	0.7328	0.5625	0.5226	0.7618	0.5920	0.6015	−0.0386	0.7338	1.0000	0.2099	1.0000	0.8850
3	0.7322	0.5624	0.5225	0.7618	0.5920	0.6015	−0.0386	0.7338	1.0000	0.2099	1.0000	0.8850
3.5	0.7317	0.5624	0.5223	0.7618	0.5919	0.6015	−0.0386	0.7338	1.0000	0.2099	1.0000	0.8850
4	0.7314	0.5623	0.5222	0.7617	0.5919	0.6015	−0.0386	0.7338	1.0000	0.2099	1.0000	0.8850
4.5	0.3377	0.3444	0.4650	0.0556	0.4780	0.5264	−0.0386	0.8089	1.0000	0.5277	1.0000	0.8850
5	0.6627	0.6565	0.6508	0.3493	0.6608	0.6421	0.6085	0.8089	1.0000	0.5034	1.0000	0.8850

Dataset	DBSCAN parameters	Best parameters detected by indices for the DBSCAN algorithm
Dataset	DBSCAN parameters	SI	DI	DB	CH	S_Dbw	DSI	RMSSTD	VIASCKDE
Half-kernel	ε	0.08	0.08	0.05	0.08	0.05	0.05	0.08	0.08
	MinPts	7	7	11	7	15	11	7	7
Two spirals	ε	0.1	0.1	0.05	0.1	0.05	0.1	0.05	0.1
	MinPts	11	11	15	11	15	11	14	11
Outlier	ε	0.07	0.07	0.07	0.07	0.05	0.07	0.05	0.07
	MinPts	15	15	15	15	8	15	14	15
Corners	ε	0.1	0.1	0.1	0.1	0.1	0.1	0.1	0.1
	MinPts	15	15	15	15	15	15	15	15
Cluster in cluster	ε	0.06	0.06	0.06	0.06	0.06	0.06	0.06	0.06
	MinPts	12	12	12	12	12	12	14	12
Crescent full moon	ε	0.07	0.07	0.07	0.07	0.05	0.06	0.05	0.07
	MinPts	14	14	14	14	15	12	15	14
Moon	ε	0.06	0.08	0.06	0.06	0.05	0.05	0.06	0.06
	MinPts	7	11	9	7	9	9	15	15
Face	ε	0.06	0.1	0.1	0.06	0.06	0.05	0.06	0.1
	MinPts	15	8	5	6	15	12	11	8
Wave	ε	0.09	0.09	0.06	0.09	0.05	0.06	0.05	0.06
	MinPts	12	5	12	12	9	12	15	12
Fisher iris	ε	0.14	0.19	0.14	0.14	0.08	0.14	0.06	0.19
	MinPts	15	6	15	15	5	15	7	6
Breast cancer	ε	0.39	0.33	0.39	0.39	0.06	0.06	0.05	0.4
	MinPts	8	5	8	8	5	5	14	5
Aggregation	ε	0.06	0.09	0.06	0.06	0.06	0.06	0.05	0.06
	MinPts	13	7	13	13	14	12	14	13
Thyroid	ε	0.1	0.1	0.06	0.09	0.07	0.05	0.05	0.1
	MinPts	5	5	12	5	6	8	9	5
Zelnik1	ε	0.08	0.08	0.05	0.1	0.07	0.07	0.08	0.07
	MinPts	6	15	14	7	5	5	15	5
Zelnik5	ε	0.06	0.1	0.05	0.1	0.06	0.05	0.05	0.1
	MinPts	14	13	12	13	15	12	14	13
Xclara	ε	0.05	0.08	0.09	0.05	0.05	0.05	0.08	0.05
	MinPts	13	12	15	13	13	13	12	13
Banana	ε	0.05	0.05	0.05	0.05	0.05	0.05	0.05	0.05
	MinPts	9	9	9	9	9	9	9	9
Ds2c2sc13	ε	0.09	0.09	0.06	0.06	0.05	0.06	0.09	0.05
	MinPts	10	10	14	14	13	14	10	8
2sp2glob	ε	0.1	0.1	0.05	0.07	0.08	0.1	0.06	0.07
	MinPts	9	9	12	14	6	9	5	14
Cure-t1-2000n	ε	0.1	0.1	0.1	0.1	0.1	0.1	0.1	0.1
	MinPts	10	10	10	10	10	10	10	10

Dataset	Spectral clustering parameters	Best parameters detected by indices for the Spectral Clustering algorithm
Dataset	Spectral clustering parameters	SI	DI	DB	CH	S_Dbw	DSI	RMSSTD	VIASCKDE
Half-kernel	n_clusters	14	2	15	15	14	15	2	2
Two spirals	n_clusters	15	2	15	15	15	15	2	2
Outlier	n_clusters	2	4	4	13	3	4	2	4
Corners	n_clusters	12	4	12	12	15	14	2	2
Cluster in cluster	n_clusters	4	2	4	15	15	15	2	2
Crescent full moon	n_clusters	5	2	5	13	15	14	2	6
Moon	n_clusters	15	2	15	15	15	15	2	2
Face	n_clusters	11	2	10	12	15	13	2	2
Wave	n_clusters	7	2	15	15	15	15	2	2
Fisher iris	n_clusters	2	2	2	3	15	2	2	3
Breast cancer	n_clusters	2	2	2	2	11	14	15	12
Aggregation	n_clusters	4	2	6	14	2	15	2	2
Thyroid	n_clusters	3	2	3	3	15	15	2	3
Zelnik1	n_clusters	12	2	13	12	15	13	3	3
Zelnik5	n_clusters	8	2	8	15	15	15	2	4
Xclara	n_clusters	3	2	3	3	10	3	2	3
Banana	n_clusters	9	2	9	15	14	15	2	2
Ds2c2sc13	n_clusters	3	3	5	8	2	15	2	5
2sp2glob	n_clusters	7	2	15	15	15	15	2	7
Cure-t1-2000n	n_clusters	5	2	4	13	2	12	2	3

Dataset	HDBSCAN Parameter	Best parameters detected by the indices for the HDBSCAN algorithm
Dataset	HDBSCAN Parameter	SI	DI	DB	CH	S_Dbw	DSI	RMSSTD	VIASCKDE
Half-kernel	n_clusters_size	24	24	2	25	25	25	24	24
	n_samples	6	6	10	25	25	25	6	6
Two spirals	n_clusters_size	3	25	3	17	2	2	15	6
	n_samples	2	17	2	7	2	2	19	12
Outlier	n_clusters_size	16	16	16	16	16	16	16	16
	n_samples	12	12	12	12	12	12	12	12
Corners	n_clusters_size	8	8	8	8	2	2	8	8
	n_samples	8	8	8	8	2	2	8	8
Cluster in cluster	n_clusters_size	20	20	9	11	7	7	20	20
	n_samples	10	10	2	2	3	3	10	10
Crescent full moon	n_clusters_size	20	20	3	20	3	3	20	20
	n_samples	12	12	2	12	2	2	12	12
Moon	n_clusters_size	22	6	22	22	10	2	10	6
	n_samples	3	4	3	3	24	25	24	4
Face	n_clusters_size	21	13	9	21	9	9	13	9
	n_samples	5	19	8	5	8	8	19	8
Wave	n_clusters_size	16	6	16	16	3	4	6	2
	n_samples	13	3	23	13	13	19	3	5
Fisher iris	n_clusters_size	5	5	14	5	5	18	9	5
	n_samples	12	12	16	12	12	21	25	12
Breast cancer	n_clusters_size	11	5	2	5	2	2	22	5
	n_samples	34	55	3	55	3	3	53	55
Aggregation	n_clusters_size	17	12	9	12	23	2	12	2
	n_samples	25	14	16	14	13	4	14	4
Thyroid	n_clusters_size	3	2	3	3	3	3	2	8
	n_samples	2	7	2	2	4	2	16	4
Zelnik1	n_clusters_size	11	3	5	3	20	2	14	3
	n_samples	16	11	25	15	16	17	19	11
Zelnik5	n_clusters_size	20	20	20	20	20	20	20	20
	n_samples	3	3	3	3	3	3	3	3
Xclara	n_clusters_size	9	22	3	13	3	3	3	13
	n_samples	2	6	3	9	3	3	3	9
Banana	n_clusters_size	21	21	13	21	21	16	21	21
	n_samples	14	14	16	14	14	24	14	14
Ds2c2sc13	n_clusters_size	22	22	16	22	4	22	24	16
	n_samples	19	19	20	19	6	19	24	10
2sp2glob	n_clusters_size	21	21	21	21	21	21	21	21
	n_samples	22	22	22	22	22	22	22	22
Cure-t1-2000n	n_clusters_size	4	4	4	4	4	4	4	25
	n_samples	6	6	6	6	6	6	6	4

Index	# of datasets that each index was the best on the different algorithms
Index	DBSCAN	Spectral Clustering	HDBSCAN	Total
SI	11	4	9	24
DI	13	10	16	39
DB	7	5	6	18
CH	13	4	8	25
S_Dbw	5	0	9	14
DSI	8	7	5	20
RMSSTD	5	3	11	19
VIASCKDE (proposed index)	15	15	17	47

Dataset	Obtained values for the each index
Dataset	SI	DI	DB	CH	S_Dbw	DSI	RMSSTD	VIASCKDE
Half-kernel	0.2010	0.0949	1.8818	127.8905	0.5419	0.5068	0.2495	0.7125
Two spirals	0.0588	0.1317	3.3241	152.9447	0.5848	0.1069	0.28	0.7903
Outlier	0.5608	0.4291	0.4037	1075.5609	0.2099	0.9654	0.1302	0.6797
Corners	0.4614	0.2872	0.7436	2020.1068	0.4976	0.6358	0.1187	0.6295
Cluster in cluster	0.2231	0.2341	208.8458	0.0169	0.8536	0.7332	0.2276	0.595
Crescent full moon	0.2784	0.1923	1.1646	285.1423	0.3255	0.6568	0.2449	0.6623
Moon	0.2371	0.1052	0.9739	244.1722	0.2081	0.8788	0.2525	0.7508
Face	0.4569	0.2217	1.1099	213.0246	0.3725	0.7627	0.2423	0.6631
Wave	0.4525	0.1291	0.7119	366.1095	0.2344	0.8935	0.2696	0.6495
Fisher iris	0.5692	0.1222	0.5234	223.6137	0.3386	0.8296	0.2527	0.443
Breast cancer	0.5698	0.1228	0.8037	900.1988	0.3606	0.9617	0.2993	0.2944
Aggregation	0.4763	0.1432	0.5461	1156.7539	0.2073	0.9442	0.1878	0.6388
Thyroid	0.433	0.0598	2.7626	16.6429	0.5343	0.7486	0.1528	0.3275
Zelnik1	0.2045	0.0992	5.6978	95.196	0.2523	0.8939	0.2171	0.6604
Zelnik5	0.4971	0.2224	0.8098	413.8835	0.3651	0.8338	0.1534	0.7739
Xclara	0.6654	0.0656	1.1863	6889.0154	0.3492	0.7462	0.229	0.8101
Banana	0.3589	0.1258	1.1322	3532.2201	0.7625	0.4334	0.2146	0.8076
Ds2c2sc13	0.5724	0.237	0.5891	1907.2388	0.1921	0.9193	0.1091	0.605
2sp2glob	0.3899	0.1278	2.7559	158.5187	0.6374	0.8003	0.2089	0.8819
Cure-t1-2000n	0.4514	0.1196	0.6775	1365.0774	0.3054	0.787	0.1721	0.6555

Dataset	Obtained values for the each index
Dataset	SI	DI	DB	CH	S_Dbw	RMSSTD	DSI	VIASCKDE
Half-kernel	0.4748	0.0949	0.6066	1761.6198	0.2246	0.9163	0.2495	0.7395
Two spirals	0.3175	0.1317	1.058	1378.878	0.2857	0.7829	0.2865	0.8151
Outlier	0.6178	0.4291	0.4037	1804.463	0.1176	0.9654	0.2924	0.6863
Corners	0.5672	0.2872	0.5315	4102.5883	0.1873	0.9439	0.207	0.6575
Cluster in cluster	0.4547	0.2341	0.9465	832.9385	0.2764	0.857	0.2275	0.6052
Crescent full moon	0.4993	0.1923	0.5792	2022.7022	0.2055	0.9103	0.2423	0.6689
Moon	0.4543	0.1285	0.6781	602.0907	0.2169	0.9098	0.2689	0.7527
Face	0.4996	0.2361	0.5473	1055.0573	0.1705	0.9271	0.2481	0.7575
Wave	0.4957	0.1291	0.631	681.3681	0.1639	0.9124	0.2541	0.617
Fisher iris	0.6295	0.3581	0.4877	356.289	0.2163	0.8923	0.1432	0.4539
Breast cancer	0.5839	0.1291	0.7738	993.0158	0.1796	0.7795	0.2031	0.4341
Aggregation	0.4541	0.1091	0.589	1623.9684	0.1434	0.921	0.2966	0.6944
Thyroid	0.5517	0.0973	0.85	138.1291	0.3809	0.685	0.1309	0.4832
Zelnik1	0.5042	0.0992	0.663	194.586	0.2836	0.8614	0.2171	0.6544
Zelnik5	0.5948	0.2651	0.5353	1832.5626	0.1548	0.9495	0.2763	0.7686
Xclara	0.6946	0.023	0.4203	10843.7203	0.2779	0.946	0.1612	0.8164
Banana	0.5087	0.1258	0.5734	14012.5597	0.1806	0.9343	0.2146	0.82
Ds2c2sc13	0.3939	0.0639	0.8082	1133.5545	0.1434	0.9064	0.2896	0.6187
2sp2glob	0.6102	0.1456	0.6921	1548.8465	0.2544	0.8693	0.2396	0.725
Cure-t1-2000n	0.4994	0.1921	0.6581	3615.5302	0.1582	0.9016	0.2817	0.6589

Dataset	Obtained ARI values for the each index
Dataset	SI	DI	DB	CH	S_Dbw	RMSSTD	DSI	VIASCKDE
Half-kernel	0.1514	1.0000	0.1422	0.1421	0.1515	0.1421	1.0000	1.0000
Two spirals	0.1401	1.0000	0.1435	0.1401	0.1401	0.1401	0.2047	1.0000
Outlier	0.8463	1.0000	1.0000	0.2236	0.2322	1.0000	0.2271	1.0000
Corners	0.4581	1.0000	0.4581	0.4581	0.3917	0.4199	0.3330	0.3330
Cluster in cluster	0.6584	1.0000	0.6584	0.1365	0.1368	0.1365	1.0000	1.0000
Crescent full moon	0.2934	1.0000	0.2934	0.1021	0.0869	0.0955	1.0000	0.2341
Moon	0.3629	0.2973	0.3629	0.3629	0.3092	0.3092	0.4916	0.4916
Face	0.0646	0.3662	0.0747	0.0580	0.0443	0.0538	0.3662	0.3662
Wave	0.2970	1.0000	0.1333	0.1323	0.1323	0.1356	1.0000	1.0000
Fisher iris	0.5681	0.5681	0.5681	0.7445	0.2395	0.5681	0.5681	0.7445
Breast cancer	0.8933	0.8933	0.8933	0.8933	0.2875	0.1779	0.0669	0.2534
Aggregation	0.7975	0.0646	0.9066	0.4453	0.0486	0.4156	0.1149	0.0646
Thyroid	0.6307	0.4204	0.6307	0.6307	0.0830	0.0830	0.4204	0.6307
Zelnik1	0.3170	0.4352	0.3004	0.3170	0.2225	0.3007	1.0000	1.0000
Zelnik5	0.6567	0.3096	0.6567	0.3638	0.3790	0.3638	0.5003	1.0000
Xclara	0.9939	0.6270	0.9939	0.9939	0.3602	0.9939	0.6270	0.9939
Banana	0.2394	1.0000	0.2394	0.1369	0.1463	0.1369	1.0000	1.0000
Ds2c2sc13	0.3267	0.3267	0.2766	0.4531	0.0244	0.5344	0.0244	0.2394
2sp2glob	0.7852	0.5709	0.3226	0.3195	0.3185	0.3226	0.5709	0.7852
Cure-t1-2000n	0.6334	0.3423	0.7818	0.3303	0.1757	0.3546	0.1757	0.8427

Dataset	Obtained values for the each index
Dataset	SI	DI	DB	CH	S_Dbw	DSI	RMSSTD	VIASCKDE
Half-kernel	0.201	0.0949	1.8878	171.8984	0.5589	0.4662	0.2495	0.7125
Two spirals	0.4071	0.1317	1.1858	259.0349	0.0136	0.9957	0.28	0.8151
Outlier	0.5608	0.4291	0.4037	1075.5609	0.2099	0.9654	0.1235	0.6881
Corners	0.4614	0.2872	0.7436	2020.1068	0.0437	0.9791	0.1187	0.6268
Cluster in cluster	0.2231	0.2341	4.4083	2.5624	0.0642	0.947	0.2275	0.6052
Crescent full moon	0.2784	0.1923	1.0934	285.1423	0.0527	0.9829	0.2423	0.6623
Moon	0.2371	0.0794	1.1729	244.1722	0.3243	0.7021	0.2628	0.7002
Face	0.417	0.2217	0.9539	204.5665	0.4031	0.8557	0.2339	0.6654
Wave	0.3746	0.1291	1.1785	168.9936	0.3155	0.7862	0.2541	0.617
Fisher iris	0.6295	0.3581	0.4659	353.3674	0.4488	0.9296	0.1478	0.4722
Breast cancer	0.4306	0.1125	1.1919	493.4632	0.1958	0.9575	0.2983	0.0143
Aggregation	0.4925	0.1432	0.6452	778.9448	0.2701	0.8481	0.1497	0.6108
Thyroid	0.4359	0.0683	1.68	38.4235	0.6519	0.7913	0.1532	0.3833
Zelnik1	0.0008	0.0992	13.2535	12.7433	0.3022	0.7287	0.2171	0.541
Zelnik5	0.4663	0.2224	1.0459	413.8835	0.4593	0.7425	0.1493	0.7739
Xclara	0.6745	0.0295	1.213	7008.8746	0.0475	0.9918	0.1114	0.7814
Banana	0.3589	0.1258	1.0288	3532.2201	0.7625	0.7003	0.2146	0.82
Ds2c2sc13	0.5724	0.237	0.5829	1785.9002	0.1831	0.8928	0.1093	0.6045
2sp2glob	0.3899	0.1278	2.7973	158.408	0.6374	0.8003	0.2088	0.7146
Cure-t1-2000n	0.4514	0.1196	0.6775	1365.0774	0.3054	0.787	0.1721	0.655

PERMALINK

VIASCKDE Index: A Novel Internal Cluster Validity Index for Arbitrary-Shaped Clusters Based on the Kernel Density Estimation

Ali Şenol

Abstract

1. Introduction

2. Background and Related Works

Figure 1.

Figure 2.

3. Statement of the Problem

Table 1.

Figure 3.

Figure 4.

4. Proposed Cluster Validity Index: A Novel Internal Cluster Validity Index for Arbitrary-Shaped Clusters Based on the Kernel Density Estimation (The VIASCKDE Index)

4.1. Basic Idea

Figure 5.

4.2. Kernel Density Estimation-Based Weighting

Figure 6.

Figure 7.

4.3. Definitions and Equations

Definition 1 . —

Definition 2 . —

Definition 3 . —

4.4. The Algorithm

4.5. Computational Complexity

5. Experimental Study

5.1. Development Environment

5.2. Used Datasets

Table 2.

Figure 8.

5.3. Experimental Procedure

5.4. Experimental Study

5.4.1. The Selection of Density Distribution Estimation Method

Table 3.

5.4.2. The Kernel Selection for KDE

Table 4.

5.4.3. Bandwidth Selection for the KDE

Figure 9.

Table 5.

5.4.4. The Tests on Both Synthetic and Real Datasets

Figure 10.

Table 6.

Table 7.

Table 8.

Table 9.

Table 10.

Table 11.

Table 12.

Table 13.

Table 14.

6. Evaluation of the Results and Discussion

Table 15.

7. Conclusion and Future Works

Algorithm 1.

Data Availability

Conflicts of Interest

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases