Skip to main content
Sensors (Basel, Switzerland) logoLink to Sensors (Basel, Switzerland)
. 2023 Apr 3;23(7):3708. doi: 10.3390/s23073708

Cluster Validity Index for Uncertain Data Based on a Probabilistic Distance Measure in Feature Space

Changwan Ko 1, Jaeseung Baek 2,3, Behnam Tavakkol 4, Young-Seon Jeong 1,5,*
Editors: Qiong Wang, Teng Huang, Yan Pang
PMCID: PMC10099331  PMID: 37050769

Abstract

Cluster validity indices (CVIs) for evaluating the result of the optimal number of clusters are critical measures in clustering problems. Most CVIs are designed for typical data-type objects called certain data objects. Certain data objects only have a singular value and include no uncertainty, so they are assumed to be information-abundant in the real world. In this study, new CVIs for uncertain data, based on kernel probabilistic distance measures to calculate the distance between two distributions in feature space, are proposed for uncertain clusters with arbitrary shapes, sub-clusters, and noise in objects. By transforming original uncertain data into kernel spaces, the proposed CVI accurately measures the compactness and separability of a cluster for arbitrary cluster shapes and is robust to noise and outliers in a cluster. The proposed CVI was evaluated for diverse types of simulated and real-life uncertain objects, confirming that the proposed validity indexes in feature space outperform the pre-existing ones in the original space.

Keywords: uncertain data, cluster validity index, kernel probabilistic distance, feature space

1. Introduction

The purpose of clustering is to partition objects into groups with criteria such that the similarity within the groups and the dissimilarity among different groups should be maximized [1,2]. Although clustering methods have been widely used in many applications, most clustering algorithms do not provide the optimal number of clusters. Partitional-based clustering algorithms such as K-means clustering [3] must preset the number of clusters [4]. As cluster information is rarely known in the real world, it is crucial to evaluate the clustering results depending on the different numbers of clusters. Although many clustering methods exist for diverse applications, such as pattern recognition [5], semiconductor manufacturing [6], and healthcare [7], they have been developed primarily for only certain data or fixed values. However, the embedded uncertainty of data is essential in many applications. For instance, a patient’s blood pressure may not be consistent because of environmental conditions and instrument errors. Furthermore, measurement values are continuously changing because of the positions of instrumentation devices or workers’ conditions. Aside from these examples, data randomness, missing data, delayed updates, and worker fatigue are other factors of data uncertainty [8,9].

Uncertain data are assumed to be prevalent information in the real world, e.g., measurement errors and environmental conditions. The uncertainty of uncertain data can be expressed by probability density functions (PDFs). Figure 1 illustrates two uncertain data, each distributed by a PDF. The standard method of converting uncertain data is to transform a summary statistic (e.g., mean or median) into certain data. However, these statistics could lose extra information of uncertainty that is significant to capture the uncertainty information of uncertain objects.

Figure 1.

Figure 1

Two uncertain datasets, each expressed by a PDF.

Cluster validity indices (CVIs), which are indicators for validating the quality of clustering algorithms, have been widely used to determine the correct number of clusters for the given data. As the CVIs only use input data information, they must be used according to the characteristics of the data. The two components of a CVI are compactness and separability measures. The former refers to an intra-cluster distance, and the latter represents an inter-cluster distance. Most CVIs indicate that a good partition produces a small compactness value and a high separability value. However, the existing CVIs are vulnerable to validating cluster results when the shapes of the clusters are not spherical clusters [10,11].

For certain data, several CVIs, such as the Dunn [12], Calinski–Harabasz [13], Davies–Bouldin [14], and Xie–Beni [15] indices, have been proposed based on combinations of compactness and separability measures. However, most of the existing CVIs have been developed for certain data. There have been few studies on uncertain data. Moreover, relatively new CVIs are also being designed to incorporate mathematical theories into pre-existing CVIs, such as the K-nearest neighbor algorithm, which is used to compute compactness and separation by taking into account shared/non-shared data pairs [10], and principal component analysis, which is used to capture the geometry of the clusters [16]; or to develop clustering algorithms to cluster more well-separated clusters [1].

To apply uncertain data to the existing CVIs’ formulas, they should be changed to calculate distance measures of compactness and separability. In a study of uncertain CVIs, Tavakkol et al. [17] proposed CVIs for uncertain data to calculate the distance between two uncertain objects using probabilistic distance measures in the original space. However, it leads to sensitivity to arbitrary shapes of clusters, sub-clusters, and outliers because of the clusters shape that may cause inaccurate compactness and separability [11].

Consequently, this study proposes new uncertain CVIs for uncertain data objects based on kernel probabilistic distance measures in feature space. The proposed CVIs for uncertain objects are designed to adapt the kernel-based Bhattacharyya probabilistic distance in kernel spaces. In kernel space, the proposed CVIs produce accurate compactness and separability for the arbitrary shapes of clusters by transforming them into elliptical shapes in feature space. Figure 2 illustrates that the ambiguous shape of a dataset in the original space is transformed into a relatively elliptical, circular shape in feature space; thus, the kernel transformation can improve performance in calculating accurate compactness and separability. Furthermore, the proposed approaches could be robust to noise and outliers in a cluster. The superior performance of the proposed CVIs was evaluated through diverse experiments, including simulated and real-life datasets.

Figure 2.

Figure 2

Visualization of kernel transformation: (a) asymmetry shape in original space; (b) transformed shape in feature space.

This paper is organized as follows. Section 2 reviews the previous studies on CVIs. New CVIs for uncertain data based on a kernel probabilistic distance measure are proposed in Section 3. After the extensive experiments are presented in Section 4, the conclusions and future studies are provided in Section 5.

2. Related Work

2.1. CVI for Certain Data

In the past few decades, many CVIs have been developed to determine the optimal number of clusters. Most CVIs focus on calculating compactness and separability measures. The combination of the two measures is composed of a ratio-type or summation-type index. This section presents several popular CVIs that have been evaluated in many applications.

The Dunn (DU) index [12]:

DUK=mini,j=1,,K, ijminxCi,  yCjdx,ymaxi=1,,Kmaxx,yCidx,y. (1)

Compactness and separability are computed using the maximum diameter among all clusters and the minimum pair-wise distance between objects in different clusters. The DU index is integrated by the ratio type of separability to compactness. Thus, the maximum value of the DU index is the optimal number of clusters (max. S/C).

Calinski–Harabasz (CH) index [13]:

CHK=i=1Kni·dzi·ztot2K1·nK1=1KxCidx,zi2 (2)

The CH is composed of the ratio type of separability and compactness like the DU index. ztot is the centroid of the entire dataset. Compactness and separability are computed using within- and between-cluster sums of squares. Thus, the maximum value for CH is the optimum partition (max. S/C).

The Davies–Bouldin (DB) index [14]:

DBK=1Kmaxi=1, ,K, ij1nixCidx,zi2+1njyCjdy,zj2/dzi,zj (3)

where zi and zj are the centroids of each cluster. Compactness and separability are calculated using the sum of mean squares of individual clusters, unlike the DU index, which considers the compactness and separability of the total cluster. Compactness is the computed sum of the pair-wise distances between different clusters; separability is calculated differently for each cluster. The DB index is comprised of the ratio types of compactness and separability. Therefore, the minimum value of DB is the optimum partition (min. C/S).

The pre-existing CVIs are sensitive to sub-clusters, arbitrary shapes, and noise in clusters for the compactness measure [18]. This study overcomes those drawbacks by conducting a spatial transformation from the original space into feature space using a kernel function that correctly measures cluster compactness and separability.

2.2. CVI for Uncertain Data

Most CVIs have focused on certain data or fixed values [19]. Certain data do not have uncertainty caused by several factors and environments such as sensor measurement error, repeated measurements by workers, or equipment operating environments. Uncertain data objects come in two possible forms: (1) multiple points for each object and (2) a PDF for each object, either given or obtained by fitting the multiple points [20]. Several studies related to clustering uncertain data have been conducted. However, CVIs for uncertain data have rarely been used. The CVIs are crucial criteria for validating the results of clusters [21,22] to find the appropriate number of clusters. Therefore, the study of CVIs for uncertain data is necessary.

In this study, the proposed CVIs use kernel probabilistic distance measures to compute the distance between two uncertain data objects. There are many popular probabilistic distance measures, such as Bhattacharyya distance [23], Wasserstein distance, and Kullback–Leibler divergence [24]. This study uses the Bhattacharyya distance measure. The Bhattacharyya distance measure is one of the widely used probabilistic distance measures and has been generally used in diverse applications.

The Bhattacharyya distance between two probability distributions can be calculated in discrete and continuous cases. Let p and q be the continuous probability distributions over the same space. The definition of the Bhattacharyya distance for a continuous case in original space can be described as follows:

PDBhattp,q=ln xpxqxdx (4)

There are closed-form solutions for many probabilistic distance measures, including the Bhattacharyya distance, for cases where uncertain data objects are modeled with multivariate normal distributions. As probabilistic distance measures can capture the distance between PDFs, they can also be used to capture the distance between uncertain data objects [25]. The Bhattacharyya distance is a special case of Chernoff distance with parameters α1=α2=1/2, and the closed-from of Bhattacharyya distance for multivariate normal PDFs is defined in Equation (5):

PDBhattp,q=18μpμqΣp+Σq1μpμq+12lnΣp+Σq2Σp+Σq12 (5)

where μp and μq are means, and Σp and Σq are covariance matrices of P ~ MVNμp, Σp and Q ~ MVNμq, Σq.

This study models the Bhattacharyya distance between two uncertain data objects in kernel space. We can compute the probabilistic distance between two uncertain data objects in feature space using a kernel function.

3. Proposed CVIs for Uncertain Data

3.1. Kernel Probabilistic Distance Measure in Feature Space

Computing the probabilistic distance is a nontrivial problem. We can compute the Bhattacharyya distance in feature space by referring to several steps developed by Zhou and Chellappa [26]. In capturing the probabilistic distance, suppose that x1=x11,x21,,xN1 and x2=x12,x22,,xN2 are the given objects in original space d with a multivariate normal density function:

Nx; μ, Σ=12πdΣexp12xμTΣ1xμ (6)

The radial basis function (RBF) kernel function displayed in Equation (7) can be used to transfer original data into feature space for calculating the distance between uncertain data objects x1 and x2. The RBF kernel function is commonly used in various fields and algorithms because it outperforms other kernel functions [27,28].

Kij=exp12σ2xixj2, i,j=1,2 (7)

In kernel function Kx1, x2, where x1, x2d, and the non-linear mapping function ϕ and kernel Gram matrix K are defined as K=ΦTΦ, where Φ:= ΦN=ϕx1, ϕx2, , ϕxNf, and fd represents the data transformed to kernel space. The mean μ and covariance matrix Σ in feature space are estimated as:

μ=N1n=1Nϕxn=Φ, (8)
Σ=N1n=1NϕnμϕnμT=ΦJJTΦT, (9)

where J=1n(INs1) with sN×1=1N1T and 1=1,1, , 1.

The covariance matrix Σ must be converted into approximation form because of its rank-deficient characteristic fd. Therefore, we can use the approximation form as follows:

C=ΦJJTΦT+ρIf=WWT+ρIf=ΦAΦT+ρIf, (10)

where W=˙ ΦJQ, A=˙JQQTJT, and ρ is a user parameter that should be pre-specified in advance.

Obtaining the matrix Q requires computing the top r eigenvalues matrix Λr and the top r eigenvectors matrix Vr of K¯=JTKJ, where top r is a pre-specified parameter; thus, r=3 is used. Q is an N×r matrix calculated as follows:

Q=˙VrIrρΛr11/ 2.  (11)

Define matrix P as:

PN1+N2×r1+r2=α1J1Q100α2J2Q2.  (12)

The Bhattacharyya distance is a special case of Chernoff distance; it must be set to α1=α2=1/2 for all experiments. The τi,i=1,,r1+r2, are eigenvalues of a Lch matrix, with dimensions of r1+r2×r1+r2 given by

Lch=PTΦ1TΦ2TΦ1T Φ2TP=PTK11K12K21K22P.  (13)

Scalar values ε11, ε12, ε22 are computed by Equation (14).

εij=siTKijsjsiTKi1Ki2BchK1jK2jsj (14)

where Bch=PρIr1+r2+Lch1PT with dimensions of N1+N2×N1+N2.

The kernel-based probabilistic Bhattacharyya distance between two uncertain data objects x1 and x2 in feature space is calculated as follows:

KPDBhatt=0.5[α1α2ρ1ε11+ε222ε12 +0.5i=1r1+r2logρ+τiλi,1+i=1r1+r2logρ+τiλi,2 ,  (15)

where λi,j , i=1,,rj are the eigenvalues of Cj:

λi,j=λi,j,when i=1,,rjρ,when i=rj+1,,r1+r2  (16)

3.2. New CVI for Uncertain Data

The uncertain data objects in the cluster are transformed into feature space to compute the compactness and separability in the feature space by applying a kernel function. The mapped uncertain data objects are used to compute the distance between different clusters for calculating compactness and separability, which are used to obtain the values of the proposed CVIs. The calculated value of the indices changes according to the number of clusters K, and the proposed uncertain feature space DU (UFSDU) and uncertain feature space CH (UFSCH) index, are defined in Equations (17) and (18), respectively:

UFSDU index:

UFSDUK=mini,j=1,,K, ijminxCi,  yCjKPDBhattx,ymaxi=1,,Kmaxx,yCi KPDBhattx,y (17)

UFSCH index:

UFSCHK=i=1Kni·KPDBhattzi·ztot2K1·nKi=1kxCiKPDBhattx,zi2 (18)

These proposed CVI equations are similar to the DU and CH indices, except for the term KPDBhattx,y, which is the computed distance between two uncertain data objects in feature space in Equation (15).

4. Experimental Results

In this study, we propose two CVIs that are calculated probabilistic distances between different uncertain data objects in feature space. The K-medoids clustering algorithm proposed by Jiang et al. [19] was used to compare the performances of the proposed CVIs in feature space. The K-medoids algorithm is one of the most useful algorithms in clustering problems, which uses probabilistic distance measures to capture the similarity between uncertain objects. It differs from the popular K-means clustering algorithm used for clustering data into groups in its robustness to outliers. The K-means method represents each cluster by the mean of all objects in this cluster, whereas the K-medoids method calculates the distance between every pair of all uncertain data objects and the medoid within a cluster [19]. Then, of all calculated distance values, uncertain data with the smallest distances are assigned as a new medoid for the cluster. We proceeded with the experiments by setting the value of K, which is the number of clusters and is used as the probabilistic distance measure. In this study, we varied the number of clusters (K) and the Bhattacharyya distance measure to compute distances between different uncertain data objects in feature space.

4.1. Experimental Procedure for Uncertain Data

Experiments were performed with artificial and real-world datasets that may have sub-clusters and clusters with asymmetrical, arbitrary, and noisy shapes to evaluate the performances of the proposed CVIs. A normalization process was conducted for each feature of the datasets to reduce the scale gap between different features defined in Equation (19):

xnorm=xxminxmaxxmin,  (19)

where xmin and xmax are the minimum and maximum values of one feature of the dataset. We then simulated uncertain data objects from certain data objects by following the methodology used by [20].

The pre-existent DU and CH indexes were used to compute uncertain data objects in original space—uncertain original space, DU (UOSUD), and uncertain original space, CH (UOSCH)—to confirm the validity of the proposed CVIs. The overall experimental procedure is represented by Algorithm 1. The procedure used to compare the performances of the proposed CVIs with those of the previous CVIs was as follows: The inputs included the number of uncertain data objects N, the number of object features M, and the number of clusters K. We modeled the uncertain data with multivariate normal distributions. The means of the distributions were the original certain data. The covariances were estimated as follows:

fSik|Ψk,dfk=Ψkdfk2pdfk2Γpdfk2Sikdfk+p+12e12trΨkSik1,  i=1,,nk, k=1,,K (20)

where Sik represents the covariance matrices for objects in class k with the inverse Wishart PDF [29], as defined in Equation (20) [20]. Ψk is a positive definite scale matrix and dfk is the degree of freedom. p indicates the dimensions of Sik, tr is the trace of a matrix, and Γ is the multivariate gamma function.

Algorithm 1: K-medoids for uncertain data using a probabilistic distance measure in feature space.
1.  Input: n: The number of objects in cluster k, K: The number of clusters, iter = 0;
2.  Randomly select the cluster medoids C0={c10 ,,cK0} obtained from the initial clusters
3.  Initialize
4.  CVIs=cvi1, , cviK obtained UOSDU, UOSCH, UFSDU, and UFSCH
5.  Repeat
6.  for k=2 to K
7.    ckold=ck0; cknew=0
8.    Compute the new medoids:
9.    while ckold cknew 
10.      p=argmin1inj=1kKPDBhatt(xi, cjk ), where j is an index of cluster medoid in ck  
11.      cknew=xp 
12.    end
13.    Calculate the cvik using Equations (1), (2), (17), and (18).
14.  end
15.  iter = iter + 1
16. Until (iter = Maxiter)

Step 1: Set K initial clusters with uncertain objects randomly for a given dataset. Run a K-medoids clustering algorithm with different values for the K parameter (2 ≤ K ≤10).

Step 2: Obtain the medoids of each cluster for which the sum of the probabilistic distance between the objects is the smallest.

Step 3: Calculate CVIs for all the partitions. We calculated the compactness and separability in kernel space using an RBF kernel function with σ (bandwidth in the RBF kernel function). The optimal value was determined through a set of preliminary experiments by taking [0.1, 0.2, …, 4] in σ.

Step 4: We increased the reliability of experimental results by replicating the experiment 100 times for the same dataset with different trial seeds to obtain the initial medoids in Step 1 and used the average value of CVI for each cluster.

Step 5: Finally, we evaluated each CVI and the suggested number of clusters from a CVI; the actual numbers of clusters of a dataset were then compared.

4.2. Experiments with Artificial and Real-World Datasets

Experiments were conducted to evaluate the proposed CVIs in comparison to the pre-existent CVIs. These experiments used 10 datasets with sensitive characteristics containing arbitrariness, sub-clusters, asymmetry, and noise provided by the UCI (https://archive.ics.uci.edu/, accessed on 10 March 2023) [30] and Tomas Barton repositories (https://github.com/deric/clustering-benchmark, accessed on 10 March 2023), which have 122 artificial datasets with arbitrariness, sub-clusters, and asymmetric shapes in two or three features. The datasets from UCI repository, (e.g., D3, D4, D5, and D7) were collected in real environmental conditions; however, the other datasets were artificially created, which can be checked in Tomas Barton repositories.

The summary of datasets used for the experiments is presented in Table 1. Two-dimensional (2D) and three-dimensional (3D) dataset shapes are illustrated in Figure 3. The CVI values were computed by changing the number of clusters (K) in each dataset and then comparing the predicted labels of experiments to the actual labels in the datasets.

Table 1.

Summary of datasets.

Dataset Index Dataset Name # of Obs. # of Dim. # of Clusters Projection Shape
D1 A.K Jain’s Toy 373 2 2 Asymmetry, Arbitrary shape
D2 Flame 240 2 2 Sub-cluster, Noise
D3 Iris 150 4 3 -
D4 Thyroid 215 5 2 -
D5 Wine 178 13 3 -
D6 Wisconsin 683 9 2 -
D7 Harberman 301 3 2 Random shape
D8 Chainlink 1000 3 2 Sub-cluster, Arbitrary shape
D9 Lsun 400 2 3 Asymmetry, Arbitrary shape
D10 Zelnik1 299 2 3 Sub-cluster

Figure 3.

Figure 3

Shapes of 2D and 3D datasets: (a) D1 dataset; (b) D2 dataset; (c) D7 dataset; (d) D8 dataset; (e) D9 dataset; (f) D10 dataset.

4.3. Performance Comparison of the Proposed CVIs

The experimental results are given in Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, Table 10 and Table 11. The actual number of clusters is below the name of the dataset. It is also noted with an asterisk (*) adjacent to the actual number of clusters along the top. Moreover, all the results of the datasets are presented in Table 12, indicating the performance of the proposed CVIs by a quantitative figure. Each cell in Table 12 represents the optimal number of clusters K determined by its CVI criteria.

Table 2.

Performance results for D1.

# of Clusters 2 * 3 4 5 6 7 8 9 10
CVI
D1
(2)
UOSDU 0.00075 0.00063 0.00049 0.00046 0.00043 0.00047 0.00044 0.00042 0.000410
UOSCH 554.4796 537.8279 573.5387 586.5310 576.5872 562.0666 575.2021 566.6556 567.6008
UFSDU 0.011830 0.00727 0.007410 0.006350 0.006920 0.006390 0.006740 0.00580 0.005630
UFSCH 256.0945 204.767 167.9338 149.5915 138.3076 128.206 122.4676 117.0263 112.4593

Table 3.

Performance results for D2.

# of Clusters 2 * 3 4 5 6 7 8 9 10
CVI
D2
(2)
UOSDU 0.00578 0.00581 0.00583 0.00533 0.00494 0.00494 0.00452 0.00454 0.00448
UOSCH 218.9052 188.6698 201.7685 195.0877 190.2412 190.7961 192.3785 187.7774 186.0032
UFSDU 0.01875 0.01433 0.01619 0.01386 0.01284 0.01261 0.01263 0.0125 0.01271
UFSCH 246.7711 190.3472 184.7522 163.52 150.3938 143.1108 138.9139 131.6189 127.3284

Table 4.

Performance results for D3.

# of Clusters 2 3 * 4 5 6 7 8 9 10
CVI
D3
(3)
UOSDU 0.57393 0.18691 0.06671 0.04599 0.03375 0.03045 0.02475 0.02443 0.02427
UOSCH 393.8149 340.7616 288.9103 257.4766 227.8328 211.7321 193.9894 179.4227 172.1492
UFSDU 0.78121 0.05291 0.0332 0.02818 0.0201 0.02217 0.02033 0.01676 0.01503
UFSCH 97.24412 100.9677 83.47847 74.68629 65.08186 59.80128 55.42499 51.32508 48.54411

Table 5.

Performance results for D4.

# of Clusters 2 * 3 4 5 6 7 8 9 10
CVI
D4
(2)
UOSDU 0.01059 0.00702 0.00447 0.00389 0.00338 0.00285 0.0029 0.00264 0.00254
UOSCH 52.44662 49.27229 45.29772 44.23136 46.29286 43.05835 40.0334 38.99379 36.43862
UFSDU 0.09045 0.02678 0.02097 0.01941 0.0186 0.0166 0.01728 0.01604 0.01577
UFSCH 88.16833 63.62494 54.54528 48.32164 43.33752 38.65073 35.53446 32.89777 30.6346

Table 6.

Performance results for D5.

# of Clusters 2 3 * 4 5 6 7 8 9 10
CVI
D5
(3)
UOSDU 0.28546 0.19218 0.16953 0.13451 0.13042 0.12188 0.1222 0.11775 0.11498
UOSCH 46.98845 41.61822 34.08324 29.45127 26.66111 23.71564 21.97848 20.8878 19.0692
UFSDU 0.1351 0.13992 0.12361 0.11102 0.1058 0.10544 0.10343 0.10402 0.10242
UFSCH 166.5115 94.11775 70.17926 57.15066 48.48219 42.44803 38.19718 34.55733 31.1674

Table 7.

Performance results for D6.

# of Clusters 2 * 3 4 5 6 7 8 9 10
CVI
D6
(2)
UOSDU 0.10223 0.04719 0.02262 0.01209 0.00742 0.00342 0.0014 0.00109 0.00075
UOSCH 237.829 186.8503 145.4631 119.3866 98.36381 89.72379 80.18472 70.83073 66.12163
UFSDU 0.22631 0.10763 0.04928 0.03902 0.01416 0.01228 0.0084 0.00605 0.00391
UFSCH 349.3685 261.4169 205.8692 171.2457 144.4285 124.5582 109.2653 97.50292 88.98401

Table 8.

Performance results for D7.

# of Clusters 2 * 3 4 5 6 7 8 9 10
CVI
D7
(2)
UOSDU 0.00198 0.0014 0.00112 0.00086 0.00078 0.00089 0.00069 0.00079 0.00076
UOSCH 128.8359 117.8517 104.8203 97.56451 95.82686 92.17925 86.85381 84.98897 82.71107
UFSDU 0.13021 0.02577 0.01681 0.01199 0.01108 0.01122 0.01132 0.01028 0.00945
UFSCH 319.3255 171.7169 127.0638 104.5919 90.63319 80.94442 72.86994 67.51974 62.62473

Table 9.

Performance results for D8.

# of Clusters 2 * 3 4 5 6 7 8 9 10
CVI
D8
(2)
UOSDU 0.00019 0.00017 0.00017 0.00017 0.00017 0.00018 0.00021 0.00019 0.00017
UOSCH 419.8882 371.9768 388.8548 430.2229 426.5956 430.8854 449.3122 438.7834 417.3569
UFSDU 0.00439 0.00237 0.00204 0.00114 0.0013 0.00118 0.00149 0.00153 0.0014
UFSCH 445.5408 463.2664 449.4758 439.8262 425.4487 411.5018 422.1565 428.8755 437.9047

Table 10.

Performance results for D9.

# of Clusters 2 3 * 4 5 6 7 8 9 10
CVI
D9
(3)
UOSDU 0.01277 0.00168 0.00087 0.00081 0.00069 0.00062 0.0006 0.00063 0.00054
UOSCH 316.7407 406.3877 395.188 401.578 380.968 363.1193 365.4242 349.9761 351.8199
UFSDU 0.01439 0.02006 0.01697 0.01119 0.00658 0.00574 0.00485 0.00472 0.00416
UFSCH 190.3465 205.1745 189.6315 175.6124 164.8462 154.2108 149.907 141.5702 133.6363

Table 11.

Performance results for D10.

# of Clusters 2 3 * 4 5 6 7 8 9 10
CVI
D10
(3)
UOSDU 0.030644 0.049296 0.048849 0.048798 0.046752 0.044594 0.042478 0.037749 0.041905
UOSCH 235.4205 161.3342 142.117 135.4194 127.012 126.4954 125.6673 123.9964 132.4379
UFSDU 0.00368 0.00123 0.00123 0.00115 0.00103 0.00087 0.00073 0.00077 0.00056
UFSCH 102.6013 106.5976 99.7133 98.79822 97.68495 95.67929 95.82844 96.62246 102.6371

Table 12.

Difference between the actual and estimated numbers of clusters in lower-dimensional datasets.

Dataset Dim # of Clusters UOSDU UOSCH UFSDU UFSCH
D1 2 2 5
D2 2 2 4
D3 4 3 2 2 2
D4 5 2
D5 13 3 2 2 2
D6 9 2
D7 3 2
D8 3 2 8 8 3
D9 2 3 2
D10 2 3 2 2
# of successes in estimating the optimal number of clusters 5 5 8 8

The bold values with gray-shaded backgrounds indicate the optimal cluster K decided by each CVI. As presented in Table 2, three of the CVIs succeeded in estimating the number of clusters as two in D1. UOSCH failed. The proposed UFSDU and UFSCH also successfully predicted the number of clusters in D2. In contrast, UOSDU failed to estimate the number of clusters in D2.

Although the proposed UFSDU index and the pre-existent CVIs failed to predict the number of clusters in D3, UFSCH was successful. All CVIs correctly predicted the number of clusters for some datasets; see Table 5, Table 7 and Table 8. In contrast, the proposed UFSDU index is the only CVI that correctly predicted the actual number of clusters in D5, as presented in Table 6. Furthermore, the UFSDU index predicted the actual number of clusters of D8. D8’s shape (Figure 3) is classified distinctly into two classes when viewed visually. However, it is challenging to calculate the compactness and separability of a cluster in the original space. Nevertheless, the UFSDU index was successful in such predictions; the UFSCH forecasted the number of clusters as three, which is close to the actual number of clusters, two. The kernel transformation facilitates computation to obtain greater compactness and separability in the feature space than the original space, leading to high-performance clustering.

The UOSCH index and the new CVIs predicted the number of clusters to be three in D9, and the UOSDU and UFSCH indexes successfully estimated the number of clusters in D10. Table 12 presents a summary of the results of the 10 datasets above, whereas the symbol of a circled dot (⨀) indicates that the CVI accurately predicted the actual number of clusters. As presented in Table 12, the pre-existent CVIs precisely estimated the number of clusters for five experimental datasets, whereas the newly proposed CVIs accurately predicted the number of clusters for eight datasets—three more than the pre-existent CVIs.

5. Conclusions

In this study, we proposed novel cluster validity indices (CVIs) for uncertain data objects in feature space. Unlike conventional CVIs in original space, the proposed CVIs are used for uncertain data objects with arbitrariness, sub-clusters, and noisy shapes of clusters that are hard to evaluate, by transforming the uncertain data from the original space to the feature space, which is performed by the kernel function. The proposed CVIs measure the compactness and separability of each cluster in kernel space, which transforms the original data into a higher-dimensional space, leading to less sensitivity to the arbitrary shapes of clusters and more robustness to noise and outliers. We compared the performances of the proposed CVIs with those of pre-existent CVIs that only consider for the original space. The Bhattacharyya distance measure, one of the most widely used for calculating distance, was used to perform experiments with several artificial and real-life datasets to capture the distances between probability density functions. Numerical examples, including a real-life case study and artificial datasets, confirmed that our proposed CVIs are robust to arbitrary cluster shapes, especially sub-clusters, and are promising alternatives for evaluating the fitness of clustering results that can find the optimal number of clusters, K. The proposed CVIs outperform the pre-existent CVIs because of the application of kernel functions to uncertain data, transforming them from the original space to the feature space. As for practical significance, the proposed CVIs could be utilized in diverse applications. For example, Kim et al. proposed new a multivariate kernel density estimator for uncertain data classification for mixed defect patterns on DRAM wafer maps [31]. The proposed CVI method could be applied for evaluating the number of defect patterns on wafer maps. However, there are some limitations to the proposed CVIs. The uncertain data are assumed to have multivariate normal distributions in advance to compute the distances between different uncertain data objects. The uncertainty of the uncertain data may have a variety of probability functions (normal distribution, exponential distribution, etc.), and some cannot be strictly modeled by PDFs. This might be overcome through methods for generating random variables and support-measure data description, which is a non-parametric machine learning method that does not require an assumption of a prior distribution to be made in advance.

Future research should consider the compactness measure in kernel space in advanced machine learning algorithms, such as support vector data descriptions or Bayesian frameworks of Bayesian support vector data descriptions. The concepts of our CVIs can also be applied to other clustering algorithms.

Abbreviations

The following abbreviations are used in this manuscript:

Bhatt Bhattacharyya distance measure
C/S Separability/Compactness
CH Calinski–Harabasz
CVIs Cluster validity indices
DB Davies–Bouldin
DU Dunn
KPD Kernel-based probabilistic distance
PD Probabilistic distance
PDF Probability density function
RBF Radial basis function
S/C Compactness/Separability
UFSCH Uncertain feature space CH
UFSDU Uncertain feature space DU
UOSCH Uncertain feature space CH
UOSDU Uncertain feature space DU

Author Contributions

Conceptualization, Y.-S.J.; data curation, C.K.; formal analysis, Y.-S.J.; investigation, B.T. and Y.-S.J.; methodology, C.K. and Y.-S.J.; resources, B.T.; software, B.T.; supervision, Y.-S.J.; validation, J.B.; visualization, J.B.; writing—original draft, C.K.; writing—review and editing, J.B., B.T. and Y.-S.J. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The real-world datasets used in this study are available at: https://archive.ics.uci.edu/ml/index.php accessed on 10 March 2023; the artificial datasets that contain data sensitive to shapes are available at: https://github.com/deric/clustering-benchmark/tree/master/ accessed on 10 March 2023.

Conflicts of Interest

The authors declare no conflict of interest.

Funding Statement

This work was supported by LG Yonam Foundation (of Republic of Korea) and by National Research Foundation of Republic of Korea Grant (No. NRF-2021S1A5A8060639, NRF-2022R1F1A1063174).

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

  • 1.Abdalameer A.K., Alswaitti M., Alsudani A.A., Isa N.A. A new validity clustering index-based on finding new centroid positions using the mean of clustered data to determine the optimum number of clusters. Expert Syst. Appl. 2022;191:116329. doi: 10.1016/j.eswa.2021.116329. [DOI] [Google Scholar]
  • 2.Irani J., Pise N., Phatak M. Clustering techniques and the similarity measures used in clustering: A survey. Int. J. Comput. Appl. Technol. 2016;134:9–14. doi: 10.5120/ijca2016907841. [DOI] [Google Scholar]
  • 3.MacQueen J.B. Some methods for classification and analysis of multivariate observations; Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Berkeley, CA, USA. 27 December 1965–7 January 1966; Santa Barbara, CA, USA: The Regents of the University of California; 1967. pp. 281–297. [Google Scholar]
  • 4.Li M.J., Ng M.K., Cheung Y.-m., Huang J.Z. Agglomerative fuzzy k-means clustering algorithm with selection of number of clusters. IEEE Trans. Knowl. Data Eng. 2008;20:1519–1534. doi: 10.1109/TKDE.2008.88. [DOI] [Google Scholar]
  • 5.Mahesh Kumar K., Rama Mohan Reddy A. A fast DBSCAN clustering algorithm by accelerating neighbor searching using groups method. Pattern Recognit. 2016;58:39–48. doi: 10.1016/j.patcog.2016.03.008. [DOI] [Google Scholar]
  • 6.Chien C.-F., Wang W.-C., Cheng J.-C. Data mining for yield enhancement in semiconductor manufacturing and an empirical study. Expert Syst. Appl. 2007;33:192–198. doi: 10.1016/j.eswa.2006.04.014. [DOI] [Google Scholar]
  • 7.El-shafeiy E., Sallam K.M., Chakrabortty R.K., Abohany A.A. A clustering based swarm intelligence optimization technique for the internet of medical things. Expert Syst. Appl. 2021;173:114648. doi: 10.1016/j.eswa.2021.114648. [DOI] [Google Scholar]
  • 8.Aggarwal C.C., Yu P.S. A survey of uncertain data algorithms and applications. IEEE Trans. Knowl. Data Eng. 2009;21:609–623. doi: 10.1109/TKDE.2008.190. [DOI] [Google Scholar]
  • 9.Shou L., Zhang X., Chen G., Gao Y., Chen K. Mud: Mapping-based query processing for high-dimensional uncertain data. Inf. Sci. 2012;198:147–168. doi: 10.1016/j.ins.2012.02.023. [DOI] [Google Scholar]
  • 10.Duan X., Ma Y., Zhou Y., Huang H., Wang B. A novel cluster validity index based on augmented non-shared nearest neighbors. Expert Syst. Appl. 2023;223:119784. doi: 10.1016/j.eswa.2023.119784. [DOI] [Google Scholar]
  • 11.Lee S.-H., Jeong Y.-S., Kim J.-Y., Jeong M.K. A new clustering validity index for arbitrary shape of Clusters. Pattern Recognit. Lett. 2018;112:263–269. doi: 10.1016/j.patrec.2018.08.005. [DOI] [Google Scholar]
  • 12.Dunn J.C. Well-separated clusters and optimal fuzzy partitions. J. Cybern. 1974;4:95–104. doi: 10.1080/01969727408546059. [DOI] [Google Scholar]
  • 13.Calinski T., Harabasz J. A dendrite method for cluster analysis. Commun. Stat.-Theory Methods. 1974;3:1–27. doi: 10.1080/03610927408827101. [DOI] [Google Scholar]
  • 14.Davies D.L., Bouldin D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979;PAMI-1:224–227. doi: 10.1109/TPAMI.1979.4766909. [DOI] [PubMed] [Google Scholar]
  • 15.Xie X.L., Beni G. A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 1991;13:841–847. doi: 10.1109/34.85677. [DOI] [Google Scholar]
  • 16.Rojas-Thomas J.C., Santos M., Mora M. New internal index for clustering validation based on graphs. Expert Syst. Appl. 2017;86:334–349. doi: 10.1016/j.eswa.2017.06.003. [DOI] [Google Scholar]
  • 17.Tavakkol B., Jeong M.K., Albin S.L. Validity indices for clusters of uncertain data objects. Ann. Oper. Res. 2018;303:321–357. doi: 10.1007/s10479-018-3043-4. [DOI] [Google Scholar]
  • 18.Wang J.-S., Chiang J.-C. A cluster validity measure with a hybrid parameter search method for the support vector clustering algorithm. Pattern Recognit. 2008;41:506–520. doi: 10.1016/j.patcog.2007.06.027. [DOI] [Google Scholar]
  • 19.Jiang B., Pei J., Tao Y., Lin X. Clustering uncertain data based on probability distribution similarity. IEEE Trans. Knowl. Data Eng. 2013;25:751–763. doi: 10.1109/TKDE.2011.221. [DOI] [Google Scholar]
  • 20.Tavakkol B., Jeong M.K., Albin S.L. Object-to-group probabilistic distance measure for uncertain data classification. IEEE Trans. Knowl. Data Eng. 2017;230:143–151. doi: 10.1016/j.neucom.2016.12.007. [DOI] [Google Scholar]
  • 21.Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J.M., Perona I. An extensive comparative study of cluster validity indices. Pattern Recognit. 2013;46:243–256. doi: 10.1016/j.patcog.2012.07.021. [DOI] [Google Scholar]
  • 22.Rezaee B. A cluster validity index for Fuzzy Clustering. Fuzzy Sets Syst. 2010;161:3014–3025. doi: 10.1016/j.fss.2010.07.005. [DOI] [Google Scholar]
  • 23.Bhattacharyya A. On a measure of divergence between two multinomial populations. Sankhya Indian J. Stat. 1946;7:401–406. [Google Scholar]
  • 24.Kullback S., Leibler R.A. On information and sufficiency. Ann. Math. Stat. 1951;22:79–86. doi: 10.1214/aoms/1177729694. [DOI] [Google Scholar]
  • 25.Tavakkol B., Son Y. Fuzzy kernel K-medoids clustering algorithm for uncertain data objects. Pattern Anal. Appl. 2021;24:1287–1302. doi: 10.1007/s10044-021-00983-z. [DOI] [Google Scholar]
  • 26.Zhou S.K., Chellappa R. From sample similarity to ensemble similarity: Probabilistic distance measures in reproducing kernel Hilbert space. IEEE Trans. Pattern Anal. Mach. Intell. 2006;28:917–929. doi: 10.1109/TPAMI.2006.120. [DOI] [PubMed] [Google Scholar]
  • 27.Patle A., Chouhan D.S. SVM kernel functions for classification; Proceedings of the 2013 International Conference on Advances in Technology and Engineering (ICATE); Mumbai, India. 23–25 January 2013. [Google Scholar]
  • 28.Tbarki K., Ben Said S., Ksantini R., Lachiri Z. RBF kernel based SVM Classification for landmine detection and discrimination; Proceedings of the 2016 International Image Processing, Applications and Systems (IPAS); Sfax, Tunisia. 5–7 November 2016. [Google Scholar]
  • 29.Nydick S.W. The wishart and inverse wishart distributions. Electron. J. Stat. 2012;6:1–19. [Google Scholar]
  • 30.UCI Machine Learning Repository. [(accessed on 28 March 2023)]. Available online: https://archive.ics.uci.edu/
  • 31.Kim B., Jeong Y.-S., Jeong M.K. New multivariate kernel density estimator for uncertain data classification. Ann. Oper. Res. 2020;303:413–431. doi: 10.1007/s10479-020-03715-4. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The real-world datasets used in this study are available at: https://archive.ics.uci.edu/ml/index.php accessed on 10 March 2023; the artificial datasets that contain data sensitive to shapes are available at: https://github.com/deric/clustering-benchmark/tree/master/ accessed on 10 March 2023.


Articles from Sensors (Basel, Switzerland) are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES