Cluster Validity Index for Uncertain Data Based on a Probabilistic Distance Measure in Feature Space

Changwan Ko; Jaeseung Baek; Behnam Tavakkol; Young-Seon Jeong

doi:10.3390/s23073708

. 2023 Apr 3;23(7):3708. doi: 10.3390/s23073708

Cluster Validity Index for Uncertain Data Based on a Probabilistic Distance Measure in Feature Space

Changwan Ko ¹, Jaeseung Baek ^2,³, Behnam Tavakkol ⁴, Young-Seon Jeong ^1,^5,^*

Editors: Qiong Wang, Teng Huang, Yan Pang

PMCID: PMC10099331 PMID: 37050769

Abstract

Cluster validity indices (CVIs) for evaluating the result of the optimal number of clusters are critical measures in clustering problems. Most CVIs are designed for typical data-type objects called certain data objects. Certain data objects only have a singular value and include no uncertainty, so they are assumed to be information-abundant in the real world. In this study, new CVIs for uncertain data, based on kernel probabilistic distance measures to calculate the distance between two distributions in feature space, are proposed for uncertain clusters with arbitrary shapes, sub-clusters, and noise in objects. By transforming original uncertain data into kernel spaces, the proposed CVI accurately measures the compactness and separability of a cluster for arbitrary cluster shapes and is robust to noise and outliers in a cluster. The proposed CVI was evaluated for diverse types of simulated and real-life uncertain objects, confirming that the proposed validity indexes in feature space outperform the pre-existing ones in the original space.

Keywords: uncertain data, cluster validity index, kernel probabilistic distance, feature space

1. Introduction

The purpose of clustering is to partition objects into groups with criteria such that the similarity within the groups and the dissimilarity among different groups should be maximized [1,2]. Although clustering methods have been widely used in many applications, most clustering algorithms do not provide the optimal number of clusters. Partitional-based clustering algorithms such as K-means clustering [3] must preset the number of clusters [4]. As cluster information is rarely known in the real world, it is crucial to evaluate the clustering results depending on the different numbers of clusters. Although many clustering methods exist for diverse applications, such as pattern recognition [5], semiconductor manufacturing [6], and healthcare [7], they have been developed primarily for only certain data or fixed values. However, the embedded uncertainty of data is essential in many applications. For instance, a patient’s blood pressure may not be consistent because of environmental conditions and instrument errors. Furthermore, measurement values are continuously changing because of the positions of instrumentation devices or workers’ conditions. Aside from these examples, data randomness, missing data, delayed updates, and worker fatigue are other factors of data uncertainty [8,9].

Uncertain data are assumed to be prevalent information in the real world, e.g., measurement errors and environmental conditions. The uncertainty of uncertain data can be expressed by probability density functions (PDFs). Figure 1 illustrates two uncertain data, each distributed by a PDF. The standard method of converting uncertain data is to transform a summary statistic (e.g., mean or median) into certain data. However, these statistics could lose extra information of uncertainty that is significant to capture the uncertainty information of uncertain objects.

Two uncertain datasets, each expressed by a PDF.

Cluster validity indices (CVIs), which are indicators for validating the quality of clustering algorithms, have been widely used to determine the correct number of clusters for the given data. As the CVIs only use input data information, they must be used according to the characteristics of the data. The two components of a CVI are compactness and separability measures. The former refers to an intra-cluster distance, and the latter represents an inter-cluster distance. Most CVIs indicate that a good partition produces a small compactness value and a high separability value. However, the existing CVIs are vulnerable to validating cluster results when the shapes of the clusters are not spherical clusters [10,11].

For certain data, several CVIs, such as the Dunn [12], Calinski–Harabasz [13], Davies–Bouldin [14], and Xie–Beni [15] indices, have been proposed based on combinations of compactness and separability measures. However, most of the existing CVIs have been developed for certain data. There have been few studies on uncertain data. Moreover, relatively new CVIs are also being designed to incorporate mathematical theories into pre-existing CVIs, such as the K-nearest neighbor algorithm, which is used to compute compactness and separation by taking into account shared/non-shared data pairs [10], and principal component analysis, which is used to capture the geometry of the clusters [16]; or to develop clustering algorithms to cluster more well-separated clusters [1].

To apply uncertain data to the existing CVIs’ formulas, they should be changed to calculate distance measures of compactness and separability. In a study of uncertain CVIs, Tavakkol et al. [17] proposed CVIs for uncertain data to calculate the distance between two uncertain objects using probabilistic distance measures in the original space. However, it leads to sensitivity to arbitrary shapes of clusters, sub-clusters, and outliers because of the clusters shape that may cause inaccurate compactness and separability [11].

Consequently, this study proposes new uncertain CVIs for uncertain data objects based on kernel probabilistic distance measures in feature space. The proposed CVIs for uncertain objects are designed to adapt the kernel-based Bhattacharyya probabilistic distance in kernel spaces. In kernel space, the proposed CVIs produce accurate compactness and separability for the arbitrary shapes of clusters by transforming them into elliptical shapes in feature space. Figure 2 illustrates that the ambiguous shape of a dataset in the original space is transformed into a relatively elliptical, circular shape in feature space; thus, the kernel transformation can improve performance in calculating accurate compactness and separability. Furthermore, the proposed approaches could be robust to noise and outliers in a cluster. The superior performance of the proposed CVIs was evaluated through diverse experiments, including simulated and real-life datasets.

Visualization of kernel transformation: (a) asymmetry shape in original space; (b) transformed shape in feature space.

This paper is organized as follows. Section 2 reviews the previous studies on CVIs. New CVIs for uncertain data based on a kernel probabilistic distance measure are proposed in Section 3. After the extensive experiments are presented in Section 4, the conclusions and future studies are provided in Section 5.

2. Related Work

2.1. CVI for Certain Data

In the past few decades, many CVIs have been developed to determine the optimal number of clusters. Most CVIs focus on calculating compactness and separability measures. The combination of the two measures is composed of a ratio-type or summation-type index. This section presents several popular CVIs that have been evaluated in many applications.

The Dunn (DU) index [12]:

D U_{K} = \frac{m i n_{i, j = 1, \dots, K, i \neq j} \{m i n_{x \in C_{i}, y \in C_{j}} d (x, y)\}}{m a x_{i = 1, \dots, K} \{m a x_{x, y \in C_{i}} d (x, y)\}} .

(1)

Compactness and separability are computed using the maximum diameter among all clusters and the minimum pair-wise distance between objects in different clusters. The DU index is integrated by the ratio type of separability to compactness. Thus, the maximum value of the DU index is the optimal number of clusters (max. S/C).

Calinski–Harabasz (CH) index [13]:

C H_{K} = \frac{\sum_{i = 1}^{K} n_{i} \cdot d {(z_{i} \cdot z_{t o t})}^{2}}{K - 1} \cdot \frac{n - K}{\sum_{1 = 1}^{K} \sum_{x \in C_{i}} d {(x, z_{i})}^{2}}

(2)

The CH is composed of the ratio type of separability and compactness like the DU index. $z_{t o t}$ is the centroid of the entire dataset. Compactness and separability are computed using within- and between-cluster sums of squares. Thus, the maximum value for CH is the optimum partition (max. S/C).

The Davies–Bouldin (DB) index [14]:

D B_{K} = \frac{1}{K} \max_{i = 1, \dots, K, i \neq j} \{(\sqrt{\frac{1}{n_{i}} \sum_{x \in C_{i}} d {(x, z_{i})}^{2}} + \sqrt{\frac{1}{n_{j}} \sum_{y \in C_{j}} d {(y, z_{j})}^{2}}) / d (z_{i}, z_{j})\}

(3)

where $z_{i}$ and $z_{j}$ are the centroids of each cluster. Compactness and separability are calculated using the sum of mean squares of individual clusters, unlike the DU index, which considers the compactness and separability of the total cluster. Compactness is the computed sum of the pair-wise distances between different clusters; separability is calculated differently for each cluster. The DB index is comprised of the ratio types of compactness and separability. Therefore, the minimum value of DB is the optimum partition (min. C/S).

The pre-existing CVIs are sensitive to sub-clusters, arbitrary shapes, and noise in clusters for the compactness measure [18]. This study overcomes those drawbacks by conducting a spatial transformation from the original space into feature space using a kernel function that correctly measures cluster compactness and separability.

2.2. CVI for Uncertain Data

Most CVIs have focused on certain data or fixed values [19]. Certain data do not have uncertainty caused by several factors and environments such as sensor measurement error, repeated measurements by workers, or equipment operating environments. Uncertain data objects come in two possible forms: (1) multiple points for each object and (2) a PDF for each object, either given or obtained by fitting the multiple points [20]. Several studies related to clustering uncertain data have been conducted. However, CVIs for uncertain data have rarely been used. The CVIs are crucial criteria for validating the results of clusters [21,22] to find the appropriate number of clusters. Therefore, the study of CVIs for uncertain data is necessary.

In this study, the proposed CVIs use kernel probabilistic distance measures to compute the distance between two uncertain data objects. There are many popular probabilistic distance measures, such as Bhattacharyya distance [23], Wasserstein distance, and Kullback–Leibler divergence [24]. This study uses the Bhattacharyya distance measure. The Bhattacharyya distance measure is one of the widely used probabilistic distance measures and has been generally used in diverse applications.

The Bhattacharyya distance between two probability distributions can be calculated in discrete and continuous cases. Let $p$ and $q$ be the continuous probability distributions over the same space. The definition of the Bhattacharyya distance for a continuous case in original space can be described as follows:

P D_{B h a t t} (p, q) = - \ln \{\int_{x} \sqrt{p (x) q (x)} d x\}

(4)

There are closed-form solutions for many probabilistic distance measures, including the Bhattacharyya distance, for cases where uncertain data objects are modeled with multivariate normal distributions. As probabilistic distance measures can capture the distance between PDFs, they can also be used to capture the distance between uncertain data objects [25]. The Bhattacharyya distance is a special case of Chernoff distance with parameters $α_{1} = α_{2} = 1 / 2$ , and the closed-from of Bhattacharyya distance for multivariate normal PDFs is defined in Equation (5):

P D_{B h a t t} (p, q) = \frac{1}{8} {(μ_{p} - μ_{q})}^{'} {(Σ_{p} + Σ_{q})}^{- 1} (μ_{p} - μ_{q}) + \frac{1}{2} \ln (\frac{|Σ_{p} + Σ_{q}|}{2 {(|Σ_{p} |+| Σ_{q}|)}^{\frac{1}{2}}})

(5)

where $μ_{p}$ and $μ_{q}$ are means, and $Σ_{p}$ and $Σ_{q}$ are covariance matrices of $P ~ M V N (μ_{p}, Σ_{p})$ and Q $~ M V N (μ_{q}, Σ_{q})$ .

This study models the Bhattacharyya distance between two uncertain data objects in kernel space. We can compute the probabilistic distance between two uncertain data objects in feature space using a kernel function.

3. Proposed CVIs for Uncertain Data

3.1. Kernel Probabilistic Distance Measure in Feature Space

Computing the probabilistic distance is a nontrivial problem. We can compute the Bhattacharyya distance in feature space by referring to several steps developed by Zhou and Chellappa [26]. In capturing the probabilistic distance, suppose that $x_{1} = \{x_{11}, x_{21}, \dots, x_{N 1}\}$ and $x_{2} = \{x_{12}, x_{22}, \dots, x_{N 2}\}$ are the given objects in original space $ℝ^{d}$ with a multivariate normal density function:

N (x; μ, Σ) = \frac{1}{\sqrt{{(2 π)}^{d} |Σ|}} \exp \{- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ)\}

(6)

The radial basis function (RBF) kernel function displayed in Equation (7) can be used to transfer original data into feature space for calculating the distance between uncertain data objects $x_{1}$ and $x_{2}$ . The RBF kernel function is commonly used in various fields and algorithms because it outperforms other kernel functions [27,28].

K_{i j} = \exp (- \frac{1}{2 σ^{2}} ‖ x_{i} - x_{j} ‖^{2}), i, j = 1, 2

(7)

In kernel function $K (x_{1}, x_{2})$ , where $x_{1}, x_{2} \in ℝ^{d}$ , and the non-linear mapping function $ϕ$ and kernel Gram matrix $K$ are defined as $K = Φ^{T} Φ$ , where $Φ : = Φ_{N} = [ϕ (x_{1}), ϕ (x_{2}), \dots, ϕ (x_{N})] \in ℝ^{f}$ , and $f ≫ d$ represents the data transformed to kernel space. The mean $μ$ and covariance matrix $Σ$ in feature space are estimated as:

μ = N^{- 1} \sum_{n = 1}^{N} ϕ (x_{n}) = Φ,

(8)

Σ = N^{- 1} \sum_{n = 1}^{N} (ϕ_{n} - μ) {(ϕ_{n} - μ)}^{T} = Φ J J^{T} Φ^{T},

(9)

where $J = \frac{1}{\sqrt{n}}$ ( $I_{N} - s \vec{1}$ ) with $s_{N \times 1} = \frac{1}{N} {\vec{1}}^{T}$ and $\vec{1} = [1, 1, \dots, 1]$ .

The covariance matrix $Σ$ must be converted into approximation form because of its rank-deficient characteristic $f ≫ d$ . Therefore, we can use the approximation form as follows:

C = Φ J J^{T} Φ^{T} + ρ I_{f} = W W^{T} + ρ I_{f} = Φ A Φ^{T} + ρ I_{f},

(10)

where $W \dot{=} Φ J Q$ , $A \dot{=} J Q Q^{T} J^{T}$ , and $ρ$ is a user parameter that should be pre-specified in advance.

Obtaining the matrix $Q$ requires computing the top $r$ eigenvalues matrix $Λ_{r}$ and the top $r$ eigenvectors matrix $V_{r}$ of $\bar{K} = J^{T} K J$ , where top $r$ is a pre-specified parameter; thus, $r = 3$ is used. $Q$ is an $N \times r$ matrix calculated as follows:

Q \dot{=} V_{r} {(I_{r} - ρ Λ_{r}^{- 1})}^{1 / 2} .

(11)

Define matrix $P$ as:

P_{(N_{1} + N_{2}) \times (r_{1} + r_{2})} = [\begin{matrix} \sqrt{α_{1}} J_{1} Q_{1} & 0 \\ 0 & \sqrt{α_{2}} J_{2} Q_{2} \end{matrix}] .

(12)

The Bhattacharyya distance is a special case of Chernoff distance; it must be set to $α_{1} = α_{2} = 1 / 2$ for all experiments. The $τ_{i}, i = 1, \dots, r_{1} + r_{2}$ , are eigenvalues of a $L_{c h}$ matrix, with dimensions of $(r_{1} + r_{2}) \times (r_{1} + r_{2})$ given by

L_{c h} = P^{T} [\begin{matrix} Φ_{1}^{T} \\ Φ_{2}^{T} \end{matrix}] [Φ_{1}^{T} Φ_{2}^{T}] P = P^{T} [\begin{matrix} K_{11} & K_{12} \\ K_{21} & K_{22} \end{matrix}] P .

(13)

Scalar values $ε_{11}, ε_{12}, ε_{22}$ are computed by Equation (14).

ε_{i j} = s_{i}^{T} K_{i j} s_{j} - s_{i}^{T} [K_{i 1} K_{i 2}] B_{c h} [\begin{matrix} K_{1 j} \\ K_{2 j} \end{matrix}] s_{j}

(14)

where $B_{c h} = P {(ρ I_{r_{1} + r_{2}} + L_{c h})}^{- 1} P^{T}$ with dimensions of $(N_{1} + N_{2}) \times (N_{1} + N_{2})$ .

The kernel-based probabilistic Bhattacharyya distance between two uncertain data objects $x_{1}$ and $x_{2}$ in feature space is calculated as follows:

K P D_{B h a t t} = 0.5 [α_{1} α_{2} ρ^{- 1} (ε_{11} + ε_{22} - 2 ε_{12}) + 0.5 \sum_{i = 1}^{r_{1} + r_{2}} \log \frac{ρ + τ_{i}}{λ_{i, 1}} + \sum_{i = 1}^{r_{1} + r_{2}} \log \frac{ρ + τ_{i}}{λ_{i, 2}},

(15)

where $λ_{i, j}$ , $i = 1, \dots, r_{j}$ are the eigenvalues of $C_{j}$ :

λ_{i, j} = \{\begin{array}{l} λ_{i, j}, & w h e n i = 1, \dots, r_{j} \\ ρ, & w h e n i = r_{j} + 1, \dots, r_{1} + r_{2} \end{array}

(16)

3.2. New CVI for Uncertain Data

The uncertain data objects in the cluster are transformed into feature space to compute the compactness and separability in the feature space by applying a kernel function. The mapped uncertain data objects are used to compute the distance between different clusters for calculating compactness and separability, which are used to obtain the values of the proposed CVIs. The calculated value of the indices changes according to the number of clusters K, and the proposed uncertain feature space DU (UFSDU) and uncertain feature space CH (UFSCH) index, are defined in Equations (17) and (18), respectively:

UFSDU index:

U F S D U_{K} = \frac{m i n_{i, j = 1, \dots, K, i \neq j} \{m i n_{x \in C_{i}, y \in C_{j}} K P D_{B h a t t} (x, y)\}}{m a x_{i = 1, \dots, K} \{m a x_{x, y \in C_{i}} K P D_{B h a t t} (x, y)\}}

(17)

UFSCH index:

U F S C H_{K} = \frac{\sum_{i = 1}^{K} n_{i} \cdot K P D_{B h a t t} {(z_{i} \cdot z_{t o t})}^{2}}{K - 1} \cdot \frac{n - K}{\sum_{i = 1}^{k} \sum_{x \in C_{i}} K P D_{B h a t t} {(x, z_{i})}^{2}}

(18)

These proposed CVI equations are similar to the DU and CH indices, except for the term $K P D_{B h a t t} (x, y)$ , which is the computed distance between two uncertain data objects in feature space in Equation (15).

4. Experimental Results

In this study, we propose two CVIs that are calculated probabilistic distances between different uncertain data objects in feature space. The K-medoids clustering algorithm proposed by Jiang et al. [19] was used to compare the performances of the proposed CVIs in feature space. The K-medoids algorithm is one of the most useful algorithms in clustering problems, which uses probabilistic distance measures to capture the similarity between uncertain objects. It differs from the popular K-means clustering algorithm used for clustering data into groups in its robustness to outliers. The K-means method represents each cluster by the mean of all objects in this cluster, whereas the K-medoids method calculates the distance between every pair of all uncertain data objects and the medoid within a cluster [19]. Then, of all calculated distance values, uncertain data with the smallest distances are assigned as a new medoid for the cluster. We proceeded with the experiments by setting the value of K, which is the number of clusters and is used as the probabilistic distance measure. In this study, we varied the number of clusters (K) and the Bhattacharyya distance measure to compute distances between different uncertain data objects in feature space.

4.1. Experimental Procedure for Uncertain Data

Experiments were performed with artificial and real-world datasets that may have sub-clusters and clusters with asymmetrical, arbitrary, and noisy shapes to evaluate the performances of the proposed CVIs. A normalization process was conducted for each feature of the datasets to reduce the scale gap between different features defined in Equation (19):

x_{n o r m} = \frac{x - x_{m i n}}{x_{m a x} - x_{m i n}},

(19)

where $x_{m i n}$ and $x_{m a x}$ are the minimum and maximum values of one feature of the dataset. We then simulated uncertain data objects from certain data objects by following the methodology used by [20].

The pre-existent DU and CH indexes were used to compute uncertain data objects in original space—uncertain original space, DU (UOSUD), and uncertain original space, CH (UOSCH)—to confirm the validity of the proposed CVIs. The overall experimental procedure is represented by Algorithm 1. The procedure used to compare the performances of the proposed CVIs with those of the previous CVIs was as follows: The inputs included the number of uncertain data objects $N$ , the number of object features $M$ , and the number of clusters $K$ . We modeled the uncertain data with multivariate normal distributions. The means of the distributions were the original certain data. The covariances were estimated as follows:

f (S_{i}^{k} | Ψ^{k}, d f^{k}) = \frac{{|Ψ^{k}|}^{\frac{d f^{k}}{2}}}{\frac{p \cdot d f^{k}}{2} Γ_{p} (\frac{d f^{k}}{2})} {|S_{i}^{k}|}^{- \frac{d f^{k} + p + 1}{2}} e^{- \frac{1}{2} t r (Ψ^{k} {(S_{i}^{k})}^{- 1})}, i = 1, \dots, n_{k}, k = 1, \dots, K

(20)

where $S_{i}^{k}$ represents the covariance matrices for objects in class k with the inverse Wishart PDF [29], as defined in Equation (20) [20]. $Ψ^{k}$ is a positive definite scale matrix and $d f^{k}$ is the degree of freedom. $p$ indicates the dimensions of $S_{i}^{k}$ , $t r (\cdot)$ is the trace of a matrix, and $Γ$ is the multivariate gamma function.

Algorithm 1: K-medoids for uncertain data using a probabilistic distance measure in feature space.

1. Input:

n

: The number of objects in cluster k,

K

: The number of clusters, iter = 0;
2. Randomly select the cluster medoids

C^{(0)} = {c_{1}^{(0)}, \dots, c_{K}^{(0)}}

obtained from the initial clusters
3. Initialize
4.

C V I s = \{c v i^{(1)}, \dots, c v i^{(K)}\}

obtained UOSDU, UOSCH, UFSDU, and UFSCH
5. Repeat
6. for

k = 2

K

c_{k}^{(o l d)} = c_{k}^{(0)}

;

c_{k}^{(n e w)} = 0

8. Compute the new medoids:
9. while

c_{k}^{(o l d)} \neq c_{k}^{(n e w)}

10.

p = \underset{1 \leq i \leq n}{\underset{⏟}{a r g m i n}} \sum_{j = 1}^{k} K P D_{B h a t t} (x_{i}, c_{j k}^{})

, where

j

is an index of cluster medoid in

c_{k}^{}

11.

c_{k}^{(n e w)} = x_{p}

12. end
13. Calculate the

c v i^{(k)}

using Equations (1), (2), (17), and (18).
14. end
15. iter = iter + 1
16. Until (

iter

= Maxiter)

Open in a new tab

Step 1: Set $K$ initial clusters with uncertain objects randomly for a given dataset. Run a K-medoids clustering algorithm with different values for the K parameter (2 ≤ K ≤10).

Step 2: Obtain the medoids of each cluster for which the sum of the probabilistic distance between the objects is the smallest.

Step 3: Calculate CVIs for all the partitions. We calculated the compactness and separability in kernel space using an RBF kernel function with σ (bandwidth in the RBF kernel function). The optimal value was determined through a set of preliminary experiments by taking [0.1, 0.2, …, 4] in σ.

Step 4: We increased the reliability of experimental results by replicating the experiment 100 times for the same dataset with different trial seeds to obtain the initial medoids in Step 1 and used the average value of CVI for each cluster.

Step 5: Finally, we evaluated each CVI and the suggested number of clusters from a CVI; the actual numbers of clusters of a dataset were then compared.

4.2. Experiments with Artificial and Real-World Datasets

Experiments were conducted to evaluate the proposed CVIs in comparison to the pre-existent CVIs. These experiments used 10 datasets with sensitive characteristics containing arbitrariness, sub-clusters, asymmetry, and noise provided by the UCI (https://archive.ics.uci.edu/, accessed on 10 March 2023) [30] and Tomas Barton repositories (https://github.com/deric/clustering-benchmark, accessed on 10 March 2023), which have 122 artificial datasets with arbitrariness, sub-clusters, and asymmetric shapes in two or three features. The datasets from UCI repository, (e.g., D3, D4, D5, and D7) were collected in real environmental conditions; however, the other datasets were artificially created, which can be checked in Tomas Barton repositories.

The summary of datasets used for the experiments is presented in Table 1. Two-dimensional (2D) and three-dimensional (3D) dataset shapes are illustrated in Figure 3. The CVI values were computed by changing the number of clusters (K) in each dataset and then comparing the predicted labels of experiments to the actual labels in the datasets.

Table 1.

Summary of datasets.

Dataset Index	Dataset Name	# of Obs.	# of Dim.	# of Clusters	Projection Shape
D1	A.K Jain’s Toy	373	2	2	Asymmetry, Arbitrary shape
D2	Flame	240	2	2	Sub-cluster, Noise
D3	Iris	150	4	3	-
D4	Thyroid	215	5	2	-
D5	Wine	178	13	3	-
D6	Wisconsin	683	9	2	-
D7	Harberman	301	3	2	Random shape
D8	Chainlink	1000	3	2	Sub-cluster, Arbitrary shape
D9	Lsun	400	2	3	Asymmetry, Arbitrary shape
D10	Zelnik1	299	2	3	Sub-cluster

Open in a new tab

Shapes of 2D and 3D datasets: (a) D1 dataset; (b) D2 dataset; (c) D7 dataset; (d) D8 dataset; (e) D9 dataset; (f) D10 dataset.

4.3. Performance Comparison of the Proposed CVIs

The experimental results are given in Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, Table 10 and Table 11. The actual number of clusters is below the name of the dataset. It is also noted with an asterisk (*) adjacent to the actual number of clusters along the top. Moreover, all the results of the datasets are presented in Table 12, indicating the performance of the proposed CVIs by a quantitative figure. Each cell in Table 12 represents the optimal number of clusters K determined by its CVI criteria.

Table 2.

Performance results for D1.

		2 *	3	4	5	6	7	8	9	10
	CVI	2 *	3	4	5	6	7	8	9	10
D1 (2)	UOSDU	0.00075	0.00063	0.00049	0.00046	0.00043	0.00047	0.00044	0.00042	0.000410
	UOSCH	554.4796	537.8279	573.5387	586.5310	576.5872	562.0666	575.2021	566.6556	567.6008
	UFSDU	0.011830	0.00727	0.007410	0.006350	0.006920	0.006390	0.006740	0.00580	0.005630
	UFSCH	256.0945	204.767	167.9338	149.5915	138.3076	128.206	122.4676	117.0263	112.4593

Open in a new tab

Table 3.

Performance results for D2.

		2 *	3	4	5	6	7	8	9	10
	CVI	2 *	3	4	5	6	7	8	9	10
D2 (2)	UOSDU	0.00578	0.00581	0.00583	0.00533	0.00494	0.00494	0.00452	0.00454	0.00448
	UOSCH	218.9052	188.6698	201.7685	195.0877	190.2412	190.7961	192.3785	187.7774	186.0032
	UFSDU	0.01875	0.01433	0.01619	0.01386	0.01284	0.01261	0.01263	0.0125	0.01271
	UFSCH	246.7711	190.3472	184.7522	163.52	150.3938	143.1108	138.9139	131.6189	127.3284

Open in a new tab

Table 4.

Performance results for D3.

		2	3 *	4	5	6	7	8	9	10
	CVI	2	3 *	4	5	6	7	8	9	10
D3 (3)	UOSDU	0.57393	0.18691	0.06671	0.04599	0.03375	0.03045	0.02475	0.02443	0.02427
	UOSCH	393.8149	340.7616	288.9103	257.4766	227.8328	211.7321	193.9894	179.4227	172.1492
	UFSDU	0.78121	0.05291	0.0332	0.02818	0.0201	0.02217	0.02033	0.01676	0.01503
	UFSCH	97.24412	100.9677	83.47847	74.68629	65.08186	59.80128	55.42499	51.32508	48.54411

Open in a new tab

Table 5.

Performance results for D4.

		2 *	3	4	5	6	7	8	9	10
	CVI	2 *	3	4	5	6	7	8	9	10
D4 (2)	UOSDU	0.01059	0.00702	0.00447	0.00389	0.00338	0.00285	0.0029	0.00264	0.00254
	UOSCH	52.44662	49.27229	45.29772	44.23136	46.29286	43.05835	40.0334	38.99379	36.43862
	UFSDU	0.09045	0.02678	0.02097	0.01941	0.0186	0.0166	0.01728	0.01604	0.01577
	UFSCH	88.16833	63.62494	54.54528	48.32164	43.33752	38.65073	35.53446	32.89777	30.6346

Open in a new tab

Table 6.

Performance results for D5.

		2	3 *	4	5	6	7	8	9	10
	CVI	2	3 *	4	5	6	7	8	9	10
D5 (3)	UOSDU	0.28546	0.19218	0.16953	0.13451	0.13042	0.12188	0.1222	0.11775	0.11498
	UOSCH	46.98845	41.61822	34.08324	29.45127	26.66111	23.71564	21.97848	20.8878	19.0692
	UFSDU	0.1351	0.13992	0.12361	0.11102	0.1058	0.10544	0.10343	0.10402	0.10242
	UFSCH	166.5115	94.11775	70.17926	57.15066	48.48219	42.44803	38.19718	34.55733	31.1674

Open in a new tab

Table 7.

Performance results for D6.

		2 *	3	4	5	6	7	8	9	10
	CVI	2 *	3	4	5	6	7	8	9	10
D6 (2)	UOSDU	0.10223	0.04719	0.02262	0.01209	0.00742	0.00342	0.0014	0.00109	0.00075
	UOSCH	237.829	186.8503	145.4631	119.3866	98.36381	89.72379	80.18472	70.83073	66.12163
	UFSDU	0.22631	0.10763	0.04928	0.03902	0.01416	0.01228	0.0084	0.00605	0.00391
	UFSCH	349.3685	261.4169	205.8692	171.2457	144.4285	124.5582	109.2653	97.50292	88.98401

Open in a new tab

Table 8.

Performance results for D7.

		2 *	3	4	5	6	7	8	9	10
	CVI	2 *	3	4	5	6	7	8	9	10
D7 (2)	UOSDU	0.00198	0.0014	0.00112	0.00086	0.00078	0.00089	0.00069	0.00079	0.00076
	UOSCH	128.8359	117.8517	104.8203	97.56451	95.82686	92.17925	86.85381	84.98897	82.71107
	UFSDU	0.13021	0.02577	0.01681	0.01199	0.01108	0.01122	0.01132	0.01028	0.00945
	UFSCH	319.3255	171.7169	127.0638	104.5919	90.63319	80.94442	72.86994	67.51974	62.62473

Open in a new tab

Table 9.

Performance results for D8.

		2 *	3	4	5	6	7	8	9	10
	CVI	2 *	3	4	5	6	7	8	9	10
D8 (2)	UOSDU	0.00019	0.00017	0.00017	0.00017	0.00017	0.00018	0.00021	0.00019	0.00017
	UOSCH	419.8882	371.9768	388.8548	430.2229	426.5956	430.8854	449.3122	438.7834	417.3569
	UFSDU	0.00439	0.00237	0.00204	0.00114	0.0013	0.00118	0.00149	0.00153	0.0014
	UFSCH	445.5408	463.2664	449.4758	439.8262	425.4487	411.5018	422.1565	428.8755	437.9047

Open in a new tab

Table 10.

Performance results for D9.

		2	3 *	4	5	6	7	8	9	10
	CVI	2	3 *	4	5	6	7	8	9	10
D9 (3)	UOSDU	0.01277	0.00168	0.00087	0.00081	0.00069	0.00062	0.0006	0.00063	0.00054
	UOSCH	316.7407	406.3877	395.188	401.578	380.968	363.1193	365.4242	349.9761	351.8199
	UFSDU	0.01439	0.02006	0.01697	0.01119	0.00658	0.00574	0.00485	0.00472	0.00416
	UFSCH	190.3465	205.1745	189.6315	175.6124	164.8462	154.2108	149.907	141.5702	133.6363

Open in a new tab

Table 11.

Performance results for D10.

		2	3 *	4	5	6	7	8	9	10
	CVI	2	3 *	4	5	6	7	8	9	10
D10 (3)	UOSDU	0.030644	0.049296	0.048849	0.048798	0.046752	0.044594	0.042478	0.037749	0.041905
	UOSCH	235.4205	161.3342	142.117	135.4194	127.012	126.4954	125.6673	123.9964	132.4379
	UFSDU	0.00368	0.00123	0.00123	0.00115	0.00103	0.00087	0.00073	0.00077	0.00056
	UFSCH	102.6013	106.5976	99.7133	98.79822	97.68495	95.67929	95.82844	96.62246	102.6371

Open in a new tab

Table 12.

Difference between the actual and estimated numbers of clusters in lower-dimensional datasets.

Dataset	Dim	# of Clusters	UOSDU	UOSCH	UFSDU	UFSCH
D1	2	2	⨀	5	⨀	⨀
D2	2	2	4	⨀	⨀	⨀
D3	4	3	2	2	2	⨀
D4	5	2	⨀	⨀	⨀	⨀
D5	13	3	2	2	⨀	2
D6	9	2	⨀	⨀	⨀	⨀
D7	3	2	⨀	⨀	⨀	⨀
D8	3	2	8	8	⨀	3
D9	2	3	2	⨀	⨀	⨀
D10	2	3	⨀	2	2	⨀
# of successes in estimating the optimal number of clusters			5	5	8	8

Open in a new tab

The bold values with gray-shaded backgrounds indicate the optimal cluster K decided by each CVI. As presented in Table 2, three of the CVIs succeeded in estimating the number of clusters as two in D1. UOSCH failed. The proposed UFSDU and UFSCH also successfully predicted the number of clusters in D2. In contrast, UOSDU failed to estimate the number of clusters in D2.

Although the proposed UFSDU index and the pre-existent CVIs failed to predict the number of clusters in D3, UFSCH was successful. All CVIs correctly predicted the number of clusters for some datasets; see Table 5, Table 7 and Table 8. In contrast, the proposed UFSDU index is the only CVI that correctly predicted the actual number of clusters in D5, as presented in Table 6. Furthermore, the UFSDU index predicted the actual number of clusters of D8. D8’s shape (Figure 3) is classified distinctly into two classes when viewed visually. However, it is challenging to calculate the compactness and separability of a cluster in the original space. Nevertheless, the UFSDU index was successful in such predictions; the UFSCH forecasted the number of clusters as three, which is close to the actual number of clusters, two. The kernel transformation facilitates computation to obtain greater compactness and separability in the feature space than the original space, leading to high-performance clustering.

The UOSCH index and the new CVIs predicted the number of clusters to be three in D9, and the UOSDU and UFSCH indexes successfully estimated the number of clusters in D10. Table 12 presents a summary of the results of the 10 datasets above, whereas the symbol of a circled dot (⨀) indicates that the CVI accurately predicted the actual number of clusters. As presented in Table 12, the pre-existent CVIs precisely estimated the number of clusters for five experimental datasets, whereas the newly proposed CVIs accurately predicted the number of clusters for eight datasets—three more than the pre-existent CVIs.

5. Conclusions

In this study, we proposed novel cluster validity indices (CVIs) for uncertain data objects in feature space. Unlike conventional CVIs in original space, the proposed CVIs are used for uncertain data objects with arbitrariness, sub-clusters, and noisy shapes of clusters that are hard to evaluate, by transforming the uncertain data from the original space to the feature space, which is performed by the kernel function. The proposed CVIs measure the compactness and separability of each cluster in kernel space, which transforms the original data into a higher-dimensional space, leading to less sensitivity to the arbitrary shapes of clusters and more robustness to noise and outliers. We compared the performances of the proposed CVIs with those of pre-existent CVIs that only consider for the original space. The Bhattacharyya distance measure, one of the most widely used for calculating distance, was used to perform experiments with several artificial and real-life datasets to capture the distances between probability density functions. Numerical examples, including a real-life case study and artificial datasets, confirmed that our proposed CVIs are robust to arbitrary cluster shapes, especially sub-clusters, and are promising alternatives for evaluating the fitness of clustering results that can find the optimal number of clusters, K. The proposed CVIs outperform the pre-existent CVIs because of the application of kernel functions to uncertain data, transforming them from the original space to the feature space. As for practical significance, the proposed CVIs could be utilized in diverse applications. For example, Kim et al. proposed new a multivariate kernel density estimator for uncertain data classification for mixed defect patterns on DRAM wafer maps [31]. The proposed CVI method could be applied for evaluating the number of defect patterns on wafer maps. However, there are some limitations to the proposed CVIs. The uncertain data are assumed to have multivariate normal distributions in advance to compute the distances between different uncertain data objects. The uncertainty of the uncertain data may have a variety of probability functions (normal distribution, exponential distribution, etc.), and some cannot be strictly modeled by PDFs. This might be overcome through methods for generating random variables and support-measure data description, which is a non-parametric machine learning method that does not require an assumption of a prior distribution to be made in advance.

Future research should consider the compactness measure in kernel space in advanced machine learning algorithms, such as support vector data descriptions or Bayesian frameworks of Bayesian support vector data descriptions. The concepts of our CVIs can also be applied to other clustering algorithms.

Abbreviations

The following abbreviations are used in this manuscript:

Bhatt	Bhattacharyya distance measure
C/S	Separability/Compactness
CH	Calinski–Harabasz
CVIs	Cluster validity indices
DB	Davies–Bouldin
DU	Dunn
KPD	Kernel-based probabilistic distance
PD	Probabilistic distance
PDF	Probability density function
RBF	Radial basis function
S/C	Compactness/Separability
UFSCH	Uncertain feature space CH
UFSDU	Uncertain feature space DU
UOSCH	Uncertain feature space CH
UOSDU	Uncertain feature space DU

Open in a new tab

Author Contributions

Conceptualization, Y.-S.J.; data curation, C.K.; formal analysis, Y.-S.J.; investigation, B.T. and Y.-S.J.; methodology, C.K. and Y.-S.J.; resources, B.T.; software, B.T.; supervision, Y.-S.J.; validation, J.B.; visualization, J.B.; writing—original draft, C.K.; writing—review and editing, J.B., B.T. and Y.-S.J. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The real-world datasets used in this study are available at: https://archive.ics.uci.edu/ml/index.php accessed on 10 March 2023; the artificial datasets that contain data sensitive to shapes are available at: https://github.com/deric/clustering-benchmark/tree/master/ accessed on 10 March 2023.

Conflicts of Interest

The authors declare no conflict of interest.

Funding Statement

This work was supported by LG Yonam Foundation (of Republic of Korea) and by National Research Foundation of Republic of Korea Grant (No. NRF-2021S1A5A8060639, NRF-2022R1F1A1063174).

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

1.Abdalameer A.K., Alswaitti M., Alsudani A.A., Isa N.A. A new validity clustering index-based on finding new centroid positions using the mean of clustered data to determine the optimum number of clusters. Expert Syst. Appl. 2022;191:116329. doi: 10.1016/j.eswa.2021.116329. [DOI] [Google Scholar]
2.Irani J., Pise N., Phatak M. Clustering techniques and the similarity measures used in clustering: A survey. Int. J. Comput. Appl. Technol. 2016;134:9–14. doi: 10.5120/ijca2016907841. [DOI] [Google Scholar]
3.MacQueen J.B. Some methods for classification and analysis of multivariate observations; Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Berkeley, CA, USA. 27 December 1965–7 January 1966; Santa Barbara, CA, USA: The Regents of the University of California; 1967. pp. 281–297. [Google Scholar]
4.Li M.J., Ng M.K., Cheung Y.-m., Huang J.Z. Agglomerative fuzzy k-means clustering algorithm with selection of number of clusters. IEEE Trans. Knowl. Data Eng. 2008;20:1519–1534. doi: 10.1109/TKDE.2008.88. [DOI] [Google Scholar]
5.Mahesh Kumar K., Rama Mohan Reddy A. A fast DBSCAN clustering algorithm by accelerating neighbor searching using groups method. Pattern Recognit. 2016;58:39–48. doi: 10.1016/j.patcog.2016.03.008. [DOI] [Google Scholar]
6.Chien C.-F., Wang W.-C., Cheng J.-C. Data mining for yield enhancement in semiconductor manufacturing and an empirical study. Expert Syst. Appl. 2007;33:192–198. doi: 10.1016/j.eswa.2006.04.014. [DOI] [Google Scholar]
7.El-shafeiy E., Sallam K.M., Chakrabortty R.K., Abohany A.A. A clustering based swarm intelligence optimization technique for the internet of medical things. Expert Syst. Appl. 2021;173:114648. doi: 10.1016/j.eswa.2021.114648. [DOI] [Google Scholar]
8.Aggarwal C.C., Yu P.S. A survey of uncertain data algorithms and applications. IEEE Trans. Knowl. Data Eng. 2009;21:609–623. doi: 10.1109/TKDE.2008.190. [DOI] [Google Scholar]
9.Shou L., Zhang X., Chen G., Gao Y., Chen K. Mud: Mapping-based query processing for high-dimensional uncertain data. Inf. Sci. 2012;198:147–168. doi: 10.1016/j.ins.2012.02.023. [DOI] [Google Scholar]
10.Duan X., Ma Y., Zhou Y., Huang H., Wang B. A novel cluster validity index based on augmented non-shared nearest neighbors. Expert Syst. Appl. 2023;223:119784. doi: 10.1016/j.eswa.2023.119784. [DOI] [Google Scholar]
11.Lee S.-H., Jeong Y.-S., Kim J.-Y., Jeong M.K. A new clustering validity index for arbitrary shape of Clusters. Pattern Recognit. Lett. 2018;112:263–269. doi: 10.1016/j.patrec.2018.08.005. [DOI] [Google Scholar]
12.Dunn J.C. Well-separated clusters and optimal fuzzy partitions. J. Cybern. 1974;4:95–104. doi: 10.1080/01969727408546059. [DOI] [Google Scholar]
13.Calinski T., Harabasz J. A dendrite method for cluster analysis. Commun. Stat.-Theory Methods. 1974;3:1–27. doi: 10.1080/03610927408827101. [DOI] [Google Scholar]
14.Davies D.L., Bouldin D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979;PAMI-1:224–227. doi: 10.1109/TPAMI.1979.4766909. [DOI] [PubMed] [Google Scholar]
15.Xie X.L., Beni G. A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 1991;13:841–847. doi: 10.1109/34.85677. [DOI] [Google Scholar]
16.Rojas-Thomas J.C., Santos M., Mora M. New internal index for clustering validation based on graphs. Expert Syst. Appl. 2017;86:334–349. doi: 10.1016/j.eswa.2017.06.003. [DOI] [Google Scholar]
17.Tavakkol B., Jeong M.K., Albin S.L. Validity indices for clusters of uncertain data objects. Ann. Oper. Res. 2018;303:321–357. doi: 10.1007/s10479-018-3043-4. [DOI] [Google Scholar]
18.Wang J.-S., Chiang J.-C. A cluster validity measure with a hybrid parameter search method for the support vector clustering algorithm. Pattern Recognit. 2008;41:506–520. doi: 10.1016/j.patcog.2007.06.027. [DOI] [Google Scholar]
19.Jiang B., Pei J., Tao Y., Lin X. Clustering uncertain data based on probability distribution similarity. IEEE Trans. Knowl. Data Eng. 2013;25:751–763. doi: 10.1109/TKDE.2011.221. [DOI] [Google Scholar]
20.Tavakkol B., Jeong M.K., Albin S.L. Object-to-group probabilistic distance measure for uncertain data classification. IEEE Trans. Knowl. Data Eng. 2017;230:143–151. doi: 10.1016/j.neucom.2016.12.007. [DOI] [Google Scholar]
21.Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J.M., Perona I. An extensive comparative study of cluster validity indices. Pattern Recognit. 2013;46:243–256. doi: 10.1016/j.patcog.2012.07.021. [DOI] [Google Scholar]
22.Rezaee B. A cluster validity index for Fuzzy Clustering. Fuzzy Sets Syst. 2010;161:3014–3025. doi: 10.1016/j.fss.2010.07.005. [DOI] [Google Scholar]
23.Bhattacharyya A. On a measure of divergence between two multinomial populations. Sankhya Indian J. Stat. 1946;7:401–406. [Google Scholar]
24.Kullback S., Leibler R.A. On information and sufficiency. Ann. Math. Stat. 1951;22:79–86. doi: 10.1214/aoms/1177729694. [DOI] [Google Scholar]
25.Tavakkol B., Son Y. Fuzzy kernel K-medoids clustering algorithm for uncertain data objects. Pattern Anal. Appl. 2021;24:1287–1302. doi: 10.1007/s10044-021-00983-z. [DOI] [Google Scholar]
26.Zhou S.K., Chellappa R. From sample similarity to ensemble similarity: Probabilistic distance measures in reproducing kernel Hilbert space. IEEE Trans. Pattern Anal. Mach. Intell. 2006;28:917–929. doi: 10.1109/TPAMI.2006.120. [DOI] [PubMed] [Google Scholar]
27.Patle A., Chouhan D.S. SVM kernel functions for classification; Proceedings of the 2013 International Conference on Advances in Technology and Engineering (ICATE); Mumbai, India. 23–25 January 2013. [Google Scholar]
28.Tbarki K., Ben Said S., Ksantini R., Lachiri Z. RBF kernel based SVM Classification for landmine detection and discrimination; Proceedings of the 2016 International Image Processing, Applications and Systems (IPAS); Sfax, Tunisia. 5–7 November 2016. [Google Scholar]
29.Nydick S.W. The wishart and inverse wishart distributions. Electron. J. Stat. 2012;6:1–19. [Google Scholar]
30.UCI Machine Learning Repository. [(accessed on 28 March 2023)]. Available online: https://archive.ics.uci.edu/
31.Kim B., Jeong Y.-S., Jeong M.K. New multivariate kernel density estimator for uncertain data classification. Ann. Oper. Res. 2020;303:413–431. doi: 10.1007/s10479-020-03715-4. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[B1-sensors-23-03708] 1.Abdalameer A.K., Alswaitti M., Alsudani A.A., Isa N.A. A new validity clustering index-based on finding new centroid positions using the mean of clustered data to determine the optimum number of clusters. Expert Syst. Appl. 2022;191:116329. doi: 10.1016/j.eswa.2021.116329. [DOI] [Google Scholar]

[B2-sensors-23-03708] 2.Irani J., Pise N., Phatak M. Clustering techniques and the similarity measures used in clustering: A survey. Int. J. Comput. Appl. Technol. 2016;134:9–14. doi: 10.5120/ijca2016907841. [DOI] [Google Scholar]

[B3-sensors-23-03708] 3.MacQueen J.B. Some methods for classification and analysis of multivariate observations; Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Berkeley, CA, USA. 27 December 1965–7 January 1966; Santa Barbara, CA, USA: The Regents of the University of California; 1967. pp. 281–297. [Google Scholar]

[B4-sensors-23-03708] 4.Li M.J., Ng M.K., Cheung Y.-m., Huang J.Z. Agglomerative fuzzy k-means clustering algorithm with selection of number of clusters. IEEE Trans. Knowl. Data Eng. 2008;20:1519–1534. doi: 10.1109/TKDE.2008.88. [DOI] [Google Scholar]

[B5-sensors-23-03708] 5.Mahesh Kumar K., Rama Mohan Reddy A. A fast DBSCAN clustering algorithm by accelerating neighbor searching using groups method. Pattern Recognit. 2016;58:39–48. doi: 10.1016/j.patcog.2016.03.008. [DOI] [Google Scholar]

[B6-sensors-23-03708] 6.Chien C.-F., Wang W.-C., Cheng J.-C. Data mining for yield enhancement in semiconductor manufacturing and an empirical study. Expert Syst. Appl. 2007;33:192–198. doi: 10.1016/j.eswa.2006.04.014. [DOI] [Google Scholar]

[B7-sensors-23-03708] 7.El-shafeiy E., Sallam K.M., Chakrabortty R.K., Abohany A.A. A clustering based swarm intelligence optimization technique for the internet of medical things. Expert Syst. Appl. 2021;173:114648. doi: 10.1016/j.eswa.2021.114648. [DOI] [Google Scholar]

[B8-sensors-23-03708] 8.Aggarwal C.C., Yu P.S. A survey of uncertain data algorithms and applications. IEEE Trans. Knowl. Data Eng. 2009;21:609–623. doi: 10.1109/TKDE.2008.190. [DOI] [Google Scholar]

[B9-sensors-23-03708] 9.Shou L., Zhang X., Chen G., Gao Y., Chen K. Mud: Mapping-based query processing for high-dimensional uncertain data. Inf. Sci. 2012;198:147–168. doi: 10.1016/j.ins.2012.02.023. [DOI] [Google Scholar]

[B10-sensors-23-03708] 10.Duan X., Ma Y., Zhou Y., Huang H., Wang B. A novel cluster validity index based on augmented non-shared nearest neighbors. Expert Syst. Appl. 2023;223:119784. doi: 10.1016/j.eswa.2023.119784. [DOI] [Google Scholar]

[B11-sensors-23-03708] 11.Lee S.-H., Jeong Y.-S., Kim J.-Y., Jeong M.K. A new clustering validity index for arbitrary shape of Clusters. Pattern Recognit. Lett. 2018;112:263–269. doi: 10.1016/j.patrec.2018.08.005. [DOI] [Google Scholar]

[B12-sensors-23-03708] 12.Dunn J.C. Well-separated clusters and optimal fuzzy partitions. J. Cybern. 1974;4:95–104. doi: 10.1080/01969727408546059. [DOI] [Google Scholar]

[B13-sensors-23-03708] 13.Calinski T., Harabasz J. A dendrite method for cluster analysis. Commun. Stat.-Theory Methods. 1974;3:1–27. doi: 10.1080/03610927408827101. [DOI] [Google Scholar]

[B14-sensors-23-03708] 14.Davies D.L., Bouldin D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979;PAMI-1:224–227. doi: 10.1109/TPAMI.1979.4766909. [DOI] [PubMed] [Google Scholar]

[B15-sensors-23-03708] 15.Xie X.L., Beni G. A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 1991;13:841–847. doi: 10.1109/34.85677. [DOI] [Google Scholar]

[B16-sensors-23-03708] 16.Rojas-Thomas J.C., Santos M., Mora M. New internal index for clustering validation based on graphs. Expert Syst. Appl. 2017;86:334–349. doi: 10.1016/j.eswa.2017.06.003. [DOI] [Google Scholar]

[B17-sensors-23-03708] 17.Tavakkol B., Jeong M.K., Albin S.L. Validity indices for clusters of uncertain data objects. Ann. Oper. Res. 2018;303:321–357. doi: 10.1007/s10479-018-3043-4. [DOI] [Google Scholar]

[B18-sensors-23-03708] 18.Wang J.-S., Chiang J.-C. A cluster validity measure with a hybrid parameter search method for the support vector clustering algorithm. Pattern Recognit. 2008;41:506–520. doi: 10.1016/j.patcog.2007.06.027. [DOI] [Google Scholar]

[B19-sensors-23-03708] 19.Jiang B., Pei J., Tao Y., Lin X. Clustering uncertain data based on probability distribution similarity. IEEE Trans. Knowl. Data Eng. 2013;25:751–763. doi: 10.1109/TKDE.2011.221. [DOI] [Google Scholar]

[B20-sensors-23-03708] 20.Tavakkol B., Jeong M.K., Albin S.L. Object-to-group probabilistic distance measure for uncertain data classification. IEEE Trans. Knowl. Data Eng. 2017;230:143–151. doi: 10.1016/j.neucom.2016.12.007. [DOI] [Google Scholar]

[B21-sensors-23-03708] 21.Arbelaitz O., Gurrutxaga I., Muguerza J., Pérez J.M., Perona I. An extensive comparative study of cluster validity indices. Pattern Recognit. 2013;46:243–256. doi: 10.1016/j.patcog.2012.07.021. [DOI] [Google Scholar]

[B22-sensors-23-03708] 22.Rezaee B. A cluster validity index for Fuzzy Clustering. Fuzzy Sets Syst. 2010;161:3014–3025. doi: 10.1016/j.fss.2010.07.005. [DOI] [Google Scholar]

[B23-sensors-23-03708] 23.Bhattacharyya A. On a measure of divergence between two multinomial populations. Sankhya Indian J. Stat. 1946;7:401–406. [Google Scholar]

[B24-sensors-23-03708] 24.Kullback S., Leibler R.A. On information and sufficiency. Ann. Math. Stat. 1951;22:79–86. doi: 10.1214/aoms/1177729694. [DOI] [Google Scholar]

[B25-sensors-23-03708] 25.Tavakkol B., Son Y. Fuzzy kernel K-medoids clustering algorithm for uncertain data objects. Pattern Anal. Appl. 2021;24:1287–1302. doi: 10.1007/s10044-021-00983-z. [DOI] [Google Scholar]

[B26-sensors-23-03708] 26.Zhou S.K., Chellappa R. From sample similarity to ensemble similarity: Probabilistic distance measures in reproducing kernel Hilbert space. IEEE Trans. Pattern Anal. Mach. Intell. 2006;28:917–929. doi: 10.1109/TPAMI.2006.120. [DOI] [PubMed] [Google Scholar]

[B27-sensors-23-03708] 27.Patle A., Chouhan D.S. SVM kernel functions for classification; Proceedings of the 2013 International Conference on Advances in Technology and Engineering (ICATE); Mumbai, India. 23–25 January 2013. [Google Scholar]

[B28-sensors-23-03708] 28.Tbarki K., Ben Said S., Ksantini R., Lachiri Z. RBF kernel based SVM Classification for landmine detection and discrimination; Proceedings of the 2016 International Image Processing, Applications and Systems (IPAS); Sfax, Tunisia. 5–7 November 2016. [Google Scholar]

[B29-sensors-23-03708] 29.Nydick S.W. The wishart and inverse wishart distributions. Electron. J. Stat. 2012;6:1–19. [Google Scholar]

[B30-sensors-23-03708] 30.UCI Machine Learning Repository. [(accessed on 28 March 2023)]. Available online: https://archive.ics.uci.edu/

[B31-sensors-23-03708] 31.Kim B., Jeong Y.-S., Jeong M.K. New multivariate kernel density estimator for uncertain data classification. Ann. Oper. Res. 2020;303:413–431. doi: 10.1007/s10479-020-03715-4. [DOI] [Google Scholar]

PERMALINK

Cluster Validity Index for Uncertain Data Based on a Probabilistic Distance Measure in Feature Space

Changwan Ko

Jaeseung Baek

Behnam Tavakkol

Young-Seon Jeong

Roles

Abstract

1. Introduction

Figure 1.

Figure 2.

2. Related Work

2.1. CVI for Certain Data

2.2. CVI for Uncertain Data

3. Proposed CVIs for Uncertain Data

3.1. Kernel Probabilistic Distance Measure in Feature Space

3.2. New CVI for Uncertain Data

4. Experimental Results

4.1. Experimental Procedure for Uncertain Data

4.2. Experiments with Artificial and Real-World Datasets

Table 1.

Figure 3.

4.3. Performance Comparison of the Proposed CVIs

Table 2.

Table 3.

Table 4.

Table 5.

Table 6.

Table 7.

Table 8.

Table 9.

Table 10.

Table 11.

Table 12.

5. Conclusions

Abbreviations

Author Contributions

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Funding Statement

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases