MulticlusterKDE: a new algorithm for clustering based on multivariate kernel density estimation

D Scaldelai; L C Matioli; S R Santos; M Kleina

doi:10.1080/02664763.2020.1799958

. 2020 Jul 30;49(1):98–121. doi: 10.1080/02664763.2020.1799958

MulticlusterKDE: a new algorithm for clustering based on multivariate kernel density estimation

D Scaldelai ^a,^CONTACT, L C Matioli ^b, S R Santos ^a, M Kleina ^c

PMCID: PMC9041763 PMID: 35707794

Abstract

In this paper, we propose the MulticlusterKDE algorithm applied to classify elements of a database into categories based on their similarity. MulticlusterKDE is centered on the multiple optimization of the kernel density estimator function with multivariate Gaussian kernel. One of the main features of the proposed algorithm is that the number of clusters is an optional input parameter. Furthermore, it is very simple, easy to implement, well defined and stops at a finite number of steps and it always converges regardless of the data set. We illustrate our findings by implementing the algorithm in R software. The results indicate that the MulticlusterKDE algorithm is competitive when compared to K-means, K-medoids, CLARA, DBSCAN and PdfCluster algorithms. Features such as simplicity and efficiency make the proposed algorithm an attractive and promising research field that can be used as basis for its improvement and also for the development of new density-based clustering algorithms.

Keywords: Kernel density estimation, Gaussian kernel, clustering data, optimization method, multiclusterKDE

1. Introduction

The amount of data generated and also available to users has increased exponentially due to the growth and massive use of technologies. The significant quantity is collected every day from satellite images, biomedicine, security, marketing, searches on networks, among other means [23]. However, ‘large databases’ and ‘knowledge about the phenomenon’ cannot be thought of as synonymous. So, to understand the phenomenon, firstly the data set needs to become a useful knowledge and only after will it be possible to generate applications and simulations.

The process of exploiting large databases in search of patterns, rules or information sequences, in order to detect correlations between variables, is known as data mining. According to [1] and [21], data mining is a branch of computing that started to be widely studied in the 1980s when companies and organizations began to worry about large amounts of data being stored and unusable within companies.

Over the years, many works have been carried out in this area, leading data mining to be subdivided into different research lines, the main ones are regression, sequential analysis, classification, clustering and analysis of outliers.

In this paper, we are interested in clustering whose objective it is to identify patterns or groups of similar objects in a data set. Clustering is a powerful set of exploratory techniques that seek to perform data grouping, automatically, maximizing the homogeneity of observations within each cluster [23].

Several different clustering strategies have been proposed in order to predict, analyze and replicate phenomena. By the way, no consensus has been reached even on the definition of a cluster. According to [40], clustering can be classified into five main classes which are partitioned, hierarchical, grid-based, model-based methods and density-based. In this paper, we propose an alternative approach based on density distribution function.

Partitioning methods are the simplest methods of clustering [1]. The basic idea of these algorithms is to partition the data set into k clusters in such a way that each object is assigned to the nearest cluster using a partitioning criterion as a function of distance-based dissimilarity. Because of this, these approaches are not able to detect non-spherical clusters [36]. Its major challenge is to know the number of cluster a priori, since this information is the question to be answered in many of the problems studied.

The well-known partitioning algorithm is K-means proposed by [28]. It is widely known in literature and its process consists of two stages, the first is to determine a set of k centroids, and the second is the allocation of the observations of these centroids by the criterion of minimum distance. Other partitioning algorithms are K-medoids and CLARA – Clustering Large Applications [20,24]. These algorithms have interesting features, such as less sensibility to noise and outliers, which make them an attractive alternative for clustering. However, they can sometimes be computationally expensive mainly to large data set.

Hierarchical clustering algorithms decompose the data set into several partitioning levels, which are usually represented by a dendrogram, which is a tree that divides the database recursively into subsets until a termination criteria to be satisfied. Each node in the tree represents one cluster. The dendrogram can be created from leaves to root (agglomerative approach) or from root to leaves (divisive approach) by merging or dividing groups at each step [13]. Although hierarchical clustering algorithms can be very effective in pattern discovery, it has a great challenge to determine the termination criteria of the merging or splitting process.

An example of divisive algorithm is DIANA – Divisive Analysis Clustering [24], where initially there is an only cluster consisting of all points of the data set, and at each subsequent step, the largest cluster is split into two new clusters until all clusters contain only a single point. SLINK – Single-Linkage [41] and CLINK – Complete-Linkage [10] are agglomerative methods that start each point in different clusters, and at each subsequent step, two clusters are joined according to the minimum or maximum distance between elements of each cluster.

Grid-based algorithms classify the space of object into a finite number of cells that form a grid structure [21]. The main advantage of this approach is its fast-processing time, which is independent of the number of objects and dependent on the number of cells in each dimension of the quantized space. In the algorithms based on models, a hypothesis is created for the model on each of the clusters and then find the best fit of the data for a given model [40].

Finally, we have the density-based algorithms which are based on the idea that cluster centers are characterized by having a higher density than their neighbors and the elements are incorporated into clusters as long as the density does not exceed a certain limit [40]. Many different density-based algorithms have been proposed in the last years, such as [3,4,7,13,18,23,25,26,30,32,36,45].

In [32], it was proposed the PdfCluster multivariate algorithm as a natural extension of the clustering procedure based on univariate density developed by [4]. It determines clusters that are associated with connected components with an estimated density above a threshold. The detection of connected regions is performed by procedures described in [4], for problems with low dimension and in [32] for larger dimensions. In both cases, after identifying multiple cluster cores with high density, the lowest density data are allocated following an approach similar to the supervised classification. An interesting feature of PdfCluster is that it does not require the information of the number of clusters a priori, which is determined by clustering process. The PdfCluster algorithm is available in R software [3,7,13,23,25,30,36,45].

Another important algorithm based on density is the DBSCAN – Density-Based Spatial Clustering of Applications with Noise proposed by [13]. The key idea of this algorithm is that for each point in a cluster, its neighborhood, bounded by a radius, must contain at least a minimum number of points, so that elements of the same cluster are closely compact and points outside the neighborhood are as far as possible. According to [23], DBSCAN can discover clusters for data sets of different shapes and sizes containing noises or outliers and it has its implementation in R (‘dbscan’ packages). According to the authors, the results obtained by DBSCAN are significantly more effective at discovering clusters than known partition-based algorithms.

Recently, [36] proposed the DPC algorithm which is based on the assumption that cluster centers are surrounded by neighbors with lower local density and that they are at a relatively large distance from any points with a higher local density. However, this algorithm can produce poor clustering, because the density metric depends strongly on the data set dimension and the strategy of assigning the remaining points to an incorrect cluster can cause propagation of errors. To circumvent these problems, it was proposed the FKNN-DPC algorithm that uses two new assignment strategies based on K-nearest neighbors and fuzzy weighted K-nearest neighbors as it can be seen in [45].

In addition to these methods, there are approaches which are fusions from different methods such as density estimation and normal mixture models that have been proposed and studied in the context of clustering, as seen in [15–17,29,39].

It should be noted that all algorithms presented here are focused on clustering of multidimensional data which are stored in an $m \times n$ matrix. However, many modern applications such as image and video recognition, text mining, internet search, large-scale telecommunications and social networking records, generate large amounts of data which have multiple aspects and high dimensionality. In these cases, the multi-way arrays can display a natural representation. These applications are beyond the focus of this work, so for more details see [9,11,12,27].

Due to the diversity of applications in data analysis area, we also intend to investigate in the future the viability of our methodology for structured data via tensors, as well as [22,43,44].

In this paper, we propose a new algorithm, named MulticlusterKDE, which has received this denomination because of performing multiple optimizations of Gaussian kernel density applied for clustering of a multidimensional data set. It is divided into two main steps. The first determines the number of clusters and respective centers by minimizing a probability density estimator. The second consists of assigning observations to each cluster obtained by determining smallest distance, that is, the points are being allocated to the centers that are closest to them. At the end of the process, the algorithm has determined the number of clusters which are stored in a matrix.

The new algorithm has the advantage of not requiring, a priori, the number of clusters and the only input data required to run is a parameter used in the bandwidth matrix. If the user, it can provide the number of clusters, but this parameter is not necessary. Numerical experiments were implemented in R software [34] and the proposed algorithm was compared to K-means, K-medoids, CLARA, DBSCAN and PdfCluster algorithms, also available in R.

The paper is organized as follows. In Section 2, we describe the essential elements to develop this paper. In Section 3, we established our algorithm and then we analyze its complexity convergence. Numerical experiments are reported in Section 4. Finally, concluding remarks close our text in Section 5.

2. Kernel density estimation

This section provides background concepts to develop this paper. For further study, we suggest the following readings [5,19,30,32,38,42].

Let $x \in R^{n}$ be a random variable with probability density function f. The knowledge of this function provides a natural description of the behavior of the variable x and allows the characteristics associated with it to be studied and replicated.

The density function of a given set of observations is not always known, especially when the data set refers to a real phenomenon. This difficulty can be overcome by using non-parametric estimators such as kernel density estimation – KDE. According to [38], the parametric estimators focus is on obtaining the best estimator $\hat{θ}$ for a given parameter θ while in the non-parametric case the objective is directly linked to obtaining a good estimate of the density function $\hat{f}$ .

As presented by [19], a general form of the multivariate kernel density estimator is

\begin{aligned} \hat{f} (x, H) & = m^{- 1} \sum_{i = 1}^{m} | H |^{- \frac{1}{2}} K (H^{- \frac{1}{2}} (x - X_{i})) \\ = m^{- 1} \sum_{i = 1}^{m} K_{H} (x - X_{i}), \end{aligned}

(1)

where

K_{H} (x) = | H |^{- \frac{1}{2}} K (H^{- \frac{1}{2}} x),

(2)

H is a square bandwidth matrix, non-random, symmetric and positive definite, $| H |$ is the determinant of H, $x = (x_{1}, x_{2}, \dots, x_{n})^{T} \in R^{n}$ is the vector of n-dimensional space of the variables and $X_{i} = (X_{1 i}, X_{2 i}, \dots, X_{n i})^{T}$ , $i = 1, 2, \dots, m$ , is the set of observations with unknown density function.

The kernel density estimator has a standard normal multivariate density given by

K (x) = (2 π)^{- \frac{n}{2}} e x p (- \frac{1}{2} x^{T} x) .

(3)

Considering Equations (1) and (3) the Gaussian kernel density estimator is given as

\hat{f} (x) = \frac{1}{m} (2 π)^{\frac{- n}{2}} {| H |}^{- \frac{1}{2}} \sum_{i = 1}^{m} e x p (- \frac{1}{2} x^{T} H^{- 1} x) .

(4)

According to [19], the multivariate KDE can be viewed as a weighted sum of density ‘bumps’ that are centered at each data point $X_{i}$ .

The key to applying the Gaussian kernel density estimator to a data set is to choose the bandwidth matrix. It is an extremely delicate problem, because few changes in H can significantly affect the shape and orientation of KDE.

As reported by [19], the bandwidth matrix is defined by three levels. The simplest case is given by the product of a scalar $h \in R_{+}^{*}$ by the identity matrix of dimension $n \times n$ , $H = {h I_{n \times n} : h > 0}$ . The second level of bandwidth matrix complexity occurs when H is a positive definite diagonal matrix, $H = d i a g (h_{1}, h_{2}, \dots, h_{n})$ . And finally, the third level and the most complex one occurs when H is a complete matrix, symmetric and positive definite. In this paper, we use the diagonal matrix.

Based on [32,38], a suitable choice for the components of the diagonal matrix H by considering on multivariate normality hypotheses is

h_{i}^{*} = {(\frac{4}{n + 2})}^{\frac{1}{n + 4}} σ_{i} m^{- \frac{1}{n + 4}},

(5)

where n is the dimension of space, $σ_{i}$ the standard deviation of the components and m the number of observations.

Under the hypothesis of normality and in the one-dimensional space, [30,38,42] define the best choice for h as being

h = 1.06 σ m^{- \frac{1}{5}} .

(6)

In [30] it was proposed a flexibilization of this value and the authors introduced a variable α as follows

h = 1.06 σ m^{- \frac{1}{α}} .

(7)

In this paper, we will use an extension of h given by [30] and we will apply the relationship (7) on each component of H, that is

h_{i} = 1.06 σ_{i} m^{- \frac{1}{α}}, i = 1, \dots, n .

(8)

3. MulticlusterKDE algorithm

In this section, we present the main contribution of our paper, that is, the MulticlusterKDE algorithm. Unlike mean-shift [8] and Expectation-Maximization (EM) [31] algorithms, which use the gradient of the Kernel function to determine the cluster centers, our algorithm determines the centers by minimizing the Gaussian kernel using some optimization method, for example BFGS. In addition, we emphasize that other kernel functions can be used instead of Gaussian, what is enough to be differentiable. Another relevant fact of our algorithm is that it does not need to know the cluster number a priori. However, if the user wishes, he can provide this data. Therefore, the algorithm is flexible and, as we will see in the numerical experiments, it is competitive when compared to some of the main clustering algorithms well known in the literature.

Next, we present and explain the steps of the algorithm. For a better understanding of the proposed approach, first we present an example and then the algorithm.

Example 1

Consider an example with 20 observations of 2 attributes, randomly generated with a structure that defines 2 clusters. The data are in the ‘Attributes’ column of Table 1 and represented in Figure 1(a). For this example, we define $α = 1.5$ as the kernel function's smoothing coefficient.

Table 1. Example data.

	Attributes
Data	$x_{1}$	$x_{2}$	Distance to $S_{1}$	Distance to $S_{2}$	Minimum distance
1	4.97	1.98	0.036	1.449	0.036
2	4.94	1.98	0.063	1.471	0.063
3	5.07	1.93	0.099	1.418	0.099
4	4.78	1.92	0.234	1.629	0.234
5	5.06	1.98	0.063	1.387	0.063
6	5.01	2.00	0.010	1.407	0.010
7	4.81	1.91	0.210	1.614	0.210
8	4.93	2.09	0.114	1.405	0.114
9	4.90	2.05	0.112	1.453	0.112
10	5.13	1.95	0.139	1.364	0.139
11	5.00	2.06	0.060	1.372	0.060
12	5.14	2.13	0.191	1.223	0.191
13	5.98	3.17	1.526	0.171	0.171
14	6.10	3.11	1.563	0.149	0.149
15	6.00	2.74	1.244	0.260	0.260
16	5.97	2.95	1.358	0.058	0.058
17	6.01	2.91	1.359	0.091	0.091
18	6.04	3.05	1.478	0.064	0.064
19	5.97	2.99	1.386	0.032	0.032
20	6.10	3.00	1.487	0.100	0.100

	MulticlusterKDE			K-means			PdfCluster
Variety	C1	C2	C3	C1	C2	C3	C1	C2	C3
Setosa	50	0	0	50	0	0	50	0	0
Versicolour	0	50	0	0	48	2	0	46	4
Virginica	0	7	43	0	14	36	0	13	37
	K-medoids			CLARA			DBSCAN
	C1	C2	C3	C1	C2	C3	C1	C2	C3
Setosa	50	0	0	50	0	0	46	0	4
Versicolour	0	48	2	0	48	2	0	37	13
Virginica	0	14	36	0	13	37	0	5	45

Algorithms	Absolute error	Relative error	time(s)
MulticlusterKDE	7	4,7%	5.9
K-means	16	10,7%	0.02
PdfCluster	17	11,3%	0.141
K-medoids	16	10.7%	0.016
CLARA	15	10.0%	0.001
DBSCAN	22	14.7%	0.001

	MulticlusterKDE			K-means			PdfCluster
Cultivars	C1	C2	C3	C1	C2	C3	C1	C2	C3
Barolo	59	0	0	46	0	13	59	0	0
Grignolino	4	66	1	1	50	20	5	62	4
Barbera	0	1	47	0	19	29	0	1	47
	K-medoids			CLARA			DBSCAN
	C1	C2	C3	C1	C2	C3	C1	C2	C3
Barolo	46	0	13	48	0	11	58	1	0
Grignolino	2	50	19	2	50	19	61	5	5
Barbera	0	18	30	0	18	30	40	2	6

Algorithms	Absolute error	Relative error	time(s)
MulticlusterKDE	6	3.4%	5.51
K-means	53	29.7%	0.015
PdfCluster	10	5.6%	2.35
K-medoids	52	29.2%	0.016
CLARA	50	28.1%	0.008
DBSCAN	109	61.2%	0.001

	MulticlusterKDE			K-means			PdfCluster
Varieties	C1	C2	C3	C1	C2	C3	C1	C2	C3
Kama	60	1	9	60	1	9	59	3	8
Rosa	4	65	1	10	60	0	4	66	0
Canadian	8	0	62	2	0	68	3	0	67
	K-medoids			CLARA			DBSCAN
	C1	C2	C3	C1	C2	C3	C1	C2	C3
Kama	57	1	12	60	2	8	69	0	1
Rosa	10	60	0	9	61	0	51	19	0
Canadian	0	0	70	4	0	66	39	0	31

Algorithms	Absolute error	Relative error	time(s)
MulticlusterKDE	23	10.9%	5.28
K-means	22	10.5%	0.003
PdfCluster	18	8.6%	3.25
K-medoids	23	10.9%	0.005
CLARA	23	10.9%	0.003
DBSCAN	64	30.5%	0.001

	MulticlusterKDE			K-means			PdfCluster
macro-area	C1	C2	C3	C1	C2	C3	C1	C2	C3
Sul	323	0	0	190	91	42	296	0	27
Sardina	0	97	1	22	76	0	0	98	0
Centre-North	0	26	125	0	17	134	0	6	145
	K-medoids			CLARA			DBSCAN
	C1	C2	C3	C1	C2	C3	C1	C2	C3
Sul	196	84	43	204	78	41	323	0	0
Sardina	19	79	0	31	67	0	98	0	0
Centre-North	0	18	133	0	21	130	129	4	18

Algorithms	Absolute error	Relative error	time(s)
MulticlusterKDE	27	4.7%	14.87
K-means	172	29.8%	0.001
PdfCluster	33	5.7%	64.12
K-medoids	164	28.7%	0.04
CLARA	171	29.9%	0.001
DBSCAN	231	40.4%	0.001

	MulticlusterKDE						PdfCluster
Diseases	C1	C2	C3	C4	C5	C6	C1	C2	C3	C4	C5	C6
psoriasis	95	5	0	1	9	1	52	15	0	1	11	32
seboreic dermatitis	0	29	0	31	0	0	2	38	15	3	2	0
lichen planus	0	0	70	1	0	0	31	0	40	0	0	0
pityriasis rosea	0	0	0	47	1	0	0	13	27	8	0	0
cronic dermatitis	0	0	0	0	48	0	0	1	1	1	45	0
pityriasis rubra pilaris	0	0	0	0	0	20	8	0	10	0	2	0
	K-means						K-medoids
	C1	C2	C3	C4	C5	C6	C1	C2	C3	C4	C5	C6
psoriasis	32	27	0	17	26	9	28	32	15	16	16	4
seboreic dermatitis	19	19	0	2	16	4	2	17	17	9	13	2
lichen planus	17	0	22	4	27	1	6	20	25	14	6	0
pityriasis rosea	15	15	0	5	8	5	5	14	8	10	10	1
cronic dermatitis	14	11	0	5	10	5	6	14	9	8	10	1
pityriasis rubra pilaris	0	1	0	0	0	19	0	0	0	0	2	18
	CLARA						DBSCAN
	C1	C2	C3	C4	C5	C6	C1	C2	C3	C4	C5	C6
psoriasis	29	31	15	20	12	4	111	0	0	0	0	0
seboreic dermatitis	6	18	12	10	12	2	11	35	0	13	0	1
lichen planus	6	20	25	18	2	0	38	0	21	0	12	0
pityriasis rosea	7	14	6	10	10	1	10	31	0	7	0	0
cronic dermatitis	8	14	7	8	10	1	27	16	0	4	0	1
pityriasis rubra pilaris	0	0	0	0	2	18	3	0	0	0	0	17

Algorithms	Absolute error	Relative error	time(s)
MulticlusterKDE	49	13.9%	10.5
K-means	251	70.1%	0.016
PdfCluster	175	48.8%	28.56
K-medoids	250	69.8%	0.031
CLARA	110	30.7%	0.001
DBSCAN	167	46.6%	0.016

	MulticlusterKDE			K-means			PdfCluster
waveform	C1	C2	C3	C1	C2	C3	C1	C2	C3
1	1392	29	236	824	11	822	–	–	–
2	505	1085	57	958	689	0	–	–	–
3	65	277	1354	0	700	996	–	–	–
	K-medoids			CLARA			DBSCAN
	C1	C2	C3	C1	C2	C3	C1	C2	C3
1	861	9	787	911	673	73	840	817	0
2	945	702	0	0	884	763	703	943	1
3	0	600	1096	870	0	826	766	925	5

	MulticlusterKDE			K-means			PdfCluster
waveform	C1	C2	C3	C1	C2	C3	C1	C2	C3
1	1002	425	265	880	12	800	–	–	–
2	139	1462	52	930	23	0	–	–	–
3	224	172	1259	0	685	970	–	–	–
	K-medoids			CLARA			DBSCAN
	C1	C2	C3	C1	C2	C3	C1	C2	C3
1	677	303	712	878	20	794	1283	402	7
2	714	939	0	925	728	0	1162	491	0
3	0	646	1009	0	651	1004	1227	420	8

Algorithms	Absolute error	Relative error	time(s)
MulticlusterKDE	1277	25.5%	5.08
K-means	2427	48.5%	0.08
PdfCluster	–	–	–
K-medoids	2375	47.5%	3.77
CLARA	2390	47.8%	0.012
DBSCAN	3218	64.4%	1.55

	USArrests: ( $4 \times 50$ )				Tripadvisor: ( $11 \times 980$ )
Algorithms	nc	Sil	time(s)	Par	nc	Sil	time(s)	Par
MulticlusterKDE	2	0.58	0.767	$α = 5$	6	0.52	6.193	$α = 5$
PdfCluster	1	–	0.06	–	17	0.0353	305.94	–
DBSCAN	8	0.43	0.001	23.7/1	8	0.54	0.001	4.6/8
K-means	2	0.59	0.01	2:10	2	0.62	0.031	2:10
K-medoids	2	0.59	0.016	2:10	2	0.62	0.172	2:10
CLARA	2	0.59	0.02	2:10	2	0.62	0.031	2:10
	Random 1: ( $4 \times 1506$ )				Random 2: ( $9 \times 4084$ )
MulticlusterKDE	2	0.75	5.88	$α = 5$	7	0.79	16.145	$α = 5$
PdfCluster	2	0.75	8.77	–	–	–	–	–
DBSCAN	2	0.75	0.02	1.2/1	8	0.78	0.172	2.3/5
K-means	2	0.75	0.049	2:10	6	0.76	0.74	2:10
K-medoids	2	0.75	0.31	2:10	7	0.78	9.052	2:10
CLARA	2	0.75	0.045	2:10	7	0.78	0.701	2:10
	Random 3: ( $4 \times 4252$ )				Random 4: ( $14 \times 5845$ )
MulticlusterKDE	6	0.54	3.88	$α = 5$	3	0.67	2.25	$α = 5$
PdfCluster	–	–	–	–	3	0.67	423.06	–
DBSCAN	5	0.49	0.1	1.7/5	4	0.67	0.031	7.8/5
K-means	3	0.50	0.365	2:10	3	0.67	0.047	2:10
K-medoids	6	0.56	11.718	2:10	3	0.67	0.225	2:10
CLARA	6	0.56	0.564	2:10	3	0.67	0.048	2:10
	Random 5: ( $38 \times 3520$ )				Random 6: ( $30 \times 2072$ )
MulticlusterKDE	14	0.83	10.1	$α = 5$	9	0.63	3.78	$α = 5$
PdfCluster	–	–	–	–	–	–	–	–
DBSCAN	15	0.83	0.128	5.7/5	10	0.63	0.185	10.2/5
K-means	18	0.64	2.167	2:20	8	0.57	0.373	2:20
K-medoids	14	0.83	16.72	2:20	9	0.63	2.379	2:20
CLARA	14	0.83	1.20	2:20	9	0.63	0.429	2:20
	Random 7: ( $20 \times 6412$ )				Random 8: ( $8 \times 10351$ )
MulticlusterKDE	4	0.74	9.44	$α = 5$	6	0.76	159	$α = 5$
PdfCluster	–	–	–	–	–	–	–	–
DBSCAN	4	0.74	0.80	3.5/5	6	0.76	1.90	2/3
K-means	4	0.74	0.74	2:10	6	0.76	2.95	2:10
K-medoids	4	0.74	11.02	2:10	6	0.76	56.17	2:10
CLARA	4	0.74	9.88	2:10	6	0.76	3.98	2:10

Algorithms	Absolute error	Relative error	time(s)
MulticlusterKDE	1169	23.4%	3.89
K-means	2491	49.8%	0.02
PdfCluster	–	–	–
K-medoids	2341	46.8%	7.9
CLARA	2379	47.6%	0.013
DBSCAN	3212	64.2%	0.687

PERMALINK

MulticlusterKDE: a new algorithm for clustering based on multivariate kernel density estimation

D Scaldelai

L C Matioli

S R Santos

M Kleina

Abstract

1. Introduction

2. Kernel density estimation

3. MulticlusterKDE algorithm

Example 1

Table 1. Example data.

Figure 1.

Figure 2.

Figure 3.

Theorem 1

Proof.

4. Numerical tests

4.1. Numerical tests on known class problems

4.1.1. Iris data set

Table 2. Confusion matrix for Iris data set.

Table 3. Comparative results for Iris data set.

4.1.2. Wine data set

Table 4. Confusion matrix for Wine data set.

Table 5. Comparative results for Wine data set.

4.1.3. Seeds data set

Table 6. Confusion matrix for Seeds data set.

Table 7. Comparative results for Seeds data set.

4.1.4. Olive oil data set

Table 8. Confusion matrix for Olive oil data set.

Table 9. Comparative results for Olive oil data set.

4.1.5. Dermatology data set

Table 10. Confusion matrix for Dermatology data set.

Table 11. Comparative results for Dermatology data set.

4.1.6. Waveform database generator (Version 1)

Table 12. Confusion matrix for Waveform database (version 1).

Table 13. Comparative results for Waveform database (version 1).

4.1.7. Waveform database generator (Version 2)

Table 14. Confusion matrix for Waveform database (version 2).

Table 15. Comparative results for Waveform database (version 2).

4.2. Numerical tests on problems with no classes defined

Table 16. Comparative results for problems with no classes defined.

5. Conclusions

Acknowledgments

Funding Statement

Disclosure statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases