Data clustering: application and trends

Gbeminiyi John Oyewole; George Alex Thopil

doi:10.1007/s10462-022-10325-y

. 2022 Nov 27;56(7):6439–6475. doi: 10.1007/s10462-022-10325-y

Data clustering: application and trends

Gbeminiyi John Oyewole ^1,^✉, George Alex Thopil ¹

PMCID: PMC9702941 PMID: 36466764

Abstract

Clustering has primarily been used as an analytical technique to group unlabeled data for extracting meaningful information. The fact that no clustering algorithm can solve all clustering problems has resulted in the development of several clustering algorithms with diverse applications. We review data clustering, intending to underscore recent applications in selected industrial sectors and other notable concepts. In this paper, we begin by highlighting clustering components and discussing classification terminologies. Furthermore, specific, and general applications of clustering are discussed. Notable concepts on clustering algorithms, emerging variants, measures of similarities/dissimilarities, issues surrounding clustering optimization, validation and data types are outlined. Suggestions are made to emphasize the continued interest in clustering techniques both by scholars and Industry practitioners. Key findings in this review show the size of data as a classification criterion and as data sizes for clustering become larger and varied, the determination of the optimal number of clusters will require new feature extracting methods, validation indices and clustering techniques. In addition, clustering techniques have found growing use in key industry sectors linked to the sustainable development goals such as manufacturing, transportation and logistics, energy, and healthcare, where the use of clustering is more integrated with other analytical techniques than a stand-alone clustering technique.

Keywords: Clustering; Clustering classification; Clustering components; Industry applications, Clustering algorithms, Clustering trends

Introduction

Clustering has been defined as the grouping of objects in which there is little or no knowledge about the object relationships in the given data (Jain et al. 1999; Liao 2005; Bose and Chen 2015; Grant and Yeo 2018; Samoilenko and Osei-Bryson 2019; Xie et al. 2020). Clustering also aims to reveal the underlying classes present within the data. Besides, clustering is referred to as a technique that groups unlabeled data with little or no supervision into different classes. The grouping is such that objects that are within the same class have similarity characteristics and are different from objects within other classes. Clustering has also been described as an aspect of machine learning that deals with unsupervised learning. The learning lies in algorithms extracting patterns from datasets obtained either from direct observation or simulated data. Schwenker and Trentin (2014) described the learning process as attempts to classify data observations or independent variables without knowledge of a target variable.

The grouping of objects into different classes has been one of the outcomes of data clustering over the years. However, the difficulty of obtaining a single method of determining the ideal or optimal number of classes for several clustering problems has been a key clustering issue noted by several authors such as Sekula et al. (2017), Rodriguez et al. (2019), Baidari and Patil (2020). Authors have referred to this issue as the subjectivity of clustering. Sekula et al. (2017), Pérez-Suárez et al. (2019) and Li et al. (2020a) described this subjectivity as the difficulty in indicating the best partition or cluster. The insufficiency of a unique clustering technique in solving all clustering problems would imply the careful selection of clustering parameters to ensure suitability for the user of the clustering results. Jain et al. (1999) specifically noted the need for several design choices in the clustering process which have the potential for the use and development of several clustering techniques/algorithms for existing and new areas of applications. They presented general applications of clustering such as in information filtering and retrieval which could span across several industrial/business sectors. This work however discusses applications of clustering techniques specifically under selected industrial/business sectors with strong links to the United Nations Sustainable Development Goals (SDGs). We also note some new developments in clustering such as in techniques and datatype over the years of the publication of Jain et al. (1999).

This review aims to give a general overview of data clustering, clustering classification, data concerns in clustering, application and trends in the field of clustering. We present a basic description of the clustering component steps, clustering classification issues, clustering algorithms, generic application of clustering across different industry sectors and specific applications across selected industries. The contribution of this work is mainly to underscore how clustering is being applied in industrial sectors with strong links to the SDGs. Other minor contributions are to point out clustering taxonomy issues, and data input concerns and suggest the size of input data is useful for classifying clustering algorithms. This review is also useful as a quick guide to practitioners or users of clustering methods interested in understanding the rudiments of clustering.

Clustering techniques have predominantly been used in the field of statistics and computing for exploratory data analysis. However, clustering has found a lot of applications in several industries such as manufacturing, transportation, medical science, energy, education, wholesale, and retail etc. Furthermore, Han et al. (2011), Landau et al. (2011), and Ezugwu et al. (2022) indicated an increasing application of clustering in many fields where data mining or processing capabilities have increased. Besides, the growing requirement of data for analytics and operations management in several fields has increased research and application interest in the use of clustering techniques.

To keep up with the growing interest in the field of clustering over the years, general reviews of clustering algorithms and approaches have been observable trends (Jain et al. 1999; Liao 2005; Xu and Wunsch 2005; Alelyani et al. 2013; Schwenker and Trentin 2014; Saxena et al. 2017). Besides, there has been a recent trend of reviews of specific clustering techniques such as in Denoeux and Kanjanatarakul (2016), Baadel et al. (2016) Shirkhorshidi et al. (2014), Bulò and Pelillo (2017), Rappoport and Shamir (2018), Ansari et al. (2019), Pérez-Suárez et al. (2019), Beltrán and Vilariño (2020), Campello et al. (2020). We have also observed a growing review of clustering techniques under a particular field of application such as in Naghieh and Peng (2009), Xu and Wunsch (2010), Anand et al. (2018), Negara and Andryani (2018), Delgoshaei and Ali (2019). However, there appears not to be sufficient reviews targeted at data clustering applications discussed under the Industrial sectors. The application of clustering is vast, and as Saxena et al. (2017) indicated, might be difficult to completely exhaust.

To put this article into perspective, we present our article selection method, a basic review of clustering steps, classification and techniques discussed in the literature under Sect. 2. Furthermore, we discuss clustering applications across and within selected business sectors or Industries in Sect. 3. A trend of how clustering is being applied in these sectors is also discussed in Sect. 3. In Sect. 4 we highlight some data issues in the field of clustering. Furthermore, in Sect. 5, we attempt to discuss and summarize clustering concepts from previous sections. We thereafter conclude and suggest future possibilities in the field of data clustering in Sect. 6.

Components and classifications for data clustering

Our article selection in this work follows a similar literature search approach of Govender and Sivakumar (2020) where google scholar (which provides indirect links to databases such as science direct) was indicated as the main search engine. In addition to key reference word combinations, they used such as "clustering", "clustering analysis”, we searched the literature using google scholar for clustering techniques", "approaches", "time series", "clustering sector application", "transportation", "manufacturing", “healthcare” and “energy”. More search was conducted using cross-referencing and the screening of abstracts of potential articles. We ensured that the articles with abstracts containing the keywords earlier indicated were selected for further review while those not relevant to our clustering area of focus were excluded. Figure 1 below further illustrates the process of our article selection using the Prisma flow diagram (Page et al. 2021) which aims to show the flow of information and summary of the screening for different stages of a systematic review.

Fig. 1 — Article selection process using PRISMA 2020 Flow diagram

The components of data clustering are the steps needed to perform a clustering task. Different taxonomies have been used in the classification of data clustering algorithms Some words commonly used are approaches, methods or techniques (Jain et al. 1999; Liao 2005; Bulò and Pelillo 2017; Govender and Sivakumar 2020). However, clustering algorithms have the tendency of being grouped or clustered in diverse ways based on their various characteristics. Jain et al. (1999) described the tendency to have different approaches as a result of cross-cutting issues affecting the specific placement of clustering algorithms under a particular approach. Khanmohammadi et al. (2017) noted these cross-cutting issues as a non-mutual exclusivity property of clustering classification. We follow the logical perspective of Khanmohammadi et al. (2017) using the term criteria to classify data clustering techniques or approaches. The clustering techniques or approaches are subsequently employed to classify clustering algorithms.

Components of a clustering task

Components of data clustering have been presented as a flow from data samples requirement through clustering algorithms to cluster formations by several authors such as Jain et al. (1999), Liao (2005), and Xu and Wunsch (2010). According to Jain et al. (1999), the following were indicated as the necessary steps to undertake a clustering activity: pattern representation (feature extraction and selection), similarity computation, grouping process and cluster representation. Liao (2005) suggested three key components of time series clustering which are the clustering algorithm, similarity/dissimilarity measure and performance evaluation. Xu and Wunsch (2010) presented the components of a clustering task as consisting of four major feedback steps. These steps were given as feature selection/extraction, clustering algorithm design/selection, cluster validation and result interpretation. According to Alelyani et al. (2013) components of data clustering was illustrated as consisting of the requirement of unlabeled data followed by the operation of collating similar data objects in a group and separation of dissimilar data objects into other groups. Due to the subjective nature of clustering results, the need to consider performance evaluation of any methods of clustering used has become necessary in the steps of clustering.

Taking these observations into consideration, we essentially list steps of clustering activity below and present them also in Fig. 2:

Input data requirement.
Pattern representation (feature extraction and selection).
Clustering or grouping process (Clustering algorithm selection and similarity/dissimilarity computation).
Cluster formation.
Performance evaluation (clustering validation).
Knowledge extraction.

Out of the six steps highlighted above, component steps (2), (3), and (5) practically appear to be critical. This is because if the components steps (2), (3), and (5) are not appropriately and satisfactorily conducted during clustering implementation, each step or all steps (2), (3) (5) including (4) might need to be revisited. We briefly discuss these vital steps.

Pattern representation (step 2)

Jain et al. (1999) defined pattern representation as the "number of classes, the number of available patterns, and the number, type, and scale of the features available to the clustering algorithm". They indicated that pattern representation could consist of feature extraction and/or selection. On one hand, feature selection was defined as “the process of identifying the most effective subset of the original features to use in the clustering process”. On the other hand, “feature extraction is the use of one or more transformations of the data input features to produce new salient features to perform the clustering or grouping of data.” We refer readers to Jain et al. (1999), Parsons et al. (2004), Alelyani et al. (2013), Solorio-Fernández et al. (2020) for a comprehensive review of pattern representation, feature selection and extraction.

Clustering or grouping process (step 3)

This step is essentially described as the grouping process by Jain et al. (1999) into a partition of distinct groups or groups having a variable degree of membership. Jain et al. (1999) noted that clustering techniques attempt to group patterns so that the classes thereby obtained reflect the different pattern generation processes represented in the pattern set. As noted by Liao (2005) clustering algorithms are a sequence of procedures that are iterative and rely on a stopping criterion to be activated when a good clustering is obtained. Clustering algorithms were indicated to depend both on the type of data available and the particular purpose and application. Liao (2005) discussed similarity/dissimilarity computation as the requirement of a function used to measure the similarity between two data types (e.g., raw values, matrices, features-pairs) being compared. Similarly, Jain et al. (1999) presented this as a distance function defined on a pair of patterns or groupings. Several authors such as Jain et al. (1999), Liao (2005), Xu and Wunsch (2010), Liu et al. (2020) have noted that similarity computation is an essential subcomponent of a typical clustering algorithm. We further discuss some similarity/dissimilarity measures in Sect. 2.4.

Performance evaluation (step 5)

This step is done to confirm the suitability of the number of clusters or groupings obtained as the results of clustering. Liao (2005) discussed this as validation indices or functions to determine the suitability or appropriateness of any clustering results. Sekula et al. (2017) indicated that the high possibility of clustering solutions is dependent on the validation indices used and suggests the use of multiple indices for comparison.

Clustering classification

There have been different terminologies for data clustering classification in the literature. This variety of classifications was indicated by Samoilenko and Osei-Bryson (2019), Rodriguez et al. (2019) as a means to organize different clustering algorithms in the literature. Some have used the word approaches, methods, and techniques. However, the term techniques and methods appear to have been widely used to depict the term clustering algorithms.

Liao (2005) segmented time-series data clustering using three main criteria. These criteria referred to the manner of handling data as either in its raw form or transforming the raw data into a feature or parameters of a model. Saxena et al. (2017) used the terminology of clustering approaches and indicated linkage to the reason for different clustering techniques. This is due to the reason for the word “cluster” not having an exact meaning for the word. Bulò and Pelillo (2017) also discussed the limitation of hard or soft classifications of clustering into partitions and they suggested an approach to clustering which was referred to as the game-theoretic framework that simultaneously overcomes limitations of the hard and soft partition approach of clustering. Khanmohammadi et al. (2017) indicated five criteria in the literature for classifying clustering algorithms which are the nature of data input, the measure of proximity of data objects, generated data cluster, membership function style and clustering strategy. These criteria have resulted in different classifications of clustering algorithms.

We present in Fig. 3 below a summary of the classification criteria presented by Khanmohammadi et al. (2017). We extend the classification criteria by adding a criterion that can also be used to classify clustering algorithms. This is the size of input data. The size of data was presented as a factor that affects the selection of clustering algorithm by Andreopoulos et al. (2009), Shirkhorshidi et al. (2014) and more recently Mahdi et al. (2021). They observed that some clustering algorithms perform poorly and sacrifice quality when the size of data increases in volume, velocity, variability and variety. On another hand, some other clustering algorithms can increase scalability and speed to cope with the huge amount of data. Another possible criterion that could be added is what Bulò and Pelillo (2017) described as a framework for clustering. However, this appears to be a clustering strategy. They described this as a perspective framework of the clustering process that is different from the traditional approaches of obtaining the number of clusters as a by-product of partitioning. They referred to this as a clustering ideology which can be thought of as a sequential search for structures in the data provided. Figure 3 below categorizes the approaches or criteria and the sub-approaches or sub-criteria that can be useful in classifying clustering algorithms.

Clustering algorithms

The criteria/sub-criteria described in the previous section can be used in classifying clustering algorithms. However, clustering algorithms have traditionally been classified as either having a partitioning (clusters obtained are put into distinctive groups) or hierarchical (forming a tree linkage or relationships for the data objects being grouped) strategy to obtain results. Jain et al. (1999) indicated the possibility of having additional categories to the traditional classification. Some authors have since then indicated the classification of clustering algorithms using five clustering strategies such as in Liao (2005), Han et al. (2011). Using the clustering criteria described earlier we demonstrate the classification of selected 21 clustering algorithms out of several clustering algorithms in the literature. These are (1) k-means, (2) k-mode, (3) k-medoid, (4), Density-Based Spatial Clustering of Applications with Noise (DBSCAN), (5) CLustering In QUEst (CLIQUE), (6) Density clustering (Denclue), (7) Ordering Points To Identify the Clustering Structure (OPTICS), (8) STatistical INformation Grid (STING), (9) k-prototype, (10) Autoclass (A Bayesian approach to classification) (11) fuzzy k-means, (12) COOLCAT (An entropy based algorithm for categorical clustering), (13) Cluster Identification via Connectivity Kernels (CLICK), (14) RObust Clustering using linK (Sandrock), (15) Self Organising Map (SOM), (16) Single-linkage (17) Complete-linkage (18) Centroid-linkage, (19) Clustering Large Applications Based upon Randomized Search (CLARANS), (20) Overlapped k-means, (21) Model-based Overlapping Clustering (MOC).

We summarize these classifications in Tables 1 and 2 below and include selected references for extensive reading.

Table 1.

Classifications of clustering algorithms based on identified clustering criteria and sub-criteria

Clustering criteria	Sub criteria	Description	Applicable scenario(s)	Grouping of selected clustering algorithms
Type of input data: Banerjee et al. (2005), Andreopoulos et al. (2009), Khanmohammadi et al. (2017)	Categorical type	Data points are usually described as qualitative data (having characteristic attributes)	Customer information such as gender, payment method etc	k-mode (2), COOLCAT (12), CLICK (13), ROCK (14),
	Numeric type	Data points are usually described as quantitative data (measurable in numbers)	Gene expression dataset (gene vs tissue), grouping potential customers in sales and marketing	k-means (1), k-medoid (3), DBSCAN (4), Denclue (6), OPTICS (7), Sting (8), SOM (15), CLARANS (19), Overlapped k-means (20)
	Mixed type	Data points could have numerical or categorical (discrete) descriptive attributes	Disease data (Patient, sex, age, group)	CLIQUE (5), k-prototype (9), Autoclass (10), Fuzzy k-means (11), Single linkage (16), complete-linkage (17), centroid –linkage (18), MOC (21),
Generated clusters: Andreopoulos et al. (2009), N’Cir et al. (2015), Khanmohammadi et al. (2017), Beltrán and Vilariño (2020)	Overlapping	Data points can belong to more than one cluster (membership either hard or fuzzy)	Social network analysis, information retrieval (e.g., several topics for a document)	Fuzzy k-means (11), Overlapped k-means (20), MOC (21)
	Non-overlapping	Data points can only belong to one of the various identified clusters (exclusive)	Clustering of movies are done by content e.g., AA, A, B, B15, C and D	k-means (1), k-mode (2), k-medoid (3), DBSCAN (4), CLIQUE (5), Denclue (6), OPTICS (7), STING (8), k-prototype (9), Autoclass (10) COOLCAT (12), CLICK (13), ROCK (14), SOM (15), Single-linkage (16), complete-linkage (17), centroid –linkage (18), CLARANS (19)
Membership style: Khanmohammadi et al. (2017), Beltrán and Vilariño 2020)	Soft (Fuzzy)	Probability membership where a data point can belong to a cluster with some degree of membership between 0 and 1	Clustering of a range of million colours	Fuzzy k -means (11)
	Hard (Crisp)	Binary membership where a data point can either belong or doesn’t belong to a cluster (0 or 1 membership)	Group work (Grouping 12 students in a group of 4 having 3 students per group)	k-means (1), k-mode (2), k-medoid (3), DBSCAN (4), Clique (5), Denclue (6), Optics (7), Sting (8), k-prototype (9), Autoclass (10), COOLCAT (12), Click (13), Rock (14), SOM (15), Single-linkage (16), complete-linkage (17), centroid –linkage (18), CLARANS (19), Overlapped k-means (20), MOC (21)
Proximity measure: Andreopoulos et al. (2009), Xu and Wunsch (2005, 2010), N’Cir et al. (2015), Khanmohammadi et al. (2017)	Similarity matrix	Data points are grouped into different clusters according to their resemblance to one another or not (usually for qualitative variables)	Common in Document clustering, and gene expression data analysis (e.g., use of cosine similarity, pearson correlation etc.)	k-mode (1), CLIQUE (5), Autoclass (10), COOLCAT (12), CLICK (13), ROCK (14)
	Distance matrix	Data points are grouped into different clusters according to certain distance functions (usually for continuous features)	Clustering using distance functions such as Euclidean, Minkowski, distance, Sup distance, city-block distance etc	k-means (1), k-medoid (3), DBSCAN (4), Denclue (6), OPTICS (7), STING (8), k-prototype (9), Fuzzy k –means (11), SOM (15), Single-linkage (16), complete-linkage (17), centroid –linkage (18) CLARANS (19), Overlapped k-means (20), MOC (21)

Open in a new tab

Table 2.

Continuation of classification of selected clustering algorithms based on identified clustering criteria and sub-criteria

Clustering criteria	Sub criteria	Description	Applicable scenario(s)	Clustering algorithms
Clustering strategy: Jain et al. (1999), Han et al. (2012), Khanmohammadi et al. (2017), Govender and Sivakumar (2020), Ezugwu et al. (2022)	Partitioning	Given a number of partitions e.g., k-partitions, n-data objects are organized into such partitions by optimizing a partitioning criterion e.g. distance function. Each partition contains at least one object such that k ≤ n	e.g., grouping post graduate students with different supervisors	k-mean (1), k-mode (2), k-medoids (3), k-prototype (9), fuzzy k-means (11), COOLCAT (12), CLARANS (19), Overlapped k-means (20), MOC (21)
	Hierarchical	This method works by grouping data objects into a tree of clusters. This could be agglomerative or divisive	Clusters have different levels e.g. text mining (Subtopics of Mathematics could be algebra, calculus, trigonometry etc	Rock (14), single-linkage (16), complete –linkage (17), centroid linkage (18)
Andreopoulos et al. (2009), Han et al. (2011), Campello et al. (2020)	Density-based clustering	The central idea is to continue growing a cluster as long as the density (number of objects or data points) in the “neighbourhood” exceeds some threshold. Rather than producing a clustering explicitly	e.g. bioinformatics for locating the densest subspaces in interactome networks	DBSCAN (4), Denclue (6), OPTICS (7), CLICK (13)
Wang et al. (1997), Hireche et al. (2020)	Grid-based clustering	This method quantizes the object space into a finite number of cells that form a grid structure on which all the operations for clustering are performed. It clusters based on the cells rather than data objects	Useful in facilitating several spatial queries (e.g. listing hotspot of crime within a specific distance of geographical region)	CLIQUE (5), STING (8)
Andreopoulos (2009), Hudson et al. (2011), Bouveyron and Brunet-Saumard (2014)	Model-based clustering	The method assumes a model for each of the clusters and attempts to best fit the data to the assumed model. Statistical and Neural networks are two approaches	In protein sequencing, bioinformatics, synchronisation of flowering (Eucalypt flower records)	Autoclass (10), SOM (15), MOC (21)
Size of data: Andreopoulos et al. (2009), Khanmohammadi et al. (2017), Shirkhorshidi et al. (2014)	Suitability for large data (High dimensional). Data	As data points increase clustering quality is minimally compromised due to scalability and speedup of the algorithm [small $O (∙)$ complexity]	For example social networking websites with billions of subscribers, Microarray gene expression data etc.)	k-mode (2), CLIQUE (5), STING (8), SOM (15), CLARANS (19), Overlapped k-means (20)
	Non-suitability for large data (low dimensional data)	As data points increase clustering quality are largely compromised due to the high complexity of data and computational cost [large $O (∙)$ complexity]	extraction of knowledge from data having bytes sizes less than 10⁸ bytes	k-means (1), k-medoid (3), DBSCAN (4), Denclue (6), OPTICS (7), k-prototype (9), Autoclass (10), Fuzzy k –means (11), COOLCAT (12), CLICK (13), ROCK (14), Single-linkage (16), complete-linkage (17), centroid –linkage (18) MOC (21),

Open in a new tab

$O (∙)$ useful in describing the effect of the size of data on clustering algorithm speed and scalability (the higher the values the slower the clustering algorithm (Andreopoulos et al. 2009)

Traditional clustering strategies

In this section, a basic description of clustering algorithms that represent the traditional clustering strategy of partitioning and hierarchical clustering algorithms is provided.

We present the common partitioning algorithm (k-means) and generic hierarchical clustering algorithm due to their basic usage and importance in being foundational for other clustering algorithms. This is as discussed by Xu and Wunsch (2010) and Sekula (2015), James et al. (2015) with some modifications to aid comprehension.

Given the following notations:

$n$ : number of observations of the data to the cluster (number of data objects).

$K$ : the number of clusters (selected randomly or obtained through statistical tests such as in the function NBclust in the statistical program R).

$C_{k} :$ Cluster centroid for each $kth$ cluster, where $k$ ranges from 1 to $K$

K-means algorithm
1. Randomly assign a number from 1 to $K$ to each of the $n$ observations. (Initial cluster assignment.
2. Iterate until the cluster assignment stops changing.
  1. For each of the kth clusters, compute the cluster centroid C_k.
  2. Assign each observation to the cluster whose centroid is closest (Where closest is defined using distance measures such as Euclidean distance).
3. Iteration and cluster assignment ends when the total within-cluster variation summed over all $K$ clusters is as small as possible.
Generic hierarchical agglomerative clustering
1. Begin with $n$ observations and a distance/dissimilarity measure (such as Euclidean distance) of all $n (n - 1) / 2$ pairwise dissimilarities (Each observation is treated as its cluster).
2. Compute pairwise inter-cluster dissimilarities.
  1. Examine all pairwise inter-cluster dissimilarities among the individual clusters and identify the pair of clusters that are least dissimilar (Dissimilarities computed depend on the type of linkages such as complete, single, or average and the type of dissimilarity measure such as correlation-based distances, Euclidean distances).
  2. Combine these two clusters.
  3. Compute the new pairwise inter-cluster dissimilarities among remaining clusters.
3. Iteration proceeds until all $n$ - observations belong to one cluster.
Generic hierarchical divisive clustering
The hierarchical divisive clustering is the reverse of the hierarchical agglomerative clustering.
1. Begins with one cluster (all $n$ observations in a single cluster).
2. Split this single (large) cluster in a hierarchy fashion into new smaller clusters using a dissimilarity measure and appropriate linkage.
3. Iteration proceeds till all $n$ observations have been allocated.

Traditional clustering strategy variants

Denoeux and Kanjanatarakul (2016) and Saxena et al. (2017)presented clustering algorithms as basically having either hierarchical or partitioning strategies. The density-based, grid-based, and model-based clustering strategies were indicated by them to exhibit the spirit of either the hierarchical or partitioning strategy. The classification of clustering algorithms based on one of the five clustering strategies as presented in Table 2 above appears to be widely used by several authors. Therefore, we limit our further discussions of clustering algorithms based on clustering strategy.

Some other clustering algorithms have been noted by Han et al. (2011) and Campello et al. (2020) to possess characteristics that make them difficult to exclusively classify under one of the five clustering strategies. As a result, different classification strategies have been given in the literature to account for this (Saxena et al. 2017). Recently there have been additional clustering strategies developed such as discussed in Ezugwu et al. (2022). Some are partly to overcome limitations of the traditional clustering techniques such as in Bulò and Pelillo (2017), Valls et al. (2018), He et al. (2020). Others have resulted from the need to apply clustering in new fields of application. Saxena et al. (2017) also acknowledged the division of clustering algorithms into the five previous classifications above. However, they indicated other clustering methods such as multiobjective clustering, collaborative fuzzy clustering, search based clustering technique as variants of the two broad clustering methods earlier indicated. Based on their review we present a brief description.

We summarize the description of Saxena et al. (2017) and suggest other references of recent articles that have extended the selected clustering variants for detailed studies in Table 3 below.

Table 3.

Clustering algorithms based on extended clustering strategy

Extended clustering strategy	Description	Clustering algorithms	Selected references
Graph (theoretic) clustering	A method that represents clusters using graphs. Graph clustering involves the task of dividing nodes into clusters so that the edge density is higher within clusters as opposed to across clusters	Complete link; minimum cut; information-theoretic; normalized cut etc	Matula (1977), Hu et al. (2009), Das et al. (2020), Chen et al. (2020)
Spectral clustering	This constructs affinity matrix in terms of similarity between data points before performing the clustering task. e.g. Un-normalized and Normalized spectral clustering. A special case of graph-theoretic clustering. Obtaining the quality of affinity matrix and spectral vectors determination are major steps	Traditional spectral; Spectral clustering using normalized laplacian;multi-view spectral clustering	Ng et al. (2002), Saxena et al. (2017), Du et al. (2020), Sharma and Seal (2020)
Dominant Set Clustering (DSC)	This is based on a stepwise search for patterns or structures in data with the clustering ideology in mind similar to solution search in optimization theory, game –theory and graph theory. Another special case of graph-theoretic clustering	DSC based on Frank-Wolfe algorithms, DSC based on replicator dynamics	Bulò and Pelillo (2017), Johnell and Chehreghani (2020)
Evolutionary Approaches Based Clustering (EABC)	The population of solutions corresponds to the K-partitions of the data. Partitions with a large fitness value corresponding to a small square error are retained after the evolutionary operation	EABC on particle swarm optimization, EABC on Genetic Algorithm; EABC on Ant colony optimization Whale optimization algorithm, Crow search algorithm Emperor Penguin Optimizer	Jain et al. (1999), Saxena et al. (2017), Ezugwu et al. (2022)
Search-Based Clustering Approaches (SBCA)	This comprises stochastic and deterministic techniques. The stochastic techniques are similar to the evolutionary-based approach and may not guarantee an optimal solution while the deterministic seeks to obtain optimal solutions	SBCA on simulated annealing, SBCA on Tabu search	Saxena et al. (2017), (Bandyopadhyay et al. (2008), Nakayama and Kagaku (1998)
Collaborative fuzzy clustering	This is relatively recent compared to other clustering techniques. subsets of patterns can be processed together to find a structure that is common to all of them	Horizontal or vertical type	Hu et al. (2020), Pedrycz (2002), Zhao et al. (2020)
Multi-objective clustering	Clustering criteria are jointly optimized	MOCK; MOCA-SM	Ramadan et al. (2020), Kessira and Kechadi (2020)
Overlapping clustering or overlapping community detection	Objects belong to more than one cluster or group in overlapping clustering. Overlapping community is aimed at identifying such multiple groups	MOC; SBK; ADCLUS; OKM; DClustR;OCDC; MCLC	Banerjee et al. (2005), Beltrán and Vilariño (2020), Xie et al. (2013)
Evidential clustering (EVCLUS)	This a soft clustering technique based on determining mass functions for data objects	EK-NNclus; EVCLUS; ECM	Denoeux and Kanjanatarakul (2016), Masson and Denoeux (2008), Denoeux (2020)
Subspace clustering	This is an extension of feature selection that attempts to find clusters in different subspaces of the same dataset	CLIQUE, ENCLUS, DOC, CBF, Multi-view subspace clustering	Parsons et al. (2004), Huang et al. (2016), Rong et al. (2020)

Open in a new tab

We present basic steps of selected variants of the traditional clustering strategy as discussed by Saxena et al. (2017) including examples of clustering algorithms of selected clustering strategy variants as discussed by Jain et al. (1999), Pedrycz (2002), Johnell and Chehreghani (2020), Ramadan et al. (2020) with little modifications to aid basic comprehension. Given $n$ number of observations of data the goal is to form clusters using different representations and approaches.

Grid-based clustering
1. Define a set of grid cells.
2. Assign observations to the appropriate grid cell and compute the density of each cell.
3. Eliminate cells, whose density is below a certain threshold.
4. Form clusters from contiguous groups of dense cells.

(b)
Spectral clustering
1. Construct a similarity graph between $n$ observations or objects to be clustered.
2. Compute the associated graph Laplacian matrix (This is obtained from the weighted adjacency matrix and diagonal matrix of the similarity graph).
3. Compute the first $K$ - eigenvectors of the Laplacian matrix to define a feature vector for each object ( $K$ implies the number of clusters to construct).
4. Organize objects into $K$ classes by running the $k$ -means algorithm on the features obtained.
(c)
Evolutionary based clustering
1. Generate (e.g., randomly) a population of solution $S$ . Each solution S corresponds to valid $K$ -partitions or clusters of $n$ observations.
2. Assign a fitness value with each solution.
3. Assign a probability of selection or survival (based on the fitness value) to each solution.
4. Obtain a new population of solutions using the evolutionary operators namely selection (e.g., roulette wheel selection), recombination (e.g., crossover) and mutation (e.g., pairwise interchange mutation).
5. Evaluate the fitness values of these solutions.
6. Repeat steps 4 to 5 until termination conditions are satisfied.
(d)
Dominant set clustering
This follows the iterative procedure to compute clusters according to Johnell and Chehreghani (2020)
1. Compute a dominant set using the similarity matrix of the available $n$ observations or data objects.
2. Remove the clustered observation from the data.
3. Repeat until a predefined number of clusters has been obtained.
(e)
Collaborative fuzzy clustering (Pedrycz 2002)
This is achieved through two stages namely: (A) Generation of clusters without collaboration and (B) Collaboration of the clusters.
1. Given subsets of patterns (patterns are obtained from $n$ observations)
2. Select distance function, number of fuzzy clusters, termination criterion, and collaboration matrix.
3. Initiate randomly all partition matrices based on the number of patterns.
4. Stage A: Generation of clusters without collaboration.
  - 4.1.
    Compute prototypes (centroids) and partition matrices for all subsets of patterns. (The results of clustering for each subset of patterns come in the form of a partition matrix and collection of prototypes).
  - 4.2.
    Computation is done until a termination criterion has been satisfied.
5. Stage B: Collaboration of the clusters.
  - 5.1.
    Given the computed matrix of collaboration
  - 5.2.
    Compute prototypes (e.g., using Lagrange multipliers technique) and partition matrices (e.g., using weighted Euclidean distances between prototype and pattern)
  - 5.3.
    Computation is done until a termination criterion has been satisfied.
(f)
Multi-objective clustering
The multi-objective clustering is described using the $k$ -means modified to work on two objective functions according to Ramadan et al. (2020).
1. The data (consisting of $n$ observations) is divided into a number of sets. The number of sets may depend on the number of distributed machines or the number of threads to be used.
2. $x$ value (mean) and $y$ value (variance) are computed for each set of data.
3. $k$ -means clustering is applied to each set. $K$ (number of clusters) is selected either heuristically or based on the number of records in each set.
4. At the global optimizers, Pareto optimality is applied to the clusters’ centroids and nondominated centroids.
5. For nondominated clusters, the distance between a point $x$ and the cluster centre is computed as well as the Silhouette scores between $x$ and the nearest cluster centre. Then, the $k$ -means algorithm is used to re-cluster those points.
6. A window W is used to extract the most effective clusters based on the required points. Pareto optimality could be applied once more for better results.
(g)
Search-based clustering
The search-based clustering is described using the Simulated Annealing example presented by Jain et al. (1999).
1. Randomly select an initial partition $P_{0}$ for the data (comprising $n$ observations) and compute the squared error value termed $E_{P 0} .$
2. Select values for the control parameters, initial and final temperatures $T_{0}$ and $T_{f}$ respectively.
3. Select a neighbour partition ( $P_{1}$ ) of $P_{0}$ and compute its squared error value termed $E_{P 1}$ .
4. If $E_{P 1}$ is larger than $E_{P 0}$ , then assign $P_{1}$ to $P_{0}$ with a temperature-dependent probability. Else assign $P_{1}$ to $P_{0}$ .
5. Repeat step 3 for a fixed number of iterations.
6. Reduce the value of $T_{0}$ , i.e. $T_{0} = c T_{0}$ , where c is a predetermined constant.
7. If $T_{0}$ is greater than $T_{f}$ , then go to step 3. Else stop.

Similarity and dissimilarity measures

As indicated by Jain et al. (1999) similarity measures are the actual strategies that clustering algorithms utilize in grouping data objects to fall within a class or cluster. The dissimilarity measures are used to differentiate a data grouping or cluster from one another. Saxena et al. (2017) also emphasized the important role similarity of objects within a cluster plays in a clustering process. According to Jain et al. (1999), many clustering methods use distance measures to determine the similarity or dissimilarity between any pair of objects and also they gave conditions for any valid distance measure. Xu and Wunsch (2010) emphasized the conditional requirement for computing similarity/dissimilarity function between any two data pairs of objects when using the distance measure. They stated that a valid similarity function or measure must satisfy the symmetry, positivity triangular inequality and reflexivity conditions. We present some of the similarity functions noted in the literature in Table 4 and suggest references to readers for more comprehensive studies. Other similarity functions or measures that have been discussed in the literature are city-block distance, sup distance, squared Mahalanobis, point symmetry distance. Xu and Wunsch (2010), Niwattanakul et al. (2013), Saxena et al. (2017) and Kalgotra et al. (2020) provide additional discussions on other similarity functions not included in this article.

Table 4.

Selected use of some similarity and dissimilarity measures

Measure	Suitability	Selected reference
Minkowski	For numeric attributes. The similarity between data pairs corresponds to the closeness of distance between data pairs	Xu and Wunsch (2010), Saxena et al. (2017), Xu and Wunsch (2005)
Euclidean distance	Most commonly used for numeric attributes. A special instance of Minkowski e.g k-means algorithm	Thakur et al. (2020), Qian et al. (2004)
Cosine measure	Varies more with linear transformations than rotational transformations. More commonly used for document clustering	Qian et al. (2004), Ye (2011)
Pearson correlation measure	Suitable for numeric variables and magnitude difference of two variables. Used for analyzing gene expression data	D’haeseleer (2005), Xu and Wunsch (2010)
Jaccard measure	Suitable for information retrieval and word similarity measurement. Can detect a mistake in spellings but cannot detect over-type words	Niwattanakul et al. (2013), Xu and Wunsch (2005)
Dice coefficient measure	Similar to the Jaccard measure for information retrieval	Pandit and Gupta (2011), Xu and Wunsch (2005)

Open in a new tab

Basic mathematical definitions of some of these measures as discussed by Xu and Wunsch (2010) are presented below. It is assumed that dataset $X$ consists of $n$ data objects or observations and $d$ features. Notation $D (., .) = >$ Distance function between two objects in the dataset. S $(., .) = >$ Similarity function between two objects in the dataset.

Minkowski distance:

$D (x_{i}, x_{j}) = ({\sum_{l = 1}^{d} {|x_{il} - x_{jl}|}^{p})}^{\frac{1}{p}}$ $p = >$ a generic numeric value.
Euclidean distance:
$D (x_{i}, x_{j}) = ({\sum_{l = 1}^{d} {| x_{il} - x_{jl} |}^{2})}^{1 / 2}$

Special case of minkowski $p = 2$
Cosine similarity:
$S (x_{i}, x_{j}) = c o s α = \frac{{x_{i}}^{T} x_{j}}{‖ x_{i} ‖ ‖ x_{j} ‖}$
Extended Jaccard measure
$D (x_{i}, x_{j}) = \frac{{x_{i}}^{T} x_{j}}{{‖ x_{i} ‖}^{2} + {‖ x_{j} ‖}^{2} - {x_{i}}^{T} x_{j}}$

Cluster optimization and validation

As indicated in the introduction section, obtaining the optimal number of clusters has been a major output of data clustering and an issue that keeps research in the field of clustering active. It has been widely indicated that no clustering algorithm can always solve all clustering problems. Saxena et al. (2017) emphasized user control in deciding the number of cluster results, which might either follow a trial and error, heuristic or evolutionary procedure. Fu and Perry (2020) discussed some trial and error and heuristic methods of obtaining the number of clusters and proposed a method that predicts errors and subsequently chooses the smallest error to determine the appropriate number of clusters. Improving the quality of clustering results obtainable from traditional clustering algorithms and variants have recently been advanced by some authors such as Calmon and Albi (2020), Chen et al. (2020), Ushakov and Vasilyev (2020).

As indicated by Jain et al. (1999) multiple features could be extracted or selected from given data and also performing a pairwise comparison of similarity within clusters for all data values can result in the combinatorial difficulty of clustering with an increase in data sizes. Also, Xu and Wunsch (2005) emphasized that different clustering algorithms could produce different results for a given data and also the same clustering algorithms using different approaches could still result in different clusters formed.

As a result, researchers have validated their search for the optimal number of clusters through techniques that are widely referred to as indices. Two major categories of indices have been highlighted in the literature. These are the internal indices and external indices. Some authors have indicated a breakdown of these validation indices into three categories but as Xu and Wunsch (2005) and Sekula et al. (2017) indicated these could still be subsumed into the spirit of internal and external indices. According to Baidari and Patil (2020), Internal indices measure the compactness of the clusters by applying similarity measure techniques cluster separability and intra-cluster homogeneity, or a combination of these two Baidari and Patil (2020). External criteria are conducted to match the structure of the cluster to a predefined classification of the instances to validate clustering results. They, however, noted the common use of internal validity with clustering algorithms. Table 5 below shows selected internal and external indices from the literature.

Table 5.

Selected internal and external validation indices

Major indices

Description of selected external indices

Given a derived clustering structure $C$ , obtained using a clustering algorithm and linked to dataset $X$ and a prescribed clustering structure P, linked to prior information on dataset $X$ .

$a =$ number of pairs of data objects in $X$ , being a member of the same clusters in $C$ and $P$ .

$b =$ number of pairs of data objects in $X$ , being a member of the same clusters in $C$ and but different clusters in $P$ .

$c =$ number of pairs of data objects in $X$ , being a member of different clusters in $C$ and but same clusters in $P$ .

$d =$ number of pairs of data objects in $X$ , being a member of different clusters in $C$ and $P$ .

$M = n (n - 1) / 2$ (Total number of pairs of objects within $n$ number of data objects in dataset $X$ .

Rand index ( $R$ ):
$R = \frac{(a + d)}{M}$
Jaccard coefficient $(J) :$
$J = \frac{a}{(a + b + c)}$
Fowlkes and Mallows Index $(F M$ ):
$F M = \sqrt{\frac{a}{(a + b)} \frac{a}{(a + c)}}$

Description of selected internal Indices

Also given $n$ data objects in dataset $X$ , with $K$ partitions indexed from $i = 1 t o K$ .

Where:

$n_{i} =$ Number of data objects assigned to cluster $C_{i}$

$m_{i} =$ centroid linked to cluster $C_{i}$

$m =$ total centroid (mean) vector of the dataset.

$e_{i} =$ average error for cluster $C_{i}$

$e_{j} =$ average error for cluster $C_{j}$

$D (C_{i}, C_{j}) =$ Distance function between clusters $C_{i}$ and $C_{j}$ in the dataset.

Calinski and Harabasz index $(C H) :$
$C H (K) = \frac{T_{r} (S_{B})}{K - 1} / \frac{T_{r} (S_{w})}{n - K}$
where

$T_{r} (S_{B}) = \sum_{i = 1}^{K} n_{i} {‖ m_{i} - m ‖}^{2} (Trace of between cluster scatter matrix)$

$T_{r} (S_{w}) = \sum_{i = 1}^{K} \sum_{j = 1}^{n_{i}} {‖ x_{j} - m_{i} ‖}^{2} (Trace within - cluster scatter matrix)$

The larger the value of $C H (K)$ the better the quality of the clustering solution obtained.
Davies-Bouldin Index $(D B)$ :
$D B (K) = \frac{1}{K} \sum_{i = 1}^{K} R_{i}$

where $R_{i} = \max_{j, j \neq i} (\frac{e_{i} + e_{j}}{{‖ m_{i} - m_{j} ‖}^{2}})$

The minimum $D B (K)$ indicates the potential $K$ in the data set.
Dunn Index $(D I)$
$D I (K) = \min_{i = 1, \dots, K} (\min_{\begin{matrix} j = 1, \dots, K, \\ j \neq i \end{matrix}}, (\frac{{D (C}_{i}, C_{j})}{\max_{l = 1, \dots, K} δ (C_{l})}))$

where ${D (C}_{i}, C_{j}) = \underset{}{\min_{x \in C_{i}, y \in C_{j}}} D (x, y)$

δ (C_{l}) = \underset{}{\max_{x, y \in C_{l}}} D (x, y)

The larger the value of $D I (K)$ the better the estimation of $K$

Applications of clustering

Clustering techniques have been widely used in several fields and areas (Rai et al. 2006; Devolder et al. 2012; Bulò and Pelillo 2017; Grant and Yeo 2018; Nerurkar et al. 2018; Govender and Sivakumar 2020). Its relevance has also been shown as an analytical technique on its own (Ray and Turi 1999; Lismont et al. 2017; Motiwalla et al. 2019) and also as a hybrid method with other analytical solution techniques such as in Grant and Yeo (2018), Zhu et al. (2019), Liu and Chen (2019), Jamali-Dinan et al. (2020), Tanoto et al. (2020), Pereira and Frazzon (2020). We review some field applications of clustering and subsequently review the application of clustering techniques in particular business sectors or fields.

Field applications

Some of the direct areas of clustering application generally discussed in the literature have been textual document classification, image segmentation, object recognition, character recognition, information retrieval, data mining, spatial data analysis, business analytics, data reduction, and big data mining. Other areas indicated by Saxena et al. (2017), have been sequence analysis (Durbin et al. 1998; Li et al. 2012), human genetic clustering, (Kaplan and Winther 2013; Lelieveld et al. 2017; Marbac et al. 2019), mobile banking and information system (Motiwalla et al. 2019; Shiau et al. 2019), social network analysis (Scott and Carrington 2011; Shiau et al. 2017; Khamparia et al. 2020), search result grouping (Mehrotra and Kohli 2016; Kohli and Mehrotra 2016), software evolution (Rathee and Chhabra 2018; Izadkhah and Tajgardan 2019), recommender systems (Petwal et al. 2020), educational data mining (Baker 2010; Guleria and Sood 2020), climatology (Sharghi et al. 2018; Pike and Lintner 2020; Chattopadhyay et al. 2020) and robotics (Khouja and Booth 1995; Zhang et al. 2013). In Table 6 below we briefly discuss a few applications as indicated by Saxena et al. (2017) and also provide references for more detailed studies.

Table 6.

Some field applications of clustering techniques

Field	Application of clustering	References
Textual documents, Document storage	Basically, clustering of texts. Efficient document storage and retrieval for many institutions of learning have been noted to be one of the important applications of clustering. In addition, discovering events and sub-events from a sequence of news articles	Rasmussen (1992), Piernik et al. (2015), Chan et al. (2016), Lee et al. (2020), Celardo and Everett (2020)
Image segmentation	This is centered around the partition of images for visibility and classification of images based on some properties	Forsyth and Ponce (2002), Lam and Wunsch (2014), Zhang et al. (2020)
Object recognition	3D object grouping has been an area of application	Dorai and Jain (1995)
Character recognition	handwriting recognition has been an important application	Connell and Jain (1998)
Data mining	Widely used in this field both to analyze structured and unstructured databases	Hedberg (1996), Han et al. (2011)
Spatial and space application	Large data sets from geographical information systems and satellite images have been analyzed using clustering techniques	Upton and Fingleton (1985), Tahmasebi et al. (2012), Song et al. (2020), Zhang et al. (2020)
Business analytics	Operational areas of marketing, demand management and production areas of product development and categorization	Kiang et al. (2007), Fennell et al. (2003), Pereira and Frazzon (2020)
Data reduction	Compression of large data into manageable sizes usually saves processing time and cost	Jiang et al. (2016), Huang (1997)
Big data mining	For databases with a growing capacity of being exponential beyond manageable sizes of conventional database tools	Shirkhorshidi et al. (2014), Russom (2011), Ezugwu et al. (2022)
Social networking	Applied in behavioural grouping of people and activities such as e-governance and educational learning sites	Cheng et al. (2020), Khamparia et al. (2020)
Non-numerical openly expressed information	Categorizing verbal information using motivation (push theory) and meaning (pull theory)e.g. in profiling tourists based on motivations for destinations and meanings of destinations to the same tourists	Batet et al. (2010), Valls et al. (2018)

Open in a new tab

Selected industry applications

The application fields or areas of clustering described above have been noted to be in general areas of application that possibly cut across through different industrial and business sectors. Clustering techniques have also found extensive application in certain industries. As indicated by Dalziel et al. (2018) different firms with similar buy-sell characteristics could be grouped under the same industry. Clustering has been used partly as a stand-alone analytical technique and largely as a hybrid technique with other analytical methods to solve industrial problems. According to Jakupović et al. (2010), Dalziel et al. (2018), (Grant and Yeo 2018), (Xu et al. 2020) and (Ezugwu et al. 2022) several business or industrial sectors exist. They further noted that a unique or universal classification of industries or business sectors is difficult due to the reason that industries or sectors are mostly classified based on the specific needs of the classifier.

According to Citizenship (2016), ten (10) industrial sectors of impact on the SDGs were identified namely Consumer goods, Industrials, Oil and Gas, Healthcare, Basic Materials, Utilities, Telecomms, Financials, Consumer Services and Technology. In addition, the industrial sectors were organised into three namely; the primary sector (raw material extraction and production), Secondary (production of goods from raw materials) and Tertiary (provision of services). These industries have also been noted to have strong linkages to either one or more SDGs. For example, Healthcare strongly impacts SDG3 which is to achieve good health and well-being for all, while Oil and gas are strongly linked to SDG 7 (affordable and clean energy). Consumer goods, industrials and consumer services impact across SDG12 (responsible consumption and production), SDG2 (achieving Zero hunger) and SG14 (on the protection of the marine environment). Furthermore, the Utilities sector is known for infrastructure provision impacts across SDG 6 (clean water and sanitation), SDG7 and SDG9 (decent work and employment). Others such as SDG 1 (poverty), SDG4 (education), and SDG 5(gender equality) have been known to be of low impact on a particular sector and receive supporting actions from the earlier discussed industrial/business sectors.

As several clustering techniques have been extensively reported in the literature, chances also exist of a corresponding application of clustering techniques in several identified industries/sectors. Using the SDG classifications indicated above, we select sectors important in driving most of the SDGs. These sectors are mostly grouped under Transportation and logistics (such as consumer services), Manufacturing (such as Industrials, basic materials, consumer goods), Energy (such as Oil and Gas, Utilities) and Healthcare. In addition, the selected industries positively impact or stimulate economic growth, innovation, development gaps and well being for a typical economy (Nhamo et al. 2020; Shi 2020; Abbaspour and Abbasizade 2020).

Transportation and logistics

The application of clustering in the transportation industry has been generally noted to be in the identification of similar patterns in various modes of transportation (Almannaa et al. 2020). Some fields under the transportation sector, where clustering application has been applied have been hazardous transportation, road transportation urban/public transportation (De Luca et al. 2011; Lu et al. 2013; Rabbani et al. 2017; Sfyridis and Agnolucci 2020; Almannaa et al. 2020). Recently, Wang and Wang (2020) discussed the application of genetic fuzzy C-means algorithm and factor analysis to identify the causes and control high-risk drivers. (de Armiño et al. 2020) combined the hierarchical clustering and neural networks to develop a linkage between road transportation data and macroeconomic indicators. Almannaa et al. (2020) developed a multi-objective clustering that can maximize purity and similarity in each cluster formed simultaneously. They also noted that the convergence speed of the multi-objective clustering method was fast, and the number of clusters obtained was stable to determine traffic and bike pattern change within clusters.

Manufacturing

Similarly to clustering applications in the transportation sector, the manufacturing sector and systems such as discussed by Delgoshaei and Gomes (2016), Delgoshaei et al. (2021) also possess a wide application of clustering techniques. The applications are mostly a hybrid method with other analytical methods. Using the case study of the textile manufacturing business, Li et al. (2011) used clustering analysis to classify customers based on selected customer characteristics and further used some cross-analysis for customer behavior tendencies. Chandrasekharan and Rajagopalan (1986) adopted k-means in a group technology problem following which the initial groupings obtained were improved. There has been a recent trend of application of clustering techniques in cloud manufacturing, cyber manufacturing, smart manufacturing, manufacturing systems and cellular manufacturing. Delgoshaei and Ali (2019) reviewed hybrid clustering methods and search algorithms such as metaheuristics in the designing of cellular manufacturing systems. Liu and Chen (2019) used the k-medoids clustering-based algorithm and trust-aware approach to predict the quality-of-service records which might become intractable under cloud manufacturing. An improved k-mean clustering technique was compared to a k-means random by Yin (2020). The comparison was done to determine which method could provide an optimal number of edge computing nodes in a smart manufacturing setup. Sabbagh and Ameri (2020) demonstrated the application of unsupervised learning in text analytics. They used the k-means clustering and topic modelling techniques to build a cluster of supplier capabilities topics. Subramaniyan et al. (2020) clustered time-series bottle-neck data using dynamic time wrapping and complete-linkage agglomerative hierarchical clustering technique for determining bottlenecks in manufacturing systems. Ahn and Chang (2019) discussed business process management for manufacturing models. They used agglomerative hierarchical clustering in the design and management of manufacturing processes. A hybrid dynamic clustering and other techniques for establishing similarities in 3D geometry of parts and printing processes were investigated by Chan et al. (2018).

Energy

Clustering techniques have also been widely used in the field of energy both in isolation and in combination with other analytical techniques. Some fields under energy where clustering applications that have been used include energy efficiency, renewable energy, electricity consumption, heating and cooling, nuclear energy, and smart metering. The k-means clustering technique and its variants have mostly been used in the energy sector clustering. Vialetto and Noro (2020) used the k-means clustering, silhouette method to define the number of clusters while clustering energy demand data. They used clustering in the design of cogeneration systems to allow energy-cost savings. Wang and Yang (2020) used fuzzy clustering and an accelerated genetic algorithm to analyze and assess sustainable and influencing factors for 27 European Union countries' renewable energy. Fuzzy C means and multi-criteria decision-making process were applied by (Tran 2020) to design the optimal loading of ships and diesel fuel consumption of marine ships. Tanoto et al. (2020) applied a hybrid of k-means clustering, neural network based-self organizing map to group technology mixes with similar patterns. Their method was designed for the energy modelling community for the understanding of complex design choices for electricity industry planning. Suh et al. (2020) applied text mining in nuclear energy. Clustering analysis and technology network analysis were used to identify topics in nuclear waste management over time. Shamim and Rihan (2020) compared using k-means clustering and k-means clustering with feature extraction in smart metering electricity. Results of their experiments showed that clustering using features from raw data obtained better performance than direct raw data.

Healthcare

The healthcare industry has been described as one that can generate a vast amount of data from diverse clinical procedures and sources in which clustering techniques are found useful (Palanisamy and Thirunavukarasu 2019; Ambigavathi and Sridharan 2020). According to Jothi and Nur’Aini Abdul Rashidb (2015), Manogaran and Lopez (2017), Palanisamy and Thirunavukarasu (2019) and Shafqat et al. (2020) some heterogeneous data sources in the healthcare industry include electronic health records, medical imaging, genetic data, clinical diagnosis, metabolomics, proteomics and long-term psychological sensing of an individual.

Clustering techniques have been useful in the healthcare industry as part of data mining techniques for the identification of patterns in healthcare data sets (Jothi and Nur’Aini Abdul Rashidb 2015; Ahmad et al. 2015; Ogundele et al. 2018). As described by Ogundele et al. (2018) data mining is the field of study that seeks to find useful and meaningful information from large data. This definition makes data mining techniques such as clustering relevant in the health care industry. Ahmad et al. (2015) showed with examples that clustering algorithms could be used as a stand-alone technique or as a hybrid with other analytical techniques in understanding healthcare datasets. The use of clustering algorithms such as k-means, k-medoids, and x-means has been used to diagnose several diseases such as breast cancers, heart problems, diabetes, and seizures (Ahmad et al. 2015; Alsayat and El-Sayed 2016; Kao et al. 2017; Ogundele et al. 2018; Shafqat et al. 2020). To understand patterns in the automatically-collected event in healthcare settings, patient flow and clinical setting conformance, Johns et al. (2020) discussed the use of trace clustering. Density-based clustering has also been applied to obtain useful patterns from biomedical data (Ahmad et al. 2015). Hybrid techniques for analyzing and predicting health issues such as the use of clustering algorithms and classification trees, the use of k-means and statistical analysis and hybrid hierarchical clustering were discussed by (Ahmad et al. 2015).

Yoo et al. (2012), Jothi and Nur’Aini Abdul Rashidb (2015) and Ogundele et al. (2018) indicated that clustering techniques (unsupervised learning) form the descriptive components of data mining techniques. In addition, Jothi and Nur’Aini Abdul Rashidb (2015), noted that clustering techniques are not as utilized as the prescriptive (Supervised) components of data mining techniques. Ahmad et al. (2015) however pointed out that a combination of different data mining techniques should be used to achieve better disease prediction, clinical monitoring, and general healthcare improvement in the healthcare industry.

Figure 4 below summarizes the general application of clustering techniques based on the identified industries above.

Fig. 4 — General application of clustering techniques

Data size, dimensionality, and data type issues in clustering

One of the approaches earlier listed for classifying clustering algorithms is the type of input data. Liao (2005) observes that the data that can be inputted into any clustering task can be classified as binary, categorical, numerical, interval, ordinal, relational, textual, spatial, temporal, spatio-temporal, image, multimedia, or mixtures of the above data types. This classification can also be sub-classified. For example, numeric raw data for clustering can either be static, time series or as a data stream. Static data do not change with time while time-series data have their data objects changing with time. Aggarwal et al. (2003) described data stream as large volumes of data arriving at an unlimited growth rate. As noted by Mahdi et al. (2021) data types that are vast and complex to store such as social network data (referred to as big data) and high-speed data (data stream) such as web-click streams, network traffic could be challenging to cluster. In addition, they emphasized that the type of data type considered often influences the type of clustering techniques selected.

The application of some clustering algorithms directly to raw data has been noted to be an issue as the data size becomes larger (Gordon 1999; Parsons et al. 2004). Two reasons were given for this observed problem. The first reason indicated was based on the type of clustering algorithm used. This is such that some clustering algorithms fully take into consideration all dimensions of the data during the clustering process. As a result, they conceal potential clusters of outlying data objects. The second was because, as dimensionality increases in the given data, the distance measure for computing similarity or dissimilarity among data objects becomes less effective. Feature extraction and selection were suggested as a generic method to solve this problem by reducing the dimensionality of the data before the clustering algorithms are applied. However, they noted that this feature-based method could omit certain clusters hidden in subspaces of the data sets. Subspace clustering was the method suggested to overcome this.

Research in the field of reducing the dimensionality of the original data through feature extraction and selection methods and variants such as subspace clustering has continued to be investigated by several authors (Huang et al. 2016; Motlagh et al. 2019; Solorio-Fernández et al. 2020). Huang et al. (2016) specifically indicated time-series data to be subject to large data sizes, high dimensionality, and progressive updating. They suggested the preference of clustering over time segments of time-series data compared to the whole time-series sequence to ensure all hidden clusters in the time series data are accounted for. Hence data pre-processing techniques such as (normalization, cumulative clustering etc.) have been suggested. Pereira and Frazzon (2020) utilized data preprocessing to detect and remove outliers followed by normalization before a clustering algorithm was applied. Li et al. (2020a) considered ameliorating datasets to improve clustering accuracy by transforming bad data sets into good data sets using the HIBOG. Solorio-Fernández et al. (2020) presents a comprehensive review of feature selection to highlight the growing advances of unsupervised feature selection methods (filter, wrapper and hybrid) for unlabeled data.

Clustering of data could also become an issue when multi-source and multi-modal data are considered. Multi-source data (originating from several sources) have been observed with characteristics such as complexity, heterogeneity, dynamicity, distribution and largeness (Uselton et al. 1998). As noted by Sprague et al. (2017) and Afyouni et al. (2021) the combination or fusion of data from diverse organizations having different reporting formats, structures and dimensions could present some complexities in multi-source data. Lahat et al. (2015) and Li and Wang (2021) discussed the complementary and diverse attributes of multi-modal data (e.g. the same data from text, image audio, and video) and also provided similar challenges of complexity resulting from the fusion of such data. Adaptation of existing clustering algorithms or development of new clustering algorithms will become useful to analyze such potential big and complex data.

Since clustering results are strongly linked to the type and features of the data being represented, their performance is being improved through current supervised machine learning methods such as Deep Neural Networks (DNN). As noted by James et al. (2015) and Ni et al. (2022), DNN have had more successful performance (e.g. in speech and text modelling, video and image classification) compared to the earlier developed neural networks as seen in (Hastie et al. 2009) due to the less training tinkering required and increasing availability of large training data sets. DNN could be used to obtain improved feature representation useful for clustering before the actual clustering is performed. This has been referred to as deep clustering in the machine learning field (Aljalbout et al. 2018). According to Min et al. (2018) emphasis was placed on prioritizing network architecture over clustering loss in classifying deep clustering due to the basic desire for clustering-oriented representations. They further classified deep clustering based on: (I) the use of Autoencoder (AE) to obtain the feasible feature representation (II) Feedforward networks such as Feedforward convolutional networks which can use specific clustering loss to obtain feasible feature representation (III) Generative Adversarial Network (GAN) and Variational Autoencoder (VAE) which uses effective generative learning frameworks to obtain feature representations.

Discussions

In this section, we highlight the major considerations in the earlier sections and project possible application trends in the field of clustering. In Sect. 2, we noted some inconsistencies in terminologies and classification criteria used in grouping clustering algorithms and their variants. Authors in the field of data clustering have suggested different terminologies for group clustering algorithms. The partitioning and hierarchical approaches have primarily been used to group clustering algorithms. Other approaches such as density-based, model-based, and grid-based have been suggested as an extension to the primary approaches. The classification of the five clustering approaches earlier mentioned can be categorized as clustering strategies. Other clustering criteria such as proximity measure, input data, size of input data, membership function style, and generated clusters can further be used to categorize different approaches employed in classifying clustering algorithms. The selection and design of clustering algorithms are observed to be a vital step in the clustering components. We suggest that the clustering component steps tend towards being cyclical with feedback than a straight follow. This relates more with the reality of iteration in obtaining the appropriate cluster results.

The reality is that there is no universally accepted clustering algorithm to solve all clustering problems (Jain et al. 1999; Rodriguez et al. 2019) and the limitation of clustering algorithms is a strong motivation for the emergence of new clustering algorithms or variants of the traditional clustering algorithms. As new clustering algorithms emerge, it is expected that existing terminologies and classification approaches could become broader with a seeming departure from the traditional approaches. With the growing number of clustering algorithms is also the growing number of clustering validation indices. This perhaps is due to the reason that users of clustering results are more interested in knowing with good confidence that clustering results obtained are well suited for the application. To test the suitability of different clustering algorithms and indices in meeting the users' needs and also due to the increase in computing technological capabilities, clustering algorithms and indices are being combined in computer programs. Rodriguez et al. (2019) presented a comparative study of 9 clustering algorithms available in the R programming language. Other authors such as Sekula (2015) have indicated some clustering packages in the R-programming language that can be useful for comparison and as a friendly user application. Besides, computer programs are used to suggest a suitable number of clusters for clustering algorithms (e.g. k-means) requiring an input of clusters as applied by (Rhodes et al. 2014; Charrad et al. 2015).

In Sect. 3, we considered that the application of clustering has largely been reported in areas such as image segmentation, object recognition, character recognition, information retrieval, and data mining. We have considered these areas to be specific applications of clustering algorithms. It is expected that more field applications will be reported due to the vast applicability of clustering techniques. Also emphasized is the application of clustering in selected industrial sectors. We specifically noted the diverse classification schemes and groupings of industrial sectors. The numerous clustering algorithms in existence have the corresponding possible applicability in several of these industries. We, however, selected manufacturing, energy, transportation and logistics, and healthcare as examples to illustrate the application of clustering in industries with important links to achieving sustainable development goals. The application of clustering techniques in these industries appears to be a move from a stand-alone analytical technique into hybrid techniques with other analytical processes. This suggests that clustering techniques will continue to be relevant as an integrated analytical technique in different industries and sectors. Besides, the vast application of clustering techniques will imply practitioners or users with a basic understanding of clustering techniques can use the clustering algorithm embedded into the software with little difficulty.

In Sect. 4, we highlighted some data sources used in clustering and discussed some data issues users of clustering techniques are likely to deal with. Clustering raw data inputs are generally observed to be more problematic than refined data inputs. This is attributable to the dimensionality problem. Due to the increase in computing technology for many industrial applications and cloud computing, the use of clustering techniques to analyze high volumes of static, time-series, multi-sources, and multimodal data are trends in the future. For multi-sources and multimodal data, applications or frameworks that can effectively integrate or fuse the complementary attributes of such data are currently observable trends. As such clustering techniques will be more readily deployed in such secondary data-use domain.

As the size of data becomes larger due to modern data mining capabilities and the need to avoid incomplete knowledge extraction from single sources or modes of data, methods that fuse complementary and diverse data with a goal of understanding and identifying hidden clusters are also notable trends. For example, deep learning methods are sometimes merged with traditional clustering methods to further search for underlying clusters and thereby improve clustering performance.

Putting the main observations in this paper together, the emergence of new clustering algorithms is expected due to the subjectivity nature of clustering and its vast applicability in diverse fields and industries. This suggests that emerging scholars can find meaningful research interest in several aspects of data clustering such as the development of new clustering algorithms, validity indices, improving clustering quality and comprehensive field and industry reviews of clustering techniques. Industry Practitioners will also find use in the application of specific clustering algorithms to analyze unlabeled data to extract meaningful information.

Conclusion and future directions

In this paper, we presented a basic definition and description of clustering components. We extended existing criteria in the literature for classifying clustering algorithms. Both traditional clustering algorithms and emerging variants were discussed. Also emphasized is the reality that clustering algorithms could produce different groupings for a given set of data. Also, as no clustering algorithm can solve all clustering problems, several clustering validation indices are used and have also been developed to gain some confidence in the cluster results obtained from clustering algorithms.

We summarized field applications of clustering algorithms such as in image segmentation, object recognition, character recognition, data mining and social networking that have been pointed out in the literature. Selected applications of clustering techniques and notable trends in industrial sectors with strong links to achieving sustainable development goals were further presented to show the diverse application of clustering techniques. Also suggested are possible application trends in the field of clustering that are observable from both specific and general article reviews in the literature. Some data input concerns in the field of clustering were examined.

This study presents a foundation for other research work that can be projected from it. Firstly, the investigation into feature extraction, selection, alignment, and other methods that could reveal hidden clusters in large volumes, high-frequency data such as data streams, multi-modal and multi-source data obtainable from current data mining capabilities, technologies and computer simulations are current and research interest into the future for the academia and industry.

In addition, the development of new clustering strategies to analyze existing and modern data types (e.g., fused multi-source and multi-modal data) would also be of more interest to researchers. The outputs and knowledge extracted from such data types could be beneficial to policymakers and business practitioners in informed decision making.

Secondly, the use of clustering techniques has a high possibility of finding more applicability in existing fields. Examples are text mining, industrial big data applications, biomedicals, commercial sectors, military applications, space navigation and biological processes. In emerging areas of applications such as Learning management systems and social media that currently churn out huge amounts of data and have recently seen a further increase due to the covid-19 pandemic, the development of effective and efficient clustering algorithms to sufficiently mine the massive amount of data from such fields are currently being projected. Deep clustering will generally find more applications in analysis useful across different business sectors where pure clustering methods have been used. This will be due to observed performance in obtaining better clustering results for example in image classification where the Feedforward convolutional network has been very useful.

Finally, a data clustering trend that summarizes trends from qualitative and quantitative results of the application of diverse variants of clustering strategies will adequately be an improvement on this research efforts.

Data availability

Data sharing not applicable to this article as no datasets were generated or analysed during the current study.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

Abbaspour M, Abbasizade F. Energy performance evaluation based on SDGs. In: Leal Filho W, Azul AM, Brandli L, Lange Salvia A, Wall T, editors. Affordable and clean energy. Cham: Springer; 2020. [Google Scholar]
Afyouni I, Al Aghbari Z, Razack RA. Multi-feature, multi-modal, and multi-source social event detection: a comprehensive survey. Inf Fusion. 2021 doi: 10.1016/j.inffus.2021.10.013. [DOI] [Google Scholar]
Aggarwal CC, Philip SY, Han J, Wang J (2003) A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference, Elsevier, pp 81–92
Ahmad P, Qamar S, Rizvi SQA. Techniques of data mining in healthcare: a review. Int J Comput Appl. 2015;120:38–50. [Google Scholar]
Ahn H, Chang T-W. A similarity-based hierarchical clustering method for manufacturing process models. Sustainability. 2019;11:2560. [Google Scholar]
Alelyani S, Tang J, Liu H. Data clustering: algorithms and applications. London: Chapman and Hal; 2013. Feature selection for clustering: a review; p. 29. [Google Scholar]
Aljalbout E, Golkov V, Siddiqui Y, Strobel M, Cremers D (2018) Clustering with deep learning: taxonomy and new methods. arXiv preprint arXiv:1801.07648
Almannaa MH, Elhenawy M, Rakha HA. A novel supervised clustering algorithm for transportation system applications. IEEE Trans Intell Transp Syst. 2020;21:222–232. [Google Scholar]
Alsayat A, El-Sayed H (2016) Efficient genetic K-means clustering for health care knowledge discovery. In: 2016 IEEE 14th international conference on software engineering research, management and applications (SERA), IEEE, pp 45–52
Ambigavathi M, Sridharan D (2020) Analysis of clustering algorithms in machine learning for healthcare data. In: International conference on advances in computing and data sciences, Springer, Singapore, pp 117–128
Anand S, Padmanabham P, Govardhan A, Kulkarni RH. An extensive review on data mining methods and clustering models for intelligent transportation system. J Intell Syst. 2018;27:263–273. [Google Scholar]
Andreopoulos B, An A, Wang X, Schroeder M. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform. 2009;10:297–314. doi: 10.1093/bib/bbn058. [DOI] [PubMed] [Google Scholar]
Ansari MY, Ahmad A, Khan SS, Bhushan G. Spatiotemporal clustering: a review. Artif Intell Rev. 2019;53:2381–2423. [Google Scholar]
Baadel S, Thabtah FA, Lu J (2016) Overlapping clustering: a review. In 2016 SAI Computing Conference (SAI),IEEE., pp. 233-237.
Baidari I, Patil C. A criterion for deciding the number of clusters in a dataset based on data depth. Vietnam J Comput Sci. 2020;7:417–431. [Google Scholar]
Baker R. Data mining for education. Int Encycl Educ. 2010;7:112–118. [Google Scholar]
Bandyopadhyay S, Saha S, Maulik U, Deb K. A simulated annealing-based multiobjective optimization algorithm: AMOSA. IEEE Trans Evol Comput. 2008;12:269–283. [Google Scholar]
Banerjee A, Krumpelman C, Ghosh J, Basu S, Mooney RJ (2005) Model-based overlapping clustering. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining, pp 532–537
Batet M, Valls A, Gibert K. International conference on artificial intelligence and soft computing. Berlin: Springer; 2010. Performance of ontology-based semantic similarities in clustering; pp. 281–288. [Google Scholar]
Beltrán B, Vilariño D. Survey of overlapping clustering algorithms. Comput Sist. 2020;24:575–581. [Google Scholar]
Bose I, Chen X. Detecting the migration of mobile service customers using fuzzy clustering. Inf Manage. 2015;52:227–238. [Google Scholar]
Bouveyron C, Brunet-Saumard C. Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal. 2014;71:52–78. [Google Scholar]
Bulò SR, Pelillo M. Dominant-set clustering: a review. Eur J Oper Res. 2017;262:1–13. [Google Scholar]
Calmon W, Albi M. Estimating the number of clusters in a ranking data context. Inf Sci. 2020;546:977–995. [Google Scholar]
Campello RJ, Kröger P, Sander J, Zimek A. Density-based clustering. Wiley Interdiscip Rev: Data Min Knowl Discov. 2020;10:e1343. [Google Scholar]
Celardo L, Everett MG. Network text analysis: a two-way classification approach. Int J Inf Manage. 2020;51:102009. [Google Scholar]
Chan LM, Intner SS, Weihs J. Guide to the library of congress classification. Santa Barbara: ABC-CLIO; 2016. [Google Scholar]
Chan SL, Lu Y, Wang Y. Data-driven cost estimation for additive manufacturing in cybermanufacturing. J Manuf Syst. 2018;46:115–126. [Google Scholar]
Chandrasekharan MP, Rajagopalan R. An ideal seed non-hierarchical clustering algorithm for cellular manufacturing. Int J Prod Res. 1986;24:451–463. [Google Scholar]
Charrad M, Ghazzali N, Boiteau V, Niknafs A (2015) Determining the best number of clusters in a data set. Recuperado de https://cran.rproject.org/web/packages/NbClust/NbClust.pdf
Chattopadhyay A, Hassanzadeh P, Pasha S. Predicting clustered weather patterns: a test case for applications of convolutional neural networks to spatio-temporal climate data. Sci Rep. 2020;10:1–13. doi: 10.1038/s41598-020-57897-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen H, Yu Z, Yang Q, Shao J. Attributed graph clustering with subspace stochastic block model. Inf Sci. 2020;535:130–141. [Google Scholar]
Cheng H, Hong SA, Ye X (2020) Clustering users of a social networking system based on user interactions with content items associated with a topic. Google Patents
Citizenship C. SDGs & sectors: a review of the business opportunities. London: Corporate Citizenship; 2016. [Google Scholar]
Connell SD, Jain AK (1998) Learning prototypes for online handwritten digits. In: Proceedings. Fourteenth international conference on pattern recognition (cat. no. 98EX170), IEEE, pp 182–184
D’haeseleer P. How does gene expression clustering work? Nat Biotechnol. 2005;23:1499–1501. doi: 10.1038/nbt1205-1499. [DOI] [PubMed] [Google Scholar]
Dalziel M, Yang X, Breslav S, Khan A, Luo J. Can we design an industry classification system that reflects industry architecture? J Enterp Transform. 2018;8:22–46. [Google Scholar]
Das S, Das A, Bhattacharya D, Tibarewala D. A new graph-theoretic approach to determine the similarity of genome sequences based on nucleotide triplets. Genomics. 2020 doi: 10.1016/j.ygeno.2020.08.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
de Armiño CA, Manzanedo MÁ, Herrero Á. Analysing the intermeshed patterns of road transportation and macroeconomic indicators through neural and clustering techniques. Pattern Anal Appl. 2020;23:1059–1070. [Google Scholar]
de Luca M, Mauro R, Russo F, Dell’Acqua G. Before-after freeway accident analysis using cluster algorithms. Procedia Soc Behav Sci. 2011;20:723–731. [Google Scholar]
Delgoshaei A, Ali A. Evolution of clustering techniques in designing cellular manufacturing systems: a state-of-art review. Int J Ind Eng Comput. 2019;10:177–198. [Google Scholar]
Delgoshaei A, Gomes C. A multi-layer perceptron for scheduling cellular manufacturing systems in the presence of unreliable machines and uncertain cost. Appl Soft Comput. 2016;49:27–55. [Google Scholar]
Delgoshaei A, Aram AK, Ehsani S, Rezanoori A, Hanjani SE, Pakdel GH, Shirmohamdi F. A supervised method for scheduling multi-objective job shop systems in the presence of market uncertainties. RAIRO-Oper Res. 2021;55:S1165–S1193. [Google Scholar]
Denoeux T. Calibrated model-based evidential clustering using bootstrapping. Inf Sci. 2020 doi: 10.1016/j.ins.2020.04.014. [DOI] [Google Scholar]
Denoeux T, Kanjanatarakul O (2016) Evidential clustering: a review. In International symposium on integrated uncertainty in knowledge modelling and decision making, Springer, Cham, pp. 24-35
Devolder P, Pynoo B, Sijnave B, Voet T, Duyck P. Framework for user acceptance: clustering for fine-grained results. Inf Manage. 2012;49:233–239. [Google Scholar]
Dorai C, Jain AK (1995) Shape spectra based view grouping for free-form objects. In: Proceedings. International conference on image processing, IEEE, pp 340–343
Du T, Wen G, Cai Z, Zheng W, Tan M, Li Y. Spectral clustering algorithm combining local covariance matrix with normalization. Neural Comput Appl. 2020;32:6611–6618. [Google Scholar]
Durbin R, Eddy SR, Krogh A, Mitchison G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University Press; 1998. [Google Scholar]
Ezugwu AE, Ikotun AM, Oyelade OO, Abualigah L, Agushaka JO, Eke CI, Akinyelu AA. A comprehensive survey of clustering algorithms: state-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Eng Appl Artif Intell. 2022;110:104743. [Google Scholar]
Fennell G, Allenby GM, Yang S, Edwards Y. The effectiveness of demographic and psychographic variables for explaining brand and product category use. Quant Mark Econ. 2003;1:223–244. [Google Scholar]
Forsyth DA, Ponce J (2002) Computer vision: a modern approach In: Prentice Hall professional technical reference
Fu W, Perry PO. Estimating the number of clusters using cross-validation. J Comput Graph Stat. 2020;29:162–173. [Google Scholar]
Gordon AD. Classification. Boca Raton: CRC Press; 1999. [Google Scholar]
Govender P, Sivakumar V. Application of k-means and hierarchical clustering techniques for analysis of air pollution: a review (1980–2019) Atmos Pollut Res. 2020;11:40–56. [Google Scholar]
Grant D, Yeo B. A global perspective on tech investment, financing, and ICT on manufacturing and service industry performance. Int J Inf Manage. 2018;43:130–145. [Google Scholar]
Guleria P, Sood M. Intelligent data analysis: from data gathering to data comprehension. Hoboken: Wiley; 2020. Intelligent data analysis using Hadoop cluster-inspired mapreduce framework and association rule mining on educational domain. [Google Scholar]
Han J, Pei J, Kamber M. Data mining: concepts and techniques. Amsterdam: Elsevier; 2011. [Google Scholar]
Han J, Kamber M, Pei J. 10-Cluster analysis: Basic concepts and methods. Data mining. Burlington: Morgan Kaufmann; 2012. pp. 443–495. [Google Scholar]
Hastie T, Tibshirani R, Friedman JH, Friedman JH. The elements of statistical learning: data mining, inference, and prediction. New York: Springer; 2009. [Google Scholar]
He Y, Wu Y, Qin H, Huang JZ, Jin Y. Improved I-nice clustering algorithm based on density peaks mechanism. Inf Sci. 2020;548:177–190. [Google Scholar]
Hedberg SR. Searching for the mother lode: tales of the first data miners. IEEE Expert. 1996;11:4–7. [Google Scholar]
Hireche C, Drias H, Moulai H. Grid based clustering for satisfiability solving. Appl Soft Comput. 2020;88:106069. [Google Scholar]
Hu W, Hu W, Xie N, Maybank S. Unsupervised active learning based on hierarchical graph-theoretic clustering. IEEE Trans Syst Man Cybern B. 2009;39:1147–1161. doi: 10.1109/TSMCB.2009.2013197. [DOI] [PubMed] [Google Scholar]
Hu J, Pan Y, Li T, Yang Y. TW-Co-MFC: two-level weighted collaborative fuzzy clustering based on maximum entropy for multi-view data. Tsinghua Sci Technol. 2020;26:185–198. [Google Scholar]
Huang Z. A fast clustering algorithm to cluster very large categorical data sets in data mining. DMKD. 1997;3:34–39. [Google Scholar]
Huang X, Ye Y, Xiong L, Lau RY, Jiang N, Wang S. Time series k-means: a new k-means type smooth subspace clustering for time series data. Inf Sci. 2016;367:1–13. [Google Scholar]
Hudson IL, Keatley MR, Lee SY. Using self-organising maps (SOMs) to assess synchronies: an application to historical eucalypt flowering records. Int J Biometeorol. 2011;55:879–904. doi: 10.1007/s00484-011-0427-4. [DOI] [PubMed] [Google Scholar]
Izadkhah H, Tajgardan M. Information theoretic objective function for genetic software clustering. Multidiscip Digit Publ Inst Proc. 2019;46:18. [Google Scholar]
Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv (CSUR) 1999;31:264–323. [Google Scholar]
Jakupović A, Pavlić M, Poščić P (2010) Business sectors and ERP solutions. In: Proceedings of the ITI 2010, 32nd international conference on information technology interfaces, IEEE, pp 477–482
Jamali-Dinan S-S, Soltanian-Zadeh H, Bowyer SM, Almohri H, Dehghani H, Elisevich K, Nazem-Zadeh M-R. A combination of particle swarm optimization and minkowski weighted k-means clustering: application in lateralization of temporal lobe epilepsy. Brain Topogr. 2020 doi: 10.1007/s10548-020-00770-9. [DOI] [PubMed] [Google Scholar]
James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning with applications in R. New York: Springer; 2015. [Google Scholar]
Jiang D, Wu S, Chen G, Ooi BC, Tan K-L, Xu J. epiC: an extensible and scalable system for processing big data. VLDB J. 2016;25:3–26. [Google Scholar]
Johnell C, Chehreghani MH (2020) Frank-wolfe optimization for dominant set clustering. arXiv preprint arXiv:2007.11652
Johns H, Hearne J, Bernhardt J, Churilov L. Clustering clinical and health care processes using a novel measure of dissimilarity for variable-length sequences of ordinal states. Stat Methods Med Res. 2020;29:3059–3075. doi: 10.1177/0962280220917174. [DOI] [PubMed] [Google Scholar]
Jothi N, Nur’aini Abdul Rashidb WH. Data mining in healthcare—a review. Procedia Comput Sci. 2015;72:306–313. [Google Scholar]
Kalgotra P, Sharda R, Luse A. Which similarity measure to use in network analysis: Impact of sample size on phi correlation coefficient and Ochiai index. Int J Inf Manage. 2020;55:102229. [Google Scholar]
Kao J-H, Chan T-C, Lai F, Lin B-C, Sun W-Z, Chang K-W, Leu F-Y, Lin J-W. Spatial analysis and data mining techniques for identifying risk factors of out-of-hospital cardiac arrest. Int J Inf Manage. 2017;37:1528–1538. [Google Scholar]
Kaplan JM, Winther RG. Prisoners of abstraction? The theory and measure of genetic variation, and the very concept of “race”. Biol Theory. 2013;7:401–412. [Google Scholar]
Kessira D, Kechadi M-T (2020) Multi-objective clustering algorithm with parallel games. In: 2020 international multi-conference on:“organization of knowledge and advanced technologies”(OCTA), IEEE, pp 1–7
Khamparia A, Pande S, Gupta D, Khanna A, Sangaiah AK. Multi-level framework for anomaly detection in social networking. Libr Hi Tech. 2020 doi: 10.1108/LHT-01-2019-0023. [DOI] [Google Scholar]
Khanmohammadi S, Adibeig N, Shanehbandy S. An improved overlapping k-means clustering method for medical applications. Expert Syst Appl. 2017;67:12–18. [Google Scholar]
Khouja M, Booth DE. Fuzzy clustering procedure for evaluation and selection of industrial robots. J Manuf Syst. 1995;14:244–251. [Google Scholar]
Kiang MY, Hu MY, Fisher DM. The effect of sample size on the extended self-organizing map network—a market segmentation application. Comput Stat Data Anal. 2007;51:5940–5948. [Google Scholar]
Kohli S, Mehrotra S. A clustering approach for optimization of search result. J Images Graph. 2016;4:63–66. [Google Scholar]
Lahat D, Adali T, Jutten C. Multimodal data fusion: an overview of methods, challenges, and prospects. Proc IEEE. 2015;103:1449–1477. [Google Scholar]
Lam D, Wunsch DC. Academic Press library in signal processing. Amsterdam: Elsevier; 2014. Clustering. [Google Scholar]
Landau S, Leese M, Stahl D, Everitt BS. Cluster analysis. Hoboken: Wiley; 2011. [Google Scholar]
Lee Y-H, Hu PJ-H, Zhu H, Chen H-W. Discovering event episodes from sequences of online news articles: a time-adjoining frequent itemset-based clustering method. Inf Manage. 2020;57:103348. [Google Scholar]
Lelieveld SH, Wiel L, Venselaar H, Pfundt R, Vriend G, Veltman JA, Brunner HG, Vissers LE, Gilissen C. Spatial clustering of de novo missense mutations identifies candidate neurodevelopmental disorder-associated genes. Am J Human Genet. 2017;101:478–484. doi: 10.1016/j.ajhg.2017.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li J, Wang Q. Multi-modal bioelectrical signal fusion analysis based on different acquisition devices and scene settings: overview, challenges, and novel orientation. Inf Fusion. 2021;79:229–247. [Google Scholar]
Li D-C, Dai W-L, Tseng W-T. A two-stage clustering method to analyze customer characteristics to build discriminative customer management: a case of textile manufacturing business. Expert Syst Appl. 2011;38:7186–7191. [Google Scholar]
Li W, Fu L, Niu B, Wu S, Wooley J. Ultrafast clustering algorithms for metagenomic sequence analysis. Brief Bioinform. 2012;13:656–668. doi: 10.1093/bib/bbs035. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li Q, Wang S, Zhao C, Zhao B, Yue X, Geng J. HIBOG: improving the clustering accuracy by ameliorating dataset with gravitation. Inf Sci. 2020;550:41–56. [Google Scholar]
Li X, Liang W, Zhang X, Qing S, Chang P-C. A cluster validity evaluation method for dynamically determining the near-optimal number of clusters. Soft Comput. 2020;24:9227–9241. [Google Scholar]
Liao TW. Clustering of time series data—a survey. Pattern Recogn. 2005;38:1857–1874. [Google Scholar]
Lismont J, Vanthienen J, Baesens B, Lemahieu W. Defining analytics maturity indicators: a survey approach. Int J Inf Manage. 2017;37:114–124. [Google Scholar]
Liu J, Chen Y. A personalized clustering-based and reliable trust-aware QoS prediction approach for cloud service recommendation in cloud manufacturing. Knowl-Based Syst. 2019;174:43–56. [Google Scholar]
Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: 2010 IEEE international conference on data mining, IEEE, pp 911–916
Liu Y, Jiang Y, Hou T, Liu F. A new robust fuzzy clustering validity index for imbalanced data sets. Inf Sci. 2020;547:579–591. [Google Scholar]
Lu J, Gan A, Haleem K, Wu W. Clustering-based roadway segment division for the identification of high-crash locations. J Transp Saf Secur. 2013;5:224–239. [Google Scholar]
Mahdi MA, Hosny KM, Elhenawy I. Scalable clustering algorithms for big data: a review. IEEE Access. 2021 doi: 10.1109/ACCESS.2021.3084057. [DOI] [Google Scholar]
Manogaran G, Lopez D. A survey of big data architectures and machine learning algorithms in healthcare. Int J Biomed Eng Technol. 2017;25:182–211. [Google Scholar]
Marbac M, Sedki M, Patin T. Variable selection for mixed data clustering: application in human population genomics. J Classif. 2019;37:124–142. [Google Scholar]
Masson M-H, Denoeux T. ECM: an evidential version of the fuzzy c-means algorithm. Pattern Recogn. 2008;41:1384–1397. [Google Scholar]
Matula DW. Classification and clustering. Amsterdam: Elsevier; 1977. Graph theoretic techniques for cluster analysis algorithms. [Google Scholar]
Mehrotra S, Kohli S. Information systems design and intelligent applications. New Delhi: Springer; 2016. Application of clustering for improving search result of a website. [Google Scholar]
Min E, Guo X, Liu Q, Zhang G, Cui J, Long J. A survey of clustering with deep learning: from the perspective of network architecture. IEEE Access. 2018;6:39501–39514. [Google Scholar]
Motiwalla LF, Albashrawi M, Kartal HB. Uncovering unobserved heterogeneity bias: measuring mobile banking system success. Int J Inf Manage. 2019;49:439–451. [Google Scholar]
Motlagh O, Berry A, O'Neil L. Clustering of residential electricity customers using load time series. Appl Energy. 2019;237:11–24. [Google Scholar]
Mourer A, Forest F, Lebbah M, Azzag H, Lacaille J (2020) Selecting the number of clusters $ K $ with a stability trade-off: an internal validation criterion. arXiv preprint arXiv:2006.08530
N’cir C-EB, Cleuziou G, Essoussi N. Partitional clustering algorithms. Cham: Springer; 2015. Overview of overlapping partitional clustering methods. [Google Scholar]
Naghieh E, Peng Y (2009) Microarray gene expression data mining: clustering analysis review. Department of Computing, pp.1-4.
Nakayama H, Kagaku N. Pattern classification by linear goal programming and its extensions. J Global Optim. 1998;12:111–126. [Google Scholar]
Negara ES, Andryani R (2018) A review on overlapping and non-overlapping community detection algorithms for social network analytics. Far East Journal of Electronics and Communications, 18(1), pp.1-27.
Nerurkar P, Shirke A, Chandane M, Bhirud S. Empirical analysis of data clustering algorithms. Procedia Comput Sci. 2018;125:770–779. [Google Scholar]
Ng AY, Jordan MI, Weiss Y. Advances in neural information processing systems. Boston: MIT Press; 2002. On spectral clustering: analysis and an algorithm; pp. 849–856. [Google Scholar]
Nhamo G, Nhemachena C, Nhamo S. Using ICT indicators to measure readiness of countries to implement Industry 4.0 and the SDGs. Environ Econ Policy Stud. 2020;22:315–337. [Google Scholar]
Ni J, Young T, Pandelea V, Xue F, Cambria E (2022) Recent advances in deep learning based dialogue systems: a systematic survey. In: Artificial intelligence review, pp 1–101
Niwattanakul S, Singthongchai J, Naenudorn E, Wanapu S (2013) Using of Jaccard coefficient for keywords similarity. In: Proceedings of the international multiconference of engineers and computer scientists, pp 380–384
Ogundele I, Popoola O, Oyesola O, Orija K (2018) A review on data mining in healthcare. International Journal of Advanced Research in Computer Engineering and Technology (IJARCET), Vol.7, pp 698–704
Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Int J Surg. 2021;88:105906. doi: 10.1016/j.ijsu.2021.105906. [DOI] [PubMed] [Google Scholar]
Palanisamy V, Thirunavukarasu R. Implications of big data analytics in developing healthcare frameworks–a review. J King Saud Univ-Comput Inf Sci. 2019;31:415–425. [Google Scholar]
Pandit S, Gupta S. A comparative study on distance measuring approaches for clustering. Int J Res Comput Sci. 2011;2:29–31. [Google Scholar]
Parsons L, Haque E, Liu H. Subspace clustering for high dimensional data: a review. Dimension. 2004;1(1):5. [Google Scholar]
Pedrycz W. Collaborative fuzzy clustering. Pattern Recogn Lett. 2002;23:1675–1686. [Google Scholar]
Pereira MM, Frazzon EM. A data-driven approach to adaptive synchronization of demand and supply in omni-channel retail supply chains. Int J Inf Manage. 2020;57:102165. [Google Scholar]
Pérez-Suárez A, Martínez-Trinidad JF, Carrasco-Ochoa JA. A review of conceptual clustering algorithms. Artif Intell Rev. 2019;52:1267–1296. [Google Scholar]
Petwal S, John KS, Vikas G, Rawat SS. Data science and security. Singapore: Springer; 2020. Recommender system for analyzing students’ performance using data mining technique. [Google Scholar]
Piernik M, Brzezinski D, Morzy T, Lesniewska A. XML clustering: a review of structural approaches. Knowl Eng Rev. 2015;30:297–323. [Google Scholar]
Pike M, Lintner BR. Application of clustering algorithms to TRMM precipitation over the tropical and south Pacific Ocean. J Clim. 2020;33:5767–5785. [Google Scholar]
Qian G, Sural S, Gu Y, Pramanik S (2004) Similarity between Euclidean and cosine angle distance for nearest neighbor queries. In: Proceedings of the 2004 ACM symposium on applied computing, 1232–1237
Rabbani M, Farrokhi-Asl H, Asgarian B. Solving a bi-objective location routing problem by a NSGA-II combined with clustering approach: application in waste collection problem. J Ind Eng Int. 2017;13:13–27. [Google Scholar]
Rai A, Tang X, Brown P, Keil M. Assimilation patterns in the use of electronic procurement innovations: a cluster analysis. Inf Manage. 2006;43:336–349. [Google Scholar]
Ramadan RA, Alhaisoni MM, Khedr AY. Multiobjective clustering algorithm for complex data in learning management systems. Complex Adapt Syst Model. 2020;8:1–14. [Google Scholar]
Rappoport N, Shamir R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. bioRxiv. 2018 doi: 10.1093/nar/gky889. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rasmussen EM. Clustering algorithms. Inf Retr: Data Struct Algorithms. 1992;419:442. [Google Scholar]
Rathee A, Chhabra JK. Clustering for software remodularization by using structural, conceptual and evolutionary features. J Univers Comput Sci. 2018;24:1731–1757. [Google Scholar]
Ray S, Turi RH (1999) Determination of number of clusters in k-means clustering and application in colour image segmentation. In: Proceedings of the 4th international conference on advances in pattern recognition and digital techniques, Calcutta, India, pp 137–143
Rhodes JD, Cole WJ, Upshaw CR, Edgar TF, Webber ME. Clustering analysis of residential electricity demand profiles. Appl Energy. 2014;135:461–471. [Google Scholar]
Rodriguez MZ, Comin CH, Casanova D, Bruno OM, Amancio DR, Costa LDF, Rodrigues FA. Clustering algorithms: a comparative approach. PLoS ONE. 2019;14:e0210236. doi: 10.1371/journal.pone.0210236. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rong W, Zhuo E, Peng H, Chen J, Wang H, Han C, Cai H. Learning a consensus affinity matrix for multi-view clustering via subspaces merging on Grassmann manifold. Inf Sci. 2020;547:68–87. [Google Scholar]
Russom P (2011) Big data analytics. TDWI best practices report, fourth quarter 19:1–34
Sabbagh R, Ameri F. A framework based on K-means clustering and topic modeling for analyzing unstructured manufacturing capability data. J Comput Inf Sci Eng. 2020;20:011005. [Google Scholar]
Samoilenko S, Osei-Bryson K-M. Representation matters: an exploration of the socio-economic impacts of ICT-enabled public value in the context of sub-Saharan economies. Int J Inf Manage. 2019;49:69–85. [Google Scholar]
Saxena A, Prasad M, Gupta A, Bharill N, Patel OP, Tiwari A, Er MJ, Ding W, Lin C-T. A review of clustering techniques and developments. Neurocomputing. 2017;267:664–681. [Google Scholar]
Schwenker F, Trentin E. Pattern classification and clustering: a review of partially supervised learning approaches. Pattern Recogn Lett. 2014;37:4–14. [Google Scholar]
Scott J, Carrington PJ. The SAGE handbook of social network analysis. Thousand Oaks: SAGE publications; 2011. [Google Scholar]
Sekula MN (2015) OptCluster: an R package for determining the optimal clustering algorithm and optimal number of clusters, . Electronic Theses and Dissertations. Paper 2147. https://doi.org/10.18297/etd/2147
Sekula M, Datta S, Datta S. optCluster: an R package for determining the optimal clustering algorithm. Bioinformation. 2017;13:101. doi: 10.6026/97320630013101. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sfyridis A, Agnolucci P. Annual average daily traffic estimation in England and Wales: an application of clustering and regression modelling. J Transp Geogr. 2020;83:102658. [Google Scholar]
Shafqat S, Kishwer S, Rasool RU, Qadir J, Amjad T, Ahmad HF. Big data analytics enhanced healthcare systems: a review. J Supercomput. 2020;76:1754–1799. [Google Scholar]
Shamim G, Rihan M. Multi-domain feature extraction for improved clustering of smart meter data. Technol Econ Smart Grids Sustain Energy. 2020;5:1–8. [Google Scholar]
Sharghi E, Nourani V, Soleimani S, Sadikoglu F. Application of different clustering approaches to hydroclimatological catchment regionalization in mountainous regions, a case study in Utah State. J Mt Sci. 2018;15:461–484. [Google Scholar]
Sharma KK, Seal A. Multi-view spectral clustering for uncertain objects. Inf Sci. 2020;547:723–745. [Google Scholar]
Shi L. Industrial symbiosis: context and relevance to the sustainable development goals (SDGs) In: Leal Filho W, Azul AM, Brandli L, Özuyar PG, Wall T, editors. Responsible consumption and production. Cham: Springer; 2020. [Google Scholar]
Shiau W-L, Dwivedi YK, Yang HS. Co-citation and cluster analyses of extant literature on social networks. Int J Inf Manage. 2017;37:390–399. [Google Scholar]
Shiau W-L, Yan C-M, Lin B-W. Exploration into the intellectual structure of mobile information systems. Int J Inf Manage. 2019;47:241–251. [Google Scholar]
Shirkhorshidi AS, Aghabozorgi S, Wah TY, Herawan T (2014) Big data clustering: a review. In: International conference on computational science and its applications, Springer, Cham, pp 707–720
Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF. A review of unsupervised feature selection methods. Artif Intell Rev. 2020;53:907–948. [Google Scholar]
Song Z, Wang C, Bergmann L. China’s prefectural digital divide: spatial analysis and multivariate determinants of ICT diffusion. Int J Inf Manage. 2020;52:102072. [Google Scholar]
Sprague LA, Oelsner GP, Argue DM. Challenges with secondary use of multi-source water-quality data in the United States. Water Res. 2017;110:252–261. doi: 10.1016/j.watres.2016.12.024. [DOI] [PubMed] [Google Scholar]
Subramaniyan M, Skoogh A, Muhammad AS, Bokrantz J, Johansson B, Roser C. A generic hierarchical clustering approach for detecting bottlenecks in manufacturing. J Manuf Syst. 2020;55:143–158. [Google Scholar]
Suh JW, Sohn SY, Lee BK. Patent clustering and network analyses to explore nuclear waste management technologies. Energy Policy. 2020;146:111794. [Google Scholar]
Tahmasebi P, Hezarkhani A, Sahimi M. Multiple-point geostatistical modeling based on the cross-correlation functions. Comput Geosci. 2012;16:779–797. [Google Scholar]
Tanoto Y, Haghdadi N, Bruce A, Macgill I. Clustering based assessment of cost, security and environmental tradeoffs with possible future electricity generation portfolios. Appl Energy. 2020;270:115219. [Google Scholar]
Thakur N, Mehrotra D, Bansal A, Bala M. Soft computing: theories and applications. Singapore: Springer; 2020. Implementation of quasi-euclidean distance-based similarity model for retrieving information from OHSUMED dataset. [Google Scholar]
Tran TA. Effect of ship loading on marine diesel engine fuel consumption for bulk carriers based on the fuzzy clustering method. Ocean Eng. 2020;207:107383. [Google Scholar]
Upton G, Fingleton B. Spatial data analysis by example. Volume 1: point pattern and quantitative data. Hoboken: Wiley; 1985. [Google Scholar]
Uselton S, Ahrens J, Bethel W, Treinish L (1998) Multi-source data analysis challenges. Lawrence Berkeley National Lab. (LBNL), Berkeley
Ushakov AV, Vasilyev I. Near-optimal large-scale k-medoids clustering. Inf Sci. 2020;545:344–362. [Google Scholar]
Valls A, Gibert K, Orellana A, Antón-Clavé S. Using ontology-based clustering to understand the push and pull factors for British tourists visiting a Mediterranean coastal destination. Inf Manage. 2018;55:145–159. [Google Scholar]
Vialetto G, Noro M. An innovative approach to design cogeneration systems based on big data analysis and use of clustering methods. Energy Convers Manage. 2020;214:112901. [Google Scholar]
Wang X, Wang H. Driving behavior clustering for hazardous material transportation based on genetic fuzzy C-means algorithm. IEEE Access. 2020;8:11289–11296. [Google Scholar]
Wang Q, Yang X. Investigating the sustainability of renewable energy–an empirical analysis of European Union countries using a hybrid of projection pursuit fuzzy clustering model and accelerated genetic algorithm based on real coding. J Clean Prod. 2020;268:121940. [Google Scholar]
Wang W, Yang J, Muntz R. STING: a statistical information grid approach to spatial data mining. VLDB. 1997;97:186–195. [Google Scholar]
Xie J, Kelley S, Szymanski BK. Overlapping community detection in networks: the state-of-the-art and comparative study. ACM Comput Surv (CSUR) 2013;45:1–35. [Google Scholar]
Xie W-B, Lee Y-L, Wang C, Chen D-B, Zhou T. Hierarchical clustering supported by reciprocal nearest neighbors. Inf Sci. 2020 doi: 10.1016/j.ins.2020.04.016. [DOI] [Google Scholar]
Xu R, Wunsch D. Survey of clustering algorithms. IEEE Trans Neural Netw. 2005;16:645–678. doi: 10.1109/TNN.2005.845141. [DOI] [PubMed] [Google Scholar]
Xu R, Wunsch DC. Clustering algorithms in biomedical research: a review. IEEE Rev Biomed Eng. 2010;3:120–154. doi: 10.1109/RBME.2010.2083647. [DOI] [PubMed] [Google Scholar]
Xu X, Qian H, Ge C, Lin Z. Industry classification with online resume big data: a design science approach. Inf Manage. 2020;57:103182. [Google Scholar]
Ye J. Cosine similarity measures for intuitionistic fuzzy sets and their applications. Math Comput Model. 2011;53:91–97. [Google Scholar]
Yin L. Intelligent clustering evaluation of marine equipment manufacturing based on network connection strength. J Coast Res. 2020;103:900–904. [Google Scholar]
Yoo I, Alafaireet P, Marinov M, Pena-Hernandez K, Gopidi R, Chang J-F, Hua L. Data mining in healthcare and biomedicine: a survey of the literature. J Med Syst. 2012;36:2431–2448. doi: 10.1007/s10916-011-9710-5. [DOI] [PubMed] [Google Scholar]
Zhang K, Collins EG, Barbu A. An efficient stochastic clustering auction for heterogeneous robotic collaborative teams. J Intell Rob Syst. 2013;72:541–558. [Google Scholar]
Zhang X, Sun Y, Liu H, Hou Z, Zhao F, Zhang C. Improved clustering algorithms for image segmentation based on non-local information and back projection. Inf Sci. 2020 doi: 10.1016/j.ins.2020.10.039. [DOI] [Google Scholar]
Zhao K, Jiang Y, Xia K, Zhou L, Chen Y, Xu K, Qian P. View-collaborative fuzzy soft subspace clustering for automatic medical image segmentation. Multimed Tools Appl. 2020;79:9523–9542. [Google Scholar]
Zhu Q, Zhang F, Liu S, Li Y. An anticrime information support system design: application of K-means-VMD-BiGRU in the city of Chicago. Inf Manage. 2019;59:103247. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data sharing not applicable to this article as no datasets were generated or analysed during the current study.

[CR1] Abbaspour M, Abbasizade F. Energy performance evaluation based on SDGs. In: Leal Filho W, Azul AM, Brandli L, Lange Salvia A, Wall T, editors. Affordable and clean energy. Cham: Springer; 2020. [Google Scholar]

[CR2] Afyouni I, Al Aghbari Z, Razack RA. Multi-feature, multi-modal, and multi-source social event detection: a comprehensive survey. Inf Fusion. 2021 doi: 10.1016/j.inffus.2021.10.013. [DOI] [Google Scholar]

[CR3] Aggarwal CC, Philip SY, Han J, Wang J (2003) A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference, Elsevier, pp 81–92

[CR4] Ahmad P, Qamar S, Rizvi SQA. Techniques of data mining in healthcare: a review. Int J Comput Appl. 2015;120:38–50. [Google Scholar]

[CR5] Ahn H, Chang T-W. A similarity-based hierarchical clustering method for manufacturing process models. Sustainability. 2019;11:2560. [Google Scholar]

[CR6] Alelyani S, Tang J, Liu H. Data clustering: algorithms and applications. London: Chapman and Hal; 2013. Feature selection for clustering: a review; p. 29. [Google Scholar]

[CR7] Aljalbout E, Golkov V, Siddiqui Y, Strobel M, Cremers D (2018) Clustering with deep learning: taxonomy and new methods. arXiv preprint arXiv:1801.07648

[CR8] Almannaa MH, Elhenawy M, Rakha HA. A novel supervised clustering algorithm for transportation system applications. IEEE Trans Intell Transp Syst. 2020;21:222–232. [Google Scholar]

[CR9] Alsayat A, El-Sayed H (2016) Efficient genetic K-means clustering for health care knowledge discovery. In: 2016 IEEE 14th international conference on software engineering research, management and applications (SERA), IEEE, pp 45–52

[CR10] Ambigavathi M, Sridharan D (2020) Analysis of clustering algorithms in machine learning for healthcare data. In: International conference on advances in computing and data sciences, Springer, Singapore, pp 117–128

[CR11] Anand S, Padmanabham P, Govardhan A, Kulkarni RH. An extensive review on data mining methods and clustering models for intelligent transportation system. J Intell Syst. 2018;27:263–273. [Google Scholar]

[CR12] Andreopoulos B, An A, Wang X, Schroeder M. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform. 2009;10:297–314. doi: 10.1093/bib/bbn058. [DOI] [PubMed] [Google Scholar]

[CR13] Ansari MY, Ahmad A, Khan SS, Bhushan G. Spatiotemporal clustering: a review. Artif Intell Rev. 2019;53:2381–2423. [Google Scholar]

[CR14] Baadel S, Thabtah FA, Lu J (2016) Overlapping clustering: a review. In 2016 SAI Computing Conference (SAI),IEEE., pp. 233-237.

[CR15] Baidari I, Patil C. A criterion for deciding the number of clusters in a dataset based on data depth. Vietnam J Comput Sci. 2020;7:417–431. [Google Scholar]

[CR16] Baker R. Data mining for education. Int Encycl Educ. 2010;7:112–118. [Google Scholar]

[CR17] Bandyopadhyay S, Saha S, Maulik U, Deb K. A simulated annealing-based multiobjective optimization algorithm: AMOSA. IEEE Trans Evol Comput. 2008;12:269–283. [Google Scholar]

[CR18] Banerjee A, Krumpelman C, Ghosh J, Basu S, Mooney RJ (2005) Model-based overlapping clustering. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining, pp 532–537

[CR19] Batet M, Valls A, Gibert K. International conference on artificial intelligence and soft computing. Berlin: Springer; 2010. Performance of ontology-based semantic similarities in clustering; pp. 281–288. [Google Scholar]

[CR20] Beltrán B, Vilariño D. Survey of overlapping clustering algorithms. Comput Sist. 2020;24:575–581. [Google Scholar]

[CR21] Bose I, Chen X. Detecting the migration of mobile service customers using fuzzy clustering. Inf Manage. 2015;52:227–238. [Google Scholar]

[CR22] Bouveyron C, Brunet-Saumard C. Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal. 2014;71:52–78. [Google Scholar]

[CR23] Bulò SR, Pelillo M. Dominant-set clustering: a review. Eur J Oper Res. 2017;262:1–13. [Google Scholar]

[CR24] Calmon W, Albi M. Estimating the number of clusters in a ranking data context. Inf Sci. 2020;546:977–995. [Google Scholar]

[CR25] Campello RJ, Kröger P, Sander J, Zimek A. Density-based clustering. Wiley Interdiscip Rev: Data Min Knowl Discov. 2020;10:e1343. [Google Scholar]

[CR26] Celardo L, Everett MG. Network text analysis: a two-way classification approach. Int J Inf Manage. 2020;51:102009. [Google Scholar]

[CR27] Chan LM, Intner SS, Weihs J. Guide to the library of congress classification. Santa Barbara: ABC-CLIO; 2016. [Google Scholar]

[CR28] Chan SL, Lu Y, Wang Y. Data-driven cost estimation for additive manufacturing in cybermanufacturing. J Manuf Syst. 2018;46:115–126. [Google Scholar]

[CR29] Chandrasekharan MP, Rajagopalan R. An ideal seed non-hierarchical clustering algorithm for cellular manufacturing. Int J Prod Res. 1986;24:451–463. [Google Scholar]

[CR30] Charrad M, Ghazzali N, Boiteau V, Niknafs A (2015) Determining the best number of clusters in a data set. Recuperado de https://cran.rproject.org/web/packages/NbClust/NbClust.pdf

[CR31] Chattopadhyay A, Hassanzadeh P, Pasha S. Predicting clustered weather patterns: a test case for applications of convolutional neural networks to spatio-temporal climate data. Sci Rep. 2020;10:1–13. doi: 10.1038/s41598-020-57897-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] Chen H, Yu Z, Yang Q, Shao J. Attributed graph clustering with subspace stochastic block model. Inf Sci. 2020;535:130–141. [Google Scholar]

[CR33] Cheng H, Hong SA, Ye X (2020) Clustering users of a social networking system based on user interactions with content items associated with a topic. Google Patents

[CR34] Citizenship C. SDGs & sectors: a review of the business opportunities. London: Corporate Citizenship; 2016. [Google Scholar]

[CR35] Connell SD, Jain AK (1998) Learning prototypes for online handwritten digits. In: Proceedings. Fourteenth international conference on pattern recognition (cat. no. 98EX170), IEEE, pp 182–184

[CR36] D’haeseleer P. How does gene expression clustering work? Nat Biotechnol. 2005;23:1499–1501. doi: 10.1038/nbt1205-1499. [DOI] [PubMed] [Google Scholar]

[CR37] Dalziel M, Yang X, Breslav S, Khan A, Luo J. Can we design an industry classification system that reflects industry architecture? J Enterp Transform. 2018;8:22–46. [Google Scholar]

[CR38] Das S, Das A, Bhattacharya D, Tibarewala D. A new graph-theoretic approach to determine the similarity of genome sequences based on nucleotide triplets. Genomics. 2020 doi: 10.1016/j.ygeno.2020.08.023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] de Armiño CA, Manzanedo MÁ, Herrero Á. Analysing the intermeshed patterns of road transportation and macroeconomic indicators through neural and clustering techniques. Pattern Anal Appl. 2020;23:1059–1070. [Google Scholar]

[CR40] de Luca M, Mauro R, Russo F, Dell’Acqua G. Before-after freeway accident analysis using cluster algorithms. Procedia Soc Behav Sci. 2011;20:723–731. [Google Scholar]

[CR41] Delgoshaei A, Ali A. Evolution of clustering techniques in designing cellular manufacturing systems: a state-of-art review. Int J Ind Eng Comput. 2019;10:177–198. [Google Scholar]

[CR42] Delgoshaei A, Gomes C. A multi-layer perceptron for scheduling cellular manufacturing systems in the presence of unreliable machines and uncertain cost. Appl Soft Comput. 2016;49:27–55. [Google Scholar]

[CR43] Delgoshaei A, Aram AK, Ehsani S, Rezanoori A, Hanjani SE, Pakdel GH, Shirmohamdi F. A supervised method for scheduling multi-objective job shop systems in the presence of market uncertainties. RAIRO-Oper Res. 2021;55:S1165–S1193. [Google Scholar]

[CR44] Denoeux T. Calibrated model-based evidential clustering using bootstrapping. Inf Sci. 2020 doi: 10.1016/j.ins.2020.04.014. [DOI] [Google Scholar]

[CR45] Denoeux T, Kanjanatarakul O (2016) Evidential clustering: a review. In International symposium on integrated uncertainty in knowledge modelling and decision making, Springer, Cham, pp. 24-35

[CR46] Devolder P, Pynoo B, Sijnave B, Voet T, Duyck P. Framework for user acceptance: clustering for fine-grained results. Inf Manage. 2012;49:233–239. [Google Scholar]

[CR47] Dorai C, Jain AK (1995) Shape spectra based view grouping for free-form objects. In: Proceedings. International conference on image processing, IEEE, pp 340–343

[CR48] Du T, Wen G, Cai Z, Zheng W, Tan M, Li Y. Spectral clustering algorithm combining local covariance matrix with normalization. Neural Comput Appl. 2020;32:6611–6618. [Google Scholar]

[CR49] Durbin R, Eddy SR, Krogh A, Mitchison G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University Press; 1998. [Google Scholar]

[CR50] Ezugwu AE, Ikotun AM, Oyelade OO, Abualigah L, Agushaka JO, Eke CI, Akinyelu AA. A comprehensive survey of clustering algorithms: state-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Eng Appl Artif Intell. 2022;110:104743. [Google Scholar]

[CR51] Fennell G, Allenby GM, Yang S, Edwards Y. The effectiveness of demographic and psychographic variables for explaining brand and product category use. Quant Mark Econ. 2003;1:223–244. [Google Scholar]

[CR52] Forsyth DA, Ponce J (2002) Computer vision: a modern approach In: Prentice Hall professional technical reference

[CR53] Fu W, Perry PO. Estimating the number of clusters using cross-validation. J Comput Graph Stat. 2020;29:162–173. [Google Scholar]

[CR54] Gordon AD. Classification. Boca Raton: CRC Press; 1999. [Google Scholar]

[CR55] Govender P, Sivakumar V. Application of k-means and hierarchical clustering techniques for analysis of air pollution: a review (1980–2019) Atmos Pollut Res. 2020;11:40–56. [Google Scholar]

[CR56] Grant D, Yeo B. A global perspective on tech investment, financing, and ICT on manufacturing and service industry performance. Int J Inf Manage. 2018;43:130–145. [Google Scholar]

[CR57] Guleria P, Sood M. Intelligent data analysis: from data gathering to data comprehension. Hoboken: Wiley; 2020. Intelligent data analysis using Hadoop cluster-inspired mapreduce framework and association rule mining on educational domain. [Google Scholar]

[CR58] Han J, Pei J, Kamber M. Data mining: concepts and techniques. Amsterdam: Elsevier; 2011. [Google Scholar]

[CR59] Han J, Kamber M, Pei J. 10-Cluster analysis: Basic concepts and methods. Data mining. Burlington: Morgan Kaufmann; 2012. pp. 443–495. [Google Scholar]

[CR60] Hastie T, Tibshirani R, Friedman JH, Friedman JH. The elements of statistical learning: data mining, inference, and prediction. New York: Springer; 2009. [Google Scholar]

[CR61] He Y, Wu Y, Qin H, Huang JZ, Jin Y. Improved I-nice clustering algorithm based on density peaks mechanism. Inf Sci. 2020;548:177–190. [Google Scholar]

[CR62] Hedberg SR. Searching for the mother lode: tales of the first data miners. IEEE Expert. 1996;11:4–7. [Google Scholar]

[CR63] Hireche C, Drias H, Moulai H. Grid based clustering for satisfiability solving. Appl Soft Comput. 2020;88:106069. [Google Scholar]

[CR64] Hu W, Hu W, Xie N, Maybank S. Unsupervised active learning based on hierarchical graph-theoretic clustering. IEEE Trans Syst Man Cybern B. 2009;39:1147–1161. doi: 10.1109/TSMCB.2009.2013197. [DOI] [PubMed] [Google Scholar]

[CR65] Hu J, Pan Y, Li T, Yang Y. TW-Co-MFC: two-level weighted collaborative fuzzy clustering based on maximum entropy for multi-view data. Tsinghua Sci Technol. 2020;26:185–198. [Google Scholar]

[CR66] Huang Z. A fast clustering algorithm to cluster very large categorical data sets in data mining. DMKD. 1997;3:34–39. [Google Scholar]

[CR67] Huang X, Ye Y, Xiong L, Lau RY, Jiang N, Wang S. Time series k-means: a new k-means type smooth subspace clustering for time series data. Inf Sci. 2016;367:1–13. [Google Scholar]

[CR68] Hudson IL, Keatley MR, Lee SY. Using self-organising maps (SOMs) to assess synchronies: an application to historical eucalypt flowering records. Int J Biometeorol. 2011;55:879–904. doi: 10.1007/s00484-011-0427-4. [DOI] [PubMed] [Google Scholar]

[CR69] Izadkhah H, Tajgardan M. Information theoretic objective function for genetic software clustering. Multidiscip Digit Publ Inst Proc. 2019;46:18. [Google Scholar]

[CR70] Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv (CSUR) 1999;31:264–323. [Google Scholar]

[CR71] Jakupović A, Pavlić M, Poščić P (2010) Business sectors and ERP solutions. In: Proceedings of the ITI 2010, 32nd international conference on information technology interfaces, IEEE, pp 477–482

[CR72] Jamali-Dinan S-S, Soltanian-Zadeh H, Bowyer SM, Almohri H, Dehghani H, Elisevich K, Nazem-Zadeh M-R. A combination of particle swarm optimization and minkowski weighted k-means clustering: application in lateralization of temporal lobe epilepsy. Brain Topogr. 2020 doi: 10.1007/s10548-020-00770-9. [DOI] [PubMed] [Google Scholar]

[CR73] James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning with applications in R. New York: Springer; 2015. [Google Scholar]

[CR74] Jiang D, Wu S, Chen G, Ooi BC, Tan K-L, Xu J. epiC: an extensible and scalable system for processing big data. VLDB J. 2016;25:3–26. [Google Scholar]

[CR75] Johnell C, Chehreghani MH (2020) Frank-wolfe optimization for dominant set clustering. arXiv preprint arXiv:2007.11652

[CR76] Johns H, Hearne J, Bernhardt J, Churilov L. Clustering clinical and health care processes using a novel measure of dissimilarity for variable-length sequences of ordinal states. Stat Methods Med Res. 2020;29:3059–3075. doi: 10.1177/0962280220917174. [DOI] [PubMed] [Google Scholar]

[CR77] Jothi N, Nur’aini Abdul Rashidb WH. Data mining in healthcare—a review. Procedia Comput Sci. 2015;72:306–313. [Google Scholar]

[CR78] Kalgotra P, Sharda R, Luse A. Which similarity measure to use in network analysis: Impact of sample size on phi correlation coefficient and Ochiai index. Int J Inf Manage. 2020;55:102229. [Google Scholar]

[CR79] Kao J-H, Chan T-C, Lai F, Lin B-C, Sun W-Z, Chang K-W, Leu F-Y, Lin J-W. Spatial analysis and data mining techniques for identifying risk factors of out-of-hospital cardiac arrest. Int J Inf Manage. 2017;37:1528–1538. [Google Scholar]

[CR80] Kaplan JM, Winther RG. Prisoners of abstraction? The theory and measure of genetic variation, and the very concept of “race”. Biol Theory. 2013;7:401–412. [Google Scholar]

[CR81] Kessira D, Kechadi M-T (2020) Multi-objective clustering algorithm with parallel games. In: 2020 international multi-conference on:“organization of knowledge and advanced technologies”(OCTA), IEEE, pp 1–7

[CR82] Khamparia A, Pande S, Gupta D, Khanna A, Sangaiah AK. Multi-level framework for anomaly detection in social networking. Libr Hi Tech. 2020 doi: 10.1108/LHT-01-2019-0023. [DOI] [Google Scholar]

[CR83] Khanmohammadi S, Adibeig N, Shanehbandy S. An improved overlapping k-means clustering method for medical applications. Expert Syst Appl. 2017;67:12–18. [Google Scholar]

[CR84] Khouja M, Booth DE. Fuzzy clustering procedure for evaluation and selection of industrial robots. J Manuf Syst. 1995;14:244–251. [Google Scholar]

[CR85] Kiang MY, Hu MY, Fisher DM. The effect of sample size on the extended self-organizing map network—a market segmentation application. Comput Stat Data Anal. 2007;51:5940–5948. [Google Scholar]

[CR86] Kohli S, Mehrotra S. A clustering approach for optimization of search result. J Images Graph. 2016;4:63–66. [Google Scholar]

[CR87] Lahat D, Adali T, Jutten C. Multimodal data fusion: an overview of methods, challenges, and prospects. Proc IEEE. 2015;103:1449–1477. [Google Scholar]

[CR88] Lam D, Wunsch DC. Academic Press library in signal processing. Amsterdam: Elsevier; 2014. Clustering. [Google Scholar]

[CR89] Landau S, Leese M, Stahl D, Everitt BS. Cluster analysis. Hoboken: Wiley; 2011. [Google Scholar]

[CR90] Lee Y-H, Hu PJ-H, Zhu H, Chen H-W. Discovering event episodes from sequences of online news articles: a time-adjoining frequent itemset-based clustering method. Inf Manage. 2020;57:103348. [Google Scholar]

[CR91] Lelieveld SH, Wiel L, Venselaar H, Pfundt R, Vriend G, Veltman JA, Brunner HG, Vissers LE, Gilissen C. Spatial clustering of de novo missense mutations identifies candidate neurodevelopmental disorder-associated genes. Am J Human Genet. 2017;101:478–484. doi: 10.1016/j.ajhg.2017.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR92] Li J, Wang Q. Multi-modal bioelectrical signal fusion analysis based on different acquisition devices and scene settings: overview, challenges, and novel orientation. Inf Fusion. 2021;79:229–247. [Google Scholar]

[CR93] Li D-C, Dai W-L, Tseng W-T. A two-stage clustering method to analyze customer characteristics to build discriminative customer management: a case of textile manufacturing business. Expert Syst Appl. 2011;38:7186–7191. [Google Scholar]

[CR94] Li W, Fu L, Niu B, Wu S, Wooley J. Ultrafast clustering algorithms for metagenomic sequence analysis. Brief Bioinform. 2012;13:656–668. doi: 10.1093/bib/bbs035. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR95] Li Q, Wang S, Zhao C, Zhao B, Yue X, Geng J. HIBOG: improving the clustering accuracy by ameliorating dataset with gravitation. Inf Sci. 2020;550:41–56. [Google Scholar]

[CR96] Li X, Liang W, Zhang X, Qing S, Chang P-C. A cluster validity evaluation method for dynamically determining the near-optimal number of clusters. Soft Comput. 2020;24:9227–9241. [Google Scholar]

[CR97] Liao TW. Clustering of time series data—a survey. Pattern Recogn. 2005;38:1857–1874. [Google Scholar]

[CR98] Lismont J, Vanthienen J, Baesens B, Lemahieu W. Defining analytics maturity indicators: a survey approach. Int J Inf Manage. 2017;37:114–124. [Google Scholar]

[CR99] Liu J, Chen Y. A personalized clustering-based and reliable trust-aware QoS prediction approach for cloud service recommendation in cloud manufacturing. Knowl-Based Syst. 2019;174:43–56. [Google Scholar]

[CR100] Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: 2010 IEEE international conference on data mining, IEEE, pp 911–916

[CR101] Liu Y, Jiang Y, Hou T, Liu F. A new robust fuzzy clustering validity index for imbalanced data sets. Inf Sci. 2020;547:579–591. [Google Scholar]

[CR102] Lu J, Gan A, Haleem K, Wu W. Clustering-based roadway segment division for the identification of high-crash locations. J Transp Saf Secur. 2013;5:224–239. [Google Scholar]

[CR103] Mahdi MA, Hosny KM, Elhenawy I. Scalable clustering algorithms for big data: a review. IEEE Access. 2021 doi: 10.1109/ACCESS.2021.3084057. [DOI] [Google Scholar]

[CR105] Manogaran G, Lopez D. A survey of big data architectures and machine learning algorithms in healthcare. Int J Biomed Eng Technol. 2017;25:182–211. [Google Scholar]

[CR106] Marbac M, Sedki M, Patin T. Variable selection for mixed data clustering: application in human population genomics. J Classif. 2019;37:124–142. [Google Scholar]

[CR107] Masson M-H, Denoeux T. ECM: an evidential version of the fuzzy c-means algorithm. Pattern Recogn. 2008;41:1384–1397. [Google Scholar]

[CR108] Matula DW. Classification and clustering. Amsterdam: Elsevier; 1977. Graph theoretic techniques for cluster analysis algorithms. [Google Scholar]

[CR109] Mehrotra S, Kohli S. Information systems design and intelligent applications. New Delhi: Springer; 2016. Application of clustering for improving search result of a website. [Google Scholar]

[CR110] Min E, Guo X, Liu Q, Zhang G, Cui J, Long J. A survey of clustering with deep learning: from the perspective of network architecture. IEEE Access. 2018;6:39501–39514. [Google Scholar]

[CR111] Motiwalla LF, Albashrawi M, Kartal HB. Uncovering unobserved heterogeneity bias: measuring mobile banking system success. Int J Inf Manage. 2019;49:439–451. [Google Scholar]

[CR112] Motlagh O, Berry A, O'Neil L. Clustering of residential electricity customers using load time series. Appl Energy. 2019;237:11–24. [Google Scholar]

[CR113] Mourer A, Forest F, Lebbah M, Azzag H, Lacaille J (2020) Selecting the number of clusters $ K $ with a stability trade-off: an internal validation criterion. arXiv preprint arXiv:2006.08530

[CR114] N’cir C-EB, Cleuziou G, Essoussi N. Partitional clustering algorithms. Cham: Springer; 2015. Overview of overlapping partitional clustering methods. [Google Scholar]

[CR115] Naghieh E, Peng Y (2009) Microarray gene expression data mining: clustering analysis review. Department of Computing, pp.1-4.

[CR116] Nakayama H, Kagaku N. Pattern classification by linear goal programming and its extensions. J Global Optim. 1998;12:111–126. [Google Scholar]

[CR117] Negara ES, Andryani R (2018) A review on overlapping and non-overlapping community detection algorithms for social network analytics. Far East Journal of Electronics and Communications, 18(1), pp.1-27.

[CR118] Nerurkar P, Shirke A, Chandane M, Bhirud S. Empirical analysis of data clustering algorithms. Procedia Comput Sci. 2018;125:770–779. [Google Scholar]

[CR119] Ng AY, Jordan MI, Weiss Y. Advances in neural information processing systems. Boston: MIT Press; 2002. On spectral clustering: analysis and an algorithm; pp. 849–856. [Google Scholar]

[CR120] Nhamo G, Nhemachena C, Nhamo S. Using ICT indicators to measure readiness of countries to implement Industry 4.0 and the SDGs. Environ Econ Policy Stud. 2020;22:315–337. [Google Scholar]

[CR121] Ni J, Young T, Pandelea V, Xue F, Cambria E (2022) Recent advances in deep learning based dialogue systems: a systematic survey. In: Artificial intelligence review, pp 1–101

[CR122] Niwattanakul S, Singthongchai J, Naenudorn E, Wanapu S (2013) Using of Jaccard coefficient for keywords similarity. In: Proceedings of the international multiconference of engineers and computer scientists, pp 380–384

[CR123] Ogundele I, Popoola O, Oyesola O, Orija K (2018) A review on data mining in healthcare. International Journal of Advanced Research in Computer Engineering and Technology (IJARCET), Vol.7, pp 698–704

[CR124] Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Int J Surg. 2021;88:105906. doi: 10.1016/j.ijsu.2021.105906. [DOI] [PubMed] [Google Scholar]

[CR125] Palanisamy V, Thirunavukarasu R. Implications of big data analytics in developing healthcare frameworks–a review. J King Saud Univ-Comput Inf Sci. 2019;31:415–425. [Google Scholar]

[CR126] Pandit S, Gupta S. A comparative study on distance measuring approaches for clustering. Int J Res Comput Sci. 2011;2:29–31. [Google Scholar]

[CR127] Parsons L, Haque E, Liu H. Subspace clustering for high dimensional data: a review. Dimension. 2004;1(1):5. [Google Scholar]

[CR128] Pedrycz W. Collaborative fuzzy clustering. Pattern Recogn Lett. 2002;23:1675–1686. [Google Scholar]

[CR129] Pereira MM, Frazzon EM. A data-driven approach to adaptive synchronization of demand and supply in omni-channel retail supply chains. Int J Inf Manage. 2020;57:102165. [Google Scholar]

[CR130] Pérez-Suárez A, Martínez-Trinidad JF, Carrasco-Ochoa JA. A review of conceptual clustering algorithms. Artif Intell Rev. 2019;52:1267–1296. [Google Scholar]

[CR131] Petwal S, John KS, Vikas G, Rawat SS. Data science and security. Singapore: Springer; 2020. Recommender system for analyzing students’ performance using data mining technique. [Google Scholar]

[CR132] Piernik M, Brzezinski D, Morzy T, Lesniewska A. XML clustering: a review of structural approaches. Knowl Eng Rev. 2015;30:297–323. [Google Scholar]

[CR133] Pike M, Lintner BR. Application of clustering algorithms to TRMM precipitation over the tropical and south Pacific Ocean. J Clim. 2020;33:5767–5785. [Google Scholar]

[CR134] Qian G, Sural S, Gu Y, Pramanik S (2004) Similarity between Euclidean and cosine angle distance for nearest neighbor queries. In: Proceedings of the 2004 ACM symposium on applied computing, 1232–1237

[CR135] Rabbani M, Farrokhi-Asl H, Asgarian B. Solving a bi-objective location routing problem by a NSGA-II combined with clustering approach: application in waste collection problem. J Ind Eng Int. 2017;13:13–27. [Google Scholar]

[CR136] Rai A, Tang X, Brown P, Keil M. Assimilation patterns in the use of electronic procurement innovations: a cluster analysis. Inf Manage. 2006;43:336–349. [Google Scholar]

[CR137] Ramadan RA, Alhaisoni MM, Khedr AY. Multiobjective clustering algorithm for complex data in learning management systems. Complex Adapt Syst Model. 2020;8:1–14. [Google Scholar]

[CR138] Rappoport N, Shamir R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. bioRxiv. 2018 doi: 10.1093/nar/gky889. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR139] Rasmussen EM. Clustering algorithms. Inf Retr: Data Struct Algorithms. 1992;419:442. [Google Scholar]

[CR140] Rathee A, Chhabra JK. Clustering for software remodularization by using structural, conceptual and evolutionary features. J Univers Comput Sci. 2018;24:1731–1757. [Google Scholar]

[CR141] Ray S, Turi RH (1999) Determination of number of clusters in k-means clustering and application in colour image segmentation. In: Proceedings of the 4th international conference on advances in pattern recognition and digital techniques, Calcutta, India, pp 137–143

[CR142] Rhodes JD, Cole WJ, Upshaw CR, Edgar TF, Webber ME. Clustering analysis of residential electricity demand profiles. Appl Energy. 2014;135:461–471. [Google Scholar]

[CR143] Rodriguez MZ, Comin CH, Casanova D, Bruno OM, Amancio DR, Costa LDF, Rodrigues FA. Clustering algorithms: a comparative approach. PLoS ONE. 2019;14:e0210236. doi: 10.1371/journal.pone.0210236. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR144] Rong W, Zhuo E, Peng H, Chen J, Wang H, Han C, Cai H. Learning a consensus affinity matrix for multi-view clustering via subspaces merging on Grassmann manifold. Inf Sci. 2020;547:68–87. [Google Scholar]

[CR145] Russom P (2011) Big data analytics. TDWI best practices report, fourth quarter 19:1–34

[CR146] Sabbagh R, Ameri F. A framework based on K-means clustering and topic modeling for analyzing unstructured manufacturing capability data. J Comput Inf Sci Eng. 2020;20:011005. [Google Scholar]

[CR147] Samoilenko S, Osei-Bryson K-M. Representation matters: an exploration of the socio-economic impacts of ICT-enabled public value in the context of sub-Saharan economies. Int J Inf Manage. 2019;49:69–85. [Google Scholar]

[CR149] Saxena A, Prasad M, Gupta A, Bharill N, Patel OP, Tiwari A, Er MJ, Ding W, Lin C-T. A review of clustering techniques and developments. Neurocomputing. 2017;267:664–681. [Google Scholar]

[CR150] Schwenker F, Trentin E. Pattern classification and clustering: a review of partially supervised learning approaches. Pattern Recogn Lett. 2014;37:4–14. [Google Scholar]

[CR151] Scott J, Carrington PJ. The SAGE handbook of social network analysis. Thousand Oaks: SAGE publications; 2011. [Google Scholar]

[CR152] Sekula MN (2015) OptCluster: an R package for determining the optimal clustering algorithm and optimal number of clusters, . Electronic Theses and Dissertations. Paper 2147. https://doi.org/10.18297/etd/2147

[CR153] Sekula M, Datta S, Datta S. optCluster: an R package for determining the optimal clustering algorithm. Bioinformation. 2017;13:101. doi: 10.6026/97320630013101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR154] Sfyridis A, Agnolucci P. Annual average daily traffic estimation in England and Wales: an application of clustering and regression modelling. J Transp Geogr. 2020;83:102658. [Google Scholar]

[CR155] Shafqat S, Kishwer S, Rasool RU, Qadir J, Amjad T, Ahmad HF. Big data analytics enhanced healthcare systems: a review. J Supercomput. 2020;76:1754–1799. [Google Scholar]

[CR156] Shamim G, Rihan M. Multi-domain feature extraction for improved clustering of smart meter data. Technol Econ Smart Grids Sustain Energy. 2020;5:1–8. [Google Scholar]

[CR157] Sharghi E, Nourani V, Soleimani S, Sadikoglu F. Application of different clustering approaches to hydroclimatological catchment regionalization in mountainous regions, a case study in Utah State. J Mt Sci. 2018;15:461–484. [Google Scholar]

[CR158] Sharma KK, Seal A. Multi-view spectral clustering for uncertain objects. Inf Sci. 2020;547:723–745. [Google Scholar]

[CR159] Shi L. Industrial symbiosis: context and relevance to the sustainable development goals (SDGs) In: Leal Filho W, Azul AM, Brandli L, Özuyar PG, Wall T, editors. Responsible consumption and production. Cham: Springer; 2020. [Google Scholar]

[CR160] Shiau W-L, Dwivedi YK, Yang HS. Co-citation and cluster analyses of extant literature on social networks. Int J Inf Manage. 2017;37:390–399. [Google Scholar]

[CR161] Shiau W-L, Yan C-M, Lin B-W. Exploration into the intellectual structure of mobile information systems. Int J Inf Manage. 2019;47:241–251. [Google Scholar]

[CR162] Shirkhorshidi AS, Aghabozorgi S, Wah TY, Herawan T (2014) Big data clustering: a review. In: International conference on computational science and its applications, Springer, Cham, pp 707–720

[CR163] Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF. A review of unsupervised feature selection methods. Artif Intell Rev. 2020;53:907–948. [Google Scholar]

[CR164] Song Z, Wang C, Bergmann L. China’s prefectural digital divide: spatial analysis and multivariate determinants of ICT diffusion. Int J Inf Manage. 2020;52:102072. [Google Scholar]

[CR165] Sprague LA, Oelsner GP, Argue DM. Challenges with secondary use of multi-source water-quality data in the United States. Water Res. 2017;110:252–261. doi: 10.1016/j.watres.2016.12.024. [DOI] [PubMed] [Google Scholar]

[CR166] Subramaniyan M, Skoogh A, Muhammad AS, Bokrantz J, Johansson B, Roser C. A generic hierarchical clustering approach for detecting bottlenecks in manufacturing. J Manuf Syst. 2020;55:143–158. [Google Scholar]

[CR167] Suh JW, Sohn SY, Lee BK. Patent clustering and network analyses to explore nuclear waste management technologies. Energy Policy. 2020;146:111794. [Google Scholar]

[CR168] Tahmasebi P, Hezarkhani A, Sahimi M. Multiple-point geostatistical modeling based on the cross-correlation functions. Comput Geosci. 2012;16:779–797. [Google Scholar]

[CR169] Tanoto Y, Haghdadi N, Bruce A, Macgill I. Clustering based assessment of cost, security and environmental tradeoffs with possible future electricity generation portfolios. Appl Energy. 2020;270:115219. [Google Scholar]

[CR170] Thakur N, Mehrotra D, Bansal A, Bala M. Soft computing: theories and applications. Singapore: Springer; 2020. Implementation of quasi-euclidean distance-based similarity model for retrieving information from OHSUMED dataset. [Google Scholar]

[CR171] Tran TA. Effect of ship loading on marine diesel engine fuel consumption for bulk carriers based on the fuzzy clustering method. Ocean Eng. 2020;207:107383. [Google Scholar]

[CR172] Upton G, Fingleton B. Spatial data analysis by example. Volume 1: point pattern and quantitative data. Hoboken: Wiley; 1985. [Google Scholar]

[CR173] Uselton S, Ahrens J, Bethel W, Treinish L (1998) Multi-source data analysis challenges. Lawrence Berkeley National Lab. (LBNL), Berkeley

[CR174] Ushakov AV, Vasilyev I. Near-optimal large-scale k-medoids clustering. Inf Sci. 2020;545:344–362. [Google Scholar]

[CR175] Valls A, Gibert K, Orellana A, Antón-Clavé S. Using ontology-based clustering to understand the push and pull factors for British tourists visiting a Mediterranean coastal destination. Inf Manage. 2018;55:145–159. [Google Scholar]

[CR176] Vialetto G, Noro M. An innovative approach to design cogeneration systems based on big data analysis and use of clustering methods. Energy Convers Manage. 2020;214:112901. [Google Scholar]

[CR177] Wang X, Wang H. Driving behavior clustering for hazardous material transportation based on genetic fuzzy C-means algorithm. IEEE Access. 2020;8:11289–11296. [Google Scholar]

[CR178] Wang Q, Yang X. Investigating the sustainability of renewable energy–an empirical analysis of European Union countries using a hybrid of projection pursuit fuzzy clustering model and accelerated genetic algorithm based on real coding. J Clean Prod. 2020;268:121940. [Google Scholar]

[CR179] Wang W, Yang J, Muntz R. STING: a statistical information grid approach to spatial data mining. VLDB. 1997;97:186–195. [Google Scholar]

[CR180] Xie J, Kelley S, Szymanski BK. Overlapping community detection in networks: the state-of-the-art and comparative study. ACM Comput Surv (CSUR) 2013;45:1–35. [Google Scholar]

[CR181] Xie W-B, Lee Y-L, Wang C, Chen D-B, Zhou T. Hierarchical clustering supported by reciprocal nearest neighbors. Inf Sci. 2020 doi: 10.1016/j.ins.2020.04.016. [DOI] [Google Scholar]

[CR182] Xu R, Wunsch D. Survey of clustering algorithms. IEEE Trans Neural Netw. 2005;16:645–678. doi: 10.1109/TNN.2005.845141. [DOI] [PubMed] [Google Scholar]

[CR183] Xu R, Wunsch DC. Clustering algorithms in biomedical research: a review. IEEE Rev Biomed Eng. 2010;3:120–154. doi: 10.1109/RBME.2010.2083647. [DOI] [PubMed] [Google Scholar]

[CR184] Xu X, Qian H, Ge C, Lin Z. Industry classification with online resume big data: a design science approach. Inf Manage. 2020;57:103182. [Google Scholar]

[CR185] Ye J. Cosine similarity measures for intuitionistic fuzzy sets and their applications. Math Comput Model. 2011;53:91–97. [Google Scholar]

[CR186] Yin L. Intelligent clustering evaluation of marine equipment manufacturing based on network connection strength. J Coast Res. 2020;103:900–904. [Google Scholar]

[CR187] Yoo I, Alafaireet P, Marinov M, Pena-Hernandez K, Gopidi R, Chang J-F, Hua L. Data mining in healthcare and biomedicine: a survey of the literature. J Med Syst. 2012;36:2431–2448. doi: 10.1007/s10916-011-9710-5. [DOI] [PubMed] [Google Scholar]

[CR188] Zhang K, Collins EG, Barbu A. An efficient stochastic clustering auction for heterogeneous robotic collaborative teams. J Intell Rob Syst. 2013;72:541–558. [Google Scholar]

[CR189] Zhang X, Sun Y, Liu H, Hou Z, Zhao F, Zhang C. Improved clustering algorithms for image segmentation based on non-local information and back projection. Inf Sci. 2020 doi: 10.1016/j.ins.2020.10.039. [DOI] [Google Scholar]

[CR190] Zhao K, Jiang Y, Xia K, Zhou L, Chen Y, Xu K, Qian P. View-collaborative fuzzy soft subspace clustering for automatic medical image segmentation. Multimed Tools Appl. 2020;79:9523–9542. [Google Scholar]

[CR191] Zhu Q, Zhang F, Liu S, Li Y. An anticrime information support system design: application of K-means-VMD-BiGRU in the city of Chicago. Inf Manage. 2019;59:103247. [Google Scholar]

PERMALINK

Data clustering: application and trends

Gbeminiyi John Oyewole

George Alex Thopil

Abstract

Introduction

Components and classifications for data clustering

Fig. 1.

Components of a clustering task

Fig. 2.

Pattern representation (step 2)

Clustering or grouping process (step 3)

Performance evaluation (step 5)

Clustering classification

Fig. 3.

Clustering algorithms

Table 1.

Table 2.

Traditional clustering strategies

Traditional clustering strategy variants

Table 3.

Similarity and dissimilarity measures

Table 4.

Cluster optimization and validation

Table 5.

Description of selected external indices

Description of selected internal Indices

Applications of clustering

Field applications

Table 6.

Selected industry applications

Transportation and logistics

Manufacturing

Energy

Healthcare

Fig. 4.

Data size, dimensionality, and data type issues in clustering

Discussions

Conclusion and future directions

Data availability

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases