Exploring online public survey lifestyle datasets with statistical analysis, machine learning and semantic ontology

. 2024 Oct 15;14:24190. doi: 10.1038/s41598-024-74539-6

Pseudo code “K”-value determination based on clustering for data labeling

Step-1: Define input parameters - data, max_clusters = 10, scaling Inline graphic {True, False}, visualization{True, False}, and metric=’euclidean’

Step-2: Define list - n_clusters_list, silhouette_list

Step-3: if (scaling = = True) Then

scalar = convert_to_min_max (data)

else

scalar = data

Step-4: For n_c = 2 to max_clusters + 1 do

kmeans_model = KMeans(n_clusters = n_c).fit(scalar)

labels = find_labels(kmeans_model)

n_clusters_list.append(n_c)

silhouette_list.append(silhouette_score(scalar, labels, metric = metric))

End

Step-5: Cross-verification of “K” value with the Elbow method.

Step-6: Find the best parameters based on defined lists

Step-7: Perform data labeling with the best model

Step-8: Visualize the best Clustering corresponds to Number of clusters (n_c) and Silhouette score.