Soft Clustering for Enhancing the Diagnosis of Chronic Diseases over Machine Learning Algorithms

Theyazn H H Aldhyani; Ali Saleh Alshebami; Mohammed Y Alzahrani

doi:10.1155/2020/4984967

. 2020 Mar 9;2020:4984967. doi: 10.1155/2020/4984967

Soft Clustering for Enhancing the Diagnosis of Chronic Diseases over Machine Learning Algorithms

Theyazn H H Aldhyani ^1,^✉, Ali Saleh Alshebami ², Mohammed Y Alzahrani ³

PMCID: PMC7085388 PMID: 32211144

Abstract

Chronic diseases represent a serious threat to public health across the world. It is estimated at about 60% of all deaths worldwide and approximately 43% of the global burden of chronic diseases. Thus, the analysis of the healthcare data has helped health officials, patients, and healthcare communities to perform early detection for those diseases. Extracting the patterns from healthcare data has helped the healthcare communities to obtain complete medical data for the purpose of diagnosis. The objective of the present research work is presented to improve the surveillance detection system for chronic diseases, which is used for the protection of people's lives. For this purpose, the proposed system has been developed to enhance the detection of chronic disease by using machine learning algorithms. The standard data related to chronic diseases have been collected from various worldwide resources. In healthcare data, special chronic diseases include ambiguous objects of the class. Therefore, the presence of ambiguous objects shows the availability of traits involving two or more classes, which reduces the accuracy of the machine learning algorithms. The novelty of the current research work lies in the assumption that demonstrates the noncrisp Rough K-means (RKM) clustering for figuring out the ambiguity in chronic disease dataset to improve the performance of the system. The RKM algorithm has clustered data into two sets, namely, the upper approximation and lower approximation. The objects belonging to the upper approximation are favourable objects, whereas the ones belonging to the lower approximation are excluded and identified as ambiguous. These ambiguous objects have been excluded to improve the machine learning algorithms. The machine learning algorithms, namely, naïve Bayes (NB), support vector machine (SVM), K-nearest neighbors (KNN), and random forest tree, are presented and compared. The chronic disease data are obtained from the machine learning repository and Kaggle to test and evaluate the proposed model. The experimental results demonstrate that the proposed system is successfully employed for the diagnosis of chronic diseases. The proposed model achieved the best results with naive Bayes with RKM for the classification of diabetic disease (80.55%), whereas SVM with RKM for the classification of kidney disease achieved 100% and SVM with RKM for the classification of cancer disease achieved 97.53 with respect to accuracy metric. The performance measures, such as accuracy, sensitivity, specificity, precision, and F-score, are employed to evaluate the performance of the proposed system. Furthermore, evaluation and comparison of the proposed system with the existing machine learning algorithms are presented. Finally, the proposed system has enhanced the performance of machine learning algorithms.

1. Introduction

Chronic diseases are serious diseases because they pose a serious threat to people's lives and persist over long periods. They can impede the freedom and health of people who have physical disabilities. Thus, they further cause frustration of people who suffer from various health disabilities. The available vaccines and medicine cannot completely prevent chronic diseases because they show no indications in any case. With aging changes, chronic diseases continue to become a more common phenomenon. Hence, there is a need to identify factors causing them and to take the required corrective measures accordingly. Factors such as smoking, physical inactivity, food diet, and insufficient or excessive alcohol consumption could largely contribute to chronic diseases. Previous studies have identified chronic diseases as the seventh cause of death among other causes. In the United States, they resulted in 65.8% of deaths among US males and 67.2% of deaths among US females in 2010 [1]. Heart, cancer, diabetes, asthma, and kidney diseases are identified as chronic diseases. In addition, chronic diseases are measured as noncommunicable diseases; they slowly end the life of people in a long period. Chronic diseases do not transfer from one person to another. In the United States, chronic diseases drive up medicinal service expenses and break up human services reasonably. They possess an essential part of the economy and thwart the health quality of people. This study promotes the classification of chronic disease conditions, namely, cancer, kidney, asthma, and diabetes. The World Health Organization reported in 2002 that mortality, dreariness, and incapacity were credited to the major chronic diseases. Currently, records show that 60% of all deaths and 43% of the global weight of illnesses are attributed to chronic diseases. By 2020, it is expected that the percentage of deaths will reach 73% of total deaths and 60% of the global weight of sicknesses [2]. With the help of machine learning algorithms, predicting chronic diseases has become an easy task. Therefore, it is aimed here to develop a surveillance system to predict and diagnose healthcare data for helping the health communities.

The machine learning algorithms increase the level of individual prediction, and therefore diagnosis will ultimately strengthen anticipation efforts. The availability of well prevention measures will not only enhance or provide good health for persons but also reduce healthcare spending. Machine learning techniques have been used widely in the healthcare domain. They have now become crucial tools for healthcare management. They have also assisted in improving health care by using the prediction measures for epidemic diseases faced by people around the world. The World Health Organization (WHO) has significantly benefited from the employment of machine learning applications that improve the quality of health care.

Machine learning algorithms are considered to be classification, clustering, and prediction for the sake of solving various issues in real-time applications. They provide an assurance of the classification and prediction solutions for stability and reliability in performance. Based on machine learning algorithms, a few researchers have developed successful healthcare systems. Algorithms include statistics, SVM, decision trees, clustering, and optimization algorithms and others. Machine learning applications rely largely on datasets that analyze and discover the patterns that are used to solve specific tasks. The healthcare system has the potential promotion in the health domain to extract and discover the hidden patterns in the database [3]. Thus, the available healthcare data are universally scattered and ambiguous. They may also contain insufficient and insignificant information stored in terms of the constancy in prediction and classification. One of the biggest challenges of healthcare data and its information is the accurate diagnosis of certain significant information. To predict and analyze the chronic diseases such as kidney, diabetic, cancer, and heart diseases, there are several proposed machine learning algorithms that can be used. These algorithms include the decision tree (DT), SVM, ANN, linear regression (LR), KNN, NB, and time series prediction models. Because of the rapid innovation and continuous changes in software engineering, a huge volume of information can be generated. With the development of a healthcare database management system, there will be more opportunities for the enhancement of the healthcare systems. Extracting patterns from these datasets and managing large amounts of dimensionality data have become a major field of machine learning. The machine learning algorithm is considered to be the classification of healthcare datasets to obtain useful knowledge that can help health officials and communities. To apply machine learning algorithms that enhance the performance of the classification process, the preprocessing of the soft clustering algorithm is required.

The remaining parts of the article are organized into sections. Introduction is discussed in Section 1, related studies are given in Section 2, data and methods are shown in Section 3, and results and discussion are shown in Section 4. Lastly, conclusion is presented in Section 5.

2. Related Studies

There is a considerable number of research works that have been done in relation to the classification and the prediction of healthcare data. Solanki [4] proposed most of the classifier algorithms on the Weka tool for predicting the prevalent sickle cell disease. The obtained results compared with the classifiers are available on the Weka data mining tool. It is observed that the random tree approach is a better algorithm for classifying sickle cell. Similarly, Joshi et al. [5] used a number of machine learning approaches such as Bayes net, logistic model tree (LMT), multilayer perception, stochastic gradient descent, and sequential minimal optimization techniques. These researchers suggested using LMT algorithms for diagnosing breast cancer because of its high performance and accuracy. Furthermore, David et al. [6] applied the KNN algorithm, Bayesian network, decision tree algorithm, and random tree method, namely, the J48 tree to predict leukemia disease. Accordingly, it was found that the decision tree algorithm had shown better accuracy in the result. In one more study conducted by Vijayarani and Sudha [7], LMT and the sequential minimal optimization multilayer, and perceptron algorithms are employed to predict heart diseases. Furthermore, the study conducted by Sugandhi et al. [8] proposed a random tree algorithm for the classification of heart diseases. The outcome of the research has shown that random tree gives a better performance than other classification algorithms. Consequently, for obtaining results from a random tree classifier, it is found that the random tree classifier is outperformed. The study of Yasodha and Kannan [9] has also reported that the Weka classification algorithm was used for analyzing and predicting the diabetic patient's database. Likewise, Bin Othman and Yau [10] have compared various classification approaches with the Weka data mining tool for predicting breast cancer. Israa [11] has applied NB, decision tree (DT), random forest, and support vector machine techniques to improve the classification of heart diseases. On the other hand, D. Sisodia and Sisodia [12] have proposed three machine learning algorithms; that is, DT, SVM, and naive Bayes (NB) to detect diabetes. Thus, experiments have been done by using standard data from the UCI machine learning repository. It is observed that the NB approach outperforms as compared with other algorithms, which have an accuracy rate of 76.30%. The study of Syed et al. [13] has employed SVM, Bayesian network, and decision tree algorithms to predict the obesity of schoolchildren. Sandeep et al. [14] proposed linear discriminate analysis (LDA), NB, random forest, LR, and quadratic discriminate analysis (QDA) for the analysis and classification of chronic kidney diseases. The study of Sahana and Minavathi [15] has also focused on predicting kidney disease using classification algorithms, namely, ANN and C45. It concentrated on accurate prediction and time factor performance. The ANN and C45 algorithms are used for helping out the medical practitioner to give proper medication and medical treatment. K. Polaraju and Prasad [16] have proposed a multiple regression model to classify chronic heart disease. It is proved that the multiple linear regression model is favourable for predicting heart diseases.

In this research, the training dataset consists of 3000 values with 13 different features. From the experimental results, it is shown that the regression algorithm performs better than other algorithms. Kim et al. [17] have proposed the character-recurrent neural network (Char-RNN) model to predict chronic diseases. They have collected data from the Korean National Health and Nutrition Examination Survey (KNHANES). It is observed that the Char-RNN model obtained higher accuracy than the conventional multilayer perceptron model. Ng et al. [18] have used machine learning algorithms to detect heart failure. Moreover, electronic health record data are used to predict events and the onset of diseases. Zhang et al. [19] have proposed a convolution neural network (CNN) architecture named Group Net to predict chronic diseases. Thus, the experimental analysis is conducted using data from local medical centers. They have noted that CNN has achieved the best accuracy. Kriplani et al. [20] have used deep learning to predict chronic kidney disease. The proposed models are tested by using standard datasets of diseases available on the UCI. The analysis results have appeared better in using cross-validation performance. Liu et al. [21] proposed CNN, LSTM, and hierarchical models to predict chronic diseases. Brisimi et al. [22] have applied four machine learning algorithms, namely, SVM, kernelized, sparse logistic regression, and random forests to predict chronic heart and diabetic diseases. They have gathered standard data from electronic health records (EHRs). Chen et al. [23] used streamline machine learning techniques to predict chronic disease epidemics. Their experiment has proposed prediction models using real-life hospital data gathered from central China in 2013–2015. The convolution neural network is implemented as based on multimodal disease risk prediction (CNN-MDRP) algorithm using structured and unstructured data from hospitals. Patel et al. [24] have developed a system using three classifiers such as KStar, SMO, and J48, Bayes net, and multilayer perception neural network algorithms with the help of Weka software to classify heart diseases. It is observed that the Bayes net has accomplished optimum performance as compared with further classification algorithms, namely, KStar, multilayer perception, and J48 approaches by using the k-fold cross-validation method. The research of Deepika and Seema [25] also designed a system to predict chronic diseases via machine learning algorithms such as naïve Bayes, decision tree, SVM, and ANN. A comparative analysis of the performances of algorithms is presented. It is observed that the support of the vector machine and the naïve Bayes provides the highest accuracy rate when predicting the diabetic disease. Ul Haq et al. [26] have suggested different machine learning algorithms such as naive Bayes, classification tree, KNN, logistic regression, SVM, and ANN to predict heart diseases. Three feature selection methods are applied to improve the classification algorithms. It is concluded that the feature selection method increases the performance of the classifier for predicting heart diseases. Ahmed et al. [27] proposed a fuzzy logic algorithm to classify kidney diseases. Coacci et al. [28] used two classification approaches, namely, logistic regression and ANN. Xun et al. [29] presented ANN and naïve Bayes classifiers for predicting chronic diseases. Some researchers used the UCI machine learning repository for testing proposed models, such as chronic diseases [30, 31], diabetic disease [32, 33], and breast cancer [34]. Using the deep learning algorithm to predict chronic diseases, Kim et al. [17] proposed nature-inspired computing algorithms for the diagnosis of chronic diseases [35] and employed machine learning algorithm to develop E-health for the diagnosis of chronic diseases [36].

In the current research article, traditional machine learning algorithms are employed for predicting chronic diseases. Therefore, the result of the existing classification algorithms is needed to make the healthcare system more reliable. Subsequently, the soft clustering algorithm is applied to increase the accuracy of classification algorithms.

3. Materials and Methods

The proposed model is designed explicitly to classify chronic diseases using machine learning algorithms. Figure 1 displays the proposed system that combines the existing machine learning algorithms with the rough k-means clustering technique. Noncrisp rough k-means algorithm is demonstrated to handle the ambiguous objects. These ambiguous objects obstruct the performance of machine learning algorithms. The RKM clustering has clustered data into two clusters. Thus, it is used to measure the roughness of the objects. Moreover, the threshold value is also used to maximize the roughness of objects for reducing the ambiguous objects. The threshold value parameter plays a very significant role in making the program of the noncrisp algorithm. It has experimented and found out that the threshold value is 1.4. The RKM algorithm is used to deal with ambiguous objects for improving the classification algorithm. The RKM algorithm has clustered data into lower approximation and upper approximation in which the clustered objects in the lower approximation are considered, but the objects clustered in upper approximation are excluded. The novelty of the proposed model has used rough k-means to handle the ambiguous objects belonging to lower approximation that is processed with the help of machine learning algorithms. The rough k-means clustering is proposed to explicitly determine ambiguous objects. To close, it is investigated that the results of the proposed system have outperformed all the alternative models used for measuring the performance. The detailed description of the proposed system is discussed in the following subsections.

3.1. Datasets

The chronic disease datasets have been collected from the different resources as follows:

3.1.1. Diabetic Disease Dataset

The diabetes data collected from the machine learning repository contained nine attributes, eight features, and one class. This dataset has been gathered from an automatic electronic recording device and paper records [37]. Table 1 shows the features of data.

Table 1.

Features of diabetic dataset.

Feature name	Categories
Pregnancies	Numeric
Glucose	Numeric
Blood pressure	Numeric
Skin thickness	Numeric
Insulin	Numeric
BMI	Numeric
Diabetes pedigree function	Numeric
Age	Numeric
Class	Nominal, diabetics, not diabetics (0, 1)

Open in a new tab

3.1.2. Breast Cancer Disease Dataset

The cancer data collected from the Kaggle contained nine attributes, 30 features, and 1 class. These features are obtained from digitized images of breast cancer [38]. Table 2 shows the features of data.

Table 2.

Features of cancer dataset.

Feature name	Categories
Id	Numeric
Radius mean	Numeric
Texture_mean	Numeric
Perimeter_mean	Numeric
Area_mean	Numeric
Smoothness_mean	Numeric
Compactness_mean	Numeric
Concavity_mean	Numeric
CONCAVE points_mean	Numeric
Symmetry_mean	Numeric
Fractal_dimension_mean	Numeric
Radius_se	Numeric
Texture_se	Numeric
Perimeter_se	Numeric
Area_se	Numeric
Smoothness_se	Numeric
Compactness_se	Numeric
Concavity_se	Numeric
Concave points_se	Numeric
Symmetry_se	Numeric
Fractal_dimension_se	Numeric
Radius_worst	Numeric
Texture_worst	Numeric
Perimeter_worst	Numeric
Area_worst	Numeric
Smoothness_worst	Numeric
Compactness_worst	Numeric
Concavity_worst	Numeric
Concave points_worst	Numeric
Symmetry_worst	Numeric
Fractal_dimension_worst	Numeric
Diagnosis	Nominal (M = malignant and B = benign)

Open in a new tab

3.1.3. Kidney Disease Dataset

The collected kidney data from the Kaggle contained 26 attributes, 24 features, and 1 class [38]. Table 3 shows the features of data.

Table 3.

Features of kidney dataset.

Feature name	Categories
Age	Numeric
bp: blood pressure	Numeric
sg: specific gravity	Nominal: 1.005, 1.010, 1.015, 1.020, 1.025
al: albumin	Nominal: 0, 1, 2, 3, 4, 5
su: sugar	Nominal: 0, 1, 2, 3, 4, 5
rbc: red blood cells	Nominal: 0, 1
pc: pus cell	Nominal: 0, 1
pcc: pus cell clumps	Nominal: 0, 1
ba: bacteria	Nominal: 0, 1
bgr: blood glucose random	Numeric
bu: blood urea	Numeric
sc: serum creatinine	Numeric
sod: sodium	Numeric
pot: potassium	Numeric
hemo: hemoglobin	Numeric
pcv: packed cell volume	Numeric
wc: white blood cell count	Numeric
rc: red blood cell count	Numeric
htn: hypertension	Nominal
dm: diabetes mellitus	Nominal
cad: coronary artery disease	Nominal
appet: appetite	Nominal
pe: pedal edema	Nominal
ane: anemia	Nominal
class: class	Nominal: CKD, not CKD

Open in a new tab

3.2. Handling Ambiguity

Machine learning algorithms have succeeded in a number of real-time applications such as image processing recognition, video recognition, marketing prediction, weather forecasting, and network security. The conventional machine learning algorithms are used to identify the objects belonging to exactly one class. In data analysis, it may be possible that an object shows the characteristics of different classes [39]. In that event, an object should belong to more than one class, and as a result, object boundaries should necessarily overlap. The machine learning algorithms categorize an object into one class precisely. Figure 2 shows the ambiguous data. Such requirement is found to be too restrictive in a number of real-time applications.

In Figure 2, the basic example of ambiguous objects can be noted. It clearly shows the three separate classes. Hence, it is observed that five objects are not classified under any precise class. Henceforth, these five objects decrease the performance of machine learning algorithms. Thus, it is required to determine such ambiguous objects and deal with them before applying machine learning algorithms. For this issue, the present research work applies the RKM technique to recognize ambiguous packets from chronic disease datasets. The detailed description of the RKM algorithm employed for identifying the ambiguous objects is presented in the subsequent subsections.

3.2.1. Rough K-Means Clustering Algorithm

The proposed RKM clustering approach is based on a simple K-means clustering [40–42]. Peters [43] enhanced the algorithm of [40] (original proposal) by calculating rough centroid using ratios of distances as new proposals to differentiate between similar distances. Joshi and Lingras [44] used RKM and ECM clustering algorithms to handle high dimensional data. Aldhyani and Joshi [39] used the rough K-means and ECM clustering algorithms to handle ambiguous objects of intrusion detection. The rough K-means approach is designed to determine the ambiguous objects that belong to the upper boundary of clusters. Cluster the data as lower approximation and upper approximation. The rough K-Means represents each.

(P1) An object $\overset{⟶}{x}$ can be part of, at most, one lower approximation (lower bound)
(P2) $\overset{⇀}{x}$ ∈ $\underline{A} ({\overset{⟶}{c}}_{i})$ = ⇒ $\overset{⟶}{c}$ ∈ $\bar{A} ({\overset{⟶}{c}}_{i})$
(P3) An object $\overset{⟶}{x}$ is not part of any lower approximation
⇕
$\overset{⟶}{x}$ belongs to two or more upper approximations (upper bound)

Overall, ideas of soft clustering are more appropriate to deal with ambiguous objects. When the algorithm is processed, all objects are assigned w_lower and w_upper. For each object vector, $\overset{⟶}{v}$ let d ( $\overset{⟶}{v}$ , ${\overset{⟶}{c}}_{j}$ ) be the distance between itself and the centroid of cluster ${\overset{⟶}{c}}_{j}$ . Let d ( $\overset{⟶}{v}$ , ${\overset{⟶}{c}}_{i}$ ) = min 1 ≤ j ≤ k d ( $\overset{⟶}{v}$ , ${\overset{⟶}{c}}_{j}$ ). The ratios d ( $\overset{⟶}{v}$ , ${\overset{⟶}{c}}_{j}$ )/d ( $\overset{⟶}{v}$ , ${\overset{⟶}{c}}_{j}$ ), 1 ≤ I, j ≤ k, are used to determine the membership of $\overset{⟶}{v}$ . Let T = {j : d ( $\overset{⟶}{v}$ , ${\overset{⟶}{c}}_{j}$ )/d ( $\overset{⟶}{v}$ , ${\overset{⟶}{c}}_{j}$ ) ≥ threshold and i ≠ j}.

If T = ϕ, $\overset{⟶}{v}$ ∈ A ( ${\overset{⟶}{c}}_{j}$ ) and $\overset{⟶}{v}$ ∈ A ( ${\overset{⟶}{c}}_{j}$ ), ∀j ∈ T. Furthermore, $\overset{⟶}{v}$ is not part of any lower approximation? The above criterion guarantees that property (P3) is satisfied.
Otherwise, if T = ϕ, $\overset{⟶}{v}$ ∈ A ( ${\overset{⟶}{c}}_{j}$ ). In addition, by property (P2), $\overset{⟶}{v}$ ∈ A ( ${\overset{⟶}{c}}_{j}$ ).

The rough k-means algorithm has stability and reliability for handling ambiguity. The rough k-means algorithm has clustered objects into lower bound and upper bound. The objects in the upper bound are ambiguous objects, whereas the objects in the lower bound are correct objects. The upper bound should not be empty, and the objects in the upper bound can belong to one or more upper bounds in the cluster numbers. Figure 3 shows a snapshot of output obtained from the RKM algorithm to determine the ambiguous objects for improving the performance of machine learning algorithms. The objects in lower bound are correction objects, whereas the objects on boundary bound are ambiguous objects.

3.3. Classification Algorithms

In this section, conventional machine learning algorithms are discussed. The automatic classification, namely, naive Bayes (NB), support vector machine (SVM), K-nearest neighbor (KNN), and random forest tree, are presented to predict chronic diseases for enhancing healthcare systems.

3.3.1. Support Vector Machine Algorithm

The support vector machine is used to analyze data as classification and regression. In the SVM algorithm, the data point is considered as n-dimensional space where there are a number of features of data, and the values of features are the values of a specific coordinate. The classification of data is achieved by finding the best difference between the classes of data using hyperplane. A support vector machine algorithm classifies data by separating the hyperactive plane of label training data. The SVM obtains lower error when the margin is large. In the present research work, two classes of chronic diseases are used. All types of kernel functions are applied to classify the chronic disease datasets in which radial basis function (RBF), along with kernel function, obtain high accuracy. The kernel function is applied and observed that the RBF function and kernel function are appropriate with the RKM algorithm to obtain good accuracy.

\begin{matrix} k (x, \bar{x}) = \exp (- \frac{‖x - \bar{x}‖}{2 σ^{2}}), \end{matrix}

(1)

where ${‖x - \bar{x}‖}^{2}$ is the square Euclidean distance between two feature vectors and σ is a parameter.

3.3.2. Naïve Bayes Algorithm

Naive Bayes algorithm is defined as a probabilistic method used to classify the dataset based on the well-known Bayes theorem of probability. The naïve Bayes classification algorithm works as prior probability, posterior probability, likelihood probability, and evidence probability. It normally uses probability distributions. The working Bayesian algorithm is as follows: Assume A = A₁, A₂, A₃,…, A_n is regarded as the feature vector of chronic disease features, and the values of the features are A₁, A₂, A₃,…, A_n and are considered as a number of features in the dataset. C indicates a class of chronic data as normal and abnormal. The Bayes equation is shown as follows:

Conditional probability:

\begin{matrix} P (C| A) = \frac{P (A, C)}{P (A)}, \end{matrix}

(2)

\begin{matrix} P (C| A) = \frac{P (A, C)}{P (c)} . \end{matrix}

(3)

It is assumed that the predictor A on the given class c is independent of the values of other predictors, and it is known as conditional class independence. P(C|A) is the posterior probability of class c, given predictor (feature).

Theorem is as follows:

\begin{matrix} P (C| A) = \frac{P (C) . P (A| C)}{P (A)} . \end{matrix}

(4)

3.3.3. K-Nearest Neighbors Algorithm

A K-nearest neighbors algorithm is a simple machine learning algorithm, which uses the entire dataset in its training phase. KNN algorithm has low complexity in programming and implementation. The basic idea can be presented in a sample space when its nearest neighboring features belong to a category, and then the features belong to the same category. The KNN classification algorithm can be used with either a single or a multidimensional feature dataset and can find the closest features. It employs the Euclidean distance method for finding the closest point among the features.

\begin{matrix} D = \sqrt{(x_{1} - x_{2}) + {(y_{1} - y_{2})}^{2}} . \end{matrix}

(5)

3.3.4. Random Forest Algorithm

A decision tree algorithm is one of the powerful decisions. It is used to build the block of a random forest. It works to select the best split of an object from the dataset in each step. To reduce the high harnessed of variance, we can create multiple trees with various samples of datasets and combine this operation with bootstrap aggregation or bagging. The disadvantage of the bootstrap aggregation method is used to spill the values of each tree, which creates a problem in decision-making. Furthermore, it makes predictions of training data similar and mitigates the variance originally sought. Thus, the random forest algorithm can be further used for the classification and regression problems and for the overfitting of data as well. The selected attributes are measured by employing the information gain method to discover the value or the information from the entire dataset. The information gain method is calculated for each splitting attribute with selecting high gain attributes. It is assumed that D is the dataset.

\begin{matrix} info (D) = - \sum_{i = 1}^{m} (p_{i} \log p_{i}), \end{matrix}

(6)

where D is the dataset, i = 1, 2,…, m is the class of dataset D, and the probability is p_i

Let B be an attribute in dataset D and {b₁, b₂, b₃,…, b_n} are values of the attributes in B. Attributes are a partition for generating the amount of information from attributes.

\begin{matrix} info (D) \sum_{j = 1}^{n} (\frac{|D_{j}|}{D}) * info (D_{j}) . \end{matrix}

(7)

The attributes show the highest information as follows:

\begin{matrix} Gain (B) = info (D) - {info}_{B} (D) . \end{matrix}

(8)

3.4. Performance Measurement

The performance measures are used to test and evaluate the proposed system. The accuracy, specificity and sensitivity, precision, recall, and F-score evaluation matrices have been employed to test the proposed model. The evaluation matrices are computed by using the equations (9)–(13) as described below. where we have true positive (TP), true negative (TN), false positive (FP), and false negative (FN).

3.4.1. Accuracy

Accuracy is the number of correct predictions made by the model over all kinds of predictions made. It is calculated as the total number of correct labels (TP + TN) divided by the total number of chronic disease datasets (P + N):

\begin{matrix} accuracy = \frac{TP + TN}{TP + FP + FN + TN} . \end{matrix}

(9)

3.4.2. Specificity

Specificity (also called the true negative rate) is a measure that tells us about the percentage of patients who do not suffer from chronic diseases, which are predicted by the model as not chronic diseases:

\begin{matrix} specificity = \frac{TN}{TN + FP} \times 100 % . \end{matrix}

(10)

3.4.3. Sensitivity

Sensitivity (also called the true positive rate, the recall, or probability of detection) is the measure that tells about the percentage of patients who actually suffer from chronic diseases, which are diagnosed by the classification algorithms on chronic diseases:

\begin{matrix} specificity = \frac{TP}{TP + FN} \times 100 % . \end{matrix}

(11)

3.4.4. Precision

Precision is a measure that tells about the proportion of patients that we diagnosed as having chronic diseases, actually had chronic diseases. It is known as positive predictive value (PPV):

\begin{matrix} precision = \frac{TP}{TP + FP} \times 100 % . \end{matrix}

(12)

3.4.5. F1-Score

F1-score (also called the precision, F-score, F-measure, and recall) is the harmonic mean (average) of the precision and recall:

\begin{matrix} F 1 - score = \frac{2 * precision * sensitivity}{precision + sensitivity} % 100. \end{matrix}

(13)

4. Experimental Results and Discussion

Therefore, the rough K-means algorithm is applied for improving the classification of chronic diseases. It is used to determine the ambiguous objects that have obstructed the classification algorithms. It has further experimented with various standard chronic datasets. It is aimed here to improve the diagnosis of chronic diseases. In the beginning, the conventional classification algorithms are applied to predict chronic diseases. However, it is observed that the obtained results were not appropriate. From the obtained results, it is noted that there are ambiguous objects that decrease the accuracy of machine learning algorithms. One of the biggest challenges that we have faced within the implementation of the proposed system is the ambiguity embedded in the variable of the standard dataset. For this reason, the RKM algorithm is considered to handle these ambiguous objects so that the accuracy of the classification algorithms can be improved. The RKM algorithm is appropriately designed for detecting the ambiguity in the chronic disease datasets. The experimental results have shown that the performance of the proposed system is better than that of the conventional models. For measuring and evaluating the performance of the proposed system, the performance measures are applied. The standard evaluation matrices, namely, accuracy, specificity, sensitivity, precision, and F-score have been presented to test the proposed system against the existing machine learning techniques. Moreover, for validating the proposed system, the datasets are divided into 70% train and 30% test. Numerous experiments have attempted to evaluate the proposed system. The results of machine learning algorithms and enhanced model to the various datasets are presented as follows:

4.1. Classification Results of Diabetics Disease

In this section, different experiments of classification algorithms with the enhanced proposed system have been conducted. The soft computing rough K-means algorithm is used to handle ambiguous objects. The ambiguous objects in chronic disease datasets have reduced the performance of machine learning algorithms. When applying the classification algorithm on the original diabetics data, it is observed that the results are not favourable. From the data, it is investigated that there are ambiguous objects that hinder the classification algorithms. The diabetes data contain seven instances and two classes. These ambiguous objects are examined by RKM clustering to assist in determining the exact class of ambiguous diseases or the closest one. The dataset has been clustered for two clusters corresponding into two classes that are labelled variables in datasets. The RKM algorithm has clustered the ambiguous objects into upper approximation and lower approximation. Those objects that belong to upper approximation, which belongs to one or more cluster numbers, are excluded. Among 768 instances, 718 instances are clustered as a lower approximation. Moreover, the remaining objects are clustered as an upper approximation and are considered as ambiguous objects as well. The ambiguous objects have been denied from the data. The classification algorithm is applied to process the data in a lower approximation for diagnosing the diseases. Table 4 shows the results obtained from the RKM algorithm used for discovering the ambiguous objects.

Table 4.

Results of RKM algorithm for handling ambiguous objects in the diabetic dataset.

Cluster number	Lower approximation	Upper approximation
Cluster 1 Cluster 2	718	50

Open in a new tab

Table 5 shows the results of the classification algorithm, namely, naïve Bayes, SVM, random forest tree, and KNN. It is observed that the obtained results are needed to improve. The rough K-means is applied to enhance the existing machine learning. Table 6 shows the results of machine learning techniques with RKM algorithm. The rough K-means is used to deal with ambiguous objects. It is observed that the RKM algorithm has improved the results of the classification algorithms. The results of naïve Bayes with RKM are 80.55%, 80.14%, 80.14%, 90%, and 84.78% in terms of accuracy, sensitivity, specificity, precision, and F-score, respectively. Similarly, the results of SVM with RKM approaches are 77.78, 77.24, 78.87, 88.19, and 82.35 corresponding to accuracy, sensitivity and specificity, precision, and F-score, in that order. The results of random forest with RKM are 77.20, 56.09, 69.05, and 62.0. Furthermore, the obtained results by using KNN with RKM are 71.30%, 79.29%, 56.58%, 77.08%, 77.08%, and 78.70%. Finally, from the obtained data, it is investigated that the classification algorithm is improved by using the RKM algorithm. Figures 4–7 show the performance of the classification algorithms with the RKM algorithm.

Table 5.

Results of existing machine learning for diagnosing diabetic diseases.

Model	Accuracy (%)	Sensitivity (%)	Specificity (%)	Precision (%)	F-score (%)
Naïve Bayes	69.69	70.72	66	88.19	82.35
SVM	73.16	73.79	72.09	81.68	82.35
Random forest	76	55.6	65.6	55.6	60.02
KNN	65.80	56.79	56.58	51	53.80

Open in a new tab

Table 6.

Results of machine learning after handling ambiguous objects for diagnosing diabetic diseases.

Model	Accuracy (%)	Sensitivity (%)	Specificity (%)	Precision (%)	F-score (%)
Naïve Bayes with RKM	80.55	80.14	80.14	90	84.78
SVM with RKM	77.78	77.24	78.87	88.19	82.35
Random forest with RKM	77.20	56.9	69.05	56.9	62.06
KNN with RKM	71.30	79.29	70.67	77.08	78.17

Open in a new tab

Comparison results of existing naïve Bayes classifier and naïve Bayes using RKM algorithm for diabetic diseases.

Comparison results of existing SVM classifier and SVM using RKM algorithm for diabetic diseases.

Comparison results of existing random forest classifier and random forest using RKM algorithm for diabetic diseases.

Comparison results of existing KNN classifier and KNN using RKM algorithm for diabetic diseases.

4.2. Classification Results of Kidney Disease

This section demonstrates the classification of kidney diseases with the help of machine learning algorithms and the enhanced proposed system. Table 7 shows the results obtained from the RKM algorithm to figure out the ambiguous objects. Kidney diseases contain 400 instances. The data are clustered into two clusters. The rough K-means clusters data into upper approximation and lower approximation. The objects that have been clustered in lower approximation are 174 instances. Those objects that belong to lower approximation are regarded as approved objects because they belong to the same cluster numbers. The remaining objects, that is, 226 objects, are clustered in upper approximation, which is considered as ambiguous objects.

Table 7.

Results of RKM algorithm for handling ambiguous objects in kidney dataset.

Cluster number	Lower approximation	Upper approximation
Cluster 1 Cluster 2	174	226

Open in a new tab

Table 8 shows the performance of existing machine learning algorithms, namely, naïve Bayes, SVM, random forest tree, and KNN. It is observed that there is a possibility for improving the classification algorithms if they handle ambiguous objects. It is also noted that the RKM algorithm has improved the results of the classification algorithms. Table 9 shows results of machine learning techniques with the RKM algorithm. The results of naïve Bayes with RKM are 98.11%, 96.43%, 96.15%, 96.15%, and 98.04.78%, with respect to the evaluation matrices. Similarly, results of SVM with RKM approaches are 100%, 100%, 100, 100%, and 100% in terms of accuracy, sensitivity and specificity, precision, and F-score, respectively. The results of random forest with RKM are 100%, 100%, 100, 100%, and 98.02%. Furthermore, the obtained results by using KNN with RKM are 84.91%, 80.65%, 90.91%, 92.59%, and 86.21%. Lastly, from the obtained results, it is found that the classification algorithm is improved by using the RKM algorithm. Figures 8–11 display the performance of the classification algorithms with the RKM algorithm for predicting kidney diseases.

Table 8.

Results of existing machine learning for diagnosing kidney diseases.

Model	Accuracy (%)	Sensitivity (%)	Specificity (%)	Precision (%)	F-score (%)
Naïve Bayes	95.87	89.36	89.36	93.67	96.73
SVM	50.2	37.0	40.0	42.0	51
Random forest	98.07	100	96.6	100	98.02
KNN	69.57	60.03	74.39	82.65	76.86

Open in a new tab

Table 9.

Results of machine learning after handling ambiguous objects for diagnosing kidney diseases.

Model	Accuracy (%)	Sensitivity (%)	Specificity (%)	Precision (%)	F-score (%)
Naïve Bayes with RKM	98.11	96.43	96.15	96.15	98.04
SVM with RKM	100	100	100	100	100
Random forest with RKM	100	100	100	100	100
KNN with RKM	84.91	80.65	90.91	92.59	86.21

Open in a new tab

Comparison of results of existing naïve Bayes classifier and naïve Bayes using RKM algorithm for kidney diseases.

Comparison of results of existing SVM classifier and SVM using the RKM algorithm for kidney diseases.

Comparison of results of existing random forest classifier and random forest using the RKM algorithm for kidney diseases.

Comparison results of existing KNN classifier and KNN using the RKM algorithm for kidney diseases.

4.3. Classification Results of Cancer Disease

This section shows the classification of cancer disease using the existing machine learning algorithms and the proposed system using the RKM algorithm. The cancer data contain 569 instances that are classified into two classes such as benign and malignant. The soft computing RKM clustering algorithm is used to handle ambiguous objects. It clusters data into two clusters according to the class label of data. The RKM technique has clustered data into lower approximation and upper approximation. The objects belonged to the lower approximation are appropriate objects, and are processed by using a machine learning algorithm. However, the objects that have been clustered into upper approximation are considered as ambiguous ones. The RKM is clustered into 539 objects in the lower approximation, and the remaining objects are in the upper approximation. Table 10 demonstrates the results of the RKM algorithm. Subsequently, machine learning is applied to diagnose cancer as benign and malignant.

Table 10.

Results analysis of machine learning algorithm using the RKM algorithm for the diagnosis of cancer disease.

Cluster number	Lower approximation	Upper approximation
Cluster 1 Cluster 2	174	226

Open in a new tab

Table 11 shows the obtained results of conventional machine learning for the classification of cancer disease. It is noted that the results need more improvement, and the RKM algorithm is applied to enhance the existing machine learning algorithm. Table 12 shows results analysis of the proposed model. The results of naïve Bayes with RKM are 94.44%, 94.95%, 93.65%, 95.92%, and 95.43% in terms of accuracy, sensitivity, specificity, precision, and F-score. Similarly, the results of SVM with RKM approaches are 97.53%, 99.07%, 94.44%, 97.27, and 98.17% regarding accuracy, sensitivity, specificity, precision, and F-score, respectively. The results of random forest with RKM are 96.30%, 93.09%, 96.01%, 95%, and 93.09%. Furthermore, the obtained results by using KNN with RKM are 85.80%, 95.05%, 70.49%, 84.21%, and 89.30%. Finally, from the obtained results, it is investigated that the classification algorithms are improved by using a soft clustering RKM algorithm. Figures 12–15 display the performance of the classification algorithms with the RKM algorithm for the diagnosis of cancer disease. From graphic representations that are shown, it can be noted that the proposed system is better.

Table 11.

Results of existing machine learning for the diagnosis of cancer disease dataset.

Model	Accuracy (%)	Sensitivity (%)	Specificity (%)	Precision (%)	F-score (%)
Naïve Bayes	88.89	88.24	89.86	92.78	90.45
SVM	95.91	96.49	94.44	97.27	92.01
Random forest	95.06	90.2	96.05	90	93.2
KNN	79.53	89.42	64.18	79.49	84.16

Open in a new tab

Table 12.

Results of machine learning after handling ambiguous objects for the diagnosis of cancer disease dataset.

Model	Accuracy (%)	Sensitivity (%)	Specificity (%)	Precision (%)	F-score (%)
Naïve Bayes with RKM	94.44	94.95	93.65	95.92	95.43
SVM with RKM	97.53	99.07	94.44	97.27	98.17
Random forest with RKM	96.30	93.09	96.01	95.0	93.09
KNN with RKM	85.80	95.05	70.49	84.21	89.30

Open in a new tab

Comparison of results of the existing naïve Bayes classifier and naïve Bayes using the RKM algorithm for cancer disease.

Comparison of results of the existing SVM classifier and SVM using the RKM algorithm for cancer disease.

Comparison of results of the existing random forest classifier and random forest using the RKM algorithm for cancer disease.

Comparison of results of the existing KNN classifier and KNN using the RKM algorithm for cancer disease.

5. Comparative Analysis

In this section, a comparative analysis between the proposed model and some of the other state-of-the-art work is used in the same datasets. The comparison is very important because it examines the results of the proposed model. The accuracy metric is used to compare the proposed model with the existing classification algorithms. Table 13 shows the results of the proposed system and the existing neural network approach. It is investigated that the results of the proposed system are better than those of the existing neural network approach.

Table 13.

Results of the proposed system against the existing neural network approach.

Technique	References	Accuracy (%)	Diseases
ANN	[45]	80.4	Kidney
General regression neural network	[46]	80.20	Diabetes
Back propagation neural network	[47]	95.03	Kidney
BPNNs	[48]	92.84	Breast cancer
Proposed model		100	Kidney
Proposed model		80.55	Diabetes
Propose model		97.53	Breast cancer

Open in a new tab

6. Conclusion

The performance of the existing machine learning is thwarted from diagnosing chronic diseases because of the availability of ambiguous objects. These ambiguous objects show traits in more than one class. To identify and process the ambiguous objects explicitly, we have demonstrated the noncrisp RKM clustering that can handle these ambiguous objects to improve the accuracy of classification algorithms. The framework of the proposed system lies in its use of a soft clustering algorithm, namely, rough K-means that can be employed for modelling ambiguity. The rough K-means clustering can assist in determining the exact class of the ambiguous objects or the approximate ones. It is observed that the RKM algorithm has increased the performance of the conventional machine algorithms to predict chronic diseases. The ambiguous objects are excluded from chronic dataset. Therefore, the RKM algorithm clustered the data into lower and upper approximation. The objects clustered in lower approximation are considered as appropriate objects. Additionally, the objects that belong to the upper approximation are denied and considered as ambiguous objects. The objects that belong to the lower approximation are proposed by using machine learning algorithms to predict chronic diseases. The experimental results demonstrate that the proposed system is successfully employed for the diagnosis of chronic diseases. Comparative analysis results between existing machine learning algorithms and the proposed system are presented. Moreover, it is observed that the results of the proposed system are superior in terms of accuracy, specificity, sensitivity, precision, recall, and F-score performance measures. Identifying common web search activity behaviour is regarded as a proxy for chronic disease risk factors using machine learning algorithms can be considered in future work.

Acknowledgments

The authors acknowledge the Deanship of Scientific Research at King Faisal University for the financial support under Nasher Track (grant no. 186163). The authors acknowledge the Deanship of Scientific Research at King Faisal University for the financial support.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Supplementary Materials

The chronic disease datasets have been collected from different resources as follows: 1. Diabetics disease dataset: the diabetes data collected from the machine learning repository contained nine attributes, eight features, and one class. This dataset has been gathered from an automatic electronic recording device and paper records. 2. Breast cancer disease dataset: the cancer data collected from the Kaggle contained nine attributes, 30 features, and 1 class. These features are obtained from digitized images of breast cancer [38]. 3. Kidney disease dataset: the collected kidney data from the Kaggle contained 26 attributes, 24 features, and 1 class [38].

Click here for additional data file.^{(64.9KB, zip)}

References

1. https://www.cdc.gov/
2. https://www.who.int/
3.Witten J. H., Frank E. Data Mining: practical machine learning tools and techniques. International Journal on Cybernetics & Informatics (IJCI) 2015;4(4) [Google Scholar]
4.Solanki A. V. Data mining techniques using WEKA classification for sickle cell disease. International Journal of Computer Science and Information Technology. 2014;5(4):5857–5860. [Google Scholar]
5.Joshi J., Rinal D., Patel J. Diagnosis and prognosis of breast cancer using classification rules. International Journal of Engineering Research and General Science. 2014;2(6):315–323. [Google Scholar]
6.David S. K., Saeb A. T., Al Rubeaan K. Comparative analysis of data mining tools and classification techniques using WEKA in medical bioinformatics. Computer Engineering and Intelligent Systems. 2013;4(13):28–38. [Google Scholar]
7.Vijayarani S., Sudha S. Comparative analysis of classification function techniques for heart disease prediction. International Journal of Innovative Research in Computer and Communication Engineering. 2013;1(3):735–741. [Google Scholar]
8.Sugandhi C., Yasodha P., Kannan M. . Analysis of a population of cataract patient database in WEKA tool. International Journal of Scientific and Engineering Research. 2011;2(10) [Google Scholar]
9.Yasodha P., Kannan M. Analysis of population of diabetic patient database in WEKA tool. International Journal of Science and Engineering Research. 2011;2(5) [Google Scholar]
10.Bin Othman M. F., Yau T. M. S. Comparison of different classification techniques using WEKA for breast cancer. Proceedings of the 3rd Kuala Lumpur International Conference on Biomedical Engineering; December 2006; Kuala Lumpur, Malaysia. Springer Heidelberg; pp. 520–523. [DOI] [Google Scholar]
11.Israa A., Azzeh M. Comparative study of predicting heart diseases by using data mining classification. International Journal of Computer Science and Information Security. 2016;14(12) [Google Scholar]
12.Sisodia D., Sisodia D. S. Prediction of diabetes using classification algorithms. Procedia Computer Science. 2018;132:1578–1585. doi: 10.1016/j.procs.2018.05.122. [DOI] [Google Scholar]
13.Syed F. A., Aryati A., Wajihah S., Shahir M. R. Classification of obesity childhood among 6 school children using data mining technique. International Conference on Soft Computing and Data Mining. 2016:p. 549. [Google Scholar]
14.Sandeep J., Mula R., Kuriachan B. Chronic kidney disease analysis using machine learning algorithms. International Journal for Research in Applied Science & Engineering Technology. 2018;6(1):3367–3379. [Google Scholar]
15.Sahana B. J., Minavathi Kidney disease prediction using data mining classification techniquesand ANN. International Journal of Innovative Research in Computer and Communication Engineering. 2017;4(5):8675–8679. [Google Scholar]
16.Polaraju K., Prasad D. D. Prediction of heart disease using multiple linear regression model. International Journal of Engineering Development and Research Development. 2017;5(4):1419–1425. [Google Scholar]
17.Kim C., Son Y., Youm S. Chronic disease prediction using character-recurrent neural network in the presence of missing information. Applied Sciences. 2019;9(10):p. 2170. doi: 10.3390/app9102170. [DOI] [Google Scholar]
18.Ng K., Steinbuhl S., deFilippi C., Dey S., Stewart W. Early detection of heart failure using electronic health records: practical implications for time before diagnosis, data diversity, data quantity, and data density. Journal of Patient-Centered Research and Reviews. 2017;4(3):174–175. doi: 10.17294/2330-0698.1523. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Zhang X., Zhao H., Zhang S., Li R. A novel deep neural network model for multi-label chronic disease prediction. Frontiers in Genetics. 2019;10 doi: 10.3389/fgene.2019.00351. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Kriplani H., Patel B., Roy S. Computer Aided Intervention and Diagnostics in Clinical and Medical Images. Berlin, Germany: Springer; 2019. Prediction of chronic kidney diseases using deep artificial neural network technique; pp. 179–187. [DOI] [Google Scholar]
21.Liu J., Zhang Z., Razavian N. Deep EHR: chronic disease prediction using medical notes. 2018. https://arxiv.org/abs/1808.04928.
22.Brisimi T. S., Xu T., Wang T., Dai W., Adams W. G., Paschalidis I. C. Predicting chronic disease hospitalizations from electronic health records: an interpretable classification approach. Proceedings of the IEEE. 2018;106(4):690–707. doi: 10.1109/jproc.2017.2789319. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Chen M., Hao Y., Hwang K., Wang L., Wang L. Disease prediction by machine learning over big data from healthcare communities. IEEE Access. 2017;5:8869–8879. doi: 10.1109/access.2017.2694446. [DOI] [Google Scholar]
24.Patel J., Upadhyay T., Patel S. Heart disease prediction using machine learning and data mining technique. International Journal of Computer Science & Communication. 2016;7(1):129–137. [Google Scholar]
25.Deepika K., Seema S. S. Predictive analytics to prevent and control chronic disease. Proceedings of the International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT); July 2016; Bangalore, India. IEEE; pp. 381–386. [DOI] [Google Scholar]
26.Ul Haq A., Li J. P., Memon M. H., Nazir S., Sun R. A hybrid intelligent system framework for the prediction of heart disease using machine learning algorithms. Mobile Information System. 2018;2018:21. doi: 10.1155/2018/3860146.3860146 [DOI] [Google Scholar]
27.Ahmed S., Kabir M. T., Mahmood N. T., Rahman R. M. Diagnosis of kidney disease using fuzzy expert system. Proceedings of the 8th International Conference on Software, Knowledge, Information Management and Applications (SKIMA 2014); December 2014; IEEE, Dhaka, Bangladesh. [DOI] [Google Scholar]
28.Caocci G., Baccoli R., Littera R., Orru S., Carcassi C., La Nasa G. Artificial Neural Networks—Architectures and Applications. Philadelphia, PA, USA: IntechOpen; 2012. Comparison between an ANN and logistic regression in predicting long term kidney transplantation outcome; pp. 106–124. [DOI] [Google Scholar]
29.Xun L., Xiaoping W., Ningshan L., Tanqi L. Application of radial basis function neural network to estimate glomerular filtration rate in Chinese patients with chronic kidney disease. Proceedings of the International Conference on Computer Application and System Modeling (ICCASM); October 2010; Taiyuan, China. IEEE; pp. V15–V332. [DOI] [Google Scholar]
30.Polat H., Mehr H. D., Cetin A. Diagnosis of chronic kidney disease based on support vector machine by feature selection methods. Systems-Level Quality Improvement. 2017;41(55):2–11. doi: 10.1007/s10916-017-0703-x. [DOI] [PubMed] [Google Scholar]
31.Anuja Kumari V., Chitra R. Classification of diabetes disease using support vector machine. International Journal of Engineering Research and Applications (IJERA) 2013;3(2):1797–1801. [Google Scholar]
32.Hamza A., Moetque H. Diabetes disease diagnosis method based on feature extraction using K-SVM. International Journal of Advanced Computer Science and Applications. 2017;8(1):p. 236. doi: 10.14569/ijacsa.2017.080130. [DOI] [Google Scholar]
33.Mert A., Kilic N., Akan A. Breast cancer classification by using support vector machines with reduced dimension. Proceedings of the 53rd International Symposium ELMAR; September 2011; Zadar, Croatia. [Google Scholar]
34.West D., Mangiameli P., Rampal R., West V. Ensemble strategies for a medical diagnostic decision support system: a breast cancer diagnosis application. European Journal of Operational Research. 2005;162(2):532–551. doi: 10.1016/j.ejor.2003.10.013. [DOI] [Google Scholar]
35.Gautam R., Kaur P., Sharma M. A comprehensive review on nature inspired computing algorithms for the diagnosis of chronic disorders in human beings. Progress in Artificial Intelligence. 2019;8(4):401–424. doi: 10.1007/s13748-019-00191-1. [DOI] [Google Scholar]
36.Rojas E. M., Moreno H. B. R., Palencia J. S. M. Innovation in Medicine and Healthcare Systems, and Multimedia. Vol. 145. Berlin, Germany: Springer; 2019. Contributions of machine learning in the health area as support in the diagnosis and care of chronic diseases; pp. 261–269. [DOI] [Google Scholar]
37. http://archive.ics.uci.edu/ml/datasets/diabetes.
38. https://www.kaggle.com/uciml/breast-cancer-wisconsin-data.
39.Aldhyani T. H., Joshi M. R. Handling ambiguous packets in intrusion detection. Proceedings of the 3rd International Conference on Signal Processing, Communication and Networking (ICSCN); March 2015; IEEE, Chennai, India. [DOI] [Google Scholar]
40.Lingras P., West C. Interval set clustering of web users with rough k-means. Journal of Intelligent Information Systems. 2004;23(1):5–16. doi: 10.1023/b:jiis.0000029668.88665.1a. [DOI] [Google Scholar]
41.Hartigan J. A., Wong M. A. Algorithm AS 136: a k-means clustering algorithm. Applied Statistics. 1979;28(1):100–108. doi: 10.2307/2346830. [DOI] [Google Scholar]
42.MacQueen Q. Some methods for classification and analysis of multivariate observations”. In: Cam L. M. L., Neyman J., editors. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Vol. 1. Berkeley, CA, USA: University of California Press; 1967. pp. 281–297. [Google Scholar]
43.Peters G. Some refinements of rough -means clustering. Pattern Recognition. 2006;39(8):1481–1491. doi: 10.1016/j.patcog.2006.02.002. [DOI] [Google Scholar]
44.Joshi M., Lingras P. Evidential clustering or rough clustering: the choice is yours. In: Li T., Nguyen H. S., Wang G., et al., editors. Rough Sets and Knowledge Technology, LNCS. Vol. 7414. Heidelberg, Germany: Springer; 2012. pp. 123–128. [Google Scholar]
45.Ramya S., Radha N. Diagnosis of chronic kidney disease using machine learning algorithms. International Journal of Innovative Research in Computer and Communication Engineering. 2016;4(1):812–820. [Google Scholar]
46.Kayaer K., Yldrm T. Medical diagnosis on Pima Indian diabetes using general regression neural networks. Proceedings of the International Conference on Artificial Neural Networks and Neural Information Processing (ICANN/ICONIP); June 2013; Daegu, South Korea. pp. 181–184. [Google Scholar]
47.Ravindra B. V., Sriraam N., Geetha M. Chronic kidney disease detection using back propagation neural network classifier. Proceedings of the International Conference on Communication, Computing and Internet of Things; February 2018; Chennai, India. pp. 15–17. [DOI] [Google Scholar]
48.Marcano-Cedeño A., Buendía-Buendía F. S., Andina D. Breast Cancer Classification Applying Artificial Metaplasticity. Berlin-Heidelberg, Germany: Springer-Verlag; 2009. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Click here for additional data file.^{(64.9KB, zip)}

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

[B1] 1. https://www.cdc.gov/

[B2] 2. https://www.who.int/

[B3] 3.Witten J. H., Frank E. Data Mining: practical machine learning tools and techniques. International Journal on Cybernetics & Informatics (IJCI) 2015;4(4) [Google Scholar]

[B4] 4.Solanki A. V. Data mining techniques using WEKA classification for sickle cell disease. International Journal of Computer Science and Information Technology. 2014;5(4):5857–5860. [Google Scholar]

[B5] 5.Joshi J., Rinal D., Patel J. Diagnosis and prognosis of breast cancer using classification rules. International Journal of Engineering Research and General Science. 2014;2(6):315–323. [Google Scholar]

[B6] 6.David S. K., Saeb A. T., Al Rubeaan K. Comparative analysis of data mining tools and classification techniques using WEKA in medical bioinformatics. Computer Engineering and Intelligent Systems. 2013;4(13):28–38. [Google Scholar]

[B7] 7.Vijayarani S., Sudha S. Comparative analysis of classification function techniques for heart disease prediction. International Journal of Innovative Research in Computer and Communication Engineering. 2013;1(3):735–741. [Google Scholar]

[B8] 8.Sugandhi C., Yasodha P., Kannan M. . Analysis of a population of cataract patient database in WEKA tool. International Journal of Scientific and Engineering Research. 2011;2(10) [Google Scholar]

[B9] 9.Yasodha P., Kannan M. Analysis of population of diabetic patient database in WEKA tool. International Journal of Science and Engineering Research. 2011;2(5) [Google Scholar]

[B10] 10.Bin Othman M. F., Yau T. M. S. Comparison of different classification techniques using WEKA for breast cancer. Proceedings of the 3rd Kuala Lumpur International Conference on Biomedical Engineering; December 2006; Kuala Lumpur, Malaysia. Springer Heidelberg; pp. 520–523. [DOI] [Google Scholar]

[B11] 11.Israa A., Azzeh M. Comparative study of predicting heart diseases by using data mining classification. International Journal of Computer Science and Information Security. 2016;14(12) [Google Scholar]

[B12] 12.Sisodia D., Sisodia D. S. Prediction of diabetes using classification algorithms. Procedia Computer Science. 2018;132:1578–1585. doi: 10.1016/j.procs.2018.05.122. [DOI] [Google Scholar]

[B13] 13.Syed F. A., Aryati A., Wajihah S., Shahir M. R. Classification of obesity childhood among 6 school children using data mining technique. International Conference on Soft Computing and Data Mining. 2016:p. 549. [Google Scholar]

[B14] 14.Sandeep J., Mula R., Kuriachan B. Chronic kidney disease analysis using machine learning algorithms. International Journal for Research in Applied Science & Engineering Technology. 2018;6(1):3367–3379. [Google Scholar]

[B15] 15.Sahana B. J., Minavathi Kidney disease prediction using data mining classification techniquesand ANN. International Journal of Innovative Research in Computer and Communication Engineering. 2017;4(5):8675–8679. [Google Scholar]

[B16] 16.Polaraju K., Prasad D. D. Prediction of heart disease using multiple linear regression model. International Journal of Engineering Development and Research Development. 2017;5(4):1419–1425. [Google Scholar]

[B17] 17.Kim C., Son Y., Youm S. Chronic disease prediction using character-recurrent neural network in the presence of missing information. Applied Sciences. 2019;9(10):p. 2170. doi: 10.3390/app9102170. [DOI] [Google Scholar]

[B18] 18.Ng K., Steinbuhl S., deFilippi C., Dey S., Stewart W. Early detection of heart failure using electronic health records: practical implications for time before diagnosis, data diversity, data quantity, and data density. Journal of Patient-Centered Research and Reviews. 2017;4(3):174–175. doi: 10.17294/2330-0698.1523. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 19.Zhang X., Zhao H., Zhang S., Li R. A novel deep neural network model for multi-label chronic disease prediction. Frontiers in Genetics. 2019;10 doi: 10.3389/fgene.2019.00351. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20.Kriplani H., Patel B., Roy S. Computer Aided Intervention and Diagnostics in Clinical and Medical Images. Berlin, Germany: Springer; 2019. Prediction of chronic kidney diseases using deep artificial neural network technique; pp. 179–187. [DOI] [Google Scholar]

[B21] 21.Liu J., Zhang Z., Razavian N. Deep EHR: chronic disease prediction using medical notes. 2018. https://arxiv.org/abs/1808.04928.

[B22] 22.Brisimi T. S., Xu T., Wang T., Dai W., Adams W. G., Paschalidis I. C. Predicting chronic disease hospitalizations from electronic health records: an interpretable classification approach. Proceedings of the IEEE. 2018;106(4):690–707. doi: 10.1109/jproc.2017.2789319. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23.Chen M., Hao Y., Hwang K., Wang L., Wang L. Disease prediction by machine learning over big data from healthcare communities. IEEE Access. 2017;5:8869–8879. doi: 10.1109/access.2017.2694446. [DOI] [Google Scholar]

[B24] 24.Patel J., Upadhyay T., Patel S. Heart disease prediction using machine learning and data mining technique. International Journal of Computer Science & Communication. 2016;7(1):129–137. [Google Scholar]

[B25] 25.Deepika K., Seema S. S. Predictive analytics to prevent and control chronic disease. Proceedings of the International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT); July 2016; Bangalore, India. IEEE; pp. 381–386. [DOI] [Google Scholar]

[B26] 26.Ul Haq A., Li J. P., Memon M. H., Nazir S., Sun R. A hybrid intelligent system framework for the prediction of heart disease using machine learning algorithms. Mobile Information System. 2018;2018:21. doi: 10.1155/2018/3860146.3860146 [DOI] [Google Scholar]

[B27] 27.Ahmed S., Kabir M. T., Mahmood N. T., Rahman R. M. Diagnosis of kidney disease using fuzzy expert system. Proceedings of the 8th International Conference on Software, Knowledge, Information Management and Applications (SKIMA 2014); December 2014; IEEE, Dhaka, Bangladesh. [DOI] [Google Scholar]

[B28] 28.Caocci G., Baccoli R., Littera R., Orru S., Carcassi C., La Nasa G. Artificial Neural Networks—Architectures and Applications. Philadelphia, PA, USA: IntechOpen; 2012. Comparison between an ANN and logistic regression in predicting long term kidney transplantation outcome; pp. 106–124. [DOI] [Google Scholar]

[B29] 29.Xun L., Xiaoping W., Ningshan L., Tanqi L. Application of radial basis function neural network to estimate glomerular filtration rate in Chinese patients with chronic kidney disease. Proceedings of the International Conference on Computer Application and System Modeling (ICCASM); October 2010; Taiyuan, China. IEEE; pp. V15–V332. [DOI] [Google Scholar]

[B30] 30.Polat H., Mehr H. D., Cetin A. Diagnosis of chronic kidney disease based on support vector machine by feature selection methods. Systems-Level Quality Improvement. 2017;41(55):2–11. doi: 10.1007/s10916-017-0703-x. [DOI] [PubMed] [Google Scholar]

[B31] 31.Anuja Kumari V., Chitra R. Classification of diabetes disease using support vector machine. International Journal of Engineering Research and Applications (IJERA) 2013;3(2):1797–1801. [Google Scholar]

[B32] 32.Hamza A., Moetque H. Diabetes disease diagnosis method based on feature extraction using K-SVM. International Journal of Advanced Computer Science and Applications. 2017;8(1):p. 236. doi: 10.14569/ijacsa.2017.080130. [DOI] [Google Scholar]

[B33] 33.Mert A., Kilic N., Akan A. Breast cancer classification by using support vector machines with reduced dimension. Proceedings of the 53rd International Symposium ELMAR; September 2011; Zadar, Croatia. [Google Scholar]

[B34] 34.West D., Mangiameli P., Rampal R., West V. Ensemble strategies for a medical diagnostic decision support system: a breast cancer diagnosis application. European Journal of Operational Research. 2005;162(2):532–551. doi: 10.1016/j.ejor.2003.10.013. [DOI] [Google Scholar]

[B35] 35.Gautam R., Kaur P., Sharma M. A comprehensive review on nature inspired computing algorithms for the diagnosis of chronic disorders in human beings. Progress in Artificial Intelligence. 2019;8(4):401–424. doi: 10.1007/s13748-019-00191-1. [DOI] [Google Scholar]

[B36] 36.Rojas E. M., Moreno H. B. R., Palencia J. S. M. Innovation in Medicine and Healthcare Systems, and Multimedia. Vol. 145. Berlin, Germany: Springer; 2019. Contributions of machine learning in the health area as support in the diagnosis and care of chronic diseases; pp. 261–269. [DOI] [Google Scholar]

[B37] 37. http://archive.ics.uci.edu/ml/datasets/diabetes.

[B38] 38. https://www.kaggle.com/uciml/breast-cancer-wisconsin-data.

[B39] 39.Aldhyani T. H., Joshi M. R. Handling ambiguous packets in intrusion detection. Proceedings of the 3rd International Conference on Signal Processing, Communication and Networking (ICSCN); March 2015; IEEE, Chennai, India. [DOI] [Google Scholar]

[B40] 40.Lingras P., West C. Interval set clustering of web users with rough k-means. Journal of Intelligent Information Systems. 2004;23(1):5–16. doi: 10.1023/b:jiis.0000029668.88665.1a. [DOI] [Google Scholar]

[B41] 41.Hartigan J. A., Wong M. A. Algorithm AS 136: a k-means clustering algorithm. Applied Statistics. 1979;28(1):100–108. doi: 10.2307/2346830. [DOI] [Google Scholar]

[B42] 42.MacQueen Q. Some methods for classification and analysis of multivariate observations”. In: Cam L. M. L., Neyman J., editors. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Vol. 1. Berkeley, CA, USA: University of California Press; 1967. pp. 281–297. [Google Scholar]

[B43] 43.Peters G. Some refinements of rough -means clustering. Pattern Recognition. 2006;39(8):1481–1491. doi: 10.1016/j.patcog.2006.02.002. [DOI] [Google Scholar]

[B44] 44.Joshi M., Lingras P. Evidential clustering or rough clustering: the choice is yours. In: Li T., Nguyen H. S., Wang G., et al., editors. Rough Sets and Knowledge Technology, LNCS. Vol. 7414. Heidelberg, Germany: Springer; 2012. pp. 123–128. [Google Scholar]

[B45] 45.Ramya S., Radha N. Diagnosis of chronic kidney disease using machine learning algorithms. International Journal of Innovative Research in Computer and Communication Engineering. 2016;4(1):812–820. [Google Scholar]

[B46] 46.Kayaer K., Yldrm T. Medical diagnosis on Pima Indian diabetes using general regression neural networks. Proceedings of the International Conference on Artificial Neural Networks and Neural Information Processing (ICANN/ICONIP); June 2013; Daegu, South Korea. pp. 181–184. [Google Scholar]

[B47] 47.Ravindra B. V., Sriraam N., Geetha M. Chronic kidney disease detection using back propagation neural network classifier. Proceedings of the International Conference on Communication, Computing and Internet of Things; February 2018; Chennai, India. pp. 15–17. [DOI] [Google Scholar]

[B48] 48.Marcano-Cedeño A., Buendía-Buendía F. S., Andina D. Breast Cancer Classification Applying Artificial Metaplasticity. Berlin-Heidelberg, Germany: Springer-Verlag; 2009. [Google Scholar]

PERMALINK

Soft Clustering for Enhancing the Diagnosis of Chronic Diseases over Machine Learning Algorithms

Theyazn H H Aldhyani

Ali Saleh Alshebami

Mohammed Y Alzahrani

Abstract

1. Introduction

2. Related Studies

3. Materials and Methods

Figure 1.

3.1. Datasets

3.1.1. Diabetic Disease Dataset

Table 1.

3.1.2. Breast Cancer Disease Dataset

Table 2.

3.1.3. Kidney Disease Dataset

Table 3.

3.2. Handling Ambiguity

Figure 2.

3.2.1. Rough K-Means Clustering Algorithm

Figure 3.

3.3. Classification Algorithms

3.3.1. Support Vector Machine Algorithm

3.3.2. Naïve Bayes Algorithm

3.3.3. K-Nearest Neighbors Algorithm

3.3.4. Random Forest Algorithm

3.4. Performance Measurement

3.4.1. Accuracy

3.4.2. Specificity

3.4.3. Sensitivity

3.4.4. Precision

3.4.5. F1-Score

4. Experimental Results and Discussion

4.1. Classification Results of Diabetics Disease

Table 4.

Table 5.

Table 6.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

4.2. Classification Results of Kidney Disease

Table 7.

Table 8.

Table 9.

Figure 8.

Figure 9.

Figure 10.

Figure 11.

4.3. Classification Results of Cancer Disease

Table 10.

Table 11.

Table 12.

Figure 12.

Figure 13.

Figure 14.

Figure 15.

5. Comparative Analysis

Table 13.

6. Conclusion

Acknowledgments

Data Availability

Conflicts of Interest

Supplementary Materials

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases