AdaBoost Ensemble Methods Using K-Fold Cross Validation for Survivability with the Early Detection of Heart Disease

T R Mahesh; V Dhilip Kumar; V Vinoth Kumar; Junaid Asghar; Oana Geman; G Arulkumaran; N Arun

doi:10.1155/2022/9005278

. 2022 Apr 18;2022:9005278. doi: 10.1155/2022/9005278

AdaBoost Ensemble Methods Using K-Fold Cross Validation for Survivability with the Early Detection of Heart Disease

T R Mahesh ¹, V Dhilip Kumar ², V Vinoth Kumar ¹, Junaid Asghar ³, Oana Geman ⁴, G Arulkumaran ^5,^✉, N Arun ¹

PMCID: PMC9038394 PMID: 35479597

Abstract

As a result of technology improvements, various features have been collected for heart disease diagnosis. Large data sets have several drawbacks, including limited storage capacity and long access and processing times. For medical therapy, early diagnosis of heart problems is crucial. Disease of heart is a devastating human disease that is quickly increasing in developed and also developing countries, resulting in death. In this type of disease, the heart normally fails to provide enough blood to different body parts in order to allow them to perform their regular functions. Early, as well as, proper diagnosis of this condition is very critical for averting further damage and also to save patients' lives. In this work, machine learning (ML) is utilized to find out whether a person has cardiac disease or not. Both the types of ensemble classifiers, namely, homogeneous as well as heterogeneous classifiers (formed by combining two separate classifiers), have been implemented in this work. The data mining preprocessing using Synthetic Minority Oversampling Technique (SMOTE) has been employed to cope with the imbalance problem of the class as well as noise. The proposed work has two steps. SMOTE is used in the initial phase to reduce the impact of data imbalance and the second phase is classifying data using Naive Bayes (NB), decision tree (DT) algorithms, and their ensembles. The experimental results demonstrate that the AdaBoost-Random Forest classifier provides 95.47% accuracy in the early detection of heart disease.

1. Introduction

Heart disease is mainly observed as the world's most dangerous and life-threatening chronic disease. During heart illness, the heart generally fails to deliver enough blood to different body regions so as to allow them to operate normally. The narrowing and occlusion of coronary arteries can cause heart failure. Heart disease is one among the leading reasons for death nowadays across the globe [1]. This leads to crucial requirement of monitoring the functioning organs in the human body and a critical aspect in monitoring health records of cardiovascular system. The coronary arteries control the entire circulation of blood to the heart. According to the latest survey, United States is one of the severely affected countries with relatively high ratio of heart disease observed in patients. The symptoms like breathing problem, physical body weakness, exhaustion, and swollen feet among various other symptoms are the most typical markers of heart disease [2]. Most of the cardiovascular diseases affecting people across the world are usually fatal. So, to overcome this problem, development of new technique may aid in detection of heart diseases in early stages as there is huge growth in the technology. Also, before causing substantial damage to avoid advantageous problems in terms of time, cost, and saving human lives machine learning techniques are used to focus on monitoring the heart diseases. Machine learning involves emerging techniques in manipulating and extracting features or relevant data information in possible way [3]. Machine learning is one of the complex fields and also has huge scope in various applications which is expanding all the time. Machine learning techniques consist of supervised learning, unsupervised learning, and also ensemble learning classifiers, which are mainly used to forecast the heart diseases in early stages with increase in accuracy results [4].

In the past years, academicians and researchers attempted to create and implement many intelligent programs by applying predefined procedures, which are similar to regular existing program works [2]. But, still there is a lag in monitoring many observations and instances in timely manner to overcome many societal challenges. Nowadays, very challenging tasks include photo tagging, identification of web-based ranking, identification of spam, or no spam Emails. To overcome these tasks or objectives, one of the options includes development of a program generating relevant rules to evaluate the data samples. It is also called training set, and one of the common emerging fields used for this is machine learning methods. Since 2010–2015, many intelligence software-based machine learning methods are applied including recognition systems on patient images to improve the accuracy results from 72% to 95% [5].

Most of the machine learning applications are evolving in present days and affecting every aspect in our daily lives. Machine learning is applicable in many emerging areas like healthcare monitoring systems, pattern recognition and feature extraction, text and speech recognition, education systems, military and defense applications, fraud detection, etc. Artificial intelligence takes the main lead in the development of ML technology systems. ML technology also simulates human learning systems from the input dataset or information. Many machine learning algorithms from firms such as Facebook, Amazon, or Flipkart are boosting the business trends in developing various brands [6]. With the help of past data or information, machine learning tries to discover new patterns in applying algorithms to achieve feasible outcome results. Also adding value to the business trends or organizations mainly focuses on monitoring future situations and outcome [7].

2. Related Work

A lot of research work is carried out using machine learning methods in achieving more accurate results and predicting outputs based on input dataset [8]. Machine learning plays a very important role in view of new trends and new techniques based on customer behavior or various input patterns, in the development of new products and new brands [9]. Enterprises can understand the customer needs at deeper level to overcome their needs using machine learning algorithms depending, for various applications, on their outcomes [10]. Machine learning also increases the importance in business operations and artificial intelligence is becoming practically high using today's ML models.

One of the new strategies for detecting cardiac diseases, mainly based on Co-Active Neuro-Fuzzy Interference Systems (CANFIS), is applied in one of the research work [11]. Most of the research study is based on regularity in detection of heart diseases based on their strategies as well as on their difficulties. Classifier strategies for the detection of heart diseases are demonstrated using machine learning algorithm, Naive Bayes classifier model. Most of the survey is carried out on various applications, in many research papers, by using data mining algorithms for prediction of heart diseases [12]. But traditional invasive-based approach is carried out using machine learning algorithms. The classifier models for diagnosing heart diseases are based on medical history of patients, patient test results, or scan results so that researchers or doctors can research on connected symptoms [13].

Alternatively, one more disadvantage is that the dye used is harmful as it affects kidneys, as it increases creatinine, including a high cost, a different kind of adverse effects, and a very good level of technological knowledge [14]. The traditional method is comparatively costly and also computationally intensive method for disease diagnosis which takes time to assess [15]. Researchers have tried to create various noninvasive smart healthcare systems which are based on predictive ML techniques, namely, SVM, K-NN, Naive Bayes (NB), and, also, decision tree (DT), among others, to overcome the challenges in conventional invasive-based methods for the identification of heart disease [16].

In the medical field, one of the most used classifiers is the decision tree. In this work [17] SEER medical datasets were used to predict the disease survivorship using classification and regression trees (CART).

In this work [18], use of neural networks was introduced to diagnose and forecast heart disease as well as blood pressure. A Deep Neural Network was built using the given disease attributes to generate an output that was accomplished by the output perceptron and almost included 120 hidden layers, which is the most basic and relevant method for ensuring an accurate result of having heart disease if the model is using the test dataset [19]. The use of a supervised algorithm for cardiac disease diagnostics is being recommended [17]. When the attributes of data are associated, the random forest approach has a tendency to favor the smaller group [20]. This is why, in order to alleviate the challenge of imbalanced data and limit the probability of bias against minorities in the dataset, the SMOTE method is being used. In this study [21], a combination of SMOTE and Artificial Neural Network (ANN) has been used to diagnose ovarian cancer using a publicly available dataset of ovarian cancer. The research demonstrates that, by using the preprocessing methodology of SMOTE to decrease the impact of data imbalance, we can improve the performance and efficiency of neural networks in cancer classification. On large datasets, most single classifier algorithms have the drawbacks of being computationally expensive and difficult. For large datasets, in particular, classification approaches do not give consistent and reliable results, making some individual classifier systems wasteful and unreliable [22]. For example, the DT approach is particularly good at managing intervariable interactions, but it struggles with linear relationships between variables [23].

In recent years, ensemble classifiers have become a popular strategy in machine learning and pattern recognition. In a nutshell, it is a method for combining the findings of many classifiers. The ensemble method's main goal is to improve classification efficiency by weighing multiple independent classifiers and thereby combining them into a single or an individual classifier that outperforms each one individually [22, 24, 25].

3. Exploratory Knowledge

One of the most well-known areas of medical research is the research for heart disease. Early identification and accurate projections of heart diseases have a significant impact on therapy and reduce patient mortality rates. The sections that follow provide brief descriptions of the algorithms used to detect heart disease in this study.

3.1. Decision Tree (DT) Classifier

A decision tree is a supervised ML algorithm that makes decisions based on a set of rules, very similar to how normally people do. A ML classification method is designed to make judgments, in one sense. Classification and regression problems can both be solved with this classifier [26].

There are different notions that define the model. They are given below.

(i)
Entropy: Entropy is a measurement of a system's unpredictability or disorder. In the year 1850, a German physicist named Rudolf Clausius proposed this hypothesis. It is computed as shown in
$\begin{matrix} Entropy = - \sum p (X) \log p (X), \end{matrix}$ (1)
where p(X) is a fraction of examples in a given class.
(ii)
Gini Index: It is also called the Gini coefficient, which is a measure of income distribution in a population. Corrado Gini, an Italian statistician, created it in 1912. The Gini impurity is computed using
$\begin{matrix} Gini Inpurity = 1 - \sum_{i = 1}^{C} {(p_{i})}^{2} . \end{matrix}$ (2)
(iii)
Information Gain: The reduction in entropy achieved by changing a dataset is known as information gain, and it is frequently utilized in the training of decision trees. The entropy of a dataset before and after a transformation is used to calculate information gain. It is computed using
$\begin{matrix} I G (D_{p}, f) = I (D_{p}) - \frac{N_{left}}{N} I (D_{lef t})_\frac{N_{right}}{N} I (D_{right}), \end{matrix}$ (3)
where f is feature split on D_pwhich is parent dataset; D_left is left child node dataset;D_right is right child node dataset; I is impurity criterion; N is total number of samples; N_left is samples number of left child node; N_right is samples number of right child node.

3.2. The CART Algorithm

The CART algorithm was first introduced by Breiman et al. [27]. Hunt's algorithm is used to create the CART. To build a DT, it can process categorical as well as continuous attributes. It also accounts for missing data and constructs the DT by making use of Gini Index as an attribute selection criterion. CART divides the given datasets (training set) into binary segments and builds binary trees as a result. The Gini Index is not employed in the ID3 and C4.5 probabilistic assumptions. In order to increase accuracy of classification, CART algorithm increases the accuracy by making use of cost-complexity pruning for removing unpredictable branches from the DT.

3.3. Alternating Decision Tree (AltDTree)

AltDTree is a classification ML method. It is related to boosting and generalizes decision trees. An AltDTree is made up of a series of decision nodes that indicates a predicate condition and prediction nodes that hold a single number [28]. Classic DTs, Voted DTs, and Voted Decision Stumps are all generalized into AltDTree. It allows any boosting implementation to extract the AltDTree model from the data as a learning method. In the context of the decision tree, AltDTree is an appealing extension of boosting. It enables the use of various boosting strategies to create an AltDTree model with unique properties that can handle a wide range of applications.

3.4. Random Forest (RF) Classifier

RF works by using the training data to create several decision trees. In the case of classification, every tree suggests output as a class; also the class with greatest number of outputs is selected as the final outcome [29]. In order to build, number of trees must be specified. RF is such a technique for aggregating or even bagging bootstrap data. This method is used to reduce an important parameter called variance in the outcomes.

3.5. Reduced Error Pruning Tree (RedEPTree)

Top-down induction of decision trees has been observed to be hampered by the pruning phase's poor performance. It is known, for example, that the size of the resulting tree rises linearly with the sample size, despite the fact that the tree's accuracy does not improve. Errors are reduced. The RedEPTree technique is based on the notion of calculating information gain using entropy and backfitting to minimize variance-induced error [30].

3.6. Naive Bayes (NB) Classifier

There are two steps of classified data in the Naive Bayesian approach [31]. The first stage involves evaluating the parameters of a probability distribution using the training input data. In the second stage, the test dataset is categorized based on the greatest posterior probability. The NB classifier's pseudocode is shown below.

3.7. AdaBoost

AdaBoost makes it possible to merge various “weak classifiers” into a single classifier which is called “strong classifier.” Decision trees with one level, or decision trees with only one split, are the most popular algorithm used with AdaBoost. Decision Stump is another name for these trees [32]. This approach creates a model by assigning equal weights to all of the data points. It then gives points that are incorrectly categorized with a higher weight. In the next model, all points with greater weights are given more importance. It will continue to train models till a lower error is received [33].

The weight of the training set is used to start the AdaBoost algorithm. Let us consider training set (x₁, y),… (x_n, y_n), in which each x_i is in instance space X and each label y_i is in collection of labels Y, that is very much similar to the collection of {−1, +1}. Weight on training instance I on the round t is assigned as D_It(i). At the start, the same weight is used (D_It(i)) = 1/M, i = 1,…, M), where It is the iteration number. Then, weight of the misclassified case from the base learning algorithm is then increased in each round. The AdaBoost algorithm's pseudocode is shown below.

And

\begin{matrix} α_{It} = \frac{1}{2} In [\frac{P_{+ 1} - P_{- 1}}{P_{- 1} + P_{- 1}}] . \end{matrix}

(4)

C _It is the normalization constant, α_It is used to allow the outcome to be generalized and to solve the problem of overfitting and noise sensitive situations [33]. The real value of α_Ith_It (x) is built using a class probability estimate (P).

4. Proposed Methodology

The proposed approach contains two phases in this section. SMOTE is used in the initial phase to lessen the impact of data imbalance. Then, the second phase entails classification using Naive Bayes and DT methods (AltDTree, CART, RedEPTree, and RF) [33]. After that, AdaBoost Ensembles of the aforementioned algorithms are constructed and their performance is evaluated. Then, heterogeneous classifiers that are formed by combining two different individual classifiers are evaluated against different performance metrics to figure out the best model. Figure 1 depicts the flow of the suggested technique.

4.1. Dataset

The UCI repository provided the Heart Disease dataset. This dataset comprises 13 medical variables for 304 patients, which helps to determine whether the patient is in the danger of developing heart disease or not, as well as categorize patients who are at risk and those who are not. The pattern that leads to the discovery of patients at risk for heart disease is retrieved from this dataset. There are two aspects to these records: training and testing. Each row corresponds to a single record in this dataset, which has 303 rows and 14 columns. Table 1 lists all of the qualities and the heatmap is depicted in Figure 2.

Table 1.

Attributes of the dataset.

Sl. No.	Features	Description	Values
1	Age	Age in years	Continuous
2	Sex	Gender of patient	Male/female
3	CP	Chest pain	Four types
4	Trestbps	Resting blood pressure	Continuous
5	Chol	Serum cholesterol	Continuous
6	FBS	Fasting blood sugar	<, or >120 mg/dl
7	Restecg	Resting electrocardiograph	Five values
8	Thalach	Maximum heart rate achieved	Continuous
9	Exang	Exercise induced angina	Yes/no
10	Oldpeak	ST depression when working out compared to the amount of rest taken	Continuous
11	Slope	Slope of peak exercise ST segment	Up/flat/down
12	Ca	Gives number of major vessels colored by fluoroscopy	0–3
13	Thal	Defect type	Reversible/fixed/normal
14	Num (disorder)	Heart disease	Not present (“NO”)/present in the four major types (“YES”)

Open in a new tab

4.2. Data Preprocessing

Most classification algorithms aim to gather pure samples to learn and make the borderline of each class as definitive as possible in order to perform better prediction. Synthetic instances that are far from the borderline are easier to categorize than those that are near to the borderline, which present a significant learning difficulty for the majority of classifiers. The authors in [32] describe an advanced strategy (A-SMOTE) for preprocessing imbalanced training sets based on these findings. It aims to clearly characterize the borderline and create pure synthetic samples from SMOTE generalization. This approach is divided into two parts, as follows.

Step 1 . —

The SMOTE technique is used to create a synthetic instance using

$\begin{matrix} N = 2 * (r - z) + z, \end{matrix}$ (5)

where r denotes majority class samples, z denotes minority class samples number, and N is the initial synthetic instance number (which is newly generated).

The synthetic instances generated by SMOTE can be approved or rejected based on two criteria, which correspond to the first stage: For example, consider $\hat{x}$ = { ${\hat{x}}_{1}$ , ${\hat{x}}_{2}$ , ${\hat{x}}_{3}$ ,…. ${\hat{x}}_{N}$ } which is the collection of new synthetic instances, and ${\hat{x}}_{i}^{(j)}$ is the jth attribute value of ${\hat{x}}_{i}$ , j∈[1, M].LetS_m = {S_m1, S_m2,… S_mz} and S_α = {S_α1, S_α1, S_α1,…S_αr} be the set of the minority samples as well as majority samples [32]. In order to make the rejection or acceptance decision, distance is computed between ${\hat{x}}_{i}$ and S_mk, DD_minority $({\hat{x}}_{i}, S_{m k})$ and the distance between ${\hat{x}}_{i}$ and S_αl, DD_majority $({\hat{x}}_{i}, S_{α l})$ . For I from N steps, we calculate the distances as stated below, using equations (6) and (7).

$\begin{matrix} D D_{minority} ({\hat{x}}_{i}, S_{m k}) = \sum_{j = 1}^{M} \sqrt{{({\hat{x}}_{i}^{(j)} - {\hat{S}}_{m k}^{(j)})}^{2}}, k \in [1, z], \end{matrix}$ (6)

$\begin{matrix} D D_{minority} ({\hat{x}}_{i}, S_{a l}) = \sum_{j = 1}^{M} \sqrt{{({\hat{x}}_{i}^{(j)} - {\hat{S}}_{a l}^{(j)})}^{2}}, l \in [1, r] . \end{matrix}$ (7)

As per (6) and (7), we compute arrays A_minorityand A_majority using (8) and (9).

$\begin{matrix} A_{m i n o r i t y} = (D D_{\min o r i t y} ({\hat{x}}_{i}, S_{m 1}), \dots D D_{\min o r i t y} ({\hat{x}}_{i}, S_{m z})), \end{matrix}$ (8)

$\begin{matrix} A_{m a j o r i t y} = (D D_{m a j o r i t y} ({\hat{x}}_{i}, S_{a 1}), \dots D D_{m a j o r i t y} ({\hat{x}}_{i}, S_{a r})) . \end{matrix}$ (9)

Then we choose the minimum value out of A_minority, min(A_minority)and the minimum value out of A_majority, min(A_majority). If min(A_minority) is lesser thanmin(A_majority), the new samples are accepted else, rejected.

min(A_majority) < min(A_majority) (Accepted).

min(A_minority) ≥ min(A_majority) (Rejected).

Step 2 . —

Then, using the accepted synthetic instances, the following steps are taken to remove the noise.

Suppose $\hat{S}$ = { ${\hat{S}}_{1}$ , ${\hat{S}}_{2}$ , ${\hat{S}}_{3}$ ,…. ${\hat{S}}_{n}$ } is a new synthetic minority received by Step 1. We then compute the distance between ${\hat{S}}_{i}$ with each original minority $S_{m}, {Min}_{Rap} ({\hat{S}}_{i}, {\hat{S}}_{m})$ , defined using

$\begin{matrix} S_{m}, {Min}_{Rap} ({\hat{S}}_{i} {\hat{S}}_{m}) = \sum_{k = 1}^{z} \sum_{j = 1}^{M} \sqrt{{({\hat{S}}_{i}^{(j)} - S_{m k}^{(j)})}^{2}}, \end{matrix}$ (10)

where

$S_{m}, {Min}_{Rap} ({\hat{S}}_{i} . {\hat{S}}_{m})$ samples rapprochement including all minority and as per (10), L is obtained as follows:

$\begin{matrix} L = \sum_{i = 1}^{n} ({Min}_{Rap} ({\hat{S}}_{i}, S_{m})) . \end{matrix}$ (11)

Step 3 . —

Compute the distance between ${\hat{S}}_{i}$ , and each original majority S_a, $M a j_{Rap} ({\hat{S}}_{i} S_{a})$ , described using

$\begin{matrix} M a j_{Rap} ({\hat{S}}_{i} S_{a}) = \sum_{i = 1}^{r} \sum_{j = 1}^{M} \sqrt{{({\hat{S}}_{i}^{(j)} - S_{a l}^{(j)})}^{2}} . \end{matrix}$ (12)

$M a j_{Rap} ({\hat{S}}_{i}, S_{a})$ ⟶ samples rapprochement including all majority and as per equation (13) H is obtained as follows:

$\begin{matrix} H = \sum_{i = 1}^{n} (M a j_{Rap} ({\hat{S}}_{i}, S_{a})) . \end{matrix}$ (13)

Then, we remove half of synthetic samples which have most likely less distance between ${\hat{S}}_{i} and S_{a}$ to obtain the data, that is, of high purity.

5. Performance Evaluations

The different ML algorithms, namely, Naive Bayes, AltDTree, RedEPTree, CART, and RF, are applied on the dataset as individual classifiers. Their performance is compared in terms of several metrics as described in the next section.

5.1. Performance Metrics

If the dataset is not balanced, accuracy may not be a good measure [34]. The number of accurately classified examples divided by total number of data instances is referred to as accuracy. The accuracy is computed using

\begin{matrix} Accuracy = \frac{T N s + T P s}{T N s + T P s + F P s + F N s} . \end{matrix}

(14)

Precision is one of the performance metrics that is going to measure how many correct positive forecasts have been done. So, precision estimates the accuracy of the minority class; then, the ratio of accurately predicted positive instances divided by the total number of positive examples predicted, is used to compute it using

\begin{matrix} Precision = \frac{T P s}{T P s + F P s} . \end{matrix}

(15)

A good classifier should have a precision of 100% (high); only when both numerator and denominator are identical, i.e., TP = TP + FP, can precision become 100% [33].

Recall is a metric that measures how many correct positive predictions were produced out of all possible positive predictions. Unlike precision, which only considers the right positive predictions out of all positive predictions, recall considers the positive predictions that were missed. In this approach, recall provides some indication of the positive class' coverage. The recall is computed using

\begin{matrix} recall = \frac{T P s}{T P s + F N s} . \end{matrix}

(16)

We want both accuracy and recall to be of the value one in a good classifier, which also means FP and FN should be zero. As a result, we require a statistic that considers both precision and recall. The F1-score is a measure that takes precision and recall into account and is defined as follows:

\begin{matrix} F 1 Score = 2 * \frac{precision * recall}{precision + recall} . \end{matrix}

(17)

To compute error rates in forecasted value, let P^N denote a collection of data having the form (t₁, r₁) , (t₂, r₂),… (t_p, r_p)such that t_i denotes n-dimensional tuples of test with respective values of r_i for a given response r and denotes count of tuples in P^N.

In all test instances, the mean-absolute-error (MAE) is the mean of the difference among the projected and guanine value. It is the standard deviation of the prediction error calculated using

\begin{matrix} M A E = \sum_{i = 1}^{p} |r_{i} - r_{i}^{T}| . \end{matrix}

(18)

The root mean squared error (RMSE) is a well-known approach for calculating numeric prediction success. The mean of the squared discrepancies among every value is computed and its matching true value is used to calculate this value using

\begin{matrix} R M S E = \sqrt{\frac{\sum_{r = 1}^{p} {(r_{i} - r^{T}_{i})}^{2}}{p}} . \end{matrix}

(19)

The total absolute mistake is made relative to what the error would have been if the prediction had just been the average of the actual numbers known as Relative Absolute Error (RAE). It is computed using

\begin{matrix} R A E = \frac{\sum_{r = 1}^{p} {(r_{i} - r^{T}_{i})}^{2}}{\sum_{r = 1}^{p} {(r_{i} - {\bar{r}}_{i})}^{2}} . \end{matrix}

(20)

The total squared error made is compared to what the error would have been if the prediction had been the average of the absolute value, known as relative squared error (RRSE). It is computed using

\begin{matrix} R R S E = \sqrt{\frac{\sum_{r = 1}^{p} {(r_{i} - r^{T}_{i})}^{2}}{\sum_{r = 1}^{p} {(r_{i} - {\bar{r}}_{i})}^{2}}} . \end{matrix}

(21)

Table 2 depicts that Random Forest is the best model as it takes only 2.27 seconds for model building (TTBM: Time to Build Model), while the AltDTree has taken 60.18 seconds for model building.

Table 2.

Single classifier evaluation comparison.

Performance metrics	Naive Bayes	AltDTree	RF	RedEPTree	CART
TTBM (sec)	4.56	60.18	2.11	10.25	52.24
Accuracy (%)	78.6	93.56	92.45	79.23	78.67
MAE	0.60	0.28	0.27	026	0.27
RMSE	0.83	0.41	0.42	0.42	0.56
RAE	120	67.71	77.87	79.12	68.91
RRSE	127.41	95.33	82.92	97.89	98.34
F1-score	0.3	0.85	0.84	0.83	0.81

Open in a new tab

Figure 3 shows the accuracy forecast for individual classifiers. Among all the aforementioned classifiers being used in the current research work, AltDTree provides the best accuracy of 93.56%. Random Forest provides 92.45% accuracy and NB classifier prediction is the lowest with 78.67% accuracy.

Accuracy prediction for single classifiers.

Figure 4 depicts the rates of errors obtained from the individual classifiers. AltDTree MAE rate is 0.28 and RMSE rate value is 0.41. This demonstrates that there is low error recorded during the prediction procedures. However, NB has a higher error rate, i.e., 0.60 MAE and 0.83 RMSE, respectively.

Table 3 demonstrates that AdaBoost-RF is the best model, as it has taken only 10.34 seconds to build the model. But the AdaBoost-CART is the worst model as it takes 295.45 seconds to build the model. Also, AdaBoost-RF has highest F1-value of 0.98 and AdaBoost-NB has the lowest F1-value of 0.81.

Table 3.

AdaBoost classifier.

Performance metrics	AB-NB	AB-AltDTree	AB-RF	AB-RedEPTree	AB-CART
TTBM (sec)	18.32	30.01	10.34	64.35	295.45
Accuracy (%)	80.6	93.56	95.47	82.23	81.67
MAE	0.54	0.21	0.14	0.21	0.20
RMSE	0.76	0.43	0.38	0.41	0.41
RAE	129.79	57.78	35.87	45.19	41.61
RRSE	155.62	96.23	65.47	91.03	91.08
F1-score	0.81	0.94	0.98	0.83	0.87

Open in a new tab

From Figure 5, AdaBoost-RF predictions are better than any other mentioned classification algorithm with an accuracy of 95.47%. However, AdaBoost-AltDTree provides 93.56% prediction accuracy and stands second. The AdaBoost-NB provides the least prediction rate of 80.6%.

Figure 6 depicts the different error rates that were recorded. AdaBoost Ensemble classifiers provide the lowest error rate of 0.14 for MAE and 0.38 for RMSE. However, AdaBoost-NB has a higher error rate, i.e., 0.54 and 0.76 for MAE and RMSE, respectively, whose values are almost the same as that of NB individual classifier.

Table 4 depicts the results of ensemble classifiers which are heterogeneous in nature. RF-CART and RF-RedEPTree take 7.34 seconds and 7.89 seconds for building the model, respectively, which are very low. However, AltDTree-CART has taken 598.02 seconds being the worst time for building the model. So, it can be said that RF-CART has a higher F1-score of 0.85, and RF-RedEPTree is second with F1-score of 0.84. AltDTree-RF and AltDTree-CART have the worst F1-scores of 0.68 and 0.69, respectively.

Table 4.

Ensemble classifiers, heterogeneous.

Performance metrics	NB + AltDTree	NB + RF	AltDTree + RF	RF + RedEPTree	RF + CART	AltDTree + RedEPTree	AltDTree + CART
TTBM (sec)	30.03	32.05	398.12	7.89	7.34	357.77	598.02
Accuracy (%)	76.45	76.05	70.12	85.45	86.29	74.49	71.29
MAE	0.42	0.43	0.37	0.35	0.34	0.37	0.41
RMSE	0.42	0.39	0.49	0.36	0.36	0.37	0.42
RAE	99.23	92.23	80.12	71.01	70.89	73.23	89.23
RRSE	98.23	97.49	101.22	91.29	90.12	93.37	99.34
F1-score	0.74	0.75	0.68	0.84	0.85	0.73	0.69

Open in a new tab

From Figure 7, RF-CART provides the best accuracy of 86.29% in comparison to others, followed by RF-RedEPTree with 85.45% prediction accuracy. AltDTree-RF has the lowest accuracy value of 70.12%.

Accuracy of heterogeneous ensemble classifiers.

Figure 8 depicts error rates obtained by ensemble classifiers are heterogeneous in nature. RF-CART exhibits the lowest error rate of 0.34 (MAE) and RMSE of 0.36. However, NB-RF has the highest MAE rate of 0.43 and AltDTree-RF has the highest RMSE rate of 0.49.

Error rates for heterogeneous ensemble classifiers.

6. Conclusion

The AdaBoost Ensemble model for heart disease prediction has been proposed in this work, which is based on recognized feature patterns. In the diagnosis of cardiac disease, it can be compared with classic data mining methods. Ensemble classification approaches replace traditional methods of extracting meaningful information during the feature extraction step. The homogeneous classifiers and ensemble classifiers which are formed by combining multiple methods called heterogeneous classifiers were employed in this study. The data mining preprocessing technique using Synthetic Minority Oversampling Technique (SMOTE) is used to cope with the problem of class imbalance as well as noise present in the heart disease dataset. The best time to build the model for heterogeneous ensemble classifiers is 7.34 seconds for RF-CART and 7.89 seconds for RF-RedEPTree ensemble, according to the experimental results. NB-AltDTree has been observed to have taken the worst time of 598.02 seconds to build the model. With 86.29% prediction accuracy, RF-CART outperforms other classification algorithms, followed by RF-RedEPTree with 85.45% prediction accuracy. As per the results, AdaBoost-RF classifier exhibits 0.14 error rate for MAE which is the lowest and 0.38 for RMSE among the other AdaBoost Ensemble classifiers. In all the overall experiments, the performances of classifiers were compared, and the findings revealed that AdaBoost-RF is the best among other classifiers with 95.47% accuracy.

Algorithm 1 — Pseudocode of NB classifier.

Algorithm 2 — Pseudocode of AdaBoost classifier.

Data Availability

The [UCI repository] data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

1.World Health Organization. Cardiovascular Diseases . Geneva, Switzerland: WHO; 2020. https://www.who.int/health-topics/cardiovascular-diseases/#tab=tab_1 . [Google Scholar]
2.Heidenreich P. A., Trogdon J. G., Khavjou O. A., et al. Forecasting the future of cardiovascular disease in the United States. Circulation . 2011 Mar 1;123(8):933–944. doi: 10.1161/CIR.0b013e31820a55f5. [DOI] [PubMed] [Google Scholar]
3.Soni J., Ansari U., Sharma D., Soni S. Predictive data mining for medical diagnosis: an overview of heart disease prediction. International Journal of Computer Application . 2011;17(8):43–48. doi: 10.5120/2237-2860. [DOI] [Google Scholar]
4.Jee S. H., Jang Y., Oh D. J., et al. A coronary heart disease prediction model: the Korean Heart Study. BMJ Open . 2014;4(5):e005025. doi: 10.1136/bmjopen-2014-005025. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.The Economist. From Not Working to Neural Networking . 2016. http://www.economist.com/news/specialreport/21700756-artificialintelligence-boom-based-old-idea-modern-twist-not . [Google Scholar]
6.Ben-David S., Shalev-Shwartz S. Understanding Machine Learning,” from Theory To Algorithms . Cambridge, UK: Cambridge University Press; 2020. [Google Scholar]
7.Bote M. P. M., D. S., Deshmukh S. D. Heart disease prediction system using naive Bayes. Int J. Enhanced Res. Sci. Technol. Eng . 2013;2(3) [Google Scholar]
8.Ganna A., Magnusson P. K. E., Pedersen N. L., et al. Multilocus genetic risk scores for coronary heart disease prediction. Arteriosclerosis, Thrombosis, and Vascular Biology . 2013;33(9):2267–2272. doi: 10.1161/atvbaha.113.301218. [DOI] [PubMed] [Google Scholar]
9.American Heart Association. Heart Failure . Chicago, IL, USA: American Heart Association; 2020. https://www.heart.org/en/health-topics/heart-failure . [Google Scholar]
10.Dandapath A., Raja M. K. Heart disease prediction using machine learning techniques: a survey. International Journal of Engineering & Technology . 2018;7(2):684–687. [Google Scholar]
11.Soni J., Soni S., Sharma D., Ansari U. Intelligent and effective heart disease prediction system using weighted associative classifiers. International Journal on Computer Science and Engineering . 2011;3(6):2385–2392. [Google Scholar]
12.Subramanian R., Parthiban L. Intelligent heart disease prediction system using CANFIS and genetic algorithm. International Journal of Biological, Biomedical and Medical Sciences . 2008;3(3) [Google Scholar]
13.Samuel O. W., Asogbon A. K., Fang P., Li G. An integrated decision support system based on ANN and Fuzzy_AHP for heart failure risk prediction. Expert Systems with Applications . 2017;68:163–172. doi: 10.1016/j.eswa.2016.10.020. [DOI] [Google Scholar]
14.Kumaraswamy Y., Patil S. B. Intelligent and effective heart attack prediction system using data mining and artificial neural network. European Journal of Scientific Research . 2009;31:642–656. [Google Scholar]
15.Singaraju J., Vanisree K. Decision support system for congenital heart disease diagnosis based on signs and symptoms using neural networks. International Journal of Computer Application . 2015;19:6–12. [Google Scholar]
16.Edmonds B. Proceedings of AISB Symposium on Socially Inspired Computing . Hatfield; 2005. pp. 1–12. [Google Scholar]
17.Delen D., Walker G., Kadam A. Predicting breast cancer survivability: a comparison of three data mining methods. Artificial Intelligence in Medicine . 2005 Jun;34(2):113–127. doi: 10.1016/j.artmed.2004.07.002. [DOI] [PubMed] [Google Scholar]
18.Kiyasu J. Y.. , S. Patent No. 4,338 . Washington, DC: U.S. Patent and Trademark Office; 1982. p. p. 396. [Google Scholar]
19.Raihan M., Mondal S., More A., et al. Smartphone based ischemic heart disease (heart attack) risk prediction using clinical data and data mining approaches, a prototype design. Proceedings of the 19th International Conference on Computer and Information Technology (ICCIT); December 2016; Dhaka, Bangladesh. IEEE; pp. 299–303. [DOI] [Google Scholar]
20.Tolo L., Lengauer T. Classification with correlated features: un- reliability of feature ranking and solutions. Bioinformatics . 1994;27(14):p. 1986. doi: 10.1093/bioinformatics/btr300. [DOI] [PubMed] [Google Scholar]
21.Hambali Moshood A., Gbolagade Morufat D. Ovarian cancer classification using hybrid synthetic minority over-sampling technique and neural network. Journal of Advances in Computer Research (JACR) . 2016;7(4):109–124. [Google Scholar]
22.Vandar Kuzhali J., Vengataasalam S. A novel ensemble classifier based classification on large datasets with hybrid feature selection approach. Research Journal of Applied Sciences, Engineering and Technology . 2014;7(17):3633–3642. doi: 10.19026/rjaset.7.716. [DOI] [Google Scholar]
23.Lavanya D. Usha R. K. Ensemble decision tree classifier for breast cancer data. International Journal of Information Technology and Computer Science . 2012;2(1):17–24. doi: 10.5121/ijitcs.2012.2103. [DOI] [Google Scholar]
24.Shipp C. A., Kuncheva L. I. Relationships between combination methods and measures of diversity in combining classifiers. Information Fusion . 2002;3(2):135–148. doi: 10.1016/s1566-2535(02)00051-9. [DOI] [Google Scholar]
25.Lior R. Ensemble-based classifiers. Artificial intelligence. Review . 2010;33:1–39. [Google Scholar]
26.Leskovec J., Grover A. node2vec: scalable feature learning for networks. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; August 2016; San Francisco, CA, USA. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Breiman L., Friedman J. H., Olshen R. A., Stone C. J. Classification and Regression Trees . Monterey, C. A: Wadsworth & Brooks/Cole Advanced Books & Software; 1984. [Google Scholar]
28.Sharma K., Mahesh T. R., Bhuvana J. Journal of Physics: Conference Series . 1. Vol. 1979. Conference Series IOP Publishing Ltd; May 2021. Big data technology for developing learning resources; p. p. 012019. [DOI] [Google Scholar]
29.Sarveshvar M. R., Gogoi A., Chaubey A. K., Rohit S., Mahesh T. R. Performance of different machine learning techniques for the prediction of heart diseases. Proceedings of the International Conference on Forensics, Analytics, Big Data, Security (FABS); December 2021; Bengaluru, India. pp. 1–4. [DOI] [Google Scholar]
30.Witten I. H., Frank E. The United States of America, Morgan Kaufmann Series in Data Management Systems . 2nd Edition 2005. Data Mining Practical Machine Learning Tools and Techniques. [Google Scholar]
31.Shashikala H. K., Mahesh T. R., Vivek V., Sindhu M. G., Saravanan C., Baig T. Z. Early detection of spondylosis using point-based image processing techniques. Proceedings of the 2021 International Conference on Recent Trends on Electronics, Information, Communication & Technology (RTEICT); August 2021; Bangalore, India. pp. 655–659. [DOI] [Google Scholar]
32.Hussein A. S., Li T., Yohannese C. W., Bashir K. A-SMOTE: a new pre-processing approach for highly imbalanced datasets by improving SMOTE international journal of computational intelligence systems. 2019;12(2):p. 1412. [Google Scholar]
33.Chaitanya Reddy P., Chandra R. M. S., Vadiraj P., Ayyappa Reddy M., Mahesh T. R., Sindhu Madhuri G. Detection of plant leaf-based diseases using machine learning approach. Proceedings of the 2021 IEEE International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS); December 2021; Bangalore, India. pp. 1–4. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The [UCI repository] data used to support the findings of this study are included within the article.

[B1] 1.World Health Organization. Cardiovascular Diseases . Geneva, Switzerland: WHO; 2020. https://www.who.int/health-topics/cardiovascular-diseases/#tab=tab_1 . [Google Scholar]

[B2] 2.Heidenreich P. A., Trogdon J. G., Khavjou O. A., et al. Forecasting the future of cardiovascular disease in the United States. Circulation . 2011 Mar 1;123(8):933–944. doi: 10.1161/CIR.0b013e31820a55f5. [DOI] [PubMed] [Google Scholar]

[B3] 3.Soni J., Ansari U., Sharma D., Soni S. Predictive data mining for medical diagnosis: an overview of heart disease prediction. International Journal of Computer Application . 2011;17(8):43–48. doi: 10.5120/2237-2860. [DOI] [Google Scholar]

[B4] 4.Jee S. H., Jang Y., Oh D. J., et al. A coronary heart disease prediction model: the Korean Heart Study. BMJ Open . 2014;4(5):e005025. doi: 10.1136/bmjopen-2014-005025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5.The Economist. From Not Working to Neural Networking . 2016. http://www.economist.com/news/specialreport/21700756-artificialintelligence-boom-based-old-idea-modern-twist-not . [Google Scholar]

[B6] 6.Ben-David S., Shalev-Shwartz S. Understanding Machine Learning,” from Theory To Algorithms . Cambridge, UK: Cambridge University Press; 2020. [Google Scholar]

[B7] 7.Bote M. P. M., D. S., Deshmukh S. D. Heart disease prediction system using naive Bayes. Int J. Enhanced Res. Sci. Technol. Eng . 2013;2(3) [Google Scholar]

[B8] 8.Ganna A., Magnusson P. K. E., Pedersen N. L., et al. Multilocus genetic risk scores for coronary heart disease prediction. Arteriosclerosis, Thrombosis, and Vascular Biology . 2013;33(9):2267–2272. doi: 10.1161/atvbaha.113.301218. [DOI] [PubMed] [Google Scholar]

[B9] 9.American Heart Association. Heart Failure . Chicago, IL, USA: American Heart Association; 2020. https://www.heart.org/en/health-topics/heart-failure . [Google Scholar]

[B10] 10.Dandapath A., Raja M. K. Heart disease prediction using machine learning techniques: a survey. International Journal of Engineering & Technology . 2018;7(2):684–687. [Google Scholar]

[B11] 11.Soni J., Soni S., Sharma D., Ansari U. Intelligent and effective heart disease prediction system using weighted associative classifiers. International Journal on Computer Science and Engineering . 2011;3(6):2385–2392. [Google Scholar]

[B12] 12.Subramanian R., Parthiban L. Intelligent heart disease prediction system using CANFIS and genetic algorithm. International Journal of Biological, Biomedical and Medical Sciences . 2008;3(3) [Google Scholar]

[B13] 13.Samuel O. W., Asogbon A. K., Fang P., Li G. An integrated decision support system based on ANN and Fuzzy_AHP for heart failure risk prediction. Expert Systems with Applications . 2017;68:163–172. doi: 10.1016/j.eswa.2016.10.020. [DOI] [Google Scholar]

[B14] 14.Kumaraswamy Y., Patil S. B. Intelligent and effective heart attack prediction system using data mining and artificial neural network. European Journal of Scientific Research . 2009;31:642–656. [Google Scholar]

[B15] 15.Singaraju J., Vanisree K. Decision support system for congenital heart disease diagnosis based on signs and symptoms using neural networks. International Journal of Computer Application . 2015;19:6–12. [Google Scholar]

[B16] 16.Edmonds B. Proceedings of AISB Symposium on Socially Inspired Computing . Hatfield; 2005. pp. 1–12. [Google Scholar]

[B17] 17.Delen D., Walker G., Kadam A. Predicting breast cancer survivability: a comparison of three data mining methods. Artificial Intelligence in Medicine . 2005 Jun;34(2):113–127. doi: 10.1016/j.artmed.2004.07.002. [DOI] [PubMed] [Google Scholar]

[B18] 18.Kiyasu J. Y.. , S. Patent No. 4,338 . Washington, DC: U.S. Patent and Trademark Office; 1982. p. p. 396. [Google Scholar]

[B19] 19.Raihan M., Mondal S., More A., et al. Smartphone based ischemic heart disease (heart attack) risk prediction using clinical data and data mining approaches, a prototype design. Proceedings of the 19th International Conference on Computer and Information Technology (ICCIT); December 2016; Dhaka, Bangladesh. IEEE; pp. 299–303. [DOI] [Google Scholar]

[B20] 20.Tolo L., Lengauer T. Classification with correlated features: un- reliability of feature ranking and solutions. Bioinformatics . 1994;27(14):p. 1986. doi: 10.1093/bioinformatics/btr300. [DOI] [PubMed] [Google Scholar]

[B21] 21.Hambali Moshood A., Gbolagade Morufat D. Ovarian cancer classification using hybrid synthetic minority over-sampling technique and neural network. Journal of Advances in Computer Research (JACR) . 2016;7(4):109–124. [Google Scholar]

[B22] 22.Vandar Kuzhali J., Vengataasalam S. A novel ensemble classifier based classification on large datasets with hybrid feature selection approach. Research Journal of Applied Sciences, Engineering and Technology . 2014;7(17):3633–3642. doi: 10.19026/rjaset.7.716. [DOI] [Google Scholar]

[B23] 23.Lavanya D. Usha R. K. Ensemble decision tree classifier for breast cancer data. International Journal of Information Technology and Computer Science . 2012;2(1):17–24. doi: 10.5121/ijitcs.2012.2103. [DOI] [Google Scholar]

[B24] 24.Shipp C. A., Kuncheva L. I. Relationships between combination methods and measures of diversity in combining classifiers. Information Fusion . 2002;3(2):135–148. doi: 10.1016/s1566-2535(02)00051-9. [DOI] [Google Scholar]

[B25] 25.Lior R. Ensemble-based classifiers. Artificial intelligence. Review . 2010;33:1–39. [Google Scholar]

[B26] 26.Leskovec J., Grover A. node2vec: scalable feature learning for networks. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; August 2016; San Francisco, CA, USA. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] 27.Breiman L., Friedman J. H., Olshen R. A., Stone C. J. Classification and Regression Trees . Monterey, C. A: Wadsworth & Brooks/Cole Advanced Books & Software; 1984. [Google Scholar]

[B28] 28.Sharma K., Mahesh T. R., Bhuvana J. Journal of Physics: Conference Series . 1. Vol. 1979. Conference Series IOP Publishing Ltd; May 2021. Big data technology for developing learning resources; p. p. 012019. [DOI] [Google Scholar]

[B29] 29.Sarveshvar M. R., Gogoi A., Chaubey A. K., Rohit S., Mahesh T. R. Performance of different machine learning techniques for the prediction of heart diseases. Proceedings of the International Conference on Forensics, Analytics, Big Data, Security (FABS); December 2021; Bengaluru, India. pp. 1–4. [DOI] [Google Scholar]

[B30] 30.Witten I. H., Frank E. The United States of America, Morgan Kaufmann Series in Data Management Systems . 2nd Edition 2005. Data Mining Practical Machine Learning Tools and Techniques. [Google Scholar]

[B31] 31.Shashikala H. K., Mahesh T. R., Vivek V., Sindhu M. G., Saravanan C., Baig T. Z. Early detection of spondylosis using point-based image processing techniques. Proceedings of the 2021 International Conference on Recent Trends on Electronics, Information, Communication & Technology (RTEICT); August 2021; Bangalore, India. pp. 655–659. [DOI] [Google Scholar]

[B32] 32.Hussein A. S., Li T., Yohannese C. W., Bashir K. A-SMOTE: a new pre-processing approach for highly imbalanced datasets by improving SMOTE international journal of computational intelligence systems. 2019;12(2):p. 1412. [Google Scholar]

[B33] 33.Chaitanya Reddy P., Chandra R. M. S., Vadiraj P., Ayyappa Reddy M., Mahesh T. R., Sindhu Madhuri G. Detection of plant leaf-based diseases using machine learning approach. Proceedings of the 2021 IEEE International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS); December 2021; Bangalore, India. pp. 1–4. [DOI] [Google Scholar]

PERMALINK

AdaBoost Ensemble Methods Using K-Fold Cross Validation for Survivability with the Early Detection of Heart Disease

T R Mahesh

V Dhilip Kumar

V Vinoth Kumar

Junaid Asghar

Oana Geman

G Arulkumaran

N Arun

Abstract

1. Introduction

2. Related Work

3. Exploratory Knowledge

3.1. Decision Tree (DT) Classifier

3.2. The CART Algorithm

3.3. Alternating Decision Tree (AltDTree)

3.4. Random Forest (RF) Classifier

3.5. Reduced Error Pruning Tree (RedEPTree)

3.6. Naive Bayes (NB) Classifier

3.7. AdaBoost

4. Proposed Methodology

Figure 1.

4.1. Dataset

Table 1.

Figure 2.

4.2. Data Preprocessing

Step 1 . —

Step 2 . —

Step 3 . —

5. Performance Evaluations

5.1. Performance Metrics

Table 2.

Figure 3.

Figure 4.

Table 3.

Figure 5.

Figure 6.

Table 4.

Figure 7.

Figure 8.

6. Conclusion

Algorithm 1.

Algorithm 2.

Data Availability

Conflicts of Interest

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases