Feature Selection and Classification of Clinical Datasets Using Bioinspired Algorithms and Super Learner

S Murugesan; R S Bhuvaneswaran; H Khanna Nehemiah; S Keerthana Sankari; Y Nancy Jane

doi:10.1155/2021/6662420

. 2021 May 17;2021:6662420. doi: 10.1155/2021/6662420

Feature Selection and Classification of Clinical Datasets Using Bioinspired Algorithms and Super Learner

S Murugesan ^1,^✉, R S Bhuvaneswaran ¹, H Khanna Nehemiah ¹, S Keerthana Sankari ², Y Nancy Jane ³

PMCID: PMC8149240 PMID: 34055041

Abstract

A computer-aided diagnosis (CAD) system that employs a super learner to diagnose the presence or absence of a disease has been developed. Each clinical dataset is preprocessed and split into training set (60%) and testing set (40%). A wrapper approach that uses three bioinspired algorithms, namely, cat swarm optimization (CSO), krill herd (KH) ,and bacterial foraging optimization (BFO) with the classification accuracy of support vector machine (SVM) as the fitness function has been used for feature selection. The selected features of each bioinspired algorithm are stored in three separate databases. The features selected by each bioinspired algorithm are used to train three back propagation neural networks (BPNN) independently using the conjugate gradient algorithm (CGA). Classifier testing is performed by using the testing set on each trained classifier, and the diagnostic results obtained are used to evaluate the performance of each classifier. The classification results obtained for each instance of the testing set of the three classifiers and the class label associated with each instance of the testing set will be the candidate instances for training and testing the super learner. The training set comprises of 80% of the instances, and the testing set comprises of 20% of the instances. Experimentation has been carried out using seven clinical datasets from the University of California Irvine (UCI) machine learning repository. The super learner has achieved a classification accuracy of 96.83% for Wisconsin diagnostic breast cancer dataset (WDBC), 86.36% for Statlog heart disease dataset (SHD), 94.74% for hepatocellular carcinoma dataset (HCC), 90.48% for hepatitis dataset (HD), 81.82% for vertebral column dataset (VCD), 84% for Cleveland heart disease dataset (CHD), and 70% for Indian liver patient dataset (ILP).

1. Introduction

Data related to symptoms observed on a patient at a point of time are stored in electronic health records (EHRs). Interesting patterns can be extracted from the data that are stored in EHRs, and the extracted patterns can be represented as knowledge, and this knowledge can assist the physicians to diagnose the presence or absence of a disease. Data mining tasks, namely, association rule mining, classification, and clustering are used to mine valuable patterns from the data stored in EHRs. Clinical decision support systems (CDSSs) that assist the physicians to diagnose the presence or absence of a disease can be developed from data stored in EHRs using bioinspired algorithms and data mining techniques. Although several algorithms have been proposed by researchers for association rule mining, classification, and clustering, no algorithm can be deliberated to be the “universal best.” Quality of data and data distribution are the two key factors that determine the effectiveness of a data mining task. The performance of a data mining task depends on how effective data preprocessing has been done. Classification plays a major role in the development of CDSSs. Classification is a two-step process, first, building the classifier and second, model usage. Building the classifier is the process of training the classifier with a supervised learning algorithm. Model usage is the process of estimating the accuracy of the classifier using testing instances commonly referred to as testing set. Overfitting and underfitting are two major problems associated with building the classifier.

Clinical dataset (s) (C_s) used for classifier construction is split into a training set (T_s) and a testing set (T_t). Researchers have proposed different methods to identify the T_s and T_t. One common method is to split 80% of the dataset into T_s and 20% of the dataset into T_t. For clinical decision-making, a balanced dataset is essential for building a prediction model. Clinical datasets are normally not balanced, and classification methods perform poorly on minority class samples when the dataset is tremendously imbalanced. For example, consider a C_s with n instances, each instance associated with a class label c₁ or c₂. Among the n instances that 75% of the instances in C_s are associated with class label c₁, and 25% of the instances in C_s are associated with class label c₂, it is evident that the class labels in C_s are not equally represented and therefore, the C_s is imbalanced. In this context, c₁ is the majority class, and c₂ is the minority class, and hence, constructing a classifier with class-imbalanced data will lead to bias in favor of the majority class. One method to handle class imbalance in a C_s is to generate additional instances from the minority class. The Synthetic Minority Oversampling Technique (SMOTE) [1] is one of the prevailing methods used to generate additional training and testing instances.

A training instance can be defined as a tuple t_i(f₁, f₂, ⋯f_m), where t_i represents a training instance, and (f₁, f₂, ⋯f_j) represents the features corresponding to a training instance. The subscript i in t_i can range from 1 to n, where n is the number of instances. The subscript j in f_j can range from 1 to m, where m is the number of features. Using irrelevant features to train a classifier will affect its performance. Selecting the optimal features from the C_s and then training the classifier will enhance the accuracy of the classifier. Feature selection methods can be supervised, unsupervised, and semisupervised depending upon whether the training set is labeled or not. Commonly used supervised feature selection methods are filter and wrapper methods. The filter method considers the dependency of each feature to the class label and is independent of any classification algorithm. Measures, namely, information gain [2], gain ratio [3], Gini index [4], Laplacian score [5], and cosine similarity [6] can be used to rank the features. Other measures to rank the features can also be used in filter method. The wrapper method considers the classification accuracy of a learning algorithm to select the relevant features. Researchers are using a confluence of disciplines to develop computer-aided diagnostic (CAD) systems to assist physicians.

Knowledge mining using rough sets for feature selection and backpropagation neural network (BPNN) for classifying clinical datasets has been proposed in [7]. A CDSS to diagnose Urticaria using Bayes classification is proposed in [8]. CDSSs to diagnose lung disorders are proposed in [9–14]. A CDSS to diagnose the severity of gait disturbances using a Q-backpropogated time delay neural network on patients affected by Parkinson's disease is proposed in [15]. A statistical tolerance rough set induced decision tree classifier to classify multivariate time series clinical data is proposed in [16]. A CDSS to diagnose gestational diabetes mellitus using the fuzzy logic and radial basis function neural network is proposed in [17]. Use of fuzzy sets and extreme learning machine to classify clinical datasets is proposed in [18]. Wind-driven swarm optimization, a metaheuristic method to classify clinical datasets, is proposed in [19]. A computer-aided diagnostic system that uses a neural network classifier trained using differential evolution, particle swarm optimization, and gradient descent backpropagation algorithms is proposed in [20]. A radial basis function neural network to classify clinical datasets using k-means clustering algorithm and quantum-behaved particle swarm optimization is proposed in [21]. Classifying clinical unevenly spaced time series data by imputing missing values has been proposed in [22]. A framework to classify unevenly spaced time series clinical data using improved double exponential smoothing, rough sets, neural network, and fuzzy logic is proposed in [23].

An outline of nature-inspired algorithms for optimization is presented in [24]. The cooperative intellectual actions of insects or animal groups in nature, for example, colonies of ants, schools of fish, flock of birds, swarms of bees, and termites, have fascinated the thoughtfulness of researchers. Entomologists have studied the collective actions of insects or animals to model biological swarms, and engineers have applied these models as a framework to solve complex real-world problems.

In this work, a CAD system that employs a super learner to diagnose the presence or absence of a disease has been proposed. The bioinspired algorithms used in this work are cat swarm optimization (CSO), krill herd (KH), and bacterial foraging optimization (BFO). The classifiers used in this work are support vector machine (SVM) and BPNN trained using the conjugate gradient algorithm.

The rest of the paper is organized as follows: the abbreviation used in the manuscript is presented in Section 2. An outline of the related work is presented in Section 3. An outline of the datasets used is presented in Section 4. The framework of the proposed classifier is presented in Section 5. The results and discussions are presented in Section 6. Finally, conclusion and scope for future work are presented in section 7.

2. Abbreviations Used

Table 1 presents the abbreviation used in the rest of the manuscript in alphabetic order.

Table 1.

Abbreviations used.

Abbreviation	Phrase
ABCO	Artificial bee colony optimization
ACO	Ant colony optimization
ANN	Artificial neural networks
BCS	Binary cuckoo search
BFA	Binary firefly algorithm
BFO	Bacterial foraging optimization
BP	Back propagation
BPNN	Back propagation neural network
CAD	Computer-aided diagnosis
CDC	Counts of dimension to change
CDSS	Clinical decision support system
CFCSA	Hybrid crow search optimization algorithm
CGA	Conjugate gradient algorithm
CHD	Cleveland heart disease
CMVO	Chaotic multiverse optimization
CSM	Cosine similarity measure
CSO	Cat swarm optimization
CT	Computed tomography
DE	Differential evolution
DGA	Distance-based genetic algorithm
DISON	Diverse intensified strawberry optimized neural network
DNN	Deep neural network
E.coli	Escherichia Coli Bacteria
ECSA	Enhanced crow search algorithm
ELM	Extreme learning machine
FBFO	Feature selected by bacterial foraging optimization
FCM	Fuzzy C-means
FCSO	Feature selected by cat swarm optimization
FFO	Firefly optimization
FKH	Feature selected by krill herd
GA	Genetic algorithm
GSO	Glowworm swarm optimization
HCC	Hepatocellular carcinoma
HD	Hepatitis
IBPSO	Improved binary particle swarm optimization
ILP	Indian liver patient
ISSA	Improved Salp swarm algorithm
KH	Krill herd
k-NN	k-nearest neighbors
LO	Lion optimization
LR	Logistic regression
MCC	Mathew's correlation coefficient
MFO	Moth-flame optimization
ML	Machine learning
MPNN	Multilayer perceptron neural network
MR	Mixed ratio
NB	Naive Bayes
PCC	Pearson correlation coefficient
PID	Pima Indian diabetes
PSO	Particle swarm optimization
RD	Random diffusion
RDM	Rough dependency measure
RF	Random forest
RoIs	Regions of interest
SHD	Statlog heart disease
SMOTE	Synthetic minority oversampling technique
SMP	Seeking memory pool
SPC	Self-position consideration
SRD	Seeking range of the selected dimension
SVC	Support vector classification
SVM	Support vector machine
TS	Thoracic surgery
UCI	University of California Irvine
VCD	Vertebral column dataset
WBC	Wisconsin breast cancer
WDBC	Wisconsin diagnostic breast cancer
WOA	Whale optimization algorithm

Dataset name	No. of instances	No. of features^∗	No. of missing values	Class labels with no. of instances associated with each class label	Interpretation of class labels
WDBC	569	31	Nil	M (212)/B (357)	M-malignant, B-benign
SHD	270	13	Nil	2 (120)/1 (150)	2-present, 1-absent
HCC	165	49	826	0 (63)/1 (102)	0-dies, 1-lives
HD	155	18	167	1 (32)/2 (123)	1-die, 2-live
VCD	310	6	Nil	0 (210)/1 (100)	0-abnormal, 1-normal
CHD	303	13	Nil	1 (139)/2 (164)	1-presence, 2-absence
ILP	583	10	Nil	1 (416)/2 (167)	1-diseased, 2-nondiseased

Instances	WDBC dataset	SHD dataset	HCC dataset	HD dataset	VCD dataset	CHD dataset	ILP dataset
Total number of instances before SMOTE	569	270	165	155	310	303	583
Total number of instances after SMOTE	780	270	228	251	410	303	750
Number of training instances for FCSO/FKH/FBFO classifiers	468	162	137	151	246	182	450
Number of training instances for FCSO/FKH/FBFO classifiers	60% of the total number of instances after SMOTE
Number of testing instances for FCSO/FKH/FBFO classifiers	312	108	91	100	164	121	300
Number of testing instances for FCSO/FKH/FBFO classifiers	40% of the total number of instances after SMOTE
Number of training instances for super learner	250	86	73	80	131	97	240
Number of training instances for super learner	80% of the total testing instances^∗ for FCSO/FKH/FBFO classifiers
Number of testing instances for	62	22	18	20	33	24	60
Super learner	20% of the total testing instances^∗ for FCSO/FKH/FBFO classifiers

Parameter	Description
SMP	SMP is used to define the size of the seeking memory of each cat. Each cat selects possible neighborhood position from a set of solutions.
SRD	SRD is used to define the seeking range of the selected dimension.
CDC	CDC is a count of dimensions to be changed in seeking mode.
SPC	SPC indicates whether the cat is in the current position or not.
N	Number of cats
MR	Mixed ratio of cats
C	Constant value
D	Size of dimension
R	Random number in the range of [0,1]

Parameter	Definition	Value
V _f	Maximum foraging speed	V _f = 0.02 m/s⁻¹
RD^max	Maximum random diffusion speed	RD^max ∈ (0.002 − 0.01 ) m/s⁻¹
N ^max	Maximum induction speed	N ^max = 0.01 m/s⁻¹
w _n	Inertia weight of the motion induced	w _n ∈ (0, 1)
w _f	Inertia weight of the foraging motion	w _f ∈ (0, 1)
C _t	Step-length scaling factor	Constant no.between [0, 2]
δ	Random directional vector	Random numbers [−1, 1]

Parameter	Description
p	Number of features
S	Number of bacteria
S _r	Number of bacteria in the reproduction steps
N _re	No. of reproductive steps
N _ed	No. of elimination-dispersal steps
N _c	No. of chemotactic steps
N _s	No. of swimming steps
L _(i)	Bacteria step size length
P _ed	Elimination probability
∅(i)	Direction of i^th bacteria
x	Index of the chemotactic process
y	Index of the reproduction process.
z	Index of the elimination-dispersal process
θ ⁱ	The i^th bacterium position
θ	A bacterium on the optimization domain
J _last	The highest objective function value
∆(i)	A random vector and its value lie between -1 and 1
Jcc(θ, θⁱ(x, y, z))	Cell-to-cell attractant effect to nutrient concentration

BPNN parameter	Bioinspired algorithm	WDBC dataset	SHD dataset	HCC dataset	HD dataset	VCD dataset	CHD dataset	ILP dataset
Number of input nodes	CSO	15	9	20	16	3	6	5
	KH	17	10	39	10	3	10	8
	BFO	18	9	35	19	2	11	5
Number of hidden nodes	CSO	30	18	40	32	6	12	10
	KH	34	20	78	20	6	20	16
	BFO	36	18	70	38	4	22	10

Name of the parameter	WDBC dataset	SHD dataset	HCC dataset	HD dataset	VCD dataset	CHD dataset	ILP dataset
Initial population size	250	86	73	80	131	97	240
Number of input nodes	3	3	3	3	3	3	3
Number of hidden nodes	6	6	6	6	6	6	6

Feature selection algorithm	Size of feature subset	TN	FP	FN	TP	Accuracy	Sensitivity	Specificity	Precision	F-score
CSO	15	137	4	6	165	96.79	96.49	97.16	97.63	0.97
KH	17	139	2	5	166	97.76	97.08	98.58	98.81	0.98
BFO	18	139	2	8	163	96.79	95.32	98.58	98.79	0.97
Super learner	—	22	0	2	39	96.83	95.12	100.00	100.00	0.98

Feature selection algorithm	Size of feature subset	TN	FP	FN	TP	Accuracy	Sensitivity	Specificity	Precision	F-score
CSO	9	53	8	9	38	84.26	80.85	86.89	82.61	0.82
KH	10	53	8	11	36	82.41	76.60	86.89	81.82	0.79
BFO	9	51	10	10	37	81.48	78.72	83.61	78.72	0.79
Super learner	—	10	1	2	9	86.36	81.82	90.91	90.00	0.86

Feature selection algorithm	Size of feature subset	TN	FP	FN	TP	Accuracy	Sensitivity	Specificity	Precision	F-score
CSO	20	43	9	9	31	80.43	77.50	82.69	77.50	0.78
KH	39	48	4	13	27	81.52	67.50	92.31	87.10	0.76
BFO	35	47	5	20	20	72.83	50.00	90.38	80.00	0.62
Super learner	—	10	1	0	8	94.74	100.00	90.91	88.89	0.94

Feature selection algorithm	Size of feature subset	TN	FP	FN	TP	Accuracy	Sensitivity	Specificity	Precision	F-score
CSO	16	47	2	10	42	88.12	80.77	95.92	95.45	0.88
KH	10	45	4	6	46	90.10	88.46	91.84	92.00	0.90
BFO	19	47	2	12	40	86.14	76.92	95.92	95.24	0.85
Super learner	—	8	1	1	11	90.48	91.67	88.89	91.67	0.92

Feature selection algorithm	Size of feature subset	TN	FP	FN	TP	Accuracy	Sensitivity	Specificity	Precision	F-score
CSO	5	103	62	33	102	68.33	75.56	62.42	62.20	0.68
KH	8	104	61	40	95	66.33	70.37	63.03	60.90	0.65
BFO	5	101	64	34	101	67.33	74.81	61.21	61.21	0.67
Super learner	—	26	15	3	16	70.00	84.21	63.41	51.61	0.64

Feature selection algorithm	Size of feature subset	TN	FP	FN	TP	Accuracy	Sensitivity	Specificity	Precision	F-score
CSO	3	74	10	17	63	83.54	78.75	88.10	86.30	0.82
KH	3	81	3	13	67	90.24	83.75	96.43	95.71	0.89
BFO	2	80	4	17	63	87.20	78.75	95.24	94.03	0.86
Super learner	—	19	2	4	8	81.82	66.67	90.48	80.00	0.73

Feature selection algorithm	Size of feature subset	TN	FP	FN	TP	Accuracy	Sensitivity	Specificity	Precision	F-score
CSO	6	60	8	12	42	83.61	77.78	88.24	84.00	0.81
KH	10	56	12	11	43	81.15	79.63	82.35	78.18	0.79
BFO	11	53	15	13	41	77.05	75.93	77.94	73.21	0.75
Super learner	—	13	2	2	8	84.00	80.00	86.67	80.00	0.80

Author/year	Method/reference	Accuracy %
Author/year	Method/reference	WDBC	SHD	HCC	HD	VCD	CHD	ILP
Ayon et al. (2020)	DNN [45]	—	98.15	—	—	—	94.39	—
Ayon et al. (2020)	SVM [45]	—	97.41	—	—	—	97.36	—
Bai Ji et al. (2020)	IBPSO with k-NN [46]	96.14	—	—	—	—	—	—
Elgin et al. (2020)	Cooperative coevolution and RF [30]	97.1	96.8	72.2	82.3	91.4	93.4	—
Magesh et al. (2020)	Cluster-based decision tree [47]	—	—	—	—	—	89.30	—
Rabbi et al. (2020)	PCC and AdaBoost [48]	—	—	—	—	—	—	92.19
Rajesh et al. (2020)	RF classifier [49]	—	—	80.64	—	—	—	—
Salima et al. (2020)	ECSA with k-NN [50]	95.76	82.96	—	—	—	—	—
Singh J et al. (2020)	Logistic regression [51]	—	—	—	—	—	—	74.36
Sreejith et al. (2020)	CMVO and RF [28]	—	—	—	—	—	—	82.46
Sreejith et al. (2020)	DISON and ERT[27]	—	94.5	—	—	87.17	93.67	—
Tougui et al. (2020)	ANN with Matlab [52]	—	—	—	—	—	85.86	—
Tubishat et al. (2020)	ISSA with k-NN [53]	—	88.1	—	—	89.0	—	—
Abdar et al. (2019)	Novel nested ensemble nu-SVC [54]	—	—	—	—	—	98.60	—
Anter et al. (2019)	CFCSA with chaotic maps [31]	98.6	—	—	68.0	—	88.0	68.4
Aouabed et al. (2019)	Nested ensemble nu-SVC, GA and multilevel balancing [55]	—	—	—	—	—	98.34	—
Elgin et al. (2019)	DE, LO and GSO with Adaboost SVM [32]	98.73	—	—	93.9	—	—	—
Książek et al. (2019)	SVM [56]	—	97.41	—	—	—	97.36	—
Sayed et al. (2019)	Novel chaotic crow search algorithm with k-NN [57]	90.28	78.84	—	83.7	—	—	71.68
Abdar et al. (2018)	MPNN and C5.0 [58]	—	—	—	—	—	—	94.12
Abdullah et al. (2018)	k-NN [59]	—	—	—	—	85.32	—	—
Abdullah et al. (2018)	RF [59]	—	—	—	—	79.57	—	—
Sawhney et al. (2018)	BFA and RF [60]	—	—	83.50	—	—	—	—
Abdar et al. (2017)	Boosted C5.0 [61]	—	—	—	—	—	—	93.75
Abdar et al. (2017)	CHAID [61]	—	—	—	—	—	—	65.0
Zamani et al. (2016)	WOA with k-NN [62]	—	77.05	—	87.10	—	—	—
Abdar (2015)	SVM with rapid miner [63]	—	—	—	—	—	—	72.54
Abdar (2015)	C5.0 with IBM SPSS modeller [63]	—	—	—	—	—	—	87.91
Santos et al. (2015)	Neural networks and augmented set approach [64]	—	—	75.2	—	—	—	—
Chiu et al. (2013)	ANN and LR [65]	—	—	85.10	—	—	—	—
Mauricio et al. (2013)	ABCO with SVM [66]	—	84.81	—	87.10	—	83.17	—
Proposed	CSO, KH, BFO, and super learner	96.83	86.36	94.74	90.48	81.82	84.00	70.00

PERMALINK

Feature Selection and Classification of Clinical Datasets Using Bioinspired Algorithms and Super Learner

S Murugesan

R S Bhuvaneswaran

H Khanna Nehemiah

S Keerthana Sankari

Y Nancy Jane

Abstract

1. Introduction

2. Abbreviations Used

Table 1.

3. Literature Survey

4. Outline of the Datasets Used

Table 2.

5. System Framework

Figure 1.

5.1. Preprocessing

Table 3.

5.2. Feature Selection

5.2.1. Outline of the CSO Algorithm for Feature Selection

Table 4.

Algorithm 1.

5.2.2. Outline of the KH Algorithm for Feature Selection

Table 5.

Algorithm 2.

5.2.3. Outline of the BFO Algorithm for Feature Selection

Table 6.

5.3. Classifier Training

Figure 2.

Table 7.

5.4. Classifier Testing and Dataset Construction for Super Learner

Figure 3.

5.5. Super Learner Training and Testing

Figure 4.

Table 8.

6. Results and Discussions

Table 9.

Table 10.

Table 11.

Table 12.

Table 13.

Table 14.

Table 15.

Table 16.

7. Conclusion and Scope for Future Work

Algorithm 3.

Algorithm 4.

Data Availability

Conflicts of Interest

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases