Effective hybrid feature selection using different bootstrap enhances cancers classification performance

Noura Mohammed Abdelwahed; Gh S El-Tawel; M A Makhlouf

doi:10.1186/s13040-022-00304-y

. 2022 Sep 30;15:24. doi: 10.1186/s13040-022-00304-y

Effective hybrid feature selection using different bootstrap enhances cancers classification performance

Noura Mohammed Abdelwahed ^1,^✉, Gh S El-Tawel ², M A Makhlouf ¹

PMCID: PMC9523996 PMID: 36175944

Abstract

Background

Machine learning can be used to predict the different onset of human cancers. Highly dimensional data have enormous, complicated problems. One of these is an excessive number of genes plus over-fitting, fitting time, and classification accuracy. Recursive Feature Elimination (RFE) is a wrapper method for selecting the best subset of features that cause the best accuracy. Despite the high performance of RFE, time computation and over-fitting are two disadvantages of this algorithm. Random forest for selection (RFS) proves its effectiveness in selecting the effective features and improving the over-fitting problem.

Method

This paper proposed a method, namely, positions first bootstrap step (PFBS) random forest selection recursive feature elimination (RFS-RFE) and its abbreviation is PFBS- RFS-RFE to enhance cancer classification performance. It used a bootstrap with many positions included in the outer first bootstrap step (OFBS), inner first bootstrap step (IFBS), and outer/ inner first bootstrap step (O/IFBS). In the first position, OFBS is applied as a resampling method (bootstrap) with replacement before selection step. The RFS is applied with bootstrap = false i.e., the whole datasets are used to build each tree. The importance features are hybrid with RFE to select the most relevant subset of features. In the second position, IFBS is applied as a resampling method (bootstrap) with replacement during applied RFS. The importance features are hybrid with RFE. In the third position, O/IFBS is applied as a hybrid of first and second positions. RFE used logistic regression (LR) as an estimator. The proposed methods are incorporated with four classifiers to solve the feature selection problems and modify the performance of RFE, in which five datasets with different size are used to assess the performance of the PFBS-RFS-RFE.

Results

The results showed that the O/IFBS-RFS-RFE achieved the best performance compared with previous work and enhanced the accuracy, variance and ROC area for RNA gene and dermatology erythemato-squamous diseases datasets to become 99.994%, 0.0000004, 1.000 and 100.000%, 0.0 and 1.000, respectively.

Conclusion

High dimensional datasets and RFE algorithm face many troubles in cancers classification performance. PFBS-RFS-RFE is proposed to fix these troubles with different positions. The importance features which extracted from RFS are used with RFE to obtain the effective features.

Keywords: Machine Learning, Random Forest feature importance, Recursive feature elimination and its disadvantages, Over-fitting, Learning algorithms, High dimensional data

Introduction

Artificial intelligence (AI) is a science that plays an important role in all fields, especially in the biomedical field, and it aims to simulate reality [1, 2]. Different AI applications have been applied in this field for 20 years due to many factors, including the availability of different datasets in this field, computer devices with high capabilities and arithmetic algorithms [2]. AI has great importance, as a survey has proven that it has great effectiveness in health, and it will outperform the performance of specialists in this field. In addition, it has proven effective in cancer research [2]. Furthermore, AI has become providing human specialists with many information and accordingly, the decision is taken, as it has become one of the most important elements in the medical team [2]. It also works to improve accuracy, speed up diagnosis and discover features or genes affecting cancer as recommendations for human specialists to take into consideration [2]. AI is considered a second decision that helps the specialist make their decision [2]. AI differs from the manual method because it provides human specialists with more information and details. Its diagnosis is more accurate and efficient and does not require more labor.

The manual method may be stressful for the patient, as it puts him under great pressure and takes more time to know the results of the sample, which makes him tense [3]. Cancer has become very widespread in recent times, as it has become a major cause of disease and death [4]. It can be defined as a group of more than one disease due to abnormal cell growth or changes in genes, and it can occur anywhere in the body [5]. Many factors cause cancer including [6]: - (1) tobacco consumption, (2) poor diet, (3) lack of physical activity, (4) alcohol, (5) radiation, (6) infection, (7) genetic factors, (8) smoking and (9) age [6]. There are many different types of human cancer, but in this paper, we used some types that included Breast Invasive Carcinoma (BR), Bladder urothelial carcinoma (BL), Colon and rectum (CO), Glioblastoma multiform (GB), Head and neck squamous cell (HN), Kidney renal clear-cell (KI), Parkinson’s disease (PD), Prostate adenocarcinoma (PRAD) and Lung adenocarcinoma (LUAD).

There are enormous problems in big datasets involved in the features numbers, fitting time, classification accuracy, and model performance. Feature selection is a process for selecting the most relevant features and discarding insignificant ones. Feature selection plays a vital role in many directions to enhance the model performance [7–9]. This process aims to select the most relevant subset r features from the original R features set (r < R) in given datasets [9]. R includes all features in a dataset. It suffers from many problems included in high dimension, noisy, repetitive and over-fitting. The ineffective features are deleted. These features diminish the classification accuracy and waste time. By deleting irrelevant features, all previous problems are solved and improved. Feature selection procedures have three major types: filter, wrapper [9, 10], and embedded [11]. Filter procedure selects the features by evaluating their relevance of features. These features are ranked in decreased order, and low-ranking features are omitted to obtain the most relevant features [12]. The filter approach can use many measures included in gain ratio, mutual information based feature selection (MIFS), information gain based feature selection (IGF), relaxed functional dependencies [9], and chi-square [10]. This procedure does not depend on any machine learning and is faster than the wrapper procedure. Despite its simplicity, it suffers from an over-fitting problem. The best subset of features is selected depending on machine learning to estimate this subset [9, 10]. This procedure suffers from expensive computationally when applied on high dimensions. On the other hand, it guarantees to select the most relevant and effective subset of features. Feature selection is an integral part of the classification model in the embedded procedure. It is embedded in the phase of learning [11]. This procedure has many advantages, including being less computationally expensive, reducing over-fitting problems, and selecting the most accurate features. In this direction, we adopted the integration of wrapper procedure with embedded one to select the relevant features using proposed methods to minimize the previous drawbacks and maximize the classification accuracy.

Selecting influencing features is an effective step in the classification process to obtain accurate results. Many datasets always suffer from high dimensions problems, which negatively affect the model performance’s accuracy. The feature selection step is considered one of the processes that positively impact solving many problems facing different datasets. In this direction, many authors applied different feature selection algorithms to minimize processing time, over-fitting, maximize classification accuracy and find the most relevant features, which still need more researches to improve. Therefore, there are numerous different methods for feature selection to fix the previous drawbacks included in the filter, wrapper, and embedded methods. The filter method is simple, and it selects the features based on their ranking according to a class. Still, it suffers from over-fitting problems in high dimensions datasets and disregards feature dependencies. Elsadek et al. [12] proposed a method using IGF to classify six human cancer types based on DNA copy number variation (CNV) dataset. The proposed method selected 16,381 features as the most relevant features. More than one learning algorithm is applied, such as logistic regression (LR), support vector machine (SVM), random forest (RF), J48, neural network, bagging and dagging. LR learning algorithm achieved the best classification accuracy of about 85% and ROC area 0.965. Rajit et al. [13] proposed selecting best and select percentile filter methods. The proposed method used a breast cancer dataset. There are more than one learning algorithms are used. LR classifier achieved a better result. Furthermore, many filter methods are proposed by Pinar Yildirim [14]. Different filter methods are applied in Cfs Subset eval, principal component analysis (PCA), consistency subset eval, IGF, One-R attribute eval, and relief attribute eval. The proposed method used the Hepatitis datasets and proved that the Consistency Subset, IGF, One-R Attribute Eval, and Relief Attribute Eval filter methods achieved better results. In addition, Alirezanejad et al. [15] proposed a filter method for gene selection using two heuristic methods. These methods, namely, Xvariance and mutual congestion. The Xvariance gave the best results with the standard datasets, while mutual congestion enhanced the accuracy of high-dimensional datasets. Kuswanto et al. [16] proposed a comparison method for feature selection using different filtering methods. Three filtering methods included in MIFS, correlation based feature selection (CFS) and fast correlation based feature selection (FCBF) are applied. The results of these methods are forwarded to K-nearest neighbors (KNN) classifer. The results showed that the FCBF selected a small number of features, while other methods performed well. Furthermore, Ghasemi et al. [17] proposed a method using IGF and gini index to select important features. These features are used to early predict of heart disease. This proposed method aimed to minimize the dimension and maximize the performance of the diagnosis of heart disease with less medical experiments. Mahmood [18] proposed a method to minimize a dimension for facial expression recognition dataset. Two feature selection methods are applied to obtain minimum number of features included in Chi-Square and Relief-F. These methods selected the first highest six features. Four different classifiers are applied to evaluate the performance. In addition, Spencer et al. [19] proposed a method to predict heart disease dataset. Four proposed methods are used for feature selection included in ReliefF, Chi-squared, symmetrical uncertainty and PCA. Different machine learning classifiers are applied to create models for comparison. The best prediction with less subset of features is selected using Chi-Square. Mohamed et al. [20] proposed a method to obtain the most important subset of feature rather than the whole dataset. Chi-square, IG and Bat algorithm are applied for feature selection. Many varieties of classifiers are used to evaluate the model performance. Vikas et al. [21] proposed a method to minimize processing time and maximize classification accuracy using lung cancer detection. To select the most relevant features, Chi-square algorithm is applied. Two different classifiers are used to evaluate the performance included in SVM and RF.

Many authors applied wrapper methods to solve the optimization problems and to get the most important subset features using different datasets. AH et al. [22] proposed an algorithm using the wrapper approach. The proposed algorithm enhanced the basic salp swarm algorithm (SSA) to improve reliability, convergence speed, and classification accuracy. The algorithm was enhanced by adding inertia weight to achieve better results. Hegazy et al. [9] used the hybrid wrapper method by applying chaotic maps to improve the performance of the salp swarm algorithm (SSA) and overcome its drawbacks. To control the exploitation/exploration rates, they used five chaotic maps. The proposed algorithm (CSSA) was applied on twenty-seven datasets and gave the best results. Although it gave the best results using twenty-seven datasets, it did not achieve good results using high-dimensional datasets. Sanaa et al. [8] proposed a wrapper method included in particle swarm optimization (PSO) and genetic algorithm (GA) to classify six human cancers types using DNA CNV dataset. The hybrid proposed method was applied to minimize the features and maximize the classification accuracy. It selected 2051 features from 16,381 features. The selected features achieved 84.6% classification accuracy. However, it suffered from many problems included in over-fitting, fitting time, relevant features, and classification accuracy. RFE is considered a wrapper method for feature selection. It suffers from time-consuming, especially when using big data. Li et al. [23] proposed fixing the support vector machine recursive feature elimination (SVMRFE) problem. They first proposed random value-based oversampling as a resampling method. The proposed variable step size (VSSRFE) to speed up the feature selection process. Another method is proposed called linear SVM (LLSVM). The two proposed methods are used together for feature selection. Jeon et al. [24] proposed a hybrid RFE method using benchmark datasets. This proposed method used SVM-RFE, random forest RFE (RF-RFE), and gradient boosting machines RFE (GBM-RFE) methods which combined the feature-importance-based RFE methods. There were two types of weighting functions used in the proposed methods. The first type sums the weight of three proposed RFE methods, and the second one reflects the classification accuracies and weights of features. Rani et al. [25] proposed a hybrid wrapper method by integrating GA and RFE algorithms. This method is compared with other feature selection methods. The proposed method improved the classification performance after canceling irrelevant features. Zvarevashe et al. [26] proposed a method to select the most relevant subset features using RFE algorithm based on RF. The proposed method was compared with a deep learning algorithm. It proved its powerful for selecting features. Senan et al. [27] proposed a method to select the relevant features using RFE algorithm for a kidney disease dataset. Four classification algorithms are applied for the classification step. The RF algorithm gave the best results.

Many researchers used a hybrid method which combined filter and wrapper methods to select relevant features, but it had many limitations that filter method may cancel important features and wrapper methods take more time. High dimensional is another limitation when applying this hybrid [28]. Ansari et al. [10] used filter and wrapper approaches as a feature selection process. They proposed two different hybrid methods. F-score feature ranker and Chi-square feature ranker are applied in the first method and took the intersection between them. The intersection between these features is applied to obtain the most important features. The results of the intersection process are applied on binary particle swarm optimization (BPSO) as a feature optimization approach. In the second one, after the intersection between features, RFE approach is applied. Zhang et al. [7] proposed a method to classify six human cancer types using CNV level values. Zhang selected the features using the methods of mRMR (minimum Redundancy Maximum Relevance Feature selection) and IFS (Incremental Feature Selection). The first method selected features by ranking the importance of these features. This method selected 200 features. The second method used IFS to select the optimal set of features. IFS selected 19 features with an accuracy value 0.75. However, this proposed method gave insufficient classification accuracy. Pirgazi et al. [29] proposed a hybrid method using filter and wrapper for feature selection in high dimensional datasets. In the first stage, they applied a filter method using the Relief method to weight the features. In the second stage, they applied a wrapper method using shuffled frog leaping algorithm (SFLA) and IWSSr algorithms. Mandal et al. [30] proposed a hybrid method for feature selection using the filter and wrapper method. They applied MIFS, ReliefF, Chi-Square, and Xvariance for the filter method. The union for four filter methods is applied to obtain the most important features. The wrapper method is applied using Whale Optimization Algorithm to overcome any limitation in the filter method. Venkatesh et al. [31] proposed a hybrid method using MIFS as a filter method and RFE as a wrapper method. The hybrid method gave better results than the individual algorithms. Gakii et al. [32] proposed comparison methods using three algorithms for feature selection included in the PCA, RFE and graph-based feature selection. The results proved that the graph-based feature selection enhanced the performance of sequential minimal optimization and multilayer perceptron classifiers. In addition, researchers applied a hybrid method using the advantages of both wrapper and embedded methods to obtain the most effective features to solve the drawbacks in the previous studies. Liu et al. [28] proposed a hybrid method using GA as a global search with an embedded regularization approach as a local search. They proposed this method to solve the over-fitting problems and select relevant features. It is compared with individual algorithms, proving its effectiveness for feature selection. Aruna et al. [33] proposed a hybrid method using LR and RFE algorithms for the diabetes dataset. The RFE is based on LR as an estimator. The RF is applied for a classification step. Venkatachalam et al. [34] proposed a hybrid method that combined the ridge regression and RFE algorithms. It solved the problem of over-fitting for feature selection. The proposed method is compared with other models. RF is applied for the classification step.

Due to the previous research gaps, this paper presents the proposed method PFBS-RFS-RFE with three positions to fix feature selection problems and improve the classification model over different datasets. It tries to enhance many issues included in time consuming using RFE algorithm, classification accuracy, over-fitting problems, fitting time and select the most effective features to know the chromosome that is considered the most developing human cancers in the datasets. Furthermore, we applied a resampling method to enhance the classification accuracy and improve the over-fitting problem [35]. The bootstrap is a resampling method that reduces the variance and bias between features; therefore, the over-fitting problem is minimized, and classification accuracy is maximized. We utilize PFBS as a resampling step with the hybrid RFS-RFE to reduce the over-fitting problem and improve the classification accuracy. We compared the proposed methods with RFE, RFS, and with previous work over five datasets. Four efficient supervised machine learning were used to evaluate the model performance of the proposed hybrid feature selection methods. The main contributions are summarized as follows: -

We propose hybrid methods, namely, positions first bootstrap step random forest selection recursive feature elimination (PFBS-RFS-RFE) based on feature selection that combines the advantages of the wrapper and embedded methods to solve many feature selection problems, including over-fitting, time consuming, relevant features, classification accuracy and solving the problem in RFE algorithm, which suffers from time-consuming with high-dimensional datasets.
The motivation behind the proposed methods is to know the genes or features associated with cancers; therefore, we can know the chromosome that is considered the most developing human cancers by taking the average number of runs and the intersection between features.

The structure of the article is as follows. The “Introduction” section presents the feature selection troubles and how previous work tried to solve them. The “Results” section presents the results of hybrid algorithm and the comparison with other studies using the same datasets. The “Discussion” section summarizes and discusses the application of the hybrid algorithm. The “Conclusions” section presents the main idea and the importance of the proposed methods. The “Method” section presents the hybrid algorithm to enhance and solve these troubles.

Results

The hybrid proposed methods applied two important stages included in feature selection and model performance. They are applied using proposed datasets to select the effective cancer genes and improve the drawbacks included in over-fitting and classification accuracy. The selected features are utilized to feed more than one classifier using 10 cross-validations. The proposed classifiers are LR, support vector machine (SVM), RF and bagging (Bagg). The proposed method is compared with the individual algorithm such as RFE and RF and with the previous work. The proposed methods confirmed the results.

Performance metrics

Performance evaluation is a very important step in machine learning. Selecting the most relevant features increases the classification accuracy and decreases the classification error. We proposed a hybrid method to obtain the accurate classification value, therefore; we fixed any previous drawbacks. The proposed methods are compared with individual algorithms included in RFE and RFS using the following metrics: -

The size of feature selection: - is the number of selected features.
Processing time: - is the time of the fitting process in second.
Performance accuracy is the percentage of the samples that are correctly evaluated by a classifier.
Performance evaluation included: - Precision, F1-score, Recall, variance, Receiver operating characteristic (ROC) area, and Area under curve (AUC) [8, 12] is used to measure the classification performance by plotting the relationship between True Positive (TP) and False Positive (FP) rates.
The calculation formula is applied to evaluate the model performance using ensemble and regularization classifiers with 10 cross-validation. Table 1 presents the meanings of the symbols that used in the proposed methods. The calculation formula is as follows: -

Table 1.

The meanings of the symbol

Symbol	Meaning
PPV	Positive predictive value
TP	Tue positive (cancer type diagnosed correctly as a cancer type)
TN	True negative (non-cancer type diagnosed correctly non-cancer type)
FN	False-negative (cancer type diagnosed incorrectly as non-cancer type)
FP	False-positive (non-cancer type diagnosed incorrectly as a cancer type
SF	The size of the selected features after applying the algorithm
TF	The total size of features

Parameter	Value	Definition
NRuns	20	No of runs
Problem Dimensions	–	No of features in the dataset.
X^*	2916	The number of data produced after the bootstrap resamples method.
M	100	The number of trees using in the Random Forest algorithm.
Criterion	–	The method which measures the quality of split, Entropy is applied.
min_samples_leaf	100	The minimum number of samples required to be at a leaf node.
RFE estimators	–	A supervised learning algorithm. LR is applied.
C	0.05	Regularization parameter.
Max-iteration	100	Max iteration in LR classifier.
Tol	0.0001	Tolerance to stop criteria in LR classification.
CV	10	No of folds in cross-validation.

Algo.	Train %	Test %	Over- Fitting Diff. %	Pre	Rec	F1- Score	NO.F	F-Time (sec)	C-Time (sec)	AUC	Var.	ACC %
RNA gene dataset
RFE	100.000	99.800	0.200	0.999	0.998	0.998	10,265	190,000	60.000	1.000	0.00002	99.800
RFS	100.000	99.800	0.200	0.999	0.998	0.998	374.000	13.015	0.275	1.000	0.00002	99.800
DNA CNV dataset
RFE	97.500	87.000	10.500	0.741	0.706	0.709	8190	182,295	40.000	0.960	0.023125	87.000
RFS	89.803	84.054	5.749	0.819	0.764	0.775	1234	5.000	5.085	0.955	0.000193	84.054
Parkinson’s disease dataset
RFE	76.441	75.133	1.308	0.384	0.480	0.426	376.000	144.783	0.177	0.689	0.00145	75.133
RFS	76.484	75.000	1.484	0.629	0.557	0.537	224.000	1.474	0.158	0.706	0.00108	75.000
BreastEW dataset
RFE	95.000	94.000	1.000	0.948	0.937	0.941	15.000	0.142	0.099	0.990	0.00050	94.000
RFS	95.000	93.000	2.000	0.938	0.928	0.932	27.000	0.090	0.008	0.989	0.00583	93.000
Dermatology erythemato-squamous diseases dataset
RFE	97.541	96.997	0.544	0.972	0.960	0.963	17.000	0.062	0.001	0.997	0.001073	96.997
RFS	89.860	87.725	2.135	0.837	0.821	0.815	11.000	0.094	0.002	0.985	0.002819	87.725

Algo.	Train Data %	Test Data %	Over-fitting Diff. %	Pre	Rec	F1-score	No.F	F-Time (sec)	C-Time (sec)	AUC	Var.	ACC %
RNA gene dataset
LR classifier
OFBS-RFS	100.000	99.944	0.056	1.000	0.999	1.000	379.100	9.537	0.296	1.000	0.000003	99.944
OFBS-RFS-RFE	100.000	99.981	0.019	1.000	1.000	1.000	142.500	189.35	0.445	1.000	0.0000004	99.981
SVM classifier
OFBS-RFS	100.000	99.945	0.055	1.000	0.999	1.000	379.100	9.537	0.296	1.000	0.000003	99.945
OFBS-RFS-RFE	100.000	95.038	4.962	1.000	0.961	1.000	142.500	189.35	0.192	1.000	0.0000002	95.038
RF classifier
OFBS-RFS	100.000	99.875	0.125	0.999	0.999	0.999	379.100	9.537	1.007	0.999	0.000013	99.875
OFBS-RFS-RFE	100.000	99.925	0.075	1.000	0.999	0.999	142.500	189.35	0.807	0.999	0.000005	99.925
Bagg classifier
OFBS-RFS	99.967	99.439	0.528	0.995	0.994	0.994	379.100	9.537	0.912	0.999	0.000074	99.439
OFBS-RFS-RFE	99.972	99.513	0.459	0.996	0.995	0.995	142.500	189.35	0.4482	0.999	0.000063	99.513
DNA CNV dataset
LR classifier
OFBS-RFS	93.838	90.857	2.981	0.914	0.875	0.888	1351	5.435	5.620	0.981	0.00138	90.857
OFBS-RFS-RFE	93.462	91.020	2.442	0.919	0.875	0.887	675.000	2755	2.637	0.983	0.00028	91.020
SVM classifier
OFBS-RFS	94.248	90.980	3.268	0.923	0.873	0.890	1351	5.435	27.316	0.981	0.00012	90.980
OFBS-RFS-RFE	94.248	90.980	3.268	0.923	0.873	0.890	675.000	2755	27.316	0.981	0.00048	90.980
RF classifier
OFBS-RFS	90.966	86.613	4.353	0.918	0.834	0.888	1351	5.435	2.934	0.985	0.00021	86.613
OFBS-RFS-RFE	95.687	91.265	4.421	0.921	0.875	0.890	675.000	2755	2.147	0.986	0.00074	91.265
Bagg classifier
OFBS-RFS	98.622	92.971	5.651	0.927	0.907	0.916	1351	5.435	9.080	0.929	0.00024	92.971
OFBS-RFS-RFE	95.525	92.762	2.763	0.925	0.910	0.912	675.000	2755	4.502	0.981	0.00023	92.762
Parkinson’s disease dataset
LR classifier
OFBS-RFS	78.453	77.864	0.589	0.737	0.602	0.605	228.150	1.704	0.128	0.736	0.00103	77.864
OFBS-RFS-RFE	77.050	72.740	4.310	0.700	0.556	0.579	113.850	11.213	0.149	0.705	0.00093	72.740
SVM classifier
OFBS-RFS	76.109	75.624	0.485	0.623	0.543	0.512	228.150	1.704	0.644	0.643	0.00043	75.624
OFBS-RFS-RFE	76.003	75.499	0.504	0.617	0.541	0.509	113.850	11.214	0.496	0.638	0.00041	75.499
RF classifier
OFBS-RFS	100.000	94.634	5.366	0.948	0.910	0.926	228.150	1.704	1.434	0.986	0.00064	94.634
OFBS-RFS-RFE	100.000	95.000	5.000	0.945	0.906	0.922	113.850	11.214	1.134	0.985	0.00062	95.000
Bagg classifier
OFBS-RFS	99.719	93.163	6.556	0.917	0.904	0.908	228.150	1.704	1.790	0.966	0.00092	93.163
OFBS-RFS-RFE	99.735	93.008	6.727	0.916	0.901	0.906	113.850	11.214	0.820	0.966	0.00078	93.008
Dermatology erythemato-squamous diseases dataset
LR classifier
OFBS-RFS	96.023	95.037	0.986	0.926	0.912	0.907	18.625	0.216	0.014	0.995	0.000190	95.037
OFBS-RFS-RFE	97.241	96.481	0.760	0.932	0.934	0.926	16.000	0.203	0.517	0.997	0.000730	96.481
SVM classifier
OFBS-RFS	79.269	78.375	0.894	0.672	0.708	0.668	18.625	0.216	0.169	0.973	0.002590	78.375
OFBS-RFS-RFE	99.484	98.940	0.544	0.988	0.986	0.984	16.000	0.203	0.064	0.998	0.000368	98.940
RF classifier
OFBS-RFS	100.000	100.000	0.0	1.000	1.000	1.000	18.625	0.216	0.562	1.000	0.0	100.000
OFBS-RFS-RFE	100.000	100.000	0.0	1.000	1.000	1.000	16.000	0.203	0.500	1.000	0.0	100.000
Bagg classifier
OFBS-RFS	99.970	99.730	0.240	0.997	0.995	0.996	18.625	0.216	0.077	0.997	0.000057	99.730
OFBS-RFS-RFE	99.966	99.796	0.170	0.998	0.995	0.996	16.000	0.203	0.242	0.998	0.000055	99.796
BreastEW dataset
LR classifier
OFBS-RFS	94.776	94.218	0.558	0.922	0.933	0.937	27.100	0.298	0.012	0.988	0.000001	94.218
OFBS-RFS-RFE	95.069	94.587	0.482	0.947	0.939	0.941	13.316	0.130	0.091	0.989	0.000810	94.587
SVM classifier
OFBS-RFS	92.167	91.902	0.266	0.934	0.897	0.909	27.100	0.298	0.076	0.978	0.000986	91.902
OFBS-RFS-RFE	93.301	93.114	0.187	0.913	0.914	0.927	13.316	0.130	0.070	0.982	0.001115	93.114
RF classifier
OFBS-RFS	100.000	97.864	2.136	0.984	0.981	0.979	27.100	0.298	0.506	0.997	0.000270	97.864
OFBS-RFS-RFE	100.000	98.000	2.000	0.983	0.979	0.982	13.316	0.130	0.428	0.997	0.000300	98.000
Bagg classifier
OFBS-RFS	99.889	97.548	2.341	0.977	0.972	0.974	27.100	0.298	0.101	0.949	0.000280	97.548
OFBS-RFS-RFE	99.888	97.724	2.164	0.978	0.974	0.976	13.316	0.130	0.104	0.948	0.000430	97.724

Algo.	Train Data %	Test Data %	Over-fitting Diff. %	Pre	Rec	F1-score	No.F	F-Time (sec)	C-Time (sec)	AUC	Var.	ACC %
RNA gene dataset
LR Classifier
IFBS-RFS	100.000	99.925	0.075	0.999	0.999	0.999	239.000	5.421	0.193	1.000	0.000004	99.925
IFBS-RFS-RFE	100.000	99.975	0.025	0.999	0.999	0.999	125.250	15.201	0.357	1.000	0.0000004	99.975
SVM Classifier
IFBS-RFS	99.999	99.906	0.093	0.999	0.998	0.999	239.000	5.421	0.225	1.000	0.000005	99.906
IFBS-RFS-RFE	100.000	99.988	0.012	0.999	0.999	0.999	125.250	15.201	0.153	1.000	0.0000002	99.988
RF Classifier
IFBS-RFS	100.000	99.694	0.306	0.998	0.997	0.997	239.000	5.421	0.901	1.000	0.000030	99.694
IFBS-RFS-RFE	100.000	99.807	0.193	0.999	0.998	0.998	125.250	15.201	0.737	0.999	0.000019	99.807
Bagg Classifier
IFBS-RFS	99.947	99.002	0.945	0.991	0.989	0.989	239.000	5.421	0.635	0.999	0.000075	99.002
IFBS-RFS-RFE	99.955	99.027	0.928	0.992	0.989	0.990	125.250	15.201	0.327	0.999	0.000075	99.027
DNA CNV dataset
LR Classifier
IFBS-RFS	88.525	82.889	5.636	0.812	0.752	0.763	966.000	6.250	3.876	0.942	0.000370	82.889
IFBS-RFS-RFE	88.000	84.000	4.000	0.831	0.764	0.804	482.000	1545	2.149	0.959	0.000420	84.000
SVM Classifier
IFBS-RFS	88.341	81.637	6.704	0.815	0.729	0.745	966.000	6.050	29.078	0.955	0.000580	81.637
IFBS-RFS-RFE	89.668	82.268	7.400	0.827	0.738	0.753	482.000	1545	15.721	0.960	0.000880	82.268
RF Classifier
IFBS-RFS	89.660	80.089	9.571	0.768	0.7025	0.709	966.000	6.050	3.150	0.938	0.000410	80.089
IFBS-RFS-RFE	89.935	80.138	9.797	0.770	0.703	0.719	482.000	1545	2.487	0.941	0.000470	80.138
Bagg Classifier
IFBS-RFS	97.697	78.316	19.381	0.733	0.722	0.702	966.000	6.050	7.850	0.867	0.000450	78.316
IFBS-RFS-RFE	89.035	78.309	10.726	0.730	0.695	0.702	482.000	1545	4.023	0.910	0.000480	78.309
Parkinson’s disease dataset
LR Classifier
IFBS-RFS	78.122	76.718	1.404	0.664	0.588	0.582	154.263	1.071	0.082	0.732	0.002210	76.718
IFBS-RFS-RFE	76.693	73.998	2.695	0.667	0.588	0.581	80.050	4.594	0.164	0.722	0.001990	73.998
SVM Classifier
IFBS-RFS	75.684	72.248	3.436	0.468	0.498	0.448	154.263	1.071	0.482	0.621	0.000770	72.248
IFBS-RFS-RFE	75.697	72.228	3.469	0.464	0.497	0.448	80.050	4.594	0.407	0.619	0.000760	72.228
RF Classifier
IFBS-RFS	99.999	83.912	16.087	0.811	0.738	0.760	154.263	1.071	1.230	0.866	0.003480	83.912
IFBS-RFS-RFE	100.000	81.485	18.515	0.773	0.700	0.719	80.050	4.594	0.964	0.834	0.003500	81.485
Bagg Classifier
IFBS-RFS	99.590	80.810	17.78	0.754	0.731	0.737	154.263	1.071	1.225	0.826	0.003460	80.810
IFBS-RFS-RFE	99.580	79.191	20.389	0.729	0.707	0.713	80.050	4.594	0.546	0.804	0.003440	79.191
Dermatology erythemato-squamous diseases dataset
LR classifier
IFBS-RFS	91.531	91.000	0.531	0.771	0.796	0.777	13.000	0.515	0.002	0.988	0.000881	91.000
IFBS-RFS-RFE	92.198	91.801	0.397	0.773	0.799	0.780	12.000	0.016	0.020	0.988	0.000975	91.801
SVM classifier
IFBS-RFS	94.870	93.979	0.891	0.888	0.878	0.875	13.000	0.515	0.023	0.988	0.001285	93.979
IFBS-RFS-RFE	94.869	93.979	0.890	0.888	0.878	0.875	12.000	0.016	0.075	0.989	0.001285	93.979
RF classifier
IFBS-RFS	97.025	93.183	3.482	0.900	0.892	0.889	13.000	0.515	0.142	0.984	0.001493	93.183
IFBS-RFS-RFE	97.000	93.500	3.500	0.900	0.892	0.889	12.000	0.016	0.140	0.980	0.001490	93.500
Bagg classifier
IFBS-RFS	96.903	92.102	4.801	0.895	0.884	0.881	13.000	0.515	0.016	0.989	0.003297	92.102
IFBS-RFS-RFE	97.177	81.194	15.983	0.789	0.764	0.760	12.000	0.016	0.014	0.970	0.081251	81.194
BreastEW dataset
LR Classifier
IFBS-RFS	94.394	93.678	0.461	0.938	0.938	0.938	23.100	0.410	0.012	0.988	0.000690	93.678
IFBS-RFS-RFE	94.855	94.403	0.452	0.946	0.946	0.946	11.900	0.103	0.091	0.992	0.000520	94.403
SVM Classifier
IFBS-RFS	92.010	91.563	0.447	0.929	0.929	0.929	23.100	0.410	0.069	0.976	0.001010	91.563
IFBS-RFS-RFE	93.888	93.503	0.385	0.944	0.944	0.944	11.900	0.103	0.059	0.983	0.000550	93.503
RF Classifier
IFBS-RFS	100.000	96.411	3.589	0.965	0.965	0.965	23.100	0.410	0.452	0.991	0.000980	96.411
IFBS-RFS-RFE	100.000	95.277	4.723	0.952	0.952	0.952	11.900	0.103	0.433	0.989	0.000930	95.277
Bagg Classifier
IFBS-RFS	99.625	95.302	4.323	0.954	0.954	0.954	23.100	0.410	0.099	0.985	0.000920	95.302
IFBS-RFS-RFE	99.610	94.416	5.194	0.944	0.944	0.944	11.900	0.103	0.085	0.981	0.001170	94.416

Algo.	Train Data %	Test Data %	Over-fitting Diff. %	Pre	Rec	F1-score	NO.F	F-Time (sec)	C-Time (sec)	AUC	Var.	ACC %
RNA gene dataset
LR Classifier
O/IFBS-RFS-	100.000	99.975	0.025	0.999	0.999	0.999	238.800	4.220	0.176	1.000	0.0000006	99.975
O/IFBS-RFS-RFE	100.000	99.994	0.006	0.999	0.999	0.999	119.200	13.726	0.307	1.000	0.0000004	99.994
SVM Classifier
O/IFBS-RFS-	100.000	99.950	0.05	0.999	0.999	0.999	238.800	4.220	0.197	1.000	0.0000025	99.950
O/IFBS-RFS-RFE	100.000	99.981	0.019	0.999	0.999	0.999	119.200	13.726	0.125	1.000	0.0000004	99.981
RF Classifier
O/IFBS-RFS-	100.000	99.888	0.112	0.999	0.999	0.999	238.800	4.220	0.755	1.000	0.0000076	99.888
O/IFBS-RFS-RFE	100.000	99.913	0.087	0.999	0.999	0.999	119.200	13.726	0.596	0.999	0.0000054	99.913
Bagg Classifier
O/IFBS-RFS-	99.974	99.357	0.617	0.994	0.992	0.993	238.800	4.220	0.513	0.999	0.0000828	99.357
O/IFBS-RFS-RFE	99.972	99.363	0.609	0.994	0.993	0.993	119.200	13.726	0.266	0.999	0.000083	99.363
DNA CNV dataset
LR Classifier
O/IFBS-RFS-	92.581	89.818	2.763	0.904	0.861	0.877	973.000	3.650	3.850	0.975	0.00031	89.818
O/IFBS-RFS-RFE	91.878	89.601	2.277	0.906	0.857	0.885	485.000	1460	1.950	0.936	0.00035	89.601
SVM Classifier
O/IFBS-RFS-	93.361	90.253	3.108	0.917	0.860	0.878	973.000	3.650	22.00	0.980	0.00065	90.253
O/IFBS-RFS-RFE	94.241	90.979	3.262	0.925	0.873	0.891	485.000	1460	11.700	0.985	0.00028	90.979
RF Classifier
O/IFBS-RFS-	95.527	90.764	4.763	0.914	0.868	0.882	973.000	3.650	2.650	0.984	0.00027	90.764
O/IFBS-RFS-RFE	95.681	90.954	4.727	0.919	0.872	0.890	485.000	1460	1.750	0.941	0.00027	90.954
Bagg Classifier
O/IFBS-RFS-	97.958	92.712	5.246	0.926	0.906	0.913	973.000	3.650	6.550	0.980	0.00027	92.712
O/IFBS-RFS-RFE	95.318	92.834	2.484	0.927	0.906	0.913	485.000	1460	3.150	0.980	0.00027	92.834
Parkinson’s disease dataset
LR classifier
O/IFBS-RFS-	79.050	78.482	0.568	0.742	0.619	0.626	155.50	1.058	0.093	0.764	0.00123	78.482
O/IFBS-RFS-RFE	77.744	77.427	0.317	0.712	0.597	0.598	77.550	5.551	0.118	0.731	0.00092	77.427
SVM classifier
O/IFBS-RFS-	76.009	75.442	0.567	0.612	0.539	0.508	155.500	1.058	0.511	0.637	0.00041	75.442
O/IFBS-RFS-RFE	77.500	76.672	0.828	0.653	0.566	0.542	77.550	5.551	0.420	0.669	0.00051	76.672
RF classifier
O/IFBS-RFS-	100.000	94.494	5.506	0.945	0.909	0.924	155.500	1.058	1.122	0.985	0.00064	94.494
O/IFBS-RFS-RFE	100.000	94.082	5.918	0.943	0.901	0.917	77.550	5.551	0.911	0.983	0.00070	94.082
Bagg Classifier
O/IFBS-RFS-	99.720	93.196	6.524	0.916	0.906	0.909	155.500	1.058	1.091	0.965	0.00093	93.196
O/IFBS-RFS-RFE	99.719	92.917	6.802	0.914	0.900	0.905	77.550	5.550	0.511	0.966	0.00084	92.917
Dermatology erythemato-squamous diseases dataset
LR classifier
O/IFBS-RFS-	96.691	96.441	0.250	0.649	0.624	0.630	11.000	0.167	0.025	0.998	0.000848	96.441
O/IFBS-RFS-RFE	92.532	92.350	0.212	0.801	0.751	0.766	10.000	0.500	0.128	0.999	0.000790	92.350
SVM classifier
O/IFBS-RFS-	95.082	95.000	0.082	0.638	0.608	0.613	11.000	0.167	0.025	0.977	0.000632	95.000
O/IFBS-RFS-RFE	98.361	98.356	0.005	0.892	0.900	0.895	10.000	0.500	0.047	0.999	0.001040	98.356
RF classifier
O/IFBS-RFS-	100.000	100.000	0.0	1.000	1.000	1.000	11.000	0.167	0.562	1.000	0.0	100.00
O/IFBS-RFS-RFE	100.000	100.000	0.0	1.000	1.000	1.000	10.000	0.500	0.500	1.000	0.0	100.00
Bagg classifier
O/IFBS-RFS-	100.000	100.000	0.0	1.000	1.000	1.000	11.000	0.167	0.520	0.999	0.0	100.000
O/IFBS-RFS-RFE	100.000	100.000	0.0	1.000	1.000	1.000	10.000	0.500	0.500	0.991	0.0	100.000
BreastEw dataset
LR classifier
O/IFBS-RFS-	94.647	94.148	0.499	0.944	0.932	0.936	22.900	0.399	0.010	0.988	0.00095	94.148
O/IFBS-RFS-RFE	95.305	94.842	0.463	0.949	0.942	0.944	11.300	0.1033	0.091	0.992	0.00086	94.842
SVM classifier
O/IFBS-RFS-	92.110	91.889	0.221	0.934	0.897	0.909	22.900	0.399	0.067	0.978	0.00098	91.889
O/IFBS-RFS-RFE	93.515	93.400	0.115	0.943	0.918	0.927	11.300	0.103	0.058	0.983	0.00094	93.400
RF classifier
O/IFBS-RFS-	99.563	97.500	2.063	0.981	0.976	0.977	22.900	0.3994	0.411	0.996	0.00031	97.500
O/IFBS-RFS-RFE	100.000	98.000	2.000	0.979	0.977	0.978	11.300	0.103	0.404	0.997	0.00031	98.000
Bagg Classifier
O/IFBS-RFS-	99.819	97.618	2.201	0.977	0.973	0.974	22.900	0.399	0.089	0.994	0.00038	97.618
O/IFBS-RFS-RFE	99.803	97.505	2.298	0.976	0.972	0.973	11.300	0.103	0.065	0.993	0.00034	97.505

Datasets	Before PFBS-RFS-RFE					After PFBS-RFS-RFE
Datasets	ACC %	Overfitting Diff%	No.F	C-time	Var.	ACC %	Overfitting Diff%	NO. F	C-time	Var.
RNA gene	99.800	0.200	20,531	16.547	0.000015	99.994	0.006	119.200	0.307 s	0.0000004
DNA CNV	85.000	12.600	16,381	170 s	0.000580	92.762	2.763	675.000	0.981 s	0.000230
Parkinson’s disease	93.677	0.800	753	2.000 s	0.000634	95.000	5.000	113.850	1.134 s	0.000620
Dermatology diseases	97.807	0.493	34	0.003 s	0.000810	100.000	0.0	10.000	0.500 s	0.0
BreastEW	75.928	2.072	30	0.500 s	0.002092	98.000	2.000	13.300	0.428 s	0.000300

Reference	Dataset	FS Approach	No of selected features	Var.	AUC	ACC %
García-Díaz et al. [36]	RNA gene	GGA	49	0.000303	–	98.810
Zhang et al. [7]	DNA CNV	mRMR & IFS	19	0.000580	0.973	75.000
Sanaa et al. [8]		PSO & GA	2050	–	0.961	84.600
Sanaa et al. [12]		IG	16,381	–	0.965	85.900
Sakar et al. [37]	Parkinson’s disease	,mRMR	50	–	–	85.000
Hegazy et al. [9]	BreastEw	CSSA	5.200	–	–	97.080

Datasets	Train Data %	Test Data %	Over-fitting Diff. %	Pre	Rec	F1-score	NO.F	F-Time (sec)	C-Time (sec)	AUC	Var.	ACC %
J48 classifier
RNA gene	99.154	97.125	2.029	0.973	0.974	0.972	4083	3.860	1.030	0.987	0.000627	97.125
DNA CNV	64.034	63.682	0.352	0.539	0.537	0.533	41	2.950	0.009	0.815	0.000997	63.682
Parkinson’s disease	87.683	78.421	9.262	0.730	0.673	0.691	119	0.180	0.016	0.713	0.005994	78.421
Dermatology diseases	74.438	73.776	0.662	0.533	0.598	0.547	9	0.150	0.0002	0.884	0.002967	73.776
BreastEW	94.083	90.865	3.218	0.905	0.902	0.902	3	0.050	0.0002	0.950	0.000596	90.865
Naïve base classifier
RNA gene	99.861	98.503	1.358	0.986	0.979	0.981	4083	3.860	0.048	0.987	0.000131	98.503
DNA CNV	64.864	64.196	0.668	0.601	0.620	0.598	41	2.950	0.001	0.884	0.001219	64.196
Parkinson’s disease	80.904	79.915	0.989	0.753	0.681	0.695	119	0.180	0.001	0.762	0.005410	79.915
Dermatology diseases	62.360	61.441	0.919	0.573	0.628	0.561	9	0.150	0.0007	0.935	0.005554	61.441
BreastEW	93.556	92.973	0.583	0.932	0.921	0.924	3	0.050	0.0008	0.979	0.000886	92.973
K-nearest neighbors(KNN) classifier
RNA gene	99.792	99.627	0.165	0.998	0.996	0.997	4083	3.860	0.007	1.000	0.000036	99.627
DNA CNV	78.986	72.599	6.387	0.691	0.626	0.630	41	2.950	0.00009	0.862	0.000289	72.599
Parkinson’s disease	89.021	80.282	8.739	0.751	0.694	0.708	119	0.180	0.0008	0.743	0.002007	80.282
Dermatology diseases	85.247	79.767	5.48	0.823	0.783	0.780	9	0.150	0.0004	0.943	0.006681	79.767
BreastEW	87.503	81.172	6.331	0.815	0.793	0.797	3	0.050	0.0008	0.857	0.005639	81.172

Datasets	Train Data %	Test Data %	Over-fitting Diff. %	Pre	Rec	F1-score	NO.F	F-Time (sec)	C-Time (sec)	AUC	Var.	ACC %
J48 classifier
RNA gene	93.106	91.136	1.97	0.880	0.869	0.869	3	1.850	0.001	0.963	0.000641	91.136
DNA CNV	64.842	63.921	0.921	0.550	0.549	0.543	42	1.600	0.012	0.816	0.001172	63.921
Parkinson’s disease	86.346	80.279	6.067	0.765	0.688	0.707	11	1.100	0.003	0.738	0.002407	80.279
Dermatology diseases	87.918	87.740	0.178	0.751	0.755	0.742	12	0.102	0.0006	0.946	0.002407	87.740
BreastEW	96.993	94.380	2.613	0.950	0.933	0.939	8	0.090	0.0008	0.971	0.000873	94.380
Naïve base classifier
RNA gene	97.545	97.380	0.165	0.972	0.970	0.970	3	1.850	0.001	0.994	0.000188	97.380
DNA CNV	73.015	71.606	1.409	0.682	0.698	0.680	42	1.600	0.006	0.923	0.001276	71.606
Parkinson’s disease	76.940	75.516	1.424	0.666	0.599	0.605	11	1.100	0.0005	0.729	0.003051	75.516
Dermatology diseases	90.265	89.839	0.426	0.878	0.901	0.867	12	0.102	0.0007	0.995	0.002740	89.839
BreastEW	94.630	94.201	0.429	0.943	0.934	0.937	8	0.090	0.002	0.988	0.000686	94.201
K-nearest neighbors(KNN) classifier
RNA gene	97.517	97.131	0.386	0.963	0.964	0.962	3	1.850	0.001	0.993	0.000311	97.131
DNA CNV	82.735	77.059	5.676	0.750	0.680	0.685	42	1.600	0.0001	0.888	0.000465	77.059
Parkinson’s disease	79.100	69.698	9.402	0.557	0.524	0.517	11	1.100	0.002	0.583	0.002962	69.698
Dermatology diseases	97.632	95.916	1.716	0.962	0.943	0.947	12	0.102	0.0008	0.994	0.001339	95.916
BreastEW	95.704	93.496	2.208	0.937	0.925	0.929	8	0.090	0.0009	0.965	0.000621	93.496

Datasets	Train Data %	Test Data %	Over-fitting Diff. %	Pre	Rec	F1-score	NO.F	F-Time (sec)	C-Time (sec)	AUC	Var.	ACC %
J48 classifier
RNA gene	96.823	94.884	1.939	0.943	0.954	0.942	700	1.027	0.174	0.985	0.000740	94.884
DNA CNV	62.334	60.665	1.669	0.526	0.516	0.505	2800	49.795	1.549	0.827	0.001724	60.665
Parkinson’s disease	83.745	73.816	9.929	0.625	0.607	0.602	250	0.079	0.028	0.645	0.002418	73.816
Dermatology diseases	81.148	80.878	0.270	0.576	0.663	0.604	18	0.016	0.003	0.925	0.000142	80.878
BreastEW	95.723	93.493	2.230	0.942	0.925	0.929	20	0.016	0.002	0.959	0.000832	93.493
Naïve base classifier
RNA gene	87.072	79.403	7.669	0.794	0.806	0.794	700	1.027	0.005	0.954	0.002116	79.403
DNA CNV	29.336	27.641	1.695	0.255	0.352	0.233	2800	49.795	0.077	0.680	0.000319	27.641
Parkinson’s disease	74.471	73.821	0.65	0.604	0.558	0.545	250	0.079	0.009	0.698	0.002826	73.821
Dermatology diseases	98.361	96.179	2.182	0.961	0.952	0.953	18	0.016	0.0001	0.997	0.001025	96.179
BreastEW	90.041	89.803	0.238	0.896	0.886	0.889	20	0.016	3.057	0.962	0.001707	89.803
K-nearest neighbors(KNN) classifier
RNA gene	99.750	99.740	0.010	0.999	0.997	0.998	700	1.027	0.002	0.999	0.000059	99.740
DNA CNV	81.200	74.348	6.852	0.663	0.639	0.634	2800	49.795	0.010	0.867	0.000273	74.348
Parkinson’s disease	81.158	72.612	8.546	0.612	0.571	0.575	250	0.079	0.001	0.627	0.002308	72.612
Dermatology diseases	92.380	87.162	5.218	0.862	0.860	0.844	18	0.050	0.00002	0.969	0.001984	87.162
BreastEW	94.728	92.976	1.752	0.930	0.921	0.924	20	0.030	0.0003	0.961	0.000953	92.976

Datasets	Train Data %	Test Data %	Over-fitting Diff. %	Pre	Rec	F1-score	NO.F	F-Time (sec)	C-Time (sec)	AUC	Var.	ACC %
LR classifier
RNA gene	100.000	99.750	0.250	0.999	0.997	0.998	650	1200.011	0.251	1.000	0.000028	99.750
DNA CNV	91.819	79.699	12.120	0.746	0.688	0.689	505	2296.409	0.686	0.940	0.000529	79.699
Parkinson’s disease	74.617	73.011	1.606	0.500	0.515	0.479	145	61.005	0.017	0.659	0.002502	73.011
Dermatology diseases	95.508	95.075	0.433	0.950	0.908	0.919	15	3.996	0.002	0.995	0.000796	95.075
BreastEW	93.085	92.620	0.465	0.936	0.910	0.917	19	4.181	0.002	0.981	0.003358	92.620
SVM classifier
RNA gene	100.000	99.748	0.252	0.999	0.997	0.998	650	1200.011	0.382	1.000	0.000028	99.748
DNA CNV	92.486	83.848	8.638	0.845	0.747	0.766	505	2296.409	3.609	0.961	0.000559	83.848
Parkinson’s disease	75.661	72.379	3.282	0.435	0.497	0.443	145	61.005	0.142	0.639	0.004378	72.379
Dermatology diseases	52.793	52.185	0.608	0.325	0.463	0.363	15	3.996	0.053	0.948	0.002302	52.185
BreastEW	89.049	88.938	0.111	0.915	0.860	0.870	19	4.181	0.044	0.945	0.005405	88.938
RF classifier
RNA gene	100.000	99.627	0.373	0.998	0.996	0.997	650	1200.011	0.398	1.000	0.000036	99.627
DNA CNV	90.959	79.935	11.024	0.727	0.690	0.689	505	2296.409	0.534	0.942	0.001249	79.935
Parkinson’s disease	100.000	81.918	18.082	0.767	0.703	0.709	145	61.005	0.467	0.833	0.011138	81.918
Dermatology diseases	100.000	97.553	2.447	0.981	0.968	0.972	15	3.996	0.1000	0.999	0.000561	97.553
BreastEW	100.00	95.604	4.396	0.960	0.950	0.952	19	4.181	0.183	0.991	0.002693	95.604
Bagg classifier
RNA gene	99.961	98.746	1.215	0.991	0.984	0.986	650	1200.011	1.680	0.999	0.000430	98.746
DNA CNV	97.817	77.468	20.349	0.731	0.682	0.687	505	2296.409	1.135	0.910	0.000937	77.468
Parkinson’s disease	99.498	79.369	20.129	0.725	0.712	0.705	145	61.005	0.561	0.799	0.010620	79.369
Dermatology diseases	99.545	94.017	5.528	0.951	0.936	0.937	15	3.996	0.009	0.982	0.002412	94.017
BreastEW	99.642	93.684	5.958	0.943	0.928	0.931	19	4.181	0.042	0.980	0.002369	93.684

Datasets	Train Data %	Test Data %	Over-fitting Diff. %	Pre	Rec	F1-score	NO.F	F-Time (sec)	C-Time (sec)	AUC	Var.	ACC %
J48 classifier
RNA gene	99.154	97.625	1.529	0.980	0.979	0.979	10,000	1.950	2.887	0.992	0.000432	97.625
DNA CNV	65.322	63.922	1.4	0.546	0.548	0.540	8000	1.550	0.994	0.816	0.001225	63.922
Parkinson’s disease	83.858	74.759	9.099	0.635	0.599	0.593	300	0.950	0.060	0.710	0.013702	74.759
Dermatology diseases	79.356	78.701	0.655	0.570	0.647	0.591	20	0.500	0.0007	0.924	0.000719	78.701
BreastEW	96.485	93.499	2.986	0.467	0.448	0.455	16	0.350	0.003	0.959	0.002994	93.499
Naïve base classifier
RNA gene	99.855	96.881	2.974	0.962	0.955	0.955	10,000	1.950	0.115	0.973	0.000976	96.881
DNA CNV	65.678	64.679	0.999	0.629	0.668	0.620	8000	1.550	0.297	0.849	0.001359	64.679
Parkinson’s disease	81.550	79.338	2.212	0.742	0.729	0.722	300	0.950	0.003	0.775	0.014727	79.338
Dermatology diseases	87.403	85.570	1.833	0.808	0.852	0.806	20	0.500	0.0007	0.979	0.002454	85.570
BreastEW	94.681	94.415	0.266	0.481	0.444	0.460	16	0.350	0.0005	0.989	0.002167	94.415
K-nearest neighbors(KNN) classifier
RNA gene	99.875	99.873	0.002	0.999	0.999	0.999	10,000	1.950	0.016	1.000	0.000031	99.873
DNA CNV	80.735	74.246	6.489	0.708	0.655	0.654	8000	1.550	0.013	0.874	0.000378	74.246
Parkinson’s disease	85.801	71.708	14.093	0.595	0.586	0.579	300	0.950	0.0007	0.608	0.007203	71.708
Dermatology diseases	92.289	86.059	6.23	0.872	0.848	0.840	20	0.500	0.001	0.961	0.004704	86.059
BreastEW	94.005	91.739	2.266	0.462	0.427	0.442	16	0.350	0.0002	0.964	0.000553	91.739

Datasets	No. Intersection Features	Feature indices or feature names	Feature or gene Description	Reference in cancer
RNA gene	1	G110	–	–
DNA CNV	12	PPP1R8	Through alternative splicing, three this gene encodes different isoforms [38].	[39]
		SCARNA1	Small Cajal body-specific RNA 1 [38].	[40]
		RPA2	Protein A (RPA) complex is encoded by this gene [38].	[41]
		SMPDL3B	Sphingomyelin phosphodiesterase acid like 3B [38].	[42]
		XKR8	Promotes phosphatidylserine exposure apoptotic cell surface, possibly by mediating phospholipid scrambling [43].	[44]
		PHACTR4	A member of the phosphatase and actin regulator (PHACTR) family are encoded by this gene [38].	[45]
		RCC1	Regulator of chromosome condensation 1 [38].	[46]
		SNHG3	Small nucleolar RNA host gene 3 [32].	[47]
		SNORD99	Small nucleolar RNA, C/D box 99 [38].	[48]
		SNORA16A	Small nucleolar RNA, H/ACA box 16A [38].	[49]
		RAB42	Member RAS oncogene family [38].	–
		TFA12	This gene Control of transcription by RNA polymerase II [38].	[50]
Parkinson’s disease	7	IMF_SNR_TKEO	–	–
		IMF_NSR_TKEO	–	–
		mean_MFCC_1st_coef	–	–
		mean_4th_delta_delta	–	–
		mean_5th_delta_delta	–	–
		mean_6th_delta_delta	–	–
		mean_7th_delta_delta	–	–
BreastEW	1	Radius	Can be defined as the mean of distances from center to points on the perimeter [51].	[51]
Datasets	No. Intersection Features	Feature indices or feature names	Feature or gene Description	Reference in derma- tology
Dermatology erythemato-squamous diseases	5	Borders	The border of the lesion which important for diagnosing and for other features [52, 53].	[53]
		Parakeratosis	Nucleated keratinocytes are existed in the stratum corneum due to accelerated keratinocytic turnover [54].	[54]
		Spongiosis	Intraepidermal eosinophils is existed in spongiotic zones [55].	[55, 56]
		Itching	Itching is a bad feeling that causes itching continuously, which affects the human psyche [57].	[56]
		Age	The age at disease onset [58].	[58]

Category Type	DS No.	Datasets	#Features	#Samples	#Class
Small < 100	D1	BreastEW	30	569	2
Small < 100	D2	Dermatology erythemato-squamous diseases	34	366	6
Medium 100 < D2 < 1000	D3	Parkinson’s disease	753	756	2
Large 1000 < D < 21,000	D4	DNA CNV	16,381	2916	6
Large 1000 < D < 21,000	D5	RNA gene	20,531	801	5

Datasets	Train Data %	Test Data %	Over-fitting Diff. %	Pre	Rec	F1-score	NO.F	F-Time (sec)	C-Time (sec)	AUC	Var.	ACC %
Mutual Information
KNN classifier
RNA gene	99.736	99.627	0.109	0.998	0.996	0.997	10,000	258.902	0.008	1.000	0.000036	99.627
DNA CNV	82.686	76.097	6.589	0.745	0.663	0.667	9000	180.314	0.011	0.854	0.000368	76.097
Parkinson’s disease	80.879	72.479	8.400	0.610	0.568	0.572	300	2.121	0.0001	0.624	0.002344	72.479
Dermatology diseases	97.966	97.267	0.699	0.975	0.969	0.969	25	0.351	0.002	0.963	0.000839	97.267
BreastEW	94.435	92.628	1.807	0.927	0.917	0.920	20	0.083	0.00002	0.958	0.001419	92.628
Correlation Based Feature
KNN classifier
RNA gene	99.867	99.748	0.119	0.999	0.997	0.998	900	2.600	0.003	1.000	0.000092	99.748
DNA CNV	52.831	49.073	3.758	0.447	0.402	0.369	750	1.850	0.003	0.669	0.000490	49.073
Parkinson’s disease	81.158	72.612	8.546	0.612	0.571	0.575	320	0.255	0.002	0.627	0.002308	72.612
Dermatology diseases	94.171	90.953	3.218	0.871	0.855	0.846	20	0.202	0.002	0.947	0.002243	90.953
BreastEW	94.747	92.976	1.771	0.931	0.920	0.924	17	0.105	0.002	0.961	0.000953	92.976
Fast Correlation Based Feature
KNN classifier
RNA gene	99.742	99.625	0.117	0.998	0.996	0.997	400	1.750	0.001	1.000	0.000131	99.625
DNA CNV	81.390	76.236	5.154	0.721	0.671	0.676	13	0.800	0.007	0.905	0.001131	76.236
Parkinson’s disease	82.657	73.270	9.387	73.271	0.585	0.587	16	1.500	0.002	0.675	0.001767	73.270
Dermatology diseases	97.936	97.005	0.931	0.970	0.967	0.966	14	0.101	0.002	0.961	0.001217	97.005
BreastEW	95.333	95.078	0.255	0.953	0.945	0.947	7	0.006	0.002	0.953	0.000261	95.078

Datasets	Train Data %	Test Data %	Over-fitting Diff. %	Pre	Rec	F1-score	NO.F	F-Time (sec)	C-Time (sec)	AUC	Var.	ACC %
Information gain
KNN classifier
RNA gene	99.723	99.627	0.096	0.998	0.996	0.997	3576	1.182	0.005	1.000	0.000036	99.627
DNA CNV	81.310	74.448	6.862	0.671	0.640	0.636	3315	5.651	0.014	0.850	0.001361	74.448
Parkinson’s disease	81.026	72.479	8.547	0.777	0.885	0.827	396	0.093	0.0008	0.452	0.002384	72.479
Dermatology diseases	97.996	97.267	0.729	0.974	0.970	0.969	25	0.032	0.0002	0.964	0.000496	97.267
BreastEW	94.728	92.976	1.752	0.924	0.888	0.904	22	0.064	0.002	0.969	0.000953	92.976
Naïve base classifier
RNA gene	100.000	96.380	3.620	0.9685	0.946	0.952	3576	1.182	0.029	0.970	0.000710	96.380
DNA CNV	66.994	65.637	1.357	0.647	0.657	0.626	3315	5.651	0.435	0.832	0.001336	65.637
Parkinson’s disease	74.618	74.070	0.548	0.803	0.867	0.833	396	0.093	0.004	0.721	0.006235	74.070
Dermatology diseases	86.947	85.781	1.166	0.828	0.856	0.802	25	0.032	0.0007	0.984	0.001143	85.781
BreastEW	94.259	93.853	0.406	0.946	0.887	0.914	22	0.064	0.002	0.988	0.000766	93.853
Decision tree classifier
RNA gene	99.154	97.250	1.904	0.975	0.977	0.975	3576	1.182	0.836	0.989	0.000444	97.250
DNA CNV	65.626	64.574	1.052	0.553	0.559	0.551	3315	5.651	2.067	0.820	0.000900	64.574
Parkinson’s disease	86.449	75.528	10.921	0.810	0.878	0.841	396	0.093	0.077	0.707	0.005645	75.528
Dermatology diseases	88.494	85.015	3.479	0.725	0.745	0.721	25	0.032	0.0008	0.934	0.003209	85.015
BreastEW	96.466	94.029	2.437	0.924	0.920	0.918	22	0.064	0.003	0.967	0.001310	94.029
Chi-square
KNN classifier
RNA gene	99.847	99.750	0.097	0.999	0.997	0.998	7555	0.0801	0.010	1.000	0.000028	99.750
DNA CNV	70.283	59.635	10.648	0.526	0.498	0.492	5555	0.528	0.005	0.753	0.002142	59.635
Parkinson’s disease	81.158	72.612	8.546	0.778	0.886	0.828	398	0.016	0.001	0.452	0.002308	72.612
Dermatology diseases	92.622	88.498	4.124	0.889	0.875	0.865	24	0.094	0.0009	0.967	0.002641	88.498
BreastEW	94.728	92.976	1.752	0.924	0.888	0.904	21	0.016	0.002	0.969	0.000953	92.976
Naïve base classifier
RNA gene	100.000	75.787	24.213	0.746	0.683	0.680	7555	0.0801	0.184	0.809	0.004833	75.787
DNA CNV	49.600	48.765	0.835	0.512	0.506	0.468	5555	0.528	0.552	0.734	0.000893	48.765
Parkinson’s disease	74.691	74.207	0.484	0.797	0.879	0.835	398	0.016	0.006	0.708	0.008073	74.207
Dermatology diseases	89.445	87.164	2.281	0.816	0.860	0.817	24	0.094	0.0004	0.975	0.001954	87.164
BreastEW	94.220	93.678	0.542	0.946	0.882	0.912	21	0.016	0.0007	0.988	0.000967	93.678
Decision tree classifier
RNA gene	99.154	97.250	1.904	0.974	0.976	0.974	7555	0.0801	5.221	0.990	0.000410	97.250
DNA CNV	58.451	55.863	2.588	0.464	0.458	0.452	5555	0.528	2.155	0.776	0.000253	55.863
Parkinson’s disease	83.127	76.721	6.406	0.823	0.878	0.849	398	0.016	0.117	0.729	0.005338	76.721
Dermatology diseases	88.494	85.015	3.479	0.725	0.745	0.721	24	0.094	0.0008	0.934	0.003209	85.015
BreastEW	96.466	93.327	3.139	0.915	0.911	0.909	21	0.016	0.010	0.966	0.002513	93.327
Bat algorithm
KNN classifier
RNA gene	99.861	99.752	0.109	0.999	0.997	0.998	6483	1350	0.012	1.000	0.000027	99.752
DNA CNV	81.786	75.309	6.477	0.685	0.644	0.639	5301	1280	0.008	0.864	0.000235	75.309
Parkinson’s disease	80.277	69.326	10.951	0.757	0.869	0.809	35	0.305	9.422	0.507	0.003475	69.326
Dermatology diseases	98.027	97.260	0.767	0.971	0.970	0.969	19	0.255	0.002	0.974	0.001678	97.260
BreastEW	94.728	92.976	1.752	0.924	0.888	0.904	14	0.200	0.001	0.969	0.000953	92.976
Naïve base classifier
RNA gene	99.882	83.777	16.105	0.858	0.787	0.792	6483	1350	0.084	0.875	0.002498	83.777
DNA CNV	67.463	66.290	1.173	0.654	0.663	0.632	5301	1280	0.186	0.873	0.002012	66.290
Parkinson’s disease	75.617	74.744	0.873	0.792	0.899	0.841	35	0.305	0.0008	0.706	0.005885	74.744
Dermatology diseases	86.109	85.540	0.569	0.801	0.850	0.799	19	0.255	0.001	0.979	0.001915	85.540
BreastEW	95.477	95.099	0.378	0.962	0.906	0.931	14	0.200	0.0005	0.990	0.001183	95.099
Decision tree classifier
RNA gene	99.320	98.750	0.570	0.988	0.990	0.988	6483	1350	1.782	0.994	0.000139	98.750
DNA CNV	65.592	64.608	0.984	0.553	0.560	0.551	5301	1280	0.578	0.822	0.000894	64.608
Parkinson’s disease	81.820	74.609	7.211	0.783	0.915	0.843	35	0.305	0.010	0.699	0.003413	74.609
Dermatology diseases	89.011	87.177	1.834	0.747	0.761	0.742	19	0.255	0.002	0.942	0.002909	87.177
BreastEW	96.466	94.029	2.437	0.924	0.920	0.918	14	0.200	0.003	0.968	0.001310	94.029

Algorithm	ACC%	NO.F	Pre	Rec	F1-score	AUC	Var.
MIFS	99.875	10,000	0.999	0.998	0.988	1.000	0.000016
IGF	99.875	3576	0.999	0.999	0.998	1.000	0.000016
mRMR	99.750	650	0.999	0.997	0.998	1.000	0.000028
CfsSubsetEval	99.627	4083	0.998	0.996	0.997	1.000	0.000036
ReliefAttributeEval	99.873	10,000	0.999	0.999	0.999	1.000	0.000031
OneRAttributeEval	99.627	7000	0.998	0.996	0.997	0.999	0.000036
ConsistencySubsetEval	97.380	3	0.972	0.970	0.970	0.994	0.000188
PCA	99.740	700	0.999	0.997	0.998	0.999	0.000059
MIFS, CBF and FCBF	99.748	900	0.999	0.997	0.998	1.000	0.000092
Chi-square	99.625	7555	0.997	0.995	0.996	1.000	0.000036
IGF, Chi-square and Bat algorithm	99.752	6483	0.999	0.997	0.998	1.000	0.000027
Proposed method (PFBS-RFS-RFE)	100.000	10.000	1.000	1.000	1.000	1.000	0.0

Datasets	Train Data %	Test Data %	Over-fitting Diff. %	Pre	Rec	F1-score	NO.F	F-Time (sec)	C-Time (sec)	AUC	Var.	ACC %
SVM classifier
RNA gene	99.791	99.750	0.221	0.999	0.997	0.998	3123	15,746.043	0.727	1.000	0.000028	99.750
DNA CNV	93.271	84.980	8.291	0.860	0.770	0.790	2940	62,405.810	35.118	0.965	0.000620	84.980
Parkinson’s disease	75.529	74.996	0.533	0.530	0.523	0.474	149.000	55.114	0.071	0.768	0.000652	74.996
Dermatology diseases	84.800	84.722	0.078	0.854	0.835	0.830	5.000	0.651	0.016	0.960	0.000052	84.722
BreastEW	91.799	91.394	0.405	0.463	0.420	0.439	5.000	0.656	0.016	0.977	0.000776	91.394

Algorithm	ACC%	NO.F	Pre	Rec	F1-score	AUC	Var.
MIFS and RFE	99.501	4500	0.794	0.715	0.723	1.000	0.000041
GA and RFE	99.750	3123	0.999	0.997	0.998	1.000	0.000028
Ridge regression and RFE	99.627	10,265	0.998	0.830	0.831	1.000	0.000036
Proposed method (PFBS-RFS-RFE)	100.000	10.000	1.000	1.000	1.000	1.000	0.0

PERMALINK

Effective hybrid feature selection using different bootstrap enhances cancers classification performance

Noura Mohammed Abdelwahed

Gh S El-Tawel

M A Makhlouf

Abstract

Background

Method

Results

Conclusion

Introduction

Results

Performance metrics

Table 1.

Parameter setting

Table 2.

Numerical results and discussion

Table 3.

Table 4.

Table 5.

Table 6.

Fig. 1.

Fig. 2.

Fig. 3.

Comparison with other studies

Table 7.

Table 8.

Table 9.

Table 10.

Table 11.

Table 12.

Table 13.

Table 14.

Table 15.

Table 16.

Table 17.

Table 18.

Table 19.

Table 20.

Table 21.

Table 22.

Table 23.

Table 24.

Table 25.

Discussion

Conclusions

Method

Dataset description

Table 26.

The proposed hybrid feature selection methods

Fig. 4.

First bootstrap step as a resampling method

Feature selection using random Forest (RFS)

Recursive feature elimination (RFE)

Table 27.

Table 28.

Table 29.

Acknowledgements

Abbreviations

Authors’ contributions

Funding

Availability of data and materials

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases