Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment

Yelleti Vivek; Vadlamani Ravi; P Radha Krishna

doi:10.1007/s10586-022-03725-w

. 2022 Sep 10;26(3):1949–1983. doi: 10.1007/s10586-022-03725-w

Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment

Yelleti Vivek ^1,², Vadlamani Ravi ^1,^✉, P Radha Krishna ²

PMCID: PMC9463682 PMID: 36105649

Abstract

Extant sequential wrapper-based feature subset selection (FSS) algorithms are not scalable and yield poor performance when applied to big datasets. Hence, to circumvent these challenges, we propose parallel and distributed hybrid evolutionary algorithms (EAs) based wrappers under Apache Spark. We propose two hybrid EAs based on the Binary Differential Evolution (BDE), and Binary Threshold Accepting (BTA), namely, (i) Parallel Binary Differential Evolution and Threshold Accepting (PB-DETA), where BDE and BTA work in tandem in every iteration, and (ii) its ablation variant, Parallel Binary Threshold Accepting and Differential Evolution (PB-TADE). Here, BTA is invoked to enhance the search capability and avoid premature convergence of BDE. For comparison purposes, we also parallelized two state-of-the-art algorithms: adaptive DE (ADE) and permutation based DE (DE-FS^PM), and named them PB-ADE and P-DE-FS^PM respectively. Throughout, logistic regression (LR) is employed to compute the fitness function, namely, area under the receiver operator characteristic curve (AUC). The effectiveness of the proposed algorithms is tested over the five big datasets of varying dimensions. It is noteworthy that the PB-TADE turned out to be statistically significant than the rest. All the algorithms have shown the repeatability property. The proposed parallel model attained a speedup of 2.2–2.9. We also reported feature subset with high AUC and least cardinality.

Keywords: Feature subset selection, Differential evolution, Threshold accepting, MapReduce, Apache spark, Multithreading

Introduction

Selecting relevant and important feature subset is a paramount pre-processing step in the Cross-industry standard process for data mining (CRISP-DM) methodology [1] of data mining projects. This process of selecting the important group of features is popularly known as Feature subset selection (FSS) [2, 3]. The main objective of FSS is to select the most relevant and highly discriminative feature subsets. The spectacular benefits of FSS are as follows: it improves the comprehensibility of the models, reduces the model complexity, improves the training time, avoids overfitting, and sometimes improves the model’s performance. Further, the resulting model becomes parsimonious. The ubiquitous presence of big datasets in every domain made FSS a mandatory pre-processing step.

FSS can be performed primarily in three distinct ways: filter, wrapper, and embedded approaches. The main difference lies in the fitness value computation and selecting the salient features individually or group-wise. The filter approaches measure the fitness value based on statistical measure such as Information gain, Mutual information, Gain ratio, etc. These methods are fast but result in less accuracy and cannot account for the interaction effects amongst the features. Wrapper approaches comprise a metaheuristic optimization algorithm that searches for the best feature subsets as indicated by the highest fitness value determined by a classifier (for a classification problem) or a regression model (for a regression problem). These approaches are computationally intensive but highly accurate while accounting for the interaction effects of the features. In embedded approaches, the feature subset selection is embedded as a part of the model-building phase. These approaches combine the advantages of being less computationally expensive than wrapper and give better accuracy than filter approaches. Even though wrapper approaches impose high complexity, the selected feature subset is highly generalized to the underlying classifier.

In the current study, FSS is formulated as a combinatorial problem because, for a given n features, the total possible number of feature subsets is $2^{n} - 1$ . Accordingly, the total number of feature subsets that can be formed constitutes the search space. The objective is to search for an efficient feature subset that comprises less redundant features. The best feature subset is found by checking all the possible feature subsets. However, this brute force method becomes unwieldy when the feature space dimension n become large, as in big datasets. Metaheuristics (evolutionary algorithms (EAs) subsumed) have demonstrated their superiority over conventional optimization methods in solving various combinatorial and continuous optimization problems. Metaheuristics are of two different types: (i) point-search-based methods such as Threshold Accepting (TA), Simulated Annealing (SA) etc., (ii) EAs, which are population-based EAs such as genetic algorithm (GA), differential evolution (DE), particle swarm optimization (PSO) etc. Recently, many new population based metaheuristics were also proposed such as monarch butterfly optimization (MBO) [4], slime mould algorithm (SMA) [5], moth search algorithm (MSA) [6], hunger games search (HGS) [7], colony predation algorithm (CPA) [8], and Harris hawks optimization (HHO) [9]. All of these algorithms were applied to solve many machine learning and engineering problems [2, 3, 10–14]. Between these two types, the point-search methods consume a lot of time to converge because their exploration capability is weaker than that of the population-based EAs [15]. The strategy of performing population based search entails the EAs inherent parallelism. Hence, the population-based EAs are considered to be robust in maintaining both the exploration and exploitation capabilities. Often the population based EAs outperformed point-search methods in various engineering problems [15, 16]. The property of inherent parallelism of the EAs is exploited in this current study to bring in process parallelization via Apache Spark. The present research study posed the FSS in a single objective environment where the objective function maximizes the AUC, thereby selecting the feature subsets of length less than or equal to k (where k < n) while achieving the best possible AUC.

As the data is generated in large volumes at a phenomenal rate, scalability becomes a major concern while developing solutions to analyze such big data. Therefore, designing scalable solutions gained prominence. Several new programming models were proposed to compute such large-scale computations on the commodity hardware. Distributed frameworks like MapReduce [17] or Spark [18] provide scalability with an improved performance by drawing capabilities from the underlying commodity hardware. These frameworks provide inherent support mainly: data distribution, load balancing, fault tolerance, and parallel processing. MapReduce [17] is a distributed programming paradigm used in handling big data. It mainly consists of two steps: map and reduce. MapReduce solutions are proven to be scalable. There are different big data frameworks available to design MapReduce solutions. Among them, Spark is faster due to its in-memory computation feature. Spark is an open-source, fast computing distributed engine used to handle large amounts of data. Spark uses in-memory computing by using Resilient Distributed Datasets (RDDs) that boost the performance, thereby avoiding the disk-access. Spark RDD is inherently distributed, follows lazy evaluation, and is immutable. Apache Spark also provides versatility by combining other big data tools such as Hadoop.

Extant EA-based wrapper algorithms are sequential and limited to small datasets. Even though they can be applied to larger datasets, they perform poorly. In the current world, the data growth is phenomenal, thereby demanding the development of scalable wrappers for FSS. There is a growing need to develop such parallel wrappers for FSS in the context of big data. These motivated us to propose a couple of scalable wrappers for FSS in a single objective environment. In the current study, the objective function is to select the feature subset of length less than or equal to k (where k << n) while achieving the best possible AUC. To the best of our knowledge, no work has been reported so far, where feature subset selection is either performed by a scalable wrapper involving parallel and distributed EC techniques or their hybrids under the Apache Spark environment.

In the current study, we considered DE from the population-based EA, because it is proven to be the robust search algorithm [19] in solving the optimization problems. Although DE is a robust algorithm, it still suffers from premature convergence [12] and faces difficulty to converge in not linearly separable functions [19]. Hence, these drawbacks motivated us to develop the hybrid EAs which are based on memetic approach. In this approach, the control shifts between the constituent algorithms at a logical point of time of the execution in order to enhance the search ability and increase the probability of avoiding premature convergence. TA is chosen from the point-based metaheuristic because it is a deterministic variant of simulated annealing and exploits the search space well and because of this nature, it is more suitable for combinatorial optimization problems. For the purpose of better hybridization, we developed a population based TA. We have designed the hybrid EAs comprising both the DE and TA. It is important to note that the current hybrid sequential DETA is distinct from the hybrid version presented in Chauhan and Ravi [20]. The DE-TA [20] is implemented as a two-stage hybrid model, was a loosely–coupled hybrid system, wherein DE is executed in stage-I till it is converged later, the TA was invoked in stage-II. However, the current hybrid (variants of DE and TA) is a tightly-coupled hybrid system, where DE and TA work in tandem and this scheme is executed every global iteration.

The major reasons behind proposing the parallel hybrid EA designed under the Apache spark environment are as follows: (i) Hybrid EA, being inherently parallel by design, are eminently amenable for process parallelization and hence suitable for analysing high-dimensional big data problems. (ii) Further, sequential implementations are not scalable to handle large, high dimensional datasets. Hence, the parallel algorithms are designed under Apache spark to make the algorithm suitable to analyze big datasets. (iii) The hybrid EA is designed to enhance the search ability so that it could reach the global optimal solution or near-global optimal solution. (iv) Also, the hybrid EA increases the probability of avoiding premature convergence because it has got high exploration capability by virtue of fusing TA with DE in every iteration.

This paper’s significant contributions are as follows: (i) Parallel DE is designed under Apache Spark to develop wrappers for FSS. (ii) Binary versions of the TADE and DETA are proposed, developed and parallelized under Apache Spark. These are named as PB-TADE and PB-DETA, respectively. DETA was initially introduced as a part of the ablation study and it turned out to be significantly different from TADE. (iii) Then, these are invoked to develop wrappers for FSS, where logistic regression is chosen as the classifier to evaluate the fitness function, namely the AUC. (iv) To compare our methods with the state-of-the-art methods, we found that the adaptive differential evolution ADE [21] and permutation based differential evolution DE-FS^PM [22] are competitive. But they are sequential in nature. Hence, we parallelized these two algorithms and refer to them as PB-ADE and P-DE-FS^PM throughout the paper. (v) To achieve scalability and algorithm parallelization, we proposed a novel MapReduce-multithread based framework.

The remainder of the paper is structured as follows: Sect. 2 presents the literature review. Section 3 presents an overview of the evolutionary algorithms employed. Section 4 presents the proposed scalable, parallel and distributed wrapper. Section 5 describes the datasets, Sect. 6 discusses the results obtained. Finally, Sect. 7 concludes the paper.

Literature review

Differential evolution, one of the widely used algorithms for feature subset selection, is proposed by Storn and Price [23]. Table 1 presents the works on filter and wrapper sequential versions of DE, where feature selection is posed as a combinatorial optimization problem. Zhang et al. [24] proposed a modified DE with self-learning (MOFS-BDE). In [24], the authors had introduced three different operators, namely: (i) modified binary mutation operator based on the probability difference, (ii) one-bit purifying search operator (OPS) to improve the self-learning capability of the elite individuals, and (iii) non-dominated sorting in the selection phase to reduce the computational complexity involved in the selection operator. Vivekanandan and Iyengar [25] designed a two-phase solution, where the critical features are selected by the DE and fed into the integrated model of the feed-forward neural network and fuzzy analytic hierarchy process (AHP) [26]. Nayak et al. [27] proposed FAEMODE, a filter approach using elitism-based multi-objective DE. It can handle both the linear and non-linear dependency among the features via both the correlation coefficient (PCC) and mutual information (MI). Mlakar et al. [28] designed a multi-objective DE (DEMO) based wrapper for facial recognition systems. In their approach, the important features are initially extracted based on the histogram of the oriented gradient descriptor (HOG) and fed to the DEMO to find the Pareto optimal solutions. Khushaba et al. [29] proposed DE with a statistical repair mechanism, DEFS, for selecting the optimal feature subsets in datasets with varying dimensionality. In their proposal, the probability of the feature distribution is fed to DEFS by the roulette wheel. Hancer et al. [30] proposed filter-based DE, MIFS, where the fisher score determines the mutual relevance between the features and class labels. The features are assigned a rank based on their fisher score. Hancer [31] proposed a new multi-objective differential evolution (MODE-CFS) with two-stage mutation, (i) centroid-based mutation to perform clustering, and (ii) feature-based mutation to perform feature selection. Non-dominated sorting is applied to determine the Pareto optimal solution set. Ghosh et al. [32] proposed a self-adaptive DE (SADE) based wrapper for the feature selection in the hyperspectral image data. The selected feature subsets are fed into fuzzy-KNN to obtain accuracy. Bhadra and Bandyopadhyay [33] improved the modified DE. They proposed an unsupervised feature selection approach called MoDE with the objective functions as (i) average dissimilarity of the selected feature subset, (ii) average similarity of the non-selected feature subset, and (iii) the average standard deviation of the selected feature subset. All the objectives mentioned above use normalized mutual information. Baig et.al [34]. proposed modified DE-based wrapper for the motor imagery EEG having a high dimensional dataset. SVM is used here as a classifier. Almasoudy et al. [35] designed a wrapper feature selection based on modified DE. The authors considered Extreme Learning Machine (ELM) as a classifier and tested its effectiveness over the intrusion detection dataset NSL-KDD. Zorarpaci and Ozel [36] proposed a hybrid FS approach based on DE and Ant-Bee Colony (ABC), where the J48 classifier from Weka computes the fitness score. This hybrid model achieved a significant F-score with less cardinal feature subset than the stand-alone DE and stand-alone ABC.

Table 1.

Sequential versions of DE and its wrapper variants

Authors	# objectives	Algorithm	Wrapper (classifier)/filter	Parallel/sequential
Zhang et.al. [24]	Multi Objective	Self-Learning DE	Wrapper (KNN)	Sequential
Vivekanandan and Sriraman [25]	Single Objective	Modified DE	Filter	Sequential
Nayak et.al. [27]	Multi Objective	FAEMODE	Filter	Sequential
Mlakar et.al. [28]	Multi Objective	DE + HOG	Wrapper (SVM)	Sequential
Khushaba et.al [29]	Single Objective	DEFS	Filter	Sequential
Hancer [31]	Multi Objective	MODE-CFS	Filter	Sequential
Hancer et.al. [30]	Multi Objective	DE + MIFS	Filter	Sequential
Ghosh et.al. [32]	Multi Objective	SADE	Wrapper (Fuzzy-KNN)	Sequential
Bhadra and Bandyopadhyay [33]	Multi Objective	MoDE	Filter MI	Sequential
Bhaig [34]	Multi Objective	Modified DE	Wrapper (SVM)	Sequential
Almasoudy et.al. [35]	Multi Objective	Modified DE	Wrapper (ELM)	Sequential
Zorarpaci et.al. [36]	Single Objective	DE + ACO	Weka J48 classifier	Sequential
Srikrishna et.al[37]	Single Objective	Quantum DE	Wrapper (LR)	Sequential
Lopez et.al. [22]	Single Objective	DE-FS^PM	Wrapper (SVM)	Sequential
Al-ani [92]	Single Objective	DE + Wheel based strategy	Filter	Sequential
Zhao et.al. [38]	Single Objective	Modified DE	Wrapper (SVM)	Sequential
Hancer [39]	Multi Objective	DE	Filter(Fuzzy + Kernel)	Sequential
Li et.al. [40]	Single Objective	DE	Wrapper (SVM)	Sequential
Wang et.al. [41]	Single Objective	DE	Wrapper (KNN)	Sequential
Krishna and Ravi [21]	Single Objective	Adaptive DE	Wrapper (LR)	Sequential
Current Study	Single Objective	PB-TADE, PB-DETA, PB-DE, PB-ADE,P-DE-FS^PM	Wrapper (LR)	Parallel

Open in a new tab

Then, a quantum-inspired wrapper based on DE (QDE) with logistic regression as the classifier was proposed by Srikrishna et al. [37]. The authors have introduced the quantum crossover and quantum mutation operators. They also used LR as the classifier, and the results obtained by the QDE are better than the ADE algorithms. They also reported that QDE achieved better repeatability than the BDE. Lopez et al. [22] proposed a wrapper based on permutation DE, where the permutation-based mutation replaced the mutation operator. The diversity of the generated children solutions is controlled by using a modified recombination operator. Zhao et al. [38] developed a two-stage wrapper feature selection algorithm, where in the first stage, the fisher score and information gain are applied to filter the redundant features. Then in the second stage, the top-k features are passed to the modified DE to perform the feature selection on four different breast cancer datasets. Hancer [39], for the first time, used fuzzy and kernel measures as filters to calculate the mutual relevance and redundancy with DE to handle continuous datasets. Li et al. [40] designed DE-SVM-FS and compared it with the default SVM approach, and they demonstrated that the DE and SVM-based FS achieved better accuracy than the stand-alone SVM. Wang et al. [41] proposed DE-KNN, where the KNN is the classifier. DE-KNN performed both the feature selection as well as the instance selection.

Along with DE, several other EAs [10, 42–45] considered the FSS problem as a combinatorial optimization problem. Khammassi and Krichen [46] proposed two schemes, namely, (i) The NSGA-BLR approach to handle binary-class datasets and (ii) The NSGA-MLR to handle multi-class network intrusion datasets. Chaudari and Sahu [47] proposed a binary version of the popular crow search algorithm (CSA) with time-varying flight in wrapper version BCSA-VF. Binary Dragon Fly algorithm (BDA) is proposed by Too and Mirjalili [48] by taking the Covid-19 dataset as a case study. Recently many new metaheuristics were proposed to solve FSS. Hu et al. [5], introduced the dispersed foraging strategy inspired by ant-bee colony (ABC), into the slime mould algorithm (SMA), and named it as dispersed foraging slime mould algorithm (DFSMA), to avoid former’s premature convergence and to maintain population diversity. The dispersion foraging strategy works as follows: an adaptive dispersion rate (DR) is maintained which is responsible to maintain the diversity in the population. DR decreases as the number of iterations increases which regulates the movement towards the optimal solutions. Hu et al. [49], had improved the grey wolf optimizer (GWO) by introducing the following operators: (i) covariance matrix adaptation evolution strategy (CMAES) to improve the exploration ability, (ii) levy flight mechanism to improve search accuracy, and (iii) orthogonal learning (OL) strategy to predict the optimal search direction and named GWOCMALOL. The authors conducted Wilcoxon signed rank test and proved that proposed GWOCMALOL obtained superior results in terms of convergence speed and accuracy. Hu et al. [50], introduced chaotic local search (CLS) mechanism into the GWO and named SCGWO to avoid local optimal, improve global exploration capability, individual moment is randomness. SCGWO has apparent advantages in processing unimodal, multimodal and composition functions. Too et al. [51], proposed MEHHO which follows an energetic learning strategy which considers the global best experience to update the locations of the other search agents, memory saving and updating mechanism where the best solution is stored and hawk is allowed to imitate its best solution if the fitness value becomes worse. The authors applied their proposed algorithm to the classification of electromyography signals. Zhang et al. [52], embedded the salp swarm algorithm (SSA) and proposed improved harris hawks optimization (IHHO) to improve the searchability of the HHO. The update stage in the IHHO is done in the following way: (i) SSA generates the population and it is called SSA-based population. (ii) generate hybrid individuals using SSA and HHO. (iii) by using greedy selection and HHO’s mechanism update the search agent. The authors demonstrated that the proposed IHHO achieved a faster convergence speed and maintained a better balance between exploration and exploitation. Several variants of DE are employed, for example, in the estimation of tool-wear during the milling process [11], optimal resource scheduling [12], energy-efficient model [13], and anomaly detection [14].

Now we shift our attention to the works more relevant to the current study. Several parallel and distributed versions of the evolutionary algorithms [53–56] are proposed to handle high-dimensional datasets and big data. Various parallel and distributed implementations of the DE are presented in Table 2. Zhou [57] discussed various strategies for implementing parallel DE MapReduce versions and their pros and cons in the Hadoop distributed framework. Teijeiro et al. [58] designed parallel DE under the Spark environment, and the experiments were conducted in the AWS cloud environment to solve benchmark optimization problems. Recently, Cho et al. [59] designed a parallel version of DE to solve large-scale clustering problems. Another parallel version of DE Classifier (SCDE) was proposed by Al-Sawwa and Ludwig [60] to handle the imbalanced data. SCDE found the optimal centroid and assigns the class to the data point based on the Euclidian distance.

Table 2.

Parallel and distributed versions of DE and its variants

Authors	Algorithm	Environment	Problem solved
Zhou [57]	DE	Spark	Pros and cons of various approaches are discussed
Teijeiroet.al. [58]	DE	Spark + AWS	Tested on benchmark functions
Chou et.al [11]	DE	Spark	Clustering
Al-Sawwa and Ludwig [60]	DE	Spark	Designed a DE based classifier
Chen et.al [61]	Modified DE	SPMD	Cluster Optimization
Adhianto et.al [62]	DE	OpenMP	Optical Network problem
Liu et.al. [93]	DE	Distributed Cloud	Power electronic circuit optimization
Deng et.al. [63]	DE	Spark	Tested on benchmark functions and reported speedup
Wong et.al. [65]	Self-Adaptive DE	CUDA	Tested on benchmark functions and reported speedup
He et.al. [64]	Five variants of DE	Spark + Cloud	Developed a ring topology model and evaluated benchmark functions to report speedup
Cao et.al. [66]	DPCCMOEA	MPI	Developed co-evolutionary based DE to solve large scale optimization
Ge et.al. [67]	DDE-AMS	MPI	Developed adaptive population model to solve large scale optimization
Falco et.al [68]	DE	MPI	Resource allocation
Veronse and Krohling [69]	DE	CUDA	To solve large scale optimization in GPU environment
Glotik et.al. [70]	PSADE	MATLAB	Hydro Scheduling algorithm
Thomert et.al. [73]	NSDE-II	OpenMP	Cloud work placements
Daoudi et.al [71]	DE	Hadoop	Clustering
Kromer et.al. [72]	DE	Unified Parallel C	To solve large scale optimization problems
Current study	PB-TADE, PB-DETA, PB-DE, PB-ADE,P-DE-FS^PM	Spark	A parallel EA based wrapper algorithm solving FSS

Open in a new tab

Chen et al. [61] proposed a parallel version of the modified DE using single-program multiple-data (SPMD), with improved genetic operators. They employed both fine-grained and coarse-grained approaches for cluster optimization. Adhianto et al. [62] proposed a fine-grained parallel version of DE using OpenMP to solve the optical networking problem, where the shared memory multi-processing is supported. Deng et al. [63] proposed a parallel DE for solving the benchmark functions and reported the speedup. He et al. [64] proposed the parallel framework for five variants of DE under the Spark cloud computing platform. They analyzed the speedup by solving the benchmark functions. Wong et al. [65] developed the Computed Unified Device Architecture (CUDA) framework for self-adaptive DE solving benchmark functions. Cao et al. [66] proposed a message passing interface (MPI) based co-evolutionary version of DE, where the population is divided and co-evolved together to solve large-scale optimization problems. Ge et al. [67] proposed an adaptive merge and split strategy for DE, namely, DE-AMS using MPI, to improve resource utilization, which is a vital aspect to minimize while handling large-scale optimization problems. Falco et al. [68] designed MPI-based DE under a CUDA grid environment and tested it on different resource allocation strategies. Veronse and Krohling [69] developed the first implementation of the CUDA version of DE. The proposed algorithm was tested on well-known benchmark functions, and compared the computing time with the standalone implementation. Glotik et al. [70]. parallelized the DE using parallel MATLAB to solve the hydro-scheduling problem. Daoudi et al. [71]. developed the MapReduce version of DE under the Hadoop environment to solve the clustering problem. Kromer et al. [72] developed a parallel version of DE using Unified Parallel C to handle large-scale clustering problems. Thomert et al. [73] developed a parallel version of DE using OpenMP to achieve the optimized workflow placement into the realm of practical utility.

In summary, the drawbacks observed in the extant approaches are as follows:

The existing EA based wrapper techniques, listed in Table 1, are sequential and are applied to small datasets. When applied to big datasets, sequential approaches will take large computation time, and result in poor performance. Hence, there is a need to develop parallel EA techniques and make them amenable to big datasets.
Moreover, the current parallel and distributed EA algorithms (refer to Table 2) are not yet applied to FSS problem.

These drawbacks motivated us to design and propose scalable, parallel EA-based wrapper techniques. As previously mentioned, the population-based EAs alone suffer from premature convergence. Hence to decrease the probability of premature convergence and to improve the search ability of the DE, a point-based metaheuristic viz., TA is invoked. Accordinlgy, we designed two parallel hybrid variants and named them PB-DETA and PB-TADE.

Overview of the evolutionary algorithms employed

Evolutionary algorithms (EAs) [19, 74, 75] effectively obtain global or near-global optimal solutions. The designing of the heuristics in EAs is inspired by natural selection of Darwinian evolution [19], and social behaviour. They start by initializing the population of solutions randomly. This population is evolved in order to find better solutions. For evolution, EAs utilize specialized heuristics to generate new solutions and compute the corresponding fitness score given by the fitness function or (objective function). EAs are perfect for both continuous and binary search spaces. This section discusses a detailed explanation of all the methods employed in the current work.

Solution encoding scheme

Kohavi and John [76] are the pioneers in proposing the wrappers for FSS by posing the FSS as a combinatorial optimization problem. Here, wrappers take the help of a classifier or a regression model since it may involve the fitness score evaluation of a given feature subset. Wrappers using metaheuristics (EAs subsumed) require the fundamental step of solution encoding. EAs randomly initiate the population consisting of a set of solution vectors. Each solution vector in the population represents a feature subset. Such a vector comprises bits, where 1 indicates the presence, and 0 indicates the absence of a feature. The dimension of this array is equal to the number of features in the dataset.

Binary differential evolution

Binary Differential Evolution (BDE), a stochastic population-based global optimization algorithm, includes the three heuristics, namely: mutation, crossover, and selection in that order. BDE starts by initializing the random population, consisting of n candidate solution vectors (X_i), where n is the population size. This candidate solution vector follows the binary encoding scheme. Each candidate solution vector is subjected to all the three heuristics in each iteration (or generation) of the algorithm.

In each generation t, in a dimensional search space (d), then the candidate solution vector ( $X_{i}^{t}$ ) subjected to the mutation operation and produces the mutant vector ( $M_{i}^{t}$ ). The mutation operation is applied as presented in Eq. (1):

M_{i}^{t} = X_{i 1}^{t} + M F * (X_{i 2}^{t} - X_{i 3}^{t})

where $X_{i 1}^{t}, X_{i 2}^{t} a n d X_{i 3}^{t}$ are three randomly chosen distinct vectors from the current generation t. MF is the mutation factor, is a user-defined parameter, and lies in the range [0,1]. After this, the mutant vector may not be a binary anymore. Hence, sigmoid based discretization process (see Eq. 2.) is applied to every $m_{ij}^{t}$ ${, j}^{th}$ member of the $M_{i}^{t}$ thereby converting a continuous vector into a binary vector.

m_{ij}^{t} = \{\begin{matrix} 1, i f r a n d (0,1) < s i g m o i d (m_{j}^{t}) \\ 0, e l s e \end{matrix})

Thus discretized mutant vector is subjected to crossover operation where it was subjected to the mating with the corresponding candidate solution vector to generate the trail vector. The crossover operation is applied to trail vector $U_{i}^{t}$ , as presented in Eq. (3):

u_{ij}^{t} = \{\begin{matrix} m_{ij}^{t}, i f r a n d (0,1) < C R a n d j \neq r a n d i \\ x_{ij}^{t}, i f r a n d (0,1) \geq C R a n d j \neq r a n d i \end{matrix})

where j = 1,2,…d, $u_{ij}^{t}$ is the $j^{th}$ bit of $U_{i}^{t}$ , rand(0,1) is the random number generated in the interval [0,1] from a uniform distribution. $randi$ is a randomly chosen index to make sure that the generated trail vector is different from the mutant vector. CR represents the crossover rate, is a user-defined parameter, and lies in the range [0,1].

Finally, the fitness score is computed for the trail vectors. Then the selection operation is applied by comparing the corresponding target vectors and trail vector to produce an offspring. Better solutions survive and form the parent population for the subsequent iteration. The selection operation follows the rule as presented in Eq. (4):

X_{i}^{t + 1} = \{\begin{matrix} X_{i}, i f f (X_{i}) > f (U_{i}) \\ U_{i}, o t h e r w i s e \end{matrix})

As mentioned earlier, this is continued till the completion of maximum iterations or other convergence criteria, if any, is met.

Binary threshold accepting

Dueck and Scheuer [77] proposed Binary Threshold Accepting (BTA) algorithm. Later, Ravi and Zimmermann [78] optimized a fuzzy rule-based classification model using modified TA. They developed the solution in three phases: feature selection was used as a preprocessing step, a modified fuzzy rule-based classification system was invoked over the selected feature subset, and finally, modified TA (MTA) was invoked to minimize the rule base size while guaranteeing high accuracy. Ravi et al. [79] proposed a modified TA (MTA) to minimize the number of rules in a fuzzy rule-based classification system. Then, Ravi and Zimmerman [80] proposed a continuous version of TA as an alternative to the backpropagation algorithm to overcome its limitations while training a neural network model. The trained neural network was utilized for feature selection, and the selected features were fed to the fuzzy classifier optimized by the MTA proposed in [79]. Later, Ravi and Pramodh [81] proposed a principal component neural network architecture trained by TA for bankruptcy prediction of banks. Chauhan and Ravi [20] proposed a hybrid approach, involving DE and TA in that order to solve a set of benchmark unconstrained benchmark problems. This hybrid is a tightly-coupled system and is implemented as a two-stage model, where DE is executed first and then TA is executed.

Threshold Accepting is a deterministic variant of simulated annealing. BTA, a binary version of the TA (MTA in [79]), is presented in Algorithm 1. BTA is applied to a single solution. BTA performs a neighbourhood search by flip-flopping the bits in the current solution vector one at a time, starting them in the left-most position. Each flip flop yields one neighbourhood solution. If the first neighbourhood solution is not accepted, then the bit is reversed to its original value. Likewise, 2nd bit is flip-flopped so on and so forth until a neighbourhood solution is accepted. However, it is not necessary to exhaustively search all the neighbourhood solutions. BTA calculates the difference in the fitness score of the present and previous solutions. BTA accepts the neighbourhood solution if its fitness value is not much worse than that of the current solution.

Algorithm 1: Threshold accepting (TA) algorithm

Adaptive differential evolution

Adaptive Differential Evolution (ADE) algorithm [82] is employed to perform FSS by Krishna and Ravi [21]. The main objective of the adaptive algorithms is to obviate the manual fine tuning of the hyperparameters, and help in faster and better convergence without compromising performance. This algorithm is different from the original DE in the following way:

(i)
Instead of manually fine-tuning the hyperparameters MF and CR, ADE adjusts them adaptively by using the roulette wheel selection based on the success rate of the parameters.
(ii)
Mutation and Crossover rate operations are performed in the same way as the Differential Evolution (refer to Sect. 3.1).

Like DE, ADE also starts by initializing the random population of size n and also follows the binary encoding scheme. Similarly, each candidate solution vector is subjected to all the three heuristics in each iteration (or generation) of the algorithm. Based on the success rate of the combinations, the MF and CR are adaptively adjusted by Roulette wheel selection. Inherently, there are 12 sets of parameters having different probabilities q_h, h = 1,2,3…H (here H = 12). Using h^th setting, the heuristics namely Mutation as per the Eq. (1), and Crossover as per the Eq. (3) apply on the parent solution x_i resulting in generating an offspring solution y_i. If f(y_i) > f(x_i), then the probability of h^th setting is adjusted by using Eq. (5).

q_{h} = \frac{\begin{matrix} n_{h} + n_{o} \end{matrix}}{\sum_{j = 1}^{H} (n_{j} + n_{o})}

where n_h is the current combination success rate, n_o > 1 is a constant term to prevent the sudden change in the success rates by just one random success achieved by a setting. In each generation, the parent population is subjected to mutation (Eq. 1) and Crossover (Eq. 3) operations. But unlike previous algorithms, MF and CR are adaptively determined by the Roulette wheel selection based on the success rate. Then the selection operator applies, which results in better solutions. The better obtained solutions become the parent population for the subsequent iterations. This process continues till the completion of maximum iterations or other convergence criteria, if any, is met.

Permutation based differential evolution

Lopez et al. [22] proposed permutation-based Differential Evolution and named it DE-FS^PM. This is different from the traditional DE in the following way:

(i)
Instead of a binary vector, here the population consists of the integer valued vector to encode all features of a dataset. Each feature is assigned with a number starting from 1 (one) i.e. and range between {1,2,..nfeat} and 0(zero) acts as a delimiter to select the features. For example, for the candidate solution $X_{i 1}^{t}$ = [7,3,5,8,6,0,1,4,2} where the left part of the 0(zero) is the selected feature subset, and the right part is the non-selected feature subset.
(ii)
The permutation-matrix based mutation operator can be used to create mutant solutions. Here also the three solutions are randomly chosen namely, $X_{i 1}^{t}, X_{i 2}^{t}$ , $X_{i 3}^{t}$ . But the difference is that here the permutation based mapping operations is done to generate the mutant vector. Initially, the permutation based mapping is done from $X_{i 1}^{t} t o X_{i 1}^{t} f r o m P M .$ The size of the matrix, PM is nfeat*nfeat, where nfeat is the number of features. PM is a binary vector of size nfeat *nfeat. This is permuted to form a new permutation matrix, namely PM’. Then, PM’ maps the $X_{i 1}^{t}$ and forms the mutant vector $M_{i 1}^{t}$ .
(iii)
It also follows the same crossover operation as presented in Eq. (3). But the only difference is that it is applied on the integer valued vector instead of binary encoded vector. After applying this operation, there is a chance few of the features are placed in multiple locations and a few of the features may be discarded. By example, $X_{i 1}^{t}$ = [ 7,3,5,8,6,0,1,4,2], and $X_{i 2}^{t}$ = [3,5,2,6,1,0,7,8,4], might result in the trail vector, [3,5,5,6,1,0,7,8,4], which selects the feature 5 twice whereas feature 2 is discarded completely. Hence, a simple repair mechanism is implemented between the $X_{i 1}^{t} a n d X_{i 1}^{t}$ to ensure the feasible candidate solutions only, to ensure all the features are selected only once in the trail vector.

DE-FS^PM, starts by initializing the integer valued vector of population. Then, this vector is subjected to the permutation based mutation, which is followed by the Crossover operation Eq. (3) with the simple repair mechanism resulting in the generation of offspring population. Then the selection operation is applied as presented in Eq. (4), resulting in the new population. This process continuous till the completion of maximum iterations or other convergence criteria, if any, is met.

Proposed scalable, parallel and distributed wrapper

To perform FSS, we chose DE and TA, and built hybrids around them. DE explores and exploits the search space globally and is stochastic in nature. However, DE sometimes gets bogged down in local minima before convergence [12], thereby slowing down the convergence rate. If the search space becomes large, this phenomenon gets accentuated. Hence, it needs support from a local search-based optimization algorithm. Here, we chose to employ Binary Threshold Accepting (BTA) for that purpose. The deterministic way of accepting the candidate solutions in BTA helps in exploration, exploitation, and fast convergence.

Even though the EAs are intrinsically parallel, explicit parallel versions of EAs have to be designed so that they meet the following requirements: optimal utilization of the distributed resources, scalability, and low communication overhead. In general, the explicit parallelism from the population perspective of EAs is achieved by two main models:

Master–slave (MS) strategy [58], which is also known as the global model, has only a single global population. Here, the master takes the responsibility of applying metaheuristics (EAs subsumed) while the slave manages the fitness function evaluation.
The island strategy [58] is where the population is divided into islands, upon which the heuristics (operators) are applied independently.

The types mentioned above differ in the underlying topology and the migration rules to determine the communication between the nodes. A hybrid strategy can also be designed by combing the two strategies. In the current study, we proposed and designed a MapReduce multithreaded framework, which mimics the combination of the above-mentioned strategies.

The distributed framework like Apache Spark is not affected by the underlying topology because, in these frameworks, independent of the underlying topology, migrant solutions are broadcast to all the partitions.

The comparative analysis is carried out across PB-DE, PB-DETA, PB-TADE, PB-ADE, and PB-DE-FS^PM to establish the importance of hybrid global and local search optimization heuristics. We developed a parallel BTA, too, independent of the DE. If BTA alone is employed for FSS, then it consumed enormous computational time without yielding useful results. Hence, the comparative study excludes BTA. All the approaches follow the same solution encoding scheme and the population RDD encoding scheme as mentioned in Sect. 4.1. In what follows, algorithm-agnostic details of the parallelization strategy proposed here are presented in Sect. 4.2, while the algorithm-specific details of the parallelization strategy proposed here are presented in Sect. 4.3. Thus, Sects. 4.1 through 4.3, succinctly capture the proposed parallelization mechanism in an unambiguous manner.

Algorithm 2: Biased sampling driven population initialization.

Population RDD encoding scheme

Let the population, denoted by P, be initialized randomly using biased sampling from Uniform distribution between 0 and 1 as presented in Algorithm 2. Our objective is to select the less cardinal feature subsets yielding the highest AUC value. Hence, the parameter in the biased sampling is taken as 0.99. If the pseudo-random number is greater than 0.99, then the bit is assigned the value 1, indicating the feature's presence, and 0, otherwise, meaning the absence of the feature. Thus, according to Algorithm 2, the initialized population is taken as population RDD and the dataset as different RDD. The population RDD is presented in Fig. 1. The key is solution-id, and the first index of the value is the solution vector of binary type. It conveys which feature subset is selected. The second index stores the names or ids of the features, which helps us form the column-reduced dataset and the construction of pipeline RDD for the respective solutions. The last index stores the AUC corresponding to the solution vector.

Algorithm-agnostic details of the paralellization

This subsection overviews the proposed parallel model. This framework consists of three stages: (1) population initialization, (2) training phase (3) test phase. These stages are common for all the algorithms. In this section, we discuss the changes occurring to population RDD during these phases in a generic way to understand the workflow of the framework.

Changes occurring to population RDD during population initialization

This is stage 1, where the chosen algorithm initializes the population according to Algorithm 2 which follows the biased sampling driven population initialization. Thus initialized population is stored in a different RDD which follows the schema as depicted in Fig. 1. Complete details are presented in Sect. 4.1. All the necessary parameters associated with the algorithm are broadcasted to all the worker nodes. Once a variable is broadcasted, it is cached by the executor and utilized for other RDD operations, viz., transformations and actions. They are meant for read-only variables. By broadcasting, these variables are shared across the cluster, thereby reducing the communication overhead.

Changes occurring to population RDD during training phase

Once the population is initialized, its fitness score is computed. As mentioned earlier, the Population RDD is divided into different partitions, and each partition represents an island. These islands asynchronously undergo heuristics (which are algorithm specific) as a separate thread in Spark. As explained earlier, population RDD follows the schema as depicted in Fig. 1. The population information corresponding to the current generation is stored in the binary vector column of each solution. Using this binary vector information, the corresponding selected features are stored in the selected feature column of the population RDD. EAs spend most of the time computing the fitness value. Hence, adopting an asynchronous way of fitness score (namely, AUC) calculation is the major task. This is achieved by initiating the thread pool mapper, where the number of threads is equal to the number of worker nodes. One cannot create the number of threads arbitrarily because arbitrary creation of the threads may increase the communication overhead. Hence, after rigorous trials, the ideal size of the thread pool is found to be the number of worker nodes. The size of the thread pool also affects the speedup. A low-size thread pool leads to poor performance, whereas a higher thread pool leads to huge communication overhead.

Each thread in the thread pool is responsible for computing the AUC of the population island. Once this map is called, the reduced dataset is formed based on the feature subset information using the selected feature index of each solution. Then, the machine learning (ML) pipeline is constructed where the reduced dataset and LR model are bound together, thereby giving pipelined RDD. Such a generated pipelined RDD is a subclass of the RDD, an immutable and partitioned collection of elements where all the operations are executed in parallel. LR model is trained by creating the vector assembler that has to be created for the corresponding selected features. Thus, a vector assembler is created over the obtained reduced data frame. This kind of vector assembler is created for every solution in each iteration. As we have adopted the pipelined RDD, the above operations are executed in parallel and distributed across the nodes, thereby achieving fine-grained parallelism. Later, the AUC is evaluated with the training dataset in the same thread. All this process is repeated for each thread in the fitness value computation step. By adopting this strategy, the computation of AUC is performed in an asynchronous fashion. The same strategy is employed in all the proposed approaches to parallelize the fitness function evaluation. Thus, the AUC’s are updated in the population RDD accordingly. This serves as the parent population. All the subsequent iterations follow the same thread pool mapper strategy to evaluate AUC.

As mentioned earlier, the binary encoded solution is stored in the binary vector field. Hence, by using the binary vector field information, the population is subjected to exploration and exploitation heuristics (which are algorithm specific) thereby forming the offspring population which also follows the same structure as the one depicted in Fig. 1. The corresponding selected feature is also updated using the binary vector information of thus formed offspring. Then the thread pool mapper is called, where the LR model evaluates the AUC on the reduced datasets. With the computed AUC of each solution, the corresponding AUC column is also updated. The selection operator is applied on both the parent and offspring populations, and the worst parent with less AUC is replaced with offspring having higher AUC. These better solutions are formed as one population, which serves as the parent population for the next iteration. It is important to note that all the algorithms follow a similar fitness score evaluation scheme. This process of computing AUC and selecting the better solutions, thereby achieving the evolution of the population, is repeated for maxIter number of iterations.

Changes occurring to population RDD during test phase

Thus, fully evolved population at convergence, which is nothing but the set of the required feature subsets obtained after convergence is evaluated on the test dataset in the test phase. The test AUC is computed and the corresponding solutions and test AUC are reported.

It is important to note that this stage is common for all the algorithms.

Algorithm-specific details of the parallelization

In this subsection, the discussion pivots on the proposed parallel algorithms. It is important to note that all the algorithms are subjected to the same framework as discussed in Sect. 4.2. Further, all the algorithms are identical in the following way: in the fitness score evaluation and test phase. Hence, the discussion on these two is obviated. The discussion is limpid more on the algorithm specific differences in detail i.e., population initialization and training phase.

Parallel binary differential evolutionary (PB-DE)

The proposed PB-DE based wrapper algorithm is presented in Algorithm 3, and the flowchart of the execution flow is depicted in Fig. 5 in the Appendix.

Population initialization

The PB-DE algorithm starts by initializing the population as mentioned in Sect. 4.2. The following information is broadcasted to all the nodes: mutation factor (MF), crossover rate (CR), number of features (nfeat), and population size (n). Then the AUC is evaluated by the LR on the initialized population and is done in an asynchronous fashion by creating the thread pool as mentioned in Sect. 4.2. All the necessary information is updated in the population RDD.

Training phase

Using the binary vector field information, the population is subjected to DE heuristics, viz., mutation and crossover thus forming the offspring population which follow the RDD structure depicted in Fig. 1. By utilizing the binary vector information of the offspring population information, the corresponding selected feature subset is updated. Then the thread pool mapper (refer to Sect. 4.2) is called, where the LR model evaluates the AUC on the reduced datasets. With the obtained AUC of each solution, their corresponding AUC column is also updated. The binary vector field information is extracted from parent and offspring population RDD which serves as the parent and offspring population. Then the selection operator is applied on both the parent and offspring populations, and the worst parent with less AUC is replaced with the better offspring with higher AUC. These better solutions are formed as one population, which serves as the parent population for the next iteration. This process of computing AUC and selecting the better solutions, thereby achieving the evolution of the population, is repeated for maxIter number of iterations.

Algorithm 3: Proposed PB-DE based wrapper

Algorithm 4: Proposed parallel PB-DETA based wrapper

Parallel binary DETA (PB-DETA)

The proposed parallel approach PB-DETA is presented in Algorithm 4, the schematic representation is depicted in Fig. 2, and the flow chart of the execution flow is depicted in Fig. 6 in the Appendix. This hybrid variant of DE and TA is a tightly-coupled hybrid system, where DE and TA work in tandem and this scheme is executed every global iteration.

Fig. 6 — Flow chart of the PB-DETA based wrapper

Population initialization

The population is initialized as per Algorithm 1 and then AUC is computed on this initialized population. All the initialized parameters such as MF, CR, nfeat, and n are broadcasted.

Training phase

As mentioned earlier, the BDE and BTA are executed in tandem i.e., one after another in one complete generation. Similarly, here also parent population is subjected to BDE heuristics, viz., crossover and mutation, thereby forming the offspring population. As mentioned in Sect. 4.2; now the thread pool mapper is called on the offspring population to evaluate the AUC, and the corresponding fields are updated. Then, the better offspring solutions replace the parent solutions. Thus evolved population is subjected to BTA heuristics as outlined in Algorithm 2.

Then, the thread pool mapper is called on, thus forming the offspring population to evaluate AUC, and also the corresponding fields are also updated. Here, the offspring solution, which is not much worse than the threshold limit value, replaces the solution in the parent population. BTA heuristics are invoked for maxIter2 times. Thus emerged population after maxIter2 times serves as the parent population. After this, the whole process of BDE and BTA in tandem is repeated until maxIter1 number of iterations is completed.

Algorithm 5: Proposed parallel PB-TADE based wrapper.

Parallel binary TADE (PB-TADE)

The proposed parallel approach PB-TADE is presented in Algorithm 5, the schematic representation is depicted in Fig. 3, and the flow chart is depicted in Fig. 7 in the Appendix. This hybrid variant of DE and TA is also a tightly-coupled hybrid system, where TA and DE work in tandem which means TA executes first, and the evolved population is given as input to the DE and this scheme is executed for every global iteration.

Fig. 7 — Flowchart of the parallel PB-TADE based wrapper

Population initialization

The population is initialized as per Algorithm 1 and then AUC is computed on thus initialized population. All the initialized parameters such as MF, CR, nfeat, and n are broadcasted.

Training phase

This algorithm is a hybrid of BTA and BDE. Here, BTA is executed first for maxIter2 times, followed by BDE. It means that first, the local exploitation happens in each solution individually. Later, it is followed by global exploration and exploitation over the entire population.

By utilizing the binary vector field information, the parent population, is subjected to BTA heuristic as presented in Algorithm 2, thus forming the offspring population. Then, AUC is evaluated by calling the thread pool mapper on the offspring population. The offspring solutions which are not much worse than the corresponding parent solution in terms of fitness value are selected. This process is continuted for maxIter2 times which results in an evolved population that serves as the initial parent population for BDE. Thus evolved population obtained from BTA was subjected to the BDE heuristics, viz., crossover and mutation. After that, the thread pool mapper is called, and the AUC of the offspring population is evaluated. Then, the selection operation is applied, replacing the worse parent solutions with their better offspring solutions. This serves as the parent population for the next iteration. After this, the whole process of BTA and BDE in tandem is repeated until maxIter1 iterations are completed.

Parallel binary ADE (PB-ADE)

The proposed parallel approach PB-ADE is presented in Algorithm 6, and the flow chart is depicted in Fig. 8 in the Appendix. As this is also implemented in the same environment, P-DE-FS^PM, also follows the same structure, the difference is the adoption of the MF and CR parameters, as mentioned in detail in Sect. 3.4.

Population initialization

The population is initialized as per Algorithm 1 and then AUC is computed on this initialized population. All the initialized parameters such as MF, CR, nfeat, and n are broadcasted.

Training phase

Using the binary vector field information, the parent population, is subjected to ADE heuristics as presented in Algorithm 6, thus forming the offspring population. Then, AUC is evaluated by calling the thread pool mapper on the offspring population. Then, the selection operation is applied, replacing the worse parent solutions with their better offspring solutions. This serves as the parent population for the next iteration. Now the rate of the success is calculated for the current MF and CR combination (refer to Eq. 5) and then stored accordingly. Then the roulette wheel is called in order to choose MF and CR. This whole process of applying heuristics and calling roulette wheel is repeated until maxIter iterations are completed.

Algorithm 6: Proposed PB-ADE based wrapper.

Parallel permutation based DE (P-DE-FS.^PM)

The proposed parallel approach P-DE-FS^PM is presented in Algorithm 7, and the flow chart is depicted in Fig. 9 in the Appendix. P-DE-FS^PM, has some special heuristics which are permutation inspired and require the integer value of the solution. Hence, along with the binary encoded form of the solution, the integer valued vector is also stored to undergo the underlying permutation based heuristic as depicted in Fig. 9.

Algorithm 7: Proposed P-DE-FS^PM based wrapper.

Population initialization

PB-DE-FS^PM algorithm requires the initialization of the integer valued vector (refer to Sect. 3.5). The delimiter of the selected and non-selected feature is ‘0’, where the right side portion is the selected feature subset. Based on the selected feature subset, the binary vector is generated and thus forms the population following the structure as depicted in Fig. 4 as per the chosen population size (n). The major difference between the previous and current population RDD structure is the additional storage of the integer valued vector. Similarly, all the initialized parameters such as MF, CR, nfeat, and n are broadcasted. The same strategy which is used in the fitness function in earlier algorithms is also used here. Then AUC is computed on the initialized population.

Fig. 4 — Schema of the population RDD for the P-DE-FS^PM

Training phase

Using the binary vector field information, the parent population is subjected to DE-FS^PM heuristics namely the permutation based mutation, crossover and the repair mechanism as presented in Algorithm 7, thus forming the offspring population. Now after applying the heuristics by using the integer value information the corresponding binary form of the vector is generated and updated in the corresponding column. As mentioned in Sect. 4.2, the AUC is evaluated by calling the thread pool mapper on the offspring population. Then, the selection operation is applied thereby replacing the worse parent solutions with their better offspring solutions. This serves as the parent population for the next iteration. The whole process is repeated until maxIter iterations are completed.

Classification algorithm

Logistic Regression (LR) is employed as the classifier for the proposed FSS. LR is chosen because it is fast to train and is nonparametric. It does not make any assumptions about the errors or variables and has no hyperparameters to fine-tune.

Fitness function

Area under the ROC Curve (AUC) is the fitness function for our proposed wrappers. It is proven to be a robust measure than accuracy for unbalanced datasets. It is defined as the average of specificity and sensitivity. The cut-off for the probability to label samples into one of the two classes is taken as 0.5. Accordingly,

A U C = \frac{S e n s i t i v i t y + S p e c i f i c i t y}{2}

where sensitivity is the ratio of the positive samples that are correctly predicted to be positive to all the predicted positive samples. This is also called True Positive Rate (TPR).

S e n s i t i v i t y = \frac{TP}{T P + F N}

where TP is true positive and FN is false negative and specificity is the ratio of the negative samples that are correctly predicted to be negative to all the predicted negative samples. This is also called True Negative Rate (TNR).

S p e c i f i c i t y = \frac{TN}{T N + F P}

where TN is true negative and FP is false positive.

Time complexity

This subsection estimates the computational complexity of all the algorithms listed in Table 3. Zielinski [83] is the pioneer in investigating the time complexity of differential evolution. Let the population size be P, handling d-dimensional space and assume that DE converged after globalG_max, maximum global generations. Then, the computational complexity of DE is O(P*d*globalG_max). TA is a point-based EA, hence it is applied only for a single solution. Then the complexity of TA handling d-dimensional space by executing for globalG_max iterations, is O(d*globalG_max). As we designed the population based TA, then the complexity becomes O(P*d* globalG_max). The complexity of the TADE, where the global maximum iteration is globalG_max and local iterations where the TA is executed is localG_max then the complexity of the TADE becomes, O(globalG_max * (P * d + localG_max * P * d)) = O(globalG_max * localG_max * P * d). Similarly, the DETA complexity is also O(globalG_max * localG_max * P * d). Hence, one can observe that the complexity of both the DETA and TADE is identical and as these algorithms are tightly coupled hybrids. The hybrid algorithms will have the more time-complexity than the native algorithms. The time complexity of the ADE is also the same as that of the DE because the underlying heuristics complexity is the same. Lopez et.al [22] mentioned the algorithm complexity of the DE-FS^PM is as O(P*d*d*globalG_max).

Table 3.

Time complexity of the algorithms

Algorithm	Time complexity
DE	O(PdglobalG_max)
DETA	O(globalG_max * localG_max * P * d)
TADE	O(globalG_max * localG_max * P * d)
ADE	O(PdglobalG_max)
DE-FS^PM	O(PddglobalG*_max)

Open in a new tab

Dataset description

The meta-information of the benchmark datasets is presented in Table 4. All other datasets except for the Microsoft Malware dataset, contain categorical features. Thus, categorical features are handled by using the one-hot encoding mechanism. All the datasets pertain to binary classification problems. The Microsoft Malware dataset is accessed from the Kaggle repository [84], the IEEE malware dataset from the IEEE data port [85], whereas OVM_Omentum and uterus are genomic datasets from open source OpenML datasets [86], and the Epsilon dataset from LIBSVM binary dataset repository [87].

Table 4.

Description of the benchmark datasets

Name of the dataset	# objects	# features	# classes	Size of the dataset
Epsilon	5,00,000	2000	2	10.8 GB
Microsoft Malware	32,59,724	76	2	1.8 GB
IEEE Malware	15,00,000	1000	2	3.2 GB
OVM_Omentum	1584	10,935	2	108.3 MB
OVM_Uterus	1584	10,935	2	108.3 MB

Open in a new tab

Results and discussions

All the datasets are divided into training and test sets in the ratio of 80%:20%. Stratified random sampling is performed to maintain a similar proportion of the classes in the training and test datasets. It is well-known that the performance of the EC techniques is susceptible to changes in hyperparameters. Hence, after rigorous fine-tuning with several combinations, the hyperparameters are listed in Table 5. For each algorithm and the dataset, the population size is fixed at 10, and the maximum number of generations is taken as 20. In the case of PB-DE, the DE is executed for 20 generations. However, in the case of the two hybrids (PB-DETA and PB-TADE), DE and TA are individually executed for 10 generations each, thereby making it 20 generations in all. The algorithms PB-ADE and P-DE-FS^PM, are also executed for 20 generations. The ADE is an adaptive mode. All the algorithms are run for 20 runs to nullify the impact of the random seed, which is customary for all EC techniques. The top solutions that achieved the highest AUC in each run are considered to report the average highest AUC and the corresponding average cardinality over 20 runs (see Table 6). All the top results obtained in this study are highlighted in bold.

Table 5.

Hyperparameters for all the approaches

Dataset	PB-DE		PB-DETA		PB-TADE		P-DE-FS^PM
Dataset	MF	CR	MF	CR	MF	CR	MF	CR
Epsilon	0.8	0.8	0.8	0.8	0.8	0.8	0.8	0.8
Microsoft Malware	0.8	0.9	0.8	0.9	0.8	0.9	0.8	0.9
IEEE Malware	0.8	0.9	0.8	0.9	0.8	0.9	0.8	0.9
OVM_Omentum	0.75	0.9	0.75	0.9	0.75	0.9	0.75	0.9
OVM_Uterus	0.85	0.9	0.85	0.9	0.85	0.9	0.85	0.9

Open in a new tab

Table 6.

Average Cardinality and mean AUC obtained

Dataset	PB-DE		PB-DETA		PB-TADE		PB-ADE		P-DE-FS^PM
	Avg Cardinality	Mean AUC	Avg. Cardinality	Mean AUC	Avg Cardinality	Mean AUC	Avg Cardinality	Mean AUC	Avg Cardinality	Mean AUC
Epsilon	617.3	0.7932	486.1	0.8029	457.7	0.8089	555.65	0.797	558	0.7971
Microsoft Malware	29.6	0.6872	21.7	0.7002	18.60	0.7054	16.95	0.682	15.8	0.6924
IEEE Malware	643.45	0.7929	477.9	0.8035	463.9	0.8009	463.5	0.790	499.55	0.7937
OVA_Omentum	47.28	0.8607	35.54	0.8722	26.15	0.8817	49.3	0.846	32.9	0.870
OVA_Uterus	37.3	0.8607	28.60	0.8712	27.12	0.8802	46.2	0.845	49.7	0.871

Open in a new tab

Environmental setup

All the experiments are conducted in a Spark-HDFS cluster with Spark version 2.4.4 [88] and Hadoop version 2.7, having one master node and four worker nodes with 32 GB RAM with Intel i5 8th generation.

Mean AUC comparative analysis

The mean AUC obtained is compared to establish the importance of proposed hybrid approaches over the PB-DE and the other baselines PB-ADE and P-DE-FS^PM.

The results in Table 6 show that PB-TADE can achieve the best AUC because PB-DE got stuck in the local minima. Both PB-DETA and PB-TADE avoided this as they have employed with TA either before or after DE. The advantage of finding the local search exploitation helps not to get entrapped in the local minima but also find the better maxima. The feature subsets selected by PB-DE achieved less accuracy when compared to both PB-DETA and PB-TADE. Also, the average cardinality obtained by PB-DE is relatively high compared to that of both the PB-DETA and PB-TADE. Both these cases are not ideal for obtaining an optimal solution. Invocation of TA is also necessary while designing the hybrid model. Hence, we worked on both possibilities as part of the ablation study. We designed parallel BTA, too, independent of the DE. If BTA alone is employed for FSS, it consumes enormous computational time with inferior results. Hence, the comparative study excludes BTA.

In the ablation experiment involving PB-TADE and PB-DETA, the former achieved little lesser average cardinality of the feature subsets than the latter. In addition to that, the mean AUC is a little less but quite comparable in the case of IEEE Malware and Microsoft Malware. The former outperformed the latter because (i) BTA is essentially very good at local search by virtue of it being a point-based algorithm and a deterministic variant of simulated annealing. (ii) even though we proposed a population-based BTA, in these hybrids, the hallmark of population-based evolutionary algorithm, namely passing on the knowledge learned by the individual solutions to one another from generation to generation, is conspicuously missing by design. Therefore, they are the most population size number of BTA instances running independently. (iii) Thus, in PB-TADE, after BTA does the exploitation well, the baton is passed on to the BDE, which is demonstrably superior in both exploration and exploitation. This process continues in every iteration. (iv) However, in PB-DETA, the BDE does the job of exploration and exploitation well before the BTA is invoked, which only minimizes the fitness values obtained by DE. Further, we should note that both BTA and BDE are run for 10 iterations each, which means that they are run with relaxed convergence criteria without adversely impacting the fitness value or the AUC. This is a significant departure from the traditional implementations of both BDE and BTA for solving combinatorial optimization problems, where they are typically run for many iterations for convergence. This strategy is designed to reduce computational time, primarily because we deal with big data sets in a distributed manner in this paper.

Further proposed hybrids PB-TADE and PB-DETA outperformed the PB-DE, PB-ADE, and PB-DE-FS^PM in terms of mean AUC. Among the latter three, PB-DE-FS^PM stands out to be the best algorithm because, in Microsoft Malware and Omentum datasets, its average cardinality is less than the PB-DETA with a little cost of AUC. Along with this P-DE-FS^PM achieved significantly comparable results with the PB-DETA viz., with high dimensional datasets in terms of AUC but the cardinality of the selected feature subset is higher than the later. Moreover, P-DE-FS^PM outperformed both the PB-DE and PB-ADE in most of the datasets.

Further, no feature subset selection work is reported in analyzing Microsoft Malware and IEEE Malware datasets to the best of our knowledge. In the Epsilon dataset, Peralta et al. [89] designed MapReduce for evolutionary feature selection. They used CHC as the evolutionary strategy, and logistic regression as the classifier and achieved a 0.6985 AUC with 721 features, whereas PB-TADE obtained an average AUC of 0.8098 with 457.7 average number of features. Moreover, Pes [90] conducted feature selection by using Symmetric Uncertainty (SU), while AUC the scores are computed using Random Forest (RF). They reported an AUC of 0.695 and 0.6 in the OVA_Uterus and OVA_Omentum datasets, respectively. However, they did not report the optimal number of features that obtained these scores. But, PB-TADE achieved an average AUC of 0.8802 and 0.8817. In comparison, the OVA_Uterus and OVA_Omentum datasets have an average number of features, 27.12 and 26.15, respectively. Thus, our proposed methods outperformed the state-of-the-art results in these datasets.

Repeatability

Repeatability is one of the critical criteria for determining how robust and stable the designed wrapper method is. The more an optimal feature or feature subset repeats itself, the more powerful the underlying EA is said to be. In this subsection, repeatability analysis is conducted in two ways: firstly, concerning the features repeated individually among the often-repeated feature subsets with the highest AUC and secondly, the repetition of a whole feature subset as corresponding to the highest AUC.

Repeatability of the individual features

All the most often repeated features part of an optimal feature subset with the highest AUC are identified and presented in Table 7. The features repeated for more than 50% of the total individuals obtained by 20 runs are considered and presented. Results accommodate the most repeated top five features selected by each approach. It turns out that the repeated features selected by PB-DETA and PB-TADE are almost identical whereas the features chosen by the PB-DE and PB-ADE are slightly different. Interestingly, the most repeated individual features selected by the PB-TADE are also selected by the PB-DETA and P-DE-FS^PM in most of the datasets.

Table 7.

Most often repeated features selected by each approach

Dataset	Approach	Most repeated features
Epsilon	PB-DE	1,3,5,7,9
	PB-DETA	1,3,6,12,19
	PB-TADE	1,3,6,12,19
	PB-ADE	1,3,5,7,9
	P-DE-FS^PM	1,3,6,12,9
Microsoft Malware	PB-DE	AVProductsInstalled,HasTpm,Isprotected,Census_OEMN_Name Identifier,SmartScreen
	PB-DETA	AVProductsInstalled,HasTpm,IsPassiveMode, OsSuite,SmartScreen
	PB-TADE	AVProductsInstalled,HasTpm,OsSuite, RipStateBuild,SmartScreen
	PB-ADE	AVProductsInstalled,HasTpm,Isprotected,Census_OEMN_Name Identifier,SmartScreen
	P-DE-FS^PM	AVProductsInstalled,HasTpm,OsSuite, RipStateBuild,SmartScreen
IEEE Malware	PB-DE	GetProcAddress,GetThreadId,Sleep,FindClose, RaiseException
	PB-DETA	GetProcAddress,GetLastError,Sleep,ReadFile, RaiseException
	PB-TADE	GetProcAddress,GetLastError,Sleep,ReadFile, RaiseException
	PB-ADE	GetProcAddress,GetThreadId,Sleep,FindClose, RaiseException
	P-DE-FS^PM	GetProcAddress,GetThreadId,Sleep,FindClose, RaiseException
OVA_Omentum	PB-DE	158765_at,201608_s_at, 206442_at,207096_s_at,210002_s_at
	PB-DETA	1554436_s_at, 201669_s_at, 20644_s_at, 207442_s_at,, 208970_s_at
	PB-TADE	1554436_s_at, 201669_s_at, 20644_s_at, 207442_s_at,, 208970_s_at
	PB-ADE	158765_at,201608_s_at, 206442_at,207096_s_at,210002_s_at
	P-DE-FS^PM	158765_at,201608_s_at, 206442_at,207096_s_at,210002_s_at
OVA_Uterus	PB-DE	205866_s_at,209682_s_at,217294_s_at, 222421_s_at,220148_s_at,
	PB-DETA	202125_s_at,205866_s_at,218132_s_at, 222421_s_at,222784_s_at,
	PB-TADE	202125_s_at,205866_s_at,218132_s_at, 222421_s_at,222784_s_at,
	PB-ADE	202125_s_at,205866_s_at,218132_s_at, 222421_s_at,222784_s_at,
	P-DE-FS^PM	202125_s_at,205866_s_at,218132_s_at, 222421_s_at,222784_s_at,

Open in a new tab

Repeatability of the feature subsets

All the feature subsets that yielded the highest AUC and repeated often are reported in Table 8. The #s1 represents the cardinality of the most-repeated feature subset, and the corresponding AUC. In the case of the Epsilon dataset, PB-DE has selected 639 features resulting in 0.79. PB-TADE outperformed the rest in terms of AUC and also by selecting a less cardinal feature subset. In the Microsoft Malware dataset, PB-DE achieved a 0.69 AUC with 31 features, while both PB-DETA and PB-TADE could achieve a better AUC than PB-DE with a less cardinal feature subset. Also, the P-DE-FS^PM and PB-ADE achieved better AUC with a less cardinal feature subset compared to that of PB-DE but achieved less AUC compared to that of PB-DETA and PB-TADE. Even though in the case of the IEEE Malware dataset, the cardinality of the selected feature subsets is the same for the PB-DETA and PB-TADE, the AUC is different because the selected features are not identical. The same is the case in OVA_Omentum and OVA_Uterus datasets. Hence, the results indicate that PB-TADE can achieve lower cardinality with better AUC than PB-DETA and PB-DE in all the datasets in terms of repeatability. Similarly, PB-DETA outperformed PB-DE. In terms of AUC, overall PB-TADE outperformed the rest of the algorithms. PB-DETA and P-DE-FS^PM achieved comparable performance in terms of AUC in almost all the datasets. Also, the PB-ADE achieved comparable performance except in Omentum and Uterus datasets.

Table 8.

Cardinalities and the corresponding AUC of the Top-most repeated feature subsets

Dataset	PB-DE		PB-DETA		PB-TADE		PB-ADE		P-DE-FS^PM
	#s1	AUC	#s1	AUC	#s1	AUC	#s1	AUC	#s1	AUC
Epsilon	639	0.7967	564	0.8068	488	0.8097	505	0.797	555	0.801
Microsoft Malware	31	0.6983	24	0.7057	17	0.7061	17	0.7	20	0.70
IEEE Malware	550	0.7956	486	0.8057	487	0.8058	483	0.804	588	0.798
OVA_Omentum	55	0.8701	37	0.8723	33	0.8779	66	0.866	33	0.876
OVA_Uterus	37	0.8607	28	0.8712	27	0.8802	60	0.86	50	0.877

Open in a new tab

*Where #s1 is the cardinality of the top-most repeated feature subset

Least cardinal feature subset with highest AUC

In this subsection, the least cardinal feature subset among the 20 runs with the highest AUC is discussed. The results are presented in Table 9. It turns out that the PB-TADE outperformed all the algorithms in detecting the least cardinal feature subset. At the same time, PB-DETA and PB-ADE stand second and third in detecting the least cardinal feature subset. Except for the OVA_Omentum dataset, PB-TADE achieved a better AUC with fewer features than the PB-DETA. In the case of Epsilon, even though PB-TADE and PB-DETA have achieved similar AUC but PB-TADE achieved a lesser cardinal feature subset. In all the datasets, PB-DE, PB-ADE, and PB-DE-FS^PM were outperformed by the proposed hybrid models. This fact further reinforces the role played by the BTA in this hybridization.

Table 9.

Least Cardinal Feature subset with the highest AUC

Dataset	PB-DE		PB-DETA		PB-TADE		PB-ADE		P-DE-FS^PM
	#s1	AUC	#s1	AUC	#s1	AUC	#s1	AUC	#s1	AUC
Epsilon	588	0.7967	471	0.8068	441	0.8068	505	0.797	555	0.801
Microsoft Malware	27	0.6915	22	0.7007	17	0.7061	17	0.7	20	0.70
IEEE Malware	550	0.7956	484	0.8057	487	0.8197	388	0.799	464	0.804
OVA_Omentum	41	0.8504	29	0.8723	31	0.8699	66	0.866	27	0.864
OVA_Uterus	31	0.8504	24	0.8699	26	0.8712	60	0.86	48	0.870

Open in a new tab

*Where #s1 is the cardinality of the feature subset having least cardinal subset with highest AUC

Speedup

Speedup is defined as the gain obtained by the parallel version of the algorithm with respect to the sequential algorithm executed on a single processor as follows in Eq. (9).

Speed Up (S . U) = \frac{Time taken by Sequential Version}{Time taken by Parallel Version}

This stands as one of the essential characteristics in evaluating the performance of the parallel version of the algorithm. The results are presented in Table 10. It is to be noted that speedup results are rounded off to two decimals. We observed that the proposed parallel algorithms achieved significant speedup. Speedup achieved ranges from 2.21 to 2.90 times by all proposed algorithms over their sequential counterparts in all datasets. As the number of nodes in the cluster is 4, the maximum possible speedup that could be achieved is 4. The linear speedup is not achieved because of the synchronization junctions in the parallel model.

Table 10.

Speedup Analysis of parallel versions over the sequential ones

Dataset	Algorithm	Sequential E.T	Parallel E.T	S.U
Epsilon	PB-DE	12,780	4485	2.84
	PB-DETA	12,680	4361	2.90
	PB-TADE	12,688	4369	2.90
	PB-ADE	12,780	4485	2.84
	P-DE-FS^PM	12,912	4860	2.65
Microsoft Malware	PB-DE	16,412	6741	2.43
	PB-DETA	15,781	6447	2.44
	PB-TADE	15,779	6432	2.45
	PB-ADE	16,222	6741	2.40
	P-DE-FS^PM	15,640	7077	2.21
IEEE Malware	PB-DE	20,412	8151	2.50
	PB-DETA	19,793	7936	2.49
	PB-TADE	19,801	7938	2.49
	PB-ADE	20,517	8817	2.32
	P-DE-FS^PM	20,331	8996	2.26
OVA_Omentum	PB-DE	14,892	5428	2.74
	PB-DETA	14,651	5226	2.80
	PB-TADE	14,689	5428	2.81
	PB-ADE	14,891	5407	2.75
	P-DE-FS^PM	16,860	6882	2.45
OVA_Uterus	PB-DE	14,108	5368	2.62
	PB-DETA	13,979	5222	2.68
	PB-TADE	13,968	5378	2.68
	PB-ADE	14,891	5407	2.75
	P-DE-FS^PM	16,042	6712	2.39

Open in a new tab

Statistical testing of the results

The two-tailed t-test at a 5% level of significance and 38 (20 + 20–2) degrees of freedom is conducted pairwise on the three proposed algorithms and the other two parallelized baselines to make statistically valid statements about their performance. The results are presented in Table 11.

Table 11.

Paired t-test results

Model	Parameter	Dataset
Model	Parameter	Epsilon	Microsoft Malware	IEEE Malware	OVA_ Omentum	OVA_ Uterus
PB-DE vs PB-TADE	t-statistic	7.72	4.62	8.045	4.168	3.69
PB-DE vs PB-TADE	p-value	2.66 × 10^–09	4.25 × 10^–05	9.91 × 10^–10	0.00017	0.00069
PB-DETA vs PB-TADE	t-statistic	3.56	3.106	3.63	2.06	1.744
PB-DETA vs PB-TADE	p-value	0.001	0.0035	0.0008	0.045	0.0891
PB-ADE vs PB-TADE	t-statistic	7.21	3.648	5.22	3.54	6.084
PB-ADE vs PB-TADE	p-value	1.25 × 10^–09	0.0007	6.57 × 10^–06	0.0010	4.36 × 10^–07
P-DE-FS^PM vs PB-TADE	t-statistic	14.21	4.035	6.90	2.45	1.78
P-DE-FS^PM vs PB-TADE	p-value	8.54 × 10^–17	0.00025	3.32 × 10^–08	0.0186	0.0818

Open in a new tab

The null hypothesis is H₀: both the algorithms are statistically equal,

while the alternate hypothesis is, H₁: both the algorithms are statistically not equal.

It is very important to conduct the t-test to check whether the numerical superiority of algorithm A over B, occurred purely by chance or indeed due to the superior nature of the algorithm A. Hence, in this work, the means of the top AUC scores achieved by the best solution in the population by each approach over the 20 runs are considered for the t-test evaluation. Thus, a two-tailed t-statistic value is calculated. Using the t-statistic, the p-values are obtained. The p-value is compared with the level of significance to determine whether to accept null hypothesis H₀ or not. Hence, even though t-statistic is calculated, it is used to determine the p-value which in turn helps in deciding whether to accept or reject H₀. As the p-values for all datasets are less than 0.05, the null hypothesis is rejected, and the alternate hypothesis is accepted. We infer that PB-TADE is significantly different from PB-DE, PB-ADE and P-DE-FS^PM as the p-values are significantly small.

In summary, the results indicate that the PB-TADE achieved higher AUC with lesser cardinality when compared to the PB-DETA. However P-DE-FS^PM, and PB-ADE are standing second and third respectively, and outperformed the PB-DE in the majority of the datasets. The statistical analysis says that the PB-TADE is significantly different compared to the rest PB-DETA and PB-DE in both exploration and exploration. All the proposed parallel approaches achieved significant speedup compared to their sequential counterparts. Although PB-TADE and PB-DETA consumed more time than the PB-DE, it is mainly due to more function evaluations. Notably, both the hybrid variants resulting in optimal feature subsets are much better than those obtained by the PB-DE, PB-ADE and PB-DE-FS^PM in terms of higher AUC and less cardinality.

Conclusions and limitations

This paper develops the parallel versions of the BDE, BDETA, and BTADE and employs them for the wrapper based feature subset selection, where logistic regression is chosen as the classifier. We demonstrated their effectiveness on five high-dimensional datasets taken from literature. The results indicate that the PB-TADE achieved higher AUC with lesser cardinality when compared to the PB-DETA, standing second, P-DE-FS^PM, and PB-ADE, giving the comparable performance than the PB-DE in the majority of the datasets. The statistical analysis says that the PB-TADE is significantly different compared to the rest PB-DETA and PB-DE in both exploration and exploration. Further, our proposed methods outperformed the state-of-the-art results, wherever the results were reported.

The limitations of the current work are as follows:

The current study is conducted in a single-objective function environment where only AUC is considered as an objective function. In future, the current investigation will be carried out on the same datasets but in bi-objective and multi-objective environments.
The proposed hybrid algorithms are non-adaptive in the sense that we need to manually tweak their hyperparameters rather than tweaking them adaptively.
Further, we observed that the specific way of parallelizing the EAs in the current study consumes more time when the population size increases. Hence, the method of parallelizing EAs can further be improved.
A valid alternative for the current proposed hybrid approach could be the parallel particle swarm optimization (PSO) and its hybrid variants with TA in either a tightly or loosely coupled manner. Also, several most recent meta-heuristics such as HGS, CPA, HHO etc., can also be parallelized.

The above-discussed limitations motivate our future work. The current investigation will be carried out on the same problem but in bi-objective and multi-objective environments. In the future, a novel optimization algorithm based on machine learning presented in Hooten et al. [91] can be employed for this problem.

Biographies

Yelleti Vivek

graduated with an M. Tech. in Computer Science from University of Hyderabad in 2020. He is currently a doctoral student in National Institute of Technology, Warangal and IDRBT. His research areas include big data analytics, machine learning, evolutionary computing and reinforcement learning. graphic file with name 10586_2022_3725_Figa_HTML.jpg

Vadlamani Ravi

is with the Institute for Development and Research in Banking Technology (IDRBT), Hyderabad since 2005. He holds a PhD in Soft Computing from Osmania University, Hyderabad and RWTH Aachen, Germany (2001), MS from BITS Pilani (1991) and M.Sc (Statistics &OR) from IIT Bombay (1987). Earlier, he worked as a Faculty at National University of Singapore from 2002-2005. He worked in RWTH Aachen under DAAD Long Term Fellowship from 1997–1999 for conducting advanced research. During 1988-2002, he worked in two CSIR research labs as Scientist B, Scientist C and Assistant Director. During the last 34 years, he has been working in soft computing, evolutionary/neuro/fuzzy computing, data/text mining, global/multi-criteria optimization, big data analytics, social media analytics, time series data mining, machine learning/deep learning, bankruptcy prediction and analytical CRM. He has 22 years of teaching experience and 34 years of research experience. He published more than 250 papers in refereed international/national journals/conferences and invited book chapters. He has 9700 citations and an h-index of 45. He also edited three Books published by IGI Global, USA, 2007 and IRT, UK, in 2021. He appears list of the top 1.03% scientists published by Stanford University researchers in PLOS Biology Journal. He won Best research Award in CSIR-IICT, and IDRBT. He is a referee for 40 International Journals. He is an Associate Editor for Swarm and Evolutionary Computation and Editorial Board Member for few International Journals of repute. He was the founding Managing Editor for Journal of Banking and Financial Technology and He is an Advisor for various Indian banks including BoI, Andhra Bank, Canara Bank, NABARD, RBI, IRDAI and SEBI for their DWH, Analytical CRM, Analytics and Fraud Analytics projects. He developed roadmap for implementing Data Science (AI/ML subsumed)) related projects for several banks. He evaluates the project proposals submitted to European Science Foundation, Dutch Science Foundation, Belgium Science Foundation, Irish Science Foundation and book proposals submitted to Elsevier and Springer. He is a referee, PC Member and Chair for any International conferences of repute. graphic file with name 10586_2022_3725_Figb_HTML.jpg

P. Radha Krishna

has been in the profession of research, development and technology adoption for about Thirty years. He is currently working as a Professor and Head, Department of Computer Science and Engineering, National Institute of Technology (NIT) Warangal. His research interests include data mining, big data, machine learning and databases and workflow systems. Prior to joining NIT, he served as Principal Research Scientist at Infosys Labs, Infosys Limited, Hyderabad, where he was associated with research projects leading to futuristic intelligent systems and analytical solutions. Krishna conceptualized numerous research projects in the areas of CRM, social network mining, Web, Sequence, text & image mining, e-check clearing and settlement, and financial inclusion, and associated in building innovation labs for leading industries. He also served as adjunct faculty at NIT-Warangal and IIIT-Hyderabad. Dr. Krishna also served as a faculty at the Institute for Development and Research in Banking Technology (IDRBT - a research arm of the Reserve Bank of India & an associate Institute of University of Hyderabad), and as a scientist at National Informatics Centre (Govt. of India), Bhopal. At IDRBT, he was an Advisor to many Public and Private Sector Banks in India for the implementation of End-to-End Enterprise-wide Data Warehouse and Business Intelligence solutions. Krishna was also a member of the IT Advisory Committee, Insurance Regulatory and Development Authority (IRDA), India. He has been a member of the research advisory committees of several academic institutes. Also, Dr. Krishna has been a member of the editorial/review board of international journals; and a reviewer of several international journals including IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Services Computing, Information Sciences, Data and Knowledge Engineering, Information System Frontiers and Soft Computing. Krishna has double PhDs – the first one from Osmania University and the second one from IIIT-Hyderabad; 19 granted patents, authored/co-authored six books, and have over a hundred publications in refereed journals and conferences. He is also a tutorial speaker at several international conferences including ER 2006, ICWS 2007, WWW 2008, IEEE Services 2010, CAiSE 2010, ICSE 2014 and ER 2014. graphic file with name 10586_2022_3725_Figc_HTML.jpg

Appendix

See Figs. 5, 6, 7, 8 and 9.

Author contributions

VY: Methodology, software, validation, formal analysis, investigation, data curation, writing-original draft, visualization; VR: conceptualization, methodology, validation, formal analysis, investigation, writing-original draft, writing-review and editing, visualization, supervision, project administration; PRK: writing-review and editing, resources, project administration.

Funding

We did not receive any funding for this research.

Data availability

The datasets used in this study are taken from the public domain and the appropriate URLs have been cited in the text.

Declarations

Conflict of interest

The authors have not disclosed any competing interests.

Ethical approval

This manuscript is not under review by any other Journal/Conference. However, for obvious reasons, its earlier version was submitted to arXiv preprint server. Hence, it has a similarity count of 75%.

Informed consent

No human participants or animals are involved in this research.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Yelleti Vivek, Email: yvivek@idrbt.ac.in.

Vadlamani Ravi, Email: vravi@idrbt.ac.in.

P. Radha Krishna, Email: prkrishna@nitw.ac.in.

References

1.CRISP DM. https://www.the-modeling-agency.com/crisp-dm.pdf. Accessed 24 Apr 2021
2.Chandrashekar G, Sahin F. A survey on feature selection methods. Comput. Electr. Eng. 2014;40:16–28. doi: 10.1016/j.compeleceng.2013.11.024. [DOI] [Google Scholar]
3.Xue B, Zhang M, Browne WN, Yao X. A survey on evolutionary computation approaches to feature selection. IEEE Trans. Evol. Comput. 2016;20:606–626. doi: 10.1109/TEVC.2015.2504420. [DOI] [Google Scholar]
4.Wang G, Deb S, Cui Z. Monarch butterfly optimization. Neural Comput. Appl. 2019;31:1995–2014. doi: 10.1007/s00521-015-1923-y. [DOI] [Google Scholar]
5.Hu, J., Gui, W., Heidari, A.A., Cai, Z., Liang, G., Chen, H., Pan, Z. Dispersed foraging slime mould algorithm: Continuous and binary variants for global optimization and wrapper-based feature selection. Knowl. Syst. 237 (2022).
6.Strumberger I, Bacanin N. Modified moth search algorithm for global optimization problems. Int. J. Comput. 2018;3:44–48. [Google Scholar]
7.Yang Y, Chen H, Heidari AA, Gandomi AH. Hunger games search: visions, conception, implementation, deep analysis, perspectives, and towards performance shifts. Expert Syst. Appl. 2021;177:114864. doi: 10.1016/j.eswa.2021.114864. [DOI] [Google Scholar]
8.Tu J, Chen H, Wang M, et al. The colony predation algorithm. J. Bionic Eng. 2021;18:674–710. doi: 10.1007/s42235-021-0050-y. [DOI] [Google Scholar]
9.Heidari AA, Mirjalili S, Faris H, Aljarah I, Mafarja M, Chen H. Harris hawks optimization: algorithm and applications. Future Gener Comput Syst. 2019;97:849–872. doi: 10.1016/j.future.2019.02.028. [DOI] [Google Scholar]
10.Nguyen BH, Xue B, Zhang M. A survey on swarm intelligence approaches to feature selection in data mining. Swarm Evol. Comput. 2020;54:100663. doi: 10.1016/j.swevo.2020.100663. [DOI] [Google Scholar]
11.Yang WA, Zhou Q, Tsui KL. Differential evolution-based feature selection and parameter optimisation for extreme learning machine in tool wear estimation. Int. J. Prod. Res. 2016;54:4703–4721. doi: 10.1080/00207543.2015.1111534. [DOI] [Google Scholar]
12.Xie, X., Xu, K., Wang, X.: Cloud computing resource scheduling based on improved differential evolution ant colony algorithm. In: ACM International Conference Proceeding Series, pp. 171–177 (2019).
13.Silva-Filho, A.G., Nunes, L.J.C., Lacerda, H.F.: Differential evolution to reduce energy consumption in three-level memory hierarchy. In: Proceedings of SBCCI 2015—28th Symposium on Integrated Circuits and Systems Design: CHIP in Bahia (2015).
14.Krishna, G.J., Ravi, V.: Anomaly detection using modified differential evolution: an application to banking and insurance. In: Proceedings of the 11th International Conference on Soft Computing and Pattern Recognition (SoCPaR 2019). Advances in Intelligent Systems and Computing, p. 1182. Springer, Cham (2019).
15.Nissen V., Propach. J.: On the robustness of population-based versus point-based optimization in the presence of noise. In: IEEE Transactions on Evolutionary Computation, vol. 2, no. 3, pp. 107–119 (1998).
16.Roeva, O., Slavov, T., Fidanova, S.: Population-based vs. single point search meta-heuristics for a PID controller tuning. In: Handbook of Research on Novel Soft Computing Intelligent Algorithms: Theory and Practical Applications, pp. 200–233. IGI Global (2014).
17.Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI’04 6th Symposium on Operating Systems Design and Implement, pp. 137--150 (2004).
18.Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I. Apache Spark: a unified engine for big data processing. Commun. ACM. 2016;59(11):56–65. doi: 10.1145/2934664. [DOI] [Google Scholar]
19.Das S, Suganthan PN. Differential evolution: a survey of the state-of-the-art. IEEE Trans. Evol. Comput. 2011;15(1):4–31. doi: 10.1109/TEVC.2010.2059031. [DOI] [Google Scholar]
20.Chauhan N, Ravi V. Differential evolution and threshold accepting hybrid algorithm for unconstrained optimization. Int. J. Bio-Inspired Comput. 2010;2:169–182. doi: 10.1504/IJBIC.2010.033086. [DOI] [Google Scholar]
21.Krishna, G.J., Ravi, V.: Feature subset selection using adaptive differential evolution: an application to banking. In: ACM International Conference Proceeding Series, pp. 157–163 (2019).
22.Rivera-Lopez R, Mezura-Montes E, Canul-Reich J, Cruz-Chávez MA. A permutational-based differential evolution algorithm for feature subset selection. Pattern Recognit. Lett. 2020;133:86–93. doi: 10.1016/j.patrec.2020.02.021. [DOI] [Google Scholar]
23.Price K, Storn R. Differential evolution—a simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim. 1997;11:341–359. doi: 10.1023/A:1008202821328. [DOI] [Google Scholar]
24.Zhang Y, Gong DW, Gao XZ, Tian T, Sun XY. Binary differential evolution with self-learning for multi-objective feature selection. Inf. Sci. (NY) 2020;507:67–85. doi: 10.1016/j.ins.2019.08.040. [DOI] [Google Scholar]
25.Vivekanandan T, Iyengar NCSN. Optimal feature selection using a modified differential evolution algorithm and its effectiveness for prediction of heart disease. Comput. Biol. Med. 2017;90:125–136. doi: 10.1016/j.compbiomed.2017.09.011. [DOI] [PubMed] [Google Scholar]
26.Samuel OW, Asogbon GM, Sangaiah AK, Fang P, Li G. An integrated decision support system based on ANN and Fuzzy_AHP for heart failure risk prediction. Expert Syst. Appl. 2017;68:163–172. doi: 10.1016/j.eswa.2016.10.020. [DOI] [Google Scholar]
27.Nayak, S.K., Rout, P.K., Jagadev, A.K., Swarnkar, T.: Elitism based multi-objective differential evolution for feature selection: a filter approach with an efficient redundancy measure. In: Journal of King Saud University—Computer and Information Sciences, vol. 32, pp. 174–187 (2020).
28.Mlakar U, Fister I, Brest J, Potočnik B. Multi-objective differential evolution for feature selection in facial expression recognition systems. Expert Syst. Appl. 2017;89:129–137. doi: 10.1016/j.eswa.2017.07.037. [DOI] [Google Scholar]
29.Khushaba RN, Al-Ani A, Al-Jumaily A. Feature subset selection using differential evolution and a statistical repair mechanism. Expert Syst. Appl. 2011;38:11515–11526. doi: 10.1016/j.eswa.2011.03.028. [DOI] [Google Scholar]
30.Hancer E, Xue B, Zhang M. Differential evolution for filter feature selection based on information theory and feature ranking. Knowl. Syst. 2018;140:103–119. doi: 10.1016/j.knosys.2017.10.028. [DOI] [Google Scholar]
31.Hancer E. A new multi-objective differential evolution approach for simultaneous clustering and feature selection. Eng. Appl. Artif. Intell. 2020;87:103307. doi: 10.1016/j.engappai.2019.103307. [DOI] [Google Scholar]
32.Ghosh A, Datta A, Ghosh S. Self-adaptive differential evolution for feature selection in hyperspectral image data. Appl. Soft Comput. J. 2013;13:1969–1977. doi: 10.1016/j.asoc.2012.11.042. [DOI] [Google Scholar]
33.Bhadra T, Bandyopadhyay S. Unsupervised feature selection using an improved version of differential evolution. Expert Syst. Appl. 2015;42:4042–4053. doi: 10.1016/j.eswa.2014.12.010. [DOI] [Google Scholar]
34.Baig MZ, Aslam N, Shum HPH, Zhang L. Differential evolution algorithm as a tool for optimal feature subset selection in motor imagery EEG. Expert Syst. Appl. 2017;90:184–195. doi: 10.1016/j.eswa.2017.07.033. [DOI] [Google Scholar]
35.Almasoudy FH, Al-Yaseen WL, Idrees AK. Differential evolution wrapper feature selection for intrusion detection system. Procedia Comput. Sci. 2020;167:1230–1239. doi: 10.1016/j.procs.2020.03.438. [DOI] [Google Scholar]
36.ZorarpacI E, Ozel SA. A hybrid approach of differential evolution and artificial bee colony for feature selection. Expert Syst. Appl. 2016;62:91–103. doi: 10.1016/j.eswa.2016.06.004. [DOI] [Google Scholar]
37.Srikrishna, V., Ghosh, R., Ravi, V., Deb, K.: Elitist quantum-inspired differential evolution based wrapper for feature subset selection. In: Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 9426, pp. 113–124 (2015).
38.Zhao XS, Bao LL, Ning Q, Ji JC, Zhao XW. An improved binary differential evolution algorithm for feature selection in molecular signatures. Mol. Inform. 2018;37:1–15. doi: 10.1002/minf.201700081. [DOI] [PubMed] [Google Scholar]
39.Hancer E. Fuzzy kernel feature selection with multi-objective differential evolution algorithm. Conn. Sci. 2019;3:323–341. doi: 10.1080/09540091.2019.1639624. [DOI] [Google Scholar]
40.Li J, Ding L, Li B. Differential evolution-based parameters optimisation and feature selection for support vector machine. Int. J. Comput. Sci. Eng. 2016;13:355–363. [Google Scholar]
41.Wang, J., Xue, B., Gao, X., Zhang, M: A differential evolution approach to feature selection and instance selection. In: Proceedings of the 14th Pacific RIM International Conference on Trends in Artificial Intelligence (PRICAI'16). Gewerbestrassse 11 CH-6330, Cham (ZG), CHE, pp. 588–602. Springer (2016).
42.Carrasco J, García S, Rueda MM, Das S, Herrera F. Recent trends in the use of statistical tests for comparing swarm and evolutionary computing algorithms: practical guidelines and a critical review. Swarm Evol. Comput. 2020;54:100665. doi: 10.1016/j.swevo.2020.100665. [DOI] [Google Scholar]
43.Cao B, Fan S, Zhao J, Yang P, Muhammad K, Tanveer M. Quantum-enhanced multiobjective large-scale optimization via parallelism. Swarm Evol. Comput. 2020;57:100697. doi: 10.1016/j.swevo.2020.100697. [DOI] [Google Scholar]
44.BenSaid, F., Alimi, A.M.: Moanofs: multi-objective automated negotiation based online feature selection system for big data classification (2018). arXiv:1810.04903.
45.Khan A, Baig AR. Multi-objective feature subset selection using non-dominated sorting genetic algorithm. J. Appl. Res. Technol. 2015;13:145–159. doi: 10.1016/S1665-6423(15)30013-4. [DOI] [Google Scholar]
46.Khammassi C, Krichen S. A NSGA2-LR wrapper approach for feature selection in network intrusion detection. Comput. Netw. 2020;172:107183. doi: 10.1016/j.comnet.2020.107183. [DOI] [Google Scholar]
47.Chaudhuri A, Sahu TP. Feature selection using Binary Crow search Algorithm with time varying flight length. Expert Syst. Appl. 2021;168:114288. doi: 10.1016/j.eswa.2020.114288. [DOI] [Google Scholar]
48.Too, J., Mirjalili, S.: A hyper learning binary dragonfly algorithm for feature selection: a COVID-19 case study. Knowl. Syst. 212 (2021).
49.Hu, J., Chen, H., Heidari, A.A., Wang, M., Zhang, X., Chen, Y., Pan, Z.: Orthogonal learning covariance matrix for defects of grey wolf optimizer: insights, balance, diversity, and feature selection. Knowl. Syst. 213 (2021).
50.Hu J, Heidari AA, Zhang L, Xue X, Gui W, Chen H, Pan Z. Chaotic diffusion-limited aggregation enhanced grey wolf optimizer: insights, analysis, binarization, and feature selection. Int. J. Intell. Syst. 2021;37(8):4864–4927. doi: 10.1002/int.22744. [DOI] [Google Scholar]
51.Too, J., Liang, G., Chen, H.: Memory-based Harris hawk optimization with learning agents: a feature selection approach. Eng. Comput. 1–22 (2021).
52.Zhang Y, Liu R, Wang X, Chen H, Li C. Boosted binary Harris Hawks optimizer and feature selection. Eng. Comput. 2021;37:3741–3770. doi: 10.1007/s00366-020-01028-5. [DOI] [Google Scholar]
53.Hammami M, Bechikh S, Hung CC, Ben Said L. A multi-objective hybrid filter-wrapper evolutionary approach for feature selection. Memetic Comput. 2019;11:193–208. doi: 10.1007/s12293-018-0269-2. [DOI] [Google Scholar]
54.Harada, T., Kaidan, M., Thawonmas, R.: Comparison of synchronous and asynchronous parallelization of extreme surrogate-assisted multi-objective evolutionary algorithm. Nat. Comput. (2020).
55.Peralta, D., Del Río, S., Ramírez-Gallego, S., Triguero, I., Benitez, J.M., Herrera, F.: Evolutionary feature selection for big data classification: a MapReduce approach. Math. Probl. Eng. (2015).
56.Rong M, Gong D, Gao X. Feature selection and its use in big data: challenges. Methods Trends IEEE Access. 2019;7:19709–19725. doi: 10.1109/ACCESS.2019.2894366. [DOI] [Google Scholar]
57.Zhou, C.: Fast parallelization of differential evolution algorithm Using MapReduce. In: Proceedings of 12th Annual Genetic and Evolutionary Computation Conference (GECCO ’10), pp. 1113–1114 (2010).
58.Teijeiro, D., Pardo, X.C., González, P., Banga, J.R., Doallo, R.: Implementing parallel differential evolution on spark. In: Squillero, G., Burelli, P. (eds.) Applications of Evolutionary Computation (EvoApplications 2016). Lecture Notes in Computer Science, p. 9598. Springer, Cham (2016).
59.Cho, P.P.W., Nyunt, T.T.S., Aung, T.T.: Differential evolution for large-scale clustering. In: Proceedings of 2019 9th International Workshop on Computer Science and Engineering (WCSE 2019 SPRING), pp. 58–62 (2019).
60.Al-Sawwa J, Ludwig SA. Performance evaluation of a cost-sensitive differential evolution classifier using spark—imbalanced binary classification. J. Comput. Sci. 2020;40:101065. doi: 10.1016/j.jocs.2019.101065. [DOI] [Google Scholar]
61.Chen Z, Jiang X, Li J, Li S, Wang L. PDECO: parallel differential evolution for clusters optimization. J. Comput. Chem. 2013;34:1046–1059. doi: 10.1002/jcc.23235. [DOI] [PubMed] [Google Scholar]
62.Adhianto L, Banerjee S, Fagan M, Krentel M, Marin G, Mellor-Crummey J, Tallent NR. HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 2010;22:685–701. [Google Scholar]
63.Deng C, Tan X, Dong X, Tan Y. A parallel version of differential evolution based on resilient distributed datasets model. Commun. Comput. Inf. Sci. 2015;562:84–93. [Google Scholar]
64.He Z, Peng H, Chen J, Deng C, Wu Z. A Spark-based differential evolution with grouping topology model for large-scale global optimization. Clust. Comput. 2021;24:515–535. doi: 10.1007/s10586-020-03124-z. [DOI] [Google Scholar]
65.Wong, T.H., Qin, A.K., Wang, S., Shi, Y.: cuSaDE: a CUDA-based parallel self-adaptive differential evolution algorithm. In: Proceedings of the 18th Asia Pacific Symposium on Intelligent and Evolutionary Systems, vol. 2, pp. 375–388 (2015).
66.Cao B, Zhao J, Lv Z, Liu X. A distributed parallel cooperative coevolutionary multiobjective evolutionary algorithm for large-scale optimization. IEEE Trans. Ind. Inf. 2017;13:2030–2038. doi: 10.1109/TII.2017.2676000. [DOI] [Google Scholar]
67.Ge Y, Yu W, Lin Y, Gong Y, Zhan Z, Chen W, Zhang J. Distributed differential evolution based on adaptive mergence and split for large-scale optimization. IEEE Trans. Cybern. 2018;48:2166–2180. doi: 10.1109/TCYB.2017.2728725. [DOI] [PubMed] [Google Scholar]
68.De Falco, I., Scafuri, U., Tarantino, E., Della Cioppa, A.: A distributed differential evolution approach for mapping in a grid environment. In: 15th EUROMICRO international conference on parallel, distributed and network-based processing (PDP'07), pp. 442–449 (2007). 10.1109/PDP.2007.6.
69.Veronese, L.P., Krohling, R.A.: Differential evolution algorithm on the GPU with C-CUDA. In: IEEE Congress on Evolutionary Computation, pp. 1–7 (2010). 10.1109/CEC.2010.5586219.
70.Glotic A, Glotic A, Kitak P, Pihler J, Ticar I. Parallel self-adaptive differential evolution algorithm for solving short-term hydro scheduling problem. IEEE Trans. Power Syst. 2014;29:2347–2358. doi: 10.1109/TPWRS.2014.2302033. [DOI] [Google Scholar]
71.Daoudi, M., Hamena, S., Benmounah, Z., Batouche, M.: Parallel diffrential evolution clustering algorithm based on MapReduce. In: 2014 6th International Conference of Soft Computing and Pattern Recognition (SoCPaR), pp. 337–341 (2014).
72.Krömer, P., Platoš, J., Snášel, V.: Scalable differential evolution for many-core and clusters in unified parallel C. In: 2013 IEEE International Conference on Cybernetics (CYBCO), pp. 180–185 (2013).
73.Thomert, D.B., Bhattacharya, A. K., Caron, E., Gadireddy, K., Lefevre, L.: Parallel differential evolution approach for cloud workflow placements under simultaneous optimization of multiple objectives. In: 2016 IEEE Congress on Evolutionary Computation (CEC), pp. 822–829 (2016).
74.Abbass HA, Sarker R. The Pareto differential evolution algorithm. Int. J. Artif. Intell. Tools. 2002;11(4):531–552. doi: 10.1142/S0218213002001039. [DOI] [Google Scholar]
75.Ali MM, Tom A. Population set based global optimization algorithms: some modifications and numerical studies. Comput. Oper. Res. 2004;31(10):1703–1725. doi: 10.1016/S0305-0548(03)00116-3. [DOI] [Google Scholar]
76.Kohavi, R., John, G.H.: Wrappers for feature subset selection. In: Lecture Notes Computer Science (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 97, pp. 273–324 (1997).
77.Dueck G, Scheuer T. Threshold accepting: a general purpose optimization algorithm appearing superior to simulated annealing. J. Comput. Phys. 1990;90:161–175. doi: 10.1016/0021-9991(90)90201-B. [DOI] [Google Scholar]
78.Ravi V, Zimmermann HJ. Fuzzy rule based classification with FeatureSelector and modified threshold accepting. Eur. J. Oper. Res. 2000;123:16–28. doi: 10.1016/S0377-2217(99)00090-9. [DOI] [Google Scholar]
79.Ravi V, Reddy PJ, Zimmermann HJ. Fuzzy rule base generation for classification and its minimization via modified threshold accepting. Fuzzy Sets Syst. 2001;120:271–279. doi: 10.1016/S0165-0114(99)00100-1. [DOI] [Google Scholar]
80.Ravi V, Zimmermann H-J. A neural network and fuzzy rule base hybrid for pattern classification. Soft Comput. 2001;5:152–159. doi: 10.1007/s005000000071. [DOI] [Google Scholar]
81.Ravi V, Pramodh C. Threshold accepting trained principal component neural network and feature subset selection: application to bankruptcy prediction in banks. Appl. Soft Comput. J. 2008;8:1539–1548. doi: 10.1016/j.asoc.2007.12.003. [DOI] [Google Scholar]
82.Tvrdík J. Adaptation in differential evolution: a numerical comparison. Appl. Soft Comput. 2009;9(3):1149–1155. doi: 10.1016/j.asoc.2009.02.010. [DOI] [Google Scholar]
83.Zielinski, K., Peters, D., Laur, R.: Run time analysis regarding stopping criteria for differential evolution and particle swarm optimization. In: Proceedings of 1st International Conference on Experiments/Process/System Modelling/Simulation/Optimization (2005).
84.Kaggle Open source Datasets. https://www.kaggle.com/c/microsoft-malware-prediction/data. Accessed 27 Mar 2021
85.IEEE Dataport. https://ieee-dataport.org/. Accessed 27 Mar 2021
86.OpenML Open Source Datasets. https://www.openml.org/home. Accessed 27 Mar 2021
87.LIBSVM repository for the binary class high dimensional datasets. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. Accessed 27 Mar 2021
88.Apache Spark. https://spark.apache.org/. Accessed 26 Jan 2021
89.Peralta, D., Río, S.D., Gallego, S.R., Triguero, I., Benitez, J.M., Herrera, F.: Evolutionary feature selection for big data classification: a mapreduce approach. Math. Probl. Eng. (2015)
90.Pes B. Learning from high-dimensional biomedical datasets: the issue of class imbalance. IEEE Access. 2020;8:13527–13540. doi: 10.1109/ACCESS.2020.2966296. [DOI] [Google Scholar]
91.Hooten, S., Vadlamani, S.K., Beausoleil, R.G., Vaerenbergh, T.V.: Generative neural network based non-convex optimization using policy gradients with an application to electromagnetic design. In: NeurIPS 2021 AI for Science Workshop (2021).
92.Al-Ani A, Alsukker A, Khushaba RN. Feature subset selection using differential evolution and a wheel based search strategy. Swarm Evol. Comput. 2013;9:15–26. doi: 10.1016/j.swevo.2012.09.003. [DOI] [Google Scholar]
93.Liu, X.F., Zhan, Z.H., Lin, J.H., Zhang, J.: Parallel differential evolution based on distributed cloud computing resources for power electronic circuit optimization. In: GECCO 2016 Companion—Proceedings of 2016 Genetic and Evolutionary Computation Conference, pp. 117–118 (2016).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets used in this study are taken from the public domain and the appropriate URLs have been cited in the text.

[CR1] 1.CRISP DM. https://www.the-modeling-agency.com/crisp-dm.pdf. Accessed 24 Apr 2021

[CR2] 2.Chandrashekar G, Sahin F. A survey on feature selection methods. Comput. Electr. Eng. 2014;40:16–28. doi: 10.1016/j.compeleceng.2013.11.024. [DOI] [Google Scholar]

[CR3] 3.Xue B, Zhang M, Browne WN, Yao X. A survey on evolutionary computation approaches to feature selection. IEEE Trans. Evol. Comput. 2016;20:606–626. doi: 10.1109/TEVC.2015.2504420. [DOI] [Google Scholar]

[CR4] 4.Wang G, Deb S, Cui Z. Monarch butterfly optimization. Neural Comput. Appl. 2019;31:1995–2014. doi: 10.1007/s00521-015-1923-y. [DOI] [Google Scholar]

[CR5] 5.Hu, J., Gui, W., Heidari, A.A., Cai, Z., Liang, G., Chen, H., Pan, Z. Dispersed foraging slime mould algorithm: Continuous and binary variants for global optimization and wrapper-based feature selection. Knowl. Syst. 237 (2022).

[CR6] 6.Strumberger I, Bacanin N. Modified moth search algorithm for global optimization problems. Int. J. Comput. 2018;3:44–48. [Google Scholar]

[CR7] 7.Yang Y, Chen H, Heidari AA, Gandomi AH. Hunger games search: visions, conception, implementation, deep analysis, perspectives, and towards performance shifts. Expert Syst. Appl. 2021;177:114864. doi: 10.1016/j.eswa.2021.114864. [DOI] [Google Scholar]

[CR8] 8.Tu J, Chen H, Wang M, et al. The colony predation algorithm. J. Bionic Eng. 2021;18:674–710. doi: 10.1007/s42235-021-0050-y. [DOI] [Google Scholar]

[CR9] 9.Heidari AA, Mirjalili S, Faris H, Aljarah I, Mafarja M, Chen H. Harris hawks optimization: algorithm and applications. Future Gener Comput Syst. 2019;97:849–872. doi: 10.1016/j.future.2019.02.028. [DOI] [Google Scholar]

[CR10] 10.Nguyen BH, Xue B, Zhang M. A survey on swarm intelligence approaches to feature selection in data mining. Swarm Evol. Comput. 2020;54:100663. doi: 10.1016/j.swevo.2020.100663. [DOI] [Google Scholar]

[CR11] 11.Yang WA, Zhou Q, Tsui KL. Differential evolution-based feature selection and parameter optimisation for extreme learning machine in tool wear estimation. Int. J. Prod. Res. 2016;54:4703–4721. doi: 10.1080/00207543.2015.1111534. [DOI] [Google Scholar]

[CR12] 12.Xie, X., Xu, K., Wang, X.: Cloud computing resource scheduling based on improved differential evolution ant colony algorithm. In: ACM International Conference Proceeding Series, pp. 171–177 (2019).

[CR13] 13.Silva-Filho, A.G., Nunes, L.J.C., Lacerda, H.F.: Differential evolution to reduce energy consumption in three-level memory hierarchy. In: Proceedings of SBCCI 2015—28th Symposium on Integrated Circuits and Systems Design: CHIP in Bahia (2015).

[CR14] 14.Krishna, G.J., Ravi, V.: Anomaly detection using modified differential evolution: an application to banking and insurance. In: Proceedings of the 11th International Conference on Soft Computing and Pattern Recognition (SoCPaR 2019). Advances in Intelligent Systems and Computing, p. 1182. Springer, Cham (2019).

[CR15] 15.Nissen V., Propach. J.: On the robustness of population-based versus point-based optimization in the presence of noise. In: IEEE Transactions on Evolutionary Computation, vol. 2, no. 3, pp. 107–119 (1998).

[CR16] 16.Roeva, O., Slavov, T., Fidanova, S.: Population-based vs. single point search meta-heuristics for a PID controller tuning. In: Handbook of Research on Novel Soft Computing Intelligent Algorithms: Theory and Practical Applications, pp. 200–233. IGI Global (2014).

[CR17] 17.Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI’04 6th Symposium on Operating Systems Design and Implement, pp. 137--150 (2004).

[CR18] 18.Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I. Apache Spark: a unified engine for big data processing. Commun. ACM. 2016;59(11):56–65. doi: 10.1145/2934664. [DOI] [Google Scholar]

[CR19] 19.Das S, Suganthan PN. Differential evolution: a survey of the state-of-the-art. IEEE Trans. Evol. Comput. 2011;15(1):4–31. doi: 10.1109/TEVC.2010.2059031. [DOI] [Google Scholar]

[CR20] 20.Chauhan N, Ravi V. Differential evolution and threshold accepting hybrid algorithm for unconstrained optimization. Int. J. Bio-Inspired Comput. 2010;2:169–182. doi: 10.1504/IJBIC.2010.033086. [DOI] [Google Scholar]

[CR21] 21.Krishna, G.J., Ravi, V.: Feature subset selection using adaptive differential evolution: an application to banking. In: ACM International Conference Proceeding Series, pp. 157–163 (2019).

[CR22] 22.Rivera-Lopez R, Mezura-Montes E, Canul-Reich J, Cruz-Chávez MA. A permutational-based differential evolution algorithm for feature subset selection. Pattern Recognit. Lett. 2020;133:86–93. doi: 10.1016/j.patrec.2020.02.021. [DOI] [Google Scholar]

[CR23] 23.Price K, Storn R. Differential evolution—a simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim. 1997;11:341–359. doi: 10.1023/A:1008202821328. [DOI] [Google Scholar]

[CR24] 24.Zhang Y, Gong DW, Gao XZ, Tian T, Sun XY. Binary differential evolution with self-learning for multi-objective feature selection. Inf. Sci. (NY) 2020;507:67–85. doi: 10.1016/j.ins.2019.08.040. [DOI] [Google Scholar]

[CR25] 25.Vivekanandan T, Iyengar NCSN. Optimal feature selection using a modified differential evolution algorithm and its effectiveness for prediction of heart disease. Comput. Biol. Med. 2017;90:125–136. doi: 10.1016/j.compbiomed.2017.09.011. [DOI] [PubMed] [Google Scholar]

[CR26] 26.Samuel OW, Asogbon GM, Sangaiah AK, Fang P, Li G. An integrated decision support system based on ANN and Fuzzy_AHP for heart failure risk prediction. Expert Syst. Appl. 2017;68:163–172. doi: 10.1016/j.eswa.2016.10.020. [DOI] [Google Scholar]

[CR27] 27.Nayak, S.K., Rout, P.K., Jagadev, A.K., Swarnkar, T.: Elitism based multi-objective differential evolution for feature selection: a filter approach with an efficient redundancy measure. In: Journal of King Saud University—Computer and Information Sciences, vol. 32, pp. 174–187 (2020).

[CR28] 28.Mlakar U, Fister I, Brest J, Potočnik B. Multi-objective differential evolution for feature selection in facial expression recognition systems. Expert Syst. Appl. 2017;89:129–137. doi: 10.1016/j.eswa.2017.07.037. [DOI] [Google Scholar]

[CR29] 29.Khushaba RN, Al-Ani A, Al-Jumaily A. Feature subset selection using differential evolution and a statistical repair mechanism. Expert Syst. Appl. 2011;38:11515–11526. doi: 10.1016/j.eswa.2011.03.028. [DOI] [Google Scholar]

[CR30] 30.Hancer E, Xue B, Zhang M. Differential evolution for filter feature selection based on information theory and feature ranking. Knowl. Syst. 2018;140:103–119. doi: 10.1016/j.knosys.2017.10.028. [DOI] [Google Scholar]

[CR31] 31.Hancer E. A new multi-objective differential evolution approach for simultaneous clustering and feature selection. Eng. Appl. Artif. Intell. 2020;87:103307. doi: 10.1016/j.engappai.2019.103307. [DOI] [Google Scholar]

[CR32] 32.Ghosh A, Datta A, Ghosh S. Self-adaptive differential evolution for feature selection in hyperspectral image data. Appl. Soft Comput. J. 2013;13:1969–1977. doi: 10.1016/j.asoc.2012.11.042. [DOI] [Google Scholar]

[CR33] 33.Bhadra T, Bandyopadhyay S. Unsupervised feature selection using an improved version of differential evolution. Expert Syst. Appl. 2015;42:4042–4053. doi: 10.1016/j.eswa.2014.12.010. [DOI] [Google Scholar]

[CR34] 34.Baig MZ, Aslam N, Shum HPH, Zhang L. Differential evolution algorithm as a tool for optimal feature subset selection in motor imagery EEG. Expert Syst. Appl. 2017;90:184–195. doi: 10.1016/j.eswa.2017.07.033. [DOI] [Google Scholar]

[CR35] 35.Almasoudy FH, Al-Yaseen WL, Idrees AK. Differential evolution wrapper feature selection for intrusion detection system. Procedia Comput. Sci. 2020;167:1230–1239. doi: 10.1016/j.procs.2020.03.438. [DOI] [Google Scholar]

[CR36] 36.ZorarpacI E, Ozel SA. A hybrid approach of differential evolution and artificial bee colony for feature selection. Expert Syst. Appl. 2016;62:91–103. doi: 10.1016/j.eswa.2016.06.004. [DOI] [Google Scholar]

[CR37] 37.Srikrishna, V., Ghosh, R., Ravi, V., Deb, K.: Elitist quantum-inspired differential evolution based wrapper for feature subset selection. In: Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 9426, pp. 113–124 (2015).

[CR38] 38.Zhao XS, Bao LL, Ning Q, Ji JC, Zhao XW. An improved binary differential evolution algorithm for feature selection in molecular signatures. Mol. Inform. 2018;37:1–15. doi: 10.1002/minf.201700081. [DOI] [PubMed] [Google Scholar]

[CR39] 39.Hancer E. Fuzzy kernel feature selection with multi-objective differential evolution algorithm. Conn. Sci. 2019;3:323–341. doi: 10.1080/09540091.2019.1639624. [DOI] [Google Scholar]

[CR40] 40.Li J, Ding L, Li B. Differential evolution-based parameters optimisation and feature selection for support vector machine. Int. J. Comput. Sci. Eng. 2016;13:355–363. [Google Scholar]

[CR41] 41.Wang, J., Xue, B., Gao, X., Zhang, M: A differential evolution approach to feature selection and instance selection. In: Proceedings of the 14th Pacific RIM International Conference on Trends in Artificial Intelligence (PRICAI'16). Gewerbestrassse 11 CH-6330, Cham (ZG), CHE, pp. 588–602. Springer (2016).

[CR42] 42.Carrasco J, García S, Rueda MM, Das S, Herrera F. Recent trends in the use of statistical tests for comparing swarm and evolutionary computing algorithms: practical guidelines and a critical review. Swarm Evol. Comput. 2020;54:100665. doi: 10.1016/j.swevo.2020.100665. [DOI] [Google Scholar]

[CR43] 43.Cao B, Fan S, Zhao J, Yang P, Muhammad K, Tanveer M. Quantum-enhanced multiobjective large-scale optimization via parallelism. Swarm Evol. Comput. 2020;57:100697. doi: 10.1016/j.swevo.2020.100697. [DOI] [Google Scholar]

[CR44] 44.BenSaid, F., Alimi, A.M.: Moanofs: multi-objective automated negotiation based online feature selection system for big data classification (2018). arXiv:1810.04903.

[CR45] 45.Khan A, Baig AR. Multi-objective feature subset selection using non-dominated sorting genetic algorithm. J. Appl. Res. Technol. 2015;13:145–159. doi: 10.1016/S1665-6423(15)30013-4. [DOI] [Google Scholar]

[CR46] 46.Khammassi C, Krichen S. A NSGA2-LR wrapper approach for feature selection in network intrusion detection. Comput. Netw. 2020;172:107183. doi: 10.1016/j.comnet.2020.107183. [DOI] [Google Scholar]

[CR47] 47.Chaudhuri A, Sahu TP. Feature selection using Binary Crow search Algorithm with time varying flight length. Expert Syst. Appl. 2021;168:114288. doi: 10.1016/j.eswa.2020.114288. [DOI] [Google Scholar]

[CR48] 48.Too, J., Mirjalili, S.: A hyper learning binary dragonfly algorithm for feature selection: a COVID-19 case study. Knowl. Syst. 212 (2021).

[CR49] 49.Hu, J., Chen, H., Heidari, A.A., Wang, M., Zhang, X., Chen, Y., Pan, Z.: Orthogonal learning covariance matrix for defects of grey wolf optimizer: insights, balance, diversity, and feature selection. Knowl. Syst. 213 (2021).

[CR50] 50.Hu J, Heidari AA, Zhang L, Xue X, Gui W, Chen H, Pan Z. Chaotic diffusion-limited aggregation enhanced grey wolf optimizer: insights, analysis, binarization, and feature selection. Int. J. Intell. Syst. 2021;37(8):4864–4927. doi: 10.1002/int.22744. [DOI] [Google Scholar]

[CR51] 51.Too, J., Liang, G., Chen, H.: Memory-based Harris hawk optimization with learning agents: a feature selection approach. Eng. Comput. 1–22 (2021).

[CR52] 52.Zhang Y, Liu R, Wang X, Chen H, Li C. Boosted binary Harris Hawks optimizer and feature selection. Eng. Comput. 2021;37:3741–3770. doi: 10.1007/s00366-020-01028-5. [DOI] [Google Scholar]

[CR53] 53.Hammami M, Bechikh S, Hung CC, Ben Said L. A multi-objective hybrid filter-wrapper evolutionary approach for feature selection. Memetic Comput. 2019;11:193–208. doi: 10.1007/s12293-018-0269-2. [DOI] [Google Scholar]

[CR54] 54.Harada, T., Kaidan, M., Thawonmas, R.: Comparison of synchronous and asynchronous parallelization of extreme surrogate-assisted multi-objective evolutionary algorithm. Nat. Comput. (2020).

[CR55] 55.Peralta, D., Del Río, S., Ramírez-Gallego, S., Triguero, I., Benitez, J.M., Herrera, F.: Evolutionary feature selection for big data classification: a MapReduce approach. Math. Probl. Eng. (2015).

[CR56] 56.Rong M, Gong D, Gao X. Feature selection and its use in big data: challenges. Methods Trends IEEE Access. 2019;7:19709–19725. doi: 10.1109/ACCESS.2019.2894366. [DOI] [Google Scholar]

[CR57] 57.Zhou, C.: Fast parallelization of differential evolution algorithm Using MapReduce. In: Proceedings of 12th Annual Genetic and Evolutionary Computation Conference (GECCO ’10), pp. 1113–1114 (2010).

[CR58] 58.Teijeiro, D., Pardo, X.C., González, P., Banga, J.R., Doallo, R.: Implementing parallel differential evolution on spark. In: Squillero, G., Burelli, P. (eds.) Applications of Evolutionary Computation (EvoApplications 2016). Lecture Notes in Computer Science, p. 9598. Springer, Cham (2016).

[CR59] 59.Cho, P.P.W., Nyunt, T.T.S., Aung, T.T.: Differential evolution for large-scale clustering. In: Proceedings of 2019 9th International Workshop on Computer Science and Engineering (WCSE 2019 SPRING), pp. 58–62 (2019).

[CR60] 60.Al-Sawwa J, Ludwig SA. Performance evaluation of a cost-sensitive differential evolution classifier using spark—imbalanced binary classification. J. Comput. Sci. 2020;40:101065. doi: 10.1016/j.jocs.2019.101065. [DOI] [Google Scholar]

[CR61] 61.Chen Z, Jiang X, Li J, Li S, Wang L. PDECO: parallel differential evolution for clusters optimization. J. Comput. Chem. 2013;34:1046–1059. doi: 10.1002/jcc.23235. [DOI] [PubMed] [Google Scholar]

[CR62] 62.Adhianto L, Banerjee S, Fagan M, Krentel M, Marin G, Mellor-Crummey J, Tallent NR. HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurr. Comput. Pract. Exp. 2010;22:685–701. [Google Scholar]

[CR63] 63.Deng C, Tan X, Dong X, Tan Y. A parallel version of differential evolution based on resilient distributed datasets model. Commun. Comput. Inf. Sci. 2015;562:84–93. [Google Scholar]

[CR64] 64.He Z, Peng H, Chen J, Deng C, Wu Z. A Spark-based differential evolution with grouping topology model for large-scale global optimization. Clust. Comput. 2021;24:515–535. doi: 10.1007/s10586-020-03124-z. [DOI] [Google Scholar]

[CR65] 65.Wong, T.H., Qin, A.K., Wang, S., Shi, Y.: cuSaDE: a CUDA-based parallel self-adaptive differential evolution algorithm. In: Proceedings of the 18th Asia Pacific Symposium on Intelligent and Evolutionary Systems, vol. 2, pp. 375–388 (2015).

[CR66] 66.Cao B, Zhao J, Lv Z, Liu X. A distributed parallel cooperative coevolutionary multiobjective evolutionary algorithm for large-scale optimization. IEEE Trans. Ind. Inf. 2017;13:2030–2038. doi: 10.1109/TII.2017.2676000. [DOI] [Google Scholar]

[CR67] 67.Ge Y, Yu W, Lin Y, Gong Y, Zhan Z, Chen W, Zhang J. Distributed differential evolution based on adaptive mergence and split for large-scale optimization. IEEE Trans. Cybern. 2018;48:2166–2180. doi: 10.1109/TCYB.2017.2728725. [DOI] [PubMed] [Google Scholar]

[CR68] 68.De Falco, I., Scafuri, U., Tarantino, E., Della Cioppa, A.: A distributed differential evolution approach for mapping in a grid environment. In: 15th EUROMICRO international conference on parallel, distributed and network-based processing (PDP'07), pp. 442–449 (2007). 10.1109/PDP.2007.6.

[CR69] 69.Veronese, L.P., Krohling, R.A.: Differential evolution algorithm on the GPU with C-CUDA. In: IEEE Congress on Evolutionary Computation, pp. 1–7 (2010). 10.1109/CEC.2010.5586219.

[CR70] 70.Glotic A, Glotic A, Kitak P, Pihler J, Ticar I. Parallel self-adaptive differential evolution algorithm for solving short-term hydro scheduling problem. IEEE Trans. Power Syst. 2014;29:2347–2358. doi: 10.1109/TPWRS.2014.2302033. [DOI] [Google Scholar]

[CR71] 71.Daoudi, M., Hamena, S., Benmounah, Z., Batouche, M.: Parallel diffrential evolution clustering algorithm based on MapReduce. In: 2014 6th International Conference of Soft Computing and Pattern Recognition (SoCPaR), pp. 337–341 (2014).

[CR72] 72.Krömer, P., Platoš, J., Snášel, V.: Scalable differential evolution for many-core and clusters in unified parallel C. In: 2013 IEEE International Conference on Cybernetics (CYBCO), pp. 180–185 (2013).

[CR73] 73.Thomert, D.B., Bhattacharya, A. K., Caron, E., Gadireddy, K., Lefevre, L.: Parallel differential evolution approach for cloud workflow placements under simultaneous optimization of multiple objectives. In: 2016 IEEE Congress on Evolutionary Computation (CEC), pp. 822–829 (2016).

[CR74] 74.Abbass HA, Sarker R. The Pareto differential evolution algorithm. Int. J. Artif. Intell. Tools. 2002;11(4):531–552. doi: 10.1142/S0218213002001039. [DOI] [Google Scholar]

[CR75] 75.Ali MM, Tom A. Population set based global optimization algorithms: some modifications and numerical studies. Comput. Oper. Res. 2004;31(10):1703–1725. doi: 10.1016/S0305-0548(03)00116-3. [DOI] [Google Scholar]

[CR76] 76.Kohavi, R., John, G.H.: Wrappers for feature subset selection. In: Lecture Notes Computer Science (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 97, pp. 273–324 (1997).

[CR77] 77.Dueck G, Scheuer T. Threshold accepting: a general purpose optimization algorithm appearing superior to simulated annealing. J. Comput. Phys. 1990;90:161–175. doi: 10.1016/0021-9991(90)90201-B. [DOI] [Google Scholar]

[CR78] 78.Ravi V, Zimmermann HJ. Fuzzy rule based classification with FeatureSelector and modified threshold accepting. Eur. J. Oper. Res. 2000;123:16–28. doi: 10.1016/S0377-2217(99)00090-9. [DOI] [Google Scholar]

[CR79] 79.Ravi V, Reddy PJ, Zimmermann HJ. Fuzzy rule base generation for classification and its minimization via modified threshold accepting. Fuzzy Sets Syst. 2001;120:271–279. doi: 10.1016/S0165-0114(99)00100-1. [DOI] [Google Scholar]

[CR80] 80.Ravi V, Zimmermann H-J. A neural network and fuzzy rule base hybrid for pattern classification. Soft Comput. 2001;5:152–159. doi: 10.1007/s005000000071. [DOI] [Google Scholar]

[CR81] 81.Ravi V, Pramodh C. Threshold accepting trained principal component neural network and feature subset selection: application to bankruptcy prediction in banks. Appl. Soft Comput. J. 2008;8:1539–1548. doi: 10.1016/j.asoc.2007.12.003. [DOI] [Google Scholar]

[CR82] 82.Tvrdík J. Adaptation in differential evolution: a numerical comparison. Appl. Soft Comput. 2009;9(3):1149–1155. doi: 10.1016/j.asoc.2009.02.010. [DOI] [Google Scholar]

[CR83] 83.Zielinski, K., Peters, D., Laur, R.: Run time analysis regarding stopping criteria for differential evolution and particle swarm optimization. In: Proceedings of 1st International Conference on Experiments/Process/System Modelling/Simulation/Optimization (2005).

[CR84] 84.Kaggle Open source Datasets. https://www.kaggle.com/c/microsoft-malware-prediction/data. Accessed 27 Mar 2021

[CR85] 85.IEEE Dataport. https://ieee-dataport.org/. Accessed 27 Mar 2021

[CR86] 86.OpenML Open Source Datasets. https://www.openml.org/home. Accessed 27 Mar 2021

[CR87] 87.LIBSVM repository for the binary class high dimensional datasets. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. Accessed 27 Mar 2021

[CR88] 88.Apache Spark. https://spark.apache.org/. Accessed 26 Jan 2021

[CR89] 89.Peralta, D., Río, S.D., Gallego, S.R., Triguero, I., Benitez, J.M., Herrera, F.: Evolutionary feature selection for big data classification: a mapreduce approach. Math. Probl. Eng. (2015)

[CR90] 90.Pes B. Learning from high-dimensional biomedical datasets: the issue of class imbalance. IEEE Access. 2020;8:13527–13540. doi: 10.1109/ACCESS.2020.2966296. [DOI] [Google Scholar]

[CR91] 91.Hooten, S., Vadlamani, S.K., Beausoleil, R.G., Vaerenbergh, T.V.: Generative neural network based non-convex optimization using policy gradients with an application to electromagnetic design. In: NeurIPS 2021 AI for Science Workshop (2021).

[CR92] 92.Al-Ani A, Alsukker A, Khushaba RN. Feature subset selection using differential evolution and a wheel based search strategy. Swarm Evol. Comput. 2013;9:15–26. doi: 10.1016/j.swevo.2012.09.003. [DOI] [Google Scholar]

[CR93] 93.Liu, X.F., Zhan, Z.H., Lin, J.H., Zhang, J.: Parallel differential evolution based on distributed cloud computing resources for power electronic circuit optimization. In: GECCO 2016 Companion—Proceedings of 2016 Genetic and Evolutionary Computation Conference, pp. 117–118 (2016).

PERMALINK

Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment

Yelleti Vivek

Vadlamani Ravi

P Radha Krishna

Abstract

Introduction

Literature review

Table 1.

Table 2.

Overview of the evolutionary algorithms employed

Solution encoding scheme

Binary differential evolution

Binary threshold accepting

Adaptive differential evolution

Permutation based differential evolution

Proposed scalable, parallel and distributed wrapper

Population RDD encoding scheme

Fig. 1.

Algorithm-agnostic details of the paralellization

Changes occurring to population RDD during population initialization

Changes occurring to population RDD during training phase

Changes occurring to population RDD during test phase

Algorithm-specific details of the parallelization

Parallel binary differential evolutionary (PB-DE)

Fig. 5.

Parallel binary DETA (PB-DETA)

Fig. 2.

Fig. 6.

Parallel binary TADE (PB-TADE)

Fig. 3.

Fig. 7.

Parallel binary ADE (PB-ADE)

Fig. 8.

Parallel permutation based DE (P-DE-FS.PM)

Fig. 9.

Fig. 4.

Classification algorithm

Fitness function

Time complexity

Table 3.

Dataset description

Table 4.

Results and discussions

Table 5.

Table 6.

Environmental setup

Mean AUC comparative analysis

Repeatability

Repeatability of the individual features

Table 7.

Repeatability of the feature subsets

Table 8.

Least cardinal feature subset with highest AUC

Table 9.

Speedup

Table 10.

Statistical testing of the results

Table 11.

Conclusions and limitations

Biographies

Yelleti Vivek

Vadlamani Ravi

P. Radha Krishna

Appendix

Author contributions

Funding

Data availability

Declarations

Conflict of interest

Ethical approval

Informed consent

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Parallel permutation based DE (P-DE-FS.^PM)