A Two-Stage Method Based on Multiobjective Differential Evolution for Gene Selection

Shuangbao Song; Xingqian Chen; Zheng Tang; Yuki Todo

doi:10.1155/2021/5227377

. 2021 Dec 20;2021:5227377. doi: 10.1155/2021/5227377

A Two-Stage Method Based on Multiobjective Differential Evolution for Gene Selection

Shuangbao Song ¹, Xingqian Chen ^2,^✉, Zheng Tang ², Yuki Todo ^3,^✉

PMCID: PMC8712129 PMID: 34966420

Abstract

Microarray gene expression data provide a prospective way to diagnose disease and classify cancer. However, in bioinformatics, the gene selection problem, i.e., how to select the most informative genes from thousands of genes, remains challenging. This problem is a specific feature selection problem with high-dimensional features and small sample sizes. In this paper, a two-stage method combining a filter feature selection method and a wrapper feature selection method is proposed to solve the gene selection problem. In contrast to common methods, the proposed method models the gene selection problem as a multiobjective optimization problem. Both stages employ the same multiobjective differential evolution (MODE) as the search strategy but incorporate different objective functions. The three objective functions of the filter method are mainly based on mutual information. The two objective functions of the wrapper method are the number of selected features and the classification error of a naive Bayes (NB) classifier. Finally, the performance of the proposed method is tested and analyzed on six benchmark gene expression datasets. The experimental results verified that this paper provides a novel and effective way to solve the gene selection problem by applying a multiobjective optimization algorithm.

1. Introduction

Gene selection is an important issue in bioinformatics [1]. A gene is the basic functional unit of heredity. Gene expression is the process in which the instructions encoded in genes are used to synthesize gene products [2] such as proteins. Then, the gene products dictate cellular function. Therefore, abnormal gene expression is usually correlated with different types of disease, such as cancer [3]. Usually, many diseases correspond to unique gene expression profiles that can be revealed by DNA microarray technology [4]. Typically, microarray data corresponding to a certain disease consist of a set of biological samples. From each sample, the expression of thousands of genes at each position can be measured. As a result, microarray data are usually in the form of a matrix. However, it is not an easy task for researchers to check which genes are responsible for a given disease because of the high dimensionality of microarray data. Thus, determining how to select the most significant genes effectively for further analysis becomes urgent and vital.

The gene selection problem is intrinsically a feature selection problem with high-dimensional features and small sample sizes. Since the gene expression data can be labeled (whether the sample is malignant or not), partially labeled, or unlabeled, three categories of methods are applied to solve the gene selection problem in the literature [5]: supervised, semisupervised, and unsupervised feature selection methods. Because labeled data are the most common types of data in reality, supervised feature selection methods are the most widely used and most practical methods for the gene selection problem. We refer to feature (gene) selection methods as supervised feature (gene) selection methods in the following context.

In the field of machine learning, feature selection, also known attribute selection, is defined as the process by which the best subset of relevant features is selected from a large set of features [6], and the performance of classifiers is assuredly improved by the optimal feature subset when compared with the utilization of all features. However, it is difficult to execute feature selection by retaining relevant features and removing irrelevant and redundant features. There are two main obstacles in feature selection. First, the size of the search space is quite large. Given a dataset with n features, there are 2ⁿ subsets (solutions) [7]. Specifically, as big data continues to grow [8], n becomes increasingly large. Thus, in most cases, an exhaustive search for feature selection is impossible. Second, the feature interaction problem makes feature selection complex. For example, a feature as a single entity is irrelevant to the target, but when combined with another feature, it may become significantly relevant. In fact, there are many interaction patterns among features. As a result, “the m best features are not the best m features” [9]. Therefore, the performance of a feature selection method depends on two key factors: (1) effective evaluation criteria to measure the quality of a feature subset and (2) an efficient search strategy to explore the large search space [10].

Regarding evaluation criteria, feature selection methods can be roughly classified into two categories: filter methods and wrapper methods [10]. The main difference between them is that wrapper methods use a classifier to evaluate a feature subset, while filter methods do not. Filter methods are independent of any classifier and focus on the intrinsic characteristics of the dataset. The common metrics used in filter methods are correlation [11] and mutual information [12]. Specifically, the filter methods examining each feature separately are considered univariate. They ignore the feature interaction problem and lead to the redundancy of feature subsets. Thus, multivariate filter methods such as minimum redundancy-maximum relevance (mRMR) [13] are considered better choices. Wrapper methods select discriminative feature subsets to improve the classification performance. Most popular classifiers can be incorporated into wrapper methods, e.g., the naive Bayes (NB), K-nearest neighbors, support vector machine, and neural network [14]. It has been generally regarded that filter methods are usually considered faster, but their accuracy is relatively lower. Wrapper methods are the opposite of filter methods because they need to consider the computational costs of the involved classifiers. Thus, combining them as a hybrid method is an alternative and promising method for feature selection problems, especially for the gene selection problem [15].

There are two main categories of search strategies applied in feature selection. The first category is sequential search. Sequential forward selection and sequential backward selection [16] are considered conventional methods but suffer from the “nesting effect” [17] because only one feature is added or removed at a time. The second category is a randomized search strategy that starts by randomly selecting some features and then executing a heuristic search. It has been verified that these methods based on randomized search are better than the methods based on sequential search because they can escape local optima more easily [10]. Specifically, applying evolutionary computation (EC) techniques such as genetic algorithms (GAs) [18], particle swarm optimization (PSO) [19, 20], and differential evolution (DE) [21, 22] to feature selection has raised the attention of researchers in recent years.

Regarding the gene selection problem, numerous methods based on EC techniques have been proposed in the literature [5]. These pertinent experiments have shown that EC techniques can achieve very competitive performance compared with traditional methods. For example, Mohamad et al. proposed an improved binary PSO as a wrapper method and obtained positive results [23]. Shreem et al. proposed a Markov blanket-embedded harmony search algorithm as a wrapper method to solve the gene selection problem [24], and Elyasigomari et al. proposed a filter method based on the cuckoo optimization algorithm and shuffling [25], where a clustering technique was involved. In addition, a modified artificial bee colony algorithm was applied to solve the gene selection problem in the work of Alshamlan et al. [26], where the search method was enhanced by combining two EC algorithms. Note that most current methods based on EC techniques treat the gene selection problem as a single-objective optimization problem. On the other hand, recent work [22, 27] suggests that multiobjective optimization techniques are alternatives for solving the gene selection problem. This is because the single objective to multiobjective transformation can lead to improvements in the search strategy and evaluation criteria; thus, more competitive results can be obtained. However, to the best of the authors' knowledge, employing an effective multiobjective differential evolution (MODE) approach to address the gene selection problem has not yet been well explored.

Thus, in this study, a two-stage method based on multiobjective optimization is proposed. The first stage included a multivariate filter method where three objective functions referring to mutual information are incorporated. The second stage included a conventional wrapper method involving the NB classifier. The number of selected features and the classification error are incorporated as the two objective functions in this stage. In addition, both stages employ the same search strategy: a well-designed MODE. Finally, six benchmark datasets are used to test and analyze the performance of the proposed method. The experimental results are statistically compared with those of five widely used feature selection methods.

The remainder of the paper is organized as follows. Section 2 introduces three important concepts: multiobjective optimization, differential evolution, and mutual information. Section 3 describes the proposed method. Section 4 provides the experimental results and analysis. Finally, Section 5 draws the conclusion of this paper.

2. Materials

2.1. Multiobjective Optimization Problem

Many real-world problems involve multiple conflicting objectives that should be optimized simultaneously [28]. A MOOP is a multiobjective minimization problem that involves more than one objective function to be optimized, and it can be mathematically stated as follows:

\begin{matrix} minimize f (x) = (f_{1} (x), f_{2} (x), \dots, f_{k} (x)) \\ s . t . x = (x_{1}, x_{2}, \dots, x_{n}) \in Ω, \end{matrix}

(1)

where x is the n-dimensional decision vector and Ω is the decision space. f : Ω⟶R^k consists of k(k ≥ 2) real-valued objective functions f₁(x), f₂(x),…, and f_k(x). In normal cases, there is no solution that can optimize all the objective functions because of the conflicts among these objectives. Four important definitions referring to MOOPs are given as follows.

Definition 1 . —

(Pareto dominance). Let a=(a₁, a₂,…, a_n) and b=(b₁, b₂,…, b_n) be two vectors. a is said to dominate b, represented as a≺b, if

$\begin{matrix} (1) \forall i \in \{1,2, \dots, k\}, f_{i} (a) \leq f_{i} (b), \\ (2) f (a) \neq f (b) . \end{matrix}$ (2)

Definition 2 . —

(Pareto optimal solution). For a given MOOP, a vector x^∗ ∈ Ω is called the Pareto optimal solution if

$\begin{matrix} \neg \exists x^{'} \in Ω, x^{'} ≺ x^{*} . \end{matrix}$ (3)

Definition 3 . —

(Pareto optimal set). All Pareto optimal solutions compose the Pareto optimal set P, which can be described as follows:

$\begin{matrix} P = \{x^{*} \in Ω | \neg \exists x^{'} \in Ω, x^{'} ≺ x^{*}\} . \end{matrix}$ (4)

Definition 4 . —

(Pareto front). The image of the Pareto optimal set is called the Pareto front PF, which is composed of objective vectors and is defined as follows:

$\begin{matrix} P F = \{f (x) | x \in P\} . \end{matrix}$ (5)

For a real-world MOOP, the Pareto optimal P is usually unreachable and infinite. Therefore, the goal of an optimization method [29–31] is to obtain an approximation of P, which is convergent and diverse in the objective space as much as possible. In addition, an excellent approximation of P is crucial for a decision maker to select the final solutions.

2.2. Standard Differential Evolution

DE is a simple but powerful stochastic optimization algorithm that was first proposed by Storn and Price in the 1990s [32]. Recent research has increased the efficiency for solving many real-world problems [33–35]. The characteristic of DE is using the difference between two candidate solutions to generate a new candidate solution. This algorithm is population based and works through a cycle of computational steps, which are similar to the steps employed in common evolutionary algorithms. The flowchart of standard DE is shown in Figure 1, and it can be separated into the following four stages: initialization, mutation, crossover, and selection.

The flowchart of standard differential evolution.

DE optimizes a problem by maintaining a population of candidate solutions and evolving them with specific formulas within the search space. An individual, also called a genome, is represented as a vector forming a candidate solution for a specific problem as follows:

\begin{matrix} X^{(i) (t)} = (x_{1}^{(i) (t)}, x_{2}^{(i) (t)}, \dots, x_{d}^{(i) (t)}), \end{matrix}

(6)

where d is the dimension of the search space and X^(i)(t) represents the ith individual in the NP-sized population at generation t.

Initially, all individuals X^(i)(t), also called target vectors, are randomly initialized by restricting them in a problem-specific range. Then, standard DE starts its main loop. Every individual evolves in the following steps. First, for each individual X^(i)(t), the differential mutation operator works and generates a donor vector V^(i)(t)=(v₁^(i)(t), v₂^(i)(t),…, v_d^(i)(t)) as follows:

\begin{matrix} V^{(i) (t)} = X^{(r_{1}) (t)} + F (X^{(r_{2}) (t)} - X^{(r_{3}) (t)}), \end{matrix}

(7)

where F is the mutation scale factor that controls the scaled difference and r₁, r₂, and r₃ are three different integers, which are randomly chosen from the range [1, NP]. Note that the three integers must be different from the current index i.

Next, the trial vector U^(i)(t)=(u₁^(i)(t), u₂^(i)(t),…, u_d^(i)(t)) is generated by crossing over the target vector X^(i)(t) and the donor vector V^(i)(t). A typical crossover mutation operation employed in standard DE is implemented by exchanging components between X^(i)(t) and V^(i)(t) as follows:

\begin{matrix} u_{j}^{(i) (t)} = \{\begin{matrix} v_{j}^{(i) (t)}, & if (r \leq C r or j = j_{r}), \\ x_{j}^{(i) (t)}, & otherwise, \end{matrix} \end{matrix}

(8)

where u_j^(i)(t) is the jth element of U^(i)(t) and r is a uniformly distributed random real number in [0,1]. Cr is the crossover rate that controls the probability of how many elements of U^(i)(t) are inherited from V^(i)(t). j_r, which ensures that U^(i)(t) obtains at least one element from V^(i)(t), is a random integer in [1, d].

Then, the selection process is executed to update all individuals as follows:

\begin{matrix} X^{(i) (t + 1)} = \{\begin{matrix} U^{(i) (t)}, & if (f (U^{(i) (t)}) \leq f (X^{(i) (t)})), \\ X^{(i) (t)}, & otherwise, \end{matrix} \end{matrix}

(9)

where f(·) is the single-objective function of DE.

Finally, the DE terminates when the stopping criterion is met.

2.3. Mutual Information

In information theory [36], the mutual information of two variables quantifies the mutual dependence between them. This metric measures the correlation between two variables powerfully and is not sensitive to the noise in sampling [37]. Given two continuous variables x and y, their mutual information can be defined as follows:

\begin{matrix} I (x; y) = \iint p (x, y) \log \frac{p (x, y)}{p (x) p (y)} d x d y, \end{matrix}

(10)

where p(x) and p(y) are the probability density functions of x and y, respectively, and p(x, y) is the joint probability density function. Therefore, if two variables are strictly independent, their mutual information is equal to 0. Similarly, for two discrete variables x and y, mutual information has the following form:

\begin{matrix} I (x; y) = \sum_{x \in X} \sum_{y \in Y} p (x, y) \log \frac{p (x, y)}{p (x) p (y)} . \end{matrix}

(11)

Given two variables x and y, the range of the mutual information I(x; y) between them is [0, min{H(x), H(y)}], where H(·) is the function to calculate the entropy of a variable.

Although mutual information has been considered an excellent indicator to quantify the independence between two variables, its calculation is not easy because estimating probability density functions is a complex task. If two variables are discrete, the calculation of mutual information is straightforward by counting the samples in difficult categories to make the joint and marginal probability tables. However, if at least one of the two variables is continuous, the calculation becomes difficult. In this work, we use entropy estimation based on the K-nearest neighbor distance [38] to calculate mutual information.

3. Methodology

In this section, the proposed two-stage method based on MODE is described. Figure 2 illustrates the flowchart of the proposed method, which consists of two stages: a filter stage and a wrapper stage. In the latter stage, a novel wrapper method based on MODE is proposed. In addition, two single-objective wrapper methods based on DE are proposed in this stage. These two single-objective methods serve as the baseline to test the performance of the MODE-based wrapper method and help us investigate the following: (1) whether it is necessary to consider the number of selected features in the wrapper method and (2) whether the method based on multiobjective optimization outperforms the methods based on single-objective optimization.

The flowchart of the proposed two-stage method based on MODE for gene selection.

3.1. Multiobjective Differential Evolution

Due to the effectiveness of DE for solving single-objective optimization problems, extending DE to solve MOOPs has attracted the interest of researchers in the literature [34]. Two important issues in extending DE into MODE need to be overcome. The first issue is how to order two candidate solutions. The solutions are straightforward to order when one solution dominates the other solution. However, if two candidate solutions do not dominate each other, an additional strategy to assign the complete order must be provided. Second, an effective scheme of maintaining a set of nondominated solutions during the optimization process is necessary. In contrast to single-objective optimization problems where only one global optimal solution is generated, the goal of MOOPs is to obtain a set of nondominated solutions. Therefore, the convergence and diversity of the set of nondominated solutions should be ensured. A widely used method is to adopt an external archive to couple with the current population [30].

The proposed MODE follows the framework of the standard DE, which is shown in Figure 1. The external archive stores the nondominated solutions that interact with the current population. In addition, the mutation operator and the selection operator, which are different from those of the standard DE algorithm, are modified. The key components of the proposed MODE are described below.

3.1.1. External Archive

Adopting an external archive to store nondominated solutions is a common and effective method in numerous multiobjective evolutionary algorithms [39, 40]. Similarly, an archive Arc with limited size N_a is maintained in the optimization process of the proposed MODE. A solution s will be added into Arc if any one of the following criteria is met. (1) Arc is empty. (2) Arc is not full, and s is nondominated by any solution in Arc. (3) s dominates at least one solution in Arc. Note that in this case, these solutions dominated by s will be removed from Arc. (4) Arc is full, and s is nondominated with any one solution in Arc. In this extreme condition, s is first added into Arc, and a density estimation operation is executed to assign each solution a crowding distance value (see Section 3.1.2). Then, the solution in the most crowded region will be removed from Arc.

The archive Arc interacts with the current population in two aspects. First, the equation for generating a donor vector V^(i)(t) (see equation (7)) is modified by

\begin{matrix} V^{(i) (t)} = X^{(r_{a r c}) (t)} + F (X^{(r_{2}) (t)} - X^{(r_{3}) (t)}), \end{matrix}

(12)

where X^(r_arc)(t) is a solution that is randomly selected from the external archive Arc rather than the current population. This handling method is inspired by the standard mutation strategy in DE/best/1 [41]. X^(r_arc)(t) can be regarded as one of the best solutions that are stored in archive Arc. Second, the selection operator (see equation (9)) of MODE is modified, which is illustrated in Algorithm 1, and the interaction between the current population and archive Arc will be enhanced. Since the updating scheme of archive Arc is based on crowded information, archive Arc will be updated in a timely manner at each iteration, and the convergence and diversity of these nondominated solutions in Arc can be ensured.

Algorithm 1 — The selection process of the proposed MODE.

3.1.2. Density Estimation

Many density estimation methods have been proposed in the literature [29, 30]. In our proposed method, a parameter-independent method called the crowding distance is used to assist Pareto dominance in assigning the complete order. The basic idea is that the degree of crowding of a solution in objective space is quantified by the distance between its neighbors. For a given solution s and an archive Arc, the crowding distance of s can be calculated by Algorithm 2. This method is similar to the method used in the nondominated sorting genetic algorithm-II (NSGA-II) [29], and the crowding distance of a solution is considered the perimeter of the cuboid formed by its neighbors.

Algorithm 2 — Calculating the crowding distance of a solution.

3.1.3. Parameter Control

The mutation scale factor F (see equation (7)) and the crossover rate Cr (see equation (8)) are the two main control parameters in DE. A well-tuned setting of F and Cr is crucial to the performance of DE [41]. However, determining how to set the suitable values of F and Cr is problem-specific. To select suitable parameters for F and Cr, we follow the idea of self-adaptive differential evolution (SaDE) [42] and use a self-adaptive strategy to control the two parameters in MODE.

The employed parameter control strategy is described as follows. At each iteration, a set of F values is regenerated from a normal distribution with a mean of μ=0.5 and a standard deviation of σ=0.3. Then, these F values are orderly applied in equation (12) to generate the donor vectors. In this way, both exploitation (small F values) and exploration (large F values) are ensured during the evolution process. Furthermore, the crossover rate Cr is gradually adjusted according to previous experience during the evolutionary process. Specifically, Cr is assumed to obey a normal distribution with a mean of μ=Cr_m and a standard deviation of σ=0.1 but is restricted to [0,1]. Initially, an empty pool is created, and Cr_m is set to 0.5. At each iteration, a set of Cr values is regenerated and applied to generate the trial vectors, as shown in equation (8). If a trial vector successfully replaces its target vector in the selection process, the corresponding Cr value will enter the pool. At the end of each iteration, the new Cr_m is reset as the median of the pool, and then the pool is emptied.

3.2. Implementation of MODE in Feature Selection

The proposed MODE is an optimization method over continuous spaces. However, the landscape of feature selection problems is discrete. To implement MODE in feature selection, a binary strategy is incorporated in the proposed method. For a given dataset with M features H={h₁, h₂,…, h_M}, a candidate solution in MODE is represented as

\begin{matrix} X^{(i) (t)} = (x_{1}^{(i) (t)}, x_{2}^{(i) (t)}, \dots, x_{M}^{(i) (t)}), x_{j} \in [0,1], \end{matrix}

(13)

where M is the number of dimensions of X^(i)(t), and it is equal to the dimensionality of the data points. Consequently, a feature subset S ⊂ H is determined by X^(i)(t) and a preset threshold parameter λ ∈ (0,1), which is shown in Algorithm 3. This strategy is also employed in the two single-objective methods based on DE.

Algorithm 3 — A binary scheme to transform continuous values to binary values for feature selection.

3.3. Three Objectives of the Filter Stage

The first stage of the proposed method is considered a multivariate filter method where the intrinsic characteristics of the raw data are considered. Three objective functions to be minimized are defined in the filter stage to evaluate a feature subset. The first objective function is the number of selected features, and it is considered a prime motivation of feature selection. Previous works [27] have proven that incorporating the number of selected features as an objective is necessary in feature selection. For a given feature subset S={s₁, s₂,…, s_n}, the first objective function is defined as

\begin{matrix} f_{1}^{(f i l t e r)} = |S| = n . \end{matrix}

(14)

The second objective function strives to select the features with the highest relevance to the target class variable (labeled as malignant or not). This objective aims to maximize the relevance between the features and the target class. Independent of the number of selected features of S, it can be defined as follows:

\begin{matrix} f_{2}^{(f i l t e r)} = - D (S, c) = - \frac{1}{|S|} \sum_{s_{i} \in S} I (s_{i}; c), \end{matrix}

(15)

where c is the target class variable and I(s_i; c) is the mutual information between feature s_i and target class c.

In addition, the redundancy among each pair of the selected features should be narrowed down because redundant information does little to improve the accuracy of a classifier [43]. The third objective function aims at minimizing the redundancy of the feature subset, and it is defined as follows:

\begin{matrix} f_{3}^{(f i l t e r)} = R (S) = \frac{1}{{|S|}^{2}} \sum_{s_{i}, s_{j} \in S} I (s_{i}; s_{j}) . \end{matrix}

(16)

3.4. Two Objectives of the Wrapper Stage

The second stage of the proposed method included a wrapper method where the employed classifier should be considered. As shown in Figure 2, a set of nondominated solutions is generated after the filter stage. Although every one of these solutions can be accepted as the starting point of the second stage, it seems more reasonable to select some typical solutions among them according to computational costs. Since the aim of the filter stage is to select a small number of informative features, we select the solution with the smallest number of features as the input of the wrapper stage. Minimizing the classification error rate of a classifier is the main goal of the wrapper stage. In this study, the famous and effective Gaussian NB [44] classifier is applied. The NB classifier is a supervised learning method for classification, which is based on Bayes' theorem and assumes that every pair of features is independent. The Gaussian NB classifier is the state-of-the-art type of NB classifier to handle continuous data in which the continuous values of a special feature are assumed to fit a Gaussian distribution. After selecting a suitable classifier, the two objective functions of the wrapper stage can be defined as follows:

\begin{matrix} \{\begin{matrix} f_{1}^{(w r a p p e r)} = Num . of selected features, \\ f_{2}^{(w r a p p e r)} = ErrorRate . \end{matrix} \end{matrix}

(17)

According to the guidance of Xue et al. [27], the first objective function to be minimized is defined as the number of selected features. In the following content, we can investigate whether it is necessary to take the number of a selected features as an objective in the wrapper stage. Moreover, the average classification error rate of a selected feature subset is defined as the second objective function, which is evaluated by 5-fold cross-validation on the training data. A more detailed description of how the 5-fold cross-validation is performed on training data is given in [45].

3.5. Two Single-Objective Feature Selection Methods

Two single-objective feature selection methods based on DE are also proposed in the wrapper stage for comparison. The main difference between the two methods is the choice of fitness functions. The fitness function of one method (DE1) is the same as the second objective function of MODE in the wrapper stage, which is defined as follows:

\begin{matrix} f_{D E 1} = ErrorRate . \end{matrix}

(18)

The aim of DE1 is to minimize the classification error rate during the training process. However, the other method (DE2) considers the number of selected features. The fitness function of DE2 is defined as follows:

\begin{matrix} f_{D E 2} = α * \frac{Num . of selected features}{Num . of all features} + (1 - α) * ErrorRate, \end{matrix}

(19)

where α is a scaling parameter determining the relative importance of the two terms and ErrorRate is the average classification error rate of 5-fold cross-validation on the training data.

To meaningfully devise a fair comparison with the proposed MODE, the procedure of the two DE-based methods is chosen to be similar to that of the proposed MODE mentioned above. The differences between the MODE-based method and the DE-based methods are the selection process and the updating strategy of the external archive. The selection process of the two DE-based methods is the same as the standard DE, as shown in equation (9). The updating strategy of the external archive of the two DE-based methods is based on tournaments with limited size N_a.

4. Experimental Studies

All the algorithms in this study are implemented in C and Python languages. The programs are executed on a Linux 64-bit system with a 3.4 GHz Core i5 CPU and 8 GB RAM. In addition, the parameters of MODE used in the two stages are listed in Table 1, which have been discussed above. To assess the performance of the proposed two-stage feature selection method, six widely used benchmark microarray datasets are selected in our experiments. The details of these datasets are summarized in Table 2. Note that all of the datasets are binary. The reason for excluding the multiclass datasets is that binary microarray datasets are more common in the field of gene selection [46].

Table 1.

The parameters in the two stages of the proposed method.

Parameters	Filter stage	Wrapper stage	Description
NP	100	50	Population size
F	Controlling	Controlling	Mutation scale factor
Cr	Self-adapted	Self-adapted	Crossover rate
N _a	200	100	The size of archive
λ	Tuning	0.5	Threshold for binarization
Ite	10000	400	Max number of iterations

Open in a new tab

Table 2.

The details of the benchmark datasets.

Dataset	Total Num. of genes (features)	Num. of instances	Num. of classes	Num. of instances for each class
Colon	2000	62	2	40, 22
DLBCL	5469	77	2	58, 19
Leukemia	7129	72	2	47, 25
Prostate	10,509	102	2	52, 50
Prostate2	2135	102	2	52, 50
TCellLymphoma	2922	63	2	43, 20

Open in a new tab

Because the numbers of samples in microarray datasets are relatively small, 5-fold cross-validation is applied to each dataset to evaluate the effectiveness of feature selection [46]. Specifically, the samples of each dataset are randomly partitioned into five equal subsamples. Four subsamples are used as the training data, and the remaining subsample is used as the test data. Then, the cross-validation process is successively repeated five times. The flowchart of the 5-fold cross-validation experiment is presented in Figure 3. The training data are used by feature selection methods to select a feature subset. Then, the selected feature subset is used to reduce the dimensions of the training data and the test data. Finally, the goodness of the selected feature subset is evaluated by using the test data.

The flowchart of the 5-fold cross-validation experiment.

4.1. Results of the Filter Stage

The proposed MODE on these benchmark datasets is first implemented in the filter stage. The threshold λ is problem-specific and is set properly for each dataset. Since MODE obtains a set of nondominated solutions in each independent run, five independent sets of nondominated solutions with three objectives are generated. We collect five sets of nondominated solutions into a union set and report its statistics in Table 3. It is clear that fruitful solutions are obtained because the values of the three objectives fluctuate significantly. In addition, the small values of |S| indicate that few features are selected, and the effectiveness of the filter stage is demonstrated.

Table 3.

The statistical information of the solutions obtained in the filter stage (calculated over the five cross-validation runs).

Dataset		\|S\|	−D(S, c)	R(S)
Colon	Min	195.0	9.29E − 02	−1.03E − 01
	Avg ± std	242.3 ± 20.3	1.01E − 01 ± 4.00E − 03	−8.93E − 02 ± 5.88E − 03
	Max	300.0	1.11E − 01	−7.66E − 02
DLBCL	Min	273.0	4.52E − 02	−9.95E − 02
	Avg ± std	388.9 ± 61.4	5.15E − 02 ± 3.46E − 03	−8.00E − 02 ± 8.20E − 03
	Max	554.0	6.03E − 02	−6.17E − 02
Leukemia	Min	195.0	3.54E − 02	−7.18E − 02
	Avg ± std	702.1 ± 229.7	4.02E − 02 ± 4.61E − 03	−5.61E − 027.98E − 03
	Max	1406.0	6.09E − 02	−3.52E − 02
Prostate	Min	241.0	6.15E − 02	−1.16E − 01
	Avg ± std	351.5 ± 51.4	7.19E − 02 ± 4.98E − 03	−9.52E − 02 ± 6.75E − 03
	Max	458.0	8.54E − 02	−7.92E − 02
Prostate2	Min	175.0	1.25E − 01	−1.25E − 01
	Avg ± std	248.9 ± 54.4	1.49E − 01 ± 1.14E − 02	−1.04E − 01 ± 8.58E − 03
	Max	362.0	1.74E − 01	−8.86E − 02
TCellLymphoma	Min	310.0	5.52E − 02	−8.36E − 02
	Avg ± std	386.8 ± 30.4	5.94E − 02 ± 2.17E − 03	−7.29E − 02 ± 4.67E − 03
	Max	469.0	6.56E − 02	−6.22E − 02

Open in a new tab

Figure 4 shows the nondominated solutions of the Colon dataset in one experiment. These solutions are mapped onto (R(S); −D(S; c)) space. Similar results can also be obtained for the remaining datasets. Figure 4 shows that R(S) and −D(S; c) strongly conflict along a curve. This strengthens the rationality of decomposing them as two objectives for optimization. Moreover, a common dominance pattern can be found in Figure 4. For example, the solutions A and B are nondominated, and R(S)_A < R(S)_B, −D(S; c)_A < −D(S; c)_B. It is easy to conclude that |S|_A > |S|_B. This finding supports our premise that simply reducing the number of features of a subset may diminish its compactness. Therefore, it is necessary to use different criteria to measure the quality of a feature subset.

The obtained nondominated solutions (200 data points in one experiment) in the filter stage of the proposed method on the Colon dataset map onto (R(S); −D(S; c)) space.

The first objective |S| is used to direct the search procedure and reduce the number of selected features. To observe the changes of the first objective during the evolutionary procedure, Figure 5 shows the convergence curves of the average value of |S| of the solutions stored in archive Arc for each dataset. We find that the average number of selected features converges quickly and finally stabilizes near a certain value. This means that the filter results are not sensitive to the iteration if the maximum number of iterations Ite has been set sufficiently large. In addition, the convergence speed and the stable number of selected features rely on the intrinsic characteristics of each dataset. For example, Leukemia and Prostate have similar scales but converge to different values. Prostate has the largest number of features, but its convergence speed is the fastest.

The convergence curves of the average number of selected features of the solutions stored in archive *Arc* in the filter stage.

4.2. Results of the Wrapper Stage

Next, the proposed MODE is adopted in the wrapper stage. Inspired by recent works [22, 47], the threshold λ is set to 0.5 for all datasets. Finally, five independent sets of nondominated solutions with two objectives are generated. In addition, to analyze the performance of the proposed multiobjective approach, two single-objective approaches mentioned above are executed in the same setting. They also generate five independent sets of solutions for each dataset. Note that for DE2, the classification performance is more important; thus, α is set to 0.2 in equation (19).

Since each method generated five sets of nondominated solutions, it will be difficult to compare the performances of these methods. We use the comparison method adopted in previous works [22, 27]. It is worth noting that the classification performance is evaluated and compared on the test data rather than the training data. Specifically, five sets of nondominated solutions that are achieved by the proposed MODE in 5-fold cross-validation are first collected into a union set. Then, the test classification error of each solution is calculated, and the test classification error of the solutions that have the same number of features is averaged. Moreover, the set of “average” solutions is defined as the “average” front. The set of nondominated solutions with the objectives |S| and the test classification error in the union set is defined as the “best” front. Furthermore, for the two single-objective methods, we also collect these solutions into a union set, and the same processing method is applied to the union sets. Finally, the performance of the three methods on these three union sets can be compared.

The experimental results of the three methods on the benchmark datasets in the wrapper stage are shown in Figures 6 and 7. The horizontal axis represents the number of selected features of a solution, and the vertical axis represents the test classification error rate. The dashed line crossing each chart represents the average classification error rate of 5-fold cross-validation using all features. Moreover, in each chart, the label “-Avg” in the legends refers to the average front obtained by each method, and the label “-Best” refers to the best front.

Experimental results of the three methods on the three benchmark datasets ((a) Colon, (b) DLBCL, and (c) Leukemia) in the wrapper stage.

Experimental results of three methods on the three benchmark datasets ((a) Prostate, (b) Prostate2, and (c) TCellLymphoma) in the wrapper stage.

According to Figures 6 and 7, the average fronts of the three methods are under the dashed line in most cases. This suggests that all the methods work effectively because their solutions achieve a lower test classification error rate and select fewer features. Moreover, the fluctuation in the curves of the average fronts means that the solutions with a similar number of features can have different test classification error rates. This implies that the feature subset search space is relatively complex.

When we compare DE1 with the other two methods, it is obvious that the classification performance of DE1 is similar to the other two methods on most datasets, but the number of selected features in DE1 is quite larger than that of the other two methods. This is because there is no term in the fitness function of DE1 (equation (18)) that considers the number of selected features. The experimental results strongly suggest the necessity of considering both the classification accuracy of a classifier and the number of features in feature selection.

Both MODE and DE2 consider the number of features in the fitness functions. However, the former method uses a multiobjective technique, while the latter method uses a single-objective technique. As shown in the left charts of Figures 6 and 7, both methods successfully achieve low classification error rates and select fewer features. When we compare these two methods, it can be observed that MODE outperforms DE2. MODE achieves significantly lower test classification error rates on most datasets except Leukemia in terms of the “average” fronts. Furthermore, MODE obtains fewer features. In terms of the “best” fronts, the performance of MODE is also better than that of DE2 because fewer features and a lower test classification error rate are obtained by MODE. Although a fine-tuning parameter α in equation (19) can improve the performance of DE2, it requires prior knowledge and should be predefined properly. The results demonstrate the advantage of the proposed MODE in the wrapper stage.

4.3. Comparison with Other Methods

To further evaluate the performance of the proposed two-stage method based on MODE, we compare it with seven widely used feature selection methods. GainRatio [48] and ReliefF [49] are two univariate feature selection methods. These methods provide each feature an order ranking according to the relevance between the feature and the target class. We retain the top 10, top 20, and top 40 features to evaluate the performance of these two methods. mRMR [13] is a classical feature method based on mutual information that returns a subset of features with a predefined size. We set the returned number of features to 10, 20, and 40. Correlation-based feature selection (CFS) [50] is also a classical multivariate feature selection method and returns a subset of features. WrapperNB [45] is a wrapper method coupled with the NB classifier. The search strategy of this method is greedy hill climbing augmented with a backtracking facility. In addition, two wrapper methods based on GA and PSO are compared. Based on the parameter settings in the literature [51, 52], the population size NP and the maximum iteration T of the two methods are set to 50 and 100, respectively. The key parameters of GA are set as follows: the crossover rate p_c=0.9, the mutation rate p_m=0.1, and the number of elites N_e=10. The key parameters of PSO are set as follows: the inertia weight w=0.5 and the acceleration constants c₁=1.5, c₂=1.5.

We use 5-fold cross-validation and follow the workflow in Figure 3 to perform the experiments. The final classifier is the NB classier. To provide a fair comparison, for the proposed method, we select the solutions in the training Pareto front of the union set because the test data cannot be seen until the final performance evaluation. Specifically, the training Pareto front of the union set is constructed according to the training classification performance and the number of features. The comparison is performed on test data, and the results are listed in Table 4. Acc represents the average test classification accuracy, and Gene represents the number of selected genes (features). As illustrated in Table 4, the proposed method obtains the best classification performance on three (out of six) problems. Moreover, it can select a small number of features and meet the target of gene selection.

Table 4.

The comparison results of different methods on the six benchmark datasets.

		Colon	DLBCL	Leukemia	Prostate	Prostate2	TCell Lymphoma	p value
All features	Acc	58.08%	76.58%	71.33%	62.81%	76.57%	67.82%	0.000 011
	Gene	2000.0	5469.0	7129.0	10,509.0	2135.0	2922.0

The proposed method	Acc	88.06%	87.67%	72.10%	94.41%	90.90%	84.47%	—
	Gene	6.2	12.7	12.4	10.0	7.0	10.1

GainRatio	Acc	79.10%	83.08%	62.67%	91.24%	93.19%	74.87%	0.005 106
	Gene (top 10)	10.0	10.0	10.0	10.0	10.0	10.0
	Acc	78.85%	87.08%	71.14%	92.24%	92.19%	77.95%	0.054 282
	Gene (top 20)	20.0	20.0	20.0	20.0	20.0	20.0
	Acc	82.31%	87.08%	64.10%	91.24%	92.14%	73.21%	0.015 658
	Gene (top 40)	40.0	40.0	40.0	40.0	40.0	40.0

ReliefF	Acc	82.44%	88.42%	69.62%	92.19%	92.14%	76.28%	0.362 370
	Gene (top 10)	10.0	10.0	10.0	10.0	10.0	10.0
	Acc	82.44%	88.33%	72.38%	92.24%	90.14%	76.15%	0.452 807
	Gene (top 20)	20.0	20.0	20.0	20.0	20.0	20.0
	Acc	83.97%	84.50%	69.90%	92.24%	92.19%	74.62%	0.236 936
	Gene (top 40)	40.0	40.0	40.0	40.0	40.0	40.0

mRMR	Acc	85.38%	82.92%	58.76%	89.33%	90.24%	74.74%	0.016 822
	Gene	10.0	10.0	10.0	10.0	10.0	10.0
	Acc	83.97%	85.67%	68.38%	92.24%	92.24%	68.46%	0.037 739
	Gene	20.0	20.0	20.0	20.0	20.0	20.0
	Acc	82.31%	86.92%	75.33%	93.24%	91.19%	70.13%	0.180 025
	Gene	40.0	40.0	40.0	40.0	40.0	40.0

CFS	Acc	82.18%	92.08%	68.48%	93.19%	91.19%	72.95%	0.456 408
	Gene	26.8	92.0	90.0	78.8	33.8	44.0

WrapperNB	Acc	76.15%	76.58%	60.00%	91.14%	88.33%	69.87%	0.002 557
	Gene	6.2	4.2	6.6	5.2	5.4	6.2

GA	Acc	71.03%	76.67%	70.00%	63.81%	86.33%	69.49%	0.000 114
	Gene	122.2	378.0	514.8	803.2	105.6	161.8

PSO	Acc	65.90%	81.92%	68.29%	71.67%	82.48%	71.03%	0.000 026
	Gene	134.0	371.6	455.6	506.2	115.2	150.4

Open in a new tab

The best classification accuracy for each benchmark dataset is in bold.

We further conduct the Wilcoxon signed-rank test to determine the significant differences between the proposed method and the other methods. The significance level is set to 0.05, and the p values are listed in Table 4. It is clear that the proposed method significantly outperforms eight (out of fourteen) methods because the p values are smaller than 0.05. In addition, for the remaining six methods, the p values are larger than 0.05. This indicates that the proposed method is not significantly better but still obtains competitive results. Therefore, we can conclude that the proposed method can be considered a very competitive method relative to classical methods. The comparison results suggest that the proposed two-stage method based on MODE is a promising method to solve the gene selection problem.

5. Conclusion

The gene selection problem is a specific feature selection problem and remains challenging in bioinformatics. In this paper, a two-stage feature selection method was proposed to solve the gene selection problem. The first stage included a multivariate filter method, and the second stage included a wrapper method. Both stages were based on the same MODE but with different objective functions. The objective functions of the filter stage were mainly based on mutual information. The classification error of the NB classifier and the number of selected features were incorporated as the two objective functions in the wrapper stage. In our experiments, six common benchmark datasets were used to test and analyze the performance of the proposed method. In addition, the effectiveness of the proposed method for solving the gene selection problem was verified by comparing it with five classical methods. Since the main differences between the two stages (filter and wrapper) were the objective functions, the proposed method is considered to be an easily understood implementation.

This study provided a new perspective for solving the gene selection problem by using multiobjective optimization because the solution ideas are quite different from the methods based on single-objective optimization. In the future, we plan to apply the proposed method to more gene expression datasets to verify its effectiveness. To improve the performance of the method, the search strategy and the evaluation criteria will also receive sustained attention.

Acknowledgments

This study was supported by the Research Foundation for Talented Scholars of Changzhou University (grant no. ZMF20020459) and JSPS KAKENHI (grant no. 19K12136).

Contributor Information

Xingqian Chen, Email: star1991chen@outlook.com.

Yuki Todo, Email: yktodo@ec.t.kanazawa-u.ac.jp.

Data Availability

The data used to support the findings of this study are included within the article and can be obtained from the corresponding authors upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

1.Higgins M. E., Claremont M., Major J. E., Sander C., Lash A. E. Cancergenes: a gene selection resource for cancer genome projects. Nucleic Acids Research . 2006;35(1):D721–D726. doi: 10.1093/nar/gkl811. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Schena M., Shalon D., Davis R. W., Brown P. O. Quantitative monitoring of gene expression patterns with a complementary dna microarray. Science . 1995;270(5235):467–470. doi: 10.1126/science.270.5235.467. [DOI] [PubMed] [Google Scholar]
3.Fearon E. R., Vogelstein B. A genetic model for colorectal tumorigenesis. Cell . 1990;61(5):759–767. doi: 10.1016/0092-8674(90)90186-i. [DOI] [PubMed] [Google Scholar]
4.Heller M. J. DNA microarray technology: devices, systems, and applications. Annual Review of Biomedical Engineering . 2002;4(1):129–153. doi: 10.1146/annurev.bioeng.4.020702.153438. [DOI] [PubMed] [Google Scholar]
5.Ang J. C., Mirzal A., Haron H., Hamed H. N. A. Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Transactions on Computational Biology and Bioinformatics . 2016;13(5):971–989. doi: 10.1109/tcbb.2015.2478454. [DOI] [PubMed] [Google Scholar]
6.Nguyen B. H., Xue B., Zhang M. A survey on swarm intelligence approaches to feature selection in data mining. Swarm and Evolutionary Computation . 2020;54 doi: 10.1016/j.swevo.2020.100663.100663 [DOI] [Google Scholar]
7.Guyon I., Elisseeff A. An introduction to variable and feature selection. Journal of Machine Learning Research . 2003;3:1157–1182. [Google Scholar]
8.Zhai Y., Ong Y.-S., Tsang I. W. The emerging big dimensionality. IEEE Computational Intelligence Magazine . 2014;9(3):14–26. doi: 10.1109/mci.2014.2326099. [DOI] [Google Scholar]
9.Jain A. K., Duin P. W., Jianchang Mao J. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence . 2000;22(1):4–37. doi: 10.1109/34.824819. [DOI] [Google Scholar]
10.Xue B., Zhang M., Browne W. N., Yao X. A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation . 2016;20(4):606–626. doi: 10.1109/tevc.2015.2504420. [DOI] [Google Scholar]
11.Henriques J. F., Caseiro R., Martins P., Batista J. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence . 2015;37(3):583–596. doi: 10.1109/tpami.2014.2345390. [DOI] [PubMed] [Google Scholar]
12.Wang Z., Li M., Li J. A multi-objective evolutionary algorithm for feature selection based on mutual information with a new redundancy measure. Information Sciences . 2015;307:73–88. doi: 10.1016/j.ins.2015.02.031. [DOI] [Google Scholar]
13.Hanchuan Peng H., Fuhui Long F., Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence . 2005;27(8):1226–1238. doi: 10.1109/tpami.2005.159. [DOI] [PubMed] [Google Scholar]
14.Song S., Chen X., Song S., Todo Y. A neuron model with dendrite morphology for classification. Electronics . 2021;10(9):p. 1062. doi: 10.3390/electronics10091062. [DOI] [Google Scholar]
15.Jain I., Jain V. K., Jain R. Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification. Applied Soft Computing . 2018;62:203–215. doi: 10.1016/j.asoc.2017.09.038. [DOI] [Google Scholar]
16.Whitney A. W. A direct method of nonparametric measurement selection. IEEE Transactions on Computers . 1971;C-20(9):1100–1103. doi: 10.1109/t-c.1971.223410. [DOI] [Google Scholar]
17.Pudil P., Novovičová J., Kittler J. Floating search methods in feature selection. Pattern Recognition Letters . 1994;15(11):1119–1125. doi: 10.1016/0167-8655(94)90127-9. [DOI] [Google Scholar]
18.Zhou Y., Zhang W., Kang J., Zhang X., Wang X. A problem-specific non-dominated sorting genetic algorithm for supervised feature selection. Information Sciences . 2021;547:841–859. doi: 10.1016/j.ins.2020.08.083. [DOI] [Google Scholar]
19.Song X.-f., Zhang Y., Gong D.-w., Sun X.-y. Feature selection using bare-bones particle swarm optimization with mutual information. Pattern Recognition . 2021;112 doi: 10.1016/j.patcog.2020.107804.107804 [DOI] [Google Scholar]
20.Engelbrecht A. P., Grobler J., Langeveld J. Set based particle swarm optimization for the feature selection problem. Engineering Applications of Artificial Intelligence . 2019;85:324–336. doi: 10.1016/j.engappai.2019.06.008. [DOI] [Google Scholar]
21.Baig M. Z., Aslam N., Shum H. P. H., Zhang L. Differential evolution algorithm as a tool for optimal feature subset selection in motor imagery EEG. Expert Systems with Applications . 2017;90:184–195. doi: 10.1016/j.eswa.2017.07.033. [DOI] [Google Scholar]
22.Hancer E., Xue B., Zhang M. Differential evolution for filter feature selection based on information theory and feature ranking. Knowledge-Based Systems . 2018;140:103–119. doi: 10.1016/j.knosys.2017.10.028. [DOI] [Google Scholar]
23.Mohamad M. S., Omatu S., Deris S., Yoshioka M. A modified binary particle swarm optimization for selecting the small subset of informative genes from gene expression data. IEEE Transactions on Information Technology in Biomedicine . 2011;15(6):813–822. doi: 10.1109/titb.2011.2167756. [DOI] [PubMed] [Google Scholar]
24.Shreem S. S., Abdullah S., Nazri M. Z. A. Hybridising harmony search with a Markov blanket for gene selection problems. Information Sciences . 2014;258:108–121. doi: 10.1016/j.ins.2013.10.012. [DOI] [Google Scholar]
25.Elyasigomari V., Mirjafari M. S., Screen H. R. C., Shaheed M. H. Cancer classification using a novel gene selection approach by means of shuffling based on data clustering with optimization. Applied Soft Computing . 2015;35:43–51. doi: 10.1016/j.asoc.2015.06.015. [DOI] [Google Scholar]
26.Alshamlan H. M., Badr G. H., Alohali Y. A. Genetic bee colony (GBC) algorithm: a new gene selection method for microarray cancer classification. Computational Biology and Chemistry . 2015;56:49–60. doi: 10.1016/j.compbiolchem.2015.03.001. [DOI] [PubMed] [Google Scholar]
27.Xue B., Zhang M., Browne W. N. Particle swarm optimization for feature selection in classification: a multi-objective approach. IEEE Transactions on Cybernetics . 2013;43(6):1656–1671. doi: 10.1109/tsmcb.2012.2227469. [DOI] [PubMed] [Google Scholar]
28.Song S., Gao S., Chen X., Jia D., Qian X., Todo Y. Aimoes: archive information assisted multi-objective evolutionary strategy for ab initio protein structure prediction. Knowledge-Based Systems . 2018;146:58–72. doi: 10.1016/j.knosys.2018.01.028. [DOI] [Google Scholar]
29.Deb K., Pratap A., Agarwal S., Meyarivan T. A fast and elitist multiobjective genetic algorithm: nsga-ii. IEEE Transactions on Evolutionary Computation . 2002;6(2):182–197. doi: 10.1109/4235.996017. [DOI] [Google Scholar]
30.Coello C. A. C., Pulido G. T., Lechuga M. S. Handling multiple objectives with particle swarm optimization. IEEE Transactions on Evolutionary Computation . 2004;8(3):256–279. doi: 10.1109/tevc.2004.826067. [DOI] [Google Scholar]
31.Dhiman G., Singh K. K., Soni M., et al. Mosoa: a new multi-objective seagull optimization algorithm. Expert Systems with Applications . 2021;167 doi: 10.1016/j.eswa.2020.114150.114150 [DOI] [Google Scholar]
32.Storn R., Price K. Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization . 1997;11(4):341–359. doi: 10.1023/a:1008202821328. [DOI] [Google Scholar]
33.Du Y., Fan Y., Liu X., Luo Y., Tang J., Liu P. Multiscale cooperative differential evolution algorithm. Computational Intelligence and Neuroscience . 2019;2019:17. doi: 10.1155/2019/5259129.5259129 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Chen X., Song S., Ji J., Tang Z., Todo Y. Incorporating a multiobjective knowledge-based energy function into differential evolution for protein structure prediction. Information Sciences . 2020;540:69–88. doi: 10.1016/j.ins.2020.06.003. [DOI] [Google Scholar]
35.Song S., Chen X., Zhang Y., Tang Z., Todo Y. Protein-ligand docking using differential evolution with an adaptive mechanism. Knowledge-Based Systems . 2021;231 doi: 10.1016/j.knosys.2021.107433.107433 [DOI] [Google Scholar]
36.Cover T. M., Thomas J. A. Elements of Information Theory . Hoboken, NJ, USA: John Wiley & Sons; 2012. [Google Scholar]
37.Huang J., Cai Y., Xu X. A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recognition Letters . 2007;28(13):1825–1844. doi: 10.1016/j.patrec.2007.05.011. [DOI] [Google Scholar]
38.Kraskov A., Stögbauer H., Grassberger P. Estimating mutual information. Physical review. E, Statistical, nonlinear, and soft matter physics . 2004;69(6) doi: 10.1103/PhysRevE.69.066138.066138 [DOI] [PubMed] [Google Scholar]
39.Cai X., Li Y., Fan Z., Zhang Q. An external archive guided multiobjective evolutionary algorithm based on decomposition for combinatorial optimization. IEEE Transactions on Evolutionary Computation . 2015;19(4):508–523. doi: 10.1109/tevc.2014.2350995. [DOI] [Google Scholar]
40.Song S., Ji J., Chen X., Gao S., Tang Z., Todo Y. Adoption of an improved PSO to explore a compound multi-objective energy function in protein structure prediction. Applied Soft Computing . 2018;72:539–551. doi: 10.1016/j.asoc.2018.07.042. [DOI] [Google Scholar]
41.Das S., Suganthan P. N. Differential evolution: a survey of the state-of-the-art. IEEE Transactions on Evolutionary Computation . 2011;15(1):4–31. doi: 10.1109/tevc.2010.2059031. [DOI] [Google Scholar]
42.Qin A. K., Huang V. L., Suganthan P. N. Differential evolution algorithm with strategy adaptation for global numerical optimization. IEEE Transactions on Evolutionary Computation . 2009;13(2):398–417. doi: 10.1109/tevc.2008.927706. [DOI] [Google Scholar]
43.Yu L., Liu H. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research . 2004;5:1205–1224. [Google Scholar]
44.Webb G. I., Boughton J. R., Wang Z. Not so naive Bayes: aggregating one-dependence estimators. Machine Learning . 2005;58(1):5–24. doi: 10.1007/s10994-005-4258-6. [DOI] [Google Scholar]
45.Kohavi R., John G. H. Wrappers for feature subset selection. Artificial Intelligence . 1997;97(1-2):273–324. doi: 10.1016/s0004-3702(97)00043-x. [DOI] [Google Scholar]
46.Bolón-Canedo V., Sánchez-Maroño N., Alonso-Betanzos A., Benítez J. M., Herrera F. A review of microarray datasets and applied feature selection methods. Information Sciences . 2014;282:111–135. doi: 10.1016/j.ins.2014.05.042. [DOI] [Google Scholar]
47.Gu S., Cheng R., Jin Y. Feature selection for high-dimensional classification using a competitive swarm optimizer. Soft Computing . 2018;22(3):811–822. doi: 10.1007/s00500-016-2385-6. [DOI] [Google Scholar]
48.Quinlan J. R. Induction of decision trees. Machine Learning . 1986;1(1):81–106. doi: 10.1007/bf00116251. [DOI] [Google Scholar]
49.Kira K., Rendell L. A. Machine Learning Proceedings 1992 . Amsterdam, Netherlands: Elsevier; 1992. A practical approach to feature selection; pp. 249–256. [DOI] [Google Scholar]
50.Yu L., Liu H. Feature selection for high-dimensional data: a fast correlation-based filter solution. Proceedings of the 20th International Conference on Machine Learning (ICML-03); August 2003; Washington, DC, USA. pp. 856–863. [Google Scholar]
51.Herrera F., Lozano M., Verdegay J. L. Tackling real-coded genetic algorithms: operators and tools for behavioural analysis. Artificial Intelligence Review . 1998;12(4):265–319. doi: 10.1023/a:1006504901164. [DOI] [Google Scholar]
52.Bonyadi M. R., Michalewicz Z. Particle swarm optimization for single objective continuous space problems: a review. Evolutionary Computation . 2017;25(1):1–54. doi: 10.1162/evco_r_00180. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data used to support the findings of this study are included within the article and can be obtained from the corresponding authors upon request.

[B1] 1.Higgins M. E., Claremont M., Major J. E., Sander C., Lash A. E. Cancergenes: a gene selection resource for cancer genome projects. Nucleic Acids Research . 2006;35(1):D721–D726. doi: 10.1093/nar/gkl811. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2.Schena M., Shalon D., Davis R. W., Brown P. O. Quantitative monitoring of gene expression patterns with a complementary dna microarray. Science . 1995;270(5235):467–470. doi: 10.1126/science.270.5235.467. [DOI] [PubMed] [Google Scholar]

[B3] 3.Fearon E. R., Vogelstein B. A genetic model for colorectal tumorigenesis. Cell . 1990;61(5):759–767. doi: 10.1016/0092-8674(90)90186-i. [DOI] [PubMed] [Google Scholar]

[B4] 4.Heller M. J. DNA microarray technology: devices, systems, and applications. Annual Review of Biomedical Engineering . 2002;4(1):129–153. doi: 10.1146/annurev.bioeng.4.020702.153438. [DOI] [PubMed] [Google Scholar]

[B5] 5.Ang J. C., Mirzal A., Haron H., Hamed H. N. A. Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Transactions on Computational Biology and Bioinformatics . 2016;13(5):971–989. doi: 10.1109/tcbb.2015.2478454. [DOI] [PubMed] [Google Scholar]

[B6] 6.Nguyen B. H., Xue B., Zhang M. A survey on swarm intelligence approaches to feature selection in data mining. Swarm and Evolutionary Computation . 2020;54 doi: 10.1016/j.swevo.2020.100663.100663 [DOI] [Google Scholar]

[B7] 7.Guyon I., Elisseeff A. An introduction to variable and feature selection. Journal of Machine Learning Research . 2003;3:1157–1182. [Google Scholar]

[B8] 8.Zhai Y., Ong Y.-S., Tsang I. W. The emerging big dimensionality. IEEE Computational Intelligence Magazine . 2014;9(3):14–26. doi: 10.1109/mci.2014.2326099. [DOI] [Google Scholar]

[B9] 9.Jain A. K., Duin P. W., Jianchang Mao J. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence . 2000;22(1):4–37. doi: 10.1109/34.824819. [DOI] [Google Scholar]

[B10] 10.Xue B., Zhang M., Browne W. N., Yao X. A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation . 2016;20(4):606–626. doi: 10.1109/tevc.2015.2504420. [DOI] [Google Scholar]

[B11] 11.Henriques J. F., Caseiro R., Martins P., Batista J. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence . 2015;37(3):583–596. doi: 10.1109/tpami.2014.2345390. [DOI] [PubMed] [Google Scholar]

[B12] 12.Wang Z., Li M., Li J. A multi-objective evolutionary algorithm for feature selection based on mutual information with a new redundancy measure. Information Sciences . 2015;307:73–88. doi: 10.1016/j.ins.2015.02.031. [DOI] [Google Scholar]

[B13] 13.Hanchuan Peng H., Fuhui Long F., Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence . 2005;27(8):1226–1238. doi: 10.1109/tpami.2005.159. [DOI] [PubMed] [Google Scholar]

[B14] 14.Song S., Chen X., Song S., Todo Y. A neuron model with dendrite morphology for classification. Electronics . 2021;10(9):p. 1062. doi: 10.3390/electronics10091062. [DOI] [Google Scholar]

[B15] 15.Jain I., Jain V. K., Jain R. Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification. Applied Soft Computing . 2018;62:203–215. doi: 10.1016/j.asoc.2017.09.038. [DOI] [Google Scholar]

[B16] 16.Whitney A. W. A direct method of nonparametric measurement selection. IEEE Transactions on Computers . 1971;C-20(9):1100–1103. doi: 10.1109/t-c.1971.223410. [DOI] [Google Scholar]

[B17] 17.Pudil P., Novovičová J., Kittler J. Floating search methods in feature selection. Pattern Recognition Letters . 1994;15(11):1119–1125. doi: 10.1016/0167-8655(94)90127-9. [DOI] [Google Scholar]

[B18] 18.Zhou Y., Zhang W., Kang J., Zhang X., Wang X. A problem-specific non-dominated sorting genetic algorithm for supervised feature selection. Information Sciences . 2021;547:841–859. doi: 10.1016/j.ins.2020.08.083. [DOI] [Google Scholar]

[B19] 19.Song X.-f., Zhang Y., Gong D.-w., Sun X.-y. Feature selection using bare-bones particle swarm optimization with mutual information. Pattern Recognition . 2021;112 doi: 10.1016/j.patcog.2020.107804.107804 [DOI] [Google Scholar]

[B20] 20.Engelbrecht A. P., Grobler J., Langeveld J. Set based particle swarm optimization for the feature selection problem. Engineering Applications of Artificial Intelligence . 2019;85:324–336. doi: 10.1016/j.engappai.2019.06.008. [DOI] [Google Scholar]

[B21] 21.Baig M. Z., Aslam N., Shum H. P. H., Zhang L. Differential evolution algorithm as a tool for optimal feature subset selection in motor imagery EEG. Expert Systems with Applications . 2017;90:184–195. doi: 10.1016/j.eswa.2017.07.033. [DOI] [Google Scholar]

[B22] 22.Hancer E., Xue B., Zhang M. Differential evolution for filter feature selection based on information theory and feature ranking. Knowledge-Based Systems . 2018;140:103–119. doi: 10.1016/j.knosys.2017.10.028. [DOI] [Google Scholar]

[B23] 23.Mohamad M. S., Omatu S., Deris S., Yoshioka M. A modified binary particle swarm optimization for selecting the small subset of informative genes from gene expression data. IEEE Transactions on Information Technology in Biomedicine . 2011;15(6):813–822. doi: 10.1109/titb.2011.2167756. [DOI] [PubMed] [Google Scholar]

[B24] 24.Shreem S. S., Abdullah S., Nazri M. Z. A. Hybridising harmony search with a Markov blanket for gene selection problems. Information Sciences . 2014;258:108–121. doi: 10.1016/j.ins.2013.10.012. [DOI] [Google Scholar]

[B25] 25.Elyasigomari V., Mirjafari M. S., Screen H. R. C., Shaheed M. H. Cancer classification using a novel gene selection approach by means of shuffling based on data clustering with optimization. Applied Soft Computing . 2015;35:43–51. doi: 10.1016/j.asoc.2015.06.015. [DOI] [Google Scholar]

[B26] 26.Alshamlan H. M., Badr G. H., Alohali Y. A. Genetic bee colony (GBC) algorithm: a new gene selection method for microarray cancer classification. Computational Biology and Chemistry . 2015;56:49–60. doi: 10.1016/j.compbiolchem.2015.03.001. [DOI] [PubMed] [Google Scholar]

[B27] 27.Xue B., Zhang M., Browne W. N. Particle swarm optimization for feature selection in classification: a multi-objective approach. IEEE Transactions on Cybernetics . 2013;43(6):1656–1671. doi: 10.1109/tsmcb.2012.2227469. [DOI] [PubMed] [Google Scholar]

[B28] 28.Song S., Gao S., Chen X., Jia D., Qian X., Todo Y. Aimoes: archive information assisted multi-objective evolutionary strategy for ab initio protein structure prediction. Knowledge-Based Systems . 2018;146:58–72. doi: 10.1016/j.knosys.2018.01.028. [DOI] [Google Scholar]

[B29] 29.Deb K., Pratap A., Agarwal S., Meyarivan T. A fast and elitist multiobjective genetic algorithm: nsga-ii. IEEE Transactions on Evolutionary Computation . 2002;6(2):182–197. doi: 10.1109/4235.996017. [DOI] [Google Scholar]

[B30] 30.Coello C. A. C., Pulido G. T., Lechuga M. S. Handling multiple objectives with particle swarm optimization. IEEE Transactions on Evolutionary Computation . 2004;8(3):256–279. doi: 10.1109/tevc.2004.826067. [DOI] [Google Scholar]

[B31] 31.Dhiman G., Singh K. K., Soni M., et al. Mosoa: a new multi-objective seagull optimization algorithm. Expert Systems with Applications . 2021;167 doi: 10.1016/j.eswa.2020.114150.114150 [DOI] [Google Scholar]

[B32] 32.Storn R., Price K. Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization . 1997;11(4):341–359. doi: 10.1023/a:1008202821328. [DOI] [Google Scholar]

[B33] 33.Du Y., Fan Y., Liu X., Luo Y., Tang J., Liu P. Multiscale cooperative differential evolution algorithm. Computational Intelligence and Neuroscience . 2019;2019:17. doi: 10.1155/2019/5259129.5259129 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B34] 34.Chen X., Song S., Ji J., Tang Z., Todo Y. Incorporating a multiobjective knowledge-based energy function into differential evolution for protein structure prediction. Information Sciences . 2020;540:69–88. doi: 10.1016/j.ins.2020.06.003. [DOI] [Google Scholar]

[B35] 35.Song S., Chen X., Zhang Y., Tang Z., Todo Y. Protein-ligand docking using differential evolution with an adaptive mechanism. Knowledge-Based Systems . 2021;231 doi: 10.1016/j.knosys.2021.107433.107433 [DOI] [Google Scholar]

[B36] 36.Cover T. M., Thomas J. A. Elements of Information Theory . Hoboken, NJ, USA: John Wiley & Sons; 2012. [Google Scholar]

[B37] 37.Huang J., Cai Y., Xu X. A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recognition Letters . 2007;28(13):1825–1844. doi: 10.1016/j.patrec.2007.05.011. [DOI] [Google Scholar]

[B38] 38.Kraskov A., Stögbauer H., Grassberger P. Estimating mutual information. Physical review. E, Statistical, nonlinear, and soft matter physics . 2004;69(6) doi: 10.1103/PhysRevE.69.066138.066138 [DOI] [PubMed] [Google Scholar]

[B39] 39.Cai X., Li Y., Fan Z., Zhang Q. An external archive guided multiobjective evolutionary algorithm based on decomposition for combinatorial optimization. IEEE Transactions on Evolutionary Computation . 2015;19(4):508–523. doi: 10.1109/tevc.2014.2350995. [DOI] [Google Scholar]

[B40] 40.Song S., Ji J., Chen X., Gao S., Tang Z., Todo Y. Adoption of an improved PSO to explore a compound multi-objective energy function in protein structure prediction. Applied Soft Computing . 2018;72:539–551. doi: 10.1016/j.asoc.2018.07.042. [DOI] [Google Scholar]

[B41] 41.Das S., Suganthan P. N. Differential evolution: a survey of the state-of-the-art. IEEE Transactions on Evolutionary Computation . 2011;15(1):4–31. doi: 10.1109/tevc.2010.2059031. [DOI] [Google Scholar]

[B42] 42.Qin A. K., Huang V. L., Suganthan P. N. Differential evolution algorithm with strategy adaptation for global numerical optimization. IEEE Transactions on Evolutionary Computation . 2009;13(2):398–417. doi: 10.1109/tevc.2008.927706. [DOI] [Google Scholar]

[B43] 43.Yu L., Liu H. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research . 2004;5:1205–1224. [Google Scholar]

[B44] 44.Webb G. I., Boughton J. R., Wang Z. Not so naive Bayes: aggregating one-dependence estimators. Machine Learning . 2005;58(1):5–24. doi: 10.1007/s10994-005-4258-6. [DOI] [Google Scholar]

[B45] 45.Kohavi R., John G. H. Wrappers for feature subset selection. Artificial Intelligence . 1997;97(1-2):273–324. doi: 10.1016/s0004-3702(97)00043-x. [DOI] [Google Scholar]

[B46] 46.Bolón-Canedo V., Sánchez-Maroño N., Alonso-Betanzos A., Benítez J. M., Herrera F. A review of microarray datasets and applied feature selection methods. Information Sciences . 2014;282:111–135. doi: 10.1016/j.ins.2014.05.042. [DOI] [Google Scholar]

[B47] 47.Gu S., Cheng R., Jin Y. Feature selection for high-dimensional classification using a competitive swarm optimizer. Soft Computing . 2018;22(3):811–822. doi: 10.1007/s00500-016-2385-6. [DOI] [Google Scholar]

[B48] 48.Quinlan J. R. Induction of decision trees. Machine Learning . 1986;1(1):81–106. doi: 10.1007/bf00116251. [DOI] [Google Scholar]

[B49] 49.Kira K., Rendell L. A. Machine Learning Proceedings 1992 . Amsterdam, Netherlands: Elsevier; 1992. A practical approach to feature selection; pp. 249–256. [DOI] [Google Scholar]

[B50] 50.Yu L., Liu H. Feature selection for high-dimensional data: a fast correlation-based filter solution. Proceedings of the 20th International Conference on Machine Learning (ICML-03); August 2003; Washington, DC, USA. pp. 856–863. [Google Scholar]

[B51] 51.Herrera F., Lozano M., Verdegay J. L. Tackling real-coded genetic algorithms: operators and tools for behavioural analysis. Artificial Intelligence Review . 1998;12(4):265–319. doi: 10.1023/a:1006504901164. [DOI] [Google Scholar]

[B52] 52.Bonyadi M. R., Michalewicz Z. Particle swarm optimization for single objective continuous space problems: a review. Evolutionary Computation . 2017;25(1):1–54. doi: 10.1162/evco_r_00180. [DOI] [PubMed] [Google Scholar]

PERMALINK

A Two-Stage Method Based on Multiobjective Differential Evolution for Gene Selection

Shuangbao Song

Xingqian Chen

Zheng Tang

Yuki Todo

Abstract

1. Introduction

2. Materials

2.1. Multiobjective Optimization Problem

Definition 1 . —

Definition 2 . —

Definition 3 . —

Definition 4 . —

2.2. Standard Differential Evolution

Figure 1.

2.3. Mutual Information

3. Methodology

Figure 2.

3.1. Multiobjective Differential Evolution

3.1.1. External Archive

Algorithm 1.

3.1.2. Density Estimation

Algorithm 2.

3.1.3. Parameter Control

3.2. Implementation of MODE in Feature Selection

Algorithm 3.

3.3. Three Objectives of the Filter Stage

3.4. Two Objectives of the Wrapper Stage

3.5. Two Single-Objective Feature Selection Methods

4. Experimental Studies

Table 1.

Table 2.

Figure 3.

4.1. Results of the Filter Stage

Table 3.

Figure 4.

Figure 5.

4.2. Results of the Wrapper Stage

Figure 6.

Figure 7.

4.3. Comparison with Other Methods

Table 4.

5. Conclusion

Acknowledgments

Contributor Information

Data Availability

Conflicts of Interest

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases